#### Introduction to Statistical Learning, Lab 4.1

# The Smarket Data Set


We will use the `Smarket` data set for most of the labs in this lecture. So we will familiarise ourselves with it first in this lab.

The data set represents *real* stock market data. It consists of percentage returns for the S&P 500 stock index over a 1,250 days, from the beginning of 2001 until the end of 2005. For each day the following is recorded.

  - The percentage returns for each of the five precious days (`Lag1` through `Lag5`).
  - The value of shares traded on the previous day in billions USD (`Volume`).
  - The percentage return of the current day (`Today`)
  - The `Direction` of the market on the current day (either `Up` or `Down`).


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from islpwf import datasets, utils, lmplots
sns.set()
%matplotlib inline

Let's have a quick look at the data set.

In [None]:
help(datasets.Smarket)

In [None]:
smarket = datasets.Smarket()
smarket.describe()

In [None]:
smarket.head()

Note that the `Direction` variable is qualitative.

We compute the *correlation matrix* for the `Smarket` data set. The qualitative variable `Direction` is removed automatically by `pandas` before the matrix is computed.

In [None]:
corr = smarket.corr()
corr

The `isply.utils` module provides a convenience function to display a correlation matrix. (If the labels on the vertical axis are missing it is due to a [bug](https://github.com/matplotlib/matplotlib/issues/14675) in `matplotlib` 3.1.1).

In [None]:
utils.plot_corr(corr)
plt.show()

We see that the only strong correlation is between `Year` and `Volume`. A plot of `Volume` versus ` Year` is therefore interesting.

In [None]:
ax = sns.regplot(x='Year', y='Volume', data=smarket, line_kws={'color':'C1', 'lw':2})
ax.set_xlim((2000, 2006))
plt.show()

The plot above shows a clear trend. Another good way of visualising this relationship is a box plot. 

In [None]:
ax = sns.boxplot(x='Year', y='Volume', data=smarket)