## Applied - Question 7

This question will apply the LOOCV on the Weekly dataset

#### Import block

In [8]:
import pandas as pd

import sklearn.linear_model as skl_lm
from sklearn.model_selection import LeaveOneOut, cross_val_score

import statsmodels.formula.api as smf


Load dataset

In [9]:
# Load data from path
data_path = 'D:\\PycharmProjects\\ISLR\\data\\'
df = pd.read_csv(f'{data_path}Weekly.csv')

# Transform cat to dummy variable
df['Direction2'] = df['Direction'].astype('category').cat.codes
df.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction,Direction2
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.27,Down,0
1,1990,-0.27,0.816,1.572,-3.936,-0.229,0.148574,-2.576,Down,0
2,1990,-2.576,-0.27,0.816,1.572,-3.936,0.159837,3.514,Up,1
3,1990,3.514,-2.576,-0.27,0.816,1.572,0.16163,0.712,Up,1
4,1990,0.712,3.514,-2.576,-0.27,0.816,0.153728,1.178,Up,1


(a) Fitting the logistic model predicting Directions using Lag1 and Lag2

In [11]:
model = smf.logit('Direction2 ~ Lag1 + Lag2', data=df).fit(disp=False)
model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.2212,0.061,3.599,0.000,0.101,0.342
Lag1,-0.0387,0.026,-1.477,0.140,-0.090,0.013
Lag2,0.0602,0.027,2.270,0.023,0.008,0.112


(b) Leave the first observation out. Fit same model

In [13]:
df2 = df.drop(df.index[0])
model = smf.logit('Direction2 ~ Lag1 + Lag2', data=df2).fit(disp=False)
model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.2232,0.061,3.630,0.000,0.103,0.344
Lag1,-0.0384,0.026,-1.466,0.143,-0.090,0.013
Lag2,0.0608,0.027,2.291,0.022,0.009,0.113


The result is practically identical. There is a tiny change in the coefficient but
the p-value stays the same.

(c) Use (b) model to predict direction of the first observation. Change to sklearn!

In [37]:
X_train = df2[['Lag1', 'Lag2']]
X_test = df[['Lag1', 'Lag2']].iloc[:1]
y_train = df2.Direction
y_test = df.Direction[0]

# print(X_test)
# Logistic regression
regr = skl_lm.LogisticRegression(solver='lbfgs')
pred = regr.fit(X_train, y_train).predict(X_test)
print(f'Prediction is : {pred[0]}')

Prediction is : Up


The test error is 100%. We failed to classify the test observation.

(d) Seems like this a job for the Leaveoneout module from sklearn.

In [42]:
# Assign parameters
regr = skl_lm.LogisticRegression(solver='lbfgs')
loo = LeaveOneOut()
loo.get_n_splits(df)

# Get the test error (average)
X = df[['Lag1', 'Lag2']]
score = cross_val_score(regr, X, df.Direction2, 
                        cv=loo, scoring='neg_mean_squared_error').mean()

# Print score
print(f'The test error rate on average is {-score}')

The test error rate on average is 0.44995408631772266


This is quite a big number for classification problem.

While this may seem that LOOCV didn't do well in this example (along with long compute time), it is not a surprising
result when predicting stocks! The nature of stock return is memory-less which is exactly 
why it is very hard to model them. 


