# Classification: Beat the market ;-)

For classification, again we will use an example from ISLR (http://www-bcf.usc.edu/~gareth/ISL/).
We will try to predict, based on the index values for the preceding 5 days, if the S&P is going to go up or down the next day.

This example is very different from the usual classification demos, in that the classes are not just not linearly separable, but possible not separable "at all" - we don't really expect this to work, do we?

So, let's try ;-) This time, we first need to load the data from csv.

## Load and inspect the data

In [1]:
# the usual imports
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
smarket = pd.read_csv('../data/Smarket.csv').iloc[:,1:]

In [7]:
smarket.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
4,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


As we see, the data is conveniently preprocessed, no need to calculate the lags ourselves.

In [8]:
smarket.corr()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
Year,1.0,0.0297,0.030596,0.033195,0.035689,0.029788,0.539006,0.030095
Lag1,0.0297,1.0,-0.026294,-0.010803,-0.002986,-0.005675,0.04091,-0.026155
Lag2,0.030596,-0.026294,1.0,-0.025897,-0.010854,-0.003558,-0.043383,-0.01025
Lag3,0.033195,-0.010803,-0.025897,1.0,-0.024051,-0.018808,-0.041824,-0.002448
Lag4,0.035689,-0.002986,-0.010854,-0.024051,1.0,-0.027084,-0.048414,-0.0069
Lag5,0.029788,-0.005675,-0.003558,-0.018808,-0.027084,1.0,-0.022002,-0.03486
Volume,0.539006,0.04091,-0.043383,-0.041824,-0.048414,-0.022002,1.0,0.014592
Today,0.030095,-0.026155,-0.01025,-0.002448,-0.0069,-0.03486,0.014592,1.0


The target for our predictions is "Direction". We need to convert this to numerical (binary).

In [9]:
smarket['dir_0_1'] = np.where(smarket['Direction'] == 'Up', 1, 0)

## Split into training and test sets

In [10]:
x_columns = ['Lag1','Lag2','Lag3','Lag4','Lag5','Volume']
X_train = smarket[smarket['Year'] != 2005][x_columns].values
y_train = smarket[smarket['Year'] != 2005][['dir_0_1']].values[:,0]
X_test = smarket[smarket['Year'] == 2005][x_columns].values
y_test = smarket[smarket['Year'] == 2005][['dir_0_1']].values[:,0]

## Standardize

We standardize the variables because even though the deltas (Lag<n>) are on the same scale, Volume isn't.

In [12]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

scaler.mean_, scaler.scale_

(array([ -3.55983335e-18,  -1.06795000e-17,  -3.55983335e-18,
          3.55983335e-17,   3.91581668e-17,   7.68924003e-16]),
 array([ 1.,  1.,  1.,  1.,  1.,  1.]))

## Logistic Regression

In [26]:
from sklearn import metrics
from sklearn import linear_model
def assess_classification_performance(model, X_train, y_train, X_test, y_test, short = False):
  
    accuracy_train = metrics.accuracy_score(y_train, model.predict(X_train))
    accuracy_test = metrics.accuracy_score(y_test, model.predict(X_test))
    print('Accuracy (train/test): {} / {}\n'.format(accuracy_train, accuracy_test))
    
    if not short:
    
      # confusion matrix: rows = actual group, columns = predicted group
      print('Confusion_matrix (training data):\n {}'.format(metrics.confusion_matrix(y_train, model.predict(X_train))))
      print('\nConfusion_matrix (test data):\n {}'.format(metrics.confusion_matrix(y_test, model.predict(X_test))))

      # precision =  tp / (tp + fp)
      # recall = tp / (tp + fn) (= sensitivity)
      # F1 = 2 * (precision * recall) / (precision + recall)
      print('\nPrecision - recall (training data):')
      print(metrics.classification_report(y_train, model.predict(X_train)))
      
      print('\nPrecision - recall (test data):')
      print(metrics.classification_report(y_test, model.predict(X_test)))

In [27]:
logistic_model = linear_model.LogisticRegression()
logistic_model.fit(X_train, y_train)
print('Coefficients ({}):\n{}\n'.format(x_columns, logistic_model.coef_)) 

Coefficients (['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']):
[[-0.06632063 -0.05605806  0.00885093  0.0079177  -0.00521892 -0.03097094]]



In [29]:
assess_classification_performance(logistic_model, X_train, y_train, X_test, y_test)  

Accuracy (train/test): 0.527054108216 / 0.480158730159

Confusion_matrix (training data):
 [[175 316]
 [156 351]]
Confusion_matrix (test data):
 [[77 34]
 [97 44]]

Precision - recall (training data):
             precision    recall  f1-score   support

          0       0.53      0.36      0.43       491
          1       0.53      0.69      0.60       507

avg / total       0.53      0.53      0.51       998


Precision - recall (test data):
             precision    recall  f1-score   support

          0       0.44      0.69      0.54       111
          1       0.56      0.31      0.40       141

avg / total       0.51      0.48      0.46       252



Unfortunately...

In [32]:
majority_vote_classifier_accuracy = max(y_test.mean(), 1 - y_test.mean())
majority_vote_classifier_accuracy

0.55952380952380953

## Logistic Regression using statsmodels

In [34]:
import statsmodels.api as sm
sm_logistic = sm.GLM(y_train, X_train, sm.families.Binomial())
sm_results = sm_logistic.fit()
print sm_results.summary()

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                  998
Model:                            GLM   Df Residuals:                      992
Model Family:                Binomial   Df Model:                            5
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -690.68
Date:                Fri, 01 Jul 2016   Deviance:                       1381.4
Time:                        22:34:03   Pearson chi2:                     998.
No. Iterations:                     6                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0666      0.064     -1.046      0.295        -0.191     0.058
x2            -0.0563      0.064     -0.884      0.3

## Logistic Regression, lag1 & lag2 predictors only

In [36]:
X_train_lag12 = X_train[:,0:2]
X_test_lag12 = X_test[:,0:2]
logistic_model_lag12 = linear_model.LogisticRegression()
logistic_model_lag12.fit(X_train_lag12, y_train)
print('Coefficients ({}): {}\n'.format('Lag1, Lag2', logistic_model_lag12.coef_))

assess_classification_performance(logistic_model_lag12, X_train_lag12, y_train, X_test_lag12, y_test, short = True)

Coefficients (Lag1, Lag2): [[-0.06808079 -0.0544505 ]]

Accuracy (train/test): 0.516032064128 / 0.559523809524

