#  Classification

- [Load Datasets](#Load-Datasets)
- [Logistic Regression](#logreg)
- [Getting Probabilistic Predictions](#prob)


In [4]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_recall_fscore_support, roc_auc_score
from sklearn.preprocessing import LabelEncoder

%matplotlib inline
plt.style.use('seaborn-white')

## Load Datasets
Datasets available on http://www-bcf.usc.edu/~gareth/ISL/data.html

We will again work with the stock market data we used last time.

In [5]:
smarket = pd.read_csv('Smarket.csv')
smarket.head()

Unnamed: 0.1,Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,1,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
1,2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2,3,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
3,4,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
4,5,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


## Logistic Regression
<a id='logreg'></a>

We will go through the same basic steps we went through last time with LDA and kNN, but now with logistic regression

## Training Data

The instances from prior to 2005 form the training set.

In [6]:
smark_train = smarket[smarket['Year'] < 2005]

X_train = smark_train[['Lag1', 'Lag2', 'Volume']]
y_train = smark_train['Direction']

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

## Examining the Model

Here are the coefficients associated with the three features.

In [7]:
print(X_train.columns.values)
print(logreg.coef_)

['Lag1' 'Lag2' 'Volume']
[[-0.05414095 -0.04584958 -0.11374391]]


## Testing Data

The instances from 2005 form the test data.

In [8]:
smark_test = smarket[smarket['Year'] == 2005]

X_test = smark_test[['Lag1', 'Lag2', 'Volume']]
y_test = smark_test['Direction']

preds = logreg.predict(X_test)
conf = confusion_matrix(y_test, preds)
print(logreg.classes_)
print(conf)

acc = accuracy_score(y_test, preds)
print('accuracy is: ', acc)

['Down' 'Up']
[[77 34]
 [97 44]]
accuracy is:  0.4801587301587302


## Getting Probabilistic Predictions
<a id='prob'></a>

Often times we want probabilities rather than just predictions of class labels. If we have the predicted probabilities for each instance, we can determine what threshold we want to use for considering an instance to be a member of the positive class. For example, in the following code we raise the threshold to 0.52. That means that if the predicted probability of the positive class is greater than 0.52, we will predict that it is a member of the positive class ("Down").

In [9]:
preds = logreg.predict_proba(X_test)
classes = logreg.classes_

threshold = 0.52

newpreds = []
for p in preds:
    if p[0] > threshold:
        newpreds.append(classes[0])
    else:
        newpreds.append(classes[1])

print(logreg.classes_)        
conf = confusion_matrix(y_test, newpreds)
print(conf)

acc = accuracy_score(y_test, newpreds)
print('accuracy is: ', acc)


['Down' 'Up']
[[ 24  87]
 [ 25 116]]
accuracy is:  0.5555555555555556


In [10]:
Notice that accuracy increases, but in the confusion matrix we can see that a lot of instances where y=Down, and the prediction is Up. Raising the probability threshold increases false negatives. 

In general, the accuracy score can be misleading when the two classes are not balanced, i.e. when we have more examples of one of the classes. In that case, precision and recall can be more informative.

SyntaxError: invalid syntax (<ipython-input-10-2777d13bc1be>, line 1)

In [11]:
precision_recall_fscore_support(y_test, newpreds, pos_label='Down', average='binary')

(0.4897959183673469, 0.21621621621621623, 0.30000000000000004, None)

We can see that the precision is fairly low and the recall is even lower. The f-score is a combination of precision and recall.

Let's also calculate the area under the ROC curve (the AUC score). 

The sklearn function for calculating this requires that the labels be binary, either 1/0 or True/False, and also requires the predicted probabilities of the positive class.

In [12]:
p_pred = [p[0] for p in preds] # probability of the positive class
pos_class = logreg.classes_[0]

# convert class labels to True or False
newlabs = (y_test == pos_class)


auc = roc_auc_score(newlabs, p_pred)
print(auc)

0.523225353012587


We can see that this particular model is not much better than random, where 0.5 is random.