# Classification

Classification is the process of predicting the class of given data points. Classes are sometimes called as targets, labels or categories. Let's start with a simple example.

In [1]:
import pandas as pd
df = pd.read_csv('data.csv')
df[:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Above is data from the Titanic. It lists all the passengers in the manifest and whether they survived or not. What features predict whether a passeger survives?

## Precision and Recall
Before we start, what makes a good classifer? Our initial thought would be simply to measure the accuracy of the classifer (the fraction of data it correctly predicts). However, the measure of accuracy when it comes to classification problems is a little nuanced. Let's do some data analysis to see what we are up against:

In [2]:
df.groupby('Survived')['Survived'].count()

Survived
0    549
1    342
Name: Survived, dtype: int64

Note that since more people died than survived a trivial classifier that simply predicts ("always die") will result in a ~60% accuracy, which at first glance seems better significantly than random. However, the classification errors are not symmetric, we significantly overestimate one class but under estimate the other.


To account for situations like this we will have to decompose accuracy into two quantities: precision and recall. Precision quantifies the false positive rate (predicted they would die but they didn't) and Recall quantifies the false negative rate (predicted they would survive but they actually died). (Note the concept of positive and negative are subjective here). Precision is tp/(tp + fp) and Recall is tp/(tp + fn).

The always die classifier has a perfect recal (it has no false negatives), but has a low precision.

## Featurization
Just like when we calculated correlations we will have to come up with a reasonable way to featurize the data. This process usually starts with some exploratory data analysis, where we try to understand how different variables relate to each other. The obvious one to try first (from the movie!!) is breaking down survival by gender:

In [3]:
df.groupby(['Survived','Sex'])[['Survived']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived
Survived,Sex,Unnamed: 2_level_1
0,female,81
0,male,468
1,female,233
1,male,109


That seems to be a valuable feature, what about Age? 

In [4]:
df['f0_is_male'] = 1.0*(df['Sex'] == 'male')

df.groupby(['Survived','Age'])[['Age']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Survived,Age,Unnamed: 2_level_1
0,1.0,2
0,2.0,7
0,3.0,1
0,4.0,3
0,6.0,1
...,...,...
1,58.0,3
1,60.0,2
1,62.0,2
1,63.0,2


It looks like there is a correlation but we can't be certain because of the number of groups, let's try using a correlation coefficient:

In [5]:
from scipy.stats import pearsonr 
import numpy as np

print(pearsonr(df['Age'],df['Survived']))

ValueError: array must not contain infs or NaNs

In [6]:
df['Age'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

One sensible fix is to set the value to be the median

In [7]:
df['Age'] = df['Age'].fillna(np.nanmedian(df['Age'])) #why nanmedian?

In [8]:
print(pearsonr(df['Age'],df['Survived']))

(-0.06491041993052585, 0.05276068847579861)


Looks like there is a small negative correlation. Do we trust this?

In [9]:
df['f1_Toddler'] = 1.0*(df['Age'] <= 5)
df['f2_Teen'] = 1.0*((df['Age'] > 6) & (df['Age'] < 18) )
df['f3_Adult'] = 1.0*((df['Age'] >= 18) & (df['Age'] < 55) )
df['f4_Elder'] = 1.0*((df['Age'] >= 55) )

In [10]:
print(pearsonr(df['f1_Toddler'],df['Survived']))
print(pearsonr(df['f2_Teen'],df['Survived']))
print(pearsonr(df['f3_Adult'],df['Survived']))
print(pearsonr(df['f4_Elder'],df['Survived']))

(0.15030438360027215, 6.6107330947242835e-06)
(0.023498939817841964, 0.483586050381892)
(-0.08830656015721033, 0.00835514004175807)
(-0.0339878171617528, 0.31087290830938435)


In [11]:
df.groupby(['Survived','Pclass'])[['Pclass']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Pclass
Survived,Pclass,Unnamed: 2_level_1
0,1,80
0,2,97
0,3,372
1,1,136
1,2,87
1,3,119


In [12]:
df['f5_is_first'] = 1.0*(df['Pclass'] == 1)
df['f6_is_second'] = 1.0*(df['Pclass'] == 2)
df['f7_is_third'] = 1.0*(df['Pclass'] == 3)

In [13]:
df.groupby(['Survived','SibSp'])[['SibSp']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,SibSp
Survived,SibSp,Unnamed: 2_level_1
0,0,398
0,1,97
0,2,15
0,3,12
0,4,15
0,5,5
0,8,7
1,0,210
1,1,112
1,2,13


In [14]:
df.groupby(['Survived','Embarked'])[['Embarked']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Embarked
Survived,Embarked,Unnamed: 2_level_1
0,C,75
0,Q,47
0,S,427
1,C,93
1,Q,30
1,S,217


In [15]:
feature_cols = ['f0_is_male','f1_Toddler','f2_Teen', 'f3_Adult','f4_Elder','f5_is_first','f6_is_second','f7_is_third']
df[feature_cols]

Unnamed: 0,f0_is_male,f1_Toddler,f2_Teen,f3_Adult,f4_Elder,f5_is_first,f6_is_second,f7_is_third
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
886,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
887,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
888,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
889,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


In [20]:
X = df[feature_cols].to_numpy()
X.shape

(891, 8)

In [21]:
Y = df['Survived'].to_numpy()
Y.shape

(891,)

## Training and Testing Data
The test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset. Once our data is in a numerical form we can split it into a training and test set:

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)

In [23]:
X_train.shape, X_test.shape

((712, 8), (179, 8))

Another useful sci-kit learn routine is the classification report feature. Once you do get your predictions you can plug them in to get a precision and recall score.

In [24]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, Y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        99
           1       1.00      1.00      1.00        80

    accuracy                           1.00       179
   macro avg       1.00      1.00      1.00       179
weighted avg       1.00      1.00      1.00       179



In [26]:
all_women = (X_test[:,0] == 0.0) #simple classifier
all_die = (X_test[:,0]*0)

print(classification_report(all_women, Y_test))

print(classification_report(all_die, Y_test))

              precision    recall  f1-score   support

       False       0.89      0.75      0.81       117
        True       0.64      0.82      0.72        62

    accuracy                           0.78       179
   macro avg       0.76      0.79      0.77       179
weighted avg       0.80      0.78      0.78       179

              precision    recall  f1-score   support

         0.0       1.00      0.55      0.71       179
         1.0       0.00      0.00      0.00         0

    accuracy                           0.55       179
   macro avg       0.50      0.28      0.36       179
weighted avg       1.00      0.55      0.71       179



  'recall', 'true', average, warn_for)


## Learning Classifiers from Data
Luckily, we don't have to write such classifiers by hand (anymore...), and we learn these rules from data. One simple approach is called logistic regression. You can think of it as a generalization of a best fit line to classify data:

In [27]:
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression()
logit.fit(X_train, Y_train)

Y_pred = logit.predict(X_test)



In [28]:
print(classification_report(Y_pred , Y_test))

              precision    recall  f1-score   support

           0       0.99      0.70      0.82       141
           1       0.46      0.97      0.63        38

    accuracy                           0.75       179
   macro avg       0.73      0.83      0.72       179
weighted avg       0.88      0.75      0.78       179



That improved our accuracy to 78%! We can try to interpret what the logistic regression is doing:

In [29]:
list(zip(logit.coef_[0], feature_cols))

[(-2.488473078346335, 'f0_is_male'),
 (1.1961845553548103, 'f1_Toddler'),
 (0.19063740516635383, 'f2_Teen'),
 (-0.1552550481560051, 'f3_Adult'),
 (-0.8409967031899755, 'f4_Elder'),
 (1.3416117778992407, 'f5_is_first'),
 (0.4301507479670454, 'f6_is_second'),
 (-0.8101630338291245, 'f7_is_third')]

Most models have "hyper"-parameters that affect accuracy as well. C maniuplates the tradeoff between precision and recall in a dataset like this:

In [36]:
logit = LogisticRegression(C=100)
logit.fit(X_train, Y_train)
Y_pred = logit.predict(X_test)
print('C=0.01',classification_report(Y_pred , Y_test))

C=0.01               precision    recall  f1-score   support

           0       0.89      0.77      0.82       115
           1       0.66      0.83      0.74        64

    accuracy                           0.79       179
   macro avg       0.78      0.80      0.78       179
weighted avg       0.81      0.79      0.79       179





It is usually up to you to search over the hyperparameters that get you the best performance for your problem. The benefit of sci-kit learn is that you can try out a bunch of different classifiers:

In [37]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(X_train, Y_train)
Y_pred = tree.predict(X_test)
print('Tree',classification_report(Y_pred , Y_test))

Tree               precision    recall  f1-score   support

           0       1.00      0.69      0.81       144
           1       0.44      1.00      0.61        35

    accuracy                           0.75       179
   macro avg       0.72      0.84      0.71       179
weighted avg       0.89      0.75      0.77       179



In [38]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()
forest.fit(X_train, Y_train)
Y_pred = forest.predict(X_test)
print('Forest',classification_report(Y_pred , Y_test))

Forest               precision    recall  f1-score   support

           0       1.00      0.69      0.81       144
           1       0.44      1.00      0.61        35

    accuracy                           0.75       179
   macro avg       0.72      0.84      0.71       179
weighted avg       0.89      0.75      0.77       179



