# Classification using scikit-learn

This notebook overviews widely-used supervised ML models in scikit-learn, specifically for classification problems. 
- **Supervised Learning**: given a dataset of labeled training examples
- **Classification Problem**: assign a discrete class based on certain input features
    - *Binary Classification*: cancer or no cancer, cats vs dogs
    - *Multiclass Classifcation*: handwritten digits, cats vs dogs vs monkeys 
- **scikit-learn**: Simple Python ML Library, https://scikit-learn.org/

## Dataset
**UCI Heart Disease Dataset**: https://www.kaggle.com/ronitf/heart-disease-uci<br/>
Goal: presense/absence of heart disease based the following health-related features

- *age*: age in years 
- *sex*: (1 = male; 0 = female) 
- *cp*: chest pain type 
- *trestbps*: resting blood pressure (in mm Hg on admission to the hospital) 
- *chol*: serum cholestoral in mg/dl 
- *fbs*: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
- *restecg*: resting electrocardiographic results 
- *thalach*: maximum heart rate achieved 
- *exang*: exercise induced angina (1 = yes; 0 = no) 
- *oldpeak*: ST depression induced by exercise relative to rest 
- *slope*: the slope of the peak exercise ST segment 
- *ca*: number of major vessels (0-3) colored by flourosopy 
- *thal*: 3 = normal; 6 = fixed defect; 7 = reversable defect 
- *target*: have disease or not (1=yes, 0=no)

**Data Preprocessing Techniques**:
- *One-Hot Encoding*: for categorical features
    - for example, the gender feature above is coded as (1 = male; 0 = female) 
    - this may cause some models to learn associations like male > female 
    - instead, we one-hot encode and create 2 features: is_female, is_male
    - at a low level, this looks like: [1 0 1] => [[0 1] [1 0] [0 1]] 
- *Feature Normalization*: scale from 0 to 1
    - the range of values for different features usually varies widely
    - this may cause problems if models try to compare features
    - thus, we rescale all features from 0 to 1 using min-max normalization
    - for a feature x, the formula is: x' = (x - min(x)) / (max(x) - min(x))
    
**Data for Models**:
- *Training Data*: examples used to train the model
- *Testing Data*: examples used to test the model, separate from training data
- We randomly choose 75% of all data for training, and the remaining 25% for testing. 

In [172]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [173]:
# External script provides X_train, y_train, X_test, y_test Pandas Dataframes. 
# X_train: features of training examples  
# y_train: labels of training examples
# X_test: features of testing examples  
# y_test: labels of testing examples

%run -i load_data.py

In [174]:
# Original Dataset 
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [175]:
# Features of training examples, after preprocessing 
X_train.head()

Unnamed: 0,age,sex,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,ca,...,cp_1,cp_2,cp_3,thal_0,thal_1,thal_2,thal_3,slope_0,slope_1,slope_2
153,0.770833,0.0,0.490566,0.347032,0.0,0.0,0.618321,0.0,0.0,0.25,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
23,0.666667,1.0,0.528302,0.267123,1.0,0.5,0.503817,1.0,0.16129,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
263,0.708333,0.0,0.132075,0.326484,0.0,0.5,0.748092,1.0,0.290323,0.5,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
110,0.729167,0.0,0.811321,0.454338,0.0,0.5,0.633588,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
81,0.333333,1.0,0.320755,0.415525,0.0,0.0,0.755725,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Simple Classifiers

**Model Evaluation Metrics**:
- *Accuracy*: overall proportion of accurate predictions
- *AUROC*: probability a classifier will rank a randomly chosen positive example higher than a randomly chosen negative example 
- *Precision*: proportion of predictions of a given class that are correct
- *Recall*: proportion of actual instances of a given class that are predicted correctly

**K-Nearest Neighbors (k-NN)**: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

- k-NN outputs the most common class among a given test element's *k* closest training examples (nearest neighbors) 
- "closeness" measured using feature similarity: how similar two elements are on the basis of their features

In [176]:
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='distance')
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Accuracy:  0.7850877192982456
AUROC:  0.8818471833177716
              precision    recall  f1-score   support

           0       0.76      0.76      0.76       102
           1       0.81      0.80      0.80       126

   micro avg       0.79      0.79      0.79       228
   macro avg       0.78      0.78      0.78       228
weighted avg       0.79      0.79      0.79       228



**Naive Bayes**: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
- What is the probability that a given test element belongs to a class on the basis of its features?
- Conditional probabilities calculated using training examples and Bayes' Rule 

In [177]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB(var_smoothing=10)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Accuracy:  0.7982456140350878
AUROC:  0.8796296296296298
              precision    recall  f1-score   support

           0       0.78      0.76      0.77       102
           1       0.81      0.83      0.82       126

   micro avg       0.80      0.80      0.80       228
   macro avg       0.80      0.80      0.80       228
weighted avg       0.80      0.80      0.80       228



**Decision Trees**: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- Decision trees embody a flowchart-like structure; sequential, hierarchical decisions are used to choose a class for a given example. 
- Decision rules are usually based on feature thresholds (i.e. cholesterol level > X) 

In [178]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=None, min_samples_split=3, min_samples_leaf=0.1, max_features=None)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Accuracy:  0.7236842105263158
AUROC:  0.7903439153439153
              precision    recall  f1-score   support

           0       0.69      0.70      0.69       102
           1       0.75      0.75      0.75       126

   micro avg       0.72      0.72      0.72       228
   macro avg       0.72      0.72      0.72       228
weighted avg       0.72      0.72      0.72       228



**Logistic Regression**: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

- Logistic Regression finds a boundary (hyperplane) to distiguish classes in feature space 
- Feature Space: n-dimensional space (where n is the number of features); the features of a given example determine its coordinates in the n-dimensional space. 
- In comparision to SVM, Logistic Regression finds the boundary that maximizes the likelihood that a random data point is classified correctly. 

In [179]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(penalty='l2', solver='liblinear', C=0.1)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Accuracy:  0.7807017543859649
AUROC:  0.8765172735760971
              precision    recall  f1-score   support

           0       0.74      0.78      0.76       102
           1       0.82      0.78      0.80       126

   micro avg       0.78      0.78      0.78       228
   macro avg       0.78      0.78      0.78       228
weighted avg       0.78      0.78      0.78       228



**Support Vector Machines (SVM)**: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

- SVM finds a boundary (hyperplane) to distiguish classes in feature space 
- Feature Space: n-dimensional space (where n is the number of features); the features of a given example determine its coordinates in the n-dimensional space. 
- In comparision to logistic regression, SVM finds the boundary with the widest possible separating margin between the classes in the feature space.

In [None]:
from sklearn.svm import LinearSVC

clf = LinearSVC(penalty='l2', C=0.01, dual=False)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
scores = clf.decision_function(X_test)

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

## Gridsearch
**Gridsearch**: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- *Hyperparamters*: parameters of the model that are chosen ahead of time and that can be tweaked during successive trials to improve performance.
- *Gridsearch*: process to determine the optimal hyperparameter values for a given model
- *K-Fold Cross Validation*: evaluate specific hyperparameters by training and testing on different folds of the training data (only testing data for final evaluation, after tuning hyperparameters)

In [180]:
from sklearn import neighbors
from sklearn.model_selection import GridSearchCV

base_clf = neighbors.KNeighborsClassifier()
parameters = {'n_neighbors': [1, 2, 5, 10, 15, 25], 'weights': ['uniform', 'distance']}

clf = GridSearchCV(base_clf, parameters, cv=3)
clf.fit(X_train, y_train)
print('Best Hyperparameters: ', clf.best_params_, '\n')

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Best Hyperparameters:  {'weights': 'distance', 'n_neighbors': 10} 

Accuracy:  0.7850877192982456
AUROC:  0.8818471833177716
              precision    recall  f1-score   support

           0       0.76      0.76      0.76       102
           1       0.81      0.80      0.80       126

   micro avg       0.79      0.79      0.79       228
   macro avg       0.78      0.78      0.78       228
weighted avg       0.79      0.79      0.79       228



## Ensemble Methods

**Ensemble Methods**: https://scikit-learn.org/stable/modules/ensemble.html <br/>
Ensemble methods combine the predictions of several base classifiers, which helps improve generalizability / robustness over a single classifier. There are two common approaches:
- *Averaging Methods*: build several classifiers independently and average their predictions
- *Boosting Methods*: build several classifiers sequentially, where each new classifier tries to improve the previous one

**Random Forest**: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

- A random forest is an averaging ensemble method that combines several decision tree classifiers.

In [181]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

base_clf = RandomForestClassifier(n_estimators=100,
                             max_features=None, 
                             min_samples_leaf=1,
                             min_samples_split=2)
parameters = {'n_estimators': [10, 100], 'min_samples_split': [0.1, 2], 'min_samples_leaf': [0.1, 1], 'max_depth': [1, 3, 5]}

clf = GridSearchCV(base_clf, parameters, cv=3)
clf.fit(X_train, y_train)
print('Best Hyperparameters: ', clf.best_params_, '\n')

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Best Hyperparameters:  {'min_samples_leaf': 0.1, 'max_depth': 5, 'min_samples_split': 0.1, 'n_estimators': 10} 

Accuracy:  0.7456140350877193
AUROC:  0.8569872393401805
              precision    recall  f1-score   support

           0       0.74      0.67      0.70       102
           1       0.75      0.81      0.78       126

   micro avg       0.75      0.75      0.75       228
   macro avg       0.74      0.74      0.74       228
weighted avg       0.75      0.75      0.74       228



**Gradient Boosted Decision Trees (GBDT)**: 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier

- GBDT is a boosting ensemble method that combines several decision tree classifiers.

In [182]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

base_clf = GradientBoostingClassifier()
parameters = {'n_estimators': [10, 100], 'min_samples_split': [0.1, 2], 'min_samples_leaf': [0.1, 1], 'max_depth': [1, 3, 5]}

clf = GridSearchCV(base_clf, parameters, cv=3)
clf.fit(X_train, y_train)
print('Best Hyperparameters: ', clf.best_params_, '\n')

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Best Hyperparameters:  {'min_samples_leaf': 0.1, 'max_depth': 1, 'min_samples_split': 0.1, 'n_estimators': 100} 

Accuracy:  0.7807017543859649
AUROC:  0.8640678493619671
              precision    recall  f1-score   support

           0       0.77      0.73      0.75       102
           1       0.79      0.83      0.81       126

   micro avg       0.78      0.78      0.78       228
   macro avg       0.78      0.78      0.78       228
weighted avg       0.78      0.78      0.78       228



**Voting**: 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

- Voting is an averaging ensemble method that combines several classifiers of different types.
- The example here uses voting with k-NN, GBDT, and Random Forests.  

In [183]:
from sklearn.ensemble import VotingClassifier
from sklearn import neighbors
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

clf1 = neighbors.KNeighborsClassifier()
clf2 = RandomForestClassifier()
clf3 = GradientBoostingClassifier(max_depth=1, min_samples_split=0.1)

voting_clf = VotingClassifier(estimators=[('knn', clf1), ('rf', clf2), ('gb', clf3)], voting='soft')

# We can tune individual parameters of each classifier using gridsearch as follows 
params = {'knn__n_neighbors': [5, 10, 25], 
          'rf__n_estimators': [10, 100], 'rf__min_samples_split': [0.1, 2], 'rf__min_samples_leaf': [0.1, 1], 'rf__max_depth': [1, 3, 5],
          'gb__n_estimators': [10, 100], 'gb__min_samples_split': [0.1, 2], 'gb__min_samples_leaf': [0.1, 1], 'gb__max_depth': [1, 3, 5]}

clf = GridSearchCV(voting_clf, params, cv=3)
clf.fit(X_train, y_train)
print(clf.best_params_)

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

{'rf__n_estimators': 10, 'gb__min_samples_split': 2, 'gb__min_samples_leaf': 0.1, 'gb__n_estimators': 10, 'gb__max_depth': 1, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'knn__n_neighbors': 10, 'rf__max_depth': 3}
Accuracy:  0.8026315789473685
AUROC:  0.885932150638033
              precision    recall  f1-score   support

           0       0.79      0.76      0.78       102
           1       0.81      0.83      0.82       126

   micro avg       0.80      0.80      0.80       228
   macro avg       0.80      0.80      0.80       228
weighted avg       0.80      0.80      0.80       228

