# Machine Learning Introduction: Classification

# Artificial Intelligence Introduction

What is the difference between AI, ML and DL?

![](images/AI_ML_DL.png)

Machine Learning is a wide field, containing many applications.

The most common ones are the following:
- Supervised learning: you know the target you want to learn
- Unsupervised learning: you only have features, no target labels

![](images/supervised_unsupervised.png)

Supervised learning can be separated into two main parts:
- Regression: predict a continuous value (e.g. house price)
- Classification: classify data (e.g. dogs or cat)

![](images/classif_vs_reg.jpg)

We will begin by doing a regression, and then we will work on a classification.

We will use a really famous french library to do so: **scikit-learn**.

![](images/sklearn.png)

# I. Data exploration and preparation

We will now apply classification algorithms to a very common problem: customer churn.

This is a binary classification: the idea is to predict whether a customer is a churner (will leave for another company) or not.

We will use the dataset `telecom.csv`: made of 3333 customers, 18 features + 1 target.

Have a look at it and make a short data exploration if you want and then data preparation.

## I.1 Data cleaning



In [1]:
import pandas as pd
import numpy as np

churn=pd.read_csv("telecom.csv")

churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 19 columns):
Account Length    3333 non-null int64
Area Code         3333 non-null int64
Int'l Plan        3333 non-null object
VMail Plan        3333 non-null object
VMail Message     3333 non-null int64
Day Mins          3333 non-null float64
Day Calls         3333 non-null int64
Day Charge        3333 non-null float64
Eve Mins          3333 non-null float64
Eve Calls         3333 non-null int64
Eve Charge        3333 non-null float64
Night Mins        3333 non-null float64
Night Calls       3333 non-null int64
Night Charge      3333 non-null float64
Intl Mins         3333 non-null float64
Intl Calls        3333 non-null int64
Intl Charge       3333 non-null float64
CustServ Calls    3333 non-null int64
Churn?            3333 non-null object
dtypes: float64(8), int64(8), object(3)
memory usage: 494.9+ KB


In [2]:
churn.head()

Unnamed: 0,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,137,415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,84,408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,75,415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [3]:
churn.duplicated().sum()

0

We seem to have no missing data and no duplicates.

It would be a nice idea to make some data exploration, for pedagogical reasons, we will go directly to data preparation. 

## I.2 Data Preparation

What are the steps of data preparation?

- one-hot-encoding of categorical data
- rescaling of quantitative data
- feature engineering (not mandatory at first iteration)
- define `X` and `y`
- data splitting into train and test sets

### Categorical data

What are the categorical data here that we have to process?

How do we process them?

In [4]:
categorical_cols = ['Area Code', "Int'l Plan", 'VMail Plan', 'Churn?']

dummies = pd.get_dummies(churn[categorical_cols].astype(str), drop_first=True)
dummies.head()

Unnamed: 0,Area Code_415,Area Code_510,Int'l Plan_yes,VMail Plan_yes,Churn?_True.
0,1,0,0,1,0
1,1,0,0,1,0
2,1,0,0,0,0
3,0,0,1,0,0
4,1,0,1,0,0


### Quantitative data

Now how do we process quantitative data?

In [5]:
scaled = churn.drop(columns=categorical_cols)
scaled = (scaled - scaled.mean())/scaled.std()

scaled.describe()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,1.449652e-16,8.527366000000001e-17,-3.446655e-15,-2.025249e-16,-2.205923e-15,4.881917e-16,3.336332e-16,2.607242e-15,-2.164885e-15,-4.8499390000000004e-17,-6.284668e-15,1.569035e-15,7.461445e-18,1.382712e-14,5.969156e-17
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.512795,-0.5916711,-3.300601,-5.004496,-3.300667,-3.963027,-5.025157,-3.963085,-3.513121,-3.429355,-3.514838,-3.666863,-1.820015,-3.66766,-1.18804
25%,-0.6796428,-0.5916711,-0.6623247,-0.6694697,-0.6622766,-0.6779283,-0.6582622,-0.6782106,-0.669754,-0.6698335,-0.667579,-0.6222756,-0.6011049,-0.6163417,-0.4278678
50%,-0.0016274,-0.5916711,-0.006886644,0.02812069,-0.006729054,0.008274899,-0.005737769,0.008458004,0.00648483,-0.005504263,0.004690538,0.02246056,-0.1948014,0.02045516,-0.4278678
75%,0.6512763,0.8694238,0.6724189,0.6758832,0.6725781,0.6767314,0.6969809,0.676568,0.6807464,0.658825,0.681354,0.6671967,0.6178056,0.6705186,0.3323046
max,3.564231,3.134121,3.13995,3.217105,3.140331,3.208584,3.507855,3.207498,3.838505,3.827165,3.836188,3.496872,6.306055,3.496304,5.653511


### Data splitting

We now have to define our `X` and `y` and then split into train and test sets.

In [6]:
concat = pd.concat([dummies, scaled], axis=1)
y = concat['Churn?_True.']
X = concat.drop(columns='Churn?_True.')

X.shape, y.shape

((3333, 19), (3333,))

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# II. Model training and optimization

We now have our data well prepared. 

Next step is model training!

There are many classification models in ML:
- Logitic Regression
- k-NN
- SVM
- Gradient Boosting
- Random Forest
- ...

They are all available in scikit-learn, so that you won't have to rewrite the algorithms!

And they all work the same way:
- first you instantiate the model
- then you train the model with the `.fit(X, y)` method
- finally you can predict with the `.predict(X)` method

## II.1 Model training

Let's now instantiate and train some models:
- k-NN
- logistic regression
- gradient boosting

Up to you to test more models!

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

knn = KNeighborsClassifier()
lr = LogisticRegression()
gdb = GradientBoostingClassifier()

knn.fit(X_train, y_train)
lr.fit(X_train, y_train)
gdb.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

## II.2 Model evaluation

We can now evaluate the performances of our model.

What metrics would you use to evaluate the performance of a binary classification?

Well, there are **many** metrics to evaluate a binary classification:
- Accuracy
- Precision
- Recall
- F-score
- ROC AUC
- ...

Let's see how to compute those most common metrics, and why to use them.

### Accuracy

This is most intuitive metrics: the percentage of correctly classified samples.

It can be computed with the scikit-learn function `accuracy_score`:

In [9]:
from sklearn.metrics import accuracy_score

y_pred_knn = knn.predict(X_test)
y_pred_lr = lr.predict(X_test)
y_pred_gdb = gdb.predict(X_test)


accuracy_knn = accuracy_score(y_test, y_pred_knn)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
accuracy_gdb = accuracy_score(y_test, y_pred_gdb)

print("Accuracy kNN:", accuracy_knn)
print("Accuracy LR:", accuracy_lr)
print("Accuracy GDB:", accuracy_gdb)

Accuracy kNN: 0.881559220389805
Accuracy LR: 0.8590704647676162
Accuracy GDB: 0.9385307346326837


### Confusion matrix

This metrics gives more than one value: it gives the correct and wrong classified samples for each class (churner or not)

In [10]:
from sklearn.metrics import confusion_matrix

confusion_matrix_knn = confusion_matrix(y_test, y_pred_knn)
confusion_matrix_lr = confusion_matrix(y_test, y_pred_lr)
confusion_matrix_gdb = confusion_matrix(y_test, y_pred_gdb )


print("Confusion matrix kNN:", confusion_matrix_knn, sep="\n")
print("Confusion matrix LR:", confusion_matrix_lr, sep="\n")
print("Confusion matrix GDB:", confusion_matrix_gdb, sep="\n")

Confusion matrix kNN:
[[561   9]
 [ 70  27]]
Confusion matrix LR:
[[551  19]
 [ 75  22]]
Confusion matrix GDB:
[[557  13]
 [ 28  69]]


### Recall, precision and F-score

The F-score is defined by the precision and recall:

$${\displaystyle F=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}}$$

#### Visual representation of Precision and Recall

<img src='images/precision_rappel.png' width='300px' class="center">


### Classification Report

All those metrics can be computed individually with the right scikit-learn function, or altogether with the `classification_report()`

In [11]:
from sklearn.metrics import classification_report
print("GDB classification report:", classification_report(y_test, y_pred_gdb) ,sep="\n")

GDB classification report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.96       570
           1       0.84      0.71      0.77        97

    accuracy                           0.94       667
   macro avg       0.90      0.84      0.87       667
weighted avg       0.94      0.94      0.94       667



### ROC AUC

The ROC curve (Receiver Operating Characteristics) is a really interesting curve, but we usually need a single value: that's why the ROC AUC (Area Under Curve) is commonly used.

A totally random model would have a ROC AUC of 0.5.

While a perfect model would have a ROC AUC of 1.

<img src='images/ROCcurve.png' width='300px' class="center">


In [12]:
from sklearn.metrics import roc_auc_score

auc_knn = roc_auc_score(y_test, knn.predict_proba(X_test)[:,1])
auc_lr = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
auc_gdb = roc_auc_score(y_test, gdb.predict_proba(X_test)[:,1])

print("ROC AUC kNN:" ,auc_knn)
print("ROC AUC LR:" ,auc_lr)
print("ROC AUC GDB:" ,auc_gdb)

ROC AUC kNN: 0.794592150479291
ROC AUC LR: 0.8170012660517273
ROC AUC GDB: 0.8874660879001628


## II.3 Hyperparameter optimization: GridSearch

We can now perform a grid search to try improving our performances.

Let's try to improve our Gradient Boosting.

To help you out, here is the signature of the Gradient Boosting Classifier in scikit-learn:

```python
GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)
```

For more information, you can go check the [online documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).

**Exercise:**

Make the hyperparameter optimization of your model.

Here is what you have to do:
- Select the hyperparameters you want to optimize
- Make your python dict containing the hyperparameters values to test (to try to much, it may take a while...)
- Instantiate your grid search object
- Fit your grid search and have a look at the best params
- Evaluate your optimized model: did it improve the performances?

In [13]:
from sklearn.model_selection import GridSearchCV

params= {"max_depth":[3, 5], "n_estimators":[10, 50, 100]}

grid = GridSearchCV(GradientBoostingClassifier(), params, 
                    scoring="accuracy", cv=5)

grid.fit(X_train, y_train)

print('best params are:', grid.best_params_)

print('the optimized accuracy is:', accuracy_score(y_test, grid.predict(X_test)))

best params are: {'max_depth': 5, 'n_estimators': 100}
the optimized accuracy is: 0.9475262368815592


# III. Titanic classification

Now you can reload the data of the titanic prepared in previous course, using pickle:
```python
pickle.load(open('my_data.pkl', 'rb'))
```

And then perform classification on it, try to get the best accuracy you can!