# Machine Learning introduction course

Freely derived from https://www.kaggle.com/c/titanic

In [None]:
import numpy as np
import pandas as pd

In [None]:
# upgrade seborn to solve issue with annotation in correlation matrix
# !pip install seaborn --upgrade

## Read the data

You should know how to that now... hint: 🐼

In [None]:
titanic = 

## Explore the data

In [None]:
titanic.info()

#### Columns in the dataset

0. **pclass**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd). Proxy for socio-economic status.

1. **survived**: Survival (0 = No; 1 = Yes).

2. **name**: Name of the passenger.

3. **sex**: Sex of the passenger.

4. **age**: Age of the passenger.

5. **sibsp**: Number of Siblings/Spouses Aboard.

6. **parch**: Number of Parents/Children Aboard.

7. **ticket**: Ticket Number.

8. **fare**: Passenger Fare.

9. **cabin**: Cabin number.

10. **embarked**: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).

11. **boat**: Lifeboat (if the passenger survived and was on a lifeboat).

12. **body**: Body identification number (if the passenger did not survive and their body was found).

13. **home.dest**: Home/Destination of the passenger.


In [None]:
titanic.head()

In [None]:
titanic.hist(figsize=(15,10));

## Find some correlations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Calculate the correlation matrix
df_numeric = titanic.select_dtypes(include=['float64', 'int64'])

corr_matrix = df_numeric.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

### Exercise: Find another one by encoding the `sex` column into integer and rerun the correlation code

In [None]:
titanic['sex']

In [None]:
titanic['sex']=='male'

In [None]:
titanic['male'] = (titanic['sex']=='male').astype('int64')

In [None]:
titanic.head()

In [None]:
# Calculate the correlation matrix
df_numeric = titanic.select_dtypes(include=['float64', 'int64'])

corr_matrix = df_numeric.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Machine Learning

Here we will try to predict the probability to survive of unkown persons.    
The target variable will then be `survived` and we must drop the `body` variable.


In [None]:
target = 'survived'

### Choose your training features as a list of strings from the colnames

In [None]:
training_features = ['age', 'fare', 'pclass']

In [None]:
# we will work with this data for now

data = titanic[training_features + [target]]
data.head()

### Fixing the random seed for reproducibility of results

In [None]:
np.random.seed(42)

## Divide the dataset into train and test

In [None]:
# Lets drop all NaN

In [None]:
data.dropna(inplace=True)

In [None]:
from sklearn.model_selection import train_test_split


# X will be the input, so we drop the target feature. y is the output = target.
X = data.drop(target, axis=1)[training_features]  
y = data[target]  

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Train your first classifier 

In [None]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier()

In [None]:
classifier.fit(X_train, y_train)

In [None]:
pred = classifier.predict(X_test)
pred

In [None]:
pd.DataFrame(np.transpose([y_test.values, pred]), columns=['truth', 'prediction'])

Is this good ? ... 

## Evaluation of the prediction

### Classification Metrics in Machine Learning

#### True / False - Positives / Negatives

![true_pos](https://miro.medium.com/v2/resize:fit:640/format:webp/1*7EYylA6XlXSGBCF77j_rOA.png)

When we build a classification model, it's important to assess its performance. This is done using various metrics that each provide different insights into the quality of the model. Here are some of the main classification metrics:

#### 1. Accuracy

Accuracy is the most intuitive performance measure. It is simply a ratio of correctly predicted observations to the total observations.

$$ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{\text{True negatives + True positives}}{\text{Total Number of Predictions}} $$

**Advantages:**
- Easy to understand and interpret.
- Useful when the target classes are well balanced.

**Drawbacks:**
- Can be misleading if the data is imbalanced. A model that always predicts the majority class will have high accuracy but may be a poor model.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/Sensitivity_and_specificity_1.01.svg/341px-Sensitivity_and_specificity_1.01.svg.png" alt="Sensitivity and Specificity" width="341" align='right'>

#### 2. Precision (Specificity)

Precision is the ratio of correctly predicted positive observations to the total predicted positives. High precision relates to the low false positive rate.

$$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$


**Advantages:**
- Useful when the cost of False Positives is high. For example, in spam detection, you would rather let a few spam emails through (False Negatives) than accidentally mark a legitimate email as spam (False Positive).

**Drawbacks:**
- Not useful if the cost of False Negatives is high. It doesn't take into account the False Negatives.


source: https://en.wikipedia.org/wiki/File:Sensitivity_and_specificity_1.01.svg

#### 3. Recall (Sensitivity)

Recall (also known as sensitivity or true positive rate) is the ratio of correctly predicted positive observations to the all observations in actual class.

$$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $$

**Advantages:**
- Useful when the cost of False Negatives is high. For example, in disease prediction, you would rather predict that a healthy person is sick (False Positive) than overlook a sick person (False Negative).

**Drawbacks:**
- Not useful if the cost of False Positives is high. It doesn't take into account the False Positives.


#### 4. F1 Score

The F1 Score is the weighted average of Precision and Recall. It tries to find the balance between precision and recall. It is useful in the case of uneven class distribution problems.

$$ \text{F1 Score} = 2*\frac{\text{Precision}*\text{Recall}}{\text{Precision} + \text{Recall}} $$

**Advantages:**
- Balances the trade-off between Precision and Recall.
- Useful when the data is imbalanced.

**Drawbacks:**
- Not interpretable in terms of the business problem.
- Assumes that Precision and Recall are equally important, which might not be the case.


#### 5. Area Under ROC (Receiver Operating Characteristic) Curve (AUC-ROC)

AUC-ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much th emodel is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

Each of these metrics provides a different perspective on the performance of the model, and it's important to consider all of them when evaluating your model. The choice of metric depends on your scientific objective.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Roc_curve.svg/1024px-Roc_curve.svg.png" alt="Sensitivity and Specificity" width="500" >

**Advantages:**
- Useful for binary classification problems.
- Takes into account both False Positive rate (1-Specificity) and True Positive rate (Recall).

**Drawbacks:**
- Not suitable for imbalanced datasets: can be overly optimistic if classes are highly imbalanced.
- More difficult to interpret in terms of scientific

---- 
Consider the case of a population of 1000 patients: 990 of the patients are healthy and 10 have a disease.  

We build a test that returns the following results:
- Patients with disease (predicted) = 0 (our model predicts everyone as healthy)
- Healthy patients (predicted) = 1000 (our model predicts everyone as healthy)

Give TP, TN, FP, FN.

What is the accuracy? Is this a good measure of our test performances?

----

TP = 990
TN = 0
FP = 0
FN = 10

accuracy = 990 / 1000 = 99%


Now let's compute these metrics for our dataset.    
Is the test dataset balanced ?



In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, pred)
## cm = cm/cm.sum()  # if you prefer %

sns.heatmap(cm, annot=True, cmap='Blues')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

### Question: where are the true positives, true negatives, false positives and false negatives ?

true positives = 39
TN = 90
FP = 50
FN = 30

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, pred)

print(f"The accuracy of the prediction is: {accuracy*100:.2f}%")

### Exercise: compute the precision, recall and f1_score. 
Be mindful of the `pos_label`.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

In [None]:
precision_score(y_test, pred, pos_label=1)

In [None]:
recall_score(y_test, pred, pos_label=0)

In [None]:
f1_score(y_test, pred, pos_label=1)

### Prediction of the probability 

Instead of predicting directly the class, the classifier is providing a probability for each sample and class and then applies a threshold.

In [None]:
pred_proba = classifier.predict_proba(X_test)
pred_proba[:10]

### ROC Curve and AUC score

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test, pred_proba[:,1])
roc_auc = roc_auc_score(y_test, pred_proba[:,1])

plt.plot(fpr, tpr, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()


# Improving the performances

We have now trained our first classifier, but let's try to improve its performances.
There are many ways we can try to improve it, here we'll see:
- trying different algorithms
- playing with hyperparameters
- feature engineering


**_Metric ?_**    
Let's say our goal is to maximise survival (= disease like problem), so we will try to **maximise recall with a positive label = 0**

## Trying different algorithms

![](https://scikit-learn.org/stable/_static/ml_map.png)
source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


### Exercise: Model and hyperparameters

Try the prediction with a GradientBoostingClassifier

Then play with the hyperparameters to try to improve the performances.

Note down the best recall along with the hyperparameters used.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
classifier_2 = GradientBoostingClassifier()

# train new classifier:
classifier_2.fit(X_train, y_train)

# make new predcition:
pred = classifier_2.predict(X_test)

pred

In [None]:
accuracy_score(y_test, pred)

In [None]:
recall_score(y_test, pred, pos_label=0)

### Question: what are the issues with this way of doing?

### A word on overfitting

#### Why increasing the number of estimators is not always a good thing ?

In [None]:
n_estimators = [10, 20, 50, 100, 400, 1000]

recall = []
recall_train = []

for n in n_estimators:
    classifier = GradientBoostingClassifier(n_estimators=n)
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    pred_train = classifier.predict(X_train)
    
    recall.append(recall_score(y_test, pred, pos_label=0))
    recall_train.append(recall_score(y_train, pred_train, pos_label=0))

In [None]:
plt.plot(n_estimators, recall, label='test recall')
plt.plot(n_estimators, recall_train, label='train recall')
plt.xlabel('n_estimators')
plt.ylabel('score')
plt.legend()

#### Why optimising (even by hand!) to find the best score on the test dataset is not a good practice either?

In [None]:
n_estimators = [5, 10, 20]
learning_rate = [0.01, 0.1, 1]
max_depth = [1, 3, 5]

best_params = None
highest_recall = 0

for n in n_estimators:
    for lr in learning_rate:
        for md in max_depth:
            gbc = GradientBoostingClassifier(n_estimators=n, learning_rate=lr, max_depth=md)
            gbc.fit(X_train, y_train)
            preds = gbc.predict(X_test)
            recall = recall_score(y_test, preds, pos_label=0)
            if recall > highest_recall:
                highest_recall = recall
                best_params = {'n_estimators': n, 'learning_rate': lr, 'max_depth': md}

print("Best parameters: ", best_params)
print("Highest recall on test set: ", highest_recall)

### Validation dataset

To avoid the issue of information leak from the test dataset into the model, we use a validation dataset taken from the training dataset.

<div align="center">
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*Nv2NNALuokZEcV6hYEHdGA.png" alt="Sensitivity and Specificity" width="600" align="center"/>
</div>


#### Cross-validation

> However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

> A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
A model is trained using `k-1` of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

<div align="center">
    <img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" alt="Sensitivity and Specificity" width="800"/>
</div>


source: https://scikit-learn.org/stable/modules/cross_validation.html

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, recall_score

param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 1],
    'max_depth': [1, 3, 5]
}

classifier = GradientBoostingClassifier()

scorer = make_scorer(recall_score, pos_label=0)


grid_search = GridSearchCV(estimator = classifier,
                           param_grid = param_grid,
                           scoring=scorer,
                           cv = 5)


grid_search.fit(X_train, y_train)

print("Best parameters: ", grid_search.best_params_)

print("Best score: ", grid_search.best_score_)

## Feature engineering

### Visualize features importance

In [None]:
import matplotlib.pyplot as plt


# Get feature importances
importances = classifier.feature_importances_

feature_names = training_features
# Sort feature importances in descending order
indices = np.argsort(importances)

# Rearrange feature names so they match the sorted feature importances
names = [feature_names[i] for i in indices]

plt.figure(figsize=(9,4))
plt.barh(range(len(indices)), importances[indices])

plt.yticks(range(len(indices)), names, rotation=0);
plt.tight_layout()


In [None]:
from copy import deepcopy

# Work on a copy of the dataset in case you perform destructive operations
df = deepcopy(titanic)

In [None]:
df.head()

### Filling missing values
One can fill missing values in order to increase the size of training data.
There are many strategies to fill the missing values.

In [None]:
# example with the port: fill missing embarked values with the most common port
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)

#### Exercice: Proceed with other columns with missing values

### Create new features 

In [None]:
# Example: extract title from Name
df['Title'] = df['name'].str.extract(' ([A-Za-z]+)\.', expand=False)


#### Exercice: create a `family_size` column by combining `sibsp` and `parch`

#### Exercice: Bin `age` into different age groups to create a column `age_group`

### Scale the features 

In [None]:
scaler = StandardScaler()

# Fit the scaler to the 'Age' and 'Fare' columns and transform them
df[['age', 'fare']] = scaler.fit_transform(df[['age', 'fare']])


### Encode features

In [None]:
df = pd.get_dummies(df, columns=['sex'])


####  Q: Another column we can apply this to?

In [None]:
# Finally, drop features that might not be useful
df = df.drop(['name', 'ticket', 'cabin', 'body', 'home.dest', 'boat'], axis=1)

df.head()


## Train a new classifier with this data
- try different hyperparameters (use GridCV)
- who got the best classifier ?

## Putting it all together