# Machine Learning with scikit-learn

Tutors: Jaime Rodríguez-Guerra (jaime.rodriguez@charite.de), Jan Philipp Albrecht (j.p.albrecht@fu-berlin.de)

> Based on [previous work](https://github.com/volkamerlab/TeachOpenCADD/blob/master/talktorials/7_machine_learning/T7_machine_learning.ipynb) by Jan Philipp Albrecht and Jacob Gora

# 1. Aims of this session

Familiarize yourself with the ground concepts of machine learning while you apply popular algorithms and patterns in Python's `scikit-learn`.

# Learning goals


## Theory

* Machine Learning (ML) methods
* Data preparation

## Practical

* Prepare your data
* Use regression to find correlations between MMSE and hippocampus volume
* Apply Random Forests to predict Alzheimer's disease based on volumetric measurements and other variables
* Cluster with k-means to guess the thresholds used for MMSE-based diagnosis
* Validate your models

# References

* ML:
    * Random forest (RF): [http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf](http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf)
    * Support vector machines (SVM): [https://link.springer.com/article/10.1007%2FBF00994018](https://link.springer.com/article/10.1007%2FBF00994018)
    * Artificial neural networks (ANN): [https://www.frontiersin.org/research-topics/4817/artificial-neural-networks-as-models-of-neural-information-processing](https://www.frontiersin.org/research-topics/4817/artificial-neural-networks-as-models-of-neural-information-processing)
* Performance: 
    * [Sensitivity and specificity (Wikipedia)](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)
    * [Roc curve and AUC (Wikipedia)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve)
* See also [this notebook by B. Merget](https://github.com/Team-SKI/Publications/tree/master/Profiling_prediction_of_kinase_inhibitors) from [*J. Med. Chem.*, 2017, 60, 474−485](https://pubs.acs.org/doi/10.1021/acs.jmedchem.6b01611) 

## Theory


### Supervised vs unsupervised learning

In supervised learning, the algorithm is fed data that is well labelled (e.g. classification); e.g. some of the data will be tagged with/as the correct answer. In unsupervised learning, the algorithm infers the labels (e.g. clustering).


### Machine Learning (ML)

ML can be applied for (see also [scikit-learn page](http://scikit-learn.org/stable/)):

* **Classification (supervised)**: Identify to which category an object belongs (Nearest neighbors, Naive Bayes, RF, SVM, ...)
* **Regression**: Prediction of a continuous-values attribute associated with an object
* **Clustering (unsupervised)**: Automated grouping of similar objects into sets

#### Supervised learning

Learning algorithm creates rules by finding patterns in the training data. 

* **Decision trees**: A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
* **Random Forest (RF)**: Multiple decision trees which produce a mean prediction.
* **Support Vector Machines (SVM)**: SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Classifier based on the idea of maximizing the margin as objective function.  

 
#### Validation strategies

All models must be validated before using them!

__Train-test split__

The data is split _beforehand_ in two sets:

* The training set, which will be used to _train_ the model.
* The testing set, which will be used to validate the answer provided by the trained model. This data shouldn't have been _seen_ by the model.

There are several criteria to split the data into these two sets, but a common one is a random split where 20% of the data goes to the testing set and the remaining 80% goes to the training set. This is called a 80/20 split.

__K-fold cross validation__

This model validation technique splits the dataset in two groups in an iterative manner:

* Training data set: Considered as the known dataset on which the model is trained
* Test dataset: Unknown dataset on which the model is then tested
* Process is repeated k-times

The goal is to test the ability of the model to predict data which it has never seen before in order to flag problems known as over-fitting and to assess the generalization ability of the model.

#### Performance measures

Having defined a True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN), we have:

* **Sensitivity**, also true positive rate: TPR = TP/(FN+TP)
* **Specificity**, also true negative rate: TNR = TN/(FP + TN)
* **Accuracy**, also the trueness: ACC = (TP + TN)/(TP + TN + FP + FN)
* **ROC-curve**, receiver operating characteristic curve
    * A graphical plot that illustrates the diagnostic ability of our classifier
    * Plots the sensitivity against the specificity
* **AUC**, the area under the roc curve (AUC):  
    * Describes the probability that a classifier will rank a randomly chosen positive instance higher than a negative one
    * Values between 0 and 1, the higher the better


### Data preparation

Some algorithms will expect the data in a specific form so, before the models are trained and evaluated, the data needs to be formatted and sanitized accordingly. Perhaps surprisingly, this is often one of the trickiest parts in ML: getting good data and preparing it for the study!

Some common tasks include:

* Dropping non-numeric values
* Cleaning unneeded columns
* Standardizing labels and/or magnitudes
* Normalizing ranges
* Converting measurement units
* Finding a good representation

## Practical

Some questions:

* Classification: Is MMSE connected to Alzheimer's? MMSE is a 30-question test.
* Classification: ADASQ4 would also be nice, but lots of NaN
* DX is the simplified version. Use that instead of DX_bl.
* Regression: Volumetric measurements, specially in the hippocampus + ventricles can correlate to dementia
    * Does the hc volume correlate to questionnaire
* Whole brain volumes should NOT correlate
* Cluster by sex

### Prepare your data

Before we can use the supplied data, we need to select and transform the parts we are interested in. Then, we will perform a 80/20 split, leaving it ready for analysis.

In [1]:
import pandas as pd

In [4]:
# Load data as cleaned in day two
# df = pd.read_csv("../data/alzheimers_disease_reduced.csv")
df = pd.read_csv("data/alzheimers_disease.csv")

In [5]:
# Print some info
print(df.shape)
print(df.info())
df.head()

(14532, 109)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14532 entries, 0 to 14531
Columns: 109 entries, RID to update_stamp
dtypes: float64(87), int64(5), object(17)
memory usage: 12.1+ MB
None


Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,...,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp
0,2,bl,11,2005-09-08,CN,74.3,Male,16,technical writer and editor,Not Hisp/Latino,...,,,1.36665,,,0.0,0.0,0,0.0,2019-12-04 04:19:56.0
1,3,bl,11,2005-09-12,AD,81.3,Male,18,Secretary,Not Hisp/Latino,...,239.7,22.83,1.08355,,,0.0,0.0,0,0.0,2019-12-04 04:19:56.0
2,3,m06,11,2006-03-13,AD,81.3,Male,18,Elementary school teacher,Not Hisp/Latino,...,239.7,22.83,1.08355,,,0.498289,5.96721,6,6.0,2019-12-04 04:19:56.0
3,3,m12,11,2006-09-12,AD,81.3,Male,18,Communication,Not Hisp/Latino,...,239.7,22.83,1.08355,,,0.999316,11.9672,12,12.0,2019-12-04 04:19:56.0
4,3,m24,11,2007-09-12,AD,81.3,Male,18,Accounting,Not Hisp/Latino,...,239.7,22.83,1.08355,,,1.99863,23.9344,24,24.0,2019-12-04 04:19:56.0


In `sklearn`, the naming convention states that:

- `X` (uppercase) refers to the dataset containing the known data; e.g. MMSE scores, hippocampus volume
- `y` (lowercase) refers to the labels (unknowns) of that data; e.g. diagnosis

One of the first tasks you will need to do will be separating the dataframe into the useful parts for each question. For example:

In [6]:
X = df['MMSE']
y = df['DX']

Before running to analyze the data, though, there are some quality check strategies we can consider. One of the easiest to implement is a test-train split; i.e. "hide" some of the data from the model as a "test" set which we'll use to verify if the predictions are useful outside the training set. One common split is 20% test, 80% train. The library `sklearn` contains an utility function for this precise action:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.2)

### Use regression to find correlations between MMSE and hippocampus volume

MMSE tests are used to diagnose AD. Low scoring participants are often diagnosed with some kind of cognitive impairment. Some literature references that a reduced hippocampus volume might correlate with some forms of dementia. Would you be able to find a correlation between MMSE scores and hippocampus volume? Let's see if we can find it with linear regression.

### Apply Random Forests to predict Alzheimer's disease based on volumetric measurements and other variables

Literature shows that some volumetric measurements on the brain has been shown to correlate with the development of the disease. Let's check if that's true.

The dataset we are using contains information on several volumetric measurements. Namely:

- Ventricles
- Hippocampus
- Entorhinal cortex
- Fusiform gyrus

Can those be accurate descriptors for AD?

### Cluster with k-means to guess the thresholds used for MMSE-based diagnosis

1. Throw in MMSE, volumetric data (all we have used so far) to a dimensionality reduction algo
2. Cluster the PCA/whatever with k-means
3. Label the data points and see if they can -see- they five or three groups

### Validate your models

Now we will compare the models you have defined with the following techniques:

* Cross validation
* Evaluation matrix: TP, FP, TN, FN
* ROC

# Discussion

Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.

# Quiz

Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.

1. Question
2. Question
3. Question