Video https://youtu.be/04aJqN0kmOw

# Plan for the seminar

1. Mice protein data
2. Train a KNN classifier and compute its performance.
3. K-fold and GridSearchCV to optimize a classifier.
4. Classification metrics, quick reminder
5. Predict on test (completely unseen) data
6. GroupKFold.
7. Summing things together: pipeline, scaling, grid search.

---

## Mice Protein Expression Data Set
https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression
- The data set consists of the expression levels of 77 proteins that produced detectable signals in the nuclear fraction of cortex.
- There are 38 control mice and 34 trisomic mice (Down syndrome). 
- In the experiments, 15 measurements were registered of each protein per sample/mouse. 
- Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic mice, there are 34x15, or 510 measurements. So, the dataset contains a total of 1080 measurements per protein. 
- Each measurement can be considered as an independent sample/mouse. 

# 1. Load data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

plt.rc('font', **{'size':22})

In [None]:
# url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00342/Data_Cortex_Nuclear.xls'
# data = pd.read_excel(url)
# data.to_csv('mouse_protein_data.csv', index=False)

# Load the data


In [None]:
# Show first several rows


In [None]:
# Set index to MouseID


## Some of the columns contains missing values

For now, we just drop them

In [None]:
# Count missing values



In [None]:
# Drop columns with missing values



In [None]:
# Create X and y


# 2. Split data into train, and test parts. 

Run a k-fold cross validation on train.

For test part we will use a prespecified set instead of random train_test_split.

In [None]:
train_index = list(range(105,975))
test_index = [*list(range(105)), *list(range(975, 1080))]

X_train, y_train = data.iloc[train_index, :28], data.iloc[train_index]['Genotype']
X_test, y_test = data.iloc[test_index, :28], data.iloc[test_index]['Genotype']

# 3. KNN + GridSearchCV on a train data

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, KFold

In [None]:
# Look for best k for KNearestNeighbours Classifier


In [None]:
# What is the best k?



In [None]:
# What is the corresponding model perfomance?


# 4. Classification metrics

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_predict

In [None]:
# Let's compute predictions on validation folds?



In [None]:
# What is the accuracy?

In [None]:
# What are Type I and Type II errors?



##  Confusion matrix

short reminder


- `True negative` (TN) = # of observations of class **-1** predicted as **-1**
- `True positive` (TP) = # of observations of class **1** predicted as **1**
- `False negative` (FN) =  # of observations of class **1** predicted as **-1**
- `False positive` (FP) =  # of observations of class **-1** predicted as **1**

![](https://downloader.disk.yandex.ru/preview/41e8787a8165a44f12d8ffe173005a851837d2b362a10a52d74993243dd96b69/5f62036b/ByxQGYV6TOA0SKM5C1wjQzf62RnoUHLkxFT6pa_5VPIfzjRLsxn-hQdij6wQzVN2R8-jQN4a-Hpzufn2KVjpsw==?uid=0&filename=Screenshot+from+2020-09-16+11-21-32.png&disposition=inline&hash=&limit=0&content_type=image%2Fpng&tknv=v2&owner_uid=159868851&size=2048x2048)

https://en.wikipedia.org/wiki/Confusion_matrix

## Type I error, Type II error

![](https://chemicalstatistician.files.wordpress.com/2014/05/pregnant.jpg)

# 5. Predict on test

In [None]:
# What is the perfomance of our model on an unseen test data?



In [None]:
# Accuracy



In [None]:
# Confusion matrix



# Groups!

Our data contains duplicate mouses, recall that:

- There are 38 control mice and 34 trisomic mice (Down syndrome).
- In the experiments, 15 measurements were registered of each protein per sample/mouse. 

It is a common situation when your data contains multiple rows for the same subject.

The problem is typically two different observations from the same subject are much more similar
than two different observations from the same class. 

---

Let's see how will our model performance changes once we put all observations from the same mouse either to
train or to test.

In [None]:
import seaborn as sns

In [None]:
data['mouse'] = data.index
data['mouse'] = data.mouse.apply(lambda x: x.split('_')[0])

In [None]:
groups = X_train.index.map(lambda x: x.split('_')[0])
index1 = [*list(range(30)), *list(range(1065, 1080))]

In [None]:
sns.pairplot(pd.concat([data.iloc[index1, :4], data.iloc[index1]['mouse']], axis=1));#,  hue='mouse');

# 6. Group k fold

In [None]:
# Import GroupKFold from sklearn.model_selection


In [None]:
# Look for a best value for k, using correct folding technique



In [None]:
# What is the best test score?



In [None]:
# What is the best number of neighbours?


In [None]:
# What is the performance of our model on an unseen test data?



# 7. Data normalization

For the kNN method we must normalize the data before training.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline

In [None]:
# Prepare a pipline consists of MinMaxScaler and KNN



In [None]:
# Find optimal value of k (number of neighbours) for KNN with normalized data, 
# using GridSearch with correct folding


In [None]:
# Make predictions for unseen test data



## Let's Try Logistic Regression as an alternative classifier

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# Prepare a pipline consists of MinMaxScaler and LogisticRegression classifier


In [None]:
# Find optimal value of C (regularization coefficient) for LogisticRegression with normalized data, 
# using GridSearch with correct folding.



In [None]:
# What is the perfomance of a LogisticRegression with the best C? (validation score)



# What is next?

1. Try to impute missing values, e.g. include `sklearn.impute.SimpleImputer` in your pipeline.
    - Example https://machinelearningmastery.com/handle-missing-data-python/
    
2. Try different classifier. e.g. DecisionTreeClassifier
