# **Welcome to the MRInference machine learning tutorial!**
by Sandra Vieira

---
This webpage contains a brief step-by-step tutorial on the implementation of a standard supervised machine learning pipeline using Python programming language. Before this tutorial make sure to go through the pre-recorded lectures:  

*   Introduction to Machine Learning
*   The Machine Learning Pipeline 


---
##Machine Learning: Methods and Applications to Brain Disorders
This tutorial and both pre-recorded lectures above are based on the book [Machine Learning: Methods and Applications to Brain Disorders](https://www.amazon.co.uk/Machine-Learning-Methods-Applications-Disorders/dp/0128157399). The pre-recorded lectures are based on chapters 1-3 and this tutorial is a shorter version of Chapter 19. You can access the full tutorial of Chapter 19 [here](https://github.com/MLMH-Lab/How-To-Build-A-Machine-Learning-Model).

---  
## Aim and structure of the tutorial
For this tutorial you will use a toy dataset containing the grey matter volume and thickness from different brain regions extracted with FreeSurfer to classify patients with schizophrenia and healthy controls using a Support Vector Machine (SVM). The main steps of the tutrial will follow the pipeline presented in lecture The Machine Learning Pipeline and are shown in the figure below.

![workflow](https://raw.githubusercontent.com/MLMH-Lab/How-To-Build-A-Machine-Learning-Model/master/figures/Figure%201.png)



## Importing libraries

Python language is organised in libraries. Each library contains a set of functions for a specific purpose. For example, numpy is a popular library for manipulating numerical data, while pandas is most commonly used to handle tabular data. There are several libraries for machine learning analysis; in this tutorial we will use scikitlearn. 

In [None]:
# SNIPPET 1

# Manipulate data
import numpy as np
import pandas as pd

# Plots
import seaborn as sns
import matplotlib.pyplot as plt

# Statistical tests
import scipy.stats as stats

# Machine learning
from sklearn.svm import LinearSVC
from sklearn.externals import joblib
from sklearn.metrics import balanced_accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV

# Ignore WARNING
import warnings

warnings.filterwarnings('ignore')



Some steps in our analysis will be subjected to randomness.  We should set the  seed value to a fixed number to guarantee that we get the same results every time we run the code. 

In [None]:
# SNIPPET 2
random_seed = 1
np.random.seed(random_seed)

## Problem formulation

 In this tutorial, our machine learning problem is: 

> *Classify patients with schizophrenia and healthy controls using structural MRI data.*

From this formulation we can derive the main elements of our machine learning problem:

*   **Features**: Structural MRI data
*   **Task**: Binary classification
*   **Target**: Patients with schizophrenia and healthy controls



---



## Data Preparation

The aim of this step is to perform a series of statistical analyses to get the data ready for the machine learning model. In this tutorial, we will assume the data is ready to be analysed. However, in a real project we would want to pay close attention to several things including class imbalance (N HC vs N SZ), missing data (data imputation?), confounding variables (age, sex?), dimensionality (N features vs N participants).

### Loading the data

In [None]:
# SNIPPET 4
# dataset_file = Path('./Chapter_19_data.csv')
dataste_url = 'https://raw.githubusercontent.com/sandramv/MRInference_ML_Tutorial/main/ml_tutorial_data.csv'
dataset_df = pd.read_csv(dataste_url, index_col='ID')

In [None]:
# SNIPPET 6
dataset_df[0:6]

Unnamed: 0_level_0,Diagnosis,Left Lateral Ventricle,Left Inf Lat Vent,Left Cerebellum White Matter,Left Cerebellum Cortex,Left Thalamus Proper,Left Caudate,Left Putamen,Left Pallidum,rd Ventricle,th Ventricle,Brain Stem,Left Hippocampus,Left Amygdala,CSF,Left Accumbens area,Left VentralDC,Right Lateral Ventricle,Right Inf Lat Vent,Right Cerebellum White Matter,Right Cerebellum Cortex,Right Thalamus Proper,Right Caudate,Right Putamen,Right Pallidum,Right Hippocampus,Right Amygdala,Right Accumbens area,Right VentralDC,CC Posterior,CC Mid Posterior,CC Central,CC Mid Anterior,CC Anterior,lh bankssts volume,lh caudalanteriorcingulate volume,lh caudalmiddlefrontal volume,lh cuneus volume,lh entorhinal volume,lh fusiform volume,...,lh superiortemporal thickness,lh supramarginal thickness,lh frontalpole thickness,lh temporalpole thickness,lh transversetemporal thickness,lh insula thickness,rh bankssts thickness,rh caudalanteriorcingulate thickness,rh caudalmiddlefrontal thickness,rh cuneus thickness,rh entorhinal thickness,rh fusiform thickness,rh inferiorparietal thickness,rh inferiortemporal thickness,rh isthmuscingulate thickness,rh lateraloccipital thickness,rh lateralorbitofrontal thickness,rh lingual thickness,rh medialorbitofrontal thickness,rh middletemporal thickness,rh parahippocampal thickness,rh paracentral thickness,rh parsopercularis thickness,rh parsorbitalis thickness,rh parstriangularis thickness,rh pericalcarine thickness,rh postcentral thickness,rh posteriorcingulate thickness,rh precentral thickness,rh precuneus thickness,rh rostralanteriorcingulate thickness,rh rostralmiddlefrontal thickness,rh superiorfrontal thickness,rh superiorparietal thickness,rh superiortemporal thickness,rh supramarginal thickness,rh frontalpole thickness,rh temporalpole thickness,rh transversetemporal thickness,rh insula thickness
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
c001,hc,4226.907844,414.407845,12242.90784,43410.50784,7020.107844,4133.407844,6467.707844,2048.207844,825.507845,1751.707844,18918.50784,3423.907844,917.007845,907.207845,691.607845,3500.107844,3517.007844,568.407845,13079.90784,44261.50784,6855.307844,4248.407844,6746.707844,1941.307844,3427.807844,1297.107844,618.107845,3837.307844,812.507845,429.107845,520.707845,407.907845,745.407845,1242.007844,911.007845,4082.007844,3236.007844,1992.007844,8908.007844,...,2.464844,2.409844,2.712844,1.940844,2.206844,2.895844,2.320844,2.229844,2.517844,1.879844,3.004844,2.726844,2.516844,2.604844,2.430844,2.323844,2.507844,2.072844,2.571844,2.731844,2.866844,2.279844,2.505844,2.828844,2.433844,1.523844,1.999844,2.487844,2.484844,2.312844,2.440844,2.522844,2.656844,2.123844,2.638844,2.420844,2.489844,2.235844,2.300844,2.645844
c002,hc,4954.912699,414.812699,16519.5127,38808.3127,7013.312699,3882.912699,5781.012699,1735.912699,457.512699,1123.312699,20193.9127,3582.712699,1578.712699,708.212699,593.812699,3802.312699,3420.612699,258.512699,16028.8127,44035.4127,6654.512699,3477.012699,5121.112699,1619.612699,3322.212699,1402.112699,529.412699,3842.312699,1034.512699,446.112699,495.012699,772.512699,815.412699,2596.012699,1493.012699,7759.012699,3268.012699,1850.012699,9055.012699,...,2.635699,2.560699,3.013699,2.452699,2.308699,2.859699,2.573699,2.141699,2.687699,1.904699,3.248699,2.408699,2.485699,2.558699,2.233699,2.181699,2.595699,2.095699,2.375699,2.659699,2.301699,2.456699,2.426699,2.823699,2.513699,1.967699,2.002699,2.297699,2.625699,2.273699,2.507699,2.470699,2.645699,2.132699,2.848699,2.425699,2.883699,2.622699,2.322699,2.673699
c003,hc,4470.611989,370.111989,10193.51199,38637.51199,5802.911989,2941.711989,5802.511989,1467.411989,835.011989,1050.011989,17577.51199,3338.211989,1318.311989,754.911989,702.611989,3444.511989,4097.511989,157.611989,14706.71199,42082.01199,5799.311989,3225.411989,4863.311989,1402.311989,3645.711989,1347.911989,588.011989,3924.011989,1067.911989,450.011989,492.411989,476.011989,888.611989,2556.011989,1633.011989,6815.011989,3291.011989,1782.011989,10994.01199,...,2.410989,2.451989,2.125989,2.332989,2.057989,2.786989,2.409989,2.421989,2.680989,2.150989,3.018989,2.963989,2.510989,2.797989,2.518989,2.237989,2.512989,2.462989,2.787989,2.754989,2.674989,2.358989,2.418989,3.020989,2.570989,1.849989,2.188989,2.488989,2.679989,2.556989,2.545989,2.589989,2.885989,2.317989,2.326989,2.454989,2.482989,2.232989,2.267989,2.795989
c004,hc,7553.310654,521.010654,12716.01065,41933.31065,5998.310654,2869.110654,5854.810654,1886.210654,867.310654,1577.310654,17785.41065,3468.710654,1242.410654,858.910654,491.410654,3704.910654,4481.710654,392.610654,13933.11065,43434.61065,6052.810654,2965.610654,5342.810654,1882.910654,4024.410654,1469.510654,478.610654,3533.110654,793.110654,348.110654,406.910654,377.710654,793.910654,1959.010654,1299.010654,6208.010654,2800.010654,1567.010654,9986.010654,...,2.427654,2.436654,3.303654,2.576654,2.368654,2.714654,2.548654,2.257654,2.599654,1.736654,2.678654,2.410654,2.354654,2.489654,2.210654,2.185654,2.517654,2.098654,2.623654,2.749654,2.750654,2.458654,2.427654,2.654654,2.401654,1.660654,2.067654,2.420654,2.589654,2.327654,2.323654,2.411654,2.770654,2.149654,2.458654,2.307654,3.284654,1.956654,2.297654,2.731654
c005,hc,8785.212771,396.912771,12077.41277,41818.91277,5839.812771,3614.812771,6013.112771,1550.712771,1226.612771,1008.412771,19291.61277,2821.512771,1197.412771,770.612771,451.712771,3553.912771,5712.312771,416.112771,12102.51277,44240.81277,5555.912771,3736.412771,5476.512771,1682.112771,3220.512771,1477.212771,702.312771,4192.512771,787.912771,479.612771,454.812771,437.612771,619.112771,2154.012771,978.012771,6817.012771,2844.012771,1891.012771,8445.012771,...,2.626771,2.391771,3.504771,3.140771,2.145771,2.953771,2.398771,2.550771,2.465771,1.886771,3.068771,2.745771,2.466771,2.673771,2.310771,2.144771,2.888771,1.948771,2.621771,2.749771,2.876771,2.250771,2.570771,2.959771,2.487771,1.750771,1.882771,2.456771,2.399771,2.417771,3.211771,2.467771,2.772771,2.051771,2.588771,2.325771,3.266771,3.162771,2.081771,2.607771
c006,hc,5083.706643,172.106643,11927.50664,38730.80664,5693.506643,3422.606643,6301.706643,1466.806643,1103.306643,1927.206643,18726.90664,3207.606643,1350.906643,842.506643,696.606643,3449.906643,4331.606643,192.606643,10098.00664,35352.20664,5239.106643,3190.706643,5036.606643,1540.506643,2749.306643,1674.006643,574.306643,3910.706643,944.906643,435.406643,451.206643,500.906643,821.206643,2223.006643,1224.006643,5884.006643,3787.006643,1395.006643,9760.006643,...,2.633643,2.455643,2.539643,2.554643,2.377643,3.164643,2.445643,2.458643,2.526643,2.094643,3.689643,2.732643,2.440643,2.305643,2.256643,2.262643,2.703643,2.215643,2.749643,2.541643,2.932643,2.413643,2.505643,2.693643,2.456643,1.850643,2.172643,2.402643,2.633643,2.466643,2.562643,2.603643,2.948643,2.177643,2.489643,2.362643,2.314643,3.512643,2.591643,2.606643


In [None]:
# SNIPPET 8
print('Number of features = %d' % dataset_df.shape[1])
print('Number of participants = %d' % dataset_df.shape[0])

Number of features = 170
Number of participants = 740


In [None]:
# SNIPPET 11
dataset_df['Diagnosis'].value_counts()

sz    372
hc    368
Name: Diagnosis, dtype: int64

### Feature set and target

Our next step is to retrieve the target and features from the dataset.

In [None]:
# SNIPPET 17
# Target
targets_df = dataset_df['Diagnosis']

# Features
features_names = dataset_df.columns[1:]
features_df = dataset_df[features_names]

In [None]:
# SNIPPET 17a
features_df

Unnamed: 0_level_0,Left Lateral Ventricle,Left Inf Lat Vent,Left Cerebellum White Matter,Left Cerebellum Cortex,Left Thalamus Proper,Left Caudate,Left Putamen,Left Pallidum,rd Ventricle,th Ventricle,Brain Stem,Left Hippocampus,Left Amygdala,CSF,Left Accumbens area,Left VentralDC,Right Lateral Ventricle,Right Inf Lat Vent,Right Cerebellum White Matter,Right Cerebellum Cortex,Right Thalamus Proper,Right Caudate,Right Putamen,Right Pallidum,Right Hippocampus,Right Amygdala,Right Accumbens area,Right VentralDC,CC Posterior,CC Mid Posterior,CC Central,CC Mid Anterior,CC Anterior,lh bankssts volume,lh caudalanteriorcingulate volume,lh caudalmiddlefrontal volume,lh cuneus volume,lh entorhinal volume,lh fusiform volume,lh inferiorparietal volume,...,lh superiortemporal thickness,lh supramarginal thickness,lh frontalpole thickness,lh temporalpole thickness,lh transversetemporal thickness,lh insula thickness,rh bankssts thickness,rh caudalanteriorcingulate thickness,rh caudalmiddlefrontal thickness,rh cuneus thickness,rh entorhinal thickness,rh fusiform thickness,rh inferiorparietal thickness,rh inferiortemporal thickness,rh isthmuscingulate thickness,rh lateraloccipital thickness,rh lateralorbitofrontal thickness,rh lingual thickness,rh medialorbitofrontal thickness,rh middletemporal thickness,rh parahippocampal thickness,rh paracentral thickness,rh parsopercularis thickness,rh parsorbitalis thickness,rh parstriangularis thickness,rh pericalcarine thickness,rh postcentral thickness,rh posteriorcingulate thickness,rh precentral thickness,rh precuneus thickness,rh rostralanteriorcingulate thickness,rh rostralmiddlefrontal thickness,rh superiorfrontal thickness,rh superiorparietal thickness,rh superiortemporal thickness,rh supramarginal thickness,rh frontalpole thickness,rh temporalpole thickness,rh transversetemporal thickness,rh insula thickness
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
c001,4226.907844,414.407845,12242.907840,43410.50784,7020.107844,4133.407844,6467.707844,2048.207844,825.507845,1751.707844,18918.50784,3423.907844,917.007845,907.207845,691.607845,3500.107844,3517.007844,568.407845,13079.90784,44261.50784,6855.307844,4248.407844,6746.707844,1941.307844,3427.807844,1297.107844,618.107845,3837.307844,812.507845,429.107845,520.707845,407.907845,745.407845,1242.007844,911.007845,4082.007844,3236.007844,1992.007844,8908.007844,12371.007840,...,2.464844,2.409844,2.712844,1.940844,2.206844,2.895844,2.320844,2.229844,2.517844,1.879844,3.004844,2.726844,2.516844,2.604844,2.430844,2.323844,2.507844,2.072844,2.571844,2.731844,2.866844,2.279844,2.505844,2.828844,2.433844,1.523844,1.999844,2.487844,2.484844,2.312844,2.440844,2.522844,2.656844,2.123844,2.638844,2.420844,2.489844,2.235844,2.300844,2.645844
c002,4954.912699,414.812699,16519.512700,38808.31270,7013.312699,3882.912699,5781.012699,1735.912699,457.512699,1123.312699,20193.91270,3582.712699,1578.712699,708.212699,593.812699,3802.312699,3420.612699,258.512699,16028.81270,44035.41270,6654.512699,3477.012699,5121.112699,1619.612699,3322.212699,1402.112699,529.412699,3842.312699,1034.512699,446.112699,495.012699,772.512699,815.412699,2596.012699,1493.012699,7759.012699,3268.012699,1850.012699,9055.012699,12232.012700,...,2.635699,2.560699,3.013699,2.452699,2.308699,2.859699,2.573699,2.141699,2.687699,1.904699,3.248699,2.408699,2.485699,2.558699,2.233699,2.181699,2.595699,2.095699,2.375699,2.659699,2.301699,2.456699,2.426699,2.823699,2.513699,1.967699,2.002699,2.297699,2.625699,2.273699,2.507699,2.470699,2.645699,2.132699,2.848699,2.425699,2.883699,2.622699,2.322699,2.673699
c003,4470.611989,370.111989,10193.511990,38637.51199,5802.911989,2941.711989,5802.511989,1467.411989,835.011989,1050.011989,17577.51199,3338.211989,1318.311989,754.911989,702.611989,3444.511989,4097.511989,157.611989,14706.71199,42082.01199,5799.311989,3225.411989,4863.311989,1402.311989,3645.711989,1347.911989,588.011989,3924.011989,1067.911989,450.011989,492.411989,476.011989,888.611989,2556.011989,1633.011989,6815.011989,3291.011989,1782.011989,10994.011990,13008.011990,...,2.410989,2.451989,2.125989,2.332989,2.057989,2.786989,2.409989,2.421989,2.680989,2.150989,3.018989,2.963989,2.510989,2.797989,2.518989,2.237989,2.512989,2.462989,2.787989,2.754989,2.674989,2.358989,2.418989,3.020989,2.570989,1.849989,2.188989,2.488989,2.679989,2.556989,2.545989,2.589989,2.885989,2.317989,2.326989,2.454989,2.482989,2.232989,2.267989,2.795989
c004,7553.310654,521.010654,12716.010650,41933.31065,5998.310654,2869.110654,5854.810654,1886.210654,867.310654,1577.310654,17785.41065,3468.710654,1242.410654,858.910654,491.410654,3704.910654,4481.710654,392.610654,13933.11065,43434.61065,6052.810654,2965.610654,5342.810654,1882.910654,4024.410654,1469.510654,478.610654,3533.110654,793.110654,348.110654,406.910654,377.710654,793.910654,1959.010654,1299.010654,6208.010654,2800.010654,1567.010654,9986.010654,12236.010650,...,2.427654,2.436654,3.303654,2.576654,2.368654,2.714654,2.548654,2.257654,2.599654,1.736654,2.678654,2.410654,2.354654,2.489654,2.210654,2.185654,2.517654,2.098654,2.623654,2.749654,2.750654,2.458654,2.427654,2.654654,2.401654,1.660654,2.067654,2.420654,2.589654,2.327654,2.323654,2.411654,2.770654,2.149654,2.458654,2.307654,3.284654,1.956654,2.297654,2.731654
c005,8785.212771,396.912771,12077.412770,41818.91277,5839.812771,3614.812771,6013.112771,1550.712771,1226.612771,1008.412771,19291.61277,2821.512771,1197.412771,770.612771,451.712771,3553.912771,5712.312771,416.112771,12102.51277,44240.81277,5555.912771,3736.412771,5476.512771,1682.112771,3220.512771,1477.212771,702.312771,4192.512771,787.912771,479.612771,454.812771,437.612771,619.112771,2154.012771,978.012771,6817.012771,2844.012771,1891.012771,8445.012771,11016.012770,...,2.626771,2.391771,3.504771,3.140771,2.145771,2.953771,2.398771,2.550771,2.465771,1.886771,3.068771,2.745771,2.466771,2.673771,2.310771,2.144771,2.888771,1.948771,2.621771,2.749771,2.876771,2.250771,2.570771,2.959771,2.487771,1.750771,1.882771,2.456771,2.399771,2.417771,3.211771,2.467771,2.772771,2.051771,2.588771,2.325771,3.266771,3.162771,2.081771,2.607771
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
p368,8283.573193,209.713156,12776.713160,50836.41316,7949.513156,3358.213156,5381.313156,1532.913156,1496.636755,3102.413156,20893.31316,4287.257241,1402.913156,1051.113156,472.913156,4016.113156,8589.813156,174.813156,12935.91316,52015.11316,6695.913156,3288.513156,4899.413156,1378.313156,4139.113156,1458.900500,411.313156,3816.313156,829.713156,405.313156,402.013156,407.813156,768.113156,2555.013156,1789.013156,6239.013156,2386.013156,1701.013156,8641.013156,12615.013160,...,2.680156,2.389156,3.127156,3.846156,2.002156,2.837156,2.263156,2.682156,2.447156,1.595156,3.442156,2.315156,2.152156,2.341156,2.223156,1.864156,2.377156,1.766156,2.317156,2.835156,2.473156,2.199156,2.599156,2.716156,2.423156,1.368156,1.783156,2.364156,2.115156,2.251156,2.853156,2.302156,2.660156,2.000156,2.656156,2.326156,2.853156,3.634156,2.129156,2.913156
p369,5507.374607,96.003605,17634.203600,55609.30360,8785.803605,3592.103605,5334.103605,1546.903605,637.897786,1460.303605,23739.50360,4427.290443,1284.103605,1010.403605,435.603605,4206.003605,6587.903605,36.503605,15485.60360,57829.70360,7950.803605,3655.803605,5027.903605,1490.803605,3958.003605,1582.118722,455.003605,4103.503605,1141.803605,493.703605,527.503605,481.503605,937.903605,2075.003605,1735.003605,7747.003605,2789.003605,2718.003605,10033.003600,14443.003600,...,2.920605,2.617605,2.392605,3.746605,2.323605,3.132605,2.616605,2.521605,2.540605,1.793605,3.674605,2.709605,2.365605,2.636605,2.615605,2.203605,2.443605,2.025605,2.395605,2.825605,2.630605,2.147605,2.556605,2.181605,2.548605,1.433605,2.026605,2.653605,2.258605,2.414605,2.813605,2.487605,2.645605,2.216605,2.887605,2.476605,2.811605,3.901605,2.093605,2.892605
p370,3607.623866,305.201604,13822.401600,52828.50160,8195.201604,3126.401604,5213.201604,1604.101604,905.963115,1382.001604,20869.10160,4570.248506,1689.301604,782.001604,441.401604,3746.901604,4239.601604,225.801604,14815.60160,55345.60160,6983.401604,3062.401604,5005.201604,1517.001604,4494.101604,1527.116406,389.501604,3588.801604,757.401604,470.101604,359.001604,411.101604,610.301604,2301.001604,1253.001604,5886.001604,2618.001604,2218.001604,8632.001604,9469.001604,...,2.784604,2.519604,2.336604,3.972604,2.306604,3.058604,2.764604,2.758604,2.575604,1.598604,3.495604,2.744604,2.581604,2.503604,2.554604,2.002604,2.597604,1.795604,2.701604,3.024604,2.879604,2.326604,2.547604,2.789604,2.493604,1.483604,2.042604,2.872604,2.382604,2.486604,3.044604,2.538604,2.888604,2.301604,2.892604,2.644604,2.761604,4.059604,2.571604,3.066604
p371,8276.575805,255.310420,11482.110420,56545.41042,7251.510420,3445.010420,5015.910420,1368.610420,1382.500931,1586.810420,20569.51042,3291.992360,1424.510420,1080.510420,356.410420,3511.310420,7552.710420,154.810420,12207.41042,57303.01042,6515.710420,3310.910420,4492.610420,1399.410420,4070.510420,1415.569344,346.510420,3330.710420,708.310420,316.010420,323.510420,320.110420,638.610420,2373.010420,1369.010420,8252.010420,2423.010420,2188.010420,10296.010420,11911.010420,...,3.104420,2.800420,2.762420,4.052420,2.630420,3.166420,2.916420,2.663420,2.755420,1.733420,3.558420,2.621420,2.676420,2.420420,2.598420,2.329420,2.145420,2.047420,2.110420,3.341420,2.863420,2.567420,2.715420,2.504420,2.482420,1.524420,2.022420,2.756420,2.481420,2.589420,3.165420,2.564420,3.058420,2.413420,3.037420,2.916420,3.010420,4.361420,2.700420,2.631420


In [None]:
# SNIPPET 19
targets_df = targets_df.map({'hc': 0, 'sz': 1})
targets = targets_df.values.astype('int')

features = features_df.values.astype('float32')

## Feature engineering


### Feature extraction
In our example, we want to use neuroanatomical data to classify SZ and HC. This requires the extraction of brain morphometric information from the raw MRI images.


### Feature scaling/normalization 
In brain disorders research, we often deal with datasets that contain features that vary in units and range. However, to model the data correctly and effectively, most machine learning algorithms require the data to be on the same scale. Since normalization involves statistics (e.g. mean and variance) of the set used to train the model, in this point we split the data into training and test sets following the scheme a cross-validation.

In [None]:
# SNIPPET 20
n_folds = 10
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=random_seed)

![alt text](https://raw.githubusercontent.com/MLMH-Lab/How-To-Build-A-Machine-Learning-Model/master/figures/Figure%202.png)

In [None]:
# SNIPPET 21
predictions_df = pd.DataFrame(targets_df)
predictions_df['predictions'] = np.nan

bac_cv = np.zeros((n_folds, 1))
sens_cv = np.zeros((n_folds, 1))
spec_cv = np.zeros((n_folds, 1))
coef_cv = np.zeros((n_folds, len(features_names)))

In [None]:
# SNIPPET 22 REDO WITHOUT CV?!
for i_fold, (train_idx, test_idx) in enumerate(skf.split(features, targets)):
    features_train, features_test = features[train_idx], features[test_idx]
    targets_train, targets_test = targets[train_idx], targets[test_idx]

    print('CV iteration: %d' % (i_fold + 1))
    print('Training set size: %d' % len(targets_train))
    print('Test set size: %d' % len(targets_test))

    # --------------------------------------------------------------------------
    # SNIPPET 23
    # Feature scaling/normalization
    scaler = StandardScaler()

    scaler.fit(features_train)

    features_train_norm = scaler.transform(features_train)
    features_test_norm = scaler.transform(features_test)

    # --------------------------------------------------------------------------
    # SNIPPET 24
    # Here, we will use the linear kernel, as this will make it easier to extract the coefficients
    #  of the SVM model (feature importance) later on.
    clf = LinearSVC(loss='hinge')

    # --------------------------------------------------------------------------
    # SNIPPET 25
    # SVM relies on a hyperparameter C that regulates how much we want to avoid misclassifying each
    #  training example.

    # Hyper-parameter search space
    param_grid = {'C': [2 ** -6, 2 ** -5, 2 ** -4, 2 ** -3, 2 ** -2, 2 ** -1, 2 ** 0, 2 ** 1]}

    # Grid search
    internal_cv = StratifiedKFold(n_splits=10)
    grid_cv = GridSearchCV(estimator=clf,
                           param_grid=param_grid,
                           cv=internal_cv,
                           scoring='balanced_accuracy',
                           verbose=1)

    # --------------------------------------------------------------------------
    # SNIPPET 26
    # Model training
    grid_result = grid_cv.fit(features_train_norm, targets_train)

    # --------------------------------------------------------------------------
    # SNIPPET 27
    print('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))
    means = grid_result.cv_results_['mean_test_score']
    stds = grid_result.cv_results_['std_test_score']
    params = grid_result.cv_results_['params']

    for mean, stdev, param in zip(means, stds, params):
        print('%f (%f) with: %r' % (mean, stdev, param))

    # --------------------------------------------------------------------------
    # SNIPPET 28
    best_clf = grid_cv.best_estimator_

    # --------------------------------------------------------------------------
    # SNIPPET 30
    # Model evaluation
    # Finally, we use the final trained model best_clf to make predictions in the test set.
    target_test_predicted = best_clf.predict(features_test_norm)

    # --------------------------------------------------------------------------
    # SNIPPET 31
    print('Confusion matrix')
    cm = confusion_matrix(targets_test, target_test_predicted)
    print(cm)

    tn, fp, fn, tp = cm.ravel()

    bac_test = balanced_accuracy_score(targets_test, target_test_predicted)
    sens_test = tp / (tp + fn)
    spec_test = tn / (tn + fp)

    print('Balanced accuracy: %.3f ' % bac_test)
    print('Sensitivity: %.3f ' % sens_test)
    print('Specificity: %.3f ' % spec_test)

    bac_cv[i_fold, :] = bac_test
    sens_cv[i_fold, :] = sens_test
    spec_cv[i_fold, :] = spec_test
    print('--------------------------------------------------------------------------')

CV iteration: 1
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.681996 using {'C': 0.03125}
0.656239 (0.082053) with: {'C': 0.015625}
0.681996 (0.084610) with: {'C': 0.03125}
0.672950 (0.086047) with: {'C': 0.0625}
0.681863 (0.084353) with: {'C': 0.125}
0.660561 (0.069703) with: {'C': 0.25}
0.663592 (0.070552) with: {'C': 0.5}
0.660517 (0.063467) with: {'C': 1}
0.650000 (0.070569) with: {'C': 2}
Confusion matrix
[[29  8]
 [13 24]]
Balanced accuracy: 0.716 
Sensitivity: 0.649 
Specificity: 0.784 
--------------------------------------------------------------------------
CV iteration: 2
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.684537 using {'C': 0.25}
0.657442 (0.060636) with: {'C': 0.015625}
0.675535 (0.071928) with: {'C': 0.03125}
0.683200 (0.069288) with: {'C': 0.0625}
0.683066 (0.061052) with: {'C': 0.125}
0.684537 (0.068300) with: {'C': 0.25}
0.678565 (0.069170) with: {'C': 0.5}
0.655971 (0.067399) with: {'C': 1}
0.650089 (0.080730) with: {'C': 2}
Confusion matrix
[[27 10]
 [10 27]]
Balanced accuracy: 0.730 
Sensitivity: 0.730 
Specificity: 0.730 
--------------------------------------------------------------------------
CV iteration: 3
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.678565 using {'C': 0.25}
0.671257 (0.083172) with: {'C': 0.015625}
0.657843 (0.090354) with: {'C': 0.03125}
0.672504 (0.071297) with: {'C': 0.0625}
0.675579 (0.085828) with: {'C': 0.125}
0.678565 (0.096041) with: {'C': 0.25}
0.675535 (0.100533) with: {'C': 0.5}
0.656105 (0.095358) with: {'C': 1}
0.656283 (0.092216) with: {'C': 2}
Confusion matrix
[[26 11]
 [12 25]]
Balanced accuracy: 0.689 
Sensitivity: 0.676 
Specificity: 0.703 
--------------------------------------------------------------------------
CV iteration: 4
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.677050 using {'C': 0.125}
0.666667 (0.087044) with: {'C': 0.015625}
0.659091 (0.086701) with: {'C': 0.03125}
0.672549 (0.076349) with: {'C': 0.0625}
0.677050 (0.074486) with: {'C': 0.125}
0.665152 (0.066903) with: {'C': 0.25}
0.648574 (0.062914) with: {'C': 0.5}
0.657665 (0.074417) with: {'C': 1}
0.663681 (0.085317) with: {'C': 2}
Confusion matrix
[[23 14]
 [ 6 31]]
Balanced accuracy: 0.730 
Sensitivity: 0.838 
Specificity: 0.622 
--------------------------------------------------------------------------
CV iteration: 5
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.660829 using {'C': 0.125}
0.654768 (0.066926) with: {'C': 0.015625}
0.645766 (0.077823) with: {'C': 0.03125}
0.639840 (0.074877) with: {'C': 0.0625}
0.660829 (0.072983) with: {'C': 0.125}
0.651738 (0.076762) with: {'C': 0.25}
0.635294 (0.085356) with: {'C': 0.5}
0.632308 (0.097250) with: {'C': 1}
0.636809 (0.080568) with: {'C': 2}
Confusion matrix
[[22 15]
 [ 3 34]]
Balanced accuracy: 0.757 
Sensitivity: 0.919 
Specificity: 0.595 
--------------------------------------------------------------------------
CV iteration: 6
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.651337 using {'C': 0.125}
0.615553 (0.087564) with: {'C': 0.015625}
0.622995 (0.085502) with: {'C': 0.03125}
0.634804 (0.080206) with: {'C': 0.0625}
0.651337 (0.075556) with: {'C': 0.125}
0.633422 (0.063303) with: {'C': 0.25}
0.623039 (0.082717) with: {'C': 0.5}
0.630704 (0.073362) with: {'C': 1}
0.644430 (0.074283) with: {'C': 2}
Confusion matrix
[[31  6]
 [ 8 29]]
Balanced accuracy: 0.811 
Sensitivity: 0.784 
Specificity: 0.838 
--------------------------------------------------------------------------
CV iteration: 7
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.665152 using {'C': 0.125}
0.662166 (0.086544) with: {'C': 0.015625}
0.659046 (0.087600) with: {'C': 0.03125}
0.657487 (0.074425) with: {'C': 0.0625}
0.665152 (0.077787) with: {'C': 0.125}
0.653075 (0.056146) with: {'C': 0.25}
0.634848 (0.078374) with: {'C': 0.5}
0.622772 (0.076761) with: {'C': 1}
0.630348 (0.072909) with: {'C': 2}
Confusion matrix
[[29  8]
 [11 26]]
Balanced accuracy: 0.743 
Sensitivity: 0.703 
Specificity: 0.784 
--------------------------------------------------------------------------
CV iteration: 8
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.672906 using {'C': 0.03125}
0.659492 (0.097342) with: {'C': 0.015625}
0.672906 (0.087826) with: {'C': 0.03125}
0.662389 (0.078545) with: {'C': 0.0625}
0.666934 (0.089875) with: {'C': 0.125}
0.657888 (0.086627) with: {'C': 0.25}
0.653298 (0.080025) with: {'C': 0.5}
0.654813 (0.077727) with: {'C': 1}
0.645900 (0.069074) with: {'C': 2}
Confusion matrix
[[30  7]
 [10 27]]
Balanced accuracy: 0.770 
Sensitivity: 0.730 
Specificity: 0.811 
--------------------------------------------------------------------------
CV iteration: 9
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best: 0.672193 using {'C': 0.0625}
0.648128 (0.076796) with: {'C': 0.015625}
0.652674 (0.078949) with: {'C': 0.03125}
0.672193 (0.068385) with: {'C': 0.0625}
0.651471 (0.077171) with: {'C': 0.125}
0.658913 (0.077373) with: {'C': 0.25}
0.640686 (0.073699) with: {'C': 0.5}
0.640686 (0.085128) with: {'C': 1}
0.642291 (0.081810) with: {'C': 2}
Confusion matrix
[[26 10]
 [ 9 29]]
Balanced accuracy: 0.743 
Sensitivity: 0.763 
Specificity: 0.722 
--------------------------------------------------------------------------
CV iteration: 10
Training set size: 666
Test set size: 74
Fitting 10 folds for each of 8 candidates, totalling 80 fits
Best: 0.681684 using {'C': 0.125}
0.661943 (0.088166) with: {'C': 0.015625}
0.664929 (0.102367) with: {'C': 0.03125}
0.676961 (0.082928) with: {'C': 0.0625}
0.681684 (0.070080) with: {'C': 0.125}
0.678788 (0.071619) with: {'C': 0.25}
0.660695 (0.059294) with: {'C': 0.5}
0.653119 (0.067356) with: {'C': 1}
0.657754 (0.081060) with: {'C': 2}
Confusion matrix
[[28

[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    7.2s finished


In [None]:
# SNIPPET 32
print('CV results')
print('Bac: Mean(SD) = %.3f(%.3f)' % (bac_cv.mean(), bac_cv.std()))
print('Sens: Mean(SD) = %.3f(%.3f)' % (sens_cv.mean(), sens_cv.std()))
print('Spec: Mean(SD) = %.3f(%.3f)' % (spec_cv.mean(), spec_cv.std()))

CV results
Bac: Mean(SD) = 0.737(0.036)
Sens: Mean(SD) = 0.737(0.092)
Spec: Mean(SD) = 0.736(0.075)


## Post-hoc analysis

Once we have our final model, we can run several additional analyses. This tutotial does not include these analysis, but we could look at the following:

*   Test balanced accuracy, sensitivity and specificity for statistical significance via permutation testing
*   Identify the features that provided the greatest contribution to the task 