<a href="https://colab.research.google.com/github/retico/cmepda_medphys/blob/master/L7_code/Lecture7_ML_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classification and cross-validation with sklearn**

We will implement ML methods to distinguish subjects with ASD from controls, based on brain features computed by means of the [FreeSurfer](https://surfer.nmr.mgh.harvard.edu/) segmentation software. A subsample of the large amount of features generated by Freesurfer for the [ABIDE I](http://fcon_1000.projects.nitrc.org/indi/abide/) data cohort is analyzed.  

We will use  [pandas](https://pandas.pydata.org/) and at [sklearn](https://scikit-learn.org/stable/). Both the libraries are already installed on Colab. For some operation it will be necessary to convert the pandas DataFrame in a Numpy array.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC  # Support Vector Classification

# Read the dataset


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
dataset_file = "/content/gdrive/MyDrive/CMEPDA_MedPhys_datasets/FEATURES/Brain_MRI_FS_ABIDE/FS_features_ABIDE_males_someGlobals.csv"
# check and modify the path of the FS_features_ABIDE_males_someGlobals.csv file you downloaded in your drive
df = pd.read_csv(dataset_file, sep=';')
df.head()

As in previous example, we add a column with the *Site* information (we can derive it from the *FILE_ID*)

In [None]:
df['Site'] = df['FILE_ID'].apply(lambda x: x.split('_')[0])

In [None]:
df.columns

As in previous examples, we make the DX_GROUP column more readable. This time we add a column with the readable labels and we keep the numerical labels [-1,1] which can directly be used in the classifier training process.


In [None]:
df['DX_GROUP_STR'] = df.DX_GROUP.apply(lambda x: 'Controls' if x==-1 else 'ASD')
df.head()

In [None]:
print(df.DX_GROUP_STR.unique())
print(df.DX_GROUP.unique())

We can count the entries grouped by the diagnostic group

In [None]:
df.groupby('DX_GROUP_STR')['FILE_ID'].count()

We have a comparable number of subjects in the two diagnostic categories, which is fine for training a classifier.

# Binary classification: ASD vs. control subjects

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [None]:
df.columns

We can select a set of features we suppose to be relevant for the diagnostic group prediction.

In [None]:
features = ['AGE_AT_SCAN', 'lh_MeanThickness',
       'rh_MeanThickness', 'lhCortexVol', 'rhCortexVol',
       'lhCerebralWhiteMatterVol', 'rhCerebralWhiteMatterVol', 'TotalGrayVol',
       ]

We split the data sample in the train and test subsets  

In [None]:
train_set, test_set = train_test_split(df, test_size = 0.3)

In [None]:
train_set[features]

In [None]:
classifier = SVC(kernel='linear', probability=True)
classifier = classifier.fit(StandardScaler().fit_transform(train_set[features]), train_set['DX_GROUP'])
classifier

We can compute the classification accuracy

In [None]:
classifier.score(StandardScaler().fit_transform(test_set[features]), test_set['DX_GROUP'])

## k-fold cross Validation

In training a ML model, data should be partitioned in train and test set. The k-fold Cross Validation (CV) scheme provides a robust estimate of the performance and its associated error. Usually k=5 or k=10 is implemented, depending on the dataset size and on the available computing resources.

We have to import the model, preprocessing and metric functions from the sklearn libraries.
For some operations it will be necessary to convert the pandas DataFrame in a Numpy array.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_curve, auc
import numpy as np
from numpy import interp
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.pipeline import Pipeline

As we did before, we select the features we will use as predictors


In [None]:
features = ['AGE_AT_SCAN', 'lh_MeanThickness',
       'rh_MeanThickness', 'lhCortexVol', 'rhCortexVol',
       'lhCerebralWhiteMatterVol', 'rhCerebralWhiteMatterVol', 'TotalGrayVol',
]

As the features (i.e. volume and thickness measures) are in different ranges of values, we rescale them column-wise to have all them in the same range. We can apply a z-score transform, *i.e. with the sklearn.StandardScaler* or other normalization transforms, e.g the *sklearn.RobustScaler*, which removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

In [None]:
X, y = df[features], df['DX_GROUP']
# X = StandardScaler().fit_transform(X)

We define a function which implements the k-fold CV, computes and averages the AUC values over the folds and provides plots of the ROC curve.

In [None]:
def plot_cv_roc(X, y, classifier, n_splits=5, scaler=None):
  """
  plot_cv_roc trains the classifier on X data with y labels, implements the
  k-fold-CV with k=n_splits, may implement a feature scaling function.
  It plots the ROC curves for each k fold and their average and displays
  the corresponding AUC values and the standard deviation over the k folders.
  """
  if scaler:
    model = Pipeline([('scaler', scaler()),
                    ('classifier', classifier)])
  else:
    model = classifier

  try:
    y = y.to_numpy()
    X = X.to_numpy()
  except AttributeError:
    pass

  cv = StratifiedKFold(n_splits)

  tprs = [] #True positive rate
  aucs = [] #Area under the ROC Curve
  interp_fpr = np.linspace(0, 1, 100)
  plt.figure()
  i = 0
  for train, test in cv.split(X, y):
    probas_ = model.fit(X[train], y[train]).predict_proba(X[test])
    # Compute ROC curve and area under the curve
    fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
#      print(f"{fpr} - {tpr} - {thresholds}\n")
    interp_tpr = interp(interp_fpr, fpr, tpr)
    tprs.append(interp_tpr)

    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=1, alpha=0.3,
            label=f'ROC fold {i} (AUC = {roc_auc:.2f})')
    i += 1
  plt.legend()
  plt.xlabel('False Positive Rate (FPR)')
  plt.ylabel('True Positive Rate (TPR)')
  plt.show()

  plt.figure()
  plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
        label='Chance', alpha=.8)

  mean_tpr = np.mean(tprs, axis=0)
  mean_tpr[-1] = 1.0
  mean_auc = auc(interp_fpr, mean_tpr)
  std_auc = np.std(aucs)
  plt.plot(interp_fpr, mean_tpr, color='b',
          label=f'Mean ROC (AUC = {mean_auc:.2f} $\pm$ {std_auc:.2f})',
          lw=2, alpha=.8)

  std_tpr = np.std(tprs, axis=0)
  tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
  tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
  plt.fill_between(interp_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                  label=r'$\pm$ 1 std. dev.')

  plt.xlim([-0.01, 1.01])
  plt.ylim([-0.01, 1.01])
  plt.xlabel('False Positive Rate',fontsize=18)
  plt.ylabel('True Positive Rate',fontsize=18)
  plt.title('Cross-Validation ROC of SVM',fontsize=18)
  plt.legend(loc="lower right", prop={'size': 15})
  plt.show()

*It's ok! We've just defined a function, no output is expected*

In [None]:
help(plot_cv_roc)

In [None]:
classifier = SVC(kernel='linear', probability=True)
plot_cv_roc(X,y, classifier, 5, scaler=RobustScaler)

## Exploring datasample subsets

If we are convinced that the heterogeneity introduced by the wide age range is excessive, we can reduce the number of subjects according to a predefined age range.

In [None]:
boxplot = df.boxplot(column='AGE_AT_SCAN', by='Site', showfliers=False)
boxplot.set_title('Box plot of subject\'s age at scan')
boxplot.get_figure().suptitle('');
boxplot.set_ylabel('Age [y]')

boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);

### Age < threshold

In [None]:
features = ['AGE_AT_SCAN', 'lh_MeanThickness',
       'rh_MeanThickness', 'lhCortexVol', 'rhCortexVol',
       'lhCerebralWhiteMatterVol', 'rhCerebralWhiteMatterVol', 'TotalGrayVol',
]

In [None]:
reduced_df = df[df.AGE_AT_SCAN<20]
X, y = reduced_df[features], reduced_df['DX_GROUP']

In [None]:
df.shape, X.shape

In [None]:
classifier = SVC(kernel='linear', probability=True)
plot_cv_roc(X,y, classifier, 5, scaler=RobustScaler)

### Similar sites
We can explore a subset of sites with similar age characteristics

In [None]:
selected_sites = df[(df['Site'] == 'KKI') | (df['Site'] == 'Stanford') | (df['Site'] == 'UCLA')]
X, y = selected_sites[features], selected_sites['DX_GROUP']
df.shape, X.shape

In [None]:
selected_sites.groupby(['DX_GROUP_STR','Site'])['FILE_ID'].count()

In [None]:
classifier = SVC(kernel='linear', probability=True)
plot_cv_roc(X,y, classifier, 5, scaler=RobustScaler)

A classifier trained on data from site A which learnt to distinguish a subject's category according to a confounding variable, will not work on data from site B.




# Binary classification: Site A vs. site B


Let's see whether the Site information is a confounding variable for ASD vs. Control classification. We evaluate the classification performance in site A vs. site B classification.

The first thing we need to do is selecting the dataframe rows which are related to two different sites, e.g. KKI and Stanford.

In [None]:
two_sites = df[(df['Site'] == 'KKI') | (df['Site'] == 'Stanford')]

two_sites.tail()

In [None]:
features = ['AGE_AT_SCAN', 'lh_MeanThickness',
       'rh_MeanThickness', 'lhCortexVol', 'rhCortexVol',
       'lhCerebralWhiteMatterVol', 'rhCerebralWhiteMatterVol', 'TotalGrayVol',
]

X = RobustScaler().fit_transform(two_sites[features])
y = two_sites['Site'].apply(lambda x: 1 if x=='KKI' else -1)

In [None]:
classifier = SVC(kernel='linear', probability=True, random_state=1)
plot_cv_roc(X, y, classifier, 5, scaler=RobustScaler)

# Accounting for confounders into the analysis

To mitigate the effect of the different acquisition sites on the features, we have to harmonize data across sites. We can attempt to normalize them by applying, for example, a per-site feature normalization (*sklearn.RobustScaler*).

In [None]:
df_site1 = df[df.Site == 'KKI']
df_site2 = df[df.Site == 'Stanford']
df_site3 = df[df.Site == 'UCLA']

In [None]:
features = ['AGE_AT_SCAN', 'lh_MeanThickness',
       'rh_MeanThickness', 'lhCortexVol', 'rhCortexVol',
       'lhCerebralWhiteMatterVol', 'rhCerebralWhiteMatterVol', 'TotalGrayVol',
]

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
X_site1 = RobustScaler().fit_transform(df_site1[features])
X_site2 = RobustScaler().fit_transform(df_site2[features])
X_site3 = RobustScaler().fit_transform(df_site3[features])

In [None]:
X = np.concatenate((X_site1, X_site2, X_site3))
y = np.concatenate((df_site1['DX_GROUP'], df_site2['DX_GROUP'], df_site3['DX_GROUP']))

In [None]:
X.shape

In [None]:
classifier = SVC(kernel='linear', probability=True)
plot_cv_roc(X, y, classifier, 5)

# Conclusions
If we obtain good/bad discrimination performance between two different diagnostic classes, are we sure the classifier is exploiting the right descriptive features?

To know more about how to evaluate the effect of confounding variables in your data you can read the recent paper by Ferrari E, *et al.*, [*Dealing with confounders and outliers in classification medical studies: the Autism Spectrum Disorders case study*](https://www.sciencedirect.com/science/article/pii/S0933365719306086), Artif Intell Med 2020, 108, 101926. doi: 10.1016/j.artmed.2020.101926