<div style="text-align: left;">
<table style="width:100%; background-color:transparent;">
  <tr style="background-color:transparent;">
    <td style="background-color:transparent;"><img href="http://www.datascience-paris-saclay.fr" src="http://project.inria.fr/saclaycds/files/2017/02/logoUPSayPlusCDS_990.png" ></td>
    <td style="background-color:transparent;"><img href="https://research.pasteur.fr/en/team/group-roberto-toro" src="https://paris-saclay-cds.github.io/autism_challenge/images/institut_pasteur_logo.svg" ></td>
    <td style="background-color:black;"><img href="fer.unizg.hr" src="https://www.fer.unizg.hr/_pub/themes_static/fer2016/default/img/FER_logo.png"></td>
  </tr>
</table> 
</div>

<center><h1>Impact of sMRI preprocessing on autism classification using machine learning methods</h1></center>



<center><h3>Forked from data challenge on Autism Spectrum Disorder detection</h3></center>
<br/>
<center>_Roberto Toro (Institut Pasteur), Nicolas Traut (Institut Pasteur), Anita Beggiato (Institut Pasteur), Katja Heuer (Institut Pasteur),<br /> Gael Varoquaux (Inria, Parietal), Alex Gramfort (Inria, Parietal), Balazs Kegl (LAL),<br /> Guillaume Lemaitre (CDS), Alexandre Boucaud (CDS), and Joris van den Bossche (CDS)<br />Lana Barić(FER), Roko Krstičević(FER)</center>

## Table of Content

0. [Prerequisites](#Software-prerequisites)
1. [Introduction about the competition](#Introduction:-what-is-this-challenge-about)
3. [The data](#The-data)
4. [Workflow](#Workflow)
5. [Evaluation](#Evaluation)
6. [Submission](#Submitting-to-the-online-challenge:-ramp.studio)
7. [More information](#More-information)
8. [Questions](#Question)

**To download and run this notebook**: download the [full starting kit](https://github.com/ramp-kits/autism/archive/master.zip), with all the necessary files.

## Software prerequisites

This starting kit requires the following dependencies:

* `numpy`
* `scipy`
* `pandas`
* `scikit-learn`
* `matplolib`
* `seaborn`
* `nilearn`
* `jupyter`
* `ramp-workflow`

The following 2 cells will install if necessary the missing dependencies.

In [None]:
# import sys
# !{sys.executable} -m pip install scikit-learn==0.21.3 seaborn==0.10.0 nilearn==0.7.1

Install `ramp-workflow` from the master branch on GitHub.

In [None]:
# !{sys.executable} -m pip install https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/0.2.1

In [None]:
%load_ext autoreload
%autoreload 2

## Introduction: detecting autism

Autism spectrum disorder (ASD) is a developmental disorder affecting communication and behavior with different range in severity of symptoms. ASD has been reported to affect approximately 1 in 166 children.

Although there is a consensus on a relation between ASD and atypical brain networks and anatomy, those differences in brain anatomy and functional connectivity remain unclear. To address these issues, study on large cohort of subjects are necessary to ensure relevant finding. 

## The data
We start from downloading data from internet

In [None]:
from problem import get_train_data

data_train = get_train_data()

In [None]:
data_train

In [None]:
print(data_train['participants_asd'])

In [None]:
print('Number of subjects in the training tests: {}'.format(data_train['participants_asd'].size))

#### Participant features

In [None]:
data_train_participants = data_train[[col for col in data_train.columns if col.startswith('participants')]]
data_train_participants.head()

#### Structural MRI features

A set of structural features have been extracted for each subject: (i) normalized brain volume computed using subcortical segmentation of FreeSurfer and (ii) cortical thickness and area for right and left hemisphere of FreeSurfer.

In [None]:
data_train_anatomy = data_train[[col for col in data_train.columns if col.startswith('anatomy')]]
data_train_anatomy.head()

Note that the column `anatomy_select` contain a label affected during a manual quality check (i.e. `0` and `3` reject, `1` accept, `2` accept with reserve). This column can be used during training to exclude noisy data for instance.

In [None]:
data_train_anatomy['anatomy_select'].head()

#### Testing data

The testing data can be loaded similarly as follows:

In [None]:
from problem import get_test_data

data_test = get_test_data()

In [None]:
data_test.head()

In [None]:
print(data_test['participants_asd'])

## Workflow

<img src="./img/workflow2.png" width="100%">

### Quality selector

Quality selector works by chosing some amount of bad quality data from the data set based on input of quality that ranges from 0 to 1

In [None]:
import pandas as pd
from problem import get_train_data

data_train = get_train_data()

def quality_selector(data, quality_factor = 1):

    # Define variables and claer them
    bad_data = pd.DataFrame()
    decent_data = pd.DataFrame()
    good_data = pd.DataFrame()
    tmp_data = pd.DataFrame()

    bad_quality = 1 - min(2 * quality_factor, 1)
    decent_quality = quality_factor * 2
    if quality_factor > 0.5:
        decent_quality = 2 - 2 * quality_factor
    good_quality = 0
    if quality_factor > 0.5:
        good_quality = quality_factor * 2 - 1

    # First we get all the data with anatomy_select = 0 or 3
    bad_data = data[data['anatomy_select'].isin([0, 3])]

    # Then we get all the data with anatomy_select = 2
    decent_data = data[data['anatomy_select'] == 2]

    # Then we get all the data with anatomy_select = 1
    good_data = data[data['anatomy_select'] == 1]

    # Now we select the amount of data according to quality_factor
    if bad_quality > 0:
        bad_data = bad_data.sample(frac=bad_quality, replace=True)
    if decent_quality > 0:
        decent_data = decent_data.sample(frac=decent_quality, replace=True)
    if good_quality > 0:
        good_data = good_data.sample(frac=good_quality, replace=True)

    if bad_quality == 0:
        bad_data = pd.DataFrame()
        print("Amount of bad quality images: 0")
    else:
        print(f"Amount of bad decent images: {bad_data['participants_asd'].size}")
    if decent_quality == 0:
        decent_data = pd.DataFrame()
        print("Amount of decent quality images: 0")
    else:
        print(f"Amount of decent images: {decent_data['participants_asd'].size}")
    if good_quality == 0:
        good_data = pd.DataFrame()
        print("Amount of good quality images: 0")
    else:
        print(f"Amount of good images: {good_data['participants_asd'].size}")

    print()

    # Finally we concatenate the data
    dataframes = [df for df in [bad_data, decent_data, good_data] if not df.empty]
    if not dataframes:
        raise ValueError('All sampled data is empty. Try a larger quality_factor.')
    tmp_data = pd.concat(dataframes)

    tmp_data['anatomy_select'] = 1

    return tmp_data


a = quality_selector(data_train, 0)



### Evaluation

The framework is evaluated with a cross-validation approach. The metrics used are the AUC under the ROC and the accuracy.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from problem import get_cv

def evaluation(X, y):
    pipe = make_pipeline(FeatureExtractor(), Classifier())
    cv = get_cv(X, y)
    results = cross_validate(pipe, X, y, scoring=['roc_auc', 'accuracy'], cv=cv,
                             verbose=1, return_train_score=True,
                             n_jobs=1)
    
    return results

#### FeatureExtractor

The available structural data can be used directly to make some classification. In this regard, we will use a feature extractor (i.e. `FeatureExtractor`). This extractor will only select only the anatomical features, dropping any information regarding the fMRI-based features.

In [None]:
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin


class FeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X_df, y):
        return self

    def transform(self, X_df):
        # get only the anatomical information
        X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]
        return X.drop(columns='anatomy_select')


#### Classifier

We propose to use a logistic classifier preceded from a scaler which will remove the mean and standard deviation computed on the training set.

In [None]:
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


class Classifier(BaseEstimator):
    def __init__(self):
        self.clf = make_pipeline(StandardScaler(), LogisticRegression(solver='liblinear'))

    def fit(self, X, y):
        self.clf.fit(X, y)
        return self
        
    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)


#### Testing the submission

We can test locally our pipeline using `evaluation` function that we defined earlier.

In [None]:
import numpy as np

In [None]:
from problem import get_train_data

data_train = get_train_data()

data_train = quality_selector(data_train, 0.8)

labels_train = data_train['participants_asd']

results = evaluation(data_train.drop('participants_asd', axis=1), labels_train)

print("Training score ROC-AUC: {:.3f} +- {:.3f}".format(np.mean(results['train_roc_auc']),
                                                        np.std(results['train_roc_auc'])))
print("Validation score ROC-AUC: {:.3f} +- {:.3f} \n".format(np.mean(results['test_roc_auc']),
                                                          np.std(results['test_roc_auc'])))

print("Training score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['train_accuracy']),
                                                         np.std(results['train_accuracy'])))
print("Validation score accuracy: {:.3f} +- {:.3f}".format(np.mean(results['test_accuracy']),
                                                           np.std(results['test_accuracy'])))

## More information

You can find more information in the [README](https://github.com/paris-saclay-cds/ramp-workflow/blob/master/README.md) of the [ramp-workflow library](https://github.com/paris-saclay-cds/ramp-workflow).