<h1 style="background-color:#DC143C; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Machine Learning in Amyotrophic Lateral Sclerosis: Achievements, Pitfalls, and Future Directions</h1>

Authors: Vincent Grollemund, Pierre-François Pradat, Giorgia Querin, François Delbot, Gaétan Le Chat, Jean-François Pradat-Peyre and Peter Bede

Front. Neurosci., 28 February 2019 | https://doi.org/10.3389/fnins.2019.00135

Survival from symptom onset ranges from 3 to 5 years depending on genetic, demographic, and phenotypic factors. Despite tireless research efforts, the core etiology of the disease remains elusive and drug development efforts are confounded by the lack of accurate monitoring markers. 

From a mathematical perspective, the main barrier to the development of validated diagnostic, prognostic, and monitoring indicators stem from limited sample sizes. The combination of multiple clinical, biofluid, and imaging biomarkers is likely to increase the accuracy of mathematical modeling and contribute to optimized clinical trial designs.

ML methods utilized in ALS research. These include Random Forests (RF), Support Vector Machines (SVM), Neural Networks (NN), Gaussian Mixture Models (GMM), Boosting methods, k-Nearest Neighbors (k-NN), Generalized linear regression models, Latent Factor models and Hidden Markov Models (HMM).

https://www.frontiersin.org/articles/10.3389/fnins.2019.00135/full

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Code By Paul Mooney  https://www.kaggle.com/paultimothymooney/starter-notebook-for-end-als-kaggle-challenge

ctrl_vs_case = '/kaggle/input/end-als/end-als/transcriptomics-data/DESeq2/ctrl_vs_case.csv'

In [None]:
#Code by Paul Mooney  https://www.kaggle.com/paultimothymooney/starter-notebook-for-end-als-kaggle-challenge

df = pd.read_csv(ctrl_vs_case)
df.to_csv('/kaggle/working/ctrl_vs_case.csv')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

#fill in mean for floats
for c in df.columns:
    if df[c].dtype=='float16' or  df[c].dtype=='float32' or  df[c].dtype=='float64':
        df[c].fillna(df[c].mean())

#fill in -999 for categoricals
df = df.fillna(-999)
# Label Encoding
for f in df.columns:
    if df[f].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(df[f].values))
        df[f] = lbl.transform(list(df[f].values))
        
print('Labelling done.')

In [None]:
df = pd.get_dummies(df)

#ValueError: could not convert string to float: 'NEUDJ536EVH'

In [None]:
df.head()

In [None]:
x = df.drop(['Participant_ID', 'CtrlVsCase_Classifier'], axis=1)
x.fillna(999999, inplace=True)
y = df['CtrlVsCase_Classifier']

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.20, random_state = 0)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

x_train_std = ss.fit_transform(x_train)
x_test_std = ss.transform(x_test)

#KNN

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

knn = KNeighborsClassifier()

param_grid = {'n_neighbors':[5,10,15,25,30,50]}

grid_knn = GridSearchCV(knn,param_grid,scoring='accuracy',cv = 10,refit = True)

In [None]:
grid_knn.fit(x_train_std,y_train)
print("Best Score ==> ", grid_knn.best_score_)
print("Tuned Paramerers ==> ",grid_knn.best_params_)
print("Accuracy on Train set ==> ", grid_knn.score(x_train_std,y_train))
print("Accuracy on Test set ==> ", grid_knn.score(x_test_std,y_test))

#Decision Tree

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()

param_grid = {'criterion':['gini','entropy'],'max_depth':np.arange(2,10),'min_samples_leaf':[0.2,0.4,0.6,0.8,0.9,1]}

grid_dtc = GridSearchCV(dtc,param_grid,scoring='accuracy',cv = 10,refit = True)

In [None]:
grid_dtc.fit(x_train_std,y_train)
print("Best Score ==> ", grid_dtc.best_score_)
print("Tuned Paramerers ==> ",grid_dtc.best_params_)
print("Accuracy on Train set ==> ", grid_dtc.score(x_train_std,y_train))
print("Accuracy on Test set ==> ", grid_dtc.score(x_test_std,y_test))

#Decision tree model for diagnosis.

The available data consist of three basic neuroimaging features: average Corticospinal Tract (CST) Fractional Anisotropy (FA), Motor Cortex (MC) thickness, and average Corpus Callosum (CC) FA. For patient 0, these features are reduced CST FA, reduced MC thickness, reduced CC FA. The target is to classify subjects between healthy and ALS subjects. Establishing a diagnosis requires to run through the decision tree till there are no more questions to answer. At step 1, the closed question directs to the right node due to patient 0's CST pathology. At step 2, the closed question directs to the right node due to patient 0's MC pathology. At step 3, the closed question directs to the left node due to patient 0 CC involvement. Step 3 is the last step as there is no more steps below. The diagnosis for patient 0 is the arrival cell value.

![](https://www.frontiersin.org/files/Articles/438192/fnins-13-00135-HTML/image_m/fnins-13-00135-g003.jpg)https://www.frontiersin.org/articles/10.3389/fnins.2019.00135/full

#SVC

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

from sklearn.svm import SVC

svc = SVC(probability=True)

param_grid = {'kernel':['rbf','linear'],'C':[0.01,0.1,1,0.001],'gamma':[0.1,0.01,0.2,0.4]}

grid_svc = GridSearchCV(svc,param_grid,scoring='accuracy',cv = 10,refit = True)

In [None]:
grid_svc.fit(x_train_std,y_train)
print("Best Score ==> ", grid_svc.best_score_)
print("Tuned Paramerers ==> ",grid_svc.best_params_)
print("Accuracy on Train set ==> ", grid_svc.score(x_train_std,y_train))
print("Accuracy on Test set ==> ", grid_svc.score(x_test_std,y_test))

#SVM model for prognosis.

The available data consist of basic clinical and demographic features; age and site of onset. The objective is to classify patients according to 3-year survival. In the input space (where features are interpretable), no linear hyperplane can divide the two patient populations. The SVM model projects the data into a higher dimensional space—in our example a three dimensional space. The set of two features is mapped to a set of three features. In the feature space, a linear hyperplane can be computed which discriminates the two populations accurately. The three features used for discrimination are unavailable for analysis and interpretability is lost in the process.


![](https://www.frontiersin.org/files/Articles/438192/fnins-13-00135-HTML/image_m/fnins-13-00135-g005.jpg)https://www.frontiersin.org/files/Articles/438192/fnins-13-00135-HTML/image_t/fnins-13-00135-g005.gif

#Voting Classifier

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

from sklearn.ensemble import VotingClassifier

classifiers = [('knn',grid_knn),('tree',grid_dtc),('svc',grid_svc)]

vtc = VotingClassifier(classifiers)

In [None]:
vtc.fit(x_train_std,y_train)
print("Accuracy on Test set ==> ", vtc.score(x_test_std,y_test))

#Feature Selection

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

#for i in range(2,7):
 #   rfe = RFE(estimator=RandomForestClassifier(),n_features_to_select=i, verbose=0)
  #  rfe.fit(x_train_std,y_train)
   # print(f"Accuracy with Feature {i} ==>",metrics.accuracy_score(y_test, rfe.predict(x_test_std)))

In [None]:
#rfe = RFE(estimator=RandomForestClassifier(),n_features_to_select=5, verbose=0)
#rfe.fit(x_train_std,y_train)

#Random forest for diagnosis. 
 
The available data consist of basic biomarkers features which are MUNIX, CSF Neurofilament (NF) levels, Vital Capacity (VC), and BMI. The objective is to classify subjects between healthy and ALS patients. The RF contains 3 decisions trees which use different feature subsets to learn a diagnosis model. Tree A learns on all available features, Tree B learns on MUNIX and VC, Tree C learns on NF levels and BMI. Each tree proposes a diagnosis. RF diagnosis is computed based on the majority vote of each of the trees contained in the forest. Given that two out of three trees concluded that patient 0 had ALS, the final diagnosis suggested by the model is ALS.

![](https://www.frontiersin.org/files/Articles/438192/fnins-13-00135-HTML/image_m/fnins-13-00135-g004.jpg)https://www.frontiersin.org/articles/10.3389/fnins.2019.00135/full

In [None]:
#print("Important Features are ==> ",list(df.columns[:7][rfe.support_]))

#Random Forest Classifier

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

param_grid = {'n_estimators':[200,500,1000],
              'max_depth':[2,3,4,5],
              'min_samples_leaf':[0.2,0.4,0.6,0.8,1],
              'max_features':['auto','sqrt'],
              'criterion':['gini','entropy']}

grid_rfc = RandomizedSearchCV(rfc,param_grid,n_iter=20,scoring='accuracy',cv = 10,refit = True)

In [None]:
grid_rfc.fit(x_train_std,y_train)
print("Best Score ==> ", grid_rfc.best_score_)
print("Tuned Paramerers ==> ",grid_rfc.best_params_)
print("Accuracy on Train set ==> ", grid_rfc.score(x_train_std,y_train))
print("Accuracy on Test set ==> ", grid_rfc.score(x_test_std,y_test))

#XGBoost Classifier

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

#import xgboost as xgb

#xgbcl = xgb.XGBClassifier()

#param_grid = {'booster':['gbtree','gblinear'],
 #            'colsample_bytree':[0.4,0.6,0.8,1],
  #           'learning_rate':[0.01,0.1,0.2,0.4],
   #          'max_depth':[2,3,4,6],
    #         'n_estimators':[200,300,400,500],
     #         'subsample':[0.4,0.6,0.8,1]}

#grid_xgb = RandomizedSearchCV(xgbcl,param_grid,n_iter=30,scoring='accuracy',cv = 10,refit = True)

In [None]:
#grid_xgb.fit(x_train_std,y_train)
#print("Best Score ==> ", grid_xgb.best_score_)
#print("Tuned Paramerers ==> ",grid_xgb.best_params_)
#print("Accuracy on Train set ==> ", grid_xgb.score(x_train_std,y_train))
#print("Accuracy on Test set ==> ", grid_xgb.score(x_test_std,y_test))

#ANN

In [None]:
#Code by Ravi Chaubey https://www.kaggle.com/ravichaubey1506/predictive-modelling-knn-ann-xgboost

import keras
from keras.models import Sequential
from keras.layers import Dense

classifier = Sequential()

classifier.add(Dense(units= 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 53859))
classifier.add(Dense(units= 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

classifier.fit(x_train_std, y_train, batch_size = 10, epochs = 100)

In [None]:
import sklearn.metrics as metrics


y_pred_test = classifier.predict(x_test_std)
y_pred_test=y_pred_test>0.5

y_pred_train = classifier.predict(x_train_std)
y_pred_train=y_pred_train>0.5

print("Accuracy on Train Set ==> ",metrics.accuracy_score(y_train,y_pred_train))
print("Accuracy on Test Set ==> ",metrics.accuracy_score(y_test,y_pred_test))

#Neural Network model for prognosis.

The available data consist of basic demographic and clinical features: age, BMI and diagnostic delay. For patient 0, these features are 50, 15kg/m2, and 15 months, respectively. The objective is to predict ALSFRS-r in 1 year. The multi-layer perceptron consists of two layers. Nodes are fed by input with un-shaded arrows. At layer 1, the three features are combined linearly to compute three node values, C1, C2, and C3. C1 is a linear combination of age and delay, C2 is a linear combination of age, delay and BMI, and C3 is a linear combination of BMI and delay. For patient 0, computing the three values returns 10, 2, and 2 for C1, C2, and C3, respectively. At layer 2, outputs from layer 1 (i.e., C1, C2, and C3) are combined linearly to compute two values, CA and CB. CA is a linear combination of C1 and C2 while CB is a linear combination of C1 and C3. For patient 0, computing the two values gives 24 and 14 for CA and CB, respectively. Model output is computed after computing linear combination of CA and CB and applying a non-linear function (in this case a maximum function which can be seen as a thresholding function which accepts only positive values). The output is the predicted motor functions decline rate. For patient 0, the returned score is 26.

![](https://www.frontiersin.org/files/Articles/438192/fnins-13-00135-HTML/image_m/fnins-13-00135-g006.jpg)https://www.frontiersin.org/files/Articles/438192/fnins-13-00135-HTML/image_t/fnins-13-00135-g005.gif

<h1 style="background-color:#DC143C; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">The Limitations of Machine Learning Approaches</h1>

Despite the pragmatic advantages, the application of ML models requires a clear understanding of what determines model performance and the potential pitfalls of specific models.

Concerns regarding data analysesshould be analyzed, which include data sparsity, data bias, and causality assumptions. Good practice recommendations for model design will then be presented, including the management of missing data, model overfitting, model validation, and performance reporting.

https://www.frontiersin.org/files/Articles/438192/fnins-13-00135-HTML/image_t/fnins-13-00135-g005.gif