<h1 style="background-color:#DC143C; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Familial ALS (FALS Amyotrophic Lateral Sclerosis)</h1>

Citation: Boylan K. Familial Amyotrophic Lateral Sclerosis. Neurol Clin. 2015;33(4):807-830. doi:10.1016/j.ncl.2015.07.001

"Familial incidence of ALS was described in scattered publications beginning in the mid 1800s but received limited attention in the literature until the report in 1955 by Kurland and Mulder, which suggested that ALS may be familial in nearly 10% of cases. The application of molecular genetic techniques to ALS, marked by the report in 1993 of linkage of the superoxide dismutase 1 (SOD1) gene in familial ALS, signaled an increasing focus on genetics in ALS as a means to gain insights into the pathogenesis of the disease, identify therapeutic targets and facilitate diagnosis."

"In recent years a rapidly expanding list of genetic variations linked to ALS and their related clinical and pathological correlates continues to provide key insights into the causes of ALS and inform therapy development."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670044/

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fng1001-103/MediaObjects/41588_2001_Article_BFng1001103_Fig2_HTML.gif)nature.com

In [None]:
df = pd.read_csv('../input/end-als/end-als/clinical-data/filtered-metadata/metadata/clinical/Family_History_Log.csv', encoding='ISO-8859-2')
pd.set_option('display.max_columns', None)
df.head()

**<span style="color:#DC143C;">Columns Names:</span>**

famgen = Gender

famguid = GUID of selected relative

famher = Heredity

famhx1 = A detailed family history was obtained

famrel = Relative

famrelsp = Other relative

fhalz = Alzheimer

fhasth = Asthma

fhdem = Dementia

fhgen = Was genetic testing for ALS/Dementia/FTD performed

fhgnc9 = Family member has C9ORF72 mutation

fhgnfus = Family member has FUS mutation

fhgnot = Family Member Has Other Gene Mutation

fhgnsod1 = Family member has SOD1 mutation

fhgntdp = Family member has TDP-43 mutation

fhpsy = Specify psychiatric disorder

fhstk = Stroke

In [None]:
df.shape

In [None]:
df.isnull().sum()

# **<span style="color:#DC143C;">Genetic susceptibility to ALS</span>**

"ALS clinical registry data and more recent meta-analysis based on prospective population based registries suggest that up to 10% of ALS patients have a family history of ALS in a first- or second-degree relative, generally classified as familial ALS (FALS). The remaining 90% of patients with no evident family history of ALS are designated as sporadic ALS (SALS), a potentially misleading designation for several reasons. First, persons with ALS associated with a causative gene variant may lack a family history of ALS as a result of reduced penetrance or small family size. In addition, family history may be incomplete or inaccurate owing to incomplete family history, incorrect diagnoses in ancestors, or death from other causes prior to onset of ALS in relatives genetically at risk."

"Estimates of the heritability of ALS, a measure of the extent of phenotypic variability that is attributable to genetic variation, provide additional evidence that genetic factors play a significant role in sporadic as well as familial ALS. In a study of identical twins that included twins with or without a history of ALS in other relatives heritability was estimated to be about 76% (95% CI=60–86%) for twins with a family history of ALS, and approximately 61% (95% CI=38–78%) for twins with no other family history of ALS ."

"Recently it has been suggested that genetic contributions to ALS may represent the inheritance of risk variants of multiple genes, acting interdependently to cause ALS. The hypothesis that ALS may be oligogenic implies that at least two pathogenic ALS gene variants are required to initiate disease."

"Several studies have shown that a subset of FALS and SALS (sporadic ALS) patients carry at least one known ALS-linked gene variant in conjunction with a second potentially pathogenic variant and offer support for the oligogenic concept of ALS genetics, but these data have been questioned on the basis that the second gene variant may represent a benign variant, potential cohort selection bias and small sample size, and further validation was recommended." 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670044/

In [None]:
from sklearn.compose import make_column_transformer,ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold,cross_val_score,RepeatedStratifiedKFold,train_test_split
from sklearn.preprocessing import RobustScaler,StandardScaler,LabelEncoder,LabelBinarizer
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from numpy import absolute,mean,std
from imblearn.over_sampling import SMOTE,ADASYN
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,accuracy_score,auc
from imblearn.ensemble import BalancedRandomForestClassifier,EasyEnsembleClassifier,BalancedBaggingClassifier
from sklearn.metrics import classification_report,f1_score,roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold,GridSearchCV,RandomizedSearchCV,StratifiedKFold,cross_val_score
from sklearn.svm import LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.gaussian_process import GaussianProcessClassifier
from xgboost import XGBClassifier

# for model explanation
import shap 
import eli5
from eli5.sklearn import PermutationImportance

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

def multiclass_roc_auc_score(y_test, y_pred, average="macro"):
    lb = LabelBinarizer()
    lb.fit(y_test)
    y_test = lb.transform(y_test)
    y_pred = lb.transform(y_pred)
    return roc_auc_score(y_test, y_pred, average=average)

# **<span style="color:#DC143C;">Familial inheritance patterns in ALS</span>**

"Inheritance of most forms of familial ALS is autosomal dominant although autosomal recessive and X-linked dominant familial ALS also occur. Different modes of inheritance may be associated with the same gene depending on the specific sequence variant involved."

"Seemingly sporadic ALS patient with a family history of FTD (frontotemporal dementia) in a first-degree relative would be considered to have possible familial ALS. Validity of this concept is supported by the discovery that an abnormal expansion of a hexanucleotide repeat (GGGGCC) in chromosome 9 open reading frame 72 (C9ORF72), a gene of unknown function, is the most common gene variant linked to ALS and is also commonly associated with ALS-FTD and pure FTD (frontotemporal dementia)." 

**<span style="color:#DC143C;">Gene variants linked to ALS pathogenesis</span>**

"A growing number gene variants associated with Mendelian inheritance of ALS have been reported. In outbred populations approximately 60–70% of FALS is accounted for by known ALS-linked genes. However, reports of families in which linkage to known loci has been excluded indicate further genetic heterogeneity."

"Although associations between the foregoing gene variants and FALS are well established each is also found infrequently in SALS patients. The possibility of incomplete information regarding the family history may be the basis for some of these observations, but documented nonpenetrance is established for the C9ORF72 repeat expansion and for some SOD1, TARDBP, and FUS variants."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670044/

In [None]:
plt.rcParams['figure.figsize'] = (15, 10)
sns.heatmap(df.corr(), cmap = 'copper')
plt.title('Heat Map for Correlations', fontsize = 20)
plt.show()

In [None]:
df1 = df.copy()
scores = pd.DataFrame(columns=['Model', 'Score'])
scores_ohe = pd.DataFrame(columns=['Model', 'Score'])

![](https://core4.bmctoday.net/storage/images/1556826169-0519_NN_Fig.png)practicalneurology.com

# **<span style="color:#DC143C;">Chromosome 9 open reading frame 72 (C9ORF72)</span>**

"A GGGGCC hexanucleotide repeat in the first intron of a gene that encodes a protein of unknown function on chromosome 9, C9ORF72, is the most common gene variant associated with FALS, found in 40% of FALS and about 6–8% of SALS patients, with ethnic variation."

# **<span style="color:#DC143C;">Superoxide dismutase 1 (SOD1)</span>**
 
"Sequence variants in the Cu/Zn superoxide dismutase gene (SOD1) on chromosome 21q12.1, were the first causative gene variants identified in ALS. Disease-linked variants are mainly point mutations and account for a approximately 12% of patients with FALS and 1–2% of SALS, with ethnic variation in prevalence." 

# **<span style="color:#DC143C;">Transactive response DNA binding protein 43 (TARDBP)</span>**

"Identification of TARDBP variants in ALS patients followed the discovery in 2006 that neuronal cytoplasmic inclusions immunoreactive for ubiquitin, a pathological hallmark in the large majority of cases of FALS and SALS, are also immunoreactive for TDP-43. Gene variants in TARDBP, which encodes the 43-kD TAR DNA-binding protein 43 (TDP-43), are found in approximately 4% of FALS and 1% of SALS (sporadic), with some regional variation."

# **<span style="color:#DC143C;">Fused in sarcoma (FUS)</span>**

"Variants in the gene fused in sarcoma (FUS) are linked to autosomal dominant ALS in about 4% of FALS and 1% of SALS patients. FUS appears to regulate DNA and RNA metabolism and be involved in RNA transcription, splicing and processing; gene sequence variants that alter these functions may contribute to neurodegeneration but the molecular pathogenesis of FUS related neurodegeneration is not fully defined ."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670044/

In [None]:
# Lets first handle numerical features with nan value
numerical_nan = [feature for feature in df.columns if df[feature].isna().sum()>1 and df[feature].dtypes!='O']
numerical_nan

# **<span style="color:#DC143C;">Other ALS risk genes and insights from ALS genetics on the pathogenesis of ALS</span>**

"The list of additional genes with sequence variants associated with ALS and related phenotypes continues to grow, aided by technological advances in large scale genetic screening in FALS and SALS patients, particularly whole exome analysis in recent studies. Although most of these genes contribute to a relatively small proportion of FALS and/or SALS, they and more common FALS genes have offered insights regarding ALS pathogenesis."

"Disease-linked variants in these genes are uncommon in ALS but they have implicated toxic conformational changes in RNA-binding proteins with prion-like domains, such as TDP-43 and FUS, in neurodegeneration."

# **<span style="color:#DC143C;">ALS-susceptibility genes associated with lower risk and potential disease modifying genes</span>**

"These variants tend to be uncommon, with limited data supporting linkage with ALS. Further studies are needed to clarify the level of ALS risk associated with these genes and, in some cases confirm that the reported variant is associated with ALS rather than being a benign variant ."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670044/

In [None]:
df[numerical_nan].isna().sum()

In [None]:
## Replacing the numerical Missing Values

for feature in numerical_nan:
    ## We will replace by using median since there are outliers
    median_value=df[feature].median()
    
    df[feature].fillna(median_value,inplace=True)
    
df[numerical_nan].isnull().sum()

In [None]:
#Column famguid (GUID of selected relative)after replacing missing STILL has 723 missing values

cols_to_drop=['famguid', 'famrelsp']
df=df.drop(cols_to_drop,axis=1)
df.columns

In [None]:
# categorical features with missing values
categorical_nan = [feature for feature in df.columns if df[feature].isna().sum()>0 and df[feature].dtypes=='O']
print(categorical_nan)

In [None]:
# replacing missing values in categorical features
for feature in categorical_nan:
    df[feature] = df[feature].fillna('None')

In [None]:
df[categorical_nan].isna().sum()

# **<span style="color:#DC143C;">Pathology of Familial and Sporadic ALS</span>**


Sporadic and hereditary amyotrophic lateral sclerosis (ALS)

Authors: Senda Ajroud-Driss, Teepu Siddique

Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease
Volume 1852, Issue 4, April 2015, Pages 679-684 - https://doi.org/10.1016/j.bbadis.2014.08.010


"Identification of TDP43 as the major component of the ubiquitinated inclusions in sporadic ALS and in non-SOD1-linked familial ALS, showed that FUS-immunoreactive inclusions were also present in spinal anterior horn neurons in all sporadic ALS and in non-SOD1-familial ALS cases." 

"The FUS-containing inclusions were also immunoreactive with antibodies to TDP43, P62 and ubiquitin. OPTN immunoreactivity was also found in skein-like inclusions of anterior horn neurons and their neurites in spinal cords of sporadic ALS and in non-SOD1 Familial ALS cases."

"Finally, UBQLN2-positive inclusions were identified in spinal cord sections of patients with X-ALS and found to co-localize with ubiquitin, P62, TDP43, FUS and OPTN but not SOD1. UBQLN2 immunoreactivity was also observed in spinal cord sections of sporadic ALS, ALS with dementia and in non-SOD1 Familial ALS, suggesting that SOD1 mutations may perform their deleterious effects through distinct pathways.

https://www.sciencedirect.com/science/article/pii/S0925443914002634

In [None]:
df1 = df.copy()
scores = pd.DataFrame(columns=['Model', 'Score'])
scores_ohe = pd.DataFrame(columns=['Model', 'Score'])

In [None]:
df1_ohe = df1.copy()
x = df1.drop(['famgen'], axis = 1)
y = df1['famgen']

print("Shape of x :", x.shape)
print("Shape of y :", y.shape)

In [None]:
le = LabelEncoder()
for i in range(0,x.shape[1]):
    if x.dtypes[i]=='object':
        x[x.columns[i]] = le.fit_transform(x[x.columns[i]])
        
print(x)

 **<span style="color:#DC143C;">Novel therapeutic approaches to ALS</span>**

"Gene silencing through selective manipulation of RNA processing to reduce toxic protein level is an exciting novel therapeutic opportunity for familial ALS."

"Despite tremendous progress in the field of genomics, ALS remains an incurable disease."

https://www.sciencedirect.com/science/article/pii/S0925443914002634

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

print("Shape of X_train :", X_train.shape)
print("Shape of X_test :", X_test.shape)
print("Shape of y_train :", y_train.shape)
print("Shape of y_test :", y_test.shape)

In [None]:
train_cols_list = X_train.columns.values.tolist()

In [None]:
from collections import Counter
print('Classes and number of values in trainset',Counter(y_train))

In [None]:
#Fixing up Imbalance Classes with AdaSyn

#from imblearn.over_sampling import SMOTE,ADASYN

#oversample = ADASYN()


#X_train,y_train = oversample.fit_resample(X_train,y_train)
#print('Classes and number of values in trainset after ADSYN:',Counter(y_train))

In [None]:
from sklearn.preprocessing import RobustScaler,StandardScaler,LabelEncoder,LabelBinarizer


In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

lrpred = model_lr.predict(X_test)

print("Training Accuracy: ", model_lr.score(X_train, y_train))
print('Testing Accuarcy: ', model_lr.score(X_test, y_test))
print('F1 Score',f1_score(y_test, lrpred,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,lrpred)
print("ROC AUC Score - ",roc_score)


# confusion matrix
cm = confusion_matrix(y_test, lrpred)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'rainbow')
plt.show()

# classification report
cr = classification_report(y_test, lrpred)
print(cr)
scores = scores.append({'Model': 'Logistic Regression', 'Score': roc_score}, ignore_index=True)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_lr_cv = LogisticRegressionCV(cv=10)
model_lr_cv.fit(X_train, y_train)

lrpred_cv = model_lr_cv.predict(X_test)

print("Training Accuracy: ", model_lr_cv.score(X_train, y_train))
print('Testing Accuarcy: ', model_lr_cv.score(X_test, y_test))
print('F1 Score',f1_score(y_test, lrpred_cv,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,lrpred_cv)
print("ROC AUC Score - ",roc_score)


# confusion matrix
cm = confusion_matrix(y_test, lrpred_cv)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'viridis')
plt.show()

# classification report
cr = classification_report(y_test, lrpred_cv)
print(cr)
scores = scores.append({'Model': 'Logistic RegressionCV', 'Score': roc_score}, ignore_index=True)

![](https://www.futuremedicine.com/cms/10.2217/fnl.10.47/asset/images/medium/figure1.gif)futuremedicine.com

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_nb = GaussianNB()
model_nb.fit(X_train, y_train)

nbpred = model_nb.predict(X_test)

print('F1 Score',f1_score(y_test, nbpred,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,nbpred)
print("ROC AUC Score - ",roc_score)


# confusion matrix
cm = confusion_matrix(y_test, nbpred)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'twilight')
plt.show()

# classification report
cr = classification_report(y_test, nbpred)
print(cr)
scores = scores.append({'Model': 'GaussianNB', 'Score': roc_score}, ignore_index=True)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_u = RandomForestClassifier(n_estimators=100, class_weight='balanced',random_state=0)
model_u.fit(X_train, y_train)

y_pred_rf = model_u.predict(X_test)

print("Training Accuracy: ", model_u.score(X_train, y_train))
print('Testing Accuarcy: ', model_u.score(X_test, y_test))
print('F1 Score',f1_score(y_test, y_pred_rf,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,y_pred_rf)
print("ROC AUC Score - ",roc_score)

# confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'winter')
plt.show()

# classification report
cr = classification_report(y_test, y_pred_rf)
print(cr)
scores = scores.append({'Model': 'Random-Forest', 'Score': roc_score}, ignore_index=True)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_gb = GradientBoostingClassifier()
model_gb.fit(X_train, y_train)

y_pred_rf = model_gb.predict(X_test)

print("Training Accuracy: ", model_gb.score(X_train, y_train))
print('Testing Accuarcy: ', model_gb.score(X_test, y_test))
print('F1 Score',f1_score(y_test, y_pred_rf,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,y_pred_rf)
print("ROC AUC Score - ",roc_score)

# confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'vlag')
plt.show()

# classification report
cr = classification_report(y_test, y_pred_rf)
print(cr)
scores = scores.append({'Model': 'Gradient Boosting', 'Score': roc_score}, ignore_index=True)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train)

y_pred_rf = model_xgb.predict(X_test)

print("Training Accuracy: ", model_xgb.score(X_train, y_train))
print('Testing Accuarcy: ', model_xgb.score(X_test, y_test))
print('F1 Score',f1_score(y_test, y_pred_rf,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,y_pred_rf)
print("ROC AUC Score - ",roc_score)

# confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'vlag_r')
plt.show()

# classification report
cr = classification_report(y_test, y_pred_rf)
print(cr)
scores = scores.append({'Model': 'XGradient Boosting', 'Score': roc_score}, ignore_index=True)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_brf = BalancedRandomForestClassifier(n_estimators = 100,max_depth=8, random_state = 0)

model_brf.fit(X_train, y_train)
y_pred_brf = model_brf.predict(X_test)

print("Training Accuracy: ", model_brf.score(X_train, y_train))
print('Testing Accuarcy: ', model_brf.score(X_test, y_test))
print('F1 Score',f1_score(y_test, y_pred_brf,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,y_pred_brf)
print("ROC AUC Score - ",roc_score)


# making a classification report
cr = classification_report(y_test,  y_pred_brf)
print(cr)

# making a confusion matrix
plt.rcParams['figure.figsize'] = (5, 5)
cm = confusion_matrix(y_test, y_pred_brf)
sns.heatmap(cm, annot = True, cmap = 'magma')
plt.show()
scores = scores.append({'Model': 'Balanced RF', 'Score': roc_score}, ignore_index=True)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model1 = EasyEnsembleClassifier(n_estimators = 100, random_state = 0)

model1.fit(X_train, y_train)
y_pred_ef = model1.predict(X_test)

print("Training Accuracy: ", model1.score(X_train, y_train))
print('Testing Accuarcy: ', model1.score(X_test, y_test))
print('F1 Score',f1_score(y_test, y_pred_ef,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,y_pred_ef)
print("ROC AUC Score - ",roc_score)



# making a classification report
cr = classification_report(y_test,  y_pred_ef)
print(cr)

# making a confusion matrix
cm = confusion_matrix(y_test, y_pred_ef)
sns.heatmap(cm, annot = True, cmap = 'copper')
plt.show()
scores = scores.append({'Model': 'Easy-Ensemble CLF', 'Score': roc_score}, ignore_index=True)

In [None]:
#I changed random state to One. In the original was Zero.  

#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model2 = BalancedBaggingClassifier(base_estimator = RandomForestClassifier(),
                                 sampling_strategy = 'auto',
                                 replacement = False,
                                 random_state = 1)

model2.fit(X_train, y_train)
y_pred_bc = model2.predict(X_test)

print("Training Accuracy: ", model2.score(X_train, y_train))
print('Testing Accuarcy: ', model2.score(X_test, y_test))
print('F1 Score',f1_score(y_test, y_pred_bc,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,y_pred_bc)
print("ROC AUC Score - ",roc_score)


# making a classification report
cr = classification_report(y_test,  y_pred_bc)
print(cr)

# making a confusion matrix
cm = confusion_matrix(y_test, y_pred_bc)
sns.heatmap(cm, annot = True, cmap = 'Purples')
plt.show()
scores = scores.append({'Model': 'Balanced Bagging Classifier', 'Score': roc_score}, ignore_index=True)

In [None]:
scores.sort_values(by='Score', ascending=False)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 10)]
max_depth = [int(x) for x in np.linspace(1, 10, num = 10)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(random_grid)

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

rf_random = RandomizedSearchCV(estimator=model_u, param_distributions=random_grid,
                              n_iter = 100, scoring='f1_weighted', 
                              cv = 3, verbose=2, random_state=42, n_jobs=-1,
                              return_train_score=True)

rf_random.fit(X_train, y_train)

In [None]:
rf_random.best_params_

In [None]:
rf_random.cv_results_

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

# fitting up tuned model param.
model_tuned = RandomForestClassifier(n_estimators =  944,min_samples_split = 10,min_samples_leaf= 1,max_depth = None,bootstrap= False)
model_tuned.fit(X_train, y_train)

y_pred_rf = model_tuned.predict(X_test)

print("Training Accuracy: ", model_tuned.score(X_train, y_train))
print('Testing Accuarcy: ', model_tuned.score(X_test, y_test))
print('F1 Score',f1_score(y_test, y_pred_rf,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,y_pred_rf)
print("ROC AUC Score - ",roc_score)

# confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'YlOrBr')
plt.show()

# classification report
cr = classification_report(y_test, y_pred_rf)
print(cr)

#Setting up ML Pipeline

Validation of Pipeline created

In [None]:
#Code by Shashwat Tiwari https://www.kaggle.com/shashwatwork/insurance-company-complaint-prediction/notebook

model_pipeline = Pipeline(steps=[('scaling',StandardScaler()),
                                 ('RFTuned', RandomForestClassifier(n_estimators =  944,min_samples_split = 10,min_samples_leaf= 1,max_depth = None,bootstrap= False))])
model_pipeline.fit(X_train, y_train)

model_pipeline_pred = model_pipeline.predict(X_test)

print("Training Accuracy: ", model_pipeline.score(X_train, y_train))
print('Testing Accuarcy: ', model_pipeline.score(X_test, y_test))
print('F1 Score',f1_score(y_test, model_pipeline_pred,average = 'weighted'))
roc_score = multiclass_roc_auc_score(y_test,model_pipeline_pred)
print("ROC AUC Score - ",roc_score)

# confusion matrix
cm = confusion_matrix(y_test, model_pipeline_pred)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(cm, annot = True, cmap = 'RdGy')
plt.show()

# classification report
cr = classification_report(y_test, model_pipeline_pred)
print(cr)

In [None]:
!pip install dexplot -q

In [None]:
import dexplot as dxp

dxp.bar(x = 'Model',y = 'Score',data = scores,cmap='geyser',figsize=(10,5),title='Model Score without OHE')

In [None]:
## Using Random Forest Model For Interpretation
shap_values = shap.TreeExplainer(model_u).shap_values(X_train)

In [None]:
shap.summary_plot(shap_values, train_cols_list, plot_type="bar")

# **<span style="color:#DC143C;">Conclusions</span>**

"Although genetic mechanisms in ALS pathogenesis appear to play a major role in the development of ALS in a minority of patients, studies suggest that genetic factors at some level are important components of disease risk in the majority of ALS patients. However, identification of gene variants associated with ALS, regardless of the prevalence or magnitude of associated risk, has informed concepts of the pathogenesis of ALS, aided the identification of therapeutic targets, facilitated research to develop new ALS biomarkers, and supported the establishment of clinical diagnostic tests for ALS-linked genes."

"New treatment strategies aimed at blocking expression of ALS gene mutations have successfully completed early phase safety testing in the case SOD1 anti-sense oligonucleotide therapy, and efforts are underway to introduce small molecule and gene therapy targeting expression of the C9ORF72 repeat expansion."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4670044/

In [None]:
#Code by Olga Belitskaya https://www.kaggle.com/olgabelitskaya/sequential-data/comments
from IPython.display import display,HTML
c1,c2,f1,f2,fs1,fs2=\
'#eb3434','#eb3446','Akronim','Smokum',30,15
def dhtml(string,fontcolor=c1,font=f1,fontsize=fs1):
    display(HTML("""<style>
    @import 'https://fonts.googleapis.com/css?family="""\
    +font+"""&effect=3d-float';</style>
    <h1 class='font-effect-3d-float' style='font-family:"""+\
    font+"""; color:"""+fontcolor+"""; font-size:"""+\
    str(fontsize)+"""px;'>%s</h1>"""%string))
    
    
dhtml('Thank you Shashwat Tiwari for the script, @shashwatwork' )