# End ALS Challenge Solution for Task 1
## by Randy Williams

![](https://alsnewstoday.com/wp-content/uploads/2017/04/shutterstock_576832348-1000x480@2x.jpeg)

# Task 1: One mechanism of action or multiple independent mechanisms of action?

### Does ALS have one mechanism of action (one pathway) or is it caused by multiple independent or different mechanisms of action (multiple pathways)? For example, what is the genetic difference between people with ALS with Bulbar onset (they start the symptoms in bulbar functions) versus Limb (they start the symptoms in the limbs)?

# Introduction

For the project, I will be attempting to solve the first task for the End ALS challenge, where I will investigate whether there is one pathway or multiple pathways for Amyotrophic Lateral Sclerosis (ALS). In this notebook, I will look at various machine learning approaches to answer this question. I will create a model that accurately predicts the classifiers for each transcriptomics dataset in the DESeq folder. After creating and validating our predictive model, I will perform an over-representation analysis to investigate the ontologies of the most important features in the model.  

In [None]:
# import the basics
!pip install nb_black > /dev/null
%load_ext lab_black
import pandas as pd
import numpy as np
import sys
import os

In [None]:
# import graphing packages
import matplotlib.pylab as plt

%matplotlib inline
import seaborn as sns

In [None]:
# imports for run_bagged_validation
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn.metrics import f1_score, log_loss, precision_score, recall_score
import lightgbm as lgb
from tqdm.notebook import tqdm

In [None]:
# import for laplacian score combine with distance entropy for feature selection
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.neighbors import kneighbors_graph
from sklearn.metrics.pairwise import euclidean_distances
import scipy
import scipy.sparse
from sklearn import preprocessing
from scipy.sparse.linalg import expm
from sklearn.metrics.pairwise import euclidean_distances

In [None]:
# import for over-representation analysis
!pip install gseapy
import gseapy as gp
from gseapy.plot import barplot, dotplot

# Loading DESeq2 Datasets

In [None]:
example_clinical_data_path_1 = "/kaggle/input/end-als/end-als/clinical-data/filtered-metadata/metadata/clinical/Demographics.csv"
example_clinical_data_path_2 = "/kaggle/input/end-als/end-als/clinical-data/filtered-metadata/metadata/clinical/ALSFRS_R.csv"
example_transcriptomics_DESEQ2_data_path_1 = (
    "/kaggle/input/end-als/end-als/transcriptomics-data/DESeq2/bulbar_vs_limb.csv"
)
example_transcriptomics_DESEQ2_data_path_2 = (
    "/kaggle/input/end-als/end-als/transcriptomics-data/DESeq2/ctrl_vs_case.csv"
)
example_transcriptomics_DESEQ2_data_path_3 = (
    "/kaggle/input/end-als/end-als/transcriptomics-data/DESeq2/median_low_vs_high.csv"
)
example_transcriptomics_3counts_data_path = "/kaggle/input/end-als/end-als/transcriptomics-data/L3_counts/CASE-NEUZX521TKK/CASE-NEUZX521TKK-5793-T/CASE-NEUZX521TKK-5793-T_P85.exon.txt"

demographics = pd.read_csv(example_clinical_data_path_1)
demographics.to_csv("/kaggle/working/demographics.csv")
alsfrs_scores = pd.read_csv(example_clinical_data_path_2)
alsfrs_scores.to_csv("/kaggle/working/alsfrs_scores.csv")
bulbar_vs_limb = pd.read_csv(example_transcriptomics_DESEQ2_data_path_1)
bulbar_vs_limb.to_csv("/kaggle/working/bulbar_vs_limb.csv")
ctrl_vs_case = pd.read_csv(example_transcriptomics_DESEQ2_data_path_2)
ctrl_vs_case.to_csv("/kaggle/working/ctrl_vs_case.csv")
median_low_vs_high = pd.read_csv(example_transcriptomics_DESEQ2_data_path_3)
median_low_vs_high.to_csv("/kaggle/working/median_low_vs_high.csv")
example_transcriptomics_3counts_data = pd.read_csv(
    example_transcriptomics_3counts_data_path,
    delim_whitespace=True,
    skiprows=1,
    low_memory=False,
)
example_transcriptomics_3counts_data.to_csv("/kaggle/working/L3_counts.csv")

# Data Description

## Ctrl_vs_Cases

The traget variable is "CtrlVsCase_Classifier", which assesses whether a patient is a case with ALS (CtrlVsCase_Classifier=1) or a control (CtrlVsCase_Classifier=0).

* Data Dimensions: 169 Patients and approx. 53000 Genes
* 32 are list as Controls
* 137 are list as cases with ALS


## Bulbar_vs_Limb

The target variable explains if the onset of the disease started in the bulbar region 
(SiteOnset_Class=0) or in the patient's limbs (SiteOnset_Class=1).

* Data Dimensions: 116 Patients and approx. 53000 Genes
* 31 patients with onset on the bulbar region 
* 85 patients with onset on limb regions


## Median_Low_vs_High

The Median Low vs High dataset has the classfier "ALSFRS_Class_Median" , which measures whether the patients ALSFRS score was above or equal to the median ALSFRS score (ALSFRS_Class_Median=1) or below (ALSFRS_Class_Median=0) the median ALSFRS score. The Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS) is a metric for evaluatings ALS patients' function status and monitor how it changes over time. 

* Data Dimensions: 92 Patients and approx. 53000 Genes
* 45 Patients have ALSFRRSS scores less than the median score
* 46 Patients have ALSFRRSS scores greater than or to equal the median score 

# Methods

For my approach, I will design a cross validation pipeline that will first split the numbers patients into folds. In the next step, the pipeline trains a seperate gradient boosting model on each fold. Within the model training step, the pipeline will perform feature selection based on fisher's score or laplacian score. For the final step I will evaluate our predictions for each fold.    

# Data Cleaning 

Each RNA-seq dataset in the DEseq2 folder contains over 53,000 genes. It is necessary to perform some sort of gene filtering and feature selection in our model in order to properly predict the disease status of a patient. Before applying any of these methods there are some genes that we might be able to rule out initially as uninformative. One example of these genes are "pseudogenes", which are nonfunctional segments of DNA that resemble functional genes. It may be the case that some pseudogenes in the datasets could be seem correlated to our target variable by random chance but have no true biological signficance in predicting ALS. Therefore, it might be in our interest remove these genes before applying any feature selection methods in our model. We can use gene annotation database systems like Ensemble BioMart to generate a list of gene ids that are identified as pseudogenes.

![](https://upload.wikimedia.org/wikipedia/commons/7/7c/Pseudogene_defects.png)

In [None]:
biomart_path = "/kaggle/input/biomart-annotation/mart_export.txt"
mart_export = pd.read_csv(biomart_path)
mart_export.head()

I made an annotation dataset from the BioMart database publicly available on Kaggle that you can look at called "BioMart_Annotation" (https://www.kaggle.com/rwilliams7653/biomart-annotation). The columns on the dataframe are: "Gene stable ID", "Gene stable ID version", "Transcript stable ID", "Transcript stable ID version", "Gene type", "Gene name". The only columns we are interested in is "Gene stable ID","Gene type", and "Gene name". "Gene type" denotes the type of the gene. "Gene stable ID" is a list of ensmbl gene ids which start with the prefix "ENSG" (for example ENSG00000281806 ), while "Gene name" are gene symbols determined by HUGO Gene Nomenclature Committee (HGNC). There are usually some overlapped between gene symbols and gene ids. For example, the ensembl gene id ENSG00000277620 and the gene symbol KIR3DL3 refer to the same gene. However, there are plenty of cases where there is a gene that is identified with an ensembl id but has no corresponding gene symbol label. This might explain why features in the trancriptomics datasets in the DESeq2 folder are denoted with both gene symbols and ensembl ids. 

 In the next line of code, there is a list of the unique genes labels

In [None]:
mart_export["Gene type"].unique()

I created a list of all of the unique gene types that referred to pseudogenes so that I can filter the annotation data to only the pseudogenes later.

In [None]:
pseudo_genes = [
    "transcribed_processed_pseudogene",
    "polymorphic_pseudogene",
    "processed_pseudogene",
    "unprocessed_pseudogene",
    "pseudogene",
    "rRNA_pseudogene",
    "IG_V_pseudogene",
    "TR_V_pseudogene",
    "unitary_pseudogene",
    "IG_C_pseudogene",
    "IG_J_pseudogene",
    "transcribed_unitary_pseudogene",
    "TR_J_pseudogene",
    "translated_processed_pseudogene",
    "TR_J_pseudogene",
    "translated_unprocessed_pseudogene",
    "IG_pseudogene",
]

Before subsetting the annotation to the pseudogenes, I created a variable called "Gene_name2" that mimics the labeling scheme of the transcriptomics datasets provided by the authors. If you look in the next line of code, I display the first 26 gene features of the ctrl_vs_case dataset. The features names are a mixture of HGNC symbols (gene labels with naming convention like "WASH7P") and Ensembl Ids (gene labels with the prefix "ENSG").   

In [None]:
pd.DataFrame(ctrl_vs_case.columns[2:28], columns=["gene id"])

As I mention earlier, even though there is some overlap in the labeling with HGNC symbols and Ensembl ids, there are plenty of times where the genes are only identifable with Ensembl ids. It looks like the authors of the datasets chose the convention of labeling the genes with their HGNC symbols when provided and labeling the genes with Ensembl ids when the HGNC symbols are not provided. In next line of code, I make a variable in the annotation set called Gene_name2 that follows this convention so that it is easier to search through the column names of the DESeq2 datasets for pseudogenes.

In [None]:
s = mart_export["Transcript stable ID"].map(
    mart_export.set_index("Transcript stable ID")["Gene stable ID"]
)
mart_export["Gene_name2"] = mart_export["Gene name"].mask(
    mart_export["Gene name"].isnull(), s
)

Next, I subset the annotation data to the pseudogenes annotations and use the variable Gene_name2 to find the list of psuedogenes in the ctrl_vs_cases dataset. When I apply the CV pipeline to these datasets, I will drop these columns beforehand.

In [None]:
pseudogenes_annotate = mart_export[mart_export["Gene type"].isin(pseudo_genes)]
pseudogenes = ctrl_vs_case.columns[
    ctrl_vs_case.columns.isin(pseudogenes_annotate["Gene_name2"])
]

# Feature Selection Methods

In this section, I list the functions used for feature selection. Each function is applied only on the training split of data for each fold in the CV pipeline.

The follow code demonstrates my implementation of fisher's score. Fisher score for feature selection is a supervised method. The fisher function calculates the mean and the standard deviation of the case and control observations the training set seperately for a given gene and apply it to the formula of the score. The fishers score is calculated with the formula:
$ \frac{|m_1-m_2|}{{\sigma}_1 +{\sigma}_2} $

where $ m_1 $ is the mean of the cases, $ m_2 $ is the mean of controls , ${\sigma}_1$ is the standard deviation of cases, and ${\sigma}_2$ is the standard deviation of control.  The function applies fisher score on all genes and returns the K best features. For fisher score, the higher the score the better the performance of the feature.

In [None]:
def fisher(data_df, target, num_features):
    """
    Makes feature Fisher selection according to the following formula:
    D(f) = m1(f) - m2(f) / sigma1(f) + sigma2(f)
    :return: the list of most significant features.

    Arguments:
    1. data_df: the dataframe which is pandas object
    2. target: the name the target variable denoted with a string
    3. num_features: specifes K for finding the top K features
    """
    fisher_df = pd.DataFrame()
    fisher_df["mean_case"] = data_df[data_df[target] == 1].mean()
    fisher_df["mean_control"] = data_df[data_df[target] == 0].mean()
    fisher_df["std_case"] = data_df[data_df[target] == 1].std()
    fisher_df["std_control"] = data_df[data_df[target] == 0].std()

    fisher_df["diff_mean"] = (fisher_df["mean_case"] - fisher_df["mean_control"]).abs()
    fisher_df["sum_std"] = (fisher_df["std_case"] + fisher_df["std_control"]).abs()
    fisher_df["fisher_coeff"] = fisher_df["diff_mean"] / fisher_df["sum_std"]
    fisher_df = fisher_df.sort_values(["fisher_coeff"], ascending=False)

    # data_df['fisher_coeff'] = fisher_df['fisher_coeff']
    most_significant_features = fisher_df["fisher_coeff"].index[:num_features]
    return list(most_significant_features.drop(target))

The next line of code shows how I implement Laplacian score for feature selection. Laplacian score is an unsupervised method. First I create a function for making the weights matrix symmetrix that is used in the method and then we make a function for implementing Laplacian score. The concrete steps for utilizing the method are:

1.  Make a k-nearest neighbor graph. (A k-nearest neighbor defines an edge in the graph for each observation if another observation one of its k-nearest neighbor's)

2. We define a matrix S measuring the similarity between to nodes (observations) that are connected using a pre-defined distance measure.

3. Next define a Laplacian graph for each feature.

4. Afterwards, we compute the Laplacian score.

5. Finally, we select the n best (n lowest Laplacian scores), where n is the number features we want.


Because calculating Laplacian scores is highly computational, in the cross validation (CV) Pipeline, I will only apply Laplacian score to 10,000 highly variable features when the method is pre-specified.

In [None]:
def construct_W(X, neighbour_size=4, t=1):
    n_samples, n_features = np.shape(X)
    S = kneighbors_graph(
        X, neighbour_size + 1, metric="euclidean"
    )  # sqecludian distance works only with mode=connectivity  results were absurd
    S = (-1 * (S * S)) / (2 * t * t)
    S = S.tocsc()
    S = expm(S)  # exponential
    S = S.tocsr()
    # [1]  M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Advances in Neural Information Processing Systems,
    # Vol. 14, 2001. Following the paper to make the weights matrix symmetrix we use this method
    bigger = np.transpose(S) > S
    S = S - S.multiply(bigger) + np.transpose(S).multiply(bigger)
    return S


def LaplacianScoreTopK(X, topK, columnlist, neighbour_size=4, t=1):
    """
    Arguments:
    1. X: is the features matrix which has to be prefined as a numpy array
    2. topK: specifes K for finding the top K features
    3. columnlist: specifies the variable names of X
    4. neighbor_size: specifies the neighbor size when making the k-nearest neighbor graph.
    5. t: a parameter used for the calculating the weights matrix in constuct_W
    """

    W = construct_W(X, t=t, neighbour_size=neighbour_size)
    n_samples, n_features = np.shape(X)

    # construct the diagonal matrix
    D = np.array(W.sum(axis=1))
    D = scipy.sparse.diags(np.transpose(D), [0])
    # construct graph Laplacian L
    L = D - W.toarray()

    # construct 1= [1,···,1]'
    I = np.ones((n_samples, n_features))

    # construct fr' => fr= [fr1,...,frn]'
    Xt = np.transpose(X)

    # construct fr^=fr-(frt D I/It D I)I
    t1 = np.matmul(np.matmul(Xt, D.toarray()), I) / np.matmul(
        np.matmul(np.transpose(I), D.toarray()), I
    )
    t1 = t1[:, 0]
    t1 = np.tile(t1, (n_samples, 1))
    fr = X - t1

    # Compute Laplacian Score
    fr_t = np.transpose(fr)
    Lr = np.matmul(np.matmul(fr_t, L), fr) / np.matmul(np.dot(fr_t, D.toarray()), fr)

    Lap_vals = np.diag(Lr)
    top_index = pd.Series(Lap_vals).sort_values().head(topK).index
    return list(columnlist[top_index])

## CV Pipeline

In the next line of code, we define a function for making a dataframe of the feature importances for each split.

In [None]:
def get_split_feature_importance(clf, split_n):
    fi_df = pd.DataFrame(
        clf.feature_importances_,
        index=clf.feature_name_,
        columns=[f"importance_{split_n}"],
    )
    return fi_df

We also define a function for computing the validation scores using the metrics F1 score, Precision, Recall, LogLoss .

In [None]:
def compute_metrics(y_val, pred_val, pred_probs_val):
    f1 = f1_score(y_val, pred_val)
    prec = precision_score(y_val, pred_val)
    rec = recall_score(y_val, pred_val)
    loglosscore = log_loss(y_val, pred_probs_val)
    print(
        f"F1 Score: {f1:0.4f} - Precision {prec:0.4f} - Recall {rec:0.4f} - LogLoss {loglosscore:0.4f}"
    )

Below is the code of function for our CV pipeline.

In [None]:
def run_bagged_validation_new(
    data,
    target_col,
    nsplits,
    lgb_params,
    gene_subset,
    test_size=0.1,
    random_state=529,
    method=None,
    topK=None,
    print_metrics=True,
):
    """
    cross-validation pipeline
    Arguments:
     1. data: the a whole dataset with the target and feature vaiables included
     2. target_col: specify the name target variable in the dataset.
                    The value is a string.
     3. nsplits: specify the number of splits to make for cross validation.
     4. lgb_param: specifies the parameters to be applied the gradient boosting model.
     5. gene_subset: pre-specify a subset of genes to be included in the model before
                     applying feature selection.
     6. test_size: represents the proportion of the dataset to include in the test split.
     7. random_state: setting the random seed of splitting
     8. method: specifies the feature selection method to used.
                The options are "Fisher" and "Laplacian".
                The default is "None".
     9. topK: specifes K for finding the top K features.
              This parameter is ignored if the method="None".
              * Note: if the method="Laplacian", then the K cannot be higher than 10,000.
                      The score is calculated on the top 10,000 highly variable features.
    10. print_metrics: specifies if the score report should be printed. The default is true.

    """
    # Create X, y
    X = data[gene_subset].copy()
    y = data[target_col].copy()

    # This will do a random stratified shuffle split 100x
    sss = StratifiedShuffleSplit(
        n_splits=nsplits, test_size=test_size, random_state=random_state
    )

    pred_val_probs_all = []
    pred_val_all = []
    y_val_all = []
    fis = []  # To Store feature importances
    split_n = 0
    for tr_idx, val_idx in tqdm(sss.split(X, y), total=nsplits):
        y_tr = y.iloc[tr_idx]
        y_val = y.iloc[val_idx]
        if method == "Fisher" and (topK is not None):
            fisher_features = fisher(data.iloc[tr_idx], target_col, topK)
            X_tr = data.iloc[tr_idx][gene_subset + fisher_features]
            X_val = data.iloc[val_idx][gene_subset + fisher_features]
        elif method == "Laplacian" and (topK is not None):
            N = 10000
            TOP_N_HIGH_VARIANCE = (
                data.iloc[tr_idx]
                .drop(["Participant_ID", target_col], axis=1)
                .var()
                .sort_values(ascending=False)
                .head(N)
                .index.tolist()
            )
            data2 = data[TOP_N_HIGH_VARIANCE]
            laplacian_features = LaplacianScoreTopK(
                np.array(data2.iloc[tr_idx]),
                topK=topK,
                columnlist=data2.columns,
            )
            X_tr = data.iloc[tr_idx][gene_subset + laplacian_features]
            X_val = data.iloc[val_idx][gene_subset + laplacian_features]
        else:
            X_tr = X.iloc[tr_idx]
            X_val = X.iloc[val_idx]
        clf = lgb.LGBMClassifier(**lgb_params)
        clf.fit(X_tr, y_tr)
        pred_val_probs = clf.predict_proba(X_val)[:, 0]
        pred_val = clf.predict(X_val)
        # Store predictions for each split
        pred_val_probs_all.append(pred_val_probs)
        pred_val_all.append(pred_val)
        y_val_all.append(y_val)

        fis.append(get_split_feature_importance(clf, split_n))

        split_n += 1

    # Flatten Predictions for scoring
    pred_val_all = np.concatenate(pred_val_all)
    pred_val_probs_all = np.concatenate(pred_val_probs_all)
    y_val_all = np.concatenate(y_val_all)

    results = pd.DataFrame(
        [y_val_all, pred_val_all, pred_val_probs_all],
        index=[target_col, "pred", "pred_probs"],
    ).T

    fis_all = pd.concat(fis, axis=1)
    if print_metrics:
        compute_metrics(y_val_all, pred_val_all, pred_val_probs_all)
    return results, fis_all

# Results

I applied the cross validation pipeline over 10 splits for a model with only the known genes ( "C9orf72", "SOD1", "TARDBP", "FUS"), a model where the feature selection method was the K best features with fisher score,and a model where the feature selection method was the K best features with Laplacian score. For validation, I used the F1 score for the criterion of accuracy.

# Model of Ctrl vs Case with the known genes 

In [None]:
KNOWN_GENES = ["C9orf72", "SOD1", "TARDBP", "FUS"]
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results, fis_all = run_bagged_validation_new(
    ctrl_vs_case,
    gene_subset=KNOWN_GENES,
    target_col="CtrlVsCase_Classifier",
    nsplits=10,
    lgb_params=lgb_params,
)

The plot below shows the feature importance for the known genes.

In [None]:
feature_order = fis_all.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(data=fis_all.loc[feature_order].T, palette="Blues_d", orient="h", ax=ax)

# Model of Ctrl vs Case with  the top K highest fisher scores added

In [None]:
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results_fisher, fis_all_fisher = run_bagged_validation_new(
    ctrl_vs_case.drop(columns=pseudogenes),
    gene_subset=[],
    target_col="CtrlVsCase_Classifier",
    method="Fisher",
    nsplits=10,
    lgb_params=lgb_params,
    topK=10,
)

There was an improvement in the F1 score from using fisher score feature selection.

The plot below shows the feature importance for the fisher score model.

In [None]:
feature_order_fisher = fis_all_fisher.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(
    data=fis_all_fisher.loc[feature_order_fisher].T,
    palette="Blues_d",
    orient="h",
    ax=ax,
)

# Model of Ctrl vs Case with  the top K best Laplacian scores added

In [None]:
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results_lap, fis_all_lap = run_bagged_validation_new(
    ctrl_vs_case.drop(columns=pseudogenes),
    gene_subset=[],
    method="Laplacian",
    target_col="CtrlVsCase_Classifier",
    nsplits=10,
    lgb_params=lgb_params,
    topK=10,
)

As shown with the F1 score, the Laplacian score selection method was more accurate than the known genes model. 

The follow graph is a bar plot shows feature importance of the variables selected using Laplacian score model.

In [None]:
feature_order_lap = fis_all_lap.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(
    data=fis_all_lap.loc[feature_order_lap].T, palette="Blues_d", orient="h", ax=ax
)

Predictably,the selection of genes from Laplacian score selection differ from Fisher score feature selection.

## Over-Representation Analysis for Fisher Score

Unfortunately, for the over-repesentation analysis, there weren't any known terms to ALS that were significant for fisher score selection.

In [None]:
fis_all_fisher_T = fis_all_fisher.loc[feature_order_fisher].T
ctrlvscase_genelist_fisher = list(
    fis_all_fisher_T.loc[:, fis_all_fisher_T.sum() > 600].columns
)

In [None]:
gp.get_library_name()

## Over-Representation Analysis for Laplacian Score

In [None]:
fis_all_lap_T = fis_all_lap.loc[feature_order_lap].T
ctrlvscase_genelist_lap = list(fis_all_lap_T.loc[:, fis_all_lap_T.sum() > 600].columns)

In [None]:
disease_enr3 = gp.enrichr(
    ctrlvscase_genelist_lap,
    gene_sets=["RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO"],
)
#disease_enr3.res2d

In [None]:
disease_enr3.res2d.head(10)

In [None]:
barplot(
    disease_enr3.res2d,
    title="RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO",
)

According to the over-representation analysis, there is a term "ALS IPSCs-derived neurons" which is associated with gene overlap of MACF1,SNN,DST,PNMA1,DNAJC5,AKAP6 and DCAF7 that was significant after multiple testing correction.  

In [None]:
disease_enr3 = gp.enrichr(
    ctrlvscase_genelist_lap,
    gene_sets=["SysMyo_Muscle_Gene_Sets"],
)
disease_enr3.res2d

The term "Human prim myotube caveolinopathy v normal" from the SysMyo Muscle Gene Set was significant after multiple testing correction for 7 overlapping genes. 

In [None]:
disease_enr3 = gp.enrichr(
    ctrlvscase_genelist_lap,
    gene_sets=["ARCHS4_Tissues"],
)
disease_enr3.res2d

In [None]:
barplot(
    disease_enr3.res2d,
    title="ARCHS4_Tissues",
)

The term fetal cortex was significant in the over-representaion using the ARCHS Tissues database after multiple testing correction for an overlap of 16 genes like ANKRD36C,DFFA,DST, SRD5A1 anad NEXMIF. The terms like prefrontal cortex and cerebral cotrex  had the original p-values significant but had adjusted p-values there unsignficant for similar genes. These results suggest that there are a cluster of genes that were selected through laplacian score that were likely associated with brain tissue. 

In [None]:
disease_enr3 = gp.enrichr(
    ctrlvscase_genelist_lap,
    gene_sets=["HumanCyc_2016"],
)
disease_enr3.res2d

# Bulbar vs Limb

# Model of Bulbar vs Limb with the known genes 

In [None]:
KNOWN_GENES = ["C9orf72", "SOD1", "TARDBP", "FUS"]
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results2, fis_all2 = run_bagged_validation_new(
    bulbar_vs_limb,
    gene_subset=KNOWN_GENES,
    target_col="SiteOnset_Class",
    nsplits=10,
    lgb_params=lgb_params,
)

The plot below shows the feature importance for the known genes.

In [None]:
feature_order2 = fis_all2.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(data=fis_all2.loc[feature_order2].T, palette="Blues_d", orient="h", ax=ax)

# Model of Bulbar vs Limb with the top K highest fisher scores added

In [None]:
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results2, fis_all2_fisher = run_bagged_validation_new(
    bulbar_vs_limb.drop(columns=pseudogenes),
    gene_subset=[],
    method="Fisher",
    target_col="SiteOnset_Class",
    nsplits=10,
    lgb_params=lgb_params,
    topK=10,
)

There is a noticeable improvement in the F1 score from applying fisher score feature selection.

The plot below shows the feature importance for the Fisher score selection model.

In [None]:
feature_order2_fisher = fis_all2_fisher.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(
    data=fis_all2_fisher.loc[feature_order2_fisher].T,
    palette="Blues_d",
    orient="h",
    ax=ax,
)

# Model of Bulbar vs Limb with the top K highest Laplacian scores added

In [None]:
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results2_lap, fis_all2_lap = run_bagged_validation_new(
    bulbar_vs_limb.drop(columns=pseudogenes),
    gene_subset=[],
    method="Laplacian",
    target_col="SiteOnset_Class",
    nsplits=10,
    lgb_params=lgb_params,
    topK=10,
)

There is a noticeable improvement in the F1 score from applying Laplacian score feature selection.

The plot below shows the feature importance for the Laplacian score selection model.

In [None]:
feature_order2_lap = fis_all2_lap.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(
    data=fis_all2_lap.loc[feature_order2_lap].T, palette="Blues_d", orient="h", ax=ax
)

## Over-Representation Analysis for Fisher Score

In [None]:
fis_all2_fisher_T = fis_all2_fisher.loc[feature_order2_fisher].T
bulbvslimb_genelist_fisher = list(
    fis_all2_fisher_T.loc[:, fis_all2_fisher_T.sum() > 600].columns
)

In [None]:
disease_enr3 = gp.enrichr(
    bulbvslimb_genelist_fisher,
    gene_sets=["ARCHS4_IDG_Coexp"],
)
disease_enr3.res2d

In [None]:
barplot(
    disease_enr3.res2d,
    title="ARCHS4_IDG_Coexp",
)

There were a couple of terms from the ARCHS4 IDG Coexpression gene set that were significant after multiple testing correction.

## Over-Representation Analysis for Laplacian Score

There no significant terms found that are associated with ALS from the Laplacian Score 

In [None]:
fis_all2_lap_T = fis_all2_lap.loc[feature_order2_lap].T
bulbvslimb_genelist_lap = list(
    fis_all2_lap_T.loc[:, fis_all2_lap_T.sum() > 600].columns
)

In [None]:
disease_enr3 = gp.enrichr(
    bulbvslimb_genelist_lap,
    gene_sets=["ClinVar_2019"],
)
disease_enr3.res2d

# Median Low vs High

# Model of Median Low vs High with the known genes

In [None]:
KNOWN_GENES = ["C9orf72", "SOD1", "TARDBP", "FUS"]
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}

results3, fis_all3 = run_bagged_validation_new(
    median_low_vs_high.drop(columns=pseudogenes),
    gene_subset=KNOWN_GENES,
    target_col="ALSFRS_Class_Median",
    nsplits=10,
    lgb_params=lgb_params,
)

The F1 score is fairly low for the known genes model.

In [None]:
feature_order3 = fis_all3.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(data=fis_all3.loc[feature_order3].T, palette="Blues_d", orient="h", ax=ax)

# Model of Median Low vs High the top K highest fisher scores added

In [None]:
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results3, fis_all3_fisher = run_bagged_validation_new(
    median_low_vs_high.drop(columns=pseudogenes),
    gene_subset=[],
    target_col="ALSFRS_Class_Median",
    nsplits=10,
    method="Fisher",
    lgb_params=lgb_params,
    topK=10,
)

Using Fisher score for feature selection made a big improvement on the F1 score, but the F1 score is still well below .7 .

The graph below shows the feature importance for the fisher score selection model.

In [None]:
feature_order3_fisher = fis_all3_fisher.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(
    data=fis_all3_fisher.loc[feature_order3_fisher].T,
    palette="Blues_d",
    orient="h",
    ax=ax,
)

# Model of Median Low vs High the top K highest Laplacian scores added

In [None]:
lgb_params = {
    "num_leaves": 31,
    "max_depth": -1,
    "learning_rate": 0.1,
    "n_estimators": 10_000,
}
results3_lap, fis_all3_lap = run_bagged_validation_new(
    median_low_vs_high.drop(columns=pseudogenes),
    gene_subset=[],
    target_col="ALSFRS_Class_Median",
    nsplits=10,
    method="Laplacian",
    lgb_params=lgb_params,
    topK=10,
)

After using Laplacian score for feature selection,the F1 score slightly decreased compared the baseline F1 score from the known genes. 

The graph below shows the feature importance of the laplacian score selection model

In [None]:
feature_order3_lap = fis_all3_lap.mean(axis=1).sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(10, 20))
sns.barplot(
    data=fis_all3_lap.loc[feature_order3_lap].T, palette="Blues_d", orient="h", ax=ax
)

## Over-Representation Analysis for Fisher Score and Laplacian Score

Unfortunately, there wasn't a large enough of an improvement (if an improvement at all) on the poor baseline predictions from a model with the known ALS genes from using feature selection with Fisher score and Laplacian score. The F1 scores were too low to suggest that the selected features from both methods explain the target variable. Therefore, I refrained from performing an Overrepresentation Analysis with Fisher Score and Laplacian Score for the Median Low vs High ALSFRS score data.

## Over-Representation Analysis for Fisher Score

In [None]:
fis_all3_fisher_T = fis_all3_fisher.loc[feature_order3_fisher].T
highvslow_genelist_fisher = list(
    fis_all3_fisher_T.loc[:, fis_all3_fisher_T.sum() > 600].columns
)
disease_enr = gp.enrichr(highvslow_genelist_fisher, gene_sets=["DisGeNET"])

In [None]:
barplot(
    disease_enr.res2d,
    title="DisGeNET",
)

## Over-Representation Analysis for Laplacian Score

In [None]:
fis_all3_lap_T = fis_all3_lap.loc[feature_order3_lap].T
highvslow_genelist_lap = list(
    fis_all3_lap_T.loc[:, fis_all3_lap_T.sum() > 3000].columns
)
disease_enr = gp.enrichr(highvslow_genelist_lap, gene_sets=["DisGeNET"])

In [None]:
disease_enr.res2d

# Discussion/Conclusion

Utilizing Fisher score and Laplacian score for feature selection modestly improved the F1 score and the selected features may have significance to ALS as suggested by the over-representation analysis.The results also suggest that there may be multiple mechanisms of action for the disease.

It is important to note that all three models perform poorly on the classifier "ALSFRS_Class_Median" based on the F1 score. Since predictive accuracy was lackluster for even the model on the known genes associated with ALS it may be that "ALSFRS_Class_Median" is poor target variable for the transcriptomic data.

There are some limitations to the feature selection methods that I utilized. For example, since calculating the Laplacian score is has a high computational cost, I had to implement the method on only the 10,000 most variable genes in the datasets. By using this approach, there is a hidden assumption that genes with more variance are usually going to be stronger predictors. However, this may not always be the case. It is possible that there could be features that are strong predictors of the target that are outside the top 10,000 variable genes,which are neglected from this approach. One disadvantage Fisher score is that it is calculated using information from cases and controls in the training set. If it happens that the training set has one of two the groups underrepresented in the split, then it hurts the accuracy of the results.


Nevertheless, the results seem promising from our CV pipeline. Some next steps that we could take is construct gradient boost models with selected features from Laplacian Score and Fisher Score on a new set of data to evaulate the generalizablity the models.  