In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<h1 style="background-color:#DC143C; font-family:'Brush Script MT',cursive;color:white;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">ALSgeneScanner is a pipeline designed for the analysis of NGS data of ALS patients</h1>

Alfredo Iacoangeli, Ahmad Al Khleifat, William Sproviero, Aleksey Shatunov, Ashley R. Jones, Sarah Opie-Martin, Ersilia Naselli, Simon D. Topp, Isabella Fogh, Angela Hodges, Richard J. Dobson, Stephen J. Newhouse & Ammar Al-Chalabi (2019) ALSgeneScanner: a pipeline for the analysis and interpretation of DNA sequencing data of ALS patients, Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, 20:3-4, 207-215, DOI: 10.1080/21678421.2018.1562553

# **<span style="color:#DC143C;">Variant prioritization</span>**


The pathogenicity prediction programs, SIFT (52), PolyPhen-2 HDIV and PolyPhen-2 HVAR (53), LRT (54), MutationTaster (55), MutationAssessor (56), Fathmm (57), PROVEAN (58), Fathmm-MKL coding (59), MetaSVM (60), and CADD (61) are used to prioritize variants. A variant is scored X where X is equal to the number of tools which predict it to be pathogenic. A higher priority is given to variants which are reported to be “likely pathogenic” or “pathogenic” on ClinVar. For each tool, it was used the authors’ recommendations for the categorical interpretation of the variants. For each variant, the score ranges between 0 and 11 according to the number of computational tools (11 in total) that predict it to be pathogenic. In order to leave the user free to customize the prioritization criteria, both our cumulative score and the categorical variant interpretations from the 11 tools are included in the final results.

https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553

#ALSgeneScanner variant prioritization performance.

VariBench: Precision, Sensivity, Accuracy.

VariBenchFiltered: PrecisionA, SensivityA, AccuracyA.

ClinVar ALS variants: PrecisionB, SensivityB, AccuracyB.

![](https://onlinelibrary.wiley.com/cms/asset/581f38c2-e172-4097-bf5b-06fdc08226f3/mgg3302-toc-0001-m.jpg)onlinelibrary.wiley.com

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
df = pd.read_csv('../input/cusersmarildownloadsvariantcsv/variant.csv', delimiter=';', encoding = "ISO-8859-2", nrows = nRowsRead)
df.dataframeName = 'variant.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
df.head()

# **<span style="color:#DC143C;">VariBench, VariBenchFiltered and ClinVar</span>**

To assess their variant prioritization approach, they used a set of non-synonymous variants from the VariBench dataset (63) for which the effect is known and all ALS-associated non-synonymous variants stored in ClinVar (71 benign and 121 pathogenic).

The VariBench variants are not ALS genes specifically, but because they are all annotated depending on whether or not they are deleterious, the general principles of the method could be tested. The dataset includes VariBench protein tolerance dataset 1.

In order to minimize the overlap between training and evaluation sets, they derived a subset of variants (VariBenchFiltered) from the VariBench dataset by filtering out its overlap with HumVar (53), the CADD training dataset (61) and ExoVar (65), which are commonly used to train the tools (66).

https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553

DATAPREP

Dataprep is an initiative by SFU Data Science Research Group to speed up Data Science. Dataprep.eda attempts to simplify the entire EDA process with very minimal lines of code. EDA is a very essential and time-consuming part of the data science pipeline, having a tool that eases the process is a boon.

https://dataprep.ai/

In [None]:
#https://dataprep.ai/

!pip install dataprep

In [None]:
from dataprep.eda import plot, plot_correlation, create_report, plot_missing

In [None]:
plot(df)

# **<span style="color:#DC143C;">Evaluation of Performance</span>**

Receiver operating characteristic (ROC) curves and their corresponding area under the curve (AUC) statistic were calculated using easyROC. Accuracy, precision, and sensitivity are defined as in equation below where Tp is true positives, Fp false positives, Fn false negatives, and Tn true negatives.

Precision = Tp/Tp+Fp;  Sensitivity = Tp/Tp+Fn;  Accuracy = Tp+Tn/Tp+Tn+Fn+Fp

Their table of sensitivity, specificity, and accuracy means that the appropriate cutoff can be used to interrogate data, depending on whether the aim is the exclusion of potentially harmful variants, or the detection of definitely harmful variants.

https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553

In [None]:
plot(df, "Precision")

In [None]:
plot(df, "Sensitivity")

In [None]:
plot(df, "Accuracy")

In [None]:
plot(df, "Precision","Sensitivity")

In [None]:
plot(df, "Accuracy","Sensitivity")

In [None]:
!pip install dabl
import dabl

In [None]:
dabl.detect_types(df)

In [None]:
dabl.plot(df, target_col="Sensitivity")

In [None]:
dabl.plot(df, target_col="SensitivityA")

In [None]:
dabl.plot(df, target_col="Precision")

In [None]:
dabl.plot(df, target_col="Accuracy")

In [None]:
#API Correlation
plot_correlation(df)

# **<span style="color:#DC143C;">Correlation analysis</span>**

Correlation analysis was performed to investigate the correlation between the 11 tools used by their score, using the categorical results of each individual tool on the VariBenchFiltered dataset. The average correlation was 45% and the standard deviation 14%. Only PolyPhen-2 HDIV and PolyPhen-2 HVAR showed a strong correlation (83%). PolyPhen-2 HDIV differs from PolyPhen-2 HVAR in the training dataset which only included Mendelian disease variants. These tools can provide the user with complementary useful information.

ALSgeneScanner puts a powerful bioinformatics tool, able to exploit the potentialities of next-generation sequencing data in the hands of patients, ALS researchers, and clinicians.

https://www.tandfonline.com/doi/full/10.1080/21678421.2018.1562553