# <span style="color:#8B4513;"> Machine Learning and RNA-Seq Data of Parkinson Disease
</span>



[<span style="color:#8B4513;">Author: **Zainab Nazari**</span>](mailto:z.nazari@ebri.com)
 
 <span style="color:#8B4513;">EBRI – European Brain Research Institute Rita Levi-Montalcini | MHPC - Master in High Performance Computing</span>
 


## Introduction
By employing machine learning in PPMI clinical data set, we can develop predictive models that aid in the early diagnosis of the disease. These models can potentially identify specific genetic markers or gene signatures that correlate with disease progression or response to treatment.

## Table of Contents
- [Matrix of Gene IDs and Counts for Pateints](#matrixcreation)
- [Data Preprocessing STEP I](#preprocessing)
- [Data Preprocessing STEP II](#preprocessing2)
- [Model Training](#training)
- [Results and Evaluation](#results)

## Matrix of Gene IDs and Counts for Pateints
- Loading the data from IR3/counts folder and extracting the associated last column (counts) of each patient file for their BL visit.


## Data Preprocessing STEP I
- We remove patients that have these diseases: SNCA (ENRLSNCA), GBA (ENRLGBA), LRRK2 (ENRLLRRK2).
-  We only keep genes with the intersection of counts and quants with proteing coding and RNAincs.
- We remove the duplicated gene IDs in which they are also lowly expressed.
- We keep only patients with diagnosis of Health control or Parkinson disease.

## Data Preprocessing STEP II
1. Removing lowely expressed genes, by keeping only genes that had more than five counts in at least 10% of the individuals, which left us with 21,273 genes

2. DESeq2: we estimated size factors, normalized the library size bias using these factors, performed independent filtering to remove lowly expressed genes using the mean of normalized counts as a filter statistic. This left us with 12,612 genes

3. DESeq2: we applied a variance stabilizing transformation to accommodate the problem of unequal variance across the range of mean values.

4. limma: we used control samples to estimate the batch effect of the site, that we subsequently removed in both controls and cases 

5. using limma: we removed further confounding effects due to sex and RIN value.

## Model Training
Build and train machine learning models on the prepared data. Explain the choice of models, feature engineering techniques, and hyperparameter tuning. Provide code and comments to walk through the model training process.

## Results and Evaluation
Present the results of the trained models, including performance metrics, accuracy, or any relevant evaluation measures. Interpret the findings and discuss the implications. Include visualizations or tables to support the results.

## Conclusion
Summarize the key findings, limitations of the analysis, and potential future work or improvements. Offer closing remarks or suggestions for further exploration.

## References
- [**Parkinson’s Progression Markers Initiative (PPMI)**](https://www.ppmi-info.org/)

- [**A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics**](https://www.mdpi.com/2073-4425/13/5/727)

- [**Quality Control Metrics for Whole Blood Transcriptome Analysis in the Parkinson’s Progression Markers Initiative (PPMI)**](https://www.medrxiv.org/content/10.1101/2021.01.05.21249278v1)



In [1]:
# In case you do not have following packages installed, uncomment instalisation.

import pandas as pd
import numpy as np
import os
import glob
import functools
from pathlib import Path
import matplotlib.pyplot as plt

#!pip install dask[complete];
# you need to run these in case dask gives you error, it might need update.
#!pip install --upgrade pandas "dask[complete]"
#python -m pip install "dask[dataframe]" --upgrade
import dask.dataframe as dd

from sklearn.model_selection import train_test_split

from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve, accuracy_score
from sklearn.metrics import roc_auc_score, confusion_matrix, precision_recall_curve

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.feature_selection import SelectFromModel
from sklearn.utils import class_weight

#!pip3 install xgboost
from xgboost import XGBClassifier

#!pip install conorm
import conorm # for tmm normalisation

#!pip install pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

In [2]:
# Note that the counts file in the IR3 is around 152 G, and the files are located in scratch area.

path_to_files="/scratch/znazari/PPMI_ver_sep2022/RNA_Seq_data/star_ir3/counts/"
path1=Path("/scratch/znazari/PPMI_ver_sep2022/RNA_Seq_data/star_ir3/counts/")
path2 = Path("/home/znazari/data") # where the output data will be saved at the end.
path3=Path("/scratch/znazari/PPMI_ver_sep2022/study_data/Subject_Characteristics/")

<a id="matrixcreation"></a>
## Matrix of Gene IDs and Counts for Pateints
 Loading the data from IR3/counts folder and extracting the associated last column (counts) of each patient file for their BL visit.

In [3]:
#reading the files which are in BL (Base line) visit.
specific_word = 'BL'
ending_pattern = '*.txt'
file_pattern = f'*{specific_word}*.{ending_pattern}'
file_paths = glob.glob(path_to_files + file_pattern)
# 'bl.txt' is a file that ccontains the name of the files with patient, BL, IR3, counts.
filename = 'bl.txt'
file_path_2 = os.path.join(path_to_files, filename)
bl_files = pd.read_csv(file_path_2,header=None)

# We define a function where we can take the second phrase seperated by dot. The second phrase 
# is the patient ID. So with this functin we want to get the patient IDs from their file's name
def function_names(fname):
    tokens=fname.split('.')
    return tokens[1]

# we create a list with the name of the each patients.
bl_list = [function_names(bl_files.iloc[i][0]) for i in range(len(bl_files))]

# here we read all the files with with base visit(BL) from the counts folder (where we have all the files
# for all the patients and all the visit).
list_bl_files = [dd.read_csv(path1/bl_files.iloc[i][0],skiprows=1,delimiter='\t') for i in range(len(bl_files))]


# we get th last columns of each file in the list
last_columns = [ddf.iloc[:, -1:] for ddf in list_bl_files]

# concatinating the list of the columns in a single file.
single_file = dd.concat(last_columns, axis=1)

# we change the name of the each columns with the patient numbers.
single_file.columns = bl_list

# we get the Geneid column and convert it to dask dataframe
pd_tmp_file = list_bl_files[3].compute()
geneid = pd_tmp_file['Geneid']
ddf_geneid = dd.from_pandas(geneid, npartitions=1)

# here we set the Geneid column as the index of the matrix.
ddf_new_index = single_file.set_index(ddf_geneid)

# converting to pandas data frame and saving.
ir3_counts = ddf_new_index.compute()
ir3_counts.to_csv(path2/"matrix_ir3_counts_bl.csv")

We're assuming that the indices of each dataframes are 
 aligned. This assumption is not generally safe.


<a id="preprocessing"></a>
## Data Preprocessing STEP I

- We remove patients that have these diseases: SNCA (ENRLSNCA), GBA (ENRLGBA), LRRK2 (ENRLLRRK2).
- dopamin drug using
-  We only keep genes with the intersection of counts and quants with proteing coding and RNAincs.
- We remove the duplicated gene IDs in which they are also lowly expressed.
- We keep only patients with diagnosis of Health control or Parkinson disease.

In [6]:
# reading the file
read_ir3_counts = pd.read_csv(path2/"matrix_ir3_counts_bl.csv")
# setting the geneid as indexing column
read_ir3_counts.set_index('Geneid', inplace=True)
# result with removing the after dot (.) value, i.e. the version of the geneIDs is removed.
read_ir3_counts.index =read_ir3_counts.index.str.split('.').str[0]


#here we delete the duplicated gene IDs, first we find them then remove them from the gene IDs
# as they are duplicated and also they are very lowly expressed either zero or one in rare caes.

# Check for duplicate index values
is_duplicate = read_ir3_counts.index.duplicated()

#  DUPLICATED STILL THERE!!! CHECK.
# Display the duplicate index values
duplicate_indices = read_ir3_counts.index[is_duplicate]

# drop them and display the result
to_be_deleted = list(duplicate_indices)
read_ir3_counts = read_ir3_counts.drop(to_be_deleted)

# we read the file where we have an intersection of geneIDs in IR3, counts, quant
intersect = pd.read_csv(path2/"intersect_IR3_ENG_IDs_LincRNA_ProtCoding_counts_quant_gene_transcript_only_tot_intsersect.txt")
intersection = read_ir3_counts.index.intersection(intersect['[IR3_gene_counts] and [IR3_quant_gene] and [IR3_quant_trans] and [lncRNA+ProtCod]: '])
filtered_read_ir3_counts = read_ir3_counts.loc[intersection]
filtered_read_ir3_counts

# reading the file which contains diagnosis
diago=pd.read_csv(path3/"Participant_Status.csv", header=None )
diago1=diago.rename(columns=diago.iloc[0]).drop(diago.index[0]).reset_index(drop=True)

#this is to remove patients that have these diseases: SNCA (ENRLSNCA), GBA (ENRLGBA), LRRK2 (ENRLLRRK2)
filtered_SNCA_GBA_LRRK2 = diago1[(diago1['ENRLSNCA'] == "0")& (diago1['ENRLGBA'] == "0")& (diago1['ENRLLRRK2'] == "0")]

#patients with their diagnosis
patinets_diagnosis = filtered_SNCA_GBA_LRRK2[['PATNO','COHORT_DEFINITION']].reset_index(drop=True)

# Define the particular names to keep
names_to_keep = ['Healthy Control', "Parkinson's Disease"]


# Filter the dataframe based on the specified names
PK_HC_pateints = patinets_diagnosis[patinets_diagnosis['COHORT_DEFINITION'].isin(names_to_keep)]

# Get the list of patient IDs with diagnosis from the second dataframe
patient_ids_with_diagnosis = PK_HC_pateints['PATNO']
list_patients=list(patient_ids_with_diagnosis)

# Filter the columns in the first dataframe based on patient IDs with diagnosis
rna_filtered = filtered_read_ir3_counts.filter(items=list_patients)
rna_filtered.to_csv(path2/'ir3_rna.csv', index=False)

# we keep only the patients that are common in the two dataframes:
common_patient_ids = list(set(PK_HC_pateints['PATNO']).intersection(rna_filtered.columns))
patient11_filtered = PK_HC_pateints[PK_HC_pateints['PATNO'].isin(common_patient_ids)]
patient11_filtered.reset_index(drop=True)

# we save the output into data folder
patient11_filtered.to_csv(path2/'patients_HC_PK_diagnosis.csv', index=False)

In [56]:
patient_dopomine = pd.read_csv(path2/'Patient_IDs_taking_dopamine_drugs.txt',delimiter='\t',  header=None)
patient_dopomine = patient_dopomine.rename(columns={0: 'Pateint IDs'})

<a id="preprocessin2"></a>
## Data Preprocessing STEP II

1. Removing lowely expressed genes, by keeping only genes that had more than five counts in at least 10% of the individuals, which left us with 21,273 genes

2. DESeq2: we estimated size factors, normalized the library size bias using these factors, performed independent filtering to remove lowly expressed genes using the mean of normalized counts as a filter statistic. This left us with 12,612 genes

3. DESeq2: we applied a variance stabilizing transformation to accommodate the problem of unequal variance across the range of mean values.

4. limma: we used control samples to estimate the batch effect of the site, that we subsequently removed in both controls and cases 

5. limma: we removed further confounding effects due to sex and RIN value.

In [54]:
ids_to_remove = patient_dopomine['Pateint IDs'].tolist()
ids_to_remove
strings = [str(num) for num in ids_to_remove]

In [57]:
rna = pd.read_csv(path2/'ir3_rna.csv')
diagnosis = pd.read_csv(path2/'patients_HC_PK_diagnosis.csv')

In [55]:
new_columns = [col for col in rna.columns if not any(string in col for string in strings)]
rna = rna[new_columns]
rna

Unnamed: 0,3000,3001,3002,3003,3004,3006,3007,3008,3009,3010,...,4121,4122,4123,4124,4125,4126,4135,4136,4139,41410
0,40,13,87,11,27,22,25,24,33,14,...,15,26,15,16,19,41,11,16,15,15
1,4,0,28,2,10,0,2,0,0,2,...,3,0,0,2,8,1,1,5,1,0
2,563,815,879,855,1194,770,779,980,945,1185,...,529,1406,851,653,606,1183,536,426,754,645
3,1869,1510,1438,1593,2418,1925,1446,1607,1923,2210,...,1275,2529,1677,1681,1428,2008,1378,1037,1390,1630
4,512,367,460,444,581,522,504,488,621,605,...,320,682,410,432,473,601,478,343,291,529
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30419,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30420,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30421,2,0,11,1,5,0,0,0,0,1,...,0,0,0,0,4,1,2,2,0,0
30422,1,0,8,0,5,1,0,0,1,0,...,0,0,0,0,3,1,0,0,1,0


<a id="training"></a>
## Model Training 

Build and train machine learning models on the prepared data. Explain the choice of models, feature engineering techniques, and hyperparameter tuning. Provide code and comments to walk through the model training process.

<a id="results"></a>
## Results and Evaluation 

Present the results of the trained models, including performance metrics, accuracy, or any relevant evaluation measures. Interpret the findings and discuss the implications. Include visualizations or tables to support the results.