# <span style="color:#8B4513;"> Machine Learning and RNA-seq Data of Parkinson Disease
</span>



[<span style="color:#8B4513;">Author: **Zainab Nazari**</span>](mailto:z.nazari@ebri.com)
 
 <span style="color:#8B4513;">EBRI – European Brain Research Institute Rita Levi-Montalcini | MHPC - Master in High Performance Computing</span>
 


## Introduction
By employing machine learning in PPMI clinical data set, we can develop predictive models that aid in the early diagnosis of the disease. These models can potentially identify specific genetic markers or gene signatures that correlate with disease progression or response to treatment.

## Table of Contents
- [Matrix of Gene Ids and Pateint Numbers](#matrixcreation)
- [Data Preprocessing](#preprocessing)
- [Model Training](#training)
- [Results and Evaluation](#results)

## Matrix of Gene Ids and Pateint Numbers
- Loading the data from IR3/counts folder and extracting the associated last column (counts) of each patient file for their BL visit.


## Data Preprocessing 
- We remove patients that have these diseases: SNCA (ENRLSNCA), GBA (ENRLGBA), LRRK2 (ENRLLRRK2).

## Model Training
Build and train machine learning models on the prepared data. Explain the choice of models, feature engineering techniques, and hyperparameter tuning. Provide code and comments to walk through the model training process.

## Results and Evaluation
Present the results of the trained models, including performance metrics, accuracy, or any relevant evaluation measures. Interpret the findings and discuss the implications. Include visualizations or tables to support the results.

## Conclusion
Summarize the key findings, limitations of the analysis, and potential future work or improvements. Offer closing remarks or suggestions for further exploration.

## References
- [**Parkinson’s Progression Markers Initiative (PPMI)**](https://www.ppmi-info.org/)

- [**A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics**](https://www.mdpi.com/2073-4425/13/5/727)

- [**Quality Control Metrics for Whole Blood Transcriptome Analysis in the Parkinson’s Progression Markers Initiative (PPMI)**](https://www.medrxiv.org/content/10.1101/2021.01.05.21249278v1)



In [1]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import os
import glob
import functools
from pathlib import Path

#!pip install dask[complete]; # you need to run these in case dask gives you error, it might need update.
#!pip install --upgrade pandas "dask[complete]"
#python -m pip install "dask[dataframe]" --upgrade

<a id="matrixcreation"></a>
## Matrix of Gene Ids and Pateint Numbers
 Loading the data from IR3/counts folder and extracting the associated last column (counts) of each patient file for their BL visit.

In [None]:
# Note that the counts file in the IR3 is around 152 G, and the files are located in scratch area.
path_to_files="/scratch/znazari/PPMI_ver_sep2022/RNA_Seq_data/star_ir3/counts/"
path1=Path("/scratch/znazari/PPMI_ver_sep2022/RNA_Seq_data/star_ir3/counts/")

#reading the files which are in BL (Base line) visit.
specific_word = 'BL'
ending_pattern = '*.txt'
file_pattern = f'*{specific_word}*.{ending_pattern}'
file_paths = glob.glob(path_to_files + file_pattern)
# 'bl.txt' is a file that ccontains the name of the files with patient, BL, IR3, counts.
filename = 'bl.txt'
file_path_2 = os.path.join(path_to_files, filename)
bl_files = pd.read_csv(file_path_2,header=None)

# We define a function where we can take the second phrase seperated by dot. The second phrase 
# is the patient ID. So with this functin we want to get the patient IDs from their file's name
def function_names(fname):
    tokens=fname.split('.')
    return tokens[1]

# we create a list with the name of the each patients.
bl_list = [function_names(bl_files.iloc[i][0]) for i in range(len(bl_files))]

# here we read all the files with with base visit(BL) from the counts folder (where we have all the files
# for all the patients and all the visit).
list_bl_files = [dd.read_csv(path1/bl_files.iloc[i][0],skiprows=1,delimiter='\t') for i in range(len(bl_files))]

# we get the Geneid column and convert it to dask dataframe
pd_tmp_file = list_bl_files[3].compute()
geneid = pd_tmp_file['Geneid']
ddf_geneid = dd.from_pandas(geneid, npartitions=1)

# we get th last columns of each file in the list
last_columns = [ddf.iloc[:, -1:] for ddf in list_bl_files]

# concatinating the list of the columns in a single file.
single_file = dd.concat(last_columns, axis=1)

# we change the name of the each columns with the patient numbers.
single_file.columns = bl_list

# here we set the Geneid column as the index of the matrix.
ddf_new_index = single_file.set_index(ddf_geneid)
# converting to pandas data frame and saving.
ir3_counts = ddf_new_index.compute()
ir3_counts.to_csv(path2/"matrix_ir3_counts_bl.csv")

<a id="preprocessing"></a>
## Data Preprocessing 

Conduct exploratory data analysis to gain insights into the dataset. Generate visualizations, compute summary statistics, and analyze relationships between variables. Include clear descriptions and captions for each plot.

In [2]:
# reading the file
path2 = Path("/home/znazari/ppmi_files/data_preprocessing/data") # where data will be saved at the end.
read_ir3_counts = pd.read_csv(path2/"matrix_ir3_counts_bl.csv")
# setting the geneid as indexing column
read_ir3_counts.set_index('Geneid', inplace=True)
# result
read_ir3_counts

Unnamed: 0_level_0,10874,12499,12593,13039,13424,14281,14331,14426,15761,16580,...,60024,60036,60095,60171,65002,65003,65005,65008,70239,85242
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.14,29,18,32,11,18,20,9,15,23,7,...,20,28,6,27,19,17,44,32,10,5
ENSG00000000005.5,3,0,9,0,0,9,0,0,5,0,...,0,1,0,0,2,2,1,14,0,0
ENSG00000000419.12,877,704,574,820,677,310,863,1252,520,631,...,699,707,550,501,1049,841,1438,431,513,314
ENSG00000000457.13,1686,1781,1704,1682,1656,695,1431,1778,1232,1034,...,1670,1651,1331,1013,1288,1732,2057,710,1275,858
ENSG00000000460.16,574,570,551,527,669,236,423,555,312,347,...,531,571,446,304,431,550,542,300,397,196
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000285990.1,1,2,2,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
ENSG00000285991.1,3,2,5,0,8,3,3,0,1,2,...,0,0,1,0,0,3,1,8,2,1
ENSG00000285992.1,7,0,5,1,1,3,0,0,1,0,...,0,1,0,0,0,2,0,3,1,0
ENSG00000285993.1,3,0,7,1,1,2,0,0,3,0,...,0,3,0,0,0,3,1,6,0,1


The patient ID needs to be as the same order as the BL matrix that we created previously.

In [5]:
path1=Path("/scratch/znazari/PPMI_ver_sep2022/study_data/Subject_Characteristics/")
mypath=Path("/scratch/znazari/PPMI_ver_sep2022/study_data/Lab_Collection_Procedures/")
cno=pd.read_csv(mypath/"Laboratory_Procedures_with_Elapsed_Times.csv", header=None )
cno2=cno.rename(columns=cno.iloc[0]).drop(cno.index[0]).reset_index(drop=True)
cno2

Unnamed: 0,REC_ID,QUERY,CNO,PATNO,EVENT_ID,PAG_NAME,INFODT,LMDT,LMTM,FASTSTAT,...,PLASBFCT,DurUT1TM,DurUT1SPNTM,DurUT1FFTM,DurPLASTM,DurPLASSPNTM,DurPLASFFTM,DurBLDSERTM,DurSERSPNTM,DurSERFFTM
0,272495701,,001,3000,BL,LAB,02/2011,02/2011,12:00:00,,...,,-194,-3,-22,-77,-15,-39,-77,-15,-39
1,269559601,,001,3000,SC,LAB,01/2011,01/2011,12:00:00,,...,,,,,,,,,,
2,288345201,,001,3000,V01,LAB,04/2011,04/2011,07:30:00,,...,,,,,-103,-21,-65,-104,-20,-64
3,308429301,,001,3000,V02,LAB,08/2011,08/2011,12:00:00,,...,,-190,-6,-24,-178,-35,-62,-177,-36,-63
4,319985501,,001,3000,V03,LAB,11/2011,11/2011,11:45:00,,...,,,,,-98,-25,-49,-98,-25,-49
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15276,3d6d0417-2d2e-4657-9fdf-cf5fd12dd227,,289,160040,BL,LAB,07/2022,07/2022,09:25:00,2,...,1,-55,-15,-35,0,-55,-80,0,-55,-80
15277,c90a93fa-be41-4763-9cf9-5258b45968eb,,120,160231,BL,LAB,08/2022,08/2022,05:00:00,2,...,1,-285,-15,-30,-290,-25,-40,-290,-25,-40
15278,0bc811a8-f682-426e-a968-bd1589d2bbcf,,289,161236,BL,LAB,07/2022,07/2022,06:30:00,2,...,1,-140,-85,-100,-180,-30,-60,-180,-30,-60
15279,5de0cc42-b0ff-4db4-a680-f29f69b477ba,,17,162140,BL,LAB,08/2022,08/2022,20:00:00,1,...,1,-850,-54,-80,-942,-40,-63,-942,-40,-63


In [28]:
diago=pd.read_csv(path1/"Participant_Status.csv", header=None )
diago1=diago.rename(columns=diago.iloc[0]).drop(diago.index[0]).reset_index(drop=True)
diago1

Unnamed: 0,PATNO,COHORT,COHORT_DEFINITION,ENROLL_DATE,ENROLL_STATUS,STATUS_DATE,ENROLL_AGE,INEXPAGE,AV133STDY,PPMI_ONLINE_ENROLL,...,COMMENTS,CONDATE,ENRLPINK1,ENRLPRKN,ENRLSRDC,ENRLHPSM,ENRLRBD,ENRLLRRK2,ENRLSNCA,ENRLGBA
0,3000,2,Healthy Control,02/2011,enrolled,05/2021,69.1,INEXHC,0,NO,...,,06/2021,0,0,0,0,0,0,0,0
1,3001,1,Parkinson's Disease,03/2011,enrolled,09/2021,65.1,INEXPD,0,NO,...,,06/2021,0,0,1,0,0,0,0,0
2,3002,1,Parkinson's Disease,03/2011,enrolled,09/2021,67.6,INEXPD,0,NO,...,,06/2021,0,0,1,0,0,0,0,0
3,3003,1,Parkinson's Disease,04/2011,enrolled,01/2022,56.7,INEXPD,0,NO,...,,06/2021,0,0,1,0,0,0,0,0
4,3004,2,Healthy Control,04/2011,enrolled,01/2022,59.4,INEXHC,0,YES,...,,06/2021,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2346,162994,1,Parkinson's Disease,,screened,07/2022,,INEXPD,0,NO,...,,,0,0,1,0,0,0,0,0
2347,163265,1,Parkinson's Disease,,screened,07/2022,,INEXPD,0,YES,...,,,0,0,1,0,0,0,0,0
2348,164900,1,Parkinson's Disease,,screened,07/2022,,INEXPD,0,NO,...,,,0,0,1,0,0,0,0,0
2349,167222,1,Parkinson's Disease,,screened,08/2022,,INEXPD,0,NO,...,,,0,0,1,0,0,0,0,0


In [34]:
#this is to remove patients that have these diseases: SNCA (ENRLSNCA), GBA (ENRLGBA), LRRK2 (ENRLLRRK2)
filtered_SNCA = diago1[(diago1['ENRLSNCA'] == "0")& (diago1['ENRLGBA'] == "0")& (diago1['ENRLLRRK2'] == "0")]

<a id="training"></a>
## Model Training 

Build and train machine learning models on the prepared data. Explain the choice of models, feature engineering techniques, and hyperparameter tuning. Provide code and comments to walk through the model training process.

<a id="results"></a>
## Results and Evaluation 

Present the results of the trained models, including performance metrics, accuracy, or any relevant evaluation measures. Interpret the findings and discuss the implications. Include visualizations or tables to support the results.