# <span style="color:#8B4513;"> Preprocessing RNA-Seq Data PPMI 
</span>


[<span style="color:#8B4513;">Author: **Zainab Nazari**</span>](mailto:z.nazari@ebri.com)
 

## Data Preprocessing STEP I
- We remove patients that have these mutations of genes: SNCA (ENRLSNCA), GBA (ENRLGBA), LRRK2 (ENRLLRRK2).
-  We only keep genes with the intersection of counts and quants with proteing coding and RNAincs.
- We remove the duplicated gene IDs in which they are also lowly expressed.
- We keep only patients with diagnosis of Health control or Parkinson disease.
- We check if there are some patients were they were taking dopamine drug, so we exclude them. Dopaminergic medication can impact the interpretation of experimental data or measurements and can alter gene expression patterns, so we need to remove them to have less biased data.

## Data Preprocessing STEP II
1. We remove lowely expressed genes, by keeping only genes that had more than five counts in at least 10% of the individuals, which left us with 21,273 genes





In [3]:
import pandas as pd
import numpy as np
import os
import glob
import functools
from pathlib import Path

In [5]:
# Note that the counts file in the IR3 is around 152 G, and the files are located in scratch area.
path_to_files="/scratch/znazari/PPMI_ver_sep2022/RNA_Seq_data/star_ir3/counts/"
path1=Path("/scratch/znazari/PPMI_ver_sep2022/RNA_Seq_data/star_ir3/counts/")
path2 = Path("/home/znazari/data") # where the output data will be saved at the end.
path3=Path("/scratch/znazari/PPMI_ver_sep2022/study_data/Subject_Characteristics/")

<a id="preprocessing"></a>
## Data Preprocessing STEP I

- We keep only individuals with diagnosis of Health control or Parkinson's disease.
- We remove patients that have these gene mutations : SNCA, GBA, LRRK2, and taking dopaminergic drugs.
-  We only keep genes with the intersection of counts and quants with proteing coding and non protein coding RNAincs.
- We remove the duplicated gene IDs in which they are also lowly expressed.


In [31]:
# Read the main table of gene IDs vs invididuals 
read_ir3_counts = pd.read_csv(path2/"matrix_ir3_counts_bl.csv")

# setting the geneid as indexing column
read_ir3_counts.set_index('Geneid', inplace=True)

# result with removing the after dot (.) value, i.e. the version of the geneIDs is removed.
#read_ir3_counts.index =read_ir3_counts.index.str.split('.').str[0]

In [32]:
# reading the file which contains diagnosis
diago=pd.read_csv(path3/"Participant_Status.csv", header=None )
diago1=diago.rename(columns=diago.iloc[0]).drop(diago.index[0]).reset_index(drop=True)

# Select only diagnosis with parkinson's and control.
selected_diagnosis_pd_hc = diago1[diago1['COHORT'].isin(['0', '1'])]
pd_hc = selected_diagnosis_pd_hc['PATNO']

filtered_df = read_ir3_counts.loc[:, read_ir3_counts.columns.isin(pd_hc)]

In [45]:
# Read the file which contains patients with gene mutations or dopamiergic drug users
union_drugs_mutations=pd.read_csv(path2/'union_drugs_mutations.csv', index_col=0)
s_union_drugs_mutations= union_drugs_mutations['0']
s_union_drugs_mutations_str = s_union_drugs_mutations.astype(str)

In [46]:
filtered_df_filtered = filtered_df.drop(columns=s_union_drugs_mutations_str, errors='ignore')
filtered_df_filtered

Unnamed: 0_level_0,3001,3002,3003,3010,3012,3014,3020,3024,3026,3027,...,4071,4072,4075,4076,4081,4091,4102,4108,4115,4136
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.14,13,87,11,14,20,7,13,17,23,52,...,20,23,60,22,18,43,20,7,8,16
ENSG00000000005.5,0,28,2,2,0,0,1,1,0,0,...,1,1,21,1,0,19,0,1,0,5
ENSG00000000419.12,815,879,855,1185,672,762,1124,719,874,1374,...,1230,975,492,528,687,468,855,555,628,426
ENSG00000000457.13,1510,1438,1593,2210,1573,1635,2078,1778,1872,3093,...,1065,1296,923,1160,1808,1223,1586,1271,1378,1037
ENSG00000000460.16,367,460,444,605,515,449,570,407,493,887,...,264,345,291,398,530,356,395,438,510,343
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000285990.1,0,1,0,0,0,0,0,1,0,0,...,0,0,3,0,0,1,0,0,0,4
ENSG00000285991.1,0,5,2,1,0,1,2,1,2,1,...,1,0,10,4,1,11,2,1,0,2
ENSG00000285992.1,2,5,2,0,0,0,1,0,0,1,...,0,1,3,5,0,3,0,1,0,5
ENSG00000285993.1,0,5,0,3,0,0,0,0,0,1,...,4,0,9,4,0,5,0,0,1,2


In [44]:

# get the duplicated indices
duplicated_indices = read_ir3_counts.index[read_ir3_counts.index.duplicated()]

# create a new dataframe with the duplicated indices
new_df = read_ir3_counts.loc[duplicated_indices]

# print the new dataframe
new_df


Unnamed: 0_level_0,10874,12499,12593,13039,13424,14281,14331,14426,15761,16580,...,60024,60036,60095,60171,65002,65003,65005,65008,70239,85242
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000002586,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000002586,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
ENSG00000124333,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000124333,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000124334,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000277120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000280767,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000280767,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000281849,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
new_df['sum'] = new_df.sum(axis=1)

# print the new dataframe
new_df = new_df.sort_values(by='sum', ascending=False)
new_df

Unnamed: 0_level_0,10874,12499,12593,13039,13424,14281,14331,14426,15761,16580,...,60036,60095,60171,65002,65003,65005,65008,70239,85242,sum
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000168939,0,2,2,11,3,0,0,3,0,1,...,5,6,1,2,5,3,1,3,1,11433
ENSG00000002586,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,363
ENSG00000169084,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,249
ENSG00000182162,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,189
ENSG00000124333,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,183
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000225661,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000226179,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000226179,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000227159,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [64]:
read_ir3_counts

read_ir3_counts['sum'] = read_ir3_counts.sum(axis=1)

# print the new dataframe
read_ir3_counts = read_ir3_counts.sort_values(by='sum', ascending=False)
read_ir3_counts

Unnamed: 0_level_0,10874,12499,12593,13039,13424,14281,14331,14426,15761,16580,...,60036,60095,60171,65002,65003,65005,65008,70239,85242,sum
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000274012,4842739,4523401,2585196,9615714,3454575,2096910,5610571,2056424,3424037,9379058,...,5769465,3588164,7351899,6043716,3040828,2010075,3773460,7493521,2803423,7153566704
ENSG00000251562,946869,859787,809913,800705,682514,371587,994011,1310891,895987,706537,...,748005,705478,520674,735021,846743,1341232,388549,724521,412501,1298264007
ENSG00000156508,149688,110618,68167,115096,125252,74783,182210,261729,104127,119114,...,168459,64591,95248,204670,157353,226827,67023,71463,55519,199890071
ENSG00000166710,152963,102118,67065,168057,158965,68313,151715,208818,72636,112782,...,143442,64499,90815,213597,106190,193813,68078,64908,43000,188816914
ENSG00000245532,65599,305467,102899,114589,344990,101243,46252,61346,62653,56240,...,250245,128055,45094,54774,111978,140862,53421,183106,62113,169802429
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000274760,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000266610,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000274777,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSG00000274790,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:

# select the rows with a sum of zero
read_ir3_counts = read_ir3_counts[read_ir3_counts['sum'] ==0]

read_ir3_counts.shape
# remove the rows with a sum of zero
#df = df[df['sum'] != 0]

(1148, 1531)

In [26]:
duplicate_indices = read_ir3_counts.index[is_duplicate]

In [48]:
read_ir3_counts

Unnamed: 0_level_0,10874,12499,12593,13039,13424,14281,14331,14426,15761,16580,...,60024,60036,60095,60171,65002,65003,65005,65008,70239,85242
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003,29,18,32,11,18,20,9,15,23,7,...,20,28,6,27,19,17,44,32,10,5
ENSG00000000005,3,0,9,0,0,9,0,0,5,0,...,0,1,0,0,2,2,1,14,0,0
ENSG00000000419,877,704,574,820,677,310,863,1252,520,631,...,699,707,550,501,1049,841,1438,431,513,314
ENSG00000000457,1686,1781,1704,1682,1656,695,1431,1778,1232,1034,...,1670,1651,1331,1013,1288,1732,2057,710,1275,858
ENSG00000000460,574,570,551,527,669,236,423,555,312,347,...,531,571,446,304,431,550,542,300,397,196
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000285990,1,2,2,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
ENSG00000285991,3,2,5,0,8,3,3,0,1,2,...,0,0,1,0,0,3,1,8,2,1
ENSG00000285992,7,0,5,1,1,3,0,0,1,0,...,0,1,0,0,0,2,0,3,1,0
ENSG00000285993,3,0,7,1,1,2,0,0,3,0,...,0,3,0,0,0,3,1,6,0,1


In [None]:

#here we delete the duplicated gene IDs, first we find them then remove them from the gene IDs
# as they are duplicated and also they are very lowly expressed either zero or one in rare cases.

# Check for duplicate index values
is_duplicate = read_ir3_counts.index.duplicated()

# Display the duplicate index values
duplicate_indices = read_ir3_counts.index[is_duplicate]

# drop them (duplicated indices and their copies are deleted, 45 duplicatd indices and 90 are dropped)
to_be_deleted = list(duplicate_indices)
read_ir3_counts = read_ir3_counts.drop(to_be_deleted)

In [6]:

# we read the file where we have an intersection of geneIDs in IR3, counts, quant
intersect = pd.read_csv(path2/"intersect_IR3_ENG_IDs_LincRNA_ProtCoding_counts_quant_gene_transcript_only_tot_intsersect.txt")
intersection = read_ir3_counts.index.intersection(intersect['[IR3_gene_counts] and [IR3_quant_gene] and [IR3_quant_trans] and [lncRNA+ProtCod]: '])
filtered_read_ir3_counts = read_ir3_counts.loc[intersection]

In [10]:
len(intersection)

30448

In [None]:


# we read the file where we have an intersection of geneIDs in IR3, counts, quant
intersect = pd.read_csv(path2/"intersect_IR3_ENG_IDs_LincRNA_ProtCoding_counts_quant_gene_transcript_only_tot_intsersect.txt")
intersection = read_ir3_counts.index.intersection(intersect['[IR3_gene_counts] and [IR3_quant_gene] and [IR3_quant_trans] and [lncRNA+ProtCod]: '])
filtered_read_ir3_counts = read_ir3_counts.loc[intersection]

# reading the file which contains diagnosis
diago=pd.read_csv(path3/"Participant_Status.csv", header=None )
diago1=diago.rename(columns=diago.iloc[0]).drop(diago.index[0]).reset_index(drop=True)

#this is to remove patients that have these diseases: SNCA (ENRLSNCA), GBA (ENRLGBA), LRRK2 (ENRLLRRK2)
filtered_SNCA_GBA_LRRK2 = diago1[(diago1['ENRLSNCA'] == "0")& (diago1['ENRLGBA'] == "0")& (diago1['ENRLLRRK2'] == "0")]

#patients with their diagnosis
patinets_diagnosis = filtered_SNCA_GBA_LRRK2[['PATNO','COHORT_DEFINITION']].reset_index(drop=True)

# Define the particular names to keep
names_to_keep = ['Healthy Control', "Parkinson's Disease"]


# Filter the dataframe based on the specified names
PK_HC_pateints = patinets_diagnosis[patinets_diagnosis['COHORT_DEFINITION'].isin(names_to_keep)]

# Get the list of patient IDs with diagnosis from the second dataframe
patient_ids_with_diagnosis = PK_HC_pateints['PATNO']
list_patients=list(patient_ids_with_diagnosis)

# Filter the columns in the first dataframe based on patient IDs with diagnosis
rna_filtered = filtered_read_ir3_counts.filter(items=list_patients)

# We read a file that contains the Patient IDs that they were taking dopomine drugs,needed to be excluded.
patient_dopomine = pd.read_csv(path2/'Patient_IDs_taking_dopamine_drugs.txt',delimiter='\t',  header=None)
patient_dopomine = patient_dopomine.rename(columns={0: 'Pateint IDs'})
ids_to_remove = patient_dopomine['Pateint IDs'].tolist() # put the patient IDs to list
strings = [str(num) for num in ids_to_remove] # convert them as string

# The code is iterating over each column name in rna.columns and checking if 
# any of the strings in the strings list 
# are present in that column name. If none of the strings are found in the column name,
# then that column name is added to the new_columns list.
new_columns = [col for col in rna_filtered.columns if not any(string in col for string in strings)] 
rna_filtered = rna_filtered[new_columns]
# there were no column name (patints that use druf in this list) to be excluded in our case.
# IN CASE THERE WERE SOME PATIENTS TO BE REMOVED, the diagnosis file below needs to be amended too.

rna_filtered.to_csv(path2/'ir3_rna_step1.csv', index=True)

# we keep only the patients that are common in the two dataframes:
common_patient_ids = list(set(PK_HC_pateints['PATNO']).intersection(rna_filtered.columns))
patient11_filtered = PK_HC_pateints[PK_HC_pateints['PATNO'].isin(common_patient_ids)]
patient11_filtered.reset_index(drop=True)

# we save the output into data folder
patient11_filtered.to_csv(path2/'patients_HC_PK_diagnosis.csv', index=False)