# Proteomics CSF Data from PPMI 

**Cerebrospinal Fluid (CSF) proteomic**  is a clear fluid that surrounds the brain and spinal cord. It provides a protective cushion for the brain and helps remove waste products.

- We organise CSF data from Parkinson's disease and healthy patients. The proteomic data are in 7 files Project_151_pQTL_in_CSF_{}_of_7_Batch_Corrected_.csv, that we need to combine them together. We save the final single DataFrame at the end.

- The collected CSF can be analyzed to look for various markers, proteins, or other substances that may be indicative of neurological conditions, including Parkinson's disease.



In [1]:
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
from datetime import datetime

In [2]:
# The pathes (Directories) to the data
path = Path("/home/znazari/data/open_proteomic/")

In [3]:
# Specify the base file name and path
base_file_name = "Project_151_pQTL_in_CSF_{}_of_7_Batch_Corrected_.csv"

# Number of files
num_files = 7

# List to store DataFrames
dfs = []

# Loop through the file indices and read each file
for file_index in range(1, num_files + 1):
    file_name = base_file_name.format(file_index)
    file_path = path / file_name
    
    # Check if the file exists before attempting to read it
    if file_path.is_file():
        # Read the CSV file and append it to the list
        df = pd.read_csv(file_path, delimiter=',')
        dfs.append(df)
    else:
        print(f"File {file_name} not found.")
        
# Concat files together
result_df = pd.concat(dfs, ignore_index=True)

# Filter out patients diagnosed as Prodromal
result_df = result_df[result_df['COHORT'] != 'Prodromal']

In [4]:
# Patients with their diagnosis 
patient_diagnosis_df = result_df[['PATNO', 'COHORT']]

# Drop duplicate rows based on 'PATNO'
unique_patient_diagnosis_df = patient_diagnosis_df.drop_duplicates(subset='PATNO')
unique_patient_diagnosis_dff = unique_patient_diagnosis_df.copy()

# Pivot the DataFrame to get the desired format
result_pivot = result_df.pivot(index='TESTNAME', columns='PATNO', values='TESTVALUE')

# Transpose the dataframe 
patients_csf_t=result_pivot.T

# Fix indexing
new_patients_csf = patients_csf_t.reset_index().rename(columns={'index': 'PATNO'})

# Convert int patient numbers to str
new_patients_csf['PATNO'] = new_patients_csf['PATNO'].astype(str)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'COHORT' column
label = label_encoder.fit_transform(unique_patient_diagnosis_df['COHORT'])

# Set the label for parkinson's disease and healthy control
unique_patient_diagnosis_dff.loc[:, 'COHORT'] = label

# Reset the indices
new_diagnosis = unique_patient_diagnosis_dff.reset_index(drop=True)

new_diagnosis['PATNO'] = new_diagnosis['PATNO'].astype(str)

# Merge the proteomic dataframe with diagnosis
final_proteomic_patients = pd.merge(new_diagnosis, new_patients_csf, how='inner', on='PATNO')

final_proteomic_patients.to_csv(path/'final_proteomic_patients.csv', index=False)


In [5]:
final_proteomic_patients

Unnamed: 0,PATNO,COHORT,10000-28_3,10001-7_3,10003-15_3,10006-25_3,10008-43_3,10009-2_3,10010-10_3,10011-65_3,...,9984-12_3,9986-14_3,9987-30_3,9989-12_3,9991-112_3,9993-11_3,9994-217_3,9995-6_3,9997-12_3,9999-1_3
0,53595,1,6.733870,7.572464,6.335528,6.302477,6.008085,5.829655,6.058751,7.902585,...,6.124683,9.322580,5.876997,6.274316,7.365320,6.240341,7.130272,9.185032,8.802108,8.985598
1,3029,0,6.881810,7.698871,6.235733,6.397376,5.898213,5.814846,6.204347,7.770759,...,6.000616,9.258282,5.840534,6.320405,7.183427,6.182855,6.954768,9.604112,8.947430,9.358625
2,3963,1,6.843277,7.940987,6.253929,6.248572,5.881750,5.813261,6.203654,7.828046,...,6.169101,8.977272,5.852367,6.352225,6.882810,6.252332,6.960120,9.578161,8.595830,9.151849
3,3316,0,6.833672,8.154299,6.226679,6.417975,5.950047,5.731352,6.329086,7.940774,...,6.054259,8.856002,5.765146,6.259030,7.287134,6.254280,7.052670,9.688273,9.096634,9.614637
4,3124,1,6.685086,7.365837,6.227807,6.318911,6.109293,5.822314,6.023443,7.706822,...,6.186214,8.766998,5.939028,6.361248,7.139187,6.364042,7.523021,8.960747,8.346206,9.197527
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
798,3812,0,7.014804,7.469007,6.306823,6.314757,5.893445,6.001855,6.143816,7.752982,...,6.270356,8.784713,6.170978,6.270834,6.675186,6.919316,7.224437,8.986365,8.306326,9.158217
799,3817,0,6.787784,7.847977,6.257174,6.412752,5.943464,5.825788,6.183518,7.829379,...,6.128340,9.528539,5.840993,6.326199,7.422274,6.160604,7.063162,9.438577,8.590444,9.395105
800,3792,1,6.881702,7.962447,6.224932,6.268376,6.019198,5.847903,6.152021,7.931321,...,6.222008,9.231704,6.029250,6.332100,7.402976,6.506793,6.756406,9.300107,8.453467,9.147362
801,3793,1,6.906777,7.918118,6.138533,6.500583,5.863784,5.778317,6.155346,7.687791,...,6.098178,9.763621,5.657412,6.234351,6.923669,6.299240,7.255767,9.800064,9.033998,9.623652


In [6]:
# Get the current date
current_date = datetime.now().date()

# Print the current date
print("Last update :", current_date)

Last update : 2024-02-23
