# Proteomics CSF Data from PPMI 

**Cerebrospinal Fluid (CSF) proteomic**  is a clear fluid that surrounds the brain and spinal cord. It provides a protective cushion for the brain and helps remove waste products.

- We organise CSF data from Parkinson's disease and healthy patients. The proteomic data are in 7 files Project_151_pQTL_in_CSF_{}_of_7_Batch_Corrected_.csv, that we need to combine them together. We save the final single DataFrame at the end.

- The collected CSF can be analyzed to look for various markers, proteins, or other substances that may be indicative of neurological conditions, including Parkinson's disease.



In [2]:
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
from datetime import datetime

In [2]:
# The pathes (Directories) to the data
path = Path("/home/znazari/data/open_proteomic/")

In [3]:
# Specify the base file name and path
base_file_name = "Project_151_pQTL_in_CSF_{}_of_7_Batch_Corrected_.csv"

# Number of files
num_files = 7

# List to store DataFrames
dfs = []

# Loop through the file indices and read each file
for file_index in range(1, num_files + 1):
    file_name = base_file_name.format(file_index)
    file_path = path / file_name
    
    # Check if the file exists before attempting to read it
    if file_path.is_file():
        # Read the CSV file and append it to the list
        df = pd.read_csv(file_path, delimiter=',')
        dfs.append(df)
    else:
        print(f"File {file_name} not found.")
        
# Concat files together
result_df = pd.concat(dfs, ignore_index=True)

# Filter out patients diagnosed as Prodromal
result_df = result_df[result_df['COHORT'] != 'Prodromal']

In [4]:
# Patients with their diagnosis 
patient_diagnosis_df = result_df[['PATNO', 'COHORT']]

# Drop duplicate rows based on 'PATNO'
unique_patient_diagnosis_df = patient_diagnosis_df.drop_duplicates(subset='PATNO')
unique_patient_diagnosis_dff = unique_patient_diagnosis_df.copy()

# Pivot the DataFrame to get the desired format
result_pivot = result_df.pivot(index='TESTNAME', columns='PATNO', values='TESTVALUE')

# Transpose the dataframe 
patients_csf_t=result_pivot.T

# Fix indexing
new_patients_csf = patients_csf_t.reset_index().rename(columns={'index': 'PATNO'})

# Convert int patient numbers to str
new_patients_csf['PATNO'] = new_patients_csf['PATNO'].astype(str)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'COHORT' column
label = label_encoder.fit_transform(unique_patient_diagnosis_df['COHORT'])

# Set the label for parkinson's disease and healthy control
unique_patient_diagnosis_dff.loc[:, 'COHORT'] = label

# Reset the indices
new_diagnosis = unique_patient_diagnosis_dff.reset_index(drop=True)

new_diagnosis['PATNO'] = new_diagnosis['PATNO'].astype(str)

# Merge the proteomic dataframe with diagnosis
final_proteomic_patients = pd.merge(new_diagnosis, new_patients_csf, how='inner', on='PATNO')

final_proteomic_patients.to_csv(path/'final_proteomic_patients.csv', index=False)


In [3]:
# Get the current date
current_date = datetime.now().date()

# Print the current date
print("Last update :", current_date)

Last update : 2024-02-05
