# <span style="color:#8B4513;"> Table of Ensembl Gene IDs  ENSG VS  Patient Num. from PPMI @ LONI
</span>

- Retrieving data from the 'IR3/counts' folder involves extracting the final column (counts) associated with each patient during their baseline (BL) visit.

- Data from Project 133 RNA Sequencing Feature Counts/TPM (IR3/B38/Phases 1-2, version 2021-04-02

In [2]:
import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
import glob
import functools
from pathlib import Path
import time
from datetime import datetime

In [2]:
# Note that the counts file in the IR3 is around 152 G, and the files are located in scratch area.

path1=Path("/scratch/znazari/PPMI_ver_sep2022/RNA_Seq_data/star_ir3/counts/")
path2 = Path("/home/znazari/data") # where the output data will be saved at the end.
path3=Path("/scratch/znazari/PPMI_ver_sep2022/study_data/Subject_Characteristics/")


<a id="matrixcreation"></a>
## Matrix of Gene IDs and Counts for Patients
 Loading the data from IR3/counts folder and extracting the column (counts) of each patient file for their BL visit.

In [5]:
# Get all file names in the folder
all_files = [file.name for file in path1.glob('*')]

# Filter the files that contain "BL" in their names
bl_files2 = [file for file in all_files if "BL" in file]

#Convert to dataframe
bl_files =pd.DataFrame(bl_files2)

# We define a function where we can take the second phrase seperated by dot. The second phrase 
# is the patient ID. So with this functin we want to get the patient IDs from their file's name
def function_names(fname):
    tokens=fname.split('.')
    return tokens[1]

# we create a list with the name of the each patients.
bl_list = [function_names(bl_files.iloc[i][0]) for i in range(len(bl_files))]

start_time = time.time()

# here we read all the files with with base visit(BL) from the counts folder (where we have all the files
# for all the patients and all the visit).
list_bl_files = [dd.read_csv(path1/bl_files.iloc[i][0],skiprows=1,delimiter='\t') for i in range(len(bl_files))]


# we get th last columns of each file in the list
last_columns = [ddf.iloc[:, -1:] for ddf in list_bl_files]

# concatinating the list of the columns in a single file.
single_file = dd.concat(last_columns, axis=1, ignore_unknown_divisions=True)

# we change the name of the each columns with the patient numbers.
single_file.columns = bl_list

# we get the Geneid column and convert it to dask dataframe
pd_tmp_file = list_bl_files[3].compute()
geneid = pd_tmp_file['Geneid']
ddf_geneid = dd.from_pandas(geneid, npartitions=1)

# here we set the Geneid column as the index of the matrix.
ddf_new_index = single_file.set_index(ddf_geneid)

# converting to pandas data frame and saving.
ir3_counts = ddf_new_index.compute()
ir3_counts.to_csv(path2/"matrix_ir3_counts_bl.csv")

end_time = time.time()

execution_time = end_time - start_time
print(f"Execution Time: {execution_time} seconds")

Execution Time: 891.5956218242645 seconds


In [3]:
# Get the current date
current_date = datetime.now().date()

# Print the current date
print("Last update :", current_date)

Last update : 2024-02-02
