# Condense Region Dataframe

This notebook attempts to extract the most important features from the average region intensity. 
The original dataset is the average region intensity for the subjects from all the sites.
The output dataset will be the regions from the original dataset that have the highest correlation to the diagnosis.

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### get_base_filepath()

Access the filepath for th ebase folder of the project. 
From here, any other asset of the project can be located.

In [2]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

### extract_features()

Create a dataframe using the mean of regions over time.

In [3]:
def extract_features(filepath):
    '''
    Create a dataframe using the mean of regions over time.
    
    Input: filepath to open the dataframe
    
    Output: dataframe of mean for each region
    '''
    # Read the filepath as a dataframe (use 1 tab as separator and the first line as the header)
    df = pd.read_csv(filepath, sep=r'\s{1,}', engine='python', header=0)
    
    # Drop two features that get in the way of evaluation
    df = df.drop('File', axis=1)
    df = df.drop('Sub-brick', axis=1)
    
    # Return the mean for each of the features (method of vectorizing)
    return df.mean()

## Open files

In this section, the files for all of the patients is opened and combined into two matrices to build a dataframe in the next section.

###  Filepaths

Access the filepath to the preprocessed data folder. 
This is where the data for all of the sites are located.

The filepath to the phenotypic data folder is also added here. 
This is where all of the phenotypic data files are located

In [4]:
# The folder for the project
base_folder_filepath = get_base_filepath()

# Preprocessed data site folder
sites_filepath = base_folder_filepath +  '\\Data\\Preprocessed_data\\Sites\\'

# Phenotypic data site folder
phenotypics_filepath = base_folder_filepath + '\\Data\\Phenotypic\\Sites\\'

### Subjects

Open the 'sfnwmrda' file for each subject in the study. 

Add the features to a matrix and the subjects to a different matrix.

There may be instances where the subject does not have the file in their folder. 
In this case, add the subject to a matrix to be dropped from the phenotypic dataframe later.

In [5]:
# Create empty lists to store important values
subjects = []
subject_features = []
subjects_dropped = []

# Loop through every site in the folder
for site_folder in os.listdir(sites_filepath):
    # Access the filepath to the site's folder
    site_folder_path = os.path.join(sites_filepath, site_folder)
        
    # Loop through every patient in the site's folder
    for patient_id_folder in os.listdir(site_folder_path):            
        # Access the filepath to the patient's folder
        patient_id_folder_path = os.path.join(site_folder_path, patient_id_folder)
        
        # Skip the folder if it is empty
        if len(os.listdir(patient_id_folder_path)) == 0:
            print(f"Skipping empty folder: {patient_id_folder}")
            subjects_dropped.append(patient_id_folder)
            continue

        # Check if the filepath is a folder, continue if it is
        if os.path.isdir(patient_id_folder_path):
            # Get the file name (dependent on folder name)
            file_name = f"sfnwmrda{patient_id_folder}_session_1_rest_1_aal_TCs.1D"
            
            # Join the file name to its path
            file_path = os.path.join(patient_id_folder_path, file_name)
            
            # Skip the folder if the file is not in it
            if not os.path.exists(file_path):
                print(f"Skipping folder {file_name}: not found.")
                subjects_dropped.append(patient_id_folder)
                continue

            # Extract the features and add it to the list of subjects
            subject_features.append(extract_features(file_path))
            
            # Add the patient ID to the subjects list
            subjects.append(patient_id_folder)

Skipping empty folder: 0010016
Skipping empty folder: 0010027
Skipping empty folder: 0010055
Skipping empty folder: 0010098
Skipping empty folder: 0010105
Skipping empty folder: 0010127
Skipping folder sfnwmrda0015001_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015004_session_1_rest_1_aal_TCs.1D: not found.
Skipping empty folder: 0015011
Skipping folder sfnwmrda0015016_session_1_rest_1_aal_TCs.1D: not found.
Skipping empty folder: 0015018
Skipping folder sfnwmrda0015026_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015027_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015032_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015036_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015052_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015057_session_1_rest_1_aal_TCs.1D: not found.


### Diagnosis

Open the phenotypic file for each subject in the study. 

Add the diagnosis to a matrix and the patient id to a different matrix.

In [6]:
# Create empty lists to store important values
dx = [] # For the diagnosis
pheno_index = [] # For the patient id

# Iterate through each file in the folder
for site_pheno in os.listdir(phenotypics_filepath):
    # Access the filepath to the phenotypic data
    site_pheno_filepath = os.path.join(phenotypics_filepath, site_pheno)
    
    # Check if the current item in the directory is a file
    if os.path.isfile(site_pheno_filepath):
        # Read the file as a dataframe
        df_pheno = pd.read_csv(site_pheno_filepath, index_col='ScanDir ID')
        
        # Add the diagnosis to the list
        dx.append(df_pheno['DX'])
        
        # Add the patient id to the list
        pheno_index.append(df_pheno.index)

## Build the dataframe

Create a dataframe of the subjects, regions and their diagnosis.

### Subject x Region

Build a matrix of subjects vs. regions.

In [7]:
## Turn the array of features into a dataframe with the index as the subject id
df_subject_x_region = pd.DataFrame(subject_features, index=subjects)
df_subject_x_region.head()

Unnamed: 0,Mean_2001,Mean_2002,Mean_2101,Mean_2102,Mean_2111,Mean_2112,Mean_2201,Mean_2202,Mean_2211,Mean_2212,...,Mean_9081,Mean_9082,Mean_9100,Mean_9110,Mean_9120,Mean_9130,Mean_9140,Mean_9150,Mean_9160,Mean_9170
10001,0.001918,0.001396,0.000917,0.001579,0.00162,0.000398,0.000401,0.000248,-6e-06,-0.001791,...,-0.001946,-0.00154,0.002221,0.00164,-0.000227,-0.000473,-0.000525,0.00246,0.00181,-0.000823
10002,0.000535,-0.000911,-0.00437,1.3e-05,-0.012312,0.001798,-0.001885,0.000525,-0.002277,0.015622,...,-0.000176,-0.001465,-0.002169,-0.000968,0.001107,0.00105,0.000374,-0.000629,-2.5e-05,0.001806
10003,0.004598,0.001763,0.001807,-0.000461,-0.004121,-0.007068,0.003899,0.004255,-0.001597,-0.011144,...,-0.001121,-0.001566,-0.00923,-0.002198,0.006707,0.009246,-0.000108,0.00162,-5.9e-05,-0.007794
10004,-0.000559,0.00083,-0.003498,-0.001282,-0.004143,0.001574,-0.001477,-0.000162,-0.005601,0.002853,...,0.0008,-0.000904,-0.000326,0.000155,0.003007,0.001742,0.002644,0.000302,-0.000304,-0.00053
10005,0.003364,0.006273,0.014627,0.015924,0.000704,0.002034,0.01669,0.014993,0.004241,0.00822,...,0.007601,0.004895,0.001707,-0.004593,-0.007235,-0.008659,-0.007546,-0.000393,-0.003564,-0.001598


### Diagnosis Series

Create a series of the patient diagnosis to combine with the region dataframe

Make a vector of the patient ids

In [8]:
# Condense the indicies in the phenotypic data to a vector
patient_ids = [p_id for site_pheno in pheno_index for p_id in site_pheno]

Unify patient id formatting and create a series for the diagnosis

In [9]:
# Fix some of the patient ids
for i in range (len(patient_ids)):
    # Access the current patient id
    s_id = patient_ids[i]
    
    # If the length of the patient id is 5...
    if len(str(s_id)) == 5:
        # ... add '00' to the beginning to match formatting with the folder names
        patient_ids[i] = '00' + str(s_id)
        
    # Otherwise, turn the current id into a string value
    else:
        patient_ids[i] = str(s_id)
    
# Make the diagnosis a series with the phenotypic array as the index
diagnosis = pd.Series([diag for site_pheno in dx for diag in site_pheno], index=patient_ids)

### Combine

Add the diagnosis Series to the regions dataframe.

In [10]:
# Make a copy of the region dataframe
df_region_w_dx = df_subject_x_region.copy()

# Drop the rows with missing files or folders from the Series
filtered_diagnosis = diagnosis.drop(index=subjects_dropped)

# Add the diagnosis to the region dataframe
df_region_w_dx['DX'] = filtered_diagnosis

df_region_w_dx.head()

Unnamed: 0,Mean_2001,Mean_2002,Mean_2101,Mean_2102,Mean_2111,Mean_2112,Mean_2201,Mean_2202,Mean_2211,Mean_2212,...,Mean_9082,Mean_9100,Mean_9110,Mean_9120,Mean_9130,Mean_9140,Mean_9150,Mean_9160,Mean_9170,DX
10001,0.001918,0.001396,0.000917,0.001579,0.00162,0.000398,0.000401,0.000248,-6e-06,-0.001791,...,-0.00154,0.002221,0.00164,-0.000227,-0.000473,-0.000525,0.00246,0.00181,-0.000823,3
10002,0.000535,-0.000911,-0.00437,1.3e-05,-0.012312,0.001798,-0.001885,0.000525,-0.002277,0.015622,...,-0.001465,-0.002169,-0.000968,0.001107,0.00105,0.000374,-0.000629,-2.5e-05,0.001806,3
10003,0.004598,0.001763,0.001807,-0.000461,-0.004121,-0.007068,0.003899,0.004255,-0.001597,-0.011144,...,-0.001566,-0.00923,-0.002198,0.006707,0.009246,-0.000108,0.00162,-5.9e-05,-0.007794,0
10004,-0.000559,0.00083,-0.003498,-0.001282,-0.004143,0.001574,-0.001477,-0.000162,-0.005601,0.002853,...,-0.000904,-0.000326,0.000155,0.003007,0.001742,0.002644,0.000302,-0.000304,-0.00053,0
10005,0.003364,0.006273,0.014627,0.015924,0.000704,0.002034,0.01669,0.014993,0.004241,0.00822,...,0.004895,0.001707,-0.004593,-0.007235,-0.008659,-0.007546,-0.000393,-0.003564,-0.001598,2


## Determine Features

Determine what features are most correlated to the diagnosis.

Find the correlations of each feature to the diagnosis

In [11]:
correlations = df_region_w_dx.drop('DX', axis=1).corrwith(df_region_w_dx['DX'])

Find all of the features that have the most significant correlation to the diagonsis.

In [12]:
correlation_features = correlations.loc[abs(correlations) >= 0.05]
correlation_features.count()

20

In [13]:
X = df_region_w_dx.drop('DX', axis=1)
y = df_region_w_dx['DX']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [14]:
correlations_train = X_train.corrwith(y_train)

In [18]:
correlation_features_train = correlations_train.loc[abs(correlations_train) >= 0.05]
correlation_features_train.count()

27

Create a dataframe of only the features with the highest correlation.

In [19]:
df_correlation_features = df_region_w_dx[correlation_features.index]
df_correlation_features['DX'] = df_region_w_dx['DX']

df_correlation_features_train = df_region_w_dx[correlation_features_train.index]
df_correlation_features_train['DX'] = df_region_w_dx['DX']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features['DX'] = df_region_w_dx['DX']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features_train['DX'] = df_region_w_dx['DX']


Export condensed dataframe as a .csv file.

In [20]:
df_correlation_features.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\2023.7.14-Region_Condensed_Dataframe.csv')
df_correlation_features_train.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\2023.7.20-Region_Condensed_Train_Dataframe.csv')