# Condense Region Correlation Dataframe

This notebook attempts to extract the most important features from the region correlations. 
The original dataset is the atlas region correlations for the subjects from all the sites.
The output dataset will be the features from the original dataset that have the highest correlation to the diagnosis.

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### get_base_filepath()

Access the filepath for th ebase folder of the project. 
From here, any other asset of the project can be located.

In [2]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

### extract_features()

Create a dataframe using the mean of regions over time.

In [3]:
def extract_features(filepath):
    '''
    Create a dataframe correlation of the regions over time
    
    Input: filepath to open the dataframe
    
    Output: dataframe of correlations between region
    '''
    # Read the filepath as a dataframe (use 1 tab as separator and the first line as the header)
    df = pd.read_csv(filepath, sep=r'\s{1,}', engine='python', header=0)
    
    # Drop two features that get in the way of evaluation
    df = df.drop('File', axis=1)
    df = df.drop('Sub-brick', axis=1)
    
    # Get the correlation matrix of the dataframe
    cor = df.corr()
    
    # Create an empty list to store the correlations
    corr_vector = []
    
    # Loop through every row in the dataframe
    for row in range(len(cor.index)):
        # Loop through every feature in the dataframe
        for feature in range(len(cor.columns)):
            # Exclude unwanted values
            #    1 when row number = feature number
            #    repeat when row number > feature number
            if row >= feature:
                continue
            
            # Add the correlation value to the vector
            corr_vector.append(cor.iloc[row, feature])
    
    # Return the correlation for each of the regions (method of vectorizing)
    return corr_vector

## Open files

In this section, the files for all of the patients is opened and combined into two matrices to build a dataframe in the next section.

###  Filepaths

Access the filepath to the preprocessed data folder. 
This is where the data for all of the sites are located.

The filepath to the phenotypic data folder is also added here. 
This is where all of the phenotypic data files are located

In [4]:
# The folder for the project
base_folder_filepath = get_base_filepath()

# Preprocessed data site folder
sites_filepath = base_folder_filepath +  '\\Data\\Preprocessed_data\\Sites\\'

# Phenotypic data site folder
phenotypics_filepath = base_folder_filepath + '\\Data\\Phenotypic\\Sites\\'

### Subjects

Open the 'sfnwmrda' file for each subject in the study. 

Add the features to a matrix and the subjects to a different matrix.

There may be instances where the subject does not have the file in their folder. 
In this case, add the subject to a matrix to be dropped from the phenotypic dataframe later.

In [5]:
# Create empty lists to store important values
subjects = []
subject_features = []
subjects_dropped = []

# Loop through every site in the folder
for site_folder in os.listdir(sites_filepath):
    # Access the filepath to the site's folder
    site_folder_path = os.path.join(sites_filepath, site_folder)
        
    # Loop through every patient in the site's folder
    for patient_id_folder in os.listdir(site_folder_path):            
        # Access the filepath to the patient's folder
        patient_id_folder_path = os.path.join(site_folder_path, patient_id_folder)
        
        # Skip the folder if it is empty
        if len(os.listdir(patient_id_folder_path)) == 0:
            print(f"Skipping empty folder: {patient_id_folder}")
            subjects_dropped.append(patient_id_folder)
            continue

        # Check if the filepath is a folder, continue if it is
        if os.path.isdir(patient_id_folder_path):
            # Get the file name (dependent on folder name)
            file_name = f"sfnwmrda{patient_id_folder}_session_1_rest_1_aal_TCs.1D"
            
            # Join the file name to its path
            file_path = os.path.join(patient_id_folder_path, file_name)
            
            # Skip the folder if the file is not in it
            if not os.path.exists(file_path):
                print(f"Skipping folder {file_name}: not found.")
                subjects_dropped.append(patient_id_folder)
                continue

            # Extract the features and add it to the list of subjects
            subject_features.append(extract_features(file_path))
            
            # Add the patient ID to the subjects list
            subjects.append(patient_id_folder)

Skipping empty folder: 0010016
Skipping empty folder: 0010027
Skipping empty folder: 0010055
Skipping empty folder: 0010098
Skipping empty folder: 0010105
Skipping empty folder: 0010127
Skipping folder sfnwmrda0015001_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015004_session_1_rest_1_aal_TCs.1D: not found.
Skipping empty folder: 0015011
Skipping folder sfnwmrda0015016_session_1_rest_1_aal_TCs.1D: not found.
Skipping empty folder: 0015018
Skipping folder sfnwmrda0015026_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015027_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015032_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015036_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015052_session_1_rest_1_aal_TCs.1D: not found.
Skipping folder sfnwmrda0015057_session_1_rest_1_aal_TCs.1D: not found.


### Diagnosis

Open the phenotypic file for each subject in the study. 

Add the diagnosis to a matrix and the patient id to a different matrix.

In [6]:
# Create empty lists to store important values
dx = [] # For the diagnosis
pheno_index = [] # For the patient id

# Iterate through each file in the folder
for site_pheno in os.listdir(phenotypics_filepath):
    # Access the filepath to the phenotypic data
    site_pheno_filepath = os.path.join(phenotypics_filepath, site_pheno)
    
    # Check if the current item in the directory is a file
    if os.path.isfile(site_pheno_filepath):
        # Read the file as a dataframe
        df_pheno = pd.read_csv(site_pheno_filepath, index_col='ScanDir ID')
        
        # Add the diagnosis to the list
        dx.append(df_pheno['DX'])
        
        # Add the patient id to the list
        pheno_index.append(df_pheno.index)

## Build the dataframe

Create a dataframe of the subjects, regions and their diagnosis.

### Subject x Region Correlation

Build a matrix of subjects vs. region correlation.

In [7]:
## Turn the array of features into a dataframe with the index as the subject id
df_subject_x_region = pd.DataFrame(subject_features, index=subjects)
df_subject_x_region.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6660,6661,6662,6663,6664,6665,6666,6667,6668,6669
10001,0.611553,0.4497,0.236411,0.04313,0.036921,0.433403,0.108195,0.310783,0.178693,0.726007,...,0.667023,0.168741,0.210081,0.267632,0.101423,0.191426,0.327108,0.703022,0.392078,0.562829
10002,0.522595,-0.119311,-0.236553,-0.226454,-0.04842,0.529094,0.056163,0.185524,-0.359501,0.66403,...,0.505595,0.485639,0.053753,-0.341721,0.047124,-0.251568,-0.268276,0.447499,-0.1588,0.09614
10003,0.713222,-0.253519,-0.204299,-0.399722,-0.532061,0.244321,-0.116494,-0.516092,-0.609289,0.613314,...,0.495047,0.399334,0.339714,-0.001517,0.806626,0.474968,0.139881,0.692053,0.200051,0.605858
10004,0.597482,0.171575,-0.265142,-0.358434,-0.225153,-0.127013,-0.396526,-0.334154,-0.223024,0.423363,...,0.676761,0.552446,0.29708,0.333399,0.543472,0.316075,0.443875,0.563456,0.538783,0.352373
10005,0.776555,0.42465,0.441287,-0.430811,-0.452683,0.36748,0.263006,-0.299944,-0.364459,0.175737,...,0.678166,0.829442,0.729792,0.723418,0.673523,0.71136,0.773908,0.915097,0.862756,0.903599


### Diagnosis Series

Create a series of the patient diagnosis to combine with the region dataframe

Make a vector of the patient ids

In [8]:
# Condense the indicies in the phenotypic data to a vector
patient_ids = [p_id for site_pheno in pheno_index for p_id in site_pheno]

Unify patient id formatting and create a series for the diagnosis

In [9]:
# Fix some of the patient ids
for i in range (len(patient_ids)):
    # Access the current patient id
    s_id = patient_ids[i]
    
    # If the length of the patient id is 5...
    if len(str(s_id)) == 5:
        # ... add '00' to the beginning to match formatting with the folder names
        patient_ids[i] = '00' + str(s_id)
        
    # Otherwise, turn the current id into a string value
    else:
        patient_ids[i] = str(s_id)
    
# Make the diagnosis a series with the phenotypic array as the index
diagnosis = pd.Series([diag for site_pheno in dx for diag in site_pheno], index=patient_ids)

### Combine

Add the diagnosis Series to the region correlations dataframe.

In [10]:
# Make a copy of the region dataframe
df_region_w_dx = df_subject_x_region.copy()

# Drop the rows with missing files or folders from the Series
filtered_diagnosis = diagnosis.drop(index=subjects_dropped)

# Add the diagnosis to the region dataframe
df_region_w_dx['DX'] = filtered_diagnosis
df_region_w_dx.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6661,6662,6663,6664,6665,6666,6667,6668,6669,DX
10001,0.611553,0.4497,0.236411,0.04313,0.036921,0.433403,0.108195,0.310783,0.178693,0.726007,...,0.168741,0.210081,0.267632,0.101423,0.191426,0.327108,0.703022,0.392078,0.562829,3
10002,0.522595,-0.119311,-0.236553,-0.226454,-0.04842,0.529094,0.056163,0.185524,-0.359501,0.66403,...,0.485639,0.053753,-0.341721,0.047124,-0.251568,-0.268276,0.447499,-0.1588,0.09614,3
10003,0.713222,-0.253519,-0.204299,-0.399722,-0.532061,0.244321,-0.116494,-0.516092,-0.609289,0.613314,...,0.399334,0.339714,-0.001517,0.806626,0.474968,0.139881,0.692053,0.200051,0.605858,0
10004,0.597482,0.171575,-0.265142,-0.358434,-0.225153,-0.127013,-0.396526,-0.334154,-0.223024,0.423363,...,0.552446,0.29708,0.333399,0.543472,0.316075,0.443875,0.563456,0.538783,0.352373,0
10005,0.776555,0.42465,0.441287,-0.430811,-0.452683,0.36748,0.263006,-0.299944,-0.364459,0.175737,...,0.829442,0.729792,0.723418,0.673523,0.71136,0.773908,0.915097,0.862756,0.903599,2


## Determine Features

Determine what features are most correlated to the diagnosis.

Find the correlations of each feature to the diagnosis

In [11]:
correlations = df_region_w_dx.drop('DX', axis=1).corrwith(df_region_w_dx['DX'])

Find all of the features that have the most significant correlation to the diagonsis.

In [12]:
correlation_features = correlations.loc[abs(correlations) >= 0.1]
correlation_features

235     0.113613
238     0.100459
269    -0.102065
278    -0.102554
279    -0.122630
          ...   
6652    0.143752
6653    0.109580
6654    0.111561
6657    0.115970
6658    0.102800
Length: 245, dtype: float64

In [13]:
correlation_features_strict = correlations.loc[abs(correlations) >= 0.125]
correlation_features_strict.count()

58

Create a dataframe of only the features with the highest correlation.

In [14]:
df_correlation_features = df_region_w_dx[correlation_features.index]
df_correlation_features['DX'] = df_region_w_dx['DX']

df_correlation_features_strict = df_region_w_dx[correlation_features_strict.index]
df_correlation_features_strict['DX'] = df_region_w_dx['DX']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features['DX'] = df_region_w_dx['DX']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features_strict['DX'] = df_region_w_dx['DX']


View this dataframe.

In [15]:
df_correlation_features.head()

Unnamed: 0,235,238,269,278,279,281,296,300,363,368,...,6585,6594,6617,6618,6652,6653,6654,6657,6658,DX
10001,0.092469,0.143063,-0.407458,-0.306035,-0.291062,-0.199889,0.145942,0.464097,0.515416,0.494145,...,0.105502,0.468128,-0.024701,0.392749,0.299324,0.458237,0.372736,0.052531,0.319021,3
10002,-0.488431,-0.212865,-0.183747,-0.45768,-0.298935,-0.405845,0.463829,0.209191,0.249426,0.537003,...,0.104692,0.147202,0.1891,0.025902,0.200124,0.468132,0.068159,0.342826,0.444667,3
10003,0.377359,0.393602,-0.245501,-0.249565,-0.037718,-0.43099,0.307692,0.023051,0.470801,0.655254,...,-0.50425,0.255154,0.360344,0.011818,-0.148461,0.175543,0.083083,-0.19689,0.001334,0
10004,-0.275275,-0.210242,-0.173657,-0.053023,0.133152,-0.307762,0.202635,0.134885,0.595765,0.288715,...,0.207912,0.150542,0.318745,0.324831,0.455398,0.280313,0.392798,0.288368,0.290472,0
10005,0.187964,0.673444,-0.256754,-0.532281,-0.435204,-0.664507,0.284294,-0.081279,0.21971,0.391106,...,0.376563,0.658235,0.311448,0.4292,0.598177,0.582079,0.766891,0.186045,0.054409,2


In [16]:
df_correlation_features_strict.head()

Unnamed: 0,368,369,432,434,669,693,723,727,757,806,...,6299,6325,6339,6448,6449,6450,6485,6618,6652,DX
10001,0.494145,0.67198,-0.353855,-0.209514,0.033789,0.283208,-0.262988,-0.119845,0.202768,0.361097,...,0.052908,-0.155903,-0.174839,0.314585,0.310112,0.221745,0.177072,0.392749,0.299324,3
10002,0.537003,0.638984,0.072294,-0.06563,0.058265,0.284916,-0.408666,0.216585,-0.315267,0.029702,...,0.27885,0.241276,0.151769,0.100793,-0.078102,0.034784,0.114098,0.025902,0.200124,3
10003,0.655254,0.716004,-0.405801,-0.499284,-0.530587,0.094302,-0.48149,0.080406,-0.547677,0.728816,...,0.348791,0.493606,0.055793,0.152796,0.376222,0.144087,-0.064512,0.011818,-0.148461,0
10004,0.288715,0.385358,-0.05599,-0.16336,-0.287324,0.342097,-0.478443,-0.162471,-0.037828,0.126863,...,0.051891,-0.075995,0.254207,0.546152,0.177562,0.180795,0.309937,0.324831,0.455398,0
10005,0.391106,0.527228,-0.525517,-0.698435,-0.484852,0.277689,-0.720616,-0.0926,-0.290185,0.479555,...,0.588343,0.252319,0.029536,0.390409,0.284506,0.587892,0.467803,0.4292,0.598177,2


Export condensed dataframe as a .csv file.

In [17]:
df_correlation_features.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\Condensed\\2023.7.14-Region_Correlation_Condensed_Dataframe.csv')
df_correlation_features_strict.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\Condensed\\2023.7.14-Region_Correlation_Condensed_Strict_Dataframe.csv')

In [18]:
X = df_region_w_dx.drop('DX', axis=1)
y = df_region_w_dx['DX']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [19]:
correlations_train = X_train.corrwith(y_train)

In [20]:
correlation_features_train = correlations_train.loc[abs(correlations_train) >= 0.1]
correlation_features_train

56     -0.114239
102     0.117607
104     0.103131
148     0.101531
191    -0.100170
          ...   
6625   -0.111713
6626    0.102850
6635    0.110228
6652    0.131794
6657    0.121656
Length: 404, dtype: float64

In [21]:
correlation_features_strict_train = correlations_train.loc[abs(correlations_train) >= 0.125]
correlation_features_strict_train.count()

114

In [22]:
correlation_features_very_strict_train = correlations_train.loc[abs(correlations_train) >= 0.14]
correlation_features_very_strict_train.count()

50

In [23]:
correlation_features_15p_strict_train = correlations_train.loc[abs(correlations_train) >= 0.15]
correlation_features_15p_strict_train.count()

28

In [24]:
df_correlation_features_train = df_region_w_dx[correlation_features_train.index]
df_correlation_features_train['DX'] = df_region_w_dx['DX']

df_correlation_features_strict_train = df_region_w_dx[correlation_features_strict_train.index]
df_correlation_features_strict_train['DX'] = df_region_w_dx['DX']

df_correlation_features_very_strict_train = df_region_w_dx[correlation_features_very_strict_train.index]
df_correlation_features_very_strict_train['DX'] = df_region_w_dx['DX']

df_correlation_features_15p_strict_train = df_region_w_dx[correlation_features_15p_strict_train.index]
df_correlation_features_15p_strict_train['DX'] = df_region_w_dx['DX']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features_train['DX'] = df_region_w_dx['DX']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features_strict_train['DX'] = df_region_w_dx['DX']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features_very_strict_train['DX'] = df_region_w_dx['DX']
A value i

In [25]:
df_correlation_features_train.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\Condensed\\2023.7.19-Region_Correlation_Condensed_Train_Dataframe.csv')
df_correlation_features_strict_train.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\Condensed\\2023.7.19-Region_Correlation_Condensed_Strict_Train_Dataframe.csv')
df_correlation_features_very_strict_train.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\Condensed\\2023.7.20-Region_Correlation_Condensed_Very_Strict_Train_Dataframe.csv')
df_correlation_features_15p_strict_train.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\Condensed\\2023.7.20-Region_Correlation_Condensed_15p_Strict_Train_Dataframe.csv')