# Condense Region Correlation Dataframe

This notebook attempts to extract the most important features from the region correlations. 
The original dataset is the atlas region correlations for the subjects from all the sites.
The output dataset will be the features from the original dataset that have the highest correlation to the diagnosis.

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### get_base_filepath()

Access the filepath for th ebase folder of the project. 
From here, any other asset of the project can be located.

In [2]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

### extract_features()

Create a dataframe using the mean of regions over time.

In [3]:
def extract_features(filepath):
    '''
    Create a dataframe correlation of the regions over time
    
    Input: filepath to open the dataframe
    
    Output: dataframe of correlations between region
    '''
    # Read the filepath as a dataframe (use 1 tab as separator and the first line as the header)
    df = pd.read_csv(filepath, sep=r'\s{1,}', engine='python', header=0)
    
    # Drop two features that get in the way of evaluation
    df = df.drop('File', axis=1)
    df = df.drop('Sub-brick', axis=1)
    
    # Get the correlation matrix of the dataframe
    cor = df.corr()
    
    # Create an empty list to store the correlations
    corr_vector = []
    
    # Loop through every row in the dataframe
    for row in range(len(cor.index)):
        # Loop through every feature in the dataframe
        for feature in range(len(cor.columns)):
            # Exclude unwanted values
            #    1 when row number = feature number
            #    repeat when row number > feature number
            if row >= feature:
                continue
            
            # Add the correlation value to the vector
            corr_vector.append(cor.iloc[row, feature])
    
    # Return the correlation for each of the regions (method of vectorizing)
    return corr_vector

## Open files

In this section, the files for all of the patients is opened and combined into two matrices to build a dataframe in the next section.

###  Filepaths

Access the filepath to the preprocessed data folder. 
This is where the data for all of the sites are located.

The filepath to the phenotypic data folder is also added here. 
This is where all of the phenotypic data files are located

In [4]:
# The folder for the project
base_folder_filepath = get_base_filepath()

# Preprocessed data site folder
sites_filepath = base_folder_filepath +  '\\Data\\Preprocessed_data\\'

# Phenotypic data site folder
phenotypics_filepath = base_folder_filepath + '\\Data\\Phenotypic\\2023.7.11-Cleaned_Phenotypic_Training_Sites.csv'

df_pheno = pd.read_csv(phenotypics_filepath, index_col=0)

In [5]:
df_pheno.head()

Unnamed: 0_level_0,Gender,Age,Handedness,Verbal IQ,Performance IQ,IQ,DX
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1038415,1,14.92,1.0,109.0,103.0,107.0,3
1201251,1,12.33,1.0,115.0,103.0,110.0,3
1245758,0,8.58,1.0,121.0,88.0,106.0,0
1253411,1,8.08,1.0,119.0,106.0,114.0,0
1419103,0,9.92,1.0,124.0,76.0,102.0,0


In [6]:
correlations = df_pheno.drop('DX', axis=1).corrwith(df_pheno['DX'])

Find all of the features that have the most significant correlation to the diagonsis.

In [7]:
correlation_features = correlations.loc[abs(correlations) >= 0.1]
correlation_features

Gender            0.168812
Verbal IQ        -0.217291
Performance IQ   -0.234445
IQ               -0.184566
dtype: float64

In [8]:
df_correlation_features = df_pheno[correlation_features.index]
df_correlation_features['DX'] = df_pheno['DX']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_correlation_features['DX'] = df_pheno['DX']


View this dataframe.

In [9]:
df_correlation_features.head()

Unnamed: 0_level_0,Gender,Verbal IQ,Performance IQ,IQ,DX
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1038415,1,109.0,103.0,107.0,3
1201251,1,115.0,103.0,110.0,3
1245758,0,121.0,88.0,106.0,0
1253411,1,119.0,106.0,114.0,0
1419103,0,124.0,76.0,102.0,0


Export condensed dataframe as a .csv file.

In [10]:
df_correlation_features.to_csv(base_folder_filepath + '\\Data\\Preprocessed_data\\2023.7.21-Phenotypic_Condensed.csv')