# Correlation Matrices with OHSU

This notebook investigates the Pearson's correlation matrices for subjects from the OHSU site.

First the image data for each subject is combined into a single dataframe (subject x region correlation)

Next, the diagnosis for the patient is added for the corresponding subject.

Finally, a machine learning model is trained on the dataframe and the predictions are compared to the true values.

## Imports

These are the imports that are required for this notebook to run properly

- `os` to access the file

- `pandas` to work with dataframes

- `numpy` for linear algebra

- `seaborn` for nicer looking graphs

- `matplotlib.pyplot` for graphing the matrix

- `train_test_split()` for splitting data into a training and testing set

- `LogisticRegression` for a logistic regression machine learning model

- `KNeighborsClassifier` for a KNN machine learning model

- `SVC` for a SVM machine learning model

- `accuracy_score()` to evaluate the accuracy of the model

- `StratifiedKFold, cross_valscore()` for cross validation

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold, cross_val_score

In [2]:
models = []

models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC()))

## Functions

There are two basic functions that will be used to create the machine learning model

1. get_base_filepath()

2. extract_features()

3. perform_cross_validation()

### get_base_filepath()

Access the filepath for th ebase folder of the project. 
From here, any other asset of the project can be located.

In [3]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

### extract_features()

Create a dataframe using the mean of regions over time.

In [4]:
def extract_features(filepath):
    '''
    Create a dataframe correlation of the regions over time
    
    Input: filepath to open the dataframe
    
    Output: dataframe of correlations between region
    '''
    # Read the filepath as a dataframe (use 1 tab as separator and the first line as the header)
    df = pd.read_csv(filepath, sep=r'\s{1,}', engine='python', header=0)
    
    # Drop two features that get in the way of evaluation
    df = df.drop('File', axis=1)
    df = df.drop('Sub-brick', axis=1)
    
    # Get the correlation matrix of the dataframe
    cor = df.corr()
    
    # Create an empty list to store the correlations
    corr_vector = []
    
    # Loop through every row in the dataframe
    for row in range(len(cor.index)):
        # Loop through every feature in the dataframe
        for feature in range(len(cor.columns)):
            # Exclude unwanted values
            #    1 when row number = feature number
            #    repeat when row number > feature number
            if row >= feature:
                continue
            
            # Add the correlation value to the vector
            corr_vector.append(cor.iloc[row, feature])
    
    # Return the correlation for each of the regions (method of vectorizing)
    return corr_vector

### perform_cross_validation()

Use a stratified K-fold for cross validation for the three classification models 

In [5]:
def perform_cross_validation(X_train, y_train):
    '''
    Input: 
        - A dataframe containing the features use to build the model
        - A Series of the true values associated with the feature list
    
    Output: Printed result for the mean and standard deviation of each model
    '''
    # Create an empty dictionary to store the results
    results = dict()

    # Loop through the models
    for name, model in models:
        # Create a Stratified K-fold for cross validation
        kfold = StratifiedKFold(n_splits=10)
        
        # Apply cross validation using the current model
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
        
        # Add the mean and standard deviation to the dictionary
        results[name] = (cv_results.mean(), cv_results.std())

    # Print the results
    print('Model\t\tCV Mean\t\tCV std')
    print(results)

## Open files

In this section, the files for all of the patients is opened and combined into two matrices to build a dataframe in the next section.

###  Filepaths

Access the filepath to the OHSU folder. 
This is where the data for all of the patients at the OHSU site are located.

In [6]:
base_folder_filepath = get_base_filepath()
ohsu_filepath = base_folder_filepath +  '\\Data\\Preprocessed_data\\Sites\\OHSU\\'
phenotypics_filepath = base_folder_filepath + '\\Data\\Phenotypic\\Sites\\OHSU_phenotypic.csv'

### Subjects

Open the 'sfnwmrda' file for each subject at the OHSU site. 

Add the features to a matrix and the subjects to a different matrix.

In [7]:
subjects = []
subject_features = []

# Access all sfnwmrda files in the OHSU folder
# Access the patient folders within the site folder
for patient_id_folder in os.listdir(ohsu_filepath):
    # Access the filepath to the folder
    patient_id_folder_path = os.path.join(ohsu_filepath, patient_id_folder)
    
    subjects.append(patient_id_folder)
    
    # Check if the filepath is a folder, continue if it is a folder
    if os.path.isdir(patient_id_folder_path):
        # Get the file name (dependent on folder name)
        file_name = f"sfnwmrda{patient_id_folder}_session_1_rest_1_aal_TCs.1D"
        
        # Join the file name to its path
        file_path = os.path.join(patient_id_folder_path, file_name)
        
        # Extract the features and add it to the list of subjects
        subject_features.append(extract_features(file_path))

## Build the dataframe

Using the subjects, their features, and their phenotypics, create a dataframe of subjects.

### Subject x Region Correlation

Using the code from the previous cell, build a matrix of subjects vs. region correlations.

In [8]:
df_subject_x_region = pd.DataFrame(subject_features, index=subjects)
df_subject_x_region.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6660,6661,6662,6663,6664,6665,6666,6667,6668,6669
1084283,0.627503,0.102603,0.202119,-0.253194,0.020725,0.262696,0.676474,0.199277,0.203969,0.443223,...,0.698502,0.732939,0.039584,-0.599,0.817161,0.01036,-0.732348,0.232411,-0.584458,0.297212
1084884,0.616312,-0.004942,-0.261927,-0.063935,0.162726,0.05485,0.203061,0.482815,0.338529,0.764273,...,0.742337,0.461013,-0.010316,-0.229072,0.629281,-0.021019,-0.15916,0.561103,0.480508,0.896453
1108916,0.79414,-0.3421,0.231283,-0.473526,-0.262899,-0.193719,0.037132,-0.292466,-0.121465,0.347377,...,0.615086,0.331752,-0.033449,-0.353265,0.565515,0.246508,-0.168515,0.71212,0.212148,0.532709
1206380,0.683042,0.400591,0.05056,0.399696,-0.205624,-0.066348,-0.422396,-0.077063,-0.083302,0.146099,...,0.565635,0.467435,-0.348167,-0.300269,0.595721,-0.011787,0.079066,0.280008,0.213431,0.818195
1340333,0.93056,-0.405752,-0.420236,-0.130116,0.095078,0.127039,0.291338,-0.003459,0.190546,0.675852,...,0.556788,0.371579,0.076717,0.312111,0.487547,0.057883,0.197871,0.708691,0.147649,0.297182


### Diagnosis

Add the subject's diagnosis to the dataframe

Read the phenotypic file as a dataframe.

Extract the diagnosis as a numpy array.

In [9]:
df_phenotypic = pd.read_csv(phenotypics_filepath, index_col='ScanDir ID')
diagnosis = df_phenotypic['DX'].to_numpy()

Add the diagnosis to a new dataframe

In [10]:
df_region_w_dx = df_subject_x_region.copy()

df_region_w_dx['DX'] = diagnosis

df_region_w_dx.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6661,6662,6663,6664,6665,6666,6667,6668,6669,DX
1084283,0.627503,0.102603,0.202119,-0.253194,0.020725,0.262696,0.676474,0.199277,0.203969,0.443223,...,0.732939,0.039584,-0.599,0.817161,0.01036,-0.732348,0.232411,-0.584458,0.297212,1
1084884,0.616312,-0.004942,-0.261927,-0.063935,0.162726,0.05485,0.203061,0.482815,0.338529,0.764273,...,0.461013,-0.010316,-0.229072,0.629281,-0.021019,-0.15916,0.561103,0.480508,0.896453,0
1108916,0.79414,-0.3421,0.231283,-0.473526,-0.262899,-0.193719,0.037132,-0.292466,-0.121465,0.347377,...,0.331752,-0.033449,-0.353265,0.565515,0.246508,-0.168515,0.71212,0.212148,0.532709,1
1206380,0.683042,0.400591,0.05056,0.399696,-0.205624,-0.066348,-0.422396,-0.077063,-0.083302,0.146099,...,0.467435,-0.348167,-0.300269,0.595721,-0.011787,0.079066,0.280008,0.213431,0.818195,3
1340333,0.93056,-0.405752,-0.420236,-0.130116,0.095078,0.127039,0.291338,-0.003459,0.190546,0.675852,...,0.371579,0.076717,0.312111,0.487547,0.057883,0.197871,0.708691,0.147649,0.297182,1


## Model

Build a machine learning model and use it to make predictions on the dataset. 
Evaluate the model based on its accuracy.

Separate the data into features and targets

In [11]:
X = df_region_w_dx.drop('DX', axis=1)
y = df_region_w_dx['DX']

### Build a model and make predictions

Logistic Regression

In [12]:
model_LR = LogisticRegression().fit(X, y)
y_pred_LR = model_LR.predict(X)

accuracy_LR = accuracy_score(y_pred_LR, y)
accuracy_LR

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


1.0

KNN

In [13]:
model_KNN = KNeighborsClassifier().fit(X, y)
y_pred_KNN = model_KNN.predict(X)

accuracy_KNN = accuracy_score(y_pred_KNN, y)
accuracy_KNN

0.5443037974683544

SVM

In [14]:
model_SVM = SVC().fit(X, y)
y_pred_SVM = model_SVM.predict(X)

accuracy_SVM = accuracy_score(y_pred_SVM, y)
accuracy_SVM

0.7088607594936709

## Evaluate Accuracy

Understand the model accuracies better

### Best model

Based on the results from the model building, SVM had the best accuracy.

In [15]:
print('Accuracies:')
print('\nLogistic Regression:\t', accuracy_LR)
print('KNN:\t\t\t', accuracy_KNN)
print('SVM:\t\t\t', accuracy_SVM)

Accuracies:

Logistic Regression:	 1.0
KNN:			 0.5443037974683544
SVM:			 0.7088607594936709


Perform cross validation to compare this to the other prediction method.

In [16]:
perform_cross_validation(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Model		CV Mean		CV std
{'LR': (0.4660714285714286, 0.1541206954808802), 'KNN': (0.48035714285714287, 0.18148515709252358), 'SVM': (0.5321428571428571, 0.05101020306102036)}


### Understand the differences

View the value counts to better understand how the predictions and true values are distributed.

#### Logisitc Regression

In [17]:
pd.Series(y_pred_LR).value_counts()

0    42
1    23
3    12
2     2
Name: count, dtype: int64

#### SVM

In [18]:
pd.Series(y_pred_SVM).value_counts()

0    65
1    14
Name: count, dtype: int64

#### True

In [19]:
y.value_counts()

DX
0    42
1    23
3    12
2     2
Name: count, dtype: int64