# Machine Learning with OHSU

This notebook investigates the performance of machine learning models to recognize ADHD in subjects. 
This dataset consits of preprocessed data from the OHSU from the ADHD-200 Competition set and the diagnosis corresponding to each subject.

The features for this analysis contains the average signal intensity for each region determined by the AAL atlas. 

This notebook runs two tests to evaluate the accuracy of multiple classification models.
1. Multi-class diagnosis (uses all diagnosis types)
2. Binary classification (if subject has ADHD or not)

## Imports

These are the imports that are required for this notebook to run properly

- `os` to access the file

- `pandas` to work with dataframes

- `numpy` for linear algebra

- `train_test_split()` for splitting data into a training and testing set

- `LogisticRegression` for a logistic regression machine learning model

- `KNeighborsClassifier` for a KNN machine learning model

- `SVC` for a SVM machine learning model

- `accuracy_score()` to evaluate the accuracy of the model

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

## Functions

There are basic functions that will be used to create the machine learning model

1. get_base_filepath()

2. extract_features()

3. make_predictions()

4. evaluate_models()

5. get_accuracies()

### get_base_filepath()

Access the filepath for th ebase folder of the project. 
From here, any other asset of the project can be located.

In [2]:
def get_base_filepath():
    '''
    Access the filepath for the base folder of the project
    
    Input: None
    
    Output: The filepath to the root of the folder
    '''
    # Get current directory
    os.path.abspath(os.curdir)

    # Go up a directory level
    os.chdir('..')

    # Set baseline filepath to the project folder directory
    base_folder_filepath = os.path.abspath(os.curdir)
    return base_folder_filepath

### extract_features()

Create a dataframe using the mean of regions over time.

In [3]:
def extract_features(filepath):
    '''
    Create a dataframe using the mean of regions over time.
    
    Input: filepath to open the dataframe
    
    Output: dataframe of mean for each region
    '''
    df = pd.read_csv(filepath, sep=r'\s{1,}', engine='python', header=0)
    df = df.drop('File', axis=1)
    df = df.drop('Sub-brick', axis=1)
    return df.mean()

### make_predictions()

Fit a model using the training data, 
make predictions on a testing set, 
and get the accuracy of the model.

Used in evaluate_models()

In [4]:
def make_predictions(model, X_trn, X_tst, y_trn, y_tst):
    '''
    Get the accuracy of a model
    
    Input:
        - A model to use to make predictions
        - Set of training features
        - Set of testing features
        - Set of training targets
        - Set of testing targets
        
    Output: Accuracy of the model
    '''
    
    # Train the model on the training set
    model_fit = model.fit(X_trn, y_trn)
    
    # Make predictions on the testing features
    y_pred = model_fit.predict(X_tst)
    
    # Compare the predictions to the true values
    accuracy = accuracy_score(y_pred, y_tst)
    
    # Return the accuracy
    return accuracy

### evaluate_models()

Evaluate the performance of models on a set of features and targets.

Uses make_predictions()

Used in get_accuracies()

In [5]:
def evaluate_models(X, y):
    '''
    Evaluate the performance of models on a set of features and targets.
    
    Input:
        - Set of features
        - Set of targets
        
    Output: Accuracy of three models (Logistic regression, KNN, SVM)
    '''
    # Separate the data into training and testing sets
    X_trn, X_tst, y_trn, y_tst = train_test_split(X, y)
    
    # Evaluate the accuracies using each of the three models
    lr_acc = make_predictions(LogisticRegression(), X_trn, X_tst, y_trn, y_tst)
    knn_acc = make_predictions(KNeighborsClassifier(), X_trn, X_tst, y_trn, y_tst)
    svm_acc = make_predictions(SVC(), X_trn, X_tst, y_trn, y_tst)
    lda_acc = make_predictions(LinearDiscriminantAnalysis(), X_trn, X_tst, y_trn, y_tst)
    
    # Return the accuracy in a list format
    return [lr_acc, knn_acc, svm_acc, lda_acc]

### get_accuracies()

Get 100 accuracies for three models (Logistic regression, KNN, SVM).

In [6]:
def get_accuracies(X, y):
    '''
    Get 100 accuracies for three models (Logistic regression, KNN, SVM).
    
    Input:
        - Set of features
        - Set of targets
        
    Output: List of 100 accuracies for the three models
    '''
    # Create an empty list to store the accuracies for each model
    lr_acc = []
    knn_acc = []
    svm_acc = []
    lda_acc = []
    
    # Run 100 iterations of evaluating the model
    for i in range(100):
        # Get the accuracy for this iteration
        accuracies = evaluate_models(X, y)
        
        # Add it to the corresponding model holder
        lr_acc.append(accuracies[0])
        knn_acc.append(accuracies[1])
        svm_acc.append(accuracies[2])
        lda_acc.append(accuracies[3])
        
    # Return a list of all accuracies
    return [lr_acc, knn_acc, svm_acc, lda_acc]

## Open files

In this section, the files for all of the patients is opened and combined into two matrices to build a dataframe in the next section.

###  Filepaths

Access the filepath to the OHSU folder. 
This is where the data for all of the patients at the OHSU site are located.

In [7]:
base_folder_filepath = get_base_filepath()
ohsu_filepath = base_folder_filepath +  '\\Data\\Preprocessed_data\\Sites\\OHSU\\'
phenotypics_filepath = base_folder_filepath + '\\Data\\Phenotypic\\Sites\\OHSU_phenotypic.csv'

### Subjects

Open the 'sfnwmrda' file for each subject at the OHSU site. 

Add the features to a matrix and the subjects to a different matrix.

In [8]:
subjects = []
subject_features = []

# Access all sfnwmrda files in the OHSU folder
# Access the patient folders within the site folder
for patient_id_folder in os.listdir(ohsu_filepath):
    # Access the filepath to the folder
    patient_id_folder_path = os.path.join(ohsu_filepath, patient_id_folder)
    
    subjects.append(patient_id_folder)
    
    # Check if the filepath is a folder, continue if it is a folder
    if os.path.isdir(patient_id_folder_path):
        # Get the file name (dependent on folder name)
        file_name = f"sfnwmrda{patient_id_folder}_session_1_rest_1_aal_TCs.1D"
        
        # Join the file name to its path
        file_path = os.path.join(patient_id_folder_path, file_name)
        
        # Extract the features and add it to the list of subjects
        subject_features.append(extract_features(file_path))
        
subjects[:3]

['1084283', '1084884', '1108916']

# Multi-Class Classificaiton

This section investigates how models perform when predicting the type of ADHD the subject has or if they are a control.

This is accomplished by extracting the diagnosis from the phenotypic data and adding it to the regions. 
Each number corresponds to a type diagnosis for ADHD.

    0 = TDC (Typically developing children)
    1 = ADHD-Combined
    2 = ADHD-Hyperactive/Impulsive
    3 = ADHD-Inattentive

### Subject x Region

Using the code from the previous cell, build a matrix of subjects vs. regions.

In [9]:
df_subject_x_region = pd.DataFrame(subject_features, index=subjects)
df_subject_x_region.head()

Unnamed: 0,Mean_2001,Mean_2002,Mean_2101,Mean_2102,Mean_2111,Mean_2112,Mean_2201,Mean_2202,Mean_2211,Mean_2212,...,Mean_9081,Mean_9082,Mean_9100,Mean_9110,Mean_9120,Mean_9130,Mean_9140,Mean_9150,Mean_9160,Mean_9170
1084283,0.001087,0.006426,0.003046,0.007911,0.002836,0.010834,0.014295,0.013514,0.000804,0.044241,...,0.02071,0.027268,0.029133,0.005323,-0.006976,-0.010725,-0.018854,-0.017208,-0.021859,0.033943
1084884,-0.002832,0.032711,-0.019582,-0.008242,-0.006916,-0.009639,-0.02009,-0.014911,-0.025029,-0.025237,...,-0.004995,-0.013224,-0.051607,0.011342,-0.006839,0.000509,-0.02116,-0.012922,-0.022561,-0.031755
1108916,-0.015434,-0.013533,0.006497,-0.003192,0.002811,0.004337,-0.003781,-0.006387,-0.008912,0.004389,...,0.000488,0.002434,0.018162,-0.022221,-0.020608,-0.010677,-0.022314,0.001143,0.010789,0.010671
1206380,-0.001697,0.003231,-0.008367,-0.006932,-0.009974,-0.000491,-0.003736,-0.007092,-0.001125,0.002679,...,0.004499,0.001607,0.016533,0.000625,0.002733,-0.021268,-0.01648,-0.003075,0.007536,0.01351
1340333,-0.007934,0.002444,-0.004535,-0.004931,0.004147,-0.003268,-0.018939,-0.009468,-0.00449,-0.004108,...,0.009226,-0.00752,0.026861,0.017277,-0.016428,-0.018125,-0.006291,-0.021627,-0.021855,0.040774


### Diagnosis

Add the subject's diagnosis to the dataframe

Read the phenotypic file as a dataframe.

Extract the diagnosis as a numpy array.

In [10]:
df_phenotypic = pd.read_csv(phenotypics_filepath, index_col='ScanDir ID')
diagnosis = df_phenotypic['DX'].to_numpy()

Add the diagnosis to a new dataframe

In [11]:
df_region_w_dx = df_subject_x_region.copy()

df_region_w_dx['DX'] = diagnosis

df_region_w_dx.head()

Unnamed: 0,Mean_2001,Mean_2002,Mean_2101,Mean_2102,Mean_2111,Mean_2112,Mean_2201,Mean_2202,Mean_2211,Mean_2212,...,Mean_9082,Mean_9100,Mean_9110,Mean_9120,Mean_9130,Mean_9140,Mean_9150,Mean_9160,Mean_9170,DX
1084283,0.001087,0.006426,0.003046,0.007911,0.002836,0.010834,0.014295,0.013514,0.000804,0.044241,...,0.027268,0.029133,0.005323,-0.006976,-0.010725,-0.018854,-0.017208,-0.021859,0.033943,1
1084884,-0.002832,0.032711,-0.019582,-0.008242,-0.006916,-0.009639,-0.02009,-0.014911,-0.025029,-0.025237,...,-0.013224,-0.051607,0.011342,-0.006839,0.000509,-0.02116,-0.012922,-0.022561,-0.031755,0
1108916,-0.015434,-0.013533,0.006497,-0.003192,0.002811,0.004337,-0.003781,-0.006387,-0.008912,0.004389,...,0.002434,0.018162,-0.022221,-0.020608,-0.010677,-0.022314,0.001143,0.010789,0.010671,1
1206380,-0.001697,0.003231,-0.008367,-0.006932,-0.009974,-0.000491,-0.003736,-0.007092,-0.001125,0.002679,...,0.001607,0.016533,0.000625,0.002733,-0.021268,-0.01648,-0.003075,0.007536,0.01351,3
1340333,-0.007934,0.002444,-0.004535,-0.004931,0.004147,-0.003268,-0.018939,-0.009468,-0.00449,-0.004108,...,-0.00752,0.026861,0.017277,-0.016428,-0.018125,-0.006291,-0.021627,-0.021855,0.040774,1


## Model

Build a machine learning model and use it to make predictions on the dataset. 
Evaluate the model based on its accuracy.

Separate the data into features and targets

In [12]:
X = df_region_w_dx.drop('DX', axis=1)
y = df_region_w_dx['DX']

### Build a model and make predictions

Logistic Regression

In [13]:
model_LR = LogisticRegression().fit(X, y)
y_pred_LR = model_LR.predict(X)

accuracy_LR = accuracy_score(y_pred_LR, y)
accuracy_LR

0.5443037974683544

KNN

In [14]:
model_KNN = KNeighborsClassifier().fit(X, y)
y_pred_KNN = model_KNN.predict(X)

accuracy_KNN = accuracy_score(y_pred_KNN, y)
accuracy_KNN

0.5949367088607594

SVM

In [15]:
model_SVM = SVC().fit(X, y)
y_pred_SVM = model_SVM.predict(X)

accuracy_SVM = accuracy_score(y_pred_SVM, y)
accuracy_SVM

0.6962025316455697

## Evaluate Accuracy

Understand the model accuracies better

### Best model

Based on the results from the model building, SVM had the best accuracy.

In [16]:
print('Accuracies:')
print('\nLogistic Regression:\t', accuracy_LR)
print('KNN:\t\t\t', accuracy_KNN)
print('SVM:\t\t\t', accuracy_SVM)

Accuracies:

Logistic Regression:	 0.5443037974683544
KNN:			 0.5949367088607594
SVM:			 0.6962025316455697


### Add the predictions to the model

In [17]:
df_w_preds = df_region_w_dx.copy()

df_w_preds['DX_pred'] = y_pred_SVM
df_w_preds.head()

Unnamed: 0,Mean_2001,Mean_2002,Mean_2101,Mean_2102,Mean_2111,Mean_2112,Mean_2201,Mean_2202,Mean_2211,Mean_2212,...,Mean_9100,Mean_9110,Mean_9120,Mean_9130,Mean_9140,Mean_9150,Mean_9160,Mean_9170,DX,DX_pred
1084283,0.001087,0.006426,0.003046,0.007911,0.002836,0.010834,0.014295,0.013514,0.000804,0.044241,...,0.029133,0.005323,-0.006976,-0.010725,-0.018854,-0.017208,-0.021859,0.033943,1,0
1084884,-0.002832,0.032711,-0.019582,-0.008242,-0.006916,-0.009639,-0.02009,-0.014911,-0.025029,-0.025237,...,-0.051607,0.011342,-0.006839,0.000509,-0.02116,-0.012922,-0.022561,-0.031755,0,0
1108916,-0.015434,-0.013533,0.006497,-0.003192,0.002811,0.004337,-0.003781,-0.006387,-0.008912,0.004389,...,0.018162,-0.022221,-0.020608,-0.010677,-0.022314,0.001143,0.010789,0.010671,1,0
1206380,-0.001697,0.003231,-0.008367,-0.006932,-0.009974,-0.000491,-0.003736,-0.007092,-0.001125,0.002679,...,0.016533,0.000625,0.002733,-0.021268,-0.01648,-0.003075,0.007536,0.01351,3,0
1340333,-0.007934,0.002444,-0.004535,-0.004931,0.004147,-0.003268,-0.018939,-0.009468,-0.00449,-0.004108,...,0.026861,0.017277,-0.016428,-0.018125,-0.006291,-0.021627,-0.021855,0.040774,1,0


### Understand the differences

View the value counts to better understand how the predictions and true values are distributed.

The model was abel to recognize 3 out of the 4 diagnosis types. The frequency follows a similar frequency to the true values

In [18]:
df_w_preds['DX_pred'].value_counts()

DX_pred
0    66
1    12
3     1
Name: count, dtype: int64

In [19]:
df_w_preds['DX'].value_counts()

DX
0    42
1    23
3    12
2     2
Name: count, dtype: int64

--------------------------------------------------------------------------------------------------------------------------------

# Binary Classificaiton

This section investigates how models perform when predicting whether a patient has ADHD or not. 

This is accomplished by converting the diagnosis to a binary value based on if their diagnosis is a control or has some type of ADHD. 
For this feature, 'True' signifies the subject has ADHD and 'False' signifies the subject is a control and does not have ADHD.

Theoretically, this model should perform better than the multi-class classification since it is simpler.

## Build the dataframe

Create a dataframe of the subjects, regions and their diagnosis.

### Combine

Add the diagnosis Series to the regions dataframe.

In [20]:
# Make a copy of the region dataframe
df_region_w_dx_binary = df_subject_x_region.copy()

# Add the diagnosis to the region dataframe
df_region_w_dx_binary['DX'] = diagnosis>0

df_region_w_dx_binary.head()

Unnamed: 0,Mean_2001,Mean_2002,Mean_2101,Mean_2102,Mean_2111,Mean_2112,Mean_2201,Mean_2202,Mean_2211,Mean_2212,...,Mean_9082,Mean_9100,Mean_9110,Mean_9120,Mean_9130,Mean_9140,Mean_9150,Mean_9160,Mean_9170,DX
1084283,0.001087,0.006426,0.003046,0.007911,0.002836,0.010834,0.014295,0.013514,0.000804,0.044241,...,0.027268,0.029133,0.005323,-0.006976,-0.010725,-0.018854,-0.017208,-0.021859,0.033943,True
1084884,-0.002832,0.032711,-0.019582,-0.008242,-0.006916,-0.009639,-0.02009,-0.014911,-0.025029,-0.025237,...,-0.013224,-0.051607,0.011342,-0.006839,0.000509,-0.02116,-0.012922,-0.022561,-0.031755,False
1108916,-0.015434,-0.013533,0.006497,-0.003192,0.002811,0.004337,-0.003781,-0.006387,-0.008912,0.004389,...,0.002434,0.018162,-0.022221,-0.020608,-0.010677,-0.022314,0.001143,0.010789,0.010671,True
1206380,-0.001697,0.003231,-0.008367,-0.006932,-0.009974,-0.000491,-0.003736,-0.007092,-0.001125,0.002679,...,0.001607,0.016533,0.000625,0.002733,-0.021268,-0.01648,-0.003075,0.007536,0.01351,True
1340333,-0.007934,0.002444,-0.004535,-0.004931,0.004147,-0.003268,-0.018939,-0.009468,-0.00449,-0.004108,...,-0.00752,0.026861,0.017277,-0.016428,-0.018125,-0.006291,-0.021627,-0.021855,0.040774,True


View the number of subjects with and without ADHD.

In [21]:
df_region_w_dx_binary['DX'].value_counts()

DX
False    42
True     37
Name: count, dtype: int64

## Evaluate Accuracy

Build models and evaluate the accuracy

Separate dataframe into features and targets

In [22]:
X_binary = df_region_w_dx_binary.drop('DX', axis=1)
y_binary = df_region_w_dx_binary['DX']

Get 100 accuracies for the models

In [23]:
accs_binary = get_accuracies(X_binary, y_binary)
accuracies_binary = np.asarray(accs_binary)

Get statistics describing accuracies.

In [24]:
means_binary = [accuracies_binary[0].mean(), accuracies_binary[1].mean(), accuracies_binary[2].mean(), accuracies_binary[3].mean()]
stds_binary  = [accuracies_binary[0].std(),  accuracies_binary[1].std(),  accuracies_binary[2].std(),  accuracies_binary[2].std()]
maxes_binary = [accuracies_binary[0].max(),  accuracies_binary[1].max(),  accuracies_binary[2].max(),  accuracies_binary[2].max()]
mins_binary  = [accuracies_binary[0].min(),  accuracies_binary[1].min(),  accuracies_binary[2].min(),  accuracies_binary[2].min()]

Turn the statistics into a dataframe for simpler analysis.

In [25]:
results_binary = pd.DataFrame([means_binary, stds_binary, maxes_binary, mins_binary], 
                              index=['Mean', 'STD', 'Max', 'Min'], 
                              columns=['LR_binary', 'KNN_binary', 'SVM_binary', 'LDA_binary'])
results_binary

Unnamed: 0,LR,KNN,SVM,LDA
Mean,0.468,0.4575,0.541,0.474
STD,0.107592,0.109402,0.116271,0.116271
Max,0.7,0.8,0.85,0.85
Min,0.25,0.2,0.3,0.3


In [32]:
results_complete = pd.concat([results, results_binary], axis=1)
results_complete

Unnamed: 0,LR_multiclass,KNN_multiclass,SVM_multiclass,LDA_multiclass,LR_binary,KNN_binary,SVM_binary,LDA_binary
Mean,0.495,0.5,0.538,0.5325,0.491,0.4905,0.523,0.513
STD,0.101858,0.091924,0.099529,0.099529,0.092839,0.100671,0.089839,0.089839
Max,0.75,0.7,0.75,0.75,0.7,0.7,0.75,0.75
Min,0.2,0.3,0.25,0.25,0.25,0.2,0.3,0.3


In [34]:
results_complete.to_csv('2023.6.30-Region_Correlation-Results-OHSU.csv')