# Machine Learning for Neuroimaging

### Python/ML basics programming - 10.5.2023

### Outline of this tutorial:
1. Create Jupyter Notebook
2. Load useful packages
3. Numpy
4. Lists
5. Dictionaries
6. Dataframes
7. Load dataset
8. Data outlier exploration
9. Fill out missing values
10. Visualize ages of cohorts
11. Visualize frequency across sexes
12. Classify subjects using random forest
13. Classify subjects using KNN
14. Classify subjects using multi-layer perceptron

In [None]:
import numpy as np # matrices and mathematics
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
print('Hello Jupyter!')

# Simple numerical operations

a = 12
b = 2

print(a + b)
print(a**b)
print(a/b)

In [None]:
# Working with lists

# Lists are a versatile way of organizing your data in Python. 
xList = [1, 2, 3, 4]
print(xList)

# Concatenation
x = [1, 2, 3, 4];
y = [5, 6, 7, 8];

print(x + y)

# Sum a list of numbers
print(np.sum(x))

In [None]:
# Dictionaries are useful for storing and retrieving data as key-value pairs. 
# Here is a short dictionary of some example subjects. 
# The keys are subject names, and the values are the corresponding Age.

mw = {'Magda': 30, 'David': 18, 'George':32, 'Jenny': 44}
print(mw)

# We can add a new value to an existing dictionary.
mw['Lisa'] = 50
print(mw)

# We can also retrieve a value from a dictionary.
print(mw['Lisa'])

# We can also delete a value from a dictionary
del mw['David']
print(mw)

In [None]:
# A for loop is a useful means of interating over all key-value pairs of a dictionary.

for species in mw.keys():
    print(f"The ages of {species} is {mw[species]:.2f}")

# Dictionaries can be quickly sorted by key
for species in sorted(mw):
    print(f"{species} {mw[species]:.2f}")


In [None]:
# From dictionaries to dataframes --> Great for data analysis and visualization
df_mw = pd.DataFrame.from_dict(mw, orient='index', columns=['Age'])
print(df_mw)
print('-------------------')

# To view data in case you have a large dataset you can use the head() function
print(df_mw.head(3)) # first 3 rows
print('-------------------')

# describe() shows a quick statistic summary of your data
print(df_mw.describe())
print('-------------------')

# Transposing your data
print(df_mw.T)
print('-------------------')

# Get a specific column
print(df_mw['Age'])
print('-------------------')
# Get a specific row
print(df_mw.loc['Magda'])

In [None]:
# Add a new column
df_mw['Year Of Birth'] = 2023 - df_mw['Age']
print(df_mw)
print('-------------------')

# Select multiple columns
df_mw_select = df_mw[['Age', 'Year Of Birth']]
print(df_mw_select)
print('-------------------')

# Select values with a condition
df_mw_select = df_mw_select[df_mw_select['Age'] > 30]
print(df_mw_select)
print('-------------------')


In [None]:

import matplotlib.pyplot as plt # Plots
import seaborn as sns # Plots
from sklearn.ensemble import RandomForestClassifier # Random forest classifier
from sklearn.datasets import make_classification # Utils for classification
from sklearn.neural_network import MLPClassifier # Multi-layer perceptron
from sklearn.model_selection import train_test_split # Split data to train and test set
from sklearn.metrics import accuracy_score, balanced_accuracy_score # Accuracy metrics for classification
from sklearn import metrics # Metrics for classification (we will use Area Under the Curve)

In [None]:
# Parameters for the size/resolution of plots
%matplotlib inline
a4_dims = (9.7, 3.27)
plt.rcParams['figure.dpi'] = 500
plt.rcParams['savefig.dpi'] = 500
plt.rcParams["figure.autolayout"] = True

## Autism Screening on Adults

Autism, or autism spectrum disorder (ASD), refers to a broad range of conditions characterized by challenges with social skills, repetitive behaviors, speech and nonverbal communication.

This dataset is composed of survey results for more than 700 subjects who filled an app form containing a quick referral guide for adults with suspected autism who do not have a learning disability. Their labels (Control vs. ASD) portray whether the subjects received a diagnosis of autism, based on the [AQ-10 Autism Spectrum Quotient (AQ) NHS Questionnaire](https://docs.autismresearchcentre.com/tests/AQ10.pdf) and were refered to a specialist for further diagnostic assessment.

[Download dataset from Kaggle](https://www.kaggle.com/datasets/andrewmvd/autism-screening-on-adults?resource=download)

[Download dataset from Google Drive](https://drive.google.com/file/d/1eaiAcaHQuMEpP5xTmGg3AI8DaZt09er_/view?usp=sharing)

[Dataset Website](https://archive.ics.uci.edu/ml/datasets/Autism+Screening+Adult#)

- Thabtah, Fadi. "An accessible and efficient autism screening method for behavioural data and predictive analyses." Health informatics journal 25.4 (2019): 1739-1755.

- Allison, Carrie, Bonnie Auyeung, and Simon Baron-Cohen. "Toward brief “red flags” for autism screening: the short autism spectrum quotient and the short quantitative checklist in 1,000 cases and 3,000 controls." Journal of the American Academy of Child & Adolescent Psychiatry 51.2 (2012): 202-212.

| Feature      | Description |
| ----------- | ----------- |
| index      | Participant ID       |
| AX_Score   | Score based on the Autism Spectrum Quotient (AQ) 10 item screening tool AQ-10        |
| age   | Age in years        |
| gender   | Participant gender        |
| ethnicity   | Ethnicities in text form        |
| jaundice   | Whether or not the participant was born with jaundice       |
| austim(typo in the original csv) | Whether or not anyone in the immediate family has been diagnosed with autism |
| country_of_res | Country of residency |
| used_app_before | Whether the participant has used a screening app |
| result | Score from the AQ-10 screening tool (sum of positive categories) |
| age_desc | Age as categorical variable |
| relation | Relation of person who completed the test |
| Class/ASD | Participant classification to control vs ASD |

## Data Loading

In [None]:
# Read csv dataset
dataframe = pd.read_csv('autism_screening.csv')

# Print the first 5 rows of our dataset
dataframe.head(5)

# Notice the dataframe: What can you observe?

## Data Exploration

In [None]:
# Look for outliers in the data

# Describe returns information about the numerical columns of our dataset
dataframe.describe()

# Max age?
# What is the plan of action?

In [None]:
plt.figure()
ax1 = sns.boxplot(x="Class/ASD", y="age", data=dataframe, notch=True)
ax1.set_ylabel("Age - With outlier", fontsize = 14)
ax1.set_xlabel("Class/ASD", fontsize = 14)
ax1.set_xticklabels(["Control", "Autism"])

## Data Cleaning

In [None]:
# Remove rows with outlier from the data
outlier_row = dataframe.loc[dataframe['age']==383]
dataframe.drop(outlier_row.index, inplace=True)

# Describe returns information about the numerical columns of our dataset
dataframe.describe()

In [None]:
# Dealing with Missing Values

# Sometimes missing values are marked with ? or NaN or an empty cell
dataframe = dataframe.replace({'?':np.NaN})

# Check how many values are missing per column
missing_values_before = dataframe.isnull().sum()

# Show amount of missing values per column
pd.DataFrame(missing_values_before, columns=["Missing Data"])

# What do you suggest?

In [None]:
# Fill out missing age values with mean age - Continuous value
dataframe['age'] = dataframe['age'].fillna(np.round(dataframe['age'].mean(), 0))

# Replace missing ethnicities and relations with 'Other' as we do not know the actual category
dataframe = dataframe.replace({np.NaN:'Other'})

missing_values_after = dataframe.isnull().sum()

# Visualize the missing values per column after imputation
dictionary = {"Missing Data Before Imputation": missing_values_before, "Missing Data After Imputation": missing_values_after}

pd.DataFrame.from_dict(dictionary)

## Age of cohorts

In [None]:
# Get dataframes for controls and positive subjects after data pre-processing
controls = dataframe[dataframe['Class/ASD']=='NO']
autism = dataframe[dataframe['Class/ASD'] == 'YES']

plt.figure()
ax1 = sns.histplot(x="age", data=dataframe, hue="Class/ASD", element="step")
ax1.set_ylabel("Number of Subjects", fontsize = 14)
ax1.set_xlabel("Age", fontsize = 14)


In [None]:
plt.figure()
ax1 = sns.boxplot(x="Class/ASD", y="age", data=dataframe, notch=True)
ax1.set_ylabel("Age - Without outlier", fontsize = 14)
ax1.set_xlabel("Class/ASD", fontsize = 14)
ax1.set_xticklabels(["Control", "Autism"])

# Notice the notches of the boxplot!
# Since the notches in the box plot do not overlap, we can conclude, with 95% confidence, 
# that the true medians do differ.

## Frequences of sexes

In [None]:
# Count males/females
plt.figure()
ax1 = sns.histplot(data=dataframe, x='gender', hue="Class/ASD", shrink=.8, multiple="dodge")
ax1.set_ylabel("Number of Subjects", fontsize = 14)
ax1.set_xlabel("Sex", fontsize = 14)
ax1.set_xticks([0, 1])
ax1.set_xticklabels(["Female", "Male"], fontsize = 14)

## Autism Classification 

Train a model $f$ that given an input $x$ predicts a classification label $y$

In this problem $x$ denotes the input features of the questionnaire and $y$ is the subject classification to control and ASD based on their responses

Which classifier would you use? Why?


In [None]:
# Preparing data for classification

# The features we want to use to classify Control vs ASD are the responses to the autism questionnaire after the residualization
features_for_classification = ['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 
                               'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score']

# Input feature for training
X = dataframe[features_for_classification]

# Class label
Y = dataframe['Class/ASD']

# Transform labels from 'YES'/'NO' to One-hot encodings
Y = pd.get_dummies(Y)

## Random Forest Classifier

In [None]:
# Split to train/test sets - we use the same data splits for the different classifiers
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state=3)

# For reproducibility we want to set the seed, it's the only fair way to compare classifiers
np.random.seed(seed=3)
    
Y_test = Y_test.values.argmax(axis=1)

# Define Random Forest classifier
model_1 = RandomForestClassifier(max_depth=3, n_estimators=200, random_state=3)

# Fit classifier using the training set and labels
model_1.fit(X_train, Y_train)
print('Random Forest has been trained!')

# Get the predictions of trained model on the test set
Y_pred_model_1 = model_1.predict(X_test)
Y_pred_model_1 = Y_pred_model_1.argmax(axis=1)

# Which metrics do you think we can use to evaluate our predictions?

In [None]:
# Random Forest Evaluation

# Area Under the curve
fpr, tpr, thresholds = metrics.roc_curve(Y_test, Y_pred_model_1, pos_label=1)
auc_model_1 = metrics.auc(fpr, tpr)
print("Area under the curve: {:.3f}".format(auc_model_1))

# Calculate the accuracy score
accuracy = accuracy_score(Y_test, Y_pred_model_1)
print("Accuracy: {:.2f}%".format(accuracy * 100))

# Calculate balanced accuracy
b_accuracy = balanced_accuracy_score(Y_test, Y_pred_model_1)
print("Balanced Accuracy: {:.2f}%".format(b_accuracy * 100))

In [None]:
# Feature Importance with Random Forest

# Get feature names and importances from trained model
feature_names = X.columns
feature_imp = pd.Series(model_1.feature_importances_,index=feature_names.values).sort_values(ascending=False)

# Create a bar plot
sns.barplot(x=feature_imp[0:10], y=feature_imp.index[0:10])
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Feature Names')
plt.title("Visualizing Random Forest Important Features")
plt.show()

In [None]:
# KNN Classifier

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, Y_train)

print('KNN Classifier has been trained!')

# Get the predictions of trained model on the test set
Y_pred_neigh = neigh.predict(X_test)
Y_pred_neigh = Y_pred_neigh.argmax(axis=1)

# KNN Evaluation

# Area Under the curve
fpr, tpr, thresholds = metrics.roc_curve(Y_test, Y_pred_neigh, pos_label=1)
auc_neigh = metrics.auc(fpr, tpr)
print("Area under the curve: {:.3f}".format(auc_neigh))

# Calculate the accuracy score
accuracy = accuracy_score(Y_test, Y_pred_neigh)
print("Accuracy: {:.2f}%".format(accuracy * 100))

# Calculate balanced accuracy
b_accuracy = balanced_accuracy_score(Y_test, Y_pred_neigh)
print("Balanced Accuracy: {:.2f}%".format(b_accuracy * 100))

## Multi-Layer Perceptron (MLP)

In [None]:
# Multi-Layer Perceptron Classifier

# hidden_layer_sizes=(16,8) --> two hidden layers, one with 16 neuros and one with 8. In total 4 layers
# Input (10 features) + hidden layer 1 (16) + hidden layer 2 (8) + Output layer (2 outputs for binary classification)
model_2 = MLPClassifier(random_state=3, max_iter=800,verbose=True, hidden_layer_sizes=(16,8))
model_2.fit(X_train, Y_train)

print('Multi-layer perceptron has been trained!')

Y_pred_model_2 = model_2.predict(X_test)
Y_pred_model_2 = Y_pred_model_2.argmax(axis=1)

In [None]:
# MLP Evaluation

# Calculate area under the curve
fpr, tpr, thresholds = metrics.roc_curve(Y_test, Y_pred_model_2, pos_label=1)
auc_model_2 = metrics.auc(fpr, tpr)
print("Area under the curve: {:.3f}".format(auc_model_2))

# Calculate the accuracy score
accuracy = accuracy_score(Y_test, Y_pred_model_2)
print("Accuracy: {:.2f}%".format(accuracy * 100))

# Calculate balanced accuracy
b_accuracy = balanced_accuracy_score(Y_test, Y_pred_model_2)
print("Balanced Accuracy: {:.2f}%".format(b_accuracy * 100))

# How can we compare the two classifiers?