# Team 112
## Members:
### Tomas Anthony

# Github Link

https://github.com/tomasanthony/cs598-project


## Video Presentation
https://youtu.be/ERturLrhywY

# Introduction

## Problem

Alzheimer’s disease is a a prominent brain disease that is the fifth-leading cause of death among
individuals aged 65 and older, according to data from 2021 (Association, 2023); It is estimated
that 10% of the 65 and older age group are suffering from Alzheimer’s and 8-11% may have Mild
Cognitive Impairment (MCI), a precursor to Alzheimer’s (Association, 2023). Alzheimer’s and MCI
are also currently under-diagnosed in the primary care setting, with people in the earlier stages of the
disease (MCI) especially going under-diagnosed (Association, 2023). In the paper Golovanevsky
et al. (2022), researchers noted the difficulty in diagnosing Alzheimer’s disease and emphasized in their research the potential to aid medical professionals with machine learning.

## Original Paper

The original paper that is replicated here is titled ”Multimodal Attention-based
Deep Learning for Alzheimer’s Disease Diagnosis” and was published in the Journal of the American Medical Informatics Association in 2022 (Golovanevsky et al., 2022).

### Paper Methodologies

#### Attention Layers
The approach used by Golovanevsky et al. involved a multi-modal deep learning model with multiple attention layers. The multi-modal approach mentioned in the title involved ingesting different
modalities of data - imaging, clinical, and genetic data - and using the attention layers to preserve inter-modal interactions in the data.

The researchers followed the example set in the original Transformers paper and used self-attention and
cross-attention mechanisms for their attention architecture.

#### Fully Connected Networks
Each modality of data was initially fed through a fully connected neural network with each of these neural networks having its own hyper-parameter tuning
done. The output of the fully-connected neural networks was then fed into attention-layers.

#### Output Layer

The output of the cross-attention layer
is fed into a final fully connected layer. The performance of the model produced at the end of the
pipeline was used to select the tune parameters and rerun the pipeline, and the best parameters for
the model were used (Golovanevsky et al., 2022).

#### Data
The data used by Golovanevsky et al. was comprised of the ADNI1, ADNI2, and GO
1
datasets from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database.


## Current State of the Art

The MADDi model introduced by Golovanevsky et al. (2022) achieved the current state of the art results. The methods used include :
- **Multiple Attention Layers:** Utilizing both self-attention and cross-attention layers, an approach that was modeled after the original Transformers paper.
- **Multi-modal Data:** The use of multi-modal data was a defining characteristic of the approach. Clinical, genetic, and imaging data were all used to train the model.
- **Hyper-parameter Tuning:** Each data modality was initially fed through a fully connected neural network, whose parameters were defined using extensive hyper-parameter tuning.

The MADDi model improved upon the previous state of the art F1-Score and Accuracy by 4% and 10.8%, respectively.

## Innovations

- **Cross-Modal Attention:** The primary innovation was the extensive use of attention layers. Each data modality is included in a cross-attention layer, in addition to their own self-attention layers. This approach captures the significance of cross-modality interactions in the data. The researches noted that previously, multi-modal approaches simply concatenated the different data instead of utilizing the multiple modalities as distinct data types.

## Contributions
- integrating multimodal inputs, multi-task
classification, and cross-modal attention for capturing interactions
- **Multimodal Inputs:** The paper introduced the use of multimodal inputs, a distinction from using multimodal data that is concatenated into a single input.
- **Multi-task Classification:** Other approaches did not use multi-task classification to address control, mild cognitiive impairment, and Alzheimer's diagnosis at in a single model.
- **Cross-Modal Attention:** As noted previously, the use of cross-modal attention and transformers architecture was a key contribution to the field.

## Performance
- **F1-Score:** 91.41%
- **Accuracy:** 96.88%




# Scope of Reproducibility:

1. State of the art results for Accuracy, Precision, Recall, and
F1-Scores averaged across three classes (Control, Moderate Cognitive Impairment, and Alzheimer’s
disease).
2. Elevated significance of clinical data compared to other data modalities.
3. Multi-modal models utilizing all three modalities outperform uni-modal and dual-modality models
4. Inclusion of additional, more recent data in
the ADNI database, specifically ADNI3, will lead to performance improvements.


# Methodology
The existing code for the MADDi model and the orginal paper was used and adjusted for this project



## Environment


### Python Version
This project was developed using **Python 3.10**.

The MADDi model can be replicated with Python version greater than **3.7**.


### Dependencies

#### Notebook
The notebook primarily uses **tensorflow**, **numpy**, **pandas**, **sklearn**, and **matplotlib**.

To fully replicate the original paper, a ``` requirements.txt ``` file is provided [here](https://github.com/tomasanthony/cs598-project/blob/main/general/requirements.txt).





##  Data
Data includes ADNI 1, 2, 3, and GO datasets.
 ### Data Source
  The use of the ADNI datasets require requesting access. The data is stored in the Image and Data
Archive (IDA). The access request involves providing information about the intended use of the
project and the researchers involved. Access to the datasets has already been granted to this project’s
participants.

### Data Download Instructions

To access the ADNI dataset, follow these steps:


1.   **Complete the Data Use Agreement form**: The  [ADNI Data Use Agreement](https://ida.loni.usc.edu/collaboration/access/appLicense.jsp) can be accessed directly.
2.   **Apply for Access**: Submit an application for accessing the ADNI dataset. This includes providing your affiliation and the use case for the data.
3. **Download Data**: Once the access request is approved. Users can log in to the LONI website to access and download ADNI datasets. Include study, imaging, and genetic data.

### Data Description

The dataset used in this study comprises multi-modal data sourced from ADNI (Alzheimer's Disease Neuroimaging Initiative). Given the confidential nature of this data, detailed charts and visualizations cannot be provided. Below is a general description of the types of data included, based on the methodologies outlined in the MADDi paper:

1. **Imaging Data**:
   - Consists of cross-sectional MRI scans from the baseline screenings of participants.
   - Images have been standardized by ADNI with GradWarp, B1 Correction, and N3 preprocessing steps.
   - Three slices of MRI images are included per patient, resized to 72x72 pixels.

2. **Genetic Data**:
   - Consists of whole genome sequencing (WGS) performed by Illumina's non-Clinical Laboratory Improvement Amendments (non-CLIA) process.
   - The initial dataset is reduced down to approximately 15,000 SNPs during preprocessing.
   - Data is in the form of large VCF files curated by ADNI in 2014.

3. **Clinical Data**:
   - Consists of initial assessments from neurological exams, cognitive tests, and demographic information.
   - 29 features

### Dataset Overlap

The final dataset included an overlap of 239 patients.

### Note on non-public nature of the datasets

The datasets used in this project are non-public and the reproduction and visualization of data samples will not be done to preserve their confidential nature.





In [None]:
import pandas as pd

data = {
    "Total Participants": [2384, 551, 805, 239],
    "Control": [942, 278, 241, 165],
    "Mild Cognitive Impairment": [796, 123, 318, 39],
    "Alzheimer's Disease": [646, 150, 246, 35]
}

index_labels = ['Clinical', 'Imaging', 'Genetic', 'Overlap']

df = pd.DataFrame(data, index=index_labels)

df


Unnamed: 0,Total Participants,Control,Mild Cognitive Impairment,Alzheimer's Disease
Clinical,2384,942,796,646
Imaging,551,278,123,150
Genetic,805,241,318,246
Overlap,239,165,39,35


### Preprocessing the data
The preprocessing code for each modalities can be found in Github.

1. [Initial Preparation](https://github.com/tomasanthony/cs598-project/blob/main/general/diagnosis_making.ipynb)
2. [Clinical Preprocessing](https://github.com/tomasanthony/cs598-project/tree/main/preprocess_clinical)
3. [Genetic Preprocessing](https://github.com/tomasanthony/cs598-project/tree/main/preprocess_genetic)
4. [Image Preprocessing](https://github.com/tomasanthony/cs598-project/tree/main/preprocess_images)

The final preprocessing step, which involves merging the datasets, is included below.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# dir and function to load raw data
raw_data_dir = '/content/gdrive/My Drive/Colab Notebooks/DL4H Project/'

clinical_data_dir = raw_data_dir + "clinical_data/"
image_data_dir = raw_data_dir + "image_data/"
image_data_dir = raw_data_dir + "genetic_data/"


def load_raw_data(raw_data_dir):
  vcf = pd.read_pickle("all_vcfs.pkl")
  c = pd.read_csv("clinical.csv").drop("Unnamed: 0", axis=1).rename(columns={"PTID":"subject"})
  img = pd.read_pickle('mri_meta.pkl')[["img_array", "subject", "label"]]
  return vcf, c, img

vcf, c, img = load_raw_data(raw_data_dir)

# Merge modalities and calculate/merge stats
def merge_modalities(vcf, c, img):
  vcf.head()
  c.head()
  c = c.rename(columns = {"Group":"GROUP"})
  a = vcf.merge(c, on = ["subject", "GROUP"]).merge(img, on = "subject")

merged_data = merge_modalities(vcf, c, img)

def calculate_stats(a)
  subject_counts = a["subject"].value_counts()
  group_counts = a["GROUP"].value_counts()

  return subject_counts, group_counts

# process raw data
def process_data(a):
  # Set columns for aggregated data
  cols = list(set(a.columns) - set(["PTID", "label", "GROUP",
                                  "RID", "ID", "Group", "Phase", "SITEID", "VISCODE", "VISCODE2", "USERDATE", "USERDATE2", "update_stamp", "DX", "Unnamed: 0"]))

  X= a[cols]
  y = a["GROUP"]
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
  X_test[["subject"]].to_csv("overlap_test_set.csv")
  snp_cols = set(X_train.columns).intersection(set(vcf.columns))
  X_train_snp = X_train[snp_cols]
  X_test_snp = X_test[snp_cols]
  img_cols = set(X_train.columns).intersection(set(img.columns))
  print(len(img.columns))
  print(len(img_cols))
  X_train_img = X_train[img_cols]
  X_test_img = X_test[img_cols]
  clin_cols = set(X_train.columns).intersection(set(c.columns))
  print(len(c.columns))
  print(len(clin_cols))
  X_train_clin = X_train[clin_cols]
  X_test_clin = X_test[clin_cols]

  # Define the directory in Google Drive where the output files will be stored
  output_pickle_path = raw_data_dir + "processed_pickle_data/"

  # Converting arrays to pandas DataFrame and saving them as Pickle files
  # This is useful for data persistence, making it easier to load preprocessed data without re-running the preprocessing steps

  # SNP datasets
  pd.DataFrame(X_train_snp).to_pickle(output_pickle_path + "X_train_snp.pkl")
  pd.DataFrame(X_test_snp).to_pickle(output_pickle_path + "X_test_snp.pkl")

  # Labels
  pd.DataFrame(y_train).to_pickle(output_pickle_path + "y_train.pkl")
  pd.DataFrame(y_test).to_pickle(output_pickle_path + "y_test.pkl")

  # Clinical data
  pd.DataFrame(X_train_clin).to_pickle(output_pickle_path + "X_train_clinical.pkl")
  pd.DataFrame(X_test_clin).to_pickle(output_pickle_path + "X_test_clinical.pkl")

  # Imaging data
  pd.DataFrame(X_train_img).to_pickle(output_pickle_path + "X_train_img.pkl")
  pd.DataFrame(X_test_img).to_pickle(output_pickle_path + "X_test_img.pkl")

processed_data = process_data(merged_data)



##   Model

### Original Paper
[”Multimodal Attention-based Deep Learning for Alzheimer’s Disease Diagnosis"](https://arxiv.org/abs/2206.08826),  Michal Golovanevsky, Carsten Eickhoff, Ritambhara Singh.

[Original Paper Github](https://github.com/rsinghlab/MADDi/tree/main)

### Description
The model uses a multi-modal, attention-based approach. It incorporates three distinct data modalities from the ADNI dataset - imaging, clinical, and genetic.

These three modalities serve as inputs to the self-attention, and then cross-attention layers where the modalities are integrated together. They are then fed into a fully-connected layer for final output.

- **Imaging Input**: This modality uses CNN layers to process the baseline screening MRI images for study participants. The CNN layers are then flattened and fed through a dense layer.
- **Clinical Input**: This modality uses dense layers, which are then normalized and fed through dropout layers.
- **Genetic Pathway**: This modality also uses dense layers, followed by normalization and dropout layers

### Attention Architecture
- **Self-Attention**: Each modality is fed to a self-attention layer
- **Cross-Modal Attention**: The cross-attention layer integrates the separate modalities and captures interactions between themm


In [None]:
import os
import random
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import compute_class_weight
from sklearn.metrics import classification_report, precision_recall_curve, precision_recall_fscore_support
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Dropout, Flatten, BatchNormalization, Conv2D, MultiHeadAttention, concatenate
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

# Configure TensorFlow session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)


In [None]:
def make_img(t_img):
    """ Load image data from pickle and prepare for training. """
    img = pd.read_pickle(t_img)
    img_l = [img.values[i][0] for i in range(len(img))]
    return np.array(img_l)

def reset_random_seeds(seed):
    """ Fix random seed for reproducibility. """
    os.environ['PYTHONHASHSEED'] = str(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

In [None]:
def create_model_snp():
    """ Build and return a Sequential model for SNP data. """
    model = Sequential([
        Input(shape=(15965,)),  # Input layer with shape = the number of SNPs used as features rom the genetic dataset
        Dense(200, activation="relu"),
        BatchNormalization(),
        Dropout(0.5),
        Dense(100, activation="relu"),
        BatchNormalization(),
        Dropout(0.3),
        Dense(50, activation="relu"),
        BatchNormalization(),
        Dropout(0.2)
    ])
    return model

def create_model_clinical():
    """ Build and return a Sequential model for clinical data. """
    model = Sequential([
        Input(shape=(185,)),  # Input layer with shape = the number of features in the clinical dataset
        Dense(200, activation="relu"),
        BatchNormalization(),
        Dropout(0.5),
        Dense(100, activation="relu"),
        BatchNormalization(),
        Dropout(0.3),
        Dense(50, activation="relu"),
        BatchNormalization(),
        Dropout(0.2)
    ])
    return model

def create_model_img():
    """ Build and return a Sequential model for imaging data. """
    model = Sequential([
        Input(shape=(72, 72, 3)),  # Shape = height, width, number of slices from the MRI screenings images of ADNI1 dataset
        Conv2D(72, (3, 3), activation='relu'),
        Conv2D(64, (3, 3), activation='relu'),
        Conv2D(32, (3, 3), activation='relu'),
        Flatten(),
        Dense(50, activation='relu')
    ])
    return model

In [None]:
from tensorflow.keras.utils import model_to_dot
from IPython.display import HTML

def get_svg(model):
    return model_to_dot(model, show_shapes=True, show_layer_names=True, dpi=90).create(prog='dot', format='svg').decode('utf-8')



model_snp = create_model_snp()
model_clinical = create_model_clinical()
model_img = create_model_img()

print("SNP Model Summary:")
model_snp.summary()
print("\nClinical Model Summary:")
model_clinical.summary()
print("\nImaging Model Summary:")
model_img.summary()

model_snp_svg = get_svg(model_snp)
model_clinical_svg = get_svg(model_clinical)
model_img_svg = get_svg(model_img)

SNP Model Summary:
Model: "sequential_32"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_76 (Dense)            (None, 200)               3193200   
                                                                 
 batch_normalization_66 (Ba  (None, 200)               800       
 tchNormalization)                                               
                                                                 
 dropout_66 (Dropout)        (None, 200)               0         
                                                                 
 dense_77 (Dense)            (None, 100)               20100     
                                                                 
 batch_normalization_67 (Ba  (None, 100)               400       
 tchNormalization)                                               
                                                                 
 dropout_67 (Dropout)        (None

In [None]:
html = f"""
<div style='width: 100%; display: flex;'>
    <div style='width: 33%; text-align: center;'>
        <h2>SNP Model</h2>
        <div>{model_snp_svg}</div>
    </div>
    <div style='width: 33%; text-align: center;'>
        <h2>Clinical Model</h2>
        <div>{model_clinical_svg}</div>
    </div>
    <div style='width: 33%; text-align: center;'>
        <h2>Imaging Model</h2>
        <div>{model_img_svg}</div>
    </div>
</div>
"""
HTML(html)


In [None]:
def plot_classification_report(y_tru, y_prd, mode, learning_rate, batch_size, epochs, figsize=(7, 7), ax=None):
    """ Generate and save a classification report as a heatmap. """
    xticks = ['precision', 'recall', 'f1-score', 'support']
    yticks = ["Control", "Moderate", "Alzheimer's", 'avg']
    rep = np.array(precision_recall_fscore_support(y_tru, y_prd)).T
    avg = np.mean(rep, axis=0)
    avg[-1] = np.sum(rep[:, -1])
    rep = np.insert(rep, rep.shape[0], avg, axis=0)
    sns.heatmap(rep, annot=True, cbar=False, xticklabels=xticks, yticklabels=yticks, ax=ax, cmap="Blues")
    plt.savefig(f'report_{mode}_{learning_rate}_{batch_size}_{epochs}.png')

In [None]:
def plot_classification_report(y_tru, y_prd, mode, learning_rate, batch_size,epochs, figsize=(7, 7), ax=None):

    plt.figure(figsize=figsize)

    xticks = ['precision', 'recall', 'f1-score', 'support']
    yticks = ["Control", "Moderate", "Alzheimer's" ]
    yticks += ['avg']

    rep = np.array(precision_recall_fscore_support(y_tru, y_prd)).T
    avg = np.mean(rep, axis=0)
    avg[-1] = np.sum(rep[:, -1])
    rep = np.insert(rep, rep.shape[0], avg, axis=0)

    sns.heatmap(rep,
                annot=True,
                cbar=False,
                xticklabels=xticks,
                yticklabels=yticks,
                ax=ax, cmap = "Blues")

    plt.savefig('report_' + str(mode) + '_' + str(learning_rate) +'_' + str(batch_size)+'_' + str(epochs)+'.png')

In [None]:
def calc_confusion_matrix(result, test_label,mode, learning_rate, batch_size, epochs):
    test_label = to_categorical(test_label,3)

    true_label= np.argmax(test_label, axis =1)

    predicted_label= np.argmax(result, axis =1)

    n_classes = 3
    precision = dict()
    recall = dict()
    thres = dict()
    for i in range(n_classes):
        precision[i], recall[i], thres[i] = precision_recall_curve(test_label[:, i],
                                                            result[:, i])


    print ("Classification Report :")
    print (classification_report(true_label, predicted_label))
    cr = classification_report(true_label, predicted_label, output_dict=True)
    return cr, precision, recall, thres

In [None]:
# Attention/Multi-modal Model

def cross_modal_attention(x, y):
    x = tf.expand_dims(x, axis=1)
    y = tf.expand_dims(y, axis=1)
    a1 = MultiHeadAttention(num_heads = 4,key_dim=50)(x, y)
    a2 = MultiHeadAttention(num_heads = 4,key_dim=50)(y, x)
    a1 = a1[:,0,:]
    a2 = a2[:,0,:]
    return concatenate([a1, a2])


def self_attention(x):
    x = tf.expand_dims(x, axis=1)
    attention = MultiHeadAttention(num_heads = 4, key_dim=50)(x, x)
    attention = attention[:,0,:]
    return attention


def multi_modal_model(mode, train_clinical, train_snp, train_img):

    in_clinical = Input(shape=(train_clinical.shape[1]))

    in_snp = Input(shape=(train_snp.shape[1]))

    in_img = Input(shape=(train_img.shape[1], train_img.shape[2], train_img.shape[3]))

    dense_clinical = create_model_clinical()(in_clinical)
    dense_snp = create_model_snp()(in_snp)
    dense_img = create_model_img()(in_img)

    ########### Attention Layer ############

    ## Cross Modal Bi-directional Attention ##

    if mode == 'MM_BA':

        vt_att = cross_modal_attention(dense_img, dense_clinical)
        av_att = cross_modal_attention(dense_snp, dense_img)
        ta_att = cross_modal_attention(dense_clinical, dense_snp)

        merged = concatenate([vt_att, av_att, ta_att, dense_img, dense_snp, dense_clinical])




    ## Self Attention ##
    elif mode == 'MM_SA':

        vv_att = self_attention(dense_img)
        tt_att = self_attention(dense_clinical)
        aa_att = self_attention(dense_snp)

        merged = concatenate([aa_att, vv_att, tt_att, dense_img, dense_snp, dense_clinical])

    ## Self Attention and Cross Modal Bi-directional Attention##
    elif mode == 'MM_SA_BA':

        vv_att = self_attention(dense_img)
        tt_att = self_attention(dense_clinical)
        aa_att = self_attention(dense_snp)

        vt_att = cross_modal_attention(vv_att, tt_att)
        av_att = cross_modal_attention(aa_att, vv_att)
        ta_att = cross_modal_attention(tt_att, aa_att)

        merged = concatenate([vt_att, av_att, ta_att, dense_img, dense_snp, dense_clinical])


    ## No Attention ##
    elif mode == 'None':

        merged = concatenate([dense_img, dense_snp, dense_clinical])

    else:
        print ("Mode must be one of 'MM_SA', 'MM_BA', 'MU_SA_BA' or 'None'.")
        return


    ########### Output Layer ############

    output = Dense(3, activation='softmax')(merged)
    model = Model([in_clinical, in_snp, in_img], output)

    return model

## Training

### Hyperparameter Tuning

#### Batch Size
The batch size was tuned. The best batch size in the original paper was 32.

#### Epochs
The number of training epochs was tuned. The best performing number of epochs in the original paper was 50

#### Learning Rate
The learning rate was tuned. The best performing learning rate in the original paper was .001

#### Seed
The random seed was was tuned in a range of 0-200 in steps of 5.


### Computational Requirements and Considerations:
#### Hardware
Using a T4 GPU hardware accelerator or another GPU type accellerator is necessary to process the large amount of imaging data as well as the extremely large VCF files for the genetic data.

#### Runtime
The runtime would vary quite a bit per epoch, but the smaller batch size and relatively smaller learning rate would reduce the training time per epoch by a significant amount.

In [None]:
def train(mode, batch_size, epochs, learning_rate, seed):


    train_clinical = pd.read_csv("X_train_clinical.csv").drop("Unnamed: 0", axis=1).values
    test_clinical= pd.read_csv("X_test_clinical.csv").drop("Unnamed: 0", axis=1).values


    train_snp = pd.read_csv("X_train_snp.csv").drop("Unnamed: 0", axis=1).values
    test_snp = pd.read_csv("X_test_snp.csv").drop("Unnamed: 0", axis=1).values


    train_img= make_img("X_train_img.pkl")
    test_img= make_img("X_test_img.pkl")


    train_label= pd.read_csv("y_train.csv").drop("Unnamed: 0", axis=1).values.astype("int").flatten()
    test_label= pd.read_csv("y_test.csv").drop("Unnamed: 0", axis=1).values.astype("int").flatten()

    reset_random_seeds(seed)
    class_weights = compute_class_weight(class_weight = 'balanced',classes = np.unique(train_label),y = train_label)
    d_class_weights = dict(enumerate(class_weights))

    # compile model #
    model = multi_modal_model(mode, train_clinical, train_snp, train_img)
    model.compile(optimizer=Adam(learning_rate = learning_rate), loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy'])


    # summarize results
    history = model.fit([train_clinical,
                         train_snp,
                         train_img],
                        train_label,
                        epochs=epochs,
                        batch_size=batch_size,
                        class_weight=d_class_weights,
                        validation_split=0.1,
                        verbose=1)



    score = model.evaluate([test_clinical, test_snp, test_img], test_label)

    acc = score[1]
    test_predictions = model.predict([test_clinical, test_snp, test_img])
    cr, precision_d, recall_d, thres = calc_confusion_matrix(test_predictions, test_label, mode, learning_rate, batch_size, epochs)


    """
    plt.clf()
    plt.plot(history.history['sparse_categorical_accuracy'])
    plt.plot(history.history['val_sparse_categorical_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()
    plt.savefig('accuracy_' + str(mode) + '_' + str(learning_rate) +'_' + str(batch_size)+'.png')
    plt.clf()
    # summarize history for loss
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()
    plt.savefig('loss_' + str(mode) + '_' + str(learning_rate) +'_' + str(batch_size)+'.png')
    plt.clf()
    """



    # release gpu memory #
    K.clear_session()
    del model, history
    gc.collect()


    print ('Mode: ', mode)
    print ('Batch size:  ', batch_size)
    print ('Learning rate: ', learning_rate)
    print ('Epochs:  ', epochs)
    print ('Test Accuracy:', '{0:.4f}'.format(acc))
    print ('-'*55)

    return acc, batch_size, learning_rate, epochs, seed

In [None]:
    m_a = {}
    seeds = random.sample(range(1, 200), 5)
    for s in seeds:
        acc, bs_, lr_, e_ , seed= train('MM_SA_BA', 32, 50, 0.001, s)
        m_a[acc] = ('MM_SA_BA', acc, bs_, lr_, e_, seed)
    print(m_a)
    print ('-'*55)
    max_acc = max(m_a, key=float)
    print("Highest accuracy of: " + str(max_acc) + " with parameters: " + str(m_a[max_acc]))

# Results

The training of the MADDi model experienced several challenges that proved fatal to the reproducibility of the original paper.

Training was not able to be completed due to these challenges.

Expected outcomes are included below.

In [None]:
## Expected Outcomes

#Upon completion model training, we would expect the to match or improve upon the following classification results.

import pandas as pd
from IPython.display import display

# Create a DataFrame with the performance metrics
data = {
    'Category': ['Control', 'Moderate Cognitive Impairment', 'Alzheimer’s Disease'],
    'Accuracy (%)': [96.66, 96.66, 100],
    'Precision (%)': [96.78, 90.00, 100],
    'Recall (%)': [98.88, 70.00, 100],
    'F1-Score (%)': [97.81, 76.66, 100]
}

df = pd.DataFrame(data)

display(df)

Unnamed: 0,Category,Accuracy (%),Precision (%),Recall (%),F1-Score (%)
0,Control,96.66,96.78,98.88,97.81
1,Moderate Cognitive Impairment,96.66,90.0,70.0,76.66
2,Alzheimer’s Disease,100.0,100.0,100.0,100.0


### Ablations

#### Model Composition
Ablations were performed in the original paper to find the ideal model composition of MADDi. Models were trained with different "modes".
Different architecture features were ablated to determine the importance of the different data modalities.

The best performing model was **Self Attention and Cross Modal Bi-directional Attention** in the original paper.
#### Mode Types

*  **Cross Modal Bi-directional Attention**
*  **Self Attention**
*  **Self Attention and Cross Modal Bi-directional Attention**
*  **General concatenated model, no attention**

## Model comparison

A comparison is given using the original paper's results in the absence of reproduced results in this project.

In [None]:
import pandas as pd
from IPython.display import display

comparison_data = {
    'Study': [
        'Bucholc et al, 2019',
        'Fang et al, 2020',
        'Abuhmed et al, 2021',
        'Venugopalan et al, 2021',
        'MADDi, Original Paper'
    ],
    'Modality': [
        'MRI, PET, Clinical',
        'MRI, PET',
        'MRI, PET, Clinical',
        'MRI, SNP, Clinical',
        'MRI, SNP, Clinical'
    ],
    'Accuracy (%)': [
        82.90,
        66.29,
        86.08,
        78.00,
        96.88
    ],
    'F1-Score (%)': [
        'Not reported',
        'Not reported',
        87.67,
        78.00,
        91.41
    ],
    'Method': [
        'SVM',
        'GDCA',
        'Multivariate BiLSTM',
        'DL + RF',
        'DL + Attention'
    ]
}

comparison_df = pd.DataFrame(comparison_data)

display(comparison_df)

Unnamed: 0,Study,Modality,Accuracy (%),F1-Score (%),Method
0,"Bucholc et al, 2019","MRI, PET, Clinical",82.9,Not reported,SVM
1,"Fang et al, 2020","MRI, PET",66.29,Not reported,GDCA
2,"Abuhmed et al, 2021","MRI, PET, Clinical",86.08,87.67,Multivariate BiLSTM
3,"Venugopalan et al, 2021","MRI, SNP, Clinical",78.0,78.0,DL + RF
4,"MADDi, Original Paper","MRI, SNP, Clinical",96.88,91.41,DL + Attention


## Discussion

### Reproducibility
The paper was not found to be easily reproducible.

#### Barriers to reproducibility
- **Data Access:** The ADNI data is not publicly accessible and the process of downloading and working with the datasets involved in the paper are cumbersome. The genetic data is extremely large when downloaded, but is then processed down to a more manageable size.

- **Imprecise, Outdated Data Descriptions**: The original paper's description of the datasets used was extremely imprecise and out of date. There is a lot of guess work involved in finding the proper datasets to train MADDi with. Since the training of the original paper's model, ADNI3 was completed and ADNI4 data is also available, crowding the data archives of LONI. Several Github issues with questions about the data used were opened in the original repo, with vague responses by the original researcher. The non-public nature of the dataset may have led the researcher to be overly cautious when sharing descriptions of the data, as well as the changing nature of the LONI website itself.

- **Data Size**: Without precise data instructions from the original paper, it is necessary to download large swathes of data which can then be preprocessed and pared down to the size of the original dataset. VCF files are extremely large, and so are the images in the MRI datasets. This introduces a complexity when developing in a Google Colab environment, where memory is limited and the transfer of data between Google Drives is slow.

#### Conducive to reproducibility
- **Model Architecture:** The model architecture is now extremely popular with the rise of transformers and attention, making the structure of the model involved easy to copy and improve upon with the range of research available to draw on.

### Next Phase
The next phase is to complete the training and perform additional ablations using the now complete ADNI3 dataset.

### Recommendations for Increasing Reproducibility
- **Detailed Descriptions of Data Acquisition** - adding more detailed descriptions of how the particular datasets were acquired within LONI would make reproducing MADDi extremely simple. The ADNI datasets within LONI are all named, or have defining features, and there are named filters as well. Instead of providing vague directions on which filters should be used, the explicit names of the filters can be provided to simplify the process.


# References

Alzheimer’s Association. 2023. 2024 alzheimer’s disease facts and figures. https://www.alz.
org/media/Documents/alzheimers-facts-and-figures.pdf. Accessed on 2024-03-24.
https://www.alz.org/media/Documents/alzheimers-facts-and-figures.pdf.

Michal Golovanevsky, Carsten Eickhoff, and Ritambhara Singh. 2022. Multimodal attention-based
deep learning for Alzheimer’s disease diagnosis. Journal of the American Medical Informatics
Association 29(12):2014–2022. https://doi.org/10.1093/jamia/ocac168.

