# EBM Project Notebook
Now that we have learned how to use the EBM package, we are going to 
take the ADNI data set that we cleaned from the last project notebook
and use it to replicate the results from the brain paper from Alex Young.

As a reminder, here are the features we are going to use:
* Cerebrospinal Fluid (CSF) INNO-BIA AlzBio3 immunoassay ('INNO')
  * Amyloid Beta 1-42
  * phosphorylated tau - log transformed to improve normality
  * total tau - log transformed to improve normality
* Volumetric measurements from **1.5T** magnetic resonance imaging (MRI), 
normalised by dividing by Intracranial Volume (ICV)
  * Whole brain volume
  * Ventricular volume
  * Entorhinal cortex volume
  * Hippocampal volume
  * Middle temporal cortex volume
  * Fusiform cortex volume 
* Direct measurements of atrophy between two timepoints using volumetric MRI
  * Annualised whole brain atrophy between 0 and **12 months** using Boundary Shift Integral (BSI)
  * Annualised hippocampal atrophy between 0 and **12 months** using Boundary Shift Integral (BSI)
* Cognitive measures
  * Mini mental state examination (MMSE)
  * ADAS-COG13
  * Rey Auditory Verbal Learning Test (RAVLT)

In [None]:
# Put any import steps that you want to keep here


## Step 1 - Open the data
First up, load up the spreadsheet that you created from the second project notebook.

In [None]:
# Your answer to Step 1
# Below put your code that will load in the data that you created from the ADNI data


## Step 2 - Initial Processing of data
Our spreadsheets from the first exercise still need a few more tweaks before running the EBM.

Our current MRI features are the raw volumes. However, the volume of most brain structures are highly associated with the head size of the individual. We have a good proxy variable of head size extracted from the image data called *total intracranial volume* or TIV. In order to reduce the effect of head size, Young *et al.* divided all of the brain structures by TIV before putting it into the EBM. So remember to do the same with your features. 
However, there are two exceptions. The *boundary shift integral* or BSI measures do not measure the size of the brain structure but how much the size has **changed** over one year. These changes have not been shown to be associated with TIV and so they **do not** need to be normalised by TIV before the EBM.

We will discuss this further in future steps, but we need to have a column that provides a special coding of the three values in the baseline diagnosis (`DX_bl`) column, where:
* Cognitively Normal (CN) = 0
* Alzheimer's Disease (AD) = 1
* Mild Cognitive Impairment (LMCI) = 2

Also, I would suggest to make some Python lists to help hold key sets of column names:
* Key *metadata* that are helpful for processing but not going to be in the EBM, and 
* names of *features* that you want to include in the EBM ordering. 

In [None]:
# Your answer to Step 2
# Below put your code for normalising the MRI volumes by TIV
# and creating helpful lists of metadata variables and feature variables

## Step 3 - Visualise distributions of features
For each of the features, produce a plot showing the difference between the CN, LMCI, and AD groups.

In [None]:
# Your answer to Step 3
# Below put your code for visualising the differences in each features according to disease diagnosis

## Step 4 - Statistical tests
After you have visually inspected the data, perform some statistical tests in order to assess whether there is evidence that these groups are significantly different from each other.  Remember there are three groups,with the most important comparison for ensuring that there is a disease-specific signal is CN vs AD (as MCI may be caused by AD or other etiologies.)

In [None]:
# Your answer to Step 4
# Below put your code for basic statistical analysis testing the difference in feature
# values between groups 


## Step 5 - Mixture model
In the walkthrough, we used the KDE approach to determine the probability distribution that an event has occurred. However, in the original implementation for the Young *et al.* paper, this was done using a parametric approach with Gaussian Mixture Models. This can be done using the 'fit_all_gmm_models' command instead of the 'fit_all_kde_models' command. Make sure to use the special coding of diagnosis as your labels for the EBM input.

One other difference that is critically important. In the walkthrough we simulated 100 "completely healthy" individuals as controls, i.e. they had no evidence of abnormality in any of the biomarkers. In the ADNI cohort this is not the case. We know that Alzheimer's disease pathology begins very early in the process, in some cases 20 years before signs of cognitive impairment are so pronounces that they would meet the criteria for a diagnosis of probable AD. As a result there are many controls in our group who actually have abnormal biomarkers. This is mentioned in the *Young et al.* paper:
> In sporadic Alzheimer’s disease, however, a significant proportion of the cognitively
> normal control group may have presymptomatic Alzheimer’s disease. To counter this,
> we fitted a mixture of two normal distributions to each biomarker separately using 
> data from all subjects to obtain the two models. To ensure a robust fit, 
> particularly for biomarkers where the distributions of the healthy and diseased 
> population overlap significantly, we constrain the standard deviations so that the 
> standard deviation of P(x|~E_i) and P(x|E_i) is less than or equal to that of the 
> cognitively normal and Alzheimer’s disease group, respectively.

The constraint mentioned above can be implemented in the `fit_all_gmm_models` by setting the function argument `fit_all_subjects` to `True`. This is why we needed to have our labels such that cognitively normals were 0 and AD were 1, so that the constraints would be applied to the correct groups. 

After the mixture models have been fitted for each feature, make sure to take a look at them and see if they make sense.

In [None]:
# Your answer to Step 5
# Below put your code for performing the mixture modelling

## Step 6 - Running the EBM
Now that you have the mixture models, it is time to run the EBM. In the code cell(s) below, please run your MCMC sampling.

In [None]:
# Your answer to Step 6
# Below put your code for performing the MCMC sampling
# and print out the most likely sample

## Step 7 - Positional Variance Diagram
Perform the necessary steps to get the positional variance diagram. Which features are we confident of their position and which ones are we not? Does your positional variance diagram coincide with the paper? In what ways are they different? What are the best practices for producing a PVD.

In [None]:
# Your answer to Step 7
# Below put your code for producing the positional variance diagram


## Step 8 - Staging
Validate this data by looking at the stages of each individual. What stages are the controls in? What does the staging of the MCI look like compared to the dementia.

In [None]:
# Your answer to Step 8
# Below put your code for performing the staging analysis of your data

## Step 9 - Explore
**Congratulations** you have succeeded in the Team Coders task! Now the goal is to take what you have learned and think about what questions you would like to answer with the EBM. What other analyses were done in the paper? What things could we change?
Here are some potential things you may want to try:
* Using the KDE method of mixture modelling, how does the ordering change. Looking at what we have stated in Step 5, why might these orderings be different
* Look at specific populations, say individuals who have the risk gene *APOE* $\epsilon4$ ?
* Try more principled methods of uncertainty using cross-fold validation.

In [None]:
# Your answer to STep 9
# Below put the additional experiments that you have tried and 
# summarize the findings as part of the presentation.