
# FAES_BIOF509_FINAL

Predicting neurodegeneration from global proteomics

Project based on the [_drivendata/cookiecutter-data-science_](https://github.com/drivendata/cookiecutter-data-science) project structure

[![Code Style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
[![GPL License](https://badges.frapsoft.com/os/gpl/gpl.svg?v=103)](https://opensource.org/licenses/GPL-3.0/)


## Command line notebooks!

Shout out to Martin's nbless package!

```bash
nbless notebooks/slides/slide_* -o notebooks/index.ipynb
nbdeck notebooks/index.ipynb -o notebooks/index.ipynb
nbconv notebooks/index.ipynb -e slides -o index.html
cp index.html reports/index.html
```



## Inspiration
"Global quantitative analysis of the human brain proteome in Alzheimer’s and Parkinson’s Disease"
doi:10.1038/sdata.2018.36


## Challenges

1. Dimensionality - 40 samples (10 per group) and ~12000 features
1. Data Wrangling - the published data is not in particularly great in structure


## Plan

1. Data wrangling
1. Exploratory data visualisation - volcano plot, UMAP
1. Dimensionality reduction/feature selection
1. Machine learning - comparison between classification and clustering
1. Validation - Leave-one-out, as sample is so small


## Sub-classing DataFrames

```python
class CleanFrame(pd.core.frame.DataFrame):

    @property
    def _constructor(self):
        return CleanFrame

    def filter_by_val(self, col="", vals=[], keep=True, inplace=False):
        # Type check inputs
        # Operate, checking whether to keep or discard
        if keep:
            new_data = self[self[col].isin(vals)]
        else:
            new_data = self[~self[col].isin(vals)]

        # self._update_inplace is from pandas.core.frame
        if inplace:
            self._update_inplace(new_data)
        else:
            return new_data
```


## Make Data

```python
import glob
import pandas as pd
import src.data.CleanFrame as cf

def make_data(
    files, usecols=None, names=None, index_col=None, axis=0, join="outer", keys=None
):
    # Type check files
    # Find files
    paths = glob.iglob(files)
    # Read in files, sep=None with engine='python' will auto determine delim
    reads = (
        pd.read_csv(
            file,
            usecols=usecols,
            header=0,
            names=names,
            index_col=index_col,
            sep=None,
            engine="python",
        )
        for file in paths
    )
    # Convert to CleanFrame
    cfs = (cf.CleanFrame(i) for i in reads)
    # Clean data
    clean = (i.prep_data() for i in cfs)
    # Create final CleanFrame
    data = cf.CleanFrame(
        pd.concat(clean, axis=axis, join=join, keys=keys, sort=False, copy=False)
    )
    return data
```


## Volcano Plots

```python
class CleanFrame(pd.core.frame.DataFrame):
    def volcano(
        self,
        x,
        y,
        is_log=True,
        fold_cut=0.585,
        q_cut=1.301,
        title="Volcano Plot",
        title_size=12,
        label_size=8,
        show=True,
        save=False,
        path="reports/figures/volcano.png",
    ):

        # Type check inputs
        # Log, if necessary
        # Create red, black green custom color map
        cmap = LinearSegmentedColormap.from_list(
            "Volcano", [(1, 0, 0), (0, 0, 0), (0, 1, 0)], N=3
        )

        # Establish colors
        conditions = [(y >= q_cut) & (x >= fold_cut), (y >= q_cut) & (x <= -fold_cut)]
        choices = [2, 0]
        colors = np.select(conditions, choices, default=1)

        # Plot data
        plt.scatter(x, y, c=colors, cmap=cmap, s=2, alpha=0.7)
        plt.axvline(fold_cut, linestyle="--", color="gray", linewidth=1)
        plt.axvline(-fold_cut, linestyle="--", color="gray", linewidth=1)
        plt.axhline(q_cut, linestyle="--", color="gray", linewidth=1)

        # Plot settings
        sns.despine(offset=5, trim=False)
        # And others...

        # Show or save
        if save:
            plt.savefig(path, dpi=600)
        if show:
            plt.show()

```


## Volcano Plots - Frontal Cortex - Alzheimers

![Frontal_volcano_plot](reports/figures/Frontal_mean_ad.png)


## Volcano Plots - Frontal Cortex - Parkinson's

![Frontal_volcano_plot](reports/figures/Frontal_mean_pd.png)


## Volcano Plots - Frontal Cortex - Dual Diagnosis

![Frontal_volcano_plot](reports/figures/Frontal_mean_adpd.png)


## Volcano Plots - Anterior Cingulate Gyrus - Alzheimer's

![Cingulate_volcano_plot](reports/figures/Cingulate_mean_ad.png)


## Volcano Plots - Anterior Cingulate Gyrus - Parkinson's Disease

![Cingulate_volcano_plot](reports/figures/Cingulate_mean_pd.png)


## Volcano Plots - Anterior Cingulate Gyrus - Dual Diagnosis

![Cingulate_volcano_plot](reports/figures/Cingulate_mean_adpd.png)


## UMAP Plots

```python
class CleanFrame(pd.core.frame.DataFrame):
    def umap(
        self,
        X_list,
        y_name,
        plt_comp=(0, 1),
        title="UMAP Plot",
        title_size=12,
        label_size=8,
        show=True,
        save=False,
        path="report/figures/umap.png",
        **kwargs,
    ):
        # Type check inputs
        # Reducer for umap
        X, y = self[X_list], self[y_name]
        reducer = umap.UMAP(random_state=1, **kwargs)
        embedding = reducer.fit_transform(X)

        # Create conditions/choices for colors, leave first for default
        # Plot UMAP
        plt.scatter(
            embedding[:, plt_comp[0]],
            embedding[:, plt_comp[1]],
            s=5,
            c=np.select(conditions, choices, 0),
            cmap="Spectral",
        )

        # Plot settings
        # Same as volcano

        # Show or save
        if save:
            plt.savefig(path, dpi=600)
        if show:
            plt.show()

```


## UMAP Plots - Labels

![UMAP frontal](reports/figures/Frontal_label.png) ![UMAP cingulate](reports/figures/Cingulate_label.png)


## UMAP Plots - Batch Effect

![UMAP frontal](reports/figures/Frontal_batch.png) ![UMAP cingulate](reports/figures/Cingulate_batch.png)


## UMAP Plots - Components 2 and 3

![UMAP frontal](reports/figures/Frontal_batch_23.png) ![UMAP cingulate](reports/figures/Cingulate_batch_23.png)


## UMAP Plots - Summary Data

![UMAP frontal](reports/figures/Frontal_sum_umap.png) ![UMAP cingulate](reports/figures/Cingulate_sum_umap.png)


## And That's it...so far!

- Batch effect removal
- Model selection wit yellowbricks classification report
- Model tuning - gridsearchcv, but also yellowbrick
   - Confusion matrix, ROCAUC, Precision-Recall, Validation Curve, Learning Curve
- Refactoring!
