
# FAES_BIOF509_FINAL

Predicting neurodegeneration from global proteomics

Project based on the [_drivendata/cookiecutter-data-science_](https://github.com/drivendata/cookiecutter-data-science) project structure

[![Code Style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)
[![GPL License](https://badges.frapsoft.com/os/gpl/gpl.svg?v=103)](https://opensource.org/licenses/GPL-3.0/)


## Inspiration
"Global quantitative analysis of the human brain proteome in Alzheimer’s and Parkinson’s Disease"
doi:10.1038/sdata.2018.36


## Challenges

1. Dimensionality - 40 samples (10 per group) and ~12000 features
1. Data Wrangling - the published data is not in particularly great in structure


## Plan

1. Data wrangling
1. Exploratory data visualisation - volcano plot, tSNE
1. Dimensionality reduction/feature selection
1. Machine learning - comparison between classification and clustering
1. Validation - Leave-one-out, as sample is so small


In [None]:
import pandas as pd


class CleanFrame(pd.core.frame.DataFrame):
    """Sub-classed DataFrame with expanded method for cleaning
    
    Frequently, when loading data, a number of cleaning steps are performed that do not have direct functions in the pandas module.
    This class seeks to add those functionalities on top of pandas to expand its capacity

    Methods
    -------
    clean_cols: 
        Cleans column names by stripping white space, removing white space, and converting all characters to either lower or upper case
    filter_by_val:
        Select rows based on values in a given column
    """

    @property
    def _constructor(self):
        return CleanFrame


In [None]:
def filter_by_val(self, col="", vals=[], keep=True, inplace=False):
    # Type check inputs
    for i in (keep, inplace):
        if not isinstance(i, bool):
            raise ValueError(f"{i} must be a bool")
    if not isinstance(col, str):
        raise ValueError("col must be a str in self.columns")
    if not isinstance(vals, (list, tuple)):
        raise ValueError("vals must be a list or tuple")

    # Operate, checking whether to keep or discard
    if keep:
        new_data = self[self[col].isin(vals)]
    else:
        new_data = self[~self[col].isin(vals)]

    # self._update_inplace is from pandas.core.frame
    if inplace:
        self._update_inplace(new_data)
    else:
        return new_data


In [None]:
def make_data(
    files, usecols=None, names=None, index_col=None, axis=0, join="outer", keys=None
):
    # Type check files
    if not isinstance(files, str):
        raise ValueError(f"files must be a str, not {type(files)}")
    # Find files
    paths = glob.iglob(files)
    # Read in files, sep=None with engine='python' will auto determine delim
    reads = (
        pd.read_csv(
            file,
            usecols=usecols,
            header=0,
            names=names,
            index_col=index_col,
            sep=None,
            engine="python",
        )
        for file in paths
    )
    # Convert to CleanFrame
    cfs = (cf.CleanFrame(i) for i in reads)
    # Clean data
    clean = (i.prep_data() for i in cfs)
    # Create final CleanFrame
    data = cf.CleanFrame(
        pd.concat(clean, axis=axis, join=join, keys=keys, sort=False, copy=False)
    )
    return data


In [None]:
if __name__ == "__main__":
    # Frontal cortex data
    frontal = make_data(
        "data/raw/f*",
        usecols=[2, 5, 9, 10, 72, 73, 74, 75, 76, 77, 78, 79],
        names=[
            "master",
            "accession",
            "q_score",
            "pep_score",
            "AD1",
            "AD2",
            "Control1",
            "Control2",
            "PD1",
            "PD2",
            "ADPD1",
            "ADPD2",
        ],
        index_col=1,
        axis=1,
        join="inner",
        keys=[1, 2, 3, 4, 5],
    )
    pd.to_pickle(frontal, "data/interim/frontal_full.pkl")
#And again with the cingulate data...


![cingulate_volcano_plots](reports/figures/Anterior_Cingulate_Gyrus_Volcano_Plots.png)


![tsne frontal](reports/figures/tsne_frontal.png)


![tsne_cingulate](reports/figures/tsne_cingulate.png)


In [None]:
# Reducer for tSNE
tsne_frontal = TSNEVisualizer(decompose=None, random_state=1, perplexity=10)
tsne_frontal.fit(X_frontal, y_frontal)
tsne_frontal.poof(outpath='reports/figures/tsne_frontal.png')


![UMAP frontal](reports/figures/UMAP_Frontal.png)


![UMAP_cingulate](reports/figures/UMAP_Cingulate.png)


## And That's it...so far!

- Model selection wit yellowbricks classification report
- Model tuning - gridsearchcv, but also yellowbrick
   - Confusion matrix, ROCAUC, Precision-Recall, Validation Curve, Learning Curve
- Refactoring!
