# Data Processing Pipeline

## Processing Summary

* Primary processing done with MINFI R package, described in this [notebook](./Methylation_Normalization_MINFI.ipynb)  
    * All data quantile normalized together 
    * Needs large memory machine (~100GB)
* Cell-counts calculated using estimateCellCounts 
    * Uses flow-sorted data 
        * from [Housmann et. al paper](http://www.ncbi.nlm.nih.gov/pubmed/22568884)
        * Obtained from R Bioconductor package [FlowSorted.Blood.450k](http://www.bioconductor.org/packages/release/data/experiment/html/FlowSorted.Blood.450k.html)
        * Quantile normalized with rest of data for consistancy
    * Benchmarked for HIV data in [Cell Composition Benchmark](../Benchmarks/Cell_Composition_Bechmark.ipynb)
* Detection p-values obtained by detectionP function
    * Processed and dumped to HDFS in [this notebook](../PreProcessing/save_detection_p_values.ipynb)  
* Data are read in from the flat .csv generated in the R pipeline and dumped into HDFS files in [this notebook](./PreProcessing/Pan_Study_Data.ipynb) 
    * Considerable savings in both time and memory usage 
    * I like to keep these HDFS files on my SSD to speed things up even more
* All data normalized to reference distribution via BMIQ 
    * Modified code from Horvath paper 
    * Use Hannum data, as gold standard for Hannum model, Horvath's supplied gold-standard for his model  
    * [Horvath pipeline script](../PreProcessing/BMIQ_Horvath.ipynb)  
    * [Hannum pipeline script](../PreProcessing/BMIQ_Normalization.ipynb)
* Each probe adjusted for cell-composition
    * Hannum: quantile normalization -> adjustment -> BMIQ
    * Horvath: quantile normalization -> BMIQ -> adjustement 
    * Exploration of this processing step is done [here](../Benchmarks/Cell_Composition_Bechmark.ipynb)
* Probe annotations obtained from R Bioconductor 450k annotation package [link](http://www.bioconductor.org/packages/release/data/annotation/html/IlluminaHumanMethylation450kanno.ilmn12.hg19.html), dumped into HDFS format in [this notebook](../PreProcessing/Compile_Probe_Annotations.ipynb)

## Pipeline Dependencies

1. Process Raw Data
    * [Methylation_Normalization_MINFI](../PreProcessing/Methylation_Normalization_MINFI.ipynb) 
2. Prepare data, save into HDFS 
    * [Pan_Study_Data](../PreProcessing/Pan_Study_Data.ipynb) 
    * [save_detection_p_values](../PreProcessing/save_detection_p_values.ipynb)  
    * [Compile_Probe_Annotations](../PreProcessing/Compile_Probe_Annotations.ipynb)
3. Adjust for cell composition and preform BMIQ normalization 
    * [BMIQ_Horvath](../PreProcessing/BMIQ_Horvath.ipynb)
    * [BMIQ_Normalization](../PreProcessing/BMIQ_Normalization.ipynb)

## Datasets

#### HIV Dataset   
* Our dataset collected for this study 
* 142 cases, 50 controls  

#### Hannum Dataset 
* Data used in [Hannum et al.](http://www.ncbi.nlm.nih.gov/pubmed/23177740)
* Data available from GEO ([GSE40279](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40279)) 
* 908 patients 

#### EPIC Dataset 
* European Prospective Investigation into Cancer and Nutrition (link)[http://www.ncbi.nlm.nih.gov/pubmed/12639222]
* Data available from GEO ([GSE51032](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51032)) 
* 845 patients

#### Summary

* Summarized in [Read HIV Data](../Setup.Read_HIV_Data) notebook 
* 142 cases, 50 controls
* All white, 2 with current alcohol use, 1 with cannabis  
* None are diabetic or HCV+ 
* All adherent to drugs

#### Filters (QC)

* 5 patients not treated with HAART therepy 
* 2 female controls removed 
* 2 controls removed due to data quality (MINFI pipeline, probe detection criteria) 

#### Filters (age models)

Looking at agreement between Hannum and Horvath models.  The idea here is that when they disagree it is likely a sign of poor data quality, or that our models do a great job of describing the aging process for a particular patien ([Relavant code](../HIV_Age_Advancement_Composition_Adj.ipynb#Quality-Control)).

* Drop 5 cases, 1 control 
* 2/6 samples have non-zero detection p-values for probes in one or both of the models