# README

This repository contains all of the analysis notebook required for the reproduction of the manuscript __Advancement of aging in HIV-positive individuals revealed by epigenomic profiling__.  

This is a fairly complex analysis pipeline, with many steps.  If you are planning of running the entire pipeline starting from raw data, this should be possible with the attached code, but I highly recomend you contact me (agross AT ucsd DOT edu) for assistance. Some steps in this process have relatively specialized requirements such as high memory machines or use of a compute cluster. 

In addition, I have also tried to make available a number of intermediate files which should allow for targeted re-analysis of the data or testing of new hypotheses. 

Finally all of the analyses done here are represented in IPython notebooks.  This was meant to allow for __high level inspection__ of the analysis logic done for this study. I have done my best to document this such that it can be understood __without running the actual code__.  If you are interesting in reproducing this study, or conducting a similar study I highly recomend __looking before you leap__ into trying to get code to run.  

## Dependencies

This code uses a number of features in the scientific python stack as well as a small set of standard R libraries. Thus far, this code has only been tested in a Linux enviroment, it may take some modification to run on other operating systems.

I highly recomend installing a scientific Python distribution such as [Anaconda](http://continuum.io/) or [Enthought](https://www.enthought.com/) to handle the majority of the Python dependencies in this project (other than rPy2 and matplotlib_venn).  These are both free for academic use.

### Python Dependencies 

* [Numpy and Scipy](http://www.scipy.org/), numeric calculations and statistics in Python 
* [matplotlib](http://matplotlib.org/), plotting in Python
* [Pandas](http://pandas.pydata.org/), data-frames for Python, handles the majority of data-structures  
* [statsmodels](http://statsmodels.sourceforge.net/), used for statstics  
* [scikit-learn](http://scikit-learn.org/stable/), used for supervised learning
* [rPy2](http://rpy.sourceforge.net/rpy2.html), communication between R and Python  
  * __NOT IN DISTRIBUTIONS__  
  * I recommend installing with `pip install rpy2`  
  * Needs R to be compiled with shared libraries  
* [seaborn](http://stanford.edu/~mwaskom/software/seaborn/index.html) 
  * __NOT IN DISTRIBUTIONS__  
  * I recommend installing with `pip install seaborn` 


### My Internal Package Dependencies

These are Python packages that I use internally for things such as statistics and visualization. They are all available on [my Github page](https://github.com/theandygross), I recomend downloading them and installing them with `python setup.py install`.  I appoligize for the generic names, I am hoping to develop these a bit more and make them into proper packages up to spec in my next code refactor.   

* [Figures](https://github.com/theandygross/Figures) 
  * Code for better figure generation, mainly using Pandas data-structures 
  * I am slowly phasing this out and replacing with the very nice [seaborn](http://stanford.edu/~mwaskom/software/seaborn/index.html) library  
  
* [Stats](https://github.com/theandygross/Stats)  
  * Contains two packages, __Stats__ and __Helpers__ 
  * __Stats__ has a number of helper functions that wrap calls to R or scipy statistics functions and allow them to play nicer with Pandas data-structures  
  * __Helpers__ has a number of common tasks that I envoke to make code a bit more readable

### R Dependencies