No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


This repository contains all of the analysis notebook required for the reproduction of the manuscript Methylome-wide analysis of chronic HIV infection reveals five-year increase in biological age and epigenetic targeting of HLA.

This is a fairly complex analysis pipeline, with many steps. If you are planning of running the entire pipeline starting from raw data, this should be possible with the attached code, but I highly recomend you contact me (the.andrew.gross AT gmail DOT com) for assistance. Its probably not going to run straight though. Some steps in this process have relatively specialized requirements such as high memory machines or use of a compute cluster.

In addition, I have also tried to make available a number of intermediate files which should allow for targeted re-analysis of the data or testing of new hypotheses.

Finally all of the analyses done here are represented in IPython notebooks. This was meant to allow for high level inspection of the analysis logic done for this study. I have done my best to document this such that it can be understood without running the actual code. If you are interesting in reproducing this study, or conducting a similar study I highly recomend looking before you leap into trying to get code to run.

For step by step running instructions see Guide to Running

Main Data Analysis

Here is where I do the main data analysis for the manuscript.

  • Unsupervised Age HIV Analysis
    Unsupervised analysis of age associated probes with the aim of showing a shared influence of age and HIV on the methylome. This contains the code for the generation of Figure 1.
  • HIV Age Advancment
    Here I read in the data and run the methylation age models.
  • HIV Age Advancement: Confounders
    Here I am looking at confounding from patients' blood composition as well as association of age advancment with other clinical variables that we have available.
  • Figure 2
    Generation of Figure 2 for the manuscript.
  • Validation_figure
    Generation of Figure 3 of the manuscript.
  • Figure 3
    Generation of Figure 4 for the manuscript. Figure 3 got added in revisions, so I'm keeping the name consistent to not lose the version control.
  • Figure 5_top
    Generation of Figure 5a and 5b for the manuscript. Also includes a look at general disorder in response to age and HIV as well as some post-hoc analysis on the HLA and sourounding regions.
  • Figure_5_bottom
    Generation of Figure 5c-f for the manuscript.


This code uses a number of features in the scientific python stack as well as a small set of standard R libraries. Thus far, this code has only been tested in a Linux enviroment, it may take some modification to run on other operating systems.

I highly recomend installing a scientific Python distribution such as Anaconda or Enthought to handle the majority of the Python dependencies in this project (other than rPy2 and matplotlib_venn). These are both free for academic use.

Python Dependencies

  • Numpy and Scipy, numeric calculations and statistics in Python
  • matplotlib, plotting in Python
  • Pandas, data-frames for Python, handles the majority of data-structures
  • statsmodels, used for statstics
  • scikit-learn, used for supervised learning
  • rPy2, communication between R and Python
    • I recommend installing with pip install rpy2
    • Needs R to be compiled with shared libraries
  • seaborn
    • I recommend installing with pip install seaborn

My Internal Package Dependencies

These are Python packages that I use internally for things such as statistics and visualization. They are all available on my Github page, I recomend downloading them and installing them with python install. I appoligize for the generic names, I am hoping to develop these a bit more and make them into proper packages up to spec in my next code refactor.

  • Figures

    • Code for better figure generation, mainly using Pandas data-structures
    • I am slowly phasing this out and replacing with the very nice seaborn library
  • Stats

    • Contains two packages, Stats and Helpers
    • Stats has a number of helper functions that wrap calls to R or scipy statistics functions and allow them to play nicer with Pandas data-structures
    • Helpers has a number of common tasks that I envoke to make code a bit more readable
  • NotebookImport

    • Utility for importing IPython notebooks as modules
    • Code taken from MinRK's Gist

R Dependencies