[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rsinghlab/pyaging/blob/main/tutorials/tutorial_dnam.ipynb) [![Open In nbviewer](https://img.shields.io/badge/View%20in-nbviewer-orange)](https://nbviewer.jupyter.org/github/rsinghlab/pyaging/blob/main/tutorials/tutorial_dnam.ipynb)

# RRBS DNA methylation

This tutorial focuses on predicting age from Mus musculus reduced-representation bisulfite sequencing (RRBS) data. There are a few clocks available that were trained on RRBS data. Moreover, it is possible to use Horvath's mammalian clocks by converting the genomic location to the probes in the Horvath methylation array.

In [1]:
import pandas as pd
import pyaging as pya
import os
import numpy as np

## Download and load example data

Let's download the publicly available dataset GSE130735 with RRBS samples from mouse. Given it is RRBS, there are millions of CpG sites.

In [2]:
pya.data.download_example_data('GSE130735')

|-----> 🏗️ Starting download_example_data function
|-----------> Data found in pyaging_data/GSE130735_subset.pkl
|-----> 🎉 Done! [0.5425s]


In [3]:
df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')

It is important to note that the features for RRBS clocks are the genomic coordinates in the format below.

In [4]:
df.head()

Unnamed: 0,chr1:3020814,chr1:3020842,chr1:3020877,chr1:3020891,chr1:3020945,chr1:3020971,chr1:3020987,chr1:3021012,chr1:3037802,chr1:3037820,...,chrY:1825397,chrY:4682362,chrY:32122892,chrY:85867071,chrY:85867083,chrY:85867117,chrY:85867137,chrY:85867139,chrY:85867178,chrY:88224179
GSM3752631,0.609,0.25,0.408,0.189,0.068,0.373,0.571,0.252,0.333,0.158,...,,,,,,,,,,
GSM3752625,,,0.973,0.984,0.912,0.915,0.987,0.974,0.991,0.932,...,,,,,,,,,,
GSM3752634,,,0.526,0.131,0.0,0.038,0.469,0.769,0.772,0.146,...,,,,,,,,,,
GSM3752620,0.931,0.92,0.988,0.949,0.897,0.921,0.907,0.958,1.0,0.867,...,,,,,,,,,,
GSM3752622,,,0.205,0.382,0.091,0.132,0.174,0.227,0.108,0.053,...,,,,,,,,,,


## Convert data to AnnData object

AnnData objects are highly flexible and are thus our preferred method of organizing data for age prediction.

In [5]:
adata = pya.pp.df_to_adata(df, imputer_strategy='mean') # knn might be a bit slow

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.9882s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> ⚙️ Log data statistics started
|-----------> There are 14 observations
|-----------> There are 1778324 features
|-----------> Total missing values: 6322346
|-----------> Percentage of missing values: 25.39%
|-----> ✅ Log data statistics finished [0.0205s]
|-----> ⚙️ Impute missing values started
|-----------> Imputing missing values using mean strategy
|-----> ✅ Impute missing values finished [0.4631s]
|-----> ⚙️ Add imputer strategy to adata.uns started
|-----> ✅ Add imputer strategy to adata.uns finished [0.0087s]
|-----> 🎉 Done! [1.4897s]


This is what the `adata` object looks like:

## Predict age with RRBS clocks

We can either predict one clock at once or all at the same time. For convenience, let's simply input all four available mammalian clocks at once. The function is invariant to the capitalization of the clock name.

In [6]:
pya.pred.predict_age(adata, ['Thompson', 'Meer', 'Petkovich', 'Stubbs'])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0033s]
|-----> 🕒 Processing clock: thompson
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/thompson.pt
|-----------> ✅ Load clock finished [0.5324s]
|-----------> ⚙️ Check features in adata started
|-----------------? 1 out of 582 features (0.17%) are missing: ['chr4:91376687'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_thompson]
|-----------> ⚠️ Check features in adata finished [0.0654s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> There is no postprocessing necessary
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0013s]
|-----------> ⚙️ Add predicted ages and clock metadata to adata star

All of the age predictions are in unit of months.

In [7]:
adata.obs.head()

Unnamed: 0,thompson,meer,petkovich,stubbs
GSM3752631,19.634113,7.315183,8.075177,0.95777
GSM3752625,-1.410461,0.028221,2.953822,-0.074265
GSM3752634,61.058783,21.322178,9.640489,1.389193
GSM3752620,-2.663815,1.611947,3.019351,-0.09271
GSM3752622,20.594114,7.592145,7.104766,0.667168


In [8]:
adata.obs.head()

Unnamed: 0,thompson,meer,petkovich,stubbs
GSM3752631,19.634113,7.315183,8.075177,0.95777
GSM3752625,-1.410461,0.028221,2.953822,-0.074265
GSM3752634,61.058783,21.322178,9.640489,1.389193
GSM3752620,-2.663815,1.611947,3.019351,-0.09271
GSM3752622,20.594114,7.592145,7.104766,0.667168


Having so much information printed can be overwhelming, particularly when running several clocks at once. In such cases, just set verbose to False.

In [9]:
pya.data.download_example_data('GSE130735', verbose=False)
df = pd.read_pickle('pyaging_data/GSE130735_subset.pkl')
adata = pya.preprocess.df_to_adata(df, imputer_strategy='mean', verbose=False)
pya.pred.predict_age(adata, ['Thompson', 'Meer', 'Petkovich', 'Stubbs'], verbose=False)

In [10]:
adata.obs.head()

Unnamed: 0,thompson,meer,petkovich,stubbs
GSM3752631,19.634113,7.315183,8.075177,0.95777
GSM3752625,-1.410461,0.028221,2.953822,-0.074265
GSM3752634,61.058783,21.322178,9.640489,1.389193
GSM3752620,-2.663815,1.611947,3.019351,-0.09271
GSM3752622,20.594114,7.592145,7.104766,0.667168


After age prediction, the clocks are added to `adata.obs`. Moreover, the percent of missing values for each clock and other metadata are included in `adata.uns`.

In [11]:
adata

AnnData object with n_obs × n_vars = 14 × 1778324
    obs: 'thompson', 'meer', 'petkovich', 'stubbs'
    var: 'percent_na'
    uns: 'imputer_strategy', 'thompson_percent_na', 'thompson_missing_features', 'thompson_metadata', 'meer_percent_na', 'meer_missing_features', 'meer_metadata', 'petkovich_percent_na', 'petkovich_missing_features', 'petkovich_metadata', 'stubbs_percent_na', 'stubbs_missing_features', 'stubbs_metadata'
    layers: 'X_original', 'X_imputed'

## Predict age with mammalian clocks

We can predict age by converting the genomic locations directly into the probes from Horvath's methylation array. 

In [12]:
os.system('git clone https://github.com/shorvath/MammalianMethylationConsortium.git')

# Let's read the manifest from the mammalian consortium
annotation_df = pd.read_csv('MammalianMethylationConsortium/Annotations, Amin Haghani/Mammals/Mus_musculus.grcm38.100.HorvathMammalMethylChip40.v1.csv', index_col=0)
annotation_df = annotation_df[~annotation_df.seqnames.isna()]
mm_genomic_locations = 'chr' + annotation_df['seqnames'].astype(str) + ':' + annotation_df['CGstart'].astype(int).astype(str)
mm_genomic_locations = mm_genomic_locations.tolist()
mammalian_probes = annotation_df['CGid'].tolist()
mm_loc_to_probe = dict(zip(mm_genomic_locations, mammalian_probes))

# Let's get the previous RRBS dataset and filter only for the genomic locations in the manifest file
df_columns_set = set(df.columns)
mm_loc_to_probe_set = set(mm_loc_to_probe.keys())
common_columns = df_columns_set.intersection(mm_loc_to_probe_set)
df_converted = df[list(common_columns)].copy()

# Then, convert the genomic location to the probe name
df_converted.columns = [mm_loc_to_probe[col] for col in df_converted.columns]

# Let's clean the GitHub
os.system('rm -r MammalianMethylationConsortium')

0

In [13]:
df_converted.head()

Unnamed: 0,cg05347424,cg26718996,cg07727941,cg16852837,cg12870762,cg26080798,cg02899039,cg12839061,cg05267150,cg13170453,...,cg02179016,cg20836420,cg18831685,cg08992395,cg13679010,cg12982463,cg17146242,cg13649253,cg07588415,cg14814195
GSM3752631,,,0.0,0.0,0.015,0.0,0.005,,0.023,0.0,...,,0.0,0.028,,0.0,0.0,0.018,0.0,0.021,
GSM3752625,0.938,,0.0,,0.0,0.0,0.0,,0.596,,...,,0.0,,,0.0,0.895,0.227,0.156,0.025,
GSM3752634,0.125,,0.0,,0.627,0.017,0.033,,0.745,,...,,0.495,,,0.014,0.278,0.519,0.786,0.012,
GSM3752620,0.769,,0.091,0.0,0.07,0.006,0.012,,0.607,0.092,...,,0.01,0.054,,0.0,0.933,0.277,0.148,0.0,
GSM3752622,,,0.0,,0.0,0.0,0.0,,0.052,,...,,0.0,,,0.0,0.0,0.064,0.022,0.0,


Now we can finally put the dataframe into pyaging after defining the species as Mus musculus.

In [14]:
df_converted['Mus musculus'] = 1
adata_mammalian = pya.pp.df_to_adata(df_converted, imputer_strategy='mean')

|-----> 🏗️ Starting df_to_adata function
|-----> ⚙️ Create anndata object started
|-----> ✅ Create anndata object finished [0.0057s]
|-----> ⚙️ Add metadata to anndata started
|-----------? No metadata provided. Leaving adata.obs empty
|-----> ⚠️ Add metadata to anndata finished [0.0006s]
|-----> ⚙️ Log data statistics started
|-----------> There are 14 observations
|-----------> There are 5150 features
|-----------> Total missing values: 17862
|-----------> Percentage of missing values: 24.77%
|-----> ✅ Log data statistics finished [0.0013s]
|-----> ⚙️ Impute missing values started
|-----------> Imputing missing values using mean strategy
|-----> ✅ Impute missing values finished [0.0060s]
|-----> ⚙️ Add imputer strategy to adata.uns started
|-----> ✅ Add imputer strategy to adata.uns finished [0.0004s]
|-----> 🎉 Done! [0.0174s]


Let's use these five mammalian predictors.

In [15]:
pya.pred.predict_age(adata_mammalian, ['Mammalian1', 'Mammalian2', 'Mammalian3', "MammalianLifespan", "MammalianFemale"])

|-----> 🏗️ Starting predict_age function
|-----> ⚙️ Set PyTorch device started
|-----------> Using device: cpu
|-----> ✅ Set PyTorch device finished [0.0013s]
|-----> 🕒 Processing clock: mammalian1
|-----------> ⚙️ Load clock started
|-----------------> Data found in pyaging_data/mammalian1.pt
|-----------> ✅ Load clock finished [0.4780s]
|-----------> ⚙️ Check features in adata started
|-----------------? 274 out of 335 features (81.79%) are missing: ['cg00249943', 'cg00250826', 'cg00292639'], etc.
|-----------------> Filling missing features entirely with 0
|-----------------> Added prepared input matrix to adata.obsm[X_mammalian1]
|-----------> ⚠️ Check features in adata finished [0.0173s]
|-----------> ⚙️ Predict ages with model started
|-----------------> There is no preprocessing necessary
|-----------------> The postprocessing method is anti_logp2
|-----------------> in progress: 100.0000%
|-----------> ✅ Predict ages with model finished [0.0083s]
|-----------> ⚙️ Add predicted 

Note that RRBS clocks are in units of months whereas the mammalian clocks are in units of years.

In [16]:
adata_mammalian.obs

Unnamed: 0,mammalian1,mammalian2,mammalian3,mammalianlifespan,mammalianfemale
GSM3752631,2.537895,-0.0098,-0.048414,1.202134,0.732238
GSM3752625,3.353935,0.064448,-0.048203,1.48008,0.952105
GSM3752634,4.49061,0.813899,-0.035059,1.400278,0.978554
GSM3752620,3.603802,0.122934,-0.046958,1.644701,0.948646
GSM3752622,2.951263,0.005617,-0.047725,1.385282,0.741336
GSM3752637,5.718515,0.895781,-0.037649,1.399769,0.975114
GSM4558216,7.456245,0.684752,-0.012195,1.448119,0.78583
GSM3752643,5.881943,0.880053,-0.037656,1.408483,0.969979
GSM4558213,6.72008,0.855574,-0.026211,1.477236,0.821308
GSM3752640,6.452934,0.766701,-0.031763,1.371831,0.938162


## Get citation

The doi, citation, and some metadata are automatically added to the AnnData object under `adata.uns[CLOCKNAME_metadata]`.

In [17]:
adata.uns['thompson_metadata']

{'clock_name': 'thompson',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2018,
 'approved_by_author': '✅',
 'citation': 'Thompson, Michael J., et al. "A multi-tissue full lifespan epigenetic clock for mice." Aging (Albany NY) 10.10 (2018): 2832.',
 'doi': 'https://doi.org/10.18632/aging.101590',
 'notes': None,
 'version': None}

In [18]:
adata.uns['meer_metadata']

{'clock_name': 'meer',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2018,
 'approved_by_author': '⌛',
 'citation': 'Meer, Margarita V., et al. "A whole lifespan mouse multi-tissue DNA methylation clock." Elife 7 (2018): e40675.',
 'doi': 'https://doi.org/10.7554/eLife.40675',
 'notes': None,
 'version': None}

In [19]:
adata.uns['petkovich_metadata']

{'clock_name': 'petkovich',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2017,
 'approved_by_author': '⌛',
 'citation': 'Petkovich, Daniel A., et al. "Using DNA methylation profiling to evaluate biological age and longevity interventions." Cell metabolism 25.4 (2017): 954-960.',
 'doi': 'https://doi.org/10.1016/j.cmet.2017.03.016',
 'notes': None,
 'version': None}

In [20]:
adata.uns['stubbs_metadata']

{'clock_name': 'stubbs',
 'data_type': 'methylation',
 'species': 'Mus musculus',
 'year': 2017,
 'approved_by_author': '⌛',
 'citation': 'Stubbs, Thomas M., et al. "Multi-tissue DNA methylation age predictor in mouse." Genome biology 18 (2017): 1-14.',
 'doi': 'https://doi.org/10.1186/s13059-017-1203-5',
 'notes': None,
 'version': None}

In [21]:
adata_mammalian.uns['mammalian1_metadata']

{'clock_name': 'mammalian1',
 'data_type': 'methylation',
 'species': 'multi',
 'year': 2023,
 'approved_by_author': '⌛',
 'citation': 'Lu, A. T., et al. "Universal DNA methylation age across mammalian tissues." Nature aging 3.9 (2023): 1144-1166.',
 'doi': 'https://doi.org/10.1038/s43587-023-00462-6',
 'notes': None,
 'version': None}

In [22]:
adata_mammalian.uns['mammalianlifespan_metadata']

{'clock_name': 'mammalianlifespan',
 'data_type': 'methylation',
 'species': 'multi',
 'year': 2023,
 'approved_by_author': '⌛',
 'citation': 'Li, Caesar Z., et al. "Epigenetic predictors of species maximum lifespan and other life history traits in mammals." bioRxiv (2023): 2023-11.',
 'doi': 'https://doi.org/10.1101/2023.11.02.565286',
 'notes': None,
 'version': None}