# Using `memento` to analyze Interferon-B response in monocytes

To install `memento` in the pre-release version (for Ye Lab members), install it directly from github by running:

```pip install git+https://github.com/yelabucsf/scrna-parameter-estimation.git@release-v0.0.3```

This requires that you have access to the Ye Lab organization. 

In [1]:
# This is only for development purposes

import sys
sys.path.append('/data/home/Github/scrna-parameter-estimation/dist/memento-0.0.4-py3.7.egg')
import memento

ModuleNotFoundError: No module named 'estimator'

In [2]:
import scanpy as sc
import memento


Bad key "text.kerning_factor" on line 4 in
/data/home/anaconda3/envs/single_cell/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


ModuleNotFoundError: No module named 'bootstrap'

In [3]:
fig_path = '/data/home/Github/scrna-parameter-estimation/figures/fig4/'
data_path = '/data_volume/parameter_estimation/'

In [4]:
import pickle as pkl

### Read IFN data and filter for monocytes

For `memento`, we need the raw count matrix. Preferrably, feed the one with all genes so that we can choose what genes to look at. 

One of the columns in `adata.obs` should be the discrete groups to compare mean, variability, and co-variability across. In this case, it's called `stim`. 

The column containing the covariate that you want p-values for should either:
- Be binary (aka the column only contains two unique values, such as 'A' and 'B'. Here, the values are either 'stim' or 'ctrl'.
- Be numeric (aka the column contains -1, 0, -1 for each genotype value). 

I recommend changing the labels to something numeric (here, i use 0 for `ctrl` and 1 for `stim`). Otherwise, the sign of the DE/EV/DC testing will be very hard to interpret.

In [8]:
adata = sc.read(data_path + 'interferon_filtered.h5ad')
adata = adata[adata.obs.cell == 'CD14+ Monocytes'].copy()
print(adata)

AnnData object with n_obs × n_vars = 5341 × 35635
    obs: 'tsne1', 'tsne2', 'ind', 'stim', 'cluster', 'cell', 'multiplets', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'cell_type'
    var: 'gene_ids', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'cell_type_colors'
    obsm: 'X_tsne'


In [9]:
adata.obs['stim'] = adata.obs['stim'].apply(lambda x: 0 if x == 'ctrl' else 1)

In [10]:
adata.obs[['ind', 'stim', 'cell']].sample(5)

Unnamed: 0_level_0,ind,stim,cell
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AATCCGGAAGCAAA-1,1015,1,CD14+ Monocytes
ACGATTCTCACCAA-1,1015,1,CD14+ Monocytes
CGGCACGACTACCC-1,1256,1,CD14+ Monocytes
AGCATCGAACTCAG-1,1016,1,CD14+ Monocytes
CGTACCACAAAGTG-1,1256,1,CD14+ Monocytes


### Create groups for hypothesis testing and compute 1D parameters

`memento` creates groups of cells based on anything that should be considered a reasonable group; here, we just divide the cells into `stim` and `ctrl`. But we can easily further divide the cells into individuals by adding the `ind` column to the `label_columns` argument when calling `create_groups`.

Values in the `q_column` is the rough estimate of the overall UMI efficiency across both sampling and sequencing. If `s` is the sequencing saturation, multiply `s` by 0.07 for 10X v1, 0.15 for v2, and 0.25 for v3. This allows you to enter different numbers for each batch, which likely have different saturation numbers. This will NOT account for wildly different sequencing scenarios.

By default, `memento` will consider all genes whose expression is high enough to calculate an accurate variance. If you wish to include less genes, increase `filter_mean_thresh`.

In [11]:
adata.obs['capture_rate'] = 0.07
memento.setup_memento(adata, q_column='capture_rate')

In [12]:
memento.create_groups(adata, label_columns=['stim'])

In [13]:
memento.compute_1d_moments(adata,
    min_perc_group=.9) # percentage of groups that satisfy the condition for a gene to be considered. 

### Perform 1D hypothesis testing

`formula_like` determines the linear model that is used for hypothesis testing, while `cov_column` is used to pick out the variable that you actually want p-values for. 

`num_cpus` controls how many CPUs to parallelize this operation for. In general, I recommend using 3-6 CPUs for reasonable peformance on any of the AWS machines that we have access to (I'm currently using a c5.2xlarge instance (8 vCPUs). 

In [11]:
memento.ht_1d_moments(
    adata, 
    formula_like='1 + stim',
    cov_column='stim', 
    num_boot=5000, 
    verbose=1,
    num_cpus=6)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    2.9s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    7.5s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   14.8s
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:   25.6s
[Parallel(n_jobs=6)]: Done 1238 tasks      | elapsed:   40.1s
[Parallel(n_jobs=6)]: Done 1788 tasks      | elapsed:   57.4s
[Parallel(n_jobs=6)]: Done 1877 out of 1877 | elapsed:  1.0min finished


In [12]:
result_1d = memento.get_1d_ht_result(adata)

In [13]:
result_1d.query('de_coef > 0').sort_values('de_pval').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval
1527,CCL2,1.480777,6.327806e-10,-1.792969,3.574865e-43
602,SERPINB9,1.05397,6.670234e-10,-0.070163,0.8366
37,IFI6,2.743425,1.467053e-08,-2.228537,8.010761e-20
455,LAP3,1.667709,2.446649e-08,-1.012301,2.533813e-06
1831,APOBEC3A,3.580056,3.398756e-08,-2.219382,8.19819e-17
1594,MYL12A,1.146715,3.889374e-08,-0.307194,0.0072
0,ISG15,4.633936,6.150075e-08,-4.411395,1.155646e-26
1165,CLEC2B,0.912022,6.680043e-08,-0.159225,0.5968
472,CXCL10,5.641742,7.403303e-08,-4.737284,6.209386e-16
1771,NAPA,1.410677,7.991084e-08,-0.486402,0.174


In [14]:
result_1d.sort_values('dv_pval').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval
1529,CCL8,5.955085,3.755407e-06,-4.213191,2.752558e-49
1527,CCL2,1.480777,6.327806e-10,-1.792969,3.574865e-43
1421,ISG20,3.650556,1.11858e-07,-2.98361,2.893859e-38
1039,IFITM3,3.396494,1.383244e-06,-3.291879,1.263268e-30
0,ISG15,4.633936,6.150075e-08,-4.411395,1.155646e-26
915,LY6E,3.435867,9.966522e-07,-3.306035,7.227039999999999e-26
876,IDO1,3.935713,1.630819e-07,-2.202358,3.542871e-22
1528,CCL7,2.130995,5.637202e-06,-1.107335,2.390108e-21
217,RSAD2,4.847639,4.51046e-06,-2.512864,7.912491e-21
37,IFI6,2.743425,1.467053e-08,-2.228537,8.010761e-20


### Perform 2D hypothesis testing

For differential coexpression testing, we can specify which genes you want to perform HT on. It takes a list of pairs of genes, where each element in the list is a tuple. Here, we focus on 1 transcription factor and their correlations to rest of the transcriptome. 

Similar to the 1D case, 2D hypothesis testing scales with the number of pairs of genes to test. If you have a smaller set of candidate genes, it will run faster.

In [15]:
import itertools

In [16]:
gene_pairs = list(itertools.product(['IRF7'], adata.var.index.tolist()))

In [17]:
memento.compute_2d_moments(adata, gene_pairs)

In [18]:
memento.ht_2d_moments(
    adata, 
    formula_like='1 + stim', 
    cov_column='stim', 
    num_cpus=13, 
    num_boot=5000)

[Parallel(n_jobs=13)]: Using backend LokyBackend with 13 concurrent workers.
[Parallel(n_jobs=13)]: Done   6 tasks      | elapsed:    0.9s
[Parallel(n_jobs=13)]: Done 102 tasks      | elapsed:    6.7s
[Parallel(n_jobs=13)]: Done 262 tasks      | elapsed:   14.9s
[Parallel(n_jobs=13)]: Done 486 tasks      | elapsed:   25.8s
[Parallel(n_jobs=13)]: Done 774 tasks      | elapsed:   41.1s
[Parallel(n_jobs=13)]: Done 1126 tasks      | elapsed:   59.5s
[Parallel(n_jobs=13)]: Done 1542 tasks      | elapsed:  1.4min
[Parallel(n_jobs=13)]: Done 1876 out of 1876 | elapsed:  1.7min finished


In [19]:
result_2d = memento.get_2d_ht_result(adata)

In [20]:
result_2d.sort_values('corr_pval').head(10)

Unnamed: 0,gene_1,gene_2,corr_coef,corr_pval
641,IRF7,HLA-DQA1,0.278193,0.000134
104,IRF7,GCLM,0.375144,0.000325
1688,IRF7,OAZ1,0.37138,0.000374
211,IRF7,GPR137B,0.392756,0.000415
716,IRF7,ACTB,0.270017,0.000472
1390,IRF7,ANXA2,0.272612,0.000517
638,IRF7,HLA-DRA,0.245894,0.00053
943,IRF7,CTSL,0.203125,0.000575
1148,IRF7,HSPA8,0.275049,0.000609
1406,IRF7,PKM,0.308369,0.000618


### Save your results

There are some objects within `memento` that doesn't play nice with scanpy. So just give it a heads up with the `prepare_to_save` function.

In [13]:
memento.prepare_to_save(adata)

In [15]:
adata.write(data_path + 'ifn_tutorial.h5ad')

... storing 'memento_group' as categorical
