# Using `memento` to analyze Interferon-B response in monocytes

To install `memento` in the pre-release version (for Ye Lab members), install it directly from github by running:

```pip install git+https://github.com/yelabucsf/scrna-parameter-estimation.git@release-v0.0.3```

This requires that you have access to the Ye Lab organization. 

In [1]:
# This is only for development purposes

import sys
sys.path.append('/data/home/Github/scrna-parameter-estimation/dist/memento-0.0.3-py3.7.egg')
import memento


Bad key "text.kerning_factor" on line 4 in
/data/home/anaconda3/envs/single_cell/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


In [2]:
import scanpy as sc
import memento

In [3]:
fig_path = '/data/home/Github/scrna-parameter-estimation/figures/fig4/'
data_path = '/data_volume/parameter_estimation/'

In [4]:
import pickle as pkl

### Read IFN data and filter for monocytes

For `memento`, we need the raw count matrix. Preferrably, feed the one with all genes so that we can choose what genes to look at. 

One of the columns in `adata.obs` should be the discrete groups to compare mean, variability, and co-variability across. In this case, it's called `stim`. 

The column containing the covariate that you want p-values for should either:
- Be binary (aka the column only contains two unique values, such as 'A' and 'B'. Here, the values are either 'stim' or 'ctrl'.
- Be numeric (aka the column contains -1, 0, -1 for each genotype value). 

I recommend changing the labels to something numeric (here, i use 0 for `ctrl` and 1 for `stim`). Otherwise, the sign of the DE/EV/DC testing will be very hard to interpret.

In [5]:
adata = sc.read(data_path + 'interferon_filtered.h5ad')
adata = adata[adata.obs.cell == 'CD14+ Monocytes'].copy()
print(adata)

AnnData object with n_obs × n_vars = 5341 × 35635
    obs: 'tsne1', 'tsne2', 'ind', 'stim', 'cluster', 'cell', 'multiplets', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'cell_type'
    var: 'gene_ids', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'cell_type_colors'
    obsm: 'X_tsne'


In [6]:
adata.obs['stim'] = adata.obs['stim'].apply(lambda x: 0 if x == 'ctrl' else 1)

In [7]:
adata.obs[['ind', 'stim', 'cell']].sample(5)

Unnamed: 0_level_0,ind,stim,cell
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TTCCAAACGTTCTT-1,1016,0,CD14+ Monocytes
GGTACTGACGCAAT-1,1256,0,CD14+ Monocytes
CCAGCTACACGGGA-1,1015,1,CD14+ Monocytes
GCTTAACTCCTTGC-1,1244,0,CD14+ Monocytes
GGTTGAACAGTCAC-1,1488,1,CD14+ Monocytes


### Create groups for hypothesis testing and compute 1D parameters

`memento` creates groups of cells based on anything that should be considered a reasonable group; here, we just divide the cells into `stim` and `ctrl`. But we can easily further divide the cells into individuals by adding the `ind` column to the `label_columns` argument when calling `create_groups`.

Values in the `q_column` is the rough estimate of the overall UMI efficiency across both sampling and sequencing. If `s` is the sequencing saturation, multiply `s` by 0.07 for 10X v1, 0.15 for v2, and 0.25 for v3. This allows you to enter different numbers for each batch, which likely have different saturation numbers. This will NOT account for wildly different sequencing scenarios.

By default, `memento` will consider all genes whose expression is high enough to calculate an accurate variance. If you wish to include less genes, increase `filter_mean_thresh`.

In [8]:
adata.obs['capture_rate'] = 0.07

In [9]:
memento.create_groups(adata, label_columns=['stim'], inplace=True, q_column='capture_rate')

In [10]:
memento.compute_size_factors(adata)

In [11]:
memento.compute_1d_moments(
    adata, 
    inplace=True, 
    filter_mean_thresh=0.07, # minimum raw mean of each gene within a group for the gene to be considered 
    min_perc_group=.9) # percentage of groups that satisfy the condition for a gene to be considered. 

### Perform 1D hypothesis testing

`formula_like` determines the linear model that is used for hypothesis testing, while `cov_column` is used to pick out the variable that you actually want p-values for. 

`num_cpus` controls how many CPUs to parallelize this operation for. In general, I recommend using 3-6 CPUs for reasonable peformance on any of the AWS machines that we have access to (I'm currently using a c5.2xlarge instance (8 vCPUs). 

In [12]:
memento.ht_1d_moments(
    adata, 
    formula_like='1 + stim',
    cov_column='stim', 
    num_boot=5000, 
    verbose=1,
    num_cpus=6)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    3.1s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:    8.2s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   16.0s
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:   27.5s
[Parallel(n_jobs=6)]: Done 1238 tasks      | elapsed:   43.0s
[Parallel(n_jobs=6)]: Done 1788 tasks      | elapsed:  1.0min
[Parallel(n_jobs=6)]: Done 1877 out of 1877 | elapsed:  1.1min finished


In [13]:
result_1d = memento.get_1d_ht_result(adata)

In [14]:
result_1d.query('de_coef > 0').sort_values('de_pval').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval
930,CHMP5,0.837601,2.006705e-09,-0.724331,0.0602
1038,IFITM2,3.055226,7.02992e-09,-0.85691,0.0009595041
370,CCR5,1.213793,7.61532e-09,0.429341,0.0544
31,TMEM50A,0.95407,1.402343e-08,-0.220844,0.8556
1787,IL4I1,1.851509,1.92669e-08,-0.199887,0.8918
0,ISG15,4.630542,2.848378e-08,-3.849747,1.10808e-28
166,SLAMF7,2.269208,3.35747e-08,-0.708408,0.9354
752,HSPB1,1.912223,3.410813e-08,-0.81887,1.992621e-11
1286,TNFSF13B,3.453565,3.506394e-08,-1.372564,1.011775e-06
421,PLSCR1,1.793992,3.963257e-08,-0.782122,0.001294553


In [16]:
result_1d.sort_values('dv_pval').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval
915,LY6E,3.432749,8.376168e-07,-3.242748,1.4149519999999998e-38
1039,IFITM3,3.393203,2.161355e-07,-3.238808,1.396884e-35
1527,CCL2,1.477253,1.715381e-07,-1.726231,2.1258580000000003e-33
0,ISG15,4.630542,2.848378e-08,-3.849747,1.10808e-28
876,IDO1,3.932135,6.092923e-07,-2.093306,4.359662e-25
1421,ISG20,3.646966,1.145916e-07,-2.904924,5.481186e-25
1529,CCL8,5.952347,5.593736e-06,-3.810503,1.1021300000000001e-23
37,IFI6,2.740365,1.222784e-06,-2.210914,5.077364e-21
1300,PSME2,0.816086,2.366761e-07,-1.008491,6.435719e-18
263,TMSB10,1.257169,8.024314e-08,-0.952123,8.107112000000001e-18


### Perform 2D hypothesis testing

For differential coexpression testing, we can specify which genes you want to perform HT on. It takes a list of pairs of genes, where each element in the list is a tuple. Here, we focus on 1 transcription factor and their correlations to rest of the transcriptome. 

Similar to the 1D case, 2D hypothesis testing scales with the number of pairs of genes to test. If you have a smaller set of candidate genes, it will run faster.

In [17]:
import itertools

In [18]:
gene_pairs = list(itertools.product(['IRF7'], adata.var.index.tolist()))

In [19]:
memento.compute_2d_moments(adata, gene_pairs)

In [20]:
memento.ht_2d_moments(
    adata, 
    formula_like='1 + stim', 
    cov_column='stim', 
    num_cpus=13, 
    num_boot=5000)

[Parallel(n_jobs=13)]: Using backend LokyBackend with 13 concurrent workers.
[Parallel(n_jobs=13)]: Done   6 tasks      | elapsed:    1.0s
[Parallel(n_jobs=13)]: Done 102 tasks      | elapsed:    6.9s
[Parallel(n_jobs=13)]: Done 262 tasks      | elapsed:   15.7s
[Parallel(n_jobs=13)]: Done 486 tasks      | elapsed:   27.3s
[Parallel(n_jobs=13)]: Done 774 tasks      | elapsed:   43.5s
[Parallel(n_jobs=13)]: Done 1126 tasks      | elapsed:  1.0min
[Parallel(n_jobs=13)]: Done 1542 tasks      | elapsed:  1.4min
[Parallel(n_jobs=13)]: Done 1876 out of 1876 | elapsed:  1.7min finished


In [21]:
result_2d = memento.get_2d_ht_result(adata)

In [22]:
result_2d.sort_values('corr_pval').head(10)

Unnamed: 0,gene_1,gene_2,corr_coef,corr_pval
1390,IRF7,ANXA2,0.272273,0.000272
716,IRF7,ACTB,0.270438,0.000307
574,IRF7,CD74,0.315809,0.000337
1688,IRF7,OAZ1,0.371533,0.000372
329,IRF7,ARPC2,0.283066,0.000444
1148,IRF7,HSPA8,0.274291,0.000486
638,IRF7,HLA-DRA,0.24531,0.000499
1815,IRF7,SDF2L1,0.30707,0.000512
241,IRF7,RTN4,0.379202,0.000515
1406,IRF7,PKM,0.309065,0.000554


### Save your results

There are some objects within `memento` that doesn't play nice with scanpy. So just give it a heads up with the `prepare_to_save` function.

In [13]:
memento.prepare_to_save(adata)

In [15]:
adata.write(data_path + 'ifn_tutorial.h5ad')

... storing 'memento_group' as categorical
