# Using `memento` to analyze Interferon-B response in monocytes

To install `memento` in the pre-release version (for Ye Lab members), install it directly from github by running:

```pip install git+https://github.com/yelabucsf/scrna-parameter-estimation.git@release-v0.0```

This requires that you have access to the Ye Lab organization. 

In [1]:
import sys
sys.path.append('/data/home/Github/scrna-parameter-estimation/dist/memento-0.0.1-py3.7.egg')
import memento


In a future version of Scanpy, `scanpy.api` will be removed.
Simply use `import scanpy as sc` and `import scanpy.external as sce` instead.



In [2]:
import scanpy as sc
import memento

In [3]:
fig_path = '/data/home/Github/scrna-parameter-estimation/figures/fig4/'
data_path = '/data/parameter_estimation/'

### Read IFN data and filter for monocytes

For `memento`, we need the raw count matrix. Preferrably, feed the one with all genes so that we can choose what genes to look at. 

One of the columns in `adata.obs` should be the discrete groups to compare mean, variability, and co-variability across. In this case, it's called `stim`. 

The column containing the covariate that you want p-values for should either:
- Be binary (aka the column only contains two unique values, such as 'A' and 'B'. Here, the values are either 'stim' or 'ctrl'.
- Be numeric (aka the column contains -1, 0, -1 for each genotype value). 

In [4]:
adata = sc.read(data_path + 'interferon_filtered.h5ad')
adata = adata[adata.obs.cell == 'CD14+ Monocytes'].copy()
print(adata)

AnnData object with n_obs × n_vars = 5341 × 35635 
    obs: 'tsne1', 'tsne2', 'ind', 'stim', 'cluster', 'cell', 'multiplets', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'cell_type'
    var: 'gene_ids', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'cell_type_colors'
    obsm: 'X_tsne'


In [5]:
adata.obs[['ind', 'stim', 'cell']].sample(5)

Unnamed: 0_level_0,ind,stim,cell
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TATCTGACAAAAGC-1,1256,ctrl,CD14+ Monocytes
ATCGCAGATTCTAC-1,1488,ctrl,CD14+ Monocytes
GTACGAACTAACCG-1,1015,ctrl,CD14+ Monocytes
CCATAGGACAGATC-1,1244,stim,CD14+ Monocytes
GTGACCCTAGAATG-1,1244,ctrl,CD14+ Monocytes


### Create groups for hypothesis testing and compute 1D parameters

`memento` creates groups of cells based on anything that should be considered a reasonable group; here, we just divide the cells into `stim` and `ctrl`. But we can easily further divide the cells into individuals by adding the `ind` column to the `label_columns` argument when calling `create_groups`.

`q` is the rough estimate of the overall UMI efficiency across both sampling and sequencing. If `s` is the sequencing saturation, multiply `s` by 0.07 for 10X v1, 0.15 for v2, and 0.25 for v3. 

By default, `memento` will consider all genes whose expression is high enough to calculate an accurate variance. If you wish to include less genes, increase `filter_mean_thresh`.

In [15]:
memento.create_groups(adata, label_columns=['stim'], inplace=True, q=0.07)

In [16]:
memento.compute_1d_moments(
    adata, 
    inplace=True, 
    filter_mean_thresh=0.07, # minimum raw mean of each gene within a group for the gene to be considered 
    min_perc_group=.9) # percentage of groups that satisfy the condition for a gene to be considered. 

### Perform 1D hypothesis testing

`formula_like` determines the linear model that is used for hypothesis testing, while `cov_column` is used to pick out the variable that you actually want p-values for. 

`num_cpus` controls how many CPUs to parallelize this operation for. In general, I recommend using 3-6 CPUs for reasonable peformance on any of the AWS machines that we have access to (I'm currently using a c5.2xlarge instance (8 vCPUs). 

In [17]:
memento.ht_1d_moments(
    adata, 
    formula_like='1 + stim',
    cov_column='stim', 
    num_boot=5000, 
    verbose=1,
    num_cpus=6)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  64 tasks      | elapsed:    2.9s
[Parallel(n_jobs=6)]: Done 364 tasks      | elapsed:   15.1s
[Parallel(n_jobs=6)]: Done 864 tasks      | elapsed:   36.7s
[Parallel(n_jobs=6)]: Done 1564 tasks      | elapsed:  1.1min
[Parallel(n_jobs=6)]: Done 1866 out of 1877 | elapsed:  1.3min remaining:    0.5s
[Parallel(n_jobs=6)]: Done 1877 out of 1877 | elapsed:  1.4min finished


In [19]:
result_1d = memento.get_1d_ht_result(adata)
result_1d['de_fdr'] = memento.util._fdrcorrect(result_1d['de_pval'])
result_1d['dv_fdr'] = memento.util._fdrcorrect(result_1d['dv_pval'])

In [22]:
result_1d.sort_values('de_fdr').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval,de_fdr,dv_fdr
357,RPL15,-0.830076,4.407524e-77,0.531724,0.000182,8.272923e-74,0.004828
1158,GAPDH,-0.648887,9.88215e-67,0.033378,0.003,9.274396999999999e-64,0.047319
49,MARCKSL1,-1.27712,7.097472e-65,0.478701,0.0532,4.440651e-62,0.324209
856,RPL10,-0.666116,9.276986000000001e-61,0.238853,0.5998,4.3532259999999996e-58,0.889168
1105,FAU,-0.379072,2.0310120000000002e-54,0.525444,0.0102,7.62442e-52,0.117456
1238,RPL6,-1.343297,6.179513e-54,0.816432,0.0804,1.9331579999999998e-51,0.405944
11,ENO1,-1.080259,8.320121e-54,0.722192,0.6636,2.230981e-51,0.919522
1763,PLAUR,-1.002734,1.306125e-53,-0.052274,0.0132,3.064496e-51,0.138416
716,ACTB,-0.746223,4.067266e-53,-0.016385,0.000125,8.482508999999999e-51,0.003714
1497,PFN1,-0.782788,1.129414e-51,0.465121,0.000183,2.11991e-49,0.004828


In [27]:
result_1d.sort_values('dv_fdr').head(10)


Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval,de_fdr,dv_fdr
915,LY6E,3.497728,9.358025e-09,-3.048274,5.533871e-37,3.721401e-08,1.038708e-33
472,CXCL10,5.699998,2.306696e-06,-3.385572,2.613791e-34,6.293122e-06,2.453042e-31
1039,IFITM3,3.458744,1.172596e-06,-2.698284,2.285528e-30,3.391313e-06,1.4299790000000001e-27
876,IDO1,3.987419,3.14778e-07,-1.980406,3.747547e-30,1.020446e-06,1.758537e-27
1527,CCL2,1.561331,0.000195183,-0.939252,9.723871e-27,0.0003385939,3.650341e-24
1421,ISG20,3.707429,3.211474e-08,-2.312789,1.797152e-24,1.196019e-07,5.62209e-22
284,IL1RN,4.34953,4.602906e-06,-1.750747,4.04765e-18,1.172273e-05,1.085348e-15
1300,PSME2,0.883725,2.048874e-07,-0.813046,6.302435e-17,6.770661e-07,1.478709e-14
37,IFI6,2.81194,7.646498e-08,-1.910721,1.204907e-16,2.692773e-07,2.5129e-14
217,RSAD2,4.909576,5.487289e-07,-2.470896,6.276557e-16,1.70242e-06,1.17811e-13


### Perform 2D hypothesis testing

For differential co-variability testing, we can specify which genes you want to perform HT on. It takes a list of pairs of genes, where each element in the list is a tuple. Here, we focus on 1 transcription factor and their correlations to rest of the transcriptome. 

Similar to the 1D case, 2D hypothesis testing scales with the number of pairs of genes to test. If you have a smaller set of candidate genes, it will run faster.

In [9]:
import itertools

In [29]:
gene_pairs = list(itertools.product(['IRF7'], adata.var.index.tolist()))

In [30]:
memento.compute_2d_moments(adata, gene_pairs)

In [31]:
memento.ht_2d_moments(
    adata, 
    formula_like='1 + stim', 
    cov_column='stim', 
    num_cpus=6, 
    num_boot=5000)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed:    2.8s
[Parallel(n_jobs=6)]: Done 116 tasks      | elapsed:   14.8s
[Parallel(n_jobs=6)]: Done 276 tasks      | elapsed:   35.9s
[Parallel(n_jobs=6)]: Done 500 tasks      | elapsed:  1.1min
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:  1.7min
[Parallel(n_jobs=6)]: Done 1140 tasks      | elapsed:  2.5min
[Parallel(n_jobs=6)]: Done 1556 tasks      | elapsed:  3.4min
[Parallel(n_jobs=6)]: Done 1876 out of 1876 | elapsed:  4.1min finished


In [32]:
result_2d = memento.get_2d_ht_result(adata)

In [33]:
result_2d.sort_values('corr_pval').head(10)

Unnamed: 0,gene_1,gene_2,corr_coef,corr_pval,corr_fdr
574,IRF7,CD74,0.316293,0.000123,0.073478
1815,IRF7,SDF2L1,0.304159,0.000181,0.073478
104,IRF7,GCLM,0.396597,0.000283,0.073478
716,IRF7,ACTB,0.272642,0.000334,0.073478
638,IRF7,HLA-DRA,0.252754,0.000473,0.073478
158,IRF7,LMNA,0.306795,0.00056,0.073478
1108,IRF7,MALAT1,0.249095,0.000571,0.073478
493,IRF7,ANXA5,0.275887,0.000642,0.073478
211,IRF7,GPR137B,0.397522,0.000659,0.073478
626,IRF7,HLA-C,0.268576,0.000686,0.073478
