# Using `memento` to analyze Interferon-B response in monocytes

To install `memento` in the pre-release version (for Ye Lab members), install it directly from github by running:

```pip install git+https://github.com/yelabucsf/scrna-parameter-estimation.git@release-v0.0.1```

This requires that you have access to the Ye Lab organization. 

In [2]:
import scanpy as sc
import memento

In [3]:
fig_path = '/data/home/Github/scrna-parameter-estimation/figures/fig4/'
data_path = '/data/parameter_estimation/'

### Read IFN data and filter for monocytes

For `memento`, we need the raw count matrix. Preferrably, feed the one with all genes so that we can choose what genes to look at. 

One of the columns in `adata.obs` should be the discrete groups to compare mean, variability, and co-variability across. In this case, it's called `stim`. 

The column containing the covariate that you want p-values for should either:
- Be binary (aka the column only contains two unique values, such as 'A' and 'B'. Here, the values are either 'stim' or 'ctrl'.
- Be numeric (aka the column contains -1, 0, -1 for each genotype value). 

In [4]:
adata = sc.read(data_path + 'interferon_filtered.h5ad')
adata = adata[adata.obs.cell == 'CD14+ Monocytes'].copy()
print(adata)

AnnData object with n_obs × n_vars = 5341 × 35635 
    obs: 'tsne1', 'tsne2', 'ind', 'stim', 'cluster', 'cell', 'multiplets', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'cell_type'
    var: 'gene_ids', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'cell_type_colors'
    obsm: 'X_tsne'


In [5]:
adata.obs[['ind', 'stim', 'cell']].sample(5)

Unnamed: 0_level_0,ind,stim,cell
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CAGCACCTAACGTC-1,1488,stim,CD14+ Monocytes
GTATCTACGGTTAC-1,107,stim,CD14+ Monocytes
AACAGAGATGCTTT-1,1016,ctrl,CD14+ Monocytes
TCCTACCTAAGTGA-1,1244,stim,CD14+ Monocytes
AGGTCTGAGTTGTG-1,1015,ctrl,CD14+ Monocytes


### Create groups for hypothesis testing and compute 1D parameters

`memento` creates groups of cells based on anything that should be considered a reasonable group; here, we just divide the cells into `stim` and `ctrl`. But we can easily further divide the cells into individuals by adding the `ind` column to the `label_columns` argument when calling `create_groups`.

`q` is the rough estimate of the overall UMI efficiency across both sampling and sequencing. If `s` is the sequencing saturation, multiply `s` by 0.07 for 10X v1, 0.15 for v2, and 0.25 for v3. 

By default, `memento` will consider all genes whose expression is high enough to calculate an accurate variance. If you wish to include less genes, increase `filter_mean_thresh`.

In [6]:
memento.create_groups(adata, label_columns=['stim'], inplace=True, q=0.07)

In [7]:
memento.compute_1d_moments(
    adata, 
    inplace=True, 
    filter_mean_thresh=0.07, # minimum raw mean of each gene within a group for the gene to be considered 
    min_perc_group=.9) # percentage of groups that satisfy the condition for a gene to be considered. 

### Perform 1D hypothesis testing

`formula_like` determines the linear model that is used for hypothesis testing, while `cov_column` is used to pick out the variable that you actually want p-values for. 

`num_cpus` controls how many CPUs to parallelize this operation for. In general, I recommend using 3-6 CPUs for reasonable peformance on any of the AWS machines that we have access to (I'm currently using a c5.2xlarge instance (8 vCPUs). 

In [8]:
memento.ht_1d_moments(
    adata, 
    formula_like='1 + stim',
    cov_column='stim', 
    num_boot=5000, 
    verbose=1,
    num_cpus=6)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    3.8s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:   10.0s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:   19.4s
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:   34.0s
[Parallel(n_jobs=6)]: Done 1238 tasks      | elapsed:   53.1s
[Parallel(n_jobs=6)]: Done 1788 tasks      | elapsed:  1.3min
[Parallel(n_jobs=6)]: Done 1877 out of 1877 | elapsed:  1.3min finished


In [9]:
result_1d = memento.get_1d_ht_result(adata)
result_1d['de_fdr'] = memento.util._fdrcorrect(result_1d['de_pval'])
result_1d['dv_fdr'] = memento.util._fdrcorrect(result_1d['dv_pval'])

In [10]:
result_1d.sort_values('de_fdr').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval,de_fdr,dv_fdr
927,RPS6,-0.680596,9.234272e-58,0.569209,0.000649,8.666365000000001e-55,0.015043
890,RPL7,-1.022373,8.665941e-58,0.695691,6.4e-05,8.666365000000001e-55,0.002071
1497,PFN1,-0.737511,2.523154e-57,0.450652,0.8774,1.5786540000000002e-54,0.976307
1238,RPL6,-1.304283,1.363269e-53,0.852596,0.003078,6.397139e-51,0.051123
32,SH3BGRL3,-0.434325,3.725468e-52,0.128927,0.0194,1.398541e-49,0.176766
140,S100A8,-1.377003,5.37408e-51,0.472991,0.3826,1.681191e-48,0.763979
1688,OAZ1,-0.452429,3.6549819999999997e-50,0.331381,0.4432,9.800573e-48,0.806749
1268,RGCC,-2.0139,5.246259e-49,0.196163,0.5882,1.230904e-46,0.88485
1158,GAPDH,-0.600653,1.746462e-46,0.021866,0.0726,3.2781089999999995e-44,0.386035
922,RPL8,-0.661875,1.687282e-46,0.248327,0.4282,3.2781089999999995e-44,0.799733


In [11]:
result_1d.sort_values('dv_fdr').head(10)


Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval,de_fdr,dv_fdr
1527,CCL2,1.616199,2.022121e-07,-0.905586,6.715195000000001e-33,8.341803e-07,1.2604420000000001e-29
472,CXCL10,5.735854,1.621175e-06,-3.385255,6.553847e-32,5.55282e-06,6.150786e-29
915,LY6E,3.545791,1.340808e-06,-3.020043,1.1527460000000001e-27,4.645574e-06,7.212347000000001e-25
1039,IFITM3,3.510266,8.798592e-06,-2.615273,2.0706670000000003e-27,2.517524e-05,9.716606e-25
37,IFI6,2.861663,3.058336e-07,-1.933102,1.098119e-21,1.221382e-06,4.1223409999999995e-19
876,IDO1,4.056669,1.044322e-07,-1.953791,6.107846e-21,4.631759e-07,1.910738e-18
1421,ISG20,3.75964,1.62265e-07,-2.276321,5.354147e-19,6.828956e-07,1.435676e-16
0,ISG15,4.737471,1.771858e-08,-2.465596,2.202423e-18,8.874323e-08,5.167436e-16
1376,B2M,0.427152,4.378063e-07,-0.082712,6.53353e-18,1.715579e-06,1.362604e-15
217,RSAD2,4.944483,6.823462e-06,-2.515631,4.2442880000000004e-17,2.010618e-05,7.966528e-15


### Perform 2D hypothesis testing

For differential co-variability testing, we can specify which genes you want to perform HT on. It takes a list of pairs of genes, where each element in the list is a tuple. Here, we focus on 1 transcription factor and their correlations to rest of the transcriptome. 

Similar to the 1D case, 2D hypothesis testing scales with the number of pairs of genes to test. If you have a smaller set of candidate genes, it will run faster.

In [12]:
import itertools

In [29]:
gene_pairs = list(itertools.product(['IRF7'], adata.var.index.tolist()))

In [30]:
memento.compute_2d_moments(adata, gene_pairs)

In [31]:
memento.ht_2d_moments(
    adata, 
    formula_like='1 + stim', 
    cov_column='stim', 
    num_cpus=6, 
    num_boot=5000)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed:    2.8s
[Parallel(n_jobs=6)]: Done 116 tasks      | elapsed:   14.8s
[Parallel(n_jobs=6)]: Done 276 tasks      | elapsed:   35.9s
[Parallel(n_jobs=6)]: Done 500 tasks      | elapsed:  1.1min
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:  1.7min
[Parallel(n_jobs=6)]: Done 1140 tasks      | elapsed:  2.5min
[Parallel(n_jobs=6)]: Done 1556 tasks      | elapsed:  3.4min
[Parallel(n_jobs=6)]: Done 1876 out of 1876 | elapsed:  4.1min finished


In [32]:
result_2d = memento.get_2d_ht_result(adata)

In [33]:
result_2d.sort_values('corr_pval').head(10)

Unnamed: 0,gene_1,gene_2,corr_coef,corr_pval,corr_fdr
574,IRF7,CD74,0.316293,0.000123,0.073478
1815,IRF7,SDF2L1,0.304159,0.000181,0.073478
104,IRF7,GCLM,0.396597,0.000283,0.073478
716,IRF7,ACTB,0.272642,0.000334,0.073478
638,IRF7,HLA-DRA,0.252754,0.000473,0.073478
158,IRF7,LMNA,0.306795,0.00056,0.073478
1108,IRF7,MALAT1,0.249095,0.000571,0.073478
493,IRF7,ANXA5,0.275887,0.000642,0.073478
211,IRF7,GPR137B,0.397522,0.000659,0.073478
626,IRF7,HLA-C,0.268576,0.000686,0.073478
