# Accounting for between-sample variability - eQTL analysis

Key point of this tutorial is to demonstrate accounting for interindividual variation with hierarchical resampling. 

Most of this tutorial can also be used for testing any feature that is at the replicate/individual level. For example, comparing case vs control would use similar procedure, since the independent variable is defined for each person and not for each cell.

The toy data files used in this tutorial can be found at:

- genotypes: https://memento-examples.s3.us-west-2.amazonaws.com/toy-eqtl/toy_genotypes.csv
- covariates: https://memento-examples.s3.us-west-2.amazonaws.com/toy-eqtl/toy_covariates.csv
- AnnData object: https://memento-examples.s3.us-west-2.amazonaws.com/toy-eqtl/toy_adata.h5ad
- gene_snp_pairs: https://memento-examples.s3.us-west-2.amazonaws.com/toy-eqtl/toy_gene_snp_pairs.csv

In [1]:
import scanpy as sc
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

%matplotlib inline

In [2]:
import memento

In [3]:
data_path = '/home/ubuntu/Data/tutorial_data/toy-eqtl/'

## Read the inputs: variables of interest (SNPs), covariates, SNP-gene pairs.

For each of these SNP and covariate, each row is an individual and columns are different variables of interest. 

For the tutorial, we use the genotypes and covariates used in 2022 Perez, Gordon, Subramaniam et al. paper from the lab. These inputs are identical to Matrix eQTL inputs - I just transpose them here because I think it makes more sense that observations are rows...

For the tutorial, we just setup some random SNP-gene pairs to test; however, you can flexibly design this mapping to fit your needs. I purposefully didn't encode all possible variations of how you can define gene-SNP relationships.

In [4]:
snps_path = data_path + 'toy_genotypes.csv'
cov_path = data_path + 'toy_covariates.csv'

In [5]:
snps = pd.read_csv(snps_path, index_col=0)
cov = pd.read_csv(cov_path, index_col=0)
gene_snp_pairs = pd.read_csv(data_path + 'toy_gene_snp_pairs.csv', index_col=0)


In [6]:
snps.head(2)

Unnamed: 0,10:5642859,19:56931672,3:164807972,19:2870759,1:17629912,2:242610773,7:50618260,17:48408836,12:10155412,1:162350451,...,7:117595082,1:151334520,20:2410952,21:31736032,3:88123610,13:103092137,7:129651617,13:76180042,3:40349948,17:34345661
1132_1132,1,1,1,1,1,0,2,1,0,0,...,1,1,1,2,2,2,1,1,1,2
1285_1285,2,2,1,2,1,1,2,1,2,0,...,2,0,1,2,2,2,2,0,0,2


In [7]:
cov.head(2)

Unnamed: 0,age,Female,status,PC1_e,PC2_e,PC3_e,PC4_e,PC5_e,PC6_e,PC7_e,...,batch_cov_b_14,batch_cov_b_15,batch_cov_b_2,batch_cov_b_3,batch_cov_b_4,batch_cov_b_5,batch_cov_b_6,batch_cov_b_7,batch_cov_b_8,batch_cov_b_9
1132_1132,45.0,0.0,0.0,19.067178,17.787198,10.275343,-2.82957,-3.546597,-1.269196,-2.183796,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1285_1285,39.0,0.0,0.0,14.471841,18.737343,12.465061,11.195105,-2.246129,-11.168822,2.230269,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
gene_snp_pairs.head(2)

Unnamed: 0,gene,SNP
606417,CALML3,10:5642859
3298886,ZSCAN5A,19:56931672


### Read h5ad object

Standard h5ad file in the scanpy workflow. Some things to keep in mind:

- `adata.X` should be the raw counts with all genes detected. Typically, this will be the size of N cells with ~30k genes in a standard 10X experiment. 
- Here, we will just use the T4 cells defined by one of the AnnData.obs columns for a subset of individuals


In [9]:
adata = sc.read(data_path + 'toy_adata.h5ad')


In [10]:
print('We have {} cells'.format(adata.shape[0]))

We have 16907 cells


In [11]:
adata.obs.head(3)

Unnamed: 0,batch_cov,ind_cov,Processing_Cohort,louvain,cg_cov,ct_cov,L3,ind_cov_batch_cov,Age,Sex,pop_cov,Status,SLE_status
AAGCCGCGTCGAACAG-1-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0,dmx_YE_7-13,1068_1068,2.0,2,T4,T4_em,0.0,1068_1068:dmx_YE_7-13,45.0,Female,European,Managed,SLE
CAAGATCGTGTCCTCT-1-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-1-0-0-0-0-0,dmx_YS-JY-22_pool6,1545_1545,4.0,10,T4,T4_naive,1.0,1545_1545:dmx_YS-JY-22_pool6,38.0,Female,European,Managed,SLE
GGGCATCGTCTGGTCG-1-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0,dmx_YE_7-19,1132_1132,2.0,2,T4,T4_em,1.0,1132_1132:dmx_YE_7-19,45.0,Female,European,Managed,SLE


In [12]:
# adata.X should be a sparse matrix with counts
print('Confirming that adata.X is a sparse matrix of counts.')
print('Row sums are:')
print(adata.X.sum(axis=1)[:5])
print('')
print('The matrix itself:')
adata.X

Confirming that adata.X is a sparse matrix of counts.
Row sums are:
[[3313.]
 [2012.]
 [3166.]
 [3324.]
 [1552.]]

The matrix itself:


<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 11790934 stored elements and shape (16907, 32738)>

### Run memento

Due to the resampling at the single-cell level, `memento` generally takes much longer than something like Matrix eQTL or FastQTL that works on pseudobulks. It is however much faster than fitting linear mixed models for millions of cells. 

I would recommend using as many cores as you can afford on a high performance computing cluster. (for members of the Ye lab, maybe something like c5.24xlarge instance just for a few hours for this and set the num_cpu to 80). 

`memento` is fairly flexible in that if you're willing to use more CPUs, the speed will scale almost linearly.

In [13]:
eqtl_results = memento.run_eqtl(
    adata=adata,
    snps=snps,
    cov=cov,
    gene_snp_pairs=gene_snp_pairs,
    donor_column='ind_cov',
    num_cpu=14,
    num_blocks=1, # increase this if you run out of memory.
    num_boot=5000,
)

blocksize 50
working on block 0
19


[Parallel(n_jobs=14)]: Using backend LokyBackend with 14 concurrent workers.
  beta = np.einsum('ijk,ij->jk', A_mA * sample_weight[:, :, np.newaxis], B_mB).T/sample_weight.sum(axis=0) / ssA.T
  x = np.asarray((x - loc)/scale, dtype=dtyp)
  x = np.asarray((x - loc)/scale, dtype=dtyp)
[Parallel(n_jobs=14)]: Done  12 out of  19 | elapsed:    1.9s remaining:    1.1s
[Parallel(n_jobs=14)]: Done  19 out of  19 | elapsed:    2.0s finished


In [14]:
# The de_pval is not corrected on purpose - user can correct the P-values however they please.
eqtl_results.head(10)

Unnamed: 0,gene,tx,de_coef,de_se,de_pval
0,ABCG1,21:43562525,4.858312e-06,6e-06,0.314122
0,ATG4B,2:242610773,-1.624035e-06,3e-06,0.675984
0,COX18,4:73896353,-9.146643e-07,3e-06,0.694078
0,EIF1B,3:40349948,-1.351412e-05,1.1e-05,0.234317
0,GIMAP2,7:150447413,-2.730682e-06,7e-06,0.943794
0,LEO1,15:52208330,1.176742e-06,2e-06,0.703611
0,LMO7,13:76180042,-5.564781e-06,1.1e-05,0.418247
0,MRPL13,8:121415567,3.032256e-06,1e-05,0.723698
0,NCALD,8:102739834,-1.17329e-05,1.3e-05,0.299957
0,PBX2,6:32187605,2.246472e-06,3e-06,0.366647
