# eQTL analysis for a dataset with many individuals.

Most of this tutorial can also be used for testing any feature that is at the replicate/individual level. For example, comparing case vs control would use similar procedure, since the independent variable is defined for each person and not for each cell.

In [1]:
import scanpy as sc
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import itertools
from pybedtools import BedTool
import statsmodels.formula.api as smf
import statsmodels.api as sm
import imp

import os
import pickle as pkl
%matplotlib inline

In [2]:
import sys
sys.path.append('/home/ssm-user/Github/scrna-parameter-estimation/dist/memento-0.0.9-py3.8.egg')
import memento

In [3]:
data_path  = '/data_volume/memento/lupus/'

### Read the inputs: variables of interest (SNPs), covariates, SNP-gene pairs.

For each of these SNP and covariate, each row is an individual and columns are different variables of interest. 

For the tutorial, we use the genotypes and covariates used in 2022 Perez, Gordon, Subramaniam et al. paper from the lab. These inputs are identical to Matrix eQTL inputs - I just transpose them here because I think it makes more sense that observations are rows...

For the tutorial, we just setup some random SNP-gene pairs to test; however, you can flexibly design this mapping to fit your needs. I purposefully didn't encode all possible variations of how you can define gene-SNP relationships.

In [4]:
pop = 'eur'
snps_path = data_path + 'mateqtl_input/{}_genos.tsv'.format(pop)
cov_path = data_path + 'mateqtl_input/{}_mateqtl_cov.txt'.format(pop)

In [5]:
snps = pd.read_csv(snps_path, sep='\t', index_col=0).T
cov = pd.read_csv(cov_path, sep='\t', index_col=0).T

In [6]:
# Print the first 5 SNPs for the first 5 individuals to show the structure
snps.iloc[:, :5].head(5)

CHROM:POS,3:165182446,6:122682327,22:40561759,3:104381193,15:57107863
1132_1132,1,1,1,1,2
1285_1285,2,1,2,1,2
1961_1961,1,2,2,0,2
HC-526,1,1,2,2,1
1414_1414,0,0,2,1,2


In [7]:
# Print the covariates for the first 5 individuals to show the structure
cov.head(5)

Unnamed: 0,age,Female,status,PC1_e,PC2_e,PC3_e,PC4_e,PC5_e,PC6_e,PC7_e,...,batch_cov_b_14,batch_cov_b_15,batch_cov_b_2,batch_cov_b_3,batch_cov_b_4,batch_cov_b_5,batch_cov_b_6,batch_cov_b_7,batch_cov_b_8,batch_cov_b_9
1132_1132,45.0,0.0,0.0,19.067178,17.787198,10.275343,-2.82957,-3.546597,-1.269196,-2.183796,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1285_1285,39.0,0.0,0.0,14.471841,18.737343,12.465061,11.195105,-2.246129,-11.168822,2.230269,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1961_1961,43.0,0.0,0.0,-7.343628,38.241007,-7.431836,0.60722,-13.730105,-2.339229,-3.238375,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
HC-526,58.0,0.0,1.0,0.495487,-17.795535,0.458286,5.384761,-10.269823,-2.239953,-4.240055,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1414_1414,41.0,1.0,0.0,-10.31384,-3.423322,1.635042,9.192646,-3.507571,11.446228,-3.834848,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [8]:
# You can define this mapping DataFrame however you want - for example, you can take find gene-SNP pairs via looking for a 100kb.
# Here, to make the tutorial faster, we'll just randomly take 50k lines.
gene_snp_pairs = pd.read_csv(data_path + 'mateqtl_input/{}/gene_snp_pairs_hg19_100kb.csv'.format(pop))
gene_snp_pairs.columns = ['gene', 'SNP']
gene_snp_pairs = gene_snp_pairs.query('SNP in @snps.columns').sample(50000)

In [9]:
gene_snp_pairs.head(10)

Unnamed: 0,gene,SNP
5457519,HIVEP2,6:143300076
5327418,RRP36,6:42920844
2681034,TSEN54,17:73461930
1512075,SNRPF,12:96256344
6033642,GPR20,8:142275326
1814111,MGAT2,14:50039198
2235233,NPIPB4,16:21827665
2825863,VPS4B,18:61111603
780221,PYROXD2,10:100228931
2618184,DLX3,17:47999822


### Read h5ad object

Standard h5ad file in the scanpy workflow. Some things to keep in mind:

- `adata.X` should be the raw counts with all genes detected. Typically, this will be the size of N cells with ~30k genes in a standard 10X experiment. 
- Here, we will just use the T4 cells defined by one of the AnnData.obs columns.


In [10]:
ct = 'T4'

In [11]:
adata = sc.read(data_path + 'single_cell/{}_{}.h5ad'.format(pop, ct))
adata = adata[adata.obs.ind_cov.isin(snps.index)].copy() # pick out individuals we have genotype and covariates for


In [12]:
print('We have {} cells labeled as T4'.format(adata.shape[0]))

We have 129531 cells labeled as T4


In [13]:
adata.obs.head(3)

Unnamed: 0,batch_cov,ind_cov,Processing_Cohort,louvain,cg_cov,ct_cov,L3,ind_cov_batch_cov,Age,Sex,pop_cov,Status,SLE_status
GTCACGGAGATTACCC-1-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-1-0-0-0-0-0-0-0-0-0-0-0-0-0,dmx_YE_8-2,1368_1368,2.0,1,T4,,0.0,1368_1368:dmx_YE_8-2,45.0,Male,European,Managed,SLE
GTCATTTCAGAGTGTG-1-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-1-0-0-0-0-0-0,dmx_YS-JY-22_pool5,HC-540,4.0,2,T4,T4_em,1.0,HC-540:dmx_YS-JY-22_pool5,68.0,Female,European,Healthy,Healthy
AAAGATGGTTCACGGC-1-1-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-1-0-0-0-0-0-0-0-0-0-0,dmx_YS-JY-20_pool3,HC-006,4.0,1,T4,T4_naive,1.0,HC-006:dmx_YS-JY-20_pool3,53.0,Female,European,Healthy,Healthy


In [14]:
adata.X[:5, :15]


<5x15 sparse matrix of type '<class 'numpy.float32'>'
	with 0 stored elements in Compressed Sparse Row format>

In [15]:
# adata.X should be a sparse matrix with counts
print('Confirming that adata.X is a sparse matrix of counts.')
print('Row sums are:')
print(adata.X.sum(axis=1)[:5])
print('')
print('The matrix itself:')
adata.X

Confirming that adata.X is a sparse matrix of counts.
Row sums are:
[[1905.]
 [2104.]
 [2102.]
 [1209.]
 [2030.]]

The matrix itself:


<129531x32738 sparse matrix of type '<class 'numpy.float32'>'
	with 83322139 stored elements in Compressed Sparse Row format>

### Run memento

Due to the resampling at the single-cell level, `memento` generally takes much longer than something like Matrix eQTL or FastQTL that works on pseudobulks. It is however much faster than fitting linear mixed models for millions of cells. 

I would recommend using as many cores as you can afford on a high performance computing cluster. (for members of the Ye lab, maybe something like c5.24xlarge instance just for a few hours for this and set the num_cpu to 80). 

`memento` is fairly flexible in that if you're willing to use more CPUs, the speed will scale almost linearly.

In [18]:
eqtl_results = memento.run_eqtl(
    adata=adata,
    snps=snps,
    cov=cov,
    gene_snp_pairs=gene_snp_pairs,
    donor_column='ind_cov',
    num_cpu=60,
    num_blocks=3, # increase this if you run out of memory.
    num_boot=5000
)

blocksize 16666
working on block 0
(91, 35)
(91, 3285470)
3661


[Parallel(n_jobs=60)]: Using backend LokyBackend with 60 concurrent workers.
[Parallel(n_jobs=60)]: Done  80 tasks      | elapsed:    9.3s
[Parallel(n_jobs=60)]: Done 330 tasks      | elapsed:   17.8s
[Parallel(n_jobs=60)]: Done 680 tasks      | elapsed:   29.0s
[Parallel(n_jobs=60)]: Done 1130 tasks      | elapsed:   43.1s
[Parallel(n_jobs=60)]: Done 1680 tasks      | elapsed:  1.0min
[Parallel(n_jobs=60)]: Done 2330 tasks      | elapsed:  1.4min
[Parallel(n_jobs=60)]: Done 3080 tasks      | elapsed:  1.8min
[Parallel(n_jobs=60)]: Done 3661 out of 3661 | elapsed:  2.3min finished


working on block 1
(91, 35)
(91, 3285470)
3648


[Parallel(n_jobs=60)]: Using backend LokyBackend with 60 concurrent workers.
[Parallel(n_jobs=60)]: Done  80 tasks      | elapsed:    3.4s
[Parallel(n_jobs=60)]: Done 330 tasks      | elapsed:   11.5s
[Parallel(n_jobs=60)]: Done 680 tasks      | elapsed:   22.6s
[Parallel(n_jobs=60)]: Done 1130 tasks      | elapsed:   36.6s
[Parallel(n_jobs=60)]: Done 1680 tasks      | elapsed:   53.6s
[Parallel(n_jobs=60)]: Done 2330 tasks      | elapsed:  1.3min
[Parallel(n_jobs=60)]: Done 3080 tasks      | elapsed:  1.6min
[Parallel(n_jobs=60)]: Done 3648 out of 3648 | elapsed:  2.2min finished


working on block 2
(91, 35)
(91, 3285470)
3682


[Parallel(n_jobs=60)]: Using backend LokyBackend with 60 concurrent workers.
[Parallel(n_jobs=60)]: Done  80 tasks      | elapsed:    3.5s
[Parallel(n_jobs=60)]: Done 330 tasks      | elapsed:   11.5s
[Parallel(n_jobs=60)]: Done 680 tasks      | elapsed:   22.6s
[Parallel(n_jobs=60)]: Done 1130 tasks      | elapsed:   36.7s
[Parallel(n_jobs=60)]: Done 1680 tasks      | elapsed:   54.0s
[Parallel(n_jobs=60)]: Done 2330 tasks      | elapsed:  1.3min
[Parallel(n_jobs=60)]: Done 3080 tasks      | elapsed:  1.6min
[Parallel(n_jobs=60)]: Done 3682 out of 3682 | elapsed:  2.2min finished


In [19]:
# The de_pval is not corrected on purpose - user can correct the P-values however they please.
eqtl_results.head(10)

Unnamed: 0,gene,tx,de_coef,de_se,de_pval,dv_coef,dv_se,dv_pval
0,A1BG,19:58909377,-0.001081,0.001542,0.483418,0.0,1.612603e-16,1.0
1,A1BG,19:58949046,-0.003161,0.002025,0.118595,0.0,1.928353e-16,1.0
0,A2M-AS1,12:9177898,-0.000219,0.001075,0.838651,0.0,2.47457e-16,1.0
0,AAGAB,15:67501033,-0.000242,0.019809,0.990291,0.0,2.807912e-16,1.0
0,AAMP,2:219117482,-0.002689,0.002074,0.194787,0.0,1.57421e-16,1.0
0,AASDH,4:57149118,-0.013158,0.004572,0.004006,0.0,1.713491e-16,1.0
0,AASDHPPT,11:106017549,-0.002468,0.002917,0.397558,0.0,2.482082e-16,1.0
0,AATF,17:35243370,3.7e-05,0.000783,0.962072,0.0,1.922602e-16,1.0
0,ABCA5,17:67346426,0.000524,0.000682,0.442494,0.0,1.927617e-16,1.0
1,ABCA5,17:67354696,-0.000494,0.000644,0.443477,0.0,1.832501e-16,1.0
