### Expressive Quantitative Trait Loci (EQTL) Association Study

In this notebook we will show how to run a simple EQTL analyisis using the `cellink` package.

First we import the necessary libraries.

In [None]:
import logging
import warnings

import anndata as ad
import scanpy as sc
import numpy as np
import pandas as pd

from tqdm.notebook import tqdm
from pathlib import Path
from statsmodels.stats.multitest import fdrcorrection

from cellink.io import read_sgkit_zarr
from cellink.tl import get_best_eqtl_on_single_gene

warnings.filterwarnings("ignore")

logger = logging.getLogger(__name__)

EQTL analyis requires reasoning over each cell type and each chromosome. In this tutorial we will be working on the chromosome 22 and the CD4 NC cell type.

In [None]:
DEBUG = False
TARGET_CELL_TYPE = "CD4 NC"
TARGET_CHROMOSOME = "22"
THRESHOLD = 0.05

Defining the data paths (should we do this differently when we publish the tutorial?)

In [None]:
## paths
DATA = Path("/home/lollo/Work/hackathon/data/Yazar_OneK1K")
# DATA = Path("/Users/jan.engelmann/projects/sc-eqtl/data")

vcf_file_path = DATA / "OneK1K_imputation_post_qc_r2_08/filter_vcf_r08/chr22.dose.filtered.R2_0.8.vcf.gz"

zarr_path = vcf_file_path.parent.parent / "filter_zarr_r08"
zarr_path.mkdir(exist_ok=True)

icf_file_path = zarr_path / vcf_file_path.with_suffix(".icf").name
zarr_file_path = (zarr_path / vcf_file_path.stem).with_suffix(".vcz")

if DEBUG:
    scdata_path = DATA / "debug_OneK1K_cohort_gene_expression_matrix_14_celltypes.h5ad"
else:
    scdata_path = DATA / "OneK1K_cohort_gene_expression_matrix_14_celltypes.h5ad.gz"

print(zarr_file_path, scdata_path)

Now we can read the single cell data by using the `anndata.read_h5ad` function.
Since we need to reason over each single cell type, we need to subset the data properly before proceeding with the rest of the pipeline.

In [None]:
## reading single cell data
scdata = ad.read_h5ad(scdata_path)
## filtering by the target cell type
scdata = scdata[scdata.obs.cell_label == TARGET_CELL_TYPE]
scdata

We read the genetic data from the `zarr` file using the `cellink.tl.read_sgkit_data` API.

In [None]:
gdata = read_sgkit_zarr(zarr_file_path)
gdata.obs = gdata.obs.set_index("id")
gdata

In order to proceed with the analysis, we need to extend the single cell data with the biomart annotations.

In [None]:
## annotating the single cell data
annot = (
    sc.queries.biomart_annotations(
        "hsapiens",
        ["ensembl_gene_id", "start_position", "end_position", "chromosome_name"],
    )
    .set_index("ensembl_gene_id")
    .drop_duplicates()
)

scdata = scdata[:, scdata.var.index.isin(annot.index)]
scdata.var["chrom"] = annot.loc[scdata.var.index, "chromosome_name"].values
scdata.var["start"] = annot.loc[scdata.var.index, "start_position"].values
scdata.var["end"] = annot.loc[scdata.var.index, "end_position"].values

We can now normalize the counts and log-transform them

In [None]:
sc.pp.normalize_total(scdata)
sc.pp.log1p(scdata)
sc.pp.normalize_total(scdata)

Since we have the genetic data associated with chromosome 22 we need to subset the single cell data to contain only the genes that are associated to such chromosome, given the biomart annotation.

In [None]:
scdata = scdata[:, scdata.var.chrom == TARGET_CHROMOSOME]

Since each observation in the genetic data is a donor, we need to pseudo-bulk the single cell data to have a representation at the same level.

In [None]:
## aggregating the data
pbdata = sc.get.aggregate(scdata, "individual", "mean")
gdata = gdata[pbdata.obs.index]
pbdata.X = pbdata.layers["mean"]
pbdata

We need to perform some sanity check to make sure that the observations match across the two data sources (pseudo-bulked single cell and genetic data)

In [None]:
## sanity check (we have all the individuals from both data sources)
assert (pbdata.obs.index == gdata.obs.index).all()

We will also filter out the genes that are expressed in less than 10 cells 

In [None]:
## first we need to filter out genes that are expressed in less than ten individuals
sc.pp.filter_genes(pbdata, min_cells=10)

We can now run our EQTL test for each of the genes that are associated to the 22 chromosome

In [None]:
## retrieving the genes associated to chromosome 22
genes_chrom_22 = pbdata[:, pbdata.var["chrom"] == TARGET_CHROMOSOME].var.index.values
## running the eqtl test
cis_window = 1_000_000
results = []
## defining the iterator
iterator = tqdm(range(len(genes_chrom_22)))
for target_gene in genes_chrom_22:
    eqtl_results = get_best_eqtl_on_single_gene(pbdata, gdata, target_gene, cis_window)
    results.append(eqtl_results)
    iterator.update()

To make more sense of the results, we need to consider the Bonferroni adjusted p-value along with the q-value computed by using the Benjamini-Hochberg score across the test.

In [None]:
## constructing output DataFrame
eqtl_results_df = pd.DataFrame(results)
eqtl_results_df["pv_reject"] = eqtl_results_df["min_pv"] < THRESHOLD
eqtl_results_df["bf_pv"] = np.clip(eqtl_results_df["min_pv"] * eqtl_results_df["no_tested_variants"], 0, 1)
eqtl_results_df["bf_pv_reject"] = eqtl_results_df["bf_pv"] < THRESHOLD
eqtl_results_df["q_val"] = fdrcorrection(eqtl_results_df["bf_pv"].values)[1]
eqtl_results_df["q_val_reject"] = eqtl_results_df["q_val"] < THRESHOLD

Once we have terminated our analysis, we can save the resulting `DataFrame` to disk

In [None]:
## saving the resulting dataframe
eqtl_results_df.to_csv(f"/home/lollo/Work/hackathon/dump/eqtl_{TARGET_CELL_TYPE}.csv")

In [None]:
eqtl_results_df

## TODOs

- [x] Run on all genes on Chromosome 22
- [x] For each gene store: minimum p value, number of variants tested, id of minimum pv variant, gene name
- [x] Bonferroni correction per hit: pv_gene = pv * num_cis_variants, np clip to (0,1)
- [x] subset to bonferroni sginifcant hits (pv_gene < 0.05)
- [x] benjamini hochberg across tests -> qv
- [x] report # of qv < 0.05
- [ ] check how many hits you have compared to OneK1K
- [x] add gwas to tools
- [x] Figure out how to render several notebooks
- [ ] Stretch goal: all cell types