# Tutorial: Rare Variant Association Testing with `cellink`

This tutorial demonstrates how to perform rare variant association testing (RVAT) in single-cell data using the `cellink` package. We focus on two major types of RVAT: burden tests and the Sequence Kernel Association Test (SKAT). Rare variants can be especially informative for understanding disease biology, but they require careful aggregation and testing approaches due to their low frequency. 

We’ll walk through how to:
- Prepare single-cell and genotype data using `DonorData` objects.
- Apply burden tests with multiple variant annotation-based weights.
- Combine p-values from multiple tests using ACAT.
- Perform SKAT for gene-level rare variant analysis.

This tutorial uses a subset of data from the 1k1k project, filtered to the **CD8 Naive** T cell type, and focuses on **chromosome 22** for runtime feasibility. It builds on earlier tutorials that cover pseudobulk expression and variant annotation, which should be reviewed for full context.


> Note: This notebook assumes that you have already run the variant annotation step as described in the [Variant Annotation Tutorial](./explore_annotations.ipynb).

> Prerequisites: To run this notebook, you need to install the required dependencies for RVAT. Use the following command to install the necessary dependencies:
```bash
pip install -e sc-genetics[rvat, datasets]
conda install -c conda-forge chiscore

## Environment Setup

We begin by importing necessary libraries and defining key parameters for our analysis. `cellink` provides utilities that extend `AnnData` to handle both donor-level genotype data and single-cell RNA-seq data efficiently.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd

In [3]:
from pathlib import Path
import warnings

import anndata as ad
import scanpy as sc
import dask.array as da
import numpy as np
from tqdm.auto import tqdm

import cellink as cl
from cellink._core import DAnn, GAnn
from cellink.tl._rvat import run_burden_test, run_skat_test, beta_weighting
from cellink.utils import column_normalize, gaussianize
from cellink.at.acat import acat_test
from cellink.resources import get_onek1k

  from .autonotebook import tqdm as notebook_tqdm
  from pkg_resources import DistributionNotFound as _DistributionNotFound


In [4]:
DATA = Path(cl.__file__).parent.parent.parent / "docs/tutorials/data"

In [5]:
n_gpcs = 20
n_epcs = 15
batch_e_pcs_n_top_genes = 2000
chrom = 22
cis_window = 100_000
cell_type = "CD4 Naive"
pb_gex_key = f"PB_{cell_type}"  # pseudobulk expression in dd.G.obsm[key_added]
original_donor_col = "donor_id"
min_percent_donors_expressed = 0.1
celltype_key = "predicted.celltype.l2"
do_debug = False

## Load and Prepare Data

Here, we load a prepared dataset (`onek1k`) that includes genotype and expression information from human donors. We also extract gene annotations using Ensembl via `pybiomart`, which are essential for defining cis-windows during eQTL analysis.

In [6]:
dd = get_onek1k(config_path="../../src/cellink/resources/config/onek1k.yaml", verify_checksum=False)
dd

INFO:root:/data/ouga/home/ag_gagneur/hoev/cellink_sample_data/onek1k/onek1k_cellxgene.h5ad already exists
INFO:root:/data/ouga/home/ag_gagneur/hoev/cellink_sample_data/onek1k/OneK1K.noGP.vcf.gz already exists
INFO:root:/data/ouga/home/ag_gagneur/hoev/cellink_sample_data/onek1k/OneK1K.noGP.vcf.gz.csi already exists
INFO:root:/data/ouga/home/ag_gagneur/hoev/cellink_sample_data/onek1k/gene_counts_Ensembl_105_phenotype_metadata.tsv.gz already exists




In [7]:
def _get_ensembl_gene_id_start_end_chr():
    from pybiomart import Server

    server = Server(host="http://www.ensembl.org")
    dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets["hsapiens_gene_ensembl"]
    ensembl_gene_id_start_end_chr = dataset.query(
        attributes=["ensembl_gene_id", "start_position", "end_position", "chromosome_name"]
    )
    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.set_index("Gene stable ID")
    ensembl_gene_id_start_end_chr = ensembl_gene_id_start_end_chr.rename(
        columns={
            "Gene start (bp)": GAnn.start,
            "Gene end (bp)": GAnn.end,
            "Chromosome/scaffold name": GAnn.chrom,
        }
    )
    return ensembl_gene_id_start_end_chr

In [8]:
ensembl_gene_id_start_end_chr = _get_ensembl_gene_id_start_end_chr()
ensembl_gene_id_start_end_chr

Unnamed: 0_level_0,start,end,chrom
Gene stable ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSG00000210049,577,647,MT
ENSG00000211459,648,1601,MT
ENSG00000210077,1602,1670,MT
ENSG00000210082,1671,3229,MT
ENSG00000209082,3230,3304,MT
...,...,...,...
ENSG00000197989,28578536,28583132,1
ENSG00000229388,28643159,28648829,1
ENSG00000289291,28736044,28737670,1
ENSG00000274978,28648600,28648734,1


In [9]:
dd.C.var = dd.C.var.join(ensembl_gene_id_start_end_chr)
dd.C.obs[DAnn.donor] = dd.C.obs[original_donor_col]
dd.G.obsm["gPCs"] = dd.G.obsm["gPCs"][dd.G.obsm["gPCs"].columns[:n_gpcs]]

## Normalize and Filter Expression Data

We normalize single-cell expression using total-count normalization and log transformation. We then filter the dataset to only include cells of a specific type (`CD8 Naive`). This step ensures we're analyzing homogeneous populations, which increases power in eQTL detection.

In [10]:
sc.pp.normalize_total(dd.C)
sc.pp.log1p(dd.C)
sc.pp.normalize_total(dd.C)

# are the expression pcs computed by pseudobulking across all cell types?
mdata = sc.get.aggregate(dd.C, by=DAnn.donor, func="mean")
mdata.X = mdata.layers.pop("mean")

sc.pp.highly_variable_genes(mdata, n_top_genes=batch_e_pcs_n_top_genes)
sc.tl.pca(mdata, n_comps=n_epcs)

dd.G.obsm["ePCs"] = mdata[dd.G.obs_names].obsm["X_pca"]

In [11]:
dd = dd[..., dd.C.obs[celltype_key] == cell_type, :].copy()
dd



In [12]:
dd.aggregate(key_added=pb_gex_key, sync_var=True, verbose=True)
dd.aggregate(obs=["sex", "age"], func="first", add_to_obs=True)
dd

INFO:cellink._core.donordata:Aggregated X to PB_CD4 Naive
INFO:cellink._core.donordata:Observation found for 981 donors.




In [13]:
print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)

keep_genes = ((dd.G.obsm[pb_gex_key] > 0).mean(axis=0) >= min_percent_donors_expressed).values
dd = dd[..., keep_genes]
print("after filtering")
print(f"{pb_gex_key} shape:", dd.G.obsm[pb_gex_key].shape)
print("dd.shape:", dd.shape)

PB_CD4 Naive shape: (981, 36469)
dd.shape: (981, 136776, 259012, 36469)
after filtering
PB_CD4 Naive shape: (981, 15770)
dd.shape: (981, 136776, 259012, 15770)


In [14]:
# alternative to dd[:, dd.G.var.chrom == str(chrom), :, dd.C.var.chrom == str(chrom)]
dd = dd.sel(G_var=dd.G.var.chrom == str(chrom), C_var=dd.C.var.chrom == str(chrom)).copy()
dd



## Step 1: Annotate Variants with Functional Information

To inform the weighting in burden tests, we annotate variants using the VEP tool. These annotations include predicted functional consequences (e.g., missense, stop gained), CADD scores, and distance to gene transcription start sites (TSS).

This information allows biologically meaningful prioritization of variants when aggregating their effects.

In [15]:
vep_annotation_file = DATA / "variant_annotation/variants_vep_annotated.txt"

In [16]:
cl.tl.add_vep_annos_to_gdata(vep_anno_file=vep_annotation_file, gdata=dd.G, dummy_consequence=True)
dd.G.uns["variant_annotation_vep"]

INFO:cellink.tl._annotate_snps_genotype_data:Preparing VEP annotations for addition to gdata
INFO:cellink.tl._annotate_snps_genotype_data:Reading annotation file /data/nasif12/home_if12/hoev/git/sc-genetics/docs/tutorials/data/variant_annotation/variants_vep_annotated.txt
INFO:cellink.tl._annotate_snps_genotype_data:Annotation file loaded
INFO:cellink.tl._annotate_snps_genotype_data:Annotation columns: ['snp_id', 'Location', 'Allele', 'gene_id', 'transcript_id', 'Feature_type', 'Consequence', 'cDNA_position', 'CDS_position', 'Protein_position', 'Amino_acids', 'Codons', 'Existing_variation', 'IMPACT', 'DISTANCE', 'STRAND', 'FLAGS', 'BIOTYPE', 'CANONICAL', 'ENSP', 'SIFT', 'PolyPhen', 'gnomADe_AF', 'gnomADe_AFR_AF', 'gnomADe_AMR_AF', 'gnomADe_ASJ_AF', 'gnomADe_EAS_AF', 'gnomADe_FIN_AF', 'gnomADe_NFE_AF', 'gnomADe_OTH_AF', 'gnomADe_SAS_AF', 'CLIN_SIG', 'SOMATIC', 'PHENO', 'CADD_PHRED', 'CADD_RAW', 'TSSDistance']
INFO:cellink.tl._annotate_snps_genotype_data:Changing dtype of categorical col

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Consequence_3_prime_UTR_variant,Consequence_5_prime_UTR_variant,Consequence_NMD_transcript_variant,Consequence_coding_sequence_variant,Consequence_downstream_gene_variant,Consequence_frameshift_variant,Consequence_inframe_deletion,Consequence_inframe_insertion,Consequence_intergenic_variant,Consequence_intron_variant,...,Codons,cDNA_position,gnomADe_ASJ_AF,CDS_position,gnomADe_AF,IMPACT,BIOTYPE,CLIN_SIG,Protein_position,PHENO
snp_id,gene_id,transcript_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
22_16388891_G_A,ENSG00000230643,ENST00000447704,0,0,0,0,1,0,0,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16388968_C_T,ENSG00000230643,ENST00000447704,0,0,0,0,1,0,0,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16389525_A_G,ENSG00000230643,ENST00000447704,0,0,0,0,0,0,0,0,0,0,...,-,78/118,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16390411_G_A,ENSG00000230643,ENST00000447704,0,0,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16391555_G_C,ENSG00000230643,ENST00000447704,0,0,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22_50796327_CCA_C,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,0,1,...,-,-,,-,,MODIFIER,protein_coding,-,-,-
22_50798021_A_G,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,0,1,...,-,-,,-,,MODIFIER,protein_coding,-,-,-
22_50798635_T_C,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,0,1,...,-,-,,-,,MODIFIER,protein_coding,-,-,-
22_50799821_A_C,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,0,1,...,-,-,,-,,MODIFIER,protein_coding,-,-,-


In [17]:
cl.tl.aggregate_annotations_for_varm(
    dd.G, "variant_annotation_vep", agg_type="first", return_data=True
)  # TODO change agg type

INFO:cellink.tl._annotate_snps_genotype_data:Aggregating using method: first


Unnamed: 0_level_0,gene_id,transcript_id,Consequence_3_prime_UTR_variant,Consequence_5_prime_UTR_variant,Consequence_NMD_transcript_variant,Consequence_coding_sequence_variant,Consequence_downstream_gene_variant,Consequence_frameshift_variant,Consequence_inframe_deletion,Consequence_inframe_insertion,...,Codons,cDNA_position,gnomADe_ASJ_AF,CDS_position,gnomADe_AF,IMPACT,BIOTYPE,CLIN_SIG,Protein_position,PHENO
snp_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
22_16388891_G_A,ENSG00000230643,ENST00000447704,0,0,0,0,1,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16388968_C_T,ENSG00000230643,ENST00000447704,0,0,0,0,1,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16389525_A_G,ENSG00000230643,ENST00000447704,0,0,0,0,0,0,0,0,...,-,78/118,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16390411_G_A,ENSG00000230643,ENST00000447704,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
22_16391555_G_C,ENSG00000230643,ENST00000447704,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,unprocessed_pseudogene,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22_50796327_CCA_C,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,protein_coding,-,-,-
22_50798021_A_G,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,protein_coding,-,-,-
22_50798635_T_C,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,protein_coding,-,-,-
22_50799821_A_C,ENSG00000100239,ENST00000395741,0,0,0,0,0,0,0,0,...,-,-,,-,,MODIFIER,protein_coding,-,-,-


In [18]:
dd.G.varm["variant_annotation"].columns

Index(['gene_id', 'transcript_id', 'Consequence_3_prime_UTR_variant',
       'Consequence_5_prime_UTR_variant', 'Consequence_NMD_transcript_variant',
       'Consequence_coding_sequence_variant',
       'Consequence_downstream_gene_variant', 'Consequence_frameshift_variant',
       'Consequence_inframe_deletion', 'Consequence_inframe_insertion',
       'Consequence_intergenic_variant', 'Consequence_intron_variant',
       'Consequence_mature_miRNA_variant', 'Consequence_missense_variant',
       'Consequence_non_coding_transcript_exon_variant',
       'Consequence_non_coding_transcript_variant',
       'Consequence_protein_altering_variant',
       'Consequence_splice_acceptor_variant',
       'Consequence_splice_donor_5th_base_variant',
       'Consequence_splice_donor_region_variant',
       'Consequence_splice_donor_variant',
       'Consequence_splice_polypyrimidine_tract_variant',
       'Consequence_splice_region_variant', 'Consequence_start_lost',
       'Consequence_stop_gained

In [19]:
dd.G.uns["variant_annotation_vep"]["CADD_RAW"].describe()

count    32966.000000
mean         0.122248
std          0.542875
min         -1.705125
25%         -0.115726
50%          0.023902
75%          0.205452
max          9.936896
Name: CADD_RAW, dtype: float64

## Step 2: Burden Testing

We now perform burden tests, which evaluate whether the **aggregate effect** of rare variants within a gene is associated with a phenotype — in this case, gene expression in CD8 Naive cells. We use different weighting schemes allow us to prioritize variants differently based on functional scores or allele frequency, including:
- **CADD_RAW**: Raw CADD scores.
- **maf_beta**: Beta weighting based on minor allele frequency (MAF).
- **tss_distance**: Distance to the transcription start site (TSS).
- **tss_distance_exp**: Exponential weighting based on TSS distance.

In [20]:
burden_agg_fct = "sum"
run_lrt = True
annotation_cols = ["CADD_RAW", "maf_beta", "tss_distance", "tss_distance_exp"]

rare_maf_threshold = 0.05

### Filtering for Rare Variants
Burden tests focus on rare variants—typically those with minor allele frequency (MAF) below 0.01 or 0.05. We filter our genotype data accordingly to isolate variants in this frequency range.

In [21]:
dd = dd.sel(G_var=dd.G.var.maf < rare_maf_threshold).copy()
dd



### Custom MAF Weights

We add custom MAF weights commonly used in burden tests, such as `Beta(MAF, 1, 25)`. These weights prioritize rarer variants in the analysis. TSS distance weight as used in the SAIGE-QTL paper are added manually for each gene

In [22]:
dd.G.varm["variant_annotation"]["maf_beta"] = beta_weighting(dd.G.var["maf"])

### Run burden tests using each annotation individually for weighting

Next, we specify the covariate matrix `F`, which includes sex, age, genetic PCs, and expression PCs. This controls for potential confounders in the association analysis.
We then iterate over all genes on chromosome 22, define a cis-window of ±500kb, and run a burden test using each of the annotation-based weighting schemes.

In [23]:
# This specifies covariates/fixed effects
F = np.concatenate(
    [
        np.ones((dd.shape[0], 1)),
        dd.G.obs[["sex"]].values - 1,
        dd.G.obs[["age"]].values,
        dd.G.obsm["gPCs"].values,
        dd.G.obsm["ePCs"],
    ],
    axis=1,
).astype(np.float64)
F[:, 2:] = column_normalize(F[:, 2:])

In [24]:
results = []
if isinstance(dd.G.X, da.Array | ad._core.views.DaskArrayView):
    if dd.G.is_view:
        dd._G = dd._G.copy()
    dd.G.X = dd.G.X.compute()

if do_debug:
    warnings.filterwarnings("ignore", category=RuntimeWarning)

for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
    Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))

    start = max(0, row.start - cis_window)
    end = row.end + cis_window
    _G = dd.G[:, (dd.G.var.pos < end)]
    _G = _G[:, (_G.var.pos > start)]
    _G = _G[:, (_G.X.std(0) != 0)]
    _G = _G.copy()

    # TODO make strand aware
    _G.varm["variant_annotation"]["tss_distance"] = np.abs(row.start - _G.var["pos"])
    _G.varm["variant_annotation"]["tss_distance_exp"] = np.exp(-1e-5 * _G.varm["variant_annotation"]["tss_distance"])

    rdf = run_burden_test(
        _G, Y, F, gene, annotation_cols=annotation_cols, burden_agg_fct=burden_agg_fct, run_lrt=run_lrt
    )
    results.append(rdf)

rdf = pd.concat(results)
rdf

  n = 1.0 / (GG - np.einsum("ij,ij->j", FG, A0iFG))
  M = -n * A0iFG
  self.beta_g += n[:, None] * GY
  n = 1.0 / (GG - np.einsum("ij,ij->j", FG, A0iFG))
  M = -n * A0iFG
  self.beta_g += n[:, None] * GY
100%|██████████| 404/404 [00:14<00:00, 27.20it/s]


Unnamed: 0,burden_gene,egene,weight_col,burden_agg_fct,pv,beta,betaste,lrt
0,ENSG00000206195,ENSG00000206195,CADD_RAW,sum,,,,
1,ENSG00000206195,ENSG00000206195,maf_beta,sum,,,,
2,ENSG00000206195,ENSG00000206195,tss_distance,sum,,,,
3,ENSG00000206195,ENSG00000206195,tss_distance_exp,sum,,,,
0,ENSG00000177663,ENSG00000177663,CADD_RAW,sum,,,,
...,...,...,...,...,...,...,...,...
3,ENSG00000100299,ENSG00000100299,tss_distance_exp,sum,0.024520,-3.623585e-03,1.611285e-03,5.057453
0,ENSG00000079974,ENSG00000079974,CADD_RAW,sum,,,,
1,ENSG00000079974,ENSG00000079974,maf_beta,sum,0.343694,6.644163e-04,7.016820e-04,0.896602
2,ENSG00000079974,ENSG00000079974,tss_distance,sum,0.977858,4.148339e-09,1.494648e-07,0.000770


## Step 3: Combine Multiple Tests with ACAT

To summarize evidence from multiple annotations, we use the **ACAT (Aggregated Cauchy Association Test)**. This meta-analysis method combines p-values from the burden tests per gene into a single statistic, improving power and interpretability.

In [25]:
combined = rdf.dropna(subset=["pv"]).groupby("egene")["pv"].agg(lambda x: acat_test(x.values)).reset_index()
combined.sort_values("pv")

Unnamed: 0,egene,pv
84,ENSG00000100219,2.287059e-14
290,ENSG00000212939,2.274714e-11
75,ENSG00000100154,8.218144e-05
186,ENSG00000133460,9.702130e-05
198,ENSG00000167077,1.625786e-04
...,...,...
158,ENSG00000100427,9.859188e-01
338,ENSG00000244625,9.891262e-01
107,ENSG00000100294,9.921173e-01
44,ENSG00000100030,9.949769e-01


## Step 4: SKAT

As an alternative to burden testing, we apply the Sequence Kernel Association Test (SKAT), which models the variance component of aggregated rare variants rather than assuming a unidirectional effect. SKAT is more robust when variant effects differ in direction or magnitude.

Currently, `cellink` supports SKAT with the standard Beta(MAF, 1, 25) weighting scheme.

In [26]:
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARNING)  # Suppress INFO and DEBUG esp. from SKAT Test

In [27]:
results = []

for gene, row in tqdm(dd.C.var.iterrows(), total=dd.shape[3]):
    Y = gaussianize(dd.G.obsm[pb_gex_key][[gene]].values.astype(np.float64) + 1e-5 * np.random.randn(dd.shape[0], 1))

    start = max(0, row.start - cis_window)
    end = row.end + cis_window
    _G = dd.G[:, (dd.G.var.pos < end)]
    _G = _G[:, (_G.var.pos > start)]
    _G = _G[:, (_G.X.std(0) != 0)]

    rdict = run_skat_test(_G, Y, F, gene)
    results.append(rdict)

rdf = pd.DataFrame(results)
rdf

  s1 = c1[2] / c1[1] ** (3 / 2)
  s2 = c1[3] / c1[1] ** 2
  Q_Norm = (Q_all - param["muQ"]) / param["sigmaQ"]
  s1 = c1[2] / c1[1] ** (3 / 2)
  s2 = c1[3] / c1[1] ** 2
  Q_Norm = (Q_all - param["muQ"]) / param["sigmaQ"]
100%|██████████| 404/404 [00:16<00:00, 23.80it/s]


Unnamed: 0,burden_gene,egene,weight_col,pv
0,ENSG00000206195,ENSG00000206195,maf_beta,[[1.0]]
1,ENSG00000177663,ENSG00000177663,maf_beta,[[0.7540438404596421]]
2,ENSG00000069998,ENSG00000069998,maf_beta,[[0.03897879062889942]]
3,ENSG00000185837,ENSG00000185837,maf_beta,[[0.8117376585909591]]
4,ENSG00000093072,ENSG00000093072,maf_beta,[[0.9819123716041604]]
...,...,...,...,...
399,ENSG00000205560,ENSG00000205560,maf_beta,[[0.02343900249308839]]
400,ENSG00000100288,ENSG00000100288,maf_beta,[[0.5147669611167375]]
401,ENSG00000205559,ENSG00000205559,maf_beta,[[0.23597911440366892]]
402,ENSG00000100299,ENSG00000100299,maf_beta,[[0.0004379433659393861]]


In [28]:
rdf

Unnamed: 0,burden_gene,egene,weight_col,pv
0,ENSG00000206195,ENSG00000206195,maf_beta,[[1.0]]
1,ENSG00000177663,ENSG00000177663,maf_beta,[[0.7540438404596421]]
2,ENSG00000069998,ENSG00000069998,maf_beta,[[0.03897879062889942]]
3,ENSG00000185837,ENSG00000185837,maf_beta,[[0.8117376585909591]]
4,ENSG00000093072,ENSG00000093072,maf_beta,[[0.9819123716041604]]
...,...,...,...,...
399,ENSG00000205560,ENSG00000205560,maf_beta,[[0.02343900249308839]]
400,ENSG00000100288,ENSG00000100288,maf_beta,[[0.5147669611167375]]
401,ENSG00000205559,ENSG00000205559,maf_beta,[[0.23597911440366892]]
402,ENSG00000100299,ENSG00000100299,maf_beta,[[0.0004379433659393861]]
