# QTL Association Testing

## Description

We perform QTL association testing using TensorQTL [[cf. Taylor-Weiner et al (2019)](https://doi.org/10.1186/s13059-019-1836-7)].

## Input

- List of molecular phenotype files: a list of `bed.gz` files containing the table for the molecular phenotype. It should have a companion index file in `tbi` format. It is the output of gene_annotation or phenotype_by_chorm
- List of genotypes in both PLINK binary format (bed/bim/fam) and PLINK 2 binary genotype table (pgen/pvar/psam) for each chromosome, previously processed through our genotype QC pipelines.
- Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.
- Optionally, a list of traits (genes, regions of molecular features etc) to analyze.

### Example phenotype list


The header of the bed.gz is per the [TensorQTL](https://github.com/broadinstitute/tensorqtl) convention:

- Phenotypes must be provided in BED format, sorted by chromosome and [start,end] position, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the cis-window (usually the TSS), with start = the minimum start for each gene, end = the maximum end for each gene(extracted from phenotype referenced gtf file).

In [None]:
# -----------------------------
# Load Phenotype File List
# -----------------------------
# The input is a text file listing phenotype .bed.gz files by chromosome
pheno_path <- fread("output/phenotype/phenotype_by_chrom/bulk_rnaseq.phenotype_by_chrom_files.txt")
head(pheno_path)

# -----------------------------
# Load Phenotype Data for One Chromosome (e.g., Chr9)
# -----------------------------
# Each file is in BED-like format (.bed.gz), commonly used in QTL pipelines
pheno <- fread("output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr9.bed.gz")
pheno[1:5, 1:8]


#id,#dir
<int>,<chr>
9,output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr9.bed.gz
19,output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr19.bed.gz
1,output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr1.bed.gz
6,output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr6.bed.gz
15,output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr15.bed.gz
11,output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr11.bed.gz


#chr,start,end,ID,strand,sample0,sample1,sample2
<chr>,<int>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
chr9,3364328,3364791,ENSG00000215297,-,-0.4736331,-0.4922869,-0.4186161
chr9,4792943,4885916,ENSG00000120158,+,-0.3648387,1.3652152,-0.2774209
chr9,5098340,5099324,ENSG00000235917,+,-0.7548631,1.1789688,-0.9199175
chr9,5299863,5304715,ENSG00000107014,-,0.6902035,-0.9455643,-0.9455643
chr9,5357970,5437924,ENSG00000107020,-,-0.1081115,-0.5687344,0.4551426


### Example genotype file

In [None]:
# Load required libraries
library(data.table)
library(genio)

# -----------------------------
# Load Genotype File Paths
# -----------------------------
# The input is a text file with two columns: chromosome ID and corresponding PLINK prefix path
geno_path <- fread("output/genotype_by_chrom/wgs.merged.plink_qc.genotype_by_chrom_files.txt")
head(geno_path)

# -----------------------------
# Read PLINK Files for One Chromosome (e.g., Chr21)
# -----------------------------
# This will automatically read .bed, .bim, and .fam files
file_path <- "output/genotype_by_chrom/wgs.merged.plink_qc.21"
plink_data <- read_plink(file_path)

# Extract genotype matrix (individuals x SNPs)
genotypes <- plink_data$X
genotypes[1:5, 1:5]


“package ‘data.table’ was built under R version 4.4.3”


#id,#path
<int>,<chr>
11,/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype_by_chrom/wgs.merged.plink_qc.11.bed
3,/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype_by_chrom/wgs.merged.plink_qc.3.bed
10,/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype_by_chrom/wgs.merged.plink_qc.10.bed
22,/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype_by_chrom/wgs.merged.plink_qc.22.bed
20,/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype_by_chrom/wgs.merged.plink_qc.20.bed
15,/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype_by_chrom/wgs.merged.plink_qc.15.bed


Reading: output/genotype_by_chrom/wgs.merged.plink_qc.21.bim

Reading: output/genotype_by_chrom/wgs.merged.plink_qc.21.fam

Reading: output/genotype_by_chrom/wgs.merged.plink_qc.21.bed



Unnamed: 0,sample0,sample1,sample2,sample3,sample4
chr21:5091891_A_G,0,0,0,,0
chr21:5097593_CGTCCCTTCCCGAGGTTCCAGGCGGACGT_C,0,0,0,,0
chr21:5097593_CGTCCCTTCCCGAGGTTCCAGGCGGACGT_CGTCCCTTCCCGAGGTTCCAGGCGGACGTGTCCCTTCCCGAGGTTCCAGGCGGACGT,0,0,0,,0
chr21:5097593_CGTCCCTTCCCGAGGTTCCAGGCGGACGT_TGTCCCTTCCCGAGGTTCCAGGCGGACGT,0,0,0,,0
chr21:5103954_G_C,0,0,0,0.0,0


### Example covariates file

In [None]:
# cov file:
cov = fread('output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz')
cov[,1:5]
dim(cov)

#id,sample0,sample1,sample2,sample3
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
sex,0.0,0.0,1.0,0.0
age,91.0,92.0,52.0,85.0
rin,4.0,6.0,3.0,5.0
pmi,4.0,33.0,21.0,28.0
PC1,0.00797194,0.01551892,0.007742702,0.017287771
PC2,0.211086385,-0.08141101,-0.132460948,0.014621938
PC3,-0.050633431,-0.1268738,-0.120602702,0.055990494
PC4,-0.014065335,0.12781518,-0.017367114,0.020156398
PC5,-0.124368908,0.03655637,0.028215755,0.156124781
PC6,-0.175836871,0.01975288,-0.007233068,-0.078524241


### Example customized cis window file (Optional)

In [None]:
# -----------------------------
# Load Customized Cis-Window File
# -----------------------------
# This file specifies the gene-level cis-window: start and end positions 
# used to define the SNP search region for each gene.
# If you do not want to use a customized window, you can instead specify `--window` in the pipeline,
# which by default sets a symmetric ±1Mb window around each gene TSS.

window <- fread("reference_data/TAD/TADB_enhanced_cis.bed")
head(window)

#chr,start,end,gene_id
<chr>,<int>,<int>,<chr>
chr1,0,6480000,ENSG00000008128
chr1,0,6480000,ENSG00000008130
chr1,0,6480000,ENSG00000067606
chr1,0,7101193,ENSG00000069424
chr1,0,7960000,ENSG00000069812
chr1,0,6480000,ENSG00000078369


### Example interaction file

In [None]:
# -----------------------------
# Load Interaction File
# -----------------------------
# For iQTL analysis, if the interaction term is not included in your covariate file,
# you need to provide it as a separate file via the `--interaction` parameter.

int <- fread("data/ROSMAP_interaction_example.tsv")
head(int)

sample_id,int
<chr>,<int>
sample0,0
sample1,0
sample2,0
sample3,0
sample4,1
sample5,1



For cis-analysis:

- Optionally, a list of genomic regions associate with each molecular features to analyze. The default cis-analysis will use a window around TSS. This can be customized to take given start and end genomic coordinates. we currently suggest using 1Mb window around a gene because longer customized cis-windows (such as extending by TAD) does not yield significant improvements.


For trans-analysis:

 **Computational strategy designed for trans analysis:**
 
 Trans analysis faces significant memory challenges as we calculate all associations between all molecular traits × all genetic variants across the genome, creating a massive computational burden. To address this challenge, we implement a two-stage chromosome-based parallelization approach:

 **Stage 1 (trans_1): Chromosome-based parallelization**
 - Phenotype data is processed per chromosome (e.g., 22 separate jobs for autosomes)
 - For each phenotype chromosome, we test associations against variants from all 22 chromosomes
 - This creates phenotype_chr × genotype_chr combinations (e.g., phenotype chr1 vs genotype chr1-22); Garbage was collected between each chromosome combination caculation to release memory
 - Results are combined across all chromosome combinations and saved as compressed files

 **Stage 2 (trans_2): Significance filtering**
 - Supports p-value cutoffs (`--pvalue-cutoff`) or q-value cutoffs (`--qvalue-cutoff`)





## Output  

For each chromosome, several summary statistics files are generated, including both nominal test statistics for each test and region (gene) level association evidence.  

### Nominal Association Results  

The columns of the nominal association result are as follows:  

- **chrom**: Variant chromosome.  
- **pos**: Variant chromosomal position (basepairs).  
- **molecular_trait_id**: Molecular trait identifier (gene).  
- **variant_id**: ID of the variant (rsid or chr:position:ref:alt).  
- **tss_distance**: Distance of the SNP to the gene transcription start site (TSS).  
- **tes_distance**: Distance of the SNP to the gene transcription end site (TES).  
- **cis_window_start_distance**: Distance of the SNP to the start of the cis window (if using a customized cis window).  
- **cis_window_end_distance**: Distance of the SNP to the end of the cis window (if using a customized cis window).  
- **af**: The allele frequency of this SNP.  
- **ma_samples**: Number of samples carrying the minor allele.  
- **ma_count**: Total number of minor alleles across individuals.  
- **pvalue**: Nominal P-value from linear regression.  
- **bhat**: Slope of the linear regression.  
- **sebhat**: Standard error of bhat.  
- **n**: Number of phenotypes after basic QC.  
#### Multiple Testing Corrected Results:  
- **qvalue**: Calculated q-value for each SNP (grouped by gene).  

### Interaction Association Results  

The columns of interaction association results are as follows (FIXME):  

**Model:**  
$$
\text{phenotype} = \beta_0 + \beta_1 \cdot \text{snp} + \beta_2 \cdot \text{msex} + \beta_3 \cdot (\text{snp} \times \text{msex}) + \epsilon
$$



(Taking msex as the interaction factor)  

- **chrom**: Chromosome number.  
- **pos**: Variant chromosomal position (basepairs).  
- **a2**: Variant reference allele (A, C, T, or G).  
- **a1**: Variant alternate allele.  
- **molecular_trait_id**: Molecular trait identifier, varies from phenotypes to phenotypes.  
- **variant_id**: ID of the top variant (rsid or chr:position:ref:alt).  
- **af**: Alternative allele frequency in the MiGA cohort.  
- **ma_samples**: Number of samples carrying the minor allele.  
- **ma_count**: Total number of minor alleles across individuals.  
- **pvalue**: P-value of the main effect from the nonlinear regression.  
- **bhat**: Slope of the main effect from the nonlinear regression.  
- **se**: Standard error of beta.  
- **pvalue_msex**: P-value of the msex term from the nonlinear regression.  
- **bhat_msex**: Slope of the msex term from the nonlinear regression.  
- **se_msex**: Standard error of bhat_msex.  
- **pvalue_msex_interaction**: P-value of the interaction term from the nonlinear regression.  
- **bhat_msex_interaction**: Slope of the interaction term from the nonlinear regression.  
- **se_msex_interaction**: Standard error of beta_msex_interaction.  
- **molecular_trait_object_id**: An intermediate ID (can be ignored).  
- **n**: Number of samples.
#### Multiple Testing Corrected Results:  
- **qvalue_main**: The q-value of the main effect.  
- **qvalue_interaction**: The q-value of the interaction effect.  

### Region (Gene) Level Association Evidence  

The column specifications for region-level association evidence are as follows:  

- **chrom**: Chromosome number.  
- **pos**: Variant chromosomal position (basepairs).  
- **n_variant**: Total number of variants tested in cis.  
- **beta_shape1**: First parameter value of the fitted beta distribution.  
- **beta_shape2**: Second parameter value of the fitted beta distribution.  
- **true_df**: Effective degrees of freedom of the beta distribution approximation.  
- **p_true_df**: Empirical P-value for the beta distribution approximation.  
- **variant_id**: ID of the top variant (rsid or chr:position:ref:alt).  
- **tss_distance**: Distance of the SNP to the gene transcription start site (TSS).  
- **tes_distance**: Distance of the SNP to the gene transcription end site (TES).  
- **ma_samples**: Number of samples carrying the minor allele.  
- **ma_count**: Total number of minor alleles across individuals.  
- **af**: Alternative allele frequency.  
- **p_nominal**: Nominal P-value from linear regression.  
- **bhat**: Slope of the linear regression.  
- **sehat**: Standard error of the bhat.  
- **p_perm**: First permutation P-value directly obtained from the permutations with the direct method.  
- **p_beta**: Second permutation P-value obtained via beta approximation (this is the one to use for downstream analysis).  
- **molecular_trait_object_id**: Molecular trait identifier (gene).  
- **n_traits**: Group size in the permutation test.  
- **genomic_inflation**: Genomic inflation factor (lambda), quantifying the extent of bulk inflation and the excess false positive rate.  
#### Multiple Testing Corrected Results:  
- **q_beta**: Q-value for p_beta using Storey's method (qvalue), more conservative than FDR.  
- **q_perm**: Q-value for p_perm using Storey's method (qvalue), more conservative than FDR.  
- **fdr_beta**: Adjusted P-value for p_beta using the Benjamini-Hochberg method (FDR).  
- **fdr_perm**: Adjusted P-value for p_perm using the Benjamini-Hochberg method (FDR).  
- **p_nominal_threshold**: Nominal p-value threshold for variants in the corresponding molecular trait, derived from empirical beta distribution as a result of permutation testing.  


## Minimal Working Example Steps



The data can be found on [Synapse](https://www.synapse.org/#!Synapse:syn36416559/files/).

### i. Cis TensorQTL Command 

In [None]:
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/wgs.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
    --customized-cis-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --cwd output/tensorqtl_cis/ \
    --MAC 5 

```
INFO: Running [32mcis_1[0m: 
INFO: [32mcis_1[0m (index=6) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=7) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=3) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=5) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=2) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=1) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=0) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=4) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=11) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=13) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=9) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=8) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=12) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=15) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=10) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=14) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=19) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=17) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=16) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=20) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=21) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=18) is [32mcompleted[0m.
INFO: [32mcis_1[0m output:   [32m/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/tensorqtl_cis/bulk_rnaseq.chr20.cis_qtl_pairs.20.parquet /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/tensorqtl_cis/bulk_rnaseq.chr20_chr20.cis_qtl.pairs.tsv.gz... (66 items in 22 groups)[0m
INFO: Running [32mcis_2[0m: 
INFO: [32mcis_2[0m is [32mcompleted[0m.
INFO: [32mcis_2[0m output:   [32moutput/tensorqtl_cis/bulk_rnaseq.cis_qtl_regional_significance.tsv.gz output/tensorqtl_cis/bulk_rnaseq.cis_qtl_regional_significance.summary.txt[0m
INFO: Workflow cis (ID=wf24d8ec17aef888e) is executed successfully with 2 completed steps and 23 completed substeps.

```

### ii. Trans TensorQTL Command 


In [None]:
sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file data/wgs.merged.plink_qc.genotype_trans_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom_for_trans/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --region-list data/combined_AD_genes.csv \
    --region-list-phenotype-column 4 \
    --covariate-file output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --cwd output/tensorqtl_trans/ \
    --MAC 5 

### iii. Interaction TensorQTL Command 


In [None]:
# Example 1: Running full iQTL scan in parallel across all chromosomes
# If using .txt files for genotype and phenotype input (with no --chromosome specified), the pipeline will run in parallel across all chromosomes.
# Note: "sex" must be a column in your covariate file; otherwise, replace --interaction with a path to an interaction file.

sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/wgs.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
    --customized-cis-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --cwd output/tensorqtl_int/ \
    --no-permutation \
    --maf-threshold 0.05 \
    --interaction sex \
    -j 22

# Example 2: Run TensorQTL for a specific chromosome
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/wgs.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
    --customized-cis-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --cwd output/tensorqtl_int/ \
    --chromosome 21 \
    --no-permutation \
    --maf-threshold 0.05 \
    --interaction sex

# Example 3: Run TensorQTL for a single chromosome with specific genotype and phenotype files
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/wgs.merged.plink_qc.21.bed \
    --phenotype-file output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr21.bed.gz \
    --covariate-file output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
    --customized-cis-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --cwd output/tensorqtl_int/ \
    --chromosome 21 \
    --no-permutation \
    --maf-threshold 0.05 \
    --interaction sex \
    -s build

# Example 4: Use a specific interaction file instead of an interaction column
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/wgs.merged.plink_qc.21.bed \
    --phenotype-file output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr21.bed.gz \
    --covariate-file output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
    --customized-cis-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --cwd output/tensorqtl_int/ \
    --chromosome 21 \
    --no-permutation \
    --maf-threshold 0.05 \
    --interaction data/ROSMAP_interaction_example.tsv \
    -s build


## Troubleshooting

| Step | Substep | Problem | Possible Reason | Solution |
|------|---------|---------|------------------|---------|
|  |  |  |  |  |




## Command Interface 

In [3]:
sos run TensorQTL.ipynb -h

usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  cis
  trans

Global Workflow Options:
  --cwd output (as path)
                        Path to the work directory of the analysis.
  --phenotype-file VAL (as path, required)
                        Phenotype file, or a list of phenotype per region.
  --genotype-file VAL (as path, required)
                        A genotype file in PLINK binary format (bed/bam/fam)
                        format, or a list of genotype per chrom
  --covariate-file VAL (as path, required)
                        Covariate file
  --name  f"{phenotype_file:bn}_{covariate_file:bn}"

                        Prefix for the analysi

## Old Minimal working example

An MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1yjTwoO0DYGi-J9ouMsh9fHKfDmsXJ_4I?usp=sharing).
The singularity image (sif) for running this MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1mLOS3AVQM8yTaWtCbO8Q3xla98Nr5bZQ)

FIXME: need to update these links. 

FIXME: Also need to update the example commands below using our new example dataset.

In [None]:
sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --MAC 5

In [None]:
sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --MAC 5 --region-name  gene_name

## Setup and global parameters

In [3]:
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Optional pattern to filter covariates (list of covariate prefixes or exact names)
parameter: covariate_pattern = []
# Prefix for the analysis output
parameter: name = ""
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""

# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()

# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path() 

# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0

# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re

# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1

# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N)

# Filtering significant trans associations (for trans_2 workflow)
parameter: pvalue_cutoff = "5e-8"
parameter: qvalue_cutoff = ""


import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))


if (str(genotype_file).endswith("bed") or str(genotype_file).endswith("pgen")) and str(phenotype_file).endswith("bed.gz"):
    input_files = [[phenotype_file, genotype_file]]
    if len(chromosome) > 0:
        input_chroms = [int(x) for x in chromosome]
    else:
        input_chroms = [0]
else:
    import pandas as pd
    import os
    molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
    
    if "#dir" in molecular_pheno_files.columns and "#chr" not in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.rename(columns={"#dir": "path"})
        if "#id" in molecular_pheno_files.columns:
            molecular_pheno_files = molecular_pheno_files.rename(columns={"#id": "#chr"})
    
    if "#chr" in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.groupby(['#chr','path']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
    genotype_files = pd.read_csv(genotype_file,sep = "\t")
    genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
    genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
    molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
    input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
    
    # Only keep chromosome specified in --chromosome
    if len(chromosome) > 0:
        input_files = input_files[input_files['#id'].isin(chromosome)]
    input_files = input_files.values.tolist()
    input_chroms = [x[0] for x in input_files]
    input_files = [x[1:] for x in input_files]
    if len(name) == 0:
        name = f'{path(input_files[0][0]):bnn}' if len(input_files) == 1 else f'{path(input_files[0][0]):bnnn}'


## cis-xQTL association testing

In [None]:
[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True

# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
    interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
    var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
                     ("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.pairs.tsv.gz')])
if test_regional_association:
    output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
python: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout' , container = container
    import pandas as pd
    import numpy as np
    import os
    import tensorqtl
    from tensorqtl import genotypeio, cis
    from scipy.stats import chi2
    import multiprocessing as mp
    from tqdm import tqdm

    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]

    covariates_file = "$[covariate_file:a]"
    window = $[window]
    interaction = "$[interaction]"
    ## Load Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    phenotype_id = phenotype_pos_df.index.name

    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]", comment="#", header=None, sep="\t" )
        phenotype_column = 1 if len(region.columns) == 1 else  $[region_list_phenotype_column]
        keep_region = region.iloc[:,phenotype_column-1].to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]
    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    genotype_df, variant_df = genotypeio.load_genotypes(plink_prefix_path, dosages = True)

    
    ## use custom sample list to subset the covariates data
    if $[keep_sample.is_file()]:
        sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
        covariates_df.loc[sample_list.sample_id]

    # Read interaction files or extract from covariates file.
    var_interaction = interaction
    interaction_s = []
    if os.path.isfile(interaction):
        # update var_interaction and interaction_s
        interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
        interaction_s = interaction_s[interaction_s.index.isin(covariates_df.index)] 
        var_interaction = interaction_s.columns[0] # interaction name
    # check if the interaction term in interaction table is in covariates file, if yes and interaction_s not yet loaded then, extract it out from covariates file
    if var_interaction in covariates_df.columns:
        # only load from covariate if it has not been loaded yet
        if len(interaction_s) == 0:
            interaction_s = covariates_df[var_interaction].to_frame()
        covariates_df = covariates_df.drop(columns=[var_interaction])
    if len(interaction) and len(interaction_s) == 0:
        raise ValueError(f"Cannot find interaction variable or file {interaction}")

    # drop samples that with missing value in iteraction
    if len(interaction_s):
        interaction_s = interaction_s.dropna() 

    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)] 

    if len(interaction_s):
        phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, interaction_s.index)]
        interaction_s = interaction_s[interaction_s.index.isin(phenotype_df.columns)]    
        interaction_s = interaction_s.loc[phenotype_df.columns]

    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    
    ## To simplify things, there should really not be "chr" prefix
    phenotype_pos_df.chr = phenotype_pos_df.chr.astype(str).str.replace("chr", "")
    variant_df.chrom =  variant_df.chrom.astype("str").str.replace("chr", "") 

    ## use custom cis windows list
    if $[customized_cis_windows.is_file()]:
        cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
        cis_list.chr = cis_list.chr.astype(str).str.replace("chr", "")  ## Again to simplify things for chr format concordance.
        if cis_list[['chr', 'ID']].duplicated().sum() != 0: # if cis_list is not unique using identifier ['#chr', 'ID']
                cis_list = cis_list.groupby(['ID', 'chr']).agg({ # use union start-end position and make cis_list unique
                    'start': 'min',
                    'end': 'max'
                }).reset_index()

                cis_list = cis_list[['chr', 'start', 'end', 'ID']]
                #cis_list = cis_list.set_index('chr')
        
        phenotype_pos_df = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
        phenotype_df = phenotype_df.reset_index()
        # Ensure phenotype_pos_df is a subset of cis_list based on ['#chr', 'ID']
        original_count = len(phenotype_pos_df)
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.set_index(['chr', 'ID']).index.isin(cis_list.set_index([cis_list.columns[0],cis_list.columns[3]]).index)] 
        phenotype_df = phenotype_df[phenotype_df.set_index('ID').index.isin(cis_list.set_index(cis_list.columns[3]).index)] 
        phenotype_df = phenotype_df.set_index('ID')
        removed_count = original_count - len(phenotype_pos_df)
        print(f"{removed_count} rows were removed from phenotype_pos_df")

        # Merge the dataframes on 'chr' and 'ID', including 'start' and 'end'
        phenotype_pos_df = phenotype_pos_df.merge(cis_list[['chr', 'ID', 'start', 'end']], 
                                        left_on = ["chr",phenotype_id],
                                        right_on = [cis_list.columns[0],cis_list.columns[3]],
                                        suffixes=('_pheno', ''))

        # Function to decide whether to keep or rename columns: 
        # If 'start' and 'end' values are the same in both dataframes, we keep the original columns without suffixes; 
        # If they're different, we name columns from cis_list with '_cis' suffixes and keep the name still for columns from phentype_pos_df.
        # #in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
        def rename_if_different(row, col):
            if row[f'{col}'] == row[f'{col}_pheno']:
                return row[col]
            else:
                return pd.Series({f'{col}_pheno': row[f'{col}_pheno'], f'{col}_cis': row[col]})
        
        # Apply the renaming logic
        for col in ['start', 'end']:
            if f'{col}_pheno' in phenotype_pos_df.columns:# this condition is to not execute when the original phenotype_pos_df don't have start, end but pos column.
                temp = phenotype_pos_df.apply(lambda row: rename_if_different(row, col), axis=1)
                phenotype_pos_df = phenotype_pos_df.drop(columns=[f'{col}_pheno']).join(temp)
                print(f"Dropped columns due to value mismatch: {col}_pheno")
        
        phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
        
        if len(phenotype_df.index) != len(phenotype_pos_df.index):
            raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
        window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0

    ## Read phenotype group if availble
    if $[phenotype_group.is_file()]:
        group_s = pd.read_csv($[phenotype_group:r], sep='\t', header=None, index_col=0).squeeze()
    else:
        group_s = None

    ## cis-QTL mapping: nominal associations for all variant-phenotype pairs
    if not ($[skip_nominal_if_exist] and $[_output["parquet"].is_file()]):
        if len(interaction_s):
            cis.map_nominal(genotype_df, variant_df, 
                    phenotype_df, 
                    phenotype_pos_df, 
                    $[_output["parquet"]:nnnr],
                    covariates_df=covariates_df,
                    interaction_df=interaction_s, 
                    maf_threshold_interaction=$[maf_threshold],
                    window=window,
                    group_s=group_s,
                    run_eigenmt=True)
        else:
            cis.map_nominal(genotype_df, variant_df,
                phenotype_df,
                phenotype_pos_df,
                $[_output["parquet"]:nnnr],
                covariates_df=covariates_df, 
                window=window, 
                maf_threshold=$[maf_threshold],
                run_eigenmt=$['False' if permutation else 'True'],
                group_s=group_s)

    ## Load the parquet and save it as txt
    pairs_df = pd.read_parquet($[_output["parquet"]:r])
    ## Remove rows whose 'pval_gi' is null for following t_pval_conversion
    if len(interaction_s):
        pairs_df = pairs_df.dropna(subset=['pval_gi'])
    # print general information of parquet
    print('Output Information:')
    print("This is the file containing the immediate output of TensorQTL's map_nominal function ")
    print(os.path.getsize($[_output["parquet"]:r]))

    ## Adds the group columns to pairs_df, if there is group_s use group_s, else use phenotype_id
    if group_s is not None:
        pairs_df = pairs_df.merge(pd.DataFrame( {"molecular_trait_object_id": group_s}),left_on = "phenotype_id", right_index = True)
    else:
        pairs_df["molecular_trait_object_id"] = pairs_df.phenotype_id
    ## if pos in phenotype_pos_df(start distance and end distance are the same), 
    ## add the column 'end_distance' with the same value as 'start_distance' to avoid mismatch of column and names
    if 'end_distance' not in pairs_df.columns:
        # Get the position of 'start_distance' column
        start_pos = pairs_df.columns.get_loc('start_distance')
        # Create a new DataFrame with the same values as start_distance
        new_df = pairs_df.copy()
        # Insert end_distance column after start_distance
        new_df.insert(start_pos + 1, 'end_distance', pairs_df['start_distance'])
        pairs_df = new_df

    # rename columns
    column_map = {'phenotype_id': 'molecular_trait_id'}
   
    if len(interaction_s):
        # calculate genomic inflation factor lambda on interaction 
        lambda_col_interaction = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pval_gi), 1)/chi2.ppf(0.5,1))
        column_map.update({
            'pval_g': 'pvalue', 'b_g': 'bhat', 'b_g_se': 'se',
            'pval_i': f'pvalue_{var_interaction}',
            'b_i': f'bhat_{var_interaction}',
            'b_i_se': f'sebhat_{var_interaction}',
            'pval_gi': f'pvalue_{var_interaction}_interaction',
            'b_gi': f'bhat_{var_interaction}_interaction',
            'b_gi_se': f'sebhat_{var_interaction}_interaction'
        })
    else:
        column_map.update({'pval_nominal': 'pvalue', 'slope': 'bhat', 'slope_se': 'sebhat'})
    if $[customized_cis_windows.is_file()]:
        column_map.update({'start_distance': 'cis_window_start_distance', 'end_distance': 'cis_window_end_distance'})
    else:
        column_map.update({'start_distance': 'tss_distance', 'end_distance': 'tes_distance'})
    pairs_df.rename(columns=column_map, inplace=True)
    pairs_df["n"] = len(phenotype_df.columns.values)
    pairs_df = variant_df.merge(pairs_df, right_on='variant_id', left_index=True)
    pairs_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
    # sort the table if chrom and pos is not in ascending order
    if not all(pairs_df['pos'].iloc[i] <= pairs_df['pos'].iloc[i+1] for i in range(len(pairs_df)-1)):
        pairs_df = pairs_df.sort_values(by=['chrom', 'pos'])
    # save file
    pairs_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
    # print general information of pairs_df
    print('Output Information:')
    print("Output Rows:", len(pairs_df))
    print("Output Columns:", pairs_df.columns.tolist())
    print("Output Preview:", pairs_df.iloc[1:5, 1:10])

    if $[test_regional_association]:
        # calculate genomic inflation factor lambda for main variant effect 
        lambda_col = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
        cis_df = cis.map_cis(genotype_df, 
                            variant_df, 
                            phenotype_df,
                            phenotype_pos_df,
                            covariates_df=covariates_df, 
                            seed=999, 
                            window=window, 
                            maf_threshold = $[maf_threshold],
                            group_s=group_s)
        cis_df.index.name = "molecular_trait_id"
        ## Add groups columns for eQTL analysis
        if "group_id" not in cis_df.columns:
            cis_df["group_id"] = cis_df.index
            cis_df["group_size"] = 1
        cis_df.rename(columns={"group_id": "molecular_trait_object_id", "group_size": "n_traits", 
                        'start_distance': 'tss_distance', 'end_distance': 'tes_distance',
                        "num_var": "n_variants", "pval_nominal": "p_nominal", 
                        'slope': 'bhat', 'slope_se': 'sebhat',
                        "pval_true_df": "p_true_df", "pval_perm": "p_perm", "pval_beta": "p_beta"}, inplace = True)
        cis_df = cis_df.assign(genomic_inflation = lambda dataframe : dataframe["molecular_trait_object_id"].map(lambda molecular_trait_object_id:lambda_col[molecular_trait_object_id]))
        # merge cis_df with variant_df
        cis_df = variant_df.merge(cis_df, right_on='variant_id', left_index=True)
        cis_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
        # sort the table if chrom and pos is not in ascending order
        if not all(cis_df['pos'].iloc[i] <= cis_df['pos'].iloc[i+1] for i in range(len(cis_df)-1)):
            cis_df = cis_df.sort_values(by=['chrom', 'pos'])
        # save file
        cis_df.to_csv(str($[_output["nominal"]:nnnr])+str('.regional.tsv'), sep='\t', index = None)
        # print general information of cis_df
        print('Output Information:')
        print("Output Rows:", len(cis_df))
        print("Output Columns:", cis_df.columns.tolist())
        print("Output Preview:", cis_df.iloc[0:5, 0:10])

R: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
    library(purrr)
    library(tidyr)
    library(readr)
    library(dplyr)
    library(qvalue)
  
    pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')
    compute_qvalues <- function(pvalues) {
        tryCatch({
            if(length(pvalues) < 2) {
                return(pvalues)
            } else {
                return(qvalue(pvalues)$qvalues)
            }
        }, error = function(e) {
            message("Too few p-values to calculate qvalue, fall back to BH")
            qvalue(pvalues, pi0 = 1)$qvalues
        })
    }
    
    var_interaction <- "$[interaction]"
    # Check if 'interaction' is a file
    if (file.exists(var_interaction)) {
        # Read the file into 'interaction_s' dataframe
        interaction_s <- read.delim(var_interaction, row.names = 1)
        # Update 'var_interaction' to the first column name of 'interaction_s'
        var_interaction <- names(interaction_s)[1]
    }

    if (is.null(var_interaction) || var_interaction == "") {
        pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = compute_qvalues(pvalue))
    } else {
        pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue_main = compute_qvalues(pvalue), qvalue_interaction = compute_qvalues($["pvalue_%s_interaction" % var_interaction]))         
    }

    pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
  
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:n] 
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]

done_if(not test_regional_association)

bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:nnn].regional.tsv
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]:nnn].regional.tsv.gz

In [1]:
[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output_file_prefix = name if len(_input["nominal"]) > 1 else f'{_input["nominal"][0]:bnnnn}'
output: f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.tsv.gz',
        f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("purrr")
    library("tidyr")
    library("dplyr")
    library("readr")
    library("qvalue")
    emprical_pd = tibble(map(c($[_input["regional"]:r,]), ~read_delim(.x,"\t")))%>%unnest()
    emprical_pd["q_beta"] = tryCatch(qvalue(emprical_pd$p_beta)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH") 
                                                                                              qvalue(emprical_pd$p_beta,pi0 = 1 )$qvalue})  

    emprical_pd["q_perm"] = tryCatch(qvalue(emprical_pd$p_perm)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH") 
                                                                                              qvalue(emprical_pd$p_perm,pi0 = 1 )$qvalue})
    emprical_pd["fdr_beta"] = p.adjust(emprical_pd$p_beta,"fdr")    
    emprical_pd["fdr_perm"] = p.adjust(emprical_pd$p_perm,"fdr")   


    # Calculate the global nominal p-value threshold based on q_beta at FDR 0.05
    if (!all(is.na(emprical_pd$p_beta))) {
      lb <- emprical_pd %>% 
        filter(q_beta <= 0.05) %>% 
        pull(p_beta) %>% 
        sort()
      
      ub <- emprical_pd %>% 
        filter(q_beta > 0.05) %>% 
        pull(p_beta) %>% 
        sort()
      
      if (length(lb) > 0) {
        lb_val <- tail(lb, 1)
        threshold <- if (length(ub) > 0) (lb_val + head(ub, 1)) / 2 else lb_val
        message(sprintf("min p-value threshold @ FDR 0.05: %g", threshold))
        
        emprical_pd <- emprical_pd %>% 
          mutate(p_nominal_threshold = qbeta(threshold, beta_shape1, beta_shape2))
      }
    }

    summary = tibble("fdr_perm_0.05" =  sum(emprical_pd["fdr_perm"] < 0.05), 
                      "fdr_beta_0.05" = sum(emprical_pd["fdr_beta"] < 0.05),
                      "q_perm_0.05" = sum(emprical_pd["q_perm"] < 0.05),
                      "q_beta_0.05" = sum(emprical_pd["q_beta"] < 0.05),
                      "fdr_perm_0.01" =  sum(emprical_pd["fdr_perm"] < 0.01), 
                      "fdr_beta_0.01" = sum(emprical_pd["fdr_beta"] < 0.01),
                      "q_perm_0.01" = sum(emprical_pd["q_perm"] < 0.01),
                      "q_beta_0.01" = sum(emprical_pd["q_beta"] < 0.01)  )
    emprical_pd%>%write_delim("$[_output[0]]","\t")
    summary%>%write_delim("$[_output[1]]","\t")

## Trans-xQTL association testing

For transQTL analysis, if you output all the p-values for many genes (default setting) it is suggested to provide the largest memory and CPU threads available on a compute node. eg 250G and >32 threads.

In [None]:
[trans]

parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False
parameter: pval = 0.0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"_%s" % input_chroms[_index] if input_chroms[_index] != 0 else ""}.trans_qtl{"_p_%.0e" % pval if pval > 0.0 else ""}.pairs.tsv.gz'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, trans
    from scipy.stats import chi2
    import gc
    import os
    from statsmodels.stats.multitest import multipletests

    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]
    covariates_file = "$[covariate_file:a]"
    window = $[window]
    current_chrom = "$[input_chroms[_index]]" if "$[input_chroms[_index]]" != "0" else None
    
    print(f"Processing with output name: {current_chrom}")
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    phenotype_id = phenotype_pos_df.index.name


    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]")
        phenotype_column = 1 if len(region.columns) == 1 else $[region_list_phenotype_column]
        keep_region = region.iloc[:, phenotype_column-1].astype(str).str.strip().to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]


    ## use custom cis windows
    if $[customized_cis_windows.is_file()]:
        cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
        phenotype_pos_df_reset = phenotype_pos_df.reset_index()
        phenotype_pos_df = phenotype_pos_df_reset.merge(cis_list, left_on=["chr",phenotype_id], right_on=[cis_list.columns[0],cis_list.columns[3]])
        if len(phenotype_df.index) - phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0]!= len(phenotype_pos_df.index):
            raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
        phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]]
        window = 0
        if phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0] != 0:
            phenotype_df = phenotype_df.loc[phenotype_df.index.isin(cis_list[phenotype_id])]

    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T


    ## Filter covariates based on covariate_pattern if provided
    covariate_pattern_list = $[covariate_pattern]
    if covariate_pattern_list:
        print(f"Filtering covariates using pattern: {covariate_pattern_list}")
        pattern_mapping = {
            "pheno_PC": ["Hidden_Factor_PC"],  # Map pheno_PC to columns starting with Hidden_Factor_PC
            "geno_PC": ["PC"]                 # Map geno_PC to columns starting with PC
        }
        
        keep_cols = []
        for col in covariates_df.columns:
            if col in covariate_pattern_list:
                keep_cols.append(col)
                continue
        
            for pattern in covariate_pattern_list:
                if pattern in pattern_mapping:
                    # For special patterns like pheno_PC and geno_PC, check their mappings
                    for mapped_pattern in pattern_mapping[pattern]:
                        if col.startswith(mapped_pattern):
                            keep_cols.append(col)
                            break
        
        if not keep_cols:
            print("Warning: No covariate columns match the provided pattern!")
        else:
            print(f"Keeping {len(keep_cols)} covariates: {keep_cols}")
            covariates_df = covariates_df[keep_cols]
       

    genotype_df, variant_df = genotypeio.load_genotypes(plink_prefix_path, dosages = True) 
    ## use custom sample list to subset the covariates data
    if $[keep_sample.is_file()]:
        sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
        covariates_df = covariates_df.loc[sample_list.sample_id]

    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)] 
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    pr = genotypeio.PlinkReader(plink_prefix_path)
    bim_df = pr.bim
    bim_df['chrom'] = bim_df['chrom'].str.replace("chr", "")

    
    ## Get all chromosomes from genotype and phenotype data
    ## To simplify things, there should really not be "chr" prefix 
    phenotype_pos_df.chr = phenotype_pos_df.chr.astype(str).str.replace("chr", "")
    variant_df.chrom = variant_df.chrom.astype("str").str.replace("chr", "") 
    
    pheno_chroms = sorted(phenotype_pos_df.chr.unique().tolist())
    print(f"Phenotype data contains chromosomes: {pheno_chroms}")

    geno_chroms = sorted(variant_df.chrom.unique().tolist())
    print(f"Genotype data contains chromosomes: {geno_chroms}")

    # If current chromosome specified, only process that phenotype chromosome
    if current_chrom and current_chrom in pheno_chroms:
        pheno_chroms_to_process = [current_chrom]
        print(f"Processing only phenotype chromosome: {current_chrom}")
    else:
        pheno_chroms_to_process = pheno_chroms
        print(f"Processing all phenotype chromosomes: {pheno_chroms_to_process}")
    
    # Determine genotype chromosomes to process
    geno_chroms_to_process = geno_chroms
    print(f"Processing all genotype chromosomes: {geno_chroms_to_process}")
    
    # Calculate total combinations
    total_combinations = len(pheno_chroms_to_process) * len(geno_chroms_to_process)
    print(f"Total combinations to process: {total_combinations}")
    all_results = []
    

    # Process each combination
    combination_count = 0
    for pheno_chrom in pheno_chroms_to_process:
        # Filter phenotype data to keep only current chromosome
        phenotype_pos_df_filtered = phenotype_pos_df[phenotype_pos_df.chr == pheno_chrom]
        phenotype_df_filtered = phenotype_df[phenotype_df.index.isin(phenotype_pos_df_filtered.index)]
        
        print(f"Filtered phenotypes for chromosome {pheno_chrom}: {len(phenotype_df_filtered)} remaining")
        
        if len(phenotype_df_filtered) == 0:
            print(f"No phenotypes found for chromosome {pheno_chrom}, skipping")
            continue
        
        for geno_chrom in geno_chroms_to_process:
            combination_count += 1
            print(f"Processing combination {combination_count}/{total_combinations}: phenotype chr{pheno_chrom} x genotype chr{geno_chrom}")
            
            # Filter genotype data to keep only current chromosome
            chrom_variants = variant_df[variant_df.chrom == geno_chrom].index.tolist()
            
            if len(chrom_variants) == 0:
                print(f"No variants found for chromosome {geno_chrom}, skipping")
                continue
            
            # Load genotype data
            print(f"Loading genotypes for chromosome {geno_chrom}...")
            genotype_df = pr.load_genotypes()
            variant_df = bim_df.set_index('snp')[['chrom', 'pos', 'a0', 'a1']]
            
            # Keep only current chromosome variants
            genotype_df_chr = genotype_df.loc[chrom_variants]
            variant_df_chr = variant_df.loc[chrom_variants]

            del genotype_df
            gc.collect()
            # Find common samples
            common_samples = np.intersect1d(phenotype_df_filtered.columns, genotype_df_chr.columns)
            common_samples = np.intersect1d(common_samples, covariates_df.index)
            
            if len(common_samples) == 0:
                print(f"No common samples between phenotypes, genotypes, and covariates, skipping")
                del genotype_df_chr
                gc.collect()
                continue
            
            phenotype_df_final = phenotype_df_filtered[common_samples]
            genotype_df_final = genotype_df_chr[common_samples]
            covariates_df_final = covariates_df.loc[common_samples]
            
            print(f"Final analysis dimensions:")
            print(f"  Samples: {len(common_samples)}")
            print(f"  Phenotypes: {len(phenotype_df_final)}")
            print(f"  Variants: {len(genotype_df_final)}")
            
            # Trans analysis
            print(f"Running trans analysis for pheno chr{pheno_chrom} x geno chr{geno_chrom}...")
            try:
                trans_df = trans.map_trans(genotype_df_final, 
                                        phenotype_df_final,
                                        covariates_df_final, 
                                        batch_size=$[batch_size],
                                        return_sparse=True, 
                                        return_r2=True,
                                        pval_threshold=$[pval_threshold], 
                                        maf_threshold=$[maf_threshold])
                
                del genotype_df_chr, genotype_df_final
                gc.collect()

                # Filter out cis signal
                if trans_df is not None and not trans_df.empty:
                    print(f"Filtering cis signals...")
                    trans_df = trans.filter_cis(trans_df, phenotype_pos_df_filtered, variant_df_chr, window=window)

                    if trans_df is not None and not trans_df.empty:
                        print(f"Found {len(trans_df)} trans-QTLs")
                        trans_df.rename(columns={"phenotype_id": "molecular_trait_id", 
                                            "pval": "pvalue", 
                                            "b": "bhat", "b_se": "sebhat"}, inplace=True)
                        trans_df["n"] = len(common_samples)
                        
                        # Merge variant information
                        trans_df = variant_df_chr.merge(trans_df, right_on='variant_id', left_index=True)
                        trans_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
                        trans_df['pheno_chrom'] = pheno_chrom
                        trans_df['geno_chrom'] = geno_chrom
                        all_results.append(trans_df)
                    else:
                        print(f"No trans-QTLs found after cis filtering")
                else:
                    print(f"No trans-QTLs found")
            except Exception as e:
                print(f"Error during trans analysis: {e}")
                continue
    
    if all_results:
        print(f"Merging {len(all_results)} result sets...")
        combined_results = pd.concat(all_results, ignore_index=True)
        print(f"Total trans-QTLs found: {len(combined_results)}")
        
        # Calculate genomic inflation factor
        lambda_col = combined_results.groupby("molecular_trait_id").apply(
            lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1)
        )
        lambda_col = lambda_col.reset_index()
        lambda_col.columns = ['molecular_trait_id', 'genomic_inflation_lambda']
        lambda_col.to_csv("$[_output:nnn].genomic_inflation.tsv.gz", sep='\t', index=None, 
                        compression={'method': 'gzip', 'compresslevel': 9})
        combined_results = combined_results.sort_values(by=['chrom', 'pos', 'molecular_trait_id'])

        # Output information
        print('Output Information:')
        print(f"Output Rows: {len(combined_results)}")
        print(f"Output Columns: {combined_results.columns.tolist()}")
        if len(combined_results) > 0:
            print(f"Output Preview (first 5 rows):")
            print(combined_results.iloc[0:min(5, len(combined_results)), 0:10])
        
        if $[pval] > 0:
            # Record initial number of variants
            initial_n = len(combined_results)

            # Calculate p-value distribution by each 10th percentile (before filtering)
            pval_percentiles = np.percentile(combined_results['pvalue'], np.arange(0, 110, 10))

            # Filter combined_results by p-value threshold
            combined_results = combined_results[combined_results['pvalue'] < $[pval]]

            # Print summary
            print(f"Number of variants initially: {initial_n}")
            print(f"Number of variants after filtering: {len(combined_results)}")
            print("P-value distribution by each 10th percentile (before filtering):")
            print(dict(zip([f"{i}%" for i in range(0, 110, 10)], pval_percentiles)))

            # Save summary to TSV
            summary_df = pd.DataFrame({
                'metric': ['initial_n', 'after_filtering'] + [f"{i}%" for i in range(0, 110, 10)],
                'value': [initial_n, len(combined_results)] + list(pval_percentiles)
            })

            summary_df.to_csv(f"$[_output['nominal']:nn].summary.tsv", sep="\t", index=False)
            
        output_file = "$[_output['nominal']:n]"
        combined_results.to_csv(output_file, sep='\t', index=None)
        print(f"Results saved to {output_file}")
    else:
        print("No trans-QTLs found across all chromosome combinations")
        with open("$[_output['nominal']:n]", 'w') as f:
            f.write("No trans-QTLs found across any chromosome combinations\n")

R: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
    library(purrr)
    library(tidyr)
    library(readr)
    library(dplyr)
    library(qvalue)

    compute_qvalues <- function(pvalues) {
        tryCatch({
            if(length(pvalues) < 2) {
                return(pvalues)
            } else {
                return(qvalue(pvalues)$qvalues)
            }
        }, error = function(e) {
            message("Too few p-values to calculate qvalue, fall back to BH")
            qvalue(pvalues, pi0 = 1)$qvalues
        })
    }
    
    pairs_df <- read.table("$[_output['nominal']:n]", header = TRUE, sep = '\t', stringsAsFactors = FALSE)

    if (nrow(pairs_df) <= 1) {
        message("File is empty or has only header. No qvalues to calculate.")
    } else {
        unique_traits <- unique(pairs_df$molecular_trait_id)
        for(trait in unique_traits) {
            trait_rows <- pairs_df$molecular_trait_id == trait
            pairs_df$qvalue[trait_rows] <- compute_qvalues(pairs_df$pvalue[trait_rows])
        }
        
        write.table(pairs_df, "$[_output['nominal']:n]", sep = '\t', row.names = FALSE, quote = FALSE)
    }

bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:n]
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]
