# Covariate data preprocessing

This notebook contains workflow of processing covariate files for TensorQTL. It also computes PCA-derived covariates from genotype and phenotype data.

## Methods overview

This workflow is an application of the covariate related sections from the xQTL project pipeline.

## Data Input
- `phenotype data with bed.gz`.
- PCs from genotypes genereated in the [genotype_pca](https://github.com/cumc/brain-xqtl-analysis/tree/main/analysis/Wang_Columbia/ROSMAP/pqtl/genotype_pca) step.
- Fixed covarate file including information such as sex, age at death, pmi etc

First, we need to read covariates data and meta data. meta data is used for match projid and sampleid. In this case, we want to change the projid in the raw cov data to corresponding sample id. Note that some of the projids don't have corresponding sampleid according to the meta list. But it's okay because in the next few steps we will only keep those ids overlapped with phenotype data. You can adjust df_cov.head() to view more.


In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
head data/covariate/covariates.tsv

id	sex	age	rin	pmi
sample0	0	91	4	4
sample1	0	92	6	33
sample2	1	52	3	21
sample3	0	85	5	28
sample4	1	54	7	36
sample5	1	77	4	18
sample6	0	83	2	43
sample7	0	78	9	29
sample8	1	65	6	25


## Data Output
- `output/data_preprocessing/covariate_data/` This contains all covariates from Genotype PCs, known covariates, and hidden factors.

### Merge covariates and genotype PCA

First, check how many genotype PC we might want to include,  
this screen file is from the pca output of genotype_pca, showing the variance explained by each principal component. 

Here we see 15 PC that will explain 80% variation in the data. Let's include 15 PC in this case. In practice it is suggested that you discuss with your collaborator and/or PI about the choice of PC given results from the previous PCA.

So in --k parameter, we set it as 15.

About `[merge_genotype_pc]`:    

`Aim`: To merge the results of a genotype Principal Component Analysis (PCA) with other covariate data for subsequent analyses.

`Inputs`:

- pcaFile: This is an RDS file, which is the output of the genotype PCA module.    
- covFile: This is a file containing covariate data.    
- k: The number of principal components to retain, defaulting to 20. In this case, we set as 15.    
- outliersFile: A file listing samples considered as outliers.    
- remove_outliers: A flag indicating whether outliers should be removed from the analysis.    
- tol_cov: If tol_cov = -1, then do nothing about missing rate, otherwise it means the maximum allowed missingness rate in covariates.    
- mean_impute: A flag indicating whether missing values in covariates should be imputed with their mean.

`Output`:    

A file that merges the PCA and covariate data. `.plink_qc.prune.pca.gz`.  

In summary, It first checks for sample overlap between the PCA and covariate data, then handles missing values in covariates, and finally merges the processed data and saves it to an output file. So after this cell, you will obtain a file that merges the PCA and covariate data, which can be used for subsequent analyses. 


In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
# assuming no related data in previous geno qc step using plink_qc.prune.pca.rds
sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate/ \
    --pcaFile output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.rds \
    --covFile data/covariate/covariates.tsv \
    --tol-cov 0.4 \
    --k `awk '$3 < 0.8' output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1 ` 
    

  import pkg_resources
INFO: Running [32mmerge_genotype_pc[0m: 
INFO: [32mmerge_genotype_pc[0m is [32mcompleted[0m.
INFO: [32mmerge_genotype_pc[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz[0m
INFO: Workflow merge_genotype_pc (ID=wca247f02ec8db517) is executed successfully with 1 completed step.


### Compute residule on merged covariates and perform hidden factor analysis
This step will compute residual on merged covariates(`Marchenko_PC_1`) and perform hidden factor analysis(`Marchenko_PC_2`)

`Background`:    
Hidden factor analysis aims to identify and quantify unobserved (hidden) factors that influence observed variables. In the context of genomics and transcriptomics, these hidden factors can be various sources of variability, such as batch effects, technical artifacts, or other unmeasured biological factors that can influence gene expression levels.

PCA on Residuals:   
Principal Component Analysis (PCA) is a dimensionality reduction technique that captures the major sources of variability in the data. By performing PCA on the residuals (which are the parts of the data unexplained by known covariates), the workflow aims to capture the variability due to hidden factors.

Marchenko-Pastur Distribution:   
The chooseMarchenkoPastur function is used to determine the number of principal components (PCs) to retain. The Marchenko-Pastur distribution is a mathematical tool used to decide how many PCs are likely representing true biological or technical variability (hidden factors) versus random noise. By comparing the eigenvalues (variances) of the PCs to this distribution, one can decide which PCs are likely representing hidden factors.

Principal Components(the output):   
The resulting principal components represent hidden biological factors that contribute to the molecular phenotype variation but are not explained by the known covariates. These factors are:
- Hidden confounders: Unobserved variables that affect gene expression patterns across samples, such as batch effects, population substructure, or unmeasured environmental factors.
- Systematic variation: Sources of correlated variation across genes that may arise from technical factors or biological processes not captured in the covariate file.
- Quality-controlled factors: Only principal components that pass the Marchenko-Pastur significance threshold are retained, ensuring that noise components are excluded.

Let's look at the workflow step by step:  

`Workflow 1: *_1(computing residual on merged covariates)`   

`Aim`: To compute residuals on merged covariates (The portion of phenotypic data that can not be explained by covariates. The effects of known covariates were removed, and the “Pure” signal in the phenotypic data was retained for subsequent hidden factor analysis, allowing PCA to capture real biological variation rather than technical noise).

`Inputs`:     
- phenoFile: A file containing phenotype data.    
- covFile: the merged pca and cov gz.file in the output of `merge_genotype_pc`.   

`Output`:   
`Mic.log2cpm.mic.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.residual.bed.gz`. A compressed .bed.gz file containing the residuals of the merged covariates.    

`In summary`, it first loads pheno, cov file, extracts overlapping Samples, computes residuals in the subset data(only contains overlapped samples) using a linear model fit. Then, the residuals, along with phenotype IDs, are written to a tab-delimited file, and compressed using bgzip and indexed using tabix.

`Workflow 2: Marchenko_PC_2`:      

 `Aim`: To perform Principal Component Analysis (PCA) on the residuals and determine the number of principal components (hidden factor) to retain based on the Marchenko-Pastur distribution.

`Inputs`:     
- Residuals from the previous workflow.
- Covariate file (covFile).

`Output`:   
`Mic.log2cpm.mic.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz`. A compressed file containing the principal components(hidden factors--written as Hidden_Factor_PC1, Hidden_Factor_PC2……) and covariates(msex,age_death,pmi ). So this is the final cov file we will actually use in the subsequent analysis(eg: tensorqtl).

`In summary`, the Marchenko_PC_2 workflow is essentially performing hidden factor analysis on the residuals. By extracting the major sources of variability from the residuals using PCA and determining the number of PCs to retain based on the Marchenko-Pastur distribution, the workflow identifies the hidden factors in the data. These hidden factors, represented by the principal components, provide insights into the underlying structure of the data and can be crucial for correcting batch effects, reducing technical noise, and improving the interpretability of downstream analyses.

In [2]:
sos run pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
   --cwd output/covariate \
   --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz  \
   --covFile output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
   --mean-impute-missing 

  import pkg_resources
INFO: Running [32mcomputing residual on merged covariates[0m: 
INFO: [32mcomputing residual on merged covariates[0m is [32mcompleted[0m.
INFO: [32mcomputing residual on merged covariates[0m output:   [32moutput/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.residual.bed.gz[0m
INFO: Running [32mMarchenko_PC_2[0m: 
INFO: [32mMarchenko_PC_2[0m is [32mcompleted[0m.
INFO: [32mMarchenko_PC_2[0m output:   [32moutput/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz[0m
INFO: Workflow Marchenko_PC (ID=w93382653426c6a87) is executed successfully with 2 completed steps.


### Summary analysis of covariates preprocessing results

In [3]:
# preview of the final Marchenko_PC.gz file
zcat output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz | head | cut -f 1-6

#id	sample0	sample1	sample2	sample3	sample4
sex	0	0	1	0	1
age	91	92	52	85	54
rin	4	6	3	5	7
pmi	4	33	21	28	36
PC1	0.0079719403735707	0.01551892391210383	0.00774270222373719	0.01728777110680747	-0.0038685278561914
PC2	0.21108638503242705	-0.08141101340591093	-0.1324609481859247	0.01462193776343482	0.06596595896560435
PC3	-0.05063343099042864	-0.12687379700590845	-0.12060270181509317	0.05599049431828046	-0.06686964706351353
PC4	-0.01406533451346132	0.12781517725968655	-0.01736711354272593	0.02015639819991523	-0.03304252616108331
PC5	-0.124368907986055	0.03655636585203538	0.0282157547046425	0.1561247811903808	-0.07371828928862333


##### The final covariates list     
The number of covariates listed is 53: msex, age_death, pmi from the raw cov data. PC1-15 from geno_pca. Hidden_Factor_PC1-10 are the hidden factors from Marchenko_PC.

`1. Number of covariates in the raw covariates file: 4. `    

- msex: This refers to the genetic sex of the individual. In many genetic studies, sex is an important covariate because males and females can have different baseline levels of gene expression and can respond differently to genetic and environmental factors.

- age_death: This refers to the age at which the individual died. Age can influence gene expression patterns, with some genes being more or less active at different stages of life. Including age at death as a covariate can help account for age-related variations in gene expression.

- pmi: This stands for "post-mortem interval." It refers to the time elapsed between an individual's death and when their tissue or samples were collected or preserved. PMI can influence the quality and stability of RNA and other molecules in the sample. By including PMI as a covariate, the analysis can account for potential biases or noise introduced by variations in sample quality due to differing post-mortem intervals.

`2. Number of covariates from genotype pca: 15. `    
Including PCs from genotype data as covariates helps in controlling for potential confounders, especially population stratification, thereby reducing the risk of false-positive findings and increasing the robustness of the eQTL analysis. 15 PC will explain 80% variation in the data, so we include 15 PC in this case.

`3. Number of covariates from hidden factor: 10. `    
Hidden factors represent sources of variability in the data that are not directly measured but can significantly influence the observed variables. In the context of genomics, these hidden factors can arise from various sources, including batch effects, technical artifacts, or other unmeasured biological factors that can influence gene expression levels. Identifying and accounting for these hidden factors is crucial to reduce confounding and enhance Interpretability.


In summary, including these covariates in the analysis helps to ensure that the observed associations between genotypes and gene expression are not confounded by these other factors.

In [4]:
zcat output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz | cut -f 1


#id
sex
age
rin
pmi
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
PC11
PC12
PC13
PC14
PC15
Hidden_Factor_PC1
Hidden_Factor_PC2
Hidden_Factor_PC3
Hidden_Factor_PC4
Hidden_Factor_PC5
Hidden_Factor_PC6
Hidden_Factor_PC7
Hidden_Factor_PC8
Hidden_Factor_PC9
Hidden_Factor_PC10
