# Phenotype preprocessing
This notebook contains workflow record of processing proteomics Phenotype files for TensorQTL.

## Data Input

- `output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz`
- `reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf`

### Phenotype Annotation
This step serves as annote corresponding `chr`, `start`, `end`, `ID`, and `strand` to genes in the original phenotype matrix. 

In this case, in the original mic data, each column: id(gene name or gene ENSGid) and sample ids. Each row: each gene.   

After the annotation, bed.gz file would be the output and it would be shown in zcat chunk to show what it would be like after annotation.

In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
zcat output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz | head | cut -f 1-6

#chr	start	end	gene_id	sample0	sample1
chr1	91104	91105	ENSG00000239945	-0.8112562251907688	-0.8112562251907688
chr1	959308	959309	ENSG00000188976	-0.5687343634272857	-0.8948618788124498
chr1	1206591	1206592	ENSG00000186891	-0.7329709314251284	-0.799775190677156
chr1	2555638	2555639	ENSG00000157873	-0.7329709314251284	0.3471524102625916
chr1	7784319	7784320	ENSG00000049246	-0.04151238794160813	0.9718486924642757
chr1	7999933	7999934	ENSG00000284716	0.40056323012421163	1.3652151742017207
chr1	9960786	9960787	ENSG00000283611	-0.3826399361206268	-0.3826399361206268
chr1	10298965	10298966	ENSG00000199562	-0.32957402197565067	-0.40056323012421163
chr1	10306464	10306465	ENSG00000264501	0.07477045310977722	0.12482480218232882


In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
sos run pipeline/gene_annotation.ipynb annotate_coord \
    --cwd output/rnaseq \
    --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --coordinate-annotation reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-column gene_id

  import pkg_resources
INFO: Running [32mannotate_coord[0m: 
INFO: [32mannotate_coord[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mannotate_coord[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.region_list.txt[0m
INFO: Workflow annotate_coord (ID=wc0f3b36281bafaba) is ignored with 1 ignored step.


The output of annotation as following:

In [5]:
zcat output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz | head | cut -f 1-6

#chr	start	end	ID	strand	sample0
chr1	89550	91104	ENSG00000239945	-	-0.8112562251907688
chr1	944202	959308	ENSG00000188976	-	-0.5687343634272857
chr1	1203507	1206591	ENSG00000186891	-	-0.7329709314251284
chr1	2555638	2565381	ENSG00000157873	+	-0.7329709314251284
chr1	7784319	7845176	ENSG00000049246	+	-0.0415123879416081
chr1	7998186	7999933	ENSG00000284716	-	0.4005632301242116
chr1	9950571	9960786	ENSG00000283611	-	-0.3826399361206268
chr1	10298965	10299071	ENSG00000199562	+	-0.3295740219756506
chr1	10306464	10306756	ENSG00000264501	+	0.0747704531097772


### Imputation
The phenotype_imputation module provides multiple imputation algorithms to handle missing values in molecular phenotype data. The primary recommended method is Empirical Bayes Matrix Factorization (EBMF), particularly the grouped version (gEBMF), as described in phenotype_imputation.ipynb:32.

This workflow includes eight imputation methods:

- gEBMF: Grouped Empirical Bayes Matrix Factorization (recommended method)
- EBMF: Standard Empirical Bayes Matrix Factorization
- missforest: Random forest-based imputation
- knn: k-nearest neighbors imputation
- soft: SoftImpute via SVD
- mean: Mean imputation
- lod: Limit of detection imputation
- bed_filter_na: Imputation with feature filtering (phenotype_imputation.ipynb:177–186)

#### Input Format
The input is a molecular phenotype file with missing values, formatted as follows:

The first four columns must be: chr, start, end, ID

The remaining columns represent sample-level measurements (phenotype_imputation.ipynb:42–44)

### Processing Steps
1. Quality Control Preprocessing

All imputation methods apply the following QC filters before imputation:
- Remove features with >40% missingness
- Remove features with >95% zero values
(phenotype_imputation.ipynb:302–306)

2. gEBMF Method (Recommended)
- The core steps for gEBMF are:
- Data grouping: Partition data by chromosome groups
- Cluster initialization: Use flash_init_cluster_for_grouped_data
- Backfitting optimization: Run specified iterations of backfitting
- Imputation: Fill missing values using the trained EBMF model
- Postprocessing: If data is in [0,1] range, apply inverse normal transformation
(phenotype_imputation.ipynb:418–445)

3. Logic of Other Methods
- EBMF: Uses the flashier package for matrix factorization (phenotype_imputation.ipynb:335–338)
- missforest: Applies random forest for imputation(phenotype_imputation.ipynb:500)
- soft: Uses softImpute based on SVD(phenotype_imputation.ipynb:669–670)
- mean: Fills missing values with row means(phenotype_imputation.ipynb:724–726)

### Output Format
The output is the fully imputed molecular phenotype matrix, with the same structure as the input:
- First four columns: chr, start, end, ID
- Remaining columns: imputed sample values

File format: *.imputed.bed.gz (bgzipped and indexed)
(phenotype_imputation.ipynb:53–55)

In [None]:
# step ii. Missing Value Imputation
# This step serves as impute the missing entries for molecular phenotype data. This step is optional for eQTL analysis. But for other QTL analysis, this step is necessary. The missing entries are imputed by flashier, a Empirical Bayes Matrix Factorization model.

sos run pipeline/phenotype_imputation.ipynb gEBMF \
    --phenoFile data/protocol_example.protein.bed.gz \
    --cwd output/phenotype/impute_gebmf \
    --no-qc-prior-to-impute # skip QC before impupation

  import pkg_resources
INFO: Running [32mgEBMF[0m: 
INFO: [32mgEBMF[0m is [32mcompleted[0m.
INFO: [32mgEBMF[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/phenotype/impute_gebmf/protocol_example.protein.bed.imputed.bed.gz[0m
INFO: Workflow gEBMF (ID=w8553fc84f43b1203) is executed successfully with 1 completed step.


### Partition by chroms

This is necessary for cis TensorQTL analysis. The output are two sets of files.   
For each chromosome(chrm1-chrm22), `chr#.bed.gz` and `chr#.bed.gz.tbi` files would be generated. There would also be a meta txt file `phenotype_by_chrom_files.txt` to show path for each chromosome.

In [11]:
#this uses results of phenotype file after it has been annotated with gene_annotation.ipynb annotate_coord
sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd output/phenotype/phenotype_by_chrom_for_cis \
    --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz \
    --name bulk_rnaseq \
    --chrom `for i in {1..22}; do echo chr$i; done`

  import pkg_resources
INFO: Running [32mphenotype_by_chrom_1[0m: 
INFO: [32mphenotype_by_chrom_1[0m (index=1) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=0) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=2) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=5) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=3) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=4) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=6) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=8) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=7) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=10) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=9) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=12) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=11) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1