# PCA on genotypes of selected samples

This notebook contains workflow to compute PCA-derived covariates from the genotype data.

## Methods overview

This workflow is an application of `PCA.ipynb` from the xQTL project pipeline.

## Data Input

- `output/plink/wgs.merged.plink_qc.bed`
- `output/plink/wgs.merged.plink_qc.bim`
- `output/plink/wgs.merged.plink_qc.fam`

## Data Output
- no related samples: `output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.rds`
- with related samples: `output/genotype/genotype_pca/wgs.merged.plink_qc.wgs.merged.king.related.plink_qc.extracted.pca.projected.rds`


## Steps in detail

### Kinship QC only on proteomics samples

To accuratly estimate the PCs for the genotype. We split participants based on their kinship coefficients, estimated by KING

#### Sample match with genotype 
-- `Aim`: In this chunk, we only want to keep the samples in genotype overlapped with phenotype to do king estimation. sample_genotypes.txt would be used as a keep sample list in the next `king` chunk after `genotype_phenotype_sample_overlap` .

-- `Main input`: 
- phenofile: should be the bed.gz file in the output of penotype preprocessing.   
- genofile: should be the output of genotype preprocessing.

-- `Output`:    
sample_overlap.txt, sample_genotypes.txt.    
These outputs are sample list of genotype overlapped with phenotype.    

In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
        --cwd output/genotype/ \
        --genoFile output/plink/wgs.merged.plink_qc.fam  \
        --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz

  import pkg_resources
INFO: Running [32mgenotype_phenotype_sample_overlap[0m: This workflow extracts overlapping samples for genotype data with phenotype data, and output the filtered sample genotype list as well as sample phenotype list
INFO: [32mgenotype_phenotype_sample_overlap[0m is [32mcompleted[0m.
INFO: [32mgenotype_phenotype_sample_overlap[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.sample_overlap.txt /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.sample_genotypes.txt[0m
INFO: Workflow genotype_phenotype_sample_overlap (ID=wb19c4f2294a7958c) is executed successfully with 1 completed step.


#### Kinship
`[king_1]`:   
-- `Aim`: it is designed to infer relationships within a sample set to identify closely related individuals.   
-- `Main input`: plink genofile, kin_maf: A parameter that specifies the minor allele frequency to filter SNPs. The --keep and --remove options might be used if the keep_samples and remove_samples files are provided. These options allow for including or excluding specific samples.  
-- `Output`: The primary output is a .kin0 file, which contains the kinship coefficients for pairs of individuals. A higher kinship coefficient indicates a closer genetic relationship between two individuals. This file helps in identifying closely related individuals.  

`[king_2]`:   
-- `Aim`: To select a list of unrelated individuals from the data. The goal is to maximize the number of unrelated individuals selected while filtering out those who are related. This is useful in genetic studies where relatedness can confound results.   
-- `Main input`: a .kin0 file containing kinship coefficients for pairs of individuals. maximize_unrelated: A boolean parameter that determines whether the workflow should attempt to maximize the number of unrelated individuals. True for keeping as many unrelated individuals as possible, False for removing entire families with any related individuals.     
-- `Output`:  a file with the extension .related_id, which contains a list of related individuals that should be excluded from further analysis.   

`[king_3]`:   
-- `Aim`: To split genotype data into two sets: one containing unrelated samples and the other containing related samples.   
-- `Main input`: output_from(2): This input is the output from the previous step (presumably king_2), which should contain the list of related individuals. genoFile: This is the primary genotype data file that will be split based on relatedness.
-- `Output`: unrelated_bed: This is the output file containing genotype data for unrelated individuals. related_bed: This is the output file containing genotype data for related individuals.

`In summary`, the `king` workflows provide a comprehensive approach to handle relatedness in genotype data. Starting from identifying related individuals, to selecting a set of unrelated samples, and finally splitting the data based on relatedness, these workflows ensure that genetic analyses can be conducted on appropriately filtered datasets.

In [3]:
#note: keep-samples is the output of last chunk.
sos run pipeline/GWAS_QC.ipynb king \
    --cwd output/genotype/kinship \
    --genoFile output/plink/wgs.merged.plink_qc.bed \
    --name wgs.merged.king \
    --keep-samples output/genotype/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.sample_genotypes.txt

  import pkg_resources
INFO: Running [32mking_1[0m: Inference of relationships in the sample to identify closely related individuals
INFO: [32mking_1[0m is [32mcompleted[0m.
INFO: [32mking_1[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/kinship/wgs.merged.plink_qc.wgs.merged.king.kin0[0m
INFO: Running [32mking_2[0m: Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data
INFO: [32mking_2[0m is [32mcompleted[0m.
INFO: [32mking_2[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/kinship/wgs.merged.plink_qc.wgs.merged.king.related_id[0m
INFO: Running [32mking_3[0m: Split genotype data into related and unrelated samples, if related individuals are detected
INFO: [32mking_3[0m is [32mcompleted[0m.
INFO: [32mking_3[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/kinship/wgs.merged.plink_qc.wgs.merged.king.unrelate

related result is shown below:

**Columns Explanation**:  
-- FID1 & IID1: Family and individual identifiers for the first sample.  
-- FID2 & IID2: Family and individual identifiers for the second sample.  
-- NSNP: The number of SNPs (Single Nucleotide Polymorphisms) that the two samples share.  
-- HETHET: The proportion of SNPs where both samples are heterozygous.  
-- IBS0: The proportion of SNPs where the two samples have two different alleles.  
-- KINSHIP: The kinship coefficient, indicating the genetic relationship between the two samples.  

In [5]:
cat output/genotype/kinship/wgs.merged.plink_qc.wgs.merged.king.kin0

#FID1	IID1	FID2	IID2	NSNP	HETHET	IBS0	KINSHIP
sample4	sample4	sample2	sample2	125472	0.0836681	0.026747	0.0681404
sample62	sample62	sample4	sample4	125511	0.078758	0.0249221	0.0646603
sample87	sample87	sample85	sample85	125963	0.0818812	0.0234355	0.0760353
sample88	sample88	sample59	sample59	125446	0.0828484	0.0267127	0.0627207
sample118	sample118	sample39	sample39	125497	0.0799222	0.0251002	0.0633735
sample118	sample118	sample46	sample46	125942	0.087175	0.0246145	0.0855043
sample118	sample118	sample95	sample95	125965	0.0930735	0.0281666	0.0793313
sample118	sample118	sample96	sample96	125983	0.0946794	0.0276704	0.0752263
sample120	sample120	sample59	sample59	125971	0.0840511	0.0242754	0.0710927
sample120	sample120	sample96	sample96	126269	0.0973556	0.0286373	0.0798903
sample122	sample122	sample37	sample37	126155	0.0792755	0.0251675	0.0667528
sample136	sample136	sample4	sample4	125235	0.086158	0.0264862	0.0678869
sample136	sample136	sample96	sample96	125777	0.0868362	0.0261336	0.0649119

Variant level and sample level QC on unrelated individuals using missingness > 10%, and LD-prunning in preparation for PCA analysis.    


**Be aware:**    

**If the message from `king_2` shown as `No related individuals detected from *.kin0`, this means no related individuals detected for the samples in `--keep_samples`. In this case, there will be no output for unrelated individuals from this step.**

**In other cases eg ROSMAP proteomics data, message `No related individuals detected from *.kin0` occured, there is no separate genotype data generated for unrelated individuals. In this case, we need to work from the original genotype data and must use `--keep-samples` to run `qc` to extract samples for PCA.**

#### QC on unrelated samples


Here we write data to `cache` folder instead of `output` because this genotype data can be removed later after PCA. Also filter out minor allel accout < 5.

**If your data has `*.unrelated.bed` generated, that means there are related individuals in your data. In cases, we will use output from the KING step for unrelated individuals.**

About `qc`:   
1. `[qc_no_prune, qc_1 (basic QC filters)]`:  
-- `aim`: To filter SNPs and select individuals based on various quality control (QC) criteria. The goal is to ensure that the genotype data is of high quality and free from potential errors or biases before further analysis.   
-- `Input`:   
genoFile: The primary input file containing genotype data.  
Various parameters that dictate the QC criteria:  
maf_filter, maf_max_filter: Minimum and maximum Minor Allele Frequency (MAF) thresholds.  
mac_filter, mac_max_filter: Minimum and maximum Minor Allele Count (MAC) thresholds.  
geno_filter: Maximum missingness per variant.  
mind_filter: Maximum missingness per sample.  
hwe_filter: Hardy-Weinberg Equilibrium (HWE) filter threshold.  
other_args: Other optional PLINK arguments.  
meta_only: Flag to determine if only SNP and sample lists should be output.  
rm_dups: Flag to remove duplicate variants.  
-- `Output`: A file (or set of files) with the suffix .plink_qc (and possibly .extracted if specific variants are kept). The exact format (e.g., .bed or .snplist) depends on the meta_only parameter.  

2. [qc_2 (LD pruning)]:   
-- `aim`: To perform Linkage Disequilibrium (LD) pruning and remove related individuals (both individuals of a pair). LD pruning is a common step in genotype data quality control, aiming to remove highly correlated SNPs, thus reducing redundancy in the data and enhancing the accuracy of subsequent analyses.   
-- `Input`:
_input: The primary input file containing genotype data that has undergone basic quality control.   
Pruning parameters:   
window: The window size for calculating LD between SNPs.   
shift: The number of SNPs to shift the window each time.   
r2: The LD threshold for pruning   
-- `Output`:  
.prune.bed: The binary PLINK format file of the pruned genotype data.   
.prune.in: A list containing the SNPs to retain.

In [29]:
#if no related ones,
# 1. qc on plink
sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/genotype/ \
   --genoFile output/plink/wgs.merged.plink_qc.bed \
   --mac-filter 5 

# 2. pca on the whole geno file
sos run pipeline/PCA.ipynb flashpca \
   --cwd output/genotype/genotype_pca \
   --genoFile output/genotype/wgs.merged.plink_qc.plink_qc.prune.bed   

  import pkg_resources
INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mqc_1[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mbasic QC filters[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/wgs.merged.plink_qc.plink_qc.bed[0m
INFO: Running [32mLD pruning[0m: LD prunning and remove related individuals (both ind of a pair) Plink2 has multi-threaded calculation for LD prunning
INFO: [32mqc_2[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mLD pruning[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/wgs.merged.plink_qc.plink_qc.prune.bed /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/wgs.merged.plink_qc.plink_qc.prune.in[0m
INFO: Workflow qc (ID=w5583b53e391fc494) is ignored with 2 ignored steps.
  import pkg_resources
INFO: Running [32mflashpca_1[0m: Run PCA analysis using flashpca
INFO: [32mflashpca_1[0m (index=0) is [32mig

if there are unrelated data & related data, treat them separately

In [None]:
# qc on unrelated geno data: basic qc + ld pruning
sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/genotype/ \
   --genoFile output/genotype/kinship/wgs.merged.plink_qc.wgs.merged.king.unrelated.bed \
   --mac-filter 5 -s force

  import pkg_resources
INFO: Running [32mbasic QC filters[0m: Filter SNPs and select individuals
INFO: [32mbasic QC filters[0m is [32mcompleted[0m.
INFO: [32mbasic QC filters[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.bed[0m
INFO: Running [32mLD pruning[0m: LD prunning and remove related individuals (both ind of a pair) Plink2 has multi-threaded calculation for LD prunning
INFO: [32mLD pruning[0m is [32mcompleted[0m.
INFO: [32mLD pruning[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.prune.bed /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.prune.in[0m
INFO: Workflow qc (ID=w0fe7a998e4af2e47) is executed successfully with 2 completed steps.


#### QC on related samples

In [None]:
#qc on related samples, basic qc, no pruning(because they are related with high ld)
#output: related.plink_qc.extracted.bed
sos run pipeline/GWAS_QC.ipynb qc_no_prune \
   --cwd output/genotype \
   --genoFile output/genotype/kinship/wgs.merged.plink_qc.wgs.merged.king.related.bed \
   --maf-filter 0 \
   --geno-filter 0 \
   --mind-filter 0.1 \
   --hwe-filter 0 \
   --keep-variants output/genotype/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.prune.in

  import pkg_resources
INFO: Running [32mqc_no_prune[0m: Filter SNPs and select individuals
INFO: [32mqc_no_prune[0m is [32mcompleted[0m.
INFO: [32mqc_no_prune[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/wgs.merged.plink_qc.wgs.merged.king.related.plink_qc.extracted.bed[0m
INFO: Workflow qc_no_prune (ID=w1f09f0c82a7cfe40) is executed successfully with 1 completed step.


#### PCA on unrelated samples
Note PC1 vs 2 outlier

About `[flashpca]`:   
1. `[flashpca_1]`:     
-- `aim`: To perform Principal Component Analysis (PCA) on genotype data using the flashpcaR library. PCA is a statistical method used to emphasize variation and bring out strong patterns in a dataset. In the context of genomics, PCA is often used to identify and correct for population stratification in genome-wide association studies.   
-- `Input`:    
genoFile: A binary PLINK file containing genotype data after qc.    
Various parameters for PCA and data filtering, such as min_pop_size, stand, and others.   
-- `Output`:    
.pca.rds: An RDS file containing the PCA results, including the PCA model, scores, and metadata.    
.txt: A text file containing the PCA scores for each individual.   

2. `[flashpca_2, project_samples_2]`:   Outlier Detection   
-- `aim`: To detect outliers based on Mahalanobis distance, which measures the distance of a point from a distribution.     
-- `Input`:  pca result     
-- `Output`:         
distance: An RDS file containing Mahalanobis distances for each sample.    
identified_outliers: A file listing the identified outliers.    
analysis_summary: A markdown file summarizing the analysis.   
qqplot_mahalanobis: A QQ plot visualizing the Mahalanobis distances.    
hist_mahalanobis: A histogram of the Mahalanobis distances.    

3. `[flashpca_3, project_samples_3]`: PCA Visualization    
-- `aim`: To visualize the PCA results, highlighting any identified outliers.   
-- `Input`:  
PCA results from the previous step.    
List of identified outliers.   
-- `Output`:    
PCA plot (*.pc.png): A scatter plot of 2 adjacent principal components, with outliers highlighted.    
Scree plot (*.scree.png): A plot showing the variance explained by each principal component.    

In [17]:
# only use the unrelated pruned genofile after qc as input to do pca, avoiding the interference of family structure on group structure inference 
# Make sure PCA reflects real group stratification and not family relationships
sos run pipeline/PCA.ipynb flashpca \
   --cwd output/genotype/genotype_pca \
   --genoFile output/genotype/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.prune.bed
   

  import pkg_resources
INFO: Running [32mflashpca_1[0m: Run PCA analysis using flashpca
INFO: [32mflashpca_1[0m is [32mcompleted[0m.
INFO: [32mflashpca_1[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/genotype_pca/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.prune.pca.rds[0m
INFO: Running [32mflashpca_2[0m: 
INFO: [32mflashpca_2[0m is [32mcompleted[0m (pending nested workflow).
INFO: Running [32mdetect_outliers[0m: Calculate Mahalanobis distance per population and report outliers
/bin/bash: /home/al4225/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
INFO: [32mdetect_outliers[0m is [32mcompleted[0m.
INFO: [32mdetect_outliers[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/genotype_pca/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.prune.pca.mahalanobis.rds /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/genotype_pca/wgs.merged.pl

#### Project PCA results back to related samples

The workflow aims to project the PCA results of unrelated samples onto the related samples. This is useful because PCA is typically performed on unrelated samples to avoid the confounding effects of relatedness. This is often done to ensure that related samples are analyzed in the same "space" as the unrelated samples, making the results more comparable and interpretable. Once the primary PCA model is established with unrelated samples, the related samples can be projected onto this model to obtain their principal component scores.



In [None]:
awk '$3 < 0.8' output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1

15


In [None]:
sos run pipeline/PCA.ipynb project_samples \
        --cwd output/genotype/genotype_pca \
        --genoFile output/genotype/wgs.merged.plink_qc.wgs.merged.king.related.plink_qc.extracted.bed \
        --pca-model output/genotype/genotype_pca/wgs.merged.plink_qc.wgs.merged.king.unrelated.plink_qc.prune.pca.rds \
        --maha-k `awk '$3 < 0.8' output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1`

  import pkg_resources
INFO: Running [32mproject_samples_1[0m: Project back to PCA model additional samples
INFO: [32mproject_samples_1[0m is [32mcompleted[0m.
INFO: [32mproject_samples_1[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/genotype_pca/wgs.merged.plink_qc.wgs.merged.king.related.plink_qc.extracted.pca.projected.rds[0m
INFO: Running [32mproject_samples_2[0m: 
INFO: [32mproject_samples_2[0m is [32mcompleted[0m (pending nested workflow).
INFO: Running [32mdetect_outliers[0m: Calculate Mahalanobis distance per population and report outliers
INFO: [32mdetect_outliers[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mdetect_outliers[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/genotype_pca/wgs.merged.plink_qc.wgs.merged.king.related.plink_qc.extracted.pca.projected.mahalanobis.rds /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/genotype/genotype_pca/wgs.merged.plink_qc.wg

# the final pca output that we will use in cov processing
`related.plink_qc.extracted.pca.projected.rds`