Genome-Phenome Wide Association Study
Zhikai Liang, Yumou Qiu and James C. Schnable.
Liang, Z., Qiu, Y. and Schnable, JC (2020). Genome-phenome wide association in maize and Arabidopsis identifies a common molecular and evolutionary signature. Molecular Plant
Installation and Package loading
# R version is required >= 3.4.4 # When the first time to use the package, please make sure MASS and leaps packages are installed under your R environment, if not, please use commands below to install > install.packages("leaps") > install.packages("MASS") # install "devtools" package in your R environment > devtools::install_github("shanwai1234/GPWAS") > library(GPWAS)
All of input data are required to be organized in following format.
Gene: Keep the title of this item as "Gene" and do not change it. All of gene names should be kept below it.
SNP: Keep the title of this item as "SNP" and do not change it. The format of SNP should be written as "S"+chromosome+"_"+SNP position.
Sample: Individual sample name.
Note: All items should be split by \tab. Making sure there is no missing data in your genotype file. For each SNP per gene, they should not be completely identical. We do recommend you to filter SNP based upon MAF (Minor Allele Frequency).
Pheno: Name of phenotype name.
Sample: Individual sample name.
Note: All items should be split by space. Make sure there is no missing data in your phenotype file or any row containing missing data will be removed for following analysis. The number of analyzed phenotype is better not exceed the number of individuals in the studied population. If you have extremely high-dimensional phenotypes, it is suggested to reduce ones that are too similar with others.
Version 1.0.1 GPWAS package controls population structure using PC scores generated by PCA. For each individual gene, population structure is controlled by PC scores calculated by the rest of other chromosomes. But generating PC scores were not included in GPWAS package. You are suggested to calculate it using function like prcomp or TASSEL
You need to prepare separate PC covariate files excluding each individual chromosome. If you have 10 chromosomes, you need to prepare 10 separate PC covariate files.
Note: All items should be split by space. The order of samples should be identical to the order of samples in both genotype and phenotype file.
Example: When you want to exclude chromosome 1, you need to make the file name such as "exclude-chr1.txt". Then store all of these files to a folder.
How to use
> gpwas(ingeno, inpheno, inpc, gp, gv, R = num)
ingeno: Input genotype file name/directory. It is recommended to split big genotype file into multiple in order to reduce memory load.
inpheno: Input phenotype file name/directory.
inpc: Input folder with PCA parsed population structure covariance. If n number of chromosomes, n number of separate files should be included, as SNPs on each chromosome is excluded for performing PCA once.
gp: Output file name/directory for selected phenotypes with every gene as well as p value of each selected phenotypes (both gene names and values are just examples, PC[,1] shows the p value for the first principal component, and same for other selected principal components, Pheno1 and Pheno2 are incorporated phenotypes for this Gene1).
gv: Output file name/directory of terminated p value for each gene (both gene names and values are just examples).
R: Number of iteration for scanning all of input phenotypes with one specific gene. Too big number will be redundancy and computationally cost. Suggested ranging from 10-50.
Note: Once you continueously seeing "No new add-in" and "No leave-out" means the model is stable.
# Customizing more > gpwas(ingeno, inpheno, inpc, g, gp, gv, R = num, pc = 3, selectIn = 0.01, selectOut = 0.01)
g: A list of specific gene that needs to analyze. By default the model will run for all of genes detected in the input genotype file.
Example as below:
pc: Number of PCs that needs to be included to control the population structure.
selectIn: p value threshold to determine if a phenotype could be selected in the model.
selectOut: p value threshold to determine if a phenotype could be dropped out from the model.
# Run the demo data # Demo data was stored in Data/ directory of GPWAS package > gpwas(ingeno='GPWAS-demo.geno', inpheno='GPWAS-demo.pheno', pc=3, inpc = 'population-structure-demo', gp='output-geno-phenotypes.txt', gv='output-geno-pvalue.txt', R=5)
GPWAS genes selection
After running GPWAS model for collected phenotype and genotype data in a given population, you will obtain two files gp and gv (depends on your provided file names) with stored selected phenotypes per gene and p value per gene. GPWAS genes could be selected upon significant level per gene. However, if you are willing to set a threshold for GPWAS genes selection, we recommend you to shuffle each phenotype across all genotypes in your original phenotype data matrix for N times, then re-run GPWAS on each shuffled phenotype matrix and merge p-value per gene in your N gv outputs. Finally, you will get two list of p-value per gene: real data gv and N permutation data gv. When setting a p-value threshold P, you will know R genes in real data gv above P and M genes in permutation data above P. FDR could be determined using
FDR = (M/N)/R
Speed up the computation of GPWAS
Depend on the size of actual data matrix you have, we recommend you to split your genotype matrix into multiple subsets if you have too many phenotypes or/and too dense SNP per gene or/and too many individuals in the given population. Then submitting jobs in parallel to a computing cluster would shorten computing time efficiently.