Skip to content
Genome-Phenome Wide Association Study
R
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data
man
DESCRIPTION
GPWAS-logo2.png
LICENSE
NAMESPACE
README.md

README.md

GPWAS License: GPL v3

Genome-Phenome Wide Association Study

Authors:

Zhikai Liang, Yumou Qiu and James C. Schnable.

Contact:

[zliang@huskers.unl.edu](Zhikai Liang)

Citation:

Liang, Z., Qiu, Y. and Schnable, JC (2020). Genome-phenome wide association in maize and Arabidopsis identifies a common molecular and evolutionary signature. Molecular Plant


Installation and Package loading

# R version is required >= 3.4.4
# When the first time to use the package, please make sure MASS and leaps packages are installed under your R environment, if not, please use commands below to install
> install.packages("leaps")
> install.packages("MASS")
# install "devtools" package in your R environment
> devtools::install_github("shanwai1234/GPWAS")
> library(GPWAS)

Data Preparation

All of input data are required to be organized in following format.

Genotype Data

Gene: Keep the title of this item as "Gene" and do not change it. All of gene names should be kept below it.

SNP: Keep the title of this item as "SNP" and do not change it. The format of SNP should be written as "S"+chromosome+"_"+SNP position.

Sample: Individual sample name.

Note: All items should be split by \tab. Making sure there is no missing data in your genotype file. For each SNP per gene, they should not be completely identical. We do recommend you to filter SNP based upon MAF (Minor Allele Frequency).

Gene SNP Sample1 Sample2 Sample3
Zm00001d000501 S3_27396852 0 1 0
Zm00001d000501 S3_27397388 0 2 1
Zm00001d000632 S3_156234383 0 1 1
Zm00001d000632 S3_156234434 1 2 1
Zm00001d000654 S3_178548901 2 1 1

Phenotype Data

Pheno: Name of phenotype name.

Sample: Individual sample name.

Note: All items should be split by space. Make sure there is no missing data in your phenotype file or any row containing missing data will be removed for following analysis. The number of analyzed phenotype is better not exceed the number of individuals in the studied population. If you have extremely high-dimensional phenotypes, it is suggested to reduce ones that are too similar with others.

Taxa Pheno1 Pheno2 Pheno3
Sample1 23 50 55
Sample2 10 12 8
Sample3 120 150 133

PC covariates

Version 1.0.1 GPWAS package controls population structure using PC scores generated by PCA. For each individual gene, population structure is controlled by PC scores calculated by the rest of other chromosomes. But generating PC scores were not included in GPWAS package. You are suggested to calculate it using function like prcomp or TASSEL

"PC1" "PC2" "PC3"
"Sample1" -11 8 31
"Sample2" -36 35 138
"Sample3" 15 10 -7

You need to prepare separate PC covariate files excluding each individual chromosome. If you have 10 chromosomes, you need to prepare 10 separate PC covariate files.

Note: All items should be split by space. The order of samples should be identical to the order of samples in both genotype and phenotype file.

Example: When you want to exclude chromosome 1, you need to make the file name such as "exclude-chr1.txt". Then store all of these files to a folder.

How to use

> gpwas(ingeno, inpheno, inpc, gp, gv, R = num)

ingeno: Input genotype file name/directory. It is recommended to split big genotype file into multiple in order to reduce memory load.

inpheno: Input phenotype file name/directory.

inpc: Input folder with PCA parsed population structure covariance. If n number of chromosomes, n number of separate files should be included, as SNPs on each chromosome is excluded for performing PCA once.

gp: Output file name/directory for selected phenotypes with every gene as well as p value of each selected phenotypes (both gene names and values are just examples, PC[,1] shows the p value for the first principal component, and same for other selected principal components, Pheno1 and Pheno2 are incorporated phenotypes for this Gene1).

"Gene" "Predictor" "Predictor p-value"
"Gene1" PC[,1] 1.0e-4
"Gene1" PC[,2] 1.2e-2
"Gene1" PC[,3] 4.3e-2
"Gene1" Pheno1 3.7e-4
"Gene1" Pheno2 1.4e-3

gv: Output file name/directory of terminated p value for each gene (both gene names and values are just examples).

"Gene" "GPWAS p-value"
"Gene1" 2.2e-6
"Gene2" 1.3e-8
"Gene3" 4.5e-10

R: Number of iteration for scanning all of input phenotypes with one specific gene. Too big number will be redundancy and computationally cost. Suggested ranging from 10-50.

Note: Once you continueously seeing "No new add-in" and "No leave-out" means the model is stable.

# Customizing more
> gpwas(ingeno, inpheno, inpc, g, gp, gv, R = num, pc = 3, selectIn = 0.01, selectOut = 0.01)

g: A list of specific gene that needs to analyze. By default the model will run for all of genes detected in the input genotype file.

Example as below:

Candidate
Gene1
Gene2
Gene3

pc: Number of PCs that needs to be included to control the population structure.

selectIn: p value threshold to determine if a phenotype could be selected in the model.

selectOut: p value threshold to determine if a phenotype could be dropped out from the model.

# Run the demo data
# Demo data was stored in Data/ directory of GPWAS package
> gpwas(ingeno='GPWAS-demo.geno', inpheno='GPWAS-demo.pheno', pc=3, inpc = 'population-structure-demo', gp='output-geno-phenotypes.txt', gv='output-geno-pvalue.txt', R=5)

GPWAS genes selection

After running GPWAS model for collected phenotype and genotype data in a given population, you will obtain two files gp and gv (depends on your provided file names) with stored selected phenotypes per gene and p value per gene. GPWAS genes could be selected upon significant level per gene. However, if you are willing to set a threshold for GPWAS genes selection, we recommend you to shuffle each phenotype across all genotypes in your original phenotype data matrix for N times, then re-run GPWAS on each shuffled phenotype matrix and merge p-value per gene in your N gv outputs. Finally, you will get two list of p-value per gene: real data gv and N permutation data gv. When setting a p-value threshold P, you will know R genes in real data gv above P and M genes in permutation data above P. FDR could be determined using

FDR = (M/N)/R

Speed up the computation of GPWAS

Depend on the size of actual data matrix you have, we recommend you to split your genotype matrix into multiple subsets if you have too many phenotypes or/and too dense SNP per gene or/and too many individuals in the given population. Then submitting jobs in parallel to a computing cluster would shorten computing time efficiently.

You can’t perform that action at this time.