Skip to content
tbrunetti edited this page Jan 9, 2018 · 3 revisions

Welcome to the GP3 wiki!

Below are some links that may be useful for quick start of already installed pipeline, troubleshooting pipeline problems and understanding the pipeline setup.

Quick Start Guide

Pipeline Setup Logic

Troubleshooting

Pipeline Arguments

Argument Usage Type Default Explanation
-inputPLINK required str ending in .bed/.ped NA Full path to PLINK file ends in .bed or .ped, whitespace characters are not allowed
-phenoFile required NA NA Full path to populated phenotype file: sample_sheet_template.xlsx
--config only required if .json not in .home of chunky str search in chunky .home directory configuration file produced after running chunky config run_GWAS_analysis_pipeline.py
--outDir optional str current working directory Full path to an already existing directory or location where you would like GP3 to build the project
--projectName optional str year-month-date-hour-min-sec Name of project to be created in the outDir location whitespace characters are not allowed
--startStep optional str hwe str
--endStep optional str PCA_indi_graph (noTGP) or PCA_TGP_graph (if --TGP used) options: if --TGP not set -> [hwe, LD, maf, het, ibd, PCA_indi, or PCA_indi_graph] if --TGP is set -> [hwe, LD, maf, het, ibd, outlier_removal, PCA_TGP, or PCA_TGP_graph]] Point of the pipeline where you would like to stop analysis. This step is inclusive!
--hweThresh optional float 1e-6 Filters out SNPs that are smaller than this threshold due to liklihood of genotyping error
--LDmethod optional str indep options:[indep, indep-pairwise or indep-pairphase] Method to calculate linkage disequilibrium. See PLINK documentation for more information.
--VIF optional int 2 variant inflation factor for indep method LD pruning method only; indep-pairwise or indep-pairphase method will not use VIF
--rsq optional float 0.50 any floating point number between 0.0-1.0; r-squared threshold for indep-pairwise or indep-pairphase LD pruning method; indep method will not use rsq
--windowSize optional int 50 any integer; the window size in kb for LD analysis
--stepSize optional int 5 any integer; variant count to shift window after each iteration
--maf optional float 0.05 any floating point number between 0.0-1.0; filter remaining LD pruned variants by MAF, any MAF below set threshold is filtered out
--hetMethod optional str meanStd options: minMax or meanStd; method to use to determine heterozygosity. minMax filter based on the parameters --hetThresh as the max F-inbreeding coefficient and --hetThreshMin for the minimum F-inbreeding coeffient, which by default are 0.10 and -0.10, respectively. The meanStd filter method calculates a het_score: 1-[observed[HOM]/total] and then filters out any samples that are more than 3 std deviations from the mean het_score. The number of standard deviations from the mean can be changed using the --het_std parameter
--het_std optional int or float 3 any floating point number or integer; if using hetMethod=meanStd you can determine how many standard deviations aways from the mean is allowable for heterozygosity. Setting to 3 is interpreted as +/-3 standard deviations away from the mean of the het_score, calculated as 1-[observed(HOM)/total]
--hetThresh optional float 0.10 any floating point number; filter out samples where inbreeding coefficient is greater than threshold (heterozygosity filtering); only used when method minMax for --hetMethod is selected
--hetThreshMin optional float -0.10 any floating point number; filter out samples where inbreeding coefficient is samller than min threshold set (heterozygosity filtering); only used when method minMax for hetThresh is selected
--sampleMiss optional float 0.03 any floating point number between 0.0-1.0; Maximum missingness of genotype call in sample before it should be filtered out. Where 0 is no missing, and 1 is all missing (0.03 is interpreted as 3 percent of snp calls are missing in a sample)
--snpMiss optional float 0.03 any floating point number between 0.0-1.0; Maximum missingness of genotype call in a SNP cluster before the SNP should be filtered out. Where 0 is no missing, and 1 is all missing (0.03 is interpreted as 3 percent of sample calls are missing in a snp)
--TGP optional flag NA specifying this flag means to generate PCA plots with TGP data merged into the given cohort data set for the 5 superpopulations in TGP (AFR, AMR, EAS, EUR, SAS)
--centerPop optional str myGroup options: literally the string myGroup or available TGP group merged into input dataset; when using the TGP flag, you have the option to specify which population cohort that PCs should be centered around for boxplots. By default this is set to your group(s) listed in the sample sheet. You can pick a TGP super population listed in the TGP_Sub_and_SuperPopulation_info.txt file. CASE SENSITIVE!
--outliers optional str None A txt file of FID and IID, tab-delimited and one sample per line, that are outliers that should be removed from the sample set (PCA outlier removal); Use original names (original FID and IID), not renamed 1-n for GENESIS formatting
--pcmat optional int 5 any integer; Number of predicted admixture populations in dataset to be used in GENESIS calculation for PCA
--reanalyze optional flag NA by adding this flag, it means you are going to pass a dataset through the pipeline that has already been partially/fully analyzed by this pipeline. WARNING! May over write exisiting data!! required if using --startStep argument OR if using --endStep arguments on an already existing project
Clone this wiki locally