Running Pipeline with Defaults

If you run the minimum number of requirements like so:

chunky run run_GWAS_analysis_pipeline.py -inputPLINK <myPlink.bed/.ped> -phenoFile <sample_sheet_template.xlsx>

This will run through all the default parameters of the pipeline from HWE through graphing PCAs (noTGP). By running the above command the following flow will occur:

Output directory is current working directory and the project name is set to date and time stamp [--outDir, --projectName]
Split out all ethnicies separately (the subsequent steps are recalculated for each individual ethnic group listed in the sample_sheet_template.xlsx)
FOR EACH ETHNIC GROUP CALCULATE AND FILTER:
remove snps below 97% call rate [--snpMiss]
remove snps with HWE less than 1e-6 [--hweThresh]
remove samples below 97% call rate [--sampleMiss]
LD prune snps using method indep with VIF of 2 and window size of 50, step size of 5 [--LDmethod, --VIF, --rsq, --windowSize, --stepSize]
remove snps with MAF <= 0.05 [--maf]
remove samples with method meanStd with excess heterozygosity of more than 3 standard deviations from the mean for each population [--hetMethod, --hetTresh, --hetThreshMin, --het_std]
perform IBD and remove true duplicate samples
Remove outliers specified by user (if available) [--outliers]
PCA without TGP for each group using GENESIS (includes running KING) and using admixture population number estimation of 5 [--pcmat]
graph each group with PCA screen plots and boxplots with mean and standard deviations centered around each input cohort [--centerPop]
generate a list of samples to remove

Running Pipeline with TGP with Defaults

You can run the default steps above with TGP by specifying the --TGP flag at the command line prompt:

chunky run run_GWAS_analysis_pipeline.py -inputPLINK <myPlink.bed/.ped> -phenoFile <sample_sheet_template.xlsx> --TGP

This will run through all the default parameters of the pipeline from HWE through graphing PCAs (only with TGP). By running the above command the following flow will occur:

Output directory is current working directory and the project name is set to data and time stamp [--outDir, --projectName]
Split out all ethnicies separately (the subsequent steps are recalculated for each individual ethnic group listed in the sample_sheet_template.xlsx)
FOR EACH ETHNIC GROUP CALCULATE AND FILTER:
remove snps below 97% call rate [--snpMiss]
remove snps with HWE less than 1e-6 [--hweThresh]
remove samples below 97% call rate [--sampleMiss]
LD prune snps using method indep with VIF of 2 and window size of 50, step size of 5 [--LDmethod, --VIF, --rsq, --windowSize, --stepSize]
remove snps with MAF <= 0.05 [--maf]
remove samples with method meanStd with excess heterozygosity of more than 3 standard deviations from the mean for each population [--hetMethod, --hetTresh, --hetThreshMin, --het_std]
perform IBD and remove true duplicate samples
Remove outliers specified by user (if available) [--outliers]
Merge TGP samples with input samples and only take common snps between groups [--TGP]
PCA including TGP for each group using GENESIS (includes running KING) and using admixture population number estimation of 5 [--pcmat]
graph each group with PCA screen plots and boxplots with mean and standard deviations centered around each input cohort [--centerPop]
generate a list of samples to remove

Pipeline Arguments

Argument	Usage	Type	Default	Explanation
-inputPLINK	required	str ending in .bed/.ped	NA	Full path to PLINK file ends in .bed or .ped, whitespace characters are not allowed
-phenoFile	required	NA	NA	Full path to populated phenotype file: sample_sheet_template.xlsx
--config	only required if .json not in .home of chunky	str	search in chunky .home directory	configuration file produced after running `chunky config run_GWAS_analysis_pipeline.py`
--outDir	optional	str	current working directory	Full path to an already existing directory or location where you would like GP3 to build the project
--projectName	optional	str	year-month-date-hour-min-sec	Name of project to be created in the outDir location whitespace characters are not allowed
--startStep	optional	str	hwe	options: if --TGP not specified -> [hwe, LD, maf, het, ibd, outlier_removal, PCA_indi, or PCA_indi_graph] if --TGP is set ->[hwe, LD, maf, het, ibd, outlier_removal, PCA_TGP, or PCA_TGP_graph]] The part of the pipeline you would like to start with, SHOULD BE USED ONLY ON REANALYZED DATA ON EXISTING PROJECT CREATED WITH GP3 WITH --reanalyze flag
--endStep	optional	str	PCA_indi_graph (noTGP) or PCA_TGP_graph (if --TGP used)	options: if --TGP not set -> [hwe, LD, maf, het, ibd, outlier_removal, PCA_indi, or PCA_indi_graph] if --TGP is set -> [hwe, LD, maf, het, ibd, outlier_removal, PCA_TGP, or PCA_TGP_graph]] Point of the pipeline where you would like to stop analysis. This step is inclusive! SHOULD BE USED ONLY ON REANALYZED DATA ON EXISTING PROJECT CREATED GP3 WITH --reanalyze flag
--hweThresh	optional	float	1e-6	Filters out SNPs that are smaller than this threshold due to liklihood of genotyping error
--LDmethod	optional	str	indep	options:[indep, indep-pairwise or indep-pairphase] Method to calculate linkage disequilibrium. See PLINK documentation for more information.
--VIF	optional	int	2	variant inflation factor for indep method LD pruning method only; indep-pairwise or indep-pairphase method will not use VIF
--rsq	optional	float	0.50	any floating point number between 0.0-1.0; r-squared threshold for indep-pairwise or indep-pairphase LD pruning method; indep method will not use rsq
--windowSize	optional	int	50	any integer; the window size in kb for LD analysis
--stepSize	optional	int	5	any integer; variant count to shift window after each iteration
--maf	optional	float	0.05	any floating point number between 0.0-1.0; filter remaining LD pruned variants by MAF, any MAF below set threshold is filtered out
--hetMethod	optional	str	meanStd	options: minMax or meanStd; method to use to determine heterozygosity. minMax filter based on the parameters --hetThresh as the max F-inbreeding coefficient and --hetThreshMin for the minimum F-inbreeding coeffient, which by default are 0.10 and -0.10, respectively. The meanStd filter method calculates a het_score: 1-[observed[HOM]/total] and then filters out any samples that are more than 3 std deviations from the mean het_score. The number of standard deviations from the mean can be changed using the --het_std parameter
--het_std	optional	int or float	3	any floating point number or integer; if using hetMethod=meanStd you can determine how many standard deviations aways from the mean is allowable for heterozygosity. Setting to 3 is interpreted as +/-3 standard deviations away from the mean of the het_score, calculated as 1-[observed(HOM)/total]
--hetThresh	optional	float	0.10	any floating point number; filter out samples where inbreeding coefficient is greater than threshold (heterozygosity filtering); only used when method minMax for --hetMethod is selected
--hetThreshMin	optional	float	-0.10	any floating point number; filter out samples where inbreeding coefficient is samller than min threshold set (heterozygosity filtering); only used when method minMax for hetThresh is selected
--sampleMiss	optional	float	0.03	any floating point number between 0.0-1.0; Maximum missingness of genotype call in sample before it should be filtered out. Where 0 is no missing, and 1 is all missing (0.03 is interpreted as 3 percent of snp calls are missing in a sample)
--snpMiss	optional	float	0.03	any floating point number between 0.0-1.0; Maximum missingness of genotype call in a SNP cluster before the SNP should be filtered out. Where 0 is no missing, and 1 is all missing (0.03 is interpreted as 3 percent of sample calls are missing in a snp)
--TGP	optional	flag	NA	specifying this flag means to generate PCA plots with TGP data merged into the given cohort data set for the 5 superpopulations in TGP (AFR, AMR, EAS, EUR, SAS)
--centerPop	optional	str	myGroup	options: literally the string myGroup or available TGP group merged into input dataset; when using the TGP flag, you have the option to specify which population cohort that PCs should be centered around for boxplots. By default this is set to your group(s) listed in the sample sheet. You can pick a TGP super population listed in the TGP_Sub_and_SuperPopulation_info.txt file. CASE SENSITIVE!
--outliers	optional	str	None	A txt file of FID and IID, tab-delimited and one sample per line, that are outliers that should be removed from the sample set (PCA outlier removal); Use original names (original FID and IID), not renamed 1-n for GENESIS formatting
--pcmat	optional	int	5	any integer; Number of predicted admixture populations in dataset to be used in GENESIS calculation for PCA
--reanalyze	optional	flag	NA	by adding this flag, it means you are going to pass a dataset through the pipeline that has already been partially/fully analyzed by this pipeline. WARNING! May over write exisiting data!! required if using --startStep argument OR if using --endStep arguments on an already existing project

Pipeline Help Flag

For more help you can use the -h or --help flag at the command line prompt and it will give you a detailed list of arguments that can be used:

chunky run run_GWAS_analysis_pipeline.py -h

The brackets in bold following the step numbers listed above, indicate the arguments you can call at the command prompt to change the default values and default methods. The -h or --help will list all the input options for using and changing these parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick_Start_FAQs.md

Quick_Start_FAQs.md

Running Pipeline with Defaults

Running Pipeline with TGP with Defaults

Pipeline Arguments

Pipeline Help Flag

Files

Quick_Start_FAQs.md

Latest commit

History

Quick_Start_FAQs.md

File metadata and controls

Running Pipeline with Defaults

Running Pipeline with TGP with Defaults

Pipeline Arguments

Pipeline Help Flag