Garrett McKinney 6/23/2020
The raw sequence data used for this example is a subset of the data from: McKinney et al. 2020. Dense SNP panels resolve closely related Chinook salmon populations. Canadian Journal of Fisheries and Aquatic Sciences (77) 451-461. DOI: 10.1139/cjfas-2019-0067
Sequence data can be downloaded from NCBI BioProject PRJNA561127
I have also made three “contaminated” samples by combining sequence reads from the following individuals: KAROL05_0009 and KAROL05_0010, KAROL05_0025 and KAROL05_0026, and KAROL05_0028 and KAROL05_0029. These are used to demononstrate how to identify contaminated samples.
The read counter is written in perl (AmpliconReadCounter.pl).
The program:
-
Identifies each unique sequence, then counts the number of times each unique sequence occurs within an individual
-
Aligns each unique sequence with primer and probe; if the sequence doesn’t align, then it is excluded as an off-target sequence and reports by individual and by locus are given.
- Note: By default, all primers are trimmed to the length of the shortest primer to increase speed. Optionally the full length primer can be used for the primer but this may significantly increase run timing depending on variation in primer lengths across loci.
Input flags for this script are:
- –p a tab delimited file containing primer/probe information for each locus
- –files a text file containing a list of .fastq sequence files to count reads from.
Optional flags:
- –prefix optional prefix for output file names
- –inDir option to specify directory containing sequence data
- –outDir option to specify directory for output files
- –i option to choose input type: fq (fastq) or fastqgz (gzipped fastq). This defaults to fq
- –useFullPrimer option to use the full primer for counting reads rather than the trimmed primer
- –alleleOrder option to specify order of alleles output in locusTable file. Options are original (matches primer-probe file order) or alphabetical (default)
- –printMatched option to output matched reads for each individual
- –printDiscarded option to output discarded reads for each individual
AmpliconReadCounter.pl can be run on the command line or called from R
create a folder with the sampleFile.txt file containing names of all the samples you’d like to genotype
You can run AmpliconReadCounter.pl from the command line using the following example: perl AmpliconReadCounter.pl -p primerProbeFile.txt –files sampleList.txt
If you prefer to run this through R then you can use the following example:
system2("perl",
args="AmpliconReadCounter.pl -p primerProbeFile.txt --files sampleFiles.txt --inDir samples")AmpliconReadCounter.pl outputs a LocusTable file and an AlleleReads file for single-SNP and haplotype data, plus two summary files.
The default names for these files are:
- LocusTable_singleSNPs.txt
- AlleleReads_singleSNPs.txt
- LocusTable_haplotypes.txt
- AlleleReads_haplotypes.txt
- GTscore_individualSummary.txt
- GTscore_locusSummary.txt
#load GTscore
source("GTscore.R")
#load locus table and allele reads file
singleSNP_locusTable<-read.delim("LocusTable_singleSNPs.txt",header=TRUE,stringsAsFactors=FALSE)
singleSNP_alleleReads<-read.delim("AlleleReads_singleSNPs.txt",header=TRUE,row.names=1,stringsAsFactors=FALSE)
head(singleSNP_locusTable)## Locus_ID ploidy alleles correctionFactors
## 1 Ots_100884-287_1 2 C,T 0,0
## 2 Ots_101554-407_1 2 C,G 0,0
## 3 Ots_101704-143_1 2 G,T 0,0
## 4 Ots_102213-210_1 2 A,G 0,0
## 5 Ots_102414-395_1 2 A,G 0,0
## 6 Ots_102457-132_1 2 A,G 0,0
singleSNP_alleleReads[1:5,1:5]## KEEK02_0001 KEEK02_0002 KEEK02_0003 KEEK02_0004 KEEK02_0005
## Ots_100884-287_1 0,84 29,37 0,51 0,37 0,73
## Ots_101554-407_1 0,94 0,81 0,82 24,26 0,107
## Ots_101704-143_1 35,0 54,0 63,0 20,0 4,32
## Ots_102213-210_1 0,15 4,7 0,12 0,9 0,9
## Ots_102414-395_1 16,0 0,10 9,6 3,3 0,16
This step is optional, but if run it does require that read correction factors were included in the primer probe file. If the primer probe file did not inlcude correction factors this step will generate errors.
Some loci may have alleles that amplify unevenly which can result in allele ratios deviating from expectations. This can be due to allele-specific differences in amplification efficiency, amplification of off-target sequence that is counted towards one allele but not the other, or if the locus is a diverged duplicate (tetraploid but inherited disomically, and only one copy has a variant SNP).
One method of dealing with this is to adjust the read counts to account for this uneven amplification and bring the allele reads back to expected ratios. GTscore uses the same read adjustment method as the GT-seq pipeline by Nate Campbell (https://github.com/GTseq). If off-target sequence is causing the bias then I suggest redesigning the bioinformatic probes whenever possible rather than relying on read count adjustment.
singleSNP_alleleReads_adjusted<-correctReads(singleSNP_locusTable,singleSNP_alleleReads)
#write to file
write.table(singleSNP_alleleReads_adjusted,"singleSNP_alleleReads_adjusted.txt",quote=FALSE,sep="\t")Genotyping is accomplished using the polyGen function. The genotyping algorithm is described in McKinney et al. 2018 and is a maximum likelihood algorithm capable of genotyping any number of alleles and ploidy per locus. This allows genoyping of single SNPs as well as microhaplotypes, and loci with elevated ploidy.
Two arguments are required for polyGen, the locusTable and alleleReads files output by AmpliconReadCounter.
Optional arguments for polyGen are:
- p_thresh - threshold p-value for likelihood ratio test (default 0.05)
- epsilon - error rate for genotyping model (default 0.01)
#generate singleSNP genotypes using the polyGen algorithm
polyGenResults_singleSNP<-polyGen(singleSNP_locusTable,singleSNP_alleleReads)
polyGenResults_singleSNP[1:5,1:5]## KEEK02_0001 KEEK02_0002 KEEK02_0003 KEEK02_0004 KEEK02_0005
## Ots_100884-287_1 "T,T" "C,T" "T,T" "T,T" "T,T"
## Ots_101554-407_1 "G,G" "G,G" "G,G" "C,G" "G,G"
## Ots_101704-143_1 "G,G" "G,G" "G,G" "G,G" "T,T"
## Ots_102213-210_1 "G,G" "A,G" "G,G" "G,G" "G,G"
## Ots_102414-395_1 "A,A" "G,G" "A,G" "A,G" "G,G"
load sample summary from AmpliconReadCounter
GTscore_individualSummary<-read.delim("GTscore_individualSummary.txt",header=TRUE,stringsAsFactors=FALSE)
head(GTscore_individualSummary)## Sample Total.Reads Off.target.Reads Primer.Only.Reads Primer.Probe.Reads
## 1 KEEK02_0001 244962 67436 92887 84639
## 2 KEEK02_0002 320926 92684 143384 84858
## 3 KEEK02_0003 297859 89582 142959 65318
## 4 KEEK02_0004 263027 79737 130565 52725
## 5 KEEK02_0005 284616 77930 113175 93511
## 6 KEEK02_0006 291600 87751 137316 66533
## Off.target.Proportion Primer.Only.Proportion Primer.Probe.Proportion
## 1 0.28 0.38 0.35
## 2 0.29 0.45 0.26
## 3 0.30 0.48 0.22
## 4 0.30 0.50 0.20
## 5 0.27 0.40 0.33
## 6 0.30 0.47 0.23
summarize single SNP results for samples
singleSNP_sampleSummary<-summarizeSamples(polyGenResults_singleSNP,singleSNP_alleleReads)
head(singleSNP_sampleSummary)## sample GenotypeRate Heterozygosity conScore
## 1 KAROL05_0009 0.9826938 0.1963883 0.1762452
## 2 KAROL05_0009_KAROL05_0010 0.9947329 0.3754703 0.7374749
## 3 KAROL05_0010 0.9954853 0.2249812 0.1438127
## 4 KAROL05_0025 0.9924755 0.2129421 0.1236749
## 5 KAROL05_0025_KAROL05_0026 0.9939804 0.3739654 0.6881288
## 6 KAROL05_0026 0.9857035 0.1723100 0.1528384
combine AmpliconReadCounter individual summary data with GTscore sample summary
GTscore_individualSummary<-merge(GTscore_individualSummary,singleSNP_sampleSummary,by.x="Sample",by.y="sample")plot histogram of genotype rate
ggplot()+geom_histogram(data=GTscore_individualSummary,aes(x=GenotypeRate),binwidth=0.03)+xlim(-0.01,1.01)+
labs(title="Sample Genotype Rate", x="Genotype Rate", y="Count")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))plot histogram of heterozygosity
ggplot()+geom_histogram(data=GTscore_individualSummary,aes(x=Heterozygosity),binwidth=0.03)+xlim(-0.01,1.01)+
labs(title="Sample Heterozygosity", x="Heterozygosity", y="Count")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))plot genotype rate vs primer probe reads
#dashed line added at 90% genotype rate, this is not a strict threshold, just a goal to aim for
ggplot()+geom_point(data=GTscore_individualSummary,aes(x=Primer.Probe.Reads,y=GenotypeRate))+
labs(title="Genotype Rate vs Total Reads per Sample", x="Primer Probe Reads", y="Genotype Rate")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))+
geom_hline(yintercept=0.9,lty="dashed")samples with unusually high heterozygosity may be contaminated or have elevated ploidy. In this case, the three “contaminated” samples that were generated by combining reads from multiple individuals all show elevated heterozygosity.
plot heterozygosity vs primer probe reads
ggplot()+geom_point(data=GTscore_individualSummary,aes(x=Primer.Probe.Reads,y=Heterozygosity))+
labs(title="Heterozygosity vs Total Reads per Sample", x="Primer Probe Reads", y="Heterozygosity")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))+
geom_hline(yintercept=0.3, lty="dashed")samples with unusually high heterozygosity relative to others are candidates for contamination. The three contaminated samples again are well outside the range of the other samples.
plot heterozygosity vs genotype rate per sample
ggplot()+geom_point(data=GTscore_individualSummary,aes(x=GenotypeRate,y=Heterozygosity))+
labs(title="Heterozygosity vs Genotype Rate per Sample", x="Genotype Rate", y="Heterozygosity")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))A contamination score is included in the sample summary. The contamination score is the proportion of heterozygous genotypes whose allele ratios significantly differ from 1:1 ratios
The three contaminated samples all show elevated contamination scores and heterozygosity relative to the rest of the samples.
plot heterozygosity vs contamination score per sample
ggplot()+geom_point(data=GTscore_individualSummary,aes(x=conScore,y=Heterozygosity))+
labs(title="Heterozygosity vs Contamination Score per Sample", x="Contamination Score", y="Heterozygosity")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))summarize single SNP results for loci
singleSNP_summary<-summarizeGTscore(singleSNP_alleleReads, singleSNP_locusTable, polyGenResults_singleSNP)
head(singleSNP_summary)## Locus_ID AvgReadDepth GenotypeRate minAF majAF alleles
## 1 Ots_100884-287_1 40.470803 0.9927273 0.12454212 0.8754579 C,T
## 2 Ots_101119-381_1 18.021978 0.9745455 0.00000000 1.0000000 C,T
## 3 Ots_101554-407_1 48.861818 1.0000000 0.08909091 0.9109091 C,G
## 4 Ots_101704-143_1 34.617647 0.9890909 0.03308824 0.9669118 G,T
## 5 Ots_101770-82_1 30.385455 1.0000000 0.00000000 1.0000000 G,T
## 6 Ots_102213-210_1 9.360294 0.9345455 0.01556420 0.9844358 A,G
## allFreqs conScore
## 1 0.12,0.88 0.04687500
## 2 1,0 NaN
## 3 0.09,0.91 0.00000000
## 4 0.97,0.03 0.08333333
## 5 1,0 NaN
## 6 0.02,0.98 0.25000000
#write results
write.table(singleSNP_summary,"singleSNP_summary.txt",quote=FALSE,sep="\t",row.names=FALSE)Generate plots for single SNP results
plot genotype rate
ggplot()+geom_histogram(data=singleSNP_summary,aes(x=GenotypeRate),binwidth=0.03)+xlim(-0.01,1.01)+
labs(title="Locus Genotype Rate", x="Genotype Rate", y="Count")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))plot average read depth for single SNP data
ggplot()+geom_histogram(data=singleSNP_summary,aes(x=AvgReadDepth),binwidth=20)+
labs(title="Average Read Depth per SNP", x="Average Read Depth", y="Count")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))plot genotype rate relative to average depth
ggplot()+geom_point(data=singleSNP_summary,aes(x=AvgReadDepth,y=GenotypeRate))+ylim(0,1)+
labs(title="Genotype Rate vs Average Depth per SNP", x="Average Depth", y="Genotype Rate")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))plot distribution of minor allele frequency
ggplot()+geom_histogram(data=singleSNP_summary,aes(x=minAF),binwidth=0.01)+
labs(title="Minor Allele Frequency Single SNP", x="Minor Allele Frequency", y="Count")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))plot distribution of major allele frequency
ggplot()+geom_histogram(data=singleSNP_summary,aes(x=majAF),binwidth=0.01)+
labs(title="Major Allele Frequency Single SNP", x="Major Allele Frequency", y="Count")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))The IDduplicateSamples function does all pairwise comparisons of samples and outputs two metrics:
- proportionCommon - The proportion of loci that had genotypes for both samples in a pair
- proportionMatch - The proportion of loci that have identical genotypes in the sample pair
#convert missing genotypes "0" to NA
polyGenResults_singleSNP_NA<-polyGenResults_singleSNP
polyGenResults_singleSNP_NA[polyGenResults_singleSNP_NA=="0"]<-NA
#compare all samples, for each comparison get proportion of loci that have genotypes
#in both samples (proportionCommon) and proportion of shared loci that have identical
#genotypes (proportionMatch).
#Samples with a high proportionCommon and high proportionMatch are likely duplicates
polyGenResults_dupTest<-IDduplicateSamples(polyGenResults_singleSNP_NA)
write.table(polyGenResults_dupTest,"polyGenResults_dupTest.txt",quote=FALSE,sep="\t",row.names=FALSE)polyGenResults_dupTest<-read.delim("polyGenResults_dupTest.txt",stringsAsFactors=FALSE)
head(polyGenResults_dupTest)## Sample1 Sample2 matchedGenotypes commonGenotypes proportionMatch
## 1 KEEK02_0001 KEEK02_0002 903 1300 0.6946154
## 2 KEEK02_0001 KEEK02_0003 873 1289 0.6772692
## 3 KEEK02_0001 KEEK02_0004 915 1292 0.7082043
## 4 KEEK02_0001 KEEK02_0005 895 1303 0.6868764
## 5 KEEK02_0001 KEEK02_0006 892 1296 0.6882716
## 6 KEEK02_0001 KEEK02_0007 885 1297 0.6823439
## proportionCommon totalLoci
## 1 0.9781791 1329
## 2 0.9699022 1329
## 3 0.9721595 1329
## 4 0.9804364 1329
## 5 0.9751693 1329
## 6 0.9759217 1329
polyGenResults_dupTest
Plot results of IDduplicateSamples
I generally set thresholds of proportionMatch > 0.8 and proportionCommon > 0.75 to identify duplicate sample pairs. The appropriate thresholds will depend on the marker set being used and the relatedness among samples. In this plot, it is possible that individuals at the lower range of identical genotypes (~0.85) are relatives rather than duplicated samples.
#potential duplicates are proportionMatch>0.8 and proportionCommon>0.75
#feel free to adjust these thresholds as needed
matchThresh=0.8
commonThresh=0.75
#plot results
ggplot()+geom_point(data=polyGenResults_dupTest,aes(x=proportionCommon,y=proportionMatch))+
geom_segment(aes(x=commonThresh,xend=1,y=matchThresh,yend=matchThresh),lty="dashed")+
geom_segment(aes(x=commonThresh,xend=commonThresh,y=matchThresh,yend=1),lty="dashed")Identify duplicate sample pairs with thresholds set previously
#filter to potentially duplicated samples using thresholds above
polyGenResults_dupTest %>% filter(proportionMatch>=matchThresh,proportionCommon>=commonThresh)## Sample1 Sample2 matchedGenotypes commonGenotypes
## 1 KAROL05_0010 KAROL05_0049 1306 1310
## 2 KAROL05_0010 KAROL05_0009_KAROL05_0010 1110 1317
## 3 KAROL05_0025 KAROL05_0025_KAROL05_0026 1088 1313
## 4 KAROL05_0028 KAROL05_0028_KAROL05_0029 1099 1324
## 5 KAROL05_0029 KAROL05_0028_KAROL05_0029 1084 1319
## 6 KAROL05_0049 KAROL05_0009_KAROL05_0010 1099 1308
## 7 KGEOR05_0061 KGEOR05_0062 1097 1308
## 8 KGEOR05_0061 KGEOR05_0063 1096 1307
## 9 KGEOR05_0062 KGEOR05_0063 1307 1313
## 10 KGEOR05_0085 KGEOR05_0704 1319 1321
## proportionMatch proportionCommon totalLoci
## 1 0.9969466 0.9857035 1329
## 2 0.8428246 0.9909707 1329
## 3 0.8286367 0.9879609 1329
## 4 0.8300604 0.9962378 1329
## 5 0.8218347 0.9924755 1329
## 6 0.8402141 0.9841986 1329
## 7 0.8386850 0.9841986 1329
## 8 0.8385616 0.9834462 1329
## 9 0.9954303 0.9879609 1329
## 10 0.9984860 0.9939804 1329
Contaminated samples can be identified through elevated heterozygosity
#plot heterozygosity vs genotype rate per sample
#samples with unusually high heterozygosity relative to others are candidates for contamination
#the sample above the dashed line in this example is likely contaminated
ggplot()+geom_point(data=GTscore_individualSummary,aes(x=GenotypeRate,y=Heterozygosity))+
labs(title="Heterozygosity vs Genotype Rate per Sample", x="Genotype Rate", y="Heterozygosity")+
theme_bw()+theme(plot.title=element_text(hjust=0.5),plot.subtitle=element_text(hjust=0.5))+
geom_hline(yintercept=0.30, lty="dashed")Set heterozygosity treshold to identify likely contaminated sample
#identify likely contaminated sample
contaminatedSample<-GTscore_individualSummary %>% filter(Heterozygosity>0.30) %>% pull(Sample)
contaminatedSample## [1] "KAROL05_0009_KAROL05_0010" "KAROL05_0025_KAROL05_0026"
## [3] "KAROL05_0028_KAROL05_0029"
Contaminated samples can alao be identified by an elevated contamination score. In this example, samples with a contamination score > 0.3 are known to be contaminated.
#plot histogram of contamination score
ggplot()+geom_histogram(data=GTscore_individualSummary,aes(x=conScore),binwidth=0.02)+geom_vline(xintercept=0.3,lty="dashed")Set contamination score treshold to identify contaminated samples
#identify likely contaminated samples
contaminatedSamples2<-GTscore_individualSummary %>% filter(conScore>=0.3) %>% pull(Sample)
contaminatedSamples2## [1] "KAROL05_0009_KAROL05_0010" "KAROL05_0025_KAROL05_0026"
## [3] "KAROL05_0028_KAROL05_0029"
Plot genotype scatterplots for each sample
#Scatter Plots can show evidence of contamination or elevated ploidy
plotGenotypes_sample(singleSNP_locusTable, singleSNP_alleleReads, polyGenResults_singleSNP, type='scatter', savePlot="Y", saveDir="scatterPlots_sample")
#look at plots for samples, particularly for putatively contaminated samples identified by high heterozygosityThis is an example of a sample with no signs of contamination
This is an example of a potentially moderately contaminated sample
This is an example of a one of the contaminated samples that we created by combining reads from two individuals
Flag SNPs with < 50% genotype rate as candidates for removal
poorQualitySNPs<-singleSNP_summary %>% filter(GenotypeRate<0.5) %>% mutate(Locus_ID=as.character(Locus_ID)) %>% pull(Locus_ID)
poorQualitySNPs## [1] "Ots_129170-683_1" "Ots_RAD10099_1" "Ots_U2305-63_1" "RAD21430_84"
Scatter Plots can show evidence of systemic issues in read counting (off-target reads, biases) or elevated ploidy
Plot genotype scatterplots per locus
plotGenotypes(singleSNP_locusTable, singleSNP_alleleReads, polyGenResults_singleSNP, type='scatter', savePlot="Y", saveDir="scatterPlots_loci")plot histogram of contamination score for loci. An arbitrary score of 0.3 was set to identify problematic loci, but a lower or higher threshold may be more suitable. Visual examinstion of loci with elevated scores is recommended. In cases where there are very few heterozygous individuals, a high contamination score may be not truly reflect locus performance..
#loci with high contamination scores should have scatterplots examined to ensure genotypes look accurate
singleSNP_summary %>% ggplot(data=.) + geom_histogram(aes(x=conScore),binwidth=0.01)+geom_vline(xintercept=0.3)Plot contamination score vs minor allele frequency. Loci with both a high contamination score and high minor allele frequency are likely to be problematic.
singleSNP_summary %>% ggplot(data=.) + geom_point(aes(x=conScore,y=minAF))write.table(polyGenResults_singleSNP,"polyGenResults_singleSNP.txt",quote=FALSE,sep="\t")Look at scatterplots for loci with high contamination score
Example of a paralog
Example of a locus with with likely off-target sequence
Example of a locus that has high contamination score but may be fine
When exporting files, you can either export the full results and remove any poor quality samples or loci afterwards, or you can use whitelists or blacklists to filter during export. Both the exportGenepop and exportRubias functions have the following options: sampleWhitelist, locusWhitelist, sampleBlacklist, locusBlacklist. The whitelists are a list of samples or loci to retain, the blacklists are lists of samples or loci to discard.
export genepop format with no filtering
#example with no filtering
exportGenepop(polyGenResults_singleSNP,singleSNP_locusTable,filename="polyGenResults_singleSNP_genepop.txt")export genepop format with poor quality loci and contaminated sample removed
#example with poor quality loci and contaminated sample removed
exportGenepop(polyGenResults_singleSNP,singleSNP_locusTable,
locusBlacklist=poorQualitySNPs,sampleBlacklist=contaminatedSample,
filename="polyGenResults_singleSNP_genepop_filtered.txt")Data output for Rubias should have metadata associated with each sample that specifies the sample type, mixture if a mixture sample, reference if a baseline sample, the reporting group, the sample collection, and individual name.
Sample metadata columns should have the following names: sample_type, repunit, collection, indiv
metadata can be created here or loaded from file
create sample metadata for mixture sample
sampleMetaData<-data.frame(sample_type="mixture",repunit=NA,collection="SSCK17P2",indiv=colnames(polyGenResults_singleSNP))
head(sampleMetaData)## sample_type repunit collection indiv
## 1 mixture NA SSCK17P2 KEEK02_0001
## 2 mixture NA SSCK17P2 KEEK02_0002
## 3 mixture NA SSCK17P2 KEEK02_0003
## 4 mixture NA SSCK17P2 KEEK02_0004
## 5 mixture NA SSCK17P2 KEEK02_0005
## 6 mixture NA SSCK17P2 KEEK02_0006
export Rubias format with no filtering
exportRubias(polyGenResults_singleSNP,singleSNP_locusTable,sampleMetaData,filename="polyGenResults_singleSNP_rubias.txt")export Rubias format with poor quality loci and contaminated sample removed
exportRubias(polyGenResults_singleSNP,singleSNP_locusTable,sampleMetaData,
locusBlacklist=poorQualitySNPs,sampleBlacklist=contaminatedSample,
filename="polyGenResults_singleSNP_rubias_filtered.txt")











-1.png)








