PNGseqR

Rapid candidate gene Selection and Visualization through multiple Bulk Segregate Analysis with next-generation sequencing

PNGseqR provides a convenient tool to map genetic loci controlling both quantitative and qualitative trait by performing BSA analysis, which integrates multiple algorithms and methods to indicate the magnitude of genome wide signals and to define the candidate regions, and prioritizes the causal genes through DEG and GO analysis. It takes single nucleotide polymorphism (SNP) markers from Next-generation sequencing (NGS) data in Variant Call Format (VCF) format as the input file, and provides four published BSA algorithms to rapidly define the candidate region through permutation test and fractile quantile. Users can choose the appropriate methods according to their data and experimental design. In addition, it also supports performing DEG analysis and GO analysis in order to prioritize the target gene. Once the analysis is completed, the plots can conveniently be exported.

PNGseqR is still under development and is offered with out any guarantee, our article has also been published，user can download through https://www.mdpi.com/2223-7747/11/14/1821/pdf for more detailed information

PNGseqR was developed in R version 3.5.0 and depends on a number of packages for various aspects of its implementation

c("vcfR", "dplyr", "tidyr", "readr", "locfit", "DESeq2", "ggplot2", "topGO", "rtracklayer", "grDevices", "graphics","methods","stats","utils")

Installation

PNGseqR can be install using devtools, either directly from GitHub,

devtools::install_github("https://github.com/smilejeen/PNGseqR")

or by downloading the repository to your computer, unzipping, and installing the PNGseqR folder.

devtools::install("PNGseqR")

or users can download the archive file from “https://github.com/smilejeen/PNGseqR/raw/main/PNGseqR_0.0.0.9.tar.gz”,

and installed the package by following command:

install.packages("{Your PATH}/PNGseqR_0.0.0.9.tar.gz", repos = NULL, type = "source")

Note: Apart from regular package dependencies, there are some Bioconductor tools that we use as well, as such you will be prompted to install support for Bioconductor, if you haven’t already. Because of this, in order to install PNGseqR from github you will be required to install some compiling tools (Rtools for Windowsand Mac, respectively).

If you use PNGseqR in published research, please cite:

We also recommend citing the paper for the corresponding method you work with.

△SNP method:

Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi, H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. Plant J, 74: 174–183. doi:10.1111/tpj.12105

G prime method:

Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing. PLOS Computational Biology 7(11): e1002255. doi.org/10.1371/journal.pcbi.1002255

Euclidean Distance method:

Hill JT, Demarest BL, Bisgrove BW, Gorsi B, Yost HJ (2013) Mmappr: mutation mapping analysis pipeline for pooled rna-seq. Genome Research 23(4):687-697. doi: 10.1101/gr.146936.112

Empirical Bayesian method:

Liu, S., Yeh, C. T., Tang, H. M., Dan, N., Schnable PS. (2012), Gene mapping via bulked segregant rna-seq (bsr-seq). Plos One, 7(5), e36406. doi:10.1371/journal.pone.0036406

Input files

At a minimum, PNGseqR requires one input data set:

genotype information of SNP Calling in VCF format(produce from GATK https://github.com/broadinstitute/gatk/releases VCF format http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk)

Three optional data frames may also be supplied:

a dataset containing reads count of replicate samples for bulks
go terms corresponding to gene id for analysis species
a dataset in GTF format contain the position information of genes

The formatting parameters of required and all optional input files are summarized below.

vcf_data

vcf_data is a dataset containing SNP information in Variant Call Format(VCF) one row per SNP, with eleven columns, containing the following columns:

Column	Description
`CHROM`	The chromosome which SNP fall in. Data type: character
`POS`	The position on each chromosome of SNP, in base pairs. Data type: character
`ID`	The name of each SNP. Data type: character
`REF`	The base type at this position of reference genome. Data type: character
`ALT`	Compare to the reference genome,the of alternative base type at this position. Data type: character
`QUAL`	Quality value of variation site detection. Data type: character
`FILTER`	Used to judge if the site passed the filter criteria. Data type: character
`FORMAT`	Including the genotype information(GT),allele depth in each sample information(AD), total depth of this site information(DP), Genotype quality value(GQ), Likelihood value of genotype(PL). Data type: character
`BULK1`	The value cprresponding to FORMAT in BULK1 sample, i.e. "B73_MU". Data type: character
`BULK2`	The value cprresponding to FORMAT in BULK2 sample, i.e. "B73_WT". Data type: character

> library(vcfR) 
> vcf_data.example <- read.vcfR("snp.vcf",verbose = TRUE) # snp.vcf is a example dataset
> head(vcf_data.example)

   	CHROM POS      ID REF ALT QUAL      FILTER	FORMAT           B73_MU                           B73_WT                        
[1,]	"1"   "49527"  NA "A" "G" "428.44"  NA    "GT:AD:DP:GQ:PL" "0/1:8,7:15:99:215,0,222"        "0/1:4,6:10:99:221,0,130"
[2,]	"1"   "51053"  NA "A" "T" "6117.44" NA    "GT:AD:DP:GQ:PL" "0/1:87,161:248:99:4854,0,2383"  "0/1:80,51:131:99:1271,0,2314"
[3,]	"1"   "55373"  NA "A" "T" "7487.44" NA    "GT:AD:DP:GQ:PL" "0/1:108,137:246:99:4585,0,3353" "0/1:75,87:163:99:2910,0,2376"
[4,]	"1"   "139436" NA "G" "A" "637.44"  NA    "GT:AD:DP:GQ:PL" "0/1:11,12:23:99:391,0,362"      "0/1:4,8:12:99:254,0,134"    
[5,]	"1"   "199964" NA "C" "T" "1026.96" NA    "GT:AD:DP:GQ:PL" "1/1:0,15:15:45:625,45,0"        "0/1:1,11:12:0:406,0,0"       
[6,]	"1"   "199995" NA "T" "C" "1120.34" NA    "GT:AD:DP:GQ:PL" "1/1:1,17:18:12:670,12,0"        "1/1:0,11:11:33:456,33,0"

DEG_reads

DEG_reads is an optional dataframe containing multiple replicates reads count of each bulk, one row per gene, with at least five columns:

Column	Description
`Gene id`	Gene symbol/name used for differential expression analysis refers to. Data type: character
`BULK1-1`	Reads number of each gene in the first sample of bulk1 replicate. Data type: integer
`BULK1-2`	Reads number of each gene in the second sample of bulk1 replicate. Data type: integer
`BULK1-n`	Reads number of each gene in the nth sample of bulk1Note: DEG_reads can contain multiple samples such as BULK1-4, BULK1-5 but the number of sample must corresponding with BULK2-n. Data type: integer
`BULK2-1`	Reads number of each gene in the first sample of bulk2 replicate. Data type: integer
`BULK2-2`	Reads number of each gene in the second sample of bulk2 replicate. Data type: integer
`BULK2-n`	Reads number of each gene in the nth sample of bulk2Note: DEG_reads can contain multiple samples such as BULK2-4, BULK2-5 but the number of sample must corresponding with BULK1-n. Data type: integer

> data("DEG_reads.example")
> tail(DEG_reads.example)
             Gene id  MU1  MU2  WT1  WT2
46113 Zm00001d054106 1606 1810 1318 1431
46114 Zm00001d054107 4746 4490 5527 7359
46115 Zm00001d054109    0    0    0    0
46116 Zm00001d054110   55   46   16    4
46117 Zm00001d054111    6    4   27   10
46118 Zm00001d054112   26    5   29   23

GO_Term

GO_Term is an optional data frame, one row per gene and a go term, which should contain the following columns:

Note: The example is a dataset containing maize GO Terms of B73 version4 from agrigo(http://bioinfo.cau.edu.cn/agriGO/index.php)

Column	Description
`Gene id`	Gene symbol/name for gene ontology analysis refers to. Data type: character
`GO Term`	GO term corresponding to the gene. Data type: character

> data("GO_Term.example")
> head(GO_Term.example,25)
   Gene id    GO Term
1  Zm00001d012464 GO:0044282
2  Zm00001d012464 GO:0005777
3  Zm00001d012464 GO:0055114
4  Zm00001d012464 GO:0010264
5  Zm00001d012464 GO:0071482
6  Zm00001d012464 GO:0003959
7  Zm00001d012464 GO:0003954
8  Zm00001d012464 GO:0046496
9  Zm00001d012464 GO:0006091
10 Zm00001d012464 GO:0031304
11 Zm00001d012464 GO:0050660
12 Zm00001d012464 GO:0005759
13 Zm00001d012464 GO:0005975
14 Zm00001d010658 GO:0016020
15 Zm00001d010658 GO:0009410
16 Zm00001d010658 GO:0009863
17 Zm00001d010658 GO:0043565
18 Zm00001d010658 GO:0006355
19 Zm00001d010658 GO:0046983
20 Zm00001d010658 GO:0005737
21 Zm00001d010658 GO:0005634
22 Zm00001d010658 GO:0009627
23 Zm00001d010658 GO:0003700
24 Zm00001d010659 GO:0080092
25 Zm00001d010659 GO:0061025

gtf_data

gtf_data is an optional GRanges object with 6 ranges and 15 metadata columns which should contain the following columns:

Note: The example is a dataset containing part of maize genes positional information of B73 version4 from ensembl plant(ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gtf/zea_mays/Zea_mays.AGPv4.34.gtf.gz)

Column	Description
`seqnames`	The ID of the sequence, which can be the ID of chromosome, scaffold or contig.. Data type: factor
`start`	The start position of the feature on the sequence. Data type: integer
`end`	The end position of the feature on the sequence. Data type: integer
`width`	The width of the feature on the sequence. Data type: integer
`source`	The source of this file. Data type: factor
`type`	The feature represented by the region between start and end can be gene, non coding RNA, TE, exon, trascript, etc. Data type: factor
`score`	A floating-point value. When there is a value, it represents the reliability of the above features. Data type: numeric
`phase`	Only valid for comment type "CDS", indicating the position of the starting code. The valid values are 0, 1 and 2. Data type: integer
`gene_id`	The name of this sequence. Data type: character
`gene_biotype`	optional column.The biological type of this sequence such as gene, lincRNA, snoRNA, etc. Data type: character
`transcript_id`	The name of this sequence, if the sequence is a transcript. Data type: character
`transcript_source`	optional column.The source of this transcript. Data type: character
`transcript_biotype`	optional column.The biological type of this sequence. Data type: character
`exon_number`	optional column.The exon number of the transcript. Data type: character
`exon_id`	optional column.The name of exon if above feature is an exon. Data type: character
`protein_id`	optional column.The name of corresponding protein. Data type: character
`protein_version`	optional column.The version of protein function annotation. Data type: character
`gene_name`	optional column.indicate the function of gene. Data type: character

> library(rtracklayer)
> gtf_data.example <- import("zma.v4.example.gtf")
> head(gtf_data.example)
GRanges object with 6 ranges and 15 metadata columns:
      seqnames      ranges strand |   source        type     score     phase        gene_id  gene_source   gene_biotype       transcript_id transcript_source transcript_biotype  exon_number                exon_id          protein_id protein_version   gene_name
         <Rle>   <IRanges>  <Rle> | <factor>    <factor> <numeric> <integer>    <character>   <character>    <character>         <character>       <character>        <character>      <character>          <character>         <character>     <character> <character>
  [1]        8 77622-80307      + |  gramene gene               NA      <NA> Zm00001d008171    gramene protein_coding                <NA>              <NA>               <NA>              <NA>                   <NA>                <NA>            <NA>        <NA>
  [2]        8 77622-80307      + |  gramene transcript         NA      <NA> Zm00001d008171    gramene protein_coding Zm00001d008171_T001           gramene     protein_coding              <NA>                   <NA>                <NA>            <NA>        <NA>
  [3]        8 77622-78130      + |  gramene exon               NA      <NA> Zm00001d008171    gramene protein_coding Zm00001d008171_T001           gramene     protein_coding              <NA>                   <NA>                <NA>            <NA>        <NA>
  [4]        8 77951-78130      + |  gramene CDS                NA         0 Zm00001d008171    gramene protein_coding Zm00001d008171_T001           gramene     protein_coding                 1 Zm00001d008171_T001...                <NA>            <NA>        <NA>
  [5]        8 77951-77953      + |  gramene start_codon        NA         0 Zm00001d008171    gramene protein_coding Zm00001d008171_T001           gramene     protein_coding                 1                   <NA> Zm00001d008171_P001               1        <NA>
  [6]        8 78839-79033      + |  gramene exon               NA      <NA> Zm00001d008171    gramene protein_coding Zm00001d008171_T001           gramene     protein_coding                 2 Zm00001d008171_T001...                <NA>            <NA>        <NA>

Function arguments

Required Arguments

Function	Parameter	Description
`vcf2table()`		The function converts file in VCF format into a data frame
``	file	The name of genotype file in VCF format from GATK SNP calling pipeline
``	BulkA	The sample name of one bulk
``	BulkB	The sample name of the other bulk
``	chromlist	A string vector of the chromosomes to be used in the analysis, used to filter out unneeded contigs etc.
`BSA_filter()`		The function to filter SNPs based on read depth and quality of each bulk
``	file	The data frame exported by vcf2table()
``	Bulk.DP	The minimum total read depth for a SNP
``	Bulk2.Ref	The minimum read depth of a reference SNP allele
``	Bulk2.Alt	The minimum read depth of a alternative SNP allele
``	Depth.diff	The maximum absolute differences between the read depths of the bulks
``	min.GQ	The minimum Genotype Quality output by GATK
``	verbose	Default value is “TRUE”. If the value is "TRUE", the number of SNPs filtered in each step will print
`DeltaSNP()`		The function to perform BSA analysis with △SNP algorithm
``	file	The data frame exported by BSA_filter()
``	Windowsize	A numeric value indicates the window size (in base pairs) containing SNPs subjected to calculating △SNP
``	threshold	From 0 to 1, default value is “0.95”. A numeric threshold value to extract candidate region by fractile quantile
``	pval	A numeric threshold value. Default value is “0.05”
`Bayesian()`		The function to perform BSA analysis with empirical bayesian algorithm
``	file	The data frame exported by BSA_filter()
``	Bulksize	The number of samples in each pool
``	Genomelength	The total length of reference genome
``	Windowsize	A numeric value indicates the window size (in base pairs) containing SNPs subjected to calculating empirical bayesian
``	threshold	From 0 to 1, default value is “0.95”. A numeric threshold value to extract candidate region by fractile quantile
``	pval	A numeric threshold value. Default value is “0.05”
`ED()`		The function to perform BSA analysis with Euclidean distance algorithm
``	file	The data frame exported by BSA_filter()
``	power	A numeric value indicates the power to raise ED. It was suggested to be 4~6
``	Windowsize	A numeric value indicates the window size (in base pairs) containing SNPs subjected to calculating Euclidean distance
``	threshold	From 0 to 1, default value is “0.95”. A numeric threshold value to extract candidate region by fractile quantile
``	pval	A numeric threshold value, default value is “0.05”
`Gprime()`		The function to perform BSA analysis with G-test algorithm
``	file	The data frame exported by BSA_filter()
``	Windowsize	A numeric value indicates the window size (in base pairs) containing SNPs subjected to calculating G statistic
``	threshold	From 0 to 1, default value is “0.95”. A numeric threshold value to extract candidate region by fractile quantile
``	pval	A numeric threshold value. Default value is “0.05”
`plot_BSA()`		The function to output BSA analysis result into scatter and line plots
``	file	The data frame exported by the BSA analysis functions
``	chromlist	A string vector of the chromosome names used to plot line plots specific chromosomes, default value is “NULL”
``	type	A character to indicate the selected algorithm
``	file.name	A character to name the output plot
``	threshold	From 0 to 1, default value is “NULL”. The red threshold line displayed in the output plot
``	pval	Default value is “NULL”. A numeric threshold value drawn in blue line in the output plot
`DEG_analysis()`		The function to perform differential expression analysis
``	file	A data frame containing read count of each pool
``	rep.num	A numeric value indicates the number of replications in differential expression analysis
``	sig.level	A numeric value indicates the FDR threshold in differential expression analysis
``	exp.fold	A numeric value indicates the log2FoldChange threshold in differential expression analysis
``	plot	Default value is “TRUE”. If "TRUE", the volcano plot will be created
`GO_analysis()`		The function to perform GO analysis
``	GO_term	A data frame contains at least two columns, one column is genes' name and the other column is the GO term
``	DEG_id	A vector containing the names of the differential expression genes
``	type	A character indicates the mode of GO analysis. There are four options: "CC" indicates cellular component; "BP" indicates biological process, "MF" indicates molecular function, and "all" indicates all three modes. The default is “all”
``	algorithm	A character indicates the algorithm for GO analysis. Three options "classic","weight01", and "elim" indicate three different algorithms
``	statistic	A character indicates two different significance test methods. Two options "fisher" and "ks" indicates two different methods
``	top_num	A numeric value to indicate the number of significant GO terms
``	plot	Default value is “TRUE”. If "TRUE", the acyclic, bubble, and histogram plots will be created
`result_anno()`		The function to extract genes fall in the candidate region and judge whether gene is differential expression
``	BSA_data	A data frame contains three columns, including the chromosome name, the start and the end positions of the candidate regions
``	gtf_data	The name of reference genome in GTF format
``	DEG_data	A vector containing the names of differential expression genes,default is "NULL"

Use Examples:

This is a basic example which shows you how to import and analyze NGS-BSA data.

#load the package
library("PNGseqR")

#Set file and sample names
fl <- "snp.vcf"
MutBulk <- "B73_MU"
WtBulk <- "B73_WT"



#Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
seqname <- c(1:10)


#Import SNP data from file
df <- vcf2table(
        file = fl,
        BulkA = MutBulk,
        BulkB = WtBulk,
        chromlist = seqname
     )

head(df) # The genotype data frame convert from vcf file

#Filter SNPs based on some criteria
data_filt <- BSA_filter(file=df,
                        Bulk.DP=6,
                        Bulk2.Ref=3,
                        Bulk2.Alt=3,
                        Depth.diff=100,
                        min.GQ=99,
                        verbose = TRUE)

head(data_filt) # The remaining SNPs after filter

###Four algorithms for BSA analysis, for more detailed instructions please read the supplement materials

			 
#BSA analysis

#Run Delta SNP analysis
df_deltaSNP <- DeltaSNP(file = data_filt,
                        Windowsize = 5e6,
                        threshold = 0.995,
                        pval = 0.001)

head(df_deltaSNP$total.result) # The total analysis result 
head(df_deltaSNP$sigthreshold) # The SNPs selected based on the fractile quntile threshold
head(df_deltaSNP$sigpval.result) #The SNPs selected based on the permutation test threshold

#Run Baysian analysis
df_bayes <- Bayesian(file = data_filt,
                     Bulksize = 30,     #The number of samples in the mixing bulks is 30
                     Genomelength = 2338,
                     Windowsize = 5e6,
                     threshold = 0.995,
                     pval = 0.001)

#Run Euclidean Distance analysis
df_ED <- ED(file = data_filt,
            power = 4,
            Windowsize = 5e6,
            threshold = 0.995,
            pval = 0.001)

#Run G-test analysis
df_G <- Gprime(file = data_filt,
               Windowsize = 5e6,
               threshold = 0.995,
               pval = 0.001)

Figure 1

Figure 1 Candidate region identified by different algorithms in PNGseqR for maize small kernel mutant. Plots produced by the plot_BSA() function with a 5 Mb sliding window: (A-D).The scatter plots exported from the BSA results, the used algorithms are Δ(SNP-index), empirical bayes, ED, and G-test. (E-H) the tricube-smoothed values on the genome. Red line threshold shows the region containing SNPs whose tricube-smoothed values are higher than 99.5% of all values, and blue line threshold shows the SNPs whose P-values are lower than 0.001 based on permutation test.

#Plot

##deltaSNP
plot_BSA(file = df_deltaSNP$total.result,
         chromlist = seqname,
         type = "DeltaSNP",
         file.name = "DS",
         threshold = 0.995,
              pval = 0.001)
         
##Baysian
plot_BSA(file = df_bayes$total.result,
         chromlist = seqname,
         type = "Bayesian",
         file.name = "Bayesian",
         threshold = 0.995,
         pval = 0.001)

##Euclidean Distance
plot_BSA(file = df_ED$total.result,
         chromlist = seqname,
         type = "ED",
         file.name = "ED",
         threshold = 0.995,
         pval = 0.001)

##Gprime
plot_BSA(file = df_G$total.result,
         chromlist = seqname,
         type = "Gprime",
         file.name = "Gprime",
         threshold = 0.995,
         pval = 0.001)

###Plots for DEG analysis and GO analysis,the plot can be found

Figure 2

Figure 2 Differential expression analysis and GO analysis for maize small kernel mutant. (A) The volcano plot shows the DEGs for maize small kernel mutant, the analysis was performed by using DEG_analysis(). (B) The histogram shows the result of GO analysis for maize small kernel mutant, the analysis was performed by using GO_analysis(). Top 20 terms of each mode (BP, CC, MF) are showed in this plot. (C) A part of directed acyclic plot plotted by using the “MF” mode of GO_analysis(). (D) Bubble plot shows the result of GO analysis, the GO terms are same as (B).

#Differential expression analysis

## Set data and parameters
DEG_reads <- DEG_reads.example
rep <- 2
pvalue <- 0.05
log2fold <- 1.5
res <- DEG_analysis(file = DEG_reads,
                    rep.num = rep,
                    sig.level = pvalue,
                    exp.fold = log2fold,
                    file.name = "test",
                    plot = TRUE)

head(res$DEG_total) # The differential expression of all genes
head(res$DEG_sig) # The significant differential expression genes based on the threshold

#GO analysis
## Set data and parameters
sigDEG <- res$DEG_sig
geneid <- sigDEG$id
go_term <- GO_Term.example
mode <- "all"
algorithm <- "classic"
statistic <- "fisher"
num <- 20

GO_result <- GO_analysis(GO_term = go_term,
                         DEG_id = geneid,
                         type = mode,
                         algorithm = algorithm,
                         statistic = statistic,
                         top_num = num,
                         plot = TRUE)

head(GO_result) # Get the top significant gene ontology analysis result

write.table(GO_result,"GO_terms.txt",sep = "\t",quote = F,row.names = F)

#result annotation
## Set data and parameters
gtf <- "zma.v4.example.gtf"
chr <- 8
start <- 79943
end <- 3418075
BSA_position <- data.frame(chr,start,end) # Construct the candidate region of data frame format

sigDEG <- res$DEG_sig

                           
BSA_gene <- result_anno(BSA_data = BSA_position,
                        gtf_data = gtf,
                        DEG_data = sigDEG)

BSA_gene$genes_in_region # The genes fall in the candidate region
BSA_gene$DEGgenes_in_region # The differential expression genes fall in the candidate region

Pipeline to get genotype file in VCF format

To help users got genotype and reads count files suitable for our package, we offer command line here:

For DNA sequencing data, the pipeline to get genotype data need install software BWA SAMtools Picard GATK baesd on a Linux system

First create index for reference genome([Reference.Genome.fa]) by bwa SAMtools and Picard

bwa index [Reference.Genome.fa]

samtools faidx [Reference.Genome.fa]

java -jar picard.jar CreateSequenceDictionary R=[Reference.Genome.fa] O=Reference.Genome.dict

Second mapping the sequencing data to the reference genome by bwa and transform result to Binary format

bwa mem -t 16 -M [Reference.Genome.fa] *R1.fq.gz *R2.fq.gz > *.sam
# * R1. fq Gz * R2.fq.gz are the paired-end sequencing data

samtools view -b -S *.sam > *.bam

Third The pretreatment for mapping result,including clean sort and adding group information for mapping result, getting unique reads and create index

java -jar picard.jar CleanSam INPUT=*.bam OUTPUT=*.cleaned.bam

java -jar picard.jar FixMateInformation INPUT=*.cleaned.bam OUTPUT=*.cleaned_fixed.bam SO=coordinate

java -jar picard.jar AddOrReplaceReadGroups INPUT=*.cleaned_fixed.bam OUTPUT=*cleaned_fixed_group.bam LB=WH SO=coordinate RGPL=ILLUMINA PU=barcode SM=WH

samtools index *cleaned_fixed_group.bam

java -jar picard.jar MarkDuplicatesWithMateCigar INPUT=*cleaned_fixed_group.bam OUTPUT=*cleaned_fixed_group_DEDUP.bam M=*cleaned_fixed_group_DEDUP.mx AS=true REMOVE_DUPLICATES=true MINIMUM_DISTANCE=500

samtools index *cleaned_fixed_group_DEDUP.bam

Then,last perform SNP Calling by GATK for each pool

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R [Reference.Genome.fa] -I *cleaned_fixed_group_DEDUP.bam -o *.gvcf

At last, combining the SNP Calling result of different pools to obtain the final file for BSA analysis by GATK

java -jar GenomeAnalysisTK.jar CombineGVCFs -R [Reference.Genome.fa] -V bulk1.gvcf -V bulk2.gvcf -O bsa.gvcf 
# bulk1.gvcf and bulk2.gvcf are SNP Calling result of different pool

java -jar GenomeAnalysisTK.jar GenotypeGVCFs -R [Reference.Genome.fa] -V bsa.gvcf -O bsa.vcf

For RNA sequencing data, the pipeline to get genotype data need install software Hisat2 SAMtools Picard GATK baesd on a Linux system

First,create index for reference genome by hisat2

hisat2/extract_exons.py [Reference.Genome.gtf] > exons.txt 
hisat2/extract_splice_sites.py [Reference.Genome.gtf] > splicesites.txt
hisat2/hisat2-build -f [Reference.Genome.fa] --ss splicesites.txt --exon exons.txt [Reference.Genome.index]
# [Reference.Genome.fa] is reference genome，[Reference.Genome.gtf] is the reference genome annotation file，[Reference.Genome.index] is the created reference genome index file

Second,mapping RNA sequencing data to reference genome by hisat2

hisat2 --dta -x [Reference.Genome.index] -1 *R1.fq.gz -2 *R2.fq.gz -S *.sam --known-splicesite-infile splicesites.txt
# *R1.fq.gz *R2.fq.gz are the paired-end sequencing data

Third,The pretreatment for mapping result,including clean sort and adding group information for mapping result, getting unique reads and create index

samtools view -h -q 50 *.sam | samtools view -bS -> unique.*.bam

samtools sort unique.*.bam unique.*.sort

samtools index unique.*.sort.bam 

java -jar picard.jar AddOrReplaceReadGroups.jar I=unique.*.sort.bam O=unique.*.GR.bam RGID=unique.* RGLB=A RGPL=illumina RGPU=nounit RGSM=unique.*

samtools sort unique.*.GR.bam unique.sort.*.GR

samtools index unique.sort.*.GR.bam

At last,perform SNP calling by GATK

java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R [Reference.Genome.fa] -I unique.sort.bulk1.GR.bam -I unique.sort.bulk2.GR.bam --heterozygosity 0.1 -stand_call_conf 50.0 -stand_emit_conf 20.0 -glm BOTH --num_threads 2 -ploidy 2 -U ALLOW_N_CIGAR_READS -o bsr_snp.vcf
# unique.sort.bulk1.GR.bam and unique.sort.bulk2.GR.bam is the pretreated mapping result of two different pools

count the reads fall in each gene region by HT-seqcount

htseq-count unique.*.sort.bam -f bam [Reference.Genome.gtf] > *.htseq.RC.txt
# unique.*.sort.bam is the mapping result after cleaning and sorting

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
R		R
data		data
man		man
.Rbuildignore		.Rbuildignore
.Rhistory		.Rhistory
.gitignore		.gitignore
BSA_analysis_plots.jpg		BSA_analysis_plots.jpg
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
PNGseqR.Rproj		PNGseqR.Rproj
PNGseqR_0.0.0.9.tar.gz		PNGseqR_0.0.0.9.tar.gz
ReadMe.md		ReadMe.md
expression_analysis_plots.jpg		expression_analysis_plots.jpg
snp.vcf		snp.vcf
zma.v4.example.gtf		zma.v4.example.gtf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PNGseqR

Rapid candidate gene Selection and Visualization through multiple Bulk Segregate Analysis with next-generation sequencing

Table of Contents

Installation

Input files

vcf_data

DEG_reads

GO_Term

gtf_data

Function arguments

Use Examples:

Figure 1

Figure 2

Pipeline to get genotype file in VCF format

About

Releases

Packages

Languages

smilejeen/PNGseqR

Folders and files

Latest commit

History

Repository files navigation

PNGseqR

Rapid candidate gene Selection and Visualization through multiple Bulk Segregate Analysis with next-generation sequencing

Table of Contents

Installation

Input files

vcf_data

DEG_reads

GO_Term

gtf_data

Function arguments

Use Examples:

Figure 1

Figure 2

Pipeline to get genotype file in VCF format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages