This pipeline provides QC information for lanes of Group B Strep (GBS) sequences that are imported on farm5 and QC-ed, assembled and mapped on pf
. This pipeline gives:
- Relative abundance of GBS reads from Kraken
- Number of contigs
- GC content
- Genome length
- Coverage breadth
- Coverage depth
- Percentage HET SNPs out of total SNPs
- Download pipeline in a directory where you keep your software or pipelines:
git clone https://github.com/sanger-bentley-group/GBS_QC_nf.git
- Go into pipeline directory
cd GBS_QC_nf
- Load nextflow module
module load nextflow
- Run QC analysis using bsub:
bsub -o gbs_qc.%J.out -e gbs_qc.%J.err -R"select[mem>4000] rusage[mem=4000]" -M4000 'nextflow run main.nf --qc_reports_directory /path/to/gbs_qc_reports --lanes /path/to/gbs_lanes.txt'
Change:
/path/to/gbs_lanes.txt
to the file location of your list of lanes (that are imported and can be accessed viapf
), e.g.
20280_5#1
20280_5#10
20280_5#100
20280_5#101
20280_5#102
20280_5#103
20280_5#104
20280_5#105
20280_5#106
20280_5#107
/path/to/gbs_qc_reports
to the directory location of the generated reports. (Default is the current directory)
You should get two tab-delimited output reports qc_report_summary.txt
and qc_report_complete.txt
in the --qc_reports_directory
you specified. qc_report_summary.txt
gives the lane_id
and PASS/FAIL status
. qc_report_complete.txt
gives all the PASS/FAIL status for each QC.
In qc_report_summary.txt
, if there are empty values:
rel_abundance
then these lanes may not have been imported/imported properly with a kraken report. Solution: Contactpath-help@sanger.ac.uk
to import those lanes againcontig_no
,gc_content
orgenome_len
then these lanes may not have been assembled/assembled properly. Solution: Check the status of the assemblies using thepf status
command. If-
, contactpath-help@sanger.ac.uk
to assemble those lanes. IfFailed
/Running
/Pending
, ask path-help to re-trigger the assemblies again (AlthoughFailed
assemblies can suggest a problem with the read coverage)cov_breadth
orcov_depth
then these were not calculated inpf
. Solution: Contactpath-help@sanger.ac.uk
to ask why these values for these lanes are not available inpf data -s
.HET_SNPs
then these lanes may not have had SNPs called. Solution: Check the status of the SNP call usingpf status
command. If-
, contactpath-help@sanger.ac.uk
to call SNPs. IfFailed
/Running
/Pending
, ask path-help to re-trigger call SNPs again.
--rel_abund_threshold Pass read QC if rel_abundance is > rel_abund_threshold. (Default: 70)
--species Species of interest. (Default: 'Streptococcus agalactiae')
--contig_no_threshold Pass contig number QC if contig_no < contig_no_threshold. (Default: 500)
--assembler Assemblies of interest e.g. velvet or spades. (Default: spades)
--gc_content_lower_threshold QC content must be >= gc_content_lower_threshold to pass. (Default: 32)
--gc_content_higher_threshold QC content must be <= gc_content_higher_threshold to pass. (Default: 38)
--genome_len_lower_threshold Genome length/total number of bases > genome_len_lower_threshold to pass. (Default: 1450000)
--genome_len_higher_threshold Genome length/total number of bases < genome_len_higher_threshold to pass. (Default: 2800000)
--cov_depth_threshold Genome depth of coverage > cov_depth_threshold to pass. (Default: 20)
--cov_breadth_threshold Genome breadth of coverage > cov_breadth_threshold to pass. (Default: 70)
--het_snps_threshold Number of HET SNPs <= het_snps_threshold to pass. (Default: 20)
The methods used for finding relative abundance from Kraken, coverage breadth, coverage depth and percentage HET SNPs out of total SNPs are described here (Sanger access only).
To run Python unit tests:
pytest tests
To test this pipeline on the farm:
module load nextflow/20.10.0-5430
bsub -G <YOUR GROUP> -o gbs_qc.o -e gbs_qc.e -R"select[mem>4000] rusage[mem=4000]" -M4000 'nextflow run main.nf --qc_reports_directory gbs_qc_report --lanes tests/test_data/test_lanes.txt'