Bioinformatic Pipeline for evaluating and charcterizing a reference genome sequence for a prokaryotic reference material using replicate whole genome sequencing data from orthogonal sequencing methods. This pipeline is being developed for the characterization of NIST candidate microbial genomic DNA reference materials and is provided to the research community for use in characterizing other prokaryotic genomic materials. The pipelines can be used to evaluate other prokaryotic genomic DNA materials labs use in whole genome sequencing method validation and quality control.
Whole genome sequencing and bioinformatic analysis is a complex process for which the sources of bias and error and not fully understood. The methods presented here are intended to help users better characterize and understand their material not to provide a error free reference genome. Results generated from this pipeline are indended for research use only.
Genome sequences and additional information about the NIST microbial genomic reference materials is available at https://github.com/usnistgov/NIST_Micro_Genomic_RM_Data.
The bioinformatic component of PEPR consists of three parts:
- evaluation - validates and corrects the reference genome
- characterization - characterizes the base level purity and homogeneity of the material using the corrected reference genome from evaluation pipeline
- genomic_purity - performs taxonomic read classification of short read data
The evaluation pipeline is first used to refine the user provided reference genome, and the characterization pipeline then characterizes the refined reference genome. The genomic purity pipeline is independent of the evaluation and characterization pipeline and is used to identify contaminant DNA.
Running PEPR
- Pipeline dependencies are installed in docker containers.
- A config file is used to define the metadata for the sequencing datasets used in the pipeline.
After the running PEPR evaluation, characterization, and genomic_purity
pipelines the R package peprr (https://github.com/usnistgov/peprr) is used to generate a SQLite database peprDB with the results for use in downstream analysis. peprDB is used to generate a generic Report of Analysis.
YAML file includes sample metadata and accession numbers for the datasets run by the pipeline. See example YAML file in the repository utils directory.
project_id,center,RM,plat,vial, andrepused to define the read group in bam files- Read Group Definition ID=
project_id, LB=vial-rep, PL=plat, PU=barcoded, SM=accession, CN=centerproject_id- also used as the name of directory where the results are savedaccession- Genbank SRA accession id for datasetcenter- sequencing centerplat- sequencing platformvial- unique id for sample level technical replicatesrep- unique id for library prep technical replicates
ref- name of the reference genome sequence, file root name in name of results directory for run
- Read Group Definition ID=
evaluationpipeline should only include Illumina datacharacterizationincludes all datasetsgenomic_purityonly short read data e.g. Illumina and Ion Torrent
The evaluation and characterization pipelines can be run directly from the command line, the genomic_purity needs to be run from within the container container.
- PEPR has two primary arguments
- -c, --config is the path to a YAML file with the sample accessions and metadata
- -p/ --pipeline is the name of the pipeline being run
- Additional arguments for status text messaging capability
- --send_status is a yaml file with sinch account information https://www.sinch.com
- --send_number is the mobile number to the status messages to, use "+1" and ten digit number with no spaces, e.g "+1##########"
- number must be able to accept sms text messages
- Status messages are only sent when a pipeline starts and finishes, will not send a message if pipeline crashes.
- Running the
evaluationandcharacterizationpipelines from host command line nohup docker …. >filename.log &this will save the standard out tofilename.logand allow you to close the terminal without any errors- The pipeline has checks to make sure the expected output files are produced and skips steps (not present in all places) if file is present. This pipeline does not check to make sure the file is not empty, if a file is incomplete will need to remove that file and subsequent files and rerun the pipeline, an easy way to do this is remove the directory generated for that step of the pipeline. I plan to add additional checks for empty files to avoid this issue
- If a text message is not received within the expected period of time check the nohup log file for an error message.
-
Pipeline steps
-
download fastq data from the database
-
copy ref genome to the
refdirectory in the project_id directory and indexes the genome -
maps the miseq data to the reference genome
-
evaluates the reference genome using Pilon
-
Pipeline command - see running pepr using nohup for additional information
nohup docker run -v path/to/pepr:/pepr -v path/to/pepr-data:/pepr-data natedolson/pepr python /pepr/pepr.py -c pipeline_config.yaml -p evaluation --send_status sinch_config.yaml --send_number +1########## > run_specific_name.log &
-
Output
-
changesandfastafiles produced by Pilon -
These files will be in the
path/to/pepr-data/project_id/ref_pilondirectory,project_idandrefare the values in the YAML config file. -
Additional 1.
path/to/pepr-data/project_id/ref_mapping- mapping results files 2.path/to/pepr-data/project_id/ref- reference genome sequences and index files 3.path/to/pepr-data/project_id/fastq- sequence data -
Next step after the pipeline completes 1. Check the
path/to/pepr-data/project_id/ref_pilon/ref_miseq.changesfile to see what modifications were made to the reference genome- If no changes move onto the characterization pipeline
- If changes re-run evaluation pipeline with
path/to/pepr-data/project_id/ref_pilon/ref_miseq.fastaas the reference- will need to create a new YAML file
- copy
path/to/pepr-data/project_id/ref_pilon/ref_miseq.fastato thepath/to/pepr-data/directory along with the new yaml file - re-run
evaluationpipeline 2. If the only changes made undo the changes from the previous run move ontocharacterizationpipeline, otherwise continue to re-run until no new changes are made (should only need to run 2-3 times).
-
Pipeline steps
-
download fastq data from the database
-
copies ref genome to the
refdirectory in the project_id directory and indexes the genome -
maps the miseq data to the reference genome
-
calculates qc metrics for datasets, e.g. read count, length, coverage, ect.
-
base level purity analysis
-
homogeneity analysis
-
Pipeline command
nohup docker run -v path/to/pepr:/pepr -v path/to/pepr-data:/pepr-data natedolson/pepr python /pepr/pepr.py -c pipeline_config.yaml -p characterization --send_status sinch_config.yaml --send_number +1########## > run_specific_name.log &
-
Output
-
Primary 1. qc -
_statanddepthfiles in/path/to/pepr-data/project_id/ref_qc_metricsdirectory. Directoryproject_idandrefare the values in the YAML config file. 2.tsvfiles inpath/to/pepr-data/project_id/ref_consensusdirectory 3.snpandindelfiles inpath/to/pepr-data/project_id/ref_homogeneity -
Additional 1.
path/to/pepr-data/project_id/ref_mapping- mapping results files 2.path/to/pepr-data/project_id/ref- reference genome sequences and index files 3.path/to/pepr-data/project_id/fastq- sequence data -
Next step after the pipeline completes 1. Check for empty files can use ls -lh to check for
bam,depth, andtsvfiles with size 0.- bam files in
path/to/pepr-data/project_id/ref_mapping- if empty bam files check size of sam file in
/path/to/pepr-data/project_id/ref_mapping/tmp
- if empty bam files check size of sam file in
- if both sam and sam files are empty remove the bam, sam files along with the bam index file (
bai) and rerun the characterization pipeline - if only the bam file is empty remove the bam and bam index file and re-run the pipeline
2. If
depthortsvfiles are empty and bam file for that accession is not, check the appropriate log files.
- bam files in
- Estimated run time ~2-7 days depending on size of datasets and similarity of genome to sequences in the database
- Pipeline steps
- download fastq data
- quality trimming of fastq files
- mapping reads to micro_rm_patho_db
- running pathoscope identification module
- Pipeline command - run from within the docker-pathoscope container, see notes on running docker containers 1. Starting docker container docker run -it -v /path/to/pepr:/pepr -v /path/to/pepr-data:/pepr-data -v /path/to/patho_db:/patho_db natedolson/docker-pathoscope /bin/bash 2. From within the container python /pepr/pepr.py -c pipeline_config.yaml -p genomic_purity --send_status sinch_config.yaml --send_number +1##########
- Output
1. Primary
-sam-report.tsvin/path/to/pepr-data/project_id/ref_genomic_puritydirectory
- Next step after pipeline completes
1. use
ls -lhto check for empty-sam-report.tsvfiles
- run the following command
nohup docker run -v /path/to/pepr:/pepr -v /path/to/pepr-data:/pepr-data natedolson/pepr python /pepr/pepr.py -c pipeline_config.yaml -p pipeline_name --send_status sinch_config.yaml --send_number +1########## > run_specific_name.log &
1. replace `pipeline_config.yaml` with the name of the config file used for the pipeline run
2. include mobile number to receive text status update messages
- Should receive a text shortly after running the command saying that the pipeline has started then another one after the run has completed.
- Keep the ssh connection open until you receive a text message status that the pipeline has started, if you do not receive one in ~1-2 mins
1. check the log file -
less run_specific_name.log- if empty - most likely an error with sending the text message
- use
topto see if any pipeline commands are running e.g. bwa, tmap, samtools, picard, ect. - If the pipeline is running can either leave it running and check back later to see if the pipeline completed, with
tail run_specific_name.log
- use
- if not - potential premature termination, or error with text message
- check file for indications of premature termination - errors messages, ect.
- if no signs of errors
- use top similar to above to check is pipeline is still running
- if programs are running follow instructions for when the file is empty
- To receive text messages, will need to restart the pipeline
- run
docker ps -a, to get a list of running containers - record the name of the most recently started container running the pepr pipeline
- force remove container
docker rm -f container_name - Check command run the pipeline for errors in the send message function
- run
- if empty - most likely an error with sending the text message
-
to start up a container's command line
docker run -it -v /full_path_to_local_directory:/full_path_mount_directory container-name /bin/bash -
After the pipeline has started and is running okay, simply close the terminal, do not exit out of the container as it will terminate the run. Breaking the ssh connection without exiting the container will keep the container running.
-
To check on the status of the run use top to see if any of the programs executed by the pipeline are still running. 1. Programs that you would expect to see foreach pipeline
evaluation- bwa, java, and samtoolscharacterization- tmap, bwa, java, samtools, and varscangenomic_purity- python, perl, grep, gzip, and bowtie
Certain commercial equipment, instruments, or materials are identified here only to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose.