HCV pipeline evaluation

Project at Public Health England for STP Elective.

De novo HCV pipeline used to determine species, quasiaspecies and drug resistance set up to replace Sanger sequencing. As Sanger is not as sensitive as the NGS-based de novo pipeline, this makes an evaluation of determining the sample composition against a known dataset unavilable.

Starting dataset

FASTAs were made to simulate samples with different compositions. These have been turned into fastqs using ART.

Overview of analysis

Create a consensus sequence from the FASTAs themselves, and also create a frequency matrix from them (scripts/FASTA_consensus.py, run by process_samples.py)
Then align the synthetic FASTQs (used BWA) to the conensus sequence and then create a quasibam for the consensus (process_samples.py)
Compare the frequency differences between the consensus frequency matrix and the consensus quasibam (scripts/frequency-matrix_quaisbam_comparison.Rmd)
Compare each pipeline quasibam to the consensus frequency matrix (quasibam-pipeline_comparison.Rmd)

Requirements

Python 3.5+ required for scripts, have not tested on anything lower. Requirements for pip in pip_requirements.txt
Rmd reports, issues with proxy that didn't have time to sort so can't use Bioconductor BiocLite or packrat/anaconda for package management. Follow steps in scripts/R-setup.R
- R 3.4+
- Ideally I'd use packrat or an anaconda package, but issues with proxy
- Tidyverse and Biostrings (Biostrings is a bioconductor package)
- Rstudio if doing single Rmarkdown reports
- Pandoc for knitting Rmarkdown
For consensus gap, the following in pipeline-resources/
- hcv.fasta and associated files
- lastz-distrib/bin/lastz and associated files
- lastz_bestref.pl
- lastz_analyser.WITH_REVCOMP.pl
- genome_maker2b.pl
- cons_mv.pl
- N_remover_from_consensus.pl
- majvarcheck2.pl
Module files show versions of each tool used

Running stages

If running any processing, do this on the cluster
Then copy results folder to local machine and run the report generation
- Had issues with pandoc and R package installation on the cluster and didn't have time to smooth it out.

Default usage

Requires Data in results/:
- Samples following the same YYMMDD_N prefix (N = sample number):
- FASTA file with each individual synthetic virus named in {YYMMDD_N}_quasi.fas format
- FASTQ files from derived from the FASTA named in {YYMMDD_N}_R1.fq and {YYMMDD_N}_R2.fq formats.

Load modules:

module load anaconda/4.2.1_python3
module load phe/quasi_bam/2-3
module load samtools # 0.1.19-44428cd
module load bwa # 0.7.9a-r786
module load smalt/0.7.6
module load blast+/2.2.27
module load vphaser # 2.0

Navigate to root directory of the project.
Run sample processing on the cluster, e.g. python3 process_samples.py 170908

Human removal

Same as default but run with --remove-human flag python3 process_samples.py 171009 --remove-human

Analysing processed data: Rmarkdown reports

Requires the following in the data folder:
- Output from samples to be processed by default or remove-human settings. e.g:
  - 171009_1_quasi_sorted.txt, 1 71009_1_quasi_frequency_matrix.txt for frequency matrix comparison, with all samples to be processed
  - 171009_1_{pipeline_name}_quasibam.txt for each pipeline
  - Easiest just to copy all of the *.txt files in results
- Quasibam .txt output from pipelines e.g. 171009_12_vicuna_bwa_quasibam.txt. This format is required.
- If running another batch of 171009 samples, change the has_true_positives <- params$date_prefix == "171009" in the first code chunk to be the new date prefix
- If vphaser has not been run, the vphaser section will be skipped
Navigate to root directory of the project.
Run python script with --reports flag, e.g. python3 process_samples.py 170908 --reports
- If you want to run a single pipeline only, adding the --pipeline flag and specify the pipeline combination. e.g. python3 process_samples.py 170908 --reports --pipeline vicuna_bwa
  - This is part of the filename, see the data requirements.
- You can also run a single pipeline by opening up the scripts/quasibam-pipeline_comparison.Rmd file in Rstudio, editing the header parameters and then knitting the document in Rstudio.
Reports will be generated in the reports folder, look at the html or add the folders and md to gitlab/github for them to be rendered.

Vphaser

Requires
- Samples to be processed by default or remove-human settings
- Loading of modules listed in default

Consensus gap

Requires: Same as default usage
Navigate to root directory of the project
Run sample processing on the cluster python3 process_samples.py 170908 --consensus-gap
Copy results/gap_files to local machine's project
Open the scripts/consensus-gap-filling.Rmd in Rstudio and edit the all_samples <- paste0("170908_", 1:18) line to fit the correct samples.
- Then knit in Rstudio

Example usage

On cluster:

python3 process_samples.py 170908
python3 process_samples.py 170908 --vphaser
python3 process_samples.py 171009 --remove-human

Locally after copying over results/ contents:

python3 process_samples.py 170908 --reports
python3 process_samples.py 171009 --reports --pipeline iva_bwa

Testing

Python unit tests were made for FASTA_conensus.py. From the root of the project run:

python3 -m pytest tests/ or pytest tests/

R unit tests were made for helper_functions. From the root of the project run:

Rscript tests/test_helper_functions.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HCV pipeline evaluation

Starting dataset

Overview of analysis

Requirements

Running stages

Default usage

Human removal

Analysing processed data: Rmarkdown reports

Vphaser

Consensus gap

Example usage

Testing

Files

README.md

Latest commit

History

README.md

File metadata and controls

HCV pipeline evaluation

Starting dataset

Overview of analysis

Requirements

Running stages

Default usage

Human removal

Analysing processed data: Rmarkdown reports

Vphaser

Consensus gap

Example usage

Testing