Project at Public Health England for STP Elective.
De novo HCV pipeline used to determine species, quasiaspecies and drug resistance set up to replace Sanger sequencing. As Sanger is not as sensitive as the NGS-based de novo pipeline, this makes an evaluation of determining the sample composition against a known dataset unavilable.
FASTAs were made to simulate samples with different compositions. These have been turned into fastqs using ART.
- Create a consensus sequence from the FASTAs themselves, and also create a frequency matrix from them (scripts/FASTA_consensus.py, run by process_samples.py)
- Then align the synthetic FASTQs (used BWA) to the conensus sequence and then create a quasibam for the consensus (process_samples.py)
- Compare the frequency differences between the consensus frequency matrix and the consensus quasibam (scripts/frequency-matrix_quaisbam_comparison.Rmd)
- Compare each pipeline quasibam to the consensus frequency matrix (quasibam-pipeline_comparison.Rmd)
- Python 3.5+ required for scripts, have not tested on anything lower. Requirements for pip in pip_requirements.txt
- Rmd reports, issues with proxy that didn't have time to sort so can't use Bioconductor BiocLite or packrat/anaconda for package management. Follow steps in scripts/R-setup.R
- R 3.4+
- Ideally I'd use packrat or an anaconda package, but issues with proxy
- Tidyverse and Biostrings (Biostrings is a bioconductor package)
- Rstudio if doing single Rmarkdown reports
- Pandoc for knitting Rmarkdown
- For consensus gap, the following in
pipeline-resources/
hcv.fasta
and associated fileslastz-distrib/bin/lastz
and associated fileslastz_bestref.pl
lastz_analyser.WITH_REVCOMP.pl
genome_maker2b.pl
cons_mv.pl
N_remover_from_consensus.pl
majvarcheck2.pl
- Module files show versions of each tool used
- If running any processing, do this on the cluster
- Then copy
results
folder to local machine and run the report generation- Had issues with pandoc and R package installation on the cluster and didn't have time to smooth it out.
- Requires Data in
results/
:- Samples following the same YYMMDD_N prefix (N = sample number):
- FASTA file with each individual synthetic virus named in
{YYMMDD_N}_quasi.fas
format - FASTQ files from derived from the FASTA named in
{YYMMDD_N}_R1.fq
and{YYMMDD_N}_R2.fq
formats.
- Load modules:
module load anaconda/4.2.1_python3 module load phe/quasi_bam/2-3 module load samtools # 0.1.19-44428cd module load bwa # 0.7.9a-r786 module load smalt/0.7.6 module load blast+/2.2.27 module load vphaser # 2.0
- Navigate to root directory of the project.
- Run sample processing on the cluster, e.g.
python3 process_samples.py 170908
- Same as default but run with --remove-human flag
python3 process_samples.py 171009 --remove-human
- Requires the following in the
data
folder:- Output from samples to be processed by default or remove-human settings. e.g:
171009_1_quasi_sorted.txt
,1 71009_1_quasi_frequency_matrix.txt
for frequency matrix comparison, with all samples to be processed171009_1_{pipeline_name}_quasibam.txt
for each pipeline- Easiest just to copy all of the *.txt files in results
- Quasibam .txt output from pipelines
e.g.
171009_12_vicuna_bwa_quasibam.txt
. This format is required. - If running another batch of 171009 samples, change the
has_true_positives <- params$date_prefix == "171009"
in the first code chunk to be the new date prefix - If vphaser has not been run, the vphaser section will be skipped
- Output from samples to be processed by default or remove-human settings. e.g:
- Navigate to root directory of the project.
- Run python script with --reports flag, e.g.
python3 process_samples.py 170908 --reports
- If you want to run a single pipeline only, adding the --pipeline flag and specify the pipeline combination.
e.g.
python3 process_samples.py 170908 --reports --pipeline vicuna_bwa
- This is part of the filename, see the data requirements.
- You can also run a single pipeline by opening up the
scripts/quasibam-pipeline_comparison.Rmd
file in Rstudio, editing the header parameters and then knitting the document in Rstudio.
- If you want to run a single pipeline only, adding the --pipeline flag and specify the pipeline combination.
e.g.
- Reports will be generated in the
reports
folder, look at the html or add the folders and md to gitlab/github for them to be rendered.
- Requires
- Samples to be processed by default or remove-human settings
- Loading of modules listed in default
- Requires: Same as default usage
- Navigate to root directory of the project
- Run sample processing on the cluster
python3 process_samples.py 170908 --consensus-gap
- Copy
results/gap_files
to local machine's project - Open the
scripts/consensus-gap-filling.Rmd
in Rstudio and edit theall_samples <- paste0("170908_", 1:18)
line to fit the correct samples.- Then knit in Rstudio
On cluster:
python3 process_samples.py 170908
python3 process_samples.py 170908 --vphaser
python3 process_samples.py 171009 --remove-human
Locally after copying over results/
contents:
python3 process_samples.py 170908 --reports
python3 process_samples.py 171009 --reports --pipeline iva_bwa
Python unit tests were made for FASTA_conensus.py. From the root of the project run:
python3 -m pytest tests/
orpytest tests/
R unit tests were made for helper_functions. From the root of the project run:
Rscript tests/test_helper_functions.R