Skip to content

reneshbedre/HTSQualC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTSQualC

HTSQualC is an automated quality control analysis tool for a single and paired-end high-throughput sequencing data (HTS) generated from Illumina sequencing platforms.

Features

  • Simultaneously filter and/or trim reads for adapter or primer contamination, uncalled bases (N), and low-quality reads
  • Supports single and paired-end reads
  • Analyze multiple samples simultaneously
  • Parallel computation for accelerating the speed of analysis
  • Visualization and statistics
  • Docker image is available
  • Available on CyVerse Discovery Environment (DE)
  • No dependency on an external open-source tool

Getting Started

Prerequisites

You need Python 3 (tested on 3.6 and 3.7) to install and run HTSQualC. Following Python 3 packages need to install before running the HTSQualC. If you have not . installed these packages, HTSQualC will guide you to install them.

numpy
pysam
matplotlib
termcolor
datetime

Installing

Clone or download HTSQualC using following command,

git clone https://github.com/reneshbedre/HTSQualC.git

To install HTSQualC, run following command in the root folder,

python setup.py install

Install using conda,

conda install -c bioconda htseqqc

How to use

Print help message to see all required and optional parameters,

filter.py -h
usage: filter.py [-h] [-a INPUT_FILES_1] [-b INPUT_FILES_2] [-c QUAL_FMT]
                 [-e N_CONT] [-f ADPT_SEQS] [-d MIN_SIZE] [-g ADPT_MATCH]
                 [-i QUAL_THRESH] [-n TRIM_OPT] [-p WIND_SIZE]
                 [-r MIN_LEN_FILT] [-q CPU] [-m OUT_FMT] [-v VIS_OPT] [-z COMPRESS]
                 [--version]

Quality control analysis of single and paired-end sequence data


optional arguments:
  -h, --help            show this help message and exit
  -a INPUT_FILES_1, --p1 INPUT_FILES_1
                        Single end input files or left files for paired-end
                        data (.fastq, .fq). Multiple sample files must be
                        separated by comma or space
  -b INPUT_FILES_2, --p2 INPUT_FILES_2
                        Right files for paired-end data (.fastq, .fq).
                        Multiple files must be separated by comma or space
  -c QUAL_FMT, --qfmt QUAL_FMT
                        Quality value format [1= Illumina 1.8, 2= Illumina
                        1.3,3= Sanger]. If quality format not provided, it
                        will automatically detect based on sequence data
  -e N_CONT, --nb N_CONT
                        Filter the reads containing given % of uncalled bases
                        (N)
  -f ADPT_SEQS, --adp ADPT_SEQS
                        Trim the adapter and truncate the read sequence
                        (multiple adapter sequences must be separated by
                        comma)
  -d MIN_SIZE, --msz MIN_SIZE
                        Filter the reads which are lesser than minimum size
  -g ADPT_MATCH, --per ADPT_MATCH
                        Truncate the read sequence if it matches to adapter
                        sequence equal or more than given percent (0.0-1.0)
                        [default=0.9]
  -i QUAL_THRESH, --qthr QUAL_THRESH
                        Filter the read sequence if average quality of bases
                        in reads is lower than threshold (1-40) [default:20]
  -n TRIM_OPT, --trim TRIM_OPT
                        If trim option set to True, the reads with low quality
                        (as defined by option --qthr) will be trimmed instead
                        of discarding [True|False] [default: False]
  -p WIND_SIZE, --wsz WIND_SIZE
                        The window size for trimming (5->3) the reads. This
                        option should always set when -trim option is defined
                        [default: 5]
  -r MIN_LEN_FILT, --mlk MIN_LEN_FILT
                        Minimum length of the reads to retain after trimming
  -q CPU, --cpu CPU     Number of CPU [default:2]
  -m OUT_FMT, --ofmt OUT_FMT
                        Output file format (fastq/fasta) [default:fastq]
  -v VIS_OPT, --no-vis VIS_OPT
                        No figures will be produced [True|False]
                        [default:False]
  -z COMPRESS, --compress COMPRESS
                        Compress (.gz) the filtered FASTQ output [True|False]   
                        [default:False]                   
  --version             show program's version number and exit

Run For single-end reads

# for single sample
filter.py OPTIONS -a fastq_file

# for multiple samples
filter.py OPTIONS -a fastq_file_1,fastq_file_2

Filter paired-end reads

# for single sample
filter.py OPTIONS -a fastq_file_left -b fastq_file_right
# for multiple samples
filter.py OPTIONS -a fastq_file_left_1,fastq_file_left_2 -b fastq_file_right_1,fastq_file_right_2

Output

HTSQualC produces the filtered cleaned HTS data as FASTQ/FASTA files, and statistics and visualization of filtered cleaned HTS datasets. The output will be saved in folder with name ending as filtering_out.

License

This project is available under the MIT License. See complete details in LICENSE file.

HTSQualC Analysis commands used for test datasets

Download the test paired and single end data using NCBI SRA toolkit

fastq-dump --split-files SRR2165176
fastq-dump --split-files SRR2165177
fastq-dump --split-files SRR2165178
fastq-dump  SRR1805340

Run HTSQualC as a command line tool (Linux and Mac)

  • for paired end data with default parameter (setting 1)

filter.py --cpu 18 --p1 SRR2165176_1.fastq --p2 SRR2165176_2.fastq

  • for paired end data with quality threshold, adapter sequences, and uncalled based parameters (setting 2)

filter.py --cpu 18 --qthr 25 --nb 5 --adp AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --p1 SRR2165176_1.fastq --p2 SRR2165176_2.fastq

  • for paired end data with default parameter and multiple samples (setting 3)

filter.py --cpu 18 --p1 SRR2165176_1.fastq,SRR2165177_1.fastq,SRR2165178_1.fastq --p2 SRR2165176_2.fastq,SRR2165177_2.fastq,SRR2165178_2.fastq

  • for single end data with default parameter (setting 4)

filter.py --cpu 18 --p1 SRR1805340.fastq

About

High-throughput sequencing (HTS) data quality control analysis software tool

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages