Skip to content

genomech/FastContext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastContext

Logo

PyPI PyPI - Python Version PyPI - Status PyPI - Downloads GitHub issues GitHub last commit (branch) GitHub Keybase PGP

Description

FastContext is a tool for identification of adapters and other sequence patterns in the next generation sequencing (NGS) data. The algorithm parses FastQ files (in a single-end or paired-end mode), searches read or read pair for user-specified patterns, and then generates a human-readable representation of the search results, which we call "read structure". Also FastContext gathers statistics on frequency of occurence for each read structure.

Installation

python3 -m pip install FastContext

Check installation:

FastContext --help

Usage

Optional arguments:

-1, --r1

Required.
Format: String
Description: FastQ input R1 file. May be uncompressed, gzipped or bzipped.
Usage: -1 input.fastq.gz

-p, --patterns

Required.
Description: Patterns to look for. The order of patterns is the order of search.
Format: Plain Javascript Object String (Key-Value). Names must contain 2-24 Latin and numeric symbols, and -_-, sequences must contain more than one symbols ATGC.
Usage: -p '{"First": "CTCAGCGCTGAG", "Second": "AAAAAA", "Third": "GATC"}'

-s, --summary

Required.
Description: Output HTML file. Contains statistics summary in human-readable form.
Format: String
Usage: -s statistics.htm

-2, --r2

Description: FastQ input R2 file. May be uncompressed, gzipped or bzipped. If single-end mode, ignore this option.
Format: String
Usage: -2 input_R2.fastq.gz

-j, --json

Description: Output JSON.GZ file (gzipped JSON). Contains extended statistics on pattern sequences, each read or read pair: read structure, Levenshtein distances (see -l option).
Format: String
Usage: -j statistics.json.gz

-k, --kmer-size

Description: Max size of unrecognized sequence to be written as K-mer of certain length.
Format: Non-negative Integer
Default: 0
Usage: -k 9

-u, --unrecognized

Description: Long unrecognized sequences replacement.
Format: 2-24 Latin and numeric symbols, and -_-
Default: unknown
Usage: -u genome

-m, --max-reads

Description: Max reads number to analyze (0 -- no limit). Notice that read number bigger than recommended may cause memory overflow.
Format: Non-negative Integer
Default: 1000000
Usage: -m 1000

-f, --rate-floor

Description: Min rate to write read structure into statistics TSV table.
Format: Float from 0 to 1
Default: 0.001
Usage: -f 0.1

-@, --threads

Description: Threads number.
Format: Non-negative integer less than 2 * cpu_count()
Default: cpu_count()
Usage: -@ 10

-d, --dont-check-read-names

Description: Don't check read names. Use this if you have unusual (non-Illumina) paired read names. Makes sense only in paired-end mode.
Usage: -d

-l, --levenshtein

Description: Calculate patterns Levenshtein distances for each position in read. Results are written into extended statistics file (JSON.GZ). Notice that it highly increases the time of processing.
Usage: -l

-h, --help

Description: Show help message and exit.
Usage: -h

-v, --version

Description: Show program's version number and exit.
Usage: -v

Examples

Summary statistics table

Contains counts, percentage and read structures. Length of K-mer or pattern strand (Forward or Reverse) is displayed after the comma.

Example:

R1

Count Percentage Read Structure
5,197 48.807 {unknown}
3,297 30.963 {unknown}--{oligme:F}--{oligb:F}--{701:F}--{unknown}
114 1.070 {unknown}--{oligb:F}--{701:F}--{unknown}
71 0.666 {unknown}--{oligme:F}--{unknown}
69 0.648 {unknown}--{oligme:F}--{unknown}--{701:F}--{unknown}
60 0.563 {unknown}--{oligme:F}--{oligb:F}--{701:F}--{kmer:14bp}

R2

Count Percentage Read Structure
7,545 70.858 {unknown}
616 5.785 {unknown}--{oligme:F}--{oliga:R}--{502:R}--{unknown}
540 5.071 {unknown}--{oligme:F}--{unknown}
441 4.141 {unknown}--{oligme:F}--{oliga:R}--{unknown}
298 2.798 {unknown}--{oliga:R}--{unknown}
263 2.469 {unknown}--{502:R}--{unknown}
233 2.188 {unknown}--{oligme:F}--{kmer:14bp}--{502:R}--{unknown}
163 1.530 {unknown}--{oliga:R}--{502:R}--{unknown}
56 0.525 {unknown}--{502:F}--{unknown}

Extended statistics JSON.GZ file

Contains extended statistics: run options, performance, pattern analysis, full summary without rate floor, each read analysis. Example is shorten.

{
	"FastQ": {
		"R1": "tests/standard_test_R1.fastq.gz",
		"R2": "tests/standard_test_R2.fastq.gz"
	},
	"RunData": {
		"Read Type": "Paired-end",
		"Max Reads": 100,
		"Rate Floor": 0.001
	},
	"Performance": {
		"Reads Analyzed": 100,
		"Threads": 4,
		"Started": "2022-07-13T18:15:48.277660",
		"Finished": "2022-07-13T18:15:48.964721"
	},
	"PatternsData": {
		"PatternsList": {
			"oligme": {
				"F": "CTGTCTCTTATACACATCT",
				"R": "AGATGTGTATAAGAGACAG",
				"Length": 19
			},
			"s502": {
				"F": "CTCTCTAT",
				"R": "ATAGAGAG",
				"Length": 8
			}
		},
		"PatternsAnalysis": [
			{
				"Analysis": "reverse complement only",
				"FirstPattern": "oligme",
				"SecondPattern": "oligme",
				"FirstLength": 19,
				"SecondLength": 19,
				"LevenshteinAbsolute": 11,
				"LevenshteinSimilarity": 0.42105263157894735,
				"Type": "good",
				"Risk": "low"
			},
			{
				"Analysis": "full",
				"FirstPattern": "oligme",
				"SecondPattern": "s502",
				"FirstLength": 19,
				"SecondLength": 8,
				"LevenshteinAbsolute": 2,
				"LevenshteinSimilarity": 0.75,
				"Type": "nested",
				"Risk": "medium"
			}
		],
		"Other": {
			"Unrecognized Sequence": "unknown",
			"K-mer Max Size": 15
		}
	},
	"Summary": {
		"R1": {
			"{unknown}--{oligme:F}--{oligb:F}--{s701:F}--{unknown}": {
				"Count": 34,
				"Percentage": 34.0,
				"ReadStructure": [
					{ "type": "unrecognized" },
					{ "type": "pattern", "name": "oligme", "strand": "F" },
					{ "type": "pattern", "name": "oligb", "strand": "F" },
					{ "type": "pattern", "name": "s701", "strand": "F" },
					{ "type": "unrecognized" }
				]
			},
			"{unknown}--{oligme:F}--{unknown}--{s701:F}--{unknown}": [ "..." ],
			"{unknown}--{s701:F}--{unknown}": [ "..." ]
		},
		"R2": [ "..." ]
	},
	"RawDataset": [
		{
			"Name": "M02435:112:000000000-DFC9M:1:1101:14970:1484",
			"R1": {
				"Sequence": "ACCTAGAAGAGCCAAAAGACTCT...AATCTCGTATGCCGTCT",
				"PhredQual": [29,32,32,33,33,37,37,37,37,"...",38,38,38,13],
				"Levenshtein": [
					{
						"name": "oligme",
						"strand": "F",
						"length": 19,
						"values": [14,14,12,13,12,12,12,"...",NaN,NaN,NaN]
					},
					{
						"name": "oligme",
						"strand": "R",
						"length": 19,
						"values": [12,11,10,9,9,9,10,10,"...",NaN,NaN,NaN]
					}
				],
				"ReadStructure": [
					{ "type": "unrecognized" },
					{ "type": "pattern", "name": "oligme", "strand": "F" },
					{ "type": "pattern", "name": "oligb", "strand": "F" },
					{ "type": "pattern", "name": "s701", "strand": "F" },
					{ "type": "unrecognized" }
				],
				"TextReadStructure": "{unknown}--{oligme:F}--{oligb:F}--{s701:F}--{unknown}"
			},
			"R2": "..." 
		}
	]
}