Skip to content

Scripts

Qiyun Zhu edited this page May 21, 2023 · 7 revisions

This scripts directory hosts multiple Python scripts for analysis and formatting outside the BinaRena main program. They are useful for preparing the input files for BinaRena. Each script has a command-line interface when you execute them. This page describes their typical usages.

Basic sequence information

The script sequence_basics.py obtains length, GC content, and coverage (if applicable) from contig sequences (multi-FASTA format). Example:

sequence_basics.py -i contigs.fna -o basic.tsv

When the input file was generated by SPAdes or MEGAHIT, the script will automatically recognize and extract the coverage values from the sequence titles.

  • Note: This "coverage" is not the same as one would get from mapping reads to contigs.

Alternatively, one can "pipe" the input and output files. This mechanism is available for most scripts here.

zcat contigs.fna.gz | sequence_basics.py | gzip > basic.tsv.gz

One can specify a minimum contig length threshold (say 1 kb):

sequence_basics.py -i contigs.fna -l 1000 -o basic.tsv

Contig annotation

orf_to_contig.py converts an ORF-to-feature mapping into a contig-to-feature(s) mapping.

orf_to_contig.py orf-to-gene.map > ctg-to-genes.map

The resulting mapping file can be appended to a dataset as a "feature set" column for BinaRena.

The input file needs to be a two-column table. Multiple common formats in bioinformatics can be converted to this simple format using some Linux commands, simple or complex, before feeding into this script. Here are some examples:

cat blast.out6 | cut -f1,2 | orf_to_contig.py > output.map
samtools view -F 4 bowtie2.bam | cut -f1,3 | orf_to_contig.py > output.map
cat pfam.tblout | grep -v ^'#' | tr -s ' ' | cut -d ' ' -f1,5 | tr ' ' '\t' | orf_to_contig.py > output.map

For GFF, a common genome annotation format, there is gff_to_features.py to extract certain attributes and to generate a mapping file:

gff_to_features.py -i prokka.gff -t gene -o gene.map

k-mer frequencies

count_kmers.py calculates k-mer frequencies of contig sequences. For example:

count_kmers.py -i contigs.fna -k 4 -o tetramers.tsv

This is an exact k-mer counter. The frequencies of all k-mers will be reported. Both forward and reverse strands are considered. k-mers with non-ACGT characters are discarded. Upper and lower cases are both fine.

The output is a tab-separated table with rows as sequence identifiers and columns as all possible k-mers (including unobserved ones).

  • Note: This k-mer counter is optimized for small k-values (k = 4, 5, 6...) and many sequences, which are typical for the task of contig binning. It is not efficient for large k-values (e.g., k = 35).

  • Note: The output file may be large, especially for relatively big k's. One may consider piping the output to a downstream application (see below).

Dimensionality reduction

reduce_dimension.py is a pipeline for reducing the dimensionality of certain contig properties (such as k-mer frequencies and per-sample coverages) such that they can be visualized in a low-dimension space (e.g., a 2D or 3D scatter plot). Three commonly used dimensionality reduction methods are implemented, including principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP).

This script requires Python libraries scikit-learn, and (optional, only for UMAP) umap-learn. You need to install them first:

conda install -c conda-forge scikit-learn umap-learn

Then you can run the script. Example:

reduce_dimension.py -i kmer_freqs.tsv --pca --tsne --umap -o output

The input is a table (TSV), with rows as contigs and columns as properties. The output files will be output.pca.tsv, output.tsne.tsv, output.umap.tsv.

You may connect count_kmers.py and reduce_dimension.py using a pipe, such that the intermediate k-mer frequencies file (which is large) can be skipped. This is efficient for data-intensive analyses. Example:

count_kmers.py -i contigs.fna -k 5 | reduce_dimension.py --pca --tsne --umap -o output

Details of the pipeline:

  1. Add a pseudocount (default: 1) to any feature that has zeros.
  2. Perform centered log-ratio transform (CLR) on each feature.
  3. Perform PCA on the transformed data.
  4. If there are >200 features, perform PCA to reduce the transformed data to 50 principal components.
  5. Perform t-SNE on the reduced data.
  6. Perform UMAP on the reduced data.

Lineage string conversion

Lineage strings like the following are widely used in microbiome research to denote the taxonomic classification of biological entities at 7-8 fixed ranks. Many software tools, such as QIIME 2, GTDB-tk and MetaPhlAn, support exporting taxonomy files in this format.

ctg1 <tab> d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia; s__Escherichia coli

lineage_to_table.py converts a contig-to-lineage mapping file into a table, with each column as a taxonomic rank:

lineage_to_table.py lineage.map > taxonomy.tsv

The output file will be like:

ID domain phylum class order family genus species
ctg1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli

Kraken taxonomic profile

Kraken2 is a widely used tool for assigning taxonomy to contig sequences. You can run Kraken2 like this (see the Kraken2 manual for more details):

kraken2 --db db_dir contigs.fna --output kraken.output --report kraken.report

The two output files, kraken.output is a mapping of contig IDs to taxonomy IDs; kraken.report is a hierarchical taxonomic profile of the sample.

You can use kraken_to_table.py to convert the result into a table, with rows as contig IDs and columns as taxonomic ranks:

kraken_to_table.py -i kraken.output -r kraken.report -o taxonomy.tsv

If you don't have the Kraken report files, but have the original NCBI taxdump, you can do this instead (the directory taxdump_dir should contain nodes.dmp and names.dmp):

kraken_to_table.py -i kraken.output -d taxdump_dir -o taxonomy.tsv

CheckM marker genes

CheckM is the standard tool for evaluating bin qualities. It calculates completeness and contamination (redundancy) of each bin by assessing the distribution of lineage-specific marker genes. The scripts checkm_marker_map.py and checkm_marker_list.py can reformat the CheckM output files such that they can enable BinaRena to calculate completeness and redundancy in real time.

Step 1. Have the entire set of contigs (unbinned) in a multi-FASTA file input/contigs.fna.

  • Note: This is slightly different from the typical use of CheckM, which should run on individual bins. You can do that too, if you are only interested in binned contigs.

Step 2. Determine which target taxon should be analyzed. For example, domain Bacteria (which is quite generic) is used in the following example. (A list of available taxa can be found using checkm taxon_list.)

Step 3. Run CheckM taxonomic-specific workflow as recommended:

checkm taxonomy_wf domain Bacteria input output

Step 4. Convert the CheckM output into a contig-to-markers map using checkm_marker_map.py:

checkm_marker_map.py output/storage/marker_gene_stats.tsv > Bacteria.map

The output file Bacteria.map can be imported into BinaRena as a "feature set" field.

Step 5. Convert a CheckM marker set definition into a marker list using checkm_marker_list.py:

checkm_marker_list.py output/Bacteria.ms > Bacteria.lst

The output file Bacteria.lst can be imported into BinaRena as a feature group membership list.

Step 6. You can now use the "feature group" menu item to calculate completeness and redundancy (contamination) scores of selection contigs on the fly! See details.

  • Tip: You may load multiple marker gene sets into BinaRena and calculate using each of them.

  • Note: The output values are analogous to CheckM's completeness and contamination scores, and they can be interpreted in a similar way. However there is one difference: CheckM considers the collocation of marker genes when calculating these metrics. BinaRena does not, however, and the output values are based on a plain list of features. Therefore the results may be different, although they are highly correlated.

  • Note: Also, BinaRena can only analyze specific marker gene sets (in this example: Bacteria), unlike CheckM's lineage_wf which can automatically determine the lineage of each bin and use the corresponding lineage-specific marker gene set for that bin.