Skip to content

Taxonomic Workflows

sarahet edited this page Jul 17, 2023 · 22 revisions

If you are using LAMBDA in taxonomic workflows, you might want to make use of some of the following features:

Printing taxonomic IDs of subject sequences

During the indexing step

You only need to do this once:

  1. Make sure your subject sequences contain accession numbers; GIs are not supported. The following accession numbers are automatically detected and extracted from fasta/fastq headers:
  2. Download a mapping file from the NCBI (make sure it's the correct one) or from UniProt (UniProt supported since lambda-1.9.3).
  3. Rebuild your index, but add --acc-tax-map /path/to/file.accession2taxid[.gz] (you don't have to unzip the file).
    • Building the new index will take longer, but it only increases the index's size by a few MBs.
    • If LAMBDA fails to assign most of your sequences to taxa, it will warn you!

When running LAMBDA

You need to tell it to print the taxonomic information:

  • for the tabular BLAST Output Formats, specify e.g. --output-columns 'std staxids'.
  • for the SAMTOOLS Output Formats, specify e.g. --sam-bam-tags 'AS NM ae ai qf st' (the last tag is the important one).
  • there is no impact on the run-time of lambda.

Note that this implies no taxonomic binning, you just get the taxa corresponding to the subject sequences of your individual matches, i.e. staxids is a per-match specifier.

LCA computation / taxonomic binning

Lambda can do taxonomic binning, i.e. it will compute the lowest common ancestor taxon for all matches of one query sequence. This helps with taxonomic assessment, although it should be noted, that it does no statistical evaluation or weighting of matches. Other tools like SLIMM do a more complex analysis.

During the indexing step

You only need to do this once:

  1. Do all of the things for printing subject taxonomic IDs as described above.
  2. But before that also download the taxdump.tar.gz and untar it to some place.
  3. Rebuild your index, but in addition to --acc-tax-map /path/to/file.accession2taxid[.gz], also add the path to the untarred taxdump directory, i.e. --tax-dump-dir /path/to/directory * this will only marginally increase your indexing build time and index size

When running LAMBDA

You need to tell it to print the lca information (either as taxon id or scientific name):

  • for the tabular BLAST Output Formats, specify e.g. --output-columns 'std lcaid lcataxid'.
  • for the SAMTOOLS Output Formats, specify e.g. --sam-bam-tags 'AS NM ae ai qf ls lt' (the last two tags are the important one).
  • there is no significant impact on the run-time of lambda.
  • although this information field is printed per-match (like all other fields), it of course refers to all matches of the query sequence -- and is thus identical for all the matches of each query sequence.

Some things to note:

  • matches against subject sequences that have no identified taxon id do not contribute to LCA computation. Alternatively we could assign all unknown sequences to the root taxon, but this would skew results strongly.
  • the --num-matches parameter strongly influences the LCA. Choose smaller values if you always end up with very generic LCAs.