## Classify Taxonomy with vsearch

In this notebook we will run vsearch on each tissue compartment using the four references we've
created, and save the classification taxonomies to /output/

In [10]:
from qiime2 import Artifact
from qiime2.plugins.feature_classifier.methods import classify_consensus_vsearch
from qiime2.metadata import Metadata
from os.path import abspath,exists

In [14]:
working_dir = abspath('../')
metadata_path = working_dir + '/input/GCMP_EMP_map_r28_no_empty_samples.txt'
seqs_path = working_dir + '/output/GCMP_seqs.qza'
taxonomy_reference_dir = working_dir + '/output/taxonomy_references/'

#### Verify that input files exist at the above paths

Let's run a quick check to make sure we have everything we need:

In [15]:
print("Verifying that all needed starting data files and directories exist.")
for existing_file in [working_dir,metadata_path,seqs_path,taxonomy_reference_dir]:
    if not exists(existing_file):
        raise IOError(f"Required file {existing_file} not found. Please ensure it is in that directory.")
print("Done.")

Verifying that all needed starting data files exist.
Done.


#### Annotate the GCMP sequences

Next, we'll use vsearch to annotate taxonomy. This will be run once for each of our different taxonomic schemes:

- Greengenes
- SILVA
- Greengens + MeTaxa2 mitochondrial sequences
- SILVA + MeTaxa2 mitochondrial sequences. 

**NOTE**: This step can take quite a while to run, so we recommend scheduling it for overnight or sometime where you have other things to do (roughly all day on my MacBook Pro using 4 threads).

In [8]:
references = ['greengenes', 'silva', 'greengenes_metaxa2', 'silva_metaxa2']
metadata = Metadata.load(metadata_path)
seqs = Artifact.load(seqs_path)

In [16]:
vsearch_results = {}
for reference in references:
    reference_otu_path = taxonomy_reference_dir + f'{reference}_otus.qza'
    reference_taxonomy_path = taxonomy_reference_dir +  f'{reference}_taxonomy.qza'
    reads = Artifact.load(reference_otu_path)
    taxonomy = Artifact.load(reference_taxonomy_path)
    vsearch_results[reference] = classify_consensus_vsearch(seqs, reads, taxonomy, threads = 4)

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /var/folders/ph/3tshftys2pn638_ypcxd_w58t0gb61/T/qiime2-archive-1giqcoxg/444b6d5b-e008-4c59-aa36-7c57adf1cdc7/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand both --maxaccepts 10 --maxrejects 0 --db /var/folders/ph/3tshftys2pn638_ypcxd_w58t0gb61/T/qiime2-archive-lnfd6_kt/73cb4a20-a0dd-480b-b64a-7fed6571b7cd/data/dna-sequences.fasta --threads 4 --output_no_hits --blast6out /var/folders/ph/3tshftys2pn638_ypcxd_w58t0gb61/T/tmp_8sa6tvi

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /var/folders/ph/3tshftys2pn638_ypcxd_w

#### Save each of the resulting taxonomy annotations for the GCMP sequences 

In [18]:
for reference in vsearch_results:
    classification_taxonomy, = vsearch_results[reference]
    classification_taxonomy.save(working_dir + '/output/' + str(reference) + '_reference_taxonomy.qza')