## SILVA + RESCRIPt

We will create a database specifically tailored for classification of our sequences.

### Downloading data

In [None]:
from rescript import get_data 
from qiime2.plugins import rescript, feature_classifier
from qiime2 import Artifact

version = "138.1"
target = "SSURef_NR99"


queries = get_data._assemble_silva_data_urls(version, target, download_sequences=True)
results = get_data._retrieve_data_from_silva(queries)

retrieving sequences from: https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz
retrieving taxonomy map from: https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.1.txt.gz
retrieving taxonomy tree from: https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.tre.gz
retrieving taxonomy ranks from: https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz


In [None]:
tax = rescript.methods.parse_silva_taxonomy(results["taxonomy tree"], results["taxonomy map"], results["taxonomy ranks"])

In [None]:
tax.taxonomy.save("../data/artifacts/silva-taxonomy-ssu-nr99-138.1.qza")

'../data/artifacts/silva-taxonomy-ssu-nr99-138.1.qza'

The data exist within SILVA as RNA sequences, and thus have been imported as `FeatureData[RNASequence]`. To make sure things run smoothly downstream we'll convert the data to `FeatureData[DNASequence]`

In [None]:
seq_dna = rescript.methods.reverse_transcribe(results["sequences"])

### “Culling” low-quality sequences with cull-seqs
Here we’ll remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. 

In [None]:
clean_seq_dna = rescript.methods.cull_seqs(seq_dna.dna_sequences)

### Filtering sequences by length and taxonomy
Rather than blindly filter all of the reference sequences below a certain length, we'll differentially filter based on the taxonomy of the reference sequence. The reason: if we decide to remove any sequences below 1000 or 1200 bp, then many of the reference sequences associated with Archaea (and some Bacteria) will be lost. This will potentially increase the retention of shorter and lower-quality Bacterial or Eukaryal sequences. Ultimately causing undue database selection bias. So, we'll attempt to mitigate these issues by differentially filtering based on length. We will remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp. See help text for more info.

In [None]:
filtered = rescript.methods.filter_seqs_length_by_taxon(
    sequences = seq_dna.dna_sequences,
    taxonomy = tax.taxonomy,
    labels = ["Archea", "Bacteria", "Eukaryota"],
    min_lens = [900, 1200, 1400]
)

### Dereplication
Given the notes outlined for the SILVA 138.1 NR99 574 release, there may be identical full-length sequences with either identical or different taxonomies. We'll proceed to dereplicate this data before moving forward. This will help remove redundant sequence data from the database prior to downstream processing. RESCRIPt provide several options for sequence-taxonomy dereplication. Click on the triangle below for more information.

**Dereplicating in uniq mode**
Here we will use the default uniq approach. That is, we’ll retain identical sequence records that have differing taxonomies. We’ll specify the option here for the sake of clarity, but feel free to use any of the --p-mode options that make sense to you.

In [None]:
dereplicated = rescript.methods.dereplicate(sequences = filtered.filtered_seqs,
                                            taxa = tax.taxonomy,
                                            mode = "uniq")

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_fulllength /tmp/qiime2/vbezshapkin/data/63a60e39-5216-40f1-8772-a9a42d060d1d/data/dna-sequences.fasta --output /tmp/tmpkxlg__fe --uc /tmp/tmpgvbxtnss --xsize --threads 1



vsearch v2.22.1_linux_x86_64, 1007.1GB RAM, 144 cores
https://github.com/torognes/vsearch

Dereplicating file /tmp/qiime2/vbezshapkin/data/63a60e39-5216-40f1-8772-a9a42d060d1d/data/dna-sequences.fasta 100%
740229621 nt in 505000 seqs, min 900, max 4000, avg 1466
Sorting 100%
462805 unique sequences, avg cluster 1.1, median 1, max 893
Writing FASTA output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-vi

In [None]:
dereplicated.dereplicated_sequences.save("../data/artifacts/seqs-silva-138.1-ssu-nr99-filtered-derep.qza")
dereplicated.dereplicated_taxa.save("../data/artifacts/tax-silva-138.1-ssu-nr99-filtered-derep.qza")

'../data/artifacts/tax-silva-138.1-ssu-nr99-filtered-derep.qza'

In [None]:
# SILVA seqs 
silva_seqs = Artifact.load("../data/artifacts/seqs-silva-138.1-ssu-nr99-filtered-derep.qza")

reads = feature_classifier.methods.extract_reads(
    sequences = silva_seqs,
    f_primer = "CCTACGGGNGGCWGCAG",
    r_primer = "GACTACHVGGGTATCTAATCC",
    n_jobs = 16,
    read_orientation = "forward"
)
derep = rescript.methods.dereplicate(
    sequences = reads.reads,
    taxa = Artifact.load(path["art"] + "tax-silva-138.1-ssu-nr99-filtered-derep.qza"),
    mode = "uniq"
)
derep.dereplicated_sequences.save(path["art"] + "seqs-silva-138.1-ssu-nr99-filtered-derep-341f-806r.qza")
derep.dereplicated_taxa.save(path["art"] + "tax-silva-138.1-ssu-nr99-filtered-derep-341f-806r.qza")

In [None]:
classifier = rescript.pipelines.evaluate_fit_classifier(
    sequences = derep.dereplicated_sequences,
    taxonomy = derep.dereplicated_taxa,
    n_jobs = 16
)

  taxa = taxa.loc[seq_ids]


Validation: 8.24s




Training: 871.66s
Classification: 1193.74s




Evaluation: 6.70s
Total Runtime: 2080.34s


In [None]:
classifier.classifier.save("../data/artifacts/classifier-silva-138.1-ssu-341f-806r.qza")
classifier.evaluation.save("../visualizations/eval-classifier-silva-138.1-ssu-341f-806r.qzv")
classifier.observed_taxonomy.save("../data/silva-138-341f-806r-taxonomy.qza")

'../data/silva-138-341f-806r-taxonomy.qza'