<h1 style="background-color:#03e8fc; font-family:'Brush Script MT',cursive;color:black;font-size:200%; text-align:center;border-radius: 50% 20% / 10% 40%">Taxonomy prediction</h1>

"Taxonomy prediction is a science involving the hierarchical classification of DNA fragments up to the rank species. Given species diversity on Earth, taxonomy prediction gets challenging with (i) increasing number of species (labels) to classify and (ii) decreasing input (DNA) size."

https://vtechworks.lib.vt.edu/handle/10919/89752

Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences

Author: Robert C. Edgar

![](https://dfzljdn9uc3pi.cloudfront.net/2018/4652/1/fig-6-1x.jpg)https://peerj.com/articles/4652/

In [None]:
import numpy as np # linear algebra
import pandas as pd
from pathlib import Path
import os.path
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns

<h1><span class="label label-default" style="background-color:#03e8fc;border-radius:100px 100px; font-weight: bold; font-family:Garamond; font-size:20px; color:black; padding:10px">Deep Learning for Taxonomy Prediction</span></h1><br>

"Deep Learning for Taxonomy Prediction"

"The last decade has seen great advances in Next-Generation Sequencing technologies, and, as a result, there has been a rise in the number of genomes sequenced each year. In 2017, there were as many as 10,000 new organisms sequenced and added into the RefSeq Database."

"Taxonomy prediction is a science involving the hierarchical classification of DNA fragments up to the rank species. In this research, the authors introduced Predicting Linked Organisms, Plinko, for short. Plinko is a fully-functioning, state-of-the-art predictive system that accurately captures DNA - Taxonomy relationships where other state-of-the-art algorithms falter."

"Plinko leverages multi-view CNNs and the pre-defined taxonomy tree structure to improve multi-level taxonomy prediction. In the Plinko strategy, each network takes advantage of different word usage patterns corresponding to different levels of evolutionary divergence. Plinko has the advantages of relatively low storage, GPGPU parallel training and inference, making the solution portable, and scalable with anticipated genome database growth."

"To the best of our knowledge, Plinko is the first to use multi-view CNN as the core algorithm in a compositional,alignment-free approach to taxonomy prediction."

https://vtechworks.lib.vt.edu/handle/10919/89752

In [None]:
#Code by Ventakumar R https://www.kaggle.com/venkatkumar001/hfp-2-eda-tensorflow/notebook

import json, codecs

with codecs.open("../input/herbarium-2022-fgvc9/train_metadata.json", 'r',
                 encoding='utf-8', errors='ignore') as f:
    train_meta = json.load(f)
    
with codecs.open("../input/herbarium-2022-fgvc9/test_metadata.json", 'r',
                 encoding='utf-8', errors='ignore') as f:
    test_meta = json.load(f)

<h1><span class="label label-default" style="background-color:#03e8fc;border-radius:100px 100px; font-weight: bold; font-family:Garamond; font-size:20px; color:black; padding:10px">Taxonomy relationships</span></h1><br>

"Taxonomy prediction is a science involving the hierarchical classification of DNA fragments up to the rank species. Given species diversity on Earth, taxonomy prediction gets challenging with (i) increasing number of species (labels) to classify and (ii) decreasing input (DNA) size."

"In that research, the authors introduced Predicting Linked Organisms, Plinko, for short. Plinko is a fully-functioning, state-of-the-art predictive system that accurately captures DNA - Taxonomy relationships where other state-of-the-art algorithms falter."

"Three major challenges in taxonomy prediction are (i) large dataset sizes (order of 109 sequences) (ii) large label spaces (order of 103 labels) and (iii) low resolution inputs (100 base pairs or less). Plinko leverages multi-view CNNs and the pre-defined taxonomy tree structure to improve multi-level taxonomy prediction for hard to classify sequences under the three conditions stated."

"Plinko has the advantage of relatively low storage footprint, making the solution portable, and scalable with anticipated genome database growth. To the best of our knowledge, Plinko is the first to use multi-view CNN as the core algorithm in a compositional, alignment-free approach to taxonomy prediction."

https://vtechworks.lib.vt.edu/handle/10919/89752

In [None]:
#Code by Ventakumar R https://www.kaggle.com/venkatkumar001/hfp-2-eda-tensorflow/notebook

display(train_meta.keys())

<h1><span class="label label-default" style="background-color:#03e8fc;border-radius:100px 100px; font-weight: bold; font-family:Garamond; font-size:20px; color:black; padding:10px">Taxonomy Prediction with Tree-Structured Covariances</span></h1><br>

Taxonomic Prediction with Tree-Structured Covariances

Authors: Matthew B. Blaschko, Wojciech Zaremba, Arthur Gretton 
DOI https://doi.org/10.1007/978-3-642-40991-2_20

"Taxonomies have been proposed numerous times in the literature in order to encode semantic relationships between classes. Such taxonomies have been used to improve classification results by increasing the statistical efficiency of learning, as similarities between classes can be used to increase the amount of relevant data during training."

"In that paper, the authors show how data-derived taxonomies may be used in a structured prediction framework, and compare the performance of learned and semantically constructed taxonomies. Structured prediction in this case is multi-class categorization with the assumption that categories are taxonomically related."

"They made three main contributions: (i) They proved the equivalence between tree-structured covariance matrices and taxonomies; (ii) They used this covariance representation to develop a highly computationally efficient optimization algorithm for structured prediction with taxonomies; (iii) They showed that the taxonomies learned from data using the Hilbert- Schmidt Independence Criterion (HSIC) often perform better than imputed semantic taxonomies."

"Source code of this implementation, as well as machine readable learned taxonomies are available for download from https://github.com/blaschko/tree-structured-covariance."

https://link.springer.com/chapter/10.1007/978-3-642-40991-2_20

In [None]:
#Code by Ventakumar R https://www.kaggle.com/venkatkumar001/hfp-2-eda-tensorflow/notebook

taxonomy = pd.DataFrame(train_meta['categories'])
#train_cat.columns = [ 'category_id', 'scientificName','family', 'genus']
display(taxonomy)

<h1><span class="label label-default" style="background-color:#03e8fc;border-radius:100px 100px; font-weight: bold; font-family:Garamond; font-size:20px; color:black; padding:10px">Accuracy of Taxonomy Prediction</span></h1><br>

"Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences"

Author: Robert C. Edgar

"Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods."

"The author assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference database. When the accuracy of genus predictions was averaged over a representative range of identities with the reference database (100%, 99%, 97%, 95% and 90%), all tested methods had ≤50% accuracy on the currently-popular V4 region of 16S rRNA."

"Accuracy was found to fall rapidly with identity; for example, better methods were found to have V4 genus prediction accuracy of ∼100% at 100% identity but ∼50% at 97% identity. The relationship between identity and taxonomy was quantified as the probability that a rank is the lowest shared by a pair of sequences with a given pair-wise identity. With the V4 region, 95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal."

https://peerj.com/articles/4652/


In [None]:
taxonomy["family"].value_counts()

<h1><span class="label label-default" style="background-color:#03e8fc;border-radius:100px 100px; font-weight: bold; font-family:Garamond; font-size:20px; color:black; padding:10px">Plants Conservation Relevant Predictions</span></h1><br>

"The conservation status of most plant species is currently unknown, despite the fundamental role of plants in ecosystem health. To facilitate the costly process of conservation assessment, the authors developed a predictive protocol using a ML approach to predict conservation status of over 150,000 land plant species. Our study uses open-source geographic, environmental, and morphological trait data, making this the largest assessment of conservation risk to date and the only global assessment for plants."

"Their results indicated that a large number of unassessed species are likely at risk and identify several geographic regions with the highest need of conservation efforts, many of which are not currently recognized as regions of global concern."

"By providing conservation-relevant predictions at multiple spatial and taxonomic scales, predictive frameworks such as the one developed here fill a pressing need for biodiversity science."

https://www.pnas.org/content/115/51/13027

In [None]:
#Codes by Pooja Jain https://www.kaggle.com/jainpooja/av-guided-hackathon-predict-youtube-likes/notebook

import matplotlib.pyplot as plt
import seaborn as sns

text_cols = ['scientificName', 'family', 'genus', 'species']

from wordcloud import WordCloud, STOPWORDS

wc = WordCloud(stopwords = set(list(STOPWORDS) + ['|']), random_state = 42, background_color='green',colormap="Dark2",)
fig, axes = plt.subplots(2, 2, figsize=(20, 12))
axes = [ax for axes_row in axes for ax in axes_row]

for i, c in enumerate(text_cols):
  op = wc.generate(str(taxonomy[c]))
  _ = axes[i].imshow(op)
  _ = axes[i].set_title(c.upper(), fontsize=24)
  _ = axes[i].axis('off')

#_ = fig.delaxes(axes[3])
_ = axes[i].axis('off')

In [None]:
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')

In [None]:
#Code by Kohei-mu https://www.kaggle.com/koheimuramatsu/industrial-accident-causal-analysis/notebook

#Fixed by Des https://www.kaggle.com/desalegngeb/bokeh-visualization-library-guide-for-beginners

family_cnt = np.round((taxonomy['family'].value_counts(normalize=True) *100).head(10))
hv.Bars(family_cnt[::-1]).opts(title="North America Flora Families", color="purple", xlabel="Flora Families", ylabel="Percentage", xformatter='%d%%')\
                .opts(opts.Bars(width=600, height=600,tools=['hover'],show_grid=True,invert_axes=True))

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.Greens(np.linspace(0,1,20))
taxonomy ["scientificName"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="category_id",colors=color,autopct="%0.1f%%")
plt.title("North America Flora Scientific Names")
plt.axis("off")
plt.show()

In [None]:
ax = taxonomy['scientificName'].value_counts()[:20].plot.barh(figsize=(16, 8), color='green')
ax.set_title('North America Flora Scientific Names', size=18, color='orange')
ax.set_ylabel('Scientific Names', size=10)
ax.set_xlabel('Count', size=10);

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.viridis(np.linspace(0,1,20))
taxonomy ["family"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="category_id",colors=color,autopct="%0.1f%%")
plt.title("North America Flora Families")
plt.axis("off")
plt.show()

In [None]:
ax = taxonomy['family'].value_counts()[:20].plot.barh(figsize=(16, 8), color='orange')
ax.set_title('North America Flora Families', size=18, color='green')
ax.set_ylabel('Families', size=10)
ax.set_xlabel('Count', size=10);

In [None]:
#Code by Kohei-mu https://www.kaggle.com/koheimuramatsu/industrial-accident-causal-analysis/notebook

#Fixed by Des https://www.kaggle.com/desalegngeb/bokeh-visualization-library-guide-for-beginners

genus_cnt = np.round((taxonomy['genus'].value_counts(normalize=True) *100).head(10))
hv.Bars(genus_cnt[::-1]).opts(title="North America Flora Genus", color="red", xlabel="Flora Genus", ylabel="Percentage", xformatter='%d%%')\
                .opts(opts.Bars(width=600, height=600,tools=['hover'],show_grid=True,invert_axes=True))

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.Oranges(np.linspace(0,1,20))
taxonomy ["genus"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="category_id",colors=color,autopct="%0.1f%%")
plt.title("North America Flora Genus")
plt.axis("off")
plt.show()

In [None]:
ax = taxonomy['genus'].value_counts()[:20].plot.barh(figsize=(16, 8), color='purple')
ax.set_title('North America Flora Genus', size=18, color='red')
ax.set_ylabel('Genus', size=10)
ax.set_xlabel('Count', size=10);

<h1><span class="label label-default" style="background-color:#03e8fc;border-radius:100px 100px; font-weight: bold; font-family:Garamond; font-size:20px; color:black; padding:10px">Plants Biodiversity</span></h1><br>

"Biodiversity is essential for ecosystem function yet is being lost at an unprecedented rate. This threat to ecosystem function has downstream economic and cultural consequences that affect human health and well-being."

"Plants are the foundation of ecosystem architecture and agriculture, and as such, changes in plant species diversity strongly influence processes such as biomass production, decomposition, and nutrient cycling. Plant diversity is therefore critical for diversity on other trophic levels."

In [None]:
#Code by Kohei-mu https://www.kaggle.com/koheimuramatsu/industrial-accident-causal-analysis/notebook

#Fixed by Des https://www.kaggle.com/desalegngeb/bokeh-visualization-library-guide-for-beginners

species_cnt = np.round((taxonomy['species'].value_counts(normalize=True) *100).head(10))
hv.Bars(species_cnt[::-1]).opts(title="North America Flora Species", color="blue", xlabel="Flora Species", ylabel="Percentage", xformatter='%d%%')\
                .opts(opts.Bars(width=600, height=600,tools=['hover'],show_grid=True,invert_axes=True))

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.winter(np.linspace(0,1,20))
taxonomy ["species"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="category_id",colors=color,autopct="%0.1f%%")
plt.title("North America Flora Species")
plt.axis("off");

In [None]:
ax = taxonomy['species'].value_counts()[:20].plot.barh(figsize=(16, 8), color='yellow')
ax.set_title('North America Flora Species', size=18, color='red')
ax.set_ylabel('Species', size=10)
ax.set_xlabel('Count', size=10);

In [None]:
#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

#Code by Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

color = plt.cm.winter(np.linspace(0,1,20))
taxonomy ["species"].value_counts().sort_values(ascending=False).head(20).plot.pie(y="category_id",colors=color,autopct="%0.1f%%")
plt.title("")
plt.axis("off");

In [None]:
#Code by Kohei-mu https://www.kaggle.com/koheimuramatsu/industrial-accident-causal-analysis/notebook

#Fixed by Des https://www.kaggle.com/desalegngeb/bokeh-visualization-library-guide-for-beginners

family_cnt = np.round((taxonomy['family'].value_counts(normalize=True) *100).head(10))
hv.Bars(family_cnt[::-1]).opts(title="North America Flora Families", color="purple", xlabel="Flora Families", ylabel="Percentage", xformatter='%d%%')\
                .opts(opts.Bars(width=600, height=600,tools=['hover'],show_grid=True,invert_axes=True))

That's it, no predictions to that Taxonomy Kaggle Notbook.

#Acknowledgements:

Ventakumar R https://www.kaggle.com/venkatkumar001/hfp-2-eda-tensorflow/notebook

Pooja Jain https://www.kaggle.com/jainpooja/av-guided-hackathon-predict-youtube-likes/notebook

Taha07  https://www.kaggle.com/taha07/data-scientists-jobs-analysis-visualization/notebook

Kohei-mu https://www.kaggle.com/koheimuramatsu/industrial-accident-causal-analysis/notebook

Des https://www.kaggle.com/desalegngeb/bokeh-visualization-library-guide-for-beginners

