**What is a herbarium?**
<br>
A herbarium is a collection of preserved plants stored, catalogued and arranged systematically for study by both professional taxonomists (scientists who name and identify plants), botanists and amateurs.

The creation of a herbarium specimen involves the pressing and drying of plants between sheets of paper, a practice that has changed very little since the beginning, 500 years ago. Thanks to this simple technique, most of the characteristics of living plants are visible on the dried plant. The few that are not (e.g. flower colour, scent, height of a tree, vegetation type) are written on the collection label by the collector. Most importantly, the label should tell us where and when the specimen was collected.

A working reference collection
A herbarium acts like a plant library or vast catalogue with each of our three million specimens providing unique information – where it was found, when it flowered, what it looks like and it’s DNA, which remains intact for many years. DNA is now routinely extracted from herbarium specimens. The most important specimens are called 'types'. The type specimen, chosen by the author of the species name, becomes the physical reference for the new species.

This unique working reference collection brings species from all over the world together into one place to be discovered, described and compared. The work is disseminated through the writing of Floras (a description of all the plants in a country or region), monographs (a description of plants or fungi within a group, such as a family) and scientific papers. This fundamental research provides an essential baseline for other plant-based research and helps inform conservation practices.

[Click here for further details.[](http://)](https://www.rbge.org.uk/science-and-conservation/herbarium/)

In [None]:
import os
import json

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


#To visualise the trend and analyse.
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_dark"

import plotly.offline as py
from plotly.offline import init_notebook_mode 


py.init_notebook_mode(connected=True)
%matplotlib inline


In [None]:
Train_data = "../input/herbarium-2020-fgvc7/nybg2020/train/"
Test_data = "../input/herbarium-2020-fgvc7/nybg2020/test/"
Meta_info  = "metadata.json"

**As the data description can be seen, Dataset is in COCO format and we have to handle it accordingly**<br>
[For further information on this data fromat click here.](http://cocodataset.org/#format-data)

In [None]:
import codecs
def meta_ifo():
    with codecs.open(Train_data+Meta_info,"r",encoding="utf-8",errors="ignore") as f:
        training_meta_info = json.load(f)

    with codecs.open(Test_data+Meta_info,"r",encoding="utf-8",errors="ignore") as f:
        testing_meta_info = json.load(f)
        
    return training_meta_info,testing_meta_info

In [None]:
train_meta_info ,test_meta_info = meta_ifo()
train_meta_info.keys()

# **Column Renaming**

In [None]:
annotations = pd.DataFrame(train_meta_info['annotations'])
annotations.columns = ['category_id', 'id', 'image_id', 'region_id']

categories = pd.DataFrame(train_meta_info['categories'])
categories.columns = ['family', 'genus', 'category_id', 'category_name']

images = pd.DataFrame(train_meta_info['images'])
images.columns = ['image_file_name', 'height', 'image_id', 'license', 'width']

licenses = pd.DataFrame(train_meta_info['licenses'])
licenses.columns = ['licenses_id', 'license_name', 'url']

regions = pd.DataFrame(train_meta_info['regions'])
regions.columns = ['region_id', 'region_name']

In [None]:
column_info = {
                "categories":categories.columns,
                "annotations":annotations.columns,
                "images":images.columns,
                "licenses":licenses.columns,
                "regions":regions.columns    
                }

In [None]:
dataframe = annotations.copy(deep=True)
dataframe = dataframe.merge(categories,on="category_id",how="outer")
dataframe = dataframe.merge(images,on="image_id",how="outer")
dataframe = dataframe.merge(regions,on="region_id",how="outer")

In [None]:
dataframe.sample(n=10)

In [None]:
imageFiles = dataframe.dropna(subset=['image_file_name'])
images  = imageFiles['image_file_name'].tolist()
train_images = ['../input/herbarium-2020-fgvc7/nybg2020/train/'+i for i in images]

In [None]:
imageFiles.tail()

# **Training Images Plot**

In [None]:
import matplotlib.image as mpimg
max_rows = 5
max_cols = 5
pic_index = 0
pic_index += 250
fig = plt.gcf()
fig.set_size_inches(max_cols * 5 , max_rows * 5)

for i, img_path in enumerate(train_images[pic_index - 25:pic_index]):
    # Set up subplot; subplot indices start at 1
    sp = plt.subplot(max_rows, max_cols, (i+1))
    sp.axis('Off')  # Don't show axes (or gridlines)
    img = mpimg.imread(img_path)
    plt.imshow(img)

plt.show()

# **Exploratory data analysis**

In [None]:
sortedData = dataframe.groupby(by=['category_id'],as_index=False,sort=True)['family'].count().sort_values(['family'], ascending=False)
sortedData = sortedData.head(n=10000)
sortedData.columns = ["Category","Total Specimen"]
sortedData.head()

# **Top 100 categories**

In [None]:
df = px.data.gapminder()

fig = px.scatter(sortedData,
                 x="Category",
                 y="Total Specimen",
                 size="Total Specimen",
                 color="Total Specimen",
                 hover_name="Total Specimen",
                 log_x=True,
                 height=1000,
                 size_max=60)
fig.show()

In [None]:
imageFilesCopyDf = imageFiles.copy(deep=True)
imageFilesCopyDf = imageFiles.groupby(["height","width"]).size().reset_index(name='Total')
imageFilesCopyDf.sort_values("Total",axis=0,ascending=False)


**As it is clearly visible that there are 211 different types of shapes in entire image set.**
> We need to reshape these images.

In [None]:
image_training_dataset = imageFiles[["category_id","family","genus","image_file_name"]]

In [None]:
image_training_dataset.sample(n=10)

In [None]:
from sklearn.model_selection import train_test_split as TTS
train_set , validation_set= TTS(image_training_dataset,test_size=0.2,shuffle=True,random_state=42)