<a href="https://colab.research.google.com/github/rahiakela/kaggle-competition-projects/blob/herbarium-2020-fgvc7-competition/1_herbarium_2020_fgvc7_competition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Herbarium 2020 - FGVC7
**Identify plant species from herbarium specimens. Data from New York Botanical Garden.**

The Herbarium 2020 FGVC7 Challenge is to identify vascular plant species from a large, long-tailed collection herbarium specimens provided by the [New York Botanical Garden](https://www.nybg.org/plant-research-and-conservation/) (NYBG).

The Herbarium 2020 dataset contains over 1M images representing over 32,000 plant species. This is a dataset with a long tail; there are a minimum of 3 specimens per species, however, some species are represented by more than a hundred specimens. This dataset only contains vascular land plants which includes lycophytes, ferns, gymnosperms, and flowering plants. The extinct forms of lycophytes are the major component of coal deposits, ferns are indicators of ecosystem health, gymnosperms provide major habitats for animals, and flowering plants provide all of our crops, vegetables, and fruits.

<img src='https://github.com/rahiakela/img-repo/blob/master/herbarium.png?raw=1' width='800'/>

The teams with the most accurate models will be contacted, with the intention of using them on the un-named plant collections in the NYBG herbarium collection, and assessed by the NYBG plant specialists.

Reference:

https://www.kaggle.com/rsingh99/getting-started-with-herbarium-2020




## Background

The New York Botanical Garden (NYBG) herbarium contains more than 7.8 million plant and fungal specimens. Herbaria are a massive repository of plant diversity data. These collections not only represent a vast amount of plant diversity, but since herbarium collections include specimens dating back hundreds of years, they provide snapshots of plant diversity through time. The integrity of the plant is maintained in herbaria as a pressed, dried specimen; a specimen collected nearly two hundred years ago by Darwin looks much the same as one collected a month ago by an NYBG botanist. All specimens not only maintain their morphological features but also include collection dates and locations, and the name of the person who collected the specimen. This information, multiplied by millions of plant collections, provides the framework for understanding plant diversity on a massive scale and learning how it has changed over time.

## Data Description

The training and test set contain images of herbarium specimens, from over 32,000 species of vascular plants. Each image contains exactly one specimen. The text and barcode labels on the specimen images have been blurred to remove category information in the image.

The data has been approximately split 80%/20% for training/test. Each category has at least 1 instance in both the training and test datasets. Note that the test set distribution is slightly different from the training set distribution. The training set contains species with hundreds of examples, but the test set has the number of examples per species capped at a maximum of 10.


### Dataset Details

Each image has different image dimensions, with a maximum of 1000 pixels in the larger dimension. These have been resized from the original image resolution. All images are in JPEG format.

### Dataset Format

This dataset uses the [COCO dataset format](http://cocodataset.org/#format-data) with additional annotation fields. In addition to the species category labels, we also provide region and supercategory information.

The training set metadata (train/metadata.json) and test set metadata (test/metadata.json) are JSON files in the format below. Naturally, the test set metadata file omits the "annotations", "categories" and "regions" elements.

```json
{
  "annotations" : [annotation],
  "categories" : [category],
  "images" : [image],
  "info" : info,
  "licenses": [license],
  "regions": [region]
}

info {
  "year" : int,
  "version" : str,
  "url": str,
  "description" : str,
  "contributor" : str,
  "date_created" : datetime
}

image {
  "id" : int,
  "width" : int,
  "height" : int,
  "file_name" : str,
  "license" : int
}

annotation {
  "id": int,
  "image_id": int,
  "category_id": int,
  # Region where this specimen was collected.
  "region_id": int
}

category {
  "id" : int,
  # Species name
  "name" : str,
  # We also provide the super-categories for each species.
  "family": str,
  "genus": str
}

region {
  "id": int
  "name": str
}

license {
  "id": 1,
  "name": str,
  "url": str
}
```

The training set images are organized in subfolders:
```python 
train/<subfolder1>/<subfolder2>/<image id>.jpg
```

The test set images are organized in subfolders:
```python
test/<subfolder>/<image id>.jpg
```

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
from tensorflow import keras

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

## Load dataset from Kaggle

In [4]:
!pip uninstall -y kaggle
!pip install --upgrade pip
!pip install kaggle==1.5.6
!kaggle -v

Found existing installation: kaggle 1.5.6
Uninstalling kaggle-1.5.6:
  Successfully uninstalled kaggle-1.5.6
Requirement already up-to-date: pip in /usr/local/lib/python3.6/dist-packages (20.0.2)
Processing /root/.cache/pip/wheels/01/3e/ff/77407ebac3ef71a79b9166a8382aecf88415a0bcbe3c095a01/kaggle-1.5.6-py3-none-any.whl
Installing collected packages: kaggle
Successfully installed kaggle-1.5.6
Kaggle API 1.5.6


First of all, needs to copy kaggle.json file to .kaggle directory

In [0]:
# copy kaggle.json file to .kaggle directory
! cp kaggle.json ~/.kaggle/
! chmod 600 /root/.kaggle/kaggle.json

In [0]:
# show all availabe datasets
!kaggle datasets list

ref                                                         title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
----------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
unanimad/dataisbeautiful                                    Reddit - Data is Beautiful                           11MB  2020-03-21 22:28:28            443         49  1.0              
allen-institute-for-ai/CORD-19-research-challenge           COVID-19 Open Research Dataset Challenge (CORD-19)  646MB  2020-03-20 23:31:34          22457       3252  0.88235295       
rubenssjr/brasilian-houses-to-rent                          brazilian_houses_to_rent                            117KB  2020-03-15 01:12:22            488         27  1.0              
sudalairajkumar/novel-corona-virus-2019-dataset             Novel Corona Virus 2

In [0]:
# Try to downlaod data for the herbarium-2020-fgvc7 challenge.
!kaggle competitions download -c herbarium-2020-fgvc7

Downloading covid19-global-forecasting-week-1.zip to /content
  0% 0.00/195k [00:00<?, ?B/s]
100% 195k/195k [00:00<00:00, 89.8MB/s]


### Unzip dataset

In [0]:
import os, shutil
import zipfile

# path to the directory where the original dataset was uncompressed
original_dataset_dir = 'kaggle_herbarium_2020_fgvc7'

# remove directories if it already exists
shutil.rmtree(original_dataset_dir, ignore_errors=True)

# create directories
os.mkdir(original_dataset_dir)

In [0]:
# unzip dataset
with zipfile.ZipFile("herbarium-2020-fgvc7.zip","r") as zip_ref:
    zip_ref.extractall(original_dataset_dir)

In [0]:
TRAIN       = "../input/herbarium-2020-fgvc7/nybg2020/train/"
TEST        = "../input/herbarium-2020-fgvc7/nybg2020/test/"
META        = "metadata.json"
BATCH_SIZE  = 7
NUM_WORKERS = 2
BATCH_EVAL  = 1
SHUFFLE     = True
EPOCHS      = 3
RESIZE      = (800, 600)
CLASSES     = 32094
LENGTH      = 2*CLASSES

## DATA INSIGTH

The dataset is in COCO Format.

COCO is a large image dataset designed for object detection, segmentation, person keypoints detection, stuff segmentation, and caption generation. This package provides Matlab, Python, and Lua APIs that assists in loading, parsing, and visualizing the annotations in COCO. Please visit http://cocodataset.org/ for more information on COCO, including for the data, paper, and tutorials. The exact format of the annotations is also described on the COCO website. The Matlab and Python APIs are complete, the Lua API provides only basic functionality.

### Train file

In [0]:
with open(join(TRAIN, META), 'r', encoding='ISO-8859-1') as file:
  metadata = json.load(file)
  print('Metadata has {} sections'.format(len(list(metadata.keys()))))
  print('All the sections in metadata:\n', [print(' - ', i) for i in list(metadata.keys())])

  print('Number of Images in Training set is:- ', len(metadata['images']))
  print('\nLet us see how every section of Dataset Looks like:-\n')
  for i in list(metadata.keys()):
    print(' - sample and number of elements in {} :- '.format(i), len(list(metadata[i])))
    print('\t', list(metadata[i])[0], end='\n\n')

### Test file

In [0]:
with open(join(TEST, META), 'r', encoding='ISO-8859-1') as file:
  metadata_test = json.load(file)
  print('Metadata has {} sections'.format(len(list(metadata_test.keys()))))
  print('All the sections in metadata:\n', [print(' - ', i) for i in list(metadata_test.keys())])

  print('Number of Images in Test set is:- ', len(metadata_test['images']))
  print('\nLet us see how every section of Dataset Looks like:-\n')
  for i in list(metadata_test.keys()):
    print(' - sample and number of elements in {} :- '.format(i), len(list(metadata_test[i])))
    print('\t', list(metadata_test[i])[0], end='\n\n')

There are 1030747 Images in Train set.

There are 32094 Classes in The dataset.

### Visulize some image sample.

Now let us see the Image Sample.

In [0]:
train_img = pd.DataFrame(metadata['images'])
train_ann = pd.DataFrame(metadata['annotations'])
train_df = pd.merge(train_img, train_ann, left_on='image_id', right_on='id', how='left').drop('image_id', axis=1).sort_values(by=['category_id'])
train_df.head()

In [0]:
img = Image.open('herbarium-2020-fgvc7/nybg2020/train/images/156/72/354106.jpg')
print('Category Id is 15672 and Image Id is 354106 is shown below.')
img

### Data Distribution

In [0]:
img_size = (28, 28)

fig = plt.figure(figsize=(72, 72))
for i in range(60):
  ax = fig.add_subplot(12, 12, i + 1)
  img = cv2.imread(TRAIN + metadata['images'][i]['file_name'])
  img = cv2.resize(img, img_size)
  ax.imshow(img)
plt.show()

## DATALOADER