# Herbarium 2021: Half-Earth Challenge - FGVC8 - Exploratory Data Analysis

Quick Exploratory Data Analysis for [Herbarium 2021: Half-Earth Challenge - FGVC8](https://www.kaggle.com/c/herbarium-2021-fgvc8) challenge    

The Herbarium 2021: Half-Earth Challenge is to identify vascular plant specimens provided by the [New York Botanical Garden (NY)](https://www.nybg.org/), [Bishop Museum (BPBM)](https://www.bishopmuseum.org/), [Naturalis Biodiversity Center (NL)](https://www.naturalis.nl/en), [Queensland Herbarium (BRI)](https://www.qld.gov.au/environment/plants-animals/plants/herbarium), and [Auckland War Memorial Museum (AK)](https://www.aucklandmuseum.com/).

![](https://storage.googleapis.com/kaggle-competitions/kaggle/25558/logos/header.png)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:brown; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h3>

* [Overview](#1)
* [Data Visualization](#2)
    
    
* [Competition Metric](#100)
* [Sample Submission](#101)
    
    
* [Modeling](#200)

<a id="1"></a>
<h2 style='background:brown; border:0; color:white'><center>Overview<center><h2>

The training and test set contain images of herbarium specimens from nearly 65,000 species of vascular plants. Each image contains exactly one specimen. The text labels on the specimen images have been blurred to remove category information in the image.

The data has been approximately split 80%/20% for training/test. Each category has at least 1 instance in both the training and test datasets. Note that the test set distribution is slightly different from the training set distribution. The training set contains species with hundreds of examples, but the test set has the number of examples per species capped at a maximum of 10.

In [None]:
import os
import json
import collections

import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt

### Read the metadata file

In [None]:
PATH_BASE = "../input/herbarium-2021-fgvc8/"
PATH_TRAIN = os.path.join(PATH_BASE, "train/")
PATH_TRAIN_META = os.path.join(PATH_TRAIN, "metadata.json")


with open(PATH_TRAIN_META) as json_file:
    metadata = json.load(json_file)

### First level elements

In [None]:
metadata.keys()

### Check the number of images and their annotations

In [None]:
len(metadata["annotations"]), len(metadata["images"])

### Check first samples from each key

In [None]:
print(metadata["annotations"][0])
print(metadata["images"][0])
print(metadata["categories"][0])
print(metadata["licenses"][0])
print(metadata["institutions"][0])

### Calculate the total number of classes

In [None]:
len(set([annotation["category_id"] for annotation in metadata["annotations"]]))

### Create DataFrame with main information

In [None]:
ids = []
categories = []
paths = []

for annotation, image in zip(metadata["annotations"], metadata["images"]):
    assert annotation["image_id"] == image["id"]
    ids.append(image["id"])
    categories.append(annotation["category_id"])
    paths.append(image["file_name"])
        
df_meta = pd.DataFrame({"id": ids, "category": categories, "path": paths})

In [None]:
df_meta

### Classes distribution

In [None]:
df_meta["category"].value_counts()

### Find name and family of the classes by their ids

In [None]:
d_categories = {category["id"]: category["name"] for category in metadata["categories"]}
d_families = {category["id"]: category["family"] for category in metadata["categories"]}
d_orders = {category["id"]: category["order"] for category in metadata["categories"]}

df_meta["category_name"] = df_meta["category"].map(d_categories)
df_meta["family_name"] = df_meta["category"].map(d_families)
df_meta["order_name"] = df_meta["category"].map(d_orders)
df_meta

<a id="2"></a>
<h2 style='background:brown; border:0; color:white'><center>Data Visualization<center><h2>

In [None]:
def visualize_train_batch(paths, categories, families, orders):
    plt.figure(figsize=(16, 16))
    
    for ind, info in enumerate(zip(paths, categories, families, orders)):
        path, category, family, order = info
        
        plt.subplot(2, 3, ind + 1)
        
        image = cv2.imread(os.path.join(PATH_TRAIN, path))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        plt.imshow(image)
        
        plt.title(
            f"FAMILY: {family} ORDER: {order}\n{category}", 
            fontsize=10,
        )
        plt.axis("off")
    
    plt.show()

In [None]:
def visualize_by_id(df, _id=None):
    tmp = df.sample(6)
    if _id is not None:
        tmp = df[df["category"] == _id].sample(6)

    visualize_train_batch(
        tmp["path"].tolist(), 
        tmp["category_name"].tolist(),
        tmp["family_name"].tolist(),
        tmp["order_name"].tolist(),
    )

In [None]:
visualize_by_id(df_meta, 22344)

In [None]:
visualize_by_id(df_meta, 42811)

In [None]:
visualize_by_id(df_meta, 1719)

In [None]:
visualize_by_id(df_meta, 1)

### Random samples

In [None]:
visualize_by_id(df_meta)

<a id="100"></a>
<h2 style='background:brown; border:0; color:white'><center>Competition Metric<center><h2>

Submissions are evaluated using the [macro F1 score](#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

$$F_1 = 2\frac{precision \cdot recall}{precision+recall}$$

where:

$$precision = \frac{TP}{TP+FP}$$

$$recall = \frac{TP}{TP+FN}$$

In "macro" F1 a separate F1 score is calculated for each species value and then averaged.

<a id="101"></a>
<h2 style='background:brown; border:0; color:white'><center>Sample Submission<center><h2>

In [None]:
df_submission = pd.read_csv(
    "../input/herbarium-2021-fgvc8/sample_submission.csv",
    index_col=0,
)

### One of the most frequently class from train data

In [None]:
df_submission["Predicted"] = 25229

In [None]:
df_submission.to_csv("submission.csv")

In [None]:
pd.read_csv("submission.csv", index_col=0)

<a id="200"></a>
<h2 style='background:brown; border:0; color:white'><center>Modeling<center><h2>

### The idea: Create for each category abstract vector from some model (MobileNetV2) and find nearest vector for each train sample

In [None]:
FULL_PIPELINE = False

### Import libraries

In [None]:
import os
import random

import numpy as np
from numpy import save, load
import pandas as pd
import cv2
import albumentations as A
from albumentations import pytorch as ATorch
import torch
from torch.utils import data as torch_data
from torch import nn as torch_nn
from torch.nn import functional as torch_functional
import torchvision
from tqdm import tqdm
from sklearn.metrics.pairwise import euclidean_distances

### Define the model

You can use any of the pretrained models, for example:
- [PYTORCH HUB FOR RESEARCHERS](https://pytorch.org/hub/research-models)
- [TORCHVISION.MODELS](https://pytorch.org/vision/stable/models.html)

In [None]:
class MobileNetV2(torch.nn.Module):
    def __init__(self):
        super().__init__()
        tmp_net = torch.hub.load(
            "pytorch/vision:v0.6.0", "mobilenet_v2", pretrained=True
        )
        self.net = torch_nn.Sequential(*(list(tmp_net.children())[:-1]))

    def forward(self, x):
        return self.net(x)

### Define your dataset class for getting image samples

In [None]:
class DataRetriever(torch_data.Dataset):
    def __init__(
        self, 
        paths, 
        categories=None,
        transforms=None,
        base_path=PATH_TRAIN
    ):
        self.paths = paths
        self.categories = categories
        self.transforms = transforms
        self.base_path = base_path
          
    def __len__(self):
        return len(self.paths)
    
    def __getitem__(self, index):
        img = cv2.imread(os.path.join(self.base_path, self.paths[index]))
        img = cv2.resize(img, (224, 224))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        if self.transforms:
            img = self.transforms(image=img)["image"]
        
        if self.categories is None:
            return img
        
        y = self.categories[index] 
        return img, y
    
    
def get_transforms():
    return A.Compose(
        [
            A.Normalize(
                mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225], 
                p=1.0
            ),
            ATorch.transforms.ToTensorV2(p=1.0),
        ], 
        p=1.0
    )

### Let's take for each category (target) all images from the train set and after processing average their vectors


In [None]:
df_train = df_meta[["category", "path"]].sort_values(by="category")

df_train

In [None]:
tmp_path = df_train["path"].tolist()
tmp_category = df_train["category"].tolist()
# If FULL_PIPELINE is False we use small subset of data
if not FULL_PIPELINE:
    tmp_path = tmp_path[:256 * 8]
    tmp_category = tmp_category[:256 * 8]

train_data_retriever = DataRetriever(
    tmp_path,
    tmp_category,
    transforms=get_transforms(),
)

train_loader = torch_data.DataLoader(
    train_data_retriever,
    batch_size=256,
    shuffle=False,
    num_workers=8,
)

### Initialize the model

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = MobileNetV2()
model.to(device)
model.eval();

### Save output vectors from the model and average by category

In [None]:
category_counts = collections.Counter(df_train["category"].tolist())

In [None]:
final_vectors = np.zeros((len(category_counts), 1280))

with torch.no_grad():
    for batch in tqdm(train_loader):
        X, y = batch
        vectors = model(X.to(device)).mean(axis=(2, 3))
        
        _y = y.numpy().tolist()
        for ind in range(len(_y)):
            final_vectors[_y[ind]] += vectors[ind].cpu().numpy().copy() / category_counts[_y[ind]]

### Save and load category vectors (you can pretrain them)

In [None]:
save("average_vectors.npy", final_vectors)

In [None]:
final_vectors = load("average_vectors.npy")

### Get test paths

In [None]:
PATH_TEST = os.path.join(PATH_BASE, "test/")
PATH_TEST_META = os.path.join(PATH_TEST, "metadata.json")


with open(PATH_TEST_META) as json_file:
    metadata = json.load(json_file)

    
id2path = {
    img["id"]: img["file_name"] for img in metadata["images"]
}

In [None]:
df_submission = pd.read_csv(
    "../input/herbarium-2021-fgvc8/sample_submission.csv",
    index_col=0,
)

df_submission["Id"] = df_submission.index
df_submission["Path"] = df_submission["Id"].map(lambda x: id2path[x])

### Create test data loader

In [None]:
tmp_path = df_submission["Path"].tolist()
# If FULL_PIPELINE is False we use small subset of data
if not FULL_PIPELINE:
    tmp_path = tmp_path[:256 * 2]

test_data_retriever = DataRetriever(
    tmp_path,
    transforms=get_transforms(),
    base_path=PATH_TEST,
)

test_loader = torch_data.DataLoader(
    test_data_retriever,
    batch_size=256,
    shuffle=False,
    num_workers=8,
)

### Get test output vectors and find the nearest train vector (by euclidean distance) and take its category

In [None]:
res = []

with torch.no_grad():
    for ind, X in enumerate(tqdm(test_loader)):
        vectors = model(X.to(device)).mean(axis=(2, 3))
        tmp = euclidean_distances(vectors.cpu().numpy(), final_vectors)
        res.extend(list(tmp.argmin(axis=1)))

### Save results to submission file

In [None]:
df_submission.iloc[:len(res), 0] = res

df_submission[["Predicted"]].to_csv("submission.csv")

pd.read_csv("submission.csv", index_col=0)

### I prepared processed by the algorithm described above submission file for all data, you can use it for fast submission

In [None]:
PATH_PREPARED_SUBMISSION = "../input/herbarium-2021-submissions/submission-mobilenetv2-mean.csv"

prepared_subnission = pd.read_csv(PATH_PREPARED_SUBMISSION, index_col=0)
prepared_subnission.to_csv("prepared_submission.csv")

pd.read_csv("prepared_submission.csv", index_col=0)

## Work In Progress...