# Vector Validation

This notebook continues on from the following. Please read them if you haven't already:

* https://www.kaggle.com/code/prubyg/hotel-id-vector-extraction
* https://www.kaggle.com/code/prubyg/hotel-id-vector-indexing/edit/run/92585356

This notebook attempted to validate the process of searching for similar images by KNN database search. TL;DR - it was a total flop, validation score 0.000. Let's see how we achieved this incredible result.

Like the extraction notebook, we will be extracting vectors from images - our validation set this time. For this, we need the timm library of models.

In [None]:
!pip install timm

Import our dependencies...

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random

from annoy import AnnoyIndex
import pyarrow as pa
from pyarrow.parquet import ParquetFile

import torch.nn as nn
from torch.utils.data import DataLoader
import timm

In [None]:
from PIL import Image as pil_image
from tqdm import tqdm

In [None]:
DEVICE = 'cpu' # We're dealing with small numbers of images, so CPU is fine.
VECTOR_WIDTH = 320
IMG_SIZE = 256
VECTOR_FILE = '/kaggle/input/hotel-id-vector-extraction/vectors.parquet'
INDEX_FILE = '/kaggle/input/hotel-id-vector-indexing/vectors.annoy'

K = 10 # Number of nearest-neighbours to retrieve for each feature vector

DATA_FOLDER = "../input/hotelid-2022-train-images-256x256/"
IMAGE_FOLDER = DATA_FOLDER + "images/"

We load our parquet file and annoy approximate-knn index.

In [None]:
vector_db = ParquetFile(VECTOR_FILE, read_dictionary=['file'])
index = AnnoyIndex(VECTOR_WIDTH, 'angular')
index.load(INDEX_FILE)
N = index.get_n_items()

This time, we completely read two columns of the Parquet file in to memory - the target and image file name. This is small enough to fit in memory easily.

In [None]:
meta = vector_db.read(['target', 'file'])
targets = meta['target']
files = meta['file']
assert(len(targets) == N)

We load our validation set from the "extraction" notebook. This was emitted from that notebook to be sure we're using a consistent train/validation split, and not splitting differently at the end here. The validation set's images will not appear in the stored index.

In [None]:
val_df = pd.read_csv('/kaggle/input/hotel-id-vector-extraction/validation.csv')


The dataset and vector extractor below are the same as in the vector extraction notebook.

In [None]:
class HotelTrainDataset:
    def __init__(self, data, transform=None, data_path="train_images/"):
        self.data = data
        self.data_path = data_path
        self.transform = transform

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        record = self.data.iloc[idx]
        image_path = self.data_path + record["image_id"]
        image = np.array(pil_image.open(image_path)).astype(np.uint8)

        if self.transform:
            transformed = self.transform(image=image)
            image = transformed["image"]
        
        hotel_id = record['hotel_id']
        
        return {
            "image" : image,
            "image_id": record["image_id"],
            "target" : hotel_id
        }

In [None]:
class VectorExtractor(nn.Module):
    def __init__(self, backbone_name='efficientnet_b0', layer_to_extract=4):
        super(VectorExtractor, self).__init__()
        self.backbone_name = backbone_name
        self.layer_to_extract = layer_to_extract
        self.backbone = timm.create_model(self.backbone_name, pretrained=True, features_only=True)

    def forward(self, x):
        layers = self.backbone(x)
        return layers[self.layer_to_extract]

We do have a slightly more involved image augmentation here - we simulate the masks from the test dataset using the method from Michal's notebooks.

In [None]:
import albumentations as A
import albumentations.pytorch as APT
import cv2 

# used for validation dataset - only occlusions
val_transform = A.Compose([
    A.CoarseDropout(p=1.0, max_holes=1, 
                    min_height=IMG_SIZE//4, max_height=IMG_SIZE//2,
                    min_width=IMG_SIZE//4,  max_width=IMG_SIZE//2, 
                    fill_value=(255,0,0)),# simulating occlusions
    A.ToFloat(),
    APT.transforms.ToTensorV2(),
])

# no augmentations
base_transform = A.Compose([
    A.ToFloat(),
    APT.transforms.ToTensorV2(),
])

Construct dataset and loader...

In [None]:
val_dataset = HotelTrainDataset(val_df, val_transform, data_path=IMAGE_FOLDER)
valid_loader = DataLoader(val_dataset, num_workers=2, batch_size=1, shuffle=False)

And we do the actual work. For each validation image, we run it through efficientnet_b0, extract the same layer we used for the indexed features, and find the nearest neighbours of that vector. We look up which hotel each corresponds to, and record a vote for that hotel.

I considered weighting votes by vector distance here, but never got that far before abandoning this path. You'll see why.

In [None]:
extractor = VectorExtractor().to(DEVICE)
correct = 0
valid_votes = 0
n = 0

bar = tqdm(valid_loader)
for record in bar:
    # Per validation image...
    votes = {}
    
    # Run it through our extractor
    vectors = extractor(record['image'])
    vectors = vectors.detach().cpu()
    
    # Iterate through the x and y dimensions of the result to get each feature vector
    for x in range(vectors.shape[2]):
        for y in range(vectors.shape[3]):
            features = vectors[0, :, x, y].numpy()
            # Use our annoy index to find the closest entries to our feature vector
            knn = index.get_nns_by_vector(features, K)
            for nn in knn:
                # Find the corresponding hotel in our metadata
                hotel = targets[nn].as_py()
                # Record a vote for the hotel
                if hotel in votes:
                    votes[hotel] += 1
                else:
                    votes[hotel] = 1
    
    # Find the candidate with the most votes. There's probably a more Pythonic way of doing this, but I didn't find it quickly so defaulted to C-style algorithms.
    candidate = None
    max_votes = 0
    for hotel, vs in votes.items(): # Corrected bug here, thank you Michal
        if vs > max_votes:
            candidate = hotel
            max_votes = vs
    
    # Check which hotel it actually was... were we right?
    target = int(record['target'][0])
    if candidate == target:
        correct += 1
    
    # Count the total number of votes cast for the right target. The Result Will SHOCK YOU!
    valid_votes += votes.get(target,0)
    n += 1
    
    bar.set_postfix(Correct=correct, ValidVotes=float(valid_votes)/float(n))

Wow, ten out of 3116 now correct - that's 0.003. Infinity times improvement.

So what went on here? Clearly the features vector extracted is not a dense representation of "lamp", "bed post", etc. Maybe we could have luck using this approach to augment another method, but it's not going to work with this pre-trained model.