# Extracting Semantic Vectors

So here's the idea: convolutional networks for image recognition need to distill the raw, pixel information, and turn it in to a representation of what's actually in that region of the image. If we extract vectors from late in the network, perhaps these vectors will contain useful meaning about the content of that part of the network, in a form that can be compared by distance.

In this notebook, we will iterate over the training set, process every image through a convolutional neural net, extract the feature vectors from near the end of the network, and store those vectors. What we want to end up with as the output of this notebook is a file containing all those vectors for further processing.

This notebook uses some parts of Michal's excellent starter notebook here: https://www.kaggle.com/code/michaln/hotel-id-starter-classification-traning

We need the following packages:
* timm for common network designs
* pyarrow which we will be using to write a file too large for memory

In [None]:
!pip install timm pyarrow

We import all of our modules...

In [None]:
import numpy as np
import pandas as pd
import random
import os
import gc
import math

In [None]:
from PIL import Image as pil_image
from tqdm import tqdm

import matplotlib
import matplotlib.pyplot as plt

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.nn.parameter import Parameter

import timm

In [None]:
import pyarrow as pa
from pyarrow.parquet import ParquetWriter


Set some parameters for this run.

In [None]:
IMG_SIZE = 256 # Images will be squares of this width and height
SEED = 42 # Random seed for consistency
DEVICE = 'cuda'
BATCH_SIZE = 32

VAL_SAMPLES = 1
PROJECT_FOLDER = "../input/hotel-id-to-combat-human-trafficking-2022-fgvc9/"
DATA_FOLDER = "../input/hotelid-2022-train-images-256x256/"
IMAGE_FOLDER = DATA_FOLDER + "images/"

train_df = pd.read_csv(os.path.join(DATA_FOLDER, 'train.csv'))

By setting a consistent random seed, we ensure the same results with every run.

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

Basic processing of images. We don't do much augmentation with this method, as every single run through the training images produces a lot of data.

In [None]:
import albumentations as A
import albumentations.pytorch as APT
import cv2 

# used for validation dataset - only occlusions
val_transform = A.Compose([
    A.CoarseDropout(p=1.0, max_holes=1, 
                    min_height=IMG_SIZE//4, max_height=IMG_SIZE//2,
                    min_width=IMG_SIZE//4,  max_width=IMG_SIZE//2, 
                    fill_value=(255,0,0)),# simulating occlusions
    A.ToFloat(),
    APT.transforms.ToTensorV2(),
])

# no augmentations
base_transform = A.Compose([
    A.ToFloat(),
    APT.transforms.ToTensorV2(),
])

The HotelTrainDataset class loads training images for each row in the data frame

In [None]:
class HotelTrainDataset:
    def __init__(self, data, transform=None, data_path="train_images/"):
        self.data = data
        self.data_path = data_path
        self.transform = transform

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        record = self.data.iloc[idx]
        image_path = self.data_path + record["image_id"]
        image = np.array(pil_image.open(image_path)).astype(np.uint8)

        if self.transform:
            transformed = self.transform(image=image)
            image = transformed["image"]
        
        hotel_id = record['hotel_id']
        
        return {
            "image" : image,
            "image_id": record["image_id"],
            "target" : hotel_id
        }

This is our "model" for the purposes of our notebook. We don't really train a model, but this is what we will be executing against each image and storing the result.

The timm library supports loading models in a "features only" mode for pyramid classification. This returns a series of layers from the backbone, rather than just the output of the last layer or classifier, to allow for models that need to inspect fine-grained state. We extract and return a single layer from that pyramid.

Our default parameters here are to use the EfficientNet B0, with pre-trained weights, and extract the fourth layer returned. This layer has dimensions of 320x8x8 - 64 grid tiles of 320 features.

In [None]:
class VectorExtractor(nn.Module):
    def __init__(self, backbone_name='efficientnet_b0', layer_to_extract=4):
        super(VectorExtractor, self).__init__()
        self.backbone_name = backbone_name
        self.layer_to_extract = layer_to_extract
        self.backbone = timm.create_model(self.backbone_name, pretrained=True, features_only=True)

    def forward(self, x):
        # Because the backbone is in "features only" mode, this returns an array of Tensors, from finer to coarser resolution.
        layers = self.backbone(x)
        # We extract and return the results from a single layer.
        return layers[self.layer_to_extract]
        

We read our training data here, split it in to a training and validation set, and then save those to disk. This allows us to ensure we're using a consistent validation set down the track.

In [None]:
data_df = pd.read_csv(DATA_FOLDER + "train.csv")
val_df = data_df.groupby("hotel_id").sample(VAL_SAMPLES, random_state=SEED)
train_df = data_df[~data_df["image_id"].isin(val_df["image_id"])]

train_df.to_csv('training.csv')
val_df.to_csv('validation.csv')

Because the results of this process will be too large to hold in memory, we need a method to write records continuously to disk.

For this purpose, I selected the Parquet format, writing it through pyarrow. The interface to pyarrow was phenomenally awkward, so I probably won't be using it again.

In [None]:
class TileVectorWriter():
    def __init__(self, path, width, capacity=32768, existing=False):
        self.path = path
        self.width = width
        self.capacity = capacity
        self.record_count = 0
        # The schema describes the columns and their types that will be written in this file
        self.schema = pa.schema([('i', pa.int32()), ('vector', pa.binary(width * 4)), ('target', pa.int32()), ('file', pa.string()), ('x', pa.int32()), ('y', pa.int32())])
        self.buffer = []
        # We construct a writer to allow continuous streaming of records to disk
        self.writer = ParquetWriter(path, self.schema, use_dictionary=['target', 'file'])
    
    def append(self, item):
        vec, target, file, x, y = item
        i = self.record_count
        self.buffer.append({"i": int(i), "vector": vec.astype(np.float32).tobytes(), "target": int(target), "file": str(file), "x": int(x), "y": int(y)})
        if len(self.buffer) >= self.capacity:
            # Whenever we reach capacity, write all records to disk and clear the internal buffer
            self.flush()
        self.record_count += 1
    
    def __len__(self):
        return self.record_count
        
    def flush(self):
        # This is a very clunky and inefficient way of getting our records in to pyarrow, via pandas.
        # Finding methods that work was hard, and pyarrow had major API changes between the version in Kaggle and the latest.
        if len(self.buffer) > 0:
            batch = pa.Table.from_pandas(pd.DataFrame(self.buffer), schema=self.schema)
            self.writer.write_table(batch)
            self.buffer = []
    
    def close(self):
        self.flush()
        self.writer.close()
        self.writer = None
    
    # For anyone unfamiliar, this is a destructor. It ensures we neatly close the file when this goes out of scope.
    def __del__(self):
        self.close()

This is the workhorse function of the process we'll be following - iterate over the training set, extract our vectors, and send them to the collection.

In [None]:
def extract_vectors(dataset, extractor, vector_list, batch_size=32, num_workers=2, device='gpu'):
    loader = DataLoader(dataset, num_workers=num_workers, batch_size=batch_size, shuffle=True, drop_last=True)
    
    i = 0
    bar = tqdm(loader, total=len(loader))
    # For each batch
    for data in bar:
        images = data['image'].to(device)

        with torch.no_grad():
            # Run the extractor over the image
            vectors = extractor(images)
            
            # Get our vectors
            vectors = vectors.detach().cpu()

            # Iterate over batch
            for b in range(vectors.shape[0]):
                image_id = data['image_id'][b]
                target = int(data['target'][b])

                # Iterate over x and y dimensions
                for x in range(vectors.shape[2]):
                    for y in range(vectors.shape[3]):
                        # Send the features to our collection, with the associated hotel, image file, and position as metadata.
                        features = vectors[b, :, x, y].numpy()
                        vector_list.append((features, target, image_id, x, y))

Actually run everything, and we should end up with "vectors.parquet".

In [None]:
%%time

seed_everything(SEED)
dataset = HotelTrainDataset(train_df, base_transform, data_path=IMAGE_FOLDER)
extractor = VectorExtractor().to(DEVICE)
vectors = TileVectorWriter("vectors.parquet", 320)
extract_vectors(dataset, extractor, vectors, batch_size=BATCH_SIZE, device=DEVICE)
vectors.close()

This notebook series continues with https://www.kaggle.com/code/prubyg/hotel-id-vector-indexing/