In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Vector Indexing

This notebook follows on from https://www.kaggle.com/code/prubyg/hotel-id-vector-extraction - please read that if you haven't already.

The purpose of this notebook is to index all the vectors extracted by that process using the "annoy" library from Spotify. We do this in a separate notebook, because it is both a non-GPU operation, so allows the GPU work to be separated, and these both produce many GB of output. In some earlier attempts, I was very much pushing the storage output limits of notebooks, so the split was helpful.

We set some parameters for this run up-front.

In [None]:
VECTOR_WIDTH = 320
NUM_TREES = 30

OUTPUT_DIR = '/kaggle/working/'
INPUT_FILE = '/kaggle/input/hotel-id-vector-extraction/vectors.parquet'

Import our libraries - primarily pyarrow for data (please do not treat this as a recommendation), and annoy for indexing (highly recommended)

In [None]:
import pyarrow as pa
from pyarrow.parquet import ParquetFile
from annoy import AnnoyIndex

Load our input file. Note this does not read the content from disk immediately - it will be read iteratively as we stream results.

In [None]:
vector_db = ParquetFile(INPUT_FILE)
display(vector_db.metadata)

In [None]:
%%time

# Construct our index and set it up to be built on disk - too large in memory
index = AnnoyIndex(VECTOR_WIDTH)
index.on_disk_build(f'{OUTPUT_DIR}/vectors.annoy')

# Note with Parquet files we can specify which columns to read. The format is organised to allow other data to be entirely skipped, saving disk I/O.
for batch in vector_db.iter_batches(columns=['i', 'vector']):
    for i, vector in zip(batch['i'], batch['vector']):
        # We extract the global index and vector from the file, then add them to the index.
        i = i.as_py()
        vec = np.frombuffer(vector.as_buffer(), dtype=np.float32)
        index.add_item(i, vec)

Finally, we build the index which will accelerate KNN search. The "build" process creates the actual search trees.

Note for annoy we have to specify the number of search trees to build, because each one is random and approximate. More trees = more accuracy (but also more time).

In [None]:
%%time

index.build(NUM_TREES)

This notebook series continues with https://www.kaggle.com/code/prubyg/hotel-id-vector-validation/