# Learned Metric Index demo notebook
This notebook walks you through the whole process of creating and using a Learned Metric Index (LMI).

## Steps
1. Load the dataset
2. Build the LMI
3. Run a query
4. Find out its k-NN performance

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

### Creating an LMI instance
`LMI` is the basic object to inveract with when working with Learned indexes. It contains operations for:
- loading the dataset
- interface for training with various classifiers
- interface for searching

In [4]:
from LMI import LMI
import tensorflow as tf
from tensorflow import keras

# specify the path with the Mtree data.
li = LMI("./Mtree-Cophir-100k")
df = li.get_dataset()
df.head(2)

03-03-21 14:19 INFO: Loaded dataset of shape: (100000, 285)


Unnamed: 0,L1,L2,object_id,0,1,2,3,4,5,6,...,272,273,274,275,276,277,278,279,280,281
0,8,31,1264121,-1.242989,0.183268,0.226676,-0.915374,0.252619,-1.130569,-1.174948,...,0.376475,0.246309,-1.161265,0.238361,0.191588,0.133651,0.191612,0.181059,0.071334,0.292033
1,8,31,1269339,-1.499727,-0.376083,-0.169159,-0.178085,-1.059864,1.100678,-0.675192,...,0.376475,0.246309,-0.91233,0.648106,0.191588,0.133651,0.191612,0.181059,0.071334,-0.206513


The dataset is composed of labels (`L1`, `L2`), identifiers (`object_id`) and numberical data. This data are the normalized descriptors of M-tree CoPhIR dataset. Labels describe the object location within the M-tree - `L1`-th node in the first level and `L2`-th node in the second level.

### Build the LMI (Training phase)
Training is goverened by the `train()` method in `LMI`. In order to specify the classifiers to use and their basic hyperparameters, you should provide it with `training_specs` dictionary. Currently supported classifiers and their parameters together with exaplanations can be found in the following tables:

| classifier | Hyp. 1 | Hyp. 2 |
|------------|--------|--------|
| RF         | depth  | n_est  |
| LogReg     | ep     |        |
| NN         | model  | opt    |
| NNMulti    | model  | opt    

| classifier                 | Hyperparameter 1                                       | Hyperparameter 2                                |
|----------------------------|----------------------------------------------|---------------------------------------|
| RandomForestClassifier     | max_depth of the trees                       | number of trees                       |
| Logistic Regression        | number of epochs                             |                                       |
| Neural networks            | a classifier function (one of networks.py) | optimizer (one of keras.optimizers) |
| Multilabel neural networks | a classifier function (one of networks.py) | optimizer (one of keras.optimizers) |

In [7]:
from networks import Adam, construct_fully_connected_model_282_128, construct_mlp
#training_specs = {"RF": [{"n_est": 100, "depth": 30}, {"n_est": 100, "depth": 30}]}
#training_specs = {"LogReg": [{"ep": 10}, {"ep": 10}]}
training_specs = {"NN": [{"model": construct_fully_connected_model_282_128, "opt": Adam(learning_rate=0.0001), "ep": 1}, \
                         {"model": construct_mlp, "opt": Adam(learning_rate=0.001), "ep":5}]}

df_result = li.train(df, training_specs)

ModuleNotFoundError: No module named 'keras'

The training logs will inform you what level/node is being trained, and, in case of NNs, their accuracy as they're trained. Note that since we trian on the whole dataset we do not use any validation dataset.

### Searching

Once we've trained the data, we can search in them.

In [6]:
df_result.head(2)

NameError: name 'df_result' is not defined

In [None]:
result = li.search(df_result, df_result.iloc[0]["object_id"], stop_cond_objects=[500, 1000], debug=True)
result

If `debug=True` is specified when searching, the logging will guide us through the whole process of searching.
Beginning in the default step of popping the root node and collecting probabilities for nodes in the first level (`Step 1: L1 added`), to popping the nodes in the first level and collecting probs. of their children all the way to popping the buckets themselves.

The return value of the `search` operation is the following:
- `id` for node id (= `object_id`)
- `time_checkpoints` time (in s) it took to find the corresponding checkpoints
- `popped_nodes_checkpoints` - the nodes that managed to be popped till their collective sum of objects did not overstep the corresponding `stop_cond_objects` threshold
- `objects_checkpoints` - the actual sum of all found objects following `stop_cond_objects`. Is slightly higher than `stop_cond_objects`

## k-NN result evaluation

In [53]:
from knn_search import evaluate_knn_per_query, get_knn_buckets_for_query

In [38]:
# get the ground truth of the 30 nearest neighbors for each object (query)
knns = li.load_knns()
len(knns)

100000

### k-NN ground truth

The following output shows the ground truth buckets for every nearest neighbor of our query. The k-NN recall is computed as the number of objects in the visited buckets over the 30 overall objects.

In [54]:
get_knn_buckets_for_query(df_result, result['id'], knns)

{'C.1.15.60': ['1359232'],
 'C.1.15.14': ['85346435'],
 'C.1.15.75': ['88363019',
  '21399194',
  '6575940',
  '83916626',
  '100662876',
  '74283088',
  '86992232'],
 'C.1.5.52': ['53438476', '100021289'],
 'C.1.15.52': ['83916430'],
 'C.1.5.84': ['75273245'],
 'C.1.15.80': ['32374815', '96749349'],
 'C.1.5.14': ['91841316'],
 'C.1.8.33': ['40731301'],
 'C.1.12.76': ['94160177'],
 'C.1.15.20': ['88062129'],
 'C.1.15.76': ['96978627', '77887056'],
 'C.1.5.64': ['17955785'],
 'C.1.15.63': ['101036492'],
 'C.1.15.79': ['85310046'],
 'C.1.15.33': ['10875747'],
 'C.1.15.65': ['84525796'],
 'C.1.15.84': ['82244198', '60558950'],
 'C.1.5.49': ['99784947'],
 'C.1.5.46': ['86869882']}

In [44]:
evaluate_knn_per_query(result, df_result, knns)

Evaluating k-NN performance on 2 checkpoints: [506, 1008]
C.1.15.75
C.1.5.52
C.1.5.84
N. of knns found: 10 in 6 buckets.
C.1.15.75
C.1.5.52
C.1.5.84
C.1.5.46
C.1.15.76
C.1.15.14
N. of knns found: 14 in 13 buckets.


[0.3333333333333333, 0.4666666666666667]