#### Background
This notebook demonstrates a hyperparameter optimization (HPO) example using Optuna with the FAISS (HNSW-flat) and CUVS (CAGRA) libraries. It provides guidance on installing and configuring these libraries, along with their dependencies in the jupyter notebook in Amazon Sagemaker Studio . Additionally, you'll find examples for creating an objective function for Optuna to evaluate the performance of various hyperparameter configurations.

#### Prerequisites:
Follow these steps to use RAPIDS 24.10 for running this Jupyter notebook in the SageMaker Studio environment.

##### 1: Create a Conda Environment: Use the following command to create a new conda environment in the terminal named rapids-24.10. This environment will include RAPIDS version 24.10, Python 3.12, and several other packages:

```conda create -n rapids-24.10 -c rapidsai-nightly -c conda-forge -c nvidia rapids=24.10 python=3.12 'cuda-version>=12.0,<=12.5' ipykernel optuna faiss-cpu h5py```

##### 2: Activate the Conda Environment:

```conda activate rapids-24.10```

##### 3: Install the Jupyter Kernel:
```python -m ipykernel install --user --name cuvs-rapids-24.10 --display-name "Python (rapids-24.10)"```

##### 4: Restart the kernel and select the kernel "Python (rapids-24.10)" for your jupyter notebook.

In [4]:
import cupy as cp
import numpy as np
from cuvs.neighbors import cagra
import time
import optuna
from utils import calc_recall
import os
import faiss


In [5]:
import tarfile

def extract_tar_dataset(dataset_url, tarfilename, work_dir):
    #wiki-all datasets are in tar format
    if os.path.exists(work_dir + "/" + tarfilename):
        print("tar file is already downloaded")
    else:
        urllib.request.urlretrieve(url, work_dir + "/" + tarfilename)
    # Open the .tar file
    with tarfile.open(work_dir + "/" + tarfilename, 'r') as tar:
        folder_name = tarfilename.split(".")[0]
        if os.path.exists(work_dir + "/" + folder_name + "/"):
            print("Files already extracted")
            return work_dir + "/" + folder_name + "/"
        # Extract all contents into the specified directory
        extract_path=work_dir + "/" +folder_name.split(".")[0]
        tar.extractall(extract_path)
    return extract_path

def read_data(file_path, dtype, use_cupy):
    if use_cupy:
        np_lib = cp
    else:
        np_lib = np
    with open(file_path, "rb") as f:
        rows,cols = np.fromfile(f, count=2, dtype= np.int32)
        d = np.fromfile(f,count=rows*cols,dtype=dtype).reshape(rows, cols)
    return np_lib.asarray(d)

In [None]:
work_dir = os.path.expanduser("~/")
extracted_path=extract_tar_dataset('https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar', 'wiki_all_1M.tar', work_dir)

### FAISS HNSW Flat

In [21]:
# Read the base vectors from the file and convert them to numpy float32 type
vectors = read_data(extracted_path + "/base.1M.fbin", np.float32, use_cupy=False)

# Read the query vectors from the file and convert them to numpy float32 type
queries = read_data(extracted_path + "/queries.fbin", np.float32, use_cupy=False)

# Read the ground truth neighbors from the file and convert them to numpy int32 type
gt_neighbors = read_data(extracted_path + "/groundtruth.1M.neighbors.ibin", np.int32, use_cupy=False)

#Note: The use_cupy parameter is set to False, indicating that the data conversion should be performed using NumPy (CPU-based).


In [31]:
#Get the dataset size of database vectors
dataset_size = vectors.shape[0]

#Get the dimension
dim = vectors.shape[1]

In [25]:
def multi_objective_hnsw_flat(trial):
    """
    This function performs a multi-objective optimization for HNSW flat using the Optuna library. It optimizes the parameters 'ef_construction' and 'ef_search' 
    to balance the trade-offs between build time, search latency, and recall.

    Parameters:
    trial (optuna.trial.Trial): A trial object that suggests values for hyperparameters.

    Returns:
    tuple: A tuple containing build time in seconds, search latency in milliseconds, and recall value, each rounded to 4 decimal places.
    """
    ef_construction_val = trial.suggest_categorical('ef_construction_val', [32, 64, 128, 256])
    ef_search_val = trial.suggest_categorical('ef_search_val', [16, 32, 64, 128]) # depth of layers explored during search
    
    # set HNSW index parameters
    ef_construction = ef_construction_val

    start_build_time = time.time()
    index = faiss.IndexHNSWFlat(dim, M=32)
    # set efConstruction and efSearch parameters
    index.hnsw.efConstruction = ef_construction_val
    index.hnsw.efSearch = ef_search_val
    # add data to index
    index.add(vectors)
    build_time_in_secs = time.time() - start_build_time
    
     # Perform the search
    start_search_time = time.time()
    distances, indices = index.search(queries, k=10)
    search_time = time.time() - start_search_time
    
    latency_in_ms = (search_time * 1000)/queries.shape[0]
            
    recall = calc_recall(indices, gt_neighbors, use_cupy=False)
    if recall < 0.80:
        raise optuna.TrialPruned()
    return round(build_time_in_secs,4), round(latency_in_ms,4), round(recall,4)

In [26]:
%%time 
hnsw_flat_study = optuna.create_study(directions=['minimize', 'minimize', 'maximize'])
hnsw_flat_study.optimize(multi_objective_hnsw_flat, n_trials=10)

[I 2024-10-07 21:01:22,471] A new study created in memory with name: no-name-8e8e21c9-de27-4577-9e28-9eabc07979d5
[I 2024-10-07 21:02:12,136] Trial 0 finished with values: [47.7321, 0.1486, 0.9752] and parameters: {'ef_construction_val': 32, 'ef_search_val': 128}. 
[I 2024-10-07 21:03:04,159] Trial 1 finished with values: [51.2019, 0.0375, 0.8533] and parameters: {'ef_construction_val': 64, 'ef_search_val': 32}. 
[I 2024-10-07 21:04:47,146] Trial 2 pruned. 
[I 2024-10-07 21:05:35,595] Trial 3 finished with values: [47.4594, 0.0551, 0.8967] and parameters: {'ef_construction_val': 32, 'ef_search_val': 32}. 
[I 2024-10-07 21:08:56,474] Trial 4 finished with values: [199.6245, 0.0816, 0.9597] and parameters: {'ef_construction_val': 256, 'ef_search_val': 64}. 
[I 2024-10-07 21:09:48,753] Trial 5 finished with values: [51.2019, 0.0636, 0.9165] and parameters: {'ef_construction_val': 64, 'ef_search_val': 64}. 
[I 2024-10-07 21:10:40,211] Trial 6 pruned. 
[I 2024-10-07 21:11:29,403] Trial 7 fi

CPU times: user 8h 57min 43s, sys: 31.3 s, total: 8h 58min 15s
Wall time: 11min 48s


#### It took about 11 mins to optimize HNSW_flat using optuna with 10 trials

In [28]:
def summarize_best_trials(trials, metric_indices=[0, 1, 2], metric_labels=["lowest build time in secs", "lowest latency in ms", "highest recall"]):
    """
    Summarizes the best trials from a list of trials based on specified metrics.

    Parameters:
    trials (list): A list of trial objects, where each trial has attributes 'number', 'params', and 'values'.
    metric_indices (list): A list of indices indicating which metrics to consider. Default is [0, 1, 2].
    metric_labels (list): A list of labels describing each metric. Default is ["lowest build time in secs", "lowest latency in ms", "highest recall"].

    Functionality:
    - Iterates over the provided metric indices and labels.
    - For each metric, finds the best trial:
        - If the metric index is 0 or 1, it considers lower values as better (minimization).
        - For other indices, it considers higher values as better (maximization).
    - Prints a summary of the best trial for each metric, including trial number, parameters, and values.
    """
    for index, label in zip(metric_indices, metric_labels):
        if index in (0, 1):
            best_trial = min(trials, key=lambda t: t.values[index])
        else:
            best_trial = max(trials, key=lambda t: t.values[index])
        print(f"Trial with {label}:")
        print(f"\tnumber: {best_trial.number}")
        print(f"\tparams: {best_trial.params}")
        print(f"\tvalues: {best_trial.values}")

In [29]:
summarize_best_trials(hnsw_flat_study.best_trials)

Trial with lowest build time in secs:
	number: 3
	params: {'ef_construction_val': 32, 'ef_search_val': 32}
	values: [47.4594, 0.0551, 0.8967]
Trial with lowest latency in ms:
	number: 1
	params: {'ef_construction_val': 64, 'ef_search_val': 32}
	values: [51.2019, 0.0375, 0.8533]
Trial with highest recall:
	number: 0
	params: {'ef_construction_val': 32, 'ef_search_val': 128}
	values: [47.7321, 0.1486, 0.9752]


### cuVS Cagra

In [8]:
vectors= read_data(extracted_path + "/base.1M.fbin",np.float32, use_cupy=True)
queries = read_data(extracted_path + "/queries.fbin",np.float32, use_cupy=True)
gt_neighbors = read_data(extracted_path + "/groundtruth.1M.neighbors.ibin",np.int32, use_cupy=True)

# Here, use_cupy=True indicates that the data conversion should be performed using CuPy (GPU-based).

In [9]:
#Get the dataset size of database vectors
dataset_size = vectors.shape[0]
dim = vectors.shape[1]

In [10]:
def multi_objective_cagra(trial):
    """
    This function performs a multi-objective optimization for cuvs cagra using the Optuna library. It optimizes the parameters 'intermediate_graph_degree', 'graph_degree' 
    and 'itopk_size' to balance the trade-offs between build time, search latency, and recall.

    Parameters:
    trial (optuna.trial.Trial): A trial object that suggests values for hyperparameters.

    Returns:
    tuple: A tuple containing build time in seconds, search latency in milliseconds, and recall value, each rounded to 4 decimal places.
    """
    # Suggest values for build parameters
    intermediate_graph_degree = trial.suggest_categorical('intermediate_graph_degree', [64, 128, 256])
    graph_degree = trial.suggest_categorical('graph_degree', [32, 64])
    
    # Suggest an integer for the number of probes
    itopk_size = trial.suggest_categorical('itopk_size', [16, 32, 64, 128])

    build_params = cagra.IndexParams(
    intermediate_graph_degree=intermediate_graph_degree,
        graph_degree=graph_degree,
        build_algo="nn_descent"
    )

    start_build_time = time.time()
    cagra_index = cagra.build(build_params, vectors)
    build_time_in_secs = time.time() - start_build_time

    # Configure search parameters
    search_params = cagra.SearchParams(itopk_size=itopk_size)

    # perform search and refine to increase recall/accuracy
    start_search_time = time.time()
    distances, indices = cagra.search(search_params, cagra_index, queries, k=10)
    search_time = time.time() - start_search_time

    latency_in_ms = (search_time * 1000)/queries.shape[0]

    recall = calc_recall(indices, gt_neighbors, use_cupy=True)

    if recall < 0.80:
        raise optuna.TrialPruned()

    return round(build_time_in_secs,4), round(latency_in_ms,4), round(recall,4)

In [11]:
%%time 
cagra_study = optuna.create_study(directions=['minimize', 'minimize', 'maximize'])
cagra_study.optimize(multi_objective_cagra, n_trials=10)

[I 2024-10-07 20:46:55,995] A new study created in memory with name: no-name-ef9a66c8-266c-47b0-85aa-09edfe875695
[I 2024-10-07 20:47:17,224] Trial 0 finished with values: [15.1621, 0.0525, 0.9925] and parameters: {'intermediate_graph_degree': 64, 'graph_degree': 64, 'itopk_size': 128}. 
[I 2024-10-07 20:47:37,912] Trial 1 finished with values: [15.1199, 0.0121, 0.9483] and parameters: {'intermediate_graph_degree': 64, 'graph_degree': 64, 'itopk_size': 32}. 
[I 2024-10-07 20:48:00,553] Trial 2 finished with values: [17.083, 0.0061, 0.8127] and parameters: {'intermediate_graph_degree': 128, 'graph_degree': 32, 'itopk_size': 32}. 
[I 2024-10-07 20:48:37,587] Trial 3 pruned. 
[I 2024-10-07 20:49:00,817] Trial 4 finished with values: [17.5252, 0.0235, 0.9778] and parameters: {'intermediate_graph_degree': 128, 'graph_degree': 64, 'itopk_size': 64}. 
[I 2024-10-07 20:49:37,553] Trial 5 pruned. 
[I 2024-10-07 20:49:58,935] Trial 6 finished with values: [15.1299, 0.041, 0.9927] and parameters:

CPU times: user 19min 17s, sys: 1min 45s, total: 21min 3s
Wall time: 4min 6s


#### It took about 4 mins to optimize cagra using optuna with 10 trials

In [32]:
summarize_best_trials(cagra_study.best_trials)

Trial with lowest build time in secs:
	number: 7
	params: {'intermediate_graph_degree': 64, 'graph_degree': 64, 'itopk_size': 16}
	values: [15.1149, 0.0059, 0.8532]
Trial with lowest latency in ms:
	number: 7
	params: {'intermediate_graph_degree': 64, 'graph_degree': 64, 'itopk_size': 16}
	values: [15.1149, 0.0059, 0.8532]
Trial with highest recall:
	number: 6
	params: {'intermediate_graph_degree': 64, 'graph_degree': 64, 'itopk_size': 128}
	values: [15.1299, 0.041, 0.9927]
