# Car Grid study

The example will show you how to define custom functions and search methods for the retrieval optimizer.

In [2]:
# load data

import json

with open('data/car_corpus.json', 'r') as f:
    corpus = json.load(f)

with open('data/car_queries.json', 'r') as f:
    queries = json.load(f)

with open('data/car_qrels.json', 'r') as f:
    qrels = json.load(f)


In [6]:
# data is in a different format than the last example
corpus[0]

{'text': "Mazda3_8Y64-EA-08A_Edition1 Page1 Tuesday, November 27 2007 9:0 AM\n\nForm No.8Y64-EA-08A\n\nBlack plate (1,1)\n\nMazda3_8Y64-EA-08A_Edition1 Page2 Tuesday, November 27 2007 9:0 AM\n\nForm No.8Y64-EA-08A\n\nBlack plate (2,1)\n\nMazda3_8Y64-EA-08A_Edition1 Page3 Tuesday, November 27 2007 9:0 AM\n\nBlack plate (3,1)\n\nA Word to Mazda Owners\n\nThank you for choosing a Mazda. We at Mazda design and build vehicles with complete customer satisfaction in mind.\n\nTo help ensure enjoyable and trouble-free operation of your Mazda, read this manual carefully and follow its recommendations.\n\nAn Authorized Mazda Dealer knows your vehicle best. So when maintenance or service is necessary, that's the place to go.\n\nOur nationwide network of Mazda professionals is dedicated to providing you with the best possible service.\n\nWe assure you that all of us at Mazda have an ongoing interest in your motoring pleasure and in your full satisfaction with your Mazda product.\n\nMazda Motor Corp

In [5]:
queries["car-1"]

{'query': 'At what speed should I shift from 2 to 3 with a manual transmission?',
 'query_metadata': {'make': 'mazda', 'model': '3'}}

In [9]:
# must be this format
qrels

{'car-1': {'mazda_3:86': 1},
 'car-2': {'mazda_3:92': 1, 'mazda_3:93': 1},
 'car-3': {'mazda_3:84': 1, 'mazda_3:75': 1, 'mazda_3:105': 1},
 'car-4': {'mazda_3:188': 1},
 'car-5': {'mazda_3:68': 1, 'mazda_3:69': 1},
 'car-6': {'mazda_3:105': 1, 'mazda_3:83': 1},
 'car-7': {'mazda_3:195': 1, 'mazda_3:194': 1},
 'car-8': {'mazda_3:226': 1,
  'mazda_3:227': 1,
  'mazda_3:229': 1,
  'mazda_3:76': 1},
 'car-9': {'mazda_3:176': 1, 'mazda_3:175': 1},
 'car-10': {'mazda_3:179': 1,
  'mazda_3:209': 1,
  'mazda_3:211': 1,
  'mazda_3:212': 1,
  'mazda_3:213': 1,
  'mazda_3:210': 1}}

# Define a study config

To set the parameters of our study we need to define a study configuration file.

Example:
```yaml
# paths to necessary data files
corpus: "data/car_corpus.json" # optional if from_existing
queries: "data/car_queries.json"
qrels: "data/car_qrels.json"

# vector field names
index_settings:
  name: "car"
  prefix: "car" # prefix for index name
  vector_field_name: "vector" # name of the vector field to search on
  text_field_name: "text" # name of the text field for lexical search
  id_field_name: "_id"
  from_existing: false
  additional_fields:
    - name: "make" # fields to match our situation
      type: "tag"
    - name: "model"
      type: "tag"
  vector_dim: 384 # should match first embedding model or from_existing

# will run all search methods for each embedding model and then iterate
embedding_models: # embedding cache would be awesome here.
# if from_existing is true, first record is assumed to be the one used to create the index
  - type: "hf"
    model: "sentence-transformers/all-MiniLM-L6-v2"
    dim: 384
    embedding_cache_name: "vec-cache" # avoid names with including 'ret-opt' as this can cause collisions

search_methods: ["basic_vector", "pre_filter_vector"] # must match what is passed as search_method_map
```

## Custom search methods

The data for this study has fields `make` and `model` which would be good to apply as a pre-filter. However, none of the default search methods account for a specific query with these particular fields but we can easily define our own.

In [1]:
from redisvl.query import VectorQuery
from redisvl.query.filter import Tag

from redis_retrieval_optimizer.search_methods.vector import make_score_dict_vec

def vector_query(query_info, num_results: int, emb_model) -> VectorQuery:
    vector = emb_model.embed(query_info["query"], as_buffer=True)

    return VectorQuery(
        vector=vector,
        vector_field_name="vector",
        num_results=num_results,
        return_fields=["_id", "make", "model", "text"],  # update to read from env maybe?
    )

def pre_filter_query(query_info, num_results, emb_model) -> VectorQuery:
    vec = emb_model.embed(query_info["query"])
    make = query_info["query_metadata"]["make"]
    model = query_info["query_metadata"]["model"]

    filter = (Tag("make") == make) & (Tag("model") == model)

    # Create a vector query
    query = VectorQuery(
        vector=vec,
        vector_field_name="vector",
        num_results=num_results,
        filter_expression=filter,
        return_fields=["_id", "make", "model", "text"]
    )

    return query

def gather_pre_filter_results(queries, index, emb_model):
    redis_res_vector = {}

    for key in queries:
        query_info = queries[key]
        vec_query = pre_filter_query(query_info, 10, emb_model)
        try:
            res = index.query(vec_query)
            score_dict = make_score_dict_vec(res)
        except Exception as e:
            print(f"failed for {key}, {text_query}")
            score_dict = {}

        redis_res_vector[key] = score_dict
    return redis_res_vector


def gather_vector_results(queries, index, emb_model):
    redis_res_vector = {}

    for key in queries:
        text_query = queries[key]
        vec_query = vector_query(text_query, 10, emb_model)
        # try:
        res = index.query(vec_query)
        score_dict = make_score_dict_vec(res)
        # except Exception as e:
        #     print(f"failed for {key}, {text_query}")
        #     score_dict = {}
        redis_res_vector[key] = score_dict
    return redis_res_vector



15:20:00 sentence_transformers.cross_encoder.CrossEncoder INFO   Use pytorch device: mps




# Custom corpus processor

In [2]:
def process_car_corpus(
    corpus, emb_model
):
    corpus_data = []
    corpus_texts = [c["text"] for c in corpus]

    text_embeddings = emb_model.embed_many(corpus_texts, as_buffer=True)

    for i, c in enumerate(corpus):
        corpus_data.append(
            {
                "_id": c["item_id"],
                "text": c["text"],
                "make": c["query_metadata"]["make"],
                "model": c["query_metadata"]["model"],
                "vector": text_embeddings[i],
            }
        )

    return corpus_data

# Run a study

In [3]:
import os
from redis_retrieval_optimizer.grid_study import run_grid_study
from dotenv import load_dotenv

CUSTOM_SEARCH_METHOD_MAP = {
    "basic_vector": gather_vector_results,
    "pre_filter_vector": gather_pre_filter_results,
}

# load environment variables containing necessary credentials
load_dotenv()

redis_url = os.environ.get("REDIS_URL", "redis://localhost:6379/0")

metrics = run_grid_study(
    config_path="custom_grid_study_config.yaml",
    redis_url="redis://localhost:6379/0",
    corpus_processor=process_car_corpus,
    search_method_map=CUSTOM_SEARCH_METHOD_MAP,
)

15:20:34 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2




15:20:34 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps


0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Recreating: loading corpus from file


ERROR:tornado.general:SEND Error: Host unreachable


15:20:49 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
15:20:49 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Running search method: basic_vector


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Running search method: pre_filter_vector


In [4]:
metrics

{'search_method': ['basic_vector', 'pre_filter_vector'],
 'ret_k': [6, 6],
 'algorithm': ['flat', 'flat'],
 'ef_construction': [0, 0],
 'ef_runtime': [0, 0],
 'm': [0, 0],
 'distance_metric': ['cosine', 'cosine'],
 'vector_data_type': ['float32', 'float32'],
 'model': ['sentence-transformers/all-MiniLM-L6-v2',
  'sentence-transformers/all-MiniLM-L6-v2'],
 'model_dim': [384, 384],
 'recall@k': [0.775, 0.9333333333333333],
 'ndcg@k': [0.6519546808446834, 0.8822834961862057],
 'f1@k': [0.4628787878787879, 0.5786075036075037],
 'total_indexing_time': [0, 0],
 'precision': [0.36, 0.4600000000000001]}