# Grid study

We will define a study config to test the relative retrieval scores returned from bm25 search vs. vector as an example.

## Data Requirements

The hardest part of any evaluation is collect and formatting a good dataset. To run a study with the retrieval optimizer you will need the following sets of data.

### Corpus

This is the data that we will make up the overall set of objects we are searching against and is what will be indexed in Redis.

General form:
```json
{
    "corpus_id": {
        "text": "test to be searched on or vectorized",
        "title": "associated title",
    }
}
```

Concrete example:
```json
{
    "MED-10": {
        "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. ...",
        "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland"
    },
    ...
}
```

### Queries

The queries that will be executed against the corpus to measure performance.

General form:
```json
{
    "query_id": "query text",
    ...
}
```

Concrete example:
```json
{
    "PLAIN-2": "Do Cholesterol Statin Drugs Cause Breast Cancer?",
    "PLAIN-12": "Exploiting Autophagy to Live Longer",
    ...
}
```

### Qrels

The labeled set of scores used for evaluation of the queries against the corpus.

General form:
```json
{
    "query_id": {
        "corpus_id": "score",
        ...
    },
    ...
}
```

Concrete example:
```json
{
    "PLAIN-2": {
        "MED-2427": 2,
        "MED-2440": 1,
        "MED-2434": 1,
        "MED-2435": 1,
        "MED-2436": 1,
    },
    "PLAIN-12": {
        "MED-2513": 2,
        "MED-5237": 2,
    },
}
```

Note: for precision, recall, f1 simple existence of the key (1 or 0) is what drives the metric. For NCDG and ranking is when scores other than 1 become relevant since those account for ranking.


## Sourcing data

For this example, we are making use of the awesome datasets available through [beir benchmarking IR project](https://github.com/beir-cellar/beir). For these datasets, there is a helper within the retrieval benchmark to test any of your specific methods with these datasets. However, for the most use-case-specific optimization you can create your own dataset it simply needs to be of this general structure. Additionally, if you have a few examples of your data it can be very helpful to use an LLM to extend and create more examples for your testing.


In [None]:
from redis_retrieval_optimizer.corpus_processors import eval_beir

# check the link above for different datasets to try
beir_dataset_name = "nfcorpus"

# Load sample data
corpus, queries, qrels = eval_beir.get_beir_dataset(beir_dataset_name)

09:43:35 beir.datasets.data_loader INFO   Loading Corpus...


  0%|          | 0/3633 [00:00<?, ?it/s]

09:43:35 beir.datasets.data_loader INFO   Loaded 3633 TEST Documents.
09:43:35 beir.datasets.data_loader INFO   Doc Example: {'text': 'Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (rang

Now that we have our data we will save it locally to the gitignored `data/` folder

In [5]:
import json

with open(f"data/{beir_dataset_name}_corpus.json", "w") as f:
    json.dump(corpus, f)

with open(f"data/{beir_dataset_name}_queries.json", "w") as f:
    json.dump(queries, f)

with open(f"data/{beir_dataset_name}_qrels.json", "w") as f:
    json.dump(qrels, f)

# Define a study config

To set the parameters of our study we need to define a study configuration file.

Example:
```yaml
# paths to necessary data files
corpus: "data/nfcorpus_corpus.json" # optional if from_existing
queries: "data/nfcorpus_queries.json"
qrels: "data/nfcorpus_qrels.json"

# vector field names
vector_field_name: "vector" # name of the vector field to search on
text_field_name: "text" # name of the text field for lexical search
index_settings:
  name: "optimize"
  from_existing: false # true if sourcing corpus from existing redis instance

# will run all search methods for each embedding model and then iterate
embedding_models: # embedding cache would be awesome here.
# if from_existing is true, first record is assumed to be the one used to create the index
  - type: "hf"
    model: "sentence-transformers/all-MiniLM-L6-v2"
    dim: 384
    embedding_cache_name: "vec-cache" # avoid names with including 'ret-opt' as this can cause collisions

search_methods: ["bm25", "vector"] # the search methods to run against the dataset
```

## Available search methods

The available search methods are defined in `redis_retrieval_optimizer.search_methods.__init__` and you can see the active `SEARCH_METHOD_MAP` which maps the string input in search_method to the corresponding function.

You can define your own SEARCH_METHOD_MAP and pass it in to define your custom retrieval logic.

In [1]:
from redis_retrieval_optimizer.search_methods import SEARCH_METHOD_MAP

SEARCH_METHOD_MAP



10:28:20 sentence_transformers.cross_encoder.CrossEncoder INFO   Use pytorch device: mps




{'bm25': <function redis_retrieval_optimizer.search_methods.bm25.gather_bm25_results(queries, index, emb_model)>,
 'rerank': <function redis_retrieval_optimizer.search_methods.rerank.gather_rerank_results(queries, index, emb_model)>,
 'lin_combo': <function redis_retrieval_optimizer.search_methods.lin_combo.gather_lin_combo_results(queries, index, emb_model, alpha=0.7)>,
 'vector': <function redis_retrieval_optimizer.search_methods.vector.gather_vector_results(queries, index, emb_model)>,
 'weighted_rrf': <function redis_retrieval_optimizer.search_methods.weighted_rrf.gather_weighted_rrf(queries, index, emb_model)>}

# Run a study

In [1]:
import os
from redis_retrieval_optimizer.grid_study import run_grid_study
from redis_retrieval_optimizer.corpus_processors import eval_beir
from dotenv import load_dotenv

# load environment variables containing necessary credentials
load_dotenv()

redis_url = os.environ.get("REDIS_URL", "redis://localhost:6379/0")

metrics = run_grid_study(
    config_path="grid_study_config.yaml",
    redis_url="redis://localhost:6379/0",
    corpus_processor=eval_beir.process_corpus
)



11:01:43 sentence_transformers.cross_encoder.CrossEncoder INFO   Use pytorch device: mps
11:01:43 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2




11:01:44 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

11:01:45 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
11:01:46 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps




Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Recreating: loading corpus from file
11:01:48 sentence_transformers.SentenceTransformer INFO   Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
11:01:49 sentence_transformers.SentenceTransformer INFO   Use pytorch device_name: mps


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Running search method: bm25
Running search method: vector


In [2]:
metrics

{'search_method': ['bm25', 'vector'],
 'ret_k': [6, 6],
 'algorithm': ['flat', 'flat'],
 'ef_construction': [0, 0],
 'ef_runtime': [0, 0],
 'm': [0, 0],
 'distance_metric': ['cosine', 'cosine'],
 'vector_data_type': ['float32', 'float32'],
 'model': ['sentence-transformers/all-MiniLM-L6-v2',
  'sentence-transformers/all-MiniLM-L6-v2'],
 'model_dim': [384, 384],
 'recall@k': [0.11579800796961262, 0.11965315421231895],
 'ndcg@k': [0.16890919365750603, 0.1655733514156841],
 'f1@k': [0.12598202968456776, 0.12115336056131429],
 'total_indexing_time': [0, 0],
 'precision': [0.32389060887512905, 0.30299277605779157]}