## Lesson 1 - Can we just generate blind synonyms?

In this lesson we work on generating blind synonyms using LLMs. This is often the first stop for search teams.

To run:
* Either make sure Ollama is running or you have an `OPENAI_API_KEY` installed

In [1]:
from cheat_at_search.wands_data import products, queries
from cheat_at_search.strategy import (
    BM25Search,
    MiniLMSearch,
    EnrichedBM25Search,
    EnrichedJustRoomBM25Search,
    SynonymSearch,
)

bm25 = BM25Search(products)

2025-05-28 10:06:46,920 - cheat_at_search.wands_data - INFO - Directory /Users/doug/ws/cheat-with-llms/notebooks/data/wands already exists. Skipping clone.
2025-05-28 10:06:46,920 - cheat_at_search.wands_data - INFO - Loading relevance labels from /Users/doug/ws/cheat-with-llms/notebooks/data/wands/dataset/label.csv
2025-05-28 10:06:46,984 - cheat_at_search.wands_data - INFO - Loaded 231873 relevance labels
2025-05-28 10:06:46,984 - cheat_at_search.wands_data - INFO - Directory /Users/doug/ws/cheat-with-llms/notebooks/data/wands already exists. Skipping clone.
2025-05-28 10:06:46,984 - cheat_at_search.wands_data - INFO - Loading queries from /Users/doug/ws/cheat-with-llms/notebooks/data/wands/dataset/query.csv
2025-05-28 10:06:46,986 - cheat_at_search.wands_data - INFO - Loaded 480 queries
2025-05-28 10:06:46,986 - cheat_at_search.wands_data - INFO - Directory /Users/doug/ws/cheat-with-llms/notebooks/data/wands already exists. Skipping clone.
2025-05-28 10:06:46,986 - cheat_at_search.w

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['product_description'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['product_name'].fillna('', inplace=True)


2025-05-28 10:06:50,313 - searcharray.indexing - INFO - Indexing begins w/ 4 workers
2025-05-28 10:06:50,314 - searcharray.indexing - INFO - 0 Batch Start tokenization
2025-05-28 10:06:50,315 - searcharray.indexing - INFO - Tokenizing 42994 documents
2025-05-28 10:06:50,413 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)
2025-05-28 10:06:50,507 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)
2025-05-28 10:06:50,602 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)
2025-05-28 10:06:50,695 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)
2025-05-28 10:06:50,740 - searcharray.indexing - INFO - Tokenization -- vstacking
2025-05-28 10:06:50,741 - searcharray.indexing - INFO - Tokenization -- DONE
2025-05-28 10:06:50,742 - searcharray.indexing - INFO - Inverting docs->terms
2025-05-28 10:06:50,760 - searcharray.indexing - INFO - Encoding positions to bit array
2025-05-28 10:06:50,773 - searcharray.indexing - I

In [2]:
from cheat_at_search.eval import grade_results

results = bm25.search_all(queries)
graded = grade_results(
    results,
    max_grade=2,
    k=10,
)

idcg = graded['idcg'].iloc[0]
dcgs = graded.groupby(["query", 'query_id'])["discounted_gain"].sum().sort_values(ascending=False).rename('dcg')
ndcgs = dcgs / idcg
ndcgs = ndcgs.rename('ndcg')

graded = graded.merge(dcgs, on=['query', 'query_id'])
graded = graded.merge(ndcgs, on=['query', 'query_id'])

graded

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  graded_results['grade'].fillna(0, inplace=True)


Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,...,query_id,rank,query_class,id,label,grade,discounted_gain,idcg,dcg,ndcg
0,7465,hair salon chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional salon ...,fauxleathertype : pu|legheight-toptobottom:18|...,69.0,4.5,53.0,"[fauxleathertype : pu, legheight-toptobottom:1...",...,0,1,Massage Chairs,80.0,Exact,2.0,3.00,8.786905,8.536905,0.971549
1,7468,mercer41 hair salon chair hydraulic styling ch...,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,mercer41 beauty offers a wide selection profes...,seatfillmaterial : foam|waterrepellant : no re...,1.0,5.0,1.0,"[seatfillmaterial : foam, waterrepellant : no ...",...,0,2,Massage Chairs,104.0,Exact,2.0,1.50,8.786905,8.536905,0.971549
2,25431,barberpub salon massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,supplierintendedandapproveduse : non residenti...,4.0,5.0,4.0,[supplierintendedandapproveduse : non resident...,...,0,3,Massage Chairs,29.0,Exact,2.0,1.00,8.786905,8.536905,0.971549
3,25432,barberpub hydraulic salon spa reclining massag...,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,backheight-seattotopofback:15.7|recliningtyped...,5.0,4.0,5.0,"[backheight-seattotopofback:15.7, recliningtyp...",...,0,4,Massage Chairs,28.0,Exact,2.0,0.75,8.786905,8.536905,0.971549
4,15612,massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,features heavy duty steel frame . premium chro...,overallheight-toptobottom:35.5|productcare : d...,59.0,4.5,50.0,"[overallheight-toptobottom:35.5, productcare :...",...,0,5,Massage Chairs,101.0,Exact,2.0,0.60,8.786905,8.536905,0.971549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,40243,madisen hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,complement your farmhouse kitchen decor with t...,producttype : wine glass rack|overallwidth-sid...,29.0,5.0,20.0,"[producttype : wine glass rack, overallwidth-s...",...,487,6,,,,0.0,0.00,8.786905,0.000000,0.000000
4796,40244,kena hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,spruce up your farmhouse kitchen decor with th...,warrantylength:1 year|producttype : wine glass...,23.0,5.0,18.0,"[warrantylength:1 year, producttype : wine gla...",...,487,7,,,,0.0,0.00,8.786905,0.000000,0.000000
4797,39976,wall mounted wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,"the latest addition to this collection , this ...",overallheight-toptobottom:4|design : wall moun...,34.0,4.5,18.0,"[overallheight-toptobottom:4, design : wall mo...",...,487,8,,,,0.0,0.00,8.786905,0.000000,0.000000
4798,40247,winn hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,are you looking for a safe and decorative solu...,overallheight-toptobottom:1.5|overallwidth-sid...,305.0,5.0,187.0,"[overallheight-toptobottom:1.5, overallwidth-s...",...,487,9,,,,0.0,0.00,8.786905,0.000000,0.000000


In [3]:
from cheat_at_search.wands_data import labeled_queries

ideal_results = labeled_queries.sort_values(['query_id', 'grade'], ascending=(True, False))
ideal_results['rank'] = ideal_results.groupby('query_id').cumcount() + 1
ideal_top_10 = ideal_results[ideal_results['rank'] <= 10] \
    .add_prefix('ideal_') \
    .rename(columns={'ideal_query_id': 'query_id', 'ideal_query': 'query'})

ideal_top_10 = ideal_top_10.merge(
    products[['product_id', 'product_name']], how='left', left_on='ideal_product_id', right_on='product_id'
).rename(columns={'product_name': 'ideal_product_name'}).drop(columns='ideal_query_class')

ideal_top_10

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name
0,0,salon chair,1197,99,Exact,2.0,1,1197,lizzy 25.6 '' w bariatric antimicrobial waitin...
1,0,salon chair,1198,116,Exact,2.0,2,1198,rackham 26.8 '' w bariatric antimicrobial pvc ...
2,0,salon chair,2636,3,Exact,2.0,3,2636,25 '' wide faux leather manual swivel standard...
3,0,salon chair,2638,66,Exact,2.0,4,2638,faux leather massage chair
4,0,salon chair,5936,10,Exact,2.0,5,5936,69 '' w metal seat waiting room chair with met...
...,...,...,...,...,...,...,...,...,...
4732,487,rack glass,110,202050,Partial,1.0,6,110,west harptree 34 '' w garment rack
4733,487,rack glass,285,202336,Partial,1.0,7,285,3 shelf movable storage rack
4734,487,rack glass,573,202462,Partial,1.0,8,573,5 - hook freestanding coat rack
4735,487,rack glass,635,202178,Partial,1.0,9,635,4 - hook freestanding coat rack


In [9]:
graded_view = graded[['query_id', 'query', 'rank', 'product_id', 'product_name',  'grade', 'dcg', 'ndcg']].rename(
                      columns={'product_id': 'product_id_actual', 'product_name': 'product_name_actual'})

side_by_side = ideal_top_10.merge(graded_view, 
                                  how='left',
                                  left_on=['query_id', 'query', 'ideal_rank'],
                                  right_on=['query_id', 'query', 'rank'])
side_by_side

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name,rank,product_id_actual,product_name_actual,grade,dcg,ndcg
0,0,salon chair,1197,99,Exact,2.0,1,1197,lizzy 25.6 '' w bariatric antimicrobial waitin...,1,7465,hair salon chair,2.0,8.536905,0.971549
1,0,salon chair,1198,116,Exact,2.0,2,1198,rackham 26.8 '' w bariatric antimicrobial pvc ...,2,7468,mercer41 hair salon chair hydraulic styling ch...,2.0,8.536905,0.971549
2,0,salon chair,2636,3,Exact,2.0,3,2636,25 '' wide faux leather manual swivel standard...,3,25431,barberpub salon massage chair,2.0,8.536905,0.971549
3,0,salon chair,2638,66,Exact,2.0,4,2638,faux leather massage chair,4,25432,barberpub hydraulic salon spa reclining massag...,2.0,8.536905,0.971549
4,0,salon chair,5936,10,Exact,2.0,5,5936,69 '' w metal seat waiting room chair with met...,5,15612,massage chair,2.0,8.536905,0.971549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4732,487,rack glass,110,202050,Partial,1.0,6,110,west harptree 34 '' w garment rack,6,40243,madisen hanging wine glass rack,0.0,0.000000,0.000000
4733,487,rack glass,285,202336,Partial,1.0,7,285,3 shelf movable storage rack,7,40244,kena hanging wine glass rack,0.0,0.000000,0.000000
4734,487,rack glass,573,202462,Partial,1.0,8,573,5 - hook freestanding coat rack,8,39976,wall mounted wine glass rack,0.0,0.000000,0.000000
4735,487,rack glass,635,202178,Partial,1.0,9,635,4 - hook freestanding coat rack,9,40247,winn hanging wine glass rack,0.0,0.000000,0.000000


In [10]:
side_by_side[side_by_side['query'] == 'ottoman bed queen']

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name,rank,product_id_actual,product_name_actual,grade,dcg,ndcg
3381,344,ottoman bed queen,545,37893,Partial,1.0,1,545,loggins storage ottoman,1,2089,elzadie queen four poster bed,0.0,0.0,0.0
3382,344,ottoman bed queen,998,37883,Partial,1.0,2,998,karlsefni 37.5 '' wide tufted rectangle standa...,2,38301,sinead queen upholstered platform bed,0.0,0.0,0.0
3383,344,ottoman bed queen,1030,37836,Partial,1.0,3,1030,48 '' tufted rectangle standard ottoman,3,14834,abbot diamond queen upholstered standard bed,0.0,0.0,0.0
3384,344,ottoman bed queen,1367,37901,Partial,1.0,4,1367,roft 50 '' tufted rectangle storage ottoman,4,7371,carrollton queen upholstered standard bed,0.0,0.0,0.0
3385,344,ottoman bed queen,1537,37873,Partial,1.0,5,1537,henninger 25 '' tufted round storage ottoman,5,29090,lamberton queen upholstered platform bed,0.0,0.0,0.0
3386,344,ottoman bed queen,1563,37894,Partial,1.0,6,1563,luper 41.75 '' wide tufted rectangle storage o...,6,17289,masonville queen platform bed,0.0,0.0,0.0
3387,344,ottoman bed queen,1730,37910,Partial,1.0,7,1730,tufted storage ottoman,7,17219,bianaca queen low profile four poster bed,0.0,0.0,0.0
3388,344,ottoman bed queen,1753,37854,Partial,1.0,8,1753,camden 28.5 '' wide square folding bed cocktai...,8,6503,karmakar queen upholstered platform bed,0.0,0.0,0.0
3389,344,ottoman bed queen,1844,37889,Partial,1.0,9,1844,lampkins 24 '' rectangle standard ottoman,9,34268,fabienne queen standard bed,0.0,0.0,0.0
3390,344,ottoman bed queen,1927,37864,Partial,1.0,10,1927,denali 51.18 '' wide faux leather rectangle st...,10,32412,louque queen upholstered platform bed,0.0,0.0,0.0


In [11]:
side_by_side.groupby('query')['ndcg'].max().sort_values()

query
one alium way              0.0
merlyn 6                   0.0
ottoman bed queen          0.0
white abstract             0.0
large bases                0.0
                          ... 
toilet paper stand         1.0
retractable side awning    1.0
delta trinsic              1.0
desk and chair set         1.0
kids chair                 1.0
Name: ndcg, Length: 480, dtype: float64

In [12]:
merlyn_6 = labeled_queries[labeled_queries['query'] == 'merlyn 6'].sort_values('grade', ascending=False)
merlyn_6[merlyn_6['product_id'] == 8531]

Unnamed: 0,query_id,query,query_class,product_id,id,label,grade
141127,304,merlyn 6,Outdoor Conversation Sets,8531,131167,Partial,1.0


In [13]:
products[products['product_id'] == 8531]

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,product_name_snowball,product_description_snowball
8531,8531,open console with 6 drawers,Sofa & Console Tables,Furniture / Living Room Furniture / Console Ta...,"crafted in italy , in timeless design , this i...",topmaterial : solid wood|countryoforigin : ita...,,,,"[topmaterial : solid wood, countryoforigin : i...","Terms({'drawer', 'open', '6', 'consol', 'with'})","Terms({'design', 'itali', 'will', 'space', 'in..."


In [31]:
import numpy as np
query_sample = side_by_side['query'].unique()
np.random.shuffle(query_sample)
query_sample = query_sample[:5]
query_sample

array(['parsons chairs', 'pedistole sink', 'penny round tile',
       'benjiamino faux leather power lift chair', 'shoe closet'],
      dtype=object)

In [32]:
print(side_by_side[side_by_side['query'].isin(query_sample)].to_markdown())

|      |   query_id | query                                    |   ideal_product_id |   ideal_id | ideal_label   |   ideal_grade |   ideal_rank |   product_id | ideal_product_name                                                                             |   rank |   product_id_actual | product_name_actual                                                                  |   grade |      dcg |      ndcg |
|-----:|-----------:|:-----------------------------------------|-------------------:|-----------:|:--------------|--------------:|-------------:|-------------:|:-----------------------------------------------------------------------------------------------|-------:|--------------------:|:-------------------------------------------------------------------------------------|--------:|---------:|----------:|
| 1801 |        182 | penny round tile                         |               3580 |      25875 | Exact         |             2 |            1 |         3580 | 1 '' x 1 '' marble pe

In [28]:
side_by_side

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name,rank,product_id_actual,product_name_actual,grade,dcg,ndcg
0,0,salon chair,1197,99,Exact,2.0,1,1197,lizzy 25.6 '' w bariatric antimicrobial waitin...,1,7465,hair salon chair,2.0,8.536905,0.971549
1,0,salon chair,1198,116,Exact,2.0,2,1198,rackham 26.8 '' w bariatric antimicrobial pvc ...,2,7468,mercer41 hair salon chair hydraulic styling ch...,2.0,8.536905,0.971549
2,0,salon chair,2636,3,Exact,2.0,3,2636,25 '' wide faux leather manual swivel standard...,3,25431,barberpub salon massage chair,2.0,8.536905,0.971549
3,0,salon chair,2638,66,Exact,2.0,4,2638,faux leather massage chair,4,25432,barberpub hydraulic salon spa reclining massag...,2.0,8.536905,0.971549
4,0,salon chair,5936,10,Exact,2.0,5,5936,69 '' w metal seat waiting room chair with met...,5,15612,massage chair,2.0,8.536905,0.971549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4732,487,rack glass,110,202050,Partial,1.0,6,110,west harptree 34 '' w garment rack,6,40243,madisen hanging wine glass rack,0.0,0.000000,0.000000
4733,487,rack glass,285,202336,Partial,1.0,7,285,3 shelf movable storage rack,7,40244,kena hanging wine glass rack,0.0,0.000000,0.000000
4734,487,rack glass,573,202462,Partial,1.0,8,573,5 - hook freestanding coat rack,8,39976,wall mounted wine glass rack,0.0,0.000000,0.000000
4735,487,rack glass,635,202178,Partial,1.0,9,635,4 - hook freestanding coat rack,9,40247,winn hanging wine glass rack,0.0,0.000000,0.000000
