## Lesson 2 - Let's try to improve on BM25

Crazy idea, lets actually *look at our data* then figure out how to use LLMs

In [1]:
from cheat_at_search.wands_data import products, queries
from cheat_at_search.strategy import (
    BM25Search,
    MiniLMSearch,
    EnrichedBM25Search,
    EnrichedJustRoomBM25Search,
    SynonymSearch,
)

bm25 = BM25Search(products)

2025-05-28 19:05:21,549 - cheat_at_search.wands_data - INFO - Directory /Users/doug/ws/cheat-with-llms/notebooks/data/wands already exists. Skipping clone.
2025-05-28 19:05:21,549 - cheat_at_search.wands_data - INFO - Loading relevance labels from /Users/doug/ws/cheat-with-llms/notebooks/data/wands/dataset/label.csv
2025-05-28 19:05:21,613 - cheat_at_search.wands_data - INFO - Loaded 231873 relevance labels
2025-05-28 19:05:21,613 - cheat_at_search.wands_data - INFO - Directory /Users/doug/ws/cheat-with-llms/notebooks/data/wands already exists. Skipping clone.
2025-05-28 19:05:21,614 - cheat_at_search.wands_data - INFO - Loading queries from /Users/doug/ws/cheat-with-llms/notebooks/data/wands/dataset/query.csv
2025-05-28 19:05:21,615 - cheat_at_search.wands_data - INFO - Loaded 480 queries
2025-05-28 19:05:21,615 - cheat_at_search.wands_data - INFO - Directory /Users/doug/ws/cheat-with-llms/notebooks/data/wands already exists. Skipping clone.
2025-05-28 19:05:21,615 - cheat_at_search.w

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['product_description'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['product_name'].fillna('', inplace=True)


2025-05-28 19:05:24,977 - searcharray.indexing - INFO - Indexing begins w/ 4 workers
2025-05-28 19:05:24,978 - searcharray.indexing - INFO - 0 Batch Start tokenization
2025-05-28 19:05:24,979 - searcharray.indexing - INFO - Tokenizing 42994 documents
2025-05-28 19:05:25,068 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)
2025-05-28 19:05:25,159 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)
2025-05-28 19:05:25,249 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)
2025-05-28 19:05:25,341 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)
2025-05-28 19:05:25,384 - searcharray.indexing - INFO - Tokenization -- vstacking
2025-05-28 19:05:25,385 - searcharray.indexing - INFO - Tokenization -- DONE
2025-05-28 19:05:25,386 - searcharray.indexing - INFO - Inverting docs->terms
2025-05-28 19:05:25,404 - searcharray.indexing - INFO - Encoding positions to bit array
2025-05-28 19:05:25,416 - searcharray.indexing - I

### Ideal top 10 for each query

In [2]:
from cheat_at_search.wands_data import labeled_queries

def ideal10():
    
    ideal_results = labeled_queries.sort_values(['query_id', 'grade'], ascending=(True, False))
    ideal_results['rank'] = ideal_results.groupby('query_id').cumcount() + 1
    ideal_top_10 = ideal_results[ideal_results['rank'] <= 10] \
        .add_prefix('ideal_') \
        .rename(columns={'ideal_query_id': 'query_id', 'ideal_query': 'query'})
    
    ideal_top_10 = ideal_top_10.merge(
        products[['product_id', 'product_name']], how='left', left_on='ideal_product_id', right_on='product_id'
    ).rename(columns={'product_name': 'ideal_product_name'}).drop(columns='ideal_query_class')
    
    return ideal_top_10

ideal_top_10 = ideal10()
ideal_top_10

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name
0,0,salon chair,1197,99,Exact,2.0,1,1197,lizzy 25.6 '' w bariatric antimicrobial waitin...
1,0,salon chair,1198,116,Exact,2.0,2,1198,rackham 26.8 '' w bariatric antimicrobial pvc ...
2,0,salon chair,2636,3,Exact,2.0,3,2636,25 '' wide faux leather manual swivel standard...
3,0,salon chair,2638,66,Exact,2.0,4,2638,faux leather massage chair
4,0,salon chair,5936,10,Exact,2.0,5,5936,69 '' w metal seat waiting room chair with met...
...,...,...,...,...,...,...,...,...,...
4732,487,rack glass,110,202050,Partial,1.0,6,110,west harptree 34 '' w garment rack
4733,487,rack glass,285,202336,Partial,1.0,7,285,3 shelf movable storage rack
4734,487,rack glass,573,202462,Partial,1.0,8,573,5 - hook freestanding coat rack
4735,487,rack glass,635,202178,Partial,1.0,9,635,4 - hook freestanding coat rack


## Run a simple BM25 search, evaluate

* Simple BM25 search, `/cheat_at_search/strategies/bm25.py`

In [3]:
from cheat_at_search.eval import grade_results

def run_strategy(strategy):
    
    results = strategy.search_all(queries)
    graded = grade_results(
        results,
        max_grade=2,
        k=10,
    )
    
    idcg = graded['idcg'].iloc[0]
    dcgs = graded.groupby(["query", 'query_id'])["discounted_gain"].sum().sort_values(ascending=False).rename('dcg')
    ndcgs = dcgs / idcg
    ndcgs = ndcgs.rename('ndcg')
    
    graded = graded.merge(dcgs, on=['query', 'query_id'])
    graded = graded.merge(ndcgs, on=['query', 'query_id'])
    
    return graded

graded_bm25 = run_strategy(bm25)
graded_bm25

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  graded_results['grade'].fillna(0, inplace=True)


Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,...,query_id,rank,query_class,id,label,grade,discounted_gain,idcg,dcg,ndcg
0,7465,hair salon chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional salon ...,fauxleathertype : pu|legheight-toptobottom:18|...,69.0,4.5,53.0,"[fauxleathertype : pu, legheight-toptobottom:1...",...,0,1,Massage Chairs,80.0,Exact,2.0,3.00,8.786905,8.536905,0.971549
1,7468,mercer41 hair salon chair hydraulic styling ch...,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,mercer41 beauty offers a wide selection profes...,seatfillmaterial : foam|waterrepellant : no re...,1.0,5.0,1.0,"[seatfillmaterial : foam, waterrepellant : no ...",...,0,2,Massage Chairs,104.0,Exact,2.0,1.50,8.786905,8.536905,0.971549
2,25431,barberpub salon massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,supplierintendedandapproveduse : non residenti...,4.0,5.0,4.0,[supplierintendedandapproveduse : non resident...,...,0,3,Massage Chairs,29.0,Exact,2.0,1.00,8.786905,8.536905,0.971549
3,25432,barberpub hydraulic salon spa reclining massag...,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,backheight-seattotopofback:15.7|recliningtyped...,5.0,4.0,5.0,"[backheight-seattotopofback:15.7, recliningtyp...",...,0,4,Massage Chairs,28.0,Exact,2.0,0.75,8.786905,8.536905,0.971549
4,15612,massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,features heavy duty steel frame . premium chro...,overallheight-toptobottom:35.5|productcare : d...,59.0,4.5,50.0,"[overallheight-toptobottom:35.5, productcare :...",...,0,5,Massage Chairs,101.0,Exact,2.0,0.60,8.786905,8.536905,0.971549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,40243,madisen hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,complement your farmhouse kitchen decor with t...,producttype : wine glass rack|overallwidth-sid...,29.0,5.0,20.0,"[producttype : wine glass rack, overallwidth-s...",...,487,6,,,,0.0,0.00,8.786905,0.000000,0.000000
4796,40244,kena hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,spruce up your farmhouse kitchen decor with th...,warrantylength:1 year|producttype : wine glass...,23.0,5.0,18.0,"[warrantylength:1 year, producttype : wine gla...",...,487,7,,,,0.0,0.00,8.786905,0.000000,0.000000
4797,39976,wall mounted wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,"the latest addition to this collection , this ...",overallheight-toptobottom:4|design : wall moun...,34.0,4.5,18.0,"[overallheight-toptobottom:4, design : wall mo...",...,487,8,,,,0.0,0.00,8.786905,0.000000,0.000000
4798,40247,winn hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,are you looking for a safe and decorative solu...,overallheight-toptobottom:1.5|overallwidth-sid...,305.0,5.0,187.0,"[overallheight-toptobottom:1.5, overallwidth-s...",...,487,9,,,,0.0,0.00,8.786905,0.000000,0.000000


In [4]:
def side_by_side(graded, ideal_top_10):
    
    graded_view = graded[['query_id', 'query', 'rank', 'product_id', 'product_name',  'grade', 'dcg', 'ndcg']].rename(
                          columns={'product_id': 'product_id_actual', 'product_name': 'product_name_actual'})
    
    sxs = ideal_top_10.merge(graded_view, 
                             how='left',
                             left_on=['query_id', 'query', 'ideal_rank'],
                             right_on=['query_id', 'query', 'rank'])
    return sxs


sxs_bm25 = side_by_side(graded_bm25, ideal_top_10)
sxs_bm25

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name,rank,product_id_actual,product_name_actual,grade,dcg,ndcg
0,0,salon chair,1197,99,Exact,2.0,1,1197,lizzy 25.6 '' w bariatric antimicrobial waitin...,1,7465,hair salon chair,2.0,8.536905,0.971549
1,0,salon chair,1198,116,Exact,2.0,2,1198,rackham 26.8 '' w bariatric antimicrobial pvc ...,2,7468,mercer41 hair salon chair hydraulic styling ch...,2.0,8.536905,0.971549
2,0,salon chair,2636,3,Exact,2.0,3,2636,25 '' wide faux leather manual swivel standard...,3,25431,barberpub salon massage chair,2.0,8.536905,0.971549
3,0,salon chair,2638,66,Exact,2.0,4,2638,faux leather massage chair,4,25432,barberpub hydraulic salon spa reclining massag...,2.0,8.536905,0.971549
4,0,salon chair,5936,10,Exact,2.0,5,5936,69 '' w metal seat waiting room chair with met...,5,15612,massage chair,2.0,8.536905,0.971549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4732,487,rack glass,110,202050,Partial,1.0,6,110,west harptree 34 '' w garment rack,6,40243,madisen hanging wine glass rack,0.0,0.000000,0.000000
4733,487,rack glass,285,202336,Partial,1.0,7,285,3 shelf movable storage rack,7,40244,kena hanging wine glass rack,0.0,0.000000,0.000000
4734,487,rack glass,573,202462,Partial,1.0,8,573,5 - hook freestanding coat rack,8,39976,wall mounted wine glass rack,0.0,0.000000,0.000000
4735,487,rack glass,635,202178,Partial,1.0,9,635,4 - hook freestanding coat rack,9,40247,winn hanging wine glass rack,0.0,0.000000,0.000000


In [5]:
sxs_bm25.groupby('query')['ndcg'].max().sort_values(ascending=True).head(20)

query
one alium way                          0.000000
merlyn 6                               0.000000
ottoman bed queen                      0.000000
white abstract                         0.000000
large bases                            0.000000
promo codes or discounts               0.000000
pull out sleeper loveseat              0.000000
midcentury tv unit                     0.000000
rug for teen room                      0.000000
rack glass                             0.000000
small ladies rocker swivel recliner    0.000000
small loving roomtables                0.000000
dull bed with shirt head board         0.000000
drum picture                           0.000000
drudge report                          0.000000
teal chair                             0.000000
star wars rug                          0.000000
minnestrista                           0.000000
pantry grey                            0.012645
wisdom stone river 3-3/4               0.016258
Name: ndcg, dtype: float64

In [6]:
sxs_bm25.groupby('query')['ndcg'].max().sort_values(ascending=True).head(40).tail(20)

query
carpet 5x6                      0.022761
lowes tile                      0.024026
donaldson teak couch            0.030348
burnt orange curtains           0.031613
living room ideas               0.034142
floating bed                    0.035226
outdoor waterproof chest        0.035406
oliver parsons                  0.036987
french molding                  0.040284
wand bunk beds                  0.046606
pedistole sink                  0.049316
3/4 size mattress               0.052161
milk cow chair                  0.060064
propane gas dryer               0.064806
wall art fiji                   0.066838
podium with locking cabinet     0.069548
black fluffy stool              0.071445
johan desk by laurel foundry    0.073341
full metal bed rose gold        0.076322
palram harmony greenhouses      0.077270
Name: ndcg, dtype: float64

In [7]:
# pull out sleeper loveseat miscategozide
# Star wars rug: miscategorized
# drudge report: weird
# midcentury tv unit style
# dull bed with shirt head board (misspelling)
# rug for teen room: perhaps mislabeled? unknown
# 
sxs_bm25[sxs_bm25['query'] == 'pantry grey']

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name,rank,product_id_actual,product_name_actual,grade,dcg,ndcg
3850,395,pantry grey,555,157019,Partial,1.0,1,555,abernathy 48 '' kitchen pantry,1,36491,dewhitt 67 '' pantry cabinet,0.0,0.111111,0.012645
3851,395,pantry grey,17023,40719,Partial,1.0,2,17023,uli 76 '' kitchen pantry,2,2414,olivas 60 '' kitchen pantry,0.0,0.111111,0.012645
3852,395,pantry grey,17835,157045,Partial,1.0,3,17835,ezara stand 61 '' kitchen pantry,3,32995,halstead 72 '' kitchen pantry,0.0,0.111111,0.012645
3853,395,pantry grey,18245,157020,Partial,1.0,4,18245,dubach 70 '' kitchen pantry,4,9924,hythe 35 '' kitchen pantry,0.0,0.111111,0.012645
3854,395,pantry grey,18246,157017,Partial,1.0,5,18246,bryaunna 34 '' kitchen pantry,5,18234,anie 66 '' kitchen pantry,0.0,0.111111,0.012645
3855,395,pantry grey,21484,157032,Partial,1.0,6,21484,wire baskets for organizing household pantry b...,6,40162,bayuga 31 '' kitchen pantry,0.0,0.111111,0.012645
3856,395,pantry grey,21815,157026,Partial,1.0,7,21815,charliedavid 70.8 '' kitchen pantry,7,14507,madalyn unfinished 72 '' kitchen pantry,0.0,0.111111,0.012645
3857,395,pantry grey,24056,156998,Partial,1.0,8,24056,easy essentials pantry bread box and divided 1...,8,9925,elliana storage 50 '' kitchen pantry,0.0,0.111111,0.012645
3858,395,pantry grey,25183,156979,Partial,1.0,9,25183,cockfosters 71 '' kitchen pantry,9,555,abernathy 48 '' kitchen pantry,1.0,0.111111,0.012645
3859,395,pantry grey,26545,157015,Partial,1.0,10,26545,mint pantry fragoza mango wood pineapple cutti...,10,36490,assent 78 '' kitchen pantry,0.0,0.111111,0.012645


In [8]:
products[products['product_id'] == 8759]

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,product_name_snowball,product_description_snowball
8759,8759,daniel-john 28 '' wide manual swivel standard ...,Recliners,Furniture / Living Room Furniture / Chairs & S...,swivel rocker fabric recliner chair - reclinin...,durability : stain resistant|warrantylength:30...,13.0,4.0,12.0,"[durability : stain resistant, warrantylength:...","Terms({'wide', 'manual', 'john', 'standard', '...","Terms({'fabric', 'manual', 'room', 'seat', 'li..."


In [9]:
print(products[products['product_id'] == 8759]['product_description'].iloc[0])

swivel rocker fabric recliner chair - reclining chair manual , single modern sofa home theater seating for living room .


In [10]:
print(products[products['product_id'] == 8759]['product_name'].iloc[0])

daniel-john 28 '' wide manual swivel standard recliner


In [11]:
from cheat_at_search.strategy import (
    SpellingCorrectedSearch
)
corrected = SpellingCorrectedSearch(products)

2025-05-28 19:05:30,936 - searcharray.indexing - INFO - Indexing begins w/ 4 workers
2025-05-28 19:05:30,938 - searcharray.indexing - INFO - 0 Batch Start tokenization
2025-05-28 19:05:30,939 - searcharray.indexing - INFO - Tokenizing 42994 documents
2025-05-28 19:05:31,026 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)
2025-05-28 19:05:31,120 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)
2025-05-28 19:05:31,212 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)
2025-05-28 19:05:31,303 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)
2025-05-28 19:05:31,347 - searcharray.indexing - INFO - Tokenization -- vstacking
2025-05-28 19:05:31,348 - searcharray.indexing - INFO - Tokenization -- DONE
2025-05-28 19:05:31,350 - searcharray.indexing - INFO - Inverting docs->terms
2025-05-28 19:05:31,368 - searcharray.indexing - INFO - Encoding positions to bit array
2025-05-28 19:05:31,381 - searcharray.indexing - I

In [12]:
graded_corrected = run_strategy(corrected)
graded_corrected

Query: foutains with brick look -> Corrected: fountains with brick look*
Query: wood coffee table set by storage -> Corrected: wood coffee table set buy storage*
Query: kohen 5 drawer dresser -> Corrected: cohen 5 drawer dresser*
Query: westling coffee table -> Corrected: wrestling coffee table*
Query: tollette teal outdoor rug -> Corrected: toilette teal outdoor rug*
Query: 7 draw white dresser -> Corrected: 7 drawer white dresser*
Query: regner power loom red -> Corrected: regency power loom red*
Query: liberty hardware francisco -> Corrected: liberty hardware san francisco*
Query: big basket for dirty cloths -> Corrected: big basket for dirty clothes*
Query: benjiamino faux leather power lift chair -> Corrected: benjamin faux leather power lift chair*
Query: biycicle plant stands -> Corrected: bicycle plant stands*
Query: chabely 5 draw chest -> Corrected: chevaly 5 drawer chest*
Query: desk for kids tjat ate 10 year old -> Corrected: desk for kids that are 10 year old*
Query: dull 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  graded_results['grade'].fillna(0, inplace=True)


Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,...,query_id,rank,query_class,id,label,grade,discounted_gain,idcg,dcg,ndcg
0,7465,hair salon chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional salon ...,fauxleathertype : pu|legheight-toptobottom:18|...,69.0,4.5,53.0,"[fauxleathertype : pu, legheight-toptobottom:1...",...,0,1,Massage Chairs,80.0,Exact,2.0,3.00,8.786905,8.536905,0.971549
1,7468,mercer41 hair salon chair hydraulic styling ch...,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,mercer41 beauty offers a wide selection profes...,seatfillmaterial : foam|waterrepellant : no re...,1.0,5.0,1.0,"[seatfillmaterial : foam, waterrepellant : no ...",...,0,2,Massage Chairs,104.0,Exact,2.0,1.50,8.786905,8.536905,0.971549
2,25431,barberpub salon massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,supplierintendedandapproveduse : non residenti...,4.0,5.0,4.0,[supplierintendedandapproveduse : non resident...,...,0,3,Massage Chairs,29.0,Exact,2.0,1.00,8.786905,8.536905,0.971549
3,25432,barberpub hydraulic salon spa reclining massag...,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,backheight-seattotopofback:15.7|recliningtyped...,5.0,4.0,5.0,"[backheight-seattotopofback:15.7, recliningtyp...",...,0,4,Massage Chairs,28.0,Exact,2.0,0.75,8.786905,8.536905,0.971549
4,15612,massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,features heavy duty steel frame . premium chro...,overallheight-toptobottom:35.5|productcare : d...,59.0,4.5,50.0,"[overallheight-toptobottom:35.5, productcare :...",...,0,5,Massage Chairs,101.0,Exact,2.0,0.60,8.786905,8.536905,0.971549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,40243,madisen hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,complement your farmhouse kitchen decor with t...,producttype : wine glass rack|overallwidth-sid...,29.0,5.0,20.0,"[producttype : wine glass rack, overallwidth-s...",...,487,6,,,,0.0,0.00,8.786905,0.000000,0.000000
4796,40244,kena hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,spruce up your farmhouse kitchen decor with th...,warrantylength:1 year|producttype : wine glass...,23.0,5.0,18.0,"[warrantylength:1 year, producttype : wine gla...",...,487,7,,,,0.0,0.00,8.786905,0.000000,0.000000
4797,39976,wall mounted wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,"the latest addition to this collection , this ...",overallheight-toptobottom:4|design : wall moun...,34.0,4.5,18.0,"[overallheight-toptobottom:4, design : wall mo...",...,487,8,,,,0.0,0.00,8.786905,0.000000,0.000000
4798,40247,winn hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,are you looking for a safe and decorative solu...,overallheight-toptobottom:1.5|overallwidth-sid...,305.0,5.0,187.0,"[overallheight-toptobottom:1.5, overallwidth-s...",...,487,9,,,,0.0,0.00,8.786905,0.000000,0.000000


In [13]:
graded_corrected.groupby('query_id')['ndcg'].first().mean()

np.float64(0.5259454643303375)

In [14]:
deltas = graded_corrected.groupby('query')['ndcg'].first() - sxs_bm25.groupby('query')['ndcg'].max().sort_values(ascending=True)
deltas = deltas.rename('ndcg_delta')
deltas[deltas != 0].sort_values()

query
kisner                                                    -1.000000
chaise lounge couch                                       -0.382198
malachi sled                                              -0.360565
tressler rug                                              -0.333333
rattan truck                                              -0.333333
bed side table                                            -0.319108
platform bed side table                                   -0.269792
liberty hardware francisco                                -0.259540
wood coffee table set by storage                          -0.227611
kohen 5 drawer dresser                                    -0.227611
grantola wall mirror                                      -0.227611
pennfield playhouse                                       -0.227611
mahone porch rocking chair                                -0.202321
odum velvet                                               -0.198212
mobley zero gravity adjustable bed with wi

In [15]:
def correct_query(query):
    return corrected._corrected(query).corrected_keywords



sxs_corrected = side_by_side(graded_corrected, ideal_top_10)
sxs_corrected['corrected_query'] = sxs_corrected['query'].apply(correct_query)
sxs_corrected = sxs_corrected.merge(deltas, how='left', on='query')
sxs_corrected


sxs_corrected[sxs_corrected['query'] == 'malachi sled']

Unnamed: 0,query_id,query,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name,rank,product_id_actual,product_name_actual,grade,dcg,ndcg,corrected_query,ndcg_delta
3071,313,malachi sled,20932,35792,Exact,2.0,1,20932,malachi sled end table,1,28909,cocktail sled coffee table,1.0,2.617857,0.297927,malachite sled,-0.360565
3072,313,malachi sled,20933,35788,Exact,2.0,2,20933,malachi 44 '' console table,2,4999,winfred sled coffee table,1.0,2.617857,0.297927,malachite sled,-0.360565
3073,313,malachi sled,20934,35791,Exact,2.0,3,20934,malachi sled coffee table,3,39954,ralph sled coffee table,1.0,2.617857,0.297927,malachite sled,-0.360565
3074,313,malachi sled,307,35790,Partial,1.0,4,307,malachi end table with storage,4,5089,ari ? n sled coffee table,1.0,2.617857,0.297927,malachite sled,-0.360565
3075,313,malachi sled,602,35805,Partial,1.0,5,602,sled coffee table,5,19202,amarillo sled coffee table,0.0,2.617857,0.297927,malachite sled,-0.360565
3076,313,malachi sled,604,133686,Partial,1.0,6,604,octopus sled coffee table,6,5913,sienna sled coffee table,1.0,2.617857,0.297927,malachite sled,-0.360565
3077,313,malachi sled,610,133695,Partial,1.0,7,610,sled coffee table,7,39800,hoffman sled coffee table,1.0,2.617857,0.297927,malachite sled,-0.360565
3078,313,malachi sled,1105,35797,Partial,1.0,8,1105,new haven sled coffee table,8,42263,johanna sled coffee table,1.0,2.617857,0.297927,malachite sled,-0.360565
3079,313,malachi sled,2087,35806,Partial,1.0,9,2087,sontag sled coffee table,9,22038,samara sled coffee table,0.0,2.617857,0.297927,malachite sled,-0.360565
3080,313,malachi sled,2495,35777,Partial,1.0,10,2495,armen sled coffee table,10,17106,telfair sled coffee table with storage,1.0,2.617857,0.297927,malachite sled,-0.360565


In [16]:
print(sxs_corrected[sxs_corrected['ndcg_delta'] > 0].groupby('query').first().to_markdown())

| query                                       |   query_id |   ideal_product_id |   ideal_id | ideal_label   |   ideal_grade |   ideal_rank |   product_id | ideal_product_name                                       |   rank |   product_id_actual | product_name_actual                                                         |   grade |     dcg |     ndcg | corrected_query                            |   ndcg_delta |
|:--------------------------------------------|-----------:|-------------------:|-----------:|:--------------|--------------:|-------------:|-------------:|:---------------------------------------------------------|-------:|--------------------:|:----------------------------------------------------------------------------|--------:|--------:|---------:|:-------------------------------------------|-------------:|
| 7 draw white dresser                        |         80 |               1134 |      10192 | Exact         |             2 |            1 |         1134 | kilduff 7 d

In [17]:
sxs_corrected[sxs_corrected['ndcg_delta'] > 0].groupby('query').first()

Unnamed: 0_level_0,query_id,ideal_product_id,ideal_id,ideal_label,ideal_grade,ideal_rank,product_id,ideal_product_name,rank,product_id_actual,product_name_actual,grade,dcg,ndcg,corrected_query,ndcg_delta
query,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
7 draw white dresser,80,1134,10192,Exact,2.0,1,1134,kilduff 7 drawer dresser,1,32233,marybella 7 drawer dresser,1.0,3.636905,0.413901,7 drawer white dresser,0.004064
desk for kids tjat ate 10 year old,197,2633,26756,Exact,2.0,1,2633,donnelly 26.14 '' art desk and chair set,1,21286,kids multifunctional desk and chair set adjust...,2.0,7.325,0.833627,desk for kids that are 10 year old,0.0378
foutains with brick look,20,443,1850,Exact,2.0,1,443,reposa natural stone cascading fountain with l...,1,17963,resin rainfall in brick design medallion fount...,2.0,5.078968,0.578016,fountains with brick look,0.496003
glass lsmp shades,401,473,41036,Exact,2.0,1,473,4.75 '' h glass bell pendant shade ( screw on ...,1,18944,8 '' glass sphere lamp shade,0.0,2.266667,0.25796,glass lamp shades,0.029264
love seat wide faux leather tuxedo arm sofa,210,674,27456,Exact,2.0,1,674,hogarth 66 '' faux leather tuxedo arm loveseat,1,20674,telfair 64 '' leather match tuxedo arm loveseat,1.0,4.703571,0.535293,loveseat wide faux leather tuxedo arm sofa,0.119767
sheets for twinxl,220,1475,28092,Exact,2.0,1,1475,itasca microfiber fitted top sheet set,1,17392,beth twin-xl 3pc sheet set,2.0,6.092857,0.693402,sheets for twin xl,0.092354
twin over full bunk beds cool desins,224,2110,28382,Exact,2.0,1,2110,rimstone twin over full bunk bed,1,14495,twin over full bunk bed,1.0,3.595635,0.409204,twin over full bunk beds cool designs,0.049316
tye dye duvet cover,459,10316,48728,Exact,2.0,1,10316,hearts in tie dye duvet cover set,1,10316,hearts in tie dye duvet cover set,2.0,5.378968,0.612157,tie dye duvet cover,0.092309


https://chatgpt.com/c/68376b81-2154-8004-8c86-9ed6151c8d4a

In [18]:
from cheat_at_search.strategy import (
    SpellingCorrectedSearch3
)
corrected = SpellingCorrectedSearch3(products)
graded_corrected = run_strategy(corrected)
graded_corrected

2025-05-28 19:05:37,097 - searcharray.indexing - INFO - Indexing begins w/ 4 workers
2025-05-28 19:05:37,099 - searcharray.indexing - INFO - 0 Batch Start tokenization
2025-05-28 19:05:37,099 - searcharray.indexing - INFO - Tokenizing 42994 documents
2025-05-28 19:05:37,190 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)
2025-05-28 19:05:37,286 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)
2025-05-28 19:05:37,378 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)
2025-05-28 19:05:37,472 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)
2025-05-28 19:05:37,518 - searcharray.indexing - INFO - Tokenization -- vstacking
2025-05-28 19:05:37,519 - searcharray.indexing - INFO - Tokenization -- DONE
2025-05-28 19:05:37,521 - searcharray.indexing - INFO - Inverting docs->terms
2025-05-28 19:05:37,538 - searcharray.indexing - INFO - Encoding positions to bit array
2025-05-28 19:05:37,552 - searcharray.indexing - I

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  graded_results['grade'].fillna(0, inplace=True)


Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,...,query_id,rank,query_class,id,label,grade,discounted_gain,idcg,dcg,ndcg
0,7465,hair salon chair,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,offers a wide selection of professional salon ...,fauxleathertype : pu|legheight-toptobottom:18|...,69.0,4.5,53.0,"[fauxleathertype : pu, legheight-toptobottom:1...",...,0,1,Massage Chairs,80.0,Exact,2.0,3.00,8.786905,8.536905,0.971549
1,7468,mercer41 hair salon chair hydraulic styling ch...,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,mercer41 beauty offers a wide selection profes...,seatfillmaterial : foam|waterrepellant : no re...,1.0,5.0,1.0,"[seatfillmaterial : foam, waterrepellant : no ...",...,0,2,Massage Chairs,104.0,Exact,2.0,1.50,8.786905,8.536905,0.971549
2,25431,barberpub salon massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,supplierintendedandapproveduse : non residenti...,4.0,5.0,4.0,[supplierintendedandapproveduse : non resident...,...,0,3,Massage Chairs,29.0,Exact,2.0,1.00,8.786905,8.536905,0.971549
3,25432,barberpub hydraulic salon spa reclining massag...,Massage Chairs|Recliners,Furniture / Living Room Furniture / Chairs & S...,salon chairs are a wonderful avenue for hairst...,backheight-seattotopofback:15.7|recliningtyped...,5.0,4.0,5.0,"[backheight-seattotopofback:15.7, recliningtyp...",...,0,4,Massage Chairs,28.0,Exact,2.0,0.75,8.786905,8.536905,0.971549
4,15612,massage chair,Massage Chairs,Furniture / Living Room Furniture / Chairs & S...,features heavy duty steel frame . premium chro...,overallheight-toptobottom:35.5|productcare : d...,59.0,4.5,50.0,"[overallheight-toptobottom:35.5, productcare :...",...,0,5,Massage Chairs,101.0,Exact,2.0,0.60,8.786905,8.536905,0.971549
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,40243,madisen hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,complement your farmhouse kitchen decor with t...,producttype : wine glass rack|overallwidth-sid...,29.0,5.0,20.0,"[producttype : wine glass rack, overallwidth-s...",...,487,6,,,,0.0,0.00,8.786905,0.000000,0.000000
4796,40244,kena hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,spruce up your farmhouse kitchen decor with th...,warrantylength:1 year|producttype : wine glass...,23.0,5.0,18.0,"[warrantylength:1 year, producttype : wine gla...",...,487,7,,,,0.0,0.00,8.786905,0.000000,0.000000
4797,39976,wall mounted wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,"the latest addition to this collection , this ...",overallheight-toptobottom:4|design : wall moun...,34.0,4.5,18.0,"[overallheight-toptobottom:4, design : wall mo...",...,487,8,,,,0.0,0.00,8.786905,0.000000,0.000000
4798,40247,winn hanging wine glass rack,Wine Racks,Kitchen & Tabletop / Tableware & Drinkware / B...,are you looking for a safe and decorative solu...,overallheight-toptobottom:1.5|overallwidth-sid...,305.0,5.0,187.0,"[overallheight-toptobottom:1.5, overallwidth-s...",...,487,9,,,,0.0,0.00,8.786905,0.000000,0.000000


In [23]:
graded_corrected.groupby('query_id')['ndcg'].first().mean(), graded_bm25.groupby('query_id')['ndcg'].first().mean()

(np.float64(0.5368180689156843), np.float64(0.534594548314742))

In [24]:
delta = graded_corrected.groupby('query')['ndcg'].first() - graded_bm25.groupby('query')['ndcg'].first()
delta[delta != 0].sort_values()

query
outdoor sectional doning                      -0.036671
pedistole sink                                -0.026555
desk for kids tjat ate 10 year old            -0.002032
7 draw white dresser                           0.004064
twin over full bunk beds cool desins           0.011381
love seat wide faux leather tuxedo arm sofa    0.045522
tye dye duvet cover                            0.054374
glass lsmp shades                              0.111141
sheets for twinxl                              0.117961
foutains with brick look                       0.150793
midcentury tv unit                             0.637312
Name: ndcg, dtype: float64

In [26]:
corrected._corrected('outdoor sectional doning')

SpellingCorrectedQuery(keywords='outdoor sectional doning', corrected_keywords='outdoor sectional dining')

In [22]:
second_level = products['category hierarchy'].fillna('//').str.split('/').apply(lambda x: str(x[1] if len(x) > 1 else '')).rename('subcategory').to_frame()
second_level_counts = second_level['subcategory'].value_counts()
second_level_counts[second_level_counts > 10]

subcategory
Living Room Furniture                    7834
Kitchen & Dining Furniture               3304
Bedroom Furniture                        2843
Bathroom Remodel & Bathroom Fixtures     1924
Office Furniture                         1863
                                         ... 
Holiday Lighting                           13
Storage & Organization Sale                13
Outdoor Heating                            13
Office Organization                        11
Entry & Mudroom Furniture                  11
Name: count, Length: 95, dtype: int64

In [29]:
top_level = products['category hierarchy'].fillna('//').str.split('/').apply(lambda x: str(x[0] if len(x) > 1 else '')).rename('category').to_frame()
top_level_counts = top_level['category'].value_counts()
top_level_counts[top_level_counts > 10]

category
Furniture                         16039
Home Improvement                   4686
Décor & Pillows                    4612
Outdoor                            3394
Storage & Organization             2175
Lighting                           2072
Rugs                               2002
Bed & Bath                         1865
                                   1645
Kitchen & Tabletop                 1615
Baby & Kids                        1204
School Furniture and Supplies       455
Appliances                          307
Holiday Décor                       212
Commercial Business Furniture       177
Pet                                 164
Contractor                          160
Sale                                 80
Foodservice                          36
Shop Product Type                    33
Browse By Brand                      24
Reception Area                       17
Clips                                14
Name: count, dtype: int64

In [32]:
top_level_counts[top_level_counts > 10].index.to_list()

['Furniture ',
 'Home Improvement ',
 'Décor & Pillows ',
 'Outdoor ',
 'Storage & Organization ',
 'Lighting ',
 'Rugs ',
 'Bed & Bath ',
 '',
 'Kitchen & Tabletop ',
 'Baby & Kids ',
 'School Furniture and Supplies ',
 'Appliances ',
 'Holiday Décor ',
 'Commercial Business Furniture ',
 'Pet ',
 'Contractor ',
 'Sale ',
 'Foodservice ',
 'Shop Product Type ',
 'Browse By Brand ',
 'Reception Area ',
 'Clips']

In [35]:
[a.strip() for a in second_level_counts[second_level_counts > 100].index.to_list()]

['Living Room Furniture',
 'Kitchen & Dining Furniture',
 'Bedroom Furniture',
 'Bathroom Remodel & Bathroom Fixtures',
 'Office Furniture',
 '',
 'Outdoor & Patio Furniture',
 'Flooring, Walls & Ceiling',
 'Area Rugs',
 'Hardware',
 'Bedding',
 'Tableware & Drinkware',
 'Window Treatments',
 'Decorative Pillows & Blankets',
 'Garage & Outdoor Storage & Organization',
 'Wall Décor',
 'Art',
 'Garden',
 'Toddler & Kids Bedroom Furniture',
 'Outdoor Décor',
 'Home Accessories',
 'Ceiling Lights',
 'Wall Lights',
 'Mirrors',
 'Bathroom Storage & Organization',
 'Kitchen Mats',
 'Area Rugs',
 'Table & Floor Lamps',
 'Outdoor Lighting',
 'Light Bulbs & Hardware',
 'Wall Shelving & Organization',
 'Flowers & Plants',
 'Kitchen Remodel & Kitchen Fixtures',
 'Clocks',
 'Doormats',
 'Shower Curtains & Accessories',
 'Kitchen Organization',
 'Mattresses & Foundations',
 'Toddler & Kids Playroom',
 'Doors & Door Hardware',
 'Bedding Essentials',
 'Shoe Storage',
 'Closet Storage & Organization',
