# Adding a steering and stopping strategy

Compared to the previous agentic search notebook, this adds two specific enhancements

## Search tools error on similar queries

If a query has been seen before (or is similar enough to one seen before) the search tool *errors*. This helps move AWAY from places the agent has already seen

Notice the tool accepts a special variable `agent_state` - that's a specially named variable we can pass to the tool *from the harness* it gives us abilities to maintain some global state in an instatiation of the loop. It lets the tool check whether the state of the world allows the search to be run.

## Automated user feedback to agent

In the form of a user prompt, we tell the agent what "we" are satisfied with. This sort of feedback could be replaced by manual intervention. Or it could be a reranker or LLM as a judge

This helps the agent move TOWARDS a type of query / set of results

In [None]:
!pip install git+https://github.com/softwaredoug/cheat-at-search.git
from cheat_at_search.data_dir import mount
mount(use_gdrive=True)    # colab, share data across notebook runs on gdrive
# mount(use_gdrive=False) # <- colab without gdrive
# mount(use_gdrive=False, manual_path="/path/to/directory")  # <- force data path to specific directory, ie you're running locally.


Collecting git+https://github.com/softwaredoug/cheat-at-search.git
  Cloning https://github.com/softwaredoug/cheat-at-search.git to /tmp/pip-req-build-wqcdllha
  Running command git clone --filter=blob:none --quiet https://github.com/softwaredoug/cheat-at-search.git /tmp/pip-req-build-wqcdllha
  Resolved https://github.com/softwaredoug/cheat-at-search.git to commit 6a08d097f1d6eaa068fb61af47c621df1682f5e2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Mounted at /content/drive


## Get an OpenAI Key

This will prompt you for an OpenAI Key to interact with GPT-5

In [None]:
from cheat_at_search.data_dir import key_for_provider
from openai import OpenAI

OPENAI_KEY = key_for_provider("openai")

openai = OpenAI(api_key=OPENAI_KEY)

## Load the Wayfair corpus

We'll recommend products only from this corpus

In [None]:
from cheat_at_search.wands_data import corpus

corpus['category'] = corpus['category'].str.strip()
corpus['sub_category'] = corpus['sub_category'].str.strip()

corpus

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count,features,doc_id,title,description,category,sub_category,cat_subcat
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0,"[overallwidth-sidetoside:64.7, dsprimaryproduc...",0,solid wood platform bed,"good , deep sleep can be quite difficult to ha...",Furniture,Bedroom Furniture,Furniture / Bedroom Furniture
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0,"[capacityquarts:7, producttype : slow cooker, ...",1,all-clad 7 qt . slow cooker,"create delicious slow-cooked meals , from tend...",Kitchen & Tabletop,Small Kitchen Appliances,Kitchen & Tabletop / Small Kitchen Appliances
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0,"[features : keep warm setting, capacityquarts:...",2,all-clad electrics 6.5 qt . slow cooker,prepare home-cooked meals on any schedule with...,Kitchen & Tabletop,Small Kitchen Appliances,Kitchen & Tabletop / Small Kitchen Appliances
3,3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0,"[overallwidth-sidetoside:3.5, warrantylength :...",3,all-clad all professional tools pizza cutter,this original stainless tool was designed to c...,Browse By Brand,All-Clad,Browse By Brand / All-Clad
4,4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0,"[compatibledoorthickness:1.375 '' , countryofo...",4,baldwin prestige alcott passage knob with roun...,the hardware has a rich heritage of delivering...,Home Improvement,Doors & Door Hardware,Home Improvement / Doors & Door Hardware
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42989,42989,malibu pressure balanced diverter fixed shower...,Shower Panels,Home Improvement / Bathroom Remodel & Bathroom...,the malibu pressure balanced diverter fixed sh...,producttype : shower panel|spraypattern : rain...,3.0,4.5,2.0,"[producttype : shower panel, spraypattern : ra...",42989,malibu pressure balanced diverter fixed shower...,the malibu pressure balanced diverter fixed sh...,Home Improvement,Bathroom Remodel & Bathroom Fixtures,Home Improvement / Bathroom Remodel & Bathro...
42990,42990,emmeline 5 piece breakfast dining set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,,basematerialdetails : steel| : gray wood|ofhar...,1314.0,4.5,864.0,"[basematerialdetails : steel, : gray wood, of...",42990,emmeline 5 piece breakfast dining set,,Furniture,Kitchen & Dining Furniture,Furniture / Kitchen & Dining Furniture
42991,42991,maloney 3 piece pub table set,Dining Table Sets,Furniture / Kitchen & Dining Furniture / Dinin...,this pub table set includes 1 counter height t...,additionaltoolsrequirednotincluded : power dri...,49.0,4.0,41.0,[additionaltoolsrequirednotincluded : power dr...,42991,maloney 3 piece pub table set,this pub table set includes 1 counter height t...,Furniture,Kitchen & Dining Furniture,Furniture / Kitchen & Dining Furniture
42992,42992,fletcher 27.5 '' wide polyester armchair,Teen Lounge Furniture|Accent Chairs,Furniture / Living Room Furniture / Chairs & S...,"bring iconic , modern style to your space in a...",legmaterialdetails : rubberwood|backheight-sea...,1746.0,4.5,1226.0,"[legmaterialdetails : rubberwood, backheight-s...",42992,fletcher 27.5 '' wide polyester armchair,"bring iconic , modern style to your space in a...",Furniture,Living Room Furniture,Furniture / Living Room Furniture


### Index the furniture

We'll index title and description with basic stemming to be able to retrieve them

In [None]:
from searcharray import SearchArray
from cheat_at_search.tokenizers import snowball_tokenizer

corpus['title_snowball'] = SearchArray.index(corpus['title'].fillna(''), snowball_tokenizer)
corpus['description_snowball'] = SearchArray.index(corpus['description'].fillna(''), snowball_tokenizer)
corpus['category_snowball'] = SearchArray.index(corpus['category'].fillna(''), snowball_tokenizer)

2026-02-11 03:42:55,101 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2026-02-11 03:42:55,132 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2026-02-11 03:42:55,138 - searcharray.indexing - INFO - Tokenizing 42994 documents


INFO:searcharray.indexing:Tokenizing 42994 documents


2026-02-11 03:42:56,116 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)


INFO:searcharray.indexing:Tokenized 10000 (23.259059403637718%)


2026-02-11 03:42:56,883 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)


INFO:searcharray.indexing:Tokenized 20000 (46.518118807275435%)


2026-02-11 03:42:57,513 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)


INFO:searcharray.indexing:Tokenized 30000 (69.77717821091315%)


2026-02-11 03:42:58,105 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)


INFO:searcharray.indexing:Tokenized 40000 (93.03623761455087%)


2026-02-11 03:42:58,520 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2026-02-11 03:42:58,549 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2026-02-11 03:42:58,576 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2026-02-11 03:42:58,659 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2026-02-11 03:42:58,769 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2026-02-11 03:42:58,772 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2026-02-11 03:42:58,838 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2026-02-11 03:42:58,903 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2026-02-11 03:42:58,916 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2026-02-11 03:42:58,918 - searcharray.indexing - INFO - Tokenizing 42994 documents


INFO:searcharray.indexing:Tokenizing 42994 documents


2026-02-11 03:43:01,130 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)


INFO:searcharray.indexing:Tokenized 10000 (23.259059403637718%)


2026-02-11 03:43:02,853 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)


INFO:searcharray.indexing:Tokenized 20000 (46.518118807275435%)


2026-02-11 03:43:04,392 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)


INFO:searcharray.indexing:Tokenized 30000 (69.77717821091315%)


2026-02-11 03:43:06,123 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)


INFO:searcharray.indexing:Tokenized 40000 (93.03623761455087%)


2026-02-11 03:43:06,909 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2026-02-11 03:43:06,983 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2026-02-11 03:43:07,036 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2026-02-11 03:43:07,778 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2026-02-11 03:43:08,215 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2026-02-11 03:43:08,221 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2026-02-11 03:43:08,442 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


2026-02-11 03:43:08,609 - searcharray.indexing - INFO - Indexing begins w/ 4 workers


INFO:searcharray.indexing:Indexing begins w/ 4 workers


2026-02-11 03:43:08,618 - searcharray.indexing - INFO - 0 Batch Start tokenization


INFO:searcharray.indexing:0 Batch Start tokenization


2026-02-11 03:43:08,623 - searcharray.indexing - INFO - Tokenizing 42994 documents


INFO:searcharray.indexing:Tokenizing 42994 documents


2026-02-11 03:43:08,875 - searcharray.indexing - INFO - Tokenized 10000 (23.259059403637718%)


INFO:searcharray.indexing:Tokenized 10000 (23.259059403637718%)


2026-02-11 03:43:09,130 - searcharray.indexing - INFO - Tokenized 20000 (46.518118807275435%)


INFO:searcharray.indexing:Tokenized 20000 (46.518118807275435%)


2026-02-11 03:43:09,427 - searcharray.indexing - INFO - Tokenized 30000 (69.77717821091315%)


INFO:searcharray.indexing:Tokenized 30000 (69.77717821091315%)


2026-02-11 03:43:09,635 - searcharray.indexing - INFO - Tokenized 40000 (93.03623761455087%)


INFO:searcharray.indexing:Tokenized 40000 (93.03623761455087%)


2026-02-11 03:43:09,830 - searcharray.indexing - INFO - Tokenization -- vstacking


INFO:searcharray.indexing:Tokenization -- vstacking


2026-02-11 03:43:09,833 - searcharray.indexing - INFO - Tokenization -- DONE


INFO:searcharray.indexing:Tokenization -- DONE


2026-02-11 03:43:09,839 - searcharray.indexing - INFO - Inverting docs->terms


INFO:searcharray.indexing:Inverting docs->terms


2026-02-11 03:43:09,847 - searcharray.indexing - INFO - Encoding positions to bit array


INFO:searcharray.indexing:Encoding positions to bit array


2026-02-11 03:43:09,854 - searcharray.indexing - INFO - Batch tokenization complete


INFO:searcharray.indexing:Batch tokenization complete


2026-02-11 03:43:09,858 - searcharray.indexing - INFO - (main thread) Processing 1 batch results


INFO:searcharray.indexing:(main thread) Processing 1 batch results


2026-02-11 03:43:09,878 - searcharray.indexing - INFO - Indexing from tokenization complete


INFO:searcharray.indexing:Indexing from tokenization complete


## Add a search tracker

Here we track which searches we've seen to prevent researching

## Create a furniture products search function

Here is a function that searches a Wayfair product dataset. It's just a Python function that returns top 10 pieces of furniture.

Right now we'll call it directly, soon we'll help ChatGPT interact with this.

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import Optional

minilm = SentenceTransformer('all-MiniLM-L6-v2')


class SearchTracker:
    def __init__(self, similarity_threshold=0.95):
        self.queries = []
        self.query_embeddings = []
        self.query_log = []
        self.similarity_threshold = similarity_threshold

    def similar_search(self, new_query: str) -> Optional[str]:
        if not new_query:
            return False

        new_embedding = minilm.encode(new_query, convert_to_numpy=True)

        for i, existing_embedding in enumerate(self.query_embeddings):
            # Calculate dot product for similarity
            similarity = np.dot(new_embedding, existing_embedding) / (np.linalg.norm(new_embedding) * np.linalg.norm(existing_embedding))
            if similarity >= self.similarity_threshold:
                return self.queries[i]

        self.queries.append(new_query)
        self.query_embeddings.append(new_embedding)
        return None

# Example usage (not wired up to the main agent logic yet)
tracker = SearchTracker()
print(tracker.similar_search("red couch"))
print(tracker.similar_search("red couch"))
# print(tracker.has_run_search("blue chair"))
# print(tracker.has_run_search("large red couch"))

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


None
red couch


In [None]:
import numpy as np
from typing import Union

from pydantic import BaseModel, Field
from typing import Optional, Literal


Categories = Literal['Furniture', 'Kitchen & Tabletop', 'Browse By Brand',
                     'Home Improvement', 'Décor & Pillows', 'Outdoor',
                     'Storage & Organization', 'Bed & Bath', 'Baby & Kids',
                     'Pet', 'Lighting', 'Rugs', 'School Furniture and Supplies',
                     'Commercial Business Furniture', 'Holiday Décor', 'Fountains',
                     'Contractor', 'Appliances', 'Sale', 'Reception Area',
                     'Foodservice', 'Institutional Furniture Parts & Accessories',
                     'Landscaping Screens & Bridges', 'Shop Product Type', 'Clips',
                     'Slicers, Peelers And Graters', 'Bed Accessories',
                     'Accommodations', 'Buffet Accessories', 'Specialty Serving',
                     'Display Cases', 'Key Organizers', 'Ergonomic Accessories',
                     'Slow Cookers', 'Bath Rugs & Mats', 'Furniture Cushions',
                     'Early Education', 'Learning Resources',
                     'Physical Education Equipment', 'Faux Plants and Trees',
                     'Desk Parts', 'Serving Dishes & Platters', 'Water Filter Pitchers',
                     'Shower Curtain Rods', 'Table Accessories',
                     'Sandboxes & Sand Toys', 'Meeting & Collaborative Spaces',
                     'Desktop Organizers & Desk Pads',
                     'Napkin Rings, Place Card Holders & Food Markers',
                     'Partition & Panel Hardware Accessories', 'Cash Handling', 'Hooks',
                     'Novelty Lighting', 'Protection Plans',
                     'Stages, Risers and Accessories']


class ToolSearchResult(BaseModel):
    id: int = Field(description="The id of the product")
    title: str = Field(description="The title of the product")
    description: str = Field(description="The description of the product")
    category: str = Field(description="The category of the product")
    score: float = Field(description="The score of the product")


class ToolSearchResults(BaseModel):
    search_results: list[ToolSearchResult] = Field(description="The search results")
    error: Optional[str] = Field(description="Any error message", default=None)



def search_wayfair(keywords: str,
                   category: Optional[Categories] = None,
                   top_k: int = 5,
                   agent_state: Optional[dict] = None
                   ) -> ToolSearchResults:
    """Search the wayfair home goods + furniture catalog, get top_k results

    This is direct keyword search along with optional category filtering.

    Args:
        keywords: The search query string.
        category: category to filter products by, unfiltered when not present
        top_k: The number of top results to return.

    Returns:
        Search results or error message

    """
    print("search", keywords)
    # ******
    # Check search tracker and reject if too similar
    query_log = []
    try:
        query_log = agent_state['log']
    except KeyError:
        agent_state['log'] = query_log
    category_lookup = "no_category"
    if category is not None:
        category_lookup = category.lower()
    search_tracker = None
    if agent_state is not None:
        search_trackers = {}
        try:
            search_trackers = agent_state['search_tracker']
        except KeyError:
            agent_state['search_tracker'] = search_trackers

        # Get tracker for category
        try:
            search_tracker = search_trackers[category_lookup]
        except KeyError:
            search_trackers[category_lookup] = SearchTracker()
            search_tracker = search_trackers[category_lookup]

        assert agent_state['search_tracker'][category_lookup] is not None
        duplicate_search = search_tracker.similar_search(keywords)
        if duplicate_search is not None:
            error_msg = f"""
                You searched '{keywords}', but you already searched for very similar '{duplicate_search}' in the past in category {category_lookup}

                Try exploring a different query to explore more of the space
                (different / no categories, or different keywords in this category)
            """
            print(error_msg)
            query_log.append((keywords, category, True))
            return ToolSearchResults(search_results=[],
                                     error=error_msg)
    # Done search tracking error checking

    bm25_scores = np.zeros(len(corpus))
    for term in snowball_tokenizer(keywords):
        bm25_scores += corpus['title_snowball'].array.score(term) * 10
        bm25_scores += corpus['description_snowball'].array.score(term) * 1

    # Filter by category
    if category:
        print("Filtering by category:", category)
        cat_tokenized = snowball_tokenizer(category)
        category_mask = corpus['category_snowball'].array.score(cat_tokenized) > 0
        bm25_scores = bm25_scores * category_mask


    top_k_indices = np.argsort(bm25_scores)[-top_k:][::-1]
    bm25_scores = bm25_scores[top_k_indices]
    top_movies = corpus.iloc[top_k_indices].copy()
    top_movies.loc[:, 'score'] = bm25_scores

    results = []
    for id, row in top_movies.iterrows():
        results.append(ToolSearchResult(
            id=row['doc_id'],
            title=row['title'],
            description=row['description'],
            category=row['category'],
            score=row['score']
        ))
    search_results = ToolSearchResults(search_results=results, error=None)
    if agent_state:
        query_log.append((keywords, category, False))
        agent_state['search_tracker'][category_lookup] = search_tracker
    return search_results



agent_state = {}
search_wayfair("geometric style couch", top_k=5, agent_state=agent_state)
search_wayfair("geometric style couch", top_k=5, agent_state=agent_state)

search geometric style couch
search geometric style couch

                You searched 'geometric style couch', but you already searched for very similar 'geometric style couch' in the past in category no_category

                Try exploring a different query to explore more of the space
                (different / no categories, or different keywords in this category)
            


ToolSearchResults(search_results=[], error="\n                You searched 'geometric style couch', but you already searched for very similar 'geometric style couch' in the past in category no_category\n\n                Try exploring a different query to explore more of the space\n                (different / no categories, or different keywords in this category)\n            ")

In [None]:
agent_state['log']

[('geometric style couch', None, False),
 ('geometric style couch', None, True),
 ('geometric style couch', 'Furniture', False),
 ('geometric style couch', 'Furniture', True)]

## Describe the search tool to the LLM

There is a specific schema for telling OpenAI about our tools / functions. However, the cheat at search library has added some conveniences:

* We use the function name as the name to OpenAI
* We use the doc string to get a description
* The typing information gets encoded in parameters and return value

So IMPORTANTLY -- all these things are part of the prompt

### Annoying serialization / deserialization

When we get it in an OpenAI-friendly format, we also keep around some book-keeping for annoying serialization / deserialization of the arguments

With this we get some plumbing information in a 3-tuple
* The arguments to pass (as one pydantic struct)
* The tool as OpenAI sees it
* The function to call to delegate to this tool

Don't get too lost in the weeds here. In future notebooks, cheat-at-search helper code will just do this for you behind the scenes.

In [None]:
from cheat_at_search.agent.pydantize import make_tool_adapter

search_tool = make_tool_adapter(search_wayfair)

tool_info = {search_wayfair.__name__: search_tool}
tool_info

{'search_wayfair': (cheat_at_search.agent.pydantize.Search_wayfairArgs,
  {'type': 'function',
   'name': 'search_wayfair',
   'description': 'Search the wayfair home goods + furniture catalog, get top_k results\n\n    This is direct keyword search along with optional category filtering.\n\n    Args:\n        keywords: The search query string.\n        category: category to filter products by, unfiltered when not present\n        top_k: The number of top results to return.\n\n    Returns:\n        Search results or error message',
   'parameters': {'properties': {'keywords': {'title': 'Keywords',
      'type': 'string'},
     'category': {'anyOf': [{'enum': ['Furniture',
         'Kitchen & Tabletop',
         'Browse By Brand',
         'Home Improvement',
         'Décor & Pillows',
         'Outdoor',
         'Storage & Organization',
         'Bed & Bath',
         'Baby & Kids',
         'Pet',
         'Lighting',
         'Rugs',
         'School Furniture and Supplies',
      

In [None]:
import json
print(json.dumps(tool_info['search_wayfair'][1], indent=2))

{
  "type": "function",
  "name": "search_wayfair",
  "description": "Search the wayfair home goods + furniture catalog, get top_k results\n\n    This is direct keyword search along with optional category filtering.\n\n    Args:\n        keywords: The search query string.\n        category: category to filter products by, unfiltered when not present\n        top_k: The number of top results to return.\n\n    Returns:\n        Search results or error message",
  "parameters": {
    "properties": {
      "keywords": {
        "title": "Keywords",
        "type": "string"
      },
      "category": {
        "anyOf": [
          {
            "enum": [
              "Furniture",
              "Kitchen & Tabletop",
              "Browse By Brand",
              "Home Improvement",
              "D\u00e9cor & Pillows",
              "Outdoor",
              "Storage & Organization",
              "Bed & Bath",
              "Baby & Kids",
              "Pet",
              "Lighting",
     

In [None]:
def call_tool(tool_info, agent_state, item) -> dict:

    # Lookup how the agent wants to call the tool
    tool_name = item.name
    tool = tool_info[tool_name]
    ToolArgsModel = tool[0]
    tool_fn = tool[2]
    fn_args: ToolArgsModel = ToolArgsModel.model_validate_json(item.arguments)

    print(f"Calling {tool_name} with args {fn_args}")
    # The tool call function itself (ie search)
    # wrapped in something helping with serialization
    py_resp, json_resp = tool_fn(fn_args, agent_state=agent_state)
    print("output", py_resp)

    # 4. Provide function call results to the model
    return {
        "type": "function_call_output",
        "call_id": item.call_id,
        "output": json_resp,
    }


## Grade helpers

Just some utilities that map between grades -> emojis

(A bit of emotional valence helps LLMs in my experience)

Also labeling the structured `SearchResults` we will get back from an agent. Giving back title, doc_id, and an emoji eval

In [None]:
from cheat_at_search.wands_data import labeled_query_products


def _grade_to_emoji(grade):
    if grade == 0:
        return '☹️'
    elif grade == 1:
        return '😑'
    elif grade == 2:
        return '😃'
    return '☹️'


def grades(query, search_results):
    """Get array of grades for each search result."""
    query_judgments = labeled_query_products[labeled_query_products['query'] == query]
    r_value = []
    for doc_id in search_results.ranked_results:
        title = corpus[corpus['doc_id'] == doc_id]['title'].iloc[0]
        doc_judgments = query_judgments[query_judgments['doc_id'] == doc_id]
        if len(doc_judgments) == 0:
            r_value.append((title, doc_id, _grade_to_emoji(None)))
        else:
            r_value.append((title, doc_id, _grade_to_emoji(int(doc_judgments['grade'].values[0]))))
    return r_value


## The core agentic loop

Here we have the main loop we use. This is the basic tool calling loop we know and love. It calls OpenAI and accumulates tools to call.


In [None]:
import textwrap
from time import sleep
from cheat_at_search.wands_data import labeled_query_products


def agent_run(tool_info,
              text_format,
              inputs,
              agent_state,
              model='gpt-5-mini',
              summary=True) -> str:

    assert inputs
    tool_calls = True
    resp = None
    while tool_calls:
        failing = True
        num_failures = 0
        while failing:
            try:
                # print(inputs)
                resp = openai.responses.parse(
                    model=model,
                    input=inputs,
                    tools=[tool[1] for tool in tool_info.values()],
                    reasoning={
                        "effort": "medium",
                        "summary": "auto" if summary else "none"
                    },
                    text_format=text_format
                )
                failing = False
            except Exception:
                failing = True
                num_failures += 1
                if num_failures > 3:
                    raise
                sleep(1)
        inputs += resp.output
        if summary:
            usage = resp.usage
            print("--")
            print(f"InpTok: {usage.input_tokens}")
            print(f"OutTok: {usage.output_tokens}")
            for item in resp.output:
                if item.type == "reasoning":
                    print("Reasoning:")
                    for summary_item in item.summary:
                        print(textwrap.fill(summary_item.text, 80), "\n")
                    item.summary = []

        for item in resp.output:
            tool_calls = False
            if item.type == "function_call":
                tool_calls = True
                # *** Get the tool, and package
                # up the call to the tool (our python function)
                tool_response = call_tool(tool_info,
                                          agent_state=agent_state,
                                          item=item)

                # 4. Provide function call results to the model
                inputs.append(tool_response)
    return resp, inputs


--
InpTok: 843
OutTok: 137
Reasoning:
**Searching for poufs**  The user is looking for a "small woven pouf," so I need
to use the search function to find at least 10 results. The tool typically
returns the top 5 results by default, so I’ll adjust the call to get 10 results
instead. I’m considering which category to use, either "Décor & Pillows" or
"Furniture." A pouf seems to fit better under "Décor & Pillows," so I'll go with
that. Time to make the search call! 

Calling search_wayfair with args keywords='small woven pouf' category='Décor & Pillows' top_k=10
search small woven pouf
Filtering by category: Décor & Pillows
output {'search_results': [{'id': 31076, 'title': 'small picture frame', 'description': 'this beautiful photo frame makes the perfect finishing touch to your favorite photos and prints . with a crisp white finish and durable polystyrene construction , you can proudly display your favorite logophilia print in chic style .', 'category': 'Décor & Pillows', 'score': 36.659

SearchResults(results_summary='I re-ranked results so the items you marked 😃 (kiki woven pouf, Mayes water-hyacinth pouf, Colt woven pouf, Dickson braided jute pouf) are at the top. After those I placed the closest woven/natural-fiber and small pouf matches (seagrass, handwoven cotton 18" poufs, small outdoor pouf, jute/cotton blends) so you have 10 strong candidates. If you want, I can replace any of the 😑 items by searching for a specific material (jute, seagrass, water hyacinth, cotton knit), shape (round vs cube), size, or color to get more 😃 matches.', ranked_results=[38620, 2070, 38616, 9800, 19532, 23332, 23333, 38423, 36130, 6058])

## Harness - a baby harness letting us drive the search

Here's a harness that lets us drive the agent to do the search work we want.

Here it has some basic error checking (sending the inputs back in with a user message on error).

BUT **importantly** - it also automates feedback back into the agent for the search process. Which helps the agent adjust and steer towards relevant results.

In [None]:
system_prompt = """
    You take user search queries and use a search tool to find furniture / home goods products.

    Look at the search tools you have, their limitations, how they work, etc when forming your plan.

    Finally return results to the user per the SearchResults schema, ranked best to worst.

    Gather results until you have 10 best matches you can find. It's important to return at least 10.

    You'll get feedback from the user on whether results satisfies them:

    ☹️ - ACTIVELY FRUSTRATES USER, do not include unless absolutely necesarry
    😑 - Meh results. OK in a pinch. But there could be better.
    😃 - Solves users problem. Good job! Rank this highest

    If you get this feedback, try again to:

    * Find results that can replace the ☹️ / 😑 with 😃
    * Hypothesize new queries that might find 😃
    * Use the content in known 😃 results to create new queries to the search tool

    Consider possibly

    * Not searching categories if no relevant results found

    It's very important you consider carefully the correct ranking as you'll be evaluated on
    how close that is to the average furniture shoppers ideal ranking.

"""
# system_prompt = build_few_shot_prompt(system_prompt)



class SearchResults(BaseModel):
    """The ranked, top 10 search results ordered most relevant to least."""
    results_summary: str = Field(description="The message from you summarizing what you found")

    ranked_results: list[int] = Field(description="Top ranked search results (their doc_ids)")

def _error_msg(error):
    print(error)
    return {"role": "user", "content": f"Oh this isn't good, it turns out: {error}. Please try again"}

def search(query, num_attempts=5):
    """A little search harness."""
    inputs = [{"role": "system", "content": system_prompt},
            {"role": "user", "content": query}]
    error = True
    resp = None
    attempt = 0
    agent_state = {}
    while error or attempt < num_attempts:
        resp, inputs = agent_run(tool_info,
                                 text_format=SearchResults,
                                 inputs=inputs,
                                 agent_state=agent_state)
        # Validate what came back, ensure it fits
        # Lookup each doc id in the response
        num_results = len(resp.output_parsed.ranked_results)
        if num_results != 10:
            inputs.append(_error_msg(f"Expected 10 ranked_results, got {num_results}"))
            error = True
            continue
        for item in resp.output_parsed.ranked_results:
            if item not in corpus['doc_id'].values:
                inputs.append(_error_msg(f"Doc id {item} is not in corpus"))
                error = True
                continue
        graded_results = grades(query, resp.output_parsed)
        graded_summary = "Here are your results reflecting whether they satisfy the user:\n"
        for title, doc_id, grade in graded_results:
            graded_summary += f"\n{title} (doc_id:{doc_id}) {grade}"
        graded_summary += "\n\n Given this, try some additional searches to find more candidates and rerank using this information."
        inputs.append({'role': 'user', 'content': graded_summary})
        print(f"Attempt {attempt} -- Results evaluated::")
        print(graded_summary)
        # Printout log
        print("Log:")
        for item in agent_state['log']:
            print(f"keywords: {item[0]}")
            print(f"category: {item[1]}")
            print(f"skipped? {item[2]}")
        attempt += 1
        error = False

    return resp.output_parsed


resp = search("small woven pouf")
resp

## Compare to BM25

In [None]:
from cheat_at_search.strategy import SearchStrategy
from cheat_at_search.search import run_strategy
from cheat_at_search.wands_data import judgments

class AgenticSearchStrategy(SearchStrategy):
    def __init__(self,
                 corpus,
                 workers=1):
        super().__init__(corpus, workers=workers)

    def search(self, query, k):
        agentic_query = "Find me: " + query
        print("_________---___________")
        print(agentic_query)
        print("_________---___________")
        resp = search(query)
        return (resp.ranked_results,
                [1.0] * len(resp.ranked_results))

# Get 20 random queries from judgments
seed = 1234
np.random.seed(seed)
random_queries = np.random.choice(judgments['query'].unique(), 8)
selected_judgments = judgments[judgments['query'].isin(random_queries)]
selected_judgments

strategy = AgenticSearchStrategy(corpus, workers=1)
graded_agentic = run_strategy(strategy, selected_judgments)

_________---___________
Find me: carolyn console table
_________---___________


Searching:   0%|          | 0/8 [00:00<?, ?it/s]

--
InpTok: 844
OutTok: 115
Reasoning:
**Searching for furniture**  I need to use the functions.search_wayfair tool to
search the Wayfair catalog for results. The goal is to get the top 10 items.
Although the tool defaults to 5, it seems I can pass any number up to that. I’ll
use the keywords "Carolyn console table," and categorize it under Furniture. So,
my plan is to call search_wayfair with those specific keywords and the category,
asking for top_k 10. 

Calling search_wayfair with args keywords='Carolyn console table' category='Furniture' top_k=10
search Carolyn console table
Filtering by category: Furniture
output {'search_results': [{'id': 24572, 'title': "carolyn 48 '' solid wood console table", 'description': "offering up space for staging , serving , and stowing , three-tier tables are ideal for any living room or entryway ensemble . take this console table for example : pairing a dark brown-finished metal frame with a trio of pinewood levels , it 's sure to fit right into indu

Searching:  12%|█▎        | 1/8 [02:10<15:11, 130.18s/it]

--
InpTok: 32753
OutTok: 780
Reasoning:
**Deciding on next steps**  I think the best thing to do next is to ask the user
if they want me to broaden my search to other retailers like Amazon, Overstock,
or Etsy. Alternatively, I could ask for more details or an image related to
their request, including specifics like finish or size. I’ll propose these
options concisely and confirm if I should search other retailers. It’s important
to structure the final output in the required SearchResults format. However, I
need to acknowledge that there were no new matches found. 

**Crafting the results summary**  I need to create a results summary that
clearly explains I couldn't find more Carolyn items on Wayfair. In this summary,
I should propose the next steps and ask the user how to proceed: either by
expanding to other retailers or providing an image or using additional filters.
I’ll also include the ranked_results with the best 10 IDs we have, even if they
were previously marked. Finally, I'll 

Searching:  25%|██▌       | 2/8 [04:17<12:50, 128.41s/it]

--
InpTok: 23085
OutTok: 664
Reasoning:
**Reranking candidates for selection**  I've got new candidates, mostly 24-25"
tufted wingbacks. First, I'll prioritize small tufted velvet items and those
that the user may like. The top item is the 24" velvet (14165), which stays on
top. Other possibilities include the 24.25" Hollange (25553) — the user reacted
with a 😑 earlier, so it might be risky to keep it. New contenders like the 25"
Rizzuto velvet (24076) and others need consideration too. I'll focus on items
25.5" and smaller, especially those yielding a 😃 response. 

**Ranking best matches**  I need to sort through the best matches based on the
user's previous reactions. Among the likely contenders are 14165, and some 24 to
25" tufted options. I have several items with past reactions of 😑, such as
Liptak (31698) and Bilski (31692). I’ll rank them while keeping 14165 at the
top, followed by the unreviewed Rizzuto (24076), Hollange (25553), and others. I
must provide at least 10 options a

Searching:  38%|███▊      | 3/8 [07:10<12:23, 148.74s/it]

--
InpTok: 29852
OutTok: 597
Reasoning:
**Evaluating chair options**  I'm noticing that my earlier search results
included some chairs like the Agustin 30654 and Marcy 11630. The user seems
satisfied with five candidates already, so I need to add five more to make a
total of ten. The previous list that the user marked positively includes some
IDs too. It looks like I'll need to rank options that closely meet their
preferences for CH24/wishbone-like chairs and paper-cord seats. I’ll ensure the
IDs I mentioned exist in my earlier findings. 

**Finalizing ranked results**  I’m working on the final ranked results. I want
to keep the top five approved options: 37312, 41475, 4196, 17645, and 5574. Then
I'll add 30654 (Agustin), 11630 (Marcy), 23733 (Aalaysha), 16680 (Waterbury),
and 27058 (Solid Wood Stacking) to complete a total of ten. The results summary
will explain the reranking based on user feedback and mention that I can fetch
further details. Finally, I’ll ensure it’s formatted in J

Searching:  50%|█████     | 4/8 [10:41<11:34, 173.52s/it]

--
InpTok: 27343
OutTok: 649
Reasoning:
**Rethinking ranked list**  I have new results, and I need to create a ranked
list of the 10 best matches for the user. They’ve marked several options with
negative feedback. I should focus on the ones they found positive. It looks like
the solid wood platform bed and minimalist options could be better choices for
them. The user's "dull bed with shirt headboard" request might imply they want
something plainer. I’ll prioritize simple, unadorned items. 

**Finalizing ranked list**  I need to avoid the items marked with ☹️ by the
user, like IDs 37195, 35161, and 15674, among others. I’ll also skip slatted
headboards since they were marked with 😑. It seems better to focus on solid,
plain wood platform beds. Based on the search, I've got the following IDs to
rank: 35166, 40381, 15672, and others. I’ll rank them from best to worst
according to the user's preference for "dull/plain." Now I’ll get everything
into the SearchResults format for a response. 

Searching:  62%|██████▎   | 5/8 [13:56<09:03, 181.25s/it]

--
InpTok: 38270
OutTok: 671
Reasoning:
**Exploring hardware options**  The tool returned mostly plumbing results, and
it feels like the Wayfair catalog has limited exact matches. I should be
transparent with the user about this. I can suggest two paths: either I find
light wood nightstands and list silver or chrome hardware with examples of how
to swap them, or widen the search to other retailers and styles. I need to get
the user's preference, acknowledging the limited matches while offering
alternative solutions. I’ll provide an updated ranked list of the best available
matches and propose various hardware options. 

**Assembling ranked list of items**  I need to choose the top 10 items likely
based on what the user rated. I’ll create a final ranked list of items that are
the closest matches. For example, jerry (22910) has silver hardware, aadi
(40106) features a silver knob, and pierce (32186) includes brushed steel knobs.
Some items like roxane (15999) and nocci (27419) have brush

Searching:  75%|███████▌  | 6/8 [16:52<05:58, 179.25s/it]

--
InpTok: 28005
OutTok: 830
Reasoning:
**Finding suitable poufs**  I’m thinking we should conduct broader searches like
"seagrass pouf small," "water hyacinth pouf," and "handwoven round small pouf."
We've already tried some, but with tool limitations and time, I want to produce
a reranked list with five confirmed options and add five similar candidates
based on previous data.   It might be good to ask the user what trait makes them
smile about the results, like material or size, to help pinpoint their
preferences before diving into more searches. 

**Clarifying user preferences**  I’m planning to run more targeted searches and
ask the user which attributes are most important, like size (less than or equal
to 16 inches), material (jute, cotton, seagrass, water hyacinth), color, and
price. I’ll confirm their preferences and provide the current best five options.
Since they requested more searches, it seems reasonable to ask for
clarification. I'll check if they prefer natural fibers on

Searching:  88%|████████▊ | 7/8 [19:41<02:56, 176.07s/it]

--
InpTok: 30313
OutTok: 681
Reasoning:
**Searching for fairy lights**  I’m noticing the search tool is returning many
bulb items rather than the desired 60-bulb fairy options. It seems like the
Wayfair catalog is limited in choices. The user is expecting more candidates.
I’ve tried several queries, but they keep yielding the same results. I should
rerank the list of fairy lights, prioritizing short battery/USB strings, then
small novelty types, longer strings, and finally solar options. I need to return
at least 10 of the best matches. 

**Reranking based on satisfaction**  The developer message requires us to
include at least 10 items. I need to rerank the results based on user
satisfaction, starting with the top-rated item (29105), followed by those rated
as 😑 (like 15252, 4287, 21952, and 2193). The items rated ☹️ (25436, 19139,
28566, and 28521) will be ranked lower. I’ll summarize the logic for the rerank
in the results_summary message and provide the top 10 doc_ids. I’ll keep
ev

Searching: 100%|██████████| 8/8 [21:52<00:00, 164.09s/it]

--
InpTok: 34733
OutTok: 638
Reasoning:
**Rethinking search strategy**  It looks like the search tool keeps returning
similar results, and the user seems frustrated. I think I need to clarify what
they're really looking for. They asked for more searches, so I'll explore
variations like "Cliffside chandelier Wayfair." But maybe the product isn’t
there. I can suggest checking other retailers like Amazon or Houzz. I need to
explain that I can’t find the exact item on Wayfair and ask what they'd prefer
next. 

**Refining search results**  I need to provide the user with an update on my
search for the Cliffside chandelier and suggest next steps. They weren’t
satisfied with previous options, so I’m thinking of offering 10 alternatives
that might be similar, like items with tassels or antiqued silver finishes. I’ll
compile a ranked results list, mentioning the inability to find exactly what
they wanted and asking if they’d like me to look at other retailers or provide
links. I'll keep it conc




In [None]:
from cheat_at_search.search import ndcgs, graded_bm25, vs_ideal
# 5,5,4,7
ndcgs(graded_agentic).sort_index(), ndcgs(graded_agentic).mean()

(query
 carolyn console table                             0.333333
 cliffside 5 light candle style drum chandelier    0.333333
 dull bed with shirt head board                    0.000000
 led 60                                            0.506435
 light wood nightstand with silver accents         0.162625
 queen wingback chair                              0.560945
 small woven pouf                                  0.853046
 wishbone chair                                    0.909272
 Name: ndcg, dtype: float64,
 np.float64(0.4573736621054057))

In [None]:
ndcgs(graded_bm25[graded_bm25['query'].isin(random_queries)]).sort_index(), ndcgs(graded_bm25[graded_bm25['query'].isin(random_queries)]).mean()

(query
 carolyn console table                             0.333333
 cliffside 5 light candle style drum chandelier    0.333333
 dull bed with shirt head board                    0.000000
 led 60                                            0.333333
 light wood nightstand with silver accents         0.288624
 queen wingback chair                              0.307727
 small woven pouf                                  0.468003
 wishbone chair                                    0.674750
 Name: ndcg, dtype: float64,
 np.float64(0.3423881136250733))