# Week 5 : Systematically Improving Your Rag Application

## Why use LLM Generated Metadata

LLM Generated Data can help us to add more structured fields to our data which allows us to perform more complex filtering before we do our retrieval.

This means that we can potentially exclude irrelevant items by investing in query understanding and filtering by pre-processing our data.

In this notebook, we'll be using the `ivanleomk/ecommerce-items` dataset that we generated in the previous notebook to add more structured fields to the items.

In [9]:
from datasets import load_dataset

ds = load_dataset("ivanleomk/ecommerce-items")

Generating train split: 100%|██████████| 418/418 [00:00<00:00, 18752.01 examples/s]


In [10]:
ds["train"][0]


{'title': 'Teal Lace Top',
 'category': 'Tops',
 'subcategory': 'Blouses',
 'brand': 'H&M',
 'description': 'This elegant teal blouse features a delicate lace design on the upper portion, offering a chic and stylish look for work or special events. Perfect for pairing with high-waisted jeans for a sophisticated ensemble.',
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x1024>,
 'id': 0}

We'll generate synthetic queries by mimicking a few different user intents

- They want to find an item for a specific occasion
- They're looking for an item to go with something else in their wardrobe
- They want to find an item that's a fit for a specific style that they're exploring

We'll also randomly add in some additional constraints to make these questions more interesting. These will be things like "must be <color of item>", "must be <brand">, "made of <material>"

In [33]:
from pydantic import BaseModel
import instructor
from openai import AsyncOpenAI
from asyncio import Semaphore, timeout
from tenacity import retry, stop_after_attempt, wait_fixed
from tqdm.asyncio import tqdm_asyncio as asyncio
import random
import tempfile

# Configure instructor with OpenAI client
client = instructor.from_openai(AsyncOpenAI())


class SyntheticQuery(BaseModel):
    chain_of_thought: str
    query: str


# Example prompts for different intents
INTENT_PROMPTS = {
    "occasion": "Generate a query from someone looking for clothing for a specific occasion like a wedding, party, job interview etc",
    "outfit_matching": "Generate a query from someone trying to find an item that matches with something else in their wardrobe",
    "style": "Generate a query from someone exploring a particular style or aesthetic like minimalist, streetwear, bohemian etc",
}


@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def generate_synthetic_query(
    intent_type: str, item: dict, sem: Semaphore
) -> SyntheticQuery:
    intent_prompt = INTENT_PROMPTS[intent_type]
    image = item["image"]
    query_condition = [
        "the query must reference the item material ",
        "the query must reference the item brand indirectly",
        "the query must reference the item color where possible ( or give a few options where the color is within)",
        "the query must reference the item category/subcategory",
    ]

    user_message_type = [
        "Short and concise (at most 1 sentence or 10 words)",
        "longer and a bit more detailed (at most 2 sentences or 20 words)",
        "short with some spelling mistakes (at most 1 sentence or 10 words)",
    ]

    with tempfile.NamedTemporaryFile(suffix=".jpg") as temp_file:
        image.save(temp_file.name)

        async with sem, timeout(30):
            response = await client.chat.completions.create(
                model="gpt-4o",
                response_model=SyntheticQuery,
                messages=[
                    {
                        "role": "system",
                        "content": """Generate a hypothetical user query for this item that a user with {{ intent }} would ask which the following item is highly relevant for. Make sure that {{ query_condition }} and to reference specific attributes of the item where possible. The message should be {{ user_message_type }}.

                        Here are examples of good queries:
                        - 'I'm looking for a t-shirt that goes well with a striped green skirt I have. Ideally it'd be the same color if possible'
                        - 'Need a cotton blouse in navy or dark blue to match my work pants'
                        - 'Looking for a formal H&M blazer similar to their slim-fit black one but in grey'""",
                    },
                    {
                        "role": "user",
                        "content": [
                            """
                            Here is some information about the item

                            Title: {{ item['title'] }}
                            Description: {{ item['description'] }}
                            Brand: {{ item['brand'] }}
                            Category: {{ item['category'] }}
                            Subcategory: {{ item['subcategory'] }}

                            
                            """,
                            instructor.Image.from_path(temp_file.name),
                        ],
                    },
                ],
                context={
                    "intent": intent_prompt,
                    "item": item,
                    "query_condition": random.choice(query_condition),
                    "user_message_type": random.choice(user_message_type),
                },
            )
        return {
            "id": item["id"],
            "query": response.query,
        }


In [34]:
from itertools import islice

intents = list(INTENT_PROMPTS.keys())
sem = Semaphore(10)
coros = [
    generate_synthetic_query(random.choice(intents), item, sem)
    for item in islice(ds["train"], 30)
]

queries = await asyncio.gather(*coros)


100%|██████████| 30/30 [00:14<00:00,  2.12it/s]


In [37]:
import json


with open("queries_v2.jsonl", "w") as f:
    for query in queries:
        f.write(json.dumps(query) + "\n")


In [36]:
for query in queries[:5]:
    print(query)


{'id': 0, 'query': 'Need teal lace top to wear for a wedding.'}
{'id': 1, 'query': 'Looking for a cotton top to match black high-waisted jeans.'}
{'id': 2, 'query': 'Need a vibrant green t-shirt for my casual jeans.'}
{'id': 3, 'query': "I'm searching for a blouse or top that complements a green and white striped pleated Zara skirt for a garden party."}
{'id': 4, 'query': 'Looking for a trendy navy crop with bold sporty vibes?'}


Now let's test our queries and see how well vector search performs

In [25]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

db = lancedb.connect("./lance")
func = get_registry().get("openai").create(name="text-embedding-3-small")


class Item(LanceModel):
    id: int
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()


if "descriptions" not in db.table_names():
    table = db.create_table("descriptions", schema=Item, mode="overwrite")

    items = [{"id": row["id"], "text": row["description"]} for row in ds["train"]]

    table.add(items)
else:
    table = db.open_table("descriptions")

[2024-12-02T13:03:14Z WARN  lance::dataset] No existing dataset at /Users/ivanleo/Documents/coding/systematically-improving-rag/cohort_2/week5/lance/descriptions.lance, it will be created


In [35]:
from braintrust import Eval, Score
from helpers import get_metrics_at_k, task


def evaluate_braintrust(input, output, **kwargs):
    metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[1, 3, 5, 10, 15, 25])
    return [
        Score(
            name=metric,
            score=score_fn(output, kwargs["expected"]),
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric, score_fn in metrics.items()
    ]


await Eval(
    "filters",
    data=lambda: [
        {
            "input": question["query"],
            "expected": [question["id"]],
        }
        for question in queries
    ],  # Replace with your eval dataset
    task=lambda query: task(
        user_query=query, table=table, reranker=None, max_k=25
    ),  # Replace with your LLM call
    scores=[evaluate_braintrust],
)

Experiment week-5-1733144834 is running at https://www.braintrust.dev/app/567/p/filters/experiments/week-5-1733144834
filters (data): 30it [00:00, 42182.07it/s]
filters (tasks): 100%|██████████| 30/30 [00:01<00:00, 22.70it/s]



week-5-1733144834 compared to main-1732799310:
63.33% 'mrr@1'     score
67.78% 'mrr@3'     score
68.61% 'mrr@5'     score
68.61% 'mrr@10'    score
69.46% 'mrr@15'    score
70.01% 'mrr@25'    score
63.33% 'recall@1'  score
73.33% 'recall@3'  score
76.67% 'recall@5'  score
76.67% 'recall@10' score
86.67% 'recall@15' score
96.67% 'recall@25' score

0.84s duration

See results for week-5-1733144834 at https://www.braintrust.dev/app/567/p/filters/experiments/week-5-1733144834


EvalResultWithSummary(summary="...", results=[...])

In [40]:
from typing import Literal
import instructor
from openai import AsyncOpenAI
from pydantic import BaseModel


# Patch the OpenAI client with instructor
client = instructor.from_openai(AsyncOpenAI())
sem = Semaphore(10)


# Define the metadata schema
class ClothingMetadata(BaseModel):
    occasion: list[Literal["casual", "smart-casual", "formal", "party", "workwear"]]
    style: list[
        Literal[
            "bohemian",
            "classic",
            "contemporary",
            "elegant",
            "minimalist",
            "preppy",
            "romantic",
            "streetwear",
            "vintage",
        ]
    ]

    material: list[
        Literal["cotton", "denim", "leather", "linen", "silk", "synthetic", "wool"]
    ]


@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def extract_metadata(item: dict, sem: Semaphore) -> ClothingMetadata:
    """Extract metadata from clothing description and image using GPT-4V"""
    async with sem, timeout(30):
        # Create a temporary file to save the image
        with tempfile.NamedTemporaryFile(suffix=".jpg", delete=True) as temp_file:
            # Save the PIL image to the temporary file
            item["image"].save(temp_file.name)
            temp_file.flush()

            metadata = await client.chat.completions.create(
                model="gpt-4o-mini",
                response_model=ClothingMetadata,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            """Analyze this clothing item and extract the following attributes:
                                1. Occasions it's suitable for (casual, smart-casual, formal, party, workwear)
                                2. Style categories (bohemian, classic, contemporary, elegant, minimalist, preppy, romantic, streetwear, vintage)
                                3. Materials used (cotton, denim, leather, linen, silk, synthetic, wool)

                                Here is some information about the item

                                Title: {{ item['title'] }}
                                Description: {{ item['description'] }}
                                Brand: {{ item['brand'] }}
                                Category: {{ item['category'] }}
                                Subcategory: {{ item['subcategory'] }}
                            """,
                            instructor.Image.from_path(temp_file.name),
                        ],
                    }
                ],
                context={
                    "item": item,
                },
            )

            return {
                **item,
                **metadata.model_dump(),
            }


await extract_metadata(ds["train"][0], sem)

{'title': 'Teal Lace Top',
 'category': 'Tops',
 'subcategory': 'Blouses',
 'brand': 'H&M',
 'description': 'This elegant teal blouse features a delicate lace design on the upper portion, offering a chic and stylish look for work or special events. Perfect for pairing with high-waisted jeans for a sophisticated ensemble.',
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x1024>,
 'id': 0,
 'occasion': ['workwear', 'party'],
 'style': ['elegant', 'contemporary'],
 'material': ['cotton', 'synthetic']}

In [41]:
coros = [extract_metadata(item, sem) for item in ds["train"]]

metadata = await asyncio.gather(*coros)


100%|██████████| 418/418 [02:28<00:00,  2.81it/s]


In [9]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

db = lancedb.connect("./lance")
func = get_registry().get("openai").create(name="text-embedding-3-small")
table_name = "metadata"


class Item(LanceModel):
    id: int
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()
    occasions: str
    style: str
    material: str


if table_name not in db.table_names():
    table = db.create_table(table_name, schema=Item, mode="overwrite")

    # items = [{"id": row["id"], "text": row["description"]} for row in metadata]

    # table.add(items)
else:
    table = db.open_table(table_name)

In [45]:
items = []
for item in metadata:
    items.append(
        {
            "id": item["id"],
            "text": item["description"],
            "occasions": ", ".join(item["occasion"]),
            "style": ", ".join(item["style"]),
            "material": ", ".join(item["material"]),
        }
    )

table.add(items)

In [1]:
import json

queries = [json.loads(item) for item in open("queries.jsonl").readlines()]
queries



[{'id': 0, 'query': 'Need teal lace top to wear for a wedding.'},
 {'id': 1,
  'query': 'Looking for a cotton top to match black high-waisted jeans.'},
 {'id': 2, 'query': 'Need a vibrant green t-shirt for my casual jeans.'},
 {'id': 3,
  'query': "I'm searching for a blouse or top that complements a green and white striped pleated Zara skirt for a garden party."},
 {'id': 4, 'query': 'Looking for a trendy navy crop with bold sporty vibes?'},
 {'id': 5, 'query': 'Streetwear style jeans with distressed finish pls?'},
 {'id': 6,
  'query': 'Need a plaid crop top with thin straps for a summer party.'},
 {'id': 7,
  'query': 'Looking for Zara plaid shorts with a tie waist for a picnic.'},
 {'id': 8,
  'query': 'Looking for a neon yellow graphic T-shirt for streetwear vibe.'},
 {'id': 9, 'query': 'Looking for a cotton top to match classic denim jeans.'},
 {'id': 10,
  'query': "I'm attending a garden party and need a sleeveless shirt with eyelet detailing and beautiful gold accents. Would t

In [4]:
from typing import Annotated
from pydantic import BaseModel, PlainValidator

def split_csv(v: str) -> list[str]:
    return [x.strip() for x in v.split(',')] if isinstance(v, str) else v

CsvList = Annotated[list[str], PlainValidator(split_csv)]

class MyModel(BaseModel):
    tags: CsvList
    categories: CsvList  # Reuse the annotation
    labels: CsvList      # Can use it multiple times

model = MyModel(
    tags="a,b,c",
    categories="foo,bar",
    labels="1,2,3"
)
model

MyModel(tags=['a', 'b', 'c'], categories=['foo', 'bar'], labels=['1', '2', '3'])

In [2]:
from typing import Literal
from openai import AsyncOpenAI
import instructor
from asyncio import Semaphore, timeout
from pydantic import BaseModel
from tenacity import retry, stop_after_attempt, wait_fixed
from tqdm.asyncio import tqdm_asyncio as asyncio


client = instructor.from_openai(AsyncOpenAI())

# Define the metadata schema
class MetadataFilters(BaseModel):
    occasion: list[Literal["casual", "smart-casual", "formal", "party", "workwear"]]
    style: list[
        Literal[
            "bohemian",
            "classic",
            "contemporary",
            "elegant",
            "minimalist",
            "preppy",
            "romantic",
            "streetwear",
            "vintage",
        ]
    ]

    material: list[
        Literal["cotton", "denim", "leather", "linen", "silk", "synthetic", "wool"]
    ]

@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def extract_metadata(id: int, query: str, sem: Semaphore) -> dict:
    async with sem, timeout(30):
        resp = await client.chat.completions.create(
            model="gpt-4o",
            response_model=MetadataFilters,
            messages=[
                {
                    "role": "system",
                    "content": "You're an expert query understanding AI, extract out any relevant filters from the query. It's ok to return an empty list for that specific filter if it's not required/relevant to the query. Look closely at mentions of desired materials, styles and occasions of what the user is looking for"
                },
                {"role": "user", "content": query},
            ],
        )
    
    return {
        "id": id,
        "query": query,
        **resp.model_dump()
    }

sem = Semaphore(10)
coros = [extract_metadata(query["id"], query["query"], sem) for query in queries]
metadata = await asyncio.gather(*coros)


100%|██████████| 30/30 [00:03<00:00,  9.85it/s]


In [5]:
metadata[0]

{'id': 0,
 'query': 'Need teal lace top to wear for a wedding.',
 'occasion': ['formal'],
 'style': [],
 'material': []}

In [6]:
class TableItem(BaseModel):
    id: int
    text:str
    occasions: CsvList
    style: CsvList
    material: CsvList


In [16]:
def apply_filter(item_values:list[str], filter_values:list[str]):
    for filter_value in filter_values:
        if filter_value in item_values:
            return True
    return False

def search_with_filter(query_with_filter:dict,table,max_k):
    query = query_with_filter["query"]
    
    items = [TableItem(**item) for item in table.search(query).limit(max_k).to_list()]


    if query_with_filter['occasion']:
        items = [item for item in items if apply_filter(item.occasions, query_with_filter['occasion'])]

    if query_with_filter['style']:
        items = [item for item in items if apply_filter(item.style, query_with_filter['style'])]

    if query_with_filter['material']:
        items = [item for item in items if apply_filter(item.material, query_with_filter['material'])]

    return [item.id for item in items]




In [20]:
from braintrust import Eval, Score
from helpers import get_metrics_at_k, task


def evaluate_braintrust(input, output, **kwargs):
    metrics = get_metrics_at_k(metrics=["mrr", "recall"], sizes=[1, 3, 5, 10, 15, 25])
    return [
        Score(
            name=metric,
            score=score_fn(output, kwargs["expected"]),
            metadata={"query": input, "result": output, **kwargs["metadata"]},
        )
        for metric, score_fn in metrics.items()
    ]


await Eval(
    "filters",
    data=lambda: [
        {
            "input": question,
            "expected": [question["id"]],
        }
        for question in metadata
    ],  # Replace with your eval dataset
    task=lambda query: search_with_filter(
        query, table=table,max_k=200
    ),  # Replace with your LLM call
    scores=[evaluate_braintrust],
)

Experiment week-5-1733149659 is running at https://www.braintrust.dev/app/567/p/filters/experiments/week-5-1733149659
filters (data): 30it [00:00, 121927.44it/s]
filters (tasks): 100%|██████████| 30/30 [00:01<00:00, 15.48it/s]



week-5-1733149659 compared to main-1732799310:
50.00% 'mrr@1'     score
53.33% 'mrr@3'     score
53.33% 'mrr@5'     score
53.33% 'mrr@10'    score
54.18% 'mrr@15'    score
54.56% 'mrr@25'    score
50.00% 'recall@1'  score
56.67% 'recall@3'  score
56.67% 'recall@5'  score
56.67% 'recall@10' score
66.67% 'recall@15' score
73.33% 'recall@25' score

1.19s duration

See results for week-5-1733149659 at https://www.braintrust.dev/app/567/p/filters/experiments/week-5-1733149659


EvalResultWithSummary(summary="...", results=[...])