In [1]:
%reload_ext autoreload
%autoreload 2

# Week 5: Enhancing RAG with Structured Metadata

Effective RAG systems need more than just text matching - they need to understand and filter based on specific attributes. This notebook demonstrates how to use LLMs to generate structured metadata that enables more precise and relevant retrieval.

## Why Generate Metadata with LLMs?

Consider a typical e-commerce query: "I'm looking for a black cotton t-shirt under $50"

This simple request contains multiple filtering criteria:
- Color (black)
- Material (cotton)
- Product type (t-shirt)
- Price range (< $50)

Traditional text search struggles with such queries because it needs:
1. Structured data for filtering
2. Consistent attribute labeling
3. Standardized taxonomies

## Benefits of LLM Metadata Generation

1. **Consistency**
   - Standardized attribute extraction
   - Taxonomy compliance
   - Quality control at scale

2. **Efficiency**
   - Faster than manual labeling
   - More accurate than rule-based systems
   - Cost-effective for large catalogs

3. **Flexibility**
   - Adapts to new product types
   - Handles complex attributes
   - Supports taxonomy updates

## Our Approach

We'll build a metadata generation system in three phases:

1. **Taxonomy Definition**
   - Define product categories
   - Specify valid attributes
   - Create validation rules

2. **Data Generation**
   - Process product images
   - Extract structured metadata
   - Validate against taxonomy

3. **Quality Assurance**
   - Enforce consistency
   - Validate attributes
   - Enable team collaboration

## What We'll Create

Using the `irow/ClothingControlV2` dataset, we'll:
1. Define a robust e-commerce taxonomy
2. Generate structured product metadata
3. Create a validation pipeline
4. Enable efficient filtering

## Prerequisites

- Basic understanding of e-commerce data
- Familiarity with product taxonomies
- Python environment with required libraries

Let's dive in and see how LLMs can help create better structured data for RAG systems!

# Generating our Dataset

In this portion, we'll generate a dataset that mimics a e-commerce company's product catalog. In order to do so, we'll be extracting out item data from images using `gpt-4o` and then using a taxonomy to classify the items.

## Loading in our Taxonomy

E-Commerce companies use what's called a taxonomy to classify their products. In our case, we've chosen the following fields

- `category` : This is a high level category such as Men's, Women's, Unisex, etc.
- `subcategory` : This is a more specific category such as T-Shirts, Blouses that are under a specific category
- `types` : These are more specific product types such as Crew Neck T-Shirt, V-Neck T-Shirt, etc.
- `attributes` These are attributes that are specific to the items that have that specific category, subcategory and type combination.
- `common_attributes` These are attributes that are common to all items in our database such as sizes and colors in stock

We want to define a taxonomy ahead of time for three main reasons

1. **Consistency** : By having a consistent taxonomy, we can ensure that we only generate data on items that fall within our taxonomy. 
2. **Filtering** : It makes it easy for us to map user queries to a set of known metadata fields which we can use to filter our retrieved items down the line.
3. **Non-Technical Help**: By using a human-readable format like yaml, we can ask members of our team that aren't technical to help define the proper taxonomy. You can implement this too using a no-sql database to store raw taxonomies or configs but we've chosen to keep it simply for now.

Let's now read in our taxonomy and see how we can enforce these fields. We'll define some simple Pydantic models to make it easy for us to work with the yaml data

We've defined a `progress_taxonomy_file` function to help us process the yaml file and convert it into a dictionary. We use `pydantic` and `instructor` to help make sure that our LLM generated metadata conforms to our taxonomy.


In [2]:
from helpers import process_taxonomy_file

taxonomy_data = process_taxonomy_file("taxonomy.yml")

taxonomy_data.keys()

dict_keys(['taxonomy_map', 'occasions', 'materials', 'common_attributes', 'taxonomy'])

In [3]:
taxonomy_data["common_attributes"].keys()

dict_keys(['Size', 'Color', 'Material', 'Pattern', 'Occasion'])

In [17]:
from pydantic import model_validator, ValidationInfo, BaseModel


class ItemAttribute(BaseModel):
    name: str
    value: str


class ItemMetadata(BaseModel):
    title: str
    brand: str
    description:str
    category: str
    subcategory: str
    product_type: str
    attributes: list[ItemAttribute]
    material: str
    pattern: str

    @model_validator(mode="after")
    def validate_material_and_pattern(self, info: ValidationInfo):
        context = info.context
        if not context or not context["taxonomy_data"]:
            raise ValueError("Taxonomy data is required for validation")

        if self.pattern not in context["taxonomy_data"]["common_attributes"]["Pattern"]:
            raise ValueError(
                f"Pattern {self.pattern} is not a valid pattern. Valid patterns are {context['taxonomy_data']['common_attributes']['Pattern']}"
            )

        if self.material not in context["taxonomy_data"]["common_attributes"]["Material"]:
            raise ValueError(
                f"Material {self.material} is not a valid material. Valid materials are {context['taxonomy_data']['common_attributes']['Material']}"
            )

        return self

    @model_validator(mode="after")
    def validate_category_and_attributes(self, info: ValidationInfo):
        context = info.context
        if not context or not context["taxonomy_data"]:
            raise ValueError("Taxonomy data is required for validation")

        taxonomy_map = context["taxonomy_data"]["taxonomy_map"]

        # 1. Validate category
        if self.category not in taxonomy_map:
            raise ValueError(
                f"Category {self.category} is not valid. Valid categories are {list(taxonomy_map.keys())}"
            )

        # 2. Validate subcategory
        if self.subcategory not in taxonomy_map[self.category]:
            raise ValueError(
                f"Subcategory {self.subcategory} does not exist under category {self.category}"
            )

        subcategory_data = taxonomy_map[self.category][self.subcategory]

        # 3. Validate product type
        if self.product_type not in subcategory_data["product_type"]:
            raise ValueError(
                f"Product type {self.product_type} is not valid for subcategory {self.subcategory}. Valid types are {subcategory_data['product_type']}"
            )

        # 4. Validate attributes
        for attr in self.attributes:
            if attr.name not in subcategory_data["attributes"]:
                raise ValueError(
                    f"Attribute {attr.name} is not valid for subcategory {self.subcategory}. Valid attributes are {list(subcategory_data['attributes'].keys())}"
                )

            if attr.value not in subcategory_data["attributes"][attr.name]:
                raise ValueError(
                    f"Value {attr.value} is not valid for attribute {attr.name}. Valid values are {subcategory_data['attributes'][attr.name]}"
                )

        return self

## Extracting Item Data from Images

The `irow/ClothingControlV2` dataset contains images of clothing items that are generated using a control net. It doesn't have any product data and so we'll extract out the item data from the images. 

We use `instructor` here to help us extract the item data from the images. Note here that we're rendering the entire yml file as context. We want to do so for two reasons

1. Firstly, providing all of the possible choices allows the model more flexibility in deciding what the right metadata fields are
2. Secondly, if we have a large taxonomy, we can leverage techniques like prompt caching to save on costs. By ensuring that the initial portion of the prompt is the same, we can leverage caching to speed up the extraction process.

We'll be using `gpt-4o` here for the extraction since it supports multimodal inputs ( in this case images ).

In [4]:
from datasets import load_dataset

ds = [item for item in load_dataset("irow/ClothingControlV2",streaming=True)["train"].take(2)]


In [19]:
from openai import AsyncOpenAI
import instructor
import tempfile

client = instructor.from_openai(AsyncOpenAI())

with open("taxonomy.yml", "r") as f:
    taxonomy = f.read()

with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as f:
    ds[0]["image"].save(f.name)
    items = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are an expert at extracting item data from images. Extract 1-2 items seen in the images based on the taxonomy provided. Here are the categories, subcategories, types and attributes that you can choose from: {{ taxonomy_data['taxonomy_map'] }}",
            },
            {
                "role": "user",
                "content": [
                    "Here is the image, choose a brand that likely sells these items - choose a real brand that exists in real life and make up a name if that's not possible. Also generate a short description of 1-2 sentences of the item that would be suitable for an e-commerce website",
                    instructor.Image.from_path(f.name),
                ],
            },
        ],
        response_model=list[ItemMetadata],
        context={
            "taxonomy_data": taxonomy_data
        },
        
    )

    print(items)

[ItemMetadata(title='Lace Detail Sleeveless Top', brand='H&M', description='Elevate your wardrobe with this elegant sleeveless top featuring intricate lace detailing at the neckline. Perfect for a chic daytime look or a night out, this versatile piece combines comfort and style effortlessly.', category='Women', subcategory='Tops', product_type='Tank Tops', attributes=[ItemAttribute(name='Sleeve Length', value='Sleeveless'), ItemAttribute(name='Neckline', value='Crew Neck'), ItemAttribute(name='Fit', value='Regular')], material='Cotton', pattern='Solid')]


We can see that `gpt-4o` was able to extract out the item data from the image and that the metadata fields conform to the taxonomy that we've defined. We've extracted out a list of items that are in the image and mapped them to a category, subcategory and other metadata fields. 

In [5]:
from openai import AsyncOpenAI
import instructor
import tempfile
from asyncio import Semaphore, timeout
from tenacity import retry, stop_after_attempt, wait_fixed
from tqdm.asyncio import tqdm_asyncio as asyncio


@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
async def generate_dataset_label(
    dataset_item: dict, client: instructor.AsyncInstructor, sem: Semaphore, taxonomy_data: dict
):
    async with sem, timeout(30):
        with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as f:
            dataset_item["image"].save(f.name)
            items = await client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {
                        "role": "system",
                        "content": "You are an expert at extracting item data from images. Extract 1-2 items seen in the images based on the taxonomy provided. Here are the categories, subcategories, types and attributes that you can choose from: {{ taxonomy_data['taxonomy_map'] }}",
                    },
                    {
                        "role": "user",
                        "content": [
                            "Here is the image, choose a brand that likely sells these items - choose a real brand that exists in real life and make up a name if that's not possible. Also generate a short description of 1-2 sentences of the item that would be suitable for an e-commerce website",
                            instructor.Image.from_path(f.name),
                        ],
                    },
                ],
                response_model=list[ItemMetadata],
                context={
                    "taxonomy_data": taxonomy_data
                },
                
            )

            return [
                {"image": dataset_item["image"], "metadata": item} for item in items
            ]


In [30]:
import instructor

client = instructor.from_openai(AsyncOpenAI())
sem = Semaphore(15)
n_rows = 150

ds = [item for item in load_dataset("irow/ClothingControlV2",streaming=True)["train"].take(n_rows)]
results = await asyncio.gather(
    *[generate_dataset_label(ds_row, client, sem, taxonomy_data) for ds_row in ds]
)

100%|██████████| 150/150 [02:19<00:00,  1.08it/s]


Before we can create a dataset, we need to flatten the list of items we've extracted from the images. We'll also flatten the metadata fields so that we can create a dataset with the correct schema. Since our attributes are a nested list of objects, we'll convert them to a json string for convinience.

In [31]:
# Flatten results list of lists into a single list
flattened_results = [item for sublist in results for item in sublist]
flattened_results[0]["metadata"].model_dump()

{'title': 'Lace Detail Sleeveless Top',
 'brand': 'H&M',
 'description': "Elevate your casual wardrobe with this elegant sleeveless top featuring intricate lace detailing at the neckline. Perfect for both day and night, it's crafted from a soft, breathable fabric for all-day comfort.",
 'category': 'Women',
 'subcategory': 'Tops',
 'product_type': 'Tank Tops',
 'attributes': [{'name': 'Sleeve Length', 'value': 'Sleeveless'},
  {'name': 'Neckline', 'value': 'Crew Neck'}],
 'material': 'Cotton',
 'pattern': 'Solid'}

In [32]:
import json
import random

def flatten_item(item: dict, id: int):
    flattened_item = {"image": item["image"], **item["metadata"].model_dump()}

    return {
        **flattened_item,
        "id": id,
        "price": round(random.uniform(10.0, 400.0),2),

        "attributes": json.dumps(flattened_item["attributes"]),
    }


hf_dataset_items = [flatten_item(item, id+1) for id, item in enumerate(flattened_results)]
hf_dataset_items[0]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=768x1024>,
 'title': 'Lace Detail Sleeveless Top',
 'brand': 'H&M',
 'description': "Elevate your casual wardrobe with this elegant sleeveless top featuring intricate lace detailing at the neckline. Perfect for both day and night, it's crafted from a soft, breathable fabric for all-day comfort.",
 'category': 'Women',
 'subcategory': 'Tops',
 'product_type': 'Tank Tops',
 'attributes': '[{"name": "Sleeve Length", "value": "Sleeveless"}, {"name": "Neckline", "value": "Crew Neck"}]',
 'material': 'Cotton',
 'pattern': 'Solid',
 'id': 1,
 'price': 181.04}

In [33]:
# Now let's create a new HF dataset with the labelled data so that we have it stored
# Convert to HuggingFace Dataset format
from datasets import Dataset

# Create HF dataset
dataset = Dataset.from_list(hf_dataset_items)
dataset.push_to_hub("ivanleomk/ecommerce-taxonomy")


Map: 100%|██████████| 191/191 [00:00<00:00, 21290.32 examples/s]/s]
Creating parquet from Arrow format: 100%|██████████| 2/2 [00:00<00:00, 317.61ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:03<00:00,  3.01s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/ivanleomk/ecommerce-taxonomy/commit/f404c96bf9e1ec0d3de7026312f5bcd36f18ceef', commit_message='Upload dataset', commit_description='', oid='f404c96bf9e1ec0d3de7026312f5bcd36f18ceef', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/ivanleomk/ecommerce-taxonomy', endpoint='https://huggingface.co', repo_type='dataset', repo_id='ivanleomk/ecommerce-taxonomy'), pr_revision=None, pr_num=None)

In this notebook, we've demonstrated how to enhance RAG systems with structured metadata generated by LLMs. This builds on our query understanding work from Week 4 while adding a new dimension of structured data.

Key accomplishments:
1. Developed a robust taxonomy for metadata
2. Created an LLM-powered metadata generation pipeline
3. Implemented validation to ensure metadata quality

This work takes our RAG system beyond pure text matching, leveraging the insights from our query classification system in Week 4. In the next notebook, we'll show how to use this metadata to improve retrieval performance.

Looking ahead to Week 6, this structured metadata approach will help our tool selection system make more informed decisions. The validation techniques we've developed here will also prove valuable for ensuring reliable tool selection.

Remember that metadata schemas should evolve with your application - start simple and add complexity only as needed based on user requirements and performance metrics.