# Use LabelKit to experiment and evaluate different approaches
There are many ways to build a labeling pipeline that all will accomplish the same result. The goal of `LabelKit` is to empower rapid and robust experimentation so that you can understand the performance, accuracy, and cost tradeoffs between approaches.

In this example, we'll experiment with a few different approaches to a categorization pipeline we want to build. `LabelKit` will make this experimentation quick and at the end we'll have a solid understanding of how different approaches perform. 


### Task
The task at hand is to categorize furniture items into a multi-level taxonomy based on their name and description. 

For example
Name: `Blair Table by homestyles`

Description: `This Blair Table by homestyles is perfect for Sunday brunches or game night. The round pedestal table is available as shown, or as part of a five-piece set. Features solid hardwood construction in a black finish that can easily match a traditional or contemporary aesthetic. Measures: 30"H x 42" Diameter`

Correct classification: `Tables & Desks > Dining Tables`

### Approaches
There are two different approaches we want to try.
1. LLMs + Embedding 
2. Heiarchical prompting


In [56]:
import os

# Set your OpenAI key here or set it as an environment variable
os.environ['OPENAI_API_KEY'] = 'OPENAI_API_KEY'

In [57]:
# %pip install cohere

import pandas as pd
from labelkit import *
from pydantic import BaseModel, Field
import cohere
import os
import numpy as np
from typing import List

## Data processing
We'll start out with reading in our data and building our taxonomy. The process of building a taxonomy is a project in and of itself. There are also many taxonomies available online that you can use. In our case, we're building our taxonomy based on our ground truth dataset. Since we have such a large dataset we can be reasonably confident that all values are represented. As you'll see our approach does not use the ground truth data as training data so it will be easy for us to expand the taxonomy without needing additional data. 

In [58]:
df = pd.read_csv('./furniture_clean.csv')

In [59]:
# Remove the 'Furniture > ' from each string in the 'category' column since they all start with Furniture.
df['category_new'] = df['category'].str.replace('Furniture > ', '')


For our embeddings approach we want the taxonomy to be a single string. We'll create the taxonomy from the ground truth data. 

In [60]:
taxonomy = list(set(df['category_new']))
taxonomy[0:5]


['Beds & Headboards > Bedframes',
 'Storage > Dressers',
 'Outdoor Tables > Outdoor Coffee Tables',
 'Tables & Desks > Kitchen Islands',
 'Chairs > Recliners']

However, for our heiarchical approach we need to understand the taxonomy a little more so we'll create a lookup table between first and second level categories.

In [61]:
# Create a lookup table with first level taxonomy as keys and second level as values
lookup_table = df['category_new'].str.split(' > ', expand=True).groupby(0)[1].apply(list).apply(set)
lookup_table['Chairs']

{'Accent Chairs', 'Desk Chairs', 'Dining Chairs', 'Recliners'}

## Building our pipeline using LabelKit

### Approach 1: Embeddings
The first approach is similar to the approach we took in the `Product Categorization` example we gave in the project repo. We are omitting the Google Search step because we already have item descriptions. 
1. Write a simple description of the product given name and description
2. Vector embedding search for top N categories
3. LLM: pick the best category



In [62]:
short_description_prompt = lambda row: f"""
You are given a product name and description for a piece of furniture.
Return a single sentence decribing the product.
Product name: {row['name']}
Product description: {row['description']}
"""

class ShortDescription(BaseModel):
  short_description: str = Field(description="A single sentence describing the product")
  
short_description_step = steps.LLMStep(
  prompt=short_description_prompt,
  model=models.gpt35,
  out_schema=ShortDescription,
  name="short_description"
)

We are using Cohere to embed both our description and the taxonomy but you can substitute in any embeddings provider with the `EmbeddingClassificationStep`. Unlike LLMs that are good at ignoring irrelevent information, we've learned from experience that short, simple descriptions work better in embedding space than trying to include too much. This is something you can and should experiment with. 

In [63]:
# set your cohere api key as an env var or set it directly here
COHERE_API_KEY = os.environ.get('COHERE_API_KEY')
co = cohere.Client(COHERE_API_KEY)

def embed(texts: List[str]):
  embeddings = co.embed(
    model="embed-english-v3.0",
    texts=texts,
    input_type='classification'
  ).embeddings
  return np.array(embeddings).astype('float32')

embedding_search_prompt = lambda row: row["short_description"]

embedding_search_step = steps.EmbeddingClassificationStep(
  search_prompt= embedding_search_prompt,
  embed=embed,
  k=5,    
  categories=taxonomy,
  name="embedding_search"
)

We now take the result of the embeddings and ask the LLM to pick the best response. It's important that our embedding search is optimized for recall because if the correct answer doesn't exist in the response our categorize step will have no chance of succeeding. 

In [64]:

def categorize_prompt(row):
    categories = ""
    i = 1
    while f"category{i}" in row:
        categories += f'{i}. {row[f"category{i}"]}\n'
        i += 1

    return f"""
    You are given a product description and {i-1} options for the product's category.
    Pick the index of the most accurate category.
    The index must be between 1 and {i-1}.
    Product description: {row['short_description']}
    Categories:
    {categories}
    """
    
class CategoryIndex(BaseModel):
    category_index: int = Field(description="The index of the most accurate category")
    
categorize_step = steps.LLMStep(
  prompt=categorize_prompt,
  model=models.gpt35,
  out_schema=CategoryIndex,
  name="categorize"
)

By returning just the index we can ensure that the actual string we use is in the taxonomy since LLMs sometimes hallucinate characters. Additionally, we don't need to waste response tokens on printing the entire string.

In [65]:
class Category(BaseModel):
    predicted_category: str = Field(description="The most accurate category")

select_category_step = steps.CustomStep(
  transform=lambda row: {"predicted_category": row[f'category{row["category_index"]}']},
  out_schema=Category,
  name="select_category"
)

We'd like to test our end to end pipeline to make sure it works before we go any further. We'll make a copy of the first five rows of the dataframe and run the pipeline to make sure it works

In [66]:
test_df = df.head(5).copy()

In [67]:
evaluate = lambda row: row['predicted_category'].lower() == row['category_new'].lower()

categorizer = pipeline.Pipeline([
  short_description_step, 
  embedding_search_step, 
  categorize_step,
  select_category_step
], evaluation_fn=evaluate)

categorizer.apply(test_df)

Running step short_description...


100%|██████████| 5/5 [00:06<00:00,  1.36s/it]


Running step embedding_search...
Running step categorize...


100%|██████████| 5/5 [00:02<00:00,  1.79it/s]


Running step select_category...


100%|██████████| 5/5 [00:00<00:00, 14453.15it/s]


Unnamed: 0,name,description,category,brand.name,category_new,__short_description__,short_description,category1,category2,category3,category4,category5,__categorize__,category_index,predicted_category
0,EnGauge Deluxe Bedframe,Introducing the Engauge Deluxe Bedframe - the ...,Furniture > Beds & Headboards > Bedframes,,Beds & Headboards > Bedframes,"{'input_tokens': 313, 'output_tokens': 46, 'su...",Introducing the Engauge Deluxe Bedframe - the ...,Beds & Headboards > Bedframes,Mattresses & Box Springs > Box Springs & Found...,Beds & Headboards > Beds,Mattresses & Box Springs > Mattresses,Beds & Headboards > Headboards,"{'input_tokens': 192, 'output_tokens': 10, 'su...",1,Beds & Headboards > Bedframes
1,Sparrow & Wren Sullivan King Channel-Stitched ...,"85""L x 83""W x 56""H | Total weight: 150 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 169, 'output_tokens': 76, 'su...",The Sparrow & Wren Sullivan King Channel-Stitc...,Beds & Headboards > Headboards,Beds & Headboards > Beds,Beds & Headboards > Bedframes,Mattresses & Box Springs > Mattresses,Kids Beds & Headboards > Kid's Beds,"{'input_tokens': 221, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds
2,Queen Bed With Frame,Dimensions:Head Board -49H x 63.75W x 1.5DFoot...,Furniture > Beds & Headboards > Beds,Hillsdale,Beds & Headboards > Beds,"{'input_tokens': 124, 'output_tokens': 58, 'su...",The Queen Bed With Frame features a head board...,Beds & Headboards > Bedframes,Beds & Headboards > Beds,Beds & Headboards > Headboards,Kids Beds & Headboards > Kid's Beds,Sets > Bedroom Furniture Sets,"{'input_tokens': 200, 'output_tokens': 10, 'su...",1,Beds & Headboards > Bedframes
3,Dylan Queen Bed,Add a touch of a modern farmhouse to your bedr...,Furniture > Beds & Headboards > Beds,,Beds & Headboards > Beds,"{'input_tokens': 140, 'output_tokens': 42, 'su...",Add a touch of modern farmhouse to your bedroo...,Beds & Headboards > Headboards,Beds & Headboards > Beds,Beds & Headboards > Bedframes,Sets > Bedroom Furniture Sets,Kids Beds & Headboards > Kid's Beds,"{'input_tokens': 184, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds
4,Sparrow & Wren Mara Full Diamond-Tufted Bed,"78""L x 56""W x 51""H | Total weight: 130 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 168, 'output_tokens': 81, 'su...",The Sparrow & Wren Mara Full Diamond-Tufted Be...,Beds & Headboards > Headboards,Beds & Headboards > Beds,Beds & Headboards > Bedframes,Mattresses & Box Springs > Mattresses,Kids Beds & Headboards > Kid's Beds,"{'input_tokens': 226, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds


Let's create a table to nicely visualize our pipeline statistics

In [68]:
# TODO: refactor this into labelkit 

from prettytable import PrettyTable
import json

def create_pretty_table(statistics_json_list, names=None):
    # If no names are provided, generate default names
    if not names or len(names) != len(statistics_json_list):
        names = [f"Pipeline {i+1}" for i in range(len(statistics_json_list))]
    
    # Create a PrettyTable object
    table = PrettyTable()
    
    # Add columns to the table
    table.field_names = ["Name", "Model", "Input Tokens", "Output Tokens", "Num Success", "Num Failure", "Total Latency"]
    
    # Iterate over each statistics JSON and its corresponding name
    for name, statistics_json in zip(names, statistics_json_list):
        # Parse the JSON statistics data
        data = json.loads(statistics_json)
        
        # Extract shared statistics
        num_success = data.get("num_success", "")
        num_failure = data.get("num_failure", "")
        total_latency = data.get("total_latency", "")
        
        # Initialize a flag to indicate the first row for shared statistics
        first_row = True
        
        # Add rows to the table for each model
        for model, input_tokens in data.get("input_tokens", {}).items():
            output_tokens = data.get("output_tokens", {}).get(model, "")
            
            # Only add shared statistics to the first row
            if first_row:
                table.add_row([name, model, input_tokens, output_tokens, num_success, num_failure, total_latency])
                first_row = False
            else:
                table.add_row(["", model, input_tokens, output_tokens, "", "", ""])
    
    # Return the table as a string
    return table

In [69]:
# Example usage with the provided data from the model dumps
statistics_json_list = [
    str(categorizer.statistics),
]

names = ["Categorizer"]

# Create and print the pretty table
pretty_table = create_pretty_table(statistics_json_list, names)
print(pretty_table)

+-------------+--------------------+--------------+---------------+-------------+-------------+-------------------+
|     Name    |       Model        | Input Tokens | Output Tokens | Num Success | Num Failure |   Total Latency   |
+-------------+--------------------+--------------+---------------+-------------+-------------+-------------------+
| Categorizer | gpt-3.5-turbo-0125 |     1937     |      353      |      5      |      0      | 9.570192332903389 |
+-------------+--------------------+--------------+---------------+-------------+-------------+-------------------+


In [70]:
print(f"Accuracy: {categorizer.score}")

Accuracy: 0.8


Our pipeline is doing well but that's only on 5 data points. Let's try it on a few more.

In [71]:
test_df100 = df.head(100).copy()
categorizer.apply(test_df100)

Running step short_description...


100%|██████████| 100/100 [01:49<00:00,  1.09s/it]


Running step embedding_search...
Running step categorize...


100%|██████████| 100/100 [00:47<00:00,  2.10it/s]


Running step select_category...


100%|██████████| 100/100 [00:00<00:00, 19546.57it/s]


Unnamed: 0,name,description,category,brand.name,category_new,__short_description__,short_description,category1,category2,category3,category4,category5,__categorize__,category_index,predicted_category
0,EnGauge Deluxe Bedframe,Introducing the Engauge Deluxe Bedframe - the ...,Furniture > Beds & Headboards > Bedframes,,Beds & Headboards > Bedframes,"{'input_tokens': 313, 'output_tokens': 66, 'su...",Introducing the Engauge Deluxe Bedframe - a st...,Beds & Headboards > Bedframes,Beds & Headboards > Beds,Mattresses & Box Springs > Mattresses,Beds & Headboards > Headboards,Mattresses & Box Springs > Box Springs & Found...,"{'input_tokens': 212, 'output_tokens': 10, 'su...",1,Beds & Headboards > Bedframes
1,Sparrow & Wren Sullivan King Channel-Stitched ...,"85""L x 83""W x 56""H | Total weight: 150 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 169, 'output_tokens': 55, 'su...",The Sparrow & Wren Sullivan King Channel-Stitc...,Beds & Headboards > Headboards,Beds & Headboards > Beds,Kids Beds & Headboards > Kid's Beds,Beds & Headboards > Bedframes,Mattresses & Box Springs > Mattresses,"{'input_tokens': 200, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds
2,Queen Bed With Frame,Dimensions:Head Board -49H x 63.75W x 1.5DFoot...,Furniture > Beds & Headboards > Beds,Hillsdale,Beds & Headboards > Beds,"{'input_tokens': 124, 'output_tokens': 56, 'su...",A queen bed with frame featuring a head board ...,Beds & Headboards > Bedframes,Beds & Headboards > Beds,Beds & Headboards > Headboards,Kids Beds & Headboards > Kid's Beds,Sets > Bedroom Furniture Sets,"{'input_tokens': 198, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds
3,Dylan Queen Bed,Add a touch of a modern farmhouse to your bedr...,Furniture > Beds & Headboards > Beds,,Beds & Headboards > Beds,"{'input_tokens': 140, 'output_tokens': 37, 'su...",The Dylan Queen Bed combines rustic and contem...,Beds & Headboards > Beds,Beds & Headboards > Headboards,Beds & Headboards > Bedframes,Sets > Bedroom Furniture Sets,Kids Beds & Headboards > Kid's Beds,"{'input_tokens': 179, 'output_tokens': 10, 'su...",1,Beds & Headboards > Beds
4,Sparrow & Wren Mara Full Diamond-Tufted Bed,"78""L x 56""W x 51""H | Total weight: 130 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 168, 'output_tokens': 81, 'su...",The Sparrow & Wren Mara Full Diamond-Tufted Be...,Beds & Headboards > Headboards,Beds & Headboards > Beds,Beds & Headboards > Bedframes,Mattresses & Box Springs > Mattresses,Sets > Bedroom Furniture Sets,"{'input_tokens': 222, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Modway Melanie Tufted Button Upholstered Fabri...,"Twin | Clean lines, a straightforward profile,...",Furniture > Beds & Headboards > Beds,Modway,Beds & Headboards > Beds,"{'input_tokens': 225, 'output_tokens': 57, 'su...","Chic and elegant, the Modway Melanie Tufted Bu...",Beds & Headboards > Headboards,Beds & Headboards > Beds,Sets > Bedroom Furniture Sets,Beds & Headboards > Bedframes,Kids Beds & Headboards > Kid's Beds,"{'input_tokens': 198, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds
96,Concord Queen Panel Bed,Looking for a new bed that has it all? Check o...,Furniture > Beds & Headboards > Beds,Daniel's Amish,Beds & Headboards > Beds,"{'input_tokens': 205, 'output_tokens': 54, 'su...",The Concord Queen Panel Bed is a contemporary ...,Beds & Headboards > Headboards,Beds & Headboards > Beds,Beds & Headboards > Bedframes,Kids Beds & Headboards > Kid's Beds,Mattresses & Box Springs > Mattresses,"{'input_tokens': 199, 'output_tokens': 10, 'su...",2,Beds & Headboards > Beds
97,Sparrow & Wren Myers King Bed,"Dimensions: 85""L x 82""W x 56""H | Headboard hei...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 271, 'output_tokens': 50, 'su...",The Sparrow & Wren Myers King Bed is a handcra...,Beds & Headboards > Beds,Beds & Headboards > Headboards,Beds & Headboards > Bedframes,Kids Beds & Headboards > Kid's Beds,Sets > Bedroom Furniture Sets,"{'input_tokens': 192, 'output_tokens': 10, 'su...",1,Beds & Headboards > Beds
98,Loden Beige 3 Pc Queen Upholstered Bed with 2 ...,A classic design and sophisticated silhouette ...,Furniture > Beds & Headboards > Beds,Rooms To Go,Beds & Headboards > Beds,"{'input_tokens': 181, 'output_tokens': 57, 'su...",The Loden Beige 3 Pc Queen Upholstered Bed wit...,Storage > Dressers,Storage > Nightstands,Beds & Headboards > Headboards,Beds & Headboards > Beds,Sets > Bedroom Furniture Sets,"{'input_tokens': 191, 'output_tokens': 10, 'su...",4,Beds & Headboards > Beds


In [72]:
statistics_json_list = [
    str(categorizer.statistics),
]

names = ["Categorizer"]

# Create and print the pretty table
pretty_table = create_pretty_table(statistics_json_list, names)
print(pretty_table)
print(f"Accuracy: {categorizer.score}")

+-------------+--------------------+--------------+---------------+-------------+-------------+--------------------+
|     Name    |       Model        | Input Tokens | Output Tokens | Num Success | Num Failure |   Total Latency    |
+-------------+--------------------+--------------+---------------+-------------+-------------+--------------------+
| Categorizer | gpt-3.5-turbo-0125 |    39680     |      6659     |     100     |      0      | 166.04607970733196 |
+-------------+--------------------+--------------+---------------+-------------+-------------+--------------------+
Accuracy: 0.93


At current gpt-3.5-turbo pricing this batch of 100 requests cost about $0.030673 and took seven minutes and thirty seconds for 93% accuracy. Let's see how heiarchical prompting does. 

### Approach 2: Heiarchical prompting
Next we want to try forgoing embeddings all together and simply stuffing all of the categories into the prompt. There are too many categories to do this all in one go but we can use the fact that our categories are heiarchical and take a step by step approach.
1. LLM: given product name, description, and first level categories, pick the best one.
2. LLM: given product name, description, and second level categories, pick the best one.

We may want to iterate a bit on this process. For example, we may want to use one model in step 1 and a different model in step 2. `LabelKit` makes this type of hyperparameter tuning easy and robust.

In our first step we're just asking the model to pick the right top level category. This is a relatively easy task if the categories are non-overlapping or can be very difficult if there are multiple correct answers. We'll only know by trying and inspecting our losses.

In [73]:
first_level_categories = list(lookup_table.keys())

def first_level_category_prompt(row):
    i = len(first_level_categories)

    return f"""
    You are given a product name, description and {i} options for the product's top level category.
    Pick the index of the most accurate category.
    The index must be between 1 and {i}.
    Product description: {row['description']}
    Product name: {row['name']}
    Categories:
    {first_level_categories}
    """
    
class FirstLevelCategoryIndex(BaseModel):
    first_category_index: int = Field(description="The index of the most accurate first level category")
    
first_level_category_step = steps.LLMStep(
  prompt=first_level_category_prompt,
  model=models.gpt35,
  out_schema=FirstLevelCategoryIndex,
  name="first_categorize"
)

In [74]:
class FirstCategory(BaseModel):
    predicted_first_category: str = Field(description="The most accurate first level category")

select_first_category_step = steps.CustomStep(
  transform=lambda row: {"predicted_first_category": first_level_categories[row["first_category_index"] - 1]},
  out_schema=FirstCategory,
  name="select_first_category"
)

Next we'll give the second layer of the taxonomy to the model to classify. Just as before are trying to predict the index to make sure our final output is valid. 

In [75]:
def second_level_category_prompt(row):
    second_level_categories = list(lookup_table[row['predicted_first_category']])
    i = len(second_level_categories)

    return f"""
    You are given a product name, description, first level category 
    and {i} options for the product's second level category.
    Pick the index of the most accurate category.
    The index must be between 1 and {i}.
    Product description: {row['description']}
    Product name: {row['name']}
    First level category: {row['predicted_first_category']}
    Categories:
    {second_level_categories}
    """
    
class SecondLevelCategoryIndex(BaseModel):
    second_category_index: int = Field(description="The index of the most accurate second level category")
    
second_level_category_step = steps.LLMStep(
  prompt=second_level_category_prompt,
  model=models.gpt35,
  out_schema=SecondLevelCategoryIndex,
  name="second_categorize"
)

In [76]:
class SecondCategory(BaseModel):
    predicted_second_category: str = Field(description="The most accurate second level category")

select_second_category_step = steps.CustomStep(
  transform=lambda row: {"predicted_second_category": list(lookup_table[row['predicted_first_category']])[row["second_category_index"] - 1]},
  out_schema=SecondCategory,
  name="select_second_category"
)

Let's combine our results so we can properly compare to our ground truth column. 

In [77]:
class PredictedTaxonomy(BaseModel):
    predicted_taxonomy: str = Field(description="The predicted taxonomy based on LLM categorization")

combine_taxonomy_step = steps.CustomStep(
    transform=lambda row: {"predicted_taxonomy": f"{row['predicted_first_category']} > {row['predicted_second_category']}"},
    out_schema=PredictedTaxonomy,
    name='combine_taxonomy'
)

In [78]:
test_df2 = df.head(5).copy()

evaluate2 = lambda row: row['predicted_taxonomy'].lower() == row['category_new'].lower()

categorizer_llm = pipeline.Pipeline([
  first_level_category_step, 
  select_first_category_step,
  second_level_category_step,
  select_second_category_step,
  combine_taxonomy_step
], evaluation_fn=evaluate2)

categorizer_llm.apply(test_df2)

Running step first_categorize...


100%|██████████| 5/5 [00:02<00:00,  2.09it/s]


Running step select_first_category...


100%|██████████| 5/5 [00:00<00:00, 9767.82it/s]


Running step second_categorize...


100%|██████████| 5/5 [00:02<00:00,  2.08it/s]


Running step select_second_category...


100%|██████████| 5/5 [00:00<00:00, 4322.24it/s]


Running step combine_taxonomy...


100%|██████████| 5/5 [00:00<00:00, 13443.28it/s]


Unnamed: 0,name,description,category,brand.name,category_new,__first_categorize__,first_category_index,predicted_first_category,__second_categorize__,second_category_index,predicted_second_category,predicted_taxonomy
0,EnGauge Deluxe Bedframe,Introducing the Engauge Deluxe Bedframe - the ...,Furniture > Beds & Headboards > Bedframes,,Beds & Headboards > Bedframes,"{'input_tokens': 419, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 372, 'output_tokens': 11, 'su...",2,Bedframes,Beds & Headboards > Bedframes
1,Sparrow & Wren Sullivan King Channel-Stitched ...,"85""L x 83""W x 56""H | Total weight: 150 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 275, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 228, 'output_tokens': 11, 'su...",3,Headboards,Beds & Headboards > Headboards
2,Queen Bed With Frame,Dimensions:Head Board -49H x 63.75W x 1.5DFoot...,Furniture > Beds & Headboards > Beds,Hillsdale,Beds & Headboards > Beds,"{'input_tokens': 230, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 183, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds
3,Dylan Queen Bed,Add a touch of a modern farmhouse to your bedr...,Furniture > Beds & Headboards > Beds,,Beds & Headboards > Beds,"{'input_tokens': 246, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 199, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds
4,Sparrow & Wren Mara Full Diamond-Tufted Bed,"78""L x 56""W x 51""H | Total weight: 130 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 274, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 227, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds


It works, let's run it on some more data like we did before. 

In [79]:
test_df2_100 = df.head(100).copy()
categorizer_llm.apply(test_df2_100)

Running step first_categorize...


100%|██████████| 100/100 [01:06<00:00,  1.50it/s]


Running step select_first_category...


100%|██████████| 100/100 [00:00<00:00, 41169.06it/s]


Running step second_categorize...


100%|██████████| 100/100 [00:45<00:00,  2.19it/s]


Running step select_second_category...


100%|██████████| 100/100 [00:00<00:00, 37272.76it/s]


Running step combine_taxonomy...


100%|██████████| 100/100 [00:00<00:00, 36192.11it/s]


Unnamed: 0,name,description,category,brand.name,category_new,__first_categorize__,first_category_index,predicted_first_category,__second_categorize__,second_category_index,predicted_second_category,predicted_taxonomy
0,EnGauge Deluxe Bedframe,Introducing the Engauge Deluxe Bedframe - the ...,Furniture > Beds & Headboards > Bedframes,,Beds & Headboards > Bedframes,"{'input_tokens': 419, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 372, 'output_tokens': 11, 'su...",2,Bedframes,Beds & Headboards > Bedframes
1,Sparrow & Wren Sullivan King Channel-Stitched ...,"85""L x 83""W x 56""H | Total weight: 150 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 275, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 228, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds
2,Queen Bed With Frame,Dimensions:Head Board -49H x 63.75W x 1.5DFoot...,Furniture > Beds & Headboards > Beds,Hillsdale,Beds & Headboards > Beds,"{'input_tokens': 230, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 183, 'output_tokens': 11, 'su...",3,Headboards,Beds & Headboards > Headboards
3,Dylan Queen Bed,Add a touch of a modern farmhouse to your bedr...,Furniture > Beds & Headboards > Beds,,Beds & Headboards > Beds,"{'input_tokens': 246, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 199, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds
4,Sparrow & Wren Mara Full Diamond-Tufted Bed,"78""L x 56""W x 51""H | Total weight: 130 lbs. | ...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 274, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 227, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds
...,...,...,...,...,...,...,...,...,...,...,...,...
95,Modway Melanie Tufted Button Upholstered Fabri...,"Twin | Clean lines, a straightforward profile,...",Furniture > Beds & Headboards > Beds,Modway,Beds & Headboards > Beds,"{'input_tokens': 331, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 284, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds
96,Concord Queen Panel Bed,Looking for a new bed that has it all? Check o...,Furniture > Beds & Headboards > Beds,Daniel's Amish,Beds & Headboards > Beds,"{'input_tokens': 311, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 264, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds
97,Sparrow & Wren Myers King Bed,"Dimensions: 85""L x 82""W x 56""H | Headboard hei...",Furniture > Beds & Headboards > Beds,Sparrow & Wren,Beds & Headboards > Beds,"{'input_tokens': 377, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 330, 'output_tokens': 11, 'su...",3,Headboards,Beds & Headboards > Headboards
98,Loden Beige 3 Pc Queen Upholstered Bed with 2 ...,A classic design and sophisticated silhouette ...,Furniture > Beds & Headboards > Beds,Rooms To Go,Beds & Headboards > Beds,"{'input_tokens': 287, 'output_tokens': 11, 'su...",1,Beds & Headboards,"{'input_tokens': 240, 'output_tokens': 11, 'su...",1,Beds,Beds & Headboards > Beds


Let's compare approach 1 to approach 2. 

In [80]:
statistics_json_list = [
    str(categorizer.statistics),
    str(categorizer_llm.statistics),
]

names = ["Embeddings+LLM", "Heiarchical prompting"]

# Create and print the pretty table
pretty_table = create_pretty_table(statistics_json_list, names)
print(pretty_table)
print(f"Embeddings+LLM accuracy: {categorizer.score}")
print(f"Heiarchical prompting accuracy: {categorizer_llm.score}")

+-----------------------+--------------------+--------------+---------------+-------------+-------------+--------------------+
|          Name         |       Model        | Input Tokens | Output Tokens | Num Success | Num Failure |   Total Latency    |
+-----------------------+--------------------+--------------+---------------+-------------+-------------+--------------------+
|     Embeddings+LLM    | gpt-3.5-turbo-0125 |    39680     |      6659     |     100     |      0      | 166.04607970733196 |
| Heiarchical prompting | gpt-3.5-turbo-0125 |    55706     |      2313     |     100     |      0      | 117.02653779514367 |
+-----------------------+--------------------+--------------+---------------+-------------+-------------+--------------------+
Embeddings+LLM accuracy: 0.93
Heiarchical prompting accuracy: 0.76


Our heiarchical approach cost almost the same at $0.032814 / 100 rows. It was much faster but at the cost of accuracy. However, we're not done just yet. The power of `LabelKit` is that we can easily try many different permuations of our pipeline using a grid search.

## Grid search

Our first pipeline has three steps we want to search over.
1. Short description: vary the model
2. Embedding search: vary the number of results
3. Categorize: vary the model

It's not clear which permutation will work the best so we'll try all of them.

In [81]:
from labelkit import grid_search

params_grid = {
    embedding_search_step.name: {
        'k': [3, 5, 7],  
    },
    categorize_step.name: {
        'model': [models.gpt35, models.gpt4], 
    },
}

small_df = df.head(30).copy()


search_embeddings = grid_search.GridSearch(categorizer, params_grid)
search_embeddings.apply(small_df)

Iteration 1 of 6
Params:  {'embedding_search': {'k': 3}, 'categorize': {'model': 'gpt-3.5-turbo-0125'}}
Result:  {'embedding_search__k': 3, 'categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.9333333333333333, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 11254}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 2047}), 'num_success': 30, 'num_failure': 0, 'total_latency': 51.57649916619994, 'index': -5329565791411564406}
Iteration 2 of 6
Params:  {'embedding_search': {'k': 3}, 'categorize': {'model': 'gpt-4-turbo-preview'}}
Result:  {'embedding_search__k': 3, 'categorize__model': 'gpt-4-turbo-preview', 'score': 0.9333333333333333, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 5852, 'gpt-4-turbo-preview': 5295}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 1643, 'gpt-4-turbo-preview': 300}), 'num_success': 30, 'num_failure': 0, 'total_latency': 65.59831066709012, 'index': 3662522137852334499}
Iteration 3

Unnamed: 0,embedding_search__k,categorize__model,score,input_tokens,output_tokens,num_success,num_failure,total_latency,index
0,3,gpt-3.5-turbo-0125,0.933333,{'gpt-3.5-turbo-0125': 11254},{'gpt-3.5-turbo-0125': 2047},30,0,51.576499,-5329565791411564406
1,3,gpt-4-turbo-preview,0.933333,"{'gpt-3.5-turbo-0125': 5852, 'gpt-4-turbo-prev...","{'gpt-3.5-turbo-0125': 1643, 'gpt-4-turbo-prev...",30,0,65.598311,3662522137852334499
2,5,gpt-3.5-turbo-0125,0.9,{'gpt-3.5-turbo-0125': 11909},{'gpt-3.5-turbo-0125': 2081},30,0,49.899396,-7605723625856707069
3,5,gpt-4-turbo-preview,0.933333,"{'gpt-3.5-turbo-0125': 5852, 'gpt-4-turbo-prev...","{'gpt-3.5-turbo-0125': 1708, 'gpt-4-turbo-prev...",30,0,64.65049,-7543548363153777932
4,7,gpt-3.5-turbo-0125,0.9,{'gpt-3.5-turbo-0125': 12521},{'gpt-3.5-turbo-0125': 2105},30,0,48.981345,6522858634821065972
5,7,gpt-4-turbo-preview,0.966667,"{'gpt-3.5-turbo-0125': 5852, 'gpt-4-turbo-prev...","{'gpt-3.5-turbo-0125': 1688, 'gpt-4-turbo-prev...",30,0,64.679221,-2711728362350690716


The results of our grid search are conveniently put into a dataframe for us to review.

Notice that it's not the case that more embeddings always does better. The 3 embeddings approach with GPT-3.5 actually does better than the 5 embedding pipeline. We'd want to run this on slightly more data to validate but it highlights the value of experimentation.

Let's do the same for our heiarchical embedding approach. This time we'll just vary the model selection for each step. 

In [83]:
params_grid = {
    first_level_category_step.name: {
        'model': [models.gpt35, models.gpt4],  
    },
    second_level_category_step.name: {
        'model': [models.gpt35, models.gpt4],  
    },
}

small_df2 = df.head(30).copy()

search_llm = grid_search.GridSearch(categorizer_llm, params_grid)
search_llm.apply(small_df2)

Iteration 1 of 4
Params:  {'first_categorize': {'model': 'gpt-3.5-turbo-0125'}, 'second_categorize': {'model': 'gpt-3.5-turbo-0125'}}
Result:  {'first_categorize__model': 'gpt-3.5-turbo-0125', 'second_categorize__model': 'gpt-3.5-turbo-0125', 'score': 0.7333333333333333, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 16648}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 660}), 'num_success': 30, 'num_failure': 0, 'total_latency': 30.506660460319836, 'index': 9064737584790964269}
Iteration 2 of 4
Params:  {'first_categorize': {'model': 'gpt-3.5-turbo-0125'}, 'second_categorize': {'model': 'gpt-4-turbo-preview'}}
Result:  {'first_categorize__model': 'gpt-3.5-turbo-0125', 'second_categorize__model': 'gpt-4-turbo-preview', 'score': 0.9333333333333333, 'input_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 9032, 'gpt-4-turbo-preview': 7616}), 'output_tokens': defaultdict(<class 'int'>, {'gpt-3.5-turbo-0125': 330, 'gpt-4-turbo-preview': 330

Unnamed: 0,first_categorize__model,second_categorize__model,score,input_tokens,output_tokens,num_success,num_failure,total_latency,index
0,gpt-3.5-turbo-0125,gpt-3.5-turbo-0125,0.733333,{'gpt-3.5-turbo-0125': 16648},{'gpt-3.5-turbo-0125': 660},30,0,30.50666,9064737584790964269
1,gpt-3.5-turbo-0125,gpt-4-turbo-preview,0.933333,"{'gpt-3.5-turbo-0125': 9032, 'gpt-4-turbo-prev...","{'gpt-3.5-turbo-0125': 330, 'gpt-4-turbo-previ...",30,0,45.977288,7013088740952582084
2,gpt-4-turbo-preview,gpt-3.5-turbo-0125,0.833333,"{'gpt-4-turbo-preview': 9032, 'gpt-3.5-turbo-0...","{'gpt-4-turbo-preview': 330, 'gpt-3.5-turbo-01...",30,0,42.584932,1541332106710414306
3,gpt-4-turbo-preview,gpt-4-turbo-preview,0.9,{'gpt-4-turbo-preview': 16663},{'gpt-4-turbo-preview': 660},30,0,58.213133,-5983245934859978448


These results highlight the importance of experimentation and optimization. As we can see, the GPT-3.5 + GPT-4 heiarchical pipleine performs the best with relatively low latency. In fact, it reaches the same accuracy as all but the most expensive embedding approach (GPT-4 + 7 embeddings + GPT-4). 

When we compare the cost we find that see that the best LLM only approach cost $0.003 per classification compared to merely $0.00028 for the equally performing embedding approach, around a 10x difference. We take a slight hit on latency but it's minor.

It looks like an embeddings based approach is our best bet. We can now make that claim with conviction, backed up by data and experimentation. 