## Exploring the Walmart Amazon dataset using Autolabel

#### Setup the API Keys for providers that you want to use

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = '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'

#### Install the autolabel library

In [2]:
!pip install 'refuel-autolabel[openai]'





#### Download the dataset

In [2]:
from autolabel import get_data

get_data('walmart_amazon')

Downloading example dataset from https://autolabel-benchmarking.s3.us-west-2.amazonaws.com/walmart_amazon/seed.csv to seed.csv...
Downloading example dataset from https://autolabel-benchmarking.s3.us-west-2.amazonaws.com/walmart_amazon/test.csv to test.csv...
100% [........................................] [929245/929245] bytes

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

In [1]:
import json

from autolabel import LabelingAgent

In [2]:
# load the config
with open('config_walmart_amazon.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `entity_matching` (since it's an entity matching task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.labels`: `['duplicate', 'not duplicate']` (the full list of labels to choose from)
* `prompt.task_guidelines`: `'You are an expert at identifying duplicate products from online product catalogs...` (how we describe the task to the LLM)
* `prompt.few_shot_num`: 10 (how many labeled examples to provide to the LLM)

In [3]:
config

{'task_name': 'ProductCatalogEntityMatch',
 'task_type': 'entity_matching',
 'dataset': {'label_column': 'label', 'delimiter': ','},
 'model': {'provider': 'huggingface_pipeline', 'name': 'gpt-3.5-turbo'},
 'prompt': {'task_guidelines': 'You are an expert at identifying duplicate products from online product catalogs.\nYou will be given information about two product entities, and your job is to tell if they are the same (duplicate) or different (not duplicate). Your answer must be from one of the following options:\n{labels}\n',
  'labels': ['duplicate', 'not duplicate'],
  'output_guidelines': '\nYou will simply return the option that corresponds to the answer. Do NOT output any explanation. Do NOT output any thing other than the answer from the options above.\n',
  'example_template': 'Title of entity1: {Title_entity1}; category of entity1: {Category_entity1}; brand of entity1: {Brand_entity1}; model number of entity1: {ModelNo_entity1}; price of entity1: {Price_entity1}\nTitle of en

In [4]:
# create an agent for labeling
agent = LabelingAgent(config=config)

2023-07-20 08:26:09 torch.distributed.nn.jit.instantiator INFO: Created a temporary directory at /tmp/tmppxomf1um
2023-07-20 08:26:09 torch.distributed.nn.jit.instantiator INFO: Writing /tmp/tmppxomf1um/_remote_module_non_scriptable.py


[2023-07-20 08:26:09,421] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
agent.plan('test.csv')

2023-07-20 08:26:46 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-07-20 08:26:47 sentence_transformers.SentenceTransformer INFO: Use pytorch device: cuda


Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Output()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

You are an expert at identifying duplicate products from online product catalogs.
You will be given information about two product entities, and your job is to tell if they are the same (duplicate) or different (not duplicate). Your answer must be from one of the following options:
duplicate
not duplicate



You will simply return the option that corresponds to the answer. Do NOT output any explanation. Do NOT output any thing other than the answer from the options above.


Some examples with their output answers are provided below:

Title of entity1: wintec filemate 4gb compactflash memory card; category of entity1: usb drives; brand of entity1: wintec; model number of entity1: 3fmcf4gb-r; price of entity1: 15.98
Title of entity2: sandisk 4gb ultra ii compactflash memory card 15mb s; category of entity2: memory cards; brand of entity2: sandisk; model number of entity2: sdcfh-4096; price of entity2: 17.96
Duplicate or not: not duplicate

Title of entity1: dane-elec 8gb high-speed compac

In [8]:
labels, df, metrics_list = agent.run('test.csv', max_items=1000)

2023-07-20 16:39:42 autolabel.labeler INFO: Task run already exists.


You are an expert at identifying duplicate products from online product catalogs.
You will be given information about two product entities, and your job is to tell if they are the same (duplicate) or different (not duplicate). Your answer must be from one of the following options:
duplicate
not duplicate



You will simply return the option that corresponds to the answer. Do NOT output any explanation. Do NOT output any thing other than the answer from the options above.


Some examples with their output answers are provided below:

Title of entity1: xfx ati radeon 5670 hd pci-express 1gb ddr3 graphics card; category of entity1: electronics - general; brand of entity1: xfx; model number of entity1: hd567xznl3; price of entity1: 91.88
Title of entity2: xfx ati radeon hd 5570 1 gb ddr2 vga dvi hdmi pci-express video card hd557xzhf2; category of entity2: computers accessories; brand of entity2: xfx; model number of entity2: hd-557x-zhf2; price of entity2: 68.61
Duplicate or not: not dupli

duplicate


 y


2023-07-20 16:39:49 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-07-20 16:39:50 sentence_transformers.SentenceTransformer INFO: Use pytorch device: cuda


Batches: 0it [00:00, ?it/s]

Output()

Actual Cost: 0.0


We are at 97% accuracy when labeling the first 100 examples. Let's see if we can use confidence scores to improve accuracy further by removing the less confident examples from our labeled set.

### Compute confidence scores


In [12]:
# Start computing confidence scores (using Refuel's LLMs)
os.environ['REFUEL_API_KEY'] = 'xxxxxxxxxxxxxxxxx'

In [13]:
config["model"]["compute_confidence"] = True

In [14]:
agent = LabelingAgent(config=config)

In [15]:
agent.plan('test.csv')

You are an expert at identifying duplicate products from online product catalogs.
Your job is to tell if the two given entities are duplicates or not duplicate. Your answer must be from one of the following options:
duplicate
not duplicate

You will return the answer with one element: "the correct option"


Some examples with their output answers are provided below:

Title of entity1: lexmark extra high yield return pgm print cartridge - magenta; category of entity1: printers; brand of entity1: lexmark; model number of entity1: c782u1mg; price of entity1: 214.88
Title of entity2: lexmark 18c1428 return program print cartridge black; category of entity2: inkjet printer ink; brand of entity2: lexmark; model number of entity2: 18c1428; price of entity2: 19.97
Duplicate or not: not duplicate

Title of entity1: edge tech proshot 4gb sdhc class 6 memory card; category of entity1: usb drives; brand of entity1: edge tech; model number of entity1: pe209780; price of entity1: 10.88
Title of enti

In [16]:
labels, df, metrics_list = agent.run('test.csv', max_items=100)

Metric: auroc: 0.9725
Actual Cost: 0.0043


Looking at the table above, we can see that if we set the confidence threshold at `0.71`, we are able to label at 100% accuracy and getting a completion rate of 91%. This means, we would ignore all the data points where confidence score is less than `0.71` (which would end up being around 9% of all samples). This would, however, guarantee a very high quality labeled dataset for us. 