## Exploring the company match dataset using Autolabel

#### Setup the API Keys for providers that you want to use

In [3]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-'

In [1]:
from autolabel import get_data

get_data('painting-style-classification')

Downloading example dataset from https://autolabel-benchmarking.s3.us-west-2.amazonaws.com/painting-style-classification/seed.csv to seed.csv...
Downloading example dataset from https://autolabel-benchmarking.s3.us-west-2.amazonaws.com/painting-style-classification/test.csv to test.csv...
100% [........................................] [57543/57543] bytes

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

### First labeling run

In [4]:
import json

from autolabel import LabelingAgent

In [5]:
# load the config
with open('image_classification.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `entity_matching` (since it's an entity matching task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are provided with descriptions of companies from their websites...` (how we describe the task to the LLM)
* `prompt.labels`: `['not duplicate', 'duplicate']` (the full list of labels to choose from)
* `prompt.few_shot_num`: 3 (how many labeled examples to provide to the LLM)

In [6]:
config

{'task_name': 'ImageClassification',
 'task_type': 'classification',
 'dataset': {'label_column': 'label',
  'delimiter': ',',
  'image_url_column': 'image_url'},
 'model': {'provider': 'openai_vision', 'name': 'gpt-4-vision-preview'},
 'prompt': {'task_guidelines': "Given the description of a painting, predict the style of the paining. You will be first shown multiple descriptions and their styles. For the last input, you'll be shown an image along with the description and your job is to predict the style for this input. Your answer must be from one of the following categories:\n{labels}",
  'labels': ['Impressionism',
   'Color Field Painting',
   'Early Renaissance',
   'Fauvism',
   'Minimalism',
   'Romanticism',
   'Mannerism Late Renaissance',
   'Post Impressionism',
   'Contemporary Realism',
   'Pointillism',
   'Ukiyo e',
   'Abstract Expressionism',
   'Analytical Cubism',
   'Art Nouveau Modern',
   'Expressionism',
   'High Renaissance',
   'Cubism',
   'Naive Art Primiti

In [7]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [8]:
# dry-run -- this tells us how much this will cost and shows an example prompt
from autolabel import AutolabelDataset
ds = AutolabelDataset("data/painting-style-classification/test.csv", config=config)
agent.plan(ds)

Output()

In [9]:
# now, do the actual labeling
ds = agent.run(ds, max_items=10)

Output()

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
