# Labeling the [banking](https://huggingface.co/datasets/banking77) dataset using Autolabel

This is a multi-class classification task where the input are customer service queries and we have to correctly label them with one of 77 intents. 

## Install Autolabel
Plus, setup your OpenAI API key, since we'll be using `gpt-3.5-turbo` as our LLM for labeling.

In [None]:
!pip install 'refuel-autolabel[openai]'

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = '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'

## Download the dataset

This dataset is available to install via Autolabel.

In [2]:
from autolabel import get_data

get_data('banking')

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

### First labeling run

In [2]:
import json

from autolabel import LabelingAgent

In [3]:
# load the config
with open('config_faire_themes.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `classification` (since it's a classification task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at understanding bank customers support complaints and queries...` (how we describe the task to the LLM)
* `prompt.labels`: `['age_limit', 'apple_pay_or_google_pay', 'atm_support', ...]` (the full list of labels to choose from)
* `prompt.few_shot_num`: 10 (how many labeled examples to provide to the LLM)

In [4]:
config

{'task_name': 'FaireThemesClassification',
 'task_type': 'classification',
 'dataset': {'label_column': 'label', 'delimiter': ','},
 'model': {'provider': 'openai', 'name': 'gpt-4'},
 'prompt': {'task_guidelines': 'Given an online product title and description, predict the stationary theme of the product. If you cannot infer the theme from the product title and product description, select N/A as the answer. Provide ONLY one CONCISE answer WITHOUT explanation. If there are multiple possible answers, choose the one that is most closely associated with the given product. If none of the options apply, output N/A. Categories: {labels}.',
  'output_guidelines': 'You will answer with just the the correct output label and nothing else.',
  'labels': ['Abstract & Geometric',
   'Alphabet, Dates & Numbers',
   'Animals',
   'Art & Music',
   'Beach & Coastal',
   'Books & Reading',
   'Busy Scenes',
   'Cats, Dogs & Other Pets',
   'Family & Friendship',
   'Fashion',
   'Female Empowerment',
  

In [5]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [11]:
# dry-run -- this tells us how much this will cost and shows an example prompt
from autolabel import AutolabelDataset
ds = AutolabelDataset("test_set_themes_processed_2.csv", config=config)
agent.plan(ds)

Output()

In [12]:
# now, do the actual labeling
ds = agent.run(ds)

Output()

2023-10-09 07:30:35 openai INFO: error_code=rate_limit_exceeded error_message='Rate limit reached for default-gpt-4 in organization org-etZVkYhAIYGmLcxLmarMmAPo on tokens per min. Limit: 40000 / min. Please try again in 1ms. Contact us through our help center at help.openai.com if you continue to have issues.' error_param=None error_type=tokens message='OpenAI API error received' stream_error=False
2023-10-09 07:30:36 openai INFO: error_code=rate_limit_exceeded error_message='Rate limit reached for default-gpt-4 in organization org-etZVkYhAIYGmLcxLmarMmAPo on tokens per min. Limit: 40000 / min. Please try again in 1ms. Contact us through our help center at help.openai.com if you continue to have issues.' error_param=None error_type=tokens message='OpenAI API error received' stream_error=False
2023-10-09 07:30:39 openai INFO: error_code=rate_limit_exceeded error_message='Rate limit reached for default-gpt-4 in organization org-etZVkYhAIYGmLcxLmarMmAPo on tokens per min. Limit: 40000 / m

In [13]:
ds.save("themes_test_set_inference.csv")

In [14]:
import pandas as pd
import json
dataset = pd.read_csv("themes_test_set_inference.csv")

for row in range(len(dataset)):
        dataset.loc[row, 'input'] = json.dumps(dataset.loc[row, 'input'].replace('\n', '|'))
        if type(dataset.loc[row, 'label']) == float:
             dataset.loc[row, 'label'] = "N/A"
        if type(dataset.loc[row, 'label_label']) == float:
             dataset.loc[row, 'label_label'] = "N/A"

dataset = dataset.drop(columns=["label_annotation", "label_prompt"], axis=1)
dataset.to_csv("themes_test_set_inference_all.csv", index=False, sep="\t")

In [22]:
import pandas as pd
dataset = pd.read_csv("complete_seed.csv")
for row in range(len(dataset)):
    dataset.loc[row, 'input'] = json.dumps(f"Product Name: {dataset.loc[row, 'name']} Description: {dataset.loc[row, 'description']}")
    dataset.loc[row, 'label'] = dataset.loc[row, 'manual label'] if type(dataset.loc[row, 'manual label']) != float else "N/A"
dataset = dataset.drop(columns=["name", "description", "manual label", "ignore"], axis=1)

dataset.to_csv("complete_seed_processed.csv", index=False)

  dataset.loc[row, 'input'] = json.dumps(f"Product Name: {dataset.loc[row, 'name']} Description: {dataset.loc[row, 'description']}")
  dataset.loc[row, 'label'] = dataset.loc[row, 'manual label'] if type(dataset.loc[row, 'manual label']) != float else "N/A"
