# Labeling the [banking](https://huggingface.co/datasets/banking77) dataset using Autolabel

This is a multi-class classification task where the input are customer service queries and we have to correctly label them with one of 77 intents. 

## Install Autolabel
Plus, setup your OpenAI API key, since we'll be using `gpt-3.5-turbo` as our LLM for labeling.

In [None]:
!pip install 'refuel-autolabel[openai]'

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = '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'


## Download the dataset

This dataset is available to install via Autolabel.

In [None]:
from autolabel import get_data

get_data('banking')

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

### First labeling run

In [2]:
import json

from autolabel import LabelingAgent

In [3]:
# load the config
with open('config_banking.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `classification` (since it's a classification task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at understanding bank customers support complaints and queries...` (how we describe the task to the LLM)
* `prompt.labels`: `['age_limit', 'apple_pay_or_google_pay', 'atm_support', ...]` (the full list of labels to choose from)
* `prompt.few_shot_num`: 10 (how many labeled examples to provide to the LLM)

In [4]:
config

{'task_name': 'BankingComplaintsClassification',
 'task_type': 'classification',
 'dataset': {'label_column': 'label', 'delimiter': ','},
 'model': {'provider': 'openai', 'name': 'gpt-3.5-turbo-instruct'},
 'prompt': {'task_guidelines': 'You are an expert at understanding bank customers support complaints and queries.\nYour job is to correctly classify the provided input example into one of the following categories.\nCategories:\n{labels}',
  'output_guidelines': 'You will answer with just the the correct output label and nothing else.',
  'labels': ['activate_my_card',
   'age_limit',
   'apple_pay_or_google_pay',
   'atm_support',
   'automatic_top_up',
   'balance_not_updated_after_bank_transfer',
   'balance_not_updated_after_cheque_or_cash_deposit',
   'beneficiary_not_allowed',
   'cancel_transfer',
   'card_about_to_expire',
   'card_acceptance',
   'card_arrival',
   'card_delivery_estimate',
   'card_linking',
   'card_not_working',
   'card_payment_fee_charged',
   'card_paym

In [5]:
# create an agent for labeling
agent = LabelingAgent(config=config)




In [6]:
# dry-run -- this tells us how much this will cost and shows an example prompt
from autolabel import AutolabelDataset
ds = AutolabelDataset("test.csv", config=config)
agent.plan(ds)

Output()

In [7]:
# now, do the actual labeling
ds = agent.run(ds, max_items=1000)

Output()













We are at 76% accuracy when labeling the first 100 examples. Let's see if we can use confidence scores to improve accuracy further by removing the less confident examples from our labeled set.

## Compute confidence scores

In [None]:
# Start computing confidence scores (using Refuel's LLMs)
os.environ['REFUEL_API_KEY'] = 'sk-xxxxxxxxxxxx'

In [None]:
# set `compute_confidence` -> True
config["model"]["compute_confidence"] = True

In [None]:
agent = LabelingAgent(config=config)

In [None]:
from autolabel import AutolabelDataset
ds = AutolabelDataset("test.csv", config=config)
agent.plan(ds)

In [None]:
ds = agent.run(ds, max_items=100)

Looking at the table above, we can see that if we set the confidence threshold at `0.9305`, we are able to label at 90% accuracy and getting a completion rate of 74%. This means, we would ignore all the data points where confidence score is less than `0.9305` (which would end up being around 26% of all samples). This would, however, guarantee a very high quality labeled dataset for us. 