## Exploring the CONLL 2003 dataset using Autolabel

#### Setup the API Keys for providers that you want to use

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-xxxxxxxxxxxxxxxxx'

#### Install the autolabel library

In [2]:
!pip install 'refuel-autolabel[openai]'





#### Download the dataset

In [1]:
from autolabel import get_data

get_data('ethos')

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

### First labeling run

In [2]:
import json

from autolabel import LabelingAgent

In [3]:
# load the config
with open('config_ethos.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `named_entity_recognition` (since it's a named entity recognition task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at extracting Person, Organization, Location, and Miscellaneous entities...` (how we describe the task to the LLM)
* `prompt.labels`: `[
            "Location",
            "Organization",
            "Person",
            "Miscellaneous"
        ]` (the full list of labels to choose from)
* `prompt.few_shot_num`: 3 (how many labeled examples to provide to the LLM)

In [4]:
config

{'task_name': 'EthosAttributeExtraction',
 'task_type': 'attribute_extraction',
 'dataset': {'label_column': 'output_dict',
  'text_column': 'text',
  'delimiter': ','},
 'model': {'provider': 'openai', 'name': 'gpt-3.5-turbo'},
 'prompt': {'task_guidelines': 'You are an expert at classifying hate speech and identifying the type of hate speech. Read the following tweets and extract the following attributes from the text.',
  'attributes': [{'name': 'violence',
    'options': ['not_violent', 'violent'],
    'description': 'If the tweet mentions violence towards a person or a group.'},
   {'name': 'directed_vs_generalized',
    'options': ['generalized', 'directed'],
    'description': 'If the hate speech is generalized towards a group or directed towards a specific person.'},
   {'name': 'gender',
    'options': ['true', 'false'],
    'description': 'If the hate speech uses gendered language and attacks a particular gender.'}],
  'few_shot_examples': 'seed.csv',
  'few_shot_selection': 

In [5]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [6]:
# dry-run -- this tells us how much this will cost and shows an example prompt
from autolabel import AutolabelDataset
ds = AutolabelDataset("test.csv", config=config)
agent.plan(ds)

Output()

You are an expert at classifying hate speech and identifying the type of hate speech. Read the following tweets and extract the following attributes from the text.

You will return the extracted attributes as a json with the following keys:
{
    "violence": "If the tweet mentions violence towards a person or a group.\nOptions:\nnot_violent,violent",
    "directed_vs_generalized": "If the hate speech is generalized towards a group or directed towards a specific person.\nOptions:\ngeneralized,directed",
    "gender": "If the hate speech uses gendered language and attacks a particular gender.\nOptions:\ntrue,false"
}

Some examples with their output answers are provided below:

Text: You should know women's sports are a joke
Output: {"violence": "not_violent", "directed_vs_generalized": "generalized", "gender": "true"}

Text: You look like Sloth with deeper Down’s syndrome
Output: {"violence": "not_violent", "directed_vs_generalized": "directed", "gender": "false"}

Text: You look like R

In [7]:
# now, do the actual labeling
ds = agent.run(ds, max_items=100)

2023-09-07 18:24:05 autolabel.labeler INFO: Task run already exists.


Output()

Actual Cost: 0.0


We are at 88.9% accuracy when labeling the first 100 examples. Let's see if we can use confidence scores to improve accuracy further by removing the less confident examples from our labeled set.

### Compute confidence scores


In [8]:
# Start computing confidence scores (using Refuel's LLMs)
os.environ['REFUEL_API_KEY'] = 'xxxxxxxxxxxxxxxxx'

In [9]:
# set `compute_confidence` -> True
config["model"]["compute_confidence"] = True

In [10]:
agent = LabelingAgent(config=config)

In [11]:
from autolabel import AutolabelDataset
ds = AutolabelDataset("test.csv", config=config)
agent.plan(ds)

Output()

You are an expert at classifying hate speech and identifying the type of hate speech. Read the following tweets and extract the following attributes from the text.

You will return the extracted attributes as a json with the following keys:
{
    "violence": "If the tweet mentions violence towards a person or a group.\nOptions:\nnot_violent,violent",
    "directed_vs_generalized": "If the hate speech is generalized towards a group or directed towards a specific person.\nOptions:\ngeneralized,directed",
    "gender": "If the hate speech uses gendered language and attacks a particular gender.\nOptions:\ntrue,false"
}

Some examples with their output answers are provided below:

Text: You should know women's sports are a joke
Output: {"violence": "not_violent", "directed_vs_generalized": "generalized", "gender": "true"}

Text: You look like Sloth with deeper Down’s syndrome
Output: {"violence": "not_violent", "directed_vs_generalized": "directed", "gender": "false"}

Text: You look like R

Looking at the table above, we can see that if we set the confidence threshold at `0.7573`, we are able to label at 82% accuracy and getting a completion rate of 74%. This means, we would ignore all the data points where confidence score is less than `0.7656` (which would end up being around 26% of all samples). This would, however, guarantee a very high quality labeled dataset for us. 

In [12]:
ds = agent.run(ds, max_items=100)

2023-09-07 18:24:33 autolabel.labeler INFO: Task run already exists.


Output()

Actual Cost: 0.0


Looking at the table above, we can see that if we set the confidence threshold at `0.7314`, we are able to label at 90.45% accuracy and getting a completion rate of 80%. This means, we would ignore all the data points where confidence score is less than `0.7314` (which would end up being around 20% of all samples). This would, however, guarantee a very high quality labeled dataset for us. 