## Exploring the Sem Eval 2018 dataset using Autolabel

#### Set up the API keys for the providers you want to use

In [None]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-xxxxxxxxxxxxxxxxx'

#### Install the autolabel library

In [None]:
!pip install 'refuel-autolabel[openai]'

#### Download the dataset

In [1]:
from autolabel import get_data

get_data('twitter_emotion_detection')

2023-06-27 15:46:06 autolabel.utils ERROR: twitter_emotion_detection not in list of available datasets: ['banking', 'civil_comments', 'ledgar', 'walmart_amazon', 'company', 'squad_v2', 'sciq', 'conll2003', 'movie_reviews']. Exiting...


This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

In [2]:
import json

from autolabel import LabelingAgent

In [3]:
# load the config
with open('config_twitter_emotion_detection.json', 'r') as f:
    config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `multi_label_classification` (since it's a multi label classification task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.labels`: `['neutral', 'anger', 'anticipation', 'disgust', 'fear', ...]` (the full list of labels to choose from)
* `prompt.task_guidelines`: `'You are an expert at classifying tweets as...` (how we describe the task to the LLM)
* `prompt.few_shot_num`: 5 (how many labeled examples to provide to the LLM)

In [4]:
config

{'task_name': 'EmotionClassification',
 'task_type': 'multi_label_classification',
 'dataset': {'label_column': 'label', 'delimiter': ','},
 'model': {'provider': 'openai', 'name': 'gpt-3.5-turbo'},
 'prompt': {'task_guidelines': 'You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.\nYour job is to correctly label the provided input example into one or more of the following categories:\n{labels}',
  'output_guidelines': 'You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: "label1, label2, label3"',
  'labels': ['neutral',
   'anger',
   'anticipation',
   'disgust',
   'fear',
   'joy',
   'love',
   'optimism',
   'pessimism',
   'sadness',
   'surprise',
   'trust'],
  'few_shot_examples': 'seed.csv',
  'few_shot_selection': 'semantic_similarity',
  'few_shot_num': 5,
  'example_template': 'Input: {example}\nOutput: {label}'}}

In [5]:
# create an aggent for labeling
agent = LabelingAgent(config=config)

In [6]:
agent.plan('test.csv')

Output()

You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.
Your job is to correctly label the provided input example into one or more of the following categories:
neutral
anger
anticipation
disgust
fear
joy
love
optimism
pessimism
sadness
surprise
trust

You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: "label1, label2, label3"

Some examples with their output answers are provided below:

Input: @MaryamNSharif I think just becz u have so much terror in pak nd urself being  a leader u forgot d difference btw a leader nd terrorist !
Output: anger, disgust, fear

Input: In wake of fresh #terror threat and sounding of alert in #Mumbai, praying for safety &amp; security of everybody in the city #Maharashtra #news
Output: fear

Input: Somewhere I rd a rpt tht Pakis wr afraid of TSD &amp; asked it 2 shut dn. Congis obliged &amp; exposed it,hounded them.time 

In [7]:
# now, do the actual labeling
labels, df, metrics_list = agent.run('test.csv', max_items=100)

2023-06-27 15:46:37 autolabel.labeler INFO: Task run already exists.


You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.
Your job is to correctly label the provided input example into one or more of the following categories:
neutral
anger
anticipation
disgust
fear
joy
love
optimism
pessimism
sadness
surprise
trust

You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: "label1, label2, label3"

Some examples with their output answers are provided below:

Input: @MatherFamilys @SDICenter @CAllstadt \n\ndo REDNECKS intellectually intimidate you and force you to be their dancing clown?
Output: anger, disgust, pessimism

Input: @Gielnorian @HedonismGaming cmode grimrail made me want to eat angry bees
Output: anger, disgust

Input: @healeyraine I'm offended, I actually am
Output: anger, sadness

Input: @Bridget_Delia @originalaubs omg that's terrible
Output: anger, disgust, fear, sadness

Input: @loismackey_ @Dory nah way

neutral


Output()

Actual Cost: 0.0302


We are at a 0.4889 f1 when labeling the first 100 examples. Let's see if we can use confidence scores to improve f1 further by removing the less confident examples from our labeled set.

### Compute confidence scores

In [8]:
# Start computing confidence scores (using Refuel's LLMs)
os.environ['REFUEL_API_KEY'] = 'sk-xxxxxxxxxxxxxxxxx'

In [9]:
# set `compute_confidence` -> True
config["model"]["compute_confidence"] = True

In [10]:
agent = LabelingAgent(config=config)

In [11]:
agent.plan('test.csv')

Output()

You are an expert at classifying tweets as neutral or one or more of the given emotions that best represent the mental state of the poster.
Your job is to correctly label the provided input example into one or more of the following categories:
neutral
anger
anticipation
disgust
fear
joy
love
optimism
pessimism
sadness
surprise
trust

You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: "label1, label2, label3"

Some examples with their output answers are provided below:

Input: @MaryamNSharif I think just becz u have so much terror in pak nd urself being  a leader u forgot d difference btw a leader nd terrorist !
Output: anger, disgust, fear

Input: In wake of fresh #terror threat and sounding of alert in #Mumbai, praying for safety &amp; security of everybody in the city #Maharashtra #news
Output: fear

Input: Somewhere I rd a rpt tht Pakis wr afraid of TSD &amp; asked it 2 shut dn. Congis obliged &amp; exposed it,hounded them.time 

In [12]:
labels, df, metrics_list = agent.run('test.csv', max_items=100)

Output()





















Metric: auroc: 0.7582
Actual Cost: 0.0029


Looking at the table above, we can see that if we set the confidence threshold at `0.6805`, we are able to label at 0.65 f1 and getting a completion rate of 30%. This means, we would ignore all the data points where confidence score is less than `0.6805` (which would end up being around 70% of all samples). This would, however, guarantee a higher quality labeled dataset for us.