# Labeling the [Civil Comments](https://huggingface.co/datasets/civil_comments) dataset using Autolabel

This dataset contains public comments collected from news websites, the task is a binary classification task -- is the provided comment toxic or not? 

## Install Autolabel
Plus, setup your OpenAI API key, since we'll be using gpt-3.5-turbo as our LLM for labeling.

In [None]:
!pip3 install 'refuel-autolabel[openai]'

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-xxxxxxxxxxxxxxxxxxxxxx'

## Download the dataset

In [2]:
from autolabel import get_data

get_data('civil_comments')

Downloading seed example dataset to "seed.csv"...
100% [..............................................................................] 65757 / 65757

Downloading test dataset to "test.csv"...
100% [............................................................................] 610663 / 610663

This downloads two datasets:

* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!
Labeling with Autolabel is a 3-step process:

* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

### Experiment #1: Very simple guidelines

In [1]:
from autolabel import LabelingAgent

In [6]:
config = {
    "task_name": "ToxicCommentClassification",
    "task_type": "classification",
    "dataset": {
        "label_column": "label",
    },
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo"
    },
    "prompt": {
        "task_guidelines": "Is the provided comment 'toxic' or 'not toxic'?",
        "labels": [
            "toxic",
            "not toxic"
        ],
        "example_template": "Input: {example}\nOutput: {label}"
    }
}

Let's review the configuration file above. You'll notice the following useful keys:

* `task_type`: `classification` (since it's a classification task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: Is the provided comment 'toxic' or 'not toxic'? (how we describe the task to the LLM)
* `prompt.labels`: ['toxic', 'not toxic'] (the two labels to choose from)

In [7]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [8]:
# dry-run -- this tells us how much this will cost and shows an example prompt
agent.plan('test.csv')

Output()

Is the provided comment 'toxic' or 'not toxic'?

You will return the answer with just one element: "the correct label"

Now I want you to label the following example:
Input: [ Integrity means that you pay your debts.]

Does this apply to President Trump too?
Output: 


In [9]:
# now, do the actual labeling
labels, df, metrics_list = agent.run('test.csv', max_items=100)

Output()

Actual Cost: 0.022


47% accuracy is not very good! Let's see if we can improve this further.

### Experiment #2: Few-shot prompting to provide helpful examples

In [27]:
config = {
    "task_name": "ToxicCommentClassification",
    "task_type": "classification",
    "dataset": {
        "label_column": "label",
    },
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo"
    },
    "prompt": {
        "task_guidelines": "Is the provided comment 'toxic' or 'not toxic'?",
        "labels": [
            "toxic",
            "not toxic"
        ],
        "example_template": "Input: {example}\nOutput: {label}"
    }
}

In [14]:
config["prompt"]["few_shot_examples"] = "seed.csv"
config["prompt"]["few_shot_selection"] = "semantic_similarity"
config["prompt"]["few_shot_num"] = 10

In [15]:
# create an agent for labeling
agent = LabelingAgent(config, cache=False)

In [16]:
# dry-run -- this tells us how much this will cost and shows an example prompt
agent.plan(dataset='test.csv')

Output()

Is the provided comment 'toxic' or 'not toxic'?

You will return the answer with just one element: "the correct label"

Some examples with their output answers are provided below:

Input: If Trump wants to totally reinvent the world political and economic order, I think he owes the American public some specific plans and proposed policies. Having "lots of meetings", and telling everyone what a great negotiator he is, is not a plan. The USA is already a great nation with great influence in the world, and we need to honor our long term alliances and commitments, or risk losing much financial and political power and influence. For Pete's sake, we owe these countries trillions of dollars of our national debit. What happens if they ask for their money back? But then Trump is unquestionably experienced in bankruptcies.
Output: not toxic

Input: Are you talking about the crooked Trump Foundation?
Output: not toxic

Input: Trump has tarnished yet another man of integrity

Kelly let himself get

In [18]:
_, output_df, _ = agent.run('test.csv', max_items=100)

Output()

2023-06-14 01:27:16 openai INFO: error_code=None error_message='Rate limit reached for default-gpt-3.5-turbo in organization org-etZVkYhAIYGmLcxLmarMmAPo on tokens per min. Limit: 90000 / min. Current: 89188 / min. Contact us through our help center at help.openai.com if you continue to have issues.' error_param=None error_type=tokens message='OpenAI API error received' stream_error=False
2023-06-14 01:27:17 openai INFO: error_code=None error_message='Rate limit reached for default-gpt-3.5-turbo in organization org-etZVkYhAIYGmLcxLmarMmAPo on tokens per min. Limit: 90000 / min. Current: 87566 / min. Contact us through our help center at help.openai.com if you continue to have issues.' error_param=None error_type=tokens message='OpenAI API error received' stream_error=False
2023-06-14 01:27:20 openai INFO: error_code=None error_message='Rate limit reached for default-gpt-3.5-turbo in organization org-etZVkYhAIYGmLcxLmarMmAPo on tokens per min. Limit: 90000 / min. Current: 88369 / min. C

Actual Cost: 0.2157


Accuracy at 54% - examples didn't seem to help for this dataset, let's try something else!

## Experiment #3: Improving prompts with error analysis (Prompt engineering)

`agent.run()` returned an output Pandas Dataframe. We'll review the mistakes and update the labeling guidelines to "teach" the LLM our content moderation policies.

In [20]:
import pandas as pd

pd.set_option('max_colwidth', None)

In [24]:
# review top 10 mistakes
display(output_df[output_df['label'] != output_df['ToxicCommentClassification_llm_label']].head(10))

Unnamed: 0,example,label,ToxicCommentClassification_llm_labeled_successfully,ToxicCommentClassification_llm_label
1,This is malfeasance by the Administrator and the Board. They are wasting our money!,not toxic,yes,toxic
2,"@Rmiller101 - Spoken like a true elitist. But look out bud. The re-awakening in Europe, Brexit and now the Trump victory are just the beginning. The unwashed masses are coming for you and your ilk….",not toxic,yes,toxic
4,Sorry you missed high school. Eisenhower sent troops to Vietnam after the French withdrew in 1954 and before that America was providing 50% of the cost of that war. WWI & WWII were won by Democrat Presidents and the last win a Republican had was 1865 but the did surrender in Korea and Vietnam and fail to win in desert Storm or these two wars.,not toxic,yes,toxic
5,Let’s see if I understand this; Berkowitz announces a $14M surplus then he rails against Proposition 8 because it would constrain his ability to confiscate even more money from citizens. Berkowitz and his allies on the Assembly changed the tax cap language to facilitate confiscating greater amounts of citizen’s money. Now why would a mayor who professes fidelity with the taxpayers change the law to enable him to take more money from them? Probably because he plans to take more money from them. Berkowitz said if Proposition 8 passes “it severely constrains our ability to give out property tax relief.” So Berkowitz wants the ability to confiscate more of the citizen’s money to provide property relief. How about taking less of that money in the first place? Let’s hope Proposition 8 passes.,not toxic,yes,toxic
8,What would Jerry Prevo think about that story? Exactly what I think about his....,not toxic,yes,toxic
9,When wil the indigenous be accountable and take responsibility instead of blaming others ? They have to be accountable and pay their own way or go it alone.,not toxic,yes,toxic
10,the people against O'Leary are the same people against\n\nTrump\nBrexit\n\nnot a good track record so far\n2 down and 1 to go,not toxic,yes,toxic
12,"Since whining is the lifeblood of online comment sites, the Pope's sign will find little support here. The self-pitying ones with their myriad complaints would have you believe that Canada is a dreadful place to live -- why they haven't left for their perfect foreign paradise is a mystery -- when by any relative standard it remains a refuge of stability in a chaotic world, its ongoing issues not likely to be solved by endless griping from anonymous online cranks.",not toxic,yes,toxic
15,IT IS DOA! As Collins announced her intentions to vote NO!,not toxic,yes,toxic
23,"Kitty Piercy did her best to make the almighty ""legacy"" (city) hall larger-than-life, at the city council work session on 7-11. Move the slide time-bar to 11:18 minutes, on the webcast of that meeting:\n\nhttp://ceapps.eugene-or.gov/CEWebcast/WebCast/Play.aspx?mid=2996",not toxic,yes,toxic


Based on the mistakes we are making, we will make some adjustments to our `task_guidelines`.

In [28]:
new_task_guidelines = """
You are an expert at identifying toxic comments.

You aim to act in a fair and balanced manner, where comments that provide fair criticism of something or someone are labelled 'not toxic'. Similarly, criticisms of policy and politicians are marked 'not toxic', unless the comment includes obscenities, racial slurs or sexually explicit material.

Any comments that are explicit, obscene, or insults a person, demographic or race are not allowed and labeled 'toxic'.
"""

In [29]:
config = {
    "task_name": "ToxicCommentClassification",
    "task_type": "classification",
    "dataset": {
        "label_column": "label",
    },
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo"
    },
    "prompt": {
        "task_guidelines": new_task_guidelines,
        "labels": [
            "toxic",
            "not toxic"
        ],
        "example_template": "Input: {example}\nOutput: {label}"
    }
}

In [30]:
# create an agent for labeling
agent = LabelingAgent(config, cache=False)

In [32]:
# dry-run -- this tells us how much this will cost and shows an example prompt
agent.plan(dataset='test.csv')

Output()


You are an expert at identifying toxic comments.

You aim to act in a fair and balanced manner, where comments that provide fair criticism of something or someone are labelled 'not toxic'. Similarly, criticisms of policy and politicians are marked 'not toxic', unless the comment includes obscenities, racial slurs or sexually explicit material.

Any comments that are explicit, obscene, or insults a person, demographic or race are not allowed and labeled 'toxic'.


You will return the answer with just one element: "the correct label"

Now I want you to label the following example:
Input: [ Integrity means that you pay your debts.]

Does this apply to President Trump too?
Output: 


In [33]:
# now, do the actual labeling
_, output_df, _ = agent.run('test.csv', max_items=100)

Output()

2023-06-14 01:34:31 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 2a50dd9a45d7ff3696b5d64d0e2c62a9 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
2023-06-14 01:35:19 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 065f2b04aa5df268bc6fe30d9ecae8e2 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False


Actual Cost: 0.0376


We see a jump in labeling accuracy to 77% - this is promising! Let's see if we can push this up even further. 

## Experiment #4: Using confidence scores

In [34]:
# Start computing confidence scores (using Refuel's LLMs)
os.environ['REFUEL_API_KEY'] = 'sk-xxxxxxxxxxxx'

In [35]:
config = {
    "task_name": "ToxicCommentClassification",
    "task_type": "classification",
    "dataset": {
        "label_column": "label",
    },
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo",
        # this is new ->
        "compute_confidence": True
    },
    "prompt": {
        "task_guidelines": new_task_guidelines,
        "labels": [
            "toxic",
            "not toxic"
        ],
        "example_template": "Input: {example}\nOutput: {label}"
    }
}

In [36]:
# create an agent for labeling
agent = LabelingAgent(config, cache=False)

In [37]:
# dry-run -- this tells us how much this will cost and shows an example prompt
agent.plan('test.csv')

Output()


You are an expert at identifying toxic comments.

You aim to act in a fair and balanced manner, where comments that provide fair criticism of something or someone are labelled 'not toxic'. Similarly, criticisms of policy and politicians are marked 'not toxic', unless the comment includes obscenities, racial slurs or sexually explicit material.

Any comments that are explicit, obscene, or insults a person, demographic or race are not allowed and labeled 'toxic'.


You will return the answer with just one element: "the correct label"

Now I want you to label the following example:
Input: [ Integrity means that you pay your debts.]

Does this apply to President Trump too?
Output: 


In [39]:
# now, do the actual labeling
_, output_df, _ = agent.run('test.csv', start_index=0, max_items=100)

Output()

2023-06-14 01:41:11 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID e770282334ac0056ba78ca394eb9ad94 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
2023-06-14 01:42:46 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 137017496026d027c25431687deebae6 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False


Metric: auroc: 0.8858
Actual Cost: 0.0376


Looking at the table above, we can see that if we set the confidence threshold at `0.6682`, we are able to label at 96% accuracy and getting a completion rate of 63%. This means, we would ignore all the data points where confidence score is less than `0.6682` (which would end up being around 37% of all samples). This would, however, guarantee a very high quality labeled dataset for us.