## Exploring the SQUADv2 dataset using Autolabel

#### Setup the API Keys for providers that you want to use

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = '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'

#### Install the autolabel library

In [1]:
!pip install 'refuel-autolabel[openai]'


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Download the dataset

In [2]:
from autolabel import get_data

get_data('squad_v2')

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

In [9]:
import json

from autolabel import LabelingAgent

In [10]:
# load the config
with open('config_squad_v2.json', 'r') as f:
    config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `question_answering` (since it's a question answering task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at answering questions based on wikipedia articles` (how we describe the task to the LLM)
* `prompt.few_shot_num`: 3 (how many labeled examples to provide to the LLM)

In [11]:
config

{'task_name': 'OpenbookQAWikipedia',
 'task_type': 'question_answering',
 'dataset': {'label_column': 'answer', 'delimiter': ','},
 'model': {'provider': 'openai', 'name': 'gpt-4'},
 'prompt': {'task_guidelines': 'You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question.\nThe answer is a continuous span of words from the context.\nUse the context to answer the question. If the question cannot be answered using the context and the context alone without any outside knowledge, answer the question as unanswerable.',
  'output_guidelines': 'You will return the answer one element: "the correct answer". If the question is unanswerable, return the answer as "unanswerable"\n',
  'few_shot_examples': 'seed.csv',
  'few_shot_selection': 'semantic_similarity',
  'few_shot_num': 4,
  'example_template': 'Context: {context}\nQuestion: {question}\nAnswer: {answer}'}}

In [12]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [13]:
agent.plan('test.csv')

Output()

You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question.
The answer is a continuous span of words from the context.
Use the context to answer the question. If the question cannot be answered using the context and the context alone without any outside knowledge, answer the question as unanswerable.

You will return the answer one element: "the correct answer". If the question is unanswerable, return the answer as "unanswerable"


Some examples with their output answers are provided below:

Context: The final major evolution of the steam engine design was the use of steam turbines starting in the late part of the 19th century. Steam turbines are generally more efficient than reciprocating piston type steam engines (for outputs above several hundred horsepower), have fewer moving parts, and provide rotary power directly instead of through a connecting rod system or similar means. Steam

In [14]:
labels, df, metrics_list = agent.run('test.csv', max_items=10)

Output()

Actual Cost: 0.3234


We are at 59% accuracy when labeling the first 100 examples. Let's see if we can use confidence scores to improve accuracy further by removing the less confident examples from our labeled set.

### Compute confidence scores


In [2]:
# Start computing confidence scores (using Refuel's LLMs)
os.environ['REFUEL_API_KEY'] = 'sk-xxxxxxxxxxxxxxxx'

In [12]:
config["model"]["compute_confidence"] = True

In [13]:
agent = LabelingAgent(config=config)

In [14]:
agent.plan('test.csv')

Output()

You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question. The answer is a continuous span of words from the context. Use the context to answer the question. If the question cannot be answered using the context, answer the question as unanswerable.

You will return the answer one element: "the correct label"


Some examples with their output answers are provided below:

Context: The modern Conservative Party was created out of the 'Pittite' Tories of the early 19th century. In the late 1820s disputes over political reform broke up this grouping. A government led by the Duke of Wellington collapsed amidst dire election results. Following this disaster Robert Peel set about assembling a new coalition of forces. Peel issued the Tamworth Manifesto in 1834 which set out the basic principles of Conservatism; – the necessity in specific cases of reform in order to survive, but an opposition 

In [15]:
labels, df, metrics_list = agent.run('test.csv', max_items=100)

Output()

Metric: auroc: 0.864
Actual Cost: 0.0095


Looking at the table above, we can see that if we set the confidence threshold at `0.8449`, we are able to label at 80.65% accuracy and getting a completion rate of 65%. This means, we would ignore all the data points where confidence score is less than `0.8449` (which would end up being around 35% of all samples). This would, however, guarantee a very high quality labeled dataset for us.