## Exploring the SQUADv2 dataset using Autolabel

#### Setup the API Keys for providers that you want to use

In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-xxxxxxxxxxxxxxxxx'

#### Install the autolabel library

In [2]:
!pip3 install 'refuel-autolabel[openai]'





#### Download the dataset

In [3]:
from autolabel import get_data

get_data('squad_v2')

  from .autonotebook import tqdm as notebook_tqdm


Downloading seed example dataset to "seed.csv"...
100% [........................................................] 177008 / 177008

Downloading test dataset to "test.csv"...
100% [......................................................] 1786563 / 1786563

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

In [5]:
import json

from autolabel import LabelingAgent

In [33]:
# load the config
with open('config_squad_v2.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `question_answering` (since it's a question answering task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at answering questions based on wikipedia articles` (how we describe the task to the LLM)
* `prompt.few_shot_num`: 3 (how many labeled examples to provide to the LLM)

In [34]:
config

{'task_name': 'OpenbookQAWikipedia',
 'task_type': 'question_answering',
 'dataset': {'label_column': 'answer', 'delimiter': ','},
 'model': {'provider': 'openai', 'name': 'gpt-3.5-turbo'},
 'prompt': {'task_guidelines': 'You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question. Use the context to answer the question - the answer is a continuous span of words from the context.\n',
  'output_guidelines': 'Your answer will consist of an explanation, followed by the correct answer. The last line of the response should always be is JSON format with one key: {"label": "the correct answer"}.\n If the question cannot be answered using the context and the context alone without any outside knowledge, the question is unanswerable. If the question is unanswerable, return the answer as {"label": "unanswerable"}\n',
  'few_shot_examples': 'seed.csv',
  'few_shot_selection': 'semantic_similarity',

In [22]:
# generate explanations for chain of thought
seed_with_explanations = agent.generate_explanations('seed.csv')

2023-06-14 15:59:48 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 87d1cac9c00b5c4994a0e5fe446a2087 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False


In [23]:
# # Use seedset with explanations in config
# config['prompt']['few_shot_examples'] = seed_with_explanations

In [35]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [25]:
agent.plan('test.csv')

You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question. Use the context to answer the question - the answer is a continuous span of words from the context.


Your answer will consist of an explanation, followed by the correct answer. The last line of the response should always be is JSON format with one key: {"label": "the correct answer"}.
 If the question cannot be answered using the context and the context alone without any outside knowledge, the question is unanswerable. If the question is unanswerable, return the answer as {"label": "unanswerable"}


Some examples with their output answers are provided below:

Context: The final major evolution of the steam engine design was the use of steam turbines starting in the late part of the 19th century. Steam turbines are generally more efficient than reciprocating piston type steam engines (for outputs above several hundred horsepow

In [36]:
labels, df, metrics_list = agent.run('test.csv', max_items=100)

2023-06-14 16:28:05 openai INFO: error_code=None error_message='The server had an error while processing your request. Sorry about that!' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False


Actual Cost: 0.1827


We are at 21% accuracy when labeling the first 100 examples. Let's see if we can use confidence scores to improve accuracy further by removing the less confident examples from our labeled set.

### Compute confidence scores


In [47]:
# Start computing confidence scores (using Refuel's LLMs)
os.environ['REFUEL_API_KEY'] = 'xxxxxxxxxxxxxxxxx'

In [43]:
config["model"]["compute_confidence"] = True

In [44]:
agent = LabelingAgent(config=config)

In [45]:
agent.plan('test.csv')

You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question. Use the context to answer the question - the answer is a continuous span of words from the context.


Your answer will consist of an explanation, followed by the correct answer. The last line of the response should always be is JSON format with one key: {"label": "the correct answer"}.
 If the question cannot be answered using the context and the context alone without any outside knowledge, the question is unanswerable. If the question is unanswerable, return the answer as {"label": "unanswerable"}


Some examples with their output answers are provided below:

Context: The final major evolution of the steam engine design was the use of steam turbines starting in the late part of the 19th century. Steam turbines are generally more efficient than reciprocating piston type steam engines (for outputs above several hundred horsepow

In [46]:
labels, df, metrics_list = agent.run('test.csv', max_items=100)

2023-06-14 16:33:52 autolabel.labeler INFO: Task run already exists.


Metric: auroc: 0.5


You are an expert at answering questions based on wikipedia articles. Your job is to answer the following questions using the context provided with the question. Use the context to answer the question - the answer is a continuous span of words from the context.


Your answer will consist of an explanation, followed by the correct answer. The last line of the response should always be is JSON format with one key: {"label": "the correct answer"}.
 If the question cannot be answered using the context and the context alone without any outside knowledge, the question is unanswerable. If the question is unanswerable, return the answer as {"label": "unanswerable"}


Some examples with their output answers are provided below:

Context: Fourth, national courts have a duty to interpret domestic law "as far as possible in the light of the wording and purpose of the directive". Textbooks (though not the Court itself) often called this "indirect effect". In Marleasing SA v La Comercial SA the Court

Van Gend en Loos v Nederlandse Administratie der Belastingen {"label": "Van Gend en Loos v Nederlandse Administratie der Belastingen"}


n


2023-06-14 16:38:35 openai INFO: error_code=None error_message='The server had an error while processing your request. Sorry about that!' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
2023-06-14 16:38:40 openai INFO: error_code=None error_message='The server had an error while processing your request. Sorry about that!' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
2023-06-14 16:39:50 openai INFO: error_code=None error_message='The server had an error while processing your request. Sorry about that!' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
2023-06-14 16:39:55 openai INFO: error_code=None error_message='The server had an error while processing your request. Sorry about that!' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False


Metric: auroc: 0.9204
Actual Cost: 0.0113


Looking at the table above, we can see that if we set the confidence threshold at `0.9281`, we are able to label at 73.9% accuracy and getting a completion rate of 23%. This means, we would ignore all the data points where confidence score is less than `0.9281` (which would end up being around 77% of all samples). This would, however, guarantee a very high quality labeled dataset for us. 