# Labeling the [Lexical Relationship](https://huggingface.co/datasets/relbert/lexical_relation_classification) dataset using Autolabel

This is a multi-class classification task where the input are two english words and we have to correctly classify the lexical relationship between them using one of 5 labels. 

## Install Autolabel
Plus, setup your OpenAI API key, since we'll be using `gpt-3.5-turbo` as our LLM for labeling.

In [12]:
!pip install 'refuel-autolabel[openai]'

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [1]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-'


## Download the dataset

This dataset is available to install via Autolabel.

In [2]:
from autolabel import get_data

get_data('lexical_relation')

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

### First labeling run

In [3]:
import json

from autolabel import LabelingAgent

In [4]:
# load the config
with open('config_lexical_relation.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `classification` (since it's a classification task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at understanding bank customers support complaints and queries...` (how we describe the task to the LLM)
* `prompt.labels`: `['age_limit', 'apple_pay_or_google_pay', 'atm_support', ...]` (the full list of labels to choose from)
* `prompt.few_shot_num`: 10 (how many labeled examples to provide to the LLM)

In [5]:
config

{'task_name': 'LexicalRelationClassification',
 'task_type': 'classification',
 'dataset': {'label_column': 'label', 'delimiter': ','},
 'model': {'provider': 'openai', 'name': 'gpt-3.5-turbo'},
 'prompt': {'task_guidelines': 'You are an expert at understanding lexical relationships.\n Given two words A and B, classify the relationship between them into one of: {labels}. RANDOM refers to an arbitrary relationship. SYN means that the words are synonyms. ANT means that the two words are antonyms. HYPER means that the two words are hypernyms. PART_OF means that the second word is a subset of the first word.',
  'output_guidelines': 'You will answer with just the the correct output label and nothing else.',
  'labels': ['RANDOM', 'SYN', 'ANT', 'HYPER', 'PART_OF'],
  'few_shot_examples': 'seed.csv',
  'few_shot_selection': 'semantic_similarity',
  'few_shot_num': 10,
  'example_template': 'Input: {example}\nOutput: {label}'}}

In [6]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [7]:
# dry-run -- this tells us how much this will cost and shows an example prompt
from autolabel import AutolabelDataset
ds = AutolabelDataset("data/lexical_relation/test.csv", config=config)
agent.plan(ds)

Output()

In [8]:
# now, do the actual labeling
ds = agent.run(ds, max_items=100)

2023-09-29 00:03:26 autolabel.labeler INFO: Task run already exists.


 n


Output()