## Exploring the SciQ dataset using Autolabel

#### Setup the API Keys for providers that you want to use

In [2]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-'

#### Install the autolabel library

In [None]:
!pip install 'refuel-autolabel[openai]'

#### Download the dataset

In [3]:
from autolabel import get_data

get_data('multimodal_science_qa')

This downloads two datasets:
* `test.csv`: This is the larger dataset we are trying to label using LLMs
* `seed.csv`: This is a small dataset where we already have human-provided labels

## Start the labeling process!

Labeling with Autolabel is a 3-step process:
* First, we specify a labeling configuration (see `config.json` below)
* Next, we do a dry-run on our dataset using the LLM specified in `config.json` by running `agent.plan`
* Finally, we run the labeling with `agent.run`

In [8]:
import json

from autolabel import LabelingAgent

In [9]:
# load the config
with open('config_multimodal_sciq.json', 'r') as f:
     config = json.load(f)

Let's review the configuration file below. You'll notice the following useful keys:
* `task_type`: `question_answering` (since it's a question answering task)
* `model`: `{'provider': 'openai', 'name': 'gpt-3.5-turbo'}` (use a specific OpenAI model)
* `prompt.task_guidelines`: `'You are an expert at answer science questions...` (how we describe the task to the LLM)
* `prompt.few_shot_num`: 10 (how many labeled examples to provide to the LLM)

In [10]:
config

{'task_name': 'ScienceQuestionAnswering',
 'task_type': 'question_answering',
 'dataset': {'label_column': 'answer',
  'delimiter': ',',
  'image_url_column': 'image_url'},
 'model': {'provider': 'openai_vision', 'name': 'gpt-4-vision-preview'},
 'prompt': {'task_guidelines': "You are an expert at answer science questions. Your job is to answer the given question, using the options provided for each question. You'll also be given an image for each question - use that as context as needed. Choose the best answer for the question from among the options provided. Output just the answer (from the given options) and nothing else.",
  'example_template': 'Question: {question}\nOptions: {choices}\nAnswer: {answer}'}}

In [11]:
# create an agent for labeling
agent = LabelingAgent(config=config)

In [12]:
from autolabel import AutolabelDataset
ds = AutolabelDataset("data/multimodal_science_qa/test.csv", config=config)
agent.plan(ds)

Output()

In [13]:
ds = agent.run(ds, max_items=10)

Output()