## Running Async Transformations in Jupyter

In [1]:
!pip install refuel-autolabel[all]
!pip install beautifulsoup4 httpx fake_useragent

Collecting pytest-asyncio (from refuel-autolabel[all])
  Downloading pytest_asyncio-0.21.1-py3-none-any.whl (13 kB)
Collecting pdfplumber>=0.10.2 (from refuel-autolabel[all])
  Downloading pdfplumber-0.10.2-py3-none-any.whl (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.5/47.5 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting pypdfium2>=4.18.0 (from pdfplumber>=0.10.2->refuel-autolabel[all])
  Downloading pypdfium2-4.18.0-py3-none-macosx_11_0_arm64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pypdfium2, pytest-asyncio, pdfplumber
  Attempting uninstall: pdfplumber
    Found existing installation: pdfplumber 0.8.0
    Uninstalling pdfplumber-0.8.0:
      Successfully uninstalled pdfplumber-0.8.0
Successfully installed pdfplumber-0.10.2 pypdfium2-4.18.0 pytest-asyncio-0.21.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[

## Finding the State of National Park using Autolabel

We will use Autolabel to find the state of the national park given a url to the national park nps website. First, we will use a transform to extract the content of the website. Then, using the content, we will structure this as a question_answering task to extract the state of the park from this webpage.

Notice the "transforms" part of the config. Here we use the url column to extract the text on the webpage. This content of the webpage is sent to the column called "content" in the "output_columns" part of the transform. Next, in the "example_template" we use this "content" column in order to send the website text and ask the question about the state of the national park.

In [1]:
config = {
    "task_name": "NationalPark",
    "task_type": "question_answering",
    "dataset": {
    },
    "model": {
        "provider": "openai",
        "name": "gpt-3.5-turbo"
    },
    "transforms": [{
        "name": "webpage_transform",
        "params": {
            "url_column": "url"
        },
        "output_columns": {
            "content_column": "content"
        }
    }],
    "prompt": {
        "task_guidelines": "You are an expert at understanding websites of national parks. You will be given a webpage about a national park. Answer with the US State that the national park is located in.",
        "output_guidelines": "Answer in one word the state that the national park is located in.",
        "example_template": "Content of wikipedia page: {content}\State:",
    }
}

In [2]:
import os

# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = 'sk-XXXXXXXXXXXXXXXXXXXXXXXX'


In [3]:
from autolabel import LabelingAgent, AutolabelDataset, AutolabelConfig
agent = LabelingAgent(config)

A small manually collected dataset of national parks and their websites containing information about them. We intend to use the LLM to find out the state which may be buried at different parts in the website.

In [4]:
import pandas as pd
df = pd.DataFrame([
    {
        "url": "https://www.visitmt.com/places-to-go/glacier-national-park",
        "name": "Glacier National Park"
    },
    {
        "url": "https://www.nps.gov/dena/index.htm",
        "name": "Denali National Park"
    },
    {
        "url": "https://www.nps.gov/lavo/index.htm",
        "name": "Lassen Volcanic National Park"
    },
    {
        "url": "https://www.nps.gov/olym/index.htm",
        "name": "Olympic National Park"
    },
    {
        "url": "https://www.nps.gov/pinn/index.htm",
        "name": "Pinnacles National Park"
    }
])

In [5]:
ds = AutolabelDataset(df, config)

## Running the transform
First, we run transform in order to run the Webpage transformation and populate the content column of the dataset.

In [6]:
ds = agent.transform(ds)

Output()

## Running the labeling function
Now, we use the send the content of the website along with the question in order to return the state of the national park.

In [7]:
ds = agent.run(ds)

Output()

In [8]:
ds.df

Unnamed: 0,url,name,content,content_in_bytes_column,soup_column,metadata_column,NationalPark_label,NationalPark_error,NationalPark_successfully_labeled,NationalPark_annotation
0,https://www.visitmt.com/places-to-go/glacier-n...,Glacier National Park,\n\n\n\n\n\n\nGlacier National Park\n\n\n\n\n\...,"b'\n<!doctype html>\n <html lang=""en"">\n<head...","[\n, html, \n, [\n, [\n, Google Tag Manager ,...",{'url': 'https://www.visitmt.com/places-to-go/...,Montana,,True,b'\x80\x04\x95q\x00\x00\x00\x00\x00\x00\x00\x8...
1,https://www.nps.gov/dena/index.htm,Denali National Park,\n Denali National Park & Preserve (U.S. N...,"b'<!doctype html> <html lang=""en"" class=""no-js...","[html, \n, [ , Content Copyright National Par...","{'url': 'https://www.nps.gov/dena/index.htm', ...",Alaska,,True,b'\x80\x04\x95p\x00\x00\x00\x00\x00\x00\x00\x8...
2,https://www.nps.gov/lavo/index.htm,Lassen Volcanic National Park,\n Lassen Volcanic National Park (U.S. Nat...,"b'<!doctype html> <html lang=""en"" class=""no-js...","[html, \n, [ , Content Copyright National Par...","{'url': 'https://www.nps.gov/lavo/index.htm', ...",California,,True,b'\x80\x04\x95t\x00\x00\x00\x00\x00\x00\x00\x8...
3,https://www.nps.gov/olym/index.htm,Olympic National Park,\n Olympic National Park (U.S. National Pa...,"b'<!doctype html> <html lang=""en"" class=""no-js...","[html, \n, [ , Content Copyright National Par...","{'url': 'https://www.nps.gov/olym/index.htm', ...",Washington,,True,b'\x80\x04\x95t\x00\x00\x00\x00\x00\x00\x00\x8...
4,https://www.nps.gov/pinn/index.htm,Pinnacles National Park,\n Pinnacles National Park (U.S. National ...,"b'<!doctype html> <html lang=""en"" class=""no-js...","[html, \n, [ , Content Copyright National Par...","{'url': 'https://www.nps.gov/pinn/index.htm', ...",California,,True,b'\x80\x04\x95t\x00\x00\x00\x00\x00\x00\x00\x8...
