# Evaporate Demo

This demo shows how you can extract DataFrame from raw text using the Evaporate paper (Arora et al.).

The inspiration is to first "fit" on a set of training text. The fitting process uses the LLM to generate a set of parsing functions from the text.
These fitted functions are then applied to text during inference time.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from llama_index import (
    SimpleDirectoryReader,
    ServiceContext,
    LLMPredictor
)
from llama_index.program.predefined import DFEvaporateProgram, EvaporateExtractor, MultiValueEvaporateProgram
from langchain.chat_models import ChatOpenAI
import requests

  from .autonotebook import tqdm as notebook_tqdm


## Use `DFEvaporateProgram` 

The `DFEvaporateProgram` will extract a 2D dataframe from a set of datapoints given a set of fields, and some training data to "fit" some functions on.

### Load data

Here we load a set of cities from Wikipedia.

In [3]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

In [4]:
from pathlib import Path

import requests
for title in wiki_titles:
    response = requests.get(
        'https://en.wikipedia.org/w/api.php',
        params={
            'action': 'query',
            'format': 'json',
            'titles': title,
            'prop': 'extracts',
            # 'exintro': True,
            'explaintext': True,
        }
    ).json()
    page = next(iter(response['query']['pages'].values()))
    wiki_text = page['extract']

    data_path = Path('data')
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", 'w') as fp:
        fp.write(wiki_text)


In [5]:
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(input_files=[f"data/{wiki_title}.txt"]).load_data()

### Parse Data

In [20]:
# setup service context
# llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-4"))
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor, chunk_size=512
)

In [7]:
# get nodes for each document
city_nodes = {}
for wiki_title in wiki_titles:
    docs = city_docs[wiki_title]
    nodes = service_context.node_parser.get_nodes_from_documents(docs)
    city_nodes[wiki_title] = nodes

In [8]:
# a list of nodes, one node per city, corresponding to intro paragraph
# city_pop_nodes = []
city_pop_nodes = [city_nodes["Toronto"][0], city_nodes["Seattle"][0]]

### Running the DFEvaporateProgram

Here we demonstrate how to extract datapoints with our `DFEvaporateProgram`. Given a set of fields, the `DFEvaporateProgram` can first fit functions on a set of training data, and then run extraction over inference data.

In [14]:
# define program
program = DFEvaporateProgram.from_defaults(fields_to_extract=["population"], service_context=service_context)

### Fitting Functions

In [15]:
program.fit_fields(city_nodes["Boston"][:1])

Here is a sample of text:

{context_str}


Question: {query_str}

Given the function signature, write Python code to extract the 
"population" field from the text.
Return the result as a single value (string, int, float), and not a list.
Make sure there is a return statement in the code. Do not leave out a return statement.

import re

def get_population_field(text: str):
    """
    Function to extract the "population field", and return the result 
    as a single value.
    """
    


{'population': 'def get_population_field(text: str):\n    """\n    Function to extract population. \n    """\n    \n    # Use regex to extract the population field\n    population_field = re.search(r\'population of (\\d+,?\\d*)\', text).group(1)\n    \n    # Return the population field as a single value\n    return int(population_field.replace(\',\', \'\'))'}

In [16]:
# view extracted function
print(program.get_function_str("population"))

def get_population_field(text: str):
    """
    Function to extract population. 
    """
    
    # Use regex to extract the population field
    population_field = re.search(r'population of (\d+,?\d*)', text).group(1)
    
    # Return the population field as a single value
    return int(population_field.replace(',', ''))


### Run Inference

In [23]:
seattle_df = program(infer_data=city_nodes["Seattle"][:1])

In [24]:
seattle_df

DataFrameRowsOnly(rows=[DataFrameRow(row_values=[749256])])

## Use `MultiValueEvaporateProgram` 

In contrast to the `DFEvaporateProgram`, which assumes the output obeys a 2D tabular format (one row per node), the `MultiValueEvaporateProgram` returns a list of `DataFrameRow` objects - each object corresponds to a column, and can contain a variable length of values. This can help if we want to extract multiple values for one field from a given piece of text.

In this example, we use this program to parse gold medal counts.

In [None]:
!pip install llama-hub

In [3]:
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-4"))
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor, chunk_size=1024, chunk_overlap=0
)

In [4]:
from llama_hub.web.simple_web.base import SimpleWebPageReader
from llama_hub.web.beautiful_soup_web.base import BeautifulSoupWebReader
from llama_hub.web.unstructured_web.base import UnstructuredURLLoader


# reader = SimpleWebPageReader(html_to_text=True)
# reader = BeautifulSoupWebReader()
# reader = UnstructuredURLLoader(urls=["https://www.theinfatuation.com/san-francisco/guides/best-restaurants-in-san-francisco"])
reader = UnstructuredURLLoader(urls=["https://www.timeout.com/san-francisco/restaurants/best-restaurants-in-san-francisco"])

In [5]:
# documents = reader.load_data(urls=["https://www.theinfatuation.com/san-francisco/guides/best-restaurants-in-san-francisco"])
# documents = reader.load_data(urls=["https://www.timeout.com/san-francisco/restaurants/best-restaurants-in-san-francisco"])
documents = reader.load()

In [6]:
nodes = service_context.node_parser.get_nodes_from_documents(documents)

In [7]:
print(nodes[3].get_content())

a more neighborhood-focused contemporary California restaurant with Moroccan influences. There are still some classic Moroccan dishes that diners love, such as Basteeya, hand-rolled couscous with aged butter, and fresh new dishes. A glowing bar is beautiful with blue-green tiles. The airy main dining room can get loud on busy nights, so ask to be seated in the cozy back room if you want a quieter experience.

Read more

Book online

Photograph: Courtesy of Bodega/Erin Ng

16. Bodega SF

For nearly 15 years, Bodega Bistro was a solid choice for Vietnamese fare in the Tenderloin until it shut down in 2017. The eatery was run by Matt Ho's father and uncles, and Ho always wanted to bring it back. He did just that in the form of a pop-up in 2019. That plan shifted during the pandemic to offering weekly meal kits out of Rooster and Rice in the Castro. Finally, in June 2022, Ho opened Bodega SF in its original neighborhood of the Tenderloin as a sit-down restaurant offering high-end yet appro

In [8]:
from llama_index.program.predefined import MultiValueEvaporateProgram
program = MultiValueEvaporateProgram.from_defaults(fields_to_extract=["restaurant_names"], service_context=service_context)

In [9]:
program.fit(nodes[3:4], "restaurant_names", expected_output=["Bodega SF", "Waterbar", "The Progress", "Ju-Ni"])
# program.fit(nodes[:1], "restaurant_names")

Here is a sample of text:

{context_str}


Question: {query_str}
Expected function output: ['Bodega SF', 'Waterbar', 'The Progress', 'Ju-Ni']


Given the function signature, write Python code to extract the 
"restaurant_names" field from the text.
Return the result as a list of values (if there is just one item, return a single element list).
Make sure there is a return statement in the code. Do not leave out a return statement.

import re

def get_restaurant_names_field(text: str) -> List:
    """
    Function to extract the "restaurant_names field", and return the result 
    as a single value.
    """
    


'def get_restaurant_names_field(text: str):\n    """\n    Function to extract restaurant_names. \n    """\n    \n    # Use regex to extract the restaurant names\n    restaurant_names = re.findall(r\'[A-Z][a-z]+\\s[A-Z][a-z]+\', text)\n    \n    # Return the list of restaurant names\n    return restaurant_names'

In [10]:
print(program.get_function_str("restaurant_names"))

def get_restaurant_names_field(text: str):
    """
    Function to extract restaurant_names. 
    """
    
    # Use regex to extract the restaurant names
    restaurant_names = re.findall(r'[A-Z][a-z]+\s[A-Z][a-z]+', text)
    
    # Return the list of restaurant names
    return restaurant_names


In [11]:
program(nodes=nodes[:1])

[DataFrameRow(row_values=['San Francisco', 'Book Online', 'Courtesy Sula', 'Cavallo Point', 'Golden Gate', 'Murray Circle', 'New American', 'Michael Garcia', 'Bay Area', 'Fort Bragg', 'Book Online', 'Courtesy Empress', 'Allen Chen', 'Chef Ho', 'Chee Boon', 'Chef Ho', 'Jean Bai', 'Executive Chef', 'Joe Hou', 'Per Se', 'Negroni Coast', 'Post Martini', 'Book Online', 'Anson Smart', 'Courtesy Nari', 'Pim Techamuanvivit', 'Kin Khao', 'Meghan Clark', 'Book Online', 'Lorena Masso', 'Courtesy El', 'Buen Comer', 'El Buen', 'Glen Park', 'This Bernal'])]

## Bonus: Use the underlying `EvaporateExtractor`

The underlying `EvaporateExtractor` offers some additional functionality, e.g. actually helping to identify fields over a set of text.

Here we show how you can use `identify_fields` to determine relevant fields around a general `topic` field.

In [25]:
extractor = program.extractor

In [26]:
# Try with Toronto and Seattle (should extract "population")
existing_fields = extractor.identify_fields(city_pop_nodes, topic="city", fields_top_k=1)

In [None]:
existing_fields

['population']