# How to Extract & Structure with `truffles`
Truffles extends playwright functionality in many ways. 
- A `truffles` page, can be used just like a `playwright` page, but with added tools to extract and structure lists.

- If (god forbid) the LLM does not work, you can still use conventional playwright without problems.

## Import & Setup Playwright

In [1]:
# import both truffles and playwright
from playwright.async_api import async_playwright

# start playwright
p = await async_playwright().start()
browser = await p.chromium.launch()
playwright_page = await browser.new_page(  # a good useragent goes a long way
    user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
)

## Setup & Initialize `truffles`

In [2]:
URL = "https://kaufleuten.ch/events"  # try your own page, it could well work out-of-the-box

In [None]:
# you may need to:
# import os
# os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

import truffles

# wrap with truffles
await truffles.start()
page = await truffles.wrap(playwright_page)

await page.goto(URL)

## Get all list elements

The `page.tools` defines a set of tools that can be used to interact with the page and augment playwright with additional functionality that requires LLM calls.
Currently the main tool is:

- `get_main_list`: Returns a list of locators for the elements that are the main list on the page.




In [4]:
locators = await page.tools.get_main_list()  # congrats! you now have a list of locators.

### get_main_list, what is it good for?

- If the page content or page structure changes, you can still use the same code.

- You can use the same code to extract the main list from multiple different pages.

- Selector caching via `StoreManager` prevents redundant LLM calls in many cases.

- The locators we received are real locators on the page
    - no hallucinations, yay! 

## Extracting Data

To be certain that the data we receive is correct, we can use a pydantic model to structure the data.


In [5]:
from pydantic import BaseModel, Field


class Event(BaseModel):  # define a pydantic model to structure the data
    title: str = Field(description="The title of the event")
    date: str = Field(description="The date of the event", format="YYYY-MM-DD")
    description: str = Field(description="Description of the event. If longer than 100 words, summarize.")

    # providing default values works very well
    location: str = Field(description="The location of the event.", default="")

We can now use `locator.tools.to_structure()` to extract the data from each locator.

```python
from tqdm import tqdm

results = []
for loc in tqdm(locators): # This takes about 1.5 seconds per locator
    results.append(await loc.tools.to_structure(Event))
```

Since we mostly wait for the LLM, we can do this in a parallel fashion for **ultra-speed**


In [6]:
import asyncio

task_list = [
    # note the missing await!
    loc.tools.to_structure(Event)
    for loc in locators
]

# If the OpenAI API complies, this takes about 12 seconds total
# or roughly 0.1 seconds per locator!
results = await asyncio.gather(*task_list)

> Note: in `.to_structure(filter_relevance=True)`, the LLM will return `None` if the locator is not relevant to the pydantic schema (i.e. for example some functionality `<div/>` or `<span/>`)

### Check the Results

We can now visualize the results and compare them to the original page.

In [None]:
import json

for result in results[:5]:
    if result is None:
        print("<!-- filtered element -->")
        continue

    print(json.dumps(result.model_dump(), indent=2, ensure_ascii=False))

### It gets a bit more interesting

Internally, `truffles` uses a `StoreManager` to cache the locators and selectors. So, if you call `get_main_list`, it will return the same locators and selectors, but:
- almost instantly *

- without the need for LLM calls *


*if the locators and selectors are cached and the page structure has not changed too much

In [8]:
# you can set a custom cache marker, to reuse the same cache for different pages
locators_again = await page.tools.get_main_list(marker_id="main-list-1")

In [9]:
# now this is instantaneous
more_locators = await page.tools.get_main_list(marker_id="main-list-1")

## The Entire Script

On many pages, where you want to extract with the same `pydantic` schema, the script is exactly the same (incredible, right?)
> If you want to help increase the number of pages where 3 lines are enough, open an issue and/or contribute!

In [10]:
await page.goto(URL)

locators = await page.tools.get_main_list(marker_id="main-list-1")  # congrats! you now have a list of locators.

results = await asyncio.gather(*[loc.tools.to_structure(Event) for loc in locators])

### That is **3 lines of code** to extract for many, many pages.

As a matter of fact, we can even make this even easier with a `ListingTask` that takes care of the entire process.

In [None]:
from truffles.tasks import ListingTask

task = ListingTask(page=page, schema=Event)
results = await task.run()

for res in results[:5]:
    if not res:
        print("<!-- filtered element -->")
        continue
    print(res)