# Scraping Wikipedia

[View on Github](https://github.com/villagecomputing/superpipe/tree/main/docs/examples/web_scraping/web_scraping.ipynb)

We'll use Superpipe to build a pipeline that receives a famous person's name and figures out their birthday, whether they're still alive and if not, their cause of death.

This pipeline will work in 4 steps -

1. Do a google search with the person's name
2. Use an LLM to fetch the URL of their wikipedia page from the search results
3. Fetch the contents of the wikipedia page and convert them to markdown
4. Use an LLM to extract the birthdate and living or dead from the wikipedia contents

We'll build the pipeline, evaluate it on some data, and optimize it to maximize accuracy while reducing cost and latency.

## Step 1: Building the pipeline

In [1]:
from superpipe.steps import LLMStructuredStep, CustomStep, SERPEnrichmentStep
from superpipe import models
from pydantic import BaseModel, Field

# Step 1: use Superpipe's built-in SERP enrichment step to search for the persons wikipedia page
# Include a unique "name" for the step that will used to reference this step's output in future steps

search_step = SERPEnrichmentStep(
  prompt= lambda row: f"{row['name']} wikipedia",
  name="search"
)

# Step 2: Use an LLM to extract the wikipedia URL from the search results
# First, define a Pydantic model that specifies the structured output we want from the LLM

class ParseSearchResult(BaseModel):
  wikipedia_url: str = Field(description="The URL of the Wikipedia page for the person")

# Then we use the built-in LLMStructuredStep and specify a model and a prompt
# The prompt is a function that has access to all the fields in the input as well as the outputs of previous steps

parse_search_step = LLMStructuredStep(
  model=models.gpt35,
  prompt= lambda row: f"Extract the Wikipedia URL for {row['name']} from the following search results: \n\n {row['search']}",
  out_schema=ParseSearchResult,
  name="parse_search"
)

In [None]:
from superpipe.pipeline import Pipeline
import requests
import html2text
import json

h = html2text.HTML2Text()
h.ignore_links = True

# Step 3: we create a CustomStep that can execute any arbitrary function (transform)
# The function fetches the contents of the wikipedia url and converts them to markdown

fetch_wikipedia_step = CustomStep(
  transform=lambda row: h.handle(requests.get(row['wikipedia_url']).text),
  name="wikipedia"
)

# Step 4: we extract the date of birth, living/dead status and cause of death from the wikipedia contents

class ExtractedData(BaseModel):
    date_of_birth: str = Field(description="The date of birth of the person in the format YYYY-MM-DD")
    alive: bool = Field(description="Whether the person is still alive")
    cause_of_death: str = Field(description="The cause of death of the person. If the person is alive, return 'N/A'")

extract_step = LLMStructuredStep(
  model=models.gpt4,
  prompt= lambda row: f"""Extract the date of birth for {row['name']}, whether they're still alive \
  and if not, their cause of death from the following Wikipedia content: \n\n {row['wikipedia']}""",
  out_schema=ExtractedData,
  name="extract_data"
)

# Finally we define and run the pipeline

pipeline = Pipeline([
  search_step,
  parse_search_step,
  fetch_wikipedia_step,
  extract_step
])

output = pipeline.run({"name": "Jean-Paul Sartre"})
print(json.dumps(output, indent=2))

## Step 2: Evaluating the pipeline

Now, we'll evaluate the pipeline on a dataset. Think of this as unit tests for your code. You wouldn't ship code to production without testing it, you shouldn't ship LLM pipelines to production without evaluating them.

To do this, we need:

1. **A dataset with labels** - In this case we need a list of famous people and the true date of birth, living status and cause of death of each person
2. **Evaluation function** - a function that defines what "correct" is. We'll use simple comparison for date of birth and living status, and an LLM call to evaluate the correctness of cause of death.

In [8]:
import pandas as pd

data = [
  ("Ruth Bader Ginsburg", "1933-03-15", False, "Pancreatic cancer"),
  ("Bill Gates", "1955-10-28", True, "N/A"),
  ("Steph Curry", "1988-03-14", True, "N/A"),
  ("Scott Belsky", "1980-04-18", True, "N/A"),
  ("Steve Jobs", "1955-02-24", False, "Pancreatic tumor/cancer"),
  ("Paris Hilton", "1981-02-17", True, "N/A"),
  ("Kurt Vonnegut", "1922-11-11", False, "Brain injuries"),
  ("Snoop Dogg", "1971-10-20", True, "N/A"),
  ("Kobe Bryant", "1978-08-23", False, "Helicopter crash"),
  ("Aaron Swartz", "1986-11-08", False, "Suicide")
]
df = pd.DataFrame([{"name": d[0], "dob_label": d[1], "alive_label": d[2], "cause_label": d[3]} for d in data])

class EvalResult(BaseModel):
  result: bool = Field(description="Is the answer correct or not?")

cause_evaluator = LLMStructuredStep(
  model=models.gpt4,
  prompt=lambda row: f"This is the correct cause of death: {row['cause_label']}. Is this provided cause of death accurate? The phrasing might be slightly different. Use your judgement: \n{row['cause_of_death']}",
  out_schema=EvalResult,
  name="cause_evaluator")

def eval_fn(row):
  score = 0
  if row['date_of_birth'] == row['dob_label']:
    score += 0.25
  if row['alive'] == row['alive_label']:
    score += 0.25
  if row['cause_label'] == "N/A":
    if row['cause_of_death'] == "N/A":
      score += 0.5
  elif cause_evaluator.run(row)['result']:
    score += 0.5  
  return score

pipeline.run(df)
print("Score: ", pipeline.evaluate(eval_fn))
df

Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.16it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:10<00:00,  1.02s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:04<00:00,  2.27it/s]
Applying step extract_data: 100%|██████████| 10/10 [01:26<00:00,  8.66s/it]


Score:  1.0


Unnamed: 0,name,dob_label,alive_label,cause_label,search,__parse_search__,wikipedia_url,wikipedia,__extract_data__,date_of_birth,alive,cause_of_death,__eval_fn__
0,Ruth Bader Ginsburg,1933-03-15,False,Pancreatic cancer,"{""searchParameters"":{""q"":""Ruth Bader Ginsburg ...","{'input_tokens': 1922, 'output_tokens': 23, 'i...",https://en.wikipedia.org/wiki/Ruth_Bader_Ginsburg,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 46522, 'output_tokens': 37, '...",1933-03-15,False,complications of metastatic pancreatic cancer,1.0
1,Bill Gates,1955-10-28,True,,"{""searchParameters"":{""q"":""Bill Gates wikipedia...","{'input_tokens': 1809, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Bill_Gates,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 46613, 'output_tokens': 32, '...",1955-10-28,True,,1.0
2,Steph Curry,1988-03-14,True,,"{""searchParameters"":{""q"":""Steph Curry wikipedi...","{'input_tokens': 1339, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Stephen_Curry,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 64861, 'output_tokens': 32, '...",1988-03-14,True,,1.0
3,Scott Belsky,1980-04-18,True,,"{""searchParameters"":{""q"":""Scott Belsky wikiped...","{'input_tokens': 1566, 'output_tokens': 21, 'i...",https://en.wikipedia.org/wiki/Scott_Belsky,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 2227, 'output_tokens': 32, 'i...",1980-04-18,True,,1.0
4,Steve Jobs,1955-02-24,False,Pancreatic tumor/cancer,"{""searchParameters"":{""q"":""Steve Jobs wikipedia...","{'input_tokens': 1625, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Steve_Jobs,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 47086, 'output_tokens': 42, '...",1955-02-24,False,respiratory arrest related to a pancreatic neu...,1.0
5,Paris Hilton,1981-02-17,True,,"{""searchParameters"":{""q"":""Paris Hilton wikiped...","{'input_tokens': 1322, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Paris_Hilton,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 49288, 'output_tokens': 32, '...",1981-02-17,True,,1.0
6,Kurt Vonnegut,1922-11-11,False,Brain injuries,"{""searchParameters"":{""q"":""Kurt Vonnegut wikipe...","{'input_tokens': 1369, 'output_tokens': 22, 'i...",https://en.wikipedia.org/wiki/Kurt_Vonnegut,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 29700, 'output_tokens': 45, '...",1922-11-11,False,"brain injuries incurred several weeks prior, f...",1.0
7,Snoop Dogg,1971-10-20,True,,"{""searchParameters"":{""q"":""Snoop Dogg wikipedia...","{'input_tokens': 1702, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Snoop_Dogg,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 40901, 'output_tokens': 32, '...",1971-10-20,True,,1.0
8,Kobe Bryant,1978-08-23,False,Helicopter crash,"{""searchParameters"":{""q"":""Kobe Bryant wikipedi...","{'input_tokens': 1355, 'output_tokens': 21, 'i...",https://en.wikipedia.org/wiki/Kobe_Bryant,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 74108, 'output_tokens': 33, '...",1978-08-23,False,helicopter crash,1.0
9,Aaron Swartz,1986-11-08,False,Suicide,"{""searchParameters"":{""q"":""Aaron Swartz wikiped...","{'input_tokens': 1329, 'output_tokens': 21, 'i...",https://en.wikipedia.org/wiki/Aaron_Swartz,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 37532, 'output_tokens': 34, '...",1986-11-08,False,Suicide by hanging,1.0


## Step 3: Optimizing the pipeline

This pipeline has an accuracy score of 100%, but perhaps there's room for improvement on cost and speed. First let's view the cost and latency of each step to figure out which one is the bottleneck.

In [4]:
for step in pipeline.steps:
  print(f"Step {step.name}:")
  print(f"- Latency: {step.statistics.total_latency}")
  print(f"- Cost: {step.statistics.input_cost + step.statistics.output_cost}")

Step search:
- Latency: 12.000389575958252
- Cost: 0.0
Step parse_search:
- Latency: 10.51110366685316
- Cost: 0.008334
Step wikipedia:
- Latency: 4.235257387161255
- Cost: 0.0
Step extract_data:
- Latency: 90.95815300196409
- Cost: 4.7203800000000005


Clearly the final step (`extract_data`) is the one responsible for the bulk of the cost and latency. This makes sense, because we're feeding in the entire wikipedia article to GPT-4, one of the most expensive models.

Let's find out if we can get away with a cheaper/faster model. Most models cannot handle the number of tokens needed to ingest a whole wikipedia article, so we'll turn to the two that can that are also cheaper than GPT4: Claude 3 Sonnet and Claude 3 Haiku.

In [5]:
from superpipe.grid_search import GridSearch
from superpipe.models import claude3_haiku, claude3_sonnet
from superpipe.steps import LLMStructuredCompositeStep

# we need to use LLMStructuredCompositeStep which uses GPT3.5 for structured JSON extraction
# because Claude does not support JSON mode or function calling out of the box
new_extract_step = LLMStructuredCompositeStep(
  model=models.claude3_haiku,
  prompt=extract_step.prompt,
  out_schema=ExtractedData,
  name="extract_data_new"
)

new_pipeline = Pipeline([
  search_step,
  parse_search_step,
  fetch_wikipedia_step,
  new_extract_step
], evaluation_fn=eval_fn)

param_grid = {
  new_extract_step.name:{
    "model": [claude3_haiku, claude3_sonnet]}
}
grid_search = GridSearch(new_pipeline, param_grid)
grid_search.run(df)

Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.20it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:10<00:00,  1.06s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:03<00:00,  2.56it/s]
Applying step extract_data_new: 100%|██████████| 10/10 [01:26<00:00,  8.63s/it]
Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.18it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:10<00:00,  1.03s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:03<00:00,  2.57it/s]
Applying step extract_data_new: 100%|██████████| 10/10 [05:17<00:00, 31.73s/it]
  styler = styler.applymap(


Unnamed: 0,extract_data_new__model,score,input_cost,output_cost,total_latency,input_tokens,output_tokens,num_success,num_failure,index
0,claude-3-haiku-20240307,1.0,0.129856,0.001945,109.038948,"defaultdict(, {'gpt-3.5-turbo-0125': 15056, 'claude-3-haiku-20240307': 487402})","defaultdict(, {'gpt-3.5-turbo-0125': 208, 'claude-3-haiku-20240307': 1218})",10,0,4643861466949536679
1,claude-3-sonnet-20240229,0.45,1.465117,0.022944,339.825781,"defaultdict(, {'gpt-3.5-turbo-0125': 14733, 'claude-3-sonnet-20240229': 488036})","defaultdict(, {'gpt-3.5-turbo-0125': 208, 'claude-3-sonnet-20240229': 1786})",10,0,3722756468172814577


Strangely, Claude 3 Haiku is both more accurate (100% v/s 45%) as well as cheaper and faster. This is suprising, but useful information that we wouldn't have found out unless we built and evaluated pipelines on _our specific data_ rather than benchmark data.

In [6]:
best_params = grid_search.best_params
new_pipeline.update_params(best_params)
new_pipeline.run(df)
print("Score: ", new_pipeline.score)
for step in new_pipeline.steps:
  print(f"Step {step.name}:")
  print(f"- Latency: {step.statistics.total_latency}")
  print(f"- Cost: {step.statistics.input_cost + step.statistics.output_cost}")

Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.14it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:11<00:00,  1.15s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:03<00:00,  2.52it/s]
Applying step extract_data_new: 100%|██████████| 10/10 [01:27<00:00,  8.76s/it]


Score:  1.0
Step search:
- Latency: 8.75270938873291
- Cost: 0.0
Step parse_search:
- Latency: 11.506851500831544
- Cost: 0.007930999999999999
Step wikipedia:
- Latency: 3.9602952003479004
- Cost: 0.0
Step extract_data_new:
- Latency: 87.57113150181249
- Cost: 0.12396325000000001
