# Scraping Wikipedia

[View on Github](https://github.com/villagecomputing/superpipe/tree/main/docs/examples/web_scraping/web_scraping.ipynb)

Here's how to use Superpipe to build a pipeline that receives a list of names of famous people and figures out their birthdays and whether they're still alive.

This pipeline will work in 4 steps -

1. Do a google search with the person's name
2. Use an LLM to fetch the URL of their wikipedia page from the search results
3. Fetch the contents of the wikipedia page and convert them to markdown
4. Use an LLM to extract the birthdate and living or dead from the wikipedia contents

In [1]:
import pandas as pd

names = [
  "Reid Hoffman",
  "Bill Gates",
  "Steph Curry",
  "Scott Belsky",
  "Paris Hilton",
  "Snoop Dogg",
  "Ryan Reynolds",
  "Kevin Durant",
  "Mustafa Suleyman",
  "Aaron Swartz" # RIP
]

names_df = pd.DataFrame([{"name": name} for name in names])

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
from superpipe.steps import LLMStructuredStep, CustomStep, SERPEnrichmentStep
from superpipe import models
from pydantic import BaseModel, Field
import requests

# Step 1: use Superpipe's built-in SERP enrichment step to search for the persons wikipedia page
# Include a unique "name" for the step that will used to reference this step's output in future steps

search_step = SERPEnrichmentStep(
  prompt= lambda row: f"{row['name']} wikipedia",
  name="search"
)

# Step 2: Use an LLM to extract the wikipedia URL from the search results
# First, define a Pydantic model that specifies the structured output we want from the LLM

class ParseSearchResult(BaseModel):
  wikipedia_url: str = Field(description="The URL of the Wikipedia page for the person")

# Then we use the built-in LLMStructuredStep and specify a model and a prompt
# The prompt is a function that has access to all the fields in the input as well as the outputs of previous steps

parse_search_step = LLMStructuredStep(
  model=models.gpt35,
  prompt= lambda row: f"Extract the Wikipedia URL for {row['name']} from the following search results: \n\n {row['search']}",
  out_schema=ParseSearchResult,
  name="parse_search"
)

In [3]:
from superpipe.pipeline import Pipeline
import html2text
h = html2text.HTML2Text()
h.ignore_links = True

# Step 3: we create a CustomStep that can execute any arbitrary function (transform)
# The function fetches the contents of the wikipedia url and converts them to markdown

fetch_wikipedia_step = CustomStep(
  transform=lambda row: h.handle(requests.get(row['wikipedia_url']).text),
  name="wikipedia"
)

# Step 4: we extract the date of birth and alive or dead status from the wikipedia contents

class ExtractedData(BaseModel):
    date_of_birth: str = Field(description="The date of birth of the person in the format YYYY-MM-DD")
    alive: bool = Field(description="Whether the person is still alive, make sure to return true or false")

extract_step = LLMStructuredStep(
  model=models.gpt4,
  prompt= lambda row: f"Extract the date of birth for {row['name']} and whether they're still alive from the following Wikipedia content: \n\n {row['wikipedia']}",
  out_schema=ExtractedData,
  name="extract_data"
)

# Finally we define and run the pipeline

pipeline = Pipeline([
  search_step,
  parse_search_step,
  fetch_wikipedia_step,
  extract_step
])

pipeline.run(names_df)

Applying step search: 100%|██████████| 10/10 [00:09<00:00,  1.07it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:08<00:00,  1.15it/s]
Applying step wikipedia: 100%|██████████| 10/10 [00:03<00:00,  2.55it/s]
Applying step extract_data: 100%|██████████| 10/10 [01:12<00:00,  7.24s/it]


Unnamed: 0,name,search,__parse_search__,wikipedia_url,wikipedia,__extract_data__,date_of_birth,alive
0,Reid Hoffman,"{""searchParameters"":{""q"":""Reid Hoffman wikiped...","{'input_tokens': 1421, 'output_tokens': 22, 'i...",https://en.wikipedia.org/wiki/Reid_Hoffman,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 12075, 'output_tokens': 22, '...",1967-08-05,True
1,Bill Gates,"{""searchParameters"":{""q"":""Bill Gates wikipedia...","{'input_tokens': 1429, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Bill_Gates,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 46437, 'output_tokens': 22, '...",1955-10-28,True
2,Steph Curry,"{""searchParameters"":{""q"":""Steph Curry wikipedi...","{'input_tokens': 1123, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Stephen_Curry,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 64784, 'output_tokens': 22, '...",1988-03-14,True
3,Scott Belsky,"{""searchParameters"":{""q"":""Scott Belsky wikiped...","{'input_tokens': 1144, 'output_tokens': 21, 'i...",https://en.wikipedia.org/wiki/Scott_Belsky,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 2154, 'output_tokens': 22, 'i...",1980-04-18,True
4,Paris Hilton,"{""searchParameters"":{""q"":""Paris Hilton wikiped...","{'input_tokens': 1309, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Paris_Hilton,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 49211, 'output_tokens': 22, '...",1981-02-17,True
5,Snoop Dogg,"{""searchParameters"":{""q"":""Snoop Dogg wikipedia...","{'input_tokens': 1607, 'output_tokens': 20, 'i...",https://en.wikipedia.org/wiki/Snoop_Dogg,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 40844, 'output_tokens': 22, '...",1971-10-20,True
6,Ryan Reynolds,"{""searchParameters"":{""q"":""Ryan Reynolds wikipe...","{'input_tokens': 1572, 'output_tokens': 21, 'i...",https://en.wikipedia.org/wiki/Ryan_Reynolds,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 18883, 'output_tokens': 22, '...",1976-10-23,True
7,Kevin Durant,"{""searchParameters"":{""q"":""Kevin Durant wikiped...","{'input_tokens': 1491, 'output_tokens': 21, 'i...",https://en.wikipedia.org/wiki/Kevin_Durant,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 49727, 'output_tokens': 22, '...",1988-09-29,True
8,Mustafa Suleyman,"{""searchParameters"":{""q"":""Mustafa Suleyman wik...","{'input_tokens': 1204, 'output_tokens': 23, 'i...",https://en.wikipedia.org/wiki/Mustafa_Suleyman,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 4924, 'output_tokens': 22, 'i...",1984-08-01,True
9,Aaron Swartz,"{""searchParameters"":{""q"":""Aaron Swartz wikiped...","{'input_tokens': 1086, 'output_tokens': 21, 'i...",https://en.wikipedia.org/wiki/Aaron_Swartz,Jump to content\n\nMain menu\n\nMain menu\n\nm...,"{'input_tokens': 37484, 'output_tokens': 22, '...",1986-11-08,
