# Example Usage of Scraipe Library

Here's a quick example using scraipe to extract mentions of celebrities in news articles.

## Setup
Install and import dependencies

In [1]:
%pip install scraipe
#import sys; sys.path.append('scraipe')

[31mERROR: Could not find a version that satisfies the requirement scraipe (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for scraipe[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from scraipe import Workflow
from scraipe.scrapers import NewsScraper
from scraipe.analyzers import OpenAiAnalyzer
from pydantic import BaseModel

## Extract links
First, we need a list of links to target with scraipe. We will extract all links from the front page of https://apnews.com.

In [3]:
import requests
import re

url = "https://apnews.com/"
response = requests.get(url)
html = response.text

# Use a regex to extract article links
pattern = r'href=["\'](?=[^"\']*/article)([^"\']+)["\']'
news_links = re.findall(pattern, html)

# Remove duplicates
news_links = list(set(news_links))

# Display a summary of the links
news_links_df = pd.DataFrame(news_links, columns=['link'])
import time
print(f"Found {len(news_links_df)} front page AP News links on {time.strftime('%Y-%m-%d')}")
display(news_links_df.head())

Found 143 front page AP News links on 2025-03-24


Unnamed: 0,link
0,https://apnews.com/article/weather-forecasts-w...
1,https://apnews.com/article/minnesota-senator-j...
2,https://apnews.com/article/north-macedonia-pro...
3,https://apnews.com/article/greenland-inuit-tra...
4,https://apnews.com/article/shelter-dog-tiktok-...


## Configure Workflow
Now we'll configure the scraipe workflow using DefaultScraper and OpenAiAnalyzer.

Store your OpenAI key in a file named .openai_key before running this code block.

In [4]:
#===Configure NewsScraper===
# NewsScraper doesn't require any extra configuration
scraper = NewsScraper()

#===Configure OpenAiAnalyzer===
# Load API key from a file
api_key = open(".openai_key").read().strip()

# Define the instruction for the LLM. Ensure the instruction specifies a return schema.
instruction = '''Extract a list of celebrities mentioned in the article text.
Return a JSON dictionary with the following schema:
{"celebrities":["celebrity1", "celebrity2", ...]}'''

# (Optional) Create a pydantic schema to validate the LLM output
from typing import List
class ExpectedOutput(BaseModel):
    celebrities: List[str]
    
# Create the analyzer with the API key, instruction, and schema
analyzer = OpenAiAnalyzer(api_key,instruction,pydantic_schema=ExpectedOutput)

#===Create Workflow===
# Create a workflow with the configured scraper and analyzer
workflow = Workflow(scraper, analyzer)

## Scrape content from news links
Next we will scrape content from news links. This content will be saved within the workflow's scrape store.

In [5]:
# Scrape the news links
workflow.scrape(news_links)
# Display the scraped content
scrape_store_df = workflow.get_scrapes()
display(scrape_store_df.iloc[0:1])

Scraping 143/143 new or failed links...


Scraping URLs: 100%|██████████| 143/143 [01:03<00:00,  2.25it/s]

Successfully scraped 143/143 links.





Unnamed: 0,link,content,success,error
0,https://apnews.com/article/weather-forecasts-w...,"WASHINGTON (AP) — With massive job cuts, the N...",True,


## Analyze content with OpenAI
Next we will analyze the stored scrapes.

In [6]:
# Analyze the scraped content
workflow.analyze()

# Dispaly the analyses
analysis_store_df = workflow.get_analyses()
display(analysis_store_df.head())

Analyzing 143/143 new or failed links...


Analyzing content: 100%|██████████| 143/143 [02:13<00:00,  1.07it/s]

Successfully analyzed 143/143 links.





Unnamed: 0,link,output,success,error
0,https://apnews.com/article/weather-forecasts-w...,"{'celebrities': ['D. James Baker', 'Ryan Maue'...",True,
1,https://apnews.com/article/minnesota-senator-j...,{'celebrities': []},True,
2,https://apnews.com/article/north-macedonia-pro...,{'celebrities': []},True,
3,https://apnews.com/article/greenland-inuit-tra...,"{'celebrities': ['Naja Parnuuna', 'Markus Olse...",True,
4,https://apnews.com/article/shelter-dog-tiktok-...,{'celebrities': []},True,


## Compile the results
Finally, let's export the completed analysis. 

In [7]:
export_df = workflow.export()
display(export_df)
export_df.to_csv('celebrities.csv', index=False)

Unnamed: 0,link,celebrities
0,https://apnews.com/article/weather-forecasts-w...,"[D. James Baker, Ryan Maue, Elbert ""Joe"" Frida..."
1,https://apnews.com/article/minnesota-senator-j...,[]
2,https://apnews.com/article/north-macedonia-pro...,[]
3,https://apnews.com/article/greenland-inuit-tra...,"[Naja Parnuuna, Markus Olsen, Bob Marley, Mart..."
4,https://apnews.com/article/shelter-dog-tiktok-...,[]
...,...,...
138,https://apnews.com/article/kabaddi-world-cup-e...,[]
139,https://apnews.com/article/wellness-advice-soc...,"[Robert F. Kennedy Jr., Dr. Mehmet Oz, Donald ..."
140,https://apnews.com/article/pro-palestinian-wat...,[]
141,https://apnews.com/article/trump-america-boyco...,"[Donald Trump, Jeff Bezos, Volodymyr Zelenskyy..."


## Analyze the results
Now you can conduct your own analysis on the structured data collected by the scraipe workflow.

In [8]:
# Load the extracted data
celebrities_df = pd.read_csv('celebrities.csv')
from ast import literal_eval
celebrities_df['celebrities'] = celebrities_df['celebrities'].apply(literal_eval)

# Explode the nested list of celebrities
celebrities_df = celebrities_df.explode('celebrities')
celebrities_df['celebrities'] = celebrities_df['celebrities'].str.strip()

# Display the top 10 most mentioned celebrities
celebrities_df = celebrities_df['celebrities'].value_counts().reset_index()
celebrities_df.columns = ['celebrity', 'mentions']
celebrities_df = celebrities_df.sort_values('mentions', ascending=False)
celebrities_df.head(10)

Unnamed: 0,celebrity,mentions
0,Donald Trump,57
1,Elon Musk,20
2,Joe Biden,6
4,Marco Rubio,5
5,Robert F. Kennedy Jr.,5
3,Barack Obama,5
9,Kamala Harris,4
8,Vladimir Putin,4
7,Volodymyr Zelenskyy,4
6,John Roberts,4
