## This notebook uses the OPENAI API to summarize a webpage / website. It uses , BeautiulSoup to scrape the data from the website and pass it to the OpenAI API as input to summarize it. The resulting output is displayed nicely using MarkDown

In [1]:
! pip install playwright
! playwright install



## Import the libraries

In [2]:
# imports

import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI

# If you get an error running this cell, then please head over to the troubleshooting notebook!

# Connecting to OpenAI

Load in the environment variables in your `.env` file and connect to OpenAI.

In [3]:
# Load environment variables in a file called .env

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')


In [4]:
openai = OpenAI()

## Scrape the website and capture the results in the class
##### Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree

In [5]:
import asyncio
from playwright.async_api import async_playwright
import nest_asyncio
from bs4 import BeautifulSoup
import time

nest_asyncio.apply()

class Website:
    title: str
    text: str
    url: str

    def __init__(self, url):
        self.url = url
         
    async def run(self, playwright):
        browser = await playwright.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto(self.url)
        await page.wait_for_load_state('load')
        
        # Extract data from the page
        self.title = await page.title()
        text = await page.content()
        await browser.close()
    
        soup = BeautifulSoup(text, 'html.parser')
        for irrelevant in soup(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.get_text(separator="\n", strip=True)
    
    async def main(self):
        async with async_playwright() as playwright:
            await self.run(playwright)   
    
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

if __name__ == "__main__":
    site = Website('https://openai.com')
    asyncio.run(site.main())
    response = openai.chat.completions.create(
            model = "gpt-4o-mini",
            messages = messages_for(site)
        )

    web_summary = response.choices[0].message.content
    display(Markdown(web_summary))

Task exception was never retrieved
future: <Task finished name='Task-2' coro=<Connection.run() done, defined at D:\Installs\Anaconda\envs\llmenv\Lib\site-packages\playwright\_impl\_connection.py:272> exception=NotImplementedError()>
Traceback (most recent call last):
  File "D:\Installs\Anaconda\envs\llmenv\Lib\asyncio\tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "D:\Installs\Anaconda\envs\llmenv\Lib\site-packages\playwright\_impl\_connection.py", line 279, in run
    await self._transport.connect()
  File "D:\Installs\Anaconda\envs\llmenv\Lib\site-packages\playwright\_impl\_transport.py", line 133, in connect
    raise exc
  File "D:\Installs\Anaconda\envs\llmenv\Lib\site-packages\playwright\_impl\_transport.py", line 120, in connect
    self._proc = await asyncio.create_subprocess_exec(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Installs\Anaconda\envs\llmenv\Lib\asyncio\subprocess.py", line 223, in create_subpr

NotImplementedError: 

## Simple test

In [None]:
web = Website("https://bbc.com")
print(web.title)
print(web.text)

## Types of prompts


Models expect to receive:

**A system prompt** that tells them what task they are performing and what tone they should use

**A user prompt** -- the conversation starter that they should reply to

In [10]:
# Define our system prompt 

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [11]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [12]:
print(user_prompt_for(web))

You are looking at a website titled BBC Home - Breaking News, World News, US News, Sports, Business, Innovation, Climate, Culture, Travel, Video & Audio
The contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.

Skip to content
British Broadcasting Corporation
Watch Live
Home
News
Sport
Business
Innovation
Culture
Arts
Travel
Earth
Audio
Video
Live
Home
News
Israel-Gaza War
War in Ukraine
US & Canada
UK
UK Politics
England
N. Ireland
N. Ireland Politics
Scotland
Scotland Politics
Wales
Wales Politics
Africa
Asia
China
India
Australia
Europe
Latin America
Middle East
In Pictures
BBC InDepth
BBC Verify
Sport
Business
Executive Lounge
Technology of Business
Future of Business
Innovation
Technology
Science & Health
Artificial Intelligence
AI v the Mind
Culture
Film & TV
Music
Art & Design
Style
Books
Entertainment News
Arts
Arts in Motion
Travel
Destinations
Africa
Antarctica
Asia


## Messages

The API from OpenAI expects to receive messages in a particular structure.
Many of the other APIs share this structure:

```
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]


## Build system and user prompts

In [13]:
# See how this function creates exactly the format above

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [14]:
messages_for(web)

[{'role': 'system',
  'content': 'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'},
 {'role': 'user',

## Summarize the data by calling the open AI API

In [15]:
# And now: call the OpenAI API. You will get very familiar with this!

def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [16]:
summarize("https://bbc.com")

"# BBC Home Summary\n\nThe BBC Home page features a wide range of coverage on breaking news across various categories including world events, US & Canada, business, sports, culture, and innovation. It offers in-depth articles, real-time updates, and multimedia content.\n\n## Recent Noteworthy News:\n\n- **Israel-Gaza War**: Ongoing developments surrounding the conflict.\n- **Ukraine-Russia Negotiations**: US Senator Marco Rubio states a partial ceasefire plan proposed for Ukraine “has promise” as talks progress.\n- **European Stock Markets**: Markets showed stability after declines in US shares linked to economic comments from President Trump.\n- **High-profile Arrests**: Former Philippines President Rodrigo Duterte has been arrested on an ICC warrant for drug-related killings.\n- **Crime Incident**: A massive drone attack on Moscow before essential US-Ukraine peace talks resulted in casualties.\n- **Duterte Analyzed**: Coverage highlights the complex legacy of Duterte as a controversi

## Markdown library to summarize the data nicely

In [17]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [22]:
display_summary("https://bbc.com")

# BBC Home - Summary

The BBC Home website serves as a portal for breaking news across various categories including World News, US News, Sports, Business, Innovation, Culture, and Climate. Users can access live coverage, in-depth reports, and audio-visual materials covering a range of topics.

### Key Highlights:

- **Israel-Gaza War** and **War in Ukraine** remain prominent news features.
- **US News**: Notable stories include President Trump's comments on the economy and incidents involving natural disasters and public transport.
- **Business Updates**: European stocks showed stability following declines in the US, while reports of a major ship collision are under scrutiny.
- **International News**:
  - **Ukraine** highlighted ongoing peace talks with the US amid military escalations.
  - **Duterte** is under arrest in the Philippines for alleged crimes against humanity.
- **Cultural Insights**: 
  - Critiques of Hollywood's handling of scandals.
  - The legacy of prominent entertainment figures and recent developments in the arts.
- **Travel** topics address sustainable tourism and unique destinations.

In addition to news, the site features content related to technology (especially in AI), health, and environmental issues, such as climate change innovations and their impacts.

Various segments encourage engagement through newsletters, podcasts, and shows that delve deeper into specific topics or stories. Overall, the BBC Home serves as a comprehensive resource for current affairs and cultural commentary.

## Testing on a website like "https://openai.com" doesn't work as the website uses Javascript. In the next version, the code will be updated  to address that. 

In [23]:
display_summary("https://openai.com")

# Website Summary

The website currently appears to have no specific title or content available for analysis. It requests users to enable JavaScript and cookies in order to proceed. Without additional information or a description of the site's purpose or content, a more detailed summary cannot be provided.