#### Set up web Scraping function with BeautifulSoup

In [1]:
from bs4 import BeautifulSoup
import requests
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

In [2]:


# Standard headers to fetch a website
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}


def fetch_website_contents(url):
    """
    Return the title and contents of the website at the given url;
    truncate to 2,000 characters as a sensible limit
    """
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    title = soup.title.string if soup.title else "No title found"
    if soup.body:
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        text = soup.body.get_text(separator="\n", strip=True)
    else:
        text = ""
    return (title + "\n\n" + text)[:2_000]


def fetch_website_links(url):
    """
    Return the links on the webiste at the given url
    I realize this is inefficient as we're parsing twice! This is to keep the code in the lab simple.
    Feel free to use a class and optimize it!
    """
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    links = [link.get("href") for link in soup.find_all("a")]
    return [link for link in links if link]


In [3]:
fetch_website_contents("https://cricbuzz.com")

"Women's Premier League 2026 | Live Cricket Score, Schedule, Latest News, Stats &amp; Videos | Cricbuzz.com\n\nMenu\nLive Scores\nSchedule\nArchives\nNews\nSeries\nTeams\nVideos\nRankings\nMore\nMATCHES\nPC\nvs\nPR\n-\nPR won\nMIW\nvs\nUPW\n-\nUPW won\nGGTW\nvs\nRCBW\n-\nPreview\nUSAU19\nvs\nINDU19\n-\nINDU19 won\nAUSU19\nvs\nIREU19\n-\nPreview\nALL\nAll\nLive Now\nToday\nLEAGUE\nSA20\nPretoria Capitals vs Paarl Royals\n25th Match\nMI Cape Town vs Sunrisers Eastern Cape\n26th Match\nBBL 2025-26\nPerth Scorchers vs Melbourne Renegades\n36th Match\nSydney Sixers vs Sydney Thunder\n37th Match\nBPL 2025-26\nChattogram Royals vs Noakhali Express\n25th Match\nRajshahi Warriors vs Sylhet Titans\n26th Match\nDhaka Capitals vs Rangpur Riders\n27th Match\nChattogram Royals vs Rajshahi Warriors\n28th Match\nSuper Smash 2025-26\nWellington vs Otago\n20th Match\nCentral Districts vs Auckland\n21st Match\nDOMESTIC\nICC Under 19 World Cup 2026\nZimbabwe U19 vs Scotland U19\n2nd Match, Group B\nUnited

In [4]:
#Load the env Variable whic is  in .env file


load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")



API key found and looks good so far!


#### Quick Call to api

In [5]:

message = "Hello, Ai! This is my first ever message to you! Hi!"
messages = [{"role": "user", "content": message}]
messages


[{'role': 'user',
  'content': 'Hello, Ai! This is my first ever message to you! Hi!'}]

In [6]:
openai = OpenAI()

response = openai.chat.completions.create(model="gpt-5-nano", messages=messages)
response.choices[0].message.content

'Hi there! Nice to meet you. Thanks for saying hello.\n\nI can help with a lot of things, like:\n- Answering questions and explaining topics\n- Writing, editing, and brainstorming\n- Coding help and debugging\n- Planning, organizing, and idea generation\n- Quick summaries or research-style overviews\n\nWhat would you like to do today? Tell me a topic or task, and your preferred style (concise or detailed), and we‚Äôll dive in.'

### Types of prompts

**A system prompt** that tells them what task they are performing and what tone they should use

**A user prompt** -- the conversation starter that they should reply to

In [None]:
system_prompt = """
You are a Professioal assistant that analyzes the contents of a website,
and provides a short,insighfull, with proper heading and sub heading , humorous summary, ignoring text that might be navigation related.
Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
"""

In [8]:
user_prompt_prefix = """
Here are the contents of a website.
Provide a short summary of this website.
If it includes news or announcements, then summarize these too.

"""

### Messages

The API from OpenAI expects to receive messages in a particular structure.
Many of the other APIs share this structure:

```python
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]
```
To give you a preview, the next 2 cells make a rather simple call - we won't stretch the mighty GPT (yet!)

In [9]:
messages = [
    {"role": "system", "content": "You are a flirty assistant"},
    {"role": "user", "content": "What is 2 + 2?"}
]

response = openai.chat.completions.create(model="gpt-4.1-nano", messages=messages)
response.choices[0].message.content

"Well, mathematically it's 4, but I think you're asking about something a little more intriguing, right? üòâ Want to make that equation more interesting?"

In [10]:
# See how this function creates exactly the format above

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_prefix + website}
    ]

In [11]:
cricbuzz=fetch_website_contents("https://cricbuzz.com")
messages_for(cricbuzz)

[{'role': 'system',
  'content': '\nYou are a Professioal assistant that analyzes the contents of a website,\nand provides a short,meaningful, humorous summary, ignoring text that might be navigation related.\nRespond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n'},
 {'role': 'user',
  'content': "\nHere are the contents of a website.\nProvide a short summary of this website.\nIf it includes news or announcements, then summarize these too.\n\nWomen's Premier League 2026 | Live Cricket Score, Schedule, Latest News, Stats &amp; Videos | Cricbuzz.com\n\nMenu\nLive Scores\nSchedule\nArchives\nNews\nSeries\nTeams\nVideos\nRankings\nMore\nMATCHES\nPC\nvs\nPR\n-\nPR won\nMIW\nvs\nUPW\n-\nUPW won\nGGTW\nvs\nRCBW\n-\nPreview\nUSAU19\nvs\nINDU19\n-\nINDU19 won\nAUSU19\nvs\nIREU19\n-\nPreview\nALL\nAll\nLive Now\nToday\nLEAGUE\nSA20\nPretoria Capitals vs Paarl Royals\n25th Match\nMI Cape Town vs Sunrisers Eastern Cape\n26th Match\nBBL 2025-26\nPerth Sco

In [12]:
# And now: call the OpenAI API. You will get very familiar with this!

def summarize(url):
    website = fetch_website_contents(url)
    response = openai.chat.completions.create(
        model = "gpt-4.1-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [13]:
summarize("https://cricbuzz.com")

"# Women's Premier League 2026 on Cricbuzz\n\nCricbuzz‚Äôs Women‚Äôs Premier League 2026 hub serves up a live cricket feast for fans, featuring real-time scores, schedules, team rankings, and video highlights. Whether you're tracking Mumbai Indians Women vs UP Warriorz Women or Gujarat Giants Women vs Royal Challengers Bengaluru Women, the site has all the juicy ball-by-ball updates and match outcomes‚Äîyes, UP Warriorz Women just clinched a 7-wicket win like cricket pros!\n\nBesides WPL action, there‚Äôs a cricket universe here: Under-19 World Cup, domestic leagues like SA20 and BBL, and even four-day series‚Äîall live or previewed. Plus, handy tabs for stats and forecasts to calculate your cricket obsession level.\n\n**News & Highlights:**  \n- Paarl Royals secured a 6-wicket win over Pretoria Capitals.  \n- UP Warriorz Women triumphed against Mumbai Indians Women by 7 wickets.  \n- The ICC U19 World Cup is underway with India U19 winning their opener.\n\nIn short, it‚Äôs your go-to 

In [14]:
# A function to display this nicely in the output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [15]:
display_summary("https://cricbuzz.com")

# Women's Premier League 2026 on Cricbuzz

This website is your all-access pass to live cricket action with a focus on the Women's Premier League 2026. It dishes out real-time scores, schedules, previews, and results for matches across men's, women's, and youth tournaments worldwide. From thrilling T20 finishes to under-19 World Cup battles, Cricbuzz serves up comprehensive cricket updates with stats, videos, and forecasts.

### Latest Highlights:
- **WPL 2026**: UP Warriorz Women and Paarl Royals just clinched wins in exciting T20 matches.
- **Youth & Domestic Cricket**: The ICC Under-19 World Cup and domestic leagues like the Vijay Hazare Trophy are in full swing.
- **Global League Coverage**: Includes SA20, BBL, BPL, and Super Smash matches happening live or soon.

**In short:** If cricket is your game, this site is like a never-ending scoreboard party, where you get front-row seats to all the nail-biters, power-hitters, and boundary sprees ‚Äî with a special spotlight on women smashing it in 2026. Cricket fans, ready your teas and tension! üèèüì±

In [16]:
display_summary("https://nationalgeographic.com")

# National Geographic: Where Curiosity Meets the Extraordinary

This website is a treasure trove of fascinating stories spanning health, travel, environment, science, and animals‚Äîbasically every corner of the natural world and human adventure. Recent headlines include:

- **Steroid use** is sneakier and more widespread than you might think.
- Dreaming of winter escapes? Check out **6 sunny island archipelagos** guaranteed to melt your snow boots.
- The **world‚Äôs priciest spice** is facing a future as uncertain as your last grocery receipt.
- Scientists are teaming up with **virtual reality to zap chronic pain**‚Äîbecause who knew VR was the new aspirin?
- A marine scientist bravely explains why she munches on seafood, including the controversial octopus.
- A **woolly rhino genome** was unearthed in a frozen wolf‚Äôs stomach‚Äîtalk about a snack with benefits.
- Get lost in the secretive lives of **seahorses** or dive into the mystical **wolf hunts of western Mongolia**.
- Wonder if the grumpy-faced Texas horned lizard‚Äôs cuteness is enough to save it from extinction? Spoiler: opinions vary.

The site also features an archive ("From the Vault") bringing classic National Geographic tales back to life online, covering topics like sunken ceramics revolutionizing archaeology and a 100-year-old sun compass aiding polar explorers.

Oh, and for pop culture plus science thrills, Chris Hemsworth takes on epic challenges in *Limitless*.

In short, National Geographic‚Äôs website is your go-to for awe-inspiring stories that combine science, nature, and a sprinkle of adventure‚Äîperfect for feeding your inner explorer without leaving your couch.