This project is to summarize a webpage to use an Open Source model running locally via Ollama.

**Benefits:**
1. No API charges - open-source
2. Data doesn't leave your box

**Disadvantages:**
1. Significantly less power than Frontier Model

## installation of Ollama

Visit [ollama.com](https://ollama.com) and install!

Once complete, the ollama server should already be running locally.  
If you visit:  
[http://localhost:11434/](http://localhost:11434/)

You should see the message `Ollama is running`.  

If not, bring up a new Terminal (Mac) or Powershell (Windows) and enter `ollama serve`  
Then try [http://localhost:11434/](http://localhost:11434/) again.

In [16]:
# imports

import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
import ollama
from openai import OpenAI

In [17]:
# Constants

MODEL = "llama3.2"

In [19]:
# A class to represent a Webpage

class Website:
    """
    A utility class to represent a Website that we have scraped
    """
    url: str
    title: str
    text: str

    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [20]:
# Let's try one out

ed = Website("https://edwarddonner.com")
print(ed.title)
print(ed.text)

Home - Edward Donner
Home
Connect Four
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. Recruiters use our product today to source, understand, engage and manage talent. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
We work with groundbreaking, proprietary LLMs verticalized for talent, we’ve
patented
our matching model, and our award-winning platform has happy customers and tons of press coverage.
Connec

## Types of prompts

**A system prompt** that tells them what task they are performing and what tone they should use

**A user prompt** -- the conversation starter that they should reply to

In [21]:
# Define our system prompt

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [22]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "The contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

## Messages

The API from Ollama expects the same message format as OpenAI:

```
[
    {"role": "system", "content": "system message goes here"},
    {"role": "user", "content": "user message goes here"}
]

In [23]:
# See how this function creates exactly the format above

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [24]:
# call the Ollama function to get a summary

def summarize(url):
    website = Website(url)
    messages = messages_for(website)
    response = ollama.chat(model=MODEL, messages=messages)
    return response['message']['content']

In [28]:
# There's actually an alternative approach that some people might prefer
# You can use the OpenAI client python library to call Ollama:

def summarize_with_openai(url):
    website = Website(url)
    ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

    response = ollama_via_openai.chat.completions.create(
        model=MODEL,
        messages=messages_for(website)
    )

    return response.choices[0].message.content

In [26]:
summarize("https://edwarddonner.com")

'### Website Summary\n#### About the Founder\nThe website is owned by Edward Donner, a co-founder and CTO of Nebula.io, an AI company applying AI to help people discover their potential. He also co-founded AI startup untapt, acquired in 2021.\n\n#### News/Announcements\n* December 21, 2024: LLM Workshop – Hands-on with Agents – resources available\n* January 23, 2025: LLM Workshop announcement (no details provided)\n\n#### Website Features\n* **Outsmart**: An arena where LLMs compete in a battle of diplomacy and deviousness.\n* **Connect Four**: No information is given about this section on the website.'

In [29]:
summarize_with_openai("https://edwarddonner.com")

"# Website Summary\n\n## About the Creator\nEd, the creator of this website, is a co-founder and CTO at Nebula.io, an AI company. He has also founded other AI startups, including untapt, which was acquired in 2021.\n\n## Products/Services\nNebula.io creates LLMs (Large Language Models) for talent sourcing and management. They have patented their matching model and received award-winning platform recognition.\n\n## News and Announcements\n- **December 21, 2024**: LLM Workshop – Hands-on with Agents – resources\n- **November 13, 2024**: Welcome, SuperDataScientists!\n- **October 16, 2024**: Mastering AI and LLM Engineering – Resources\n- **September [missing] (likely missing due to formatting) From Software Engineer to AI Data Scientist – resources**\n\n- **January 23, 2025** has a future date: 'upcoming' but the content provided shows no specific information of what is being worked on"

In [12]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [13]:
display_summary("https://edwarddonner.com")

**Summary**
================

The website belongs to Edward Donner, a co-founder and CTO of Nebula.io. He is also the founder and CEO of AI startup untapt.

### News/Announcements

* **LLM Workshop – Hands-on with Agents**: A workshop on January 23, 2025
* **December 21, 2024: Welcome, SuperDataScientists!** 
* **Mastering AI and LLM Engineering - Resources**: Available on October 16, 2024
* **From Software Engineer to AI Data Scientist – resources**: Available on October 16, 2024

### About the Founder

Edward Donner is a self-described enthusiast of writing code, experimenting with Large Language Models (LLMs), and DJing. He is also an avid reader of Hacker News.

### Contact Information

* Email: ed [at] edwarddonner [dot] com
* Website: www.edwarddonner.com
* Social Media:
  * LinkedIn: 
  * Twitter: 
  * Facebook:

# Let's try more websites

This will only work on websites that can be scraped using this simplistic approach.

Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)

Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.

But many websites will work just fine!

In [14]:
display_summary("https://cnn.com")

Here are the top stories from CNN:

1. **Heathrow Airport Defends Decision to Shut Down Amid Blame Game**: The UK's Heathrow airport has defended its decision to shut down amid a blame game between politicians and airlines.
2. **LA Firefighters Worry About Cancer Risk After Fighting Massive Blazes**: Los Angeles firefighters are worried about the risk of cancer after fighting massive blazes in recent months.
3. **TikTok Withdraws Controversial 'Chubby' Filter Amid Backlash**: TikTok has withdrawn its controversial "chubby" filter amid backlash from users and critics who accused it of perpetuating negative body image.
4. **Australia's Red Fire Ant Population Sends 23 People to Hospital**: The spread of Australia's red fire ant population has sent 23 people to hospital, highlighting the dangers of these invasive insects.
5. **50,000 Killed in Gaza Since Start of Israel-Hamas War, Health Ministry Says**: A health ministry in Gaza has said that 50,000 people have been killed since the start of the conflict between Israel and Hamas.

These are just a few of the top stories from CNN. You can find more information on these topics and others by visiting the CNN website or mobile app.

Additionally, you can check out the latest news, trends, and analysis on various subjects such as:

* Business: Invest in stocks, get market updates
* Entertainment: Get the latest movie reviews, TV show news, celebrity gossip
* Health: Stay up-to-date with the latest health news, fitness tips, wellness advice
* Politics: Follow CNN's coverage of the latest political news, elections, and policy issues
* Sports: Get the latest sports news, scores, highlights, and analysis

Let me know if you have any specific topic or category in mind, and I can try to help you find more information on that subject.

In [15]:
display_summary("https://anthropic.com")

# Website Summary
### Title: Just a Moment

**Content Summary**

* The website appears to be a waiting page or a loading indicator, with the title "Just a Moment" suggesting that some action is taking place.
* The text "Enable JavaScript and cookies to continue" indicates that certain functionality may not be available without these settings enabled.

### No News or Announcements

No news or announcements were found in the provided content.