# Brief Description

In this notebook, I will try to explore which model is better at summarizing the content of a web page, crawled and parsed by Beautiful Soup. 

I will use two approaches. 
The first approach is to use OpenAI API.
The second approach is to run local models.

# First Approach: Using OpenAI API

In [2]:
# imports

import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from openai import OpenAI

# If you get an error running this cell, then please head over to the troubleshooting notebook!

**Connecting to OpenAI**

In [3]:
# Load environment variables in a file called .env

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [4]:
openai = OpenAI()

Testing a quick call to a Frontier model - gpt-4o-mini

In [5]:
# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.

message = "Hello, GPT! This is my first ever message to you! Please be nice to me as I have started my learning journey!!"
response = openai.chat.completions.create(model="gpt-4o-mini", messages=[{"role":"user", "content":message}])
print(response.choices[0].message.content)

Hello! Welcome to your learning journey! I'm so glad you're here, and I'm more than happy to help you with anything you need. Don't hesitate to ask questions or share what you're curious about. Let's make this a great experience together!


**Starting the project**

In [6]:
# A class to represent a Webpage
# If you're not familiar with Classes, check out the "Intermediate Python" notebook

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:

    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [7]:
# testing the website functionality

zkh = Website("https://zinkohlaing.com")
print(zkh.title)
print(zkh.text)

A blog by ZinKo
A blog by ZinKo
Home
About
Sign in
Subscribe
Thoughts, stories and ideas.
Subscribe below to receive my latest posts directly in your inbox. Promise no spams!
jamie@example.com
Subscribe
Trip to New Zealand - Queenstown!
Jan 27, 2025
1 min read
Hello from my new blog!
Nov 8, 2024
1 min read
My Experience with Georgia Tech OMSA Program Part-4
May 1, 2024
6 min read
My Experience with Georgia Tech OMSA Program Part-3
Apr 30, 2024
3 min read
My Experience with Georgia Tech OMSA Program Part-2
Apr 29, 2024
4 min read
My Experience with Georgia Tech OMSA Program Part-1
Mar 25, 2024
3 min read
A blog by ZinKo © 2025. Powered by
Ghost


**Defining system prompt versus user prompt**

In [8]:
# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish."

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

In [10]:
# A function that writes a User Prompt that asks for summaries of websites:

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [11]:
print(user_prompt_for(zkh))

You are looking at a website titled A blog by ZinKo
The contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.

A blog by ZinKo
Home
About
Sign in
Subscribe
Thoughts, stories and ideas.
Subscribe below to receive my latest posts directly in your inbox. Promise no spams!
jamie@example.com
Subscribe
Trip to New Zealand - Queenstown!
Jan 27, 2025
1 min read
Hello from my new blog!
Nov 8, 2024
1 min read
My Experience with Georgia Tech OMSA Program Part-4
May 1, 2024
6 min read
My Experience with Georgia Tech OMSA Program Part-3
Apr 30, 2024
3 min read
My Experience with Georgia Tech OMSA Program Part-2
Apr 29, 2024
4 min read
My Experience with Georgia Tech OMSA Program Part-1
Mar 25, 2024
3 min read
A blog by ZinKo © 2025. Powered by
Ghost


**Designing messages structure for OpenAI API**

In [16]:
# example
messages = [
    {"role": "system", "content": "You are a nerdy assistant who likes to use GenZ lingo to explain things."},
    {"role": "user", "content": "What is the meaning of a universe?"}
]

In [17]:
# To give you a preview -- calling OpenAI with system and user messages:

response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
print(response.choices[0].message.content)

Alright, fam, let’s break it down. The universe is basically everything, like the ultimate vibe check of existence. 🌌 It includes all the stars, planets, galaxies, and even the empty spaces in between. Think of it as the biggest group chat ever where everything from cosmic dust to supernovae is chilling together.

In more scientific terms, it’s the totality of space-time and all the matter and energy that exists. It’s estimated to be around 13.8 billion years old, which is like, ancient AF. So yeah, the universe is the ultimate canvas, stretching way beyond what we can even see with our telescopes. It's literally the GOAT of all things! 🚀✨


Now back to our project, we will build a function to write messages for GPT model.

In [18]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

In [19]:
messages_for(zkh)

[{'role': 'system',
  'content': 'You are an assistant that analyzes the contents of a website and provides a short summary, ignoring text that might be navigation related. Respond in markdown.'},
 {'role': 'user',
  'content': 'You are looking at a website titled A blog by ZinKo\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\nA blog by ZinKo\nHome\nAbout\nSign in\nSubscribe\nThoughts, stories and ideas.\nSubscribe below to receive my latest posts directly in your inbox. Promise no spams!\njamie@example.com\nSubscribe\nTrip to New Zealand - Queenstown!\nJan 27, 2025\n1 min read\nHello from my new blog!\nNov 8, 2024\n1 min read\nMy Experience with Georgia Tech OMSA Program Part-4\nMay 1, 2024\n6 min read\nMy Experience with Georgia Tech OMSA Program Part-3\nApr 30, 2024\n3 min read\nMy Experience with Georgia Tech OMSA Program Part-2\nApr 29, 2024\n4 min read\nMy E

**Calling OpenAI API**

This is the final step to call OpenAI to return the summary of a website.

In [20]:
# And now: call the OpenAI API. You will get very familiar with this!

def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

In [21]:
summarize("https://zinkohlaing.com")

'# Summary of "A blog by ZinKo"\n\n"A blog by ZinKo" is a personal blog focused on sharing thoughts, stories, and ideas. The author encourages readers to subscribe to receive updates directly in their inbox.\n\n## Recent Posts\n- **Trip to New Zealand - Queenstown!** (Jan 27, 2025) - A brief post about a trip to Queenstown.\n- **Hello from my new blog!** (Nov 8, 2024) - An introductory post welcoming visitors to the blog.\n- **My Experience with Georgia Tech OMSA Program** series:\n  - **Part-4** (May 1, 2024) - Continuation of experiences in the program, focusing on a specific aspect.\n  - **Part-3** (Apr 30, 2024) - Insights shared from the previous segment.\n  - **Part-2** (Apr 29, 2024) - Further reflections on the program experience.\n  - **Part-1** (Mar 25, 2024) - An initial overview of the author\'s journey through the program.\n\nThe blog showcases a blend of travel experiences and educational insights from the author\'s academic journey.'

In [22]:
# A function to display this nicely in the Jupyter output, using markdown

def display_summary(url):
    summary = summarize(url)
    display(Markdown(summary))

In [23]:
display_summary("https://zinkohlaing.com")

# Summary of "A blog by ZinKo"

**About the Blog:**  
"A blog by ZinKo" features thoughts, stories, and ideas from the author, ZinKo. Readers are encouraged to subscribe for the latest posts via email, with a promise of no spam.

## Recent Posts:
1. **Trip to New Zealand - Queenstown!**  
   *Date:* Jan 27, 2025  
   *Read Time:* 1 min  

2. **Hello from my new blog!**  
   *Date:* Nov 8, 2024  
   *Read Time:* 1 min  

3. **My Experience with Georgia Tech OMSA Program Part-4**  
   *Date:* May 1, 2024  
   *Read Time:* 6 min  

4. **My Experience with Georgia Tech OMSA Program Part-3**  
   *Date:* Apr 30, 2024  
   *Read Time:* 3 min  

5. **My Experience with Georgia Tech OMSA Program Part-2**  
   *Date:* Apr 29, 2024  
   *Read Time:* 4 min  

6. **My Experience with Georgia Tech OMSA Program Part-1**  
   *Date:* Mar 25, 2024  
   *Read Time:* 3 min  

The blog appears to include a series of posts about the Georgia Tech OMSA program, detailing the author's experiences in multiple parts, alongside personal travel reflections.

In [24]:
#let's try the busy website like NYT!
display_summary("https://www.nytimes.com/")

# Summary of The New York Times Website

The New York Times provides comprehensive coverage of breaking news across multiple categories including U.S. news, world affairs, politics, business, culture, and lifestyle. Readers can explore in-depth articles, analyses, and opinion pieces covering significant global events such as the ongoing Russia-Ukraine war, the Middle East crisis, and U.S. domestic policies under the Trump administration.

### Key Highlights:
- **U.S. News**: The site covers a range of topics including significant political developments, budget plans, and social issues like poverty in New York City.
- **International News**: Articles delve into international relations, with recent discussions around U.S.-China dynamics and political changes in Syria.
- **Business**: Insights into economic trends with a focus on the U.S. economy and corporate strategies, including BP's shift in investment towards fossil fuels.
- **Arts and Lifestyle**: Features on entertainment, including awards season coverage and cultural critiques.
- **Opinion Section**: Offers varied perspectives from well-known writers on current topics, with analysis on complications in U.S. politics and societal changes.

### Notable Recent Articles:
- **Trump's New Crackdown on China**: Discussion on the administration's measures against Chinese investments.
- **As Poverty Rises in New York City**: Statistics revealing an alarming poverty rate almost double the national average.
- **Chile Declares Curfew**: Coverage of a power outage resulting in a nationwide curfew.
- **Unknown Deadly Illness in Congo**: Reports of an illness possibly linked to bat consumption.

The website also features podcasts, newsletters, and interactive games, making it a versatile source for news consumption and engagement.

# Second Approach: Using locally run LLM models

First, you need to install Ollama in your computer. Once installed, open the app and visit http://localhost:11434/

You should see the message Ollama is running.

In [25]:
import ollama

**Starting with llama3.2 model**

In [26]:
# Let's just make sure the model is loaded

!ollama pull llama3.2

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest 
pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB                         
pulling 966de95ca8a6... 100% ▕████████

In [37]:
# define model constant
model = "llama3.2"

We will now need to rewrite our function to suit with Ollama package.

In [38]:
# new function, let's name it summarize_local not to get confused.

def summarize_local(url):
    website = Website(url)
    response = ollama.chat(
        model = model,
        messages = messages_for(website)
    )
    return response['message']['content']

In [39]:
# now let's test
summarize_local("https://zinkohlaing.com")

'# Summary of A Blog by ZinKo\n\n*   The website is a personal blog written by ZinKo.\n*   It features a mix of stories, experiences, and ideas shared by the author.\n*   News and Announcements:\n    *   "Trip to New Zealand - Queenstown!" (January 27, 2025) - A recent post about visiting Queenstown in New Zealand.\n    *   "Hello from my new blog!" (November 8, 2024) - An introduction to the author\'s new blog.'

In [40]:
# nice. so let's make it pretty

# A function to display this nicely in the Jupyter output, using markdown

def display_summary_local(url):
    summary_local = summarize_local(url)
    display(Markdown(summary_local))

In [42]:
display_summary_local("https://zinkohlaing.com")

**Summary of A Blog by ZinKo**
==========================

*   The website is a personal blog where the author shares their thoughts, stories, and ideas.
*   It appears to be a personal blog with a focus on storytelling, rather than a news or informational website.
*   **Recent Posts:**

    *   "Trip to New Zealand - Queenstown!" (January 27, 2025) - A travel post about the author's experience in Queenstown, New Zealand.
    *   "Hello from my new blog!" (November 8, 2024) - An introductory post welcoming readers to the blog.
*   The blog seems to be relatively new, with most posts dating back to 2024 or earlier.

**Observations:**

As you can see, the response from local model is not that great compared to OpenAI. Sometimes, it is really bad, as it will make up the author name or active date.

**Trying this on busy traffic site**

In [43]:
display_summary_local("https://nytimes.com")

It appears that you've entered a New York Times website, but the page seems to be in an unclear state. The content is mixed with articles, news, and other sections like audio, games, and puzzles.

To help you better, I'll need more information on what specific section or topic you're looking for. Are you interested in:

1. News and current events?
2. Entertainment (movies, TV, music, etc.)?
3. Sports?
4. Science and technology?
5. Something else?

Please let me know, and I can try to assist you with a more focused response!

**Observation**

For NYT, it looks a bit weird as the model can't seem to interpret well.

**Trying out Deepseek**

In [36]:
!ollama pull deepseek-r1:1.5b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ 

In [45]:
model = "deepseek-r1:1.5b"

In [46]:
# rewriting the function slightly to fit with deepseek model
def summarize_local(url):
    website = Website(url)
    response = ollama.chat(
        model = model,
        messages = messages_for(website)
    )
    return response['message']['content']

<think>

</think>

A blog by ZinKo  
Home  
About  
Sign in  
Subscribe  
Thoughts, stories and ideas.  

Subscribe below to receive my latest posts directly in your inbox. Promise no spams!  

 jamie@example.com  
Subscribe  

Trip to New Zealand - Queenstown!  
Jan 27, 2025  
1 min read  

Hello from my new blog!  
Nov 8, 2024  
1 min read  

My Experience with Georgia Tech OMSA Program Part-4  
May 1, 2024  
6 min read  

My Experience with Georgia Tech OMSA Program Part-3  
Apr 30, 2024  
3 min read  

My Experience with Georgia Tech OMSA Program Part-2  
Apr 29, 2024  
4 min read  

My Experience with Georgia Tech OMSA Program Part-1  
Mar 25, 2024  
3 min read  

A blog by ZinKo © 2025. Powered by Ghost

**Observation**

Again, as you can see it's not super great.

In [47]:
display_summary_local("https://nytimes.com")

<think>
Okay, so I'm trying to figure out how to approach this problem. The question is about the New York Times offering a free daily puzzle called Wordle in exchange for their ads. It says that the ad on the right side of the screen includes more than just the word, and it's not entirely clear whether the answer is included. They also mention that they're working with a brand studio to create this puzzle each day.

Hmm, first I need to understand what exactly the user is asking. Are they looking for tips on how to solve Wordle puzzles, or are they confused about why this specific ad includes more than just the answer? Or maybe they're trying to decide whether to support this subscription model?

I think I should break it down. The Times offers a free daily puzzle to replace their ads, which is interesting because it's a mix between the word game Wordle and the ad that's part of their daily news rotation.

I remember that Wordle is a popular online game where players guess letters based on hints given after each guess. It has its own rules about when they win—like when you guess correctly or if you run out of attempts, but in this case, it seems like the puzzle itself isn't part of their ad; instead, they're offering a free puzzle.

So why would the Times offer that? Maybe they think the puzzle is valuable on its own. They might be trying to create a recurring revenue stream without relying solely on ads. By providing the puzzle for free in exchange for the ad, they could reduce the dependency on their usual news ads and maybe attract more readers who want puzzles.

But then again, the user didn't specify what exactly they're asking about. They just provided this context. I wonder if they're trying to discuss the strategy of solving Wordle puzzles, or if they're unsure how this specific puzzle works or why it's included in their ad.

I should also consider whether there are any potential downsides. If the Times is offering a free puzzle, do they risk engaging with the readers and people who might not have time for word games? Or could that attract attention away from the main ads?

Another angle is to think about audience preferences. The New York Times has a strong reputation for delivering quality news and their ad ecosystem. By offering a new service, they might be expanding into other revenue streams, which could enhance their brand presence and diversification.

I should also check if there are any official statements or press releases from the New York Times regarding this subscription model. Maybe that will give me more insight into how it's perceived by readers or the media.

In summary, I'm a bit confused about the specific interaction here, but I think the key points to consider are: the substitution of ads with free puzzles, potential benefits in terms of revenue streams and audience engagement, and possible concerns about losing touch with readers interested in word games.
</think>

The New York Times is offering a free daily Wordle puzzle as part of their ad rotation, aiming to diversify their revenue streams. While there are several considerations:

1. **Revenue Streams**: The substitution replaces ads, which could reduce reliance on traditional news ads and attract readers interested in word games.

2. **Audience Engagement**: Wordle puzzles might attract those not engaged with word games, potentially affecting the main ad ecosystem's attention to content.

3. **Brand Diversification**: By introducing a new service, they expand their brand beyond news, enhancing exposure and differentiation.

4. **Potential Concerns**: Readers interested in word games may lose touch, though this could be mitigated by promoting Wordle alongside other puzzles or articles.

In conclusion, the interaction highlights a strategy for diversifying revenue and engaging with different audiences through substitution of ads with word-based challenges.

**Conclusion**

In summary, OpenAI model performs well in this task compared to locally run models. I guess this is the trade-off you have to consider given that it is kind of amazing you are able to get this level of output from locally run LLM (with my not so powerful machine).