
### Web-Based Text Summarization Using GenAI 

In the era of information overload, extracting meaningful insights from extensive textual content is a critical need. 
This project focuses on developing a **web-based text summarization tool**, leveraging Natural Language Processing (NLP) techniques 
to generate concise summaries from long-form content. The tool is designed to provide accurate and context-aware summarization 
to aid users in consuming key information efficiently.

In [1]:
import os
import openai
import ollama
import requests
from openai import OpenAI
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from IPython.display import Markdown, display

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")

In [2]:
# A class to represent a Webpage
# Some websites need you to use proper headers when fetching them:
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"}
class Website:
    def __init__(self, url):
        """
        Create this Website object from the given url using the BeautifulSoup library
        """
        self.url = url
        response = requests.get(url, headers = headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

In [3]:
# Let's try one out. Change the website and add print statements to follow along.

ed = Website("https://mydatascienceenthusiast.com")
print(ed.title)
#print(ed.text)

Welcome


In [4]:
# Define our system prompt 
system_prompt = "You are an assistant that analyzes the contents of a website \
                and provides a short summary, ignoring text that might be navigation related. \
                Respond in markdown."

# A function that writes a User Prompt that asks for summaries of websites:
def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}\n"
    user_prompt += "The contents of this website is as follows; please provide a short summary of this website in markdown.\n"
    user_prompt += "If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [5]:
#print(user_prompt_for(ed))

In [6]:
# See how this function creates exactly the format above
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

# Try this out, and then try for a few more websites
print(messages_for(ed))

[{'role': 'system', 'content': 'You are an assistant that analyzes the contents of a website                 and provides a short summary, ignoring text that might be navigation related.                 Respond in markdown.'}, {'role': 'user', 'content': 'You are looking at a website titled Welcome\nThe contents of this website is as follows; please provide a short summary of this website in markdown.\nIf it includes news or announcements, then summarize these too.\n\nSkip to content\nNo results\nAbout Us\nBlog\nContact\nHome\nWelcome\nHome\nAbout Us\nBlog\nContact\nSearch\nWelcome\nMenu\nTazeb Abera\nHey, I am Tazeb\nWelcome to my Profile, I am a GenAI Data Scientist, I Love to share my Knowledge and Experience\nLinkedin\nFacebook\nTwitter\nYoutube\nContact me\nAbout Me\nOur strong determination and passion towards web development have inspired us to offer premium quality web development services to the global clients, including 1200+ satisfied customers.\nDownload cv\nGenAI\nPython\n

### OpenAI web Text Summarization

In [7]:
client = openai.OpenAI() 
def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website) )
    return response.choices[0].message.content

In [8]:
aa = summarize("https://mydatascienceenthusiast.com/blog/")

In [9]:
print(aa)

# Summary of Blog – Welcome

The blog post titled "DeepSeek vs. OpenAI – New Race in AI," published on January 29, 2024, discusses the author's journey exploring AI technologies, specifically comparing OpenAI's models with DeepSeek, an emerging open-source LLM from China.

## Key Points:
- The author highlights their experience using OpenAI's frontier models for various tasks and their curiosity about DeepSeek after its recent launch.
- **DeepSeek** raises concerns regarding **data security**, requiring users to log in and share personal information, which parallels worries about how user data might be handled, especially in light of the recent TikTok ban in the US.
- **UI/UX Issues**: The author encountered significant navigation problems on DeepSeek’s website, questioning its stability and user-friendliness compared to OpenAI.
- **Political Sensitivity**: DeepSeek avoids addressing sensitive political topics, raising doubts about the trustworthiness of its outputs.
- The author concl

In [10]:
summarize("https://cnn.com")

"# CNN News Summary\n\nCNN is a leading source for breaking news, covering a wide range of topics including US and world news, politics, business, health, entertainment, sports, and science. The site features live updates and in-depth analyses on critical issues and events currently shaping the world.\n\n## Key Highlights:\n- **Midair Collision Incident:** The FAA has restricted helicopter operations near Reagan National Airport following a deadly collision involving an American Airlines jet and an Army helicopter. This incident has led to criticism of former President Trump's handling of aviation safety.\n- **Ukraine-Russia War:** Reports indicate that North Korean troops have pulled back from frontline positions after suffering significant losses; this development continues to highlight the ongoing conflict dynamics.\n- **Israeli-Palestinian Conflict:** A father of one of the youngest hostages in Gaza is expected to be released, furthering discussions about the ongoing humanitarian i

In [11]:
summarize("https://mydatascienceenthusiast.com/")

"# Website Summary\n\n**Title:** Welcome\n\n## Overview\nThe website serves as a personal profile for Tazeb Abera, a GenAI Data Scientist dedicated to sharing knowledge and experience in the fields of AI, Machine Learning, and web development. Tazeb emphasizes a strong passion for premium quality web development services, having served over 1200 satisfied clients.\n\n## Education & Experience\n- **PhD in Data Science** (2023 - Present) at North Central University\n- **MSc in Data Science** (2018-2020) from Southern Methodist University\n- **MSc in Sustainable Energy Engineering** (2012-2014)\n- **BSc in Computer Science** (2004-2007)\n\nTazeb has held various roles in the tech field, including Senior Data Scientist, Data Scientist, Business Analyst, and Power Engineer.\n\n## Services Offered\nTazeb specializes in several areas:\n- **AI, Machine Learning, and Deep Learning:** Developing intelligent systems for automation, predictive insights, and innovation.\n- **Data Engineering and Cl

### Ollama Web Text Summarization

In [12]:
def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "gpt-4o-mini",
        messages = messages_for(website) )
    return response.choices[0].message.content

In [13]:
# Constants
OLLAMA_API = "http://localhost:11434/api/chat"
HEADERS = {"Content-Type": "application/json"}
MODEL = "llama3.2"

In [14]:
# Option I  for website summerization with Ollama  / open source no payment required
ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
def summarize_ollama(url):
    website = Website(url)
    response = ollama_via_openai.chat.completions.create(
        model = "llama3.2",
        messages = messages_for(website) )
    return response.choices[0].message.content

def display_summary(url):
    summary = summarize_ollama(url)
    display(summary)

display_summary("https://en.wikipedia.org/wiki/Ed_Sheeran")

'Here\'s a summary of Ed Sheeran\'s biography:\n\n**Early Life**\n\n* Born on February 17, 1991, in Hebden Bridge, West Yorkshire, England\n* Raised in Framlingham, Suffolk, with influences from his Irish parents\n\n**Music Career**\n\n* Began busking in London and released his debut EP, "Loose Change," in 2008\n* Signed with Atlantic Records in 2010 and released his debut album, "+," in 2011\n* Became a global pop sensation with hits like "The A Team" and "Lego House"\n\n**Personal Life**\n\n* Known for his charity work, including supporting the Elton John AIDS Foundation\n* Has been involved in high-profile relationships with athletes Ellie Goulding and Cherry Seaborn (whom he married in 2018)\n* Has a daughter with Cherry Seaborn\n\n**Awards and Achievements**\n\n* Won numerous awards, including four Grammy Awards and two Brit Awards\n* Nominated for several MTV Europe Music Awards and Billboard Music Awards\n* Became the best-selling solo artist of 2020 with his album "+"\n\n**Publ

In [15]:
# Option II for website summerization with Ollama  / open source no payment required
MODEL = "llama3.2"
def summarize(url):
    website = Website(url)
    messages = messages_for(website)
    response = ollama.chat(model = MODEL, messages = messages)
    return response['message']['content']

In [16]:
summarize("https://en.wikipedia.org/wiki/Ed_Sheeran")

'Ed Sheeran is a British singer-songwriter, musician, and record producer. He was born on February 17, 1991, in Hebden Bridge, West Yorkshire, England.\n\nSheeran rose to fame in the late 2000s with his unique blend of folk, pop, and hip-hop music. He has released several successful albums, including "+", "x", "÷", and "No.6 Collaborations Project". Some of his most popular songs include "Shape of You", "Thinking Out Loud", "Photograph", and "Perfect".\n\nSheeran has won numerous awards for his music, including four Grammy Awards, four Brit Awards, and an Ivor Novello Award. He has also broken multiple records in the music industry, including becoming the first artist to have seven songs simultaneously on the US Billboard Hot 100 chart.\n\nIn addition to his music career, Sheeran is known for his philanthropic efforts, particularly in the area of education and children\'s charities. He has been involved in several high-profile charity projects, including the "Songs for Love" campaign, 