# Website Summarizer

This notebook grabs a webpage and uses Ollama's API to summarize a webpage.

1. Install ollama: https://ollama.com/download
2. Run `ollama serve`
3. Visit `http://localhost:11434/`

In [72]:
import requests
import time
from bs4 import BeautifulSoup
from IPython.display import Markdown, display
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
import ollama

In [114]:
class Website:
    url: str
    title: str
    text: str
    summary: str
    system_prompt: str
    user_prompt: str
    selenium: bool

    def __init__(self, url, selenium=False):
        self.url = url
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "Unknown"
        self.selenium = selenium
        if selenium:
            self._selenium()
        else:
            self._requests()
        self.system_prompt()
        self.user_prompt()

    def _wait(self, driver, timeout=20):
        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.common.by import By
    
        try:
            # Wait for the body to contain something
            WebDriverWait(driver, timeout).until(lambda d: d.find_element(By.TAG_NAME, "body").text.strip() != "")
        except:
            pass

        start = time.time()
        prev_html = ""
        cur_html = ""
        stable_html_count = 0
        # Halt at timeout
        while time.time() - start < timeout:
            cur_html = driver.page_source
            if cur_html == prev_html:
                stable_html_count += 1
                # If the content is 5 times the same then it is considered stable
                if stable_html_count > 5:
                    return
            else:
                stable_html_count = 0
            prev_html = cur_html
            time.sleep(0.5)

    def _requests(self):
        response = requests.get(self.url)
        soup = BeautifulSoup(response.content, 'html.parser')
        for no_text_tags in soup.body(["script", "style", "img", "input"]):
            no_text_tags.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

    def _selenium(self):
        options = Options()
        options.add_argument("--headless")
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")
        options.add_argument("--window-size=1920,1080")

        driver = webdriver.Chrome(options=options)
        driver.get(self.url)
        # Wait until load
        self._wait(driver)
        html = driver.page_source
        driver.quit()

        soup = BeautifulSoup(html, 'html.parser')
        for no_text_tags in soup.body(["script", "style", "img", "input"]):
            no_text_tags.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

    def printer(self):
        print(f'Website: {self.url}') 
        print(f'Title: {self.title}')
        print(f'Text Head: {self.text[:100]}...')

    # The prompt was reworded via ChatGPT-4.o to increase llama3.2's accuracy in Markdown output
    def system_prompt(self):
        self.system_prompt = f"""
    You are a website summarizer. You will be given a webpage's raw text content.
    
    Your task is to output a **Markdown-formatted table** with **exactly four rows** and **two columns**:
    
    - Column 1 (header): `Field`
    - Column 2 (header): `Value`
    
    The rows must be:
    1. Field: `URL`, Value: the page URL
    2. Field: `Title`, Value: the page title
    3. Field: `Summary`, Value: a concise summary of the webpage (up to 300 words)
    4. Field: `Noteworthy`, Value: 2–3 important bullet points divided by `•` (short, relevant highlights)
    
    ⚠️ Follow these strict rules:
    - Use **only** Markdown table syntax.
    - **Do not** modify or add any fields.
    - **Do not** include explanations, reasoning, or extra text before or after the table.
    - Use `•` for bullets inside the Noteworthy cell (use `<br>` to separate lines for each bullet).
    - **Do not** use more than 4 table rows for the final output.
    - If a value is missing, fill it with `"Unknown"` or infer it from the content.
    
    Here is the exact format to follow (replace only the values):
    
    ```
    | Field      | Value         |
    |------------|---------------|
    | URL        | <url>         |
    | Title      | <title>       |
    | Summary    | <summary>     |
    | Noteworthy | • point 1<br>• point 2<br>• point 3 |
    ```
    """
        return self.system_prompt

    def user_prompt(self):
        self.user_prompt = f"""
        The website URL is {self.url}. The website Title is {self.title}. The website content is ```{self.text}```. 
        Please summarize.
        """
        return self.user_prompt

In [115]:
def api_message(website):
    return [
        {"role": "system", "content": website.system_prompt},
        {"role": "user", "content": website.user_prompt}
    ]

def summarizer(url, selenium=False):
    website = Website(url, selenium)
    response = ollama.chat(model="llama3.2", messages=api_message(website))
    return response['message']['content']

In [116]:
summary = summarizer("https://gtsig.eu")
display(Markdown(summary))

| Field      | Value         |
|------------|---------------|
| URL        | https://gtsig.eu |
| Title      | George T.      |
| Summary    | George T. is a cybersecurity expert with experience in Information Security, Music & good food. He has worked for various companies, including SKROUTZ.GR and EUROPEAN RELIANCE - ALLIANZ. He also has a degree in Piano and is proficient in multiple programming languages.         |
| Noteworthy | • Penetration Tests: 9<br>• Internal Audits ISO 9001:2015, 27001:2013: 5<br>• Cybersecurity Training Courses: 2<br>• Vulnerability Assessments<br>• Tool Assisted Code Audits<br>• Custom CI/CD Security Integrations<br>• Custom SIEM setup on Elasticsearch |

In [117]:
summary = summarizer("https://www.airbnb.com/")
display(Markdown(summary))

| Field      | Value         |
|------------|---------------|
| URL        | https://www.airbnb.com/          |
| Title      | Airbnb: Vacation Rentals, Cabins, Beach Houses, Unique Homes & Experiences     |
| Summary    | A platform for booking unique vacation rentals worldwide, offering a diverse range of properties, including homes, cabins, and villas. Users can filter by location, price, and amenities to find their ideal getaway. The site also features listings for experiences, such as cooking classes and wine tastings.
| Noteworthy | • Airbnb offers a wide selection of unique vacation rentals, catering to various interests and budgets. • The platform allows users to explore destinations worldwide and discover new experiences. • Airbnb has implemented measures to promote sustainability, such as eco-friendly properties and carbon offsetting options.

In [118]:
summary = summarizer("https://www.airbnb.com/", selenium=True)
display(Markdown(summary))

| Field      | Value         |
|------------|---------------|
| URL        | https://www.airbnb.com/          |
| Title      | Airbnb: Vacation Rentals, Cabins, Beach Houses, Unique Homes & Experiences    |
| Summary    | Airbnb is a popular platform for vacation rentals, offering a wide range of unique homes and experiences worldwide. Users can search and book properties in various locations, including cities, beaches, and countryside areas. The website also provides travel tips, inspiration, and resources for hosts.          |
| Noteworthy | • Pet-Friendly Vacation Rentals<br>• Luxury Dog-Friendly Cottages<br>• Beachfront Rentals with a Pool |

In [119]:
summary = summarizer("https://www.netflix.com/", selenium=True)
display(Markdown(summary))

| Field      | Value         |
|------------|---------------|
| URL        | https://www.netflix.com/ |
| Title      | Netflix Greece - Watch TV Shows Online, Watch Movies Online  |
| Summary    | Netflix is a streaming service offering a wide variety of award-winning TV shows, movies, anime, documentaries, and more. It allows users to watch as much as they want, whenever they want without commercials, for a fixed monthly fee ranging from €8.99 to €15.99. Users can access the platform on various devices, including smartphones, tablets, smart TVs, and streaming devices. |
| Noteworthy | • The Netflix Kids experience is included in membership, providing family-friendly content with PIN-protected parental controls.|
|            | • The service uses cookies and similar technologies to collect information about user browsing activities for analysis, personalization, and customization of online advertisements.|
|            | • Users can opt out of advertising cookies, but may still see targeted ads based on other data sources.