# Website Content Extraction and Brochure Generation with LLaMA

## Description

This notebook demonstrates how to extract useful information from a company website and generate a professional brochure using the power of LLaMA language models. The process involves multiple steps:

1. **Website Scraping:** 
   - The `Website` class retrieves and processes content from a specified URL.
   - The page title and textual content are extracted, while unnecessary elements like images and scripts are removed.
   - The links found on the page are collected and processed to form a list.

2. **Contact Information Extraction:**
   - Using regular expressions, phone numbers and email addresses are extracted from the website's textual content.
   - This information is returned as a dictionary.

3. **Link Classification for Brochure Content:**
   - A prompt is sent to an LLaMA model to classify and filter the links found on the website.
   - The model identifies important links such as "About Us", "Mission", "Contact", and "Careers" pages, which are critical for building a company brochure.

4. **Brochure Content Generation:**
   - Based on the website summary and extracted contact information, an AI-generated brochure is created.
   - The model writes sections like "About Us", "Vision/Mission", and "Contact" using the retrieved content.

## Key Features:
- **Link Classification:** Helps extract relevant sections of the website that are crucial for creating a company brochure.
- **Contact Information Extraction:** Retrieves phone numbers and emails from the website content, providing a fallback for incomplete data.
- **Automated Brochure Generation:** AI-driven creation of professional brochure content from the website’s textual summary and contact information.

This notebook uses the **Ollama** API with the LLaMA language model to process the content and generate human-like text based on the provided system instructions.


In [7]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import ollama
import json
import re

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}


class Website:
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string.strip() if soup.title else "No title found"

        for tag in soup(["script", "style", "img", "input"]):
            tag.decompose()

        self.text = soup.body.get_text(separator="\n", strip=True) if soup.body else ""
        raw_links = [a.get("href") for a in soup.find_all("a") if a.get("href")]
        self.links = [urljoin(self.url, link) for link in raw_links]

    def get_summary(self):
        return f"Webpage Title: {self.title}\n\nContent Summary:\n{self.text[:1000]}..."


def extract_contact_info(text: str):
    phones = re.findall(r'(\+?\d[\d\s\-\(\)]{7,}\d)', text)
    emails = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', text)
    return {
        "phones": list(set(phones)),
        "emails": list(set(emails))
    }


def extract_json_block(text: str) -> str:
    match = re.search(r'({[\s\S]+})', text)
    return match.group(1) if match else None


link_system_prompt = """You are a brochure assistant. You are given a list of links found on a company's website.
Identify only the links that are helpful to someone creating a company brochure.
This includes: About Us, Company History, Mission, Vision, Team, Careers, Jobs, and Contact pages.

Output should be in JSON format as:
{
    "links": [
        {"type": "about page", "url": "https://example.com/about"},
        {"type": "careers page", "url": "https://example.com/careers"},
        {"type": "contact page", "url": "https://example.com/contact"}
    ]
}
"""


def get_links_user_prompt(website: Website) -> str:
    return f"Website URL: {website.url}\nList of links:\n" + "\n".join(website.links)


def ask_ollama(prompt: str, system_prompt: str, model: str = "llama3") -> str:
    response = ollama.chat(model=model, messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ])
    return response["message"]["content"]


def build_brochure(website: Website, contact_info: dict, model: str = "llama3"):
    prompt = f"""You are a company brochure builder AI. Based on the following company website summary, generate content for a professional brochure.
Include sections like About Us, Vision/Mission, and Contact. Use the following contact info if the webpage doesn't clearly state it.

Website Summary:
{website.get_summary()}

Contact Info (fallback):
Phone(s): {", ".join(contact_info['phones']) or "N/A"}
Email(s): {", ".join(contact_info['emails']) or "N/A"}
"""

    brochure_system_prompt = "You are a professional writer generating brochure content from company website information."
    return ask_ollama(prompt, brochure_system_prompt, model)


# ===== FINAL EXECUTION BLOCK =====

if __name__ == "__main__":
    url = "https://www.blackrock.com/corporate"
    website = Website(url)

    print("\n🎯 Filtered Brochure Links:\n")
    raw_links_response = ask_ollama(get_links_user_prompt(website), link_system_prompt)
    json_str = extract_json_block(raw_links_response)

    if json_str:
        try:
            parsed_links = json.loads(json_str)
            print(json.dumps(parsed_links, indent=2))
        except json.JSONDecodeError as e:
            print("⚠️ JSON decoding failed:", e)
            print("🔍 Raw content:\n", raw_links_response)
    else:
        print("⚠️ No JSON block found.\nRaw content:\n", raw_links_response)

    contact_info = extract_contact_info(website.text)

    print("\n📞 Extracted Contact Info:")
    print(json.dumps(contact_info, indent=2))

    print("\n📘 Brochure Content:\n")
    brochure = build_brochure(website, contact_info)
    print("📄 Final Brochure Output:\n")
    print("*******************************************************************************************")
    print(brochure)



🎯 Filtered Brochure Links:

{
  "links": [
    {
      "type": "about page",
      "url": "https://www.blackrock.com/corporate/about-us"
    },
    {
      "type": "company history",
      "url": "https://www.blackrock.com/corporate/about-us/blackrock-history"
    },
    {
      "type": "mission statement",
      "url": "https://www.blackrock.com/corporate/about-us/mission-and-principles"
    },
    {
      "type": "leadership team",
      "url": "https://www.blackrock.com/corporate/about-us/leadership"
    },
    {
      "type": "team",
      "url": "https://careers.blackrock.com/search-jobs"
    },
    {
      "type": "careers page",
      "url": "https://careers.blackrock.com"
    },
    {
      "type": "contact page",
      "url": "https://www.blackrock.com/corporate/newsroom/media-contacts"
    },
    {
      "type": "sustainability page",
      "url": "https://www.blackrock.com/corporate/sustainability"
    }
  ]
}

📞 Extracted Contact Info:
{
  "phones": [
    "44 (0)20 7743 30