<a href="https://colab.research.google.com/github/sufiyansayyed19/LLM_Learning/blob/main/W1Day5_1_Business_Brochure_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### BUSINESS CHALLENGE:

Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.

See the end of this notebook for examples of real-world business applications.

And remember: I'm always available if you have problems or ideas! Please do reach out.

All imports  
import os  
import json  
from IPython.display import Markdown, display, update_display  
from google.colab import userdata  
from openai import OpenAI   

## 1.Scraper Functions

This section defines functions to scrape website content and extract links. These functions are crucial for gathering the necessary information to create the brochure.

### Get website content

In [None]:
# 1. Install necessary libraries
!pip install beautifulsoup4 requests markdownify

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin

# --- THESE ARE THE FUNCTIONS ED HAS IN 'scraper.py' ---

def fetch_website_contents(url):
    """
    Fetches the text content of a website, stripping out scripts and styles.
    """
    print(f"Scraping: {url}...")
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Remove script and style elements
        for script in soup(["script", "style", "nav", "footer"]):
            script.extract()

        # Get text
        text = soup.get_text()

        # Break into lines and remove leading/trailing space on each
        lines = (line.strip() for line in text.splitlines())
        # Break multi-headlines into a line each
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        # Drop blank lines
        text = '\n'.join(chunk for chunk in chunks if chunk)

        return text[:5000] # Limit to 5000 chars to save tokens
    except Exception as e:
        return f"Error fetching {url}: {e}"



Collecting markdownify
  Downloading markdownify-1.2.2-py3-none-any.whl.metadata (9.9 kB)
Downloading markdownify-1.2.2-py3-none-any.whl (15 kB)
Installing collected packages: markdownify
Successfully installed markdownify-1.2.2


### Get website links

In [None]:
def fetch_website_links(url):
    """
    Fetches all links from a website and converts relative links to absolute.
    """
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')

        links = []
        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']
            # Convert relative links (e.g. "/about") to full links
            full_url = urljoin(url, href)
            if full_url.startswith('http'):
                links.append(full_url)

        # Remove duplicates
        return list(set(links))
    except Exception as e:
        print(f"Error fetching links: {e}")
        return []

print("Scraper functions loaded!")

Scraper functions loaded!


## 2.API Key Set Up for colab

### ADD key value in secret and check with code below

In [None]:
from google.colab import userdata

try:
    key = userdata.get('OPENAI_API_KEY')
    print(f"Success! Key found. It starts with: {key[:8]}...")
except Exception as e:
    print("Error: Could not find key. Did you turn the toggle switch ON?")

Success! Key found. It starts with: sk-or-v1...


## 3.API Call Setup

In [None]:
import os
from openai import OpenAI

 #1. Setup API Key (Make sure you added OPENAI_API_KEY in Colab Secrets)
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# 2. Setup Client (OpenRouter)
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENAI_API_KEY"],
)

# 3. Define Models
# We use a cheap, fast model for link selection
LINK_MODEL = "openai/gpt-4o-mini"
# We use a smarter model for writing the brochure
WRITER_MODEL = "openai/gpt-4o-mini"

print("Client and Models configured!")

Client and Models configured!


## 4.Prompt Design

### i-System Prompts

#### Link system prompt

In [None]:
# # --- SYSTEM PROMPTS ---
# link_system_prompt = """
# You are provided with a list of links found on a webpage.
# You are able to decide which of the links would be most relevant to include in a brochure about the company,
# such as links to an About page, or a Company page, or Careers/Jobs pages.
# You should respond in JSON as in this example:
# {
#     "links": [
#         {"type": "about page", "url": "https://full.url/goes/here/about"},
#         {"type": "careers page", "url": "https://another.full.url/careers"}
#     ]
# }
# """

In [None]:
# --- SYSTEM PROMPTS ---
link_system_prompt = """
You are provided with a list of links found on a portfoio webpage.
You are able to decide which of the links would be most relevant to generate the one page attractive CV,
{
    "links": [
        {"type": "About me", "url": "https://full.url/goes/here/about"},
        {"type": "Projects", "url": "https://another.full.url/projets/.."}
    ]
}
"""

#### Brochure System prompt

In [None]:

brochure_system_prompt = """
You are an expert career assistant and resume writer.

You will be provided with a portfolio website URL that contains multiple internal links (such as About, Projects, Experience, Skills, Blogs, GitHub, Case Studies, etc.).

Your task is to:

Analyze the entire portfolio, including all relevant linked pages

Extract meaningful information about:

Skills and technologies

Projects and their impact

Experience, internships, or freelancing work

Education and certifications

Open-source contributions or research (if any)

Achievements and measurable results

Infer strengths, specialization, and career direction based on evidence (not assumptions)

Then generate a high-quality, ATS-friendly, and recruiter-attractive CV tailored for modern tech roles (e.g., Software Engineer, ML Engineer, Data Scientist, AI Engineer, Full-Stack Developer ‚Äî depending on the portfolio content).

CV Requirements

Use strong action verbs and impact-driven bullet points

Quantify results wherever possible (performance, scale, users, accuracy, efficiency, etc.)

Highlight real projects over generic skills

Reflect industry-level professionalism, not student-level wording

Optimize for clarity, conciseness, and credibility

Output Format (in Markdown)

Professional Summary (2‚Äì3 lines, sharp and role-focused)

Core Skills (grouped by category)

Experience / Internships / Freelance (if applicable)

Projects (most important section)

Education

Certifications / Achievements (if available)

Optional: Publications, Blogs, Open Source

Do not invent information.
If something is missing, omit it gracefully.
Focus on making the candidate look hire-ready and competitive.
"""


### ii- User Prompt

#### User prompt to select relevent links

In [None]:
# def get_links_user_prompt(url):
#     raw_links = fetch_website_links(url)
#     # Take only first 30 links to save tokens
#     links_text = "\n".join(raw_links[:30])
#     user_prompt = f"""
#     Here is the list of links on the website {url} -
#     Please decide which of these are relevant web links for a brochure about the company,
#     respond with the full https URL in JSON format.
#     Do not include Terms of Service, Privacy, email links.

#     Links:
#     {links_text}
#     """
#     return user_prompt

In [None]:
def get_links_user_prompt(url):
    raw_links = fetch_website_links(url)
    # Take only first 30 links to save tokens
    links_text = "\n".join(raw_links[:30])
    user_prompt = f"""
    Here is the list of links on the website {url} -
    Please decide which of these are relevant web links for a cv generated from webpage,
    respond with the full https URL in JSON format.
    Do not include Terms of Service, Privacy, email links.

    Links:
    {links_text}
    """
    return user_prompt

#### User prompt to Create brochure

In [None]:
# def get_brochure_user_prompt(company_name, url):
#     print("üì• Fetching website content (this takes a moment)...")
#     content = fetch_page_and_all_relevant_links(url)

#     user_prompt = f"""
#     You are looking at a company called: {company_name}
#     Here are the contents of its landing page and other relevant pages;
#     use this information to build a short brochure of the company in markdown without code blocks.\n\n
#     {content[:10000]}
#     """
#     # (Limited to 10k chars to ensure we don't overflow context)
#     return user_prompt

In [None]:
def get_cv_user_prompt(candidate_name, portfolio_url):
    print("üì• Fetching portfolio content (this may take a moment)...")
    content = fetch_page_and_all_relevant_links(portfolio_url)

    user_prompt = f"""
You are an expert career assistant and professional resume writer.

You are analyzing the **portfolio website** of a candidate named: {candidate_name}

The portfolio contains multiple linked pages such as About, Projects, Skills, Experience,
Blogs, GitHub, Case Studies, or similar sections.

Your task is to:
- Analyze the entire portfolio content provided below
- Extract verified information about:
  - Technical skills and tools
  - Projects and their real-world impact
  - Internships, freelance work, or professional experience
  - Education and certifications
  - Open-source contributions, research, or blogs (if present)
- Infer strengths and specialization ONLY from the given evidence

Then generate a **high-quality, ATS-friendly, recruiter-ready CV** in **markdown format**
(without code blocks).

### CV Writing Guidelines
- Use strong action verbs and impact-focused bullet points
- Quantify results where possible (accuracy, performance, scale, users, revenue, etc.)
- Emphasize real projects over generic skill lists
- Maintain industry-level, professional language
- Do NOT invent or assume missing information

### CV Structure
- Professional Summary (2‚Äì3 concise lines)
- Core Skills (grouped by category)
- Experience / Internships / Freelance (if available)
- Projects (most important section)
- Education
- Certifications / Achievements (if available)
- Optional: Publications, Blogs, Open Source

Here is the portfolio content to analyze:

{content[:10000]}
"""

    # Limited to 10k characters to avoid context overflow
    return user_prompt


## 5.Content Get functions

### Page and all Links function

In [None]:
def fetch_page_and_all_relevant_links(url):
    # 1. Get Main Page
    contents = fetch_website_contents(url)

    # 2. Get Relevant Links
    relevant_links = select_relevant_links(url)

    result = f"## Landing Page:\n\n{contents}\n## Relevant Links:\n"

    # 3. Get Content of Relevant Links
    for link in relevant_links['links']:
        print(f"   Reading: {link['type']} ({link['url']})")
        link_content = fetch_website_contents(link["url"])
        result += f"\n\n### Link: {link['type']}\n{link_content}"

    return result

### Select relevant links from all links

In [None]:
import json

def select_relevant_links(url):
    print(f"üîç Analyzing links for {url}...")
    response = client.chat.completions.create(
        model=LINK_MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(url)}
        ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    try:
        links = json.loads(result)
        print(f"‚úÖ Found {len(links['links'])} relevant links.")
        return links
    except:
        print("Error parsing JSON")
        return {"links": []}

### 6.Brochure Generator function

In [None]:
from IPython.display import Markdown, display, update_display

def stream_brochure(company_name, url):
    print(f"üöÄ Generating brochure for {company_name}...")
    stream = client.chat.completions.create(
        model=WRITER_MODEL,
        messages=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    response = ""
    display_handle = display(Markdown("Wait for it..."), display_id=True)
    for chunk in stream:
        content = chunk.choices[0].delta.content or ''
        response += content
        update_display(Markdown(response), display_id=display_handle.display_id)

## 7.Driver

In [None]:
# --- RUN IT! ---
# Use standard HuggingFace URL
stream_brochure("Sufiyan Portfolio Site", "https://sufiyan-sayyed.vercel.app/")

üöÄ Generating brochure for HuggingFace...
üì• Fetching website content (this takes a moment)...
Scraping: https://huggingface.co...
üîç Analyzing links for https://huggingface.co...
‚úÖ Found 4 relevant links.
   Reading: about page (https://huggingface.co/docs)
Scraping: https://huggingface.co/docs...
   Reading: blog page (https://huggingface.co/blog)
Scraping: https://huggingface.co/blog...
   Reading: careers page (https://huggingface.co/enterprise)
Scraping: https://huggingface.co/enterprise...
   Reading: pricing page (https://huggingface.co/pricing)
Scraping: https://huggingface.co/pricing...


# Hugging Face Brochure

## Overview
**Hugging Face** is a leading platform for the machine learning community dedicated to building the future of AI. Through collaboration, innovation, and open-source values, we empower users to create, discover, and collaborate on machine learning models, datasets, and applications.

## Our Mission
At Hugging Face, we believe in the potential of artificial intelligence to transform our world. We serve as the home of machine learning, providing tools, resources, and a vibrant community for anyone interested in AI.

## Key Offerings
- **Collaborative Platform**: Host, discover, and collaborate on over **2 million models, applications,** and **500k datasets**.
- **Open Source**: We contribute to the foundation of ML tooling with open-source libraries like Transformers, Diffusers, and Tokenizers.
- **Enterprise Solutions**: Tailored offerings for organizations to accelerate AI development with robust security and dedicated support.

## Who We Serve
Over **50,000 organizations** use Hugging Face, including industry leaders like Google, Microsoft, Amazon, and Intel. Our user base spans diverse sectors including tech, education, and non-profits, all leveraging our platform for sustainable AI development.

## Company Culture
We pride ourselves on a **community-driven culture** that emphasizes collaboration and inclusivity. Our team fosters an environment that encourages innovation and promotes shared learning among developers, researchers, and enthusiasts in the AI space.

### Core Values
- **Collaboration**: We thrive on the active participation of our users and contributors.
- **Innovation**: Continuous learning and pioneering new technologies is at the core of our mission.
- **Transparency**: Open-source accessibility ensures everyone can contribute and benefit from our tools.

## Join Us
### Careers at Hugging Face
At Hugging Face, we are always looking for passionate individuals to join our team of experts and innovators. We offer a range of opportunities across various domains in AI development, engineering, and community support.

**Why Work With Us?**
- Be a part of a passionate and collaborative team.
- Work on cutting-edge technology that shapes the future of AI.
- Engage with a vibrant community of developers and researchers.

Explore our current job openings and find your place in the AI revolution.

## Get Involved
Join our community and explore how you can utilize our resources, participate in discussions, and contribute to the ever-growing field of machine learning. Whether you're a seasoned expert or a curious beginner, Hugging Face welcomes you to embark on this journey with us.

**Discover more at**: [Hugging Face Website](https://huggingface.co)  
**Connect with us**: [Join our Discord](https://huggingface.co/discord)  |  [Read our Blog](https://huggingface.co/blog)

---

Welcome to Hugging Face, where together we are building a brighter future through AI.

## Function Call Flow for `stream_brochure`

The `stream_brochure` function orchestrates the entire brochure generation process by calling several other functions. Here's the sequence of execution:

1.  **`stream_brochure(company_name, url)`**:
    *   Initiates the process and prints a message indicating brochure generation has started.
    *   Internally calls `get_brochure_user_prompt(company_name, url)` to prepare the prompt for the brochure writer model.
    *   Sends this prompt to the `WRITER_MODEL` via `client.chat.completions.create` and streams the response to the display.

2.  **`get_brochure_user_prompt(company_name, url)`**:
    *   Prints a message about fetching website content.
    *   Its primary role is to gather all necessary content by calling `fetch_page_and_all_relevant_links(url)`.
    *   Formats the retrieved content into a user prompt for the brochure writer.

3.  **`fetch_page_and_all_relevant_links(url)`**:
    *   First, it fetches the content of the main URL by calling `fetch_website_contents(url)`.
    *   Then, it identifies relevant links on the website by calling `select_relevant_links(url)`.
    *   For each relevant link found, it prints which link type is being read and then calls `fetch_website_contents(link["url"])` again to get the content of that specific linked page.
    *   Aggregates all fetched content into a single string.

4.  **`select_relevant_links(url)`**:
    *   Prints a message indicating link analysis is in progress.
    *   It prepares a prompt for the `LINK_MODEL` by calling `get_links_user_prompt(url)`.
    *   Sends this prompt to the `LINK_MODEL` via `client.chat.completions.create` to get a JSON response of relevant links.
    *   Parses the JSON response and returns a list of relevant links.

5.  **`get_links_user_prompt(url)`**:
    *   Calls `fetch_website_links(url)` to get all raw links from the given URL.
    *   Filters and formats these links into a user prompt for the `LINK_MODEL`.

6.  **`fetch_website_links(url)`**:
    *   Uses `requests` and `BeautifulSoup` to scrape all `<a>` tags from the specified URL.
    *   Converts relative URLs to absolute URLs and returns a unique list of all found links.

7.  **`fetch_website_contents(url)`**:
    *   Prints the URL being scraped.
    *   Uses `requests` and `BeautifulSoup` to fetch the HTML content of the URL.
    *   Strips out script, style, navigation, and footer elements.
    *   Extracts and cleans the visible text content, limiting it to the first 5000 characters to save tokens.