## A full business solution ( Generating /Building a Brochure for a Company)

#### BUSINESS CHALLENGE:

Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a **company name** and their **primary website**.

See the end of this notebook for examples of real-world business applications.


In [1]:
import os
import json
import requests
from typing import List
from openai import OpenAI
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display

In [2]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")
    
MODEL = 'gpt-4o-mini'
openai = OpenAI()

API key looks good so far


In [3]:
# A class to represent a Webpage
# What this function does:-:
# The Website class performs the following tasks:
                # Fetches the webpage using requests.
                # Parses the HTML with BeautifulSoup.
                # Extracts the title, visible text, and links.
                # Removes unnecessary elements (script, style, etc.).
                # Provides a method to return the extracted content.

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"}
class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """
    def __init__(self, url): # The __init__ method is the constructor,  that initializes an instance of Website with a given url.
        self.url = url # Stores the URL of the webpage
        response = requests.get(url, headers=headers) # Sends an HTTP GET request to the given url using the specified headers to retrieve the webpage's content.
        self.body = response.content # Stores the raw HTML content of the webpage
        soup = BeautifulSoup(self.body, 'html.parser') # Parses the HTML using BeautifulSoup with the html.parser to enable easy extraction of data.
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]): # soup.body(["script", "style", "img", "input"]) finds all <script>, <style>, <img>, and <input> elements.
                irrelevant.decompose() # .decompose() permanently removes them from the parsed HTML.
            self.text = soup.body.get_text(separator="\n", strip=True) # Extracts the visible text from the webpage. strip=True removes extra spaces.
        else:
            self.text = "" # Assigns an empty string if there’s no body content
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link] # Finds all hyperlinks (<a> tags) and extracts the href attributes.
        
    def get_contents(self): # Defines a method to return the webpage title and text.
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n" #  Formats the extracted data into a readable string.    

   

### why we exclude the following elements from the extracted text:

     <script> – Contains JavaScript code, which is not part of the visible text on a webpage.

     <style> – Contains CSS styles, which define the appearance but do not contribute to the visible content.

     <img> – Represents images, which are non-textual elements and cannot be extracted as readable content.

     <input> – Represents form fields (text boxes, checkboxes, buttons, etc.), which are interactive but do not contain useful textual information.


self.title = soup.title.string if soup.title else "No title found" --> equivalent to 

        if soup.title:
            self.title = soup.title.string
        else: 
            self.title = "No title found"

links = [link.get('href') for link in soup.find_all('a')]  --> equivalent to 

        links = []
        for link in soup.find_all('a'):
            href = link.get('href')
            if href:
                links.append(href)

       Practical example: 

       <html>
        <body>
            <a href="https://example.com/home">Home</a>
            <a href="https://example.com/about">About Us</a>
            <a>Contact</a> <!-- No href attribute -->
            <a href="https://example.com/services">Services</a>
        </body>
    </html>

        

In [4]:
ed = Website("https://mydatascienceenthusiast.com/")
print(ed.links)
# print(ed.title)
# print(ed.text)
#print(ed.get_contents())

['#main', 'https://mydatascienceenthusiast.com/about-us/', 'https://mydatascienceenthusiast.com/blog/', 'https://mydatascienceenthusiast.com/contact/', 'https://mydatascienceenthusiast.com/', 'https://mydatascienceenthusiast.com/', 'https://mydatascienceenthusiast.com/', 'https://mydatascienceenthusiast.com/', 'https://mydatascienceenthusiast.com/about-us/', 'https://mydatascienceenthusiast.com/blog/', 'https://mydatascienceenthusiast.com/contact/', 'https://mydatascienceenthusiast.com/', 'https://mydatascienceenthusiast.com/', 'https://www.linkedin.com/in/tazeb-abera/', 'https://www.facebook.com/addisumng', '#contact-section', 'https://mydatascienceenthusiast.com/wp-content/uploads/2025/01/TazebAbera_Resume.pdf', '#contact-section', '#contact-section', '#contact-section', '#contact-section', '#contact-section', '#contact-section', 'https://creativethemes.com']


## First step: Have GPT-4o-mini figure out which links are relevant

### Use a call to gpt-4o-mini to read the links on a webpage, and respond in structured JSON.  
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".  
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [5]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [6]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.
You should respond in JSON as in this example:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}



In [7]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
        Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [8]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://mydatascienceenthusiast.com/ - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format.         Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
#main
https://mydatascienceenthusiast.com/about-us/
https://mydatascienceenthusiast.com/blog/
https://mydatascienceenthusiast.com/contact/
https://mydatascienceenthusiast.com/
https://mydatascienceenthusiast.com/
https://mydatascienceenthusiast.com/
https://mydatascienceenthusiast.com/
https://mydatascienceenthusiast.com/about-us/
https://mydatascienceenthusiast.com/blog/
https://mydatascienceenthusiast.com/contact/
https://mydatascienceenthusiast.com/
https://mydatascienceenthusiast.com/
https://www.linkedin.com/in/tazeb-abera/
https://www.facebook.com/addisumng
#contact-section
https://mydatascienceenthusiast.com/wp-content/uploads/2025/01/TazebAbera_Resume.pdf
#contact-s

In [9]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = MODEL,
        messages = [
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [10]:
# Anthropic has made their site harder to scrape, so I'm using HuggingFace..
huggingface = Website("https://huggingface.co")
# huggingface.links

In [11]:
get_links("https://huggingface.co")

{'links': [{'type': 'homepage', 'url': 'https://huggingface.co/'},
  {'type': 'about page', 'url': 'https://huggingface.co/huggingface'},
  {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'},
  {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'blog page', 'url': 'https://huggingface.co/blog'},
  {'type': 'community page', 'url': 'https://discuss.huggingface.co'},
  {'type': 'GitHub page', 'url': 'https://github.com/huggingface'},
  {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'},
  {'type': 'LinkedIn page',
   'url': 'https://www.linkedin.com/company/huggingface/'}]}

## Second step: make the brochure!

Assemble all the details into another prompt to GPT4-o

In [12]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [13]:
#print(get_all_details("https://huggingface.co"))

In [14]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':
# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."


In [15]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [16]:
#get_brochure_user_prompt("Anthropic", "https://anthropic.com")

In [17]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model = MODEL,
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [18]:
create_brochure("Ethiopian ", "https://www.eotcdbht.org/")

Found links: {'links': [{'type': 'company homepage', 'url': 'https://www.eotcdbht.org'}, {'type': 'services page', 'url': 'https://www.eotcdbht.org/አገልግሎቶች-ministries'}, {'type': 'staff page', 'url': 'https://www.eotcdbht.org/አገልጋዮች-staff'}, {'type': 'nearby locations page', 'url': 'https://www.eotcdbht.org/አጥቢያ-አብያተ-ክርስትያናት-near-by-eotc'}, {'type': 'contact page', 'url': 'https://www.eotcdbht.org/ያግኙን-contact'}, {'type': 'membership page', 'url': 'https://www.eotcdbht.org/copy-of-የአባልነት-ቅጽ-membership-1'}, {'type': 'job openings page', 'url': 'https://www.eotcdbht.org/jobopenning'}, {'type': 'book online page', 'url': 'https://www.eotcdbht.org/book-online'}]}


```markdown
# Debre Berehan Holy Trinity Ethiopian Orthodox Tewahedo Church

**Location:**  
4406 Broadway Blvd, Garland, TX 75043  
**Contact:**  
Email: office@eotcdbht.com  
Phone: 469-688-4084  

---

## Welcome to Our Church

At the Debre Berehan Holy Trinity Ethiopian Orthodox Tewahedo Church, we honor our rich tradition and welcome all who seek to connect with our spiritual community. Located in Garland, Texas, we strive to be a source of light, faith, and fellowship for our congregation.

---

## Our Ministries

Our church offers a variety of ministries that cater to different spiritual needs:

- **Sunday School:**  
  We provide relevant teachings for children and engage them in learning about the Orthodox faith.  
  Contact: Mengistu Zafu  
  Email: sundayschool@eotcdbht.org  
  Phone: 469-363-6066  

- **Baptismal Services:**  
  We celebrate the entrance of new believers into our community through baptism.  
  Contact: Kesis Henok Tezera  
  Email: office@eotcdbht.org  
  Phone: 469-432-1879  

- **Evangelism:**  
  Our evangelism team reaches out to the wider community to share the love of Christ.

- **Gift Shop:**  
  Open for parishioners to purchase various religious items and books.  
  Contact: Senait Hunde  
  Email: info@eotcdbht.com  
  Phone: 469-487-4338  

---

## Church Schedule

- **Sunday Services:** 4 AM  
- **Office Hours:**  
  - **Monday - Friday:** 6 AM - 5 PM  
  - **Saturday:** 5 AM - 9 AM  
  - **Sunday:** 3 AM - 12 PM  

---

## Company Culture

We pride ourselves on a welcoming and inclusive environment that embodies the teachings of Jesus Christ. Our church is not just a place for worship, but a community where friendships are nurtured and faith is deepened through shared experiences and acts of service.

---

## Join Us

Whether you're a prospective member wanting to know more about our faith, an investor interested in supporting our church initiatives, or a recruit looking for a community to grow with, we welcome you to get involved and experience the warmth and hospitality of our congregation.  

To learn more about membership or job openings, please visit our [Membership Page](#) or [Job Opportunities](#).

---

## Support Your Church

We appreciate your support and donations, which help sustain our ministries and outreach programs.  
[Donate Now](#)

---

We look forward to welcoming you into our community at the Debre Berehan Holy Trinity Ethiopian Orthodox Tewahedo Church!
```
This markdown brochure outlines the key aspects of the Debre Berehan Holy Trinity Ethiopian Orthodox Tewahedo Church for potential customers, investors, and recruits.

## Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI,
with the familiar typewriter animation

In [19]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [20]:
#stream_brochure("HuggingFace", "https://www.eotcdbht.org/")

In [21]:
# Try changing the system prompt to the humorous version when you make the Brochure for Hugging Face:

stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


# Hugging Face Brochure

## Welcome to Hugging Face
The AI community building the future, Hugging Face is at the forefront of machine learning innovation. Our platform serves as a collaborative space where developers, researchers, and enthusiasts converge to create, share, and utilize cutting-edge models, datasets, and applications.

---

## What We Offer
### Models, Datasets, & Spaces
- **Models**: Explore a library of over **400,000** models across various tasks and modalities including text, images, video, and audio.
- **Datasets**: Access and share **100,000+** datasets to advance your projects and research.
- **Spaces**: Create and host applications seamlessly with our community-focused features.

### Advanced Solutions
- **Compute**: Take advantage of our optimized inference endpoints or upgrade your applications with GPU support starting at **$0.60/hour**.
- **Enterprise Solutions**: Organizations can benefit from enterprise-grade security, priority support, and dedicated resources starting at **$20/user/month**.

### Leading Companies
Join a robust network of over **50,000 organizations** including Amazon Web Services, Google, Microsoft, Grammarly, and more that leverage Hugging Face for their AI solutions.

---

## Company Culture
At Hugging Face, we foster a vibrant and inclusive culture that prioritizes **collaboration and community**. Our open-source ethos empowers everyone—from novices to experts—to contribute to and benefit from shared knowledge in machine learning.

- **Community-Driven**: We believe in the power of collaboration and have built a strong community around our models and datasets.
- **Innovation**: Our commitment to pushing the boundaries of AI ensures that we are always at the cutting-edge of technology and research.

---

## Careers at Hugging Face
We are always on the lookout for passionate talent to join our mission in transforming the world with AI. Working at Hugging Face means being part of a **dynamic team** that values creativity, diversity, and growth. 

### Current Opportunities
Explore various roles across:
- Software Engineering
- Data Science
- Product Management
- Marketing

To learn more and apply, visit our [Careers Page](https://huggingface.co/jobs).

---

## Join Us
Discover how Hugging Face is reshaping the future of AI by visiting our [website](https://huggingface.co). Together, we can build remarkable solutions that drive progress and innovation for everyone.

---

**Follow Us**
Connect with us on social platforms to stay updated with our latest news, models, and community events.
- [GitHub](https://github.com/huggingface)
- [Twitter](https://twitter.com/huggingface)
- [LinkedIn](https://www.linkedin.com/company/hugging-face)
- [Discord](https://discord.gg/huggingface)

--- 

### Hugging Face - The AI Community Building the Future.