# A full business solution


### BUSINESS CHALLENGE:

Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.

See the end of this notebook for examples of real-world business applications.

And remember: I'm always available if you have problems or ideas! Please do reach out.

In [1]:
# imports
# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt

import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [2]:
# Initialize and constants

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")
    
MODEL = 'gpt-4o-mini'
openai = OpenAI()

API key looks good so far


In [51]:
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links.
    """

    def __init__(self, url):
        self.url = url
        try:
            # Fetch webpage content
            response = requests.get(url, headers=headers)
            response.raise_for_status()  # Raise exception for bad responses (4xx/5xx)
            self.body = response.content
        except Exception as e:
            print(f"Error fetching URL: {url}. Details: {e}")
            self.body = None

        if self.body:
            # Parse the HTML using BeautifulSoup
            soup = BeautifulSoup(self.body, 'html.parser')
            self.title = soup.title.string if soup.title else "No title found"

            # Extract clean text from the body (removing irrelevant tags)
            if soup.body:
                for irrelevant in soup.body(["script", "style", "img", "input"]):
                    irrelevant.decompose()
                self.text = soup.body.get_text(separator="\n", strip=True)
            else:
                self.text = ""

            # Extract and resolve links
            raw_links = soup.find_all('a')
            self.links = []
            for link in raw_links:
                href = link.get('href')
                if href:  # Only process non-empty links
                    absolute_url = urljoin(url, href)  # Resolve relative links
                    parsed_url = urlparse(absolute_url)
                    if parsed_url.scheme in ["http", "https"] and parsed_url.netloc:
                        self.links.append(absolute_url)
        else:
            # Handle cases where the content could not be fetched
            self.title = "No title found"
            self.text = ""
            self.links = []

    def get_contents(self):
        """
        Returns the title and text content of the webpage.
        """
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

    def get_links(self):
        """
        Returns the extracted links from the webpage.
        """
        return self.links

In [39]:
ed = Website("https://edwarddonner.com")
#ed.links
ed.url

'https://edwarddonner.com'

## First step: Have GPT-4o-mini figure out which links are relevant

### Use a call to gpt-4o-mini to read the links on a webpage, and respond in structured JSON.  
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".  
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [29]:
link_system_prompt = """
You will be provided with a list of links extracted from a website. Your task is to analyze the links and classify them into types that are relevant for creating an AI-related website brochure. The classification should include categories such as "about page," "careers page," "products/services page," "contact page," or any other type you deem relevant for an AI website brochure.

You should return the output as a JSON object in the following format:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"},
        {"type": "products/services page", "url": "https://yet.another.url/products"},
        {"type": "contact page", "url": "https://yet.another.url/contact"}
    ]
}

If no relevant links are found, return:
{
    "links": []
}

Ensure the URLs are classified correctly based on their content or intended purpose as indicated by the link text or URL structure. Use logical reasoning and common naming conventions (e.g., "about," "careers," "services," "contact") to categorize the links appropriately.
"""

In [30]:
print(link_system_prompt)


You will be provided with a list of links extracted from a website. Your task is to analyze the links and classify them into types that are relevant for creating an AI-related website brochure. The classification should include categories such as "about page," "careers page," "products/services page," "contact page," or any other type you deem relevant for an AI website brochure.

You should return the output as a JSON object in the following format:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"},
        {"type": "products/services page", "url": "https://yet.another.url/products"},
        {"type": "contact page", "url": "https://yet.another.url/contact"}
    ]
}

If no relevant links are found, return:
{
    "links": []
}

Ensure the URLs are classified correctly based on their content or intended purpose as indicated by the link text or URL structure. Use logical

In [40]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [41]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://edwarddonner.com - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://edwarddonner.com/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://patents.google.com/patent/US20210049536A1/
https://www.linkedin.com/in/eddonner/
https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/
https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/
https://edwarddonner.com/2024/11/13/llm-engineering-resources/
https://edwarddonner.com/2024/11/13/ll

In [42]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [44]:
# Anthropic has made their site harder to scrape, so I'm using HuggingFace..

huggingface = Website("https://edwarddonner.com")
huggingface.links

['https://edwarddonner.com/',
 'https://edwarddonner.com/outsmart/',
 'https://edwarddonner.com/about-me-and-about-nebula/',
 'https://edwarddonner.com/posts/',
 'https://edwarddonner.com/',
 'https://news.ycombinator.com',
 'https://nebula.io/?utm_source=ed&utm_medium=referral',
 'https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html',
 'https://patents.google.com/patent/US20210049536A1/',
 'https://www.linkedin.com/in/eddonner/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'https://edwarddonner.com/2024/10/16/from-software-engineer-to-ai-data-scientist-resources/',
 'https://edwarddonner.com/2024/10/16/from-software-engineer-to-ai-data-scientist-resources/',
 'https://edwarddonner.com/

In [45]:
get_links("https://huggingface.co")

{'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'},
  {'type': 'products/services page', 'url': 'https://huggingface.co/models'},
  {'type': 'products/services page', 'url': 'https://huggingface.co/datasets'},
  {'type': 'products/services page', 'url': 'https://huggingface.co/spaces'},
  {'type': 'blog', 'url': 'https://huggingface.co/blog'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'contact page', 'url': 'https://discuss.huggingface.co'},
  {'type': 'social media', 'url': 'https://twitter.com/huggingface'},
  {'type': 'social media',
   'url': 'https://www.linkedin.com/company/huggingface/'}]}

## Second step: make the brochure!

Assemble all the details into another prompt to GPT4-o

In [46]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [52]:
print(get_all_details("https://huggingface.co"))

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/about'}, {'type': 'products/services page', 'url': 'https://huggingface.co/models'}, {'type': 'products/services page', 'url': 'https://huggingface.co/datasets'}, {'type': 'products/services page', 'url': 'https://huggingface.co/spaces'}, {'type': 'products/services page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'products/services page', 'url': 'https://huggingface.co/pricing'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'contact page', 'url': 'https://huggingface.co/contact'}, {'type': 'learn page', 'url': 'https://huggingface.co/learn'}, {'type': 'community page', 'url': 'https://huggingface.co/discuss'}, {'type': 'social media page', 'url': 'https://twitter.com/huggingface'}, {'type': 'social media page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}
Error fetching URL: https://h

In [53]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."


In [54]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [55]:
get_brochure_user_prompt("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'products/services page', 'url': 'https://huggingface.co/models'}, {'type': 'products/services page', 'url': 'https://huggingface.co/datasets'}, {'type': 'products/services page', 'url': 'https://huggingface.co/spaces'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'documentation page', 'url': 'https://huggingface.co/docs'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}, {'type': 'contact page', 'url': 'https://huggingface.co/chat'}]}


'You are looking at a company called: HuggingFace\nHere are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\nLanding page:\nWebpage Title:\nHugging Face ‚Äì The AI community building the future.\nWebpage Contents:\nHugging Face\nModels\nDatasets\nSpaces\nPosts\nDocs\nEnterprise\nPricing\nLog In\nSign Up\nThe AI community building the future.\nThe platform where the machine learning community collaborates on models, datasets, and applications.\nTrending on\nthis week\nModels\ndeepseek-ai/DeepSeek-V3\nUpdated\n7 days ago\n‚Ä¢\n66.2k\n‚Ä¢\n1.26k\ndeepseek-ai/DeepSeek-V3-Base\nUpdated\n7 days ago\n‚Ä¢\n8.17k\n‚Ä¢\n1.15k\nPowerInfer/SmallThinker-3B-Preview\nUpdated\nabout 4 hours ago\n‚Ä¢\n4.66k\n‚Ä¢\n240\nblack-forest-labs/FLUX.1-dev\nUpdated\nAug 16, 2024\n‚Ä¢\n1.17M\n‚Ä¢\n7.75k\nhexgrad/Kokoro-82M\nUpdated\nabout 1 hour ago\n‚Ä¢\n861\n‚Ä¢\n232\nBrowse 400k+ models\nSpaces\nRunning\n501\nü¶Ä\nGemini Cod

In [56]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [57]:
create_brochure("HuggingFace", "https://huggingface.com")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.com/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'products/services page', 'url': 'https://huggingface.com/models'}, {'type': 'products/services page', 'url': 'https://huggingface.com/datasets'}, {'type': 'products/services page', 'url': 'https://huggingface.com/spaces'}, {'type': 'products/services page', 'url': 'https://huggingface.com/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.com/pricing'}, {'type': 'documentation page', 'url': 'https://huggingface.com/docs'}, {'type': 'blog page', 'url': 'https://huggingface.com/blog'}, {'type': 'contact page', 'url': 'https://huggingface.com/contact'}]}
Error fetching URL: https://huggingface.com/contact. Details: 404 Client Error: Not Found for url: https://huggingface.co/contact


# Hugging Face Brochure

## About Us
**Hugging Face** is at the forefront of the AI revolution, creating a collaborative platform for the machine learning community. Our mission is to democratize access to cutting-edge machine learning technologies, allowing developers, researchers, and organizations to build the future of AI together.

---

## Our Offerings
- **Models:** Over 400,000 AI models, including state-of-the-art architectures like Transformers and diffusion models.
- **Datasets:** Access a vast repository of over 100,000 datasets for various AI tasks including NLP, image processing, and more.
- **Spaces:** Host and collaborate on applications seamlessly, with various tools for deployment and scalability.
- **Enterprise Solutions:** Tailored services for organizations, featuring enterprise-grade security, dedicated support, and team collaboration tools.

---

## Join Our Community
More than **50,000 organizations** are utilizing Hugging Face, including renowned companies like Google, Microsoft, and Amazon Web Services. Our community thrives on collaboration, with trending projects and ongoing contributions from users around the globe.

### Recent Trends:
- **Top Models:** Notable recent models include DeepSeek-V3 and FLUX.1-dev, with significant user engagement.
  
---

## Company Culture
At Hugging Face, we embrace an open-source philosophy and value community engagement. We encourage collaboration, exploration, and knowledge sharing, fostering an innovative environment where everyone can contribute to the growth of AI technologies.

- **Team Size:** Our team consists of over 200 skilled professionals dedicated to advancing AI.
- **Work Environment:** We emphasize flexibility, support, and inclusivity, ensuring that everyone can thrive and develop their skills.

 ## Career Opportunities
We are looking for talented individuals who are passionate about AI and machine learning. Join us to work on impactful projects and contribute to the development of cutting-edge technology.

- **Open Positions:** Explore various roles at Hugging Face that cater to your skills and aspirations.
  
**Discover more at** [Hugging Face Careers](https://huggingface.co/jobs)

---

## Contact Us
For inquiries regarding partnerships, support, or general information, please visit our [Contact Page](https://huggingface.co/contact).

---

**Join us at Hugging Face and become a part of the AI community building the future!**

## Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI,
with the familiar typewriter animation

In [58]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [59]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'products/services page', 'url': 'https://huggingface.co/models'}, {'type': 'products/services page', 'url': 'https://huggingface.co/datasets'}, {'type': 'products/services page', 'url': 'https://huggingface.co/spaces'}, {'type': 'products/services page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'products/services page', 'url': 'https://huggingface.co/pricing'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'community/discussion page', 'url': 'https://discuss.huggingface.co'}, {'type': 'documentation page', 'url': 'https://huggingface.co/docs'}]}


# Hugging Face Brochure

## Welcome to Hugging Face
### The AI Community Building the Future

At Hugging Face, we‚Äôre on a mission to transform how the machine learning community collaborates and innovates. Our platform serves as a hub for building, sharing, and deploying machine learning models, datasets, and applications seamlessly.

---

## Our Offerings
- **Models**: Explore our extensive library with over **400,000 models** allowing for cutting-edge AI development.
- **Datasets**: Access **100,000+ datasets** to enhance your projects.
- **Spaces**: Collaborate and experiment with applications in real-time.
  
### Enterprise Solutions
We cater to numerous organizations with:
- Advanced enterprise-grade security
- Access controls
- Priority support and resource management

Join the ranks of **50,000+ organizations** that leverage our platform including big names like Google, Meta, Microsoft, and AWS.

---

## Our Community and Culture
### Collaboration at Our Core
Hugging Face is built on community collaboration and open-source principles. We‚Äôre committed to democratizing AI and making it accessible to everyone. By providing powerful tools, we enable individuals and organizations to create and innovate without barriers.

### Values
- **Inclusivity**: We welcome talent and contributions from diverse backgrounds.
- **Innovation**: We strive to lead the way in AI and ML technologies.
- **Community Support**: Our community is our strength; we maintain forums and resources to ensure shared learning.

---

## Careers at Hugging Face
### Join Us!
We‚Äôre always looking for passionate individuals who want to make a difference in the AI space. Our culture fosters **creativity, collaboration**, and **continuous learning**. You‚Äôll have the chance to:
- Work on exciting projects with a talented team.
- Contribute to cutting-edge technology.
- Be part of a community that values every voice.

For more details on available roles, check our [Careers Page](https://huggingface.co/jobs).

---

## Connect With Us
Follow our journey and be part of our vibrant community:
- [GitHub](https://github.com/huggingface)
- [Twitter](https://twitter.com/huggingface)
- [LinkedIn](https://linkedin.com/company/huggingface)
- [Discord](https://discord.com/invite/huggingface)

**Join us and start building the future of AI together!**  
Explore our website: [Hugging Face](https://huggingface.co)  
Sign up today and push the boundaries of what‚Äôs possible with AI!

--- 

**Hugging Face ‚Äì Where AI & ML communities thrive!**

In [None]:
# Try changing the system prompt to the humorous version when you make the Brochure for Hugging Face:

stream_brochure("HuggingFace", "https://huggingface.co")