# A full business solution

## Now we will take our project from Day 1 to the next level

### BUSINESS CHALLENGE:

Create a product that builds a Brochure for a company to be used for prospective clients, investors and potential recruits.

We will be provided a company name and their primary website.

See the end of this notebook for examples of real-world business applications.

And remember: I'm always available if you have problems or ideas! Please do reach out.

In [1]:
# imports
# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt

import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [2]:
# Initialize and constants

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:
    print("API key looks good so far")
else:
    print("There might be a problem with your API key? Please visit the troubleshooting notebook!")
    
MODEL = 'gpt-4o-mini'
openai = OpenAI()

API key looks good so far


In [3]:
# A class to represent a Webpage

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [4]:
ed = Website("https://edwarddonner.com")
ed.links

['https://edwarddonner.com/',
 'https://edwarddonner.com/outsmart/',
 'https://edwarddonner.com/about-me-and-about-nebula/',
 'https://edwarddonner.com/posts/',
 'https://edwarddonner.com/',
 'https://news.ycombinator.com',
 'https://nebula.io/?utm_source=ed&utm_medium=referral',
 'https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html',
 'https://patents.google.com/patent/US20210049536A1/',
 'https://www.linkedin.com/in/eddonner/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'https://edwarddonner.com/2024/10/16/from-software-engineer-to-ai-data-scientist-resources/',
 'https://edwarddonner.com/2024/10/16/from-software-engineer-to-ai-data-scientist-resources/',
 'https://edwarddonner.com/

## First step: Have GPT-4o-mini figure out which links are relevant

### Use a call to gpt-4o-mini to read the links on a webpage, and respond in structured JSON.  
It should decide which links are relevant, and replace relative links such as "/about" with "https://company.com/about".  
We will use "one shot prompting" in which we provide an example of how it should respond in the prompt.

This is an excellent use case for an LLM, because it requires nuanced understanding. Imagine trying to code this without LLMs by parsing and analyzing the webpage - it would be very hard!

Sidenote: there is a more advanced technique called "Structured Outputs" in which we require the model to respond according to a spec. We cover this technique in Week 8 during our autonomous Agentic AI project.

In [5]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Careers/Jobs pages.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}
"""

In [6]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Careers/Jobs pages.
You should respond in JSON as in this example:
{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"}
    ]
}



In [7]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [8]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of https://edwarddonner.com - please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://edwarddonner.com/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://patents.google.com/patent/US20210049536A1/
https://www.linkedin.com/in/eddonner/
https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/
https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/
https://edwarddonner.com/2024/11/13/llm-engineering-resources/
https://edwarddonner.com/2024/11/13/ll

In [9]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [10]:
# Anthropic has made their site harder to scrape, so I'm using HuggingFace..

huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/posts',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/hexgrad/Kokoro-82M',
 '/microsoft/phi-4',
 '/openbmb/MiniCPM-o-2_6',
 '/NovaSky-AI/Sky-T1-32B-Preview',
 '/deepseek-ai/DeepSeek-V3',
 '/models',
 '/spaces/hexgrad/Kokoro-TTS',
 '/spaces/JeffreyXiang/TRELLIS',
 '/spaces/lllyasviel/iclight-v2',
 '/spaces/stabilityai/stable-point-aware-3d',
 '/spaces/FaceOnLive/Face-Search-Online',
 '/spaces',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/NovaSky-AI/Sky-T1_data_17k',
 '/datasets/DAMO-NLP-SG/multimodal_textbook',
 '/datasets/HumanLLMs/Human-Like-DPO-Dataset',
 '/datasets/cfahlgren1/react-code-instructions',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/microsoft',
 '/grammarly',
 '/Writer',
 '/docs/transforme

In [11]:
get_links("https://huggingface.co")

{'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'},
  {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}]}

## Second step: make the brochure!

Assemble all the details into another prompt to GPT4-o

In [12]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [13]:
print(get_all_details("https://huggingface.co"))

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'community discussion', 'url': 'https://discuss.huggingface.co'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}]}
Landing page:
Webpage Title:
Hugging Face – The AI community building the future.
Webpage Contents:
Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
The AI community building the future.
The platform where the machine learning community collaborates on models, datasets, and applications.
Trending on
this

In [14]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':

# system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
# Include details of company culture, customers and careers/jobs if you have the information."


In [15]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [16]:
get_brochure_user_prompt("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/about'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog', 'url': 'https://huggingface.co/blog'}, {'type': 'company page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'join page', 'url': 'https://huggingface.co/join'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'community forum', 'url': 'https://discuss.huggingface.co'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


"You are looking at a company called: HuggingFace\nHere are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\nLanding page:\nWebpage Title:\nHugging Face – The AI community building the future.\nWebpage Contents:\nHugging Face\nModels\nDatasets\nSpaces\nPosts\nDocs\nEnterprise\nPricing\nLog In\nSign Up\nThe AI community building the future.\nThe platform where the machine learning community collaborates on models, datasets, and applications.\nTrending on\nthis week\nModels\nhexgrad/Kokoro-82M\nUpdated\nabout 12 hours ago\n•\n18.1k\n•\n1.65k\nmicrosoft/phi-4\nUpdated\n8 days ago\n•\n88.1k\n•\n1.36k\nopenbmb/MiniCPM-o-2_6\nUpdated\nabout 3 hours ago\n•\n4.3k\n•\n415\nNovaSky-AI/Sky-T1-32B-Preview\nUpdated\n3 days ago\n•\n3.81k\n•\n412\ndeepseek-ai/DeepSeek-V3\nUpdated\n17 days ago\n•\n137k\n•\n1.93k\nBrowse 400k+ models\nSpaces\nRunning\non\nZero\n912\n❤️\nKokoro TTS\nNow in 5 languages!\nRunning\non\nZe

In [17]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [18]:
create_brochure("HuggingFace", "https://huggingface.com")

Found links: {'links': [{'type': 'home page', 'url': 'https://huggingface.com'}, {'type': 'about page', 'url': 'https://huggingface.com/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog page', 'url': 'https://huggingface.com/blog'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}, {'type': 'GitHub page', 'url': 'https://github.com/huggingface'}, {'type': 'Twitter page', 'url': 'https://twitter.com/huggingface'}, {'type': 'LinkedIn page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


```markdown
# Hugging Face: The AI Community Building the Future

Welcome to **Hugging Face**, a vibrant platform where the machine learning community collaborates to build models, datasets, and applications that shape the future of AI. 

## Our Mission
At Hugging Face, we're on a mission to advance AI technology by fostering collaboration and innovation. Our platform enables creators, researchers, and organizations to openly share and improve upon the state-of-the-art machine learning tools and models.

## What We Offer

### **Models**
With over **400k+ models** available, you can browse recent developments like:
- **hexgrad/Kokoro-82M**: 18.1k updates
- **microsoft/phi-4**: 88.1k followers
- **deepseek-ai/DeepSeek-V3**: 137k updates.

### **Datasets**
Access a wealth of resources with **100k+ datasets**, including challenges and tools for computer vision, audio, and natural language processing.

### **Spaces**
Explore various applications, such as:
- **Kokoro TTS**: Multi-language text-to-speech
- **TRELLIS**: 3D generation from images
- **Stable Point-Aware 3D**: Revolutionary 3D modeling.

## Our Customers
More than **50,000 organizations** use Hugging Face, including industry giants like:
- **Google**
- **Amazon Web Services**
- **Microsoft**
- **Meta**.

Our tools empower enterprises to build advanced AI applications with enterprise-grade security.

## Company Culture
Hugging Face thrives on a culture of openness, collaboration, and community. We believe that the collective power of the community can drive innovation in AI. Our **open source** ethos means that we prioritize transparency and collaboration, enabling members to contribute and learn from one another.

## Careers at Hugging Face
Join our team and be part of a pioneering force in machine learning! Hugging Face seeks passionate individuals who are excited about AI and dedicated to making a difference. Check our **Jobs** page for current opportunities.

### Why Work with Us?
- Impactful projects at the forefront of AI.
- Collaborative and inclusive company culture.
- Opportunities for professional growth and learning.

## Get In Touch
To learn more about our services and how to join our community:
- **Visit our website:** [Hugging Face](https://huggingface.co)
- **Follow us on social media:** [Twitter](https://twitter.com/huggingface) | [LinkedIn](https://linkedin.com/company/huggingface)

Together, let's build the future of AI!
```


## Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI,
with the familiar typewriter animation

In [19]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )
    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```","").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [20]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog', 'url': 'https://huggingface.co/blog'}, {'type': 'models page', 'url': 'https://huggingface.co/models'}, {'type': 'datasets page', 'url': 'https://huggingface.co/datasets'}, {'type': 'spaces page', 'url': 'https://huggingface.co/spaces'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'documentation page', 'url': 'https://huggingface.co/docs'}]}


# Hugging Face Brochure

---

## Company Overview

**Hugging Face** is the vibrant, collaborative platform where the machine learning (ML) community converges to create, discover, and innovate. As a leader in artificial intelligence, Hugging Face provides cutting-edge tools and resources for building the future of AI.

**Tagline:** The AI community building the future.

---

## What We Offer

### Models
With a repository of over **400,000 models**, Hugging Face allows users to explore and implement state-of-the-art machine learning technologies. Our platform supports a diverse range of modalities, including text, image, video, audio, and even 3D formats.

### Datasets
Access **100,000+ datasets** to enhance your ML projects. Hugging Face serves as a hub for both developers and researchers to contribute and collaborate on various data sources.

### Spaces
Create and run ML applications seamlessly within **Hugging Face Spaces**. This feature lets users showcase their models and applications in an interactive, user-friendly environment.

### Enterprise Solutions
Hugging Face caters to over **50,000 organizations**, offering enterprise-grade solutions that emphasize security, resource management, and priority support, making it easier to integrate AI into business functions.

---

## Community & Company Culture

At Hugging Face, community is at our core. We believe in building an open and inclusive culture where collaboration is encouraged and innovation thrives. By engaging with developers, researchers, and enterprises alike, Hugging Face creates an ecosystem where ideas can flourish, and projects can multiply. Our team consists of over **222 talented professionals**, all working together to push the boundaries of what artificial intelligence can achieve.

---

## Careers at Hugging Face

Hugging Face is always looking for passionate individuals to join our team. We value creativity, collaboration, and a commitment to excellence. By joining us, you'll have the opportunity to work on groundbreaking AI projects and be a part of a community that shapes the future.

**Explore Careers:**
- **Open Positions:** [Find your dream job](https://huggingface.co/jobs)
  
Join us in transforming the landscape of AI and machine learning!

---

## Why Choose Hugging Face?

1. **Robust Infrastructure:** Harness the power of advanced ML tools and resources at your fingertips.
2. **Extensive Community:** Be a part of a diverse and engaged community dedicated to collaboration and innovation.
3. **Enterprise Solutions:** Tailored support for businesses of all sizes, ensuring that AI integration is effective and secure.

---

Ready to embark on an exciting journey in AI with Hugging Face? [Sign Up Now](https://huggingface.co/signup) and become part of a thriving community driving the future of technology!

--- 

For further inquiries and more information, please visit our [website](https://huggingface.co). 

Social Media: [Twitter](https://twitter.com/huggingface) | [LinkedIn](https://linkedin.com/company/huggingface) | [GitHub](https://github.com/huggingface) | [Discord](https://discord.gg/huggingface) 

--- 

*Experience the possibilities with Hugging Face – where technology meets community.*

In [21]:
# Try changing the system prompt to the humorous version when you make the Brochure for Hugging Face:

stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/huggingface'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'blog', 'url': 'https://huggingface.co/blog'}, {'type': 'models page', 'url': 'https://huggingface.co/models'}, {'type': 'datasets page', 'url': 'https://huggingface.co/datasets'}, {'type': 'spaces page', 'url': 'https://huggingface.co/spaces'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}]}



# Hugging Face Brochure

## About Us
### Hugging Face: The AI Community Building the Future
Hugging Face is the leading collaboration platform for the machine learning (ML) community. We empower researchers, developers, and enterprises to create, discover, and collaborate on state-of-the-art machine learning models, datasets, and applications. Our platform fosters community engagement, innovation, and rapid development in AI.

## Our Offerings
- **Models & Datasets**: Offering access to over **400,000 models** and **100,000 datasets** spanning various AI modalities—text, images, videos, audio, and more.
- **Spaces**: A robust environment where users can host and collaborate on their projects while leveraging advanced computation.
- **Enterprise Solutions**: Tailored services for organizations, providing enterprise-grade security, dedicated support, and access to advanced AI capabilities.

## Customers
We proudly serve more than **50,000 organizations**, including industry leaders like:
- **Amazon Web Services**
- **Google**
- **Microsoft**
- **Grammarly**

## Community Engagement
Our mission is grounded in open-source collaboration. Hugging Face is a hub for machine learning researchers and practitioners, promoting the sharing of knowledge and resources. Our large community contributes to various projects, including:
- **Transformers**: A highly popular library for state-of-the-art ML.
- **Diffusers**: For AI-generated images and audio.
- **Tokenizers**: Fast tokenization solutions for text processing.

## Company Culture
At Hugging Face, we are proud of our collaborative and inclusive culture that encourages innovation. Our team embodies diversity, creativity, and a passion for AI. We believe in the power of community and actively support an environment where everyone can contribute, learn, and grow.

## Careers at Hugging Face
Explore opportunities to join our growing team! We are always looking for talented individuals who are passionate about AI and want to make a difference in the field. Check our **Jobs page** for current openings and become part of the future of AI.

## Get Involved
Join the Hugging Face community today—whether you are looking to deploy state-of-the-art AI solutions, contribute to open-source projects, or explore exciting career opportunities, there is a place for you in our journey towards building the future of artificial intelligence. 

- [Sign Up Now](https://huggingface.co) to start collaborating and innovating!



<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../business.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#181;">Business applications</h2>
            <span style="color:#181;">In this exercise we extended the Day 1 code to make multiple LLM calls, and generate a document.

This is perhaps the first example of Agentic AI design patterns, as we combined multiple calls to LLMs. This will feature more in Week 2, and then we will return to Agentic AI in a big way in Week 8 when we build a fully autonomous Agent solution.

Generating content in this way is one of the very most common Use Cases. As with summarization, this can be applied to any business vertical. Write marketing content, generate a product tutorial from a spec, create personalized email content, and so much more. Explore how you can apply content generation to your business, and try making yourself a proof-of-concept prototype.</span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../important.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#900;">Before you move to Week 2 (which is tons of fun)</h2>
            <span style="color:#900;">Please see the week1 EXERCISE notebook for your challenge for the end of week 1. This will give you some essential practice working with Frontier APIs, and prepare you well for Week 2.</span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../resources.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#f71;">A reminder on 2 useful resources</h2>
            <span style="color:#f71;">1. The resources for the course are available <a href="https://edwarddonner.com/2024/11/13/llm-engineering-resources/">here.</a><br/>
            2. I'm on LinkedIn <a href="https://www.linkedin.com/in/eddonner/">here</a> and I love connecting with people taking the course!
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left;">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../thankyou.jpg" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#090;">Finally! I have a special request for you</h2>
            <span style="color:#090;">
                My editor tells me that it makes a MASSIVE difference when students rate this course on Udemy - it's one of the main ways that Udemy decides whether to show it to others. If you're able to take a minute to rate this, I'd be so very grateful! And regardless - always please reach out to me at ed@edwarddonner.com if I can help at any point.
            </span>
        </td>
    </tr>
</table>

In [30]:
link_system_prompt = "You are provided with a list of links found on a Dutch real estate webpage. \
You retrieve all links that provide information about floors or 'woningzoekers' that contain houses, \
so not the links of a specific appartment or house.\n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "woningaanbod", "url": "https://crossroadsamsterdam.nl/woningaanbod/"},
        {"type": "woningaanbod - verdieping 0 ": "url": "https://crossroadsamsterdam.nl/woningaanbod/verdieping-0/"}
    ]
}
"""
link_system_prompt += "You should avoid these links of specific appartments or houses:"
link_system_prompt += """
{
    "links": [
        {"type": "woningaanbod - verdieping 0  - appartement 6310' ": "url": "https://crossroadsamsterdam.nl/woningaanbod/verdieping-27/#6310"}
    ]
}
"""

def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += "please decide which of these are relevant web links that contain information about the appartments, respond with the full https URL in JSON format. \
Do not include general information of the building, only of specific appartments or houses.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [31]:
# Anthropic has made their site harder to scrape, so I'm using HuggingFace..

cross = Website("https://crossroadsamsterdam.nl/")
cross.links


['https://crossroadsamsterdam.nl',
 'https://crossroadsamsterdam.nl/',
 'https://crossroadsamsterdam.nl/locatie/',
 'https://crossroadsamsterdam.nl/woningaanbod/',
 'https://crossroadsamsterdam.nl/voordelen/',
 'https://crossroadsamsterdam.nl/veelgesteldevragen/',
 'https://crossroadsamsterdam.nl/contact/',
 'https://account.crossroadsamsterdam.nl/#aanmelden',
 'https://crossroadsamsterdam.nl/woningaanbod',
 'https://account.crossroadsamsterdam.nl/#aanmelden',
 'https://crossroadsamsterdam.nl/veelgesteldevragen/',
 'https://crossroadsamsterdam.nl/contact',
 'https://crossroadsamsterdam.nl/cookies',
 'https://www.door.business']

In [37]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

def get_more_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
      ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    print("Found links:", json.loads(result)["links"])
    # do another iteration of the found links to see if there are more links to be found
    for link in json.loads(result)["links"]:
        print(link)
        website = Website(link["url"])
        print("Getting links from", link["url"])
        links = get_links(link["url"])
        print("Found links:", links)
        for link in links["links"]:
            if link not in json.loads(result)["links"]:
                print("Adding link", link)
                json.loads(result)["links"].append(link)
    return json.loads(result)


In [33]:
get_links("https://crossroadsamsterdam.nl/")




{'links': [{'type': 'woningaanbod',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/'}]}

In [None]:
get_links("https://crossroadsamsterdam.nl/woningaanbod/")

{'links': [{'type': 'woningaanbod',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/'},
  {'type': 'woningaanbod - verdieping 3',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-3/'},
  {'type': 'woningaanbod - verdieping 4',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-4/'},
  {'type': 'woningaanbod - verdieping 5',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-5/'},
  {'type': 'woningaanbod - verdieping 6',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-6/'},
  {'type': 'woningaanbod - verdieping 7',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-7/'},
  {'type': 'woningaanbod - verdieping 8',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-8/'},
  {'type': 'woningaanbod - verdieping 9',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-9/'},
  {'type': 'woningaanbod - verdieping 10',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdiepin

In [38]:
get_more_links("https://crossroadsamsterdam.nl")

Found links: [{'type': 'woningaanbod', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/'}]
{'type': 'woningaanbod', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/'}
Getting links from https://crossroadsamsterdam.nl/woningaanbod/
Found links: {'links': [{'type': 'woningaanbod', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/'}, {'type': 'woningaanbod - verdieping 0', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-0/'}, {'type': 'woningaanbod - verdieping 1', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-1/'}, {'type': 'woningaanbod - verdieping 2', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-2/'}, {'type': 'woningaanbod - verdieping 3', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-3/'}, {'type': 'woningaanbod - verdieping 4', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-4/'}, {'type': 'woningaanbod - verdieping 5', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-5/'}, {'

{'links': [{'type': 'woningaanbod',
   'url': 'https://crossroadsamsterdam.nl/woningaanbod/'}]}

In [None]:
website = Website("https://crossroadsamsterdam.nl/woningaanbod/")
website.links


['https://crossroadsamsterdam.nl',
 'https://crossroadsamsterdam.nl/',
 'https://crossroadsamsterdam.nl/locatie/',
 'https://crossroadsamsterdam.nl/woningaanbod/',
 'https://crossroadsamsterdam.nl/voordelen/',
 'https://crossroadsamsterdam.nl/veelgesteldevragen/',
 'https://crossroadsamsterdam.nl/contact/',
 'https://account.crossroadsamsterdam.nl/#aanmelden',
 '#',
 '#6303',
 '#6304',
 '#6305',
 '#6307',
 '#6298',
 '#6299',
 '#6300',
 '#6301',
 '#6302',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-27/#6309',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-27/#6310',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-27/#6311',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-27/#6312',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-3/#6318',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-3/#6319',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-3/#6321',
 'https://crossroadsamsterdam.nl/woningaanbod/verdieping-3/#6322',
 

In [None]:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Scrape_Website:
    """
    A utility class to represent a Website that we have scraped, now with links
    """

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\n\n"
    
    # def get_house_information(self):

        


In [28]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Scrape_Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [29]:
print(get_all_details("https://crossroadsamsterdam.nl/"))

Found links: {'links': [{'type': 'woningaanbod', 'url': 'https://crossroadsamsterdam.nl/woningaanbod/'}]}
Landing page:
Webpage Title:
Crossroads Amsterdam



woningaanbod
Webpage Title:
Crossroads Amsterdam | Woningaanbod
Webpage Contents:
Toggle navigation
Home
Locatie
Woningaanbod
Voordelen
Veelgestelde vragen
Contact
Account / Inschrijven
Beschikbaar woningaanbod
15
te huur
62
in optie
43
verhuurd
Filteren
Appartementen filteren
Soort huur
middenhuur
(119)
vrije sector
(77)
Aantal slaapkamers
1
(52)
2
(136)
3
(1)
combi
(7)
Oppervlakte
m2
-
m2
Huurprijs
€
,-
€
,-
Verdieping
begane grond
(5)
verdieping 1
(4)
verdieping 2
(4)
verdieping 3
(15)
verdieping 4
(11)
verdieping 5
(10)
verdieping 6
(13)
verdieping 7
(9)
verdieping 8
(7)
verdieping 9
(7)
verdieping 10
(7)
verdieping 11
(7)
verdieping 12
(7)
verdieping 13
(7)
verdieping 14
(7)
verdieping 15
(7)
verdieping 16
(7)
verdieping 17
(7)
verdieping 18
(7)
verdieping 19
(7)
verdieping 20
(7)
verdieping 21
(7)
verdieping 22
(7)
verdiepi