
このプロジェクトの特徴は、単なるパンフレット生成ツールではなく、異なる目的（営業、投資誘致、採用）に対応できる柔軟性を持たせている点です。また、企業のウェブサイトを入力情報として使用することで、最新の企業情報を活用できる設計になっています。実践的なビジネス応用を意識した開発課題といえます。

In [1]:
import os
import requests
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from openai import OpenAI

In [2]:
## Initialize and constants
load_dotenv(override=True)

api_key = os.getenv("OPENAI_API_KEY")

MODEL = "gpt-4o-mini"
openai = OpenAI()

In [3]:
# A class to represent a Webpage

# Some websites need you to use proper headers when fetching them:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    """
    A utility class to represent a Website that we have scraped, now with links.
    """
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        self.body = response.content
        soup = BeautifulSoup(self.body, "html.parser")
        self.title = soup.title.string if soup.title else "No title found"
        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)
        else:
            self.text = ""
        links = [link.get("href") for link in soup.find_all("a")]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n"

In [4]:
ed = Website("https://edwarddonner.com")
ed.links

['https://edwarddonner.com/',
 'https://edwarddonner.com/outsmart/',
 'https://edwarddonner.com/about-me-and-about-nebula/',
 'https://edwarddonner.com/posts/',
 'https://edwarddonner.com/',
 'https://news.ycombinator.com',
 'https://nebula.io/?utm_source=ed&utm_medium=referral',
 'https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html',
 'https://patents.google.com/patent/US20210049536A1/',
 'https://www.linkedin.com/in/eddonner/',
 'https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/',
 'https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/12/21/llm-resources-superdatascience/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'https://edwarddonner.com/2024/11/13/llm-engineering-resources/',
 'https://edwarddonner.com/2024/10/16/from-soft

GPT-4o-miniを使用してウェブページのリンクを解析し、構造化JSONで応答する手順です：

GPT-4o-miniにリンクを読み取らせ、関連性を判断させます。
相対リンク（例："/about"）を絶対リンク（例："https://company.com/about"）に変換します。
ワンショットプロンプティング（例を1つ示す手法）を使用して、期待する出力形式を指定します。

この手法はLLMの特性を活かした方法で、従来の解析的なコーディングよりも効果的です。
補足：Week 8では、より高度な「構造化出力」技術について学習する予定です。これはAIエージェントプロジェクトで使用される手法です。

In [5]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a brochure about the company, \
such as links to an About page, or a Company page, or Carreers/Jobs pages. \n"
link_system_prompt += "You should respond in JSON as in this example:"
link_system_prompt += """
{
    "links": [
        {"type": "about page": "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"},
    ]
}
"""

In [6]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About page, or a Company page, or Carreers/Jobs pages. 
You should respond in JSON as in this example:
{
    "links": [
        {"type": "about page": "url": "https://full.url/goes/here/about"},
        {"type": "careers page": "url": "https://another.full.url/careers"},
    ]
}



In [7]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website} - "
    user_prompt += "please decide which of these are relevant web links for a brochure about the compary, respond with the full https URL in JSON format. \
Do not include Terms of Service, Privacy, email links.\n"
    user_prompt += "Links (some might be relative links):\n"
    user_prompt += "\n".join(website.links)
    return user_prompt

In [8]:
print(get_links_user_prompt(ed))

Here is the list of links on the website of <__main__.Website object at 0x110f7d390> - please decide which of these are relevant web links for a brochure about the compary, respond with the full https URL in JSON format. Do not include Terms of Service, Privacy, email links.
Links (some might be relative links):
https://edwarddonner.com/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://patents.google.com/patent/US20210049536A1/
https://www.linkedin.com/in/eddonner/
https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/
https://edwarddonner.com/2025/01/23/llm-workshop-hands-on-with-agents-resources/
https://edwarddonner.com/2024/12/21/llm-resources-superdata

In [16]:
def get_links(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(website)}
        ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    return json.loads(result)

In [17]:
huggingface = Website("https://huggingface.co")
huggingface.links

['/',
 '/models',
 '/datasets',
 '/spaces',
 '/posts',
 '/docs',
 '/enterprise',
 '/pricing',
 '/login',
 '/join',
 '/deepseek-ai/DeepSeek-R1',
 '/hexgrad/Kokoro-82M',
 '/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
 '/deepseek-ai/DeepSeek-R1-Zero',
 '/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',
 '/models',
 '/spaces/hexgrad/Kokoro-TTS',
 '/spaces/tencent/Hunyuan3D-2',
 '/spaces/lllyasviel/iclight-v2',
 '/spaces/webml-community/deepseek-r1-webgpu',
 '/spaces/JeffreyXiang/TRELLIS',
 '/spaces',
 '/datasets/fka/awesome-chatgpt-prompts',
 '/datasets/cais/hle',
 '/datasets/bespokelabs/Bespoke-Stratos-17k',
 '/datasets/HumanLLMs/Human-Like-DPO-Dataset',
 '/datasets/yale-nlp/MMVU',
 '/datasets',
 '/join',
 '/pricing#endpoints',
 '/pricing#spaces',
 '/pricing',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/enterprise',
 '/allenai',
 '/facebook',
 '/amazon',
 '/google',
 '/Intel',
 '/microsoft',
 '/grammarly',
 '/Writer',
 '/docs/transformers',

In [18]:
get_links("https://huggingface.co")

{'links': [{'type': 'about page', 'url': 'https://huggingface.co'},
  {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'},
  {'type': 'company page',
   'url': 'https://www.linkedin.com/company/huggingface/'}]}

## Second step: make the brochure!

Assemble all the details into another prompt to GTP4-o

In [19]:
def get_all_details(url):
    result = "Landing page:\n"
    result += Website(url).get_contents()
    links = get_links(url)
    print("Found links:", links)
    for link in links["links"]:
        result += f"\n\n{link['type']}\n"
        result += Website(link["url"]).get_contents()
    return result

In [20]:
print(get_all_details("https://huggingface.co"))

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'models page', 'url': 'https://huggingface.co/models'}, {'type': 'datasets page', 'url': 'https://huggingface.co/datasets'}, {'type': 'spaces page', 'url': 'https://huggingface.co/spaces'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'join page', 'url': 'https://huggingface.co/join'}]}
Landing page:
Webpage Title:
Hugging Face – The AI community building the future.
Webpage Contents:
Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
The AI community building the future.
The platform where the machine learning community collaborates on models, datasets, and applications.
Trending on
this week
Models
deepseek-ai/DeepSeek-R1
Updated
1 day ago
•
69.6k
•
2.32k
hexgrad/Kokoro-82M
Updated
about 5 hours

In [21]:
system_prompt = "You are an assistant that analyzes the contents of several relevant pages from a company website \
and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\
Include details of company culture, customers and careers/jobs if you have the information."

In [22]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"You are looking at a company called: {company_name}\n"
    user_prompt += f"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\n"
    user_prompt += get_all_details(url)
    user_prompt = user_prompt[:5_000]
    return user_prompt

In [25]:
print(get_brochure_user_prompt("Hugging Face", "https://huggingface.co"))

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'models page', 'url': 'https://huggingface.co/models'}, {'type': 'datasets page', 'url': 'https://huggingface.co/datasets'}, {'type': 'spaces page', 'url': 'https://huggingface.co/spaces'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}]}
You are looking at a company called: Hugging Face
Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.
Landing page:
Webpage Title:
Hugging Face – The AI community building the future.
Webpage Contents:
Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
The AI community building the future.
The platform where the machine learning community collaborates on models, datasets, and applicat

In [26]:
def create_brochure(company_name, url):
    response = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
        ],
    )
    result = response.choices[0].message.content
    display(Markdown(result))

In [27]:
create_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'company page', 'url': 'https://www.linkedin.com/company/huggingface/'}]}


# Hugging Face Brochure

---

### **About Us**
**Hugging Face** is more than just a machine learning company; it is a thriving **AI community** dedicated to developing cutting-edge technologies that shape the future of artificial intelligence. Our platform brings together experts and enthusiasts alike to collaborate on models, datasets, and applications.

### **Our Offerings**
- **Models**: Access over 400,000 pre-trained models ranging from text generation to image and audio processing.
- **Datasets**: Browse through a collection of over 100,000 datasets tailored to various machine learning tasks.
- **Spaces**: Collaboratively run applications and deploy machine learning models in an interactive environment.

### **Why Choose Us?**
- **Open Source**: Our commitment to open-source projects such as Transformers and Diffusers allows anyone to contribute to or utilize state-of-the-art ML tools freely.
- **Enterprise Solutions**: We provide tailored enterprise-grade solutions with robust security, access management, and dedicated support, catering to organizations looking to leverage AI at scale.
- **Compute Resources**: With a seamless transition from development to deployment, our computational resources start at just $0.60 per hour for GPU processing.

### **Our Customers**
Join a growing list of over **50,000 organizations** that trust Hugging Face, including industry leaders like:
- **Meta**
- **Amazon Web Services**
- **Google**
- **Microsoft**

### **Company Culture**
At Hugging Face, we believe in fostering a culture of collaboration, innovation, and inclusivity. Our community-driven approach ensures that everyone has a chance to contribute and grow. We prioritize continuous learning and support initiatives that promote well-being and creativity within our teams.

### **Careers**
We’re always looking for passionate individuals to join our ranks. Whether you're a seasoned developer, a data scientist, or a newcomer eager to learn, we have exciting opportunities to help you thrive in the ML space:
- Explore a variety of job roles across engineering, product management, and research.
- Work in a dynamic environment that encourages creativity and collaboration.

### **Join Us**
Become part of the AI revolution with Hugging Face and help build the future of machine learning. Visit our [website](https://huggingface.co) to learn more about our projects, community, and career opportunities.

--- 

**Together, we are building a future where technology is accessible and beneficial for everyone. Welcome to Hugging Face!**

### Finally - a minor improvement

With a small adjustment, we can change this so that the results stream back from OpenAI, with the familiar typewriter animation.

In [31]:
def stream_brochure(company_name, url):
    stream = openai.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
        ],
        stream=True
    )
    print(stream)
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        response = response.replace("```", "").replace("markdown", "")
        update_display(Markdown(response), display_id=display_handle.display_id)

In [32]:
stream_brochure("HuggingFace", "https://huggingface.co")

Found links: {'links': [{'type': 'about page', 'url': 'https://huggingface.co/'}, {'type': 'careers page', 'url': 'https://apply.workable.com/huggingface/'}, {'type': 'enterprise page', 'url': 'https://huggingface.co/enterprise'}, {'type': 'pricing page', 'url': 'https://huggingface.co/pricing'}, {'type': 'models page', 'url': 'https://huggingface.co/models'}, {'type': 'datasets page', 'url': 'https://huggingface.co/datasets'}, {'type': 'spaces page', 'url': 'https://huggingface.co/spaces'}, {'type': 'blog page', 'url': 'https://huggingface.co/blog'}, {'type': 'learn page', 'url': 'https://huggingface.co/learn'}, {'type': 'community page', 'url': 'https://discuss.huggingface.co'}]}
<openai.Stream object at 0x113400d10>



# Hugging Face Brochure

## Introduction
**Hugging Face** is at the forefront of the AI revolution, embodying a vibrant community dedicated to building the future of machine learning. Our platform is a collaborative hub where individuals and organizations can interact, innovate, and share breakthroughs in machine learning models, datasets, and applications.

---

## Our Vision
**“The AI community building the future.”**  
At Hugging Face, we believe in the power of collaboration and open-source technology. Our mission is to democratize AI by making it accessible to everyone, regardless of their technical background.

---

## Products & Services
### Models
- Browse our extensive library of **400k+ models** including cutting-edge developments like DeepSeek and Kokoro.
- Contribute to the community by sharing your own models and accessing those created by others.

### Datasets
- Access and share **100k+ datasets** tailored to a multitude of tasks in computer vision, audio, and NLP.
- Collaborate on and improve datasets to enhance machine learning training and evaluation.

### Spaces & Applications
- Explore **150k+ applications** that leverage AI across various industries and projects.
- Create and share your own applications on our platform, enhancing visibility and engagement.

### Enterprise Solutions
- Count on advanced features for businesses including security, access controls, and dedicated support.
- Our **enterprise offerings** start at $20/user/month, designed to meet the needs of large organizations.

---

## Customers
More than **50,000 organizations** trust Hugging Face:
- Leading companies like **Google**, **Microsoft**, **Amazon Web Services**, and **Intel** rely on our technology.
- Our platform serves as the backbone for startups, enterprises, and research institutions alike, fostering innovation across diverse sectors.

---

## Company Culture
### Community-Driven
At Hugging Face, we cherish a culture of collaboration and inclusivity. Our team is a melting pot of talents, driven by a shared passion for artificial intelligence and its transformative potential.

### Open Source Ethos
We actively contribute to the open-source community, providing tools and resources that facilitate education and innovation in machine learning technologies.

---

## Careers at Hugging Face
Join a vibrant team of innovators! We are always on the lookout for talented individuals who are passionate about AI and want to make a difference. 

### Opportunities
- Explore various career paths ranging from **Machine Learning Engineers** to **Community Managers.**
- Help us continue to shape the future of AI by joining our ambitious team.

**Visit our Careers page** to see available positions and learn more about our work culture!

---

## Connect with Us
- **Website:** [huggingface.co](https://huggingface.co)
- **Social Media:** Follow us on [Twitter](https://twitter.com/huggingface), [LinkedIn](https://linkedin.com/company/huggingface), and [GitHub](https://github.com/huggingface).

Together, let’s build the next generation of AI technology!



In [33]:
links = {'links': [{'type': 'About page', 'url': 'https://www.langchain.com/about'},
  {'type': 'Company page', 'url': 'https://smith.langchain.com/'},
  {'type': 'Careers page',
   'url': 'https://langchain-ai.github.io/langgraphjs/tutorials/quickstart#careers'},
  {'type': 'Blog', 'url': 'https://blog.langchain.dev/'},
  {'type': 'Twitter handle', 'url': 'https://twitter.com/LangChainAI'},
  {'type': 'LinkedIn profile',
   'url': 'https://www.linkedin.com/company/langchain/'},
  {'type': 'YouTube channel', 'url': 'https://www.youtube.com/@LangChain'},
  {'type': 'Ebook',
   'url': 'https://drive.google.com/drive/folders/17xybjzmVBdsQA-VxouuGLxF6bDsHDe80?usp=sharing'}]}