# AI Broschure Generator
### Buiseness Challenge:
Create a product that builds a Broschure for a company to be used for prospective clients, investors and potential recruiters.

We will be provided the company name and their primary website.

In [38]:
import os 
import requests 
import json
from typing import List
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from IPython.display import Markdown, display, update_display
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.schema import SystemMessage, HumanMessage

In [10]:
load_dotenv()
google_api_key = os.getenv('GEMINI_API_KEY')

model = "gemini-1.5-flash"
gemini = ChatGoogleGenerativeAI(
    model = model,
    temperature=0.4,
    google_api_key=google_api_key
)


In [21]:
# A class to represent a webpage
class Website:
    """
    A utility class to represent a website that we have scrapped with links
    """
    url: str
    title: str
    body: str
    links: List[str]
    text: str

    def __init__(self, url):
        self.url = url
        response = requests.get(url)
        self.body = response.content
        soup = BeautifulSoup(self.body, 'html.parser')
        self.title = soup.title.string if soup.title else "No Title Found"

        if soup.body:
            for irrelevant in soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = soup.body.get_text(separator="\n", strip=True)

        else:
            self.text = ""
        links = [link.get('href') for link in soup.find_all('a')]
        self.links = [link for link in links if link]

    def get_contents(self):
        return f"Webpage Title:\n{self.title} \nWebpage Contents:\n{self.text}\n\n"

In [41]:
web = Website("https://www.geeksforgeeks.org/machine-learning/machine-learning/")
print(web.get_contents())

Webpage Title:
Machine Learning Tutorial - GeeksforGeeks 
Webpage Contents:
Skip to content
Courses
DSA / Placements
GATE 2026 Prep
ML & Data Science
Development
Cloud / DevOps
Programming Languages
All Courses
Tutorials
Python
Java
DSA
ML & Data Science
Interview Corner
Programming Languages
Web Development
GATE
CS Subjects
DevOps
School Learning
Software and Tools
Practice
Practice Coding Problems
Nation Skillup- Free Courses
Problem of the Day
Jobs
Become a Mentor
Apply Now!
Post Jobs
Job-A-Thon: Hiring Challenge
Jobs Updates
Notifications
Mark all as read
All
View All
Notifications
Mark all as read
All
Unread
Read
You're all caught up!!
Python for Machine Learning
Machine Learning with R
Machine Learning Algorithms
EDA
Math for Machine Learning
Machine Learning Interview Questions
ML Projects
Deep Learning
NLP
Computer vision
Data Science
Artificial Intelligence
Sign In
▲
Open In App
Share Your Experiences
Machine Learning Basics
Introduction to Machine Learning
Types of Machine Le

In [39]:
# Getting all the links from webpage
web.links[:10]

['#main',
 'https://www.geeksforgeeks.org/',
 'https://www.geeksforgeeks.org/courses/category/dsa-placements',
 'https://www.geeksforgeeks.org/courses/category/gate/',
 'https://www.geeksforgeeks.org/courses/category/machine-learning-data-science',
 'https://www.geeksforgeeks.org/courses/category/development-testing',
 'https://www.geeksforgeeks.org/courses/category/cloud-devops',
 'https://www.geeksforgeeks.org/courses/category/programming-languages',
 'https://www.geeksforgeeks.org/courses',
 'https://www.geeksforgeeks.org/python/python-programming-language-tutorial/']

### Now we'll use the Gemini model to read the links on a webpage, and respond in structured JSON format


In [42]:
link_system_prompt = "You are provided with a list of links found on a webpage. \
You are able to decide which of the links would be most relevant to include in a broschure about the company, \
such as links to an About page, or a Company page, or Career/Jobs pages. \n"

link_system_prompt += "Yoy should respond in JSON as this example:" 

link_system_prompt += """
{
    "links" : [
        {"type" : "about page", "url" : "https://full.url/here/about"}
        {"type" : "careers page", "url" : "https://another.full.url/here/about"}
    ]
}
"""

In [43]:
print(link_system_prompt)

You are provided with a list of links found on a webpage. You are able to decide which of the links would be most relevant to include in a broschure about the company, such as links to an About page, or a Company page, or Career/Jobs pages. 
Yoy should respond in JSON as this example:
{
    "links" : [
        {"type" : "about page", "url" : "https://full.url/here/about"}
        {"type" : "careers page", "url" : "https://another.full.url/here/about"}
    ]
}



In [35]:
def get_links_user_prompt(website):
    user_prompt = f"Here is the list of links on the website of {website.url} - "
    user_prompt += """Please decide which of these are relative web links for a brochure about the company, response with the full url, 
Do not include terms of service, Privacy, email links. \n """
    user_prompt += "Links (Some might be relative links): \n"
    user_prompt += "\n".join(website.links)

    return user_prompt

In [36]:
print(get_links_user_prompt(web))

Here is the list of links on the website of https://www.geeksforgeeks.org/machine-learning/machine-learning/ - Please decide which of these are relative web links for a brochure about the company, response with the full url, 
Do not include terms of service, Privacy, email links. 
 Links (Some might be relative links): 
#main
https://www.geeksforgeeks.org/
https://www.geeksforgeeks.org/courses/category/dsa-placements
https://www.geeksforgeeks.org/courses/category/gate/
https://www.geeksforgeeks.org/courses/category/machine-learning-data-science
https://www.geeksforgeeks.org/courses/category/development-testing
https://www.geeksforgeeks.org/courses/category/cloud-devops
https://www.geeksforgeeks.org/courses/category/programming-languages
https://www.geeksforgeeks.org/courses
https://www.geeksforgeeks.org/python/python-programming-language-tutorial/
https://www.geeksforgeeks.org/java/java/
https://www.geeksforgeeks.org/learn-data-structures-and-algorithms-dsa-tutorial/
https://www.geeksf

In [75]:
from pydantic import BaseModel

# Schema
class Link(BaseModel):
    url_type: str
    url: str

class LinksSchema(BaseModel):
    links: List[Link]

# Structured LLM
structured_llm = gemini.with_structured_output(LinksSchema)

# # Makining the LLM to response in structured format
# structured_llm = gemini.with_structured_output({
#     "links" : [{
#         "url_type" : "string",
#         "url" : "string"
#     }]
#   })

In [76]:
def get_links(url):
    website = Website(url)

    messages = [
        SystemMessage(content = link_system_prompt),
        HumanMessage(content = get_links_user_prompt(website))
    ]

    result = structured_llm.invoke(messages)
    json_result = json.dumps(result.model_dump(), indent=2)  
    
    return json_result

In [81]:
anthropic = Website("https://www.anthropic.com/")
anthropic.links

['#main',
 '#footer',
 'https://www.anthropic.com/',
 'https://www.anthropic.com/claude',
 'https://www.anthropic.com/claude-code',
 'https://www.anthropic.com/max',
 'https://www.anthropic.com/team',
 'https://www.anthropic.com/enterprise',
 'https://www.anthropic.com/pricing',
 'https://claude.ai/download',
 'https://claude.ai/',
 'https://www.anthropic.com/news/claude-character',
 'https://www.anthropic.com/api',
 'https://docs.anthropic.com/',
 'https://www.anthropic.com/pricing#api',
 'https://console.anthropic.com/',
 'https://docs.anthropic.com/en/docs/welcome',
 'https://www.anthropic.com/solutions/agents',
 'https://www.anthropic.com/solutions/code-modernization',
 'https://www.anthropic.com/solutions/coding',
 'https://www.anthropic.com/solutions/customer-support',
 'https://www.anthropic.com/solutions/education',
 'https://www.anthropic.com/solutions/financial-services',
 'https://www.anthropic.com/solutions/government',
 'https://www.anthropic.com/customers',
 'https://www.

In [77]:
print(get_links("https://www.anthropic.com/"))

{
  "links": [
    {
      "url_type": "company page",
      "url": "https://www.anthropic.com/company"
    },
    {
      "url_type": "team page",
      "url": "https://www.anthropic.com/team"
    },
    {
      "url_type": "careers page",
      "url": "https://www.anthropic.com/careers"
    },
    {
      "url_type": "research page",
      "url": "https://www.anthropic.com/research"
    },
    {
      "url_type": "customers page",
      "url": "https://www.anthropic.com/customers"
    },
    {
      "url_type": "solutions page",
      "url": "https://www.anthropic.com/solutions"
    },
    {
      "url_type": "pricing page",
      "url": "https://www.anthropic.com/pricing"
    },
    {
      "url_type": "claude page",
      "url": "https://www.anthropic.com/claude"
    },
    {
      "url_type": "max page",
      "url": "https://www.anthropic.com/max"
    },
    {
      "url_type": "enterprise page",
      "url": "https://www.anthropic.com/enterprise"
    },
    {
      "url_type": "