# AI-powered Brochure Generator
---
- Task: Generate a company brochure using its name and website for clients, investors, and recruits.
- Model: Toggle `USE_OPENAI` to switch between OpenAI and Ollama models
- Data Extraction: Scraping website content and filtering key links (About, Products, Careers, Contact).
- Output Format: a Markdown-formatted brochure streamed in real-time.
- Tools: BeautifulSoup, OpenAI API, and IPython display, ollama.
- Skill Level: Intermediate.

🛠️ Requirements
- ⚙️ Hardware: ✅ CPU is sufficient — no GPU required
- 🔑 OpenAI API Key 
- Install Ollama and pull llama3.2:3b or another lightweight model
---
📢 Find more LLM notebooks on my [GitHub repository](https://github.com/shivachaudhary46/LLM_Engineering)

In [26]:
from openai import OpenAI
import ollama 
import os
import json
import requests
from IPython.display import display, Markdown
from bs4 import BeautifulSoup
from dotenv import load_dotenv

In [27]:
# load api key 
# load_dotenv()
# api_key = os.getenv('OPENAI_API_KEY')
# if not api_key or not api_key.startswith('sk-'):
#     raise ValueError("Invalid OpenAI API Key. Check your .env file")

use_ollama = True # True if we are using ollama and False to use OpenAI
model = "llama3.1" if use_ollama else 'gpt-4o'

# openai_client = OpenAI() if use_openai else None

# since we don't have paid version of api of gpt-4o 
# so i will be commenting this code if you want to use it 
# you can do that. 

In [20]:
headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
    }

class website:

    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No Title Found"
        self.text = self.extract_texts(soup)
        self.link = self.extract_links(soup)

    def extract_texts(self, soup):
        if soup.body:
            for irrelevant in soup.body(['img', "input", "script", "style"]):
                irrelevant.decompose()

            return soup.body.get_text(separator='\n', strip=True)
        return ""
    
    def extract_links(self, soup):
        links = [link.get('href') for link in soup.find_all('a')]
        return [link for link in links if 'http' or 'https' in link]
    
    def get_contents(self):
        return f"website title: \n{self.title}\n and website content: \n{self.text}\n\n"

In [21]:
a_j = website("https://www.mbmc.edu.np/")
# print(a_j.title)
# print(a_j.text)
# print(a_j.links)
print(a_j.get_contents())

website title: 
Madan Bhandari Memorial College
 and website content: 
info@mbmc.edu.np
+977-01-5172175, 5172715 (BScCSIT)
HEMIS Login
Policy, Procedure and Guidelines
Contact Us
Payment
FAQs
About Us
About Us
Madan Bhandari Memorial College, a non-profit making community institution, was established in 2001 to impart quality education at an affordable cost. The college offers a wide range of academic courses in BA, BBS, BBM, BCA, BScCSIT, and Master’s Degree courses in Sociology, Journalism, and English. Since its inception, the college has achieved remarkable success in terms of quality education and infrastructural development.
Read More
Management Committee
Institutional Overview
Organizational Structure
College Statute
Facilities and Services
Staff
Institutional Expertise
Publication
Research Publication
Research Article by Faculty Members
Conference Paper by Faculty Members
Reports
Annual Report
Audit Report
Shweta Shardul
Tracer Study Report
Survey Report
EMIS Report
Other Publi

In [None]:
class generate_LLM_link:

    def __init__(self, model=model):
        self.model = model

    def get_relevant_link(self, website):
        link_system_prompt = """
            you will be given a list of links from a company website. 
            select only relevant links for a brochure like (About, Company, Carrers, Products)
            Exclude (login, emails, terms, and privacy) which are not needed for brochure. 

            Instructions to follow: 
            - return on valid json
            - do not include explanations, comments or markdown
            - example of output: 
            {
                "links": [
                    {"type": "about", "url": "https://company.com/about"},
                    {"type": "contact", "url": "https://company.com/contact"},
                    {"type": "product", "url": "https://company.com/products"}
                ]
            }
        """
        links = { link for link in website.link}
        link_user_prompt = f"""
            Here is a list of the links you can find on the website of {website.url}
            please identify the relevant web links needed for company brochure. Respond in json.
            Do not include login, emails, terms and privacy.
            {links}. 
        """
        link_user_prompt += """
            return in json format, like this:
            {
                "links": [
                    {"type": "about", "url": "https://company.com/about"},
                    {"type": "contact", "url": "https://company.com/contact"},
                    {"type": "product", "url": "https://company.com/products"}
                ]
            }
        """
        if use_ollama:
            response = ollama.chat(
                model = model,
                messages = [
                    {"role": "system", "content": link_system_prompt},
                    {'role': "user", "content": link_user_prompt}
                ]
            )     

            return response['message']['content'].strip()



links = generate_LLM_link()   
ed = website("https://www.sharesansar.com/")
print(links.get_relevant_link(ed))

ValueError: Invalid format specifier ' "about", "url": "https://company.com/about"' for object of type 'str'

In [34]:
print(ed.link)
print(type(ed.link))

['https://www.facebook.com/ShareSansar.com.np/', 'https://twitter.com/share_sansar', 'https://www.sharesansar.com/login', 'https://www.sharesansar.com/register', '/contactus', '/faq', '/write-for-us', 'https://www.sharesansar.com', '/', '#', '/news-page', '/announcement', '#', '/inflation', '/bullion', '/gdp-market-capitalization', '/gdp-nepse-index', '/weekly-deposit-lending', '/government-revenue-expenditure', '/capital-expenditure', '/remittance', '/bfis-deposit-lending', '/short-term-interest-rates', '/budget', '#', '/market', '/live-trading', '#', '/stock-heat-map/turnover', '/stock-heat-map/volume', '/today-share-price', '/floorsheet', '/agm-list', '/proposed-dividend', '/sectorwise-share-price', '/indices-sub-indices', '/datewise-indices', '/top-brokers', '/nepse-candlestick-chart', '/index-history-data', '#', '/ipo-result', '/category/ipo-fpo-news', '/category/share-allotment', '#', '/company-list', '/mutual-fund-navs', '/merged-companies', '/merger-acquisition', '/suspended-co