# Web Scrapping and Brochure Generator

In [1]:
import os
import json
from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display
from scraper import fetch_website_links, fetch_website_contents
from openai import OpenAI

In [2]:
# Load environment variables in a file called .env
GEMINI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"

load_dotenv(override=True)
api_key = os.getenv('GEMINI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please be sure to add your key to the .env file, and save the file! Or you can skip the next 2 cells if you don't want to use Gemini")
elif not api_key.startswith("AIz"):
    print("An API key was found, but it doesn't start AIz")
else:
    print("API key found and looks good so far!")


API key found and looks good so far!


In [3]:
# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.

message = "Hello, GPT! This is my first ever message to you! Hi!"

messages = [{"role": "user", "content": message}]
messages


[{'role': 'user',
  'content': 'Hello, GPT! This is my first ever message to you! Hi!'}]

In [4]:
gemini = OpenAI(base_url=GEMINI_BASE_URL, api_key=api_key)
response = gemini.chat.completions.create(model="gemini-2.5-flash-lite", messages=messages)
response.choices[0].message.content

"Hello! It's great to hear from you, and welcome! I'm happy to be your first ever message recipient.  How can I help you today? What's on your mind?"

In [2]:
links = fetch_website_links("https://edwarddonner.com")

In [6]:
ed = fetch_website_contents("https://edwarddonner.com")
print(ed)

Home - Edward Donner

Home
AI Curriculum
Connect Four
Outsmart
An arena that pits LLMs against each other in a battle of diplomacy and deviousness
About
Posts
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy amateur electronic music production (
very
amateur) and losing myself in
Hacker News
, nodding my head sagely to things I only half understand.
I’m the co-founder and CTO of
Nebula.io
. We’re applying AI to a field where it can make a massive, positive impact: helping people discover their potential and pursue their reason for being. I’m previously the founder and CEO of AI startup untapt,
acquired in 2021
.
If left unchecked, I will happily drone on about LLMs to anyone that will listen. My friends got fed up with my impromptu lectures, and convinced me to make some Udemy courses. To my total joy (and shock) they’ve become best-selling, top-rated courses, with 400,000 enrolled across 190 countries. 

## Types of prompts

You may know this already - but if not, you will get very familiar with it!

Models like GPT have been trained to receive instructions in a particular way.

They expect to receive:

**A system prompt** that tells them what task they are performing and what tone they should use

**A user prompt** -- the conversation starter that they should reply to

In [7]:

system_prompt = """
You are a best assistant that analyzes the contents of a website,
and provides a short, accurate, perfect summary, ignoring text that might be navigation related.
Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
"""

user_prompt_prefix = """
Here are the contents of a website.
Provide a short summary of this website.
If it includes news or announcements, then summarize these too.

"""

In [8]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_prefix + website}
    ]

In [9]:
messages_for(ed)

[{'role': 'system',
  'content': '\nYou are a best assistant that analyzes the contents of a website,\nand provides a short, accurate, perfect summary, ignoring text that might be navigation related.\nRespond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n'},
 {'role': 'user',

In [10]:
# And now: call the OpenAI API. You will get very familiar with this!

def summarize(url):
    gemini = OpenAI(base_url=GEMINI_BASE_URL, api_key=api_key)

    website = fetch_website_contents(url)
    response = gemini.chat.completions.create(
        model = "gemini-2.5-flash-lite",
        messages = messages_for(website)
    )
    summary =  response.choices[0].message.content
    display(Markdown(summary))
    

In [11]:
summarize("https://edwarddonner.com")

This website belongs to Edward Donner, a co-founder and CTO of Nebula.io, who focuses on AI and LLMs. He previously founded and sold an AI startup called untapt. The site features his work, including an "AI Curriculum" which has become a best-selling Udemy course with over 400,000 enrollees. It also showcases projects like "Connect Four" and "Outsmart," an arena for LLM diplomacy and deviousness. The site includes blog posts on AI topics, with recent entries discussing AI builders, AI live events, and MLOps. Edward also offers a newsletter for updates.

In [12]:
summarize("https://anthropic.com")

Anthropic is a public benefit corporation focused on AI research and products, emphasizing safety and humanity's long-term well-being. They have introduced Claude Opus 4.5, their latest model designed for coding, agents, computer use, and enterprise workflows, with advanced tool use capabilities on their Developer Platform.

In [13]:
link_system_prompt = """
You are provided with a list of links found on a webpage.
You are able to decide which of the links would be most relevant to include in a brochure about the company,
such as links to an About page, or a Company page, or Careers/Jobs pages.
You should respond in JSON as in this example:

{
    "links": [
        {"type": "about page", "url": "https://full.url/goes/here/about"},
        {"type": "careers page", "url": "https://another.full.url/careers"}
    ]
}
"""

In [14]:
def get_links_user_prompt(url):
    user_prompt = f"""
Here is the list of links on the website {url} -
Please decide which of these are relevant web links for a brochure about the company, 
respond with the full https URL in JSON format.
Do not include Terms of Service, Privacy, email links.

Links (some might be relative links):

"""
    links = fetch_website_links(url)
    user_prompt += "\n".join(links)
    return user_prompt

In [15]:
print(get_links_user_prompt("https://edwarddonner.com"))


Here is the list of links on the website https://edwarddonner.com -
Please decide which of these are relevant web links for a brochure about the company, 
respond with the full https URL in JSON format.
Do not include Terms of Service, Privacy, email links.

Links (some might be relative links):

https://edwarddonner.com/
https://edwarddonner.com/curriculum/
https://edwarddonner.com/connect-four/
https://edwarddonner.com/outsmart/
https://edwarddonner.com/about-me-and-about-nebula/
https://edwarddonner.com/posts/
https://edwarddonner.com/
https://news.ycombinator.com
https://nebula.io/?utm_source=ed&utm_medium=referral
https://www.prnewswire.com/news-releases/wynden-stark-group-acquires-nyc-venture-backed-tech-startup-untapt-301269512.html
https://edwarddonner.com/curriculum/
https://edwarddonner.com/2026/01/04/ai-builder-with-n8n-create-agents-and-voice-agents/
https://edwarddonner.com/2026/01/04/ai-builder-with-n8n-create-agents-and-voice-agents/
https://edwarddonner.com/2025/11/11/

In [21]:
def select_relevant_links(url):
    print(f"Selecting relevant links for {url} by calling gemini-2.5-flash-lite")
    response = gemini.chat.completions.create(
        model="gemini-2.5-flash-lite",
        messages=[
            {"role": "system", "content": link_system_prompt},
            {"role": "user", "content": get_links_user_prompt(url)}
        ],
        response_format={"type": "json_object"}
    )
    result = response.choices[0].message.content
    links = json.loads(result)
    print(f"Found {len(links['links'])} relevant links")
    return links

In [22]:
select_relevant_links("https://edwarddonner.com")

Selecting relevant links for https://edwarddonner.com by calling gemini-2.5-flash-lite
Found 3 relevant links


{'links': [{'type': 'about page',
   'url': 'https://edwarddonner.com/about-me-and-about-nebula/'},
  {'type': 'company page', 'url': 'https://edwarddonner.com/'},
  {'type': 'careers page',
   'url': 'https://nebula.io/?utm_source=ed&utm_medium=referral'}]}

In [23]:
def fetch_page_and_all_relevant_links(url):
    contents = fetch_website_contents(url)
    relevant_links = select_relevant_links(url)
    result = f"## Landing Page:\n\n{contents}\n## Relevant Links:\n"
    for link in relevant_links['links']:
        result += f"\n\n### Link: {link['type']}\n"
        result += fetch_website_contents(link["url"])
    return result

In [35]:
brochure_system_prompt = """
You are an assistant that analyzes the contents of several relevant pages from a collee website
and creates a short brochure about the college for prospective students, parents and recruits.
Respond in markdown without code blocks.
Include details of college culture, departments courses offered, careers/jobs and contact details like phone number emailID  if you have the information.
"""

In [36]:
def get_brochure_user_prompt(company_name, url):
    user_prompt = f"""
You are looking at a company called: {company_name}
Here are the contents of its landing page and other relevant pages;
use this information to build a short brochure of the company in markdown without code blocks.\n\n
"""
    user_prompt += fetch_page_and_all_relevant_links(url)
    user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters
    return user_prompt

In [37]:
def stream_brochure(company_name, url):
    stream = gemini.chat.completions.create(
        model="gemini-2.5-flash-lite",
        messages=[
            {"role": "system", "content": brochure_system_prompt},
            {"role": "user", "content": get_brochure_user_prompt(company_name, url)}
          ],
        stream=True
    )    
    response = ""
    display_handle = display(Markdown(""), display_id=True)
    for chunk in stream:
        response += chunk.choices[0].delta.content or ''
        update_display(Markdown(response), display_id=display_handle.display_id)

In [38]:
stream_brochure("srmeaswari eswari engineering college", "https://srmeaswari.ac.in/")

Selecting relevant links for https://srmeaswari.ac.in/ by calling gemini-2.5-flash-lite
Found 26 relevant links


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


# Easwari Engineering College (SRM Group)

## Welcome to Easwari Engineering College!

Easwari Engineering College, a proud part of the SRM Group, is dedicated to fostering innovation, academic excellence, and holistic development. We are committed to providing a world-class education that prepares students for successful careers and impactful contributions to society.

## Our Culture & Values

At Easwari Engineering College, we cultivate an environment that encourages intellectual curiosity, collaboration, and a strong sense of community. Our vision is to be a leading institution in engineering and technology, driven by a mission to impart quality education and promote research. We adhere to a stringent Quality Policy to ensure continuous improvement across all aspects of our operations.

## Academic Excellence & Programs

We offer a comprehensive range of undergraduate, postgraduate, and Ph.D. programs across various engineering disciplines. Our curriculum is designed to be industry-relevant and future-focused, ensuring our graduates are equipped with the skills and knowledge demanded by the global market.

### Undergraduate Programs (B.E. & B.Tech.)

*   B.E. Automobile Engineering
*   B.E. Bio Medical Engineering
*   B.E. Civil Engineering
*   B.E. Computer Science and Engineering
*   B.E. Computer Science and Engineering (Artificial Intelligence and Machine Learning)
*   B.E. Computer Science and Engineering (Cyber Security)
*   B.E. Computer Science and Design
*   B.E. Electrical and Electronics Engineering
*   B.E. Electronics and Communication Engineering
*   B.E. Mechanical Engineering
*   B.E. Robotics and Automation Engineering
*   B.Tech. Artificial Intelligence and Data Science
*   B.Tech. Biotechnology
*   B.Tech. Computer Science and Business System (with TCS partnership)
*   B.Tech. Information Technology

### Postgraduate Programs (M.E., M.B.A., M.C.A.)

*   M.B.A
*   M.C.A
*   M.E. Communication Systems
*   M.E. Computer Science and Engineering
*   M.E. Embedded System Technologies
*   M.E. Engineering Design
*   M.E. Structural Engineering

### Ph.D. Programs

We also offer Ph.D. programs for aspiring researchers.

### Science & Humanities

Our Science & Humanities department offers foundational courses in:
*   Chemistry
*   English
*   Mathematics
*   Physics
*   Tamil

## Career Opportunities & Future Prospects

Our strong industry connections and career-oriented approach ensure that our students are well-prepared for diverse career paths. Graduates from Easwari Engineering College are highly sought after by leading companies in various sectors. We focus on developing skills that are crucial for success in today's dynamic job market, including areas like Artificial Intelligence, Machine Learning, Cyber Security, Data Science, and more.

## Admissions & Further Information

We are approved by AICTE and affiliated with Anna University. Our institution is accredited by NAAC and NBA, and we are recognized in NIRF rankings.

For detailed information on our programs, admission procedures, eligibility criteria, scholarships, and more, please visit our website or contact us directly.

**Why Choose SRM Easwari?**

*   Comprehensive Program Offerings
*   Industry-Relevant Curriculum
*   Dedicated Faculty and Research Opportunities
*   Strong Placement Assistance
*   Vibrant Campus Life

## Contact Us

**Enquiry:**
*   **Call:** [Insert Phone Number Here]
*   **Email:** [Insert Email ID Here]

**Apply Now!**

We look forward to welcoming you to the Easwari Engineering College family.

## OLLAMA using in localhost

In [42]:
!ollama pull llama3.2

'ollama' is not recognized as an internal or external command,
operable program or batch file.


In [43]:
OLLAMA_BASE_URL = "http://localhost:11434/v1"

ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')

In [44]:
response = ollama.chat.completions.create(model="llama3.2", messages=[{"role": "user", "content": "Tell me a fun fact"}])

response.choices[0].message.content

"Did you know that honey never spoils? Archaeologists have found pots of honey in ancient Egyptian tombs that are over 3,000 years old and still perfectly edible! Honey's longevity is due to its unique composition: it's extremely acidic and has antibacterial properties, making it a challenging environment for bacteria and microorganisms to grow. Is that sweet fact sticky in your mind?"

In [45]:
# deepseek-r1:1.5b - this is DeepSeek "distilled" into Qwen from Alibaba Cloud

!ollama pull deepseek-r1:1.5b

'ollama' is not recognized as an internal or external command,
operable program or batch file.


In [47]:
response = ollama.chat.completions.create(model="deepseek-r1:1.5b", messages=[{"role": "user", "content": "Tell me a fun fact"}])

response.choices[0].message.content

'Sure, here\'s a fun fact for you: **"Water is the only liquid that boils at 100° Celsius."** The boiling point of water varies depending on atmospheric pressure; it reaches 100° when in pure oxygen and hydrogen atmosphere. This means that without oxygen, most substances boil at lower temperatures.'