# Building a RAG System for MBA Program Information with Gemini API
## Retrieval-Augmented Generation for MBA College Websites

### Project Overview

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using Google's Gemini API. The system extracts information from MBA college websites, indexes the content, and answers questions about the MBA program based on the retrieved information.

### Objectives
- Extract content from MBA college website URLs
- Create a vector database using ChromaDB for efficient retrieval
- Implement a RAG pipeline with Gemini API for question answering
- Answer typical MBA program inquiries with accurate information

### Required Libraries

In [3]:
# Install required packages
!pip install langchain langchain_community langchain_chroma google-generativeai chromadb unstructured beautifulsoup4 python-dotenv

# Import libraries
import os
from langchain_community.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from dotenv import load_dotenv
import google.generativeai as genai




In [5]:
pip install langchain langchain_community langchain-google-genai python-dotenv streamlit langchain_experimental sentence-transformers langchain_chroma langchainhub pypdf rapidocr-onnxruntime


Collecting langchain-google-genai
  Downloading langchain_google_genai-2.0.11-py3-none-any.whl.metadata (3.6 kB)
Collecting streamlit
  Downloading streamlit-1.42.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting langchain_experimental
  Downloading langchain_experimental-0.3.4-py3-none-any.whl.metadata (1.7 kB)
Collecting langchainhub
  Downloading langchainhub-0.1.21-py3-none-any.whl.metadata (659 bytes)
Collecting rapidocr-onnxruntime
  Downloading rapidocr_onnxruntime-1.4.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.16 (from langchain-google-genai)
  Downloading google_ai_generativelanguage-0.6.16-py3-none-any.whl.metadata (5.7 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading py

In [6]:
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

### Configuration Setup


In [58]:
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

# Manually set the API key
os.environ["GOOGLE_API_KEY"] = "AIzaSyDacpRbhHNyjE6BvQT3e5Ju1tWy5u-WjXM"

# Initialize embeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# Generate vector representation for the query
vector = embeddings.embed_query("hello, world!")

# Print first 5 elements of the vector
print(vector[:5])


[0.05168594419956207, -0.030764883384108543, -0.03062233328819275, -0.02802734263241291, 0.01813093200325966]


## Data Collection

### 1. URL Loading




In [12]:
# Define URLs to load - replace with your MBA college URLs
urls = [
    'https://www.greatlakes.edu.in/chennai/',  # Program Overview
    'https://www.greatlakes.edu.in/chennai/recruiters/placement-report-pgpm',  # Admissions
    'https://www.greatlakes.edu.in/chennai/pgpm/curriculum',  # Curriculum
    'https://www.greatlakes.edu.in/chennai/pgpm',  # PGPM Program Details
    'https://www.greatlakes.edu.in/chennai/',  # General Information
    'https://www.greatlakes.edu.in/chennai/recruiters/past-recruiters',
    'https://www.greatlakes.edu.in/chennai/faculty-category/full-time-faculty',
    'https://www.greatlakes.edu.in/chennai/alumni/',
    'https://www.greatlakes.edu.in/chennai/accreditations',
    'https://www.greatlakes.edu.in/chennai/rankings',
    'https://www.greatlakes.edu.in/chennai/pgdm'

]


In [13]:
# Load content from URLs
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

print(f"Loaded {len(data)} documents")
print(f"Example content from first document: {data[3].page_content[:200]}...")

Loaded 11 documents
Example content from first document: PGPM

A Truly Transformational 1 year on-campus MBA Program for Professionals with 2+ Years of Experience

Application Deadline: 5th March, 2025

Home

Full Time Programs

PGPM

PGPM

ADMISSIONS

Admi...


## Data Processing

### 1. Text Chunking

In [14]:
# Split data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(data)

print(f"Total number of documents after splitting: {len(docs)}")
print(f"Example chunk: {docs[0].page_content[:200]}...")

Total number of documents after splitting: 79
Example chunk: Admissions Open for PGPM & PGDM

Applications open for PGPM and PGPM Family Business & Entrepreneurship 2025-26 & PGDM 2025-27. Learn More.

PGPM | PGDM | PGPM-FBE

PGPM Family Business & Entrepreneur...


In [23]:
docs

[Document(metadata={'source': 'https://www.greatlakes.edu.in/chennai/'}, page_content="Admissions Open for PGPM & PGDM\n\nApplications open for PGPM and PGPM Family Business & Entrepreneurship 2025-26 & PGDM 2025-27. Learn More.\n\nPGPM | PGDM | PGPM-FBE\n\nPGPM Family Business & Entrepreneurship\n\nAdmission open to India's first 1 Year Full Time MBA Program for Family Business Scions & Entrepreneurs\n\nRead more\n\nPGXPM Executive MBA Program\n\nKnow more about our Executive MBA Program for Senior Professionals. Admissions open for PGXPM 2025 intake\n\nRead more\n\nCAMPUS PLACEMENTS\n\nCorporate & Career Services is delighted to announce the Campus Recruitment Program for the class of 2025.\n\nRead more\n\nThought Leaders at Great Lakes\n\nAn interactive series with distinguished speakers that helps discover and steer your ambition\n\nRead more\n\nKARMA YOGA\n\nPositively impacting the lives of over 11,000 households in 27 villages.\n\nRead more\n\nPGPM\n\nOne Year MBA program for pr

## Vector Database Creation

### 1. Create ChromaDB Vector Store

In [22]:
vectorstore = Chroma.from_documents(documents=docs, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

In [24]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})

retrieved_docs = retriever.invoke("What is the program overview?")


In [26]:
len(retrieved_docs)

10

In [27]:
print(retrieved_docs[0].page_content)

Product Labs

Data Visualization and Insight Generation*

Business Consulting Lab

Technology Business Consulting Lab

IData Visualization and Insight Generation*

Predictive Analytics Labs - Supervised and Unsupervised ML+ ensemble techniques with SQL, cloud computing

Generative AI/Deep Learning

Dual specialisation: Developing Functional and Industry Expertise

The PGPM program offers specialisations allowing students to gain deep expertise in two disciplines. In addition to Finance, Marketing, and Operations, you can pursue dual majors in Analytics, Data Science, Consulting and Strategy, gaining expertise tailored to today’s complex business environment. These specialisations, designed in consultation with industry leaders, emphasize problem-solving and in-depth industry competencies.

Industry Specialisations

Strategic Product Development and Execution Part 1

Technology Product Sales and Marketing

Strategic Product Development and Execution Part 2


In [59]:
from langchain_google_genai import ChatGoogleGenerativeAI

# Use the best available Google AI model
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro-latest",  # Best Gemini model for reasoning & complex tasks
    temperature=0.3,
    max_tokens=500
)

# Example usage
response = llm.invoke("What are the latest LLM trends?")
print(response)


content="The field of Large Language Models (LLMs) is rapidly evolving. Here are some of the latest trends:\n\n**1. Focus on Efficiency:**\n\n* **Smaller, more efficient models:**  The trend is moving away from solely pursuing larger models to developing smaller, more specialized models that require less computational power and are cheaper to train and deploy. Techniques like quantization, pruning, and knowledge distillation are key here.\n* **Optimized inference:**  Research is focused on optimizing the inference process to make LLMs faster and more responsive, crucial for real-time applications.\n* **Parameter-efficient fine-tuning (PEFT):** Methods like LoRA, prompt tuning, and adapter modules allow adapting pre-trained models to specific tasks with minimal changes to the original model weights, saving computational resources.\n\n**2. Enhanced Reasoning and Tool Use:**\n\n* **Chain-of-thought prompting:**  Guiding LLMs through a step-by-step reasoning process improves performance on

### 2. Create RAG Chain

In [63]:
# Define the system prompt
system_prompt = (
    "You are an assistant for answering questions about MBA programs. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

# Create chat prompt template
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [64]:
# Create document chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# Create retrieval chain
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [65]:
# Test the RAG system with a sample MBA question
response = rag_chain.invoke({"input": "What is the duration of the full-time MBA program?"})
print("Answer:")
print(response["answer"])

Answer:
The Great Lakes Post Graduate Program in Management (PGPM) is a one-year, full-time MBA program.  There is also a two-year, full-time MBA program, the PGDM.  Both are designed for professionals, the PGPM for those with 2+ years of experience and the PGDM for those with up to 3 years of experience.


## Advanced Features

### 1. Custom Retrieval Parameters

In [66]:
def ask_with_custom_retrieval(question, k=4, fetch_k=20):
    """
    Ask a question with custom retrieval parameters

    Args:
        question (str): User question
        k (int): Number of documents to retrieve
        fetch_k (int): Number of documents to initially fetch before filtering

    Returns:
        str: Generated answer
    """
    # Create custom retriever
    custom_retriever = vectorstore.as_retriever(
        search_type="mmr",  # Maximum Marginal Relevance for diversity
        search_kwargs={"k": k, "fetch_k": fetch_k}
    )

    # Create custom RAG chain
    custom_rag_chain = create_retrieval_chain(custom_retriever, question_answer_chain)

    # Query the system
    response = custom_rag_chain.invoke({"input": question})
    return response["answer"]

In [67]:
# Example
print("Custom retrieval example:")
print(ask_with_custom_retrieval("What is the MBA program's ranking nationally and internationally?", k=3, fetch_k=10))

Custom retrieval example:
This program is consistently ranked among the top 10 business schools in India, specifically 3rd among top standalone institutions.  The ranking mentions Analytics B-Schools as 5th.  There is no information about international rankings provided.


## Conclusion

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system focused on MBA program information using Google's Gemini API. Key components include:

1. **Data Collection**: Loading content from MBA college websites
2. **Text Processing**: Splitting documents into manageable chunks
3. **Vector Database**: Creating embeddings and storing them in ChromaDB
4. **RAG Pipeline**: Implementing a retrieval and generation chain with Gemini
5. **MBA-Specific Questions**: Answering common questions about MBA programs, admissions, curriculum, etc.
