<a href="https://colab.research.google.com/github/usshaa/Cheatsheets/blob/main/Activity8_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🤖 Task: Build a Chatbot for College Placement Insights using Gemini + RAG



### 🎯 Objective:

Develop a chatbot that uses RAG (Retrieval-Augmented Generation) technique with the Gemini LLM to answer questions related to student placement data, like:

* 📊 “What’s the average MBA % of placed students?”
* 📉 “Which degree stream has the lowest placement rate?”
* 📈 “What factors affect placement the most?”
* ❓ “Suggest improvements for students who are not placed.”

## 🏗️ Chatbot Architecture Outline:

### ✅ Step 1: Prepare the Dataset for RAG

In [1]:
import pandas as pd
import os

# Load cleaned placement data
df = pd.read_csv('/content/Placement_Data_Full_Class.csv')

# Create the rag_chunks directory if it doesn't exist
if not os.path.exists("rag_chunks"):
    os.makedirs("rag_chunks")

# Optional: Save as knowledge chunks
for i, row in df.iterrows():
    with open(f"rag_chunks/chunk_{i}.txt", "w", encoding="utf-8") as f:
        f.write(row.to_string())

### ✅ Step 2: Vectorize Dataset for Retrieval (Using FAISS or Similar)

In [2]:
!pip install -U langchain-community
!pip install faiss-cpu

Collecting pydantic<3.0.0,>=2.7.4 (from langchain<1.0.0,>=0.3.25->langchain-community)
  Downloading pydantic-2.11.4-py3-none-any.whl.metadata (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.6/66.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Downloading pydantic-2.11.4-py3-none-any.whl (443 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m443.9/443.9 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydantic
  Attempting uninstall: pydantic
    Found existing installation: pydantic 1.10.22
    Uninstalling pydantic-1.10.22:
      Successfully uninstalled pydantic-1.10.22
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-genai 1.15.0 requires websockets<15.1.0,>=13.0.0, but you have websockets 11.0.3 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you h

In [3]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
import os # Import the os module to list files

# Initialize an empty list to store documents
documents = []

# Iterate through files in the rag_chunks directory
for filename in os.listdir('rag_chunks/'):
    if filename.endswith('.txt'): # Process only text files
        file_path = os.path.join('rag_chunks/', filename)
        # Use TextLoader for each individual file
        loader = TextLoader(file_path)
        documents.extend(loader.load())

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embedding = HuggingFaceEmbeddings()
db = FAISS.from_documents(docs, embedding)

# Save DB
db.save_local("faiss_index")

  embedding = HuggingFaceEmbeddings()
  embedding = HuggingFaceEmbeddings()


### ✅ Step 3: RAG-based Query Handling with Gemini

In [41]:
from google import genai
from google.genai import types
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from google.colab import userdata

# Load the vector store (assuming you already created and saved it)
embedding = HuggingFaceEmbeddings()
db = FAISS.load_local("faiss_index", embedding, allow_dangerous_deserialization=True)

# Configure Gemini client
# client = genai.Client(api_key="AIzaSyDgyBPhm-kluN8RU8mn50RWcgIgmlnDMQ0")  # ⚠️ Do NOT expose in production

# Access the key from environment variable
api_key = userdata.get('GOOGLE_API_KEY')  # Or the key name you used in Colab
client = genai.Client(api_key=api_key)

def ask_gemini_with_rag(user_question):
    # Step 1: Retrieve relevant documents
    relevant_docs = db.similarity_search(user_question, k=3)
    context = "\n\n".join([doc.page_content for doc in relevant_docs])

    # Step 2: Build the prompt using Gemini's Part API
    contents = [
        types.Part.from_text(text=f"""You are a helpful assistant answering questions about college placements.

        === CONTEXT ===
        {context}

        === QUESTION ===
        {user_question}

        Answer based only on the context above.
        """)
            ]

    # Step 3: Generate answer with Gemini 2.0 Flash
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=contents
    )

    return response.text


  embedding = HuggingFaceEmbeddings()


### ✅ Step 4: Build a Chatbot Interface (Optional UI)

In [26]:
# Install a compatible version
!pip install gradio==3.50.0

Collecting gradio==3.50.0
  Downloading gradio-3.50.0-py3-none-any.whl.metadata (17 kB)
Collecting gradio-client==0.6.1 (from gradio==3.50.0)
  Using cached gradio_client-0.6.1-py3-none-any.whl.metadata (7.1 kB)
Downloading gradio-3.50.0-py3-none-any.whl (20.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.3/20.3 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached gradio_client-0.6.1-py3-none-any.whl (299 kB)
Installing collected packages: gradio-client, gradio
  Attempting uninstall: gradio-client
    Found existing installation: gradio_client 0.7.0
    Uninstalling gradio_client-0.7.0:
      Successfully uninstalled gradio_client-0.7.0
  Attempting uninstall: gradio
    Found existing installation: gradio 4.0.0
    Uninstalling gradio-4.0.0:
      Successfully uninstalled gradio-4.0.0
Successfully installed gradio-3.50.0 gradio-client-0.6.1


In [None]:
import gradio as gr

# Interface function
def gradio_chat_interface(question):
    try:
        answer = ask_gemini_with_rag(question)
        return answer
    except Exception as e:
        return f"An error occurred: {str(e)}"

# Create Gradio interface
iface = gr.Interface(
    fn=gradio_chat_interface,
    inputs=gr.Textbox(lines=2, placeholder="Enter your question here..."),
    outputs="text",
    title="🎓 College Placement Chatbot",
    description="Ask questions related to college placement records. Powered by Gemini + RAG."
)

# Launch in Colab or locally
iface.launch(share=True, debug=True)  # `share=True` gives public link in Colab

IMPORTANT: You are using gradio version 3.50.0, however version 4.44.1 is available, please upgrade.
--------
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://0068f71ad960356e2d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


### ✅ Step 5: Sample User Prompts

* What percentage of students in this dataset were placed after their MBA?
* What is the average salary of placed students?
* Is there any noticeable pattern between work experience and placement status?
* Which specialisation (Mkt&HR or Mkt&Fin) seems to have better placement outcomes?
* How does undergraduate degree percentage affect placement status or salary?

📊 Correlation/Comparison
* Compare average MBA scores of placed vs not placed students.
* Do students from the Science stream in 12th grade perform better in placements than Commerce or Arts?
* Is there a trend between 10th percentage and final placement salary?
* Which SSC or HSC board shows better placement results or average salary?
* Among students without work experience, who secured the highest salary?

🔍 Advanced or Analytical
* Does having work experience increase the chances of placement or higher salary?
* Which combination of 12th stream and degree stream leads to higher placement success?
* Is there a minimum MBA percentage required to be placed in this dataset?
* What is the distribution of salaries among placed students?
* List students who have MBA percentage above 60 but were not placed.

💡 Natural Language Style Prompts for Chatbot
* “Tell me which degree streams had the highest placement rate.”
* “Show me students with the highest and lowest salary.”
* “Who were placed despite having low academic performance?”
* “Find students with Commerce in 12th and who were placed.”
* “List students with no work experience and above 70% in 10th and 12th.”





## 🧠 Learning Goals:

* Understand RAG (combine search + generation)
* Use Gemini LLM for natural conversation
* Use vector store (FAISS) for similarity search
* Enable real-time query answering from structured data