# LearnMate: AI-Powered RAG Chatbot for Students



**Project Description:**
LearnMate is a Retrieval-Augmented Generation (RAG) chatbot designed to act as an interactive study assistant. Instead of relying purely on an LLM's pre-trained knowledge, LearnMate dynamically searches uploaded course materials (like PDF textbooks) to find the exact information needed to answer a student's query. This minimizes hallucinations and ensures academic accuracy.

This project demonstrates the end-to-end implementation of a RAG pipeline:
1. **Document Loading & Processing**
2. **Text Chunking**
3. **Vector Embeddings & Storage (FAISS)**
4. **Semantic Retrieval**
5. **Answer Generation (using Llama 3 via Groq API)**

Let's get started by setting up the environment dependencies.

In [None]:
# Step 0: Install the tools we need

%pip -q install langchain langchain-community langchain-text-splitters langchain-huggingface pypdf sentence-transformers faiss-cpu ipywidgets requests==2.32.4 groq

In [None]:
!unzip resources.zip

In [None]:
import logging  
from transformers.utils import logging as t_logging
t_logging.set_verbosity_error()
logging.getLogger("pypdf").setLevel(logging.ERROR)

### Phase 1: Data Ingestion

The first step in building the RAG pipeline is to ingest the knowledge base. For this implementation, I am using a sample HTML textbook provided in PDF format. I'll utilize LangChain's `PyPDFLoader` to parse and extract the text content sequentially.

In [None]:
# PyPDFLoader is a tool from LangChain that reads PDF files page by page

from langchain_community.document_loaders import PyPDFLoader

file_path = "HTML - Book.pdf"
loader = PyPDFLoader(file_path)

In [None]:
# Read the PDF — this loads all pages into memory
document = loader.load()

print("Number of pages in the document:", len(document))

In [None]:
# Let's peek at the first page to see what the loader extracted
# You'll see the text content + some metadata (info about the file)

print("First page of the document:", document[0])

### Phase 2: Document Chunking

Feeding entire documents into an LLM is inefficient and often exceeds context token limits. To optimize retrieval, I will split the extracted text into smaller, overlapping chunks. I chose a chunk size of 500 characters with an overlap of 20 to preserve contextual continuity between adjacent segments.

In [None]:
# Using a "text splitter" to cut the document into chunks
# chunk_size=500 means each piece will be ~500 characters long (about 1 paragraph)
# chunk_overlap=20 means pieces slightly overlap so we don't lose info at the edges

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
texts = text_splitter.split_documents(document)

In [None]:
# A 5-page PDF might give us around 10-15 chunks

print("Total number of chunks:", len(texts))

In [None]:
# Each chunk is a small piece of text from the PDF

print("Example chunk:")
print(texts[5].page_content)

### Phase 3: Text Embeddings

To enable semantic search—allowing the system to search by meaning rather than exact keyword matches—I need to convert the text chunks into dense mathematical vectors (embeddings). I'm utilizing the pre-trained `all-MiniLM-L6-v2` model from HuggingFace, which is lightweight yet highly effective for sentence-level semantic representation.

In [None]:
# We'll use a free, pre-trained model to create our embeddings
# This model was trained on millions of sentences so it understands meaning well

from langchain_huggingface import HuggingFaceEmbeddings

In [None]:
# Load the embedding model — this might take a minute the first time
# "all-MiniLM-L6-v2" is a small but powerful model for understanding text meaning

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("Embeddings model ready!")

### Phase 4: Vector Database Initialization

With the embeddings generated, I require a highly efficient database to store and query these vectors. I've integrated FAISS (Facebook AI Similarity Search), an open-source library that indexes the embeddings and allows for rapid similarity comparisons at scale.

In [None]:
# FAISS is a fast vector database created by Facebook/Meta
# It takes our text chunks + embedding model, converts everything to numbers,
# and organizes them for fast searching

from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(texts, embedding_model)

print("Vector store ready! Total items stored:", vector_store.index.ntotal)

### Phase 5: Semantic Retrieval Configuration

With the FAISS vector store ready, I can instantiate the retrieval engine. I've configured it to execute a similarity search and return the top 3 most relevant text chunks (`k=3`) for any given user query. Below is a quick unit test to verify the retrieval accuracy.

In [None]:
# Create a "retriever" that will find the top 3 most relevant chunks for any question
# k=3 means: "give me the 3 best matches"

retriever = vector_store.as_retriever(search_kwargs={"k": 3})

In [None]:
# Let's test it! Ask a question and see which chunks the system finds

question = "How to place the text in center?"

docs = retriever.get_relevant_documents(question)

# Print the top 3 chunks that matched our question
for i, doc in enumerate(docs, start=1):
    print(f"--- Chunk {i} ---")
    print(doc.page_content[:400])
    print()

### Phase 6: LLM Integration and Answer Generation

The retrieval system successfully isolates the raw context, but LearnMate needs to synthesize this data into conversational, human-readable answers.

I am connecting to Groq's high-speed inference API utilizing the `llama-3.1-8b-instant` model. The system prompt is strictly engineered to force the LLM to answer *only* based on the provided context, preventing hallucinated information.

In [None]:
# Read API keys from file

import random

with open("api_keys.txt", "r") as f:
    api_keys = f.read().splitlines()

random.shuffle(api_keys)

api_key = api_keys[0]

In [None]:
# Set up Groq — a free cloud AI service that will generate answers for us
# Get your own free API key at: https://console.groq.com

from groq import Groq

GROQ_API_KEY = api_key
client = Groq(api_key=GROQ_API_KEY)

def build_prompt(context, question):
    """Create the instructions we send to the AI along with the context and question."""
    return (
        f"""
            You are a document-based Question Answering assistant helping students prepare for exams.

            IMPORTANT:
            The provided document context is the ONLY source of truth.
            Answer strictly using information available in the document.
            Do NOT use outside knowledge, assumptions, or prior training information.

            Instructions:
            1. Carefully read the entire document context before answering.
            2. Extract the answer only from the provided context.
            3. If relevant information appears in multiple places, combine them logically.
            4. Do not invent, assume, or expand beyond the document.
            5. If the answer is not clearly present in the context, respond exactly with:
              Not found in document
            6. Keep answers clear, simple, and concise (maximum 2–3 sentences).

            Document Context:
            {context}

            Question:
            {question}

            Answer (based only on the document):
"""
    )

def generate_answer(prompt):
    """Send our prompt to Groq and get an answer back."""
    try:
        chat = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role": "user", "content": prompt}]
        )
        return chat.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error: {e}")
        return "Not found in document."

print("Groq AI is ready!")

### Phase 7: Interactive Chatbot User Interface

To provide a seamless experience, I've constructed a front-end GUI directly within the Jupyter Notebook using `ipywidgets`. This allows end-users to organically query the textbook just like a standard chat application.

In [None]:
# Now try it yourself! Type any question about your PDF below.

import ipywidgets as widgets
from IPython.display import display, clear_output
from google.colab import output

output.enable_custom_widget_manager()

title = widgets.HTML("<h3>PDF Chatbot</h3><p style='color:gray'>Ask anything about the uploaded document</p>")
input_box = widgets.Text(
    placeholder="Type your question here...",
    layout=widgets.Layout(width="70%")
)
ask_button = widgets.Button(
    description="Ask",
    button_style="primary",
    layout=widgets.Layout(width="100px")
)
clear_button = widgets.Button(
    description="Clear Chat",
    button_style="warning",
    layout=widgets.Layout(width="100px")
)
chat_out = widgets.Output(
    layout=widgets.Layout(
        border="1px solid #ddd",
        padding="10px",
        height="350px",
        overflow_y="auto"
    )
)
status = widgets.HTML("")

chat_history = []

def handle_ask(_):
    question = input_box.value.strip()
    if not question:
        return
    input_box.value = ""
    status.value = "<i style='color:gray'>Thinking...</i>"

    docs = retriever.invoke(question)
    context = "\n\n".join(doc.page_content for doc in docs)
    prompt = build_prompt(context, question)
    answer = generate_answer(prompt)

    chat_history.append(("You", question))
    chat_history.append(("Bot", answer))
    status.value = ""

    with chat_out:
        clear_output()
        for role, text in chat_history:
            if role == "You":
                print(f"You: {text}\n")
            else:
                print(f"Bot: {text}\n")
                print("-" * 50)

def handle_clear(_):
    chat_history.clear()
    with chat_out:
        clear_output()

ask_button.on_click(handle_ask)
clear_button.on_click(handle_clear)
input_box.on_submit(handle_ask)

buttons = widgets.HBox([ask_button, clear_button])
display(title, widgets.HBox([input_box]), buttons, status, chat_out)