# **End-to-End RAG-Based System for Educational Content Generation**

This notebook provides a complete, runnable example of a Retrieval-Augmented Generation (RAG) system. It demonstrates how to leverage a local vector database (Weaviate), a modern embedding model, and a powerful LLM (Google Gemini) to build a system that can generate structured educational content with proper citations.

The core components of this system are:
- **Document Processing:** Loading, chunking, and embedding documents.
- **Vector Store:** Storing the document chunks and their embeddings in a vector database.
- **Retrieval:** Retrieving the most relevant chunks based on a user's query.
- **Generation:** Using an LLM to generate a comprehensive article based on the retrieved context.

For this example, the system is configured to generate an educational article about phishing using a set of public PDF guides.

### **1. Setup and Initialization**

This first cell imports all the necessary classes and functions from the project. It also sets up a warning filter to keep the output clean.

In [1]:
from preprocessing.embedding import Embedding
from preprocessing.document import DocumentProcessor
from utils.db_config import DB
from utils.config import *
from warnings import filterwarnings

filterwarnings("ignore")

### **2. Connect to the Database and Load the Embedding Model**

Here, we establish a connection to the local Weaviate database. If the required collection doesn't exist, the `DB().connect()` method will create it automatically. We then load the `Embedding` model, which will be used to convert text into vector representations.

In [2]:
client = DB().connect()
embed = Embedding()

### **3. Define Document Sources**

This cell defines the list of documents that will be processed and used as the knowledge base for the RAG system. The documents are a collection of publicly available PDF guides on phishing.

In [3]:
phishing_resources = [
  {
    "title": "Phishing Guidance: Stopping the Attack Cycle at Phase One",
    "url": "https://www.cisa.gov/sites/default/files/2023-10/Phishing%20Guidance%20-%20Stopping%20the%20Attack%20Cycle%20at%20Phase%20One_508c.pdf"
  },
  {
    "title": "Phishing from Université Côte d'Azur",
    "url": "https://www.i3s.unice.fr/~bmartin/Phishing.pdf"
  },
  {
    "title": "Lesson 1: Phishing Analysis for Beginners",
    "url": "https://internews.org/wp-content/uploads/2021/03/Lesson1.pdf"
  },
  {
    "title": "Phishing Guide 2023 by IT.ie",
    "url": "https://it.ie/wp-content/uploads/2023/09/IT.ie-Phishing-Guide-2023.pdf"
  }
]

### **4. Process and Store Documents**

This loop iterates through the list of resources. For each document, it initializes a `DocumentProcessor` and runs the `process_document` method. This method automatically handles the entire pipeline of loading, splitting, and embedding the document, then stores the processed chunks in the Weaviate database. It also includes a check to skip documents that have already been processed to save time.

In [4]:
for item in phishing_resources:
    doc = DocumentProcessor(item['url'], item['title'])
    doc.process_document(embed, client)

Document https://www.cisa.gov/sites/default/files/2023-10/Phishing%20Guidance%20-%20Stopping%20the%20Attack%20Cycle%20at%20Phase%20One_508c.pdf already exists in the database. Skipping processing.
Document https://www.i3s.unice.fr/~bmartin/Phishing.pdf already exists in the database. Skipping processing.
Document https://internews.org/wp-content/uploads/2021/03/Lesson1.pdf already exists in the database. Skipping processing.
Document https://it.ie/wp-content/uploads/2023/09/IT.ie-Phishing-Guide-2023.pdf already exists in the database. Skipping processing.


### **5. Create the RAG Chain**

This is where we assemble the RAG components. 

- We initialize the `Retriever` and configure its search parameters, such as the number of chunks to fetch (`k`) and the blend between keyword and vector search (`alpha`). 
- We then load the `GoogleGenerativeAI` model to serve as our LLM.
- Finally, we define two `PromptTemplate`s: one for structuring the retrieved document context (`DOCUMENT_PROMPT`) and another for guiding the LLM to generate the final response (`RETRIEVAL_PROMPT`). The final `rag_chain` combines these components into a single, executable pipeline.

In [None]:
from retriever.weaviate_retriever import Retriever
from langchain_google_genai import GoogleGenerativeAI
from langchain_openai import OpenAI 
from langchain.chains.retrieval import create_retrieval_chain 
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import PromptTemplate

retriever = Retriever(client, embed).as_retriever(search_kwargs={'source_ids': None, 
                                                                 'auto_merge': False, 'k': 30, 'top_k': 15,
                                                                 'alpha': 0.25})

llm = GoogleGenerativeAI(model=LLM_MODEL_NAME, google_api_key= GOOGLE_API_KEY, temperature= 0)
#llm = OpenAI(base_url='http://172.22.176.1:6969/v1', model='qwen3-8b', api_key='None', temperature=0)

BASE_PROMPT ="""You are an expert {domain} instructor. 
Your task is to write a detailed, 800-word lesson on the topic of ```{input}``` for a {domain} trainee. 
Use the following source materials to ensure accuracy and factual correctness:

```{context}```

Your lesson must be structured with the following sections: an introduction that defines the core concepts, 
a detailed explanation of the key technical principles, a "real-world application" section, and a summary. 
Maintain a professional and technical tone throughout. At the end of the lesson, include a "References" section. 
For each piece of information, cite the source using a numerical reference corresponding to the snippet it cites. 

The references should be in the following format:
[1] Source Title - Page Number 
[2] Source Title - Page Number, etc.

Ensure that the lesson is comprehensive and covers all aspects of the topic.
Do not include information not found in the source material.
Cite only the sources provided in the context."""

DOCUMENT_PROMPT = PromptTemplate(
    input_variables=["page_content", "page_no", "source_id"],
    template="***Source ID***: {source_id}\n***Page Number***: {page_no}\n\nContent: {page_content}"
)
prompt = PromptTemplate(template=BASE_PROMPT, input_variables=["input", "domain", "context"])
document_chain = create_stuff_documents_chain(llm, prompt, document_prompt=DOCUMENT_PROMPT, document_separator=DOCUMENT_SEPERATOR)
rag_chain = create_retrieval_chain(retriever, document_chain)

query = "What is phishing and how can it be prevented in cybersecurity?"

### **6. Execute the RAG Chain and Stream the Output**

This final cell defines the user's query and executes the `rag_chain`. Instead of waiting for the full response, it uses `astream` to stream the output as it is generated, providing a faster and more responsive experience. The output is printed to the console and also saved to a markdown file named `phishing_lesson.md`.

In [6]:
async def generate_and_stream():
    rag_context = []
    with open("phishing_lesson.md", "w") as md_file:
        async for chunk in rag_chain.astream({"input": query, 
                                              "domain": "cybersecurity"}):
            if "answer" in chunk:
                md_file.write(chunk["answer"])
                print(chunk["answer"], end="", flush=True)
            if "context" in chunk:
                rag_context.extend(chunk["context"])
    return rag_context

rag_context = await generate_and_stream()

Hello Trainee,

Welcome to today's lesson. As your cybersecurity instructor, my goal is to provide you with a foundational understanding of one of the most pervasive threats in the digital landscape. Today, we will conduct a deep dive into the topic of phishing: what it is, how it works, and the critical strategies for its prevention.

### **Introduction: Defining the Threat**

At its core, phishing is a type of social engineering attack [1]. Social engineering is the practice of tricking someone into revealing information or taking an action they otherwise would not [2]. Phishing specifically leverages this deception, typically via email, to lure victims into visiting a malicious website or providing sensitive data [2]. The primary goal for malicious actors is to steal user data, most commonly login credentials and credit card numbers [1].

These attackers pose as trustworthy sources—such as colleagues, familiar organizations, or other acquaintances—to exploit the victim's trust [7]. 

In [7]:
rag_context

[Document(metadata={'source_id': "Phishing from Université Côte d'Azur", 'page_no': '{3}'}, page_content='WHAT IS PHISHING\n- Phishing is a social engineering attack used to steal user data.\n- The common stolen data are login credentials and credit card numbers.\n- The goal is to send an email that seems something that the victim needs or wants and induce him/her to click a link or download an attachment.'),
 Document(metadata={'source_id': 'Phishing Guidance: Stopping the Attack Cycle at Phase One', 'page_no': '{3}'}, page_content='OVERVIEW\nSocial engineering is the attempt to trick someone into revealing information (e.g., a password) or taking an action that can be used to compromise systems or networks. Phishing is a form of social engineering where malicious actors lure victims (typically via email) to visit a malicious site or deceive them into providing login credentials. Malicious actors primarily leverage phishing for:\n- Obtaining login credentials. Malicious actors conduct

### **7. Use Hypothetical Document Embedding (Hyde) for better document retrieval**
This section uses the HyDE (Hypothetical Document Embedding) approach to generate a hypothetical document based on the user's query. The generated document is then used to retrieve relevant context from the vector database, which is passed to the base chain for final output generation.

In [None]:
from langchain_core.prompts.base import format_document

HYDE_PROMPT = """You are an expert {domain} instructor. 
Please write a concise, clear paragraph that answers the following question in technical detail: {query}"""

hyde_prompt = PromptTemplate(
    input_variables=["query", "domain"],
    template=HYDE_PROMPT
)

hyde_chain = hyde_prompt | GoogleGenerativeAI(model="gemini-2.5-flash", google_api_key= GOOGLE_API_KEY, temperature= 0)

hypothetical_document = hyde_chain.invoke({"query": query, 
                                           "domain": "cybersecurity"})

docs = retriever.invoke(hypothetical_document)
context = DOCUMENT_SEPERATOR.join(
        format_document(doc, DOCUMENT_PROMPT)
        for doc in docs
)

base_chain = prompt | llm

In [9]:
async def generate_and_stream_hyde():
    with open("phishing_lesson_hyde.md", "w") as md_file:
        async for chunk in base_chain.astream({"input": query,
                                              "context": context,
                                              "domain": "cybersecurity"}):
            md_file.write(chunk)
            print(chunk, end="", flush=True)

await generate_and_stream_hyde()

Hello Trainee,

Welcome to today's lesson. As a cybersecurity expert, my goal is to provide you with the foundational knowledge necessary to excel in this field. Today, we will cover one of the most pervasive and dangerous threats you will encounter: phishing. This lesson will define what phishing is, explore the techniques attackers use, and detail the critical prevention strategies that organizations and individuals must implement.

### **Introduction: Defining the Threat**

At its core, phishing is a form of **social engineering**, which is the attempt to trick someone into revealing information or taking an action that can be used to compromise systems or networks [1]. Phishing is a specific type of social engineering attack where malicious actors, posing as trustworthy sources, lure victims into providing sensitive data [1, 11]. These attacks are most commonly delivered via email but can also occur through other communication channels [1].

The primary objectives of a phishing cam

In [10]:
docs

[Document(metadata={'source_id': 'Phishing Guidance: Stopping the Attack Cycle at Phase One', 'page_no': '{3}'}, page_content='OVERVIEW\nSocial engineering is the attempt to trick someone into revealing information (e.g., a password) or taking an action that can be used to compromise systems or networks. Phishing is a form of social engineering where malicious actors lure victims (typically via email) to visit a malicious site or deceive them into providing login credentials. Malicious actors primarily leverage phishing for:\n- Obtaining login credentials. Malicious actors conduct phishing campaigns to steal login credentials for initial network access.\n- Malware deployment. Malicious actors commonly conduct phishing campaigns to deploy malware for follow-on activity, such as interrupting or damaging systems, escalating user privileges, and maintaining persistence on compromised systems.\nThe Cybersecurity and Infrastructure Security Agency (CISA), National Security Agency (NSA), Fede