# **End-to-End RAG System with Weaviate and Google Gemini**

This notebook provides a complete, runnable example of a Retrieval-Augmented Generation (RAG) system. It demonstrates how to leverage a local vector database (Weaviate), a modern embedding model, and a powerful LLM (Google Gemini) to build a system that can answer questions based on a provided set of documents.

The core components of this system are:
- **Document Processing:** Loading, chunking, and embedding documents.
- **Vector Store:** Storing the document chunks and their embeddings in a vector database.
- **Retrieval:** Retrieving the most relevant chunks based on a user's query.
- **Generation:** Using an LLM to generate a comprehensive answer based on the retrieved context.

For this example, the system is configured to answer questions about phishing using a set of public PDF guides.

### **1. Setup and Initialization**

This first cell imports all the necessary classes and functions from the project. It also sets up a warning filter to keep the output clean.

In [1]:
from preprocessing.embedding import Embedding
from preprocessing.document import DocumentProcessor
from utils.db_config import DB
from utils.config import *
from time import time
import warnings

warnings.filterwarnings("ignore")

### **2. Connect to the Database and Load the Embedding Model**

Here, we establish a connection to the local Weaviate database. If the required collection doesn't exist, the `DB().connect()` method will create it automatically. We then load the `Embedding` model, which will be used to convert text into vector representations.

In [2]:
client = DB().connect()
embed = Embedding()

### **3. Define Document Sources**

This cell defines the list of documents that will be processed and used as the knowledge base for the RAG system. The documents are a collection of publicly available PDF guides on phishing.

In [3]:
phishing_resources = [
  {
    "title": "Phishing Guidance: Stopping the Attack Cycle at Phase One",
    "url": "https://www.cisa.gov/sites/default/files/2023-10/Phishing%20Guidance%20-%20Stopping%20the%20Attack%20Cycle%20at%20Phase%20One_508c.pdf"
  },
  {
    "title": "Phishing from Université Côte d'Azur",
    "url": "https://www.i3s.unice.fr/~bmartin/Phishing.pdf"
  },
  {
    "title": "Lesson 1: Phishing Analysis for Beginners",
    "url": "https://internews.org/wp-content/uploads/2021/03/Lesson1.pdf"
  },
  {
    "title": "Phishing Guide 2023 by IT.ie",
    "url": "https://it.ie/wp-content/uploads/2023/09/IT.ie-Phishing-Guide-2023.pdf"
  }
]

### **4. Process and Store Documents**

This loop iterates through the list of resources. For each document, it initializes a `DocumentProcessor` and runs the `process_document` method. This method automatically handles the entire pipeline of loading, splitting, and embedding the document, then stores the processed chunks in the Weaviate database. It also includes a check to skip documents that have already been processed to save time.

In [4]:
for item in phishing_resources:
    doc = DocumentProcessor(item['url'], item['title'])
    doc.process_document(embed, client)

Document https://www.cisa.gov/sites/default/files/2023-10/Phishing%20Guidance%20-%20Stopping%20the%20Attack%20Cycle%20at%20Phase%20One_508c.pdf already exists in the database. Skipping processing.
Document https://www.i3s.unice.fr/~bmartin/Phishing.pdf already exists in the database. Skipping processing.
Document https://internews.org/wp-content/uploads/2021/03/Lesson1.pdf already exists in the database. Skipping processing.
Document https://it.ie/wp-content/uploads/2023/09/IT.ie-Phishing-Guide-2023.pdf already exists in the database. Skipping processing.


### **5. Create the RAG Chain**

This is where we assemble the RAG components. 

- We initialize the `Retriever` and configure its search parameters, such as the number of chunks to fetch (`k`) and the blend between keyword and vector search (`alpha`). 
- We then load the `GoogleGenerativeAI` model to serve as our LLM.
- Finally, we define two `PromptTemplate`s: one for structuring the retrieved document context (`DOCUMENT_PROMPT`) and another for guiding the LLM to generate the final response (`RETRIEVAL_PROMPT`). The final `rag_chain` combines these components into a single, executable pipeline.

In [5]:
from retriever.weaviate_retriever import Retriever
from langchain_google_genai import GoogleGenerativeAI
from langchain.chains.retrieval import create_retrieval_chain 
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import PromptTemplate

retriever = Retriever(client, embed).as_retriever(search_kwargs={'source_ids': None, 
                                                                 'auto_merge': False, 'k': 30, 'top_k': 15,
                                                                 'alpha': 0.25})

llm = GoogleGenerativeAI(model=LLM_MODEL_NAME, google_api_key= GOOGLE_API_KEY, temperature= 0)

RETRIEVAL_PROMPT ="""You are an expert {domain} instructor. 
Your task is to write a detailed, 800-word lesson on the topic of ```{input}``` for a {domain} trainee. 
Use the following source materials to ensure accuracy and factual correctness:

```{context}```

Your lesson must be structured with the following sections: an introduction that defines the core concepts, 
a detailed explanation of the key technical principles, a "real-world application" section, and a summary. 
Maintain a professional and technical tone throughout. At the end of the lesson, include a "References" section. 
For each piece of information, cite the source using a numerical reference corresponding to the snippet it cites. 

The references should be in the following format:
[1] Source Title - Page Number 
[2] Source Title - Page Number, etc.

Ensure that the lesson is comprehensive and covers all aspects of the topic.
Do not include information not found in the source material.
Cite only the sources provided in the context."""

DOCUMENT_PROMPT = PromptTemplate(
    input_variables=["page_content", "page_no", "source_id"],
    template="***Source ID***: {source_id}\n***Page Number***: {page_no}\n\nContent: {page_content}"
)
prompt = PromptTemplate(template=RETRIEVAL_PROMPT, input_variables=["input", "domain", "context"])
question_answer_chain = create_stuff_documents_chain(llm, prompt, document_prompt=DOCUMENT_PROMPT)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

### **6. Execute the RAG Chain and Stream the Output**

This final cell defines the user's query and executes the `rag_chain`. Instead of waiting for the full response, it uses `astream` to stream the output as it is generated, providing a faster and more responsive experience. The output is printed to the console and also saved to a markdown file named `phishing_lesson.md`.

In [6]:
query = "What is phishing and how can it be prevented in cybersecurity?"

async def stream_result_to_md():
    with open("phishing_lesson.md", "w") as md_file:
        async for chunk in rag_chain.astream({"input": query, 
                                              "domain": "cybersecurity"}):
            if "answer" in chunk:
                md_file.write(chunk["answer"])
                print(chunk["answer"], end="", flush=True)

await stream_result_to_md()

### **Lesson: What is Phishing and How Can It Be Prevented in Cybersecurity?**

#### **Introduction**

Welcome, trainee. This lesson provides a foundational understanding of phishing, one of the most pervasive and dangerous threats in the digital landscape. Phishing is a type of social engineering attack where malicious actors attempt to trick individuals into revealing sensitive information or taking an action that compromises their security [2]. At its core, social engineering is the art of manipulation, and phishing is its primary digital weapon [2].

The objective of a phishing attack is to steal valuable user data, most commonly login credentials and credit card numbers [1]. Attackers achieve this by posing as trustworthy sources—such as colleagues, well-known organizations, or acquaintances—to lure victims [7]. They craft deceptive emails designed to appear as something the victim needs or wants, inducing them to click a malicious link or download a compromised attachment [1]. Th