# Lesson 1 : Creating a Document Processor for Contextual Retrieval

## Creating a Document Processor for Contextual Retrieval

Welcome to the first lesson of our course on building a RAG-powered chatbot with LangChain and Python! In this course, we’ll be creating a complete Retrieval-Augmented Generation (RAG) system that can intelligently answer questions based on your documents.

At the heart of any RAG system is the **document processor**. This component is responsible for taking your raw documents, processing them into a format that can be efficiently searched, and retrieving the most relevant information when a query is made. Think of it as the librarian of your RAG system—organizing information and fetching exactly what you need when you ask for it.

---

### Understanding the Document Processor

The document processing pipeline we’ll build today consists of several key steps:

1. **Loading documents** from files (like PDFs)
2. **Splitting** these documents into smaller, manageable chunks
3. **Creating vector embeddings** for each chunk
4. **Storing** these embeddings in a vector database
5. **Retrieving** the most relevant chunks when a query is made

This document processor will serve as the foundation for our RAG chatbot. In later units, we’ll:

* Build a chat engine that can maintain conversation history
* Integrate both components into a complete RAG system

By the end of this course, you’ll have a powerful chatbot that can answer questions based on your document collection with remarkable accuracy.

---

## 1. Setting Up the Document Processor Class

First, we need to create a class that will handle all our document processing needs. This class will encapsulate functionality for loading, processing, and retrieving information from documents.

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None
```

In this `__init__`:

* **`chunk_size`**: Number of characters per chunk (default: 1000)
* **`chunk_overlap`**: Overlap between chunks (default: 100)
* **`embedding_model`**: OpenAI Embeddings instance for vectorizing text
* **`vectorstore`**: Placeholder for our FAISS index

These parameters can be adjusted based on your needs—e.g., increase `chunk_size`/`chunk_overlap` for more technical documents.

---

## 2. Implementing Document Loading and Chunking

### 2.1 Loading Documents

We’ll start by writing a method to load documents based on file type:

```python
def load_document(self, file_path):
    """Load a document based on its file type."""
    if file_path.endswith('.pdf'):
        loader = PyPDFLoader(file_path)
    else:
        raise ValueError("Unsupported file format")
        
    return loader.load()
```

Currently only PDFs are supported, but you can extend this to `.txt`, `.docx`, `.html`, etc.

### 2.2 Processing and Chunking

Next, implement a method to split a loaded document into chunks and add them to FAISS:

```python
def process_document(self, file_path):
    """Process a document and add it to the vector store."""
    # 1. Load
    docs = self.load_document(file_path)
    
    # 2. Split
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=self.chunk_size, 
        chunk_overlap=self.chunk_overlap
    )
    split_docs = text_splitter.split_documents(docs)
    
    # 3. Create or update vector store
    if self.vectorstore is None:
        self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
    else:
        self.vectorstore.add_documents(split_docs)
```

* **First document**: builds a new FAISS index
* **Subsequent documents**: appends to existing index

This incremental approach lets you grow your knowledge base without rebuilding from scratch.

---

## 3. Implementing Context Retrieval

To fetch relevant information at query time, we need a retrieval method:

```python
def retrieve_relevant_context(self, query, k=3):
    """Retrieve relevant document chunks for a query."""
    if self.vectorstore is None:
        return []
        
    return self.vectorstore.similarity_search(query, k=k)
```

* **`query`**: Question string
* **`k`**: Number of top chunks to return (default: 3)

If no documents have been processed yet, it safely returns an empty list.

---

## 4. Resetting the Vector Store

A utility to clear your processor’s memory:

```python
def reset(self):
    """Reset the document processor (clear vector store)."""
    self.vectorstore = None
```

Use this to start fresh with a new document collection.

---

## 5. Putting It All Together: A Complete RAG Workflow

Here’s an end-to-end example that:

1. Initializes the processor
2. Processes a PDF
3. Retrieves context for a query
4. Invokes a chat model with RAG

```python
from document_processor import DocumentProcessor
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Initialize
processor = DocumentProcessor()

# Process a PDF (e.g., Sherlock Holmes)
file_path = "data/a_scandal_in_bohemia.pdf"
processor.process_document(file_path)

# Prepare chat model
chat = ChatOpenAI()

# Define a query
query = "What is the main mystery in the story?"

# Retrieve context
relevant_docs = processor.retrieve_relevant_context(query)
context = "\n\n".join(doc.page_content for doc in relevant_docs)

# Build RAG prompt
prompt_template = ChatPromptTemplate.from_template(
    "Answer the following question based on the provided context.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}"
)
prompt = prompt_template.format(context=context, question=query)

# Invoke and print
response = chat.invoke(prompt)
print(f"Question: {query}")
print(f"Answer: {response.content}")
```

**Expected output example:**

> **Question:** What is the main mystery in the story?
> **Answer:** The main mystery in "A Scandal in Bohemia" revolves around retrieving a compromising photograph that Irene Adler possesses…

---

## Summary and Next Steps

In this lesson, you’ve learned to:

* Create a **DocumentProcessor** class
* **Load** and **chunk** documents
* Build and manage a **FAISS** vector store
* **Retrieve** relevant chunks for a query
* Integrate with a chat model for a basic **RAG** workflow

Next unit: we’ll craft a **chat engine** that maintains conversation history, then merge it with our document processor to build a fully conversational RAG chatbot. Stay tuned!


## Implementing Document Loading Logic

Your first task is to implement the method to load a document in the DocumentProcessor class. This method should:

Check if the file path ends with .pdf
Use PyPDFLoader to load PDF documents
Raise a ValueError for unsupported file formats
Return the loaded documents
Then in main.py, use this method to load a PDF file, print the content of the first page, and handle any errors that might occur.

```python
from langchain_community.document_loaders import PyPDFLoader


class DocumentProcessor:
        
    # TODO: Implement a method to load a document that accepts a file path parameter
        # TODO: Check if the file path ends with '.pdf'
            # TODO: If so, create a PyPDFLoader instance with the file path
        # TODO: If it's not a PDF, raise a ValueError with an appropriate message
        # TODO: Return the loaded documents
```


Here’s an example of how you can implement `load_document` in your `DocumentProcessor` and then use it in a simple `main.py`.

```python
# document_processor.py

from langchain_community.document_loaders import PyPDFLoader

class DocumentProcessor:
    def load_document(self, file_path: str):
        """
        Load a document based on its file type.

        Args:
            file_path (str): Path to the file.

        Returns:
            List[Document]: A list of LangChain Document objects.

        Raises:
            ValueError: If the file format is unsupported.
        """
        # Check for PDF extension
        if file_path.lower().endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError(f"Unsupported file format: {file_path}. Only PDF files are supported.")
        
        # Load and return the documents
        return loader.load()
```

```python
# main.py

from document_processor import DocumentProcessor

def main():
    processor = DocumentProcessor()
    file_path = "data/a_scandal_in_bohemia.pdf"  # adjust path as needed

    try:
        docs = processor.load_document(file_path)
        if not docs:
            print("No pages found in document.")
            return

        # Print the content of the first page
        first_page = docs[0].page_content
        print("=== First Page Content ===")
        print(first_page.strip())

    except ValueError as e:
        # Handle unsupported file formats
        print(f"Error: {e}")

    except Exception as e:
        # Catch-all for other errors (e.g., file not found, loader issues)
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    main()
```

**Explanation:**

1. **`load_document` method**

   * Checks whether `file_path` ends with `.pdf` (case-insensitive).
   * Uses `PyPDFLoader` to load PDF pages into a list of `Document` objects.
   * Raises a `ValueError` if the extension is not `.pdf`.

2. **`main.py`**

   * Instantiates `DocumentProcessor` and attempts to load a PDF.
   * If successful, prints the text of the first page.
   * Catches and reports:

     * `ValueError` for unsupported formats
     * Any other exception (I/O issues, missing file, etc.)


## Initializing Vector Store and Retrieving Context

You've done well setting up the method to load documents. Now, let's focus on initializing our vector store and retrieving context.

Your task is to:

Complete the constructor for the DocumentProcessor class by initializing a vectorstore attribute to None.
Implement the missing method to retrieve relevant context by performing similarity searches:
This method should accept a query parameter and an optional k parameter (default=3)
Check if the vectorstore is None and return an empty list if so
If vectorstore is not None, return the result of a similarity search
After implementing these, test your code in main.py by using your retrieval method. Print the result to confirm that it returns an empty list, indicating that the vector store is not yet initialized.

Keep going; you're building a strong foundation!

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS


class DocumentProcessor:
    def __init__(self):
        # TODO: Initialize a vectorstore attribute to None
        
    def load_document(self, file_path):
        """Load a document based on its file type"""
        if file_path.endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format")
            
        return loader.load()
        
    # TODO: Implement a method to retrieve relevant context
    # This method should accept a query parameter and an optional k parameter (default=3)
        # TODO: Check if the vectorstore is None and return an empty list if so
        # TODO: If vectorstore is not None, return the result of a similarity search

from document_processor import DocumentProcessor

# Initialize the document processor
processor = DocumentProcessor()

# Define a query
query = "What is the main mystery in the story?"

# TODO: Use the method to retrieve context and store the result

# TODO: Print the result to verify it's an empty list
# 
```

```python
# document_processor.py

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS

class DocumentProcessor:
    def __init__(self):
        # Initialize the vectorstore to None
        self.vectorstore = None

    def load_document(self, file_path):
        """Load a document based on its file type."""
        if file_path.lower().endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format: only PDF is supported.")
        return loader.load()

    def retrieve_relevant_context(self, query: str, k: int = 3):
        """
        Retrieve the top-k most relevant document chunks for a given query.
        
        Args:
            query (str): The search query.
            k (int): How many chunks to retrieve (default: 3).
        
        Returns:
            List[Document]: The list of matching chunks, or empty if no vectorstore.
        """
        if self.vectorstore is None:
            return []
        return self.vectorstore.similarity_search(query, k=k)
```

```python
# main.py

from document_processor import DocumentProcessor

def main():
    processor = DocumentProcessor()

    query = "What is the main mystery in the story?"
    results = processor.retrieve_relevant_context(query)

    # Since we haven't processed any documents yet, vectorstore is None,
    # so we expect an empty list.
    print("Retrieved context:", results)

if __name__ == "__main__":
    main()
```

**What to expect when you run `main.py`:**

```
Retrieved context: []
```

This confirms that, before any documents are processed, calling `retrieve_relevant_context` returns an empty list.



## Processing Documents for Vector Storage

Excellent work on setting up document loading and retrieval functionality! Now it's time to implement the heart of our document processor — the method that processes documents and adds them to the vector store.

This method needs to:

Load the document using your existing load\_document method.
Split the document into manageable chunks using a RecursiveCharacterTextSplitter.
Create or update the vector store with these chunks.
Then, in main.py, you'll put everything together by processing a document and retrieving relevant context for a query. This is where your RAG system really comes to life — once you complete this exercise, you'll have a working document processor that can find relevant information from your documents!

````python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None
        
    def load_document(self, file_path):
        """Load a document based on its file type"""
        if file_path.endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format")
            
        return loader.load()
        
    # TODO: Implement a method to process a document and add it to the vector store
        # TODO: Load the document using the load_document method
        # TODO: Create a RecursiveCharacterTextSplitter with the chunk_size and chunk_overlap attributes
        # TODO: Split the loaded documents into chunks using the text splitter
        # TODO: Check if the vectorstore is None
            # TODO: If it is, create a new FAISS vectorstore from the split documents
        # TODO: If not, add the documents to the existing vectorstore
        
    def retrieve_relevant_context(self, query, k=3):
        """Retrieve relevant document chunks for a query"""
        if self.vectorstore is None:
            return []
            
        return self.vectorstore.similarity_search(query, k=k)


from document_processor import DocumentProcessor

# Initialize the document processor
processor = DocumentProcessor()

# Define the file path of the document
file_path = "data/a_scandal_in_bohemia.pdf"

# TODO: Process the document

# Define a query
query = "What is the main mystery in the story?"

# Retrieve relevant context
relevant_docs = processor.retrieve_relevant_context(query)

# Check if any relevant documents were found
if relevant_docs:
    # Print the content of the first one
    print(relevant_docs[0].page_content)
else:
    print("No relevant documents found.")
    
```

Here’s how you can implement the `process_document` method for the `DocumentProcessor` class to handle document processing, splitting, and updating the vector store:

```python
# document_processor.py

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None
        
    def load_document(self, file_path):
        """Load a document based on its file type."""
        if file_path.lower().endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format: only PDF is supported.")
        return loader.load()
        
    def process_document(self, file_path):
        """
        Process a document, split it into chunks, and add it to the vector store.
        
        Args:
            file_path (str): The path to the document to process.
        """
        # Load the document
        docs = self.load_document(file_path)
        
        # Split the document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size, 
            chunk_overlap=self.chunk_overlap
        )
        split_docs = text_splitter.split_documents(docs)
        
        # Check if vectorstore exists, and create or update it
        if self.vectorstore is None:
            self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
        else:
            self.vectorstore.add_documents(split_docs)
        
    def retrieve_relevant_context(self, query, k=3):
        """Retrieve relevant document chunks for a query."""
        if self.vectorstore is None:
            return []
            
        return self.vectorstore.similarity_search(query, k=k)
```

### Explanation:

1. **`process_document` method**:

   * Loads a document using `load_document()`.
   * Splits the document into chunks using `RecursiveCharacterTextSplitter` with the defined `chunk_size` and `chunk_overlap`.
   * Checks if the `vectorstore` is `None`. If so, it creates a new FAISS vector store from the document chunks. If the vectorstore already exists, it adds the new chunks to it.

---

### Main Program (`main.py`)

Now, let's put everything together in `main.py` to test the document processor.

```python
# main.py

from document_processor import DocumentProcessor

def main():
    # Initialize the document processor
    processor = DocumentProcessor()

    # Define the file path of the document
    file_path = "data/a_scandal_in_bohemia.pdf"  # Ensure the correct path to your PDF file

    # Process the document (this will add the document chunks to the vector store)
    processor.process_document(file_path)

    # Define a query
    query = "What is the main mystery in the story?"

    # Retrieve relevant context from the vector store
    relevant_docs = processor.retrieve_relevant_context(query)

    # Check if any relevant documents were found
    if relevant_docs:
        # Print the content of the first relevant document
        print("=== First Relevant Document ===")
        print(relevant_docs[0].page_content)
    else:
        print("No relevant documents found.")

if __name__ == "__main__":
    main()
```

### Expected Output:

After running this program, if the document processing works as expected, you should see the content of the first relevant document chunk printed.

```
=== First Relevant Document ===
The main mystery in "A Scandal in Bohemia" revolves around retrieving a compromising photograph that Irene Adler possesses...
```

---

### Summary:

* **`process_document`** method splits documents into chunks and adds them to the vector store.
* **`retrieve_relevant_context`** method performs a similarity search for a query and returns the most relevant document chunks.
* The **main program** processes a document, queries the vector store, and prints the most relevant content based on the query.

You now have a working Retrieval-Augmented Generation (RAG) system!



## Building a Multi-Document Knowledge Base

Now that you've built a working document processor that can handle a single document, let's take it to the next level! One of the most powerful features of our vector store is its ability to store and search across multiple documents.

In this exercise, you'll expand your document processor to work with multiple documents:

Complete the code in main.py to process the second document (the file path is already provided)
Run the code to see how the system retrieves relevant chunks from both documents
Notice how the system automatically retrieves the most relevant information regardless of which document it's stored in
This exercise demonstrates a key advantage of RAG systems: the ability to build a knowledge base incrementally by adding multiple documents to the same vector store. The similarity search will find the most relevant information across your entire document collection, not just within a single file.

```python
from document_processor import DocumentProcessor

# Initialize the document processor
processor = DocumentProcessor()

# Process the first document
first_file_path = "data/a_scandal_in_bohemia.pdf"
processor.process_document(first_file_path)

# TODO: Process the second document
second_file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"


# Create a query that might retrieve content from both documents
query = "What methods does Sherlock Holmes use to solve cases?"

# Retrieve relevant context
print(f"Query: {query}")
relevant_docs = processor.retrieve_relevant_context(query, k=4)

# Print the results and check which document they came from
for i, doc in enumerate(relevant_docs):
    # Extract source file from metadata
    source = doc.metadata.get('source', 'Unknown source')
    
    # Print chunk information
    print(f"\nChunk {i+1} (from {source}):")
    print(f"Content preview:\n{doc.page_content[:150]}...")


from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None
        
    def load_document(self, file_path):
        """Load a document based on its file type"""
        if file_path.endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format")
            
        return loader.load()
        
    def process_document(self, file_path):
        """Process a document and add it to the vector store"""
        # Load the document
        docs = self.load_document(file_path)
        
        # Split the document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size, 
            chunk_overlap=self.chunk_overlap
        )
        split_docs = text_splitter.split_documents(docs)
        
        # Create or update the vector store
        if self.vectorstore is None:
            self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
        else:
            self.vectorstore.add_documents(split_docs)
        
        return split_docs
        
    def retrieve_relevant_context(self, query, k=3):
        """Retrieve relevant document chunks for a query"""
        if self.vectorstore is None:
            return []
            
        return self.vectorstore.similarity_search(query, k=k)

```

To expand the document processor to handle multiple documents and integrate them into the same vector store, we just need to process the second document and ensure the system correctly retrieves relevant context from both documents.

### Steps:

1. **Process the second document** using the `process_document` method. This will add the second document's chunks to the existing vector store.
2. **Perform a query** that could retrieve information from either of the two documents.
3. **Ensure that each document chunk has metadata** that indicates its source so we can easily identify which document the chunk came from.

### Complete Code:

Here’s how to implement the solution:

#### `main.py`

```python
from document_processor import DocumentProcessor

# Initialize the document processor
processor = DocumentProcessor()

# Process the first document
first_file_path = "data/a_scandal_in_bohemia.pdf"
processor.process_document(first_file_path)

# Process the second document
second_file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"
processor.process_document(second_file_path)

# Create a query that might retrieve content from both documents
query = "What methods does Sherlock Holmes use to solve cases?"

# Retrieve relevant context
print(f"Query: {query}")
relevant_docs = processor.retrieve_relevant_context(query, k=4)

# Print the results and check which document they came from
for i, doc in enumerate(relevant_docs):
    # Extract source file from metadata
    source = doc.metadata.get('source', 'Unknown source')
    
    # Print chunk information
    print(f"\nChunk {i+1} (from {source}):")
    print(f"Content preview:\n{doc.page_content[:150]}...")
```

### Updates to `DocumentProcessor`

We need to ensure that **metadata is included** when adding documents to the vector store. This way, when we perform a query, we can retrieve the source document for each chunk.

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None
        
    def load_document(self, file_path):
        """Load a document based on its file type"""
        if file_path.endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format")
            
        return loader.load()
        
    def process_document(self, file_path):
        """Process a document and add it to the vector store"""
        # Load the document
        docs = self.load_document(file_path)
        
        # Split the document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size, 
            chunk_overlap=self.chunk_overlap
        )
        split_docs = text_splitter.split_documents(docs)
        
        # Add metadata to each document chunk
        for doc in split_docs:
            doc.metadata['source'] = file_path
        
        # Create or update the vector store
        if self.vectorstore is None:
            self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
        else:
            self.vectorstore.add_documents(split_docs)
        
        return split_docs
        
    def retrieve_relevant_context(self, query, k=3):
        """Retrieve relevant document chunks for a query"""
        if self.vectorstore is None:
            return []
            
        return self.vectorstore.similarity_search(query, k=k)
```

### Explanation:

1. **Processing the Second Document**:
   The `process_document` method now processes both the first and the second document, adding chunks to the vector store as we go. The metadata is updated for each chunk to include the `source` (the file path) so we can track which document each chunk came from.

2. **Querying Across Multiple Documents**:
   Once both documents are processed, we can query for relevant context, and the system will return the most relevant chunks from any of the documents in the vector store. The source of each chunk is displayed to let us know which document the chunk came from.

3. **Handling Metadata**:
   Each chunk’s metadata contains the source document file path, allowing us to easily identify which document a chunk belongs to when displaying the results.

### Running the Code:

When you run this code, the output will show relevant chunks from both documents, and the source of each chunk will be displayed. For example:

```
Query: What methods does Sherlock Holmes use to solve cases?

Chunk 1 (from data/a_scandal_in_bohemia.pdf):
Content preview:
Sherlock Holmes uses a variety of methods to solve cases, including observation, deduction...

Chunk 2 (from data/the_adventure_of_the_blue_carbuncle.pdf):
Content preview:
Holmes relies heavily on his acute observation and reasoning skills to uncover the truth...

Chunk 3 (from data/a_scandal_in_bohemia.pdf):
Content preview:
In "A Scandal in Bohemia," Holmes's method revolves around studying the behavior of suspects...

Chunk 4 (from data/the_adventure_of_the_blue_carbuncle.pdf):
Content preview:
Holmes deduces the culprit by carefully analyzing the evidence and understanding human behavior...
```

This demonstrates that the system is capable of processing and searching across multiple documents, retrieving relevant information regardless of the document source. This is one of the core benefits of building a RAG system with incremental document additions.


## Implementing Reset for Document Management

Your document processor has been growing with each exercise, and now it's time to implement a way to clear the vector store when needed. This is particularly useful when you want to start fresh with a new set of documents or when testing different document collections.

Your tasks are:

Implement the reset method in the DocumentProcessor class that sets the vector store to None.

Complete the main.py file to test this functionality by:

Calling your reset method.
Confirming that queries after the reset return empty results.
This exercise will teach you proper state management — an essential skill when building systems that need to be flexible and maintainable over time.

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None
        
    def load_document(self, file_path):
        """Load a document based on its file type"""
        if file_path.endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format")
            
        return loader.load()
        
    def process_document(self, file_path):
        """Process a document and add it to the vector store"""
        # Load the document
        docs = self.load_document(file_path)
        
        # Split the document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size, 
            chunk_overlap=self.chunk_overlap
        )
        split_docs = text_splitter.split_documents(docs)
        
        # Create or update the vector store
        if self.vectorstore is None:
            self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
        else:
            self.vectorstore.add_documents(split_docs)
        
        return split_docs
        
    def retrieve_relevant_context(self, query, k=3):
        """Retrieve relevant document chunks for a query"""
        if self.vectorstore is None:
            return []
            
        return self.vectorstore.similarity_search(query, k=k)
        
    # TODO: Implement the method to clear the vector store


from document_processor import DocumentProcessor

# Initialize the document processor
processor = DocumentProcessor()

# Process a document
file_path = "data/a_scandal_in_bohemia.pdf"
processor.process_document(file_path)

# Define a query
query = "What is the main mystery in the story?"

# Verify the document was processed correctly
print("Test 1: Retrieving context before reset")
relevant_docs = processor.retrieve_relevant_context(query)
if relevant_docs:
    print(f"First chunk preview: {relevant_docs[0].page_content[:100]}...")
else:
    print("No relevant documents found.")

# TODO: Call the method to clear the vector store

# TODO: Verify that no documents are returned after reset

```

Here’s how you can add a simple `reset` method to your `DocumentProcessor` and then finish your `main.py` to test that it really clears out the vector store.

```python
# document_processor.py

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None
        
    def load_document(self, file_path):
        """Load a document based on its file type"""
        if file_path.endswith('.pdf'):
            loader = PyPDFLoader(file_path)
        else:
            raise ValueError("Unsupported file format")
            
        return loader.load()
        
    def process_document(self, file_path):
        """Process a document and add it to the vector store"""
        # Load the document
        docs = self.load_document(file_path)
        
        # Split the document into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size, 
            chunk_overlap=self.chunk_overlap
        )
        split_docs = text_splitter.split_documents(docs)
        
        # Create or update the vector store
        if self.vectorstore is None:
            self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
        else:
            self.vectorstore.add_documents(split_docs)
        
        return split_docs
        
    def retrieve_relevant_context(self, query, k=3):
        """Retrieve relevant document chunks for a query"""
        if self.vectorstore is None:
            return []
            
        return self.vectorstore.similarity_search(query, k=k)
        
    def reset(self):
        """Clear the vector store so we can start fresh."""
        self.vectorstore = None
```

```python
# main.py

from document_processor import DocumentProcessor

# Initialize the document processor
processor = DocumentProcessor()

# Process a document
file_path = "data/a_scandal_in_bohemia.pdf"
processor.process_document(file_path)

# Define a query
query = "What is the main mystery in the story?"

# Verify the document was processed correctly
print("Test 1: Retrieving context before reset")
relevant_docs = processor.retrieve_relevant_context(query)
if relevant_docs:
    print(f"First chunk preview: {relevant_docs[0].page_content[:100]}...")
else:
    print("No relevant documents found.")

# --- Now reset and test again ---
processor.reset()

print("\nTest 2: Retrieving context after reset")
relevant_docs_after_reset = processor.retrieve_relevant_context(query)
if not relevant_docs_after_reset:
    print("✅ Vector store has been reset. No documents found.")
else:
    print("❌ Found documents after reset, something went wrong.")
```

With this in place:

1. Calling `processor.reset()` will drop your existing FAISS index by setting `vectorstore` back to `None`.
2. A subsequent call to `retrieve_relevant_context(...)` will correctly return an empty list, confirming your reset logic works.
