<a href="https://colab.research.google.com/github/roblein/Optimized-RAG-Pipeline-for-Mortgage-Document-Analysis/blob/main/RL_Optimized_RAG_Pipeline_for_Mortgage_Document_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimized RAG Pipeline for Document Question Answering

This notebook demonstrates the construction of an Optimized Retrieval-Augmented Generation (RAG) pipeline designed for question answering over custom documents. The pipeline is built using a combination of open-source libraries, including:

*   **`llama-cpp-python`**: For loading and running a quantized Large Language Model (LLM) efficiently, leveraging GPU acceleration.
*   **`sentence-transformers`**: For generating high-quality text embeddings used for semantic search and retrieval.
*   **Manual Indexing and Retrieval**: A custom implementation to create an in-memory vector index and perform similarity search, bypassing potential compatibility issues with standard LlamaIndex vector stores in this environment.
*   **Manual Response Synthesis**: Logic to construct a prompt with retrieved context and use the loaded LLM to synthesize an answer.
*   **`gradio`**: For creating an interactive web interface to easily query the RAG pipeline.

The pipeline processes an uploaded PDF document, splits it into manageable chunks, generates embeddings for each chunk, and stores them in an in-memory index. When a user submits a query via the Gradio interface, the pipeline finds the most relevant document chunks using embedding similarity, feeds them into the LLM along with the query, and returns the generated answer. This optimized approach ensures efficient processing and accurate responses based on the document content.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Installed Libraries Overview

This notebook relies on the following key libraries, installed via `pip`, to build the Optimized RAG Pipeline:

*   **`torch`**: Fundamental library for building and running deep learning models.
*   **`llama-index` (and related packages like `llama-index-core`, `llama-index-llms-llama-cpp`, `llama-index-embeddings-huggingface`)**: Provides the framework and tools for building RAG applications, including document loading, indexing, retrieval, and query engines.
*   **`llama-cpp-python`**: Enables efficient execution of quantized Large Language Models (LLMs) in GGUF format, leveraging GPU acceleration.
*   **`sentence-transformers`**: Used for generating high-quality text embeddings for semantic search and retrieval.
*   **`pymupdf` / `pypdf`**: Libraries for processing PDF documents and extracting text content.
*   **`gradio`**: Used to create the interactive web-based user interface for the RAG pipeline.
*   **`accelerate`, `bitsandbytes`, `einops`**: Libraries that assist in optimizing model loading and inference performance.
*   **`nest_asyncio`**: Helps manage asynchronous operations in environments like Colab.
*   **`python-dotenv`**: Used for loading environment variables for configuration.

In [None]:
# Install required libraries with CUDA support
!pip install -q torch
!pip install -q llama-index llama-index-llms-gemini pymupdf
!pip install -q llama-index-embeddings-huggingface
!pip install nest_asyncio
!pip install -q pypdf
!pip install -q python-dotenv
!pip install -q llama-index
!pip install -q gradio
!pip install einops
!pip install accelerate
!pip install --upgrade llama-index
!pip install -U bitsandbytes
!pip install pymupdf
!pip install llama-index-llms-llama-cpp
!pip install llama-index-embeddings-huggingface

Collecting llama-cpp-python<0.4.0,>=0.3.0 (from llama-index-llms-llama-cpp)
  Using cached llama_cpp_python-0.3.12-cp311-cp311-linux_x86_64.whl
Installing collected packages: llama-cpp-python
  Attempting uninstall: llama-cpp-python
    Found existing installation: llama_cpp_python 0.2.90
    Uninstalling llama_cpp_python-0.2.90:
      Successfully uninstalled llama_cpp_python-0.2.90
Successfully installed llama-cpp-python-0.3.12


### CUDA Version Check and llama-cpp-python Installation

This cell performs two main actions:

1.  **Check CUDA Version**: The command `!nvcc --version` checks the version of the NVIDIA CUDA compiler installed in the environment. This is useful for verifying the CUDA version and ensuring compatibility with CUDA-enabled libraries.
2.  **Install CUDA-enabled `llama-cpp-python`**: The command `!pip install --no-cache-dir llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123` installs a specific version (`0.2.90`) of the `llama-cpp-python` library.
    *   `llama-cpp-python` is used to run Large Language Models (LLMs) in GGUF format efficiently, including on GPUs.
    *   `--no-cache-dir` prevents pip from using cached versions during installation.
    *   `--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123` tells pip to look for packages at this specific URL, which hosts CUDA-enabled builds of `llama-cpp-python` compatible with CUDA 12.3.

This cell is crucial for setting up the environment to run GGUF models with GPU acceleration using `llama-cpp-python`.

In [None]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

CUDA available: True
GPU: Tesla T4


In [None]:
# Check CUDA version
!nvcc --version

# Install llama-cpp-python with CUDA 12.x support
!pip install --no-cache-dir llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu123
Collecting llama-cpp-python==0.2.90
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.90-cu123/llama_cpp_python-0.2.90-cp311-cp311-linux_x86_64.whl (444.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m444.5/444.5 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: llama-cpp-python
  Attempting uninstall: llama-cpp-python
    Found existing installation: llama_cpp_python 0.3.12
    Uninstalling llama_cpp_python-0.3.12:
      Successfully uninstalled llama_cpp_python-0.3.12
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source

### Upload Document

This cell uses the `google.colab.files` module to allow you to upload a document (like a PDF) from your local computer to this Colab environment. The uploaded file will be used as the knowledge source for the RAG pipeline.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving sample_blob_1.pdf to sample_blob_1 (1).pdf
User uploaded file "sample_blob_1 (1).pdf" with length 2602908 bytes


### Download and Load Mistral LLM (llama_cpp)

This cell performs the following steps to get and load the Mistral language model:

1.  **Import Libraries**: Imports the necessary `Llama` class from `llama_cpp` for model loading and the `os` module for file system operations.
2.  **Define Model Path**: Sets the local path where the model file will be stored (`/content/mistral.gguf`).
3.  **Download Model**: It checks if the model file already exists at the defined path. If not, it uses the `!wget` command to download a specific Mistral GGUF model file from a Hugging Face repository.
4.  **Verify Download**: After the potential download, it verifies that the model file exists and prints its size.
5.  **Load Model**: It initializes a `Llama` object from the `llama_cpp` library.
    *   `model_path`: Specifies the path to the downloaded GGUF model file.
    *   `n_gpu_layers`: **Crucially**, this parameter specifies how many layers of the model should be offloaded to the GPU for accelerated inference. Setting this to a high value (like 30 or -1) is important for performance on a GPU like the Tesla T4.
    *   `n_ctx`: Sets the context window size of the model, determining how much text the model can consider at once.
    *   `verbose=True`: Enables detailed loading progress output.
6.  **Error Handling**: A `try...except` block is used to catch any errors that might occur during the model loading process.

This cell is vital as it loads the core language model (`llm` object) that is used by your RAG pipeline's `predict` function (in cell `d224aed8`) to synthesize answers based on the retrieved document content.

In [None]:
from llama_cpp import Llama
import os

# Download Mistral model
model_path = "/content/mistral.gguf"
if not os.path.exists(model_path):
    !wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf -O {model_path}
    print(f"Model downloaded to {model_path}")

# Verify file exists and check size
if os.path.exists(model_path):
    print(f"Model file exists. Size: {os.path.getsize(model_path) / (1024 * 1024):.2f} MB")
else:
    print("Model file not found!")

# Load the model with GPU acceleration
try:
    llm = Llama(
        model_path=model_path,
        n_gpu_layers=4,
        n_ctx=2048,      # Context window size
        verbose=True     # Loading progress
    )

    print("Model loaded successfully!")

except Exception as e:
    print(f"Error loading model: {e}")

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/mistral.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.

Model file exists. Size: 4166.07 MB


llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloaded 4/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4165.37 MiB
llm_load_tensors:      CUDA0 buffer size =   530.00 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   224.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    32.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer

Model loaded successfully!


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': 

### Query Expansion Function

This code defines the `expand_query` function, which utilizes the loaded Mistral language model (`llm`) to generate alternative versions of a given input query.

*   The function takes a `query` string and an optional `num_expansions` as input.
*   It constructs a detailed prompt that instructs the LLM to create related queries, focusing on different terminology and relevant legal terms, and to format the output as a list.
*   It calls the `llm.create_completion()` method to send the prompt to the Mistral model and get a text response.
*   The response text is then parsed to extract the generated alternative queries.
*   Finally, the original query is optionally added to the list, and the combined list of queries is returned.

This function is designed to enhance search recall by querying the RAG pipeline with multiple related queries instead of just the original one.

In [None]:
from llama_cpp import Llama
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.core import Settings
import os

# Simple query expansion function using Mistral
def expand_query(query: str, num_expansions: int = 3) -> list:
    """Expand a query to include related terms using Mistral."""
    prompt = f"""
    I need to search a legal contract with this query: "{query}"

    Please help me expand this query by generating {num_expansions} alternative versions that:
    1. Use different but related terminology
    2. Include relevant legal terms that might appear in a contract
    3. Cover similar concepts but phrased differently

    Format your response as a list of alternative queries only, with no additional text.
    """

    # Using the 'llm' object already initialized with LlamaCPP (Mistral)
    response = llm.create_completion(prompt, max_tokens=128, temperature=0.1) # Use create_completion

    # Extract the expanded queries
    # Access the text from the 'text' attribute of the completion object
    expanded_queries = [line.strip() for line in response['choices'][0]['text'].split('\n') if line.strip()] # Corrected access

    # Add the original query if needed
    if query not in expanded_queries:
        expanded_queries = [query] + expanded_queries

    return expanded_queries

### Test LLM Completion

This cell tests the text generation capability of the loaded Mistral LLM (`llm` object) using the `create_completion` method.

It sends a sample prompt to the model and prints the generated response to confirm that the model is loaded correctly and can produce output.

In [None]:
# Test with a RAG query
prompt = "what are the borrowers obligations?"
print(f"\nSending prompt: {prompt}")

# Use the create_completion method to generate text directly from the LLM
response = llm.create_completion(prompt, max_tokens=128, temperature=0.1) # Use create_completion
print("\nResponse:")
# Access the text directly from the response object
print(response['choices'][0]['text']) # Corrected access


Sending prompt: what are the borrowers obligations?



llama_print_timings:        load time =    3990.64 ms
llama_print_timings:      sample time =       5.71 ms /    90 runs   (    0.06 ms per token, 15756.30 tokens per second)
llama_print_timings: prompt eval time =    3990.55 ms /     8 tokens (  498.82 ms per token,     2.00 tokens per second)
llama_print_timings:        eval time =   63466.35 ms /    89 runs   (  713.11 ms per token,     1.40 tokens per second)
llama_print_timings:       total time =   67534.40 ms /    97 tokens



Response:


The borrower's obligations typically include making regular payments on the loan according to the agreed-upon schedule, maintaining adequate collateral (if the loan is secured), and complying with any other terms and conditions of the loan agreement. Failure to meet these obligations can result in default and potential legal action by the lender. It's important for borrowers to carefully review the loan agreement and understand their obligations before signing.


## Integrating Open-Source LLMs into RAG (with a PDF!)

### PDF Text Extraction (PyMuPDF)

This code cell uses the `fitz` library (part of PyMuPDF) to load a specified PDF file and extract all the text content from its pages.

1.  **Import `fitz`**: Imports the necessary library.
2.  **Define PDF Path**: Sets the file path to the uploaded PDF document.
3.  **Load PDF**: Opens the PDF file using `fitz.open()`.
4.  **Extract Text**: Iterates through each page of the document (`for page in doc`), extracts the text from each page using `page.get_text()`, and joins all the extracted text strings together with newline characters in between (`"\n".join(...)`).
5.  **Print Word Count**: Prints the total number of words extracted as a simple verification.

The resulting `text` variable contains the entire textual content of the PDF, ready for further processing like chunking and embedding.

In [None]:
import fitz  # PyMuPDF

# Load the sample contract PDF
pdf_path = "/content/sample_blob_1.pdf" # Corrected filename
doc = fitz.open(pdf_path)

# Extract text from all pages
text = "\n".join([page.get_text() for page in doc])

print(f"Extracted {len(text.split())} words from the PDF.")

Extracted 21141 words from the PDF.


### Embedding Model Setup and Document Processing

This code cell initializes the embedding model and processes the extracted document text into a format suitable for manual retrieval.

1.  **Import Libraries**: Imports necessary libraries including `Document`, `TextNode`, `SentenceSplitter` from `llama_index.core`, `SentenceTransformer` for embeddings, `numpy` for array operations, and `torch` for device checking.
2.  **Initialize Embedding Model**: Initializes the `SentenceTransformer` model (`BAAI/bge-small-en-v1.5`) which will be used to convert text into numerical embeddings. It explicitly sets the device (`cuda` or `cpu`) based on availability.
3.  **Create Documents and Split into Nodes**:
    *   Creates a LlamaIndex `Document` object from the extracted PDF text (`text` from cell `x9xb-U1uNDa7`).
    *   Uses a `SentenceSplitter` to break down the document text into smaller chunks called `nodes`.
4.  **Generate Embeddings**: Iterates through each created `node` and uses the `embed_model_st.encode()` method to generate a vector embedding for the text content of that node.
5.  **Store Nodes and Embeddings**: Stores the original `node` objects in a list (`indexed_nodes`) and their corresponding embeddings in another list (`node_embeddings`).
6.  **Convert Embeddings to NumPy Array**: Converts the list of embeddings into a `numpy` array (`node_embeddings_np`) for efficient similarity calculations during retrieval.
7.  **Global Variables**: Explicitly declares that `indexed_nodes`, `node_embeddings_np`, and `embed_model_st` are global variables, making them accessible to other parts of the notebook, such as the `predict` function.

This cell is crucial for preparing the document data and the embedding model, which are essential components for the manual retrieval process implemented in the `predict` function.

In [None]:
from llama_index.core import Document # Import Document
from sentence_transformers import SentenceTransformer # Import SentenceTransformer
# Removed LlamaCPP import
from llama_index.core.schema import TextNode # Import TextNode
from llama_index.core.node_parser import SentenceSplitter # Import SentenceSplitter
import torch
import numpy as np # Import numpy

# Using the existing 'llm' object from llama_cpp (Mistral model) initialized in cell qXyDQ-kQMqSp

# Removed LlamaCPP initialization (llm_index)


# Initialize SentenceTransformer model directly
embed_model_st = SentenceTransformer("BAAI/bge-small-en-v1.5", device="cuda" if torch.cuda.is_available() else "cpu") # Explicitly set device

# Create documents from your text and split into nodes
documents = [Document(text=text)]

# Simple text splitting
parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
nodes = parser.get_nodes_from_documents(documents)

# Generate embeddings for each node using SentenceTransformer and store nodes and embeddings
node_embeddings = []
indexed_nodes = []
for node in nodes:
    embedding = embed_model_st.encode(node.get_content()).tolist()
    node_embeddings.append(embedding)
    indexed_nodes.append(node) # Store the node itself

# Convert embeddings list to a numpy array for efficient search
node_embeddings_np = np.array(node_embeddings)

print(f"Created {len(indexed_nodes)} nodes with embeddings of dimension {node_embeddings_np.shape[1]}.")


# Removed LlamaIndex index, retriever, reranker, synthesizer, and query engine setup
# Manual indexing and retrieval will be implemented in the predict function

# Store components globally for access in predict function
# Removed unnecessary assignments, variables are already global
global indexed_nodes, node_embeddings_np, embed_model_st
# Note: 'llm' is also used globally but initialized in cell qXyDQ-kQMqSp

Created 73 nodes with embeddings of dimension 384.


In [None]:
import gradio as gr

### Prediction Function (Manual RAG Logic)

This cell defines the `predict` function, which serves as the main logic for your RAG pipeline. It takes a user query and performs the following steps to generate an answer based on the document content:

1.  **Access Global Components**: It accesses the necessary components initialized in previous cells as global variables: the indexed nodes (`indexed_nodes`), the numpy array of node embeddings (`node_embeddings_np`), the loaded Mistral LLM (`llm`), and the Sentence Transformer embedding model (`embed_model_st`). It includes a check to ensure these components are initialized.
2.  **Generate Query Embedding**: It uses the `embed_model_st` to generate a vector embedding for the user's input `query_text`.
3.  **Calculate Similarity and Retrieve**: It calculates the cosine similarity between the query embedding and the embeddings of all document nodes (`node_embeddings_np`). It then identifies the indices of the top-k most similar nodes (currently set to 3).
4.  **Retrieve Relevant Nodes**: It retrieves the actual `TextNode` objects corresponding to the top-k indices. These are the document chunks most relevant to the query.
5.  **Construct Synthesis Prompt**: It combines the text content of the retrieved nodes into a single context string. It then formats a prompt for the LLM, including the retrieved context and the original user query, instructing the LLM to answer based *only* on the provided context.
6.  **Synthesize Response**: It calls the `create_completion` method on the `llm` object (your `llama_cpp.Llama` Mistral model) with the constructed prompt. The LLM generates a response based on the provided context.
7.  **Extract and Return Answer**: It extracts the generated text from the LLM's response and returns it as the function's output. A `try...except` block is included to catch any errors that might occur during the LLM call.

This function effectively implements the manual retrieval and synthesis steps of the RAG pipeline, using the prepared document nodes and the loaded language model to answer user questions.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity # Import cosine_similarity for retrieval
import numpy as np # Import numpy
# Import the expand_query function
from __main__ import expand_query # Assuming expand_query is defined in the main notebook scope

def predict(query_text):
    """
    Performs manual retrieval and synthesis using the prepared nodes and LLM, with query expansion.

    Args:
        query_text: The user's query string.

    Returns:
        The synthesized response from the LLM.
    """
    global indexed_nodes, node_embeddings_np, llm, embed_model_st # Access global variables

    if indexed_nodes is None or node_embeddings_np is None or llm is None or embed_model_st is None:
         return "Error: RAG components not initialized. Please run the setup cells first."

    # 1. Perform Query Expansion
    try:
        expanded_queries = expand_query(query_text)
        print(f"Original Query: {query_text}")
        print(f"Expanded Queries: {expanded_queries}")
    except Exception as e:
        print(f"Error during query expansion: {e}")
        # Continue with just the original query if expansion fails
        expanded_queries = [query_text]


    # 2. Generate embeddings for all queries (original + expanded)
    query_embeddings = [embed_model_st.encode(q).reshape(1, -1) for q in expanded_queries]

    # 3. Average the query embeddings to get a single representative query embedding
    if query_embeddings:
        averaged_query_embedding = np.mean(query_embeddings, axis=0)
    else:
        # Fallback to original query embedding if expansion failed and was not added
        averaged_query_embedding = embed_model_st.encode(query_text).reshape(1, -1)


    # 4. Calculate similarity between the averaged query embedding and all node embeddings
    similarities = cosine_similarity(averaged_query_embedding, node_embeddings_np).flatten()

    # 5. Get indices of top-k most similar nodes (e.g., top 3)
    top_k_indices = np.argsort(similarities)[-3:][::-1] # Get top 3 indices in descending order of similarity

    # 6. Retrieve the top-k nodes
    retrieved_nodes = [indexed_nodes[i] for i in top_k_indices]

    # 7. Construct prompt for synthesis
    # Combine the content of retrieved nodes
    context_text = "\n\n".join([node.get_content() for node in retrieved_nodes])

    # Create a prompt for the LLM using the retrieved context and the original query
    prompt = f"""
    Context information is below.
    ---------------------
    {context_text}
    ---------------------
    Given the context information and not prior knowledge, answer the query.
    Query: {query_text}
    Answer:
    """

    # 8. Synthesize response using the llama_cpp.Llama LLM ('llm' object)
    try:
        # Call create_completion directly on the llama_cpp.Llama instance
        response = llm.create_completion(prompt, max_tokens=256, temperature=0.7)
        synthesized_response = response['choices'][0]['text'] # Access text from response dict

    except Exception as e:
        return f"Error during LLM synthesis: {e}"

    return synthesized_response

### Gradio Interface Setup and Interaction

This section focuses on defining the function that handles user queries and launching the interactive web interface.

*   **`gr.Interface(...)`**: This creates the Gradio web interface.
    *   `fn=predict`: Specifies that the `predict` function (defined in cell `d224aed8`) will be executed when a user interacts with the interface.
    *   `inputs=gr.Textbox(...)`: Defines a textbox as the input component for the user's query.
    *   `outputs=gr.Textbox(...)`: Defines a textbox as the output component to display the generated response.
    *   `title="..."`: Sets the title of the Gradio application.
    *   `description="..."`: Provides a brief description displayed below the title.
*   **`iface.launch()`**: This method starts the Gradio web server and launches the interface. In environments like Colab, it provides a public URL to access the interface.

This cell makes your RAG pipeline accessible through a user-friendly interactive web application.

In [None]:
# Create the Gradio interface
iface = gr.Interface(
    fn=predict,
    inputs=gr.Textbox(label="Enter your query:"),
    outputs=gr.Textbox(label="Response:"),
    title="Mortgage Document Query Analyst",
    description="Ask questions about the uploaded mortgage documents using an optimized RAG pipeline."
)

# Launch the interface
iface.launch()

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://cd1e15694cd19c536f.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


