SPDX-License-Identifier: Apache-2.0 Copyright (c) 2023, Aryan Karumuri srirajx.aryan.karumuri@intel.com

# PDF Interaction with Hybrid Retrievals 📄🔍

Welcome to the future of document interaction! Whether you're a student, researcher, developer, or simply someone who works with PDFs, this guide will show you how to make the most of advanced retrieval techniques to chat with your PDFs.

Have you ever wished you could ask questions about a PDF document and get quick, relevant answers? With **Hybrid Retrievals**, **Conceptualization**, **Reranking**, and **Contextual Compression**, you can enhance your PDF experience like never before.

Choose your PDF, and the system will analyze the content using both traditional keyword search and advanced machine learning models. This hybrid approach allows you to query the document, extract information, and even have a back-and-forth conversation about its contents.

This tool is powered by cutting-edge AI technology, ensuring high-quality results and accurate, context-aware responses. No complex setup, just easy interaction with your documents in real time.

Ready to unlock the potential of your PDFs and have intelligent conversations with them? Let’s dive in!


# Setup & Imports 🚀

### Setup

1. Run the below to create a new virtual environment.  
2. To select the new environment/kernel in Jupyter, choose **"Python (llm_rag_venv)"** from the dropdown menu in the top-right corner of this notebook.


In [1]:
# Step 1: Create a Python virtual environment
!python3 -m venv llm_rag_venv

# Step 2: Install dependencies (torch, transformers, etc.) in the virtual environment
# Since we can't activate the environment directly in Jupyter, we'll use the full path to `pip` inside the venv.
!llm_rag_venv/bin/pip install torch==2.3.1+cxx11.abi \
                             torchvision==0.18.1+cxx11.abi \
                             torchaudio==2.3.1+cxx11.abi \
                             oneccl_bind_pt==2.3.100+xpu \
                             intel-extension-for-pytorch==2.3.110+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

!llm_rag_venv/bin/pip install transformers==4.44.2 \
                             langchain==0.3.3 \
                             langchain-community==0.3.0 \
                             langchain-core==0.3.10 \
                             langchain-huggingface==0.1.0 \
                             langchain-text-splitters==0.3.0 \
                             langsmith==0.1.128 \
                             InstructorEmbedding==1.0.1 \
                             huggingface-hub==0.25.1 \
                             chroma-hnswlib==0.7.6 \
                             chromadb==0.5.11 \
                             accelerate==1.0.1 \
                             pypdf \
                             ipywidgets \
                             rank-bm25==0.2.2

!llm_rag_venv/bin/pip install sentence-transformers==2.2.2

# Step 3: Install `ipykernel` to register the virtual environment as a Jupyter kernel
!llm_rag_venv/bin/pip install ipykernel

# Step 4: Register the virtual environment as a Jupyter kernel
!llm_rag_venv/bin/python -m ipykernel install --user --name=llm_rag_venv --display-name "Python (llm_rag_venv)"


Looking in indexes: https://pypi.org/simple, https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
Collecting sentence-transformers>=2.6.0
  Using cached sentence_transformers-3.3.1-py3-none-any.whl (268 kB)
Installing collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 2.2.2
    Uninstalling sentence-transformers-2.2.2:
      Successfully uninstalled sentence-transformers-2.2.2
Successfully installed sentence-transformers-3.3.1
Collecting sentence-transformers==2.2.2
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Installing collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 3.3.1
    Uninstalling sentence-transformers-3.3.1:
      Successfully uninstalled sentence-transformers-3.3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are i

### Imports 📦

In this section, we import all the necessary libraries and modules required for various tasks like document loading, embeddings, model generation, and chaining.

### Key Libraries:
- **PyTorch**: Used for deep learning model operations, and device management (CPU or GPU).
- **LangChain**: A framework to work with language models, document processing, embeddings, and chains of operations.
- **Transformers**: Hugging Face's `transformers` library is used to work with pretrained models for text generation and embedding extraction.
- **Chroma**: A library for creating and managing vector databases, storing embeddings for document retrieval.

Additionally, we suppress any warnings that might arise from model loading and setup.


## Now select the kerenel as 'Python (llm_rag_venv)' in the notebook

In [2]:
# Standard library imports
import os
from pathlib import Path
import ipywidgets as widgets
import warnings
warnings.filterwarnings("ignore")


# Third-party imports
import torch
import intel_extension_for_pytorch as ipex
import langchain
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, pipeline

# Langchain related imports
from langchain_huggingface import HuggingFacePipeline
from langchain.schema import Document
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.cache import InMemoryCache
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.embeddings import HuggingFaceInstructEmbeddings

# Chroma and related imports
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker

# Cross-encoder related imports
from langchain_community.cross_encoders import HuggingFaceCrossEncoder


## Device Setup 🖥️🔧

This section checks if a compatible GPU (xPU) is available for use. If so, the code sets the device to `xPU` for accelerated computations. If no GPU is found, it defaults to using the CPU.

### Key Components:
- **Device Selection**: The device is set based on availability. If the `xPU` device is available (for Intel GPUs), it will be used. Otherwise, the system defaults to CPU.
- **Memory Management**: If using xPU, the memory is cleared to ensure there is no cached memory from previous operations, optimizing the resources for the next computation.


In [3]:
# Check and set the device (either XPU or CPU)
device = torch.device("xpu" if torch.xpu.is_available() else "cpu")

# If XPU is available, use it, otherwise fall back to CPU
if device.type == "xpu":
    # Empty the XPU cache
    torch.xpu.empty_cache()
    print(f"Using device: {torch.xpu.get_device_name()}")
else:
    print("Using CPU")


Using device: Intel(R) Data Center GPU Max 1100




## Paths Configuration 📂

This section sets up the paths for various directories used in the project. These paths are essential for file management, including loading documents, storing embeddings, and saving the final database.

### Key Directories:
- `ROOT_DIRECTORY`: The base directory of the project where all source code and notebooks reside.
- `SOURCE_DIRECTORY`: Directory containing source documents (e.g., PDFs) for processing.
- `PERSIST_DIRECTORY`: Directory where embeddings and other persisted data will be stored.
- `pdf_file_path`: Path to the specific PDF file that will be processed.

A **LocalFileStore** object is also created here to manage embedding caching.

**Note**: By default, "attention.pdf" is selected. If one wants to select other then they need to run cell by cell.


In [4]:
# Define root and source directories
ROOT_DIRECTORY = Path.cwd()
SOURCE_DIRECTORY = ROOT_DIRECTORY / 'SOURCE_DOCUMENTS'
PERSIST_DIRECTORY = ROOT_DIRECTORY / 'DB'  # Path to the DB directory

# Local file store for caching
store = LocalFileStore("./db_cache/cache/")

# Remove the PERSIST_DIRECTORY if it already exists
if PERSIST_DIRECTORY.exists():
    os.system(f'rm -rf {PERSIST_DIRECTORY}')

# Ensure the PERSIST_DIRECTORY exists (create it if it doesn't)
PERSIST_DIRECTORY.mkdir(parents=True, exist_ok=True)

# You can now confirm the directory was created
print(f"PERSIST_DIRECTORY exists at: {PERSIST_DIRECTORY}")

# Define the Chroma settings
CHROMA_SETTINGS = Settings(
    anonymized_telemetry=False,
    is_persistent=True,
)

# List of PDF filenames
files = os.listdir(str(ROOT_DIRECTORY)+'/data/')
pdf_files=[]
for file in files:
    if  file[-3:]=='pdf':
        pdf_files.append(file)

# Create the full paths for the PDFs
pdf_paths = [ROOT_DIRECTORY / 'data' / pdf for pdf in pdf_files]

# Create a dropdown widget for selecting a PDF
dropdown = widgets.Dropdown(
    options=pdf_files,
    description='Select PDF:',
    disabled=False
)

# Define a function to handle the dropdown change
selected_pdf = None

def on_dropdown_change(change):
    global selected_pdf, selected_pdf_path
    selected_pdf = change.new    
    selected_pdf_path = ROOT_DIRECTORY / 'data' / selected_pdf 

if selected_pdf is None:
    selected_pdf = "attention.pdf"
    selected_pdf_path = ROOT_DIRECTORY / 'data' / selected_pdf     
        

# Attach the function to the dropdown's 'value' change event
dropdown.observe(on_dropdown_change, names='value')

# Display the dropdown
display(dropdown)

PERSIST_DIRECTORY exists at: /home/u5c2dbc0bf2849dd5288e3311262c709/dev/Usecase/Main_Usecase_GPT/notebook/DB


Dropdown(description='Select PDF:', options=('attention.pdf', 'semantic_search_&_recommendation_algorithms.pdf…

## Configuration and Session Management ⚙️📅

### Embedding Model:
- **EMBEDDING_MODEL_NAME**: Specifies the embedding model used for generating document and query embeddings. Here, `"hkunlp/instructor-large"` is used, which is optimized for instruction-based embeddings suitable for document retrieval tasks.

### Hugging Face Model:
- **MODEL_ID**: Identifies the language model for text generation. `"NousResearch/Llama-2-7b-chat-hf"` is a conversational model designed for chat-based interactions, used to generate responses based on user queries.

### Generation Config:
- **MAX_LENGTH**: Maximum tokens allowed in the model's response (4096 tokens).
- **TEMPERATURE**: Controls randomness in generation; set to `0.1` for deterministic outputs.
- **REPETITION_PENALTY**: Discourages repeated phrases, set to `1.15`.

### Session State Management:
- **session_store**: A dictionary that stores chat history for each session.
- **session_id**: A default session ID (`"default_session"`) is used.
- **ChatMessageHistory**: Manages the chat history, ensuring that past interactions are accessible for contextual responses in ongoing conversations.


In [5]:
# Embedding Model
EMBEDDING_MODEL_NAME = "hkunlp/instructor-large"

# Hugging Face Model
MODEL_ID = "NousResearch/Llama-2-7b-chat-hf"

# Generation config parameters
MAX_LENGTH = 4096
TEMPERATURE = 0.1
REPETITION_PENALTY = 1.15

# Session state management
session_store = {}
session_id = "default_session"

if session_id not in session_store:
    session_store[session_id] = ChatMessageHistory()

## Load & Chunk Documents 📥📑

This section deals with loading a PDF file and splitting its content into manageable chunks for processing.

### Key Steps:
1. **Loading the PDF**: The `PyPDFLoader` is used to load the document from the specified `pdf_file_path`.
2. **Splitting the Document**: The document is split into chunks using `RecursiveCharacterTextSplitter`. This is done to break the document into smaller sections, allowing more efficient processing.
3. **Filtering Chunks**: After splitting, the chunks are filtered to remove any empty content before embedding.

This process is necessary to prepare the document content for embedding and retrieval in a vector store.


In [6]:
# Loading the PDF
loader = PyPDFLoader(selected_pdf_path)
docs = loader.load()

if not docs:
    raise ValueError("No documents loaded from the PDF.")

# Splitting the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=240, chunk_overlap=50)
document_chunks = text_splitter.split_documents(docs)

# Creating documents from the chunks
documents = [
    Document(page_content=chunk.page_content) 
    for chunk in document_chunks if chunk.page_content
]


## Getting Embeddings & Embedder Setup 🧠🔗

### `get_embeddings(device_type)`:
- This function selects the appropriate embedding model based on the value of `EMBEDDING_MODEL_NAME`.
    - If the model name contains "instructor" (indicating an instruction-based embedding model like `hkunlp/instructor-large`), it initializes the `HuggingFaceInstructEmbeddings` class, which generates embeddings for both documents and queries with custom instructions.
    - If the model is not instruction-based, the function defaults to using `HuggingFaceEmbeddings`, a general-purpose embedding model.

### Getting Embeddings:
- The `get_embeddings(device)` function is called to retrieve the appropriate embedding model based on the selected `device_type` (e.g., CPU or GPU).

### Cache-Backed Embeddings:
- **`CacheBackedEmbeddings`**: This class wraps around the selected embedding model to provide caching support. It ensures that embeddings are stored efficiently and can be quickly retrieved from the cache, reducing the need for redundant computations.
- **`from_bytes_store`**: This method loads the embeddings from the cache store, specified by `store`, and associates them with the provided namespace (`EMBEDDING_MODEL_NAME`), enabling faster access to embeddings during document retrieval tasks.


In [7]:
# Getting embeddings
def get_embeddings(device_type):
    if "instructor" in EMBEDDING_MODEL_NAME:
        return HuggingFaceInstructEmbeddings(
            model_name=EMBEDDING_MODEL_NAME,
            model_kwargs={"device": device_type},
            embed_instruction="Represent the document for retrieval:",
            query_instruction="Represent the question for retrieving supporting documents:"
        )
    else:
        return HuggingFaceEmbeddings(
            model_name=EMBEDDING_MODEL_NAME,
            model_kwargs={"device": device_type}
        )

# Get embeddings
embeddings = get_embeddings(device)

# Create embedder with cache-backed embeddings
embedder = CacheBackedEmbeddings.from_bytes_store(embeddings, store, namespace=EMBEDDING_MODEL_NAME)


load INSTRUCTOR_Transformer
max_seq_length  512


## Database Creation with Chroma 🗃️

### `Chroma.from_documents()`:
- This function creates a vector database from the provided documents, using embeddings generated by the `embedder`.
    - **`documents`**: A list of documents that will be embedded and stored in the database.
    - **`embedder`**: The embedding model used to generate embeddings for the documents.
    - **`persist_directory`**: Specifies the directory where the vector database will be stored persistently. The `PERSIST_DIRECTORY` path is converted to a string to ensure compatibility.
    - **`client_settings`**: Configuration settings for the Chroma client, typically specifying things like connection settings or storage options.

The `Chroma` database enables efficient vector search and retrieval, where documents are indexed by their embeddings for fast querying.


In [8]:
# DB creation
db = Chroma.from_documents(
    documents,
    embedder,
    persist_directory=str(PERSIST_DIRECTORY),
    client_settings=CHROMA_SETTINGS
)

## Hybrid Retriever Setup 🔍
In this section, we set up a hybrid retriever that combines multiple retrieval methods to answer user queries effectively.

### Key Components:

- **DB Retriever**: 
    - Uses the Chroma vector database (`db`) for efficient vector-based retrieval, leveraging the embeddings stored in the database for document search.
    
- **BM25 Retriever**:
    - A classical information retrieval technique based on the BM25 algorithm, which ranks documents according to their relevance to the query. The `k=5` limits the number of retrieved documents to 5.
    
- **Contextual Compression Retriever**:
    - The **Contextual Compression Retriever** is an advanced retrieval technique that combines the power of deep learning models with traditional retrieval methods. It works by first retrieving documents using an initial retrieval method (e.g., BM25 or Chroma database). 
    - **Purpose**: The primary goal of the Contextual Compression Retriever is to enhance the relevance of the returned documents by considering the **context of the query**. This is crucial in handling complex queries where context and intent are more important than just keyword matches. By reranking, it selects the most relevant documents that best align with the user's intent and query context.
    - **Key Feature**: The model chooses the **top 3 documents** after reranking, ensuring that the most relevant results are provided to the user.

- **Reranking**:
    - **Reranking** refers to the process where the documents retrieved by an initial retrieval method (e.g., BM25) are passed through a more sophisticated model to score and reorder them. This step helps refine the relevance of the documents by considering not only keyword matches but also the semantic context of both the query and the documents.
    - **How It Works**: The reranking model (e.g., `BAAI/bge-reranker-base`) processes the query and documents through a transformer-based model that understands deeper semantic relationships. It evaluates how well each document answers the user's question in the given context and reorders them accordingly. The final ranking reflects the documents that are the most relevant based on the **contextual meaning** of the query, rather than just matching keywords.
    - **Purpose**: The goal of reranking is to improve the quality of the retrieved documents by applying a more **context-aware model** that ensures the documents returned are not only relevant based on keywords but also semantically aligned with the user's query.

- **Ensemble Retriever**:
    - Combines the BM25 and Contextual Compression Retrievers into an ensemble. The retrievers contribute proportionally to the final results (weights=[0.7, 0.3]), enabling the hybrid retriever to leverage both classical and modern retrieval methods. This balance ensures that the system delivers comprehensive document coverage while maintaining contextual accuracy in its results.

### Output:
The ensemble retriever returns the most relevant documents by leveraging both traditional and advanced retrieval methods, combining the strengths of keyword-based ranking (BM25) and context-aware reranking. Reranking ensures that the documents returned are not only relevant based on keywords but also semantically aligned with the user's query, leading to more precise and contextually appropriate results.


In [9]:
def hybrid_retrievers():
    # DB Retriever
    db_retriever = db.as_retriever()
    
    # BM25 Retriever
    sparse_retriever = BM25Retriever.from_documents(documents)
    sparse_retriever.k = 5

    # Adding Contextual Compression Retriever
    model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=model, top_n=3)
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=db_retriever
    )

    # Creating Ensemble Retriever
    ensemble_retriever = EnsembleRetriever(
        retrievers=[compression_retriever, sparse_retriever],
        weights=[0.7, 0.3]
    )
    
    return ensemble_retriever

## LLM Setup 🤖🔧

This function sets up a text generation pipeline using Hugging Face's transformer model, with integration into LangChain for enhanced capabilities.

### Key Components:

- **Model and Tokenizer Setup**:
    - Loads the pre-trained language model (`MODEL_ID`) and tokenizer for tokenizing text and generating outputs.
    - The model uses `AutoModelForCausalLM` and is set to use `bfloat16` precision for efficient memory usage.

- **Generation Configuration**:
    - Loads model-specific generation settings (e.g., `max_length`, `temperature`, `repetition_penalty`) to control the behavior of the generated text.

- **Text Generation Pipeline**:
    - A Hugging Face pipeline is set up for text generation with the model and tokenizer.
    - Configured with parameters like `max_length`, `temperature`, and `num_return_sequences` to guide output generation.

- **LangChain Cache**:
    - An in-memory cache (`InMemoryCache`) is set up for LangChain's LLM to store responses efficiently.

- **Return Hugging Face Pipeline**:
    - The Hugging Face pipeline is wrapped in LangChain’s `HuggingFacePipeline` for seamless integration and use in advanced tasks like retrieval-augmented generation.



In [10]:
def llm():
    # Load the model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    # Load the generation configuration for the model
    generation_config = GenerationConfig.from_pretrained(MODEL_ID)

    # Create a text generation pipeline
    pipe = pipeline(
        "text-generation",
        model=model,
        device=device,
        tokenizer=tokenizer,
        max_length=MAX_LENGTH,
        temperature=TEMPERATURE,
        repetition_penalty=REPETITION_PENALTY,
        generation_config=generation_config,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

    # Set up in-memory cache for LangChain's LLM
    langchain.llm_cache = InMemoryCache()

    # Return the HuggingFace pipeline wrapped in LangChain's HuggingFacePipeline
    return HuggingFacePipeline(pipeline=pipe)


## History-Aware Retriever 📜🔍

This section defines a **history-aware retriever** that processes chat history to refine the user's question and retrieve the most relevant documents or answers.

### Key Components:

- **Contextualize User Question**:
    - A system prompt is defined to formulate a standalone question from the chat history and the user's latest query. 
    - The prompt ensures that the user's query can be understood without needing the entire history, focusing on reformulating the question if needed.

- **Chat Prompt Template**:
    - The `ChatPromptTemplate` is created using the system prompt, chat history, and user input. It enables the model to interpret the context and refine the query for better retrieval.

- **History-Aware Retriever**:
    - The history-aware retriever is created by passing the **language model** (`llm()`), **hybrid retriever** (`hybrid_retrievers()`), and **contextualize question prompt** (`contextualize_q_prompt`) to the `create_history_aware_retriever` function. This combination allows the retriever to understand the context from previous messages and adjust the retrieval process accordingly.


In [11]:
def history_aware_retriever():
    # Define the system prompt to contextualize the user question
    contextualize_q_system_prompt = (
        "Given a chat history and the latest user question, "
        "which might or mightnot reference context in the chat history, "
        "check the query carefully and formulate a standalone question which can be understood "
        "without the chat history. Do NOT answer the question, "
        "just reformulate it if needed and otherwise return it as is."
    )

    # Create the chat prompt using the system prompt and user input
    contextualize_q_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", contextualize_q_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )

    # Create the history-aware retriever
    history_retriever = create_history_aware_retriever(
        llm(), 
        hybrid_retrievers(), 
        contextualize_q_prompt
    )
    
    return history_retriever


## Question Answering (QA) Prompt Setup ❓💬

This section defines the prompt template used to interact with the language model for question-answering tasks.

### Key Components:
- **System Prompt**: A system-level prompt defines the task for the assistant: to answer questions strictly using the provided context.
- **Chat Prompt Template**: The `ChatPromptTemplate` is used to manage user input and chat history to generate context-sensitive answers.

This is used to format the interaction with the language model in a structured manner, ensuring that the model responds only with contextually relevant information.


In [12]:
def qa_prompt():
    # Define the system prompt for efficient question-answering
    system_prompt = (
        "You are an assistant that answers questions based strictly on the provided context. "
        "If the answer is not clear or not present in the context, respond with 'I don't know'. "
        "Do not provide any additional information beyond the context provided. "
        "Use the context below to answer the user's question:\n\n"
        "{context}"
    )

    # Create the QA prompt with improved instructions
    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )

    return qa_prompt


## Session History Management 🗂️🔒

This function manages the chat history for a given session. It stores and retrieves chat messages to ensure that the conversation history is maintained.

### Key Functionality:
1. **Session History**: A dictionary (`session_store`) is used to store the chat history for each session.
2. **History Retrieval**: The `get_session_history()` function retrieves the conversation history for a specific session (either existing or default).

This ensures that the assistant can recall previous interactions and provide contextually relevant answers.


In [13]:
def get_session_history(session: str) -> ChatMessageHistory:
    # Check if the session exists, if not, create a new one
    if session not in session_store:
        session_store[session] = ChatMessageHistory()
    
    # Return the chat history for the given session
    return session_store[session]


## Conversational Chain Setup 🔗💬

In this section, we set up the full conversational chain that connects the retrieval mechanism and the language model for generating answers.

### Key Components:
- **Retrieval Chain**: Combines the hybrid retriever with the question-answering chain to create a comprehensive retrieval-based question-answering system.
- **Session-Aware Chain**: The `RunnableWithMessageHistory` integrates session history, allowing the model to keep track of previous interactions.

This creates a seamless conversation flow, where the system can answer based on both previous chat history and relevant retrieved documents.


In [14]:
# Create the stuff documents chain for QA
question_answer_chain = create_stuff_documents_chain(llm(), qa_prompt())
   
# Create the retrieval chain with the history-aware retriever
rag_chain = create_retrieval_chain(history_aware_retriever(), question_answer_chain)
    
# Create a Runnable with message history to allow conversation state management
conversational_rag_chain = RunnableWithMessageHistory(
        rag_chain,
        get_session_history,
        input_messages_key="input",      # Key for the input messages
        history_messages_key="chat_history",  # Key for the historical messages
        output_messages_key="answer"      # Key for the response
    )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Main Loop Interaction 🧑‍💻

The main loop allows continuous interaction with the assistant. It listens for user input, processes the input using the conversational chain, and returns an answer based on the retrieved context.

### Key Functionality:
1. **User Input**: The loop takes user input, which is passed to the conversational chain.
2. **Response Generation**: The response is generated based on the current session's context, including both the user input and retrieved documents.
3. **Exit Condition**: If the user types 'exit', the loop terminates.

The assistant responds to the user's queries, using both previous conversations and retrieved documents to provide a relevant and context-aware answer.


In [15]:
# Main loop for continuous interaction

print("Selected PDF: ", selected_pdf)
while True:
    user_input = input("\nYour question (or type 'exit' to quit): ")
    
    if user_input.lower() == 'exit':
        print("Exiting the conversation. Bye!")        
        break

    # Retrieve the session's message history
    session_history = get_session_history(session_id)

    # Invoke the conversational RAG chain with the user input
    response = conversational_rag_chain.invoke(
        {"input": user_input},
        config={"configurable": {"session_id": session_id}},
    )
    
    answer = response['answer']
    assistant_response = answer.split("Assistant:")[-1].strip()
    final_answer = assistant_response.split("System:")[-1].split("Human:")[-1].strip()
    print(final_answer)

Selected PDF:  attention.pdf



Your question (or type 'exit' to quit):  Introduction


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


According to the context provided, the dimensions of the input and output in the model are dmodel = 512 and dff = 2048, respectively.



Your question (or type 'exit' to quit):  Explain about Potential Limitations and Future Research Directions.


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Based on the context provided, there is no information regarding potential limitations and future research directions in the given paper. Therefore, I cannot provide any insights on this topic.



Your question (or type 'exit' to quit):  what is Multi-Head Attention?


Multi-Head Attention is a technique used in the model to allow the attention mechanism to focus on different aspects of the input sequence simultaneously. It allows the model to jointly attend to information from different representation subspaces at different positions. However, this concept is not explained in detail in the provided paper.



Your question (or type 'exit' to quit):  exit


Exiting the conversation. Bye!


# Disclaimer for Using Large Language Models
Please be aware that while Large Language Models are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It's advisable to carefully review the generated text and consider the context and application in which you are using these models.

For detailed information on each model's capabilities, licensing, and attribution, please refer to the respective model cards:

1. **NousResearch-llama-2-7b-chat-hf** 
    - Model Card: https://huggingface.co/NousResearch/Llama-2-7b-chat-hf

Usage of these models must also adhere to the licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above. To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.

# Limitations

There are a few limitations that users may encounter when using the assistant. These limitations are as follows:

1. **Token Length**: The "MAX_LENGTH" parameter in the code represents the maximum token limit supported by the model (NousResearch-llama-2-7b-chat-hf). As a result, the assistant can process only 3-4 queries at a time. This is because the model needs to load all the retrieved content, history, and corpus into memory, which limits the number of queries it can handle effectively. If you need to process more queries or handle larger content, consider using a model with larger parameters. A bigger model supports a higher token length, allowing it to manage more queries at once.

2. **Hallucination**: Occasionally, the assistant may respond with "I don't know" even when the question is derived from the corpus. This could happen if the model cannot find the relevant information within the given context. To resolve this issue, try restarting the kernel and re-submit your queries. This helps refresh the model's context and can improve the accuracy of its responses.

3. **Out of Memory**: If you encounter an "Out of Memory" error, it means the token limit of the model has been exceeded, and there are insufficient resources left to generate a response. In this case, try restarting the kernel and re-submit your queries. If this problem persists, using a model with more parameters may allow for better memory handling and support for longer queries.
