<center><img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/4_RGB_McCombs_School_Brand_Branded.png" width="300" height="100"/>
  <img src="https://mma.prnewswire.com/media/1458111/Great_Learning_Logo.jpg?p=facebook" width="200" height="100"/></center>

<center><font size=10>Generative AI for Business Applications</center></font>
<center><font size=6>Retrieval Augmented Generation - Week 3</center></font>

<center><img src="https://i.ibb.co/pBF9nKpf/apple.png" width="720"></center>

<center><font size=6>Apple HBR Report Document Q&A</center></font>

# Problem Statement

## Business Context

As organizations grow and scale, they are often inundated with large volumes of data, reports, and documents that contain critical information for decision-making. In real-world business settings, such as venture capital firms like Andreesen Horowitz, business analysts are required to sift through large datasets, research papers, or reports to extract relevant information that impacts strategic decisions.

For instance, consider that you've just joined Andreesen Horowitz, a renowned venture capital firm, and you are tasked with analyzing a dense report like the Harvard Business Review's **"How Apple is Organized for Innovation."** Going through the report manually can be extremely time-consuming as the size and complexity of these report increases. However, by using **Semantic Search** and **Retrieval-Augmented Generation (RAG)** models, you can significantly streamline this process.

Imagine having the capability to directly ask questions like, “How does Apple structure its teams for innovation?” and get immediate, relevant answers drawn from the report. This ability to extract and organize specific insights quickly and accurately enables you to focus on higher-level analysis and decision-making, rather than being bogged down by information retrieval.

## Objective

The goal is to develop a RAG application that helps business analysts efficiently extract key insights from extensive reports, such as “How Apple is Organized for Innovation.”

Specifically, the system aims to:

- Answer user queries by retrieving relevant content directly from lengthy documents.

- Support natural-language interaction without requiring a full manual read-through.

- Act as an intelligent assistant that streamlines the report analysis process.

Through this solution, analysts can save time, improve productivity, and make faster, more informed strategic decisions

## Data Description

**How Apple is Organized for Innovation** - An article of 11 pages in pdf format

# Installing and Importing the Necessary Libraries

In [None]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
# !CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [None]:
# For installing the libraries & downloading models from HF Hub
!pip install -q pandas \
            tiktoken \
            pymupdf \
            langchain \
            langchain-community \
            chromadb \
            sentence-transformers \
            datasets

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [None]:
import json
import os
import tiktoken

import pandas as pd

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Loading the data

In [None]:
# uncomment and run the below code snippets if the dataset is present in the Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
pdf_file = "/content/HBR_How_Apple_Is_Organized_For_Innovation.pdf"

In [None]:
pdf_loader = PyMuPDFLoader(pdf_file)

### Downloading and loading the LLM

We are going to download and use the Llama model which is trained on 13 billion parameters. The size of this model is around 9GB so it is recommended to have a good internet connection along with a GPU to download it and generate responses from it respectively.

In [None]:
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
model_basename = "mistral-7b-instruct-v0.1.Q2_K.gguf" # the model is in gguf format

In [None]:
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

The given code downloads a model file from **Hugging Face Hub** using the `hf_hub_download` function. Here's a breakdown:

- `hf_hub_download(...)`: Fetches a file from the **Hugging Face Model Hub**.
- `repo_id=model_name_or_path`: Specifies the **repository ID** (i.e., the model's name or path on Hugging Face).
- `filename=model_basename`: Specifies the **file name** to download from the model repository.

This is typically used to **download pre-trained models**, embeddings, or other necessary files from Hugging Face for tasks like **text generation, embeddings, or fine-tuning**.

In [None]:
#uncomment the below snippet of code if the runtime is connected to GPU.
lcpp_llm = Llama(
    model_path=model_path,
    n_ctx=2300,
    n_gpu_layers=38,
    n_batch=512
)

In [None]:
# # uncomment the below snippet of code if the runtime is connected to CPU only.
# lcpp_llm = Llama(
#    model_path=model_path,
#    n_ctx=2300
# )

The given code initializes a **Llama model**for local inference. Here's a breakdown of each parameter:

- **`Llama(...)`**: Loads a **Llama model** for text generation.
- **`model_path=model_path`**: Specifies the **file path** of the downloaded model (from Hugging Face or another source).
- **`n_ctx=..`**: Sets the **context window** (i.e., the maximum number of tokens the model can process at once).
- **`n_batch=..`**: Defines the **batch size** for processing tokens. A higher value improves speed but requires more VRAM.
- **`n_gpu_layers=..`**: Determines how many **layers** are offloaded to the **GPU**. Adjust this based on available **VRAM**.

This setup is optimized for **running a local Llama model**, leveraging both **CPU and GPU** for efficient inference. The parameters should be adjusted based on **hardware constraints** (CPU, GPU, and RAM availability).

# Question Answering using Base model

## Generation Function

In [None]:
def generate_response(user_input , llm):

    # Quering an LLM
    try:
        response = llm(
                prompt=user_input,
                max_tokens=512,
                temperature=0.4,
                top_p=0.95,
                repeat_penalty=1.2,
                top_k=25,
                stop=['INST'],
                echo=False
                )

        prediction =  response["choices"][0]["text"]

    except Exception as e:
        prediction = f'Sorry, I encountered the following error: \n {e}'

    return  prediction

### Question 1: Who are the authors of this article and who published this article ?

In [None]:
question_1 = "Who are the authors of this article and who published this article ?"
base_answer_1=generate_response(question_1,lcpp_llm)
print(base_answer_1)

### Question 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
question_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
base_answer_2=generate_response(question_2,lcpp_llm)
base_answer_2

### Question 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
question_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
base_answer_3=generate_response(question_3,lcpp_llm)
base_answer_3

# Retrieval Augmented Generation Implementation

### Split the Loaded PDF into Chunks for Further Processing

In [None]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap=20
)

The given code initializes a **RecursiveCharacterTextSplitter** to split the text into manageable chunks for embedding and retrieval. Here's a breakdown:

- `RecursiveCharacterTextSplitter.from_tiktoken_encoder(...)`: Uses **TikToken encoding** to properly handle token-based splitting.
- `encoding_name='cl100k_base'`: Specifies the **TikToken encoding** (used by OpenAI models like GPT-4 and GPT-3.5).
- `chunk_size=512`: Each text chunk will have a maximum of **512 tokens**.
- `chunk_overlap=16`: Ensures **overlapping** of 16 tokens between consecutive chunks to preserve context.

This approach ensures that text is split **intelligently** while maintaining **semantic meaning** for better retrieval and embeddings.

In [None]:
document_chunks = pdf_loader.load_and_split(text_splitter)

(Note: Expect that the above cell will take time to execute).

Let's take a look at consecutive chunks from the document.

In [None]:
i = 5
document_chunks[i]

In [None]:
document_chunks[i+1]

As we can see there is some overlap between the chunks. This improves the coherence and relevance of retrieved results, as the model can better understand the relationship between adjacent parts of the document. It also helps in maintaining the flow of ideas and ensuring that critical context is available when generating answers, leading to more accurate and contextually consistent outputs.


### Generate Vector Embeddings for Text Chunks

In [None]:
embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large')

Now that we have chunked the raw input, **we can present these chunks to an embedding model and then store the generated embeddings into a vector database.**
  - We generate a vector for each chunk and save this chunk along with the vector representation in a specialized database.

### Creating a Vector Database

In [None]:
out_dir = 'apple_db'

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [None]:
vectorstore = Chroma.from_documents(
    document_chunks,
    embedding_model,
    persist_directory=out_dir)

The given code initializes a **vector database** (also called **vector store**) using **Chroma**, a popular open-source vector database, to store **document embeddings** for retrieval in a **Retrieval-Augmented Generation (RAG)** system. Here's a breakdown of what each part does:  

- **`vectorstore = Chroma.from_documents(...)`**  
   - This creates a **Chroma vector store** from a set of **document chunks**. Chroma is used to store and retrieve embeddings efficiently.

- **Parameters Passed to `Chroma.from_documents()`**  
   - `document_chunks`: A list of **text chunks** (split portions of a document) that will be converted into embeddings.  
   - `embedding_model`: The model responsible for **embedding** the document chunks into vector representations. Common choices include OpenAI’s embeddings, Sentence Transformers, or other dense vector models.  

In [None]:
vectorstore = Chroma(persist_directory=out_dir,embedding_function=embedding_model)

### Retrieval

We will now create a retriever that can query an input text and retrieve the top$-k$ documents that are most relevant from the vector store.

- Under the hood, a similarity score is computed between the embedded query and all the chunks in the database
- The top $k$ chunks with the highest similarity scores are then returned.

In [None]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 3}
)

The given code initializes a **retriever** from the **Chroma vector store** to fetch similar documents based on embeddings. Here's a breakdown:

- `vectorstore.as_retriever(...)`: Converts the **Chroma vector store** into a retriever for querying.
- `search_type='similarity'`: Specifies that retrieval is based on **cosine similarity** (or another similarity metric used by Chroma).
- `search_kwargs={'k': 6}`: Retrieves the **top 6 most similar** documents for a given query.

This allows for **efficient information retrieval**, where the retriever finds the most relevant document chunks based on their **semantic similarity** to a user's query.

#### **Retrieving the Relevant Documents**

Let's ask a simple query and see what document chunks are returned based on the similarity search.

In [None]:
user_input = "How does does Apple develop and ship products that requires good coordination between the teams?"

relevant_document_chunks = retriever.get_relevant_documents(user_input)

In [None]:
len(relevant_document_chunks)

In [None]:
for document in relevant_document_chunks:
    print(document.page_content.replace("\t", " "))

In [None]:
len(relevant_document_chunks[0].page_content)

It can be observed that the chunks are related to the user query and can perhaps contain the answer.

### Generation

#### Designing the System Prompt

System Prompt designing is a crucial part of designing a RAG based system, it consists mainly of two parts:

- system message: This is the instruction that has to be given to the LLM.
- user message template: This is a message template that contains the context retrieved from the document chunks and the User Query.

In [None]:
qna_system_message = """
You are an assistant whose work is to give answers to questions with repect to a context.
User input will have the context required by you to answer user questions.

This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Strictly answer only using the information provided in the ###Context.
Do not mention anything about the information in ###Context or the question in ###Question in your final answer.

If the answer to ###Question cannot be derived from the ###Context, just respond by saying "I don't know".

Remember that the answer to ###Question might not always be directly present in the information provided in the ###Context.
the answer can be indirectly derived from the information in ###Context.

"""

**Note**: It is important to specify that the LLM should not attempt to answer the question if the context provided (retrieved from the knowledge base provided) doesn't contain the information required. We don't want the LLM to use the knowledge from its training data and/or hallucinate to share a "seemingly correct" answer.

In [None]:
qna_user_message_template = """
Conider the following ###Context and ###Question
###Context
{context}

###Question
{question}
"""

### Defining the function for generating responses




Let's create a function that takes a user query and an LLM as input, finds the relevant chunks, and uses them as context to generate an answer.

In [None]:
def generate_rag_response(user_input , llm):
    """
    Args:
        user_input: Takes a user input for which the response should be retrieved from the vectorDB.
        llm: The LLM to be used for generating the response
    Returns:
        The generated response based on the user query and the context from the knowledge base
    """
    relevant_document_chunks = retriever.get_relevant_documents(user_input)
    context_list = [d.page_content.replace("\t", " ") for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)



    # Combine user_prompt and system_message to create the prompt
    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=context_for_query, question=user_input)}
                [/INST]"""


    # Quering an LLM
    try:
        response = llm(
                prompt=prompt,
                max_tokens=512,
                temperature=0.4,
                top_p=0.95,
                repeat_penalty=1.2,
                top_k=25,
                stop=['INST'],
                echo=False
                )

        prediction =  response["choices"][0]["text"]

    except Exception as e:
        prediction = f'Sorry, I encountered the following error: \n {e}'

    return  prediction

Let's try this function on the previous user query and see whether it can generate an answer.

# Question Answering using RAG

### Question 1: Who are the authors of this article and who published this article ?

In [None]:
question_1 = "Who are the authors of this article and who published this article ?"
rag_answer_1=generate_rag_response(question_1,lcpp_llm)
print(rag_answer_1)

- The answer is clear, concise, and focused, without any unnecessary information.  

- For queries like this, we expect a response of this nature.

### Question 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
question_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
rag_answer_2=generate_rag_response(question_2,lcpp_llm)
print(rag_answer_2)

### Question 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
question_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
rag_answer_3=generate_rag_response(question_3,lcpp_llm)
print(rag_answer_3)

# Output Evaluation

**Why Do We Need Output Evaluation in a RAG-Based System?**  
Output evaluation in a **Retrieval-Augmented Generation (RAG) system** is essential to ensure that the system produces **accurate, relevant, and reliable** responses. Since RAG systems rely on both **retrieval** and **generation**, issues like **irrelevant retrieval, hallucinations, or poor contextual grounding** can degrade the output quality. Evaluating the outputs helps in:  

- **Measuring Groundedness** - Ensuring that the generated response is **faithfully derived** from the retrieved documents.  
- **Assessing Relevance** - Checking if the retrieved information directly answers the user’s query.    


## LLM as a Judge

In [None]:
groundedness_rater_system_message = """
You will be given a ###Question, ###Context, and an AI-generated ###Answer.

Your task: Rate how well the ###Answer is derived from the ###Context.

Return only a single number from 1 to 5:
1 = Not derived at all
2 = Derived to a limited extent
3 = Derived to a good extent
4 = Derived mostly
5 = Derived completely

Return only the Score in last in a dictionary format not json and score should be in the range of 1 to 5.
Example {groundedness_score:4}
"""


This prompt is designed to evaluate the groundedness of the AI-generated answer, i.e., how well the answer is derived from the provided context. It asks the LLM judge to compare the answer with the context and rate it on a scale of 1 to 5, where:

- 1 indicates no derivation from the context, and
- 5 indicates complete derivation from the context.

This helps in assessing whether the model is hallucinating or if its response is factually accurate and grounded in the retrieved information.

In [None]:
relevance_rater_system_message = """
You will be given a ###Question, ###Context, and an AI-generated ###Answer.

Your task: Rate how relevant the ###Answer is to the ###Question, based on the ###Context.

Return only a single number from 1 to 5:
1 = Not relevant at all
2 = Slightly relevant, misses key aspects
3 = Moderately relevant, addresses some parts but misses important details
4 = Mostly relevant, covers key aspects with minor gaps
5 = Fully relevant, directly answers all important aspects with details from Context

Return only the Score in last in a dictionary format not json and score should be in the range of 1 to 5.
Example {relevance_score:4}
"""


This prompt is focused on evaluating the relevance of the generated answer. It checks if the answer addresses the main aspects of the question using the provided context. The rating is again on a scale of 1 to 5, where:

- 1 indicates the answer is irrelevant, and
- 5 indicates it is completely relevant to the question.

This ensures that the output is not only accurate but also contextually appropriate and directly answers the user's query.

In [None]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

In [None]:
def generate_ground_relevance_response(user_input,answer,llm):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)


    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    response_1 = llm(
            prompt=groundedness_prompt,
            max_tokens=1024,
            temperature= 0.3,
            top_p= 0.95,
            top_k= 50,
            stop=None,
            echo=False
            )

    response_2 = llm(
            prompt=relevance_prompt,
            max_tokens= 1024,
            temperature= 0.3,
            top_p= 0.95,
            top_k= 50,
            stop=None,
            echo=False
            )

    return response_1['choices'][0]['text'],response_2['choices'][0]['text']

### **Evaluation 1: Base Prompt Response Evaluation**

In [None]:
groundedness_report_1, relevance_report_1 = generate_ground_relevance_response(question_1,base_answer_1,lcpp_llm)

In [None]:
print(groundedness_report_1, '\n\n', relevance_report_1)

In [None]:
groundedness_report_2, relevance_report_2 = generate_ground_relevance_response(question_2,base_answer_2,lcpp_llm)

In [None]:
print(groundedness_report_2, '\n\n', relevance_report_2)

In [None]:
groundedness_report_3, relevance_report_3 = generate_ground_relevance_response(question_3,base_answer_3,lcpp_llm)

In [None]:
print(groundedness_report_3, '\n\n', relevance_report_3)

Even after providing a strict prompt instructing the model to return a dictionary, it is still unable to consistently do so in all answers. This may be due to the model's lower instruction-following capability or the limitations of its compressed format (like GGUF), which is a lightweight version of the model designed to run on smaller devices but can sometimes reduce output accuracy.

### **Evaluation 2: RAG Response Evaluation**

In [None]:
groundedness_report_1, relevance_report_1 = generate_ground_relevance_response(question_1,rag_answer_1,lcpp_llm)

In [None]:
print(groundedness_report_1, '\n\n', relevance_report_1)

In [None]:
groundedness_report_2, relevance_report_2 = generate_ground_relevance_response(question_2,rag_answer_2,lcpp_llm)

In [None]:
print(groundedness_report_2, '\n\n', relevance_report_2)

In [None]:
groundedness_report_3, relevance_report_3 = generate_ground_relevance_response(question_3,rag_answer_3,lcpp_llm)

In [None]:
print(groundedness_report_3, '\n\n', relevance_report_3)

The results from the LLM Judge show that responses generated using RAG consistently outperform those without RAG across both evaluation metrics: groundedness and relevance.

# **Conclusion**

- We've learned how to create a Retrieval-Augmented Generation (RAG) based application that can perform Q&A from documents for accurate information retrieval.
    - First, we chunked the data to create multiple splits with overlaps.
    - Then we used embedding models to encode the different data splits.
    - Then we stored these embeddings in a vector database.
    - Then we defined an LLM that would take the user query and relevant context via the encoded data chunks.
    - Finally, we assembled all these components to build the RAG-based system.
- We've also learned how to evaluate the output of a RAG-based system using the LLM-as-a-Judge technique to check the groundedness and relevance of the generated output.
- Lastly, we also compared the output from an LLM and that from an RAG-based system and understood the differences in the groundedness and relevance of the two methods.

<font size = 6 color ='#4682B4' > Power Ahead </font>
___