<center><img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/4_RGB_McCombs_School_Brand_Branded.png" width="300" height="100"/>
  <img src="https://mma.prnewswire.com/media/1458111/Great_Learning_Logo.jpg?p=facebook" width="200" height="100"/></center>

<center><font size=10>Generative AI for Business Applications</center></font>
<center><font size=6>Retrieval Augmented Generation - Week 3</center></font>

<center><img src="https://i.ibb.co/pBF9nKpf/apple.png" width="720"></center>

<center><font size=6>Apple HBR Report Document Q&A</center></font>

# Problem Statement

## Business Context

As organizations grow and scale, they are often inundated with large volumes of data, reports, and documents that contain critical information for decision-making. In real-world business settings, such as venture capital firms like Andreesen Horowitz, business analysts are required to sift through large datasets, research papers, or reports to extract relevant information that impacts strategic decisions.

For instance, consider that you've just joined Andreesen Horowitz, a renowned venture capital firm, and you are tasked with analyzing a dense report like the Harvard Business Review's **"How Apple is Organized for Innovation."** Going through the report manually can be extremely time-consuming as the size and complexity of these report increases. However, by using **Semantic Search** and **Retrieval-Augmented Generation (RAG)** models, you can significantly streamline this process.

Imagine having the capability to directly ask questions like, “How does Apple structure its teams for innovation?” and get immediate, relevant answers drawn from the report. This ability to extract and organize specific insights quickly and accurately enables you to focus on higher-level analysis and decision-making, rather than being bogged down by information retrieval.

## Objective

The goal is to develop a RAG application that helps business analysts efficiently extract key insights from extensive reports, such as “How Apple is Organized for Innovation.”

Specifically, the system aims to:

- Answer user queries by retrieving relevant content directly from lengthy documents.

- Support natural-language interaction without requiring a full manual read-through.

- Act as an intelligent assistant that streamlines the report analysis process.

Through this solution, analysts can save time, improve productivity, and make faster, more informed strategic decisions

## Data Description

**How Apple is Organized for Innovation** - An article of 11 pages in pdf format

# Installing and Importing the Necessary Libraries

In [None]:
# Install required libraries
!pip install -q langchain_community==0.3.27 \
              langchain==0.3.27 \
              chromadb==1.0.15 \
              pymupdf==1.26.3 \
              tiktoken==0.9.0 \
              ragas==0.3.0 \
              datasets==4.0.0 \
              evaluate==0.4.5

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [None]:
# Import core libraries
import os                                                                       # Interact with the operating system (e.g., set environment variables)
import json                                                                     # Read/write JSON data

# Import libraries for working with PDFs and OpenAI
from langchain.document_loaders import PyMuPDFLoader                            # Load and extract text from PDF files
from openai import OpenAI                                                       # Access OpenAI's models and services

# Import libraries for processing dataframes and text
import tiktoken                                                                 # Tokenizer used for counting and splitting text for models
import pandas as pd                                                             # Load, manipulate, and analyze tabular data

# Import LangChain components for data loading, chunking, embedding, and vector DBs
from langchain.text_splitter import RecursiveCharacterTextSplitter              # Break text into overlapping chunks for processing
from langchain.embeddings.openai import OpenAIEmbeddings                        # Create vector embeddings using OpenAI's models  # type: ignore
from langchain.vectorstores import Chroma                                       # Store and search vector embeddings using Chroma DB  # type: ignore

# Import components to run evaluation on RAG pipeline outputs
from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    LLMContextPrecisionWithoutReference,
)
from datasets import Dataset                                                    # Used to structure the input (questions, answers, contexts etc.) in tabular format
from langchain_openai import ChatOpenAI                                         # This is needed since LLM is used in metric computation

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Loading the data

In [None]:
# uncomment and run the below code snippets if the dataset is present in the Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
pdf_file = "/content/HBR_How_Apple_Is_Organized_For_Innovation.pdf"

In [None]:
pdf_loader = PyMuPDFLoader(pdf_file)

### OpenAI API Calling



In [None]:
# Load the JSON file and extract values
file_name = 'config.json'                                                       # Name of the configuration file
with open(file_name, 'r') as file:                                              # Open the config file in read mode
    config = json.load(file)                                                    # Load the JSON content as a dictionary
    OPENAI_API_KEY = config.get("OPENAI_API_KEY")                                             # Extract the API key from the config
    OPENAI_API_BASE = config.get("OPENAI_API_BASE")                             # Extract the OpenAI base URL from the config

# Store API credentials in environment variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY                                          # Set API key as environment variable
os.environ["OPENAI_BASE_URL"] = OPENAI_API_BASE                                 # Set API base URL as environment variable

# Initialize OpenAI client
client = OpenAI()                                                               # Create an instance of the OpenAI client

# Question Answering using Base Model

In [None]:
def generate_response(user_input,k=5,max_tokens=500,temperature=0.3,top_p=0.95):
    prompt="Answer the question"
    try:
        response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": user_input}
        ],
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p
        )
        # Extract and print the generated text from the response
        response = response.choices[0].message.content.strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

### Question 1: Who are the authors of this article and who published this article ?

In [None]:
question_1 = "Who are the authors of this article and who published this article ?"
base_answer_1=generate_response(question_1)
print(base_answer_1)

### Question 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
question_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
base_answer_2=generate_response(question_2)
print(base_answer_2)

### Question 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
question_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
base_answer_3=generate_response(question_3)
print(base_answer_3)

# Retrieval Augmented Generation Implementation

### Split the Loaded PDF into Chunks for Further Processing

In [None]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap=20
)

The given code initializes a **RecursiveCharacterTextSplitter** to split the text into manageable chunks for embedding and retrieval. Here's a breakdown:

- `RecursiveCharacterTextSplitter.from_tiktoken_encoder(...)`: Uses **TikToken encoding** to properly handle token-based splitting.
- `encoding_name='cl100k_base'`: Specifies the **TikToken encoding** (used by OpenAI models like GPT-4 and GPT-3.5).
- `chunk_size=512`: Each text chunk will have a maximum of **512 tokens**.
- `chunk_overlap=16`: Ensures **overlapping** of 16 tokens between consecutive chunks to preserve context.

This approach ensures that text is split **intelligently** while maintaining **semantic meaning** for better retrieval and embeddings.

In [None]:
document_chunks = pdf_loader.load_and_split(text_splitter)

(Note: Expect that the above cell will take time to execute).

Let's take a look at consecutive chunks from the document.

In [None]:
i = 5
document_chunks[i]

In [None]:
document_chunks[i+1]

As we can see there is some overlap between the chunks.
- This improves the coherence and relevance of retrieved results, as the model can better understand the relationship between adjacent parts of the document.
- It also helps in maintaining the flow of ideas and ensuring that critical context is available when generating answers, leading to more accurate and contextually consistent outputs.


### Generate Vector Embeddings for Text Chunks

In [None]:
# Initialize the OpenAI Embeddings model with API credentials
embedding_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY,                                                     # Your OpenAI API key for authentication
    openai_api_base=OPENAI_API_BASE                                             # The OpenAI API base URL endpoint
)

Now that we have chunked the raw input, **we can add these chunks to an embedding model and then store the generated embeddings into a vector database.**
  - We generate a vector for each chunk and save this chunk along with the vector representation in a specialized database.

### Creating a Vector Database

In [None]:
out_dir = 'apple_db'

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [None]:
vectorstore = Chroma.from_documents(
    document_chunks,
    embedding_model,
    persist_directory=out_dir)

The given code initializes a **vector database** (also called **vector store**) using **Chroma**, a popular open-source vector database, to store **document embeddings** for retrieval in a **Retrieval-Augmented Generation (RAG)** system. Here's a breakdown of what each part does:  

- **`vectorstore = Chroma.from_documents(...)`**  
   - This creates a **Chroma vector store** from a set of **document chunks**. Chroma is used to store and retrieve embeddings efficiently.

- **Parameters Passed to `Chroma.from_documents()`**  
   - `document_chunks`: A list of **text chunks** (split portions of a document) that will be converted into embeddings.  
   - `embedding_model`: The model responsible for **embedding** the document chunks into vector representations. Common choices include OpenAI’s embeddings, Sentence Transformers, or other dense vector models.  

In [None]:
vectorstore = Chroma(persist_directory=out_dir,embedding_function=embedding_model)

### Retrieval

We will now create a retriever that can query an input text and retrieve the top-k documents that are most relevant from the vector store.

- Under the hood, a similarity score is computed between the embedded query and all the chunks in the database
- The top k chunks with the highest similarity scores are then returned.

In [None]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 4}
)

The given code initializes a **retriever** from the **Chroma vector store** to fetch similar documents based on embeddings. Here's a breakdown:

- `vectorstore.as_retriever(...)`: Converts the **Chroma vector store** into a retriever for querying.
- `search_type='similarity'`: Specifies that retrieval is based on **cosine similarity** (or another similarity metric used by Chroma).
- `search_kwargs={'k': 6}`: Retrieves the **top 6 most similar** documents for a given query.

This allows for **efficient information retrieval**, where the retriever finds the most relevant document chunks based on their **semantic similarity** to a user's query.

#### **Retrieving the Relevant Documents**

Let's ask a simple query and see what document chunks are returned based on the similarity search.

In [None]:
user_input = "Who are the authors of this article and who published this article ?"

relevant_document_chunks = retriever.get_relevant_documents(user_input)

In [None]:
len(relevant_document_chunks)

In [None]:
for document in relevant_document_chunks:
    print(document.page_content)

In [None]:
len(relevant_document_chunks[0].page_content)

It can be observed that the chunks are related to the user query and can perhaps contain the answer.

### Generation

#### Designing the System Prompt

System Prompt designing is a crucial part of designing a RAG based system, it consists mainly of two parts:

- system message: This is the instruction that has to be given to the LLM.
- user message template: This is a message template that contains the context retrieved from the document chunks and the User Query.

In [None]:
qna_system_message = """
You are an assistant whose work is to give answers to questions with repect to a context.
User input will have the context required by you to answer user questions.

This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Strictly answer only using the information provided in the ###Context.
Do not mention anything about the information in ###Context or the question in ###Question in your final answer.

If the answer to ###Question cannot be derived from the ###Context, just respond by saying "I don't know".

Remember that the answer to ###Question might not always be directly present in the information provided in the ###Context.
the answer can be indirectly derived from the information in ###Context.

"""

**Note**: It is important to specify that the LLM should not attempt to answer the question if the context provided (retrieved from the knowledge base provided) doesn't contain the information required. We don't want the LLM to use the knowledge from its training data and/or hallucinate to share a "seemingly correct" answer.

In [None]:
qna_user_message_template = """
Conider the following ###Context and ###Question
###Context
{context}

###Question
{question}
"""

### Defining the function for generating responses




Let's create a function that takes a user query and an LLM as input, finds the relevant chunks, and uses them as context to generate an answer.

In [None]:
def generate_rag_response(user_input,k=5,max_tokens=500,temperature=0.3,top_p=0.95):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    # Generate the response
    try:
        response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": qna_system_message},
            {"role": "user", "content": user_message}
        ],
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p
        )
        # Extract and print the generated text from the response
        response = response.choices[0].message.content.strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

Let's try this function on the previous user query and see whether it can generate an answer.

# Question Answering using RAG

### Question 1: Who are the authors of this article and who published this article ?

In [None]:
question_1 = "Who are the authors of this article and who published this article ?"
rag_answer_1=generate_rag_response(question_1)
print(rag_answer_1)

### Question 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
question_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
rag_answer_2=generate_rag_response(question_2)
print(rag_answer_2)

### Question 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
question_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
rag_answer_3=generate_rag_response(question_3)
print(rag_answer_3)

# Output Evaluation

**Why Do We Need These Evaluation Metrics in a RAG-Based System?**
When evaluating a RAG system, using multiple metrics helps us capture different aspects of response quality. Each metric plays a distinct role in identifying weaknesses and ensuring the system produces trustworthy outputs.

* **Faithfulness** - Checks whether the generated response stays true to the retrieved context without adding unsupported or hallucinated information.
* **Answer Relevancy** - Measures how directly the response addresses the user's query, ensuring that the answer is not only correct but also useful.
* **Context Precision** - Evaluates how precisely the retrieved context contributes to answering the query, reducing noise and irrelevant details.

## Evaluating Responses using RAGAS

### Evaluation 1: Base Prompt Response Evaluation

In [None]:
# Initialize the evaluator LLM
evaluator_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Initialize evaluation metrics
faithfulness = Faithfulness()
answer_relevancy = AnswerRelevancy()
context_precision = LLMContextPrecisionWithoutReference()

In [None]:
questions = [question_1,question_2,question_3]                            # List of user questions
responses_with_base = [base_answer_1,base_answer_2,base_answer_3]               # Responses from Base Model

# Retrieve top-k documents as context for each question
contexts = [
    [doc.page_content for doc in retriever.get_relevant_documents(q, k=6)]      # Get top 6 docs for each question
    for q in questions
]

In [None]:
# Wrap into HuggingFace Dataset
ragas_dataset_with_RAG = Dataset.from_dict({
    "question": questions,
    "answer": responses_with_base,
    "contexts": contexts,
    "reference": questions
})

# Run RAGAS evaluation
result_with_rag = evaluate(
    ragas_dataset_with_RAG,
    metrics=[
        answer_relevancy,
        context_precision,
        faithfulness,
    ],
    llm=evaluator_llm,
    embeddings=embedding_model
)

# Convert results to DataFrame
df_rag = result_with_rag.to_pandas()
df_rag

### Evaluation 2: RAG Response Evaluation

In [None]:
questions = [question_1,question_2,question_3]                                    # List of user questions
responses_with_rag = [rag_answer_1,rag_answer_2,rag_answer_3]                     # Responses from RAG pipeline

# Retrieve top-k documents as context for each question
contexts = [
    [doc.page_content for doc in retriever.get_relevant_documents(q, k=6)]        # Get top 6 docs for each question
    for q in questions
]

In [None]:
# Wrap into HuggingFace Dataset
ragas_dataset_with_RAG = Dataset.from_dict({
    "question": questions,
    "answer": responses_with_rag,
    "contexts": contexts,
    "reference": questions
})

# Run RAGAS evaluation
result_with_rag = evaluate(
    ragas_dataset_with_RAG,
    metrics=[
        answer_relevancy,
        context_precision,
        faithfulness,
    ],
    llm=evaluator_llm,
    embeddings=embedding_model
)

# Convert results to DataFrame
df_rag = result_with_rag.to_pandas()
df_rag

The results from the RAGAS model show that responses generated using RAG consistently outperform those from the base model across all three evaluation metrics: answer relevancy, LLM context precision (without reference), and faithfulness

# **Conclusion**

* We've learned how to create a Retrieval-Augmented Generation (RAG) based application using an **OpenAI model** that can perform Q\&A from documents for accurate information retrieval.

  * First, we chunked the data to create multiple splits with overlaps.
  * Then we used embedding models to encode the different data splits.
  * Then we stored these embeddings in a vector database.
  * Then we defined the OpenAI model that would take the user query and relevant context via the encoded data chunks.
  * Finally, we assembled all these components to build the RAG-based system.
* We've also learned how to evaluate the output of a RAG-based system using the **RAGAS framework**, which measures groundedness, relevance, and answer correctness.
* Lastly, we also compared the output from an OpenAI model alone and that from an RAG-based system and understood the differences in the faithfullness, answer relevancy  and context precision of the two methods.


<font size = 6 color = '#4682B4' > Power Ahead </font>
___