## Problem Statement

### Business Context

As organizations grow, business analysts increasingly face the challenge of navigating large volumes of reports, research papers, and strategic documents. Extracting the right insights from lengthy materials can be time-consuming and overwhelming, especially when these insights directly influence key business decisions.

Consider joining a venture capital firm like Andreessen Horowitz and being assigned a dense report such as Harvard Business Review’s **“How Apple is Organized for Innovation.”** Manually reviewing such documents requires significant effort, slowing down the analysis process and increasing the chances of missing important details.

To overcome this information overload, businesses can leverage **Semantic Search** and **Retrieval-Augmented Generation (RAG)** models. These systems allow analysts to ask natural-language questions like, *“How does Apple structure its teams for innovation?”* and instantly retrieve relevant, accurate insights from the source document.

By integrating such AI-driven retrieval systems, organizations can streamline research workflows, reduce manual effort, and enable analysts to focus on high-value strategic thinking, ultimately improving decision-making speed and quality.


**Common Questions to Answer**

1. Who are the authors of this article and who published this article?

2. List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

3. Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?



### Objective

As an AI specialist, your task is to develop a RAG-based application that enables business analysts to efficiently extract insights from extensive business reports such as “How Apple is Organized for Innovation.” The objective is to understand the challenges of navigating long, information-dense documents, apply retrieval-augmented generation techniques to surface only the most relevant content, analyze how this improves the speed and accuracy of report interpretation, evaluate its potential to enhance strategic decision-making and productivity for analysts, and create a functional prototype that demonstrates the system’s effectiveness in answering queries, summarizing key insights, and supporting natural-language interactions without requiring users to read the entire report.

### Data Description

**“How Apple is Organized for Innovation”** is a detailed Harvard Business Review article that examines Apple’s unique approach to structuring teams, driving innovation, and maintaining a culture of excellence. The article is provided as a PDF consisting of **11 pages**, offering in-depth insights into Apple’s organizational design, leadership principles, and decision-making processes.


## **Please read the instructions carefully before starting the project.**

This is a commented Python Notebook file in which all the instructions and tasks to be performed are mentioned.
* Blanks '_____' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. With every '_____' blank, there is a comment that briefly describes what needs to be filled in the blank space.
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
* Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same. Any mathematical or computational details which are a graded part of the project can be included in the Appendix section of the presentation.

## Installing and Importing Necessary Libraries and Dependencies

In [1]:
# Install required libraries
!pip install -q langchain_community==0.3.27 \
              langchain==0.3.27 \
              chromadb==1.0.15 \
              pymupdf==1.26.3 \
              tiktoken==0.9.0 \
              datasets==4.0.0 \
              evaluate==0.4.5 \
              langchain_openai==0.3.30

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m110.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m90.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m90.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.4/74.4 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [None]:
# Import core libraries
import os                                                                       # Interact with the operating system (e.g., set environment variables)
import json                                                                     # Read/write JSON data
import requests  # type: ignore                                                 # Make HTTP requests (e.g., API calls); ignore type checker

# Import libraries for working with PDFs and OpenAI
from langchain.document_loaders import PyMuPDFLoader                            # Load and extract text from PDF files
# from langchain_community.document_loaders import PyPDFLoader                    # Load and extract text from PDF files
from openai import OpenAI                                                       # Access OpenAI's models and services

# Import libraries for processing dataframes and text
import tiktoken                                                                 # Tokenizer used for counting and splitting text for models
import pandas as pd                                                             # Load, manipulate, and analyze tabular data

# Import LangChain components for data loading, chunking, embedding, and vector DBs
from langchain.text_splitter import RecursiveCharacterTextSplitter              # Break text into overlapping chunks for processing
from langchain.embeddings.openai import OpenAIEmbeddings                        # Create vector embeddings using OpenAI's models  # type: ignore
from langchain.vectorstores import Chroma                                       # Store and search vector embeddings using Chroma DB  # type: ignore

from datasets import Dataset                                                    # Used to structure the input (questions, answers, contexts etc.) in tabular format
from langchain_openai import ChatOpenAI                                         # This is needed since LLM is used in metric computation

## Question Answering using LLM

### OpenAI API Calling



> **Note:** <br> Make sure to create a `config.json` file in your project directory containing your OpenAI credentials in the following format:
<br><br>```{"API_KEY": "your_openai_api_key_here","OPENAI_API_BASE": "your_api_base"}```<br><br>
Replace the placeholder with your actual API key. This file allows your script to securely load API configuration details without hardcoding them directly into the code. </br>


In [None]:
# Load the JSON file and extract values
file_name = "config.json"                                                       # Name of the configuration file
with open(file_name, 'r') as file:                                              # Open the config file in read mode
    config = json.load(file)                                                    # Load the JSON content as a dictionary
    OPENAI_API_KEY = config.get("OPENAI_API_KEY")                                             # Extract the API key from the config
    OPENAI_API_BASE = config.get("OPENAI_API_BASE")                             # Extract the OpenAI base URL from the config

# Store API credentials in environment variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY                                          # Set API key as environment variable
os.environ["OPENAI_BASE_URL"] = OPENAI_API_BASE                                 # Set API base URL as environment variable

# Initialize OpenAI client
client = OpenAI()                                                               # Create an instance of the OpenAI client

### Response

In [None]:
# Define a function to get a response
def response(user_prompt, max_tokens=5, temperature=1, top_p=0.9):   # Complete the code to set default paramenters
    # Create a chat completion using the OpenAI client
    completion = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # Complete the code by specifying the model to be used.
        messages=[
            {"role": "user", "content": user_prompt}                            # User prompt is the input/query to respond to
        ],
        max_tokens=max_tokens,                                                  # Max number of tokens to generate in the response
        temperature=temperature,                                                # Controls randomness in output
        top_p=top_p                                                             # Controls diversity via nucleus sampling
    )
    return completion.choices[0].message.content                                # Return the text content from the model's reply

### Question 1: Who are the authors of this article and who published this article?

In [None]:
question_1 = "Who are the authors of this article and who published this article ?"
base_prompt_response_1=response(question_1)

### Question 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
question_2 = "_____" #Complete the code to define the question #2
base_prompt_response_2=response(_____) #Complete the code to pass the user input

### Question 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
question_3 = "_____" #Complete the code to define the question #3
base_prompt_response_3=response(_____) #Complete the code to pass the user input

**Observations:**
-
-

## Question Answering using LLM with Prompt Engineering

In the next step, we will use prompt engineering to check the effect of a more detailed and well-engineered prompt on the output of the model.

In [None]:
system_prompt = """
___
""" #Complete the code to define the system prompt

### Defining the function to Generate a Response From the LLM

In [None]:
# Define a function to get a response from the OpenAI chat model
def response(system_prompt, user_prompt, max_tokens=_____, temperature=______, top_p=_____):  # Complete the code to set default paramenters
    # Create a chat completion using the OpenAI client
    completion = client.chat.completions.create(
        model="_______",                                                        # Complete the code by specifying the model to be used.
        messages=[
            {"role": "system", "content": system_prompt},                       # System prompt sets the assistant's behavior
            {"role": "user", "content": user_prompt}                            # User prompt is the input/query to respond to
        ],
        max_tokens=max_tokens,                                                  # Max number of tokens to generate in the response
        temperature=temperature,                                                # Controls randomness in output (0 = deterministic)
        top_p=top_p                                                             # Controls diversity via nucleus sampling
    )
    return completion.choices[0].message.content                                # Return the text content from the model's reply

### Question 1: Who are the authors of this article and who published this article ?

In [None]:
response_with_prompt_eng_1=response(system_prompt,question_1)
response_with_prompt_eng_1

### Question 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
response_with_prompt_eng_2=response(_____,_____) #Complete the code to pass the user prompt and system prompt
response_with_prompt_eng_2

### Question 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
response_with_prompt_eng_3=response(_____,_____) #Complete the code to pass the user prompt and system prompt
response_with_prompt_eng_3

**Observations**:
-
-


## Data Preparation for RAG

### Loading the Data

In [None]:
# uncomment and run the below code snippets if the dataset is present in the Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
pdf_path = "_____" #Complete the code to define the file name

In [None]:
pdf_loader = PyMuPDFLoader(pdf_path)

In [None]:
pdf = pdf_loader.load()

### Data Overview

#### Checking the first 5 pages

In [None]:
for i in range(5):
    print(f"Page Number : {i+1}",end="\n")
    print(pdf[i].page_content,end="\n")

### Data Chunking

#### Chunk the PDF into Manageable Text Sections Using a Token-Based Splitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=_____, #Complete the code to define the chunk size
    chunk_overlap= _____ #Complete the code to define the chunk overlap
)

#### Split the Loaded PDF into Chunks for Further Processing

In [None]:
document_chunks = pdf_loader.load_and_split(text_splitter)

#### Check the Number of Chunks Created

In [None]:
len(document_chunks)

### Generate Vector Embeddings for Text Chunks Using OpenAI

In [None]:
# Initialize the OpenAI Embeddings model with API credentials
embedding_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY,                                                     # Your OpenAI API key for authentication
    openai_api_base=OPENAI_API_BASE                                             # The OpenAI API base URL endpoint
)

# Generate embeddings (vector representations) for the first two document chunks
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)      # Embedding for chunk 0
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)      # Embedding for chunk 1

# Check and print the dimension (length) of the embedding vector
print("Dimension of the embedding vector ", len(embedding_1))                   # Typically 1536 or 2048 depending on model

In [None]:
# Verify if both embeddings have the same dimension (should be True)
len(embedding_1) == len(embedding_2)

# Return/display the two embedding vectors for further inspection or use
embedding_1, embedding_2

### Vector Database

#### Setup Vector Database Directory

In [None]:
out_dir = '____'    # complete the code to define the name of the vector database

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

#### Create Vector Database from Documents

In [None]:
# Building the vector store and saving it to disk for future use
vectorstore = Chroma.from_documents(
    document_chunks,                                                            # Documents to index
    embedding_model,                                                            # Embedding model for converting text to vectors
    persist_directory=out_dir                                                   # Save vector DB files here
)

#### Load Vector Database

In [None]:
vectorstore = Chroma(
    persist_directory=out_dir,
    embedding_function=embedding_model
)

#### Explore Vector Database and Perform Searches

In [None]:
vectorstore.embeddings

write an instruction on what to do in next cell

In [None]:
vectorstore.similarity_search("_____",k=_____) #Complete the code to pass a query and an appropriate k value

### Retrieval and Response Generation using Vector Search

#### Convert Vector Database into a Retriever and Retrieve Relevant Documents

In [None]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': _____} #Complete the code to pass an appropriate k value
)

### System and User Prompt Template

Prompts guide the model to generate accurate responses. Here, we define two parts:

    1. The system message describing the assistant's role.
    2. A user message template including context and the question.

In [None]:
qna_system_message = """
___
"""  #Complete the code to define the system message

In [None]:
qna_user_message_template = """
___
"""  #Complete the code to define the user message

### Response Function

In [None]:
def generate_rag_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    # Generate the response
    try:
        response = client.chat.completions.create(
        model="_____",   # Complete the code by specifying the model to be used.
        messages=[
            {"role": "system", "content": qna_system_message},
            {"role": "user", "content": user_message}
        ],
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p
        )
        # Extract and print the generated text from the response
        response = response.choices[0].message.content.strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

## Question Answering using RAG

### Question 1: Who are the authors of this article and who published this article ?

In [None]:
response_with_rag_1 = generate_rag_response(question_1)
response_with_rag_1

### Question 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
response_with_rag_2 = generate_rag_response(_____) #Complete the code to pass the user input
response_with_rag_2

### Question 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
response_with_rag_3 = generate_rag_response(_____) #Complete the code to pass the user input
response_with_rag_3

### **Observations**

-
-
-

## Actionable Insights and Business Recommendations


*   
*  
*



<font size=6 color='#4682B4'>Power Ahead</font>
___