<font size=10>**Retrieval Augmented Generation (RAG)**</font>

<font size=6>**Salesforce CRM Q&A Assistant**</font>

## <font color='blue'>**Solution Approach**</font>

<font color='blue'>**1. Data Ingestion and Chunking**</font>
* Install necessary libraries and load the CRM documents from a local folder using **LangChain’s** `PyPDFLoader` with TextLoader.
* Apply RecursiveCharacterTextSplitter to break the documents into overlapping text chunks while preserving semantic context.

<font color='blue'>**2. Embedding & Vector Store Creation**</font>
* Generate text embeddings using OpenAIEmbeddings to convert text chunks into vector representations.
* Create a vector store using Chroma, and persist it locally using .persist() for reusability.

<font color='blue'>**3. Query Handling & Answer Generation**</font>
* Retrieve top-k most relevant document chunks from Chroma using semantic similarity search.
* Feed the retrieved context and user query into a custom prompt template using LangChain’s LLMChain with ChatOpenAI.
* Generate a final answer based on the prompt + context and strictly grounded in the provided context.

<font color='blue'>**4. LLM-Based Multi-Metric Evaluation (MLS)**</font>
* Use a second LLMChain to evaluate the generated response using five key evaluation metrics:
  * **Groundedness –** Is the answer supported by retrieved context?

  * **Relevance –** Does the answer align with the user’s query?

  * **Faithfulness –** Are the statements logically valid and consistent?

  * **Context Precision –** Does the answer avoid including irrelevant context?

  * **Context Recall –** Has it captured all important info from context?

* Each metric is scored on a scale of 1–5 and accompanied by a justification generated by the LLM.


## **Business Questions for Evaluation**

The RAG system is designed to embed and store large documentation sets, conduct vector similarity search, and generate accurate answers using a Large Language Model (LLM) with clear provenance of the retrieved context.

Here are example questions that the internal sales team frequently asks, which the RAG assistant should be able to answer accurately:

*1. What are the key features and functionalities of Salesforce CRM's Sales Cloud module?*

*2. What are the steps to integrate third-party marketing tools with Salesforce CRM?*

*3. How do I generate a report showing lead conversion rates by region?*

*4. How can we track and report on customer satisfaction in Salesforce?*

*5. Where can I find historical engagement data for a specific customer?*

*6. What is the best way to segment customers for an upcoming campaign*?

*7. What are the best practices for updating opportunities during a sales review?*

*8. Can we automate lead assignment based on geography or product interest?*

These questions are intentionally business-critical, ensuring the RAG system is tested for both completeness and precision in responses.


⚠️ Note:
All cell outputs have been intentionally cleared to protect sensitive
and proprietary data. This repository focuses on demonstrating
architecture, evaluation methodology, and implementation.


# **<font color='blue'>Library Installation and OpenAI LLM Calling</font>**

In [None]:
# Install required libraries
!pip install langchain_community langchain chromadb pypdf tiktoken

In [3]:
# Import libraries
import os
from langchain_community.document_loaders import PyPDFLoader
from openai import OpenAI
import json
import requests # type: ignore

In [None]:
# Load the JSON file and extract values
#create a config.json file with your API key for this chunk to work
file_name = 'config.json'
with open(file_name, 'r') as file:
    config = json.load(file)
    API_KEY = config.get("API_KEY") # Loading the API Key
    OPENAI_API_BASE = config.get("OPENAI_API_BASE") # Loading the API Base Url

model_name = "gpt-4o-mini"

# Storing API credentials in environment variables
os.environ['OPENAI_API_KEY'] = API_KEY
os.environ["OPENAI_BASE_URL"] = OPENAI_API_BASE

# Initialize OpenAI client
client = OpenAI()

# Create a chat completion
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "Hello, can you return an executive summary of multiple pdfs? I need them for my meeting. "}
    ]
)

# Print the assistant's reply
print(completion.choices[0].message.content)


# **<font color='blue'>Data Ingestion and Chunking</font>**

**This section loads all Salesforce-related PDF documents and chunks them into manageable sizes for LLM input.**

In [None]:
! unzip "sample_pdfs.zip" -d ./sample_pdfs

In [16]:
# Uploading multiple pdfs:
from glob import glob
from langchain_community.document_loaders import PyPDFLoader

# Path to folder with PDFs
DOC_FOLDER = "Salesforce/"
pdf_files = glob(DOC_FOLDER + "*.pdf")  # grabs all PDFs in the folder

all_pages = []

# Load each PDF and extract pages
for pdf_path in pdf_files:
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    all_pages.extend(pages)

In [None]:
print(pdf_files)

In [18]:
# Chunking the data
from langchain.text_splitter import RecursiveCharacterTextSplitter # type: ignore


# Split the doc into smaller chunks i.e. chunk_size=500
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

chunks = text_splitter.split_documents(all_pages)


# **<font color='blue'>Embedding and Vector Store Creation</font>**

The next step is to convert the above chunks into vector embeddings using OpenAI Embeddings. These embeddings are stored in a Chroma vector database for later retrieval during question answering.

In [19]:
# Directory to store vector database
CHROMA_PATH = "./chroma_db""

In [None]:
# Calculate the embeddings and save in database
from langchain.embeddings.openai import OpenAIEmbeddings # type: ignore
from langchain.vectorstores import Chroma # type: ignore

# Get OpenAI Embedding model
embeddings = OpenAIEmbeddings(openai_api_key=API_KEY, openai_api_base=OPENAI_API_BASE)

# Embed the chunks as vectors and load them into the database
db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)

In [89]:
# Let us start with the following user query (user_input)

user_input= 'How will Sales Cloud improve female representation in our office?.'

In [90]:
# retrieve context - top 5 most relevant (closests) chunks to the query vector
# (by default Langchain is using cosine distance metric)

docs_chroma = db_chroma.similarity_search_with_score(user_input, k=5)


In [None]:
# @title Metadata Display
# @markdown For each chunk, it prints: Cleaned content (tab characters removed), File source, Page number


for i, (doc, _score) in enumerate(docs_chroma): # unpack the tuple into doc and _score
    print(f"Retrieved chunk {i+1}: \n")
    print(doc)
    print(doc.page_content.replace('\t', ' '))
    print("Source: ", doc.metadata['source'],"\n ")
    print("Page Number: ",doc.metadata['page'],"\n===================================================== \n")
    print('\n')

In [None]:
# Concatenate all the retrieved chunk texts to form a single context_text block
context_text = "\n\n".join([doc.page_content for doc, _score in docs_chroma])

# Check how many documents were actually retrieved (should be 5)
len(docs_chroma)

*This is used as input for the final LLM call to generate an answer.*

# **<font color='blue'>Query Handling and Answer Generation</font>**

Implementation of the response generation step in our RAG pipeline using LangChain and an OpenAI-compatible chat model.

`ChatPromptTemplate` is used for formatting dynamic prompts.

`ChatOpenAI` provides a wrapper around OpenAI's chat models *(like gpt-3.5-turbo, gpt-4o-mini, etc.)*

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

# You can use a prompt template
PROMPT_TEMPLATE = """
You are an assistant to a Sales Team. Your task is to summarize and provide relevant information to the team's question based on the provided context.

Answer the question based only on the following context:
{context}
Answer the question based on the above context: {question}.

Please adhere to the following guidelines:
- Provide a detailed answer.
- Don’t justify your answers.
- Don’t give information not mentioned in the CONTEXT INFORMATION.
- Do not say "according to the context" or "mentioned in the context" or similar.
- If the answer is not found in the context, it is very very important for you to respond with "Sorry, this is out of my knowledge base"
"""

# Load retrieved context and user query in the prompt template
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=user_input)
print(prompt)

In [None]:
# @title Model Invocation and Response generation

# Call LLM model to generate the answer based on the given context and query
model = ChatOpenAI(
    model_name=model_name,
    openai_api_key=API_KEY,
    openai_api_base=OPENAI_API_BASE
    )

response_text = model.predict(prompt)
print(response_text)

We print and inspect `context_for_query` to verify what information is being passed into the model prompt for generating outputs.

In [27]:
# A template string that formats a structured message.

user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

In [38]:
user_input = "What are the key features and functionalities of Salesforce CRM Sales Cloud?"

Each item in `docs_chroma` is a tuple: `(document_chunk, similarity_score)`.

In [36]:
# (by default Langchain is using cosine distance metric)
docs_chroma = db_chroma.similarity_search_with_score(user_input, k=5)

context_list = [d[0].page_content for d in docs_chroma]
context_for_query = ". ".join(context_list)

This extracts the text content from each document in `docs_chroma`, resulting in a list of context chunks as strings.
These chunks are then joined into a single string, separated by a period and space, to form the complete context.
The combined context is printed to verify the information being passed into the model prompt.

In [None]:
print(context_for_query)

# **<font color='blue'>LLM Based Evaluation</font>**

## **Groundedness**

To assess whether the AI-generated **answer is strictly based on the provided context** without introducing any external or hallucinated information.

**Why It Matters:**

In Retrieval-Augmented Generation (RAG) systems, hallucination is a major risk. Groundedness evaluation helps ensure that **answers are trustworthy** and **traceable to source documents**.



In [32]:
# @title Metric 1


# This prompt effectively turns the LLM into a groundedness evaluator, instructing it to reflect, explain, and then score the answer.

groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context.

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.

Output Format:
Arrange your output in the following JSON format.
{
    "steps": write down the steps that are needed to evaluate the context as per the metric.
    "explanation": provide a step-by-step explanation if the context adheres to the metric considering the question as the input.
    "evaluation": the extent to which the metric is followed.
    "rating": <1,2,3,4,5>
}
DO NOT output anything else before or after the JSON output.

"""

In [33]:
groundedness_prompt = [
    {'role':'system', 'content': groundedness_rater_system_message},
    {'role': 'user', 'content': user_message_template.format(
        question=user_input,
        context=context_for_query,
        answer=response_text
        )
    }
]

In [None]:
groundedness_eval = client.chat.completions.create(
    model=model_name,
    messages=groundedness_prompt,
    temperature=0
)

print(groundedness_eval.choices[0].message.content)

## **Relevance**

To evaluate **how relevant the AI-generated answer is to the question**, based on the context used.

*Relevance means the answer should cover all important aspects of the question without adding unrelated or missing important information.*
  - *It shouldn’t go off-topic, nor miss key information.*


**Why It Matters:**

Even if an answer is factual and grounded, it might not actually address the **user’s intent**.

- Relevance ensures answers aren’t off-topic, overly generic, or missing key information.

- In RAG, this avoids responses like:

  "*Here is some info about the topic...*" instead of actually answering the question.



In [None]:
# @title Metric 2


# This turns the LLM into a Relevance Rater, ensuring the answer meaningfully covers what the question asked, and nothing more or less.

relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.

Output Format:
Arrange your output in the following JSON format.
{
    "steps": write down the steps that are needed to evaluate the context as per the metric.
    "explanation": provide a step-by-step explanation if the context adheres to the metric considering the question as the input.
    "evaluation": the extent to which the metric is followed.
    "rating": <1,2,3,4,5>
}
DO NOT output anything else before or after the JSON output.

"""

In [None]:
relevance_prompt = [
    {'role':'system', 'content': relevance_rater_system_message},
    {'role': 'user', 'content': user_message_template.format(
        question=user_input,
        context=context_for_query,
        answer=response_text
        )
    }
]

In [None]:
# Calling the evaluation API

relevance_eval = client.chat.completions.create(
    model=model_name,
    messages=relevance_prompt,
)

print(relevance_eval.choices[0].message.content)

## **Faithfulness**

To assess whether the **answer contains only claims that are explicitly or implicitly supported** by the provided context (i.e., faithful to source).

*`Faithfulness ensures the answer doesn't make things up, distort facts, or draw unjustified conclusions.`*


**Why It Matters:**

- In **RAG pipelines**, even grounded answers can be **unfaithful** — the LLM might use valid sources but make an incorrect inference.

- This metric flags **hallucinations** or **logical overreach** from the model.



In [None]:
# @title Metric 3


# This turns the LLM into a faithfulness checker to ensure the generated content doesn’t hallucinate beyond what the context states.

faithfulness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer can be directly inferred based on the context.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.

Output Format:
Arrange your output in the following JSON format.
{
    "steps": write down the steps that are needed to evaluate the context as per the metric.
    "explanation": provide a step-by-step explanation if the context adheres to the metric considering the question as the input.
    "evaluation": the extent to which the metric is followed.
    "rating": <1,2,3,4,5>
}
DO NOT output anything else before or after the JSON output.

"""

In [None]:
faithfulness_prompt = [
    {'role':'system', 'content': faithfulness_rater_system_message},
    {'role': 'user', 'content': user_message_template.format(
        question=user_input,
        context=context_for_query,
        answer=response_text
        )
    }
]

In [None]:
# Invoke the LLM model to perform faithfulness evaluation

faithfulness_eval = client.chat.completions.create(
    model=model_name,
    messages=faithfulness_prompt,
    temperature=0
)

print(faithfulness_eval.choices[0].message.content)

## **Context Precision**

To assess **whether the context was actually useful** in helping the AI generate its answer.


**What It Measures:**

Determines whether the **retrieved context** was actually **used** to construct the answer.

1. `High score` = Answer is tightly derived from the context

2. `Low score` = Context was either irrelevant or ignored


**Why It Matters:**

Even if the answer is correct, if the **context wasn't used**, then the RAG pipeline is being **inefficient or misleading**.

- Helps diagnose over-retrieval or irrelevant chunking

- Improves **contextual alignment** in future retrievals



In [None]:
# @title Metric 4


# This prompt essentially turns the LLM into a “context usage checker,” ensuring that irrelevant or unused context doesn’t sneak in.

context_precision_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Verify if the context was useful in arriving at the given answer.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.

Output Format:
Arrange your output in the following JSON format.
{
    "steps": write down the steps that are needed to evaluate the context as per the metric.
    "explanation": provide a step-by-step explanation if the context adheres to the metric considering the question as the input.
    "evaluation": the extent to which the metric is followed.
    "rating": <1,2,3,4,5>
}
DO NOT output anything else before or after the JSON output.

"""

In [None]:
context_precision_prompt = [
    {'role':'system', 'content': context_precision_rater_system_message},
    {'role': 'user', 'content': user_message_template.format(
        question=user_input,
        context=context_for_query,
        answer=response_text
        )
    }
]

In [None]:
context_precision_eval = client.chat.completions.create(
    model=model_name,
    messages=context_precision_prompt,
    temperature=0
)

print(context_precision_eval.choices[0].message.content)

## **Context Recall**

It evaluates not just whether the context was *used*, but whether **each sentence in the answer** can be **traced back** to something present in the context.


**Why It Matters:**

- High scores mean **low hallucination risk**

- Low scores highlight answers with **extra information not grounded in retrieved docs**

This metric is key in **evaluating trustworthiness and factual accuracy** in RAG systems.



In [None]:
# @title Metric 5


# This is the system instruction prompt provided to the evaluator LLM. It sets the rules for evaluating Context Recall.

context_recall_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Analyze each sentence in the answer and classify if the sentence can be attributed to the given context or not.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.

Output Format:
Arrange your output in the following JSON format.
{
    "steps": write down the steps that are needed to evaluate the context as per the metric.
    "explanation": provide a step-by-step explanation if the context adheres to the metric considering the question as the input.
    "evaluation": the extent to which the metric is followed.
    "rating": <1,2,3,4,5>
}
DO NOT output anything else before or after the JSON output.

"""

This ensures the LLM receives all necessary information and follows the context recall rubric for scoring.



In [None]:
context_recall_prompt = [
    {'role':'system', 'content': context_recall_rater_system_message},
    {'role': 'user', 'content': user_message_template.format(
        question=user_input,
        context=context_for_query,
        answer=response_text
        )
    }
]

This helps you see exactly which answer segments are traceable to context and which are not.

In [None]:
context_recall_eval = client.chat.completions.create(
    model=model_name,
    messages=context_recall_prompt,
    temperature=0
)

print(context_recall_eval.choices[0].message.content)

# **<font color='blue'>Inference and Output Generation</font>**

The purpose of this section is to collect all the **evaluation outputs** from each metric (groundedness, relevance, etc.), compile them into a **single pandas DataFrame**, and render both raw data and a **clean Markdown version** of the model response for user inspection.

In [None]:
import pandas as pd
from IPython.display import display, Markdown

In [None]:
# @title Load Evaluation Results and show the ouputs



# Each *_eval.choices[0].message.content contains the model's structured rating (steps, explanation, score).
# .strip() cleans up extra whitespace.

# Get the full raw content from each evaluation
groundedness_output = groundedness_eval.choices[0].message.content.strip()
relevance_output = relevance_eval.choices[0].message.content.strip()
faithfulness_output = faithfulness_eval.choices[0].message.content.strip()
context_precision_output = context_precision_eval.choices[0].message.content.strip()
context_recall_output = context_recall_eval.choices[0].message.content.strip()

# Build the DataFrame row with full JSON strings
row = {
    "query": user_input,
    "response": response_text,
    "groundedness_evaluation": groundedness_output,
    "relevance_evaluation": relevance_output,
    "faithfulness_evaluation": faithfulness_output,
    "context_precision_evaluation": context_precision_output,
    "context_recall_evaluation": context_recall_output
}

# Create the DataFrame
df = pd.DataFrame([row])

* `pd.DataFrame([row])`: Wraps the dictionary into a one-row DataFrame

* `display.expand_frame_repr = True`: Prevents output truncation when printing wide tables

* `max_colwidth = 300`: Ensures large JSON fields aren't cut off mid-way

In [None]:
# Ensure full JSON is shown, not truncated
pd.set_option('display.expand_frame_repr', True)
pd.set_option('display.max_colwidth', 300)

This final step:

* Displays the entire evaluation **DataFrame** (one row per query)

* Then, renders the model’s `response` in **Markdown** for better readability (preserving lists, headings, formatting)

In [None]:
# Display
display(df)
display(Markdown(f"### Response Generated \n\n{df.loc[0, 'response']}"))