# Crayon RAG based chatbot system - RAG Chatbot using ChromaDB, LlamaIndex and Gradio

 - Environment Setup 
 - Data EDA
    - Loading data
    - Data exploration
 - Building RAG system and vector database
    - split documents into chunks
    - embedding chunks into vector and store vectors into local directory
 - Building RAG based ChatBot with WebUI and LLM function calling 
 - Model evaluation 

## Step 1: Environment Setting

Configure the environment variables and python packages

In [64]:
import os
import json
import llama_index
import chromadb  # vector database
import gradio as gr  # WebUI
from openai import OpenAI  # OpenAI
from importlib.metadata import version

# Loading environment variables
from dotenv import load_dotenv,find_dotenv  

# LlamaIndex
from llama_index.embeddings.openai import OpenAIEmbedding
# from llama_index.llms.openai import OpenAI
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.settings import Settings
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.vector_stores.chroma import ChromaVectorStore

from sklearn.metrics import confusion_matrix
import numpy as np


# Verify the LlamaIndex version and Vector database version
print(f"LlamaIndex version: {version('llama_index')}")
print(f"chromadb version: {version('chromadb')}")

# Use this line of code if you have a local .env file
load_dotenv(find_dotenv()) 
# Settings LLM in LlamaIndex  --->  llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding()

LlamaIndex version: 0.11.18
chromadb version: 0.5.16


## Step 2: Data ETA

- Loading the dataset
- Explore the data

In [24]:
# Load data
documents = SimpleDirectoryReader("./dataset/policy").load_data()

## Step 3: Building RAG system
Building the RAG system consists of seveal steps:
- Split documents into chunks (or Nodes)
- Embedding chunks into vectors and sotre vectors into Vector Database
- Create a retriever using vector database
- Semantic search using query

### Step 3.1 : Chunk documents into Nodes

As the whole document is too large to fit into the context window of the LLM, you will need to partition it into smaller text chunks, which are called Nodes in LlamaIndex.

With the SimpleNodeParser each document is stored as chunks. Each chunk consists of 258 tokens. Different chunks are overlapeed with each other.

In [None]:
# split documents into chunks
node_parser = SimpleNodeParser.from_defaults(chunk_size=258, chunk_overlap=50)
# Extract nodes from documents
nodes = node_parser.get_nodes_from_documents(documents)

### Step 3.2: Embedding chunks into vectors and sotre vectors into Vector Database via cheromaDB

In [None]:

# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./dataset/chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("policy_knowledge")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
# build VectorStoreIndex that takes care of chunking documents and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex(nodes, storage_context=storage_context)


### Step 3.3: Create a retriever using Chroma 
creata a retriever to support allows for semantic search

First, initialize the PersistentClient with the same path you specified while creating the Chroma vector store. You'll then retrieve the collection "policy_knowledge" you created previously from Chroma. You can use this collection to initialize the ChromaVectorStore in which you store the embeddings of the website data. You can then use the from_vector_store function of VectorStoreIndex to load the index.

In [None]:
# Load from disk
load_client = chromadb.PersistentClient(path="./dataset/chroma_db")

# Fetch the collection
chroma_collection = load_client.get_collection("policy_knowledge")

# Fetch the vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Get the index from the vector store
index = VectorStoreIndex.from_vector_store(vector_store)

### Step 3.4: Semantic search with vector database

In [None]:
def get_external_policy_knowledge(query):
    # This funciton is for serching the relevant chunks via semantic search
    # and get responce based on the LLM

    # check if the retriever is working by trying to fetch the relevant docs related
    test_query_engine = index.as_query_engine()
    response = test_query_engine.query(query)
    return str(response)  # Convert response to string to ensure it's serializable


# Text the query semantic search and get the answer based on the OPENAI 3.5 
print(get_external_policy_knowledge("Tell me about the government fiancial policy"))


def get_relevant_chunks(query):
    # This funciton is for serching the relevant chunks via semantic search
    retriever = VectorIndexRetriever(index=index, similarity_top_k=1,)
    #Finding out the nodes for the new query:1``
    nodes=retriever.retrieve(query)
    return nodes[0].text
    # print(nodes[0].text)

The provided context information does not mention anything about government financial policy.


## Step 4: Chatbot - OpenAI with function calling
Building RAG based ChatBot with WebUI and LLM function calling
 - Use the LLM with function calling as the router to support multi-turn conversation
 - Build the WebUI via gradio as a prototype
 - gpt-4 as the LLM router
 - gpt-3.5-turbo is the answer summarizer to summarize the query and extracted text chunks

In [48]:
# Define the function schema that OpenAI will use
functions = [
    {
        "name": "get_external_policy_knowledge",
        "description": "Retrieve knowledge from external policy documents. This documents database consits of Data Privacy Policy, AI Ethics Policy and Model Governance Policy",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The query to search for in the policy documents"
                }
            },
            "required": ["query"]
        }
    }
]

In [None]:
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"),)

def predict(message, history):
    history_openai_format = [{"role": "system", "content": "You are AI assistant, only answer the question based on the chat history or the information extracted from the external database. Don't answer questions based on you own knowledge."}]
    for human, assistant in history:
        history_openai_format.append({"role": "user", "content": human})
        history_openai_format.append({"role": "assistant", "content": assistant})
    history_openai_format.append({"role": "user", "content": message})

    # First, use the LLM as the query router to justify if LLM direct give the answer OR call the function to retrieve the relevant konwledge from vector database
    # get the response from OpenAI with potential function calls
    response = client.chat.completions.create(
        model='gpt-4',
        messages=history_openai_format,
        functions=functions,
        function_call="auto",
        temperature=0.1
    )

    response_message = response.choices[0].message

    # Check if the model wants to call a function
    if response_message.function_call:
        # Get the function call details
        function_name = response_message.function_call.name
        function_args = json.loads(response_message.function_call.arguments)

        # Call the function
        if function_name == "get_external_policy_knowledge":
            function_response = get_external_policy_knowledge(function_args.get("query"))
            print(function_response)

            return function_response

            # # Add the function result to the message history
            # history_openai_format.append({
            #     "role": "function",
            #     "name": function_name,
            #     "content": function_response
            # })

            # # Get a new response from the model with the function result
            # second_response = client.chat.completions.create(
            #     model='gpt-4',
            #     messages=history_openai_format,
            #     temperature=0.1
            # )
            
            # return second_response.choices[0].message.content
    else:
        # If no function call is needed, return the response directly
        return response_message.content


# Create and launch the Gradio WebUI interface
gr.ChatInterface(predict).launch()



* Running on local URL:  http://127.0.0.1:7870

To create a public link, set `share=True` in `launch()`.




Develop a comprehensive ethical risk assessment toolkit that includes templates, best practices, and guidelines to standardize the assessment process across the company.
The document does not provide information on government financial policy.


## Step 5: Model evaluation
### Eval 1: LLM as a router performance 
 - confusion matrix

 - **Test data**: both in-domain and out of domain question

### Eval 2: RAG system perofrmance
 - Accuracy
   - BLEU or ROUGE
 - Faithfulness
 - RAG retriever relevance
 - Latency 
    - LLM response time
    - RAG responses time

 - **Test data:** in-domain question which is relevant to the documents


### Step 5.1: Define several labeled text cases
Manully provide several test cases with corresponding answers:
In the test datset, we provides 5 test cases, which consists of 
- In-domian questions 
- Irrelevant question 
- Out of distribution questions

In [76]:
labeled_text_cases = [
    # In-domain questions in terms of data policy, AI policy and model policy
    {   # Data pirvacy policy question 
        "text": "Where should the data be stored.",
        "expected_answer": "Data is securely stored in state-of-the-art data centers located in the United States, the European Union, and other jurisdictions, depending on the nature of the data and the services provided. Each location is chosen based on stringent security standards and data protection compliance.",
        "label": "function_call"
    },
    {   # AI ethics policy question 
        "text": "Tell me about the AI Transparency rules.",
        "expected_answer": "Enhance transparency by developing interfaces that allow users to query AI decisions and receive explanations in understandable terms. Document all AI systems decision-making processes and methodologies, ensuring that this documentation is accessible to all relevant stakeholders and regularly updated.",
        "label": "function_call"
    },
    {   # Model governance policy question 
        "text": "How to test and validate the model?",
        "expected_answer": "Models must undergo rigorous testing to validate their accuracy, performance, and fairness. Validation tests should be designed to cover various operational scenarios and should include stress and failure mode analysis. Documentation of all test results is mandatory for auditability and further review.",
        "label": "function_call"
    },

    # Out of distribution question
    {   # Qauestion from other domains
        "text": "Tell me about the Australia financial regulation",
        "expected_answer": "I cannot answer this question",
        "label": "direct_answer"
    },

    # Irrelevant question
    {   # general question. No necessary to call the RAG system 
        "text": "Hi.",
        "expected_answer": "Hello, what can I help you today",
        "label": "direct_answer"
    }
]


## Step 5.2: Eval LLM as a Router
    Check if the LLM can call the RAG system precisely.

In [70]:

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"),)

def indicator_call_RAG(query):
    # providing no response is more acceptable than offering an incorrect one, as the latter could mislead the user regarding their query
    # history_openai_format = [{"role": "system", "content": "You are AI assistant, only answer the question based on the chat history or the information extracted from the external database. Don't answer questions based on you own knowledge."}]
    history_openai_format = [{"role": "system", "content": "You are a policy documents assistant, and you are responsible to answer company data pirvacy, AI ethics and data governance policy questions. You can provid no response as it is more acceptable than offering an incorrect one, as the latter could mislead the user regarding their query."}]
    history_openai_format.append({"role": "user", "content": query})
    # First, use the LLM as the query router to justify if LLM direct give the answer OR call the function to retrieve the relevant konwledge from vector database
    # get the response from OpenAI with potential function calls
    response = client.chat.completions.create(
        model='gpt-4',
        messages=history_openai_format,
        functions=functions,
        function_call="auto",
        temperature=0
    )

    response_message = response.choices[0].message

    # Check if the model wants to call a function
    if response_message.function_call:
        return "function_call"
    else:
        return "direct_answer"

for item in labeled_text_cases:
    item["predicted_label"] = indicator_call_RAG(item["text"])


In [69]:
import pandas as pd
# Extract labels and predicted labels
y_true = [item['label'] for item in labeled_text_cases]
y_pred = [item['predicted_label'] for item in labeled_text_cases]

# Calculate confusion matrix
labels = list(set(y_true + y_pred))  # Get unique labels
cm = confusion_matrix(y_true, y_pred, labels=labels)

# Create a DataFrame for better visualization
cm_df = pd.DataFrame(cm, index=labels, columns=labels)

# Print the confusion matrix
print("Confusion Matrix:")
print(cm_df)

Confusion Matrix:
               function_call  direct_answer
function_call              3              0
direct_answer              1              1


## Step 5.3: RAG system peroformance
 - Accuracy
   - BLEU or ROUGE
 - Faithfulness
 - RAG retriever relevance
 - Latency 
    - LLM response time
    - RAG responses time

 - **Test data:** in-domain question which is relevant to the documents

In [78]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import time
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu

# Load the embedding model (e.g., SentenceTransformer)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight and fast for semantic similarity

def evaluate_rag_system_with_semantics(labeled_text_cases):
    """
    Evaluate the RAG system based on the provided labeled cases, with semantic analysis for faithfulness.

    Metrics:
    - Accuracy: BLEU, ROUGE
    - Faithfulness: Semantic similarity between retrieved chunks and LLM response
    - RAG retriever relevance: Compare retrieved chunks with expected
    - Latency: Measure response time for retriever and LLM

    Args:
        labeled_text_cases (list): A list of test cases with queries, expected answers, and labels.

    Returns:
        dict: Evaluation metrics results.
    """
    metrics = {
        "accuracy": {"bleu": [], "rouge": []},
        "faithfulness": [],
        "retriever_relevance": [],
        "latency": {"retriever_time": [], "llm_time": []},
    }

    rouge = Rouge()

    labeled_text_cases = labeled_text_cases[:3]

    for case in labeled_text_cases:
        query = case["text"]
        expected_answer = case["expected_answer"]

        # Measure retriever latency
        retriever_start = time.time()
        retrieved_chunks = get_relevant_chunks(query)
        retriever_end = time.time()
        metrics["latency"]["retriever_time"].append(retriever_end - retriever_start)

        # Measure LLM latency
        llm_start = time.time()
        llm_response = get_external_policy_knowledge(query)
        llm_end = time.time()
        metrics["latency"]["llm_time"].append(llm_end - llm_start)

        # Accuracy: BLEU and ROUGE
        bleu_score = sentence_bleu([expected_answer.split()], llm_response.split())
        rouge_scores = rouge.get_scores(llm_response, expected_answer, avg=True)
        metrics["accuracy"]["bleu"].append(bleu_score)
        metrics["accuracy"]["rouge"].append(rouge_scores)

        # Faithfulness: Semantic similarity between retrieved chunks and LLM response
        retrieved_embedding = embedding_model.encode(retrieved_chunks, convert_to_tensor=True)
        llm_response_embedding = embedding_model.encode(llm_response, convert_to_tensor=True)
        faithfulness_score = util.pytorch_cos_sim(retrieved_embedding, llm_response_embedding).item()
        metrics["faithfulness"].append(faithfulness_score)

        # RAG retriever relevance: Check if retrieved chunks are relevant to the query
        retriever_relevance = retrieved_chunks in expected_answer
        metrics["retriever_relevance"].append(retriever_relevance)

    # Summarize results
    results = {
        "accuracy": {
            "avg_bleu": sum(metrics["accuracy"]["bleu"]) / len(metrics["accuracy"]["bleu"]),
            "avg_rouge": {
                key: sum([score[key]["f"] for score in metrics["accuracy"]["rouge"]]) / len(metrics["accuracy"]["rouge"])
                for key in ["rouge-1", "rouge-2", "rouge-l"]
            },
        },
        "faithfulness": np.mean(metrics["faithfulness"]),
        "retriever_relevance": sum(metrics["retriever_relevance"]) / len(metrics["retriever_relevance"]),
        "latency": {
            "avg_retriever_time": sum(metrics["latency"]["retriever_time"]) / len(metrics["latency"]["retriever_time"]),
            "avg_llm_time": sum(metrics["latency"]["llm_time"]) / len(metrics["latency"]["llm_time"]),
        },
    }

    return results

# Example usage
results = evaluate_rag_system_with_semantics(labeled_text_cases)
print("Evaluation Results:")
print(f"{results}")

The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Evaluation Results:
{'accuracy': {'avg_bleu': 0.6449922429023552, 'avg_rouge': {'rouge-1': 0.7373737323752678, 'rouge-2': 0.6558307483542042, 'rouge-l': 0.7272727222742578}}, 'faithfulness': 0.6541802883148193, 'retriever_relevance': 0.0, 'latency': {'avg_retriever_time': 0.6374595165252686, 'avg_llm_time': 2.9588892459869385}}


In [82]:

def format_results_as_table(results):
    """
    Format the RAG evaluation results into a table (Pandas DataFrame).

    Args:
        results (dict): The evaluation results.

    Returns:
        pd.DataFrame: A formatted table of results.
    """
    # Convert the nested results into a flat structure for better readability
    table_data = {
        "Metric": ["BLEU", "ROUGE-1", "ROUGE-2", "ROUGE-L", "Faithfulness", "Retriever Relevance", "Avg Retriever Time (s)", "Avg LLM Time (s)"],
        "Value": [
            results["accuracy"]["avg_bleu"],
            results["accuracy"]["avg_rouge"]["rouge-1"],
            results["accuracy"]["avg_rouge"]["rouge-2"],
            results["accuracy"]["avg_rouge"]["rouge-l"],
            results["faithfulness"],
            results["retriever_relevance"],
            results["latency"]["avg_retriever_time"],
            results["latency"]["avg_llm_time"]
        ]
    }

    # Create a DataFrame from the table data
    df_results = pd.DataFrame(table_data)
    return df_results
# Reformat the results into a table
df_evaluation_results = format_results_as_table(results)

df_evaluation_results

Unnamed: 0,Metric,Value
0,BLEU,0.644992
1,ROUGE-1,0.737374
2,ROUGE-2,0.655831
3,ROUGE-L,0.727273
4,Faithfulness,0.65418
5,Retriever Relevance,0.0
6,Avg Retriever Time (s),0.63746
7,Avg LLM Time (s),2.958889
