<h1 align="center">
    <img 
        src="../img/logo_white_bg.jpeg" 
        width="200" 
        border="1" />
</h1>
<h1 align="center">
    <b>GenAISHAP</b>
</h1>
<h4 align="center">
    <i>Explanations for Generative AI, LLM-and-SLM-Based, Solutions</i> ⚡️
</h4>



Generative AI SHAP (GenAISHAP) is a python library that supports the creation of explanations to the metrics obtained for solutions based on LLMs (Large Language Models) or SLMs (Small Language Models). 

When building an LLM-or-SLM based solution one of the first challenges is how to measure the quality of the responses from the "Agent". Here, libraries like  [RAGAS](https://github.com/explodinggradients/ragas) or [promptflow](https://github.com/microsoft/promptflow) help on the evaluation of the quality of the solution by using metrics like **faithfulness**, **groundedness**, **context precision**, **context recall**, among others.

The open challenge is to add explainability to those quality metrics.  To answer questions like: 

- *Why a particular question is marked with a higher/lower metric (e.g., faithfulness)?*
- *What are the common characteristics of the user questions that produce good or bad **faithfulness**?*
- *What type of prompts produce better or lower **context recall**?*

The answer of those questions helps on the debugging of the overall solution and gives more insights to where to focus the next steps to improve the metrics.

This notebook shows an example of how to create the **Input** for GenAISHAP, which is a simple Pandas DataFrame with the evaluation dataset. This dataset should have at least the following columns:

- User input, question or prompt. Column name should be `user_input` and its type should be string.
- One column for each metric already calculated, for example **faithfulness**, **context precision**, **context recall**.  All numerical columns will be assumed to be a metric column.

At the end of this notebook a pandas Dataframe like the following will be produced:

<img src="../img/input_example.png" width="1200" />

> In this example, the column `user_input` will be used to refer to the user prompt, and the columns `faithfulness`, `context_precision` and `context_recall` will be used as metric columns since those columns are numerical.
>
> The other columns, `retrieved_contexts`, `response`, and `reference` are not needed for **GenAISHAP** but are normally required for the calculation of the metrics.

# 1. Import relevant libraries and load environment variables

Make sure that before you run this notebook you have a `.env` file with the following variables with their respective values:

- OPENAI_API_VERSION
- AZURE_OPENAI_ENDPOINT
- OPENAI_API_KEY
- PYTHONIOENCODING=utf-8
- PYTHONUTF8=1

> Note that here we are not importing **genaishap** library, since this notebooks is only an example of how to create the input for the GenAISHAP execution.


In [1]:
from llama_index.core.llama_dataset import download_llama_dataset
from dotenv import load_dotenv
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
import os
from llama_index.core import VectorStoreIndex, Settings
from ragas.metrics import (
    Faithfulness,
    ContextPrecision,
    ContextRecall
)
from ragas.llms import LlamaIndexLLMWrapper
from ragas.embeddings import LlamaIndexEmbeddingsWrapper
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas.evaluation import evaluate
from ragas.run_config import RunConfig

In [2]:
load_dotenv()

True

# 2. Download the example dataset

As an example the [Mini Esg Bench Dataset](https://llamahub.ai/l/llama_datasets/Mini%20ESG%20Bench%20Dataset?from=llama_datasets) dataset will be used.  This dataset provides, among others:

- 6 PDFs with the 2022 or 2021 sustainability report from Microsoft, Apple, Amazon, Google, Meta and Netflix.
- 50 user queries that should be answered using those documents.
- Reference answer for each query that will be used as ground truth to calculate the RAGAS metrics.

In [3]:
rag_dataset, documents = download_llama_dataset(
    llama_dataset_class="MiniEsgBenchDataset", 
    download_dir="./data",
    show_progress=True
)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:08<00:00,  1.38s/it]
Loading files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:19<00:00,  3.29s/file]


It should have created a folder named `data` with 2 elements:
- a `rag_dataset.json` file with the information shown after executing the next cell.
- a `source_files` folder with a single TXT file with the full Paul Graham Essay.

In [4]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,Can you provide for me the three highlights fo...,[GHG emissions\n65%\ncumulative GHG\nemissions...,"Sure, they are: \n1. 65% cumulative GHG emissi...",human,human
1,What percentage of waste from Google's offices...,"[64%\nlandfill diversion\nIn 2021, we reached ...",Sixty-four percent.,human,human
2,Can you present me with the performance highli...,[EMPOWERING USERS WITH TECHNOLOGY\nProducts\nT...,Sure! The Performance Highlights for Empowerin...,human,human
3,What was the listed key achievement regarding ...,[We’ve been a leader on sustainability and cli...,"In 2017, Google became the first major company...",human,human
4,Did Google reach its intended Waste target und...,[Target: Achieve UL 2799 Zero Waste to Landfil...,"No, this target has not been met in 2021. Howe...",human,human
5,How many EV charging locations were there on G...,"[200,000\nEV charging locations\non Google Map...",200000,human,human
6,On what page of the report can I find the perf...,[EMPOWERING USERS WITH TECHNOLOGY\nProducts\nT...,The performance highlights for Empowering User...,human,human
7,Can you please provide for me the glossary of ...,[Glossary\nCFE: carbon-free energyCO2e: carbon...,"Sure, here is the glossary:\nGlossary\nCFE: ca...",human,human
8,On what page can I find details about Amazons ...,[Contents\nIntroduction\n2 About Amazon\n3 Ope...,You can find information on driving climate so...,human,human
9,"For the listed Renewable Energy goals, by when...",[Renewable Energy\nGoal: Power our operations ...,Amazon set the goal of becoming powered by 100...,human,human


# 3. Create semantic index for RAG

The following cell creates the connection object to the embedding and language model that will be used to calculate the response from each user query in the dataset:

In [5]:
embed_model = AzureOpenAIEmbedding(
    model='text-embedding-3-small', # Update with the embeddings deployment name
    api_key=os.environ['OPENAI_API_KEY'],
    api_version=os.environ['OPENAI_API_VERSION'],
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT']
)

llm = AzureOpenAI(
    engine="gpt-4o", # Update with the language model deployment name 
    model="gpt-4o", # Update with the language model name
    temperature=0.0,
    api_key=os.environ['OPENAI_API_KEY'],
    api_version=os.environ['OPENAI_API_VERSION'],
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT']
)

The next two cells create the "in-memory" semantic index with chunks of the essay

In [6]:
Settings.embed_model = embed_model
Settings.llm = llm

In [7]:
index = VectorStoreIndex.from_documents(
    documents=documents,
    show_progress=True
)
query_engine = index.as_query_engine()

Parsing nodes:   0%|          | 0/455 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/508 [00:00<?, ?it/s]

# 4. Calculate the reponse 
The next cell calculates the response by querying the semantic index and calling the language model. After this step the following elements are calculated for each user query in the original dataset:

- `retrieved_context`. Which contains the list of the top chuncks retrieved from the semantic index as context for the LLM to calculate the answer.
- `response`. Response of the LLM to the user query using the retrieved chunks as context. 

> *This process could take several minutes, even hours, depending on the rate limit (max tokens per minute) of the deployments used.*

In [8]:
%%time
predictions = rag_dataset.make_predictions_with(
    predictor = query_engine,
    show_progress = True,
    batch_size = 20
)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:25<00:00,  1.29s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:21<00:00,  1.08s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.00s/it]

CPU times: user 2.88 s, sys: 85.9 ms, total: 2.97 s
Wall time: 57.4 s





# 5. Data preparation for the calculation of the RAGAS metrics

The following cell shows an example of how to prepare the data for the calculation of the RAGAS metrics. For this example we only need to use the following elements:

- `user_input`. In this case, the user input is just the `query` in the original dataset.
- `reference`.  Reference column in ragas is the **expected answer**. We have it as `reference_answer` in the original dataset.
- `response`. Bot response. We just calculated it in the last step.
- `retrieved_contexts`. List of documents or chunks used as context for the Bot to create the answer. It was also calculated in the last step.


In [9]:
list_of_samples = []

for idx in range(len(rag_dataset.examples)):
    list_of_samples.append(
        SingleTurnSample (
            user_input = rag_dataset.examples[idx].query,
            reference = rag_dataset.examples[idx].reference_answer,
            response = predictions.predictions[idx].response,
            retrieved_contexts = predictions.predictions[idx].contexts
        )
    )

ragas_evaluation_dataset = EvaluationDataset(list_of_samples)
ragas_evaluation_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference
0,Can you provide for me the three highlights fo...,"[31. In 2018, to align with industry best prac...",The three highlights for the GHG emissions sec...,"Sure, they are: \n1. 65% cumulative GHG emissi..."
1,What percentage of waste from Google's offices...,[Performance highlights\nThe following section...,The percentage of waste from Google's offices ...,Sixty-four percent.
2,Can you present me with the performance highli...,"[Education\nFor more than 40 years, we’ve work...",The performance highlights for empowering user...,Sure! The Performance Highlights for Empowerin...
3,What was the listed key achievement regarding ...,[Our approach\nWe believe that every business ...,There is no listed key achievement for Google ...,"In 2017, Google became the first major company..."
4,Did Google reach its intended Waste target und...,[BUILDING BETTER DEVICES AND SERVICES\nTarget ...,"Yes, in 2021, Google achieved the Waste target...","No, this target has not been met in 2021. Howe..."
5,How many EV charging locations were there on G...,[This guidance does not recognize existing ren...,The provided information does not specify the ...,200000
6,On what page of the report can I find the perf...,"[Employee Recruitment, Inclusion and Performan...",The performance highlights for the Empowering ...,The performance highlights for Empowering User...
7,Can you please provide for me the glossary of ...,[GRI INDEX\nGRI 304 - Biodiversity\nGRI 103 Ma...,The provided information does not include a gl...,"Sure, here is the glossary:\nGlossary\nCFE: ca..."
8,On what page can I find details about Amazons ...,[IntroductionSustainability\nDriving Climate S...,Details about Amazon's climate solutions can b...,You can find information on driving climate so...
9,"For the listed Renewable Energy goals, by when...",[IntroductionSustainability\nDriving Climate S...,Amazon intends to have all operations powered ...,Amazon set the goal of becoming powered by 100...


# 6. Calculation of RAGAS metrics

The following 2 cells will execute the calculation of the RAGAS metrics. For this example we are going to calculate **faithfulness**, **context_precision**, and **context_recall**. To see all the RAGAS metrics and the full RAGAS documentation follow this link: [https://docs.ragas.io/en/latest/](https://docs.ragas.io/en/latest/) 

> *This process could take several minutes, even hours, depending on the rate limit (max tokens per minute) of the deployments used.*

In [10]:
evaluator_llm = LlamaIndexLLMWrapper(llm)
evaluator_embeddings = LlamaIndexEmbeddingsWrapper(embed_model)

In [11]:
%%time

metrics = [
    Faithfulness(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm)
]
ragas_evaluation_result = evaluate(
    dataset=ragas_evaluation_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    run_config=RunConfig(timeout=1800, max_wait=180, max_retries=20),
    show_progress=True,
    batch_size=20
)

Evaluating:   0%|          | 0/150 [00:00<?, ?it/s]

Batch 1/8:   0%|          | 0/20 [00:00<?, ?it/s]

CPU times: user 4.14 s, sys: 192 ms, total: 4.33 s
Wall time: 1min 41s


# 7. Data cleasing

In some cases the RAGAS metric could return **null** values, the following cell cleans all the records with null metrics.

In [12]:
df_ragas_result = ragas_evaluation_result.to_pandas()

# Removing NULL values
df_ragas_result = df_ragas_result[(
    ~df_ragas_result['faithfulness'].isnull()
)&(
    ~df_ragas_result['context_precision'].isnull()
)&(
    ~df_ragas_result['context_recall'].isnull()
)].reset_index(drop=True)

df_ragas_result

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,context_precision,context_recall
0,Can you provide for me the three highlights fo...,"[31. In 2018, to align with industry best prac...",The three highlights for the GHG emissions sec...,"Sure, they are: \n1. 65% cumulative GHG emissi...",1.0,0.0,0.0
1,What percentage of waste from Google's offices...,[Performance highlights\nThe following section...,The percentage of waste from Google's offices ...,Sixty-four percent.,1.0,0.0,0.0
2,Can you present me with the performance highli...,"[Education\nFor more than 40 years, we’ve work...",The performance highlights for empowering user...,Sure! The Performance Highlights for Empowerin...,1.0,0.0,0.0
3,What was the listed key achievement regarding ...,[Our approach\nWe believe that every business ...,There is no listed key achievement for Google ...,"In 2017, Google became the first major company...",1.0,1.0,1.0
4,Did Google reach its intended Waste target und...,[BUILDING BETTER DEVICES AND SERVICES\nTarget ...,"Yes, in 2021, Google achieved the Waste target...","No, this target has not been met in 2021. Howe...",0.5,1.0,0.5
5,How many EV charging locations were there on G...,[This guidance does not recognize existing ren...,The provided information does not specify the ...,200000,1.0,0.0,0.0
6,On what page of the report can I find the perf...,"[Employee Recruitment, Inclusion and Performan...",The performance highlights for the Empowering ...,The performance highlights for Empowering User...,1.0,0.0,0.0
7,Can you please provide for me the glossary of ...,[GRI INDEX\nGRI 304 - Biodiversity\nGRI 103 Ma...,The provided information does not include a gl...,"Sure, here is the glossary:\nGlossary\nCFE: ca...",0.0,0.0,0.0
8,On what page can I find details about Amazons ...,[IntroductionSustainability\nDriving Climate S...,Details about Amazon's climate solutions can b...,You can find information on driving climate so...,0.0,0.0,0.0
9,"For the listed Renewable Energy goals, by when...",[IntroductionSustainability\nDriving Climate S...,Amazon intends to have all operations powered ...,Amazon set the goal of becoming powered by 100...,1.0,0.5,0.0


# 8. Store the input calculated as a JSON file

In [13]:
df_ragas_result.to_json('./test-dataset.json', orient='records', indent=4)