<h1 align="center">
    <img 
        src="../img/logo_white_bg.jpeg" 
        width="200" 
        border="1" />
</h1>
<h1 align="center">
    <b>GenAISHAP</b>
</h1>
<h4 align="center">
    <i>Explanations for Generative AI, LLM-and-SLM-Based, Solutions</i> ⚡️
</h4>



Generative AI SHAP (GenAISHAP) is a python library that supports the creation of explanations to the metrics obtained for solutions based on LLMs (Large Language Models) or SLMs (Small Language Models). 

The previous notebook showed an example of how to create the **Input** for ***GenAISHAP***, which is a simple Pandas DataFrame with the evaluation dataset. A pandas Dataframe like the following was produced and stored as a JSON file:

<img src="../img/input_example.png" width="1200" />

> The column `user_input` will be used to refer to the user prompt, and the columns `faithfulness`, `context_precision` and `context_recall` will be used as metric columns since those columns are numerical.
>
> The other columns, `retrieved_contexts`, `response`, and `reference` are not needed for **GenAISHAP** but were required for the calculation of the metrics.

***GenAISHAP*** works as follows.  ***GenAISHAP*** will create regression models, which we call them **black-box models**, for each of the metrics and will use those black-box models to produce explanations for each metric. The models are created from features extracted from the provided questions. Those **question features** could be generated automatically, using a tool, named **Featurizer** incorporated in the library or they can be manually created.

This notebook explains how to use the **Featurizer** tool.  

In [1]:
import pandas as pd
from genaishap import Featurizer
from dotenv import load_dotenv

# 1. Load input and set environment variables

The **Featurizer** tool uses an LLM to produce the features and its corresponding values per instance sample. That is the reason we need to load the environment variable here in this notebook.  The loaded dataset is the one was created in the previous notebook.


In [2]:
load_dotenv()

True

In [3]:
df_test_dataset = pd.read_json('./test-dataset.json', orient='records')
df_test_dataset.head(10)

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,context_precision,context_recall
0,"In the essay, the author mentions his early ex...",[What I Worked On\n\nFebruary 2021\n\nBefore c...,The first computer the author used for program...,The first computer the author used for program...,1.0,1.0,1.0
1,The author switched his major from philosophy ...,[I couldn't have put this into words when I wa...,The author developed an interest in AI due to ...,The two specific influences that led the autho...,0.9,1.0,1.0
2,"In the essay, the author discusses his initial...",[I couldn't have put this into words when I wa...,The author was initially drawn to AI by two ma...,The two main influences that initially drew th...,1.0,1.0,1.0
3,The author mentions his shift of interest towa...,[I couldn't have put this into words when I wa...,The author shifted his interest towards Lisp b...,The author shifted his interest towards Lisp a...,0.9,1.0,1.0
4,"In the essay, the author mentions his interest...",[So I looked around to see what I could salvag...,"During his time in grad school, the author att...","The author in the essay is Paul Graham, who wa...",0.846154,1.0,1.0
5,The author discusses his decision to write a b...,[I couldn't have put this into words when I wa...,The author decided to write a book on Lisp hac...,The author decided to write a book on Lisp hac...,0.5,1.0,1.0
6,"In the essay, the author mentions a quick deci...",[So I looked around to see what I could salvag...,The author made a quick decision to attempt to...,The author decided to attempt writing his diss...,1.0,1.0,1.0
7,The author describes the atmosphere and practi...,"[I didn't want to drop out of grad school, but...",The author describes the atmosphere at the Acc...,"According to the author's account, the student...",1.0,1.0,1.0
8,"In the essay, the author discusses his experie...","[We actually had one of those little stoves, f...",The author describes painting still lives as d...,"In the essay, the author explains that paintin...",0.923077,1.0,1.0
9,The author shares his work experience at a com...,"[We actually had one of those little stoves, f...",Interleaf had added a unique feature to their ...,"Interleaf, the company where the author worked...",0.8125,1.0,0.857143


The following code initializes the **Featurizer** with the data loaded in the previous step

In [4]:
featurizer = Featurizer.from_pandas(df_test_dataset)

# 2. Create features automatically

The following code uses an Azure OpenAI LLM deployment to create the features, by default the number of features is 12, but this number can be modified if required. If the quality of the black-box models is low, one option to improve it is to increase the number of features generated.

The features automatically generated can be or **boolean** or **list of strings**. The goal is to be able to capture the characteristics of the different user queries in a way that can be easily interpretable by a human, and at the same time these features should be able to be engineered to be used as regressors for the black-box regression models.  

The following is an example of how to create the features, and how to visualize them.

In [5]:
%%time

featurizer.create_features_using_azure_openai(
    deployment_name="gpt-4o", # Update with the name of your Azure OpenAI LLM deployment name
    num_features=12
)
print(featurizer.features.model_dump_json(indent=4))

{
    "features": [
        {
            "feature": "there_is_any_person_identified_in_the_question",
            "ftype": "boolean"
        },
        {
            "feature": "list_of_people_identified_in_the_question",
            "ftype": "list_of_strings"
        },
        {
            "feature": "there_is_any_educational_institution_identified_in_the_question",
            "ftype": "boolean"
        },
        {
            "feature": "list_of_educational_institutions_identified_in_the_question",
            "ftype": "list_of_strings"
        },
        {
            "feature": "there_is_any_programming_language_identified_in_the_question",
            "ftype": "boolean"
        },
        {
            "feature": "list_of_programming_languages_identified_in_the_question",
            "ftype": "list_of_strings"
        },
        {
            "feature": "there_is_any_company_identified_in_the_question",
            "ftype": "boolean"
        },
        {
            "feature"

These automatically generated features can be modified if required. For this, just use the `featurizer.features.features` list to remove or add features. For example, to remove the second question (index = 1), you can use `del` as follows:

```python
del featurizer.features.features[1]
```

To add an additional feature, you can use `append` as follows:

```python
from genaishap import Feature

featurizer.features.features.append(
    Feature(
        feature="is_an_open_ended_question",
        ftype="boolean"
    )
)
```


# 3. Fill out the features automatically

Once the list of features is created it is possible to automatically fill out the values of each feature using the **Featurizer**. The **Featurizer** uses an Azure OpenAI LLM to do the job. The batch size parameter is used to fill out batches of, for example, 20 questions. The batch size helps to control the number of tokens used per LLM call. 

In [6]:
%%time

featurizer.fill_out_features_using_azure_openai(
    deployment_name="gpt-4o", 
    batch_size=20
)

  0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 126 ms, sys: 9.79 ms, total: 135 ms
Wall time: 2min 16s


In [7]:
df_features = featurizer.to_pandas()
df_features.style.set_table_styles(
    [{'selector': 'th', 'props': [('font-size', '5pt')]}]
).set_properties(**{'font-size': '8pt',})

Unnamed: 0,there_is_any_person_identified_in_the_question,list_of_people_identified_in_the_question,there_is_any_educational_institution_identified_in_the_question,list_of_educational_institutions_identified_in_the_question,there_is_any_programming_language_identified_in_the_question,list_of_programming_languages_identified_in_the_question,there_is_any_company_identified_in_the_question,list_of_companies_identified_in_the_question,is_a_question_about_personal_experiences_or_decisions,is_a_question_about_technological_or_programming_concepts,is_a_question_about_art_or_creative_processes,is_a_question_related_to_business_or_startup_strategies
0,True,['author'],False,[],True,['programming'],False,[],True,True,False,False
1,True,['author'],True,['college'],False,[],False,[],True,True,False,False
2,True,['author'],False,[],False,[],False,[],True,True,False,False
3,True,['author'],False,[],True,['Lisp'],False,[],True,True,False,False
4,True,['author'],True,['grad school'],False,[],False,[],True,True,True,False
5,True,['author'],False,[],True,['Lisp'],False,[],True,True,False,False
6,True,['author'],False,[],False,[],False,[],True,False,False,False
7,True,['author'],True,['Accademia di Belli Arti'],False,[],False,[],True,False,True,False
8,True,['author'],False,[],False,[],False,[],True,False,True,False
9,True,['author'],False,[],False,[],True,['Interleaf'],True,True,False,True


It is possible to manually edit the values of the features or add or remove manually an entire column.  The easies way to do it is just to manipulate the `df_features` dataframe.  Unsing pandas Dataframe methods and functions it is easy to manipulate this table as required.  The following are the restrictions for the manual manipulation:
- The total number of records of `df_features` has to be the same as the number of records of `df_test_dataset`.
- The column names of `df_features` should be self explanatory since they are going to be used for the explanations.
- The type of the columns in `df_features` needs to be **boolean** or **list of strings**.

# 4. Store features and values as JSON file

In [11]:
df_features.to_json('./test-features.json', orient='records', indent=4)

Just for visualization, the following cell shows in a single table all the columns calculated so far, user input, metrics and question features

In [10]:
df_test_dataset.join(df_features)

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,context_precision,context_recall,there_is_any_person_identified_in_the_question,list_of_people_identified_in_the_question,there_is_any_educational_institution_identified_in_the_question,list_of_educational_institutions_identified_in_the_question,there_is_any_programming_language_identified_in_the_question,list_of_programming_languages_identified_in_the_question,there_is_any_company_identified_in_the_question,list_of_companies_identified_in_the_question,is_a_question_about_personal_experiences_or_decisions,is_a_question_about_technological_innovations_or_concepts,is_a_question_about_artistic_pursuits_or_styles,is_a_question_related_to_business_or_startup_strategies
0,"In the essay, the author mentions his early ex...",[What I Worked On\n\nFebruary 2021\n\nBefore c...,The first computer the author used for program...,The first computer the author used for program...,1.0,1.0,1.0,True,[author],False,[],True,[programming],False,[],True,False,False,False
1,The author switched his major from philosophy ...,[I couldn't have put this into words when I wa...,The author developed an interest in AI due to ...,The two specific influences that led the autho...,0.9,1.0,1.0,True,[author],True,[college],False,[],False,[],True,True,False,False
2,"In the essay, the author discusses his initial...",[I couldn't have put this into words when I wa...,The author was initially drawn to AI by two ma...,The two main influences that initially drew th...,1.0,1.0,1.0,True,[author],False,[],False,[],False,[],True,True,False,False
3,The author mentions his shift of interest towa...,[I couldn't have put this into words when I wa...,The author shifted his interest towards Lisp b...,The author shifted his interest towards Lisp a...,0.9,1.0,1.0,True,[author],False,[],True,[Lisp],False,[],True,True,False,False
4,"In the essay, the author mentions his interest...",[So I looked around to see what I could salvag...,"During his time in grad school, the author att...","The author in the essay is Paul Graham, who wa...",0.846154,1.0,1.0,True,[author],True,[grad school],False,[],False,[],True,False,True,False
5,The author discusses his decision to write a b...,[I couldn't have put this into words when I wa...,The author decided to write a book on Lisp hac...,The author decided to write a book on Lisp hac...,0.5,1.0,1.0,True,[author],False,[],True,[Lisp],False,[],True,True,False,False
6,"In the essay, the author mentions a quick deci...",[So I looked around to see what I could salvag...,The author made a quick decision to attempt to...,The author decided to attempt writing his diss...,1.0,1.0,1.0,True,[author],False,[],False,[],False,[],True,False,False,False
7,The author describes the atmosphere and practi...,"[I didn't want to drop out of grad school, but...",The author describes the atmosphere at the Acc...,"According to the author's account, the student...",1.0,1.0,1.0,True,"[students, faculty]",True,[Accademia di Belli Arti],False,[],False,[],False,False,True,False
8,"In the essay, the author discusses his experie...","[We actually had one of those little stoves, f...",The author describes painting still lives as d...,"In the essay, the author explains that paintin...",0.923077,1.0,1.0,True,[author],False,[],False,[],False,[],False,False,True,False
9,The author shares his work experience at a com...,"[We actually had one of those little stoves, f...",Interleaf had added a unique feature to their ...,"Interleaf, the company where the author worked...",0.8125,1.0,0.857143,True,[author],False,[],False,[],True,[Interleaf],True,True,False,True
