# RAG Model using Langchain library

In this notebook, we are going to use LangChain to implement a simple RAG Model for automating the process of answering RFP questions using GenAI. We will see how we can initialize an embedding model, a retrieval model and a generator model with LangChain components and use them within the ValidMind developer framework to run tests against them. Finally, we will see how we can put them together in a Pipeline and run that to get e2e results and run tests against that.

<a id='toc2_'></a>

## About ValidMind

ValidMind is a platform for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Developer Framework to automate documentation and validation tests, and then use the ValidMind AI Risk Platform UI to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

<a id='toc2_1_'></a>

### Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. 

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).

<a id='toc2_2_'></a>

### New to ValidMind?

If you haven't already seen our [Get started with the ValidMind Developer Framework](https://docs.validmind.ai/guide/get-started-developer-framework.html), we recommend you explore the available resources for developers at some point. There, you can learn more about documenting models, find code samples, or read our developer reference.

::: {.callout-tip}

For access to all features available in this notebook, create a free ValidMind account.

Signing up is FREE — [**Sign up now!**](https://app.prod.validmind.ai)

:::

<a id='toc2_3_'></a>

### Key concepts

**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.

**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.

**Tests**: A function contained in the ValidMind Developer Framework, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.

**Metrics**: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.

**Custom metrics**: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with ValidMind to be used in the platform.

**Inputs**: Objects to be evaluated and documented in the ValidMind framework. They can be any of the following:

  - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).
  - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).
  - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.
  - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. See this [example](https://docs.validmind.ai/notebooks/how_to/run_tests_that_require_multiple_datasets.html) for more information.

**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.

**Outputs**: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.

Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases.


# Pre-requisites

Let's go ahead and install the `validmind` library if its not already installed... Then we can install the `qdrant-client` library for our vector store and `langchain` for everything else:

In [1]:
%pip install -q validmind

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -q qdrant-client langchain

Note: you may need to restart the kernel to use updated packages.


### ValidMind Initialization

Now we will import and initialize the ValidMind framework so we can connect to our project in the ValidMind platform. This will let us log inputs, plots, and test results to our model documentation.

In [3]:
import validmind as vm

vm.init(
  api_host = "...",
  api_key = "...",
  api_secret = "...",
  project = "..."
)

2024-05-10 11:25:34,994 - INFO(validmind.api_client): Connected to ValidMind. Project: [Demo] Customer Churn Model - Initial Validation (clnt1f4qc00ap15lfts8ur7lw)


### Read openai key

We will need to have an OpenAI API key to be able to use their `text-embedding-3-small` model for our embeddings, `gpt-3.5-turbo` model for our generator and `gpt-4-turbo` model for our LLM-as-Judge tests. If you don't have an OpenAI API key, you can get one by signing up at [OpenAI](https://platform.openai.com/signup). Then you can create a `.env` file in the root of your project and the following cell will load it from there. Alternatively, you can just uncomment the line below to directly set the key (not recommended for security reasons).

In [4]:
# load openai api key
import os

from dotenv import load_dotenv
load_dotenv()

# os.environ["OPENAI_API_KEY"] = "sk-..."

if not 'OPENAI_API_KEY' in os.environ:
    raise ValueError('OPENAI_API_KEY is not set')

# Dataset Loader

Great, now that we have all of our dependencies installed, the developer framework initialized and connected to our model documentation project and our OpenAI API key setup, we can go ahead and load our datasets. We will use the synthetic `RFP` dataset included with ValidMind for this notebook. This dataset contains a variety of RFP questions and ground truth answers that we can use both as the source where our Retriever will search for similar question-answer pairs as well as our test set for evaluating the performance of our RAG model. To do this, we just have to load it and call the preprocess function to get a split of the data into train and test sets.

In [5]:
# Import the sample dataset from the library
from validmind.datasets.llm.rag import rfp

raw_df = rfp.load_data()
train_df, test_df = rfp.preprocess(raw_df)

In [6]:
vm_train_ds = vm.init_dataset(
    train_df,
    text_column="question",
    target_column="ground_truth",
)

vm_test_ds = vm.init_dataset(
    test_df,
    text_column="question",
    target_column="ground_truth",
)

vm_test_ds.df.head()

2024-05-10 11:25:35,087 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
2024-05-10 11:25:35,979 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


Unnamed: 0,Project_Title,question,ground_truth,Area,Requester,Status,id
73,Implementation of AI Chatbots for Enhanced Cus...,How do you design the user interfaces and expe...,Our design philosophy centers on simplicity an...,General,Bank C,Awarded,110edc94-e0ea-4b44-8081-fc46ae8a0571
21,Generative AI Solutions for Fraud Detection an...,How do you ensure your AI solutions adhere to ...,We ensure compliance with U.S. regulations suc...,AI Regulation,Bank E,Under Review,2a9f7593-1e16-4d22-a56b-ab5ce88f8aa3
43,Automated Document Processing System Using AI ...,Explain how you manage and mitigate identified...,We implement and maintain robust risk manageme...,AI Regulation,Bank D,Awarded,78add580-9b39-4cd9-90b8-ec166c49c652
67,Gen AI-Driven Financial Advisory System,How do you ensure that your AI solutions are c...,We ensure compliance with U.S. regulations suc...,AI Regulation,Bank A,Under Review,dfe66d2e-7139-4413-a1aa-b7f9e8284e6b
86,Implementation of AI Chatbots for Enhanced Cus...,How do you perform risk identification and ass...,We conduct thorough assessments of AI systems ...,AI Regulation,Bank C,Awarded,29b2b2e0-4969-4acc-a373-66b8c88fdedb


# Data validation

Now that we have loaded our dataset, we can go ahead and run some data validation tests right away to start assessing and documenting the quality of our data. Since we are using a text dataset, we can use ValidMind's built-in array of text data quality tests to check that things like number of duplicates, missing values, and other common text data issues are not present in our dataset. We can also run some tests to check the sentiment and toxicity of our data.

### Duplicates

First, let's check for duplicates in our dataset. We can use the `validmind.data_validation.Duplicates` test and pass our dataset:

In [7]:
vm.tests.run_test(
    test_id="validmind.data_validation.Duplicates",
    inputs={
        "dataset": vm_train_ds
    }
).log()

VBox(children=(HTML(value='\n            <h1>Duplicates ✅</h1>\n            <p>Tests dataset for duplicate ent…

### Stop Words

Next, let's check for stop words in our dataset. We can use the `validmind.data_validation.StopWords` test and pass our dataset:

In [8]:
vm.tests.run_test(
    test_id="validmind.data_validation.nlp.StopWords",
    inputs={
        "dataset": vm_train_ds
    }
).log()

[nltk_data] Downloading package stopwords to /Users/jwalz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='\n            <h1>Stop Words ❌</h1>\n            <p>Evaluates and visualizes the fr…

### Punctuations

Next, let's check for punctuations in our dataset. We can use the `validmind.data_validation.Punctuations` test:

In [9]:
vm.tests.run_test(
    test_id="validmind.data_validation.nlp.Punctuations",
    inputs={
        "dataset": vm_train_ds
    }
).log()

VBox(children=(HTML(value='<h1>Punctuations</h1>'), HTML(value="<p>Analyzes and visualizes the frequency distr…

### Common Words

Next, let's check for common words in our dataset. We can use the `validmind.data_validation.CommonWord` test:

In [10]:
vm.tests.run_test(
    test_id="validmind.data_validation.nlp.CommonWords",
    inputs={
        "dataset": vm_train_ds
    }
).log()

[nltk_data] Downloading package stopwords to /Users/jwalz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='<h1>Common Words</h1>'), HTML(value="<p>Identifies and visualizes the 40 most frequ…

### Language Detection

For documentation purposes, we can detect and log the languages used in the dataset with the `validmind.data_validation.LanguageDetection` test:

In [11]:
vm.tests.run_test(
    test_id="validmind.data_validation.nlp.LanguageDetection",
    inputs={
        "dataset": vm_train_ds
    }
).log()

VBox(children=(HTML(value='<h1>Language Detection</h1>'), HTML(value="<p>Detects the language of each text ent…

### Toxicity Score

Now, let's go ahead and run the `validmind.data_validation.nlp.Toxicity` test to compute a toxicity score for our dataset:

In [12]:
vm.tests.run_test(
    "validmind.data_validation.nlp.Toxicity", inputs={"dataset": vm_train_ds}
).log()

Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint


VBox(children=(HTML(value='<h1>Toxicity</h1>'), HTML(value="<p>Analyzes the toxicity of text data within a dat…

### Polarity and Subjectivity

We can also run the `validmind.data_validation.nlp.PolarityAndSubjectivity` test to compute the polarity and subjectivity of our dataset:

In [13]:
vm.tests.run_test(
    "validmind.data_validation.nlp.PolarityAndSubjectivity",
    inputs={"dataset": vm_train_ds},
).log()

VBox(children=(HTML(value='<h1>Polarity And Subjectivity</h1>'), HTML(value='<p>Analyzes the polarity and subj…

### Sentiment

Finally, we can run the `validmind.data_validation.nlp.Sentiment` test to plot the sentiment of our dataset:

In [14]:
vm.tests.run_test(
    "validmind.data_validation.nlp.Sentiment", inputs={"dataset": vm_train_ds}
).log()

VBox(children=(HTML(value='<h1>Sentiment</h1>'), HTML(value="<p>Analyzes the sentiment of text data within a d…

# Embedding Model

Now that we have our dataset loaded and have run some data validation tests to assess and document the quality of our data, we can go ahead and initialize our embedding model. We will use the `text-embedding-3-small` model from OpenAI for this purpose wrapped in the `OpenAIEmbeddings` class from LangChain. This model will be used to "embed" our questions both for inserting the question-answer pairs from the "train" set into the vector store and for embedding the question from inputs when making predictions with our RAG model.

In [15]:
from langchain_openai import OpenAIEmbeddings

embedding_client = OpenAIEmbeddings(model="text-embedding-3-small")

def embed(input):
    """Returns a text embedding for the given text"""
    return embedding_client.embed_query(input["question"])

vm_embedder = vm.init_model(input_id="embedding_model", predict_fn=embed)

What we have done here is to initialize the `OpenAIEmbeddings` class so it uses OpenAI's `text-embedding-3-small` model. We then created an `embed` function that takes in an `input` dictionary and uses the `embed_query` method of the embedding client to compute the embeddings of the `question`. We use an `embed` function since that is how ValidMind supports any custom model. We will use this strategy for the retrieval and generator models as well but you could also use, say, a HuggingFace model directly. See the documentation for more information on which model types are directly supported - [ValidMind Documentation](https://docs.validmind.ai/validmind/validmind.html)... Finally, we use the `init_model` function from the ValidMind framework to create a `VMModel` object that can be used in ValidMind tests. This also logs the model to our model documentation project and any test that uses the model will be linked to the logged model and its metadata.

### Assign Predictions

To precompute the embeddings for our test set, we can call the `assign_predictions` method of our `vm_test_ds` object we created above. This will compute the embeddings for each question in the test set and store them in the a special prediction column of the test set thats linked to our `vm_embedder` model. This will allow us to use these embeddings later when we run tests against our embedding model.

In [16]:
vm_test_ds.assign_predictions(vm_embedder)
print(vm_test_ds)

2024-05-10 11:25:48,543 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2024-05-10 11:25:48,544 - INFO(validmind.vm_models.dataset.utils): Not running predict_proba() for unsupported models.
2024-05-10 11:25:48,544 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2024-05-10 11:25:52,741 - INFO(validmind.vm_models.dataset.utils): Done running predict()


VMDataset object: 
Input ID: dataset
Target Column: ground_truth
Feature Columns: ['Project_Title', 'question', 'Area', 'Requester', 'Status', 'id']
Text Column: question
Extra Columns: ExtraColumns(extras=set(), group_by_column=None, prediction_columns={'embedding_model': 'embedding_model_prediction'}, probability_columns={})
Target Class Labels: None
Columns: ['Project_Title', 'question', 'ground_truth', 'Area', 'Requester', 'Status', 'id', 'embedding_model_prediction']
Index: [ 73  21  43  67  86  44  60  19 104  31   9  96  72  71  54 101  30  48
   1  66  25  81 113]



### Run tests

Now that everything is setup for the embedding model, we can go ahead and run some tests to assess and document the quality of our embeddings. We will use the `validmind.model_validation.embeddings.*` tests to compute a variety of metrics against our model.

In [17]:
from validmind.tests import run_test

result = run_test(
    "validmind.model_validation.embeddings.StabilityAnalysisRandomNoise",
    inputs={"model": vm_embedder, "dataset": vm_test_ds},
    params={"probability": 0.3},
).log()

VBox(children=(HTML(value='\n            <h1>Stability Analysis Random Noise ✅</h1>\n            <p>Evaluate r…

In [18]:
result = run_test(
    "validmind.model_validation.embeddings.CosineSimilarityHeatmap",
    inputs = {"model": vm_embedder, "dataset": vm_test_ds}
).log()

VBox(children=(HTML(value='<h1>Cosine Similarity Heatmap</h1>'), HTML(value='<p>Plots an interactive heatmap o…

In [19]:
result = run_test(
    "validmind.model_validation.embeddings.EuclideanDistanceHeatmap",
    inputs={"model": vm_embedder, "dataset": vm_test_ds},
).log()

VBox(children=(HTML(value='<h1>Euclidean Distance Heatmap</h1>'), HTML(value='<p>Plots an interactive heatmap …

In [20]:
result = run_test(
    "validmind.model_validation.embeddings.PCAComponentsPairwisePlots",
    inputs={"model": vm_embedder, "dataset": vm_test_ds},
    params = {"n_components": 3}
).log()

VBox(children=(HTML(value='<h1>PCA Components Pairwise Plots</h1>'), HTML(value="<p>Plots individual scatter p…

In [21]:
result = run_test(
    "validmind.model_validation.embeddings.TSNEComponentsPairwisePlots",
    inputs={"model": vm_embedder, "dataset": vm_test_ds},
    params = {"n_components": 3, "perplexity": 20}
).log()

VBox(children=(HTML(value='<h1>TSNE Components Pairwise Plots</h1>'), HTML(value="<p>Plots individual scatter …

# Setup Vector Store

Great, so now that we have assessed our embedding model and verified that it is performing well, we can go ahead and use it to compute embeddings for our question-answer pairs in the "train" set. We will then use these embeddings to insert the question-answer pairs into a vector store. We will use an in-memory `qdrant` vector database for demo purposes but any option would work just as well here. We will use the `QdrantClient` class from LangChain to interact with the vector store. This class will allow us to insert and search for embeddings in the vector store.

### Generate embeddings for the Train Set

We can use the same `assign_predictions` method from earlier except this time we will use the `vm_train_ds` object to compute the embeddings for the question-answer pairs in the "train" set.

In [22]:
vm_train_ds.assign_predictions(vm_embedder)
print(vm_train_ds)

2024-05-10 11:26:05,204 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2024-05-10 11:26:05,205 - INFO(validmind.vm_models.dataset.utils): Not running predict_proba() for unsupported models.
2024-05-10 11:26:05,205 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2024-05-10 11:26:22,791 - INFO(validmind.vm_models.dataset.utils): Done running predict()


VMDataset object: 
Input ID: dataset
Target Column: ground_truth
Feature Columns: ['Project_Title', 'question', 'Area', 'Requester', 'Status', 'id']
Text Column: question
Extra Columns: ExtraColumns(extras=set(), group_by_column=None, prediction_columns={'embedding_model': 'embedding_model_prediction'}, probability_columns={})
Target Class Labels: None
Columns: ['Project_Title', 'question', 'ground_truth', 'Area', 'Requester', 'Status', 'id', 'embedding_model_prediction']
Index: [ 18  97  80 111  98 109  76  24  74  99  57  84  51  35  45  85  38   6
   5  61 107   4   2  26   0  59   8  90  34  79  32  11  64  12  91  70
  95  36  37  88  69  17  10  53  52  39 102  63   3  29  16  20 100  49
 108  14  15  40 114  94  47  62  78  93  50 103 105  42  83  56  68  75
  87  55  13  28   7  65  89  27  23  92  82  33 112 110  58  22  77  46
 106  41]



### Insert embeddings and questions into Vector DB

Now that we have computed the embeddings for our question-answer pairs in the "train" set, we can go ahead and insert them into the vector store:

In [23]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DataFrameLoader

# load documents from dataframe
loader = DataFrameLoader(train_df, page_content_column="question")
docs = loader.load()
# choose model using embedding client
embedding_client = OpenAIEmbeddings(model="text-embedding-3-small")

# setup vector datastore
qdrant = Qdrant.from_documents(
    docs,
    embedding_client,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="rfp_rag_collection",
)

# Retrieval Model

Now that we have an embedding model and a vector database setup and loaded with our data, we need a Retrieval model that can search for similar question-answer pairs for a given input question. Once created, we can initialize this as a ValidMind model and `assign_predictions` to it just like our embedding model.

In [24]:
def retrieve(input):
    contexts = []

    for result in qdrant.similarity_search_with_score(input["question"]):
        document, score = result
        context = f"Q: {document.page_content}\n"
        context += f"A: {document.metadata['ground_truth']}\n"

        contexts.append(context)

    return contexts


vm_retriever = vm.init_model(input_id="retrieval_model", predict_fn=retrieve)

In [25]:
vm_test_ds.assign_predictions(model=vm_retriever)
print(vm_test_ds)

2024-05-10 11:26:24,728 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2024-05-10 11:26:24,729 - INFO(validmind.vm_models.dataset.utils): Not running predict_proba() for unsupported models.
2024-05-10 11:26:24,729 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2024-05-10 11:26:29,802 - INFO(validmind.vm_models.dataset.utils): Done running predict()


VMDataset object: 
Input ID: dataset
Target Column: ground_truth
Feature Columns: ['Project_Title', 'question', 'Area', 'Requester', 'Status', 'id']
Text Column: question
Extra Columns: ExtraColumns(extras=set(), group_by_column=None, prediction_columns={'embedding_model': 'embedding_model_prediction', 'retrieval_model': 'retrieval_model_prediction'}, probability_columns={})
Target Class Labels: None
Columns: ['Project_Title', 'question', 'ground_truth', 'Area', 'Requester', 'Status', 'id', 'embedding_model_prediction', 'retrieval_model_prediction']
Index: [ 73  21  43  67  86  44  60  19 104  31   9  96  72  71  54 101  30  48
   1  66  25  81 113]



# Generation Model

As the final piece of this simple RAG pipeline, we can create and initialize a generation model that will use the retrieved context to generate an answer to the input question. We will use the `gpt-3.5-turbo` model from OpenAI.

In [26]:
from openai import OpenAI


system_prompt = """
You are an expert RFP AI assistant.
You are tasked with answering new RFP questions based on existing RFP questions and answers.
You will be provided with the existing RFP questions and answer pairs that are the most relevant to the new RFP question.
After that you will be provided with a new RFP question.
You will generate an answer and respond only with the answer.
Ignore your pre-existing knowledge and answer the question based on the provided context.
""".strip()

openai_client = OpenAI()

def generate(input):
    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "\n\n".join(input["retrieval_model"])},
            {"role": "user", "content": input["question"]},
        ],
    )

    return response.choices[0].message.content

vm_generator = vm.init_model(input_id="generation_model", predict_fn=generate)

Let's test it out real quick:

In [27]:
import pandas as pd

vm_generator.predict(pd.DataFrame({"retrieval_model": [["My name is anil"]], "question": ["what is my name"]}))

['Your name is Anil.']

# Setup RAG Pipeline Model

Now that we have all of our individual "component" models setup and initialized we need some way to put them all together in a single "pipeline". We can use the `PipelineModel` class to do this. This ValidMind model type simply wraps any number of other ValidMind models and runs them in sequence. We can use a pipe(`|`) operator - in Python this is normally an `or` operator but we have overloaded it for easy pipeline creation - to chain together our models. We can then initialize this pipeline model and assign predictions to it just like any other model.

In [28]:
vm_rag_model = vm.init_model(vm_retriever | vm_generator, input_id="rag_model")

We can `assign_predictions` to the pipeline model just like we did with the individual models. This will run the pipeline on the test set and store the results in the test set for later use.

In [29]:
vm_test_ds.assign_predictions(model=vm_rag_model)
print(vm_test_ds)

2024-05-10 11:26:30,688 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2024-05-10 11:26:30,689 - INFO(validmind.vm_models.dataset.utils): Not running predict_proba() for unsupported models.
2024-05-10 11:26:30,690 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2024-05-10 11:27:49,996 - INFO(validmind.vm_models.dataset.utils): Done running predict()


VMDataset object: 
Input ID: dataset
Target Column: ground_truth
Feature Columns: ['Project_Title', 'question', 'Area', 'Requester', 'Status', 'id']
Text Column: question
Extra Columns: ExtraColumns(extras=set(), group_by_column=None, prediction_columns={'embedding_model': 'embedding_model_prediction', 'retrieval_model': 'retrieval_model_prediction', 'rag_model': 'rag_model_prediction'}, probability_columns={})
Target Class Labels: None
Columns: ['Project_Title', 'question', 'ground_truth', 'Area', 'Requester', 'Status', 'id', 'embedding_model_prediction', 'retrieval_model_prediction', 'rag_model_prediction']
Index: [ 73  21  43  67  86  44  60  19 104  31   9  96  72  71  54 101  30  48
   1  66  25  81 113]



In [30]:
vm_test_ds.df.head(5)

Unnamed: 0,Project_Title,question,ground_truth,Area,Requester,Status,id,embedding_model_prediction,retrieval_model_prediction,rag_model_prediction
73,Implementation of AI Chatbots for Enhanced Cus...,How do you design the user interfaces and expe...,Our design philosophy centers on simplicity an...,General,Bank C,Awarded,110edc94-e0ea-4b44-8081-fc46ae8a0571,"[-0.005328164056798606, 0.0019559278470552936,...",[Q: How is user interface and experience consi...,Our design philosophy prioritizes simplicity a...
21,Generative AI Solutions for Fraud Detection an...,How do you ensure your AI solutions adhere to ...,We ensure compliance with U.S. regulations suc...,AI Regulation,Bank E,Under Review,2a9f7593-1e16-4d22-a56b-ab5ce88f8aa3,"[0.012983299710708322, 0.008619748933411301, 0...",[Q: How do you ensure that your AI solutions c...,We ensure compliance with U.S. regulations suc...
43,Automated Document Processing System Using AI ...,Explain how you manage and mitigate identified...,We implement and maintain robust risk manageme...,AI Regulation,Bank D,Awarded,78add580-9b39-4cd9-90b8-ec166c49c652,"[0.007622844916689943, 0.059345308001852774, 0...",[Q: Explain how you manage and mitigate AI ris...,We implement and maintain robust risk manageme...
67,Gen AI-Driven Financial Advisory System,How do you ensure that your AI solutions are c...,We ensure compliance with U.S. regulations suc...,AI Regulation,Bank A,Under Review,dfe66d2e-7139-4413-a1aa-b7f9e8284e6b,"[0.018119321018536055, 0.012668794170985565, 0...",[Q: How do you ensure that your AI solutions c...,We ensure compliance with U.S. regulations on ...
86,Implementation of AI Chatbots for Enhanced Cus...,How do you perform risk identification and ass...,We conduct thorough assessments of AI systems ...,AI Regulation,Bank C,Awarded,29b2b2e0-4969-4acc-a373-66b8c88fdedb,"[-0.014985800663575534, 0.027727032113998212, ...",[Q: How do you perform risk assessment and ide...,We conduct thorough assessments of AI systems ...


# Run tests

Let's go ahead and run some of our new RAG tests against our model...

> Note: these tests are still being developed and are not yet in a stable state. We are using advanced tests here that use LLM-as-Judge and other strategies to assess things like the relevancy of the retrieved context to the input question and the correctness of the generated answer when compared to the ground truth. There is more to come in this area so stay tuned!

In [31]:
import warnings

warnings.filterwarnings("ignore")

### Answer Similarity

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.

Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.

In [32]:
run_test(
    "validmind.model_validation.ragas.AnswerSimilarity",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
).log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Answer Similarity</h1>'), HTML(value='<p>Calculates the answer similarity metri…

### Context Entity Recall

This metric gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.

In [33]:
result = run_test(
    "validmind.model_validation.ragas.ContextEntityRecall",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
)
result.log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Context Entity Recall</h1>'), HTML(value='<p>Evaluates the context entity recal…

### Context Precision

Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

In [34]:
result = run_test(
    "validmind.model_validation.ragas.ContextPrecision",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Context Precision</h1>'), HTML(value='<p>Evaluates the context precision metric…

### Context Relevancy

This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.

Ideally, the retrieved context should exclusively contain essential information to address the provided query. To compute this, we initially estimate the value of by identifying sentences within the retrieved context that are relevant for answering the given question.

In [35]:
result = run_test(
    "validmind.model_validation.ragas.ContextRelevancy",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
)
result.log()


Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Context Relevancy</h1>'), HTML(value='<p>Evaluates the context relevancy metric…

### Faithfulness

This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not.

In [36]:
result = run_test(
    "validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
)
result.log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Faithfulness</h1>'), HTML(value="<p>Evaluates the faithfulness metric for gener…

### Answer Relevance

The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.

The Answer Relevancy is defined as the mean cosine similartiy of the original question to a number of artifical questions, which where generated (reverse engineered) based on the answer.

Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guranteed, due to the nature of the cosine similarity ranging from -1 to 1.

> Note: This is reference free metric. If you’re looking to compare ground truth answer with generated answer refer to Answer Correctness.

An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.

In [38]:
result = run_test(
    "validmind.model_validation.ragas.AnswerRelevance",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
)
result.log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Answer Relevance</h1>'), HTML(value="<p>Evaluates the relevance of answers in a…

### Context Recall

Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.

In [39]:
result = run_test(
    "validmind.model_validation.ragas.ContextRecall",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
)
result.log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Context Recall</h1>'), HTML(value="<p>Evaluates the context recall metric for d…

### Answer Correctness

The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.

Factual correctness quantifies the factual overlap between the generated answer and the ground truth answer. This is done using the concepts of:

- TP (True Positive): Facts or statements that are present in both the ground truth and the generated answer.
- FP (False Positive): Facts or statements that are present in the generated answer but not in the ground truth.
- FN (False Negative): Facts or statements that are present in the ground truth but not in the generated answer.

In [40]:
result = run_test(
    "validmind.model_validation.ragas.AnswerCorrectness",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
)
result.log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Answer Correctness</h1>'), HTML(value="<p>Evaluates the correctness of answers …

### Aspect Critique

This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.

Critiques within the LLM evaluators evaluate submissions based on the provided aspect. Ragas Critiques offers a range of predefined aspects like correctness, harmfulness, etc. (Please refer to SUPPORTED_ASPECTS for a complete list). If you prefer, you can also create custom aspects to evaluate submissions according to your unique requirements.

```
SUPPORTED_ASPECTS = [ harmfulness, maliciousness, coherence, correctness, conciseness, ]
```

> TODO: Add support for parameterized Supported Aspects

In [41]:
result = run_test(
    "validmind.model_validation.ragas.AspectCritique",
    inputs={"dataset": vm_test_ds},
    params= {
        "question_column":"question",
        "answer_column":"rag_model_prediction",
        "ground_truth_column":"ground_truth",
        "contexts_column":"retrieval_model_prediction"
    },
)
result.log()

Evaluating:   0%|          | 0/23 [00:00<?, ?it/s]

VBox(children=(HTML(value='<h1>Aspect Critique</h1>'), HTML(value="<p>Evaluates the harmfulness of answers in …

# Conclusion

In this notebook, we have seen how we can use LangChain and ValidMind together to build, evaluate and document a simple RAG Model as its developed. This is a great example of the interactive development experience that ValidMind is designed to support. We can quickly iterate on our model and document as we go... We have seen how ValidMind supports non-traditional "models" using a functional interface and how we can build pipelines of many models to support complex GenAI workflows.

This is still a work in progress and we are actively developing new tests and metrics to support more advanced GenAI workflows. We are also keeping an eye on the most popular GenAI models and libraries to explore direct integrations. Stay tuned for more updates and new features in this area!