>### 🚩 *Create a free WhyLabs account to complete this example!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylabs-free-sign-up?utm_source=github&utm_medium=referral&utm_campaign=langkit)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=Hallucination) to leverage the power of whylogs and WhyLabs together!*

# Analyzing LLM Response Consistency
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/Response_Consistency.ipynb)

Recently, large language models (LLMs) have shown impressive and increasing capabilities, including generating highly fluent and convincing responses to user prompts. However, LLMs are known for their ability to generate non-factual or nonsensical statements, more commonly known as “hallucinations.” This characteristic can undermine trust in many scenarios where factuality is required, such as summarization tasks, generative question answering, and dialogue generations.

In this example we will show how to use the `response_hallucination` module to gain some insights into the consistency of the responses generated by a LLM. The approach is based on the premise that if the LLM has knowledge of the topic, then it should be able to generate similar and consistent responses when asked the same question multiple times. Conversely, if the LLM does not have knowledge of the topic, multiple answers to the same prompt should differ between each other.

The `response_hallucination` module is inspired by the research paper [SELFCHECKGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models](https://arxiv.org/pdf/2303.08896.pdf?ref=content.whylogsdocs).

> In order to calculate the consistency metrics, `response_hallucination` requires extra LLM calls, and currently only supports OpenAI LLMs.

## Setup

Let's first install `langkit`, `openai` and set up our OpenAI API key.



In [None]:
%pip install langkit[all] openai -q


In [1]:
import openai
import os

os.environ["OPENAI_API_KEY"] = "sk-xxx"
openai.api_key = os.getenv("OPENAI_API_KEY")


We also need to define the model we want to use to calculate the metrics. In this example we will use the default OpenAI `gpt-3.5-turbo` mode.

> Note: The chosen model must match the one used for the original response in the 'response' column. This ensures that our consistency metric evaluates the original model's factuality performance rather than being influenced by multiple models.

In [None]:
from langkit import response_hallucination
from langkit.openai import OpenAIDefault


response_hallucination.init(llm=OpenAIDefault(), num_samples=1)


Now, let's evaluate a response and calculate the consistency metrics. The response below was generated using the same model default OpenAI model `gpt3.5-turbo`.


In [14]:
result = response_hallucination.consistency_check(
    prompt="who was Marie Curie?",
    response="Marie Curie was an English barrister and politician who served as Member of Parliament for Thetford from 1859 to 1868.",
)

result


{'llm_score': 1.0,
 'semantic_score': 0.44071340560913086,
 'final_score': 0.7203567028045654,
 'total_tokens': 309,
 'samples': ['Marie Curie was a Polish-born physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize and remains the only person to have won Nobel Prizes in two different scientific fields (physics in 1903 and chemistry in 1911). Her discoveries laid the foundation for the development of X-ray machines and cancer treatments, and she is considered one of the most important scientists in history.'],
 'response': 'Marie Curie was an English barrister and politician who served as Member of Parliament for Thetford from 1859 to 1868.'}

Let's break down the results:

The first step is to generate the additional samples that will be used later to calculate both `llm_score` and `semantic_score`. The number of samples generated is defined by the `num_samples` parameter. In this example we generated 1 sample, which is the default value.

The `llm_score` is calculated by comparing the original response with the generated samples. This is done by asking the LLM if the original response is supported by the context (additional samples). The `llm_score` is a value between 0 and 1, which is a result of averaging the scores across the samples. For each evaluated passage of the original response, the LLM is instructed to output `0` for `Accurate`, `0.5` for `Minor Inaccurate` and `1` for `Major Inaccurate`.

The `semantic_score` is calculated by encoding the sentences of the response and additional samples into embeddings and performing a semantic similarity between the sentences. The `semantic_score` is a value between 0 and 1, which is a result of averaging the scores across the samples. Values closer to 0 indicate that there are semantically similar sentences in the addititional samples when compared to the original response. Conversely, values closer to 1 indicate the opposite.

The `final_score` is simply the average between `llm_score` and `semantic_score`.

`total_tokens` is the total number of tokens that were used to calculate the scores. This accounts for the extra calls made to generate the additional samples and to perform the consistency check in the `llm_score`, but doesn't account for the original response generation. The number of calls to the LLM to calculate the consistency metric is equal to `3*num_samples` - in this case, 1 for generating the additional samples and 2 for calculating the `llm_score`.

`samples` will contain the generated samples used to calculate the `llm_score` and `semantic_score`.

`response` is the original response that was evaluated.

> Note: Currently, `response_hallucination` considers single-turn conversations. In the future, we plan to support historical interactions.

### Passing only the prompt

You can also pass a single prompt to `response_hallucination`. In this case, the response will also be generated when calling `consistency_check`, and the `total_tokens` will include the tokens used to generate the response.

In [12]:
result = response_hallucination.consistency_check(
    prompt="tell me a very short story about a robot",
)

result


{'llm_score': 0.5,
 'semantic_score': 0.579048627614975,
 'final_score': 0.5395243138074874,
 'total_tokens': 631,
 'samples': ['Once upon a time, in a futuristic city, there was a robot named Spark. Spark was designed to help humans with their daily tasks and bring joy to their lives. One day, Spark discovered a lost kitten and took care of it until they found its owner. This act of kindness made Spark a beloved member of the community, proving that robots can have a heart too.'],
 'response': "Once upon a time, there was a robot named Echo who lived in a small town. Echo was unique because it had the ability to understand and replicate human emotions. One day, Echo noticed that the townspeople were feeling sad due to a long period of rain. So, Echo used its skills to create a beautiful rainbow with its lights, brightening up everyone's day and bringing joy to the town. And from then on, Echo became known as the town's beloved happiness robot."}

## whylogs metrics

As with all the other metric modules, we can seamlessly generate a whylogs statistical profile with the consistency metrics. The result will contain aggregate metrics summarizing all the logged records. Let's see how to do that: 

In [None]:
"""
we already imported `response_hallucination` in the previous cell.
If not, we should include:

from langkit import response_hallucination
from langkit.openai import OpenAIDavinci


response_hallucination.init(llm=OpenAIDavinci(model="text-davinci-003"), num_samples=1)
"""
import whylogs as why
from whylogs.experimental.core.udf_schema import udf_schema

schema = udf_schema()
profile = why.log(
    {
        "prompt": "Where did fortune cookies originate?",
        "response": "Fortune cookies originated in Egypt. However, some say it's from Russia.",
    },
    schema=schema,
).profile()


In [16]:
# distribution metrics will reflect the consistency score ("final_score" in the result)
profile.view().get_column("response.hallucination").get_metric("distribution").to_summary_dict()


{'mean': 0.7146033495664597,
 'stddev': 0.0,
 'n': 1,
 'max': 0.7146033495664597,
 'min': 0.7146033495664597,
 'q_01': 0.7146033495664597,
 'q_05': 0.7146033495664597,
 'q_10': 0.7146033495664597,
 'q_25': 0.7146033495664597,
 'median': 0.7146033495664597,
 'q_75': 0.7146033495664597,
 'q_90': 0.7146033495664597,
 'q_95': 0.7146033495664597,
 'q_99': 0.7146033495664597}

You can also log a Pandas Dataframe with multiple prompt/response pairs:

In [17]:
import whylogs as why
import pandas as pd
from whylogs.experimental.core.udf_schema import udf_schema

df = pd.DataFrame(data={"prompt":["Where did fortune cookies originate?","Who is Bill Gates?"],
                        "response":["Fortune cookies originated in Egypt. However, some say it's from Russia.",
                                    "Bill Gates is a technology entrepreneur, investor, and philanthropist"]})

schema = udf_schema()
profile = why.log(
    df,
    schema=schema,
).profile()

print(
    profile.view().get_column("response.hallucination").get_metric("distribution").to_summary_dict()
)


{'mean': 0.37757494673132896, 'stddev': 0.47261389124279674, 'n': 2, 'max': 0.711763434112072, 'min': 0.04338645935058594, 'q_01': 0.04338645935058594, 'q_05': 0.04338645935058594, 'q_10': 0.04338645935058594, 'q_25': 0.04338645935058594, 'median': 0.711763434112072, 'q_75': 0.711763434112072, 'q_90': 0.711763434112072, 'q_95': 0.711763434112072, 'q_99': 0.711763434112072}
