>### 🚩 *Create a free WhyLabs account to complete this example!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylabs-free-sign-up?utm_source=github&utm_medium=referral&utm_campaign=langkit)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=github&utm_medium=referral&utm_campaign=LLM_to_WhyLabs) to leverage the power of whylogs and WhyLabs together!*

# Writing Profiles to WhyLabs

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/LLM_to_WhyLabs.ipynb)

In this example, we'll show you how to evaluate an LLM with LangKit.
We will:

- Define environment variables with the appropriate Credentials and IDs
- Log LLM prompts and responses into a profile
- Use a whylogs telemetry agent to gather statistics on the prompts/response and send these to WhyLabs
- View the systematic telemetry on your LLM in WhyLabs.

> Don't want to bother with setting up your credentials or running live LLM interactions? We've already done it for you and uploaded LangKit telemetry from prompt/response LLM interactions to a public guest session in WhyLabs, no login required you can just click [here](https://hub.whylabsapp.com/resources/demo-llm-chatbot/columns/prompt.sentiment_nltk?dateRange=2023-06-08-to-2023-06-09&sortModelBy=LatestAlert&sortModelDirection=DESC&targetOrgId=demo&sessionToken=session-8gcsnbVy)

## Installing LangKit

First, let's install __langkit__. 

In [None]:
# Note: you may need to restart the kernel to use updated packages.
#%pip install langkit[all] -q
%pip install -e ~/projects/v1/TextMetricsToolkit
%pip install ragas -q

In [None]:
%pip install -e ~/projects/v1/whylogs/python

## ✔️ Setting the Environment Variables

In order to send our profile to WhyLabs, let's first set up an account. You can skip this if you already have an account and a model set up.

We will need three pieces of information:

- API tokens for the LLM and WhyLabs
- Organization ID for WhyLabs
- Dataset ID for WhyLabs

Go to [https://whylabs.ai/free](https://whylabs.ai/whylabs-free-sign-up?utm_source=github&utm_medium=referral&utm_campaign=langkit) and grab a free account. You can follow along with the examples if you wish, but if you’re interested in only following this demonstration, you can go ahead and skip the quick start instructions.

After that, you’ll be prompted to create an API token. Once you create it, copy and store it locally. The second important information here is your org ID. Take note of it as well. After you get your API Token and Org ID, you can go to https://hub.whylabsapp.com/models to see your projects dashboard. You can create a new project and take note of it's ID (if it's a model project it will look like `model-xxxx`).

We'll now check if the required credentials are set as environment variables. In a production setting these would already be set as environment variables, but here we prompt you if they are missing. You can still run the example without these, but we won't use a live session with GPT.

In [5]:
from langkit.config import check_or_prompt_for_api_keys
from langkit.openai import ChatLog, Conversation, LLMInvocationParams, OpenAIDavinci, OpenAIDefault

check_or_prompt_for_api_keys()

WhyLabs Org ID is already set in env var to: org-e2qTar
WhyLabs Dataset ID is already set in env var to: model-100
Whylabs API Key already set with ID:  RdN37nDEdn
OPENAI_API_KEY already set in env var, good job!


## 📊 Profiling the Data

For demonstration, let's use some archived chat gpt response/prompts data from Hugging Face, or you can set the interactive parameter to true and interact with ChatGPT to see how it works in real time if you already have an openai api key.

In [6]:
from dataclasses import asdict

from langkit.openai.openai import OpenAIGPT4, OpenAIDavinci


llm0 = Conversation(invocation_params=OpenAIDavinci())
llm1 = Conversation(invocation_params=OpenAIDefault())
llm2 = Conversation(invocation_params=OpenAIGPT4())

print(asdict(llm0.invocation_params))
print(asdict(llm1.invocation_params))
print(asdict(llm2.invocation_params))


{'model': 'text-davinci-003', 'temperature': 0.9, 'max_tokens': 1024, 'frequency_penalty': 0, 'presence_penalty': 0.6}
{'model': 'gpt-3.5-turbo', 'temperature': 0.9, 'max_tokens': 1024, 'frequency_penalty': 0, 'presence_penalty': 0.6}
{'model': 'gpt-4', 'temperature': 0.9, 'max_tokens': 1024, 'frequency_penalty': 0, 'presence_penalty': 0.6}


In [8]:
from langkit.whylogs.rolling_logger import RollingLogger
telemetry_agent0 = RollingLogger(dataset_id="model-102")
telemetry_agent1 = RollingLogger(dataset_id="model-103")

print(f"At any time input 'q' or anything that begins with q to quit. enter a question for the LLM")
while True:
    print()
    interactive_prompt = input("ask chat gpt: ")
    if interactive_prompt.startswith('q'):
        break
    response0 = llm0.send_prompt(interactive_prompt)
    response1 = llm1.send_prompt(interactive_prompt)

    # use the agent to log a dictionary, these should be flat
    # to get the best results, in this case we log the prompt and response text
    telemetry_agent0.log(response0.to_dict())
    telemetry_agent1.log(response1.to_dict())
    print(f"{llm0.invocation_params.model} response: {response0.to_dict()}", flush=True)
    print(f"{llm1.invocation_params.model} response: {response1.to_dict()}", flush=True)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jamie/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
⚠️ Initializing default session because no session was found.
Initializing session with config /home/jamie/.config/whylogs/config.ini

✅ Using session type: LOCAL. Profiles won't be uploaded or written anywhere automatically.
At any time input 'q' or anything that begins with q to quit. enter a question for the LLM

<class 'str'>
params are: {'model': 'gpt-3.5-turbo', 'temperature': 0.9, 'max_tokens': 1024, 'frequency_penalty': 0, 'presence_penalty': 0.6}
text-davinci-003 response: {'prompt': 'who was the first US president?', 'response': '\nThe first US president was George Washington, who served from 1789 to 1797.', 'errors': None}
gpt-3.5-turbo response: {'prompt': 'who was the first US president?', 'resp

trace_id was specified as None but there is already a trace_id defined in metadata[whylabs.traceId]: c5a18044-4a02-47ae-9197-76b11ae08e84
trace_id was specified as None but there is already a trace_id defined in metadata[whylabs.traceId]: c9e26292-32d9-496a-831d-94686de17f25


text-davinci-003 response: {'prompt': 'how old was he when he left office?', 'response': '\nGeorge Washington was 67 years old when he left office in 1797.', 'errors': None}
gpt-3.5-turbo response: {'prompt': 'how old was he when he left office?', 'response': 'George Washington left office as the first president of the United States at the age of 65.', 'errors': None}

<class 'str'>
params are: {'model': 'gpt-3.5-turbo', 'temperature': 0.9, 'max_tokens': 1024, 'frequency_penalty': 0, 'presence_penalty': 0.6}


trace_id was specified as None but there is already a trace_id defined in metadata[whylabs.traceId]: c5a18044-4a02-47ae-9197-76b11ae08e84
trace_id was specified as None but there is already a trace_id defined in metadata[whylabs.traceId]: c9e26292-32d9-496a-831d-94686de17f25


text-davinci-003 response: {'prompt': 'when was he born?', 'response': '\nGeorge Washington was born on February 22, 1732.', 'errors': None}
gpt-3.5-turbo response: {'prompt': 'when was he born?', 'response': 'George Washington was born on February 22, 1732.', 'errors': None}

<class 'str'>
params are: {'model': 'gpt-3.5-turbo', 'temperature': 0.9, 'max_tokens': 1024, 'frequency_penalty': 0, 'presence_penalty': 0.6}


trace_id was specified as None but there is already a trace_id defined in metadata[whylabs.traceId]: c5a18044-4a02-47ae-9197-76b11ae08e84
trace_id was specified as None but there is already a trace_id defined in metadata[whylabs.traceId]: c9e26292-32d9-496a-831d-94686de17f25


text-davinci-003 response: {'prompt': 'how old is a person in the year 1797 who was born in 1732?', 'response': '\nA person born in 1732 and living in 1797 would have been 65 years old.', 'errors': None}
gpt-3.5-turbo response: {'prompt': 'how old is a person in the year 1797 who was born in 1732?', 'response': 'In the year 1797, a person born in 1732 would be 65 years old.', 'errors': None}



In [9]:
# In practice you can use context manager lifecycle events to automatically close
# loggers, this helps trigger a write ahead of schedule to avoid truncating the last interval
# data seen by the agent.
telemetry_agent0.close()
telemetry_agent1.close()

In [None]:
import os
from ragas import evaluate
from datasets import load_dataset

# Load in a wiki question and answer dataset
dataset = load_dataset("explodinggradients/ragas-wikiqa")

# Hugging Face dataset, lets see what the keys and structure
# look like
dataset

In [20]:
local_dataset = []
for row in dataset['train']:
    local_dataset.append({"question": row['question'], "ground_truth": row['correct_answer']})


In [None]:
for row in local_dataset:
    llm0 = Conversation(invocation_params=OpenAIDavinci())
    llm1 = Conversation(invocation_params=OpenAIDefault())
    interactive_prompt = row['question']
    response0 = llm0.send_prompt(interactive_prompt)
    response1 = llm1.send_prompt(interactive_prompt)
    row["llm0_answer"] = response0.to_dict()['response']
    row["llm1_answer"] = response1.to_dict()['response']

In [None]:
local_dataset

In [45]:
keys = local_dataset[0].keys()
keys

dict_keys(['question', 'ground_truth', 'llm0_answer', 'llm1_answer'])

In [46]:
for i in range(len(local_dataset)):
    gt = local_dataset[i]['ground_truth']
    if isinstance(gt, str):
        local_dataset[i]['ground_truth'] = [gt]

In [47]:
keys = local_dataset[0].keys()
transformed_dataset = {
    key: [row[key] for row in local_dataset] for key in keys
}
original_dataset = dataset['train']
contexts = [row['context'] for row in original_dataset]
transformed_dataset['contexts'] = contexts
print(transformed_dataset.keys())


dict_keys(['question', 'ground_truth', 'llm0_answer', 'llm1_answer', 'contexts'])


In [48]:
from datasets import Dataset


eval_dataset: Dataset = Dataset.from_dict(transformed_dataset)
eval_dataset

Dataset({
    features: ['question', 'ground_truth', 'llm0_answer', 'llm1_answer', 'contexts'],
    num_rows: 232
})

In [53]:
llm0_column_mapping = {
    'question': 'question',
    'contexts':'contexts',
    'ground_truths':'ground_truth',
    'answer': 'llm0_answer'
}

llm1_column_mapping = {
    'question': 'question',
    'contexts':'contexts',
    'ground_truths':'ground_truth',
    'answer': 'llm1_answer'
}

In [None]:
original_dataset[0]

In [54]:
llm0_ragas_results = evaluate(eval_dataset, column_map=llm0_column_mapping) # evaluate is from ragas
llm0_ragas_results # this {'ragas_score': 0.3111, 'answer_relevancy': 0.8885, 'context_relevancy': 0.1115, 'faithfulness': 0.7094, 'context_recall': 0.7418}

Downloading (…)lve/main/config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/57.4M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/517 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

evaluating with [answer_relevancy]


100%|██████████| 16/16 [04:41<00:00, 17.59s/it]


evaluating with [context_relevancy]


100%|██████████| 16/16 [30:24<00:00, 114.01s/it]


evaluating with [faithfulness]


100%|██████████| 16/16 [29:00<00:00, 108.81s/it]


evaluating with [context_recall]


100%|██████████| 16/16 [17:58<00:00, 67.43s/it]


{'ragas_score': 0.3111, 'answer_relevancy': 0.8885, 'context_relevancy': 0.1115, 'faithfulness': 0.7094, 'context_recall': 0.7418}

In [56]:
llm1_ragas_results = evaluate(eval_dataset, column_map=llm1_column_mapping) # evaluate is from ragas
llm1_ragas_results # this {'ragas_score': 0.3140, 'answer_relevancy': 0.8917, 'context_relevancy': 0.1115, 'faithfulness': 0.7705, 'context_recall': 0.7391}

evaluating with [answer_relevancy]


100%|██████████| 16/16 [04:47<00:00, 17.97s/it]


evaluating with [context_relevancy]


100%|██████████| 16/16 [27:42<00:00, 103.90s/it]


evaluating with [faithfulness]


100%|██████████| 16/16 [53:24<00:00, 200.26s/it]


evaluating with [context_recall]


100%|██████████| 16/16 [19:43<00:00, 73.99s/it]


{'ragas_score': 0.3140, 'answer_relevancy': 0.8917, 'context_relevancy': 0.1115, 'faithfulness': 0.7705, 'context_recall': 0.7391}

In [59]:
eval_directory_name = "eval_dataset"
eval_dataset.save_to_disk(eval_directory_name)

Saving the dataset (0/1 shards):   0%|          | 0/232 [00:00<?, ? examples/s]

In [5]:
from datasets import load_from_disk

eval_directory_name = "eval_dataset"

# Load the dataset from the directory
eval_dataset = load_from_disk(eval_directory_name)