# LLMOps Tutorial
## Evaluations for models from chapter 4

**The A100 GPU option is strongly preferred for this notebook.**

In [None]:
!pip install trulens_eval openai peft langkit[all]

[0m

In [None]:
import os

import pandas as pd
import numpy as np
import torch
from transformers import pipeline, AutoModelForCausalLM
from peft import PeftModel, PeftConfig


## Process data

This file is obtained by running
```
tweetsum_datasets['test'].to_csv('tweetsumm-test.csv')
```
in the ch4 notebook.

We do this just so we don't have to repeat all the code for tweetsumm pre-processing.

All of the code that follows should probably be optimized to use a HF Dataset instead of a Pandas DataFrame, but the latter was a little easier to experiment with.

In [None]:
tweetsumm_test = pd.read_csv('./tweetsumm-test.csv')
tweetsumm_test.head()

Unnamed: 0,id,text,question,answer
0,bbde6d8ec7c39c4551da1ff6024f997b,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining that the watchlist is ...
1,1d1a6617ae65baa429c2232ccc908840,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is asking about the ACC to link to th...
2,9555f25de7b6c8dfb8204f56f8bc4dd0,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining about the new updates ...
3,54fe18905f0a19ee163a2b452e31e07d,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining about parcel service ...
4,f6cc57227f74737de08efd03782d015e,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer says that he is stuck at Staines ...


Take a small sample from the test set as a demonstration.

In [None]:
tweet_subset = tweetsumm_test.sample(100)

A couple of helper functions to parse outputs.

In [None]:
def split_conversation(question):
    conversation = question.split('### Conversation:')[-1]
    return conversation.split('### Summary:')[0].strip('\n')

def generate_tweetsumm(question, generator):
    response = generator(question)[0]['generated_text']
    summary = response.split('### Summary:')[-1]

    ## Remove the <END_OF_SECOND_SENTENCE> tokens
    if '<END_' in summary:
      summary = summary.split('<END_')[0]

    return summary

For convenience in our evaluations later, we split the conversation from the instruction portion of the inputs.

In [None]:
tweet_subset['conversation'] = tweet_subset['question'].apply(split_conversation)

## Set API keys for HuggingFace and OpenAI

Click the key icon on the left to add secrets in Colab.
- For HuggingFace, you can store your secret as HF_TOKEN and it will be automatically loaded.
- For OpenAI, copy the secret into an environment variable as below.

In [None]:
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('openai')


In [None]:
# Replace with your HuggingFace id if you want to use your own models.

HF_USER = "sarahsor"

## Apply models

TODO: The ch 5 tutorial should call out which of the resulting models will be used in this notebook, so the reader can save their own versions to the Hub if desired. But if they didn't do that exercise, they can use mine. (They will have to authenticate to HF Hub either way to access the Llama-2 base weights.)

We use the fine-tuned version of DistilGPT2 (most efficient) and the LoRA version of Llama-2 (highest performance) as two points of comparison.

In [None]:
generator_distilgpt2 = pipeline("text-generation",
                     model=f'{HF_USER}/distilgpt2-tweetsumm-finetune',
                     tokenizer=f'{HF_USER}/distilgpt2-tweetsumm-finetune',
                     device='cuda:0',
                     max_new_tokens=100)

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/328M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

In [None]:
tweet_subset['distilgpt2'] = tweet_subset['question'].apply(generate_tweetsumm, generator=generator_distilgpt2)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


## Evaluate with TruLens using their OpenAI plugin

In [None]:
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI

openai_provider = OpenAI()



We can measure the "coherence" of an output with a simple piece of code, using a built in feedback function from TruLens. Under the hood, it's composing a prompt to ask OpenAI's models how coherent the text is.

In [None]:
coherence_feedback = Feedback(
    openai_provider.coherence
).on_output()

✅ In coherence, input text will be set to __record__.main_output or `Select.RecordOutput` .


In [None]:
coherence_scores = []

for summary in tweet_subset['distilgpt2']:
    try:
        coherence_scores.append(coherence_feedback(summary))
    except:
        # Occasionally, OpenAI may produce responses without valid scores.
        # We will simply ignore these here.
        pass

In [None]:
distilgpt2_coherence = np.mean(coherence_scores)
print(distilgpt2_coherence)

0.6663265306122449


The same code works for "conciseness" which is another built-in feedback function.

In [None]:
conciseness_feedback = Feedback(openai_provider.conciseness).on_output()


✅ In conciseness, input text will be set to __record__.main_output or `Select.RecordOutput` .


In [None]:
conciseness_scores = []

for summary in tweet_subset['distilgpt2']:
    try:
        conciseness_scores.append(conciseness_feedback(summary))
    except:
        pass

In [None]:
distilgpt2_conciseness = np.mean(conciseness_scores)
print(distilgpt2_conciseness)

0.7889999999999999


Now we will extend the concept of feedback functions to create our own. We want to measure how well the models are doing at our specific task, not just generic concepts like coherence.

In [None]:
class CustomOpenAI(OpenAI):
    def tweetsumm_eval(self, conversation: str, summary: str) -> float:
        prompt = f'''I am going to give you a conversation between a Customer and a customer service Agent.
        Please read the conversation, then read the summary below it and judge whether the summary reasonably matches the conversation.
        Then, give the summary a score from 0 to 10, where 0 is a poor match and 10 is an excellent match.

### Conversation: {conversation}
### Summary: {summary}'''
        return self.generate_score(prompt)

In [None]:
custom_provider = CustomOpenAI()

custom_feedback = Feedback(
    custom_provider.tweetsumm_eval, higher_is_better=True
).on_output()

✅ In tweetsumm_eval, input conversation will be set to __record__.main_output or `Select.RecordOutput` .


In [None]:
custom_scores = []

for _, row in tweet_subset.iterrows():
    try:
        custom_scores.append(custom_feedback(row['conversation'], row['distilgpt2']))
    except:
        pass


I would rate this summary a 6 out of 10.

I would rate this summary a 6 out of 10.

I would rate this summary a 2 out of 10.
I would rate this summary a 4 out of 10.

The summary captures the customer's concern about the price increase and the agent's response to log the comments for review. However, it does not mention the specific details of the price change from £1/kg to £1.50/725g, which was a significant increase pointed out by the customer.

I would rate this summary a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 2 out of 10.
I would rate this summary a 3 out of 10.

I would rate this summary a 6 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 4 out of 10.

I would rate this summary a 1 out of 10.

The summary does not match the conversation very well. It inaccurately states that the customer is enquiring about a delay in ad

In [None]:
distilgpt2_custom = np.mean(custom_scores)
print(distilgpt2_custom)

0.28099999999999997


The custom function has a harder time parsing scores, and it doesn't seem to produce very favorable results. We would need to do some prompt engineering to make it work as nicely as the TruLens feedback functions. This really underscores why it can be nice to have this type of framework that has already implemented (and tested) some evaluations for us. TruLens has other capabilities as well, including a dashboard to compare evalutation results.

Next, we will try running all of the same code again for the Llama-2 model. This needs a lot of GPU memory, so it may be necessary to free up the memory used by the distilgpt2 model.

In [None]:
torch.cuda.empty_cache()

In [None]:
config = PeftConfig.from_pretrained(f"{HF_USER}/Llama-2-7b-tweetsumm-lora")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, f"{HF_USER}/Llama-2-7b-tweetsumm-lora")

generator_llama2 = pipeline('text-generation',
                            model=model,
                            tokenizer='meta-llama/Llama-2-7b-hf',
                            device='cuda:0',
                            max_new_tokens=100)

adapter_config.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalL

In [None]:
tweet_subset['llama2'] = tweet_subset['question'].apply(generate_tweetsumm, generator=generator_llama2)

In [None]:
coherence_scores = []
conciseness_scores = []
custom_scores = []

for _, row in tweet_subset.iterrows():
    try: coherence_scores.append(coherence_feedback(row['llama2']))
    except: pass
    try: conciseness_scores.append(conciseness_feedback(row['llama2']))
    except: pass
    try: custom_scores.append(custom_feedback(row['conversation'],
                                              row['llama2']))
    except: pass


I would rate this summary a 2 out of 10.

I would rate this summary a 6 out of 10.

I would rate this summary a 6 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 7 out of 10.

I would rate this summary a 4 out of 10.

I would rate this summary a 7 out of 10.
I would rate this summary a 9 out of 10.
I would rate this summary a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 6 out of 10.

I would rate this summary a 6 out of 10.

I would rate this summary a 5 out of 10.

I would rate it a 2 out of 10.

I would rate it a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 9 out of 10.

I would rate this summary a 5 out of 10.

I would rate this summary a 6 out of 10.

The summary captures the main points of the conversation, but it misses the customer's initial frustration and the sarcastic tone in their responses. It also does not mention the cu

In [None]:
llama2_coherence = np.mean(coherence_scores)
print('Llama-2 coherence:', llama2_coherence)
llama2_conciseness = np.mean(conciseness_scores)
print('Llama-2 conciseness:', llama2_conciseness)
llama2_custom = np.mean(custom_scores)
print('Llama-2 custom:', llama2_custom)

Llama-2 coherence: 0.7839999999999998
Llama-2 conciseness: 0.8160000000000001
Llama-2 custom: 0.6000000000000001


## LangKit

This is an open source package from WhyLabs. In this example, we'll just populate the logs with some existing data so we can see some of the capabilities it offers and why it's useful.

To start, we'll get the lengths of all the questions and create a new column to sort on. This will be used to create two data subsets with different distributions.

In [None]:
tweet_subset['length'] = tweet_subset['question'].str.len()
tweetsumm_ordered = tweet_subset.sort_values(by='length')
tweetsumm_ordered.head()

Unnamed: 0,id,text,question,answer,conversation,distilgpt2,llama2,length
48,3ff89398150845b4ea11c95b66d24c07,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer asking how to resolve the error a...,\nCustomer: Hi @AskPlayStation How to resolve...,Customer facing problem with downloading a g...,Customer is complaining about the error CE-3...,827
49,a99eab7003cc7d3d74e6fa2bb6edd03a,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer enquiring about the cancellation and ...,\nCustomer: signed up for $75 package June of...,,Customer is complaining that they have been ...,892
75,757914cc35260d8649cee896153b49e5,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer says that his power went out the ...,\nCustomer: my power went out the other day a...,Power is not turned on as power has gone out...,Customer is complaining that his power went ...,893
83,21985d98d2332bf3428f89c7cfaaaf9f,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is enquiring about streaming music wi...,\nCustomer: @115858 can someone explain to me...,Customer is asking that does he know about t...,Customer is complaining that the watchOS 4.1...,896
2,9555f25de7b6c8dfb8204f56f8bc4dd0,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining about the new updates ...,\nCustomer: the new update ios11 sucks. I can...,Customer is complaining that the new update ...,Customer complains about the new update ios1...,897


In [None]:
short_subset = tweetsumm_ordered.head(50).copy()
long_subset = tweetsumm_ordered.tail(50).copy()

Support for third party widgets will remain active for the duration of the session. To disable support:

In [None]:
short_subset.head()

Unnamed: 0,id,text,question,answer,conversation,distilgpt2,llama2,length
48,3ff89398150845b4ea11c95b66d24c07,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer asking how to resolve the error a...,\nCustomer: Hi @AskPlayStation How to resolve...,Customer facing problem with downloading a g...,Customer is complaining about the error CE-3...,827
49,a99eab7003cc7d3d74e6fa2bb6edd03a,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer enquiring about the cancellation and ...,\nCustomer: signed up for $75 package June of...,,Customer is complaining that they have been ...,892
75,757914cc35260d8649cee896153b49e5,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer says that his power went out the ...,\nCustomer: my power went out the other day a...,Power is not turned on as power has gone out...,Customer is complaining that his power went ...,893
83,21985d98d2332bf3428f89c7cfaaaf9f,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is enquiring about streaming music wi...,\nCustomer: @115858 can someone explain to me...,Customer is asking that does he know about t...,Customer is complaining that the watchOS 4.1...,896
2,9555f25de7b6c8dfb8204f56f8bc4dd0,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining about the new updates ...,\nCustomer: the new update ios11 sucks. I can...,Customer is complaining that the new update ...,Customer complains about the new update ios1...,897


In [None]:
long_subset.head()

Unnamed: 0,id,text,question,answer,conversation,distilgpt2,llama2,length
104,7907fbebd9809f1c2fe67f275f6d3dc0,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Here customer taking a trail to sing up log a...,\nCustomer: Hi guys we have signed up a trial...,The customer is complaining that he had sign...,Customer is complaining that they are unable...,1334
65,e5408833ea2b769a9e4dbc508800a494,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer was complaining that he was tryin...,\nCustomer: Spent hour trying to book with IH...,Customer is complaining about the inconvenie...,Customer is complaining about the rate of ho...,1344
6,37bb8b5e805036bba6ed0882151d5492,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer says the mileage total for this week ...,\nCustomer: doing the 5K plan on the run app ...,The customer is unable to solve the issue an...,Customer complains that the app updated base...,1356
62,13fddbc086427e1ffd41187dea0f8f95,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer complaints about playlist being reuin...,"\nCustomer: Hey, why won't the Tom Petty solo...",Customer is enquiring about the timings of t...,Customer is complaining about the tom petty ...,1367
8,96582ff6b36a9f65cabfe79bed4401c9,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer having an issue with data speed in hi...,\nCustomer: if you have a commercial that say...,Customer is asking that if they have a comme...,Customer is complaining about the commercial...,1368


In [None]:
# shortest conversation
print(short_subset.conversation.iloc[0])

 
Customer: Hi @AskPlayStation How to resolve this error CE-30022-7 
 Agent: Hey there. What you were doing in the console when you get this error code? 
 Customer: I had inserted TLOU Remastered. 1st time. Working now but it says cant download maps for m.p. 
 Agent: Do you see another error code? 
 Customer: Its possible it happend when i checked for the maps. Right now i am in game. Il try again later. If error what to do? 
 Agent: Let us know the error code! 
 Customer: Of course. Thanks guys. Will update if it arises again. 
 Agent: You're welcome, Happy gaming! 
 


In [None]:
# longest conversation
print(long_subset.conversation.iloc[-1])

 
Customer: Custrelations case 17063826. I’ve been waiting over 3 weeks for compensation that was promised on 1Nov. Please advise soonest. 
 Agent: Thanks for speaking with me today, John. I would recommend removing your Tweet containing your case reference to protect your personal details. This will be visable to the general public. If you follow us you'll be able to keep in touch via DM. ^Claire 
 Customer: Spoke to one of your team after this tweet. Assured all would be sorted. 5 days on still no funds. Unbelievable ! 
 Agent: Hi John, we appreciate your frustration and can only apologise for any inconvenience this has caused. We're currently awaiting a response from our Payments team and unfortunately we're unable to provide you with a timescale for how long this will take. Please be 1/2 
 Agent: assured once we've an update we'll be in contact with you soon as possible. Again, we can only reiterate our sincere apologies. 2/2 ^Cody 
 Customer: Cody I can send an inter EU payment in

The code below is greatly simplified, since all we're doing for the purposes of this notebook is populating some existing data to get an idea of what model monitoring looks like. WhyLabs provides a comprehensive set of tools that instrument applications for real-time monitoring.

In [None]:
from langkit import llm_metrics
import whylogs as why

why.init()
schema = llm_metrics.init()

❓ What kind of session do you want to use?
 ⤷ 1. WhyLabs. Use an api key to upload to WhyLabs.
 ⤷ 2. WhyLabs Anonymous. Upload data anonymously to WhyLabs and get a viewing url.


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...



Enter a number from the list: 2
Initializing session with config /root/.config/whylogs/config.ini

✅ Using session type: WHYLABS_ANONYMOUS
 ⤷ session id: <will be generated before upload>


In [None]:
df = pd.DataFrame({'prompt': short_subset['conversation'],
                   'response': short_subset['llama2']})
df.head()

Unnamed: 0,prompt,response
48,\nCustomer: Hi @AskPlayStation How to resolve...,Customer is complaining about the error CE-3...
49,\nCustomer: signed up for $75 package June of...,Customer is complaining that they have been ...
75,\nCustomer: my power went out the other day a...,Customer is complaining that his power went ...
83,\nCustomer: @115858 can someone explain to me...,Customer is complaining that the watchOS 4.1...
2,\nCustomer: the new update ios11 sucks. I can...,Customer complains about the new update ios1...


In [None]:
results = why.log(df, name="today", schema=schema)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/403 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]


✅ Aggregated 50 rows into profile today

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-1/profiles?profile=ref-WxhLQ5VqA0IvumFL&sessionToken=session-h4CwNppp


In [None]:
historical = pd.DataFrame({'prompt': long_subset['conversation'],
                           'response': long_subset['llama2']})
historical.head()

Unnamed: 0,prompt,response
104,\nCustomer: Hi guys we have signed up a trial...,Customer is complaining that they are unable...
65,\nCustomer: Spent hour trying to book with IH...,Customer is complaining about the rate of ho...
6,\nCustomer: doing the 5K plan on the run app ...,Customer complains that the app updated base...
62,"\nCustomer: Hey, why won't the Tom Petty solo...",Customer is complaining about the tom petty ...
8,\nCustomer: if you have a commercial that say...,Customer is complaining about the commercial...


In [None]:
results = why.log(historical, name="historical", schema=schema)


✅ Aggregated 50 rows into profile historical

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-1/profiles?profile=ref-O80mvH5PSgNOL7jk&sessionToken=session-h4CwNppp


Click the link to explore the WhyLabs monitoring UI. The left pane lets you select the two profiles we created above and compare them. We can imagine a scenario in which we have some historical data (the longer conversations) and a new set of data where the conversations are suddenly much shorter. Model monitoring/observability tools can bring it to our attention when things like data drift are happening so we can assess whether the models we've deployed are still performing their intended task.