# Chapter 8 Tutorial: Preparing experimental models for production deployment

Requirements:

1. A Huggingface account with an API token generated.
2. An OpenAI account with an API key.
3. Distil-GPT2 and Llama-2-7b models from Chapter 4 tutorial saved to HF.  


You will also have to request access to the Llama-2 family of models if you have not already. Please visit https://huggingface.co/meta-llama/Llama-2-7b-hf and request access. Once granted, you can log in with hugggingface_hub in the runtime and you will be allowed to download the Llama-2 model.

NOTE As of August 2024, an L4 or A100 GPU instance are recommended for this tutorial.

# Installation and Imports

## Run once after creating run-time

In [1]:
!pip install -q trulens_eval openai peft langkit[all] huggingface_hub pyarrow==14.0.1 fsspec==2024.6.1 bitsandbytes==0.40.2

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m765.5/765.5 kB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m94.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.4/362.4 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.0/233.0 kB[0m [31m22.2 MB/s[0m eta [36m

In [2]:
## If using Colab, click the key icon on the left to add secrets
## Add an item with name = hf_login and value = your API token.

from google.colab import userdata
hf_login = userdata.get('hf_login')
!huggingface-cli login --token $hf_login

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Run every time you restart the session

### Imports

In [3]:
import os

import pandas as pd
import numpy as np
import torch
from tqdm import tqdm
from transformers import pipeline, AutoModelForCausalLM
from peft import PeftModel, PeftConfig

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI

from langkit import llm_metrics
import whylogs as why

from google.colab import userdata

## Set your HF username

HF_USER = "<your-username>"



In [4]:
## Set API key for OpenAI

## If using Colab, click the key icon on the left to add secrets
## Add an item with name = openai and value = your API token.

os.environ["OPENAI_API_KEY"] = userdata.get('openai')

### Load and process data

In [5]:
## Load the tweetsumm test datafile. This file is obtained
## by running:
##
##    tweetsum_datasets['test'].to_csv('tweetsumm-test.csv')
##
## in the Chapter 4 notebook. It is also located in the folder
## tutorials/chapter8/data in the tutorials Github repository.

tweetsumm_test = pd.read_csv('./tweetsumm-test.csv')
tweetsumm_test.head()

Unnamed: 0,id,text,question,answer
0,bbde6d8ec7c39c4551da1ff6024f997b,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining that the watchlist is ...
1,1d1a6617ae65baa429c2232ccc908840,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is asking about the ACC to link to th...
2,9555f25de7b6c8dfb8204f56f8bc4dd0,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining about the new updates ...
3,54fe18905f0a19ee163a2b452e31e07d,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is complaining about parcel service ...
4,f6cc57227f74737de08efd03782d015e,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer says that he is stuck at Staines ...


In [7]:
## A couple of helper functions to parse outputs.

def split_conversation(question):
    conversation = question.split('### Conversation:')[-1]
    return conversation.split('### Summary:')[0].strip('\n')

def generate_tweetsumm(question, generator):
    response = generator(question)[0]['generated_text']
    summary = response.split('### Summary:')[-1]

    ## Remove the <END_OF_SECOND_SENTENCE> tokens
    if '<END_' in summary:
      summary = summary.split('<END_')[0]

    return summary


## Take a small sample from the test set as a demonstration.
tweet_subset = tweetsumm_test.sample(100)

## For convenience in our evaluations later, we split the conversation from
## the instruction portion of the inputs.
tweet_subset['conversation'] = tweet_subset['question'].apply(split_conversation)

# Apply models

For this project, we will use two models created in the Chapter 4 tutorial:

> 1) A full fine-tuned version of DistilGPT2
>
> 2) A LoRA fine-tuned version of Llama-2.


The former offers greater inference speed, while the latter shows better performance. The Chapter 4 tutorial included cells for saving these models in the following default locations on HuggingFace:

> `huggingface.co/<your-hf-username>/distilgpt2-tweetsumm-finetune`
>
> `huggingface.co/<your-hf-username>/Llama-2-7b-tweetsumm-lora`

You may also use versions saved during building of these notebooks by entering the username `sarahsor` into the above URLs.

## Distil-GPT2

In [9]:
## Create a generate with the DistilGPT2 model

generator_distilgpt2 = pipeline("text-generation",
                     model=f'{HF_USER}/distilgpt2-tweetsumm-finetune',
                     tokenizer=f'{HF_USER}/distilgpt2-tweetsumm-finetune',
                     device='cuda:0',
                     max_new_tokens=100)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

In [10]:
## Apply the generator to the Tweetsumm questions

tweet_subset['distilgpt2'] = tweet_subset['question'].apply(generate_tweetsumm, generator=generator_distilgpt2)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


### TruLens Coherence

In [11]:
openai_provider = OpenAI()

In [12]:
## We can measure the "coherence" of an output with a simple piece of code,
## using a built in feedback function from TruLens. Under the hood, it's
## composing a prompt to ask OpenAI's models how coherent the text is.

coherence_feedback = Feedback(
    openai_provider.coherence
).on_output()

✅ In coherence, input text will be set to __record__.main_output or `Select.RecordOutput` .


In [13]:
## Lets try one example to see what they look like

summary = tweet_subset['distilgpt2'].iloc[0]
coherence_score = coherence_feedback(summary)

print(f'Distil-GPT2 sentence:\n{summary}')
print(f'\nCoherence score:\n{coherence_score}')

Distil-GPT2 sentence:
  Customer is complaining that he is not able to get the resolution of of the games even after reinstalling the game. Agent requests to uninstall the game by uninstalling the game after the reinstalling the game. Here is the link to uninstall the game: and reinstall the game.  Finally customer asks if he has tried reinstalling the game after the reinstalling the console  for installation. Here is the screen shot. Here is the screen shot. Here error code is in

Coherence score:
0.2


In [14]:
coherence_scores = []

for summary in tqdm(tweet_subset['distilgpt2']):
    try:
        coherence_scores.append(coherence_feedback(summary))
    except:
        # Occasionally, OpenAI may produce responses without valid scores.
        # We will simply ignore these here.
        pass

100%|██████████| 100/100 [01:31<00:00,  1.10it/s]


In [15]:
distilgpt2_coherence = np.mean(coherence_scores)
print(distilgpt2_coherence)

0.539


### TruLens Conciseness

In [16]:
## The same code works for "conciseness" which is another built-in feedback

conciseness_feedback = Feedback(
    openai_provider.conciseness
).on_output()

✅ In conciseness, input text will be set to __record__.main_output or `Select.RecordOutput` .


In [17]:
## Lets try one example to see what they look like

summary = tweet_subset['distilgpt2'].iloc[0]
conciseness_score = conciseness_feedback(summary)

print(f'Distil-GPT2 sentence:\n{summary}')
print(f'\nConciseness score:\n{conciseness_score}')

Distil-GPT2 sentence:
  Customer is complaining that he is not able to get the resolution of of the games even after reinstalling the game. Agent requests to uninstall the game by uninstalling the game after the reinstalling the game. Here is the link to uninstall the game: and reinstall the game.  Finally customer asks if he has tried reinstalling the game after the reinstalling the console  for installation. Here is the screen shot. Here is the screen shot. Here error code is in

Conciseness score:
0.8


In [18]:
conciseness_scores = []

for summary in tqdm(tweet_subset['distilgpt2']):
    try:
        conciseness_scores.append(conciseness_feedback(summary))
    except:
        pass

100%|██████████| 100/100 [01:27<00:00,  1.14it/s]


In [19]:
distilgpt2_conciseness = np.mean(conciseness_scores)
print(distilgpt2_conciseness)

0.7180000000000001


### TruLens Custom Function

In [20]:
## Now we will extend the concept of feedback functions to create our own.
## We want to measure how well the models are doing at our specific task,
## not just generic concepts like coherence.

class CustomOpenAI(OpenAI):
    def tweetsumm_eval(self, conversation: str, summary: str) -> float:
        prompt = f'''I am going to give you a conversation between a Customer and a customer service Agent.
        Please read the conversation, then read the summary below it and judge whether the summary reasonably matches the conversation.
        Then, give the summary a score from 0 to 10, where 0 is a poor match and 10 is an excellent match.

### Conversation: {conversation}
### Summary: {summary}'''
        return self.generate_score(prompt)

In [21]:
## We can create a TruLens wrapper to run this function with OpenAI.

custom_provider = CustomOpenAI()

custom_feedback = Feedback(
    custom_provider.tweetsumm_eval, higher_is_better=True
).on_output()

✅ In tweetsumm_eval, input conversation will be set to __record__.main_output or `Select.RecordOutput` .


In [22]:
## Lets try one example to see what they look like

conversation, summary = tweet_subset['conversation'].iloc[0], tweet_subset['distilgpt2'].iloc[0]
custom_score = custom_feedback(conversation,summary)

print(f'Original Conversation:\n{conversation}\n')
print(f'Distil-GPT2 sentence:\n{summary}')
print(f'\nCustom score:\n{custom_score}')

Original Conversation:
 
Customer: Hi @AskPlayStation How to resolve this error CE-30022-7 
 Agent: Hey there. What you were doing in the console when you get this error code? 
 Customer: I had inserted TLOU Remastered. 1st time. Working now but it says cant download maps for m.p. 
 Agent: Do you see another error code? 
 Customer: Its possible it happend when i checked for the maps. Right now i am in game. Il try again later. If error what to do? 
 Agent: Let us know the error code! 
 Customer: Of course. Thanks guys. Will update if it arises again. 
 Agent: You're welcome, Happy gaming! 
 

Distil-GPT2 sentence:
  Customer is complaining that he is not able to get the resolution of of the games even after reinstalling the game. Agent requests to uninstall the game by uninstalling the game after the reinstalling the game. Here is the link to uninstall the game: and reinstall the game.  Finally customer asks if he has tried reinstalling the game after the reinstalling the console  for 

In [23]:
custom_scores = []

for _, row in tqdm(tweet_subset.iterrows()):
    try:
        custom_scores.append(custom_feedback(row['conversation'], row['distilgpt2']))
    except:
        pass


Score: 10

I would rate this summary a 1 out of 10.

I would rate it a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate it a 9 out of 10.

Score: 2/10

I would rate this summary a 2 out of 10.

The summary does not accurately capture the conversation between the Customer and the Agent. It includes incorrect information about double ups in Super Mario, limited website information, and holding stock. It also lacks clarity and coherence.

I would rate this summary a 7 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 2 out of 10.

I would rate this summary a 6 out of 10.

The summary does not accurately capture the conversation between the Customer and the Agent. It misses important details such as the customer's complaint about booking seats online, the agent's explanation about data protection rules, and the request for the customer to provide contact information for further assistance. Additionally, the summary is confusing and does 

In [24]:
distilgpt2_custom = np.mean(custom_scores)
print(distilgpt2_custom)

0.269


In [39]:
## The custom function has a harder time parsing scores, and it doesn't seem to produce very favorable results.
## We would need to do some prompt engineering to make it work as nicely as the TruLens feedback functions.
## This really underscores why it can be nice to have this type of framework that has already implemented
## (and tested) some evaluations for us. TruLens has other capabilities as well, including a dashboard to
## compare evalutation results.

## Llama-2-7B

In [25]:
## Next, we will try running all of the same code again for the Llama-2 model. This needs a lot of GPU
## memory, so it may be necessary to free up the memory used by the distilgpt2 model.

torch.cuda.empty_cache()

In [26]:
#config = PeftConfig.from_pretrained(f"{HF_USER}/Llama-2-7b-tweetsumm-lora")

#base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map={"": 0}
)
base_model.config.use_cache = False

#model = PeftModel.from_pretrained(base_model, f"{HF_USER}/Llama-2-7b-tweetsumm-lora")
model = PeftModel.from_pretrained(base_model, f"{HF_USER}/llama2-tweetsumm-finetuned-lora")

generator_llama2 = pipeline('text-generation',
                            model=model,
                            tokenizer='meta-llama/Llama-2-7b-hf',
                            max_new_tokens=100)

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM'

In [27]:
## Run each test prompt question through
llama_responses = []
for theprompt in tqdm(tweet_subset['question']):
    with torch.autocast("cuda"):
        llama_output = generator_llama2(theprompt)
    llama_responses.append(llama_output[0]['generated_text'])

tweet_subset['llama2'] = llama_responses

  4%|▍         | 4/100 [01:53<45:15, 28.29s/it]


KeyboardInterrupt: 

In [30]:
## Now run each of the three tests we ran above. Lets
## try one sample first:

conversation, summary = tweet_subset['conversation'].iloc[0], tweet_subset['llama2'].iloc[0]
coherence_score = coherence_feedback(summary)
conciseness_score = conciseness_feedback(summary)
custom_score = custom_feedback(conversation,summary)

print(f'Original Conversation:\n{conversation}\n')
print(f'Llama-2 sentence:\n{summary}\n')
print(f'\nCoherence score:\n{coherence_score}')
print(f'\nConciseness score:\n{conciseness_score}')
print(f'\nCustom score:\n{custom_score}')

Original Conversation:
 
Customer: Hi @AskPlayStation How to resolve this error CE-30022-7 
 Agent: Hey there. What you were doing in the console when you get this error code? 
 Customer: I had inserted TLOU Remastered. 1st time. Working now but it says cant download maps for m.p. 
 Agent: Do you see another error code? 
 Customer: Its possible it happend when i checked for the maps. Right now i am in game. Il try again later. If error what to do? 
 Agent: Let us know the error code! 
 Customer: Of course. Thanks guys. Will update if it arises again. 
 Agent: You're welcome, Happy gaming! 
 

Llama-2 sentence:
### Instruction:
Read the following conversation between a customer and a customer service agent, and then create a two sentence summary of the conversation, describing the customer's question and the agent's response.

### Conversation: 
Customer: Hi @AskPlayStation How to resolve this error CE-30022-7 
 Agent: Hey there. What you were doing in the console when you get this erro

In [31]:
## Run each of the three tests we ran above

coherence_scores = []
conciseness_scores = []
custom_scores = []

for _, row in tweet_subset.iterrows():
    try: coherence_scores.append(coherence_feedback(row['llama2']))
    except: pass
    try: conciseness_scores.append(conciseness_feedback(row['llama2']))
    except: pass
    try: custom_scores.append(custom_feedback(row['conversation'],
                                              row['llama2']))
    except: pass


This summary captures the essence of the conversation by highlighting the customer's question about the train delay and the agent's response regarding the delays caused by incidents on the network. However, it could be improved by including a bit more detail about the customer's frustration and dissatisfaction with the service. 
I would rate this summary a 7 out of 10.


In [32]:
llama2_coherence = np.mean(coherence_scores)
print('Llama-2 coherence:', llama2_coherence)
llama2_conciseness = np.mean(conciseness_scores)
print('Llama-2 conciseness:', llama2_conciseness)
llama2_custom = np.mean(custom_scores)
print('Llama-2 custom:', llama2_custom)

Llama-2 coherence: 0.8
Llama-2 conciseness: 0.825
Llama-2 custom: 0.825


# LangKit

LangKit is an open source package from WhyLabs. In this example, we'll just populate the logs with some existing data so we can see some of the capabilities it offers and why it's useful.

To start, we'll get the lengths of all the questions and create a new column to sort on. This will be used to create two data subsets with different distributions.

In [33]:
tweet_subset['length'] = tweet_subset['question'].str.len()
tweetsumm_ordered = tweet_subset.sort_values(by='length')
tweetsumm_ordered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tweet_subset['length'] = tweet_subset['question'].str.len()


Unnamed: 0,id,text,question,answer,conversation,distilgpt2,llama2,length
48,3ff89398150845b4ea11c95b66d24c07,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer asking how to resolve the error a...,\nCustomer: Hi @AskPlayStation How to resolve...,Customer is complaining that he is not able ...,### Instruction:\nRead the following conversat...,827
53,803c59fcf57c37dc0027e63363efa2bd,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is inquiring about flight timing to c...,\nCustomer: may i know the flight number from...,Customer is enquiring about the flights numb...,### Instruction:\nRead the following conversat...,1075
97,301dca7b586b984346b1b14c75d370cf,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer enquirers about iOS app. Agent infor...,\nCustomer: the auto checkin on the iOS app c...,Customer is complaining about the auto check...,### Instruction:\nRead the following conversat...,1107
34,e10620de5879b5064bd9e4d558b13b78,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer asking about how is the 846 fnb - wat...,\nCustomer: morning. How is the 846 fnb-wat t...,The customer has a terrible experience thoug...,### Instruction:\nRead the following conversat...,1905


In [34]:
short_subset = tweetsumm_ordered.head(50).copy()
long_subset = tweetsumm_ordered.tail(50).copy()

Support for third party widgets will remain active for the duration of the session. To disable support:

In [35]:
short_subset.head()

Unnamed: 0,id,text,question,answer,conversation,distilgpt2,llama2,length
48,3ff89398150845b4ea11c95b66d24c07,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer asking how to resolve the error a...,\nCustomer: Hi @AskPlayStation How to resolve...,Customer is complaining that he is not able ...,### Instruction:\nRead the following conversat...,827
53,803c59fcf57c37dc0027e63363efa2bd,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is inquiring about flight timing to c...,\nCustomer: may i know the flight number from...,Customer is enquiring about the flights numb...,### Instruction:\nRead the following conversat...,1075
97,301dca7b586b984346b1b14c75d370cf,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer enquirers about iOS app. Agent infor...,\nCustomer: the auto checkin on the iOS app c...,Customer is complaining about the auto check...,### Instruction:\nRead the following conversat...,1107
34,e10620de5879b5064bd9e4d558b13b78,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer asking about how is the 846 fnb - wat...,\nCustomer: morning. How is the 846 fnb-wat t...,The customer has a terrible experience thoug...,### Instruction:\nRead the following conversat...,1905


In [36]:
long_subset.head()

Unnamed: 0,id,text,question,answer,conversation,distilgpt2,llama2,length
48,3ff89398150845b4ea11c95b66d24c07,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,The customer asking how to resolve the error a...,\nCustomer: Hi @AskPlayStation How to resolve...,Customer is complaining that he is not able ...,### Instruction:\nRead the following conversat...,827
53,803c59fcf57c37dc0027e63363efa2bd,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer is inquiring about flight timing to c...,\nCustomer: may i know the flight number from...,Customer is enquiring about the flights numb...,### Instruction:\nRead the following conversat...,1075
97,301dca7b586b984346b1b14c75d370cf,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer enquirers about iOS app. Agent infor...,\nCustomer: the auto checkin on the iOS app c...,Customer is complaining about the auto check...,### Instruction:\nRead the following conversat...,1107
34,e10620de5879b5064bd9e4d558b13b78,### Instruction:\nRead the following conversat...,### Instruction:\nRead the following conversat...,Customer asking about how is the 846 fnb - wat...,\nCustomer: morning. How is the 846 fnb-wat t...,The customer has a terrible experience thoug...,### Instruction:\nRead the following conversat...,1905


In [37]:
# shortest conversation
print(short_subset.conversation.iloc[0])

 
Customer: Hi @AskPlayStation How to resolve this error CE-30022-7 
 Agent: Hey there. What you were doing in the console when you get this error code? 
 Customer: I had inserted TLOU Remastered. 1st time. Working now but it says cant download maps for m.p. 
 Agent: Do you see another error code? 
 Customer: Its possible it happend when i checked for the maps. Right now i am in game. Il try again later. If error what to do? 
 Agent: Let us know the error code! 
 Customer: Of course. Thanks guys. Will update if it arises again. 
 Agent: You're welcome, Happy gaming! 
 


In [38]:
# longest conversation
print(long_subset.conversation.iloc[-1])

 
Customer: morning. How is the 846 fnb-wat train looking today? 
 Agent: Hi Nick, this train is currently on time between Winchester and Basingstoke. ^PN 
 Customer: We left on time. What's with the stop start crawl into Surbiton? Bet you're going to make me late, as usual 
 Customer: Ah, 12 minutes behind, so far. Marvellous. Keep purposefully running a bad service so you can cut loads out in the new timetable 
 Agent: Hi Nick, there have been two separate incidents on our network this morning which have led to delays: sorry for the delay this morning. ^PN 
 Customer: You really are terrible at your jobs. There hasn't been a day without some sort of huge delay since you took over. And I bet commuters get no void day refunds on our season tickets now, do we? 
 Customer: What would be really good is if, as well as making everyone late for work, you can make them late home tonight as well. Get them nice and stressed at the end of a day. That'd be great 
 Customer: Guard reckons we'll ev

In [39]:
## The code below is simplified, since what we are doing for this notebook is populating some existing data to
## get an idea of what model monitoring looks like. WhyLabs provides a comprehensive set of tools that instrument
## applications for real-time monitoring.

why.init()
schema = llm_metrics.init()

## Enter 2 when prompted

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


❓ What kind of session do you want to use?
 ⤷ 1. WhyLabs. Use an api key to upload to WhyLabs.
 ⤷ 2. WhyLabs Anonymous. Upload data anonymously to WhyLabs and get a viewing url.

Enter a number from the list: 2
Initializing session with config /root/.config/whylogs/config.ini

✅ Using session type: WHYLABS_ANONYMOUS
 ⤷ session id: <will be generated before upload>


In [40]:
df = pd.DataFrame({'prompt': short_subset['conversation'],
                   'response': short_subset['llama2']})
df.head()

Unnamed: 0,prompt,response
48,\nCustomer: Hi @AskPlayStation How to resolve...,### Instruction:\nRead the following conversat...
53,\nCustomer: may i know the flight number from...,### Instruction:\nRead the following conversat...
97,\nCustomer: the auto checkin on the iOS app c...,### Instruction:\nRead the following conversat...
34,\nCustomer: morning. How is the 846 fnb-wat t...,### Instruction:\nRead the following conversat...


In [41]:
results = why.log(df, name="today", schema=schema)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/403 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]


✅ Aggregated 4 rows into profile today

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-1/profiles?profile=ref-YRJzV4H9qknksuZv&sessionToken=session-w9VD2cay


In [42]:
historical = pd.DataFrame({'prompt': long_subset['conversation'],
                           'response': long_subset['llama2']})
historical.head()

Unnamed: 0,prompt,response
48,\nCustomer: Hi @AskPlayStation How to resolve...,### Instruction:\nRead the following conversat...
53,\nCustomer: may i know the flight number from...,### Instruction:\nRead the following conversat...
97,\nCustomer: the auto checkin on the iOS app c...,### Instruction:\nRead the following conversat...
34,\nCustomer: morning. How is the 846 fnb-wat t...,### Instruction:\nRead the following conversat...


In [43]:
results = why.log(historical, name="historical", schema=schema)


✅ Aggregated 4 rows into profile historical

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-1/profiles?profile=ref-bhG2zEQk0pO6kgbr&sessionToken=session-w9VD2cay


Click the link to explore the WhyLabs monitoring UI. The left pane lets you select the two profiles we created above and compare them. We can imagine a scenario in which we have some historical data (the longer conversations) and a new set of data where the conversations are suddenly much shorter. Model monitoring/observability tools can bring it to our attention when things like data drift are happening so we can assess whether the models we've deployed are still performing their intended task.