# Seq2Seq Decoder-Only DQ Test Notebook

In this notebook we test the dq client for **DecoderOnly** models using simulated / fake data. The main intention is to battle test the different components of the client without training an actual model - i.e. optimizing for speed!

Things that we test:
1. Using the watch function - to set the tokenizer + response_template + generation_config 
2. Logging data (input + target output + formatted prompt) - ensuring
   that we properly handle identifying the `response_template` / the 
   response tokens
3. Logging model outputs 1+ epoch - ensuring we strip just the logits
   for the response tokens
4. Fake model generations - interestingly the best way to do this may be with a small validation dataset + a real LLM model. This depends a bit on design decisions around logging for generation.

NOTE: For a first pass we work with just a training dataset

Let's get testing

In [1]:
from datasets import load_dataset, Dataset
import numpy as np

%load_ext autoreload
%autoreload 2

## Pull data from hf hub

Since part of the dq processing involves tokenizing and aligning text / token indices, we work with a small real-world dataset - rather than dummy data.

The Billsum dataset contains three columns:

<p style="text-align: center;">|| text || summary || title ||</p>

We look at just **summary** and **title** and map them as follows:
<p style="text-align: center;">(summary, title) --> (input context,  target output)</p>

For **DecoderOnly** models we need to specify a formatting function. We use a simple formatting function to create the `formatted_prompt`:
```
formatted_prompt = f"""Input: {summary}\n\nResponse: {title}"""
```

We also use a small subset of the first 100 data rows!

In [2]:
response_template = "###Response:"
def create_formatted_prompt(row, idx):
    formatted_prompt = f"""###Input: {row['summary']}\n\n###Response: {row['title']}"""
    return {"formatted_prompt": formatted_prompt, "id": idx}

In [3]:
dataset_size = 100

ds = load_dataset("billsum")
ds = ds.remove_columns('text')
# Add ids
ds = ds.map(create_formatted_prompt, with_indices=True)
ds_train = Dataset.from_dict(ds['train'][:100])
ds_val = Dataset.from_dict(ds['test'][:100])
ds_train

Found cached dataset billsum (/Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-c6e334700689d4a3.arrow
Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-f8df2c4d228909d7.arrow
Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-985b0a5345b5b56c.arrow


Dataset({
    features: ['summary', 'title', 'formatted_prompt', 'id'],
    num_rows: 100
})

## Tokenizing the Data

Tokenize the data for use later when faking our logging - i.e. to make sure we log the correct number of logits

In [4]:
from transformers import AutoTokenizer, GenerationConfig, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")

In [5]:
# Tokenize things
def tokenize_formatted_prompts(row):
    return tokenizer(row['formatted_prompt'])

ds_train = ds_train.map(tokenize_formatted_prompts)
ds_val = ds_val.map(tokenize_formatted_prompts)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [6]:
ds_train[0]

{'summary': "Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business of the business entity; (2) such injury or death occurs during a period that such facility is used by such organization; and (3) the business entity authorized the use of such facility by the organization. \nMakes this Act inapplicable to an injury or death that results from an act or omission of a business entity that constitutes gross negligence or intentional misconduct, including misconduct that: (1) constitutes a hate crime or a crime of violence or act of international terrorism for which the defendant has been convicted in any court; or (2) involves a sexual offense for which the defendant has been convicted in any court or misconduct for which the defendant has been found to have violated a Federal or State civil rights law. \nP

In [7]:
batch = ds_train[:10]
model_inputs = {
    'input_ids': batch['input_ids'],
    'attention_mask': batch['attention_mask'],
    #'labels': batch['input_ids'].copy()
}
model_inputs = tokenizer.pad(model_inputs, padding=True, return_tensors='pt')
model_inputs['labels'] = model_inputs['input_ids'].clone()
model_outputs = model(**model_inputs)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [8]:
model_outputs.logits.shape

torch.Size([10, 839, 50272])

In [11]:
model_outputs.logits[:, :-1].take_along_dim(model_inputs['labels'][:, 1:, None], dim=-1).shape

torch.Size([10, 838, 1])

In [60]:
import os
os.environ['GALILEO_CONSOLE_URL']="https://console.dev.rungalileo.io"
os.environ["GALILEO_USERNAME"]="galileo@rungalileo.io"
os.environ["GALILEO_PASSWORD"]="A11a1una!"

import dataquality as dq
from dataquality.integrations.seq2seq.hf import watch
dq.configure()



📡 https://console.dev.rungalileo.io
🔭 Logging you into Galileo

🚀 You're logged in to Galileo as galileo@rungalileo.io!


In [61]:
dq.init("seq2seq", project_name="Seq2Seq_DecoderOnly_Log_Logprobs")

temperature = 0.4
generation_config = GenerationConfig(
    max_new_tokens=15,
    # Whether we use multinomial sampling
    do_sample=temperature >= 1e-5,
    temperature=temperature,
)

watch(
    model,
    tokenizer,
    generation_config,
    generation_splits=[],
    max_input_tokens=1024,
    response_template=response_template
)

✨ Initializing existing public project 'Seq2Seq_DecoderOnly_Log_Logprobs'
🏃‍♂️ Creating new run '2023-11-13_9'
🛰 Connected to existing project 'Seq2Seq_DecoderOnly_Log_Logprobs', and new run '2023-11-13_9'.


  warn(


In [62]:
def log_dataset(ds, input_col="summary", target_col="title", formatted_prompt="formatted_prompt"):
    dq.log_dataset(
        ds,
        text=input_col,
        label=target_col,
        formatted_prompt=formatted_prompt,
        split="training"
    )

# Log just for training
log_dataset(ds_train)

Logging 100 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 

## Logging Model Outputs
Log 1 epoch of fake model output data: includes just logits!

In [63]:
from time import time
import torch

num_logits = len(tokenizer)
batch_size = 10

log_logprobs = True

@torch.no_grad()
def log_epoch(ds):
    #ids = ds['id']
    max_seq_length = np.max([len(ids) for ids in ds['input_ids']])
    print("max seq len", max_seq_length)
    for i in range(0, len(ds), batch_size):
        print (f"Processing batch {i // batch_size}")
        #batch_ids = ids[i: i + batch_size]
        batch = ds[i: i + batch_size]
        batch_ids = batch['id']
        model_inputs = {
            'input_ids': batch['input_ids'],
            'attention_mask': batch['attention_mask'],
        }
        model_inputs = tokenizer.pad(model_inputs, padding=True, return_tensors='pt')
        model_inputs['labels'] = model_inputs['input_ids'].clone()
        print ("Model is working...")
        model_outputs = model(**model_inputs)
        print ("DONE!")
        print()
        
        if log_logprobs:
            logprobs = torch.zeros((batch_size, model_outputs.logits.shape[1]))
            model_logprobs = torch.nn.functional.log_softmax(model_outputs.logits, dim=-1)
            extracted_logprobs = model_logprobs[:, :-1].take_along_dim(model_inputs['labels'][:, 1:, None], dim=-1).squeeze(-1)
            logprobs[:, 1:] = extracted_logprobs
            dq.log_model_outputs(
                probs = logprobs,
                ids = batch_ids
            )
        else:
            dq.log_model_outputs(
                logits = model_outputs.logits,
                ids = batch_ids
            )

dq.set_epoch(0)
dq.set_split("train")
log_epoch(ds_train)

max seq len 839
Processing batch 0
Model is working...
DONE!

Processing batch 1
Model is working...
DONE!

Processing batch 2
Model is working...
DONE!

Processing batch 3
Model is working...
DONE!

Processing batch 4
Model is working...
DONE!

Processing batch 5
Model is working...
DONE!

Processing batch 6
Model is working...
DONE!

Processing batch 7
Model is working...
DONE!

Processing batch 8
Model is working...
DONE!

Processing batch 9
Model is working...
DONE!



In [64]:
dq.finish()

☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


training:   0%|          | 0/1 [00:00<?, ?it/s]

Skipping generation for split training


training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/342k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights/4551e611-011a-4004-a8ba-b3113980f2ca/2f2fdd25-68ae-4be0-be94-ef62590a7345?split=training&taskType=8
Waiting for job (you can safely close this window)...
	Downloading all embedding files for this run
	Finding semantic clusters for training
Done! Job finished with status completed
🧹 Cleaning up
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights/4551e611-011a-4004-a8ba-b3113980f2ca/2f2fdd25-68ae-4be0-be94-ef62590a7345?split=training&taskType=8'

In [67]:
model.generate(torch.tensor([[2,
  48134,
  48214,
  35,
  18573,
  10,
  265,
  10014,
]]))



tensor([[    2, 48134, 48214,    35, 18573,    10,   265, 10014,    19,    10,
           923,     9,   321,     4,   245,    50,   540,     4,  1437,  1437]])

In [83]:
# Create a little fake sample for generation with decoder only models
sample = "Copy the following text - this is a test.###Response: this is a test."
response_template = "###Response:"
response_tokens = [42,    16,    10,  1296,     4]

inputs = tokenizer(sample, return_tensors="pt")['input_ids']
t_response_template = tokenizer(response_template, add_special_tokens=False)['input_ids']

input_prompt = inputs[:, :-len(response_tokens)]
input_prompt

tensor([[    2, 48233,     5,   511,  2788,   111,    42,    16,    10,  1296,
             4, 48134, 47806,    35]])

In [84]:
gen_tokens = model.generate(input_prompt)

In [96]:
outputs = model(input_ids=gen_tokens, labels=gen_tokens.clone())

In [97]:
outputs.logits.shape

torch.Size([1, 20, 50272])

In [91]:
generated_tokens = gen_tokens[0, input_prompt.shape[1]:]
generated_tokens

tensor([50118, 50118,   133,   511,  2788,   111])

In [92]:
tokenizer.decode(generated_tokens)

'\n\nThe following text -'

In [72]:
tokenizer(response_template, add_special_tokens=False)

{'input_ids': [48134, 47806, 35], 'attention_mask': [1, 1, 1]}