# Seq2Seq Decoder-Only DQ Test Notebook

In this notebook we test the dq client for **DecoderOnly** models using simulated / fake data. The main intention is to battle test the different components of the client without training an actual model - i.e. optimizing for speed!

Things that we test:
1. Using the watch function - to set the tokenizer + response_template + generation_config 
2. Logging data (input + target output + formatted prompt) - ensuring
   that we properly handle identifying the `response_template` / the 
   response tokens
3. Logging model outputs 1+ epoch - ensuring we strip just the logits
   for the response tokens
4. Fake model generations - interestingly the best way to do this may be with a small validation dataset + a real LLM model. This depends a bit on design decisions around logging for generation.

NOTE: For a first pass we work with just a training dataset

Let's get testing

In [None]:
from datasets import load_dataset, Dataset
import numpy as np

%load_ext autoreload
%autoreload 2

## Pull data from hf hub

Since part of the dq processing involves tokenizing and aligning text / token indices, we work with a small real-world dataset - rather than dummy data.

The Billsum dataset contains three columns:

<p style="text-align: center;">|| text || summary || title ||</p>

We look at just **summary** and **title** and map them as follows:
<p style="text-align: center;">(summary, title) --> (input context,  target output)</p>

For **DecoderOnly** models we need to specify a formatting function. We use a simple formatting function to create the `formatted_prompt`:
```
formatted_prompt = f"""Input: {summary}\n\nResponse: {title}"""
```

We also use a small subset of the first 100 data rows!

In [None]:
response_template = "###Response:"
def create_formatted_prompt(row, idx):
    formatted_prompt = f"""###Input: {row['summary']}\n\n###Response: {row['title']}"""
    return {"formatted_prompt": formatted_prompt, "id": idx}

In [None]:
dataset_size = 10

ds = load_dataset("billsum")
ds = ds.remove_columns('text')
# Add ids
ds = ds.map(create_formatted_prompt, with_indices=True)
ds_train = Dataset.from_dict(ds['train'][:dataset_size])
ds_val = Dataset.from_dict(ds['test'][:dataset_size])
ds_train

## Tokenizing the Data

Tokenize the data for use later when faking our logging - i.e. to make sure we log the correct number of logits

In [None]:
from transformers import AutoTokenizer, GenerationConfig, AutoModelForCausalLM, PreTrainedTokenizerFast

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")

In [None]:
# Tokenize things
def tokenize_formatted_prompts(row):
    return tokenizer(row['formatted_prompt'])

ds_train = ds_train.map(tokenize_formatted_prompts)
ds_val = ds_val.map(tokenize_formatted_prompts)

In [None]:
ds_train[0]

In [None]:
import os
os.environ['GALILEO_CONSOLE_URL']=""
os.environ["GALILEO_USERNAME"]=""
os.environ["GALILEO_PASSWORD"]=""


import dataquality as dq
from dataquality.integrations.seq2seq.core import watch
dq.configure()

In [None]:
dq.init("seq2seq", project_name="Seq2Seq_DecoderOnly_Generation")

temperature = 0.
generation_config = GenerationConfig(
    max_new_tokens=15,
    # Whether we use multinomial sampling
    do_sample=temperature >= 1e-5,
    temperature=temperature,
)

response_template = "###Response:"
response_template = tokenizer(response_template, add_special_tokens=False)["input_ids"]

watch(
    tokenizer,
    "decoder_only",
    model,
    generation_config,
    generation_splits=[],
    max_input_tokens=1024,
    response_template=response_template
)

In [None]:
def log_dataset(ds, input_col="summary", target_col="title", formatted_prompt="formatted_prompt"):
    dq.log_dataset(
        ds,
        text=input_col,
        label=target_col,
        formatted_prompt=formatted_prompt,
        split="training"
    )

# Log just for training
log_dataset(ds_train)

## Logging Model Outputs
Log 1 epoch of fake model output data: includes just logits!

In [None]:
from time import time

num_logits = len(tokenizer)
batch_size = 10

def log_epoch(ds):
    #ids = ds['id']
    max_seq_length = np.max([len(ids) for ids in ds['input_ids']])
    print("max seq len", max_seq_length)
    for i in range(0, len(ds), batch_size):
        print (f"Processing batch {i // batch_size}")
        #batch_ids = ids[i: i + batch_size]
        batch = ds[i: i + batch_size]
        batch_ids = batch['id']
        model_inputs = {
            'input_ids': batch['input_ids'],
            'attention_mask': batch['attention_mask'],
        }
        model_inputs = tokenizer.pad(model_inputs, padding=True, return_tensors='pt')
        model_inputs['labels'] = model_inputs['input_ids'].clone()
        print ("Model is working...")
        model_outputs = model(**model_inputs)
        print ("DONE!")
        print()
        
        dq.log_model_outputs(
            logits = model_outputs.logits,
            ids = batch_ids
        )

dq.set_epoch(0)
dq.set_split("train")
log_epoch(ds_train)

In [None]:
dq.finish(data_embs_col="title")