# Seq2Seq DQ Test Notebook

In this notebook we test the dq client for Seq2Seq using simulated / fake data. The main intention is to battle test the different components of the client without training an actual model - i.e. optimizing for speed!

Things that we want to test:
1. Setting the tokenizer
2. Logging data (input + target outputs)
3. Logging model outputs 1+ epoch
4. Fake model generations - interestingly the best way to do this may be with a small validation dataset + a real LLM model. This depends a bit on design decisions around logging for generation.

NOTE: For a first pass we work with just a training dataset

Let's get testing

In [1]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import load_dataset, Dataset
import numpy as np
# import torch

%load_ext autoreload
%autoreload 2

## Pull data from hf hub

Since part of the dq processing involves tokenizing and aligning text / token indices, we work with a small real-world dataset - rather than dummy data.

The Billsum dataset contains three columns:

<p style="text-align: center;">|| text || summary || title ||</p>

We look at just **summary** and **title** and map them as follows:
<p style="text-align: center;">(summary, title) --> (input context,  target output)</p>

We also use a small subset of the first 100(0?) data rows!

In [15]:
dataset_size = 100

ds = load_dataset("billsum")
ds = ds.remove_columns('text')
# Add ids
ds = ds.map(lambda _, idx: {"id": idx}, with_indices=True)
ds_train = Dataset.from_dict(ds['train'][:100])
ds_val = Dataset.from_dict(ds['test'][:100])
ds_train

Found cached dataset billsum (/Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-8163760ca7c203c4.arrow
Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-6f832c3394bf0964.arrow
Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-fa696985d54ba920.arrow


Dataset({
    features: ['summary', 'title', 'id'],
    num_rows: 100
})

In [3]:
ds_train[0]

{'summary': "Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business of the business entity; (2) such injury or death occurs during a period that such facility is used by such organization; and (3) the business entity authorized the use of such facility by the organization. \nMakes this Act inapplicable to an injury or death that results from an act or omission of a business entity that constitutes gross negligence or intentional misconduct, including misconduct that: (1) constitutes a hate crime or a crime of violence or act of international terrorism for which the defendant has been convicted in any court; or (2) involves a sexual offense for which the defendant has been convicted in any court or misconduct for which the defendant has been found to have violated a Federal or State civil rights law. \nP

## Logging Data

1. Before logging input data log the tokenizer (making sure we use the fast tokenizer)
2. Log the input and target output data

In [16]:
from transformers import AutoTokenizer, GenerationConfig, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("t5-small", use_fast=True)
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Tokenize things
def tokenize_outputs(row):
    label_ids = tokenizer(row['title'])['input_ids']
    return {'labels': label_ids}

ds_train = ds_train.map(tokenize_outputs)
ds_val = ds_val.map(tokenize_outputs)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [17]:
ds_train[0]

{'summary': "Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business of the business entity; (2) such injury or death occurs during a period that such facility is used by such organization; and (3) the business entity authorized the use of such facility by the organization. \nMakes this Act inapplicable to an injury or death that results from an act or omission of a business entity that constitutes gross negligence or intentional misconduct, including misconduct that: (1) constitutes a hate crime or a crime of violence or act of international terrorism for which the defendant has been convicted in any court; or (2) involves a sexual offense for which the defendant has been convicted in any court or misconduct for which the defendant has been found to have violated a Federal or State civil rights law. \nP

In [30]:
import os
os.environ['GALILEO_CONSOLE_URL']="https://console.dev.rungalileo.io"
os.environ["GALILEO_USERNAME"]="galileo@rungalileo.io"
os.environ["GALILEO_PASSWORD"]="A11a1una!"

import dataquality as dq
from dataquality.integrations.seq2seq.hf import watch
dq.configure()
dq.init("seq2seq")

temperature = 0.4
generation_config = GenerationConfig(
    max_new_tokens=15,
    # Whether we use multinomial sampling
    do_sample=temperature >= 1e-5,
    temperature=temperature,
)

watch(
    model,
    tokenizer,
    generation_config,
    generate_training_data=True
)



📡 https://console.dev.rungalileo.io
🔭 Logging you into Galileo

🚀 You're logged in to Galileo as galileo@rungalileo.io!
✨ Initializing new public project 'wee_ivory_starfish_82a97'
🏃‍♂️ Creating new run '2023-09-27_1'
🛰 Connected to new project 'wee_ivory_starfish_82a97', and new run '2023-09-27_1'.


  warn(
  warn(


In [31]:
def log_dataset(ds, input_col="summary", target_col="title"):
    dq.log_dataset(
        ds,
        text=input_col,
        label=target_col,
        split="training"
    )

# Log just for training
log_dataset(ds_train)

Aligning characters with tokens:   0%|          | 0/100 [00:00<?, ?it/s]

Logging 100 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 

## Logging Model Outputs
Log 1 epoch of fake model output data: includes just logits!

In [32]:
num_logits = len(tokenizer)

def log_epoch(ds):
    ids = ds['id']
    max_seq_length = np.max([len(ids) for ids in ds['labels']])
    print("max seq len", max_seq_length)
    for i in range(0, len(ids), 100):
        print (i)
        batch_ids = ids[i: i + 100]
        # Shape - [bs, max_seq_len, num_logits]
        fake_logits = np.ones((len(batch_ids), max_seq_length, num_logits))
        dq.log_model_outputs(
            logits = fake_logits,
            ids = batch_ids
        )

dq.set_epoch(0)
dq.set_split("train")
log_epoch(ds_train)

max seq len 111
0


In [33]:
dq.finish()

☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


training:   0%|          | 0/1 [00:00<?, ?it/s]

Creating Register Function
Doing sample 0 out of 1
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 0 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 1 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 2 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 3 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 4 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 5 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 6 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 7 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 8 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 9 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 10 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 11 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 12 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 13 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 14 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 15 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 16 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 17 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 18 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 19 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 20 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 21 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 22 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 23 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 24 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 25 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 26 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 27 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 28 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 29 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 30 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 31 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 32 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 33 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 34 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 35 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 36 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 37 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 38 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 39 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 40 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 41 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 42 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 43 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 44 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 45 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 46 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 47 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 48 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 49 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 50 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 51 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 52 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 53 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 54 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 55 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 56 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 57 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 58 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 59 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 60 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 61 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 62 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 63 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 64 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 65 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 66 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 67 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 68 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 69 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 70 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 71 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 72 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 73 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 74 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 75 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 76 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 77 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 78 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 79 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 80 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 81 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 82 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 83 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 84 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 85 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 86 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 87 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 88 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 89 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 90 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 91 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 92 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 93 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 94 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 95 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 96 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 97 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 98 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

Doing sample 99 out of 100
=====


Aligning characters with tokens:   0%|          | 0/1 [00:00<?, ?it/s]

About to Flatten!
Flattened Vaex DF
Joining
Seperating
Done
About to Upload


training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

About to upload: emb


Uploading data to Galileo:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

About to upload: prob


Uploading data to Galileo:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

About to upload: data


Uploading data to Galileo:   0%|          | 0.00/620k [00:00<?, ?B/s]

done with it all
Job default successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=8e34b273-d6bd-4da3-a798-21ff162f0351&runId=40b94aa7-97c9-4bc6-8105-16e9237dba13&taskType=8&split=training
Waiting for job (you can safely close this window)...
	Downloading all embedding files for this run
	Uploading processed training data
Done! Job finished with status completed
🧹 Cleaning up
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights?projectId=8e34b273-d6bd-4da3-a798-21ff162f0351&runId=40b94aa7-97c9-4bc6-8105-16e9237dba13&taskType=8&split=training'