# Seq2Seq DQ Test Notebook

In this notebook we test the dq client for Seq2Seq using simulated / fake data. The main intention is to battle test the different components of the client without training an actual model - i.e. optimizing for speed!

Things that we want to test:
1. Setting the tokenizer
2. Logging data (input + target outputs)
3. Logging model outputs 1+ epoch
4. Fake model generations - interestingly the best way to do this may be with a small validation dataset + a real LLM model. This depends a bit on design decisions around logging for generation.

NOTE: For a first pass we work with just a training dataset

Let's get testing

In [1]:
# from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import load_dataset, Dataset
import numpy as np
# import torch

%load_ext autoreload
%autoreload 2

## Pull data from hf hub

Since part of the dq processing involves tokenizing and aligning text / token indices, we work with a small real-world dataset - rather than dummy data.

The Billsum dataset contains three columns:

<p style="text-align: center;">|| text || summary || title ||</p>

We look at just **summary** and **title** and map them as follows:
<p style="text-align: center;">(summary, title) --> (input context,  target output)</p>

We also use a small subset of the first 100(0?) data rows!

In [112]:
dataset_size = 100

ds = load_dataset("billsum")
ds = ds.remove_columns('text')
# Add ids
ds = ds.map(lambda _, idx: {"id": idx}, with_indices=True)
ds_train = Dataset.from_dict(ds['train'][:10])
ds_val = Dataset.from_dict(ds['test'][:10])
ds_train

Found cached dataset billsum (/Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-8163760ca7c203c4.arrow
Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-6f832c3394bf0964.arrow
Loading cached processed dataset at /Users/jonathangomesselman/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc/cache-fa696985d54ba920.arrow


Dataset({
    features: ['summary', 'title', 'id'],
    num_rows: 10
})

In [113]:
ds_train[0]

{'summary': "Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business of the business entity; (2) such injury or death occurs during a period that such facility is used by such organization; and (3) the business entity authorized the use of such facility by the organization. \nMakes this Act inapplicable to an injury or death that results from an act or omission of a business entity that constitutes gross negligence or intentional misconduct, including misconduct that: (1) constitutes a hate crime or a crime of violence or act of international terrorism for which the defendant has been convicted in any court; or (2) involves a sexual offense for which the defendant has been convicted in any court or misconduct for which the defendant has been found to have violated a Federal or State civil rights law. \nP

## Logging Data

1. Before logging input data log the tokenizer (making sure we use the fast tokenizer)
2. Log the input and target output data

In [114]:
from transformers import AutoTokenizer, T5ForConditionalGeneration, GenerationConfig

model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small", use_fast=True)

# Tokenize things
def tokenize_outputs(row):
    label_ids = tokenizer(row['title'])['input_ids']
    return {'labels': label_ids}

ds_train = ds_train.map(tokenize_outputs)
ds_val = ds_val.map(tokenize_outputs)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [115]:
ds_train[0]

{'summary': "Shields a business entity from civil liability relating to any injury or death occurring at a facility of that entity in connection with a use of such facility by a nonprofit organization if: (1) the use occurs outside the scope of business of the business entity; (2) such injury or death occurs during a period that such facility is used by such organization; and (3) the business entity authorized the use of such facility by the organization. \nMakes this Act inapplicable to an injury or death that results from an act or omission of a business entity that constitutes gross negligence or intentional misconduct, including misconduct that: (1) constitutes a hate crime or a crime of violence or act of international terrorism for which the defendant has been convicted in any court; or (2) involves a sexual offense for which the defendant has been convicted in any court or misconduct for which the defendant has been found to have violated a Federal or State civil rights law. \nP

In [150]:
import os
import dataquality as dq
from dataquality.integrations.seq2seq.hf import watch

# os.environ['GALILEO_CONSOLE_URL']="http://localhost:8088"
# os.environ["GALILEO_USERNAME"]="user@example.com"
# os.environ["GALILEO_PASSWORD"]="Th3secret_"
os.environ["GALILEO_USERNAME"]="galileo@rungalileo.io"
os.environ["GALILEO_PASSWORD"]="A11a1una!"
dq.set_console_url("https://console.dev.rungalileo.io")

dq.configure()
dq.init("seq2seq")
dq.set_tokenizer(tokenizer)
generation_config = GenerationConfig(
    max_new_tokens=10,
)
watch(
    model,
    generation_config,
    generate_training_data=False
)



📡 https://console.dev.rungalileo.io
🔭 Logging you into Galileo

🚀 You're logged in to Galileo as galileo@rungalileo.io!
✨ Initializing new public project 'anxious_jade_squid_cca44'
🏃‍♂️ Creating new run '2023-08-29_1'
🛰 Connected to new project 'anxious_jade_squid_cca44', and new run '2023-08-29_1'.


In [151]:
def log_dataset(ds, input_col="summary", target_col="title"):
    dq.log_dataset(
        ds,
        text=input_col,
        label=target_col,
        split="training"
    )

# Log just for training
log_dataset(ds_train)

Aligning characters with tokens:   0%|          | 0/10 [00:00<?, ?it/s]

Logging 10 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 

## Logging Model Outputs
Log 1 epoch of fake model output data: includes just logits!

In [152]:
num_logits = len(tokenizer)


def log_epoch(ds):
    ids = ds['id']
    max_seq_length = np.max([len(ids) for ids in ds['labels']])
    print("len ids", len(ids))
    print("max seq len", max_seq_length)
    # Shape - [bs, max_seq_len, num_logits]
    fake_logits = np.random.randn(len(ids), max_seq_length, num_logits)
    dq.log_model_outputs(
        logits = fake_logits,
        ids = ids
    )

dq.set_epoch(0)
dq.set_split("train")
log_epoch(ds_train)

len ids 10
max seq len 37


In [153]:
dq.finish()

☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


training:   0%|          | 0/1 [00:00<?, ?it/s]

Skipping generation for split training


training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/41.5k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=397a6d6e-4bc0-46ed-87cf-74d9f9082fac&runId=f633722e-2bc7-4545-aadc-de92291e3921&taskType=8&split=training
Waiting for job (you can safely close this window)...
Done! Job finished with status completed
🧹 Cleaning up
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights?projectId=397a6d6e-4bc0-46ed-87cf-74d9f9082fac&runId=f633722e-2bc7-4545-aadc-de92291e3921&taskType=8&split=training'

In [144]:
import vaex

df = vaex.open("/Users/jonathangomesselman/Downloads/data (2).arrow")

In [145]:
df

#,token_deps,data_error_potential,perplexity,token_logprobs,top_logprobs,input,target,split,token_label_positions,token_label_offsets,generated_output,generated_token_label_positions,generated_token_label_offsets,generated_token_logprobs,generated_top_logprobs,id,bleu,rouge
0,"'[0.5005900859832764, 0.5010707378387451, 0.5003...",0.500567,57352.1,"'[-9.821720123291016, -12.884685516357422, -11.4...","""[[('Explo', -6.697160720825195), ('Brent', -7.1...",'Shields a business entity from civil liability ...,'A bill to limit the civil liability of business...,training,"'[[0], [], [1], [], [2], [], [3], [], [4], [], [...","'[[0, 1], [1, 2], [2, 6], [6, 7], [7, 9], [9, 10...",'<extra_id_0> if such State laws are inconsisten...,"'[[0], [], [1, 2], [2], [], [3], [], [4], [], [5...","'[[0, 12], [12, 13], [13, 14], [14, 15], [15, 16...","'[-3.5762778338721546e-07, -2.454937219619751, -...","""[[('', -16.516389846801758), ('<extra_id_3>', -...",0,0.0,0.0
1,"'[0.5005096197128296, 0.5005630850791931, 0.5005...",0.500563,50127.5,"'[-10.991921424865723, -11.315881729125977, -10....","""[[('pirate', -7.196037292480469), ('museum', -7...",'Human Rights Information Act - Requires certain...,Human Rights Information Act,training,"[[0], [], [1], [], [2], [], [3]]","'[[0, 5], [5, 6], [6, 12], [12, 13], [13, 24], [...","<extra_id_0> to review, declassify, and publicly","'[[0], [], [1], [], [2], [3], [], [4], [5], [6],...","'[[0, 12], [12, 13], [13, 15], [15, 16], [16, 22...","'[-1.6689285757820471e-06, -1.7782273292541504, ...","""[[('', -15.420294761657715), ('<extra_id_3>', -...",1,0.0,0.0
2,"'[0.5006729364395142, 0.5008552074432373, 0.5009...",0.500627,73623.3,"'[-11.187646865844727, -9.137674331665039, -11.8...","""[[('segments', -6.600510120391846), ('seamlessl...",'Jackie Robinson Commemorative Coin Act - Direct...,Jackie Robinson Commemorative Coin Act,training,"'[[0], [], [1], [], [2], [3], [4], [], [5], [], ...","'[[0, 6], [6, 7], [7, 15], [15, 16], [16, 21], [...",Jackie Robinson Commemorative Coin Act - Direct,"'[[0], [], [1], [], [2], [3], [4], [], [5], [], ...","'[[0, 6], [6, 7], [7, 15], [15, 16], [16, 21], [...","'[-0.4167225956916809, -0.0022107940167188644, -...","""[[('Jackie', -0.4167225956916809), ('<extra_id_...",2,61.4788,0.909091
3,"'[0.5006540417671204, 0.5014073848724365, 0.5003...",0.500672,81645.9,"'[-12.535238265991211, -11.838638305664062, -10....","""[[('pouvoir', -6.636413097381592), ('hari', -6....",'Amends the Internal Revenue Code to provide (te...,'To amend the Internal Revenue Code to provide t...,training,"'[[0], [], [1], [], [2], [], [3], [], [4], [], [...","'[[0, 2], [2, 3], [3, 8], [8, 9], [9, 12], [12, ...",<extra_id_0> stock.,"[[0], [], [1], [2]]","[[0, 12], [12, 13], [13, 18], [18, 19]]","'[-0.632779061794281, -2.335423469543457, -2.436...","""[[('Mod', -3.213479995727539), ('Der', -3.14257...",3,1.25675,0.0
4,"'[0.500501811504364, 0.500938892364502, 0.500492...",0.500577,32879.0,"'[-9.629667282104492, -10.671297073364258, -11.0...","""[[('arg', -7.079741477966309), ('zumindest', -6...",'Native American Energy Act - (Sec. 3) Amends th...,Native American Energy Act,training,"[[0], [], [1], [], [2], [], [3]]","'[[0, 6], [6, 7], [7, 15], [15, 16], [16, 22], [...",'<extra_id_0> to be approved if the Secretary do...,"'[[0], [], [1], [], [2], [], [3], [], [4, 5], [5...","'[[0, 12], [12, 13], [13, 15], [15, 16], [16, 18...","'[-2.8132995794294402e-05, -2.1928133964538574, ...","""[[('', -12.974101066589355), ('(', -11.31648349...",4,0.0,0.0
5,"'[0.500636100769043, 0.5006024241447449, 0.50052...",0.500606,54585.6,"'[-12.30510139465332, -12.59483814239502, -10.84...","""[[('Visiting', -7.1746649742126465), ('ultimate...",'Holocaust Victims Insurance Relief Act of 2001 ...,'To provide for the establishment of the Holocau...,training,"'[[0], [], [1], [], [2], [], [3], [], [4], [], [...","'[[0, 2], [2, 3], [3, 10], [10, 11], [11, 14], [...",'<extra_id_0> or<extra_id_1> a policyholder domi...,"'[[0], [], [1], [2], [], [3, 4], [], [5], [6], [...","'[[0, 12], [12, 13], [13, 15], [15, 27], [27, 28...","'[-1.7165990357170813e-05, -2.3321456909179688, ...","""[[('<extra_id_0>', -1.7165990357170813e-05), ('...",5,0.0,0.0
6,"'[0.5006597638130188, 0.5003625154495239, 0.5004...",0.500632,58794.1,"'[-9.034614562988281, -9.330867767333984, -10.11...","""[[('solid', -6.543978691101074), ('nici', -7.18...",'Amends the Elementary and Secondary Education A...,'To amend the Elementary and Secondary Education...,training,"'[[0], [], [1], [], [2], [], [3], [], [4], [], [...","'[[0, 2], [2, 3], [3, 8], [8, 9], [9, 12], [12, ...","'<extra_id_0>,<extra_id_1> of 1965 to establish ...","'[[0], [1, 2], [3], [], [4], [], [5], [], [6], [...","'[[0, 12], [12, 13], [13, 25], [25, 26], [26, 28...","'[-1.060956947185332e-05, -1.9707493782043457, -...","""[[('Am', -13.150713920593262), ('<extra_id_0>',...",6,3.64009,0.130435
7,"'[0.5005189776420593, 0.5008106827735901, 0.5004...",0.500517,69722.6,"'[-9.317082405090332, -12.136747360229492, -11.5...","""[[('Highway', -7.318516731262207), ('reason', -...",'Gallatin Land Consolidation Act of 1998 - Provi...,Gallatin Land Consolidation Act of 1998,training,"'[[0], [1], [], [2], [], [3], [4], [5], [], [6],...","'[[0, 3], [3, 8], [8, 9], [9, 13], [13, 14], [14...",<extra_id_0> (GALlatin Land Consolidation Act,"'[[0], [], [1], [2], [3], [4], [], [5], [], [6],...","'[[0, 12], [12, 13], [13, 14], [14, 15], [15, 17...","'[-2.7418100216891617e-06, -1.2507578134536743, ...","""[[('<extra_id_0>', -2.7418100216891617e-06), ('...",7,12.606,0.615385
8,"'[0.5004742741584778, 0.5004647374153137, 0.5004...",0.500543,62778.4,"'[-10.367053985595703, -12.382671356201172, -10....","""[[('fighting', -6.927959442138672), ('theater',...",'Marine Debris Act Reauthorization Amendments of...,'To reauthorize and amend the Marine Debris Rese...,training,"'[[0], [], [1, 2], [2], [3], [4], [], [5], [], [...","'[[0, 2], [2, 3], [3, 4], [4, 5], [5, 11], [11, ...",<extra_id_0> the Marine Debris Act Reauthorization,"'[[0], [], [1], [], [2], [], [3], [4], [5], [], ...","'[[0, 12], [12, 13], [13, 16], [16, 17], [17, 23...","'[-2.264974000354414e-06, -2.5343732833862305, -...","""[[('<extra_id_19>', -16.30010986328125), ('<ext...",8,10.5496,0.5
9,"'[0.50039142370224, 0.5006950497627258, 0.500831...",0.500599,49746.6,"'[-11.250298500061035, -11.582002639770508, -11....","""[[('Evil', -7.1361470222473145), ('Encyclopedia...",'Indian Needs Assessment and Program Evaluation ...,'A bill to provide for periodic Indian needs ass...,training,"'[[0], [], [1], [], [2], [], [3], [], [4], [], [...","'[[0, 1], [1, 2], [2, 6], [6, 7], [7, 9], [9, 10...",<extra_id_0> of 2001 - Directs the Secretary of,"'[[0], [], [1], [], [2], [], [3, 4], [], [5], [6...","'[[0, 12], [12, 13], [13, 15], [15, 16], [16, 20...","'[-1.6689285757820471e-06, -1.7190946340560913, ...","""[[('', -15.471539497375488), ('<extra_id_0>', -...",9,0.0,0.0
