## End to end examples logging data to Galileo for Text Classification, MLTC, and NER

### For understanding the client and how to get started, see the [Dataquality Demo](./Dataquality-Client-Demo.ipynb)
### Check out the full documentation [here](https://rungalileo.gitbook.io/galileo/getting-started)
### To see real end-to-end notebooks training real ML models, see [here](https://drive.google.com/drive/folders/17-cHuRzXIpWaD8rYwy69RMQr__HiAiDk?usp=sharing)

In [2]:
## Local

import os

os.environ['GALILEO_CONSOLE_URL']="http://localhost:8088"
os.environ["GALILEO_USERNAME"]="user@example.com"
os.environ["GALILEO_PASSWORD"]="Th3secret_"

In [3]:
%%time
import dataquality as dq
dq.configure()

📡 http://localhost:8088
🔭 Logging you into Galileo

🚀 You're logged in to Galileo as user@example.com!
CPU times: user 10.3 ms, sys: 5.28 ms, total: 15.6 ms
Wall time: 311 ms


In [3]:
from dataquality import Condition, AggregateFunction, Operator

dq.init("text_classification", "Important Project", "My Report Run")

conf_cond = Condition(
    agg=AggregateFunction.avg,
    metric="confidence",
    operator=Operator.lt,
    threshold=0.99,
)
dep_cond = Condition(
    agg=AggregateFunction.min,
    metric="data_error_potential",
    operator=Operator.gt,
    threshold=0.65,
)
dq.register_run_report(conditions=[conf_cond, dep_cond], emails=["echartock3@gmail.com", "elliott@rungalileo.io"])

📡 Retrieving run from existing project, Important Project
🛰 Connected to project, Important Project, and run, My Report Run.




## Text Classification

In [3]:
from tqdm.notebook import tqdm
import time
import numpy as np
from uuid import uuid4
import pandas as pd
from sklearn.datasets import fetch_20newsgroups


BATCH_SIZE=16
EMB_DIM=768
NUM_EPOCHS=1

dq.init("text_classification", "foo","bar")

newsgroups = fetch_20newsgroups(subset="train", remove=('headers', 'footers', 'quotes'))
dataset = pd.DataFrame()
dataset["text"] = newsgroups.data
label_ind = newsgroups.target_names
dataset["label"] = [label_ind[i] for i in newsgroups.target]
dataset["id"] = list(range(len(dataset)))

dataset = dataset[:200]


def generate_random_embeddings(batch_size: int, emb_dims: int) -> np.ndarray:
    return np.random.rand(batch_size, emb_dims)


def generate_random_probabilities(batch_size: int, num_classes: int) -> np.ndarray:
    probs = np.random.rand(batch_size, num_classes)
    return probs / probs.sum(axis=-1).reshape(-1, 1)  # Normalize to sum to 1


t_start = time.time()
dq.set_labels_for_run(dataset["label"].unique())

print("Logging input data")
for split in ["training", "test"]:
    dq.log_dataset(dataset, split=split)
    
print("Done")
print(f"Input logging took {time.time() - t_start} seconds\n\n")


print("Logging model outputs")
t_start = time.time()
num_classes = dataset["label"].nunique()
# Simulates model training loop
for epoch_idx in range(NUM_EPOCHS):
    print(f"Epoch {epoch_idx}")
    print('-'*100)
    for split in ["training", "test"]:
        print(split.capitalize())
        dq.set_split(split)
        for i in tqdm(range(0, len(dataset), BATCH_SIZE)):
            batch = dataset[i : i + BATCH_SIZE]
            embeddings = generate_random_embeddings(len(batch), EMB_DIM)
            probs = generate_random_probabilities(len(batch), num_classes)
            dq.log_model_outputs(
                embs=embeddings,
                probs=probs,
                epoch=epoch_idx,
                ids=batch["id"],
            )
    print('-'*100,end="\n\n")
            
print("Done")

time_spent = time.time() - t_start
print(f"Logging output took {time_spent} seconds")
dq.finish(data_embs=True)

💭 Project foo was not found.
✨ Initializing public project foo
🏃‍♂️ Starting run bar
🛰 Created project, foo, and new run, bar.
Logging input data
Logging 200 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
Logging 200 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 Done
Input logging took 0.23761200904846191 seconds


Logging model outputs
Epoch 0
----------------------------------------------------------------------------------------------------
Training


  0%|          | 0/13 [00:00<?, ?it/s]

Test




  0%|          | 0/13 [00:00<?, ?it/s]

----------------------------------------------------------------------------------------------------

Done
Logging output took 0.2591838836669922 seconds
☁️ Uploading Data


training:   0%|          | 0/1 [00:00<?, ?it/s]

Getting data embeddings for training


Uploading data to Galileo:   0%|          | 0.00/613k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/13 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/286k [00:00<?, ?B/s]

test:   0%|          | 0/1 [00:00<?, ?it/s]

Getting data embeddings for test


Uploading data to Galileo:   0%|          | 0.00/613k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/13 [00:00<?, ?it/s]

test (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/286k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at http://127.0.0.1:3000/insights?projectId=59c3f732-15cd-4d45-8f31-8171d02ec0d2&runId=9734bcec-31f0-424b-bc1a-805ce53fe77c&split=training&metric=f1&depHigh=1&depLow=0&taskType=0
Waiting for job...
Done! Job finished with status completed
🧹 Cleaning up


{'project_id': '59c3f732-15cd-4d45-8f31-8171d02ec0d2',
 'run_id': '9734bcec-31f0-424b-bc1a-805ce53fe77c',
 'job_name': 'default',
 'labels': ['rec.autos',
  'comp.sys.mac.hardware',
  'comp.graphics',
  'sci.space',
  'talk.politics.guns',
  'sci.med',
  'comp.sys.ibm.pc.hardware',
  'comp.os.ms-windows.misc',
  'rec.motorcycles',
  'talk.religion.misc',
  'misc.forsale',
  'alt.atheism',
  'sci.electronics',
  'comp.windows.x',
  'rec.sport.hockey',
  'rec.sport.baseball',
  'soc.religion.christian',
  'talk.politics.mideast',
  'talk.politics.misc',
  'sci.crypt'],
 'task_type': 0,
 'tasks': None,
 'non_inference_logged': False,
 'migration_name': None,
 'xray': True,
 'process_existing_inference_runs': False,
 'message': 'Processing job!',
 'link': 'http://127.0.0.1:3000/insights?projectId=59c3f732-15cd-4d45-8f31-8171d02ec0d2&runId=9734bcec-31f0-424b-bc1a-805ce53fe77c&split=training&metric=f1&depHigh=1&depLow=0&taskType=0'}

In [17]:
dq.metrics.get_data_embeddings("automatic_amethyst_impala", "striking_aquamarine_armadillo", "train")

#,id,data_emb
0,0,"'array([-3.12296189e-02, 3.87282576e-03, 1.073..."
1,1,"'array([ 1.31226787e-02, 2.68374998e-02, 1.898..."
2,2,"'array([ 3.13321277e-02, 3.75694223e-02, -2.715..."
3,3,"'array([-1.62603483e-02, 2.05614753e-02, 2.298..."
4,4,"'array([-1.49948930e-03, 9.73128434e-03, 2.439..."
...,...,...
195,195,"'array([-4.94875461e-02, 2.15062592e-02, 2.261..."
196,196,"'array([-1.70215108e-02, -8.53871256e-02, -1.704..."
197,197,"'array([-1.07716899e-02, 9.03651938e-02, -8.732..."
198,198,"'array([-8.67577456e-03, 4.50826325e-02, 6.657..."


In [39]:
dq.metrics.get_dataframe(
    "automatic_amethyst_impala", "striking_aquamarine_armadillo", "train", include_embs=True, include_data_embs=True
)

#,epoch,pred,text,split,data_schema_version,id,galileo_text_length,galileo_language_id,galileo_pii,confidence,data_error_potential,label,likely_mislabeled,cbo_cluster,data_x,data_y,x,y,pred_idx,emb,data_emb
0,0,comp.sys.mac.hardware,'I was wondering if anyone out there could enlig...,training,1,0,475,en,,0.09836423642484288,0.5338762023612155,rec.autos,True,-1,11.376409530639648,4.246286392211914,5.65283203125,5.5749897956848145,1,"'array([0.29837902, 0.92121278, 0.92973643, 0.28...","'array([-3.12296189e-02, 3.87282576e-03, 1.073..."
1,0,rec.sport.hockey,'A fair number of brave souls who upgraded their...,training,1,1,530,en,,0.08033560776517272,0.5020206264381786,comp.sys.mac.hardware,False,-1,9.92272663116455,6.8370280265808105,10.7771635055542,2.547896385192871,14,"'array([0.38165844, 0.03603427, 0.02338958, 0.47...","'array([ 1.31226787e-02, 2.68374998e-02, 1.898..."
2,0,sci.electronics,"'well folks, my mac plus finally gave up the gho...",training,1,2,1659,en,email,0.08354285554819539,0.5233631578957361,comp.sys.mac.hardware,True,-1,11.355062484741211,4.144324779510498,7.036731243133545,6.68037223815918,12,"'array([0.59774918, 0.68268885, 0.10923718, 0.82...","'array([ 3.13321277e-02, 3.75694223e-02, -2.715..."
3,0,sci.crypt,"""\nDo you have Weitek's address/phone number? I'...",training,1,3,95,en,,0.10009139798351302,0.5003215371552514,comp.graphics,False,-1,11.398351669311523,4.21092414855957,7.649347305297852,7.090616703033447,19,"'array([3.97452249e-01, 4.40846703e-01, 5.954833...","'array([-1.62603483e-02, 2.05614753e-02, 2.298..."
4,0,rec.sport.hockey,"'From article <C5owCB.n3p@world.std.com>, by tom...",training,1,4,448,en,email,0.08497681769539624,0.5016142055310292,sci.space,False,-1,9.694826126098633,1.8458075523376465,5.622284412384033,3.9054837226867676,14,"'array([5.82038555e-01, 2.08934225e-01, 4.488748...","'array([-1.49948930e-03, 9.73128434e-03, 2.439..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0,talk.religion.misc,'Hi ... Recently I found XV for MS-DOS in a subd...,training,1,195,927,en,email,0.09561871203092301,0.5071909220607094,comp.graphics,False,-1,5.677307605743408,3.221129894256592,9.530557632446289,1.93036687374115,9,"'array([6.83725933e-01, 9.07856418e-01, 8.802090...","'array([-4.94875461e-02, 2.15062592e-02, 2.261..."
196,0,comp.sys.mac.hardware,"'\nAs a general rule, no relay will cleanly switc...",training,1,196,1726,en,phone,0.1221157191642118,0.5442886540665051,sci.electronics,True,-1,10.978202819824219,2.6762144565582275,10.83487319946289,6.180272579193115,1,"'array([6.23284972e-01, 4.31596115e-01, 3.121943...","'array([-1.70215108e-02, -8.53871256e-02, -1.704..."
197,0,sci.electronics,' ...\n\nI think this is a big leap sex->depressi...,training,1,197,659,en,,0.13224859900285157,0.5653523086457009,alt.atheism,True,-1,8.261764526367188,1.5314429998397827,5.971230506896973,6.017798900604248,12,"'array([2.48097100e-01, 4.47150206e-02, 9.063618...","'array([-1.07716899e-02, 9.03651938e-02, -8.732..."
198,0,alt.atheism,"""\n\n\n\n\nI don't sign any blank checks.\n\nWhen Doug ...",training,1,198,380,en,,0.10058767737327697,0.5424329941556727,talk.politics.guns,True,-1,9.371979713439941,7.027573108673096,8.880782127380371,7.061689853668213,11,"'array([1.88811188e-01, 7.81076818e-01, 5.208729...","'array([-8.67577456e-03, 4.50826325e-02, 6.657..."


In [18]:
from sentence_transformers import SentenceTransformer
import transformers

transformers.logging.disable_progress_bar()
# data_model = SentenceTransformer("all-mpnet-base-v2")
data_model = SentenceTransformer("distilbert-base-uncased")


data_model.encode(["sentence 1"])

No sentence-transformers model found with name /Users/benepstein/.cache/torch/sentence_transformers/distilbert-base-uncased. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /Users/benepstein/.cache/torch/sentence_transformers/distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


array([[ 5.67681640e-02, -9.06216875e-02, -2.78602570e-01,
        -9.06290337e-02, -7.34410807e-02, -1.70276463e-01,
         4.53511685e-01, -2.41690978e-01, -1.65875703e-02,
        -9.15749073e-02, -1.47593975e-01, -4.29968312e-02,
        -1.12826794e-01,  1.16887763e-01, -2.06177652e-01,
        -9.26956162e-02,  4.54158708e-02, -2.31197104e-02,
         1.44801602e-01, -6.28933543e-03,  4.81691927e-01,
         1.10726506e-02,  5.85196912e-02,  1.45137534e-01,
         1.45566568e-01,  5.66426963e-02, -2.00803354e-01,
         2.16374829e-01, -2.80160725e-01,  2.14290842e-02,
         6.51769191e-02, -1.65645778e-01, -1.98399633e-01,
         1.28243044e-01, -2.78879851e-02, -1.83226213e-01,
         1.39946863e-02, -2.60219634e-01, -2.71594673e-02,
        -8.38839263e-03,  1.46430552e-01, -6.63785934e-02,
        -3.20758641e-01, -3.61926854e-02,  1.20771997e-01,
        -5.45867756e-02, -4.38974559e-01, -2.53210038e-01,
        -3.28152597e-01,  1.50458515e-01, -2.98882395e-0

In [94]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer

LOCAL_TOKENIZER_PATH = "tmp/testing-random-distilbert-tokenizer"
LOCAL_MODEL_PATH = "tmp/testing-random-distilbert-sq"

tokenizer = AutoTokenizer.from_pretrained(
        "hf-internal-testing/tiny-random-distilbert"
    )
tokenizer.save_pretrained(LOCAL_MODEL_PATH)

model = AutoModelForSequenceClassification.from_pretrained(
    "hf-internal-testing/tiny-random-distilbert"
)
model.save_pretrained(LOCAL_MODEL_PATH)

Some weights of the model checkpoint at hf-internal-testing/tiny-random-distilbert were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'qa_outputs.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'qa_outputs.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [93]:
model.save_pretrained?

In [78]:
dir(model)

['T_destination',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_apply',
 '_auto_class',
 '_backward_compatibility_gradient_checkpointing',
 '_backward_hooks',
 '_buffers',
 '_call_impl',
 '_can_retrieve_inputs_from_name',
 '_convert_head_mask_to_5d',
 '_create_repo',
 '_expand_inputs_for_generation',
 '_forward_hooks',
 '_forward_pre_hooks',
 '_from_config',
 '_get_backward_hooks',
 '_get_decoder_start_token_id',
 '_get_files_timestamps',
 '_get_logits_processor',
 '_get_logits_warper',
 '_get_name',
 '_get_resized_embeddings',
 '_get_resized_lm_head',
 '_get_stopping_criteria',
 '_hook_rss_memory_po

In [95]:
!ls -lsah {LOCAL_MODEL_PATH}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
total 856
  0 drwxr-xr-x  8 benepstein  staff   256B Nov 17 13:48 [1m[36m.[m[m
  0 drwxr-xr-x  4 benepstein  staff   128B Nov 17 10:06 [1m[36m..[m[m
  8 -rw-r--r--  1 benepstein  staff   580B Nov 17 13:48 config.json
768 -rw-r--r--  1 benepstein  staff   369K Nov 17 13:48 pytorch_model.bin
  8 -rw-r--r--  1 benepstein  staff   125B Nov 17 13:48 special_tokens_map.json
 48 -rw-r--r--  1 benepstein  staff    23K Nov 17 13:48 tokenizer.json
  8 -rw-r--r--  1 benepstein  staff   379B Nov 17 13:48 tokenizer_config.json
 16 -rw-r--r--  1 benepstein  staff   4.6K Nov 17 13:48 vocab.txt


In [88]:
!cp {LOCAL_MODEL_PATH}/* {LOCAL_TOKENIZER_PATH}
!ls -lash {LOCAL_TOKENIZER_PATH}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
total 832
  0 drwxr-xr-x  8 benepstein  staff   256B Nov 17 13:47 [1m[36m.[m[m
  0 drwxr-xr-x  4 benepstein  staff   128B Nov 17 10:06 [1m[36m..[m[m
  8 -rw-r--r--  1 benepstein  staff   580B Nov 17 13:47 config.json
744 -rw-r--r--  1 benepstein  staff   369K Nov 17 13:47 pytorch_model.bin
  8 -rw-r--r--  1 benepstein  staff   125B Nov 17 10:07 special_tokens_map.json
 48 -rw-r--r--  1 benepstein  staff    23K Nov 17 10:07

In [84]:
SentenceTransformer?

In [96]:
data_model = SentenceTransformer(LOCAL_MODEL_PATH)

No sentence-transformers model found with name tmp/testing-random-distilbert-sq. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at tmp/testing-random-distilbert-sq were not used when initializing DistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [90]:
data_model.encode("test")

array([ 1.31875408e+00, -1.36274144e-01, -4.55679804e-01,  3.05176288e-01,
        4.53322917e-01,  2.71099836e-01,  3.02597076e-01,  3.70480180e-01,
       -5.49280405e-01, -2.95543194e-01,  1.08455203e-01, -1.40951172e-01,
       -2.37448409e-01, -2.34409809e-01, -1.31957665e-01, -3.15684766e-01,
        2.78756022e-04,  1.02355920e-01, -5.70327580e-01,  2.87402362e-01,
       -1.25006303e-01,  8.92032325e-01, -4.03520346e-01,  2.39682391e-01,
       -1.28863156e-01,  1.04067922e-02,  1.15602612e-02,  1.05618455e-01,
       -1.26513556e-01,  2.16349378e-01, -1.14398706e+00, -1.24702230e-04],
      dtype=float32)

In [36]:
token_embs = data_model.encode(
    [
        "this is a long sentence about how jon is really pretty and tall and has great eyes",
        "this is a long sentence about how jon is "
    ],
    output_value=None, 
    convert_to_numpy=True
)
token_embs[0]

{'input_ids': tensor([ 101, 2023, 2003, 1037, 2146, 6251, 2055, 2129, 6285, 2003, 2428, 3492,
         1998, 4206, 1998, 2038, 2307, 2159,  102]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'token_embeddings': tensor([[-0.0989, -0.1244, -0.0870,  ...,  0.1521,  0.4170,  0.0777],
         [-0.1934, -0.1937, -0.3912,  ...,  0.1710,  0.5467,  0.0530],
         [-0.1983, -0.0436, -0.0218,  ...,  0.3699,  0.0821,  0.4078],
         ...,
         [ 0.1655,  0.3409,  0.2666,  ..., -0.1975,  0.1927, -0.1595],
         [-0.0490,  0.0316, -0.0028,  ...,  0.0534,  0.2502, -0.5220],
         [ 0.7531, -0.0019, -0.3970,  ...,  0.1781, -0.3043, -0.6650]]),
 'sentence_embedding': tensor([ 1.1392e-02,  5.2962e-03,  2.4081e-02,  5.5669e-02,  2.9600e-01,
         -2.5092e-01,  3.5211e-04,  1.0010e+00, -3.2073e-01, -2.2458e-01,
          1.2752e-01, -5.7345e-01, -3.2998e-02,  4.6655e-01, -2.2689e-01,
          2.9207e-01,  2.8244e-01,  2.1811e-01, -8.0712e-02, 

In [38]:
SentenceTransformer("all-MiniLM-L6-v2")

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [32]:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

toks = tok.encode("this is a long sentence about how jon is really pretty and tall and has great eyes")
toks

[101,
 2023,
 2003,
 1037,
 2146,
 6251,
 2055,
 2129,
 6285,
 2003,
 2428,
 3492,
 1998,
 4206,
 1998,
 2038,
 2307,
 2159,
 102]

In [24]:
import transformers

dir(transformers.logging)

['CRITICAL',
 'DEBUG',
 'ERROR',
 'EmptyTqdm',
 'FATAL',
 'INFO',
 'NOTSET',
 'Optional',
 'WARN',
 '__annotations__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_configure_library_root_logger',
 '_default_handler',
 '_default_log_level',
 '_get_default_logging_level',
 '_get_library_name',
 '_get_library_root_logger',
 '_lock',
 '_reset_library_root_logger',
 '_tqdm_active',
 '_tqdm_cls',
 'add_handler',
 'disable_default_handler',
 'disable_progress_bar',
 'disable_propagation',
 'enable_default_handler',
 'enable_explicit_format',
 'enable_progress_bar',
 'enable_propagation',
 'get_log_levels_dict',
 'get_logger',
 'get_verbosity',
 'hf_hub_utils',
 'is_progress_bar_enabled',
 'log_levels',
 'logging',
 'os',
 'remove_handler',
 'reset_format',
 'set_verbosity',
 'set_verbosity_debug',
 'set_verbosity_error',
 'set_verbosity_info',
 'sys',
 'threading',
 'tqdm',
 'tqdm_lib',

## Multi Label

In [None]:
from typing import *
from random import choice
import numpy as np


dq.init("text_multi_label", "test-mltc-run")
dq.set_labels_for_run([["not "+_label, _label] for _label in ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult','identity_hate']]) 
dq.set_tasks_for_run(['task_0', 'task_1', 'task_2', 'task_3', 'task_4', 'task_5'])

n = 5000

texts: List[str] = [f"text sample {i}" for i in range(n)]

labels: List[str] = [
    [choice(i) for i in dq.get_data_logger().logger_config.labels]
    for _ in range(n)
]

ids = list(range(n))


dq.log_data_samples(texts=texts, task_labels=labels, ids=ids, split="training")
dq.log_data_samples(texts=texts, task_labels=labels, ids=ids, split="test")
dq.log_data_samples(texts=texts, task_labels=labels, ids=ids, split="validation")

for split in ["training", "test", "validation"]:
    for epoch in range(5):
        emb=np.random.rand(n, 768)
        logits=[[np.random.rand(2)] * 6] * n
        ids=list(range(n))
        
        for i in range(0, n, 32):
            dq.log_model_outputs(
                embs=emb[i:i+5],
                logits=logits[i:i+5],
                ids=ids[i:i+5],
                split=split,
                epoch=epoch
            )

dq.finish()
df_train, df_test, df_val = see_results()


## NER

In [None]:
from dataquality.schemas.task_type import TaskType
from dataquality import config 
from uuid import uuid4
import numpy as np
from time import sleep
from tqdm.notebook import tqdm


dq.init("text_ner", "test-ner-run")


def log_inputs():
    text_inputs = ['what movies star bruce willis', 'show me films with drew barrymore from the 1980s', 'what movies starred both al pacino and robert deniro', 'find me all of the movies that starred harold ramis and bill murray', 'find me a movie with a quote about baseball in it']
    tokens = [[(0, 4), (5, 11), (12, 16), (17, 22), (17, 22), (23, 29), (23, 29)], [(0, 4), (5, 7), (8, 13), (14, 18), (19, 23), (24, 33), (24, 33), (24, 33), (34, 38), (39, 42), (43, 48)], [(0, 4), (5, 11), (12, 19), (20, 24), (25, 27), (28, 34), (28, 34), (28, 34), (35, 38), (39, 45), (39, 45), (46, 52), (46, 52)], [(0, 4), (5, 7), (8, 11), (12, 14), (15, 18), (19, 25), (26, 30), (31, 38), (39, 45), (39, 45), (39, 45), (46, 51), (46, 51), (52, 55), (56, 60), (61, 67), (61, 67), (61, 67)], [(0, 4), (5, 7), (8, 9), (10, 15), (16, 20), (21, 22), (23, 28), (29, 34), (35, 43), (44, 46), (47, 49)]]
    gold_spans = [[{'start': 17, 'end': 29, 'label': 'ACTOR'}], [{'start': 19, 'end': 33, 'label': 'ACTOR'}, {'start': 43, 'end': 48, 'label': 'YEAR'}], [{'start': 25, 'end': 34, 'label': 'ACTOR'}, {'start': 39, 'end': 52, 'label': 'ACTOR'}], [{'start': 39, 'end': 51, 'label': 'ACTOR'}, {'start': 56, 'end': 67, 'label': 'ACTOR'}], []]
    ids = [0, 1, 2, 3, 4]

    labels = ['[PAD]', '[CLS]', '[SEP]', 'O', 'B-ACTOR', 'I-ACTOR', 'B-YEAR', 'B-TITLE', 'B-GENRE', 'I-GENRE', 'B-DIRECTOR', 'I-DIRECTOR', 'B-SONG', 'I-SONG', 'B-PLOT', 'I-PLOT', 'B-REVIEW', 'B-CHARACTER', 'I-CHARACTER', 'B-RATING', 'B-RATINGS_AVERAGE', 'I-RATINGS_AVERAGE', 'I-TITLE', 'I-RATING', 'B-TRAILER', 'I-TRAILER', 'I-REVIEW', 'I-YEAR']
    dq.set_labels_for_run(labels)
    dq.set_tagging_schema("BIO")
    dq.log_data_samples(texts=text_inputs, text_token_indices=tokens, ids=ids, gold_spans=gold_spans, split="training")
    dq.log_data_samples(texts=text_inputs, text_token_indices=tokens, ids=ids, gold_spans=gold_spans, split="validation")
    dq.log_data_samples(texts=text_inputs, text_token_indices=tokens, ids=ids, gold_spans=gold_spans, split="test")

def log_outputs():
    num_classes = 28
    embs = [np.random.rand(119, 768) for _ in range(5)]
    logits= [np.random.rand(119, 28) for _ in range(5)]                                      
    ids= list(range(5))
    for epoch in tqdm(range(6)):
        for split in ["training", "test", "validation"]:
            dq.log_model_outputs(
                embs=embs, logits=logits, ids=ids, split=split, epoch=epoch
            )
    
def finish():
    dq.finish()
    
    
def runit():
    log_inputs()
    log_outputs()
    finish()
    
runit()
df_train, df_test, df_val = see_results()