# Automatic Data Insights
The fastest way to improve your data


**What is DQ auto?**

dq.auto is a helper function to train the most cutting-edge transformer (or any of your choosing from HuggingFace) on your dataset so it can be processed by Galileo. You provide the data, let Galileo train the model, and provide you with data quality insights

Requirements:
```python
pip install dataquality
```

In [1]:
%%capture
%pip install dataquality
%pip install evaluate

# Get Started
First we are going to import dataquality and create our train, test, validation and inference Dataframes.

In [2]:
# 🔭🌕 Galileo logging
import dataquality as dq
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
# Load the newsgroups dataset from sklearn
newsgroups_ds = fetch_20newsgroups(subset='all')
# Convert to pandas dataframes and split
df = pd.DataFrame({"text": newsgroups_ds.data, "label": newsgroups_ds.target})
df_train,df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_train, test_size=0.2, random_state=1)
df_val, df_inf = train_test_split(df_val, test_size=0.2, random_state=1)
df_inf_1, df_inf_2 = train_test_split(df_inf, test_size=0.5, random_state=1)

Welcome to Galileo Cloud v0.9.3!


# Run ```dq.auto``` for insights on text classification
Simply run dq.auto with pandas dataframes and you are ready to go.

Training takes around 14 minutes in Google Colab with early stopping.

In [5]:
# 🔭🌕 Galileo logging:
# Automatically fine tune a hf model. Once training is completed the hf trainer
# is returned.
t = dq.auto(
     train_data=df_train,
     test_data=df_test,
     val_data=df_val,
     inference_data={"inference_test":df_inf},
     labels=newsgroups_ds.target_names,
     project_name="newsgroups_work",
     run_name="run_1_raw_data"
)

Casting the dataset:   0%|          | 0/12060 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/2412 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/3770 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/604 [00:00<?, ? examples/s]

📡 https://console.dev.rungalileo.io
🔭 Logging you into Galileo

Go to https://console.dev.rungalileo.io/get-token to generate a new API Key
🔐 Enter your API Key:··········
🚀 You're logged in to Galileo as franz@rungalileo.io!
✨ Initializing existing public project 'newsgroups_work'
🏃‍♂️ Fetching existing run 'run_1_raw_data'
🛰 Connected to existing project 'newsgroups_work', and existing run 'run_1_raw_data'.




🚀 Found existing run labels. Setting labels for run to ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']. You do not need to set labels for this run.
Logging 12060 samples [########################################] 100.00% elapsed time  :     0.05s =  0.0m =  0.0h
Logging 2412 samples [########################################] 100.00% elapsed time  :     0.01s =  0.0m =  0.0h
Logging 3770 samples [########################################] 100.00% elapsed time  :     0.01s =  0.0m =  0.0h
Logging 604 samples [########################################] 100.00% elapsed time  :     0.01s =  0.0m =  0.0h
 

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/12060 [00:00<?, ? examples/s]

Map:   0%|          | 0/2412 [00:00<?, ? examples/s]

Map:   0%|          | 0/3770 [00:00<?, ? examples/s]

Map:   0%|          | 0/604 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.

Found layer "classifier" in model layers: "distilbert, pre_classifier, classifier"


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.

Found layer "classifier" in model layers: "distilbert, pre_classifier, classifier"


Epoch,Training Loss,Validation Loss,F1
1,No log,0.567533,0.841245
2,No log,0.359053,0.897793
3,0.699900,0.382828,0.898249


☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


training:   0%|          | 0/3 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/189 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/189 [00:00<?, ?it/s]

training (epoch=1):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/35.4M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/22.7M [00:00<?, ?B/s]

Creating and uploading data embeddings for training


Uploading data to Galileo:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/189 [00:00<?, ?it/s]

training (epoch=2):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/35.4M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/22.7M [00:00<?, ?B/s]

validation:   0%|          | 0/3 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/38 [00:00<?, ?it/s]

validation (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/256k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/38 [00:00<?, ?it/s]

validation (epoch=1):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/7.09M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/256k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/4.53M [00:00<?, ?B/s]

Creating and uploading data embeddings for validation


Uploading data to Galileo:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/38 [00:00<?, ?it/s]

validation (epoch=2):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/7.09M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/256k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/4.53M [00:00<?, ?B/s]

test:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for test


Uploading data to Galileo:   0%|          | 0.00/5.56M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/59 [00:00<?, ?it/s]

test (epoch=2):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/11.1M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/394k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/6.46M [00:00<?, ?B/s]

inference:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for inference/inference_test


Uploading data to Galileo:   0%|          | 0.00/921k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/10 [00:00<?, ?it/s]

inference (inf_name=inference_test):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/1.78M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/64.4k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

Job inference successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=aa143acc-27f6-4356-a7e2-7fd7246684df&runId=ca47ebfb-91cd-4b93-9e30-ced8c3042a6f&taskType=0&split=inference&inferenceName=inference_test&sortBy=confidence&groupedBy=pred&sortDirection=asc
Waiting for job (you can safely close this window)...
	Found embs. Analyzing dimensions
	Applying dimensionality reduction to embs
	Applying dimensionality reduction to data embs
	Found embs. Analyzing dimensions
	Applying dimensionality reduction to embs
	Applying dimensionality reduction to data embs
	Adding smart features to df
	Finding semantic clusters for training
	Measuring sample similarity for training
	Saving processed training data
	Uploading processed training data
	Finding semantic clusters for validation
	Looking for likely mislabeled validation samples
	Saving processed validation data
	Adding confidence to df
	Finding semantic clusters for test
	Measuring sample s

In [6]:
# Use the returned trainer to save the model
t.save_model("finetuned_hf_model")

# Running inference on an existing model with Trainer
First we need to load the model and tokenizer from HuggingFace.
Then we can run the trainer with the model for inference.
To load the last checkpoint we will check the output folder for the last checkpoint and load it.

In [13]:
# 🔭🌕 Galileo logging
import dataquality as dq
from dataquality.integrations.transformers_trainer import watch
from datasets import Dataset
from transformers import Trainer, AutoModelForSequenceClassification, AutoTokenizer
# Convert to pandas dataframes and split
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
newsgroups_ds = fetch_20newsgroups(subset='all')
df = pd.DataFrame({"text": newsgroups_ds.data, "label": newsgroups_ds.target}).sample(2000)
df_train,df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_train, test_size=0.2, random_state=1)
df_val, df_inf = train_test_split(df_val, test_size=0.2, random_state=1)

# Local model
model_name = "finetuned_hf_model"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

ds = Dataset.from_pandas(df.sample(200), preserve_index=False)
# 🔭🌕 Galileo logging: Add id column to dataset
ds = ds.map(lambda x,idx: {"id":idx},with_indices=True)
# 🔭🌕 Galileo logging: Init project
dq.init(
    task_type="text_classification",
    project_name="newsgroups_work",
    run_name="run_1_raw_data")
# 🔭🌕 Galileo logging: Log inference dataset
dq.log_dataset(ds,split="inference", inference_name="inference_foo_bar")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
# Make sure to apply tokenization after logging the dataset
ds = ds.map(tokenize_function, batched=True)


# 🔭🌕 Galileo logging: Set labels
dq.set_labels_for_run(labels=newsgroups_ds.target_names)
trainer = Trainer(model)
# 🔭🌕 Galileo logging: Watch trainer
watch(trainer)
# 🔭🌕 Galileo logging: Set split and start predicting
dq.set_split("inference", inference_name="inference_foo_bar")
preds = trainer.predict(test_dataset=ds)
# 🔭🌕 Galileo logging: Finish run
dq.finish()

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

✨ Initializing existing public project 'newsgroups_work'
🏃‍♂️ Fetching existing run 'run_1_raw_data'
🛰 Connected to existing project 'newsgroups_work', and existing run 'run_1_raw_data'.




🚀 Found existing run labels. Setting labels for run to ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']. You do not need to set labels for this run.
Logging 200 samples [########################################] 100.00% elapsed time  :     0.01s =  0.0m =  0.0h
 Found layer "classifier" in model layers: "distilbert, pre_classifier, classifier"


☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


inference:   0%|          | 0/1 [00:00<?, ?it/s]

Processing data for upload:   0%|          | 0/25 [00:00<?, ?it/s]

inference (inf_name=inference_foo_bar_col_cpu):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/613k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/487k [00:00<?, ?B/s]

Job inference successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=aa143acc-27f6-4356-a7e2-7fd7246684df&runId=ca47ebfb-91cd-4b93-9e30-ced8c3042a6f&taskType=0&split=inference&inferenceName=inference_foo_bar_checkpoint_cpu&sortBy=confidence&groupedBy=pred&sortDirection=asc
Waiting for job (you can safely close this window)...
	Building visualization of embeddings
	Finding semantic clusters for inference
	Looking for data embeddings
	[inference] 👀 Looking for data anomalies
Done! Job finished with status completed
Click here to see your run! https://console.dev.rungalileo.io/insights?projectId=aa143acc-27f6-4356-a7e2-7fd7246684df&runId=ca47ebfb-91cd-4b93-9e30-ced8c3042a6f&taskType=0&split=inference&inferenceName=inference_foo_bar_checkpoint_cpu&sortBy=confidence&groupedBy=pred&sortDirection=asc
🧹 Cleaning up
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights?projectId=aa143acc-27f6-4356-a7e2-7fd7246684df&runId=ca47ebfb-91cd-4b93-9e30-ced8c3042a6f&taskType=0&split=inference&inferenceName=inference_foo_bar_checkpoint_cpu&sortBy=confidence&groupedBy=pred&sortDirection=asc'