# Text Classification using SetFit and 🔭 Galileo

In this tutorial, we'll train a model with SetFit and explore the results in Galileo.

**Make sure to select GPU in your Runtime! (Runtime -> Change Runtime type)**

### Install Galileo 🔭🌕

Simply run ```pip install dataquality```

In [1]:
%%capture

%pip install dataquality
%pip install setfit transformers

Import dataquality and set the project name and run

In [2]:
#@title 1. Initialize Galileo
# 🔭🌕 Galileo setup
import dataquality as dq

project_name="text_classification_with_setfit"
run_name="set_fit_sst2"

Welcome to Galileo Cloud v0.8.45!


In [3]:
#@title 2. Data preperation
#@markdown Use Hugging Face 🤗 Datasets for training SetFit
from datasets import load_dataset
from setfit import sample_dataset

dataset_id = "sst2"
dataset = load_dataset(dataset_id)

train_dataset = sample_dataset(dataset["train"], num_samples=8)
eval_dataset = dataset["validation"]

Downloading builder script:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading and preparing dataset sst2/default to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset sst2 downloaded and prepared to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# 3. Training the model with dataquality integration
Integrate dataquality with the SetFit Trainer
```python
from dataquality.integrations.setfit import watch
watch(trainer, labels=labels, project_name=project_name, run_name=run_name)
```
After the Trainer is initiated, the watch function of dataquality hooks into the model.

SetFit only runs with training and validation data. If you want to log test data, simply pass ```finish=false``` to the ```watch``` function and follow this guide. You can also pass inference data to the dq_evaluate function.


In [4]:
# 🔭🌕 Galileo logging
import dataquality as dq
from dataquality.integrations.setfit import watch

from setfit import SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer


model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)
column_mapping = {"sentence": "text", "label": "label","idx":"id"}
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    column_mapping=column_mapping,
)
labels = dataset["train"].features["label"].names

# 🔭🌕 Galileo logging
watch(trainer, labels=labels,
      project_name=project_name, run_name=run_name,
      batch_size=512, # Speed up prediction
      # 🔭🌕 Set finish to False to add test
      finish=False
      )

trainer.train()

model.save_pretrained("./trained_model")

metrics = trainer.evaluate()
# 🔭🌕 Galileo logging for custom split (inference)
dq_evaluate = watch(model, batch_size=512)
preds = dq_evaluate(
    dataset["test"],
    split="test",
    column_mapping=column_mapping
    # for inference set the split to inference
    # and pass an inference_name="inference_run_1"
    )
dq.finish()

Downloading (…)lve/main/config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

Downloading (…)f39ef/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)0182ff39ef/README.md:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading (…)82ff39ef/config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)f39ef/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading (…)0182ff39ef/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)2ff39ef/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset


Generating Training Pairs:   0%|          | 0/20 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 640
  Num epochs = 1
  Total optimization steps = 40
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/40 [00:00<?, ?it/s]

📡 https://console.dev.rungalileo.io
🔭 Logging you into Galileo

Go to https://console.dev.rungalileo.io/get-token to generate a new API Key
🔐 Enter your API Key:··········
🚀 You're logged in to Galileo as franz@rungalileo.io!
✨ Initializing existing public project 'text_classification_with_setfit'
🏃‍♂️ Fetching existing run 'set_fit_sst2'
🛰 Connected to existing project 'text_classification_with_setfit', and existing run 'set_fit_sst2'.




🚀 Found existing run labels. Setting labels for run to ['negative', 'positive']. You do not need to set labels for this run.
Logging 16 samples [########################################] 100.00% elapsed time  :     0.01s =  0.0m =  0.0h
Logging 872 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 

Applying column mapping to evaluation dataset
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Logging 1821 samples [########################################] 100.00% elapsed time  :     0.01s =  0.0m =  0.0h
 ☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


training:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for training


Uploading data to Galileo:   0%|          | 0.00/36.5k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/60.5k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/13.4k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/17.5k [00:00<?, ?B/s]

validation:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for validation


Uploading data to Galileo:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/55 [00:00<?, ?it/s]

validation (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/2.57M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/33.0k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/136k [00:00<?, ?B/s]

test:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for test


Uploading data to Galileo:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/29 [00:00<?, ?it/s]

test (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/5.35M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/55.2k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/249k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=3594e1b6-4c58-4702-b3ed-e458018528ce&runId=4db70894-c403-48e9-a1cc-9a58431ce3f7&taskType=0&split=training
Waiting for job (you can safely close this window)...
	Applying dimensionality reduction to embs
	Applying dimensionality reduction to data embs
	Measuring class overlap
	[training] 👀 Looking for data anomalies
	Finding semantic clusters for validation
	Saving processed validation data
	[validation] 👀 Looking for data anomalies
	Finding semantic clusters for test
	Saving processed test data
	[test] 👀 Looking for data anomalies
Done! Job finished with status completed
Click here to see your run! https://console.dev.rungalileo.io/insights?projectId=3594e1b6-4c58-4702-b3ed-e458018528ce&runId=4db70894-c403-48e9-a1cc-9a58431ce3f7&taskType=0&split=training
🧹 Cleaning up
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights?projectId=3594e1b6-4c58-4702-b3ed-e458018528ce&runId=4db70894-c403-48e9-a1cc-9a58431ce3f7&taskType=0&split=training'

# 4. Model Inference from pretrained model
Performing inference on the trained model and logging the predictions with dataquality. First we start by importing the necessary modules, next we load the trained model.
Now you have to initialize Galileo. Make sure the project name and run is identical to your training run. :
```python
import dataquality as  dq

dq.init(task_type="text_classification",
        project_name=project_name,
        run_name=run_name)
dq.set_labels_for_run(labels)
```
We use a watch function to get our evaluation function for the inference process, which takes the dataset as the input.
```python
from dataquality.integrations.setfit import watch
dq_evaluate = watch(model)
preds = dq_evaluate(
    dataset,
    split="inference",
    inference_name="inference_test",
    column_mapping=column_mapping)
# Finalizing the inference
dq.finish()
```


In [5]:
# skip ci: true
from setfit import SetFitModel
import uuid

model = SetFitModel.from_pretrained("./trained_model")

# 🔭🌕 Galileo logging
dq.init(task_type="text_classification",
        project_name=project_name,
        run_name=run_name)

labels = dataset["train"].features["label"].names

# 🔭🌕 Galileo logging
dq.set_labels_for_run(labels)
dq_evaluate = watch(model,
                    batch_size=512 # Speed up prediction
                    )

column_mapping = {"sentence": "text" , "label": "label", "idx": "id"}
print("Starting inference")
preds = dq_evaluate(
    dataset["test"],
    split="inference",
    inference_name=f"inference_test_{uuid.uuid4().hex[:4]}",
    column_mapping=column_mapping)
# Finalizing the inference
dq.finish()

✨ Initializing existing public project 'text_classification_with_setfit'
🏃‍♂️ Fetching existing run 'set_fit_sst2'
🛰 Connected to existing project 'text_classification_with_setfit', and existing run 'set_fit_sst2'.




🚀 Found existing run labels. Setting labels for run to ['negative', 'positive']. You do not need to set labels for this run.
Logging 1821 samples [########################################] 100.00% elapsed time  :     0.00s =  0.0m =  0.0h
 ☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


inference:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for inference/inference_test


Uploading data to Galileo:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/29 [00:00<?, ?it/s]

inference (inf_name=inference_test):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/5.35M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/41.0k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/292k [00:00<?, ?B/s]

Job inference successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=3594e1b6-4c58-4702-b3ed-e458018528ce&runId=4db70894-c403-48e9-a1cc-9a58431ce3f7&taskType=0&split=inference&inferenceName=inference_test&sortBy=confidence&groupedBy=pred&sortDirection=asc
Waiting for job (you can safely close this window)...
	Building visualization of embeddings
	Finding semantic clusters for inference
	Found data embeddings. Analyzing embedding dimensions
	Saving processed inference data
	[inference] 👀 Looking for data anomalies
Done! Job finished with status completed
Click here to see your run! https://console.dev.rungalileo.io/insights?projectId=3594e1b6-4c58-4702-b3ed-e458018528ce&runId=4db70894-c403-48e9-a1cc-9a58431ce3f7&taskType=0&split=inference&inferenceName=inference_test&sortBy=confidence&groupedBy=pred&sortDirection=asc
🧹 Cleaning up
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights?projectId=3594e1b6-4c58-4702-b3ed-e458018528ce&runId=4db70894-c403-48e9-a1cc-9a58431ce3f7&taskType=0&split=inference&inferenceName=inference_test&sortBy=confidence&groupedBy=pred&sortDirection=asc'