# Text Classification using SetFit and 🔭 Galileo

In this tutorial, we'll train a model with SetFit and explore the results in Galileo.

**Make sure to select GPU in your Runtime! (Runtime -> Change Runtime type)**

### Install Galileo 🔭🌕

Simply run ```pip install dataquality```

In [None]:
%%capture

%pip install dataquality
%pip install setfit transformers -q

Import dataquality and set the project name and run

In [None]:
#@title 1. Initialize Galileo
# 🔭🌕 Galileo setup
import dataquality as dq

project_name="text_classification_with_setfit"
run_name="set_fit_sst2"

Welcome to Galileo Cloud v0.8.40!


In [None]:
#@title 2. Data preperation
#@markdown Use Hugging Face 🤗 Datasets for training SetFit
from datasets import load_dataset
from setfit import sample_dataset

dataset_id = "sst2"
dataset = load_dataset(dataset_id)

train_dataset = sample_dataset(dataset["train"], num_samples=8)
eval_dataset = dataset["validation"] 

Downloading builder script:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading and preparing dataset sst2/default to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset sst2 downloaded and prepared to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# 3. Training the model with dataquality integration
Integrate dataquality with the SetFit Trainer
```python
from dataquality.integrations.setfit import watch
watch(trainer, labels=labels, project_name, run_name)
```
After the Trainer is initiated, the watch function of dataquality hooks into the model


In [None]:
# 🔭🌕 Galileo logging
from dataquality.integrations.setfit import watch

from setfit import SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer

model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    column_mapping={"sentence": "text", "label": "label","idx":"id"},
)
labels = dataset["train"].features["label"].names
# 🔭🌕 Galileo logging
watch(trainer, labels=labels,
      project_name=project_name, run_name=run_name)

trainer.train()

model.save_pretrained("./trained_model")

metrics = trainer.evaluate()
metrics

# 4. Model Inference
Performing inference on the trained model and logging the predictions with dataquality. First we start by importing the necessary modules, next we load the trained model.
Now you have to initialize Galileo:
```python
import dataquality as  dq

dq.init(task_type="text_classification",
        project_name=project_name,
        run_name=run_name)
```
Make sure the project name and run is identical to your training run. Set the labels with:
```python
dq.set_labels_for_run(labels)
```
We use a watch function to monitor the model during the inference process
```python
from dataquality.integrations.setfit import watch
dq_evaluate = dq.watch(model)
preds = dq_evaluate(
    dataset["test"],
    split="inference",
    inference_name="inference_test",
    column_mapping={
    "sentence": "text",
    "label": "label",
    "idx": "id"
})
# Finalizing the inference
dq.finish()
```


In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained("./trained_model")

# 🔭🌕 Galileo logging
dq.init(task_type="text_classification",
        project_name=project_name,
        run_name=run_name)

labels = dataset["train"].features["label"].names

# 🔭🌕 Galileo logging
dq.set_labels_for_run(labels)
dq_evaluate = watch(model)
preds = dq_evaluate(
    dataset["test"],
    split="inference",
    inference_name="inference_test",
    column_mapping={
    "sentence": "text" ,
    "label": "label",
    "idx": "id"
})
dq.finish()

# Galileo with training, validation and test data
SetFit only runs with training and validation data. If you want to log test data, simply pass ```finish=false``` to the ```watch``` function and follow this guide

In [None]:
# 🔭🌕 Galileo logging

import dataquality as dq
from dataquality.integrations.setfit import watch

from setfit import SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer


model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    column_mapping={"sentence": "text", "label": "label","idx":"id"},
)
labels = dataset["train"].features["label"].names

# 🔭🌕 Galileo logging
watch(trainer, labels=labels,
      project_name=project_name, run_name=run_name,
      # 🔭🌕 Set finish to False to add test
      finish=False
      )

trainer.train()

model.save_pretrained("./trained_model")

metrics = trainer.evaluate()
metrics

In [None]:
# 🔭🌕 Galileo logging for custom split (test)
dq_evaluate = watch(model)
preds = dq_evaluate(
    dataset["test"],
    split="test",
    column_mapping={
    "sentence": "text" ,
    "label": "label",
    "idx": "id"
})

In [None]:
dq.finish()

☁️ Uploading Data
CuML libraries not found, running standard process. For faster Galileo processing, consider installing
`pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/`


training:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for training


Uploading data to Galileo:   0%|          | 0.00/36.5k [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/1 [00:00<?, ?it/s]

training (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/60.5k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/13.4k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/17.5k [00:00<?, ?B/s]

validation:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for validation


Uploading data to Galileo:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/55 [00:00<?, ?it/s]

validation (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/2.57M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/33.0k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/136k [00:00<?, ?B/s]

test:   0%|          | 0/1 [00:00<?, ?it/s]

Creating and uploading data embeddings for test


Uploading data to Galileo:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Processing data for upload:   0%|          | 0/29 [00:00<?, ?it/s]

test (epoch=0):   0%|          | 0/3 [00:00<?, ?it/s]

Uploading data to Galileo:   0%|          | 0.00/5.35M [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/55.2k [00:00<?, ?B/s]

Uploading data to Galileo:   0%|          | 0.00/249k [00:00<?, ?B/s]

Job default successfully submitted. Results will be available soon at https://console.dev.rungalileo.io/insights?projectId=96783a2e-a920-40bd-a676-be933f1b20e2&runId=e061c477-9382-4385-8ed9-85ba54757433&split=training&metric=f1&depHigh=1&depLow=0&taskType=0
Waiting for job (you can safely close this window)...
	Applying dimensionality reduction to embs
	Applying dimensionality reduction to data embs
	Calculating training data error potential
	Saving processed training data
	Finding semantic clusters for validation
	Uploading processed validation data
	Finding semantic clusters for test
	Saving processed test data
	Uploading processed test data
Done! Job finished with status completed
Click here to see your run! https://console.dev.rungalileo.io/insights?projectId=96783a2e-a920-40bd-a676-be933f1b20e2&runId=e061c477-9382-4385-8ed9-85ba54757433&split=training&metric=f1&depHigh=1&depLow=0&taskType=0
🧹 Cleaning up
🧹 Cleaning up


'https://console.dev.rungalileo.io/insights?projectId=96783a2e-a920-40bd-a676-be933f1b20e2&runId=e061c477-9382-4385-8ed9-85ba54757433&split=training&metric=f1&depHigh=1&depLow=0&taskType=0'