# Text Classification using SetFit and 🔭 Galileo

In this tutorial, we'll train a model with SetFit and explore the results in Galileo.

**Make sure to select GPU in your Runtime! (Runtime -> Change Runtime type)**

### Install Galileo 🔭🌕

Simply run ```pip install dataquality```

In [None]:
%%capture

%pip install dataquality
%pip install setfit transformers -q

Import dataquality and set the project name and run

In [None]:
#@title 1. Initialize Galileo
# 🔭🌕 Galileo setup
import dataquality as dq

project_name="text_classification_with_setfit"
run_name="set_fit_sst2"

Welcome to Galileo Cloud v0.8.40!


In [None]:
#@title 2. Data preperation
#@markdown Use Hugging Face 🤗 Datasets for training SetFit
from datasets import load_dataset
from setfit import sample_dataset

dataset_id = "sst2"
dataset = load_dataset(dataset_id)

train_dataset = sample_dataset(dataset["train"], num_samples=8)
eval_dataset = dataset["validation"] 

Downloading builder script:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading and preparing dataset sst2/default to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset sst2 downloaded and prepared to /root/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# 3. Training the model with dataquality integration
Integrate dataquality with the SetFit Trainer
```python
from dataquality.integrations.setfit import watch
watch(trainer, labels=labels, project_name=project_name, run_name=run_name)
```
After the Trainer is initiated, the watch function of dataquality hooks into the model.

SetFit only runs with training and validation data. If you want to log test data, simply pass ```finish=false``` to the ```watch``` function and follow this guide. You can also pass inference data to the dq_evaluate function.


In [None]:
# 🔭🌕 Galileo logging
import dataquality as dq
from dataquality.integrations.setfit import watch

from setfit import SetFitModel
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer


model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)
column_mapping={"sentence": "text", "label": "label","idx":"id"}
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    column_mapping=column_mapping,
)
labels = dataset["train"].features["label"].names

# 🔭🌕 Galileo logging
watch(trainer, labels=labels,
      project_name=project_name, run_name=run_name,
      # 🔭🌕 Set finish to False to add test
      finish=False
      )

trainer.train()

model.save_pretrained("./trained_model")

metrics = trainer.evaluate()
# 🔭🌕 Galileo logging for custom split (inference)
dq_evaluate = watch(model)
preds = dq_evaluate(
    dataset["test"],
    split="test",
    column_mapping=column_mapping
    # for inference set the split to inference
    # and pass an inference_name="inference_run_1"
    )
dq.finish()

# 4. Model Inference from pretrained model
Performing inference on the trained model and logging the predictions with dataquality. First we start by importing the necessary modules, next we load the trained model.
Now you have to initialize Galileo. Make sure the project name and run is identical to your training run. :
```python
import dataquality as  dq

dq.init(task_type="text_classification",
        project_name=project_name,
        run_name=run_name)
dq.set_labels_for_run(labels)
```
We use a watch function to get our evaluation function for the inference process, which takes the dataset as the input.
```python
from dataquality.integrations.setfit import watch
dq_evaluate = watch(model)
preds = dq_evaluate(
    dataset,
    split="inference",
    inference_name="inference_test",
    column_mapping=column_mapping)
# Finalizing the inference
dq.finish()
```


In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained("./trained_model")

# 🔭🌕 Galileo logging
dq.init(task_type="text_classification",
        project_name=project_name,
        run_name=run_name)

labels = dataset["train"].features["label"].names

# 🔭🌕 Galileo logging
dq.set_labels_for_run(labels)
dq_evaluate = watch(model)
preds = dq_evaluate(
    dataset["test"],
    split="inference",
    inference_name="inference_test",
    column_mapping={
    "sentence": "text" ,
    "label": "label",
    "idx": "id"
})
# Finalizing the inference
dq.finish()