Another approach to this problem is zero-shot classification, which creates embeddings for input texts and find their closest matches from the pre-defined labels. It can potentially outperform ML models because it leverage language models pre-trained on a diverse and massive dataset. It also saves us time to engineer features but we can also use our data to fine-tune these models to potentially further improve its performance. This notebook is meant to demonstrate how it can work.

In [1]:
# Install additional packages
!pip install -q numpy==1.26.4 torch==2.2.0 transformers==4.57.1

In [2]:
# Define zero shot classifier class
from transformers import pipeline
from picnic_topic_prediction.config import LABEL_MAPPING

class ZeroShotClassifier:
    def __init__(self):
        self.pipeline = pipeline(task="zero-shot-classification", model="sileod/deberta-v3-base-tasksource-nli")

    def predict(self, X):
        output = self.pipeline(X, candidate_labels=[f"{v} News" for v in LABEL_MAPPING.values()], multi_label=False)
        pred = [item['labels'][0] for item in output]
        
        reverse_map = {f"{v} News": k for k, v in LABEL_MAPPING.items()}
        return [reverse_map[item] for item in pred]

  from .autonotebook import tqdm as notebook_tqdm


To run this locally on CPU, I have picked a very small model (200M parameters) and we still achieve very decent performance without any need for tuning.

In [3]:
from picnic_topic_prediction.utils import load_data
from sklearn.metrics import accuracy_score, f1_score

data = load_data('test')
X = data['text']
y_true = data['label']
y_pred = ZeroShotClassifier().predict(X.to_list())

print(accuracy_score(y_true, y_pred))
print(f1_score(y_true, y_pred, average='macro'))

Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


0.84
0.8274249943604781
