#### Using Zero-Shot Model to Classify Data

This code shows how to use zero-shot classifier as a quick and easy way to assign labels of intest to the data. This classifier treats text classification task as a **text entailment problem**. More information on zero-shot classfier can be found in [Yin et al., 2019](https://arxiv.org/abs/1909.00161).

In this project, I am using [facebook/bart-large-mnli model](https://huggingface.co/tasks/zero-shot-classification) and 200 tweets about disaster from [Kaggle](https://www.kaggle.com/competitions/nlp-getting-started/data). The tweets have been already labeled and thus can serve as a comparison point for us to evaluate model performance on unseen data. 

I am using these model to assign 2 labels simultaneously: whether the document is about disaster and whether it is about forrest.

#### Part 1: Load Model, Data and Run Zero-Shot Classifier

In [1]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
import pandas as pd



In [2]:
df = pd.read_csv("disaster_data_labeled.csv", index_col = "id", nrows=200)
df.head(2)

Unnamed: 0_level_0,text,disaster_true
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Our Deeds are the Reason of this #earthquake M...,1
4,Forest fire near La Ronge Sask. Canada,1


In [3]:
candidate_labels = ['disaster', 'forest']

In [4]:
# Define a function to apply the classifier
def apply_classifier(row):
    result = classifier(row['text'], candidate_labels, multi_label = True)
    return result['scores']

In [5]:
#apply classifier
df[['disaster_pred_prob', 'forest_pred_prob']] = df.apply(apply_classifier, axis=1, result_type='expand')
df.head(3)

Unnamed: 0_level_0,text,disaster_true,disaster_pred_prob,forest_pred_prob
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Our Deeds are the Reason of this #earthquake M...,1,0.98684,0.001377
4,Forest fire near La Ronge Sask. Canada,1,0.949915,0.557649
5,All residents asked to 'shelter in place' are ...,1,0.081987,0.033703


#### Part 2: Explore Misclassified Documents

In [6]:
df['disaster_pred_label'] = (df['disaster_pred_prob'] >= 0.5).astype(int)
df.head(3)

Unnamed: 0_level_0,text,disaster_true,disaster_pred_prob,forest_pred_prob,disaster_pred_label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Our Deeds are the Reason of this #earthquake M...,1,0.98684,0.001377,1
4,Forest fire near La Ronge Sask. Canada,1,0.949915,0.557649,1
5,All residents asked to 'shelter in place' are ...,1,0.081987,0.033703,0


In [7]:
mismatch_count = (df['disaster_true'] != df['disaster_pred_label']).sum()
print("Number of mismatching rows:", mismatch_count)

Number of mismatching rows: 40


In [12]:
mismatched_rows = df[df['disaster_true'] != df['disaster_pred_label']]
print("Mismatching rows:")
mismatched_rows.head(5)

Mismatching rows:


Unnamed: 0_level_0,text,disaster_true,disaster_pred_prob,forest_pred_prob,disaster_pred_label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,All residents asked to 'shelter in place' are ...,1,0.081987,0.033703,0
8,#RockyFire Update => California Hwy. 20 closed...,1,0.379244,0.020025,0
28,What a goooooooaaaaaal!!!!!!,0,0.644796,0.261415,1
31,this is ridiculous....,0,0.981207,0.142747,1
37,No way...I can't eat that shit,0,0.983586,0.314536,1


Thus, we can see that zero-shot classifier was able to **correctly** identify disaster-related label in **160 out of 200 cases**, which is a great result considering that this is new data and no fine-tuning has been performed!