Transformer-based zero-shot text classification model from Hugging Face for predicting NLP topic classes

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Resources

- [Blog post](https://medium.com/@AmyGrabNGoInfo/zero-shot-topic-modeling-with-deep-learning-using-python-a895d2d0c773) for this tutorial
- Video version of the tutorial on [YouTube](https://www.youtube.com/watch?v=XrDqMG7zouE&list=PLVppujud2yJpx5r8GFeJ81fyek8dEDMX-&index=9)
- More video tutorials on [NLP](https://www.youtube.com/playlist?list=PLVppujud2yJpx5r8GFeJ81fyek8dEDMX-)
- More blog posts on [NLP](https://medium.com/@AmyGrabNGoInfo/list/nlp-49340193610f)


For more information about data science and machine learning, please check out my [YouTube channel](https://www.youtube.com/@grabngoinfo), [Medium Page](https://medium.com/@AmyGrabNGoInfo) and [GrabNGoInfo.com](https://grabngoinfo.com/tutorials/), or follow GrabNGoInfo on [LinkedIn](https://www.linkedin.com/company/grabngoinfo/).

# Intro

Zero-shot learning (ZSL) refers to building a model and using it to make predictions on the tasks that the model was not trained to do. For example, if we would like to classify millions of news articles into different topics, building a traditional multi-class classification model would be very costly because manually labeling the news topics takes a lot of time. 

Zero-shot text classification is able to make class predictions without explicitly building a supervised classification model using a labeled dataset. This tutorial will use an Amazon review dataset to illustrate how to build a zero-shot topic model using Hugging Face's zero-shot text classification model. We will talk about:
* What's the algorithm behind the zero-shot text classification model?
* How to install and import Hugging Face libraries for the zero-shot text classification model?
* How to implement zero-shot topic modeling for single-topic and multiple topics predictions separately?
* What to do if there is no list of topic labels for the prediction? 

Let's get started!

# Step 0: Zero-shot Topic Modeling Algorithm

In step 0, we will talk about the model algorithm behind the zero-shot topic model.

Zero-shot topic modeling is a use case of zero-shot text classification on topic predictions. Zero-shot text classification is a Natural Language Inference (NLI) model where two sequences are compared to see if they contradict each other, entail each other, or are neutral (neither contradict nor entail).

When using zero-shot topic modeling, we will have the text as the premise and the pre-defined candidate labels as hypotheses.
If the model predicts a text document such as a review entails the topic in the candidate labels, then the document is likely to belong to the topic. Otherwise, the document is not likely to belong to the topic.

# Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let's import `transformers`.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


After installing the python packages, we will import the python libraries.
* `pandas` is imported for data processing.
* Hugging Face `pipeline` is imported from `transformers` for the zero-shot classification model.
 * `task` describes the task for the pipeline. The task name we use is `zero-shot-classification`.
 * `model` is the model name for the prediction used in the pipeline. You can find the full list of available models for zero-shot classification on the [Hugging Face website](https://huggingface.co/models?pipeline_tag=zero-shot-classification). At the time this tutorial was created in January 2023, the `bart-large-mnli` by Facebook(Meta) is the model with the highest number of downloads and likes, so we will use it for the pipeline.
 * `device` defines the device type. `device=0` means that we are using GPU for the pipeline.


In [None]:
# Data processing
import pandas as pd

# Modeling
from transformers import pipeline
classifier = pipeline(task="zero-shot-classification", 
                      model="facebook/bart-large-mnli",
                      device=0) 

# Step 2: Download And Read Data

The second step is to download and read the dataset. 

The UCI Machine Learning Repository has the review data from three websites: imdb.com, amazon.com, and yelp.com. We will use the review data from amazon.com for this tutorial. Please follow these steps to download the data.
1. Go to: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
2. Click "Data Folder"
3. Download "sentiment labeled sentences.zip"
4. Unzip "sentiment labeled sentences.zip"
5. Copy the file "amazon_cells_labelled.txt" to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab. 
* `drive.mount` is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
* `os.chdir` is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
* `!pwd` is used to print the current working directory.

Please check out [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44) for details about using Google Colab for data science projects. 

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

Now let's read the data into a `pandas` dataframe and see what the dataset looks like.

The dataset has two columns. One column contains the reviews and the other column contains the sentiment label for the review. Since this tutorial is for topic modeling, we will not use the sentiment label column, so we removed it from the dataset.

In [None]:
# Read in data
invisalign_posts = pd.read_excel('/content/drive/MyDrive/invisalignKeyword_raw_withcomments.xlsx')

# Drop te label 
#amz_review = amz_review.drop('label', axis=1);

# Take a look at the data
invisalign_posts.head()

Unnamed: 0,SubmissionNumber,Subreddit,Content
0,1,invisalign,Did any of you ever get used to the feeling of...
1,2,invisalign,Started with metal braces and after a month of...
2,3,invisalign,Invisalign vs ceramic braces... Which would yo...
3,4,invisalign,How long do you leave your braces off to eat?I...
4,5,invisalign,Im only wearing my Invisalign 10-14hrs a day. ...


`.info` helps us to get information about the dataset. 

From the output, we can see that this data set has 1000 records and no missing data. The `review` column is the `object` type.

In [None]:
# Get the dataset information
invisalign_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 461 entries, 0 to 460
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   SubmissionNumber  461 non-null    int64 
 1   Subreddit         461 non-null    object
 2   Content           461 non-null    object
dtypes: int64(1), object(2)
memory usage: 10.9+ KB


# Step 3: Zero-shot Topic Prediction of a Single Topic

In step 3, we will use the zero-shot topic model to predict one topic for each text document.
* Firstly, the reviews are put into a list for the pipeline.
* Then, the candidate labels are defined. We set four candidate labels, `sound quality`, `battery`, `price`, and `comfortable`.
* After that, the hypothesis template is defined. The default template is used by the Hugging Face pipeline is `This example is {}`, we use a hypothesis template that is more specific to the topic modeling `The topic of this review is {}.` and it helps to improve the results.
* Finally, the text, the candidate labels, and the hypothesis template are passed into the zero-shot classification pipeline called `classifier`. 

The output is in a list format and we converted it into a Pandas dataframe. 




In [None]:
# Put reviews in a list
sequences = invisalign_posts['Content'].to_list()

#sequences = sequences[:25]

# Define the candidate labels 
#content_labels = ['hurt', 'pain', 'painful', 'lisp','pretty', 'beautiful', 'looks', 'conscious','afford', 'expensive', 'worth it', 'insurance', '$','eat', 'drink', 'meal', 'eating', 'weight','traytime', 'wear', 'forget','brush', 'floss', 'cavity']

content_labels = ["physical pain and oral dysfunction", "oral hygiene", "tray time management", "mealtime, weight and eating disorder concerns","appearance and self-consciousness","financial concerns"]

# Set the hyppothesis template
hypothesis_template = "This example is {}"

# Prediction results
single_topic_prediction = classifier(sequences, content_labels, hypothesis_template=hypothesis_template)


print(single_topic_prediction)
# Save the output as a dataframe
single_topic_prediction = pd.DataFrame(single_topic_prediction)

# Take a look at the data
single_topic_prediction.head()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m




Unnamed: 0,sequence,labels,scores
0,Did any of you ever get used to the feeling of...,"[self-consciousness, facial appearance, tray t...","[0.22282733023166656, 0.2184673249721527, 0.18..."
1,Started with metal braces and after a month of...,"[tray time management, physical pain and oral ...","[0.22406096756458282, 0.19505280256271362, 0.1..."
2,Invisalign vs ceramic braces... Which would yo...,"[self-consciousness, facial appearance, physic...","[0.2595674991607666, 0.13902278244495392, 0.13..."
3,How long do you leave your braces off to eat?I...,"[self-consciousness, mealtime, weight and eati...","[0.28950509428977966, 0.14555425941944122, 0.1..."
4,Im only wearing my Invisalign 10-14hrs a day. ...,"[self-consciousness, mealtime, weight and eati...","[0.2018447071313858, 0.16703923046588898, 0.15..."


In [None]:
single_topic_prediction.to_excel('invisalignPosts_ZSTP_classification_2_6.xlsx',index=False)

It is not uncommon to get an out-of-memory error when running the zero-shot classification model. To resolve the error, we can set smaller `batch_size` for the model. `batch_size = 4` means that the model will process 4 text documents each time. Below is a sample code for your reference.

In [None]:
# Tune the batch_size to fit in the memory
batch_size = 4 

# Put reviews in a list
sequences = invisalign_posts['Content'].to_list()

# Define the candidate labels 
candidate_labels = ["physical pain and oral dysfunction", "oral hygiene", "tray time management", "mealtime, weight and eating disorder concerns","appearance and self-consciousness","financial concerns"]

# Set the hyppothesis template
hypothesis_template = "This example is {}"

# Create an empty list to save the prediciton results
single_topic_prediction = []

# Loop through the batches
for i in range(0, len(sequences), batch_size):
    # Append the results 
    single_topic_prediction += classifier(sequences[i:i+batch_size], candidate_labels, hypothesis_template=hypothesis_template)




KeyboardInterrupt: ignored

In [None]:
single_topic_prediction = pd.DataFrame(single_topic_prediction)
single_topic_prediction.to_excel('invisalignPosts_improvised_multipred_batchsize4.xlsx',index=False)

By default, the sum of all scores is 1, so the scores represent the relative relevance to each topic. 

The first label in the labels list is the predicted topic for each review, and the first score in the scores list is the corresponding score prediction. For example, the review `Great for the jawbone.` has the predicted topic of `comfortable` and the predicted score of `0.76`, indicating that `comfortable` is a much more relevant topic than the other three topics. Note that the score values are not the absolute predicted probability of the topic, and it represents only the relative probability among the given candidate labels.

To make the prediction results easy to read and process, two new columns are created, one for the predicted topic and the other for the score of the predicted topic.

In [None]:
# The column for the predicted topic
single_topic_prediction['predicted_topic'] = single_topic_prediction['labels'].apply(lambda x: x[0])

# The column for the score of predi ted topic
single_topic_prediction['predicted_topic_score'] = single_topic_prediction['scores'].apply(lambda x: x[0])

# Take a look at the data
single_topic_prediction.head()

Unnamed: 0,sequence,labels,scores,predicted_topic,predicted_topic_score
0,Did any of you ever get used to the feeling of...,"[self-consciousness, facial appearance, tray t...","[0.22282733023166656, 0.2184673249721527, 0.18...",self-consciousness,0.222827
1,Started with metal braces and after a month of...,"[tray time management, physical pain and oral ...","[0.22406096756458282, 0.19505280256271362, 0.1...",tray time management,0.224061
2,Invisalign vs ceramic braces... Which would yo...,"[self-consciousness, facial appearance, physic...","[0.2595674991607666, 0.13902278244495392, 0.13...",self-consciousness,0.259567
3,How long do you leave your braces off to eat?I...,"[self-consciousness, mealtime, weight and eati...","[0.28950509428977966, 0.14555425941944122, 0.1...",self-consciousness,0.289505
4,Im only wearing my Invisalign 10-14hrs a day. ...,"[self-consciousness, mealtime, weight and eati...","[0.2018447071313858, 0.16703923046588898, 0.15...",self-consciousness,0.201845


In [None]:
single_topic_prediction.to_excel('invisalignPosts_ZSTP_classification_4_6.xlsx',index=False)

# Step 4: Zero-shot Topic Prediction of Multiple Topics

In step 4, we will use the zero-shot topic model to predict multiple topics. This is useful when one text document belongs to multiple topics, and we would like to assign one or more topics to a document. 

The syntax for multiple topics prediction is similar to the code for the single topic prediction, the only difference is that we set `multi_label=True` to allow multiple-label predictions.

The scores in the multiple-topic prediction are the absolute values for the predicted probabilities, so they do not sum up to one anymore. Each score is a value between 0 and 1 indicating the probability of the document belonging to the corresponding topic. 

In [None]:
# Put reviews in a list
sequences = invisalign_posts['Content'].to_list()

# Define the candidate labels 
candidate_labels = ["physical pain and oral dysfunction", "oral hygiene", "tray time management", "mealtime, weight and eating disorder concerns","physical appearance and self-consciousness","financial concerns"]

# Set the hyppothesis template
hypothesis_template = "This example is {}"

# Prediction results
multi_topic_prediction = classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template, multi_label=True)

# Save the output in a dataframe
multi_topic_prediction = pd.DataFrame(multi_topic_prediction)

# Take a look at the data
multi_topic_prediction.head()

Unnamed: 0,sequence,labels,scores
0,Did any of you ever get used to the feeling of...,"[appearance and self-consciousness, oral hygie...","[0.9874378442764282, 0.9465792775154114, 0.942..."
1,Started with metal braces and after a month of...,"[appearance and self-consciousness, tray time ...","[0.8558685779571533, 0.7364727258682251, 0.687..."
2,Invisalign vs ceramic braces... Which would yo...,"[physical pain and oral dysfunction, mealtime,...","[0.6140117645263672, 0.6016679406166077, 0.507..."
3,How long do you leave your braces off to eat?I...,"[tray time management, oral hygiene, financial...","[0.8309887051582336, 0.779344379901886, 0.7139..."
4,Im only wearing my Invisalign 10-14hrs a day. ...,"[physical pain and oral dysfunction, financial...","[0.8025760054588318, 0.7888554334640503, 0.739..."


In [None]:
multi_topic_prediction_save = multi_topic_prediction
multi_topic_prediction_save.to_excel('invisalignPosts_improvised_multiple_pred.xlsx',index=False)

To assign multiple labels to a review, a threshold probability for the topic predictions is needed. We set the `threshold = 0.6` meaning that the labels with a predicted probability of greater than or equal to 0.6 is assigned to the reviews.

Before applying the threshold, we expanded the `label` list and the `scores` list using `pd.Series.explode`.

After applying the threshold, all the scores in the dataframe are greater than 0.6. The reviews with multiple topics have multiple rows, one row for each topic.

In [None]:
# Threshold probability
threshold = 0.8

# Expand the lists
multi_topic_prediction = multi_topic_prediction.set_index('sequence').apply(pd.Series.explode).reset_index()

# Filter by threshold
multi_topic_prediction = multi_topic_prediction[multi_topic_prediction['scores'] >= threshold]

# Take a look at the data
multi_topic_prediction.head()

Unnamed: 0,sequence,labels,scores
0,Did any of you ever get used to the feeling of...,appearance and self-consciousness,0.987438
1,Did any of you ever get used to the feeling of...,oral hygiene,0.946579
2,Did any of you ever get used to the feeling of...,"mealtime, weight and eating disorder concerns",0.942612
3,Did any of you ever get used to the feeling of...,tray time management,0.90316
4,Did any of you ever get used to the feeling of...,physical pain and oral dysfunction,0.874316


Some reviews are not assigned to any topic because none of the candidate labels have a predicted score of more than 0.6. For those records, we can examine if there are common topics missing from the candidate labels. 
* If there is a common theme that is not listed in `candidate_labels`, we can add a new topic and rerun the model. 
* If there is not a common theme across the documents, we can create an `other topics` category.

In [None]:
multi_topic_prediction.to_excel('invisalignPosts_improvised_multiple_pred.xlsx',index=False)

# Step 5: Topic Model with Unkown Candidate Labels

You might have noticed that a pre-defined list of candidate labels is required for the Hugging Face zero-shot text classification model. These candidate labels are usually from business domain knowledge or past experiences. What if there is no prior knowledge about candidate labels?

In step 5, we will talk about how to build a deep-learning topic model with unknown candidate labels.

If there is no business domain knowledge about what are the typical topics for the corpus, we can train an unsupervised topic model and let the model find the topics for us automatically.

BERTopic is a topic modeling python library that combines transformer embeddings and clustering model algorithms to identify topics in NLP (Natual Language Processing). Please check out my previous tutorial [Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504) to learn how to build topic models when the topics are not pre-defined.

The topic predictions from BERTopic can be used in two ways:
* The first way is to use the topic predictions directly as the final topic assignment of the text documents.
* The second way is to extract the candidate labels based on the BERTopic predictions, and then apply the candidate labels in the zero-shot topic model to create the final topic prediction.


# Recommended Tutorials

- [GrabNGoInfo Machine Learning Tutorials Inventory](https://medium.com/grabngoinfo/grabngoinfo-machine-learning-tutorials-inventory-9b9d78ebdd67)
- [Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/p/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
- [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44)
- [Five Ways To Create Tables In Databricks](https://medium.com/grabngoinfo/five-ways-to-create-tables-in-databricks-cd3847cfc3aa)
- [Time Series Anomaly Detection Using Prophet in Python](https://medium.com/grabngoinfo/time-series-anomaly-detection-using-prophet-in-python-877d2b7b14b4)
- [Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python](https://medium.com/p/multivariate-time-series-forecasting-with-seasonality-and-holiday-effect-using-prophet-in-python-d5d4150eeb57)
- [Time Series Causal Impact Analysis in Python](https://medium.com/grabngoinfo/time-series-causal-impact-analysis-in-python-63eacb1df5cc)
- [3 Ways for Multiple Time Series Forecasting Using Prophet in Python](https://medium.com/p/3-ways-for-multiple-time-series-forecasting-using-prophet-in-python-7a0709a117f9)
- [Hierarchical Topic Model for Airbnb Reviews](https://medium.com/p/hierarchical-topic-model-for-airbnb-reviews-f772eaa30434)
- [Hyperparameter Tuning For XGBoost](https://medium.com/p/hyperparameter-tuning-for-xgboost-91449869c57e)
- [Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python](https://medium.com/p/four-oversampling-and-under-sampling-methods-for-imbalanced-classification-using-python-7304aedf9037)
- [Explainable S-Learner Uplift Model Using Python Package CausalML](https://medium.com/grabngoinfo/explainable-s-learner-uplift-model-using-python-package-causalml-a3c2bed3497c)
- [One-Class SVM For Anomaly Detection](https://medium.com/p/one-class-svm-for-anomaly-detection-6c97fdd6d8af)
- [Recommendation System: Item-Based Collaborative Filtering](https://medium.com/grabngoinfo/recommendation-system-item-based-collaborative-filtering-f5078504996a)
- [Hyperparameter Tuning for Time Series Causal Impact Analysis in Python](https://medium.com/grabngoinfo/hyperparameter-tuning-for-time-series-causal-impact-analysis-in-python-c8f7246c4d22)
- [Hyperparameter Tuning and Regularization for Time Series Model Using Prophet in Python](https://medium.com/grabngoinfo/hyperparameter-tuning-and-regularization-for-time-series-model-using-prophet-in-python-9791370a07dc)
- [LASSO (L1) Vs Ridge (L2) Vs Elastic Net Regularization For Classification Model](https://medium.com/towards-artificial-intelligence/lasso-l1-vs-ridge-l2-vs-elastic-net-regularization-for-classification-model-409c3d86f6e9)
- [S Learner Uplift Model for Individual Treatment Effect and Customer Segmentation in Python](https://medium.com/grabngoinfo/s-learner-uplift-model-for-individual-treatment-effect-and-customer-segmentation-in-python-9d410746e122)
- [How to Use R with Google Colab Notebook](https://medium.com/p/how-to-use-r-with-google-colab-notebook-610c3a2f0eab)

# References

* [Hugging Face New pipeline for zero-shot text classification](https://discuss.huggingface.co/t/new-pipeline-for-zero-shot-text-classification/681)
* [Zero-shot Learning in Modern NLP](https://joeddav.github.io/blog/2020/05/29/ZSL.html)
* [Zero-shot Pipeline Notebook](https://colab.research.google.com/drive/1jocViLorbwWIkTXKwxCOV9HLTaDDgCaw?usp=sharing)
* [Using Huggingface zero-shot text classification with large data set](https://stackoverflow.com/questions/63953597/using-huggingface-zero-shot-text-classification-with-large-data-set)
* [Zero-shot classification NLI models](https://huggingface.co/models?pipeline_tag=zero-shot-classification)
* [Hugging Face bart-large-mnli model documentation](https://huggingface.co/facebook/bart-large-mnli)

In [None]:
# Define the candidate labels 
candidate_labels = [
    "sound quality", 
    "battery life", 
    "price point", 
    "comfort level"
]

# Define the subwords for each label
subwords = {
    "sound quality": ["audio quality", "sound clarity"],
    "battery life": ["battery duration", "battery performance"],
    "price point": ["price range", "cost"],
    "comfort level": ["comfortability", "fit"]
}

# Set the hypothesis template
hypothesis_template = "The topic of this review is {}."

# Update the candidate labels with subwords
for label, subword_list in subwords.items():
    for subword in subword_list:
        candidate_labels.append(f"{subword} ({label})")

print(candidate_labels)

['sound quality', 'battery life', 'price point', 'comfort level', 'audio quality (sound quality)', 'sound clarity (sound quality)', 'battery duration (battery life)', 'battery performance (battery life)', 'price range (price point)', 'cost (price point)', 'comfortability (comfort level)', 'fit (comfort level)']
