Zero Shot Multilingual Sentiment Classification with PyTorch Lightning | NLP
---

### [Dataset Link](https://huggingface.co/datasets/yelp_polarity)

## In this exercise I will train a zero-shot multilingual sentiment classifier, and the model I will use for this is Multilingual Universal Sentence Encoder (mUSE) for feature generation. 


## Quick points about Multilingual Universal Sentence Encoder (mUSE)

* 🔬 The model is a dual encoder model: one side encodes the query, e.g., the question in the QA task, and the other side encodes all possible candidates, e.g., all possible responses in the QA tasks.

* 🔬The model computes a similarity metric between the query encoding and the response encodings. The output of the model is the response most similar to the query.

* 🔬 2 architectures were tried as the encoder: Convolutional Neural Network for parameter efficient network and Transformer encoder for higher accuracy but more resourceful.

----------------



[Original Paper](https://arxiv.org/pdf/1907.04307.pdf)

----------------------



The goal is to illustrate the zero-shot classification abilities of multilingual models where the model is only trained English
data and then used to predict on non-English data with no further training.

I use the binary sentiment classification Yelp Polarity dataset. 

The dataset consists of 560K highly polar Yelp reviews for training and 38K reviews for testing. Original Yelp reviews
take numerical score from 1 to 5 stars. This dataset is constructed by grouping the 1 and 2 stars reviews into the negative sentiment class and
the 3 and 4 stars reviews into the positive sentiment class.


mUSE is a Transformer encoding text such that text of two different languages with similar meaning will result in a similar encoding. 


This is analogous to the way two words with similar meaning (and usage) will have similar word embeddings. 

mUSE supports 16 languages: Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean,
Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian.

Here, I will use TensorFlow Hub to load the mUSE model, Huggingface Datasets to load the Yelp Polarity dataset, and PyTorch Lightning for training. 


In [None]:
# !pip install tensorflow_hub tensorflow_text pytorch_lightning datasets -q

In [2]:
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset, load_metric
import numpy as np

from typing import List, Dict

In [None]:
import dataloading
import modeling

In [3]:
pl.seed_everything(445326, workers=True)

INFO:pytorch_lightning.utilities.seed:Global seed set to 445326


445326

# Pull `universal-sentence-encoder-large` from TF-Hub

In [4]:
# model_URL = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'

# encoder = hub.load(model_URL)

Want the vectors to be numpy arrays, not Tensorflow tensors, b/c they'll be used in PyTorch.

# Data

YelpDataLoader()

In [7]:
data = dataloading.YelpDataLoader()
data.prepare_data()



Downloading builder script:   0%|          | 0.00/6.35k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.47k [00:00<?, ?B/s]

Downloading and preparing dataset yelp_polarity/plain_text (download: 158.67 MiB, generated: 421.07 MiB, post-processed: Unknown size, total: 579.73 MiB) to /root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/14f90415c754f47cf9087eadac25823a395fef4400c7903c5897f55cfaaa6f61...


Downloading data:   0%|          | 0.00/166M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/560000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/38000 [00:00<?, ? examples/s]

Dataset yelp_polarity downloaded and prepared to /root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/14f90415c754f47cf9087eadac25823a395fef4400c7903c5897f55cfaaa6f61. Subsequent calls will reuse this data.




In [8]:
data.setup()
print(len(data.train))
print(len(data.val))
print(len(data.test))



  0%|          | 0/350 [00:00<?, ?ba/s]

  0%|          | 0/175 [00:00<?, ?ba/s]

  0%|          | 0/24 [00:00<?, ?ba/s]

11200
5600
760


# Model

## Multilingual binary classifier

A LightningModule organizes your PyTorch code into 6 sections:

* Computations (init).

* Train Loop (training_step)

* Validation Loop (validation_step)

* Test Loop (test_step)

* Prediction Loop (predict_step)

* Optimizers and LR Schedulers (configure_optimizers)

## Train

In [10]:
model = modeling.Model()

  import sys


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [11]:
MAX_EPOCHS = 15

checkpoint_callback = pl.callbacks.ModelCheckpoint(
    monitor="val_loss",
    dirpath="model",
    filename="yelp-sentiment-multilingual-{epoch:02d}-{val_loss:.3f}",
    save_top_k=3,
    mode="min")

trainer = pl.Trainer(gpus=1, max_epochs=MAX_EPOCHS, 
                     callbacks=[checkpoint_callback])

  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


## Under the hood, the Lightning Trainer handles the training loop details for you, some examples include:

Automatically enabling/disabling grads

Running the training, validation and test dataloaders

Calling the Callbacks at the appropriate times

Putting batches and computations on the correct devices

In [None]:
trainer.fit(model, data.train_dataloader(), data.val_dataloader())

## VALIDATE AND TEST A MODEL

During and after training we need a way to evaluate our models to make sure they are not overfitting while training and generalize well on unseen or real-world data. There are generally 2 stages of evaluation: validation and testing. To some degree they serve the same purpose, to make sure models works on real data but they have some practical differences.

Validation is usually done during training, traditionally after each training epoch. It can be used for hyperparameter optimization or tracking model performance during training. It’s a part of the training process.

Testing is usually done once we are satisfied with the training and only with the best model selected from the validation metrics.


---------------------------------

# Test

#### After running the `.fit()` Now we can run the Test

In [13]:
trainer.test(dataloaders=data.test_dataloader())

  + f" You can pass `.{fn}(ckpt_path='best')` to use the best model or"
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/model/yelp-sentiment-multilingual-epoch=10-val_loss=0.190.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from checkpoint at /content/model/yelp-sentiment-multilingual-epoch=10-val_loss=0.190.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      test_accuracy         0.8960526585578918
        test_loss           0.25015878677368164
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


  "Trying to infer the `batch_size` from an ambiguous collection. The batch size we"


[{'test_loss': 0.25015878677368164, 'test_accuracy': 0.8960526585578918}]

# Inference

In [14]:
best_model = modeling.Model.load_from_checkpoint(checkpoint_callback.best_model_path)

### Get predict() method

## Inference on non-English text

Since we used USEm embeddings, we should be able to predict sentiment for non-English languages. Let's try it out. [USEm supports 16 languages](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3):

Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian

In [17]:
from pprint import PrettyPrinter
pp = PrettyPrinter()

In [None]:
def predict(text: List[str]):
    """
    This function predicts the sentiment of a list of sentences using a pre-trained model.
    
    The sentences are first converted into embeddings using a custom data loading and embedding function. 
    These embeddings are then passed through the model to generate logits. The model's predictions are 
    then converted into human-readable labels and scores.

    Parameters
    ----------
    text : List[str]
        The list of sentences to classify.

    Returns
    -------
    results : List[dict]
        The list of dictionaries with each dictionary containing the text, predicted label, and corresponding score.
    """
    embeddings = torch.Tensor(dataloading.embed_text(text))
    logits = best_model(embeddings)
    preds = torch.argmax(logits, dim=1).detach().cpu().numpy()
    scores = torch.softmax(logits, dim=1).detach().cpu().numpy()

    results = []
    for t, best_index, score_pair in zip(text, preds, scores):
        results.append({
            "text": t,
            "label": "positive" if best_index == 1 else "negative",
            "score": score_pair[best_index]
        })
    return results

Compare predictions for English and German.

In [20]:
english_text = "Like any Barnes & Noble, it has a nice comfy cafe, and a large selection of books. The staff is very friendly and helpful. They stock a decent selection, and the prices are pretty reasonable."

german_translation = "Wie jedes Barnes & Noble hat es ein nettes, gemütliches Café und eine große Auswahl an Büchern. Das Personal ist sehr freundlich und hilfsbereit. Sie haben eine anständige Auswahl und die Preise sind ziemlich vernünftig."

pp.pprint(predict([english_text, german_translation]))


[{'label': 'positive',
  'score': 0.9985252,
  'text': 'Like any Barnes & Noble, it has a nice comfy cafe, and a large '
          'selection of books. The staff is very friendly and helpful. They '
          'stock a decent selection, and the prices are pretty reasonable.'},
 {'label': 'positive',
  'score': 0.835555,
  'text': 'Wie jedes Barnes & Noble hat es ein nettes, gemütliches Café und '
          'eine große Auswahl an Büchern. Das Personal ist sehr freundlich und '
          'hilfsbereit. Sie haben eine anständige Auswahl und die Preise sind '
          'ziemlich vernünftig.'}]


Compare predictions for English and Italian. For kicks, let's also see how it performs on a European language that USEm does not support, Finnish.

In [30]:
english_text = "The inside of the Restaurant was not clean at all. And we also did not like their lighting arrangement. Too dark."

italian_translation = "L'interno del Ristorante non era affatto pulito. E non ci piaceva nemmeno la loro disposizione delle luci. Troppo scuro."

finnish_translation = "Ravintolan sisäpuoli ei ollut ollenkaan puhdas. Ja emme myöskään pitäneet heidän valaistusjärjestelystä. Liian pimeä."

pp.pprint(predict([english_text, italian_translation, finnish_translation]))

[{'label': 'negative',
  'score': 0.9895075,
  'text': 'The inside of the Restaurant was not clean at all. And we also did '
          'not like their lighting arrangement. Too dark.'},
 {'label': 'negative',
  'score': 0.7474784,
  'text': "L'interno del Ristorante non era affatto pulito. E non ci piaceva "
          'nemmeno la loro disposizione delle luci. Troppo scuro.'},
 {'label': 'negative',
  'score': 0.87850356,
  'text': 'Ravintolan sisäpuoli ei ollut ollenkaan puhdas. Ja emme myöskään '
          'pitäneet heidän valaistusjärjestelystä. Liian pimeä.'}]
