<a href="https://colab.research.google.com/github/rahiakela/mlops-research-and-practice/blob/main/practical-deep-learning-with-mlflow/1-sentiment-classifier/sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Sentiment classifier

Text classification is the task of assigning a piece of text (word, sentence or document) an appropriate class, or category. The categories depend on the chosen data set and can range from topics.

Let’s train a model to classify text as expressing either positive or negative sentiment. We will be using the IMDB data set, that contains a `train.csv` and `valid.csv`.

Reference:

https://lightning-flash.readthedocs.io/en/stable/reference/text_classification.html

##Step-0: Setup

In [None]:
!pip install lightning-flash
!pip install 'lightning-flash[text]'

In [2]:
import torch
import flash
from flash.core.data.utils import download_data
from flash.text import TextClassificationData, TextClassifier

##Step-1: Create the DataModule

In [None]:
# download IMDb data to local folder
download_data("https://pl-flash-data.s3.amazonaws.com/imdb.zip", "./data/")

In [None]:
dataset = TextClassificationData.from_csv(
  input_field="review",
  target_fields="sentiment",
  train_file="data/imdb/train.csv",
  val_file="data/imdb/valid.csv",
  test_file="data/imdb/test.csv",
  batch_size=4
)

##Step-2: Build the task

In [None]:
classifier_model = TextClassifier(backbone="prajjwal1/bert-medium", labels=dataset.labels)

##Step-3: Define the trainer

In [7]:
trainer = flash.Trainer(max_epochs=3, gpus=torch.cuda.device_count())

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


##Step-4: Finetune the model

In [8]:
# fine tune the pretrained model to get a new model for sentiment classification
trainer.finetune(classifier_model, datamodule=dataset, strategy="freeze")

Missing logger folder: /content/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                          | Params
----------------------------------------------------------------
0 | train_metrics | ModuleDict                    | 0     
1 | val_metrics   | ModuleDict                    | 0     
2 | test_metrics  | ModuleDict                    | 0     
3 | model         | BertForSequenceClassification | 41.4 M
----------------------------------------------------------------
1.0 K     Trainable params
41.4 M    Non-trainable params
41.4 M    Total params
165.497   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

##Step-5: Make prediction 

In [9]:
# Classify a few sentences! How was the movie?
new_dataset = TextClassificationData.from_lists(
  predict_data=[
    "Turgid dialogue, feeble characterization - Harvey Keitel a judge?.",
    "The worst movie in the history of cinema.",
    "I come from Bulgaria where it 's almost impossible to have a tornado."
  ],
  batch_size=4
)

predictions = trainer.predict(classifier_model, datamodule=new_dataset, output="labels")
print(predictions)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 5625it [00:00, ?it/s]

[['negative', 'positive', 'positive']]


In [12]:
new_dataset = TextClassificationData.from_lists(
  predict_data=[
    "Best movie I have seen.",
    "What a movie!"
  ],
  batch_size=4
)

predictions = trainer.predict(classifier_model, datamodule=new_dataset, output="labels")
print(predictions)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 5625it [00:00, ?it/s]

[['positive', 'positive']]
