<a href="https://colab.research.google.com/github/rahiakela/mlops-research-and-practice/blob/main/practical-deep-learning-with-mlflow/2-mlflow-introduction/sentiment_classifier_with_mlflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Sentiment classifier

Text classification is the task of assigning a piece of text (word, sentence or document) an appropriate class, or category. The categories depend on the chosen data set and can range from topics.

Let’s train a model to classify text as expressing either positive or negative sentiment. We will be using the IMDB data set, that contains a `train.csv` and `valid.csv`.

Reference:

https://lightning-flash.readthedocs.io/en/stable/reference/text_classification.html

https://www.dagshub.com/Dean/mlflow-colab-example

##Step-0: Setup

In [None]:
!pip install lightning-flash
!pip install 'lightning-flash[text]'
!pip install mlflow

In [2]:
import torch
import flash
from flash.core.data.utils import download_data
from flash.text import TextClassificationData, TextClassifier

##Step-0: Setup mlflow

In [3]:
import mlflow
import os
from getpass import getpass

In [4]:
os.environ['MLFLOW_TRACKING_USERNAME'] = input('Enter your DAGsHub username: ')
os.environ['MLFLOW_TRACKING_PASSWORD'] = getpass('Enter your DAGsHub access token: ')
os.environ['MLFLOW_TRACKING_PROJECTNAME'] = input('Enter your DAGsHub project name: ')

Enter your DAGsHub username: rahiakela
Enter your DAGsHub access token: ··········
Enter your DAGsHub project name: mlflow_projects


In [10]:
mlflow.set_tracking_uri(f'https://dagshub.com/' + os.environ['MLFLOW_TRACKING_USERNAME'] + '/' + os.environ['MLFLOW_TRACKING_PROJECTNAME'] + '.mlflow')

In [None]:
# set up an active experiment using mlflow
EXPERIMENT_NAME = "basic_classifier_model"
mlflow.set_experiment(EXPERIMENT_NAME)
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
print(f"experiment_id: {experiment.experiment_id}")

# enable autologging in MLflow
mlflow.pytorch.autolog()

##Step-1: Create the DataModule

In [None]:
# download IMDb data to local folder
download_data("https://pl-flash-data.s3.amazonaws.com/imdb.zip", "./data/")

In [None]:
dataset = TextClassificationData.from_csv(
  input_field="review",
  target_fields="sentiment",
  train_file="data/imdb/train.csv",
  val_file="data/imdb/valid.csv",
  test_file="data/imdb/test.csv",
  batch_size=4
)

##Step-2: Build the task

In [None]:
classifier_model = TextClassifier(backbone="prajjwal1/bert-medium", labels=dataset.labels)

##Step-3: Define the trainer

In [8]:
trainer = flash.Trainer(max_epochs=3, gpus=torch.cuda.device_count())

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


##Step-4: Finetune the model

In [12]:
# start the experiment run
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name="basic_classifier"):
  # fine tune the pretrained model to get a new model for sentiment classification
  trainer.finetune(classifier_model, datamodule=dataset, strategy="freeze")

Missing logger folder: /content/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                          | Params
----------------------------------------------------------------
0 | train_metrics | ModuleDict                    | 0     
1 | val_metrics   | ModuleDict                    | 0     
2 | test_metrics  | ModuleDict                    | 0     
3 | model         | BertForSequenceClassification | 41.4 M
----------------------------------------------------------------
1.0 K     Trainable params
41.4 M    Non-trainable params
41.4 M    Total params
165.497   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]



In [13]:
# See your experiments table inside Colab!
import IPython
display(IPython.display.IFrame("https://dagshub.com/"+ os.environ['MLFLOW_TRACKING_USERNAME'] 
                        + '/' + os.environ['MLFLOW_TRACKING_PROJECTNAME'] + "/experiments/#/",'100%',600))

##Step-5: Make prediction 

In [14]:
# Classify a few sentences! How was the movie?
new_dataset = TextClassificationData.from_lists(
  predict_data=[
    "Turgid dialogue, feeble characterization - Harvey Keitel a judge?.",
    "The worst movie in the history of cinema.",
    "I come from Bulgaria where it 's almost impossible to have a tornado."
  ],
  batch_size=4
)

predictions = trainer.predict(classifier_model, datamodule=new_dataset, output="labels")
print(predictions)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 5625it [00:00, ?it/s]

[['positive', 'positive', 'positive']]


In [15]:
new_dataset = TextClassificationData.from_lists(
  predict_data=[
    "Best movie I have seen.",
    "What a movie!"
  ],
  batch_size=4
)

predictions = trainer.predict(classifier_model, datamodule=new_dataset, output="labels")
print(predictions)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 5625it [00:00, ?it/s]

[['positive', 'positive']]
