# IMDB Sentiment Classifier
### Using Hugging Face with the SageMaker SDK

# What We're Going To Do:

#### Installation
1. Install the SageMaker SDK and the Hugging Face libraries
1. Start a SageMaker session, including the default IAM role and S3 bucket
    
#### Data Preparation
1. Tokenization: Download and prepare our IMDB dataset for NLP model training
1. Upload our tokenized and split dataset to S3

#### Model Training
1. Setup an Estimator
1. Train a model

#### Real Time Inference
1. Prepare the model for deployment
1. Deploy the model and create a Predictor
1. Make inferences using a Predictor

#### Clean Up

---
# But What _Is_ Machine Learning?

For our purposes, we can think of machine learning as a method of using computers to learn the rules of computation. 

For example, in a traditional computation like adding two integers, we supply the input data, the integers 2 and 3, and wish to apply a rule, addition, to compute the output, 5. Computers are convenient for these types of operations for obvious historical reasons.

However, with machine learning, we supply the input and output data, but are interested in computing the unknown rules that generated our output from the input. This process is not magic. Behind the scenes, machine learning relies on statistical techniques and often complex framing of the problem as one of optimizing the fit of rules that minimize the error between the input and output data. Both how this optimization problem is framed and what particular mechanisms are employed to use computers to fit optimized rules to the data is at the frontier of machine learning research. 

Due to the increasingly convenient and economical benefits of cloud computing of the past decade, machine learning has become more accessible and democratized. However, to perform machine learning in a cloud environment, one is still responsible for the data preparation, training, and inference infrastructure. This is where Amazon SageMaker is beneficial. It's a machine learning service that you can use to build, train, and deploy machine learning models for virtually any use case.

![diagram](assets/what-is-ml.svg)

---
# Installation
##### ⏰ About 1 minute

This section has nothing to do with machine learning, but sets up our development environment with the requisite SDKs and AWS constructs we'll need to perform machine learning. In particular, we'll fix specific versions of the SageMaker and Hugging Face SDKs, as well as direct our SageMaker Studio session to use a particular S3 bucket for staging our input and output data.

In [None]:
%%time
%%capture

import os

DATASETS_VERSION = "1.6.2"
TRANSFORMERS_VERSION = "4.5.0"
SAGEMAKER_VERSION = "2.40.0"

requirements_txt = f"""numpy
pandas
transformers=={TRANSFORMERS_VERSION}
datasets=={DATASETS_VERSION}
"""

with open(os.path.join(os.getcwd(), "scripts", "requirements.txt"), "w") as f:
    f.write(requirements_txt)

!pip install --upgrade "sagemaker==$SAGEMAKER_VERSION" "transformers==$TRANSFORMERS_VERSION" "datasets[s3]==$DATASETS_VERSION"
# !conda install -c conda-forge ipywidgets -y

In [None]:
if False:
    import IPython
    IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
%%capture

import boto3
import botocore
import sagemaker
import sagemaker.huggingface

session = sagemaker.Session()
bucket = session.default_bucket()
role = sagemaker.get_execution_role()

In [None]:
print(f"SageMaker role arn: {role}")
print(f"SageMaker bucket: {session.default_bucket()}")
print(f"SageMaker session region: {session.boto_region_name}")

---
# Data Preparation

### Download and Split the Dataset
##### ⏰ About 2 minutes

Machine learning datasets are often a mixture of labeled and unlabeled data. For this example, we'll only be using labeled from the IMDB movie reviews. 

When a model is trained, the process feeds labeled examples from our dataset into the training algorithm, which evaluates its performance against other labeled examples in the dataset. If the model is doing well, then the error between its predictions and the test data will be low. But we need to first decide how much of our dataset will be used for training and how much will be used for evaluating the model as it is trained. For our example, we'll simply split the labeled dataset in half and use one half for training and the other for testing.

In [None]:
%%time

import pandas
import datasets
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

train_dataset, test_dataset = datasets.load_dataset(
    "imdb", 
    ignore_verifications = False,
    split = ["train", "test"]
)

### Tokenization
##### ⏰ About 1 minute

NLP models are not trained directly against the natural languages they form predictions over. Generally speaking, machine learning models are trained with numerical inputs. Tokenization is the data preparation process by which we take our natural English language movie reviews and transform them into numbers the model training algorithm understands. There are many different ways to tokenize natural language data. In our case we will select the tokenizer that was originally used for training the pretrained [DiltilBERT model Hugging Face provides](https://huggingface.co/distilbert-base-uncased).

In [None]:
%%time

tokenize = lambda batch: tokenizer(
    batch["text"], 
    padding = "max_length", 
    truncation = "longest_first"
)

train_ds = train_dataset.shuffle().map(tokenize)
test_ds = test_dataset.shuffle().map(tokenize)

try:
    train_ds = train_ds.rename_column("label", "labels")
    test_ds = test_ds.rename_column("label", "labels")
except:
    pass

columns = ["input_ids", "attention_mask", "labels"]
train_ds.set_format("torch", columns = columns)
test_ds.set_format("torch", columns = columns)

### So What Does a Tokenized Natural Language Dataset Look Like?

In [None]:
train_ds.to_pandas().head(100)[["text", "labels", "input_ids", "attention_mask"]]

### WTF?

- `text` contains the raw English IMDB movie reviews 
- `labels` are the sentiment values for each review where `1` is positive and `0` is negative
- `input_ids` are the tokens, referred to here as IDs. Hugging Face associates the token IDs with the raw numerical token values that are fed into the model training loop.
- `attention_mask` refers to which elements of the `input_ids` vector are actually processed in the training loop. Because each original `text` is a different length, we've chosen to pad the data to the same length. The attention mask makes sure the empty padding values are not used in the training loop.

### Upload the Dataset to S3
##### ⏰ About 5 seconds

In [None]:
%%time

from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

s3_prefix = "datasets/imdb-binary-classification"
training_input_path = f"s3://{bucket}/{s3_prefix}/train"
test_input_path = f"s3://{bucket}/{s3_prefix}/test"

train_ds.save_to_disk(training_input_path, fs = s3)
test_ds.save_to_disk(test_input_path, fs = s3)

<span style="font-size: 16px;"><a href="https://s3.console.aws.amazon.com/s3/buckets/sagemaker-us-east-1-934284400219?region=us-east-1&prefix=datasets/imdb-binary-classification/&showversions=false">Prove it Landed in S3</a></span>

---
# Model Training

### Setup an Estimator

Estimators are part of the SageMaker SDK and represent at a high-level the model training job, data access, and managed infrastructure required to produce the trained model artifact. Using the latest version of the SageMaker SDK, we can leverage its [Hugging Face integration](https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face) to simplify the training process.

How do we evaluate the model training performance as its running? When we train a model using SageMaker, we can monitor several metrics in real time in AWS using Amazon CloudWatch. In particular, we'll look at two varieties of metrics: the EC2 training instance metrics and the training algorithm metrics. The EC2 training instance metrics will be supplied by SageMaker without needing to configure anything. But to capture the specific Hugging Face model training metrics, we need to tell the `HuggingFace` estimator that we're interested in specific ones, which we do by specifiying in the `metric_definitions` list below. There are many more detailed metrics we can subscribe to, but for this example we will only pay attention to two: the epoch and the loss. 

Loosely speaking, when we train a machine learning model over a dataset, one complete run through the dataset is called an _epoch_. Usually models are trained for more than one epoch, and in our case we will train for three epochs. The _loss_ is a generalized notion of the error associated with the model's performance against the test dataset we split from the training set at the beginning of this notebook. The lower the loss is, the better our model is at predicting correct sentiment labels on the test dataset, which it has never seen before.

In [None]:
from sagemaker.huggingface import HuggingFace

job_name = "imdb-huggingface"

estimator = HuggingFace(
    base_job_name = job_name,
    role = role,
    py_version = "py36",
    pytorch_version = "1.6.0",
    transformers_version = TRANSFORMERS_VERSION,
    entry_point = "trainer.py",
    instance_count = 1,
    instance_type = "ml.p3.16xlarge",
    source_dir = "./scripts",
    enable_sagemaker_metrics = True,
    metric_definitions = [
        { "Name": "epoch", "Regex": "'epoch': ([0-9]+(.|e\-)[0-9]+),?" },
        { "Name": "loss", "Regex": "'loss': ([0-9]+(.|e\-)[0-9]+),?" }
    ],
    hyperparameters = {
        "epochs": 3,
        "eval_batch_size": 32,
        "model_name": model_name,
        "train_batch_size": 32
    }
)

<span style="font-size: 16px;"><a href="https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs/">See Training Jobs in the SageMaker Console</a></span>

### Train a Model using the Estimator
##### ⏰ About 15 minutes

In [None]:
%%time

inputs = {
    "train": training_input_path, 
    "test": test_input_path
}
estimator.fit(inputs)

### How'd It Go?

In [None]:
from sagemaker import TrainingJobAnalytics
df = TrainingJobAnalytics(training_job_name = estimator.latest_training_job.name).dataframe()
df = df[["metric_name", "value"]]

summary = df.groupby("metric_name").describe()
summary.columns = summary.columns.droplevel(0)
summary = summary.reset_index().rename(columns = { 
    "metric_name": "Metric",
    "min": "Min", 
    "max": "Max", 
    "mean": "Average" 
}).set_index("Metric")
summary = summary.drop(["std", "count", "25%", "50%", "75%"], axis = 1).drop(["epoch"])
display(summary)

---
# Model Deployment

### Prepare the Model for Deployment

Here we use PyTorch for hosting the inference endpoint. The SageMaker SDK comes prebuilt with a PyTorch model class that let's us easily deploy the model to a real time inference endpoint. Because Hugging Face models are compatible with PyTorch, we can simply pass along the reference to the trained model artifacts in S3 to the PyTorchModel object we create below.

When we setup this SageMaker model, we need to supply a script that is used when the inference endpoint is invoked. Some models do not need this level of customization, but we want to make sure that our model uses JSON as an input and output format, as well as perform the low level predictions in a particular way, which is coded in the `predictor.py` script included in this project and passed along to our PyTorchModel object below.

In [None]:
from sagemaker.utils import name_from_base
from sagemaker.pytorch import PyTorchModel
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

class SentimentAnalysis(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super().__init__(
            endpoint_name, 
            sagemaker_session = sagemaker_session, 
            serializer = JSONSerializer(), 
            deserializer = JSONDeserializer()
        )

name = name_from_base(job_name)

model = PyTorchModel(
    name = name,
    role = role, 
    model_data = estimator.model_data,
    source_dir = "./scripts",
    entry_point = "predictor.py",
    framework_version = "1.6.0",
    py_version = "py36",
    predictor_cls = SentimentAnalysis
)

<span style="font-size: 16px;"><a href="https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/endpoints">See Endpoints in the SageMaker Console</a></span>

### Deploy the Model
##### ⏰ About 5 minutes

Now that we've configured our model, all that is left is to deploy it. 

In [None]:
%%time

predictor = model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.m5.large",
    endpoint_name = name,
    wait = True
)

### Make Inferences Using a SageMaker Predictor

In [None]:
import json

inputs = [
    "Willow is the greatest movie that ever lived.",
    "The Notebook is ironically depressing.",
    "It's annoying that I had to Google the capitalization of 'Back to the Future', but it is a gem of nostalgic wonder.",
    "Yikes! Weird Science did not age well for 2021.",
    "Love and Monsters made me cry happy tears."
]

results = []
for it in inputs:
    inp = {"text": it}
    prediction = predictor.predict(inp)
    results.append({
        **inp,
        **prediction
    })
    
df = pandas.DataFrame(results)
df.head()

---
# Clean Up

In [None]:
try:
    predictor.delete_endpoint()
    model.delete_model()
except:
    display("Already deleted")

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

sagemaker.s3.S3Downloader.download(estimator.model_data, "models")


lt = AutoTokenizer.from_pretrained("distilbert-base-uncased")
lm = AutoModelForSequenceClassification.from_pretrained("./models")

In [None]:
import torch

tokenized = lt(
    inputs[0],
    add_special_tokens = True,
    return_token_type_ids = False,
    return_attention_mask = True,
    padding = "max_length",
    truncation = True,
    return_tensors = "pt"
)
prediction = lm(tokenized["input_ids"], tokenized["attention_mask"])

# print(
#     prediction.logits, '\n\n',
#     torch.softmax(prediction.logits, dim = 1), '\n\n',
#     torch.max(prediction.logits, dim = 1)
# )

values, indices = torch.max(prediction.logits, dim = 1)
p = torch.softmax(prediction.logits, dim = 1)

print(p[0].size())
print(p[0][indices.item()].item(), ["yes", "no"][indices.item()])