In [None]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Training PyTorch Model on Google Cloud AI Platform Training 
## Fine Tuning Pretrained [BERT](https://huggingface.co/bert-base-cased) Model for Sentiment Classification Task 

# Overview

This example is inspired from Token-Classification [notebook])(https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb) and [run_glue.py](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py) from HuggingFace 🤗. 
We will be fine-tuning bert-base-cased (pre-trained) model.
You can find the details about this model at [🤗  Hub](https://huggingface.co/bert-base-cased).

For more notebooks of the state of the art PyTorch/Tensorflow/JAX you can explore [🤗 Notebooks](https://huggingface.co/transformers/notebooks.html).

### Dataset

We will be using IMDB moview review dataset from Huggingface Datasets.

### Objective

Get familiar with PyTorch on Cloud AI Platform notebooks instances.

### Costs 

This tutorial uses billable components of Google Cloud Platform (GCP):

* Cloud AI Platform Notebook
* Cloud AI Platform Training

Learn about [Cloud AI Platform
pricing](https://cloud.google.com/ml-engine/docs/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Setting up Notebook Environment

This notebook assumes [PyTorch 1.7 DLVM](https://cloud.google.com/ai-platform/notebooks/docs/images) development environment. You can create a Notebook instance using [Google Cloud Console](https://cloud.google.com/ai-platform/notebooks/docs/create-new) or [`gcloud` command](https://cloud.google.com/sdk/gcloud/reference/notebooks/instances/create).

```
gcloud notebooks instances create example-instance \
    --vm-image-project=deeplearning-platform-release \
    --vm-image-family=pytorch-1-7-cu110-notebooks \
    --machine-type=n1-standard-4 \
    --location=us-central1-a \
    --boot-disk-size=100 \
    --accelerator-core-count=1 \
    --accelerator-type=NVIDIA_TESLA_T4 \
    --install-gpu-driver \
    --network=default
```

### Python Dependencies

Python dependencies required for this notebook are [Transformers](https://pypi.org/project/transformers/) and [Datasets](https://pypi.org/project/datasets/) and will be installed in the bnotebook itself.

In [None]:
!pip -q install torch==1.7
!pip -q install transformers
!pip -q install datasets
!pip -q install tqdm

### Restart the Kernel

Once you've installed the {packages}, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Python imports

In [None]:
import numpy as np
from datasets import load_dataset
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          EvalPrediction, Trainer, TrainingArguments,
                          default_data_collator)

## Loading the dataset

We use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

For this example we will use IMDB movie review dataset for sentiment classification task.

In [None]:
datasets = load_dataset("imdb")
batch_size = 16
max_seq_length = 128
model_name_or_path = "bert-base-cased"

In [None]:
datasets

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [None]:
print(
    "Total # of rows in training dataset {} and size {:5.2f} MB".format(
        datasets["train"].shape[0], datasets["train"].size_in_bytes / (1024 * 1024)
    )
)
print(
    "Total # of rows in test dataset {} and size {:5.2f} MB".format(
        datasets["test"].shape[0], datasets["test"].size_in_bytes / (1024 * 1024)
    )
)

To access an actual element, you need to select a split first, then give an index:

In [None]:
datasets["train"][0]

Using the `unique` method to extract label list. This will allow us to experiment with other datasets without hard-coding labels.

In [None]:
label_list = datasets["train"].unique("label")

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [None]:
import random

import pandas as pd
from datasets import ClassLabel, Sequence
from IPython.display import HTML, display


def show_random_elements(dataset, num_examples=2):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(
                lambda x: [typ.feature.names[i] for i in x]
            )
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    use_fast=True,
)
# 'use_fast' ensure that we use fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

You can directly call this tokenizer on one sentence:

In [None]:
tokenizer("Hello, this is one sentence!")

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Note: If, as is the case here, your inputs have already been split into words, you should pass the list of words to your tokenzier with the argument `is_split_into_words=True`:

In [None]:
example = datasets["train"][4]
print(example)

In [None]:
tokenizer(
    ["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."],
    is_split_into_words=True,
)

In [None]:
# Dataset loading repeated here to make this cell idempotent
# Since we are over-writing datasets variable
datasets = load_dataset("imdb")

# TEMP: We can extract this automatically but Unique method of the dataset
# is not reporting the label -1 which shows up in the pre-processing
# Hence the additional -1 term in the dictionary
label_to_id = {1: 1, 0: 0, -1: 0}


def preprocess_function(examples):
    # Tokenize the texts
    args = (examples["text"],)
    result = tokenizer(
        *args, padding="max_length", max_length=max_seq_length, truncation=True
    )

    # Map labels to IDs (not necessary for GLUE tasks)
    if label_to_id is not None and "label" in examples:
        result["label"] = [label_to_id[example] for example in examples["label"]]

    return result


datasets = datasets.map(preprocess_function, batched=True, load_from_cache_file=True)

Note that transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer. Let's look at an example of that:

## Fine Tuning the Model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about token classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which we can get from the features, as seen before):

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name_or_path, num_labels=len(label_list)
)

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
args = TrainingArguments(
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    output_dir="/tmp/cls",
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a predictions and label_ids field) and has to return a dictionary string to float.

In [None]:
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

Now we Create the Trainer object and we are almost ready to train.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
    data_collator=default_data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

In [None]:
trainer.save_model("./finetuned-bert-classifier")

The `evaluate` method allows you to evaluate again on the evaluation dataset or on another dataset:

In [None]:
trainer.evaluate()

To get the precision/recall/f1 computed for each category now that we have finished training, we can apply the same function as before on the result of the `predict` method:

## Running Predictions with Sample Examples

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name_or_path = "bert-base-cased"
label_text = {0: "Negative", 1: "Positive"}
saved_model_path = "./finetuned-bert-classifier"


def predict(input_text, saved_model_path):
    # initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

    # preprocess and encode input text
    predict_input = tokenizer.encode(
        review_text, truncation=True, max_length=128, return_tensors="pt"
    )

    # load trained model
    loaded_model = AutoModelForSequenceClassification.from_pretrained(saved_model_path)

    # get predictions
    output = loaded_model(predict_input)

    # return labels
    label_id = torch.argmax(*output.to_tuple(), dim=1)

    print(f"Review text: {review_text}")
    print(f"Sentiment : {label_text[label_id.item()]}\n")

In [None]:
# example #1
review_text = (
    """Jaw dropping visual affects and action! One of the best I have seen to date."""
)
predict_input = predict(review_text, saved_model_path)

In [None]:
# example #2
review_text = """Take away the CGI and the A-list cast and you end up with film with less punch."""
predict_input = predict(review_text, saved_model_path)

## Run Training Job on Cloud AI Platform (CAIP)

You can do local experimentation on your AI Platform Notebooks instance. However, for larger datasets or models often a vertically scaled compute or horizontally distributed training is required. The most cost effective way to perform this task is [Cloud AI Platform Training](https://cloud.google.com/ai-platform/training/docs) Service. AI Platform Training takes care of creating designated compute resources, performs the training task and ensures deletion of compute resources once the training job is finished.

In this part of the notebook, we will show you scaling your training job by packaging the code and submitting the training job to AI Platform Training.

### Packaging the Training Application

Before runnning the training application with AI Platform Training, training application code and any dependencies must be uploaded into a Cloud Storage bucket that your Google Cloud project can access. This sections shows how to package and stage your application in the cloud.

There are two ways to package your application and dependencies and run on AI Platform Training:

1. Package [application and Python dependencies](https://cloud.google.com/ai-platform/training/docs/packaging-trainer#working_with_dependencies) manually using setup tools
2. Use [custom containers](https://cloud.google.com/ai-platform/training/docs/custom-containers-training) to package dependencies using Docker containers

#### Recommended Training Application Structure

You can structure your training application in any way you like. However, the [following structure](https://cloud.google.com/ai-platform/training/docs/packaging-trainer#project-structure) is commonly used in AI Platform Training samples, and having your project's organization be similar to the samples can make it easier for you to follow the samples.

We have two directories `python_package` and `custom_container` showing both the packaging approaches. `README.md` files inside each directory has details on the directory structure and instructions on howw to run application locally and on the cloud.

```
.
├── custom_container
│   ├── Dockerfile
│   ├── README.md
│   ├── scripts
│   │   ├── train-cloud.sh
│   │   └── train-local.sh
│   └── trainer -> ../python_package/trainer/
├── python_package
│   ├── README.md
│   ├── scripts
│   │   ├── train-cloud.sh
│   │   └── train-local.sh
│   ├── setup.py
│   └── trainer
│       ├── __init__.py
│       ├── experiment.py
│       ├── metadata.py
│       ├── model.py
│       ├── task.py
│       └── utils.py
└── pytorch-text-classification-caip-training.ipynb    --> This notebook
```

1. Main project directory contains your `setup.py` file or `Dockerfile` with the dependencies. 
2. Use a subdirectory named `trainer` to store your main application module and `scripts` to submit training jobs locally or cloud
3. Inside `trainer` directory:
    - `task.py` - Main application module 1) initialises and parse task arguments (hyper parameters), and 2) entry point to the trainer
    - `model.py` -  Includes function to create model with a sequence classification head from a pretrained model.
    - `experiment.py` - Runs the model training and evaluation experiment, and exports the final model.
    - `metadata.py` - Defines metadata for classification task such as predefined model dataset name, target labels
    - `utils.py` - Includes utility functions such as data input functions to read data, save model to GCS bucket


### Using Python Packaging to Build Manually

In this notebook, we are using Huggingface datasets and fine tuning a transformer model from Huggingface Transformers Library for sentiment analysis task. We will be adding standard Python dependencies - `transformers`, `datasets` and `tqdm` - in the `setup.py` file. The `find_packages()` function inside `setup.py` includes the `trainer` directory in the package as it contains `__init__.py` which tells [Python Setuptools](https://setuptools.readthedocs.io/en/latest/) to include all subdirectories of the parent directory as dependencies.

```
# ==========================================
# contents of setup.py file
# ==========================================

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
    'torch==1.7',
    'transformers',
    'datasets',
    'tqdm'
]

setup(
    name='trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='AI Platform | Training | PyTorch | Text Classification | Python Package'
)
```

#### Running Training Job Locally

Before submitting the job to cloud, ensure the script runs locally. The script `./python_package/scripts/train-local.sh` runs training locally using `python -m trainer.task`

```
python -m trainer.task \
    --job-dir ${JOB_DIR} \
    --model-name="finetuned-bert-classifier"
```

In [None]:
!cd python_package && ./scripts/train-local.sh

#### Running Training Job on Cloud AI Platform

You would submit the training job to Cloud AI Platform Training using `gcloud ai-platform jobs submit training`. `gcloud` command stages your training application on GCS bucket and submits the training job.


```
gcloud ai-platform jobs submit training ${JOB_NAME} \
    --region ${REGION} \
    --master-image-uri ${IMAGE_URI} \
    --scale-tier=CUSTOM \
    --master-machine-type=n1-standard-8 \
    --master-accelerator=type=nvidia-tesla-t4,count=2 \
    --job-dir ${JOB_DIR} \
    --module-name trainer.task \
    --package-path ${PACKAGE_PATH} \
    -- \
    --model-name="finetuned-bert-classifier"
```

- Set the `--master-image-uri` flag to `gcr.io/cloud-aiplatform/training/pytorch-gpu.1-7` for training on pre-built PyTorch v1.7 image for GPU
- Set the `--packages` flag to the path to your packaged application
- Set the `--module-name` flag to the `trainer.task` which is the main module to start your application
- Set the `--master-accelerator` and `--master-machine-type` flag to set the infrastructure to run the application. Refer [documentation](https://cloud.google.com/ai-platform/training/docs/machine-types) to set machine types and scaling tiers

In [None]:
!cd python_package && ./scripts/train-cloud.sh

### Using Custom Containers

To create a training job with custom container, you have define a `Dockerfile` to install the dependencies required for the training job. Then, you build and test your Docker image locally to verify it before using it with AI Platform Training.


```
# ==========================================
# contents of Dockerfile
# ==========================================

# Install pytorch
FROM gcr.io/cloud-aiplatform/training/pytorch-gpu.1-7

WORKDIR /root

# Installs pandas, and google-cloud-storage.
RUN pip install google-cloud-storage transformers datasets tqdm

# Copies the trainer code to the docker image.
COPY ./trainer/__init__.py ./trainer/__init__.py
COPY ./trainer/experiment.py ./trainer/experiment.py
COPY ./trainer/utils.py ./trainer/utils.py
COPY ./trainer/metadata.py ./trainer/metadata.py
COPY ./trainer/model.py ./trainer/model.py
COPY ./trainer/task.py ./trainer/task.py

# Set up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]
```

#### Running Training Job Locally with Custom Container

Before submitting the job to cloud, ensure the script runs locally. The script `./python_package/scripts/train-local.sh` runs training locally using `python -m trainer.task`

```
# Build the docker image
docker build -f Dockerfile -t ${IMAGE_URI} ./

# Test your docker image locally
echo "Running the Docker Image"
docker run ${IMAGE_URI} \
    --job-dir ${JOB_DIR} \
    --model-name="finetuned-bert-classifier"
```

In [None]:
!cd custom_container && ./scripts/train-local.sh

#### Running Training Job on Cloud AI Platform with Custom Container

Before submitting the training job, you need to push image to Google Cloud Container Registry and then submit the training job to Cloud AI Platform Training using `gcloud ai-platform jobs submit training`. 


```
# Deploy the docker image to Cloud Container Registry
docker push ${IMAGE_URI}

# Submit the training job
gcloud ai-platform jobs submit training ${JOB_NAME} \
    --region ${REGION} \
    --master-image-uri ${IMAGE_URI} \
    --scale-tier=CUSTOM \
    --master-machine-type=n1-standard-8 \
    --master-accelerator=type=nvidia-tesla-t4,count=2 \
    --job-dir ${JOB_DIR} \
    -- \
    --model-name="finetuned-bert-classifier"
```

- Set the `--master-image-uri` flag to the custom container image pushed to Google Cloud Container Registry
- Set the `--master-accelerator` and `--master-machine-type` flag to set the infrastructure to run the application. Refer [documentation](https://cloud.google.com/ai-platform/training/docs/machine-types) to set machine types and scaling tiers

In [None]:
!cd custom_container && ./scripts/train-cloud.sh

## Monitoring Training Job on Cloud AI Platform (CAIP)

After you submit your job, you can monitor the job status using `gcloud ai-platform jobs describe $JOB_NAME` command

In [None]:
!gcloud ai-platform jobs describe $JOB_NAME

You can stream logs using `gcloud ai-platform jobs stream-logs $JOB_NAME`

In [None]:
!gcloud ai-platform jobs stream-logs $JOB_NAME

## Cleaning up Notebook Environment

After you are done experimenting, you can either [STOP](https://cloud.google.com/ai-platform/notebooks/docs/shut-down) or DELETE the AI Notebook instance to prevent any  charges. If you want to save your work, you can choose to stop the instance instead.

```
# Stopping AI Platform Notebook instance
gcloud notebooks instances stop example-instance --location=us-central1-a


# Deleting AI Platform Notebook instance
gcloud notebooks instances delete example-instance --location=us-central1-a
```