<a href="https://colab.research.google.com/github/rastringer/ml_gke_notebooks/blob/main/swin_image_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### Training a larger Swin transformer model on T4 GPUs using GKE

In this example, we will fine-tune any pretrained vision model for image classification on a custom dataset. We will add a randomly initialized classification head on top of a pre-trained encoder, then fine-tune the model on a labeled dataset.



### Authenticate to GCP

In [None]:
from google.colab import auth
auth.authenticate_user()

Let's make sure we have the Google Cloud CLI installed.

In [None]:
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg |  gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg

!echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" |  tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

!apt-get update &&  apt-get install google-cloud-cli

!apt-get update && apt-get install google-cloud-cli-gke-gcloud-auth-plugin


Here's how we can install `kubectl` in the Colab environment.

In [None]:
! apt-get update && apt-get install -y apt-transport-https gnupg2
! curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
! echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee -a /etc/apt/sources.list.d/kubernetes.list
! apt-get update
! apt-get install -y kubectl

### Set environment

In [None]:
import os

os.environ["PROJECT_ID"] = "<your-project-id>"
os.environ["REGION"] = "<your-region>"
os.environ["ZONE"] = "<your-zone>"
os.environ["BUCKET_NAME"] = "<your-bucket>"
os.environ["VPC_NAME"] = "default"
os.environ["VPC_SUBNET"] = "default"

In [None]:
!gcloud config set project <your-project-id>

In [None]:
# Enable APIs
! gcloud services enable iam.googleapis.com
! gcloud services enable container.googleapis.com
! gcloud services enable cloudbuild.googleapis.com

### Create a GKE cluster

There are many options when creating a GKE cluster, please see [here](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create) for the full list.



In [None]:
! gcloud container clusters create t4-demo --location ${REGION} \
  --workload-pool ${PROJECT_ID}.svc.id.goog \
  --enable-image-streaming \
  --enable-shielded-nodes \
  --shielded-secure-boot \
  --shielded-integrity-monitoring \
  --enable-ip-alias \
  --node-locations=${ZONE} \
  --network projects/$PROJECT_ID/global/networks/${VPC_NAME} \
  --subnetwork ${VPC_SUBNET} \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --labels="ml-on-gke=t4-demo" \
  --addons GcsFuseCsiDriver

Now that we have a cluster, we need to add a node pool with the T4 GPUs. Create a [node pool](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) with T4 GPU.

In [None]:
! gcloud container node-pools create t4-pool \
  --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
  --machine-type=n1-standard-2 \
  --num-nodes=3 \
  --region=${REGION} \
  --cluster=t4-demo \
  --node-locations ${ZONE}

In [None]:
! apt-get update && apt-get install -y apt-transport-https gnupg2
! curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
! echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee -a /etc/apt/sources.list.d/kubernetes.list
! apt-get update
! apt-get install -y kubectl

### Writefile

We use `%%writefile`, a magic function for notebooks, to write our code to a file, `train.py` in this case.

In [None]:
%%writefile train.py

# Thanks to Hugging Face for their Swin example notebook here:
# https://colab.sandbox.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb

from google.cloud import storage

# Set your Google Cloud Storage bucket name
bucket_name = "eurosat-data"

# Set the local directory where you want to copy the files
local_directory = "data"

# Initialize a client
client = storage.Client()

# Get the bucket
bucket = client.get_bucket(bucket_name)

# List the objects in the bucket
blobs = bucket.list_blobs()

# Loop through the objects and download each one
for blob in blobs:
    destination_blob_path = local_directory + blob.name
    blob.download_to_filename(destination_blob_path)
    print(f"Downloaded {blob.name} to {destination_blob_path}")


from datasets import load_dataset

# load a custom dataset from local/remote files or folders using the ImageFolder feature

# option 1: local/remote files (supporting the following formats: tar, gzip, zip, xz, rar, zstd)
# Change this to delete 'content' from file path when running on GKE vs colab
dataset = load_dataset("imagefolder", data_files="EuroSAT.zip")

from datasets import load_metric

metric = load_metric("accuracy")

labels = dataset["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

id2label[2]

from transformers import AutoImageProcessor

model_checkpoint = "microsoft/swin-tiny-patch4-window7-224" # pre-trained model from which to fine-tune
batch_size = 32 # batch size for training and evaluation

image_processor  = AutoImageProcessor.from_pretrained(model_checkpoint)
image_processor

from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    Resize,
    ToTensor,
)

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
if "height" in image_processor.size:
    size = (image_processor.size["height"], image_processor.size["width"])
    crop_size = size
    max_size = None
elif "shortest_edge" in image_processor.size:
    size = image_processor.size["shortest_edge"]
    crop_size = (size, size)
    max_size = image_processor.size.get("longest_edge")

train_transforms = Compose(
        [
            RandomResizedCrop(crop_size),
            RandomHorizontalFlip(),
            ToTensor(),
            normalize,
        ]
    )

val_transforms = Compose(
        [
            Resize(size),
            CenterCrop(crop_size),
            ToTensor(),
            normalize,
        ]
    )

def preprocess_train(example_batch):
    """Apply train_transforms across a batch."""
    example_batch["pixel_values"] = [
        train_transforms(image.convert("RGB")) for image in example_batch["image"]
    ]
    return example_batch

def preprocess_val(example_batch):
    """Apply val_transforms across a batch."""
    example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["image"]]
    return example_batch

# split up training into training + validation
splits = dataset["train"].train_test_split(test_size=0.1)
train_ds = splits['train']
val_ds = splits['test']

train_ds.set_transform(preprocess_train)
val_ds.set_transform(preprocess_val)

from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint,
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-eurosat",
    remove_unused_columns=False,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

import numpy as np

# the compute_metrics function takes a Named Tuple as input:
# predictions, which are the logits of the model as Numpy arrays,
# and label_ids, which are the ground-truth labels as Numpy arrays.
def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

import torch

def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
)

train_results = trainer.train()
# rest is optional but nice to have
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

metrics = trainer.evaluate()
# some nice to haves:
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

### Dockerfile

Kubernetes is built for running workloads that can be specified in Docker containers. The `Dockerfile` is a script used to build a container image, which is a lightweight and portable unit that can run applications and their dependencies in isolated environments.

In our `Dockerfile`, we list the base image, the files we want to copy (eg `train.py`), install dependencies and set environment variables.

In [None]:
%%writefile Dockerfile

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir torch torchvision transformers peft datasets accelerate google-cloud-storage google-cloud-logging

COPY train.py /train.py

ENV PYTHONUNBUFFERED 1

CMD python3 /train.py

### Respository

We need a repository to host our Docker image. In this example, we will use GCP's [Artifact Registry](https://cloud.google.com/artifact-registry).

In [None]:
! gcloud artifacts repositories create pytorch-t4 \
    --project=$PROJECT_ID \
    --repository-format=docker \
    --location=us-central1 \
    --description="Docker repository"

In [None]:
! gcloud builds submit \
  --tag us-central1-docker.pkg.dev/$PROJECT_ID/pytorch-t4/swin-train .

You need to provide repository read access to the cluster otherwise you will get permission denied errors - https://cloud.google.com/kubernetes-engine/docs/troubleshooting#permission_denied_error
Please use the Compute Engine default service account email to update the below command.

In [None]:
! gcloud artifacts repositories add-iam-policy-binding pytorch-t4 \
    --location=us-central1 \
    --member=serviceAccount:<SERVICE_ACCOUNT_EMAIL> \
    --role="roles/artifactregistry.reader"

### YAML

In the context of Kubernetes, a YAML (YAML Ain't Markup Language) file is a human-readable data serialization format used for configuration files. Kubernetes uses YAML files to define and configure resources such as pods, deployments, services, and more.

The file typically contains a set of key-value pairs, where the keys represent the configuration options and the values represent their respective settings. The structure corresponds to the desired state of the  resources needed for a job.

A YAML file for a basic Pod might look like this:

```
apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  containers:
  - name: nginx-container
    image: nginx:latest

```

We will see additional key-value pairs in the YAML necessary for our training job, including the job name, the docker image, and a graceful termination setting.

YAML files are typically run with a `kubectl` command such as:

`kubectl apply -f example-pod.yaml`


Remember to add your project id to the .yaml file below:

In [None]:
%%writefile train.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: swin-train-job-1
spec:
  backoffLimit: 5
  template:
    metadata:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: mnist-train
        image: us-central1-docker.pkg.dev/<your_project_id>/pytorch-t4/swin-train:latest
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: OnFailure

### kubectl

`kubectl` is the command-line tool used for interacting with Kubernetes clusters. It serves as the primary means for administrators, developers, and operators to manage Kubernetes clusters and work with the resources within them. The name "Kube Control."

Let's link `kubectl` to the cluster so we can start and check the training job.

In [None]:
!gcloud container clusters get-credentials t4-demo \
    --location ${REGION}

To use `kubectl` commands on our cluster, we have to make sure it's talking to the correct one. We can use the `set-context` command:

In [None]:
!kubectl config set-context t4-demo

Get the basic info about the cluster

In [None]:
!kubectl cluster-info

Run, or 'apply' the training job

In [None]:
!kubectl apply -f train.yaml

See all jobs running

In [None]:
!kubectl get jobs

Check the nodes

In [None]:
!kubectl get nodes

Describe the training job

In [None]:
!kubectl describe job train-job-1

To inspect the pods running the job, replace the value below with the `Created pod: xxxxx` value above.

In [None]:
!kubectl describe pod train-job-1-xxxxx

----------------------------------