<a href="https://colab.research.google.com/github/rastringer/ml_gke_notebooks/blob/main/introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Training a simple MNIST model on T4 GPU using GKE

In this introductory notebook, we will create and run a simple model training job using a GPU on GKE.

The MNIST database (Modified National Institute of Standards and Technology database) is a popular dataset for training image recognition and other machine learning models. It comprises 60,000 training images and 10,000 test images of handwritten digits, 0-9.

As is common for container-based workloads, this code should be easily adapted to running elsewhere, either on a local kubernetes cluster or on another cloud.

### Authenticate to GCP

In [None]:
from google.colab import auth
auth.authenticate_user()

Let's make sure we have the Google Cloud CLI installed.

In [None]:
%%bash

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg |  gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg

echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" |  tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

apt-get update &&  apt-get install google-cloud-cli

apt-get update && apt-get install google-cloud-cli-gke-gcloud-auth-plugin


Here's how we can install `kubectl` in the Colab environment.

In [None]:
%%bash

apt-get update && apt-get install -y apt-transport-https gnupg2
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee -a /etc/apt/sources.list.d/kubernetes.list
apt-get update
apt-get install -y kubectl

In [None]:
%%bash

gcloud config set project notebooks-370010

In [None]:
%%bash

gcloud projects describe notebooks-370010 --format="value(projectNumber)"

In [None]:
%%bash

gcloud services enable iam.googleapis.com

### Set environment

### Authenticate to Google Cloud

To be able to run Terraform to provision your GKE cluster and supporting infrastructure we will be using your [Application Default Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc) (ADC)

In [None]:
%%bash

gcloud auth application-default login

### Create a GKE cluster with Terraform

GKE Clusters can be provisioned using Terraform, allows us to adopt an Infrastrcture as Code (IaC) approach. There are many options when creating a GKE cluster, please see [here](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create) for the full list.

Start, by making a copy of the `terraform.tfvars.example` file and update at a minimum update the `project_id` field with the project ID you are using for your Google Cloud Project.

Note that a new VPC will be provisioned to be used by the GKE cluster as part of the Terraform deployment.

Once complete, run the following commands to allow Terraform to provision a new Cluster with a [node pool](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) that has T4 GPUs

In [None]:
%%bash

terraform init
terraform apply -auto-approve

In [None]:
%%bash

apt-get update && apt-get install -y apt-transport-https gnupg2
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee -a /etc/apt/sources.list.d/kubernetes.list
apt-get update
apt-get install -y kubectl

###Set up the convolutional neural net

Here's our simple convolutional neural network in PyTorch which we will use for some investigations on the MNIST dataset of hand-drawn numbers, 0-9. It consists of:

* Two convolutional layers (conv1 and conv2)
* Two dropout layers, used for regularization
* Two fully connected layers - these 'flatten' outputs from the pervious layers and reduce the dimensions to the final output of 10 classes in this case (numerals 0-9)
* The 'forward' methos defines the forward pass of data through the various layers, with ReLU activation, max-pooling and dropout to prevent overfitting.

### Writefile

We use `%%writefile`, a magic function for notebooks, to write our code to a file, `train.py` in this case.

In [None]:
%%writefile train.py

"""
Thanks to Meta for the PyTorch example at
https://github.com/pytorch/examples/blob/main/mnist/main.py
"""

import os
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % LOG_INTERVAL == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

EPOCHS = 1
BATCH_SIZE = 64
LR = 0.001
GAMMA = 0.7
SEED = 1
LOG_INTERVAL = 100

use_cuda = torch.cuda.is_available()

torch.manual_seed(SEED)

if use_cuda:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

train_kwargs = {'batch_size': BATCH_SIZE}
test_kwargs = {'batch_size': BATCH_SIZE}
if use_cuda:
    cuda_kwargs = {'num_workers': 1,
                    'pin_memory': True,
                    'shuffle': True}
    train_kwargs.update(cuda_kwargs)
    test_kwargs.update(cuda_kwargs)

transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])
dataset1 = datasets.MNIST('../data', train=True, download=True,
                    transform=transform)
dataset2 = datasets.MNIST('../data', train=False,
                    transform=transform)
train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

model = Net().to(device)
optimizer = optim.Adadelta(model.parameters(), lr=LR)

scheduler = StepLR(optimizer, step_size=1, gamma=GAMMA)
for epoch in range(1, EPOCHS + 1):
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader)
    scheduler.step()

torch.save(model, "mnist_cnn.pt")

# Upload the trained model to Cloud storage
from google.cloud import storage

storage_client = storage.Client()

BUCKET_NAME="genai-experiments"

destination_blob_name = "k8s-models/mnist_cnn.pt"

bucket = storage_client.bucket(BUCKET_NAME)
blob = bucket.blob(destination_blob_name)

blob.upload_from_filename("mnist_cnn.pt")

print(f"Model uploaded to: gs://{BUCKET_NAME}/{destination_blob_name}")

from google.cloud import logging

client = logging.Client()
logger = client.logger('training_job_logger')

text = f"Resource usage: CPU={cpu_usage}%, Memory={memory_usage}MB"
logger.log_text(text, severity='INFO')

In [None]:
%%writefile Dockerfile

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir torch torchvision transformers peft datasets bitsandbytes protobuf scipy einops google-cloud-storage google-cloud-logging

COPY train.py /train.py

ENV PYTHONUNBUFFERED 1

CMD python3 /train.py

In [None]:
%%bash

gcloud artifacts repositories create pytorch-t4 \
    --project=$PROJECT_ID \
    --repository-format=docker \
    --location=us-central1 \
    --description="Docker repository"

In [None]:
%%bash

gcloud builds submit \
  --tag us-central1-docker.pkg.dev/$PROJECT_ID/pytorch-t4/mnist-train .

### YAML

In the context of Kubernetes, a YAML (YAML Ain't Markup Language) file is a human-readable data serialization format used for configuration files. Kubernetes uses YAML files to define and configure resources such as pods, deployments, services, and more.

The file typically contains a set of key-value pairs, where the keys represent the configuration options and the values represent their respective settings. The structure corresponds to the desired state of the  resources needed for a job.

A YAML file for a basic Pod might look like this:

```
apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  containers:
  - name: nginx-container
    image: nginx:latest

```

We will see additional key-value pairs in the YAML necessary for our training job, including the job name, the docker image, and a graceful termination setting.

YAML files are typically run with a `kubectl` command such as:

`kubectl apply -f example-pod.yaml`


Remember to add your project id to the .yaml file below:

In [None]:
%%writefile train.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: train-job-1
spec:
  backoffLimit: 2
  template:
    metadata:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: mnist-train
        image: us-central1-docker.pkg.dev/<YOUR-PROJECT-ID>/pytorch-t4/mnist-train:latest
        resources:
          limits:
      restartPolicy: OnFailure

### kubectl

`kubectl` is the command-line tool used for interacting with Kubernetes clusters. It serves as the primary means for administrators, developers, and operators to manage Kubernetes clusters and work with the resources within them. The name "Kube Control."

Let's link `kubectl` to the cluster so we can start and check the training job.

In [None]:
!gcloud container clusters get-credentials t4-demo \
    --location us-central1

In [None]:
!kubectl config set-context t4-demo

In [None]:
!kubectl cluster-info

In [None]:
!kubectl apply -f train.yaml

In [None]:
!kubectl get jobs

In [None]:
!kubectl get nodes

In [None]:
!kubectl describe job train-job-1

To inspect the pods running the job, replace the value below with the `Created pod: xxxxx` value above.

In [None]:
!kubectl describe pod train-job-1-xxxxx

Finally, check the model file is in the specified storage bucket

In [None]:
!gsutil ls gs://{BUCKET_NAME}/k8s-models/mnist_cnn.pt

----------------------------------