 Copyright 2024 Google LLC

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

This sections shows how to run model training using PyTorch and data from specific storage 

Import required python libraries:

In [4]:
import torch
import time
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import DataLoader

Declare batch size for dataset reading. Define dataset [transformation parameters](https://pytorch.org/vision/stable/transforms.html) - resize image, apply random image augmentations, convert to tensor and normalize it to fit all vector dimensions into [-1, 1] range.

In [5]:
BATCH_SIZE = 64

train_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomVerticalFlip(p=0.5),
    transforms.GaussianBlur(kernel_size=(5, 9), sigma=(0.1, 5)),
    transforms.RandomRotation(degrees=(30, 70)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.5, 0.5, 0.5],
        std=[0.5, 0.5, 0.5]
    )
])


Declare custom neural network. Worth to mention nn.Linear parameter set limits to input features (vector dimensions of input image) and output (corresponding image class).

In [6]:
import torch.nn as nn
import torch.nn.functional as F
class CNNModel(nn.Module):
    def __init__(self):
        super(CNNModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 5)
        self.conv2 = nn.Conv2d(32, 64, 5)
        self.conv3 = nn.Conv2d(64, 128, 3)
        self.conv4 = nn.Conv2d(128, 256, 5)
        
        self.fc1 = nn.Linear(256, 1000)
        
        self.pool = nn.MaxPool2d(2, 2)
        
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = self.pool(F.relu(self.conv4(x)))
        bs, _, _, _ = x.shape
        x = F.adaptive_avg_pool2d(x, 1).reshape(bs, -1)
        x = self.fc1(x)
        return x

In this cell you check if CUDA is available and declare two [optimization functions](https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html) - optimizer (the Adam
optimizer with a 0.001 learning rate) and criterion (the Cross-Entropy loss function). 

In [7]:
import torch.nn as nn
import torch.optim as optim
from tqdm.auto import tqdm

device = ('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Computation device: {device}\n")
model = CNNModel().to(device)

optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

Computation device: cuda



Define helper save_model function to save model states.

In [8]:
def save_model(epochs, model, optimizer, criterion):
    model_path = "/local-ssd/outputs/model-" + time.strftime("%H-%M-%S", time.localtime()) + ".pth"
    torch.save({
                'epoch': epochs,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': criterion,
                }, model_path)
    print(f"Model was saved in {model_path}")

Define model training function that runs very [common training loop](https://pytorch.org/tutorials/beginner/introyt/trainingyt.html).

In [9]:
def train(model, trainloader, optimizer, criterion):
    model.train()
    print('Training')
    train_running_loss = 0.0
    train_running_correct = 0
    counter = 0
    for i, data in tqdm(enumerate(trainloader), total=len(trainloader)):
        counter += 1
        image, labels = data
        image = image.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        outputs = model(image)
        loss = criterion(outputs, labels)
        train_running_loss += loss.item()
        _, preds = torch.max(outputs.data, 1)
        train_running_correct += (preds == labels).sum().item()
        loss.backward()
        optimizer.step()
    epoch_loss = train_running_loss / counter
    epoch_acc = 100. * (train_running_correct / len(trainloader.dataset))
    return epoch_loss, epoch_acc

Declare benchmark function, that calculates training time using different storages and epochs amount. The function erases cache in GPU memory and uploads an untrained model to GPU memory before each training.

In [None]:
def train_benchmark(dataset_path, epochs):
    start = time.time()
    train_dataset = datasets.ImageFolder(
        root=dataset_path,
        transform=train_transform
    )
    train_loader = DataLoader(
        train_dataset, batch_size=BATCH_SIZE, shuffle=True,
        num_workers=2
    )
    #recreate the model before training
    torch.cuda.empty_cache()
    model = CNNModel().to(device)
    train_loss = []
    train_acc = []
    for epoch in range(epochs):
        print(f"[INFO]: Epoch {epoch+1} of {epochs}")
        train_epoch_loss, train_epoch_acc = train(model, train_loader, optimizer, criterion)
        train_loss.append(train_epoch_loss)
        train_acc.append(train_epoch_acc)
        print(f"Training loss: {train_epoch_loss:.3f}, training acc: {train_epoch_acc:.3f}")
        print('-'*50)
        save_model(epochs, model, optimizer, criterion)
    print('Training complete')

    end = time.time()
    print("Total training time: ", time.strftime("%H:%M:%S", time.gmtime(end-start)))

Run the benchmark using Ram disk and 2/5/10 training cycles.

In [None]:
print("Ram disk - 2 epochs")
train_benchmark("/ram-disk/dataset", 2)
print("Ram disk - 5 epochs")
train_benchmark("/ram-disk/dataset", 5)
print("Ram disk - 10 epochs")
train_benchmark("/ram-disk/dataset", 10)

Run the benchmark using Local SSD and 2/5/10 training cycles.

In [None]:
print("Local ssd - 2 epochs")
train_benchmark("/local-ssd/dataset", 2)
print("Local ssd - 5 epochs")
train_benchmark("/local-ssd/dataset", 5)
print("Local ssd - 10 epochs")
train_benchmark("/local-ssd/dataset", 10)

Run the benchmark using Persistent disk and 2/5/10 training cycles.

In [None]:
print("Persistent disk - 2 epochs")
train_benchmark("/pd-ssd/dataset", 2)
print("Persistent disk - 5 epochs")
train_benchmark("/pd-ssd/dataset", 5)
print("Persistent disk - 2 epochs")
train_benchmark("/pd-ssd/dataset", 10)

Run the benchmark using GCS bucket and 2/5/10 training cycles.

In [None]:
print("Local ssd - 2 epochs")
train_benchmark("/bucket/datase", 2)
print("Ram disk - 5 epochs")
train_benchmark("/bucket/datase", 5)
print("Bucket - 10 epochs")
train_benchmark("/bucket/dataset", 10)