In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/sdk/pytorch_lightning_custom_container_training.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 在Colab中运行
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/sdk/pytorch_lightning_custom_container_training.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      在GitHub上查看
    </a>
  </td>
</table>

## 概述

本教程演示了如何使用Vertex AI SDK for Python来训练一个使用自定义容器和PyTorch Lightning的ResNet模型。模型训练代码取自PyTorch Lightning文档页面上的CIFAR-10训练示例：
https://pytorch-lightning.readthedocs.io/en/stable/notebooks/lightning_examples/cifar10-baseline.html

采取了两种训练方法：1）在单台计算机上使用多个GPU进行训练 2）在每个计算机上只使用单个GPU进行多机训练

### 数据集

以下是网站上的描述：CIFAR-10数据集由10个类别中的60000张32x32彩色图片组成，每个类别有6000张图片。训练图像有50000张，测试图像有10000张。
https://www.cs.toronto.edu/~kriz/cifar.html

将使用Lightning Bolts datamodules加载数据集

### 目标

在本笔记本中，您将学习如何利用Vertex AI将现有的使用PyTorch Lightining训练的模型示例分布到GPU和多台计算机上进行训练

    * 安装和导入库以在本地测试模型训练
    * 初始化Vertex AI SDK
    * 为训练创建自定义容器
    * 创建一个Vertex AI TensorBoard
    * 修改代码以传入参数，记录到TensorBoard，并将模型保存到Cloud Storage
    * 在单台带有GPU的机器上运行Vertex AI训练任务
    * 在多台单GPU附加的机器上运行Vertex AI训练任务
    
    
### 成本

本教程使用Google Cloud的计费组件：

* Vertex AI
* Cloud Storage

了解[Vertex AI价格](https://cloud.google.com/vertex-ai/pricing)和[Cloud Storage价格](https://cloud.google.com/storage/pricing)，并使用[Pricing Calculator](https://cloud.google.com/products/calculator/)根据您的预期使用情况生成成本估算。

### 设置您的本地开发环境

**如果您正在使用Colab或Google Cloud笔记本**，您的环境已经满足运行本笔记本的所有要求。您可以跳过此步骤。

否则，请确保您的环境符合本笔记本的要求。
您需要以下内容：

* Google Cloud SDK
* Git
* Python 3
* virtualenv
* 在使用 Python 3 的虚拟环境中运行的 Jupyter 笔记本

Google Cloud 的 [设置 Python 开发环境指南](https://cloud.google.com/python/setup) 和 [Jupyter 安装指南](https://jupyter.org/install) 提供了满足这些要求的详细说明。以下步骤提供了简化的说明：

1. [安装并初始化 Cloud SDK。](https://cloud.google.com/sdk/docs/)
2. [安装 Python 3。](https://cloud.google.com/python/setup#installing_python)
3. [安装 virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) 并创建一个使用 Python 3 的虚拟环境。激活虚拟环境。
4. 要安装 Jupyter，请在终端 shell 中的命令行中运行 `pip3 install jupyter`。
5. 要启动 Jupyter，请在终端 shell 中的命令行中运行 `jupyter notebook`。
6. 在 Jupyter Notebook 仪表板中打开此笔记本。

### 安装额外的包

在您的笔记本环境中安装未安装的额外包依赖项

In [None]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [None]:
! pip3 install {USER_FLAG} --upgrade "torch>=1.6, <1.9"
! pip3 install {USER_FLAG} --upgrade "lightning-bolts"
! pip3 install {USER_FLAG} --upgrade git+https://github.com/PyTorchLightning/pytorch-lightning
! pip3 install {USER_FLAG} --upgrade "torchmetrics>=0.3"
! pip3 install {USER_FLAG} --upgrade "torchvision"
! pip3 install {USER_FLAG} --upgrade google-cloud-aiplatform
! pip3 install {USER_FLAG} --upgrade ipywidgets

### 重新启动内核

在安装完额外的软件包之后，您需要重新启动笔记本内核，以便它能找到这些软件包。

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## 在开始之前

### 选择 GPU 运行时

**确保如果有这个选项的话，在 GPU 运行时中运行这个笔记本。在 Colab 中，选择“运行时 --> 更改运行时类型 > GPU”**

### 设置您的Google Cloud项目

**无论您使用什么笔记本环境，以下步骤都是必需的。**

1. [选择或创建一个Google Cloud项目](https://console.cloud.google.com/cloud-resource-manager)。当您第一次创建账户时，您将获得$300的免费信用，可用于支付计算/存储成本。

1. [确保为您的项目启用了计费](https://cloud.google.com/billing/docs/how-to/modify-project)。

1. [启用Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)。{TODO: 更新教程所需的API。编辑API名称，并更新链接以附加API ID，用逗号分隔每个API ID。例如，container.googleapis.com,cloudbuild.googleapis.com}

1. 如果您在本地运行此笔记本，您需要安装[Cloud SDK](https://cloud.google.com/sdk)。

1. 在下面的单元格中输入您的项目ID。然后运行该单元格，以确保Cloud SDK在本笔记本中的所有命令中使用正确的项目。

**注意**：Jupyter会将以`!`开头的行视为shell命令，并将以`$`开头的Python变量插入这些命令中。

#### 设置您的项目 ID

**如果您不知道您的项目 ID**，您可以使用 `gcloud` 获取您的项目 ID。

In [None]:
PROJECT_ID = ""

import os

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

否则，在这里设置您的项目ID。

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your GCP project id from gcloud
    shell_output = !gcloud auth list 2>/dev/null
    SERVICE_ACCOUNT = shell_output[2].strip().replace("*", "").replace(" ", "")
    print("Service Account:", SERVICE_ACCOUNT)

时间戳

如果您在一个现场教程会话中，您可能正在使用一个共享的测试账户或项目。为了避免创建的资源之间的名称冲突，您可以为每个实例会话创建一个时间戳，并将其附加到您在本教程中创建的资源的名称上。

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### 验证您的Google Cloud账户

**如果您正在使用Google Cloud笔记本**，您的环境已经通过验证。请跳过此步骤。

如果您正在使用Colab，请运行下面的单元格，并按照提示进行oAuth账户验证。

否则，请按照以下步骤操作：

1. 在Cloud控制台中，转到[**创建服务账号密钥**页面](https://console.cloud.google.com/apis/credentials/serviceaccountkey)。

2. 点击**创建服务账号**。

3. 在**服务账号名称**字段中输入名称，然后点击**创建**。

4. 在**授予此服务账号对项目的访问权限**部分，点击**角色**下拉列表。在筛选框中输入"Vertex AI"，并选择**Vertex AI管理员**。在筛选框中输入"存储对象管理员"，并选择**存储对象管理员**。

5. 点击*创建*。包含您密钥的JSON文件将下载到您的本地环境。

6.在下面的单元格中将您的服务账号密钥路径输入为`GOOGLE_APPLICATION_CREDENTIALS`变量，并运行该单元格。

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### 创建一个云存储桶

**无论您的笔记本环境如何，都需要执行以下步骤。**


使用 Cloud SDK 提交训练作业时，您需要将包含训练代码的 Python 包上传到一个云存储桶中。Vertex AI 将从这个包中运行代码。在本教程中，Vertex AI 还会将作业生成的训练模型保存在同一个存储桶中。通过使用这个模型工件，您可以创建 Vertex AI 模型和端点资源，以便提供在线预测。

在下面设置你的云存储桶的名称。它必须在所有云存储桶中是唯一的。

您还可以更改 `REGION` 变量，该变量用于本笔记本中的所有操作。我们建议您[选择一个支持 Vertex AI 服务的地区](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions)。

In [None]:
BUCKET_URI = "gs://[your-bucket-name]"  # @param {type:"string"}
REGION = "[your-region]"  # @param {type:"string"}

In [None]:
if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_URI = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

if REGION == "[your-region]":
    REGION = "us-central1"

只有当您的存储桶尚不存在时才运行以下单元格来创建您的云存储桶。

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

最后，通过检查云存储桶的内容来验证对其的访问。

In [None]:
! gsutil ls -al $BUCKET_URI

导入库并定义常量

In [None]:
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
AVAIL_GPUS = min(1, torch.cuda.device_count())
BATCH_SIZE = 256 if AVAIL_GPUS else 64
NUM_WORKERS = int(os.cpu_count() / 2)

print(PATH_DATASETS)
print(AVAIL_GPUS)
print(BATCH_SIZE)
print(NUM_WORKERS)

定义用于本地测试的训练函数。

In [None]:
train_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomCrop(32, padding=4),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

test_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)


def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(
        3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
    )
    model.maxpool = nn.Identity()
    return model


class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}

### 在本地训练模型

In [None]:
model = LitResnet(lr=0.05)
model.datamodule = cifar10_dm

trainer = Trainer(
    progress_bar_refresh_rate=10,
    max_epochs=5,
    gpus=AVAIL_GPUS,
    logger=TensorBoardLogger("lightning_logs/", name="resnet"),
    callbacks=[LearningRateMonitor(logging_interval="step")],
    strategy="dp",
)

trainer.fit(model, cifar10_dm)
trainer.test(model, datamodule=cifar10_dm)

使用Vertex AI SDK和自定义容器进行Vertex AI培训.

### 构建自定义容器

运行这些步骤一次来设置工件注册表并授权docker使用它

In [None]:
! gcloud config set project $PROJECT_ID
! gcloud services enable artifactregistry.googleapis.com
! sudo usermod -a -G docker ${USER}
! gcloud auth configure-docker us-central1-docker.pkg.dev --quiet

In [None]:
REPOSITORY = "gpu-training-repository"

In [None]:
! gcloud artifacts repositories create $REPOSITORY --repository-format=docker \
--location=$REGION --description="Vertex GPU training repository"

创建一个教练目录

In [None]:
import os

os.mkdir("trainer")

构建容器

此代码扩展了原始示例，并添加了参数解析、TensorBoard 日志记录、选择训练策略的能力以及将模型保存到云存储的功能。

In [None]:
%%writefile trainer/task.py
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torch.optim.swa_utils import AveragedModel, update_bn
from torchmetrics.functional import accuracy

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, Trainer, seed_everything
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torch.optim.swa_utils import AveragedModel, update_bn
from torchmetrics.functional import accuracy

# Arg parsing and shutil for folder creation
import argparse
import shutil

seed_everything(7)

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
AVAIL_GPUS = min(1, torch.cuda.device_count())
BATCH_SIZE = 256 if AVAIL_GPUS else 64
NUM_WORKERS = int(os.cpu_count() / 2)

print (PATH_DATASETS)
print (AVAIL_GPUS)
print (BATCH_SIZE)
print (NUM_WORKERS)

train_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomCrop(32, padding=4),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

test_transforms = torchvision.transforms.Compose(
    [
        torchvision.transforms.ToTensor(),
        cifar10_normalization(),
    ]
)

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)

# Added code to read args
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--epochs', dest='epochs',
                        default=10, type=int,
                        help='Number of epochs.')
    parser.add_argument('--distribute', dest='distribute', type=str, default='dp',
                        help='Distributed training strategy.')
    parser.add_argument('--num-nodes', dest='num_nodes',
                        default=1, type=int,
                        help='Number of nodes')
    parser.add_argument(
          '--model-dir', dest='model_dir', default=os.getenv('AIP_MODEL_DIR'), type=str,
          help='a Cloud Storage URI of a directory intended for saving model artifacts')
    parser.add_argument(
          '--tensorboard-log-dir', dest='tensorboard_log_dir', default=os.getenv('AIP_TENSORBOARD_LOG_DIR'), type=str,
          help='a Cloud Storage URI of a directory intended for saving TensorBoard')
    parser.add_argument(
          '--checkpoint-dir', dest='checkpoint_dir', default=os.getenv('AIP_CHECKPOINT_DIR'), type=str,
          help='a Cloud Storage URI of a directory intended for saving checkpoints')
    args = parser.parse_args()
    return args

# Cunction to make model directory if it doesn't exist
def makedirs(model_dir):
    if os.path.exists(model_dir) and os.path.isdir(model_dir):
        shutil.rmtree(model_dir)
    os.makedirs(model_dir)
    return

def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

class LitResnet(LightningModule):
    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    # TensorBoard logging at epoch end
    def training_epoch_end(self,outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

        tensorboard_logs = {'loss': avg_loss}

        epoch_dictionary={'loss': avg_loss,'log': tensorboard_logs}

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("train_loss", loss)

        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f"{stage}_loss", loss, prog_bar=True)
            self.log(f"{stage}_acc", acc, prog_bar=True)


    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, "val")

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, "test")

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            "scheduler": OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            "interval": "step",
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}

def main():   

    # Parse args
    args = parse_args()
    print (f"Args={args}")
    print (f"model directory={args.epochs}")
    print (f"model directory={args.model_dir}")
    print (f"distribute strategy={args.distribute}")

    # model, ensorboard, and checkpoint directories set
    local_model_dir = './tmp/model'
    local_tensorboard_log_dir = './tmp/logs'
    local_checkpoint_dir = './tmp/checkpoints'

    model_dir = args.model_dir or local_model_dir
    tensorboard_log_dir = args.tensorboard_log_dir or local_tensorboard_log_dir
    checkpoint_dir = args.checkpoint_dir or local_checkpoint_dir

    print ("Model directory" + model_dir)
    print ("TensorBoard directory" + tensorboard_log_dir)
    print ("Checkpoint directory" + checkpoint_dir)

    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    if model_dir and model_dir.startswith(gs_prefix):
        model_dir = model_dir.replace(gs_prefix, gcsfuse_prefix)
        if not os.path.isdir(model_dir):
            os.makedirs(model_dir)
    if tensorboard_log_dir and tensorboard_log_dir.startswith(gs_prefix):
        tensorboard_log_dir = tensorboard_log_dir.replace(gs_prefix, gcsfuse_prefix)
        if not os.path.isdir(tensorboard_log_dir):
            os.makedirs(tensorboard_log_dir)
    if checkpoint_dir and checkpoint_dir.startswith(gs_prefix):
        checkpoint_dir = checkpoint_dir.replace(gs_prefix, gcsfuse_prefix)
        if not os.path.isdir(checkpoint_dir):
            os.makedirs(checkpoint_dir)

    model = LitResnet(lr=0.05)
    model.datamodule = cifar10_dm

    trainer = Trainer(
        progress_bar_refresh_rate=10,
        gpus=AVAIL_GPUS, 
        logger=TensorBoardLogger(tensorboard_log_dir, "resnet"), 
        callbacks=[LearningRateMonitor(logging_interval="step")],
        # Changes to use args, change default checkpoint dir, and set number of nodes
        max_epochs=args.epochs,
        strategy=args.distribute,
        default_root_dir=checkpoint_dir,
        num_nodes=args.num_nodes,
    )

    trainer.fit(model, cifar10_dm)
    trainer.test(model, datamodule=cifar10_dm)

    #Save model step
    model_name = "pylightning_resnet_state_dict.pth"

    model_save_path = os.path.join(model_dir, model_name)
    if trainer.global_rank == 0:
        makedirs(model_dir)
        print("Saving model to {}".format(model_save_path))
        torch.save(model.state_dict(), model_save_path)


if __name__ == '__main__':
    main()

#### 配置容器名称和制品注册表路径

In [None]:
content_name = "pytorch-lightning-gpu-training"
hostname = f"{REGION}-docker.pkg.dev"
image_name_train = content_name
tag = "latest"

custom_container_image_uri_train = (
    f"{hostname}/{PROJECT_ID}/{REPOSITORY}/{image_name_train}:{tag}"
)

创建 requirements.txt 和 Dockerfile

In [None]:
%%writefile trainer/requirements.txt
torch>=1.6, <1.9
lightning-bolts
pytorch-lightning>=1.3
torchmetrics>=0.3
torchvision

In [None]:
%%writefile trainer/Dockerfile
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime

COPY . /trainer

WORKDIR /trainer

RUN pip install -r requirements.txt

ENTRYPOINT ["python", "task.py"]

创建一个空的 __init__.py 文件，需要放在容器中。

In [None]:
import os

with open(os.path.join("trainer", "__init__.py"), "w") as fp:
    pass

#### 在容器中构建，本地训练模型，并推送到Artifact Registry

In [None]:
! cd trainer && docker build -t $custom_container_image_uri_train -f Dockerfile .

In [None]:
! docker run --rm $custom_container_image_uri_train

In [None]:
! docker push $custom_container_image_uri_train

In [None]:
! gcloud artifacts repositories describe $REPOSITORY --location=$REGION

初始化顶点 SDK

In [None]:
from google.cloud import aiplatform

aiplatform.init(
    project=PROJECT_ID,
    staging_bucket=BUCKET_URI,
    location=REGION,
)

创建一个Vertex AI TensorBoard实例

In [None]:
tensorboard = aiplatform.Tensorboard.create(
    display_name=content_name,
)

选项：使用先前创建的Vertex AI TensorBoard实例

```
tensorboard_name = "您的TensorBoard资源名称或TensorBoard ID"
tensorboard = aiplatform.Tensorboard(tensorboard_name=tensorboard_name)
```

使用多个GPU运行Vertex AI SDK自定义容器训练作业

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y-%m-%d-%H%M%S")
print(TIMESTAMP)

设置训练的参数。模型/TensorBoard/检查点目录使用 Vertex 默认值。取消注释以设置自己的。

In [None]:
gcs_output_uri_prefix = f"{BUCKET_URI}/{content_name}-{TIMESTAMP}"

In [None]:
EPOCHS = 30
TRAIN_STRATEGY = "dp"  # Distributed Parallel for single machine multiple GPU
MODEL_DIR = f"{BUCKET_URI}/{content_name}/model"
TB_DIR = f"{BUCKET_URI}/{content_name}/logs"
CHKPT_DIR = f"{BUCKET_URI}/{content_name}/checkpoints"
NUM_NODES = 1

machine_type = "n1-standard-4"
accelerator_count = 2
accelerator_type = "NVIDIA_TESLA_V100"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--distribute=" + TRAIN_STRATEGY,
    "--num-nodes=" + str(NUM_NODES),
    "--model-dir=" + MODEL_DIR,
    "--checkpoint-dir=" + CHKPT_DIR,
]

In [None]:
custom_container_training_job = aiplatform.CustomContainerTrainingJob(
    display_name=content_name + "-MultGPU-dp-" + TIMESTAMP,
    container_uri=custom_container_image_uri_train,
)

In [None]:
custom_container_training_job.run(
    args=CMDARGS,
    replica_count=NUM_NODES,
    base_output_dir=gcs_output_uri_prefix,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    service_account=SERVICE_ACCOUNT,
    tensorboard=tensorboard.resource_name,
    sync=False,
)

In [None]:
print(f"Custom Training Job Name: {custom_container_training_job.resource_name}")
print(f"GCS Output URI Prefix: {gcs_output_uri_prefix}")

删除训练作业

In [None]:
custom_container_training_job.delete()

在多台机器上使用每台1个GPU运行训练

In [None]:
EPOCHS = 30
TRAIN_STRATEGY = "ddp"  # Distributed Parallel for single machine multiple GPU
MODEL_DIR = f"{BUCKET_URI}/{content_name}-ddp/model"
TB_DIR = f"{BUCKET_URI}/{content_name}-ddp/logs"
CHKPT_DIR = f"{BUCKET_URI}/{content_name}-ddp/checkpoints"
NUM_NODES = 2

machine_type = "n1-standard-4"
accelerator_count = 1
accelerator_type = "NVIDIA_TESLA_V100"

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--distribute=" + TRAIN_STRATEGY,
    "--num-nodes=" + str(NUM_NODES),
    "--model-dir=" + MODEL_DIR,
    "--checkpoint-dir=" + CHKPT_DIR,
]

In [None]:
custom_container_training_job_dist = aiplatform.CustomContainerTrainingJob(
    display_name=content_name + "-MultiCPU-1GPU-ddp-" + TIMESTAMP,
    container_uri=custom_container_image_uri_train,
)

In [None]:
custom_container_training_job_dist.run(
    args=CMDARGS,
    replica_count=NUM_NODES,
    base_output_dir=gcs_output_uri_prefix,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    service_account=SERVICE_ACCOUNT,
    tensorboard=tensorboard.resource_name,
    sync=False,
)

In [None]:
print(f"Custom Training Job Name: {custom_container_training_job_dist.resource_name}")
print(f"GCS Output URI Prefix: {gcs_output_uri_prefix}")

清理

要清理此项目中使用的所有Google Cloud资源，您可以[删除用于教程的Google Cloud项目](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)。

否则，您可以删除在本教程中创建的各个资源。

In [None]:
# Warning: Setting this to true will delete everything in your bucket
delete_bucket = False

# Delete TensorBoard
TB_NAME = tensorboard.resource_name
! gcloud beta ai tensorboards delete $TB_NAME --quiet

# Delete the training job
custom_container_training_job_dist.delete()

CONTENT_DIR = f"{BUCKET_URI}/{content_name}*"
# Delete Cloud Storage objects that were created
! gsutil -m rm -r $CONTENT_DIR

if delete_bucket and "BUCKET_URI" in globals():
    ! gsutil -m rm -r $BUCKET_URI