# Create PyTorchJob using Kubeflow Training SDK

This is a sample for Kubeflow Training SDK `kubeflow-training`.

The notebook shows how to use Kubeflow Training SDK to create, get, wait, check and delete PyTorchJob.

In [None]:
training_python_sdk='kubeflow-training'
namespace='kubeflow-user-example-com'

## Install Kubeflow Training Python SDKs

You need to install Kubeflow Training SDK to run this Notebook.

In [None]:
# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.
# Install Kubeflow Python SDK
!pip install {training_python_sdk}

In [None]:
from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container

from kubeflow.training import KubeflowOrgV1ReplicaSpec
from kubeflow.training import KubeflowOrgV1PyTorchJob
from kubeflow.training import KubeflowOrgV1PyTorchJobSpec
from kubeflow.training import KubeflowOrgV1RunPolicy
from kubeflow.training import TrainingClient

from kubeflow.training import constants

## Define PyTorchJob

The demo only creates a worker of PyTorchJob to run mnist sample.

In [None]:
name = "pytorch-dist-mnist-gloo"
container_name = "pytorch"

container = V1Container(
    name=container_name,
    image="ghcr.io/kubeflow/training-v1/pytorch-dist-mnist:latest",
    args=["--backend", "gloo"],
)

replica_spec = KubeflowOrgV1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            name=name,
            namespace=namespace,
            annotations={
                "sidecar.istio.io/inject": "false"
            }
        ),
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

pytorchjob = KubeflowOrgV1PyTorchJob(
    api_version=constants.API_VERSION,
    kind=constants.PYTORCHJOB_KIND,
    metadata=V1ObjectMeta(name=name, namespace=namespace),
    spec=KubeflowOrgV1PyTorchJobSpec(
        run_policy=KubeflowOrgV1RunPolicy(clean_pod_policy="None"),
        pytorch_replica_specs={
            "Master": replica_spec,
            "Worker": replica_spec
        },
    ),
)

## Create PyTorchJob

You have to create Training Client to deploy your PyTorchJob in you cluster.

In [None]:
# Namespace will be reused in every APIs.
training_client = TrainingClient(namespace=namespace)

# `job_kind` is set in `TrainingClient`
training_client.create_job(pytorchjob)

PyTorchJob kubeflow-user-example-com/pytorch-dist-mnist-gloo has been created


## Get the Created PyTorchJob

You can verify the created PyTorchJob name

In [None]:
training_client.get_job(name).metadata.name

'pytorch-dist-mnist-gloo'

## Get the PyTorchJob Conditions

In [None]:
training_client.get_job_conditions(name=name)

[{'last_transition_time': datetime.datetime(2023, 9, 8, 21, 14, 59, tzinfo=tzutc()),
  'last_update_time': datetime.datetime(2023, 9, 8, 21, 14, 59, tzinfo=tzutc()),
  'message': 'PyTorchJob pytorch-dist-mnist-gloo is created.',
  'reason': 'PyTorchJobCreated',
  'status': 'True',
  'type': 'Created'},
 {'last_transition_time': datetime.datetime(2023, 9, 8, 21, 15, 45, tzinfo=tzutc()),
  'last_update_time': datetime.datetime(2023, 9, 8, 21, 15, 45, tzinfo=tzutc()),
  'message': 'PyTorchJob pytorch-dist-mnist-gloo is running.',
  'reason': 'JobRunning',
  'status': 'True',
  'type': 'Running'}]

## Wait Until PyTorchJob Finishes

In [None]:
pytorchjob = training_client.wait_for_job_conditions(name=name)

print(f"Succeeded number of replicas: {pytorchjob.status.replica_statuses['Master'].succeeded}")

NAME                           STATE                TIME
pytorch-dist-mnist-gloo        Running              2023-09-08 21:15:45+00:00
pytorch-dist-mnist-gloo        Running              2023-09-08 21:15:45+00:00
pytorch-dist-mnist-gloo        Succeeded            2023-09-08 21:26:44+00:00


Succeeded number of replicas: 1


## Verify if PyTorchJob is Succeeded

In [None]:
training_client.is_job_succeeded(name=name)

True

## Get the PyTorchJob Training Logs

In [None]:
training_client.get_job_logs(name=name)

The logs of pod pytorch-dist-mnist-gloo-master-0:
 Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!

accuracy=0.9669




## Delete the PyTorchJob

In [None]:
training_client.delete_job(name)

PyTorchJob kubeflow-user-example-com/pytorch-dist-mnist-gloo has been deleted
