# Kubeflow Trainer: Container Backend (Single-Node) Training

This notebook demonstrates how to run single-node training using the **Container Backend** with Docker or Podman.

## Container Backend

- **Container Runtime**: Docker or Podman required
- **Use Case**: Testing container workflows, simulating production environments
- **Prerequisites**: Python 3.9+ and Docker Desktop/Engine OR Podman

This example trains a CNN on the classic [MNIST](http://yann.lecun.com/exdb/mnist/) handwritten digit dataset using PyTorch.

## Install the Kubeflow SDK

You need to install the Kubeflow SDK with container backend support:

In [None]:
# Uncomment to install
# %pip install -U kubeflow[docker]  # For Docker
# %pip install -U kubeflow[podman]  # For Podman

## Define the Training Function

The first step is to create a function to train CNN model using MNIST data.

In [1]:
def train_mnist():
    import torch
    import torch.nn.functional as F
    from torch import nn, optim
    from torch.utils.data import DataLoader
    from torchvision import datasets, transforms

    # Define the PyTorch CNN model to be trained
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(1, 20, 5, 1)
            self.conv2 = nn.Conv2d(20, 50, 5, 1)
            self.fc1 = nn.Linear(4 * 4 * 50, 500)
            self.fc2 = nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # Create the model
    model = Net()
    
    # Load MNIST dataset
    dataset = datasets.MNIST(
        './data',
        train=True,
        download=True,
        transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
    )
    train_loader = DataLoader(dataset, batch_size=64, shuffle=True)
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
    
    for epoch in range(1, 3):
        model.train()
        
        # Iterate over mini-batches from the training set
        for batch_idx, (data, target) in enumerate(train_loader):
            # Forward pass
            outputs = model(data)
            loss = F.nll_loss(outputs, target)
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            if batch_idx % 100 == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )

    torch.save(model.state_dict(), "mnist_cnn.pt")
    print("Training is finished")

## Configure Container Backend

The container backend automatically detects and uses either Docker or Podman:

In [None]:
from kubeflow.trainer import TrainerClient, ContainerBackendConfig
import os

# Auto-detects Docker or Podman
backend_config = ContainerBackendConfig()

# Optional: Force specific runtime
# backend_config = ContainerBackendConfig(runtime="docker")  # Force Docker
# backend_config = ContainerBackendConfig(runtime="podman")  # Force Podman

# Optional: For Colima on macOS
# backend_config = ContainerBackendConfig(
#     container_host=f"unix://{os.path.expanduser('~')}/.colima/default/docker.sock"
# )

# Optional: For Podman Machine on macOS
# backend_config = ContainerBackendConfig(
#     runtime="podman",
#     container_host="unix:///run/user/1000/podman/podman.sock"
# )

## Initialize Client

Initialize the TrainerClient with the Container Backend:

In [3]:
client = TrainerClient(backend_config=backend_config)

## List the Training Runtimes

You can get the list of available Training Runtimes to start your TrainJob.

In [4]:
for runtime in client.list_runtimes():
    print(runtime)
    if runtime.name == "torch-distributed":
        torch_runtime = runtime

Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None, image='pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime')


## Run the TrainJob

Submit the training job to the Container Backend (single container):

In [5]:
from kubeflow.trainer import CustomTrainer

job_name = client.train(
    trainer=CustomTrainer(
        func=train_mnist,
        packages_to_install=["torchvision"],
    ),
    runtime=torch_runtime,
)

## Check the TrainJob Status

You can check the status of the TrainJob that's created.

In [6]:
job = client.get_job(job_name)
print("Job: {}, Status: {}".format(job.name, job.status))

Job: m299e1022d7a, Status: Running


## Watch the TrainJob Logs

We can use the `get_job_logs()` API to get the TrainJob logs.

In [7]:
for logline in client.get_job_logs(job_name, follow=True):
    print(logline, end='')

100%|██████████| 9.91M/9.91M [00:01<00:00, 5.53MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 233kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.52MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 3.37MB/s]
Training is finished


## Delete the TrainJob

When the TrainJob is finished, you can delete the resource.

In [None]:
# client.delete_job(job_name)