# Kubeflow Trainer: Local Training

This notebook demonstrates how to run single-node training using the **Local Process Backend**.

## Local Process Backend

- **Container Runtime**: None (native Python subprocess)
- **Use Case**: Quick testing, debugging, rapid iteration
- **Prerequisites**: Python 3.9+ only

This example trains a CNN on the classic [MNIST](http://yann.lecun.com/exdb/mnist/) handwritten digit dataset using PyTorch.

## Install the Kubeflow SDK

You need to install the Kubeflow SDK to interact with Kubeflow Trainer APIs:

In [None]:
# Uncomment to install
# %pip install -U kubeflow

## Define the Training Function

The first step is to create a function to train CNN model using MNIST data.

In [1]:
def train_mnist():
    import torch
    import torch.nn.functional as F
    from torch import nn, optim
    from torch.utils.data import DataLoader
    from torchvision import datasets, transforms

    # Define the PyTorch CNN model to be trained
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(1, 20, 5, 1)
            self.conv2 = nn.Conv2d(20, 50, 5, 1)
            self.fc1 = nn.Linear(4 * 4 * 50, 500)
            self.fc2 = nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # Create the model
    model = Net()
    
    # Load MNIST dataset
    dataset = datasets.MNIST(
        './data',
        train=True,
        download=True,
        transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
    )
    train_loader = DataLoader(dataset, batch_size=64, shuffle=True)
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
    
    for epoch in range(1, 3):
        model.train()
        
        # Iterate over mini-batches from the training set
        for batch_idx, (data, target) in enumerate(train_loader):
            # Forward pass
            outputs = model(data)
            loss = F.nll_loss(outputs, target)
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            if batch_idx % 100 == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )

    torch.save(model.state_dict(), "mnist_cnn.pt")
    print("Training is finished")

## Configure Local Process Backend

Initialize the Local Process Backend configuration:

In [2]:
from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig

# Configure Local Process Backend
backend_config = LocalProcessBackendConfig(
    cleanup_venv=True  # Auto-cleanup virtual environments after job completes
)

## Initialize Client

Initialize the TrainerClient with the Local Process Backend:

In [3]:
client = TrainerClient(backend_config=backend_config)

## List the Training Runtimes

You can get the list of available Training Runtimes to start your TrainJob.

In [4]:
for runtime in client.list_runtimes():
    print(runtime)
    if runtime.name == "torch-distributed":
        torch_runtime = runtime

Runtime(name='torch-distributed', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None, image=None)


## Run the TrainJob

Submit the training job to the Local Process Backend:

In [5]:
from kubeflow.trainer import CustomTrainer

job_name = client.train(
    trainer=CustomTrainer(
        func=train_mnist,
        packages_to_install=["torch", "torchvision"],
    ),
    runtime=torch_runtime,
)

## Check the TrainJob Status

You can check the status of the TrainJob that's created.

In [6]:
job = client.get_job(job_name)
print("Job: {}, Status: {}".format(job.name, job.status))

Job: a2711556169f, Status: Running


## Watch the TrainJob Logs

We can use the `get_job_logs()` API to get the TrainJob logs.

In [7]:
for logline in client.get_job_logs(job_name, follow=True):
    print(logline, end='')

Operating inside /var/folders/r3/kwn1z7n15nq3rh54ykdsy73r0000gn/T/a2711556169f120m5qtg
Looking in links: /tmp/tmpdbsqoh3g
Processing /private/tmp/tmpdbsqoh3g/pip-23.2.1-py3-none-any.whl
Installing collected packages: pip
Successfully installed pip-23.2.1
Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/dd/5f/b85bd8c05312d71de9402bf5868d217c38827cfd09d8f8514e5be128a52b/torch-2.9.0-cp312-none-macosx_11_0_arm64.whl.metadata
  Using cached torch-2.9.0-cp312-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting torchvision
  Obtaining dependency information for torchvision from https://files.pythonhosted.org/packages/47/ef/81e4e69e02e2c4650b30e8c11c8974f946682a30e0ab7e9803a831beff76/torchvision-0.24.0-cp312-cp312-macosx_11_0_arm64.whl.metadata
  Using cached torchvision-0.24.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (5.9 kB)
Collecting filelock (from torch)
  Obtaining dependency information for filelock from https://files.python

## Delete the TrainJob

When the TrainJob is finished, you can delete the resource.

In [None]:
client.delete_job(job_name)