# PyTorch DDP Speech Recognition Training Example

This example demonstrates how to train a transformer network to classify audio words with Google's [Speech Command](https://huggingface.co/datasets/google/speech_commands).

This notebook walks you through running that example locally, and how to easily scale PyTorch DDP across multiple nodes with Kubeflow TrainJob.


## Prepare the Kubernetes environment using Kind

If you already have your own Kubernetes cluster, you can skip this step.

For demo purpose, you can create a k8s cluster with [Kind](https://kind.sigs.k8s.io/). In the same folder of this example Jupyter notebook file, you can find a file called `kind-config.yaml`. It will create a k8s cluster with 3 workers and /data from host server is mounted to kind k8s cluster server. Therefore you can download data to /data in your local machine and can be accessed from kind cluster as well.

To create the kind cluster, run the following command:
**Notice** This will create a Kind cluster named 'ml', you only need to run this command once. 


In [None]:
# !kind create cluster --name ml --config kind-config.yaml

## Add CRD and Kubeflow Trainer operator to Kubernetes cluster

The full instruction is at [here](https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/). In short, you can run command:

In [None]:
# !export VERSION=v2.0.0
# !kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=${VERSION}"

## Prepare Docker Image

We need to create a Docker image with requirements.txt, the `Dockerfile` and `requirements.txt` can be found at the same folder of this Jupyter Notebook.

To build:

In [None]:
# !docker build -t speech-recognition-image:0.1 -f Dockerfile .

### Load image to Kind cluster

#### Kind cluster

If you are using a local Kind cluster, run the following command to load docker image to your local cluster

In [None]:
# !kind load docker-image speech-recognition-image:0.4 --name ml

#### Kubernetes cluster

If you are not using the local Kind cluster for testing. Please upload the code to your own Docker registry.

```bash
docker image push <your docker image name with registry info>
```

## Add Runtime to K8s cluster

In the same folder of this Jypyter notebook file, there is a `kubeflow-runtime-example.yaml`. 

**Please modify the image to the one you just uploaded.**


In [None]:
# !kubectl apply -f kubeflow-runtime-example.yaml

## Install the Kubeflow SDK

You need to install the Kubeflow SDK to interact with Kubeflow Trainer APIs:

In [None]:
# !pip install git+https://github.com/kubeflow/sdk.git@main

In [2]:
from kubeflow.trainer import CustomTrainer, TrainerClient

client = TrainerClient()

In [3]:
from kubeflow.trainer import Runtime, RuntimeTrainer, TrainerType

# Create custom Runtime
custom_runtime = Runtime(
    name="custom-pytorch-runtime",
    trainer=RuntimeTrainer(
        trainer_type=TrainerType.CUSTOM_TRAINER,
        framework="torch",
        num_nodes=2,
    ),
    pretrained_model=None
)



In [4]:
for runtime in client.list_runtimes():
    print(runtime)
    if runtime.name == "torch-distributed-speech-recognition":
        torch_runtime = runtime


Runtime(name='torch-distributed-speech-recognition', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)
Runtime(name='torch-distributed-voice-recognition', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)


In [6]:
import train_with_kubeflow_trainer

job_name = client.train(
    trainer=CustomTrainer(
        func=train_with_kubeflow_trainer.run,
        # Set how many PyTorch nodes you want to use for distributed training.
        num_nodes=2,
        # Set the resources for each PyTorch node.
        resources_per_node={
            "cpu": 2,
            "memory": "8Gi",
            # Uncomment this to distribute the TrainJob using GPU nodes.
            # "nvidia.com/gpu": 1,
        },
    ),
    runtime=torch_runtime,
)

In [6]:
client.wait_for_job_status(name=job_name, status={"Running"})




TrainJob(name='m1b42f2aab8e', creation_timestamp=datetime.datetime(2025, 9, 9, 6, 7, 17, tzinfo=TzInfo(UTC)), runtime=Runtime(name='torch-distributed-voice-recognition', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None), steps=[Step(name='node-0', status='Running', pod_name='m1b42f2aab8e-node-0-0-hxsmv', device='cpu', device_count='4'), Step(name='node-1', status='Running', pod_name='m1b42f2aab8e-node-0-1-k869r', device='cpu', device_count='4'), Step(name='node-2', status='Running', pod_name='m1b42f2aab8e-node-0-2-xtz4g', device='cpu', device_count='4')], num_nodes=3, status='Running')