# PyTorch DDP Speech Recognition Training Example

This example demonstrates how to train a transformer network to classify audio words with Google's [Speech Command](https://huggingface.co/datasets/google/speech_commands). It's a very small dataset that contains words for classification. The dataset is small(2.3G) and it's quite fast to train a small model.

This notebook walks you through running that example locally, and how to easily scale PyTorch DDP across multiple nodes with Kubeflow TrainJob.


## Prepare the Kubernetes environment using Kind

If you already have your own Kubernetes cluster, you can skip this step.

For demo purpose, we will create a k8s cluster with [Kind](https://kind.sigs.k8s.io/). In the same folder of this example Jupyter notebook file, there is a Kind file called `kind-config.yaml`. It will create a k8s cluster with 3 workers and /data from host server is mounted to kind k8s cluster server. Therefore you can download data to /data in your local machine and can be accessed from kind cluster as well.

To create the kind cluster, run the following command:
**Notice** This will create a Kind cluster named 'ml', you only need to run this command once. 


In [7]:
!kind create cluster --name ml --config kind-config.yaml

Creating cluster "ml" ...
 [32m✓[0m Ensuring node image (kindest/node:v1.34.0) 🖼
 [32m✓[0m Preparing nodes 📦 📦 📦 📦 7l
 [32m✓[0m Writing configuration 📜7l
 [32m✓[0m Starting control-plane 🕹️7l
 [32m✓[0m Installing CNI 🔌7l
 [32m✓[0m Installing StorageClass 💾7l
 [32m✓[0m Joining worker nodes 🚜7l
Set kubectl context to "kind-ml"
You can now use your cluster with:

kubectl cluster-info --context kind-ml

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂


## Add CRD and Kubeflow Trainer operator to Kubernetes cluster

The full instruction is at [here](https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/). In short, run this command:

In [8]:
!export VERSION=v2.0.0
!kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=${VERSION}"

namespace/kubeflow-system serverside-applied
customresourcedefinition.apiextensions.k8s.io/clustertrainingruntimes.trainer.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/jobsets.jobset.x-k8s.io serverside-applied
customresourcedefinition.apiextensions.k8s.io/trainingruntimes.trainer.kubeflow.org serverside-applied
customresourcedefinition.apiextensions.k8s.io/trainjobs.trainer.kubeflow.org serverside-applied
serviceaccount/jobset-controller-manager serverside-applied
serviceaccount/kubeflow-trainer-controller-manager serverside-applied
role.rbac.authorization.k8s.io/jobset-leader-election-role serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-manager-role serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-metrics-reader serverside-applied
clusterrole.rbac.authorization.k8s.io/jobset-proxy-role serverside-applied
clusterrole.rbac.authorization.k8s.io/kubeflow-trainer-controller-manager serverside-applied
rolebinding.rbac.authoriz

## Prepare Docker Image

We need to create a Docker image with requirements.txt, the `Dockerfile` and `requirements.txt` can be found at the same folder of this Jupyter Notebook.

To build:

In [9]:
!docker build -t speech-recognition-image:0.1 -f Dockerfile .

[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.2s (1/2)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 197B                                       0.0s
[0m => [internal] load metadata for docker.io/pytorch/pytorch:2.8.0-cuda12.8  0.2s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (1/2)                                          docker:default
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 197B                                       0.0s
[0m => [internal] load metadata for docker.io/pytorch/pytorch:2.8.0-cuda12.8  0.3s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.5s (1/2)                                          docker:default
[34m => [internal] load build definition from Dockerfile     

### Load image to Kind cluster

#### Kind cluster

If you are using a local Kind cluster, run the following command to load docker image to your local cluster

In [25]:
!kind load docker-image speech-recognition-image:0.1 --name ml

Image: "speech-recognition-image:0.1" with ID "sha256:f98d06d275aa85a352ca3b4ee886fd7c10a052fef62e430c01853e6a1cffc689" not yet present on node "ml-worker2", loading...
Image: "speech-recognition-image:0.1" with ID "sha256:f98d06d275aa85a352ca3b4ee886fd7c10a052fef62e430c01853e6a1cffc689" not yet present on node "ml-worker", loading...
Image: "speech-recognition-image:0.1" with ID "sha256:f98d06d275aa85a352ca3b4ee886fd7c10a052fef62e430c01853e6a1cffc689" not yet present on node "ml-worker3", loading...
Image: "speech-recognition-image:0.1" with ID "sha256:f98d06d275aa85a352ca3b4ee886fd7c10a052fef62e430c01853e6a1cffc689" not yet present on node "ml-control-plane", loading...


#### Kubernetes cluster

If you are not using the local Kind cluster for testing. Please upload the code to your own Docker registry.

```bash
docker image push <your docker image name with registry info>
```

## Add Runtime to K8s cluster

In the same folder of this Jypyter notebook file, there is a `kubeflow-runtime-example.yaml`. 

**Please modify the image to the one you just uploaded.**


In [19]:
!kubectl apply -f kubeflow-runtime-example.yaml

clustertrainingruntime.trainer.kubeflow.org/torch-distributed-speech-recognition created


## Install the Kubeflow SDK

You need to install the Kubeflow SDK to interact with Kubeflow Trainer APIs:

In [24]:
!pip install git+https://github.com/kubeflow/sdk.git@main

Collecting git+https://github.com/kubeflow/sdk.git@main
  Cloning https://github.com/kubeflow/sdk.git (to revision main) to /tmp/pip-req-build-dhm1y012
  Running command git clone --filter=blob:none --quiet https://github.com/kubeflow/sdk.git /tmp/pip-req-build-dhm1y012
  Resolved https://github.com/kubeflow/sdk.git to commit 6709dcff0f3e68d44b37531d3154829e626f4b62
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


## Prepare Speech Command Dataset

For demo purpose and to simply the process, we are downloading data to /data in the host server. And in the Kind cluster, it's mounting the host's /data folder to cluster's server's /data folder. And in the Kubeflow Runtime, it's mounting the data with hostpath on /data. Therefore everyone is accessing data in /data folder. **Please make sure there is /data folder in the host server.**

For other clusters, please create a volume to make sure data can be accessed via /data.

The exact path for Speech Command dataset is `/data/SpeechCommands/speech_commands_v0.02`.

To download data, run the code below.


In [15]:
import torchaudio

print("Downloading SpeechCommands dataset...")

# This command will download the data to a folder named "SpeechCommands"
# in your current directory if it's not already there.
train_dataset = torchaudio.datasets.SPEECHCOMMANDS(root="/data", download=True)

print("Download complete!")
print(f"Number of training samples: {len(train_dataset)}")


Downloading SpeechCommands dataset...
Download complete!
Number of training samples: 105829


## Create TrainerClient with Kubeflow Trainer SDK

In [16]:
from kubeflow.trainer import CustomTrainer, TrainerClient

client = TrainerClient()

In [21]:
from kubeflow.trainer import Runtime, RuntimeTrainer, TrainerType

# Create custom Runtime
custom_runtime = Runtime(
    name="custom-pytorch-runtime",
    trainer=RuntimeTrainer(
        trainer_type=TrainerType.CUSTOM_TRAINER,
        framework="torch",
        num_nodes=2,
    ),
    pretrained_model=None
)



## Get runtime from K8s cluster

After running the below cell, you should see something like the below. If the following cell shows nothing, it mostly because the Custom Kubeflow Runtime is not created. Please go back to previous step to create Kubeflow Runtime with `kubectl`.
```
Runtime(name='torch-distributed-speech-recognition', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)
```



In [20]:
for runtime in client.list_runtimes():
    print(runtime)
    if runtime.name == "torch-distributed-speech-recognition":
        torch_runtime = runtime


Runtime(name='torch-distributed-speech-recognition', trainer=RuntimeTrainer(trainer_type=<TrainerType.CUSTOM_TRAINER: 'CustomTrainer'>, framework='torch', num_nodes=1, device='Unknown', device_count='Unknown'), pretrained_model=None)


## Start training

The training code is in the `train_with_kubeflow_trainer.py`, which is in the same folder of current Jupyter Notebook.

In [22]:
import train_with_kubeflow_trainer

job_name = client.train(
    trainer=CustomTrainer(
        func=train_with_kubeflow_trainer.run,
        # Set how many PyTorch nodes you want to use for distributed training.
        num_nodes=2,
        # Set the resources for each PyTorch node.
        resources_per_node={
            "cpu": 2,
            "memory": "8Gi",
            # Uncomment this to distribute the TrainJob using GPU nodes.
            # "nvidia.com/gpu": 1,
        },
    ),
    runtime=torch_runtime,
)

In [23]:
client.wait_for_job_status(name=job_name, status={"Running"})

RuntimeError: Failed to get TrainJob: default/lfb5b5487cbe

In [None]:
# or use kubectl to get pod and see logs
! kubectl get pod