# Training the Fraud Detection model with the Kubeflow Training Operator

The example fraud detection model is small and quickly trained. For many large models, training requires multiple GPUs and often multiple machines. In this notebook, you learn how to train a model by using the Kubeflow Training Operator on OpenShift AI to scale out model training. You use the Training Operator SDK to create a PyTorchJob that executes the provided model training script.

### Install the Training Operator SDK

The Training Operator SDK is not available by default with the Tensorflow workbench image. Run the following command to install it:

In [None]:
%pip install "kubeflow @ git+https://github.com/opendatahub-io/kubeflow-sdk.git@v0.2.1+rhai0"

In [None]:
from kubeflow.trainer import TrainerClient
from kubeflow.trainer.rhai import TransformersTrainer

In [None]:
import sys
import os
sys.path.append("./kfto-scripts")  # needed to make training function available in the notebook
from train_pytorch_cpu import train_func

In [None]:
from kubernetes import client
from kubeflow.common.types import KubernetesBackendConfig

api_server = "https://XXXX"
token = "sha256~XXXX"

configuration = client.Configuration()
configuration.host = api_server
configuration.api_key = {"authorization": f"Bearer {token}"}
# Un-comment if your cluster API server uses a self-signed certificate or an un-trusted CA
# configuration.verify_ssl = False

backend_config = KubernetesBackendConfig(
      client_configuration=configuration,
      namespace="fraud-detection"  # your namespace
  )

### Create a PyTorchJob

Use the Training Operator SDK client to submit a PyTorchJob.

The model training script is imported from the `kfto-scripts` folder.

The model training script loads and distributes the training data set among nodes, performs distributed training, evaluates by using the test data set, and exports the trained model to ONNX format and uploads it to the S3 bucket that is specified in the provided connection.

In [None]:
trainer = TransformersTrainer(
    func=train_func,
    num_nodes=2,
    resources_per_node={"nvidia.com/gpu": 0},
    packages_to_install=[
        "s3fs",
        "boto3",
        "scikit-learn",
        "onnx",
    ],
    env={
          "AWS_ACCESS_KEY_ID": os.environ.get("AWS_ACCESS_KEY_ID"),
          "AWS_S3_BUCKET": os.environ.get("AWS_S3_BUCKET"),
          "AWS_S3_ENDPOINT": os.environ.get("AWS_S3_ENDPOINT"),
          "AWS_SECRET_ACCESS_KEY": os.environ.get("AWS_SECRET_ACCESS_KEY"),
      }
)

In [None]:
trainer_client = TrainerClient(backend_config=backend_config)
runtime = trainer_client.get_runtime("torch-distributed")
job_name = trainer_client.train(
    trainer=trainer,
    runtime=None,
)

In [None]:
trainer_client.wait_for_job_status(
      name=job_name,
      status={"Running"},
      timeout=3600,  # 1 hour
  )

In [None]:
print("\n--- Streaming logs ---")
for log_line in trainer_client.get_job_logs(name=job_name, follow=True):
  print(log_line, end='\n')