# Distributed training of a model

Training a model is often the most time and resource consuming part of the machine learning process.  Large models can take multiple GPUs for days.  Expect the training on CPU for this very simple model to take a minute or more.

## Setup Ray cluster for distribute training

Ray cluster provides distributed training environment consisting of multiple pods. Training job is distributed among Ray pods based on available resources.
Ray head pod serves as main point of contact for Ray API, providing Dashboard UI to observe Ray cluster status and processed job.

CodeFlare SDK needs authentication information to be passed to work properly - OpenShift server URL and authentication token.
If you are logged into cluster then you can retrieve authentication token by running `oc whoami -t`. OpenShift server URL can be retrieved from `oc cluster-info`.
If you are running on OpenShift cluster using self signed certificate, set `skip_tls` in `TokenAuthentication` to `True`.

In [None]:
# Import codeflare-sdk dependencies
from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication
import os

In [None]:
# Create authentication object for user permissions used by CodeFlare SDK
auth = TokenAuthentication(
    token = "",
    server = "",
    skip_tls=False
)
auth.login()

In [None]:
# Create and configure Ray cluster
cluster = Cluster(ClusterConfiguration(
    name='fraud-detection',
    head_cpus=2,
    head_memory=6,
    head_extended_resource_requests={'nvidia.com/gpu':0}, # For GPU enabled workloads set the head_extended_resource_requests and worker_extended_resource_requests
    worker_extended_resource_requests={'nvidia.com/gpu':0},
    num_workers=1,
    worker_cpu_requests=1,
    worker_cpu_limits=2,
    worker_memory_requests=4,
    worker_memory_limits=6,
))

In [None]:
# Bring up the cluster
cluster.up()
cluster.wait_ready()

In [None]:
cluster.details()

## Submit distributed training job

Once Ray cluster is up and running then we can submit training job itself.
Ray will download dependencies defined in requirements.txt and execute training job.

You can monitor submitted model training job either from Ray dashboard (URL available above in `cluster.details()` result) or using client functions below.

In [None]:
# Initialize the Job Submission Client
"""
The CodeFlare SDK will automatically gather the dashboard address and authenticate using the Ray Job Submission Client
"""
client = cluster.job_client

In [None]:
# Submit a job creating and training fraud detection model
submission_id = client.submit_job(
    entrypoint="python fraud_detection_sharded.py",
    runtime_env={
        "working_dir": "./ray",
        "pip": "./ray/requirements.txt",
        "env_vars": {
            "AWS_ACCESS_KEY_ID": os.environ.get('AWS_ACCESS_KEY_ID'),
            "AWS_SECRET_ACCESS_KEY": os.environ.get('AWS_SECRET_ACCESS_KEY'),
            "AWS_S3_ENDPOINT": os.environ.get('AWS_S3_ENDPOINT')
            "AWS_S3_BUCKET": os.environ.get('AWS_S3_BUCKET')
        },
    },
)

## Model training

Code below provides log entries produced by training job. In the beginning there is a time delay between fist log appears - it is caused by downloading of needed resources (which doesn't provide any log output).
Once Notebook cell finishes running then job finished training.

In [None]:
async for lines in client.tail_job_logs(submission_id):
    print(lines, end="") 

In [None]:
# Get job related info
client.get_job_info(submission_id)

In [None]:
# Delete the Ray cluster when you finished your training
cluster.down()