# Training the Fraud Detection model with Ray by using Codeflare

The example fraud detection model is very small and quickly trained. However, for many large models, training requires multiple GPUs and often multiple machines. In this notebook, you learn how to train a model by using Ray on OpenShift AI to scale out the model training. You use the Codeflare SDK to create the cluster and submit the job. You can find detailed documentation for the SDK [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation/).

For this procedure, you need to use codeflare-sdk 0.19.1 (or later).  Begin by installing the SDK if it's not already installed or up to date:

In [None]:
!pip install --upgrade codeflare-sdk==0.19.1 lakefs==0.7.1

## Define lakeFS Repository and create Training branch in lakeFS
### Change MinIO Access and Secret keys

In [None]:
import os
import lakefs

repo_name = os.environ.get('LAKEFS_REPO_NAME')

mainBranch = "main"
trainingBranch = "train01"

os.environ["PIPELINE_ARTIFACTS_ENDPOINT_URL"] = "http://minio:9000"
os.environ["PIPELINE_ARTIFACTS_ACCESS_KEY_ID"] = "MinIO Access Key"
os.environ["PIPELINE_ARTIFACTS_SECRET_ACCESS_KEY"] = "MinIO Secret Key"
os.environ["PIPELINE_ARTIFACTS_S3_BUCKET"] = "pipeline-artifacts"

repo = lakefs.Repository(repo_name)
branchMain = repo.branch(mainBranch)
print(repo)

branchTraining = repo.branch(trainingBranch).create(source_reference=mainBranch, exist_ok=True)

### Preparing the data

Normally, the training data for your model would be available in a shared location. For this example, the data is local. You must upload it to your object storage so that you can see how data loading from a shared data source works. After you upload the data, you can work with it by using Ray Data so that it is properly shared across the worker machines.

In [None]:
import sys
sys.path.append('./utils')

import utils.s3

utils.s3.upload_directory_to_s3("data", f"{trainingBranch}/data")
print("---")
utils.s3.list_objects(f"{trainingBranch}/data")

### Authenticate to the cluster by using the OpenShift console login

You must create the Kubernetes objects for Ray Clusters using the Codeflare SDK. In order to do so, you need access permission for your own namespace. The easiest way to set up access is by using the OpenShift CLI `oc` client. 

From the OpenShift web console, you can generate an `oc login` command that includes your token and server information. You can use the command to log in to the OpenShift CLI. 

1. To generate the command, select **Copy login command** from the username drop-down menu at the top right of the web console.

    <figure>
        <img src="./assets/copy-login.png"  alt="copy login"  >
    <figure/>

2. Click **Display token**.

3. Below **Log in with this token**, take note of the parameters for token and server.
   For example:
    ```
    oc login --token=sha256~LongString --server=https://api.your-cluster.domain.com:6443
    ```    
    - token: `sha256~LongString`
    - server: `https://api.your-cluster.domain.com:6443`
    
4. In the following code cell, in the TokenAuthentication object, replace the token and server values with the values that you noted in Step 3.
   For example:
   ```
   auth = TokenAuthentication(
       token = "sha256~LongString",
       server = "https://api.your-cluster.domain.com:6443",
       skip_tls=False
   )
   auth.login()
   ```


In [None]:
from codeflare_sdk import TokenAuthentication
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
    token = "sha256~XXXX",
    server = "https://XXXX",
    skip_tls=False
)
auth.login()

## Create a Ray cluster

### Configure a Ray cluster

CodeFlare allows you to specify parameters, such as number of workers, image, and kueue local queue name.  A full list of parameters is available [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation/cluster/config.html).

In [None]:
from codeflare_sdk import Cluster, ClusterConfiguration

cluster = Cluster(ClusterConfiguration(
    name="raycluster-cpu",
    head_extended_resource_requests={'nvidia.com/gpu': 0},
    worker_extended_resource_requests={'nvidia.com/gpu': 0},
    num_workers=2,
    worker_cpu_requests=1,
    worker_cpu_limits=4,
    worker_memory_requests=2,
    worker_memory_limits=4,
    image="quay.io/modh/ray:2.35.0-py39-cu121"
))


### Start the cluster

If you have a running cluster that you want to connect to, skip to the next cell.

To start a cluster, run the following cell to create the necessary Kubernetes objects to run the Ray cluster. This step might take a few minutes to complete.

In [None]:
cluster.up()
cluster.wait_ready()

### Connect to a running cluster

If you've already created a cluster, but you've restarted the Python kernel, closed the notebook, or are working in a different notebook, and you want to connect to the existing cluster, uncomment the code in the following cell and then run it.

In [None]:
from codeflare_sdk import get_cluster
name="raycluster-cpu"
namespace="lakefs"
cluster = get_cluster(name, namespace=namespace)

You can view information about the cluster, including a link to the Ray dashboard. In the Ray dashboard, you can inspect the running jobs and logs, and see the resources being used.
<figure>
    <img src="./assets/codeflare-details.png"  alt="codeflare details" width="400">
<figure/>



In [None]:
cluster.details()

The link to the Ray dashboard is available in the cluster details provided as a result of running the previous cell.  It should look something like this:

<figure>
    <img src="./assets/ray-dashboard.png"  alt="ray dashboard" width="600"
<figure/>


## Ray job submission

### Initialize the Job Submission Client

If you want to submit jobs, connect to the running Ray cluster by initializing the job client that has the proper authentication and connection information.


In [None]:
client = cluster.job_client

After you connect to the Ray cluster, you can query the cluster to determine whether there are any existing jobs:

In [None]:
client.list_jobs()

### Create a runtime environment

Now you can configure the [runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) for the job. This step includes specifying the working directory, files to exclude, dependencies, and environment variables.

```python
runtime_env={
    "working_dir": "./", # relative path to files uploaded to the job
    "excludes": ["local_data/"], # directories and files to exclude from being uploaded to the job
    "pip": ["boto3", "botocore"], # can also be a string path to a requirements.txt file
    "env_vars": {
        "MY_ENV_VAR": "MY_ENV_VAR_VALUE",
        "MY_ENV_VAR_2": os.environ.get("MY_ENV_VAR_2"),
    },
}
```

In [None]:
import os

# script = "test_data_loader.py"
script = "train_tf_cpu_lakefs.py"
runtime_env = {
    "working_dir": "./ray-scripts",
    "excludes": [],
    "pip": "./ray-scripts/requirements.txt",
    "env_vars": {
        "AWS_ACCESS_KEY_ID": os.environ.get("AWS_ACCESS_KEY_ID"),
        "AWS_SECRET_ACCESS_KEY": os.environ.get("AWS_SECRET_ACCESS_KEY"),
        "AWS_S3_ENDPOINT": os.environ.get("AWS_S3_ENDPOINT"),
        "AWS_DEFAULT_REGION": os.environ.get("AWS_DEFAULT_REGION"),
        "AWS_S3_BUCKET": os.environ.get("AWS_S3_BUCKET"),
        "PIPELINE_ARTIFACTS_ENDPOINT_URL": os.environ.get("PIPELINE_ARTIFACTS_ENDPOINT_URL"),
        "PIPELINE_ARTIFACTS_ACCESS_KEY_ID": os.environ.get("PIPELINE_ARTIFACTS_ACCESS_KEY_ID"),
        "PIPELINE_ARTIFACTS_SECRET_ACCESS_KEY": os.environ.get("PIPELINE_ARTIFACTS_SECRET_ACCESS_KEY"),
        "PIPELINE_ARTIFACTS_S3_BUCKET": os.environ.get("PIPELINE_ARTIFACTS_S3_BUCKET"),
        "NUM_WORKERS": "1",
        "TRAIN_DATA": f"{trainingBranch}/data/train.csv",
        "VALIDATE_DATA": f"{trainingBranch}/data/validate.csv",
        "MODEL_OUTPUT_PREFIX": f"{trainingBranch}/models/fraud/1/",
    },
}

### Submit the configured job

Now you can submit the job to the cluster. This step creates the necessary Kubernetes objects to run the job. The job runs the script with the specified runtime environment. The script for this example is located in [ray-scripts/train_tf_cpu.py](./ray-scripts/train_tf_cpu.py). The script follows the code fairly closely to the official [Ray TensorFlow example](https://docs.ray.io/en/latest/train/distributed-tensorflow-keras.html). This example uses TensorFlow, note that the [Ray site](https://docs.ray.io/en/latest/train/train.html) provides examples for PyTorch and other frameworks.

In [None]:
submission_id = client.submit_job(
    entrypoint=f"python {script}",
    runtime_env=runtime_env,
)

print(submission_id)

### Query important job information

In [None]:
# Get the job's status
print(client.get_job_status(submission_id), "\n")

# Get job related info
print(client.get_job_info(submission_id), "\n")

# Get the job's logs
print(client.get_job_logs(submission_id))

You can also tail the job logs to watch the progress of the job.

In [None]:
# Iterate through the logs of a job 
async for lines in client.tail_job_logs(submission_id):
    print(lines, end="")

### List jobs

In [None]:
client.list_jobs()

### Stop jobs

If you want to stop a job, call `stop_job` and specify the submission ID.  In the following cell, the command lists all the jobs and stops them.

In [None]:
for job_details in client.list_jobs():
    print(f"deleting {job_details.submission_id}")
    client.stop_job(job_details.submission_id)

### Delete jobs

You can also delete the jobs.

In [None]:
for job_details in client.list_jobs():
    print(f"deleting {job_details.submission_id}")
    client.delete_job(job_details.submission_id)

client.list_jobs()

### Delete the cluster

After you complete training, you can delete the cluster. When you delete the cluster, you remove the Kubernetes objects and free up resources.

In [None]:
cluster.down()