# Training Fraud Detection using Codeflare

The fraud detection model is very small and quickly trained.  However, for many large models, training requires multiple GPUs and often multiple machines.  In this notebook, we'll demonstrate how to train a model using Ray on OpenShift AI to scale out the training.  We'll use the Codeflare SDK to create the cluster and submit the job.  Full documentation for the SDK is available [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation/)

For this demo we require codeflare-sdk of at least 0.19.1.  Let's begin by installing the SDK if it's not already installed or up to date.

In [None]:
!pip install --upgrade codeflare-sdk>=0.19.1

### Preparing the data

Normally data would already be in a shared location, but since it's local, we'll upload it to our object storage so we can show sharded data loading from a shared data source. Once it's uploaded, we can work with it using Ray Data so it's properly sharded across the workers.

In [None]:
import sys
sys.path.append('./utils')

import utils.s3

utils.s3.upload_directory_to_s3("data", "data")
print("---")
utils.s3.list_objects("data")

### Authenticate to the cluster using the OpenShift console login

We'll be creating the Kubernetes objects for Ray Clusters using the Codeflare SDK.  In order to do so, you'll need permission to do so in your own namespace.  The easiest way to do this is using the `oc` client.  You can copy the oc login command from the OpenShift console.

<figure>
    <img src="./assets/copy-login.png"  alt="copy login"  >
<figure/>

As a helper, you can run the below cell and click the link to take you directly to the token request page.

In [None]:
# To launch a Ray cluster, you will need to authenticate yourself against the OpenShift cluster.
# Run this cell to get the full instructions.

import re
import os

NOTEBOOK_ARGS = os.environ.get('NOTEBOOK_ARGS', '')
match = re.search(r'"hub_host":"https://.*?(apps\.[^"]+)"', NOTEBOOK_ARGS)
hub_host_value = match.group(1)

login_url = 'https://oauth-openshift.' + hub_host_value + "/oauth/token/request"

print('Open the following URL to get your authentication token.')
print('Authenticate, then click on "Display token", and copy the content of the line under "Log in with this token"')
print('You can then come back here and paste this content in the next cell.')
print('Login URL: ' + login_url)

We can run the `oc login` command the same way we run any terminal command.  You can designate a cell as `%%bash` or you can paste a command after `!`.

Paste the `oc login --token=sha256~XXXX --server=https://XXXX` command in the following cell designated with `%%bash` and run it.

In [None]:
%%bash
oc login --token=sha256~XXXX --server=https://XXXX

## Create a Ray Cluster

### Configure our Ray cluster

CodeFlare allows you to specify many parameters such as number of workers, image, and kueue local queue name.  A full list is available [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation/cluster/config.html).

In [None]:
from codeflare_sdk import Cluster, ClusterConfiguration

cluster = Cluster(ClusterConfiguration(
    name="raycluster-cpu",
    head_gpus=0,
    num_gpus=0,
    num_workers=2,
    min_cpus=1,
    max_cpus=4,
    min_memory=2,
    max_memory=4
))


### Start the cluster

This will create the necessary Kubernetes objects to run the Ray cluster.  It may take a few minutes to complete.

In [None]:
cluster.up()
cluster.wait_ready()

### Alternatively, get a running cluster object

This is useful when you've already created a cluster, but you've restarted the Python kernel, closed the notebook, or are working in a different notebook.

Uncomment below to connect to an existing cluster

In [None]:
# from codeflare_sdk import get_cluster

# cluster = get_cluster(name, namespace=namespace)

Now we can show information about the cluster, including a link to the dashboard.  There we can inspect the running jobs and logs, and see the resources being used.
<figure>
    <img src="./assets/codeflare-details.png"  alt="codeflare details" width="400">
<figure/>



In [None]:
cluster.details()

The link to the Ray dashboard is available in the details above.  It should look something like this:

<figure>
    <img src="./assets/ray-dashboard.png"  alt="ray dashboard" width="600"
<figure/>


## Ray Job Submission

### Initialize the Job Submission Client

We want to connect to the running Ray Cluster to submit jobs.  We can do this by initializing the job client which will have the proper authentication and connection information.


In [None]:
client = cluster.job_client

Now that we're connected, we can query the cluster.  Let'ssSee if there are any existing jobs

In [None]:
client.list_jobs()

### Create a Runtime Environment

Now we can set up the [runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) for the job.  This includes the working directory, files to exclude, dependencies, and environment variables.

```python
runtime_env={
    "working_dir": "./", # relative path to files uploaded to the job
    "excludes": ["local_data/"], # directories and files to exclude from being uploaded to the job
    "pip": ["boto3", "botocore"], # can also be a string path to a requirements.txt file
    "env_vars": {
        "MY_ENV_VAR": "MY_ENV_VAR_VALUE",
        "MY_ENV_VAR_2": os.environ.get("MY_ENV_VAR_2"),
    },
}
```

In [None]:
import os

# script = "test_data_loader.py"
script = "train_tf_cpu.py"
runtime_env = {
    "working_dir": "./ray-scripts",
    "excludes": [],
    "pip": "./ray-scripts/requirements.txt",
    "env_vars": {
        "AWS_ACCESS_KEY_ID": os.environ.get("AWS_ACCESS_KEY_ID"),
        "AWS_SECRET_ACCESS_KEY": os.environ.get("AWS_SECRET_ACCESS_KEY"),
        "AWS_S3_ENDPOINT": os.environ.get("AWS_S3_ENDPOINT"),
        "AWS_DEFAULT_REGION": os.environ.get("AWS_DEFAULT_REGION"),
        "AWS_S3_BUCKET": os.environ.get("AWS_S3_BUCKET"),
        "NUM_WORKERS": "1",
        "TRAIN_DATA": "data/train.csv",
        "VALIDATE_DATA": "data/validate.csv",
        "MODEL_OUTPUT_PREFIX": "models/fraud/1/",
    },
}

### Submit the configured job

Now we can submit the job to the cluster.  This will create the necessary Kubernetes objects to run the job.  The job will run the script with the specified runtime environment.  The script is located in [ray-scripts/train_tf_cpu.py](./ray-scripts/train_tf_cpu.py).  The script follows the code fairly closely to the official [Ray TensorFlow example](https://docs.ray.io/en/latest/train/distributed-tensorflow-keras.html).  While we're using TensorFlow, there are examples for PyTorch and other frameworks as well.

In [None]:
submission_id = client.submit_job(
    entrypoint=f"python {script}",
    runtime_env=runtime_env,
)

print(submission_id)

### Query Important Job Information

In [None]:
# Get the job's status
print(client.get_job_status(submission_id), "\n")

# Get job related info
print(client.get_job_info(submission_id), "\n")

# Get the job's logs
print(client.get_job_logs(submission_id))

We can also tail the job logs to watch the progress of the job.

In [None]:
# Iterate through the logs of a job 
async for lines in client.tail_job_logs(submission_id):
    print(lines, end="")

### List Jobs

In [None]:
client.list_jobs()

### Stop jobs

If we want to stop a job, we can do so by calling `stop_job` with the submission id.  Here we list all the jobs and stop them.

In [None]:
for job_details in client.list_jobs():
    print(f"deleting {job_details.submission_id}")
    client.stop_job(job_details.submission_id)

### Delete jobs

We can also delete the jobs.

In [None]:
for job_details in client.list_jobs():
    print(f"deleting {job_details.submission_id}")
    client.delete_job(job_details.submission_id)

client.list_jobs()

### Delete the 

Once we're done training, we can delete the cluster.  This will remove the Kubernetes objects and free up resources.

In [None]:
cluster.down()