![](https://www.lemagit.fr/visuals/LeMagIT/hero_article/Societe-General-logo.jpg)

# Fight against financial crimes 

## Context

Banks are investing a lot of money to prevent financial frauds. One of them is Credit Card Fraud. To give a little context, credit cards frauds:

* Global losses to credit card fraud approximates $35 billion annually
* The amount of credit card data available on the dark web increased by 135% last year.
* 130,928 credit card fraud reports were recorded in the United States in 2018.

No need to say that this is a huge issue. [Societe Générale](https://www.societegenerale.com/en/ai-no-longer-just-option-finance), one of the biggest european banks, led innovation initiatives to use AI to detect credit card fraud more efficiently. 

They provided you with an anonymized dataset and they need you to build a ML algorithm that:

* Is able to accurately predict credit card fraud
* Is not black box - They need to be able to trace how the algorithm got the result it came up with 

## Dataset

Dataset can be found here 👉 [CreditCardFraud.csv](https://lead-program-assets.s3.eu-west-3.amazonaws.com/M01-Distributed_machine_learning/datasets/creditcard.csv)

## Exercise - Part I - Train a model locally using Ray

Let's train a classification model that will predict fraud on the transactions in the dataset.

1. Start by importing the needed dependencies:

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import  StandardScaler
from sklearn.pipeline import Pipeline

import joblib
from ray.util.joblib import register_ray

2. Load the dataset, isolate the predictors from the target variable, and split the dataset between a training set and a validation set.

In [4]:
data = pd.read_csv('https://lead-program-assets.s3.eu-west-3.amazonaws.com/M01-Distributed_machine_learning/datasets/creditcard.csv')

X = data.drop("Class", axis=1)
y = data["Class"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

3. Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) with two steps: a standardization, then a random forest classifier.

In [None]:
model = Pipeline(steps=[
        ("standard_scaler", StandardScaler()),
        ("classifier", RandomForestClassifier())
    ], verbose=True)

4. Train the model with `joblib` using `ray` as the parallelization backend.

In [6]:
register_ray()

with joblib.parallel_backend('ray'):    
    model.fit(X_train, y_train)

2025-06-20 11:36:27,779	INFO ray_backend.py:74 -- Starting local ray cluster


[Pipeline] ... (step 1 of 2) Processing standard_scaler, total=   0.0s


2025-06-20 11:36:29,530	INFO worker.py:1908 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8266 [39m[22m


[Pipeline] ........ (step 2 of 2) Processing classifier, total=  28.4s


In production environment you would most likely have to submit your parallel jobs to a remote cluster. A remote cluster in the cloud is not cheap, so for testing purposes, let's start a ray cluster on kubernetes using our local machine on minikube.

## Exercise - Part II - Train a model on a Ray Cluster

In case the local ray cluster is still in use, let's stop it.
```shell
ray stop 
```

As a reminder, here are the commands you may use to start your cluster on minikube (feel free to change the resources setup according to your machine): 

```shell
minikube start --cpus=5 --memory=7995
```

```shell
minikube dashboard
```

```shell
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
```

```shell
helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0
```

You may create a file called `ray-cluster.yaml` like :
```yaml
head:
  enableInTreeAutoscaling: true
  resources:
    limits:
      cpu: "2"
      # To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head.
      memory: "3G"
    requests:
      cpu: "2"
      memory: "3G"


worker:
  replicas: 1
  resources:
    limits:
      cpu: "2"
      memory: "3G"
    requests:
      cpu: "2"
      memory: "3G"
```

```shell
helm install raycluster kuberay/ray-cluster --version 1.3.0 --set 'image.tag=2.41.0-aarch64' -f ray-cluster.yaml
```

```shell
kubectl port-forward --address 0.0.0.0 service/raycluster-kuberay-head-svc 8265:8265
```

```shell
ray job submit --working-dir=. --runtime-env=runtime-env.json --address="http://127.0.0.1:8265" -- python ray_train.py
```

8. Now that our cluster is up and running, write a script to submit a hyperparameter tuning job to our cluster, and submit using the ray CLI.

In [None]:
# See ray_train.py