# Training A XGBoost Classifier With And Without Dask

This test generates a big binarry classification data for later train a model on the generated data using XGBoost with and without Dask. Later it will verify that:

  * The accuracy was not damaged in Dask.
  * The dask run was faster (only possible on big data). 

## General Configurations

In [1]:
# Image versions: scikit-learn~=1.0 xgboost~=1.1
# Test is set to install latest to make sure we are always up-to-date with the latest releases.
!pip install plotly scikit-learn xgboost



In [2]:
# Path to store the generated data:
DATA_PATH = "./data"

# Number of samples of generated data (number of rows in the data table):
N_SAMPLES = 1000

# Number of features of the generated data (number of columns in the data table):
N_FEATURES = 10

# The percentage of data to be labeled as a testing set:
TRAIN_TEST_SPLIT = 0.33

# The amount of parquet partitions to have of the generated data:
N_PARTITIONS = 10

## 1. Generate Data:

1. Generate a binary classification data.
2. Turn the data into a `pandas.DataFrame` naming the columns `features_{i}` and adding the partioting column (year).
3. Split the data into a training and testing sets.

In [3]:
import os
import shutil

import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


def generate_data(
    output_path: str,
    n_samples: int, 
    n_features: int, 
    test_size: float, 
    n_partitions: int,
):
    # Generate data:
    x, y = make_classification(
        n_samples=n_samples, 
        n_features=n_features, 
        n_informative=int(n_features / 4) + 1, 
        n_classes=2, 
        n_redundant=int(n_features / 10) + 1,
    )
    
    # Create a dataframe:
    data = pd.DataFrame(
        data=x, 
        columns=[f"feature-{i}" for i in range(n_features)]
    )
    data["year"] = np.random.randint(2000, 2000 + n_partitions, size=n_samples)
    data["label"] = y
    
    # Split into train and test sets:
    train_set, test_set = train_test_split(data, test_size=test_size)
    
    # Save to parquets:
    train_set.to_parquet(f"{output_path}/train", partition_cols=["year"])
    test_set.to_parquet(f"{output_path}/test", partition_cols=["year"])

Generate the data (will require writing permissions to the local directory).

In [4]:
# Delete past generated data (in case there was a failure):
if os.path.exists(DATA_PATH):
    shutil.rmtree(os.path.abspath(DATA_PATH))

# Generate new data:
generate_data(
    output_path=DATA_PATH,
    n_samples=N_SAMPLES, 
    n_features=N_FEATURES, 
    test_size=TRAIN_TEST_SPLIT,
    n_partitions=N_PARTITIONS,
)

## 2. Training Code

1. Read the data into a pandas (dask) `DataFrame`.
2. Split the data into `x` and `y`.
3. Initialize a `XGBClassifier` (`DaskXGBClassifier`) model.
4. Run training on the training set.
5. Run evaluation on the testing set.

Accuracy score will be logged as a result.

In [5]:
# mlrun: start-code

In [6]:
import time
import pandas as pd
import dask
import xgboost as xgb
from sklearn.metrics import accuracy_score

import mlrun


@mlrun.handler(outputs=["time", "accuracy_score"])
def train(context: mlrun.MLClientCtx, data_path: str):
    # Start the timer:
    run_time = time.time()
    
    # Check for a dask client:
    dask_function = context.get_param("dask_function", None)
    dask_client = mlrun.import_function(dask_function).client if dask_function else None
    
    # Get the data:
    read_parquet_function = dask.dataframe.read_parquet if dask_client else pd.read_parquet
    train_set = read_parquet_function(f"{data_path}/train")
    train_set = train_set.drop('year', axis=1)
    if dask_client:
        train_set = dask_client.persist(train_set)
    test_set = read_parquet_function(f"{data_path}/test")
    test_set = test_set.drop('year', axis=1)
    if dask_client:
        test_set = dask_client.persist(test_set)
    
    
    # Split into x and y:
    y_train = train_set.label
    x_train = train_set.drop(columns=["label"])
    y_test = test_set.label
    x_test = test_set.drop(columns=["label"])
    
    # Initialize a model:
    if dask_client:
        model = xgb.dask.DaskXGBClassifier()
        model.client = dask_client
    else:
        model = xgb.XGBClassifier()
    
    # Train:
    model.fit(x_train, y_train)
    
    # Predict:
    y_pred = model.predict(x_test)
    
    # Evaluate:
    if dask_client:
        y_test = y_test.compute()
        y_pred = y_pred.compute()
    run_time = time.time() - run_time
    return run_time, accuracy_score(y_pred, y_test)



In [7]:
# mlrun: end-code

## 3. Create a Project

1. Create the MLRun project.
2. Create an MLRun function of the training code.

In [8]:
import os
import shutil
import time

import mlrun

In [9]:
# Create the project:
project = mlrun.get_or_create_project(name="dask-xgboost-test", context="./", user_project=True)

> 2023-02-12 22:37:36,430 [info] loaded project dask-xgboost-test from MLRun DB


In [10]:
# Create the training function:
train_function = project.set_function(name="train", kind="job", image="mlrun/mlrun", handler="train")
train_function.apply(mlrun.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f04fc91ef90>

## 4. Run Without Dask

Run the training without Dask and store the results.

In [11]:
# Run without Dask:
without_dask_run = train_function.run(
    name="without-dask",
    params={
        "data_path": os.path.abspath(DATA_PATH),
    },
)

# Store results:
without_dask_time = without_dask_run.status.results['time']
without_dask_score = without_dask_run.status.results['accuracy_score']

> 2023-02-12 22:37:49,002 [info] starting run without-dask uid=7e500ecb1d854388ad85c59bd1d293ca DB=http://mlrun-api:8080
> 2023-02-12 22:37:49,221 [info] Job is running in the background, pod: without-dask-6t78p
> 2023-02-12 22:37:58,508 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 7e500ecb1d854388ad85c59bd1d293ca -p dask-xgboost-test-guyl', 'logs_cmd': 'mlrun logs 7e500ecb1d854388ad85c59bd1d293ca -p dask-xgboost-test-guyl'}
> 2023-02-12 22:37:58,508 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.cto-office.iguazio-cd1.com/mlprojects/dask-xgboost-test-guyl/jobs/monitor/7e500ecb1d854388ad85c59bd1d293ca/overview'}
> 2023-02-12 22:37:58,508 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
dask-xgboost-test-guyl,...d1d293ca,0,Feb 12 22:37:56,completed,without-dask,v3io_user=guylkind=jobowner=guylmlrun/client_version=0.0.0+unstablemlrun/client_python_version=3.7.6host=without-dask-6t78p,,data_path=/User/demos/notebooks/data,time=0.3261575698852539accuracy_score=0.8818181818181818,





> 2023-02-12 22:38:01,576 [info] run executed, status=completed


## 5. Run With Dask

1. Create the Dask function.
2. Configure it.
3. Run the training with Dask and store the results.

In [12]:
# Create the dask function:
dask_function = mlrun.new_function(name="my-dask", kind="dask", image="mlrun/mlrun")

# Configure the dask function specs:
dask_function.spec.remote = True
dask_function.spec.replicas = 5
dask_function.spec.service_type = 'NodePort'
dask_function.with_worker_limits(mem="6G")
dask_function.spec.nthreads = 5
dask_function.apply(mlrun.auto_mount())

# Assign the function to the project:
project.set_function(dask_function)

# Save:
dask_function.save()

'db://dask-xgboost-test-guyl/my-dask'

In [13]:
# Get the dask client:
dask_function.client

> 2023-02-12 22:38:16,863 [info] trying dask client at: tcp://mlrun-my-dask-3fd5be9a-2.default-tenant:8786
> 2023-02-12 22:38:16,892 [info] using remote dask scheduler (mlrun-my-dask-3fd5be9a-2) at: tcp://mlrun-my-dask-3fd5be9a-2.default-tenant:8786


Mismatched versions found

+-------------+--------+-----------+---------+
| Package     | client | scheduler | workers |
+-------------+--------+-----------+---------+
| blosc       | 1.7.0  | 1.10.6    | None    |
| cloudpickle | 2.0.0  | 2.2.1     | None    |
| lz4         | 3.1.0  | 3.1.10    | None    |
| msgpack     | 1.0.3  | 1.0.4     | None    |
| toolz       | 0.11.2 | 0.12.0    | None    |
| tornado     | 6.1    | 6.2       | None    |
+-------------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Connection method: Direct,
Dashboard: http://mlrun-my-dask-3fd5be9a-2.default-tenant:8787/status,

0,1
Comm: tcp://192.168.210.147:8786,Workers: 0
Dashboard: http://192.168.210.147:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B


In [14]:
# Run with Dask:
with_dask_run = train_function.run(
    name="with-dask",
    params={
        "data_path": os.path.abspath(DATA_PATH),
        "dask_function": "db://" + dask_function.uri,
    },
)

# Store results:
with_dask_time = with_dask_run.status.results['time']
with_dask_score = with_dask_run.status.results['accuracy_score']

> 2023-02-12 22:38:16,979 [info] starting run with-dask uid=bc295600cb734d71a69d406f87efb8e9 DB=http://mlrun-api:8080
> 2023-02-12 22:38:17,164 [info] Job is running in the background, pod: with-dask-fjrj9
> 2023-02-12 22:38:26,338 [info] trying dask client at: tcp://mlrun-my-dask-3fd5be9a-2.default-tenant:8786
> 2023-02-12 22:38:26,374 [info] using remote dask scheduler (mlrun-my-dask-3fd5be9a-2) at: tcp://mlrun-my-dask-3fd5be9a-2.default-tenant:8786
remote dashboard: default-tenant.app.cto-office.iguazio-cd1.com:32653
> 2023-02-12 22:38:29,248 [info] To track results use the CLI: {'info_cmd': 'mlrun get run bc295600cb734d71a69d406f87efb8e9 -p dask-xgboost-test-guyl', 'logs_cmd': 'mlrun logs bc295600cb734d71a69d406f87efb8e9 -p dask-xgboost-test-guyl'}
> 2023-02-12 22:38:29,248 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.cto-office.iguazio-cd1.com/mlprojects/dask-xgboost-test-guyl/jobs/monitor/bc295600cb734d71a69d406f87efb8e9/overview'}
> 2023-02-12 22:38:2

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
dask-xgboost-test-guyl,...87efb8e9,0,Feb 12 22:38:24,completed,with-dask,v3io_user=guylkind=jobowner=guylmlrun/client_version=0.0.0+unstablemlrun/client_python_version=3.7.6host=with-dask-fjrj9,,data_path=/User/demos/notebooks/datadask_function=db://dask-xgboost-test-guyl/my-dask,time=2.873952627182007accuracy_score=0.8878787878787879,





> 2023-02-12 22:38:29,539 [info] run executed, status=completed


## 6. Compare Runtimes

1. Print a summary message.
2. Verify that:
  * The dask run took less time (only in stronger machines). 
  * The accuracy score is almost equal between the runs.

In [15]:
# Delete the generated data:
shutil.rmtree(os.path.abspath(DATA_PATH))

# Delete the MLRun project:
mlrun.get_run_db().delete_project(name=project.name, deletion_strategy="cascading")

In [16]:
# Print the test's collected results:
print(
    f"Without dask:\n" 
    f"\t{'%.2f' % without_dask_time} Seconds\n"
    f"\tAccuracy: {without_dask_score}"
)
print(
    f"With dask:\n"
    f"\t{'%.2f' % with_dask_time} Seconds\n"
    f"\tAccuracy: {with_dask_score}\n"
)

# Verification: (Only possible to test on a stronger machine as the test requires big data)
# assert with_dask_time < without_dask_time
# assert np.isclose(without_dask_score, with_dask_score, atol=0.1)

Without dask:
	0.33 Seconds
	Accuracy: 0.8818181818181818
With dask:
	2.87 Seconds
	Accuracy: 0.8878787878787879

