# Loading And Storing Data From And Into S3 With And Without Dask

This test generates a big random data, uploading it to S3 and then processing it with and without Dask. Later it will verify that:

* The data was handled properly and results were equal.
* The stored dataset artifact in S3 is loadable and equal.
* The dask run was faster (only possible on big data). 

## General Configurations

In [1]:
import os
import shutil
import sys

sys.path.append(os.path.abspath("../"))

from utils import S3Client

# AWS Credentials:
AWS_ACCESS_KEY_ID = os.environ.get("AWS_ACCESS_KEY_ID", "")
AWS_SECRET_ACCESS_KEY = os.environ.get("AWS_SECRET_ACCESS_KEY", "")
assert AWS_ACCESS_KEY_ID != "" and AWS_SECRET_ACCESS_KEY != "" 
os.environ["AWS_ACCESS_KEY_ID"] = AWS_ACCESS_KEY_ID
os.environ["AWS_SECRET_ACCESS_KEY"] = AWS_SECRET_ACCESS_KEY

# Path to store the generated data:
LOCAL_DATA_PATH = "./data"
S3_BUCKET = os.environ.get("S3_BUCKET", "testbucket-igz-temp")
S3_PROJECT_DIRECTORY = "test-dask-s3"
S3_DATA_PATH = os.path.join(S3_PROJECT_DIRECTORY, "data")

# Number of samples of generated data (number of rows in the data table):
N_SAMPLES = 10_000

# Number of features of the generated data (number of columns in the data table):
N_FEATURES = 10

# The amount of parquet partitions to have of the generated data:
N_PARTITIONS = 10

## 1. Generate Data:

1. Generate random data.
2. Turn the data into a `pandas.DataFrame` naming the columns `features_{i}` and adding the partioting column (year).

In [2]:
import numpy as np
import pandas as pd


def generate_data(
    output_path: str,
    n_samples: int, 
    n_features: int, 
    n_partitions: int,
):
    # Generate data:
    data = np.random.random(size=(n_samples, n_features))
    
    # Create a dataframe:
    data = pd.DataFrame(
        data=data, 
        columns=[f"feature_{i}" for i in range(n_features)]
    )
    data["year"] = np.random.randint(2000, 2000 + n_partitions, size=n_samples)
    
    # Save to parquets:
    data.to_parquet(output_path, partition_cols=["year"])

Generate the data (will require writing permissions to the local directory and of course to S3).

In [3]:
# Delete past generated data (in case there was a past failure):
if os.path.exists(LOCAL_DATA_PATH):
    shutil.rmtree(os.path.abspath(LOCAL_DATA_PATH))

# Generate new data:
generate_data(
    output_path=LOCAL_DATA_PATH,
    n_samples=N_SAMPLES, 
    n_features=N_FEATURES, 
    n_partitions=N_PARTITIONS,
)

In [4]:
# Create the S3 client:
s3_client = S3Client(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

# Delete the project and data in S3 (in case there was a past failure):
try:
    s3_client.delete(
        bucket=S3_BUCKET,
        s3_path=S3_PROJECT_DIRECTORY,
    )
except FileNotFoundError:
    pass

# Upload it to S3:
s3_client.upload(
    bucket=S3_BUCKET,
    local_path=LOCAL_DATA_PATH,
    s3_path=S3_DATA_PATH,
    replace=False,
)

Uploading:   0%|          | 0/10 [00:00<?, ?it/s]

Uploading './data/year=2000/c3d2bcb360f643fbb04d072c6116c477.parquet' to test-dask-s3/data/year=2000/c3d2bcb360f643fbb04d072c6116c477.parquet
Uploading './data/year=2001/b20e7a8fcb2849d98dca1770cbee75cc.parquet' to test-dask-s3/data/year=2001/b20e7a8fcb2849d98dca1770cbee75cc.parquet
Uploading './data/year=2002/99ebbe2e64fd4ba2b6f29ca6cb0117ed.parquet' to test-dask-s3/data/year=2002/99ebbe2e64fd4ba2b6f29ca6cb0117ed.parquet
Uploading './data/year=2003/fdab28f8921548a9aa25cea2b6b3599a.parquet' to test-dask-s3/data/year=2003/fdab28f8921548a9aa25cea2b6b3599a.parquet
Uploading './data/year=2004/539d16ae17904315a23f4794be3669a6.parquet' to test-dask-s3/data/year=2004/539d16ae17904315a23f4794be3669a6.parquet
Uploading './data/year=2005/f9f1ee96b1e04c78b6877bbd6f24f482.parquet' to test-dask-s3/data/year=2005/f9f1ee96b1e04c78b6877bbd6f24f482.parquet
Uploading './data/year=2006/7349725c0b114a6c8ba12b0c9399d52a.parquet' to test-dask-s3/data/year=2006/7349725c0b114a6c8ba12b0c9399d52a.parquet
Upload

In [5]:
# Delete new generated data (data will be loaded from S3):
shutil.rmtree(os.path.abspath(LOCAL_DATA_PATH))

## 2. Data Processing Code

1. Read the data into a pandas (dask) `DataFrame` using MLRun's `DataItem.as_df`'s method.
2. Do some calculations.

The calculations are accumulated into a single value that will be logged as a result along a single column of data (means in this case) to be stored in S3.

In [6]:
# mlrun: start-code

In [7]:
import time
import pandas as pd
import dask
import mlrun


@mlrun.handler(outputs=["time", "result", "means"])
def process_data(context: mlrun.MLClientCtx, data_path: mlrun.DataItem):
    # Start the timer:
    run_time = time.time()
    
    # Check for a dask client:
    dask_function = context.get_param("dask_function", None)
    dask_client = mlrun.import_function(dask_function).client if dask_function else None
    
    # Get the data:
    data = data_path.as_df(
        df_module=dask.dataframe if dask_client else pd,
        format="parquet"
    )
    
    # Do some random calculations:
    if dask_client:
        data = dask_client.persist(data)
    sum_value = data.sum()
    mean_value = data.mean()
    var_value = data.var()
    if dask_client:
        sum_value = dask.delayed(sum)(sum_value)
        mean_value = dask.delayed(sum)(mean_value)
        var_value = dask.delayed(sum)(var_value)
    else:
        sum_value = sum(sum_value)
        mean_value = sum(mean_value)
        var_value = sum(var_value)
    result = sum_value + mean_value + var_value
    means = data.mean()
    for column in data.columns:
        means = means + means * means
    if dask_client:
        result = result.compute()
        means = means.compute()
    
    # Log the values:
    run_time = time.time() - run_time
    return run_time, result, means

In [8]:
# mlrun: end-code

## 3. Create a Project

1. Create the MLRun project.
2. Create an MLRun function of the processing code.

In [9]:
import os
import shutil
import time

import mlrun

In [10]:
# Create the project:
project = mlrun.get_or_create_project(name=S3_PROJECT_DIRECTORY, context="./", user_project=False)

# Add the S3 credentials:
project.set_secrets(
    secrets={
        "AWS_ACCESS_KEY_ID": AWS_ACCESS_KEY_ID,
        "AWS_SECRET_ACCESS_KEY": AWS_SECRET_ACCESS_KEY,
    }
)

> 2022-12-23 15:11:19,767 [info] loaded project test-dask-s3 from MLRun DB


In [11]:
# Create the training function:
process_data_function = project.set_function(name="process_data", kind="job", image="mlrun/mlrun", handler="process_data")

## 4. Run Without Dask

Run the processing without Dask and store the results.

In [12]:
# Run without dask:
without_dask_run = process_data_function.run(
    name="without_dask",
    inputs={
        "data_path": f"s3://{S3_BUCKET}/{S3_PROJECT_DIRECTORY}/data/",
    },
    artifact_path=f"s3://{S3_BUCKET}/{S3_PROJECT_DIRECTORY}",
)

# Store results:
without_dask_time = without_dask_run.status.results['time']
without_dask_result = without_dask_run.status.results['result']
without_dask_means = np.array(without_dask_run.artifact('means').as_df()["0"])

> 2022-12-23 15:11:26,731 [info] starting run without_dask uid=d46fc266409142bdbab51db7950e4538 DB=http://mlrun-api:8080
> 2022-12-23 15:11:26,939 [info] Job is running in the background, pod: without-dask-srhpm
> 2022-12-23 15:11:48,159 [info] To track results use the CLI: {'info_cmd': 'mlrun get run d46fc266409142bdbab51db7950e4538 -p test-dask-s3', 'logs_cmd': 'mlrun logs d46fc266409142bdbab51db7950e4538 -p test-dask-s3'}
> 2022-12-23 15:11:48,159 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.dev6.lab.iguazeng.com/mlprojects/test-dask-s3/jobs/monitor/d46fc266409142bdbab51db7950e4538/overview'}
> 2022-12-23 15:11:48,159 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
test-dask-s3,...950e4538,0,Dec 23 15:11:31,completed,without_dask,v3io_user=guylkind=jobowner=guylmlrun/client_version=1.2.1-rc4host=without-dask-srhpm,data_path,dask_function=None,time=12.91029977798462result=50062.27446583847,means





> 2022-12-23 15:11:56,617 [info] run executed, status=completed


## 5. Run With Dask

1. Create the Dask function.
2. Configure it.
3. Run the data processing with Dask and store the results.

In [13]:
# Create the dask function:
dask_function = mlrun.new_function(name="my_dask", kind="dask", image="mlrun/mlrun")

# Configure the dask function specs:
dask_function.spec.remote = True
dask_function.spec.replicas = 5
dask_function.spec.service_type = 'NodePort'
dask_function.with_limits(mem="6G")
dask_function.spec.nthreads = 5

# Assign the function to the project:
project.set_function(dask_function)

# Save:
dask_function.save()

'db://test-dask-s3/my-dask'

In [14]:
# Get the dask client:
dask_function.client

> 2022-12-23 15:12:09,450 [info] trying dask client at: tcp://mlrun-my-dask-bbd4cbcf-3.default-tenant:8786
> 2022-12-23 15:12:09,480 [info] using remote dask scheduler (mlrun-my-dask-bbd4cbcf-3) at: tcp://mlrun-my-dask-bbd4cbcf-3.default-tenant:8786


Mismatched versions found

+-------------+--------+-----------+---------+
| Package     | client | scheduler | workers |
+-------------+--------+-----------+---------+
| blosc       | 1.7.0  | None      | None    |
| cloudpickle | 2.0.0  | 2.2.0     | None    |
| lz4         | 3.1.0  | None      | None    |
| msgpack     | 1.0.3  | 1.0.4     | None    |
| toolz       | 0.11.2 | 0.12.0    | None    |
| tornado     | 6.1    | 6.2       | None    |
+-------------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6


0,1
Connection method: Direct,
Dashboard: http://mlrun-my-dask-bbd4cbcf-3.default-tenant:8787/status,

0,1
Comm: tcp://10.201.96.230:8786,Workers: 0
Dashboard: http://10.201.96.230:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B


In [15]:
# Run with dask:
with_dask_run = process_data_function.run(
    name="with_dask",
    inputs={
        "data_path": f"s3://{S3_BUCKET}/{S3_PROJECT_DIRECTORY}/data/",
    },
    params={
        "dask_function": "db://" + dask_function.uri,
    },
    artifact_path=f"s3://{S3_BUCKET}/{S3_PROJECT_DIRECTORY}",
)

# Store results:
with_dask_time = with_dask_run.status.results['time']
with_dask_result = with_dask_run.status.results['result']
with_dask_means = np.array(with_dask_run.artifact('means').as_df()["0"])

> 2022-12-23 15:12:09,535 [info] starting run with_dask uid=d54bd286ea084c738263a8b0e32bb473 DB=http://mlrun-api:8080
> 2022-12-23 15:12:09,911 [info] Job is running in the background, pod: with-dask-b99wb
> 2022-12-23 15:12:16,631 [info] trying dask client at: tcp://mlrun-my-dask-bbd4cbcf-3.default-tenant:8786
> 2022-12-23 15:12:16,641 [info] using remote dask scheduler (mlrun-my-dask-bbd4cbcf-3) at: tcp://mlrun-my-dask-bbd4cbcf-3.default-tenant:8786
remote dashboard: default-tenant.app.dev6.lab.iguazeng.com:31651
> 2022-12-23 15:12:31,614 [info] To track results use the CLI: {'info_cmd': 'mlrun get run d54bd286ea084c738263a8b0e32bb473 -p test-dask-s3', 'logs_cmd': 'mlrun logs d54bd286ea084c738263a8b0e32bb473 -p test-dask-s3'}
> 2022-12-23 15:12:31,614 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.dev6.lab.iguazeng.com/mlprojects/test-dask-s3/jobs/monitor/d54bd286ea084c738263a8b0e32bb473/overview'}
> 2022-12-23 15:12:31,615 [info] run executed, status=comple

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
test-dask-s3,...e32bb473,0,Dec 23 15:12:14,completed,with_dask,v3io_user=guylkind=jobowner=guylmlrun/client_version=1.2.1-rc4host=with-dask-b99wb,data_path,dask_function=db://test-dask-s3/my-dask,time=13.100464344024658result=50062.27446583847,means





> 2022-12-23 15:12:36,147 [info] run executed, status=completed


## 6. Compare Runtimes

1. Print a summary message.
2. Verify that:
  * The dask run took less time (only in stronger machines). 
  * The result value is almost equal between the runs.
  * The means values are almost equal between the runs.

In [16]:
# Delete the project and data in S3:
s3_client.delete(
    bucket=S3_BUCKET,
    s3_path=S3_PROJECT_DIRECTORY,
)

Deleting:   0%|          | 0/12 [00:00<?, ?it/s]

Deleting 'test-dask-s3/data/year=2000/c3d2bcb360f643fbb04d072c6116c477.parquet'
Deleting 'test-dask-s3/data/year=2001/b20e7a8fcb2849d98dca1770cbee75cc.parquet'
Deleting 'test-dask-s3/data/year=2002/99ebbe2e64fd4ba2b6f29ca6cb0117ed.parquet'
Deleting 'test-dask-s3/data/year=2003/fdab28f8921548a9aa25cea2b6b3599a.parquet'
Deleting 'test-dask-s3/data/year=2004/539d16ae17904315a23f4794be3669a6.parquet'
Deleting 'test-dask-s3/data/year=2005/f9f1ee96b1e04c78b6877bbd6f24f482.parquet'
Deleting 'test-dask-s3/data/year=2006/7349725c0b114a6c8ba12b0c9399d52a.parquet'
Deleting 'test-dask-s3/data/year=2007/b1e1fd12bb6248c3b1ffa39af4e45237.parquet'
Deleting 'test-dask-s3/data/year=2008/29ff8d0aa94b40c28f3b61490d4a0526.parquet'
Deleting 'test-dask-s3/data/year=2009/0d4d67f1d4994a96ad8318263bef69c2.parquet'
Deleting 'test-dask-s3/with_dask/0/means.parquet'
Deleting 'test-dask-s3/without_dask/0/means.parquet'
Done!


In [17]:
# Print the test's collected results:
print(
    f"Without dask:\n" 
    f"\t{'%.2f' % without_dask_time} Seconds\n"
    f"\tResult: {without_dask_result}"
)
print(
    f"With dask:\n"
    f"\t{'%.2f' % with_dask_time} Seconds\n"
    f"\tResult: {with_dask_result}\n"
)

# Verification:
# assert with_dask_time < without_dask_time  # Only possible to test on a stronger machine (requires big data)
assert np.isclose(without_dask_result, with_dask_result)
assert np.allclose(without_dask_means, with_dask_means)

Without dask:
	12.91 Seconds
	Result: 50062.27446583847
With dask:
	13.10 Seconds
	Result: 50062.27446583847

