Machine Learning on Big Data with Dask

Hands-on workshop: data processing and model training at scale

In this hands-on workshop, you will be introduced to Dask, a Python-native parallel computing framework. Dask extends traditional Python tools to operate at scale across a cluster of machines, removing memory and compute limitations. The tutorial covers setting up a Dask cluster in Saturn Cloud, processing large datasets efficiently, and performing machine learning model training across the cluster.

After this workshop you will know:

What Dask is and how it fits in with the broader PyData ecosystem
When to use Dask to scale out machine learning workloads
How to use Dask Dataframes for loading and cleaning data
How to perform distributed model training with Dask

To get the full learning value from this workshop, attendees should have prior experience with machine learning in Python. Experience with parallel computing is not needed.

Getting started

Steps

Create an account on Saturn Cloud Hosted or use your organization's existing Saturn Cloud Enterprise installation.

Create a new project (keep defaults unless specified here)

Name: "workshop-dask-ml-big-data"
Image: saturncloud/saturn:2020.11.30
(or latest available saturncloud/saturn:* image)
Workspace Settings
- Size: XLarge - 4 cores - 32GB RAM
Start script:

# this is to utilize the latest release of xgboost
pip uninstall -y dask-xgboost xgboost || true
rm -f /opt/conda/envs/saturn/lib/libxgboost.so
rm -f /opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/lib/libxgboost.so
pip install --upgrade 'xgboost>=1.3.0'

Click "Create"

Attach a Dask Cluster to the project
- Worker Size: XLarge - 4 cores - 32GB RAM
- Number of workers (n_workers): 5
- Number of worker threads (nthreads): 4
- Click "Create"
Start both the Jupyter Server and Dask Cluster
Open Jupyter Lab

From Jupyter Lab, open a new Terminal window and clone the workshop-scaling-ml repository:

git clone https://github.com/saturncloud/workshop-dask-ml-big-data.git /tmp/workshop-dask-ml-big-data
cp -r /tmp/workshop-dask-ml-big-data /home/jovyan/project

Navigate to the "workshop-dask-ml-big-data" folder in the File browser and start from the 01-start.ipynb notebook.

Screenshots

The project from the Saturn UI should look something like this:

JupyterLab should look like this:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
executed		executed
img		img
01-start.ipynb		01-start.ipynb
02-single-node.ipynb		02-single-node.ipynb
03-dask-basics.ipynb		03-dask-basics.ipynb
04-large-dataset.ipynb		04-large-dataset.ipynb
05-train-models.ipynb		05-train-models.ipynb
06-discussion.ipynb		06-discussion.ipynb
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executed

executed

img

img

01-start.ipynb

01-start.ipynb

02-single-node.ipynb

02-single-node.ipynb

03-dask-basics.ipynb

03-dask-basics.ipynb

04-large-dataset.ipynb

04-large-dataset.ipynb

05-train-models.ipynb

05-train-models.ipynb

06-discussion.ipynb

06-discussion.ipynb

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

Machine Learning on Big Data with Dask

Hands-on workshop: data processing and model training at scale

Getting started

Steps

Screenshots

About

Releases

Packages

Languages

License

saturncloud/workshop-dask-ml-big-data

Folders and files

Latest commit

History

Repository files navigation

Machine Learning on Big Data with Dask

Hands-on workshop: data processing and model training at scale

Getting started

Steps

Screenshots

About

Resources

License

Stars

Watchers

Forks

Languages