In this hands-on workshop, you will be introduced to Dask, a Python-native parallel computing framework. Dask extends traditional Python tools to operate at scale across a cluster of machines, removing memory and compute limitations. The tutorial covers setting up a Dask cluster in Saturn Cloud, processing large datasets efficiently, and performing machine learning model training across the cluster.
After this workshop you will know:
- What Dask is and how it fits in with the broader PyData ecosystem
- When to use Dask to scale out machine learning workloads
- How to use Dask Dataframes for loading and cleaning data
- How to perform distributed model training with Dask
To get the full learning value from this workshop, attendees should have prior experience with machine learning in Python. Experience with parallel computing is not needed.
- Create an account on Saturn Cloud Hosted or use your organization's existing Saturn Cloud Enterprise installation.
- Create a new project (keep defaults unless specified here)
- Name: "workshop-dask-ml-big-data"
- Image:
saturncloud/saturn:2020.11.30
(or latest availablesaturncloud/saturn:*
image) - Workspace Settings
- Size:
XLarge - 4 cores - 32GB RAM
- Size:
- Start script:
# this is to utilize the latest release of xgboost pip uninstall -y dask-xgboost xgboost || true rm -f /opt/conda/envs/saturn/lib/libxgboost.so rm -f /opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/lib/libxgboost.so pip install --upgrade 'xgboost>=1.3.0'
- Click "Create"
- Attach a Dask Cluster to the project
- Worker Size:
XLarge - 4 cores - 32GB RAM
- Number of workers (n_workers): 5
- Number of worker threads (nthreads): 4
- Click "Create"
- Worker Size:
- Start both the Jupyter Server and Dask Cluster
- Open Jupyter Lab
- From Jupyter Lab, open a new Terminal window and clone the workshop-scaling-ml repository:
git clone https://github.com/saturncloud/workshop-dask-ml-big-data.git /tmp/workshop-dask-ml-big-data cp -r /tmp/workshop-dask-ml-big-data /home/jovyan/project
- Navigate to the "workshop-dask-ml-big-data" folder in the File browser and start from the 01-start.ipynb notebook.
The project from the Saturn UI should look something like this:
JupyterLab should look like this: