Skip to content

rskworld/dask-parallel

Repository files navigation

Dask Parallel Computing

Parallel and distributed computing with Dask for scaling Pandas and NumPy operations to larger datasets and clusters.

Description

This project demonstrates Dask, a library for parallel computing in Python. It covers Dask arrays, DataFrames, delayed computations, distributed computing, and scaling workflows. Perfect for working with larger-than-memory datasets and parallel processing.

Features

Core Features

  • Parallel arrays and DataFrames
  • Delayed and bag computations
  • Distributed computing
  • Task scheduling
  • Memory-efficient operations

Advanced Features

  • Dask Bags for unstructured data (JSON, text, logs)
  • Advanced DataFrame operations (joins, window functions, time series)
  • Machine learning with parallel training
  • Performance profiling and optimization
  • Complex data transformations
  • Multi-file parallel processing
  • Time series resampling and rolling operations
  • Hyperparameter tuning with distributed computing

Technologies

  • Python
  • Dask
  • Pandas
  • NumPy
  • Jupyter Notebook

Difficulty Level

Intermediate

Installation

  1. Install the required packages:
pip install -r requirements.txt
  1. Launch Jupyter Notebook:
jupyter notebook
  1. Open the notebooks in the notebooks/ directory to explore the examples.

Project Structure

dask-parallel/
├── README.md
├── requirements.txt
├── .gitignore
├── notebooks/
│   ├── 01_dask_arrays.ipynb
│   ├── 02_dask_dataframes.ipynb
│   ├── 03_delayed_computations.ipynb
│   ├── 04_distributed_computing.ipynb
│   ├── 05_task_scheduling.ipynb
│   ├── 06_dask_bags.ipynb
│   ├── 07_advanced_dataframes.ipynb
│   └── 08_dask_ml.ipynb
├── scripts/
│   ├── parallel_processing.py
│   ├── memory_efficient_ops.py
│   ├── distributed_workflow.py
│   ├── performance_profiling.py
│   ├── advanced_data_processing.py
│   └── generate_advanced_data.py
└── data/
    └── (generated data files)

Usage Examples

Dask Arrays

import dask.array as da

# Create a large array
x = da.random.random((10000, 10000), chunks=(1000, 1000))
result = (x + 1).sum()
print(result.compute())

Dask DataFrames

import dask.dataframe as dd

# Read large CSV file
df = dd.read_csv('data/large_file.csv')
result = df.groupby('column').sum().compute()

Delayed Computations

from dask import delayed

@delayed
def process_data(x):
    return x * 2

results = [process_data(i) for i in range(10)]
final = sum(results)
print(final.compute())

Advanced: Time Series Processing

import dask.dataframe as dd

# Read and resample time series data
df = dd.read_csv('data/timeseries_data.csv', parse_dates=['timestamp'])
df = df.set_index('timestamp')
daily = df.resample('1D').agg({'value': 'mean'}).compute()

Advanced: Machine Learning

from dask import delayed, compute
from sklearn.ensemble import RandomForestClassifier

@delayed
def train_model(X, y):
    model = RandomForestClassifier()
    model.fit(X, y)
    return model

# Train multiple models in parallel
models = [train_model(X, y) for _ in range(5)]
trained_models = compute(*models)

Generating Advanced Data

To generate advanced sample datasets for testing:

python scripts/generate_advanced_data.py

This will create:

  • Large time series datasets
  • Transaction data
  • Machine learning datasets
  • JSON/nested data
  • Multiple batch files for parallel processing
  • Network/graph data

License

This project is provided for educational purposes. Content used for educational purposes only.

Contact

For questions or support, visit rskworld.in or contact:

About

This project demonstrates Dask, a library for parallel computing in Python. It covers Dask arrays, DataFrames, delayed computations, distributed computing, and scaling workflows. Perfect for working with larger-than-memory datasets and parallel processing.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors