Parallel and distributed computing with Dask for scaling Pandas and NumPy operations to larger datasets and clusters.
This project demonstrates Dask, a library for parallel computing in Python. It covers Dask arrays, DataFrames, delayed computations, distributed computing, and scaling workflows. Perfect for working with larger-than-memory datasets and parallel processing.
- Parallel arrays and DataFrames
- Delayed and bag computations
- Distributed computing
- Task scheduling
- Memory-efficient operations
- Dask Bags for unstructured data (JSON, text, logs)
- Advanced DataFrame operations (joins, window functions, time series)
- Machine learning with parallel training
- Performance profiling and optimization
- Complex data transformations
- Multi-file parallel processing
- Time series resampling and rolling operations
- Hyperparameter tuning with distributed computing
- Python
- Dask
- Pandas
- NumPy
- Jupyter Notebook
Intermediate
- Install the required packages:
pip install -r requirements.txt- Launch Jupyter Notebook:
jupyter notebook- Open the notebooks in the
notebooks/directory to explore the examples.
dask-parallel/
├── README.md
├── requirements.txt
├── .gitignore
├── notebooks/
│ ├── 01_dask_arrays.ipynb
│ ├── 02_dask_dataframes.ipynb
│ ├── 03_delayed_computations.ipynb
│ ├── 04_distributed_computing.ipynb
│ ├── 05_task_scheduling.ipynb
│ ├── 06_dask_bags.ipynb
│ ├── 07_advanced_dataframes.ipynb
│ └── 08_dask_ml.ipynb
├── scripts/
│ ├── parallel_processing.py
│ ├── memory_efficient_ops.py
│ ├── distributed_workflow.py
│ ├── performance_profiling.py
│ ├── advanced_data_processing.py
│ └── generate_advanced_data.py
└── data/
└── (generated data files)
import dask.array as da
# Create a large array
x = da.random.random((10000, 10000), chunks=(1000, 1000))
result = (x + 1).sum()
print(result.compute())import dask.dataframe as dd
# Read large CSV file
df = dd.read_csv('data/large_file.csv')
result = df.groupby('column').sum().compute()from dask import delayed
@delayed
def process_data(x):
return x * 2
results = [process_data(i) for i in range(10)]
final = sum(results)
print(final.compute())import dask.dataframe as dd
# Read and resample time series data
df = dd.read_csv('data/timeseries_data.csv', parse_dates=['timestamp'])
df = df.set_index('timestamp')
daily = df.resample('1D').agg({'value': 'mean'}).compute()from dask import delayed, compute
from sklearn.ensemble import RandomForestClassifier
@delayed
def train_model(X, y):
model = RandomForestClassifier()
model.fit(X, y)
return model
# Train multiple models in parallel
models = [train_model(X, y) for _ in range(5)]
trained_models = compute(*models)To generate advanced sample datasets for testing:
python scripts/generate_advanced_data.pyThis will create:
- Large time series datasets
- Transaction data
- Machine learning datasets
- JSON/nested data
- Multiple batch files for parallel processing
- Network/graph data
This project is provided for educational purposes. Content used for educational purposes only.
For questions or support, visit rskworld.in or contact:
- Email: help@rskworld.in
- Phone: +91 93305 39277