# Parallelization with Python (II):

## Demo for basic Dask.

### Online resources:

- https://tutorial.dask.org/

### Hungjui Yu – 20240223

***
## <font color='maroon'>What is Dask?</font>

- Dask is a parallel computing library that integrates seamlessly with popular Python libraries like NumPy, Pandas, and Scikit-Learn. It enables parallel and distributed computing on larger-than-memory datasets.

## <font color='maroon'>What is Dask Arrays?</font>

- Dask arrays provide parallelized and larger-than-memory computations on arrays. They closely resemble NumPy arrays but operate on larger datasets.


In [4]:
import dask
import dask.array as da
from dask import delayed
import numpy as np
import time

### Creating a Dask Array:

In [23]:
x = np.random.random((10000, 10000))

x_dask_array = da.from_array(x, chunks='auto') #(500, 500))

x_dask_array


Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,128.00 MiB
Shape,"(10000, 10000)","(4096, 4096)"
Dask graph,9 chunks in 1 graph layer,9 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 762.94 MiB 128.00 MiB Shape (10000, 10000) (4096, 4096) Dask graph 9 chunks in 1 graph layer Data type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,128.00 MiB
Shape,"(10000, 10000)","(4096, 4096)"
Dask graph,9 chunks in 1 graph layer,9 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


### Calculate as original numpy array:

In [24]:
start = time.time()

y = (x + x.T).mean()
print(y)

end = time.time()
print(f'Finished in {round(end-start, 3)} seconds.')

1.0001521412513117
Finished in 6.201 seconds.


### Calculate as Dask Array:

In [25]:
start = time.time()

y_dask_array = (x_dask_array + x_dask_array.T).mean()
result = y_dask_array.compute()
print(result)

end = time.time()
print(f'Finished in {round(end-start, 3)} seconds.')

1.0001521412513161
Finished in 1.283 seconds.


***
## <font color='maroon'>What is Dask Delayed?</font>

- Dask Delayed is a Dask submodule that allows users to parallelize custom computations by delaying their execution until a later time.


In [31]:
@delayed
def square(x):
    return x ** 2

def add(a, b):
    return a + b

# Delayed computation
a = square(2)
b = square(3)
c = add(a, b)

# Compute the result
result_delayed = c.compute()
print(result_delayed)

13


In [29]:
def simple_computation(a, b):
    result = a + b
    return result

# Timing the computation without Dask Delayed
start_time = time.time()
result_without_delayed = simple_computation(2, 3)
elapsed_time_without_delayed = time.time() - start_time

print(f"Without Dask Delayed - Result: {result_without_delayed}, Elapsed Time: {elapsed_time_without_delayed:.4f} seconds")


Without Dask Delayed - Result: 5, Elapsed Time: 0.0001 seconds


In [30]:
# Using Dask delayed to parallelize a simple computation
@delayed
def simple_computation_delayed(a, b):
    result = a + b
    return result

# Timing the computation with Dask Delayed
start_time = time.time()
result_with_delayed = simple_computation_delayed(2, 3).compute()
elapsed_time_with_delayed = time.time() - start_time

print(f"With Dask Delayed - Result: {result_with_delayed}, Elapsed Time: {elapsed_time_with_delayed:.4f} seconds")


With Dask Delayed - Result: 5, Elapsed Time: 0.0029 seconds


***
***
# <font color='teal'>**Supplement:**</font>
***
***

### Matrix Multiplication without Dask:

In [18]:
def matrix_multiply_np(size):
    A = np.random.random((size, size))
    B = np.random.random((size, size))
    start_time = time.time()
    result_np = np.dot(A, B)
    elapsed_time = time.time() - start_time
    return result_np, elapsed_time

result_np, time_np = matrix_multiply_np(10000)

print(f"Without Dask - Elapsed Time: {time_np:.4f} seconds")


Without Dask - Elapsed Time: 4.2804 seconds


### Matrix Multiplication with Dask:

In [19]:
def matrix_multiply_dask(size):
    A = da.random.random((size, size), chunks='auto')
    B = da.random.random((size, size), chunks='auto')
    start_time = time.time()
    result_dask = da.dot(A, B).compute()
    elapsed_time = time.time() - start_time
    return result_dask, elapsed_time

result_dask, time_dask = matrix_multiply_dask(10000)

print(f"With Dask - Elapsed Time: {time_dask:.4f} seconds")


With Dask - Elapsed Time: 14.1037 seconds


In [1]:
# Importing Dask, NumPy, and other necessary libraries
import dask
import dask.array as da
import numpy as np
import time

# Function to perform matrix multiplication without Dask
def matrix_multiply_np(size):
    np.random.seed(42)
    A = np.random.random((size, size))
    B = np.random.random((size, size))
    
    start_time = time.time()
    result_np = np.dot(A, B)
    elapsed_time = time.time() - start_time
    
    return result_np, elapsed_time

# Function to perform matrix multiplication with Dask
def matrix_multiply_dask(size, chunk_size):
    da.random.seed(42)
    A = da.random.random((size, size), chunks=(chunk_size, chunk_size))
    B = da.random.random((size, size), chunks=(chunk_size, chunk_size))
    
    start_time = time.time()
    result_dask = da.dot(A, B).compute()
    elapsed_time = time.time() - start_time
    
    return result_dask, elapsed_time

# Parameters
matrix_size = 1000
chunk_size = 500

# Matrix multiplication without Dask
result_np, time_np = matrix_multiply_np(matrix_size)

# Matrix multiplication with Dask
result_dask, time_dask = matrix_multiply_dask(matrix_size, chunk_size)

# Display results and comparison
print(f"Matrix Size: {matrix_size}x{matrix_size}")
print("Without Dask:")
print(f"   Elapsed Time: {time_np:.4f} seconds")

print("\nWith Dask:")
print(f"   Elapsed Time: {time_dask:.4f} seconds")

# Check if results are close (within a tolerance) due to potential floating-point differences
if np.allclose(result_np, result_dask):
    print("\nThe results are close.")
else:
    print("\nThe results differ.")



Matrix Size: 1000x1000
Without Dask:
   Elapsed Time: 0.1801 seconds

With Dask:
   Elapsed Time: 0.9378 seconds

The results differ.
