## Quiz 05 - Parallel Computing, Reproducibility, and Containers

### Instructions

This quiz is based on the material covered in lectures 21 to 24. You may use
any resources available to you, including the lecture notes and the internet.

All the data required for this quiz can be found in the `data` folder within this repository. If you need to recreate the datasets, you can do so by running the Python script included in the `scripts-data-generation` folder. Please make sure that the following Python packages are installed:

```bash
pip install numpy pandas pyarrow dask dask-sql joblib SQLAlchemy
```

This notebook contains the questions you need to answer.
If possible, please submit your answers as an `.html` file on Canvas.

### Question 01 - Parallelising a Function with Joblib

Use `joblib` to parallelise the computation of squaring numbers in a large array. Import the required packages and write code that uses four cores to parallelise the computation.

```python
import numpy as np

def square(x):
    return x ** 2

numbers = np.arange(1000000)
```

In [1]:
# Please write your code here
import numpy as np
from joblib import Parallel, delayed

def square(x):
    return x**2

numbers = np.arange(1000000)

# Use joblib to parallelize the computation using four cores
squared_numbers = Parallel(n_jobs=4)(delayed(square)(i) for i in numbers)

# Optionally, convert the result back to a NumPy array
squared_numbers = np.array(squared_numbers)

### Question 02 - Using Dask Arrays for Large Data

Using Dask's `array` module, create a Dask array of random numbers with 10,000 rows and 10,000 columns. The array should be divided into chunks of 1,000 rows by 1,000 columns to enable efficient parallel computation. Populate the array with random numbers drawn from a normal distribution, where the mean is 0 and the standard deviation is 1. After creating the array, compute the mean, standard deviation, maximum, and minimum of the array using Dask's parallel computation capabilities. Use the `.compute()` method to execute the computations and print the results.

In [2]:
# Please write your code here
import dask.array as da

# Define the shape and chunks of the array
shape = (10000, 10000)
chunks = (1000, 1000)

# Create a Dask array with the specified shape and chunks
dask_array = da.random.normal(0, 1, size=shape, chunks=chunks)

# Compute the mean, standard deviation, maximum, and minimum
mean = dask_array.mean().compute()
std_dev = dask_array.std().compute()
maximum = dask_array.max().compute()
minimum = dask_array.min().compute()

# Print the results
print(f"Mean: {mean}")
print(f"Standard Deviation: {std_dev}")
print(f"Maximum: {maximum}")
print(f"Minimum: {minimum}")


Mean: -0.00019316936338155356
Standard Deviation: 1.0000628347868312
Maximum: 5.474412750877314
Minimum: -5.786211933245348


### Question 03 - Dask DataFrame Operations with Parquet Files

The `data` folder containts datasets for four countries—Brazil, India, UK, and USA—covering the years 1945 to 2023. Each country's data is stored in a separate Parquet file named after the country (`Brazil.parquet`, `India.parquet`, `UK.parquet`, `USA.parquet`). Each file contains the following columns:

- `country` (string): The name of the country.
- `year` (integer): The year of the record.
- `gdp_per_capita` (float): The GDP per capita for that country and year.
- `population` (integer): The population for that country and year.

Using Dask's `dataframe` module, read _only the `country` and the `gdp_per_capita` columns_ from the Parquet files into a Dask DataFrame. Then, compute the mean and standard deviation of the GDP per capita for each country using Dask's parallel computation capabilities.


In [3]:
# Please write your code here
import dask.dataframe as dd

# Read the Parquet files
df = dd.read_parquet('~/Documents/github/qtm350-quiz05/data/*.parquet', columns=['country', 'gdp_per_capita'])

# Group by 'country' and compute the mean and standard deviation
grouped = df.groupby('country')['gdp_per_capita']
mean_gdp = grouped.mean().compute()
std_gdp = grouped.std().compute()

# Print the results
print("Mean GDP per Capita by Country:")
print(mean_gdp)
print("\nStandard Deviation of GDP per Capita by Country:")
print(std_gdp)


Mean GDP per Capita by Country:
country
Brazil     5496.292031
India      1251.704443
UK        27496.851363
USA       40189.822290
Name: gdp_per_capita, dtype: float64

Standard Deviation of GDP per Capita by Country:
country
Brazil     2682.494158
India       456.525628
UK        10607.858036
USA       14892.455747
Name: gdp_per_capita, dtype: float64


### Question 04 - Dask and SQL Queries

Load the `data.csv` file into a Dask DataFrame and use the `dask_sql` package to perform a SQL query that selects the `country` and `gdp_per_capita` columns and filters the rows where `gdp_per_capita` is greater than 20000 in 2014. Display the results. Do not forget to register the Dask DataFrame as a SQL table with the `create_table` method.

In [4]:
# Please write your code here
import dask.dataframe as dd
from dask_sql import Context

# Load the data into a Dask DataFrame
df = dd.read_csv('~/Documents/github/qtm350-quiz05/data/data.csv')

# Create a Dask SQL context
c = Context()

# Register the DataFrame as a SQL table named 'my_table'
c.create_table('my_table', df)

# Define the SQL query
query = """
SELECT country, gdp_per_capita
FROM my_table
WHERE gdp_per_capita > 20000 AND year = 2014
"""

# Execute the query and compute the results
result = c.sql(query).compute()

# Display the results
print(result)


    country  gdp_per_capita
227      UK    40455.486012
306     USA    65386.141694


### Question 05 - Parallelising a Function with Dask Delayed

Suppose we need to compute the sum of squares of numbers for large ranges. The function below calculates the sum of squares from `0` up to `n-1`. Modify the given `sum_of_squares` function to use Dask's `@delayed` decorator and compute the sum of squares for each number in the numbers list in parallel. Measure and print the total execution time for the parallel computation, and print the results for each input number (as indicated in the code).

```python
import time

def sum_of_squares(n):
    """Compute the sum of squares from 0 to n-1."""
    return sum(i * i for i in range(n))

numbers = [100_000_000, 200_000_000, 300_000_000, 400_000_000]

# Measure the start time
start_time = time.time()

# Perform the computations serially
results_serial = []
for n in numbers:
    result = sum_of_squares(n)
    results_serial.append(result)
    print(f"Sum of squares up to {n}: {result}")

# Measure the end time
end_time = time.time()

# Calculate and print the total execution time
serial_execution_time = end_time - start_time
print(f"\nSerial execution time: {serial_execution_time:.2f} seconds")
```


In [5]:
# Please write your code here
import time
from dask import delayed, compute

# Use Dask's delayed decorator
@delayed
def sum_of_squares(n):
    """Compute the sum of squares from 0 to n-1."""
    return sum(i*i for i in range(n))

numbers = [100_000_000, 200_000_000, 300_000_000, 400_000_000]

# Measure the start time
start_time = time.time()

# Create a list of delayed computations
delayed_results = [sum_of_squares(n) for n in numbers]

# Compute the results in parallel
results_parallel = compute(*delayed_results)

# Measure the end time
end_time = time.time()

# Print the results
for n, result in zip(numbers, results_parallel):
    print(f"Sum of squares up to {n}: {result}")

# Calculate and print the total execution time
parallel_execution_time = end_time - start_time
print(f"\nParallel execution time: {parallel_execution_time:.2f} seconds")


Sum of squares up to 100000000: 333333328333333350000000
Sum of squares up to 200000000: 2666666646666666700000000
Sum of squares up to 300000000: 8999999955000000050000000
Sum of squares up to 400000000: 21333333253333333400000000

Parallel execution time: 44.93 seconds


### Question 06 - Using `pip` and `requirements.txt` for Dependency Management

Explain how you can use `pip` to manage dependencies in a Python project. Describe the process of generating a `requirements.txt` file from your current environment and how to use this file to install the same packages in another environment or on a different machine. Please comment your code to explain each step.

### Question 07 - Creating and Sharing a Conda Environment

Describe the steps to create a new Conda virtual environment named `qtm350` with Python 3.12 and install the packages `numpy`, `pandas`, and `matplotlib`. Explain how to export this environment to an `environment.yml` file and how someone else can recreate the same environment on their machine using this file. Please comment your code to explain each step. There is no need to run the code for this question, but you can do so if you wish.

### Question 08 - Writing a Simple Dockerfile

Write a simple `Dockerfile` that creates a Docker image for a Python application. The application consists of a single Python script named `app.py` that prints "Hello, World!" when executed. The `Dockerfile` should use the official Python image as the base image and copy the `app.py` script into the image. When the container is run, it should execute the `app.py` script and print "Hello, World!".

### Question 09 - Writing a Dockerfile to Install Software on a Base Image

Create a Dockerfile that starts from an Ubuntu 24.04 base image and installs the following software:

- Git version 2.43.0-1ubuntu7.1
- SQLite version 3.45.1-1ubuntu2

Ensure that you specify the exact versions of the packages. Include commands to clean up the package manager cache after installation to reduce the image size.

### Question 10 - Writing a Dockerfile to Install Python and Packages on Ubuntu

Write a `Dockerfile` that starts from an Ubuntu 24.04 base image, installs Python 3.12 and `pip`, and then uses `pip` to install specific versions of `numpy` (1.26.4), `pandas` (2.2.2), and `matplotlib` (3.9.2). Ensure you include commands to clean up the package manager cache after installation to reduce the image size. Set up a working directory named `app/` and configure the container to start an interactive Python shell `python3` by default.