## Quiz 05 - Parallel Computing, Reproducibility, and Containers

### Instructions

This quiz is based on the material covered in lectures 20 to 24. You may use
any resources available to you, including the lecture notes and the internet.

All the data required for this quiz can be found in the `data` folder within this repository. If you need to recreate the datasets, you can do so by running the Python script included in the `script-data-generation` folder.

**Important:** Please start by completing Question 01 to set up the correct Python environment before proceeding with the other questions.

This notebook contains the questions you need to answer.
If possible, please submit your answers as an `.html` file on Canvas.

### **Question 01: Setting up the Python Environment**

Before proceeding with the rest of the quiz, it is important to set up a Python environment with specific package versions to ensure compatibility and reproducibility. This quiz requires **Python 3.10** and the following packages with exact versions:
- `dask-sql=2024.5.0`
- `dask=2024.4.1`
- `ipykernel=6.29.3`
- `joblib=1.3.2`
- `numpy=1.26.4`
- `pandas=2.2.1`

You can use tools like `conda`, `pipenv`, or `uv` to manage your environment. If you use conda (recommended), please make sure you **create the environment and install all packages in the same command**. Also include `-c conda-forge` in your command. Make sure to change your current environment to the new environment after creation. 

Write the terminal commands in the code cell below:

In [9]:
# Please write your bash commands here. You can run them using the `!` opera#or or the `%%bash` magic.
#!/bin/bash
# Question 01: Setting up the Python Environment

# Create conda environment with Python 3.10 and all required packages
!conda create -n qtm350-quiz python=3.10 dask-sql=2024.5.0 dask=2024.4.1 ipykernel=6.29.3 joblib=1.3.2 numpy=1.26.4 pandas=2.2.1 -c conda-forge -y

Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/sohanbellam/anaconda3/envs/qtm350-quiz

  added / updated specs:
    - dask-sql=2024.5.0
    - dask=2024.4.1
    - ipykernel=6.29.3
    - joblib=1.3.2
    - numpy=1.26.4
    - pandas=2.2.1
    - python=3.10


The following NEW packages will be INSTALLED:

  _openmp_mutex      conda-forge/osx-arm64::_openmp_mutex-4.5-7_kmp_llvm 
  annotated-doc      conda-forge/noarch::annotated-doc-0.0.4-pyhcf101f3_0 
  annotated-types    conda-forge/noarch::annotated-types-0.7.0-pyhd8ed1ab_1 
  anyio              conda-forge/noarch::anyio-4.12.0-pyhcf101f3_0 
  appnope            conda-forge/noarch::appnope-0.1.4-pyhd8ed1ab_1 
  asttokens          conda-forge/noarch::asttokens-3.0.1-pyhd8ed1ab_0 
  aws-c-auth         conda-forge/osx-arm64::aws-c-auth-0.9.1-h8818502_7 
  aws-c-cal          conda-forge/osx-arm64::aws-c-ca

### Question 02: Understanding the `map` Function and Parallelism

The built-in Python `map()` function applies a function to each element sequentially. Using `joblib`, rewrite the following serial code to run in parallel using **all available cores** (hint: use `n_jobs=-1`). Compare the results to verify correctness.

```python
import numpy as np

def cube_root(x):
    return x ** (1/3)

numbers = np.arange(1, 500001)

# Serial version using map
serial_result = list(map(cube_root, numbers))
print("First 5 serial results:", serial_result[:5])
```

Write the parallel version using `joblib.Parallel` and `delayed`.

In [4]:
# Please write your answer here.
from joblib import Parallel, delayed
import numpy as np

def cube_root(x):
    return x ** (1/3)

numbers = np.arange(1, 500001)

# Serial version using map
serial_result = list(map(cube_root, numbers))
print("First 5 serial results:", serial_result[:5])

# Parallel version
parallel_result = Parallel(n_jobs=-1)(delayed(cube_root)(x) for x in numbers)
print("First 5 serial results:", parallel_result[:5])

First 5 serial results: [1.0, 1.2599210498948732, 1.4422495703074083, 1.5874010519681994, 1.7099759466766968]
First 5 serial results: [1.0, 1.2599210498948732, 1.4422495703074083, 1.5874010519681994, 1.7099759466766968]


### Question 03: Measuring Parallel Speedup

Create a function called `simulate_computation` that generates 100,000 random numbers and calculates their variance. Using `%timeit`, measure and compare the execution time of:

1. Running the function **4 times sequentially** in a list comprehension (`[simulate_computation() for _ in range(4)]`)
2. Running the function **4 times in parallel** using `joblib` with 4 workers

Print and compare both timing results.

In [None]:
# Please write your answer here.
def simulate_computation():
    random = np.random.rand(100000)
    return np.var(random)

%timeit [simulate_computation() for _ in range(4)]
%timeit Parallel(n_jobs=4)(delayed(simulate_computation)() for _ in range(4))

1.56 ms ± 32.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
12.5 ms ± 803 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Question 04: Dask Array with Custom Chunk Sizes

Create a Dask array of shape (5000, 2000) filled with random integers between 1 and 100. Use chunks of size (500, 500). Then:

1. Compute the sum of each row
2. Calculate the mean and standard deviation of the entire array
3. Print all three results

In [12]:
# Please write your answer here.
import dask.array as da

data = np.random.normal(size=10000000).reshape(5000,2000)
a = da.from_array(data, chunks=(500,500))

print(a.sum(axis=1).compute())
print(a.mean().compute())
print(a.std().compute())

[-106.94014744   22.15236202   94.29809625 ...   55.98980573   67.61376916
  -21.97533284]
-2.5124455113762817e-05
1.000115364869338


### Question 05: Optimising Chunk Size

The chunk size significantly affects Dask performance. Create a Dask array with 100,000 random numbers and test three different chunk sizes: 1,000 (many small chunks), 10,000 (medium chunks), and 50,000 (few large chunks).

For each configuration, measure the time to compute `mean(sin(x) + cos(x))`. Which chunk size performed best? Explain why in a comment.

In [15]:
# Please write your answer here.
data = np.random.random(size=100000)

# 1,000 chunk
a = da.from_array(data, chunks=(1000))
%timeit (da.sin(a) + da.cos(a)).mean().compute()

# 10,000 chunk
a = da.from_array(data, chunks=(10000))
%timeit (da.sin(a) + da.cos(a)).mean().compute()

# 50,000 chunk
a = da.from_array(data, chunks=(50000))
%timeit (da.sin(a) + da.cos(a)).mean().compute()

# The chunk that performed best was the 50,000 chunk.
# It performed best because less resources were spent on
# breaking up into many many chunks as opposed to just
# two chunks since 100,000 / 50,000 = 2.

15 ms ± 266 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.71 ms ± 128 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.66 ms ± 51.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Question 06: Reading Parquet Files with Column Selection

The `data` folder contains Parquet files for multiple countries. Using Dask, read **all Parquet files at once** (`data/*.parquet`), but load only the `year` and `population` columns.

Calculate the total world population for each year across all countries and display the results sorted by year.

In [24]:
# Please write your answer here.
import dask.dataframe as dd
df = dd.read_parquet('data/*.parquet', columns = ['year', 'population'])

pop = df.groupby('year').population.sum().compute()
print(pop.sort_values(ascending=False).head(10))

year
2022    1985837056
2023    1980706538
2020    1959057915
2018    1940659483
2021    1928046753
2017    1906082434
2019    1893887207
2015    1872433473
2016    1847205000
2014    1824069730
Name: population, dtype: int64


### Question 07: Dask SQL with Multiple Conditions

Load the `data.csv` file into a Dask DataFrame and register it as a SQL table. Write a SQL query that:

1. Selects countries where `gdp_per_capita` was between 10000 and 50000
2. Filters for years between 2000 and 2020
3. Orders results by `gdp_per_capita` in descending order
4. Limits to the top 2 results

Execute the query and display the results.

In [32]:
# Please write your answer here.
from dask_sql import Context
df = dd.read_csv('data/data.csv')

c = Context()
c.create_table("dask", df)
print(c.sql("""
    SELECT *
    FROM dask
    WHERE gdp_per_capita BETWEEN 10000 AND 50000
        AND year BETWEEN 2000 and 2020
    ORDER BY gdp_per_capita DESC
    LIMIT 2
""").compute()
)

    country  year  gdp_per_capita  population
294     USA  2002    48942.492140   284207485
295     USA  2003    47607.365171   277711486


### Question 08: Dask SQL with Aggregation

Using the same `data.csv` file, write a SQL query that calculates:

1. The average GDP per capita for each country
2. The minimum and maximum years in the dataset for each country

Group by country and display all results.

In [33]:
# Please write your answer here.
print(c.sql("""
    SELECT country, AVG(gdp_per_capita) AS avg_gdp_per_capita, MIN(year) AS min_year, MAX(year) AS max_year
    FROM dask
    GROUP BY country
""").compute()
)

  country  avg_gdp_per_capita  min_year  max_year
0  Brazil         5496.292031      1945      2023
1   India         1251.704443      1945      2023
2      UK        27496.851363      1945      2023
3     USA        40189.822290      1945      2023


### Question 09: Generating `requirements.txt` and `environment.yml` Files

Write the commands to:

1. Export your current environment's packages to a `requirements.txt` and an `environment.yml` file
2. Show how someone else would install these exact dependencies in these two cases

Explain each step with comments. It is not necessary to run the code.

In [None]:
# Please write your answer here.
# create environment.yml file
conda env export --name qtm350-quiz --file ~Desktop/environment.yml

# create requirement.txt file
!pip freeze > requirements.txt

# to install back:
# conda:
!conda env create --file ~Desktop/environment.yml
conda activate qtm350-quiz

# pip:
!pip install -r requirements.txt

### Question 10: Troubleshooting a Broken Dockerfile

The following Dockerfile has several errors. Identify and fix 5 issues, then explain what was wrong with each line:

```dockerfile
# Broken Dockerfile - Fix the errors
from ubuntu

RUN apt install python3 python3-pip
RUN pip install numpy pandas

COPY . .
EXPOSE 8888
RUN ["python3", "app.py"]
```

Write the corrected Dockerfile and list each error with its fix.

In [None]:
# corrected file:
# 1: need capital FROM instead of from
FROM ubuntu
# 2: need apt-get update 
RUN apt-get update && apt-get install python3 python3-pip
# 3: need pip3
RUN pip3 install numpy pandas
# 4: need copy directions
COPY ./app
EXPOSE 8888
# 5: need CMD instead of RUN
CMD ["python3", "app.py"]

### Question 11 - Writing a Dockerfile to Install Software on a Base Image

Create a Dockerfile that starts from an Ubuntu image and installs the following software:

- Git version 2.43.0-1ubuntu7.1
- SQLite version 3.45.1-1ubuntu2

Ensure that you specify the exact versions of the packages by checking their versions after installation. Include commands to clean up the package manager cache after installation to reduce the image size.

#### Please write your anwer here. You can use ```dockerfile to format your code

In [35]:
dockerfile_content = """
FROM ubuntu
RUN apt-get update && install git=2.43.0-1ubuntu7.1 sqlite3=3.45.1-1ubuntu2
RUN git --version && sqlite3 --version
RUN apt-get clean
"""

### Question 12: Dockerfile for a Jupyter Data Science Environment

Create a Dockerfile starting from Ubuntu that:

1. Installs Python 3.11 and pip
2. Installs `jupyterlab`, `numpy`, `pandas`, `matplotlib`, and `scikit-learn` with specific versions of your choice
4. Sets the working directory to `/home/analyst/notebooks`
5. Exposes port 8888
6. Starts JupyterLab with `--no-browser` and `--ip=0.0.0.0`

Clean up apt cache to reduce image size.

#### Please write your answer here. You can use ```dockerfile to format your code

In [36]:
# I made up fake version numbers
dockerfile_content = """
FROM ubuntu
RUN apt-get update && apt-get install -y python3.11 python-3pip
RUN pip3 install jupyterlab==5.0.0, numpy-5.0.0, pandas=5.0.0, matplotlib=5.0.0, scikit-learn=5.0.0
WORKDIR /home/analyst/notebooks
EXPOSE 8888
CMD ["jupyter", "lab", "--no-browser", "--ip=0.0.0.0]
"""