## Quiz 05 - Parallel Computing, Reproducibility, and Containers

### Instructions

This quiz is based on the material covered in lectures 20 to 24. You may use
any resources available to you, including the lecture notes and the internet.

All the data required for this quiz can be found in the `data` folder within this repository. If you need to recreate the datasets, you can do so by running the Python script included in the `script-data-generation` folder.

**Important:** Please start by completing Question 01 to set up the correct Python environment before proceeding with the other questions.

This notebook contains the questions you need to answer.
If possible, please submit your answers as an `.html` file on Canvas.

### **Question 01: Setting up the Python Environment**

Before proceeding with the rest of the quiz, it is important to set up a Python environment with specific package versions to ensure compatibility and reproducibility. This quiz requires **Python 3.10** and the following packages with exact versions:
- `dask-sql=2024.5.0`
- `dask=2024.4.1`
- `ipykernel=6.29.3`
- `joblib=1.3.2`
- `numpy=1.26.4`
- `pandas=2.2.1`

You can use tools like `conda`, `pipenv`, or `uv` to manage your environment. If you use conda (recommended), please make sure you **create the environment and install all packages in the same command**. Also include `-c conda-forge` in your command. Make sure to change your current environment to the new environment after creation. 

Write the terminal commands in the code cell below:

In [4]:
# Please write your bash commands here. You can run them using the `!` operator or the `%%bash` magic.
conda create -n quiz05 python=3.10 \
  dask-sql=2024.5.0 \
  dask=2024.4.1 \
  ipykernel=6.29.3 \
  joblib=1.3.2 \
  numpy=1.26.4 \
  pandas=2.2.1 \
  -c conda-forge -y

conda activate quiz05

python -m ipykernel install --user --name quiz05 --display-name "Python (quiz05)"

SyntaxError: invalid syntax (4174038818.py, line 2)

### Question 02: Understanding the `map` Function and Parallelism

The built-in Python `map()` function applies a function to each element sequentially. Using `joblib`, rewrite the following serial code to run in parallel using **all available cores** (hint: use `n_jobs=-1`). Compare the results to verify correctness.

```python
import numpy as np

def cube_root(x):
    return x ** (1/3)

numbers = np.arange(1, 500001)

# Serial version using map
serial_result = list(map(cube_root, numbers))
print("First 5 serial results:", serial_result[:5])
```

Write the parallel version using `joblib.Parallel` and `delayed`.

In [5]:
# Please write your answer here.
from joblib import Parallel, delayed
import numpy as np

def cube_root(x):
    return x ** (1/3)

numbers = np.arange(1, 500001)

parallel_result = Parallel(n_jobs=-1)(
    delayed(cube_root)(x) for x in numbers
)

print("First 5 parallel results:", parallel_result[:5])

First 5 parallel results: [1.0, 1.2599210498948732, 1.4422495703074083, 1.5874010519681994, 1.7099759466766968]


### Question 03: Measuring Parallel Speedup

Create a function called `simulate_computation` that generates 100,000 random numbers and calculates their variance. Using `%timeit`, measure and compare the execution time of:

1. Running the function **4 times sequentially** in a list comprehension (`[simulate_computation() for _ in range(4)]`)
2. Running the function **4 times in parallel** using `joblib` with 4 workers

Print and compare both timing results.

In [33]:
# Please write your answer here.
import numpy as np
from joblib import Parallel, delayed

def simulate_computation():
    data = np.random.rand(100_000)
    return np.var(data)

%timeit [simulate_computation() for _ in range(4)]

%timeit Parallel(n_jobs=4)(delayed(simulate_computation)() for _ in range(4))

2.05 ms ± 32.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.8 ms ± 949 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Question 04: Dask Array with Custom Chunk Sizes

Create a Dask array of shape (5000, 2000) filled with random integers between 1 and 100. Use chunks of size (500, 500). Then:

1. Compute the sum of each row
2. Calculate the mean and standard deviation of the entire array
3. Print all three results

In [32]:
# Please write your answer here.
import dask.array as da

x = da.random.randint(1, 101, size=(5000, 2000), chunks=(500, 500))

row_sums = x.sum(axis=1).compute()

mean_val = x.mean().compute()
std_val = x.std().compute()

print(row_sums)
print(mean_val)
print(std_val)

[100263  99816 100242 ... 101373 101490 103403]
50.5113081
28.86255744259116


### Question 05: Optimising Chunk Size

The chunk size significantly affects Dask performance. Create a Dask array with 100,000 random numbers and test three different chunk sizes: 1,000 (many small chunks), 10,000 (medium chunks), and 50,000 (few large chunks).

For each configuration, measure the time to compute `mean(sin(x) + cos(x))`. Which chunk size performed best? Explain why in a comment.

In [8]:
# Please write your answer here.
import dask.array as da
import numpy as np
import time

chunk_sizes = [1000, 10000, 50000]

for cs in chunk_sizes:
    x = da.random.random(100000, chunks=cs)
    expr = da.mean(da.sin(x) + da.cos(x))
    
    start = time.time()
    expr.compute()
    end = time.time()
    
    print(f"Chunk size {cs}: {end - start:.4f} seconds")

Chunk size 1000: 0.0387 seconds
Chunk size 10000: 0.0065 seconds
Chunk size 50000: 0.0045 seconds


### Question 06: Reading Parquet Files with Column Selection

The `data` folder contains Parquet files for multiple countries. Using Dask, read **all Parquet files at once** (`data/*.parquet`), but load only the `year` and `population` columns.

Calculate the total world population for each year across all countries and display the results sorted by year.

In [9]:
# Please write your answer here.
import dask.dataframe as dd

df = dd.read_parquet("data/*.parquet", columns=["year", "population"])

result = (
    df.groupby("year")["population"]
      .sum()
      .compute()
      .sort_index()
)

print(result)

year
1945     566994202
1946     596804909
1947     606569895
1948     637303888
1949     644613118
           ...    
2019    1893887207
2020    1959057915
2021    1928046753
2022    1985837056
2023    1980706538
Name: population, Length: 79, dtype: int64


### Question 07: Dask SQL with Multiple Conditions

Load the `data.csv` file into a Dask DataFrame and register it as a SQL table. Write a SQL query that:

1. Selects countries where `gdp_per_capita` was between 10000 and 50000
2. Filters for years between 2000 and 2020
3. Orders results by `gdp_per_capita` in descending order
4. Limits to the top 2 results

Execute the query and display the results.

In [20]:
import dask.dataframe as dd
from dask_sql import Context

df = dd.read_csv("data/data.csv")

c = Context()
c.create_table("gdp", df)  # "gdp" is the table name we'll use in SQL

query = """
SELECT *
FROM gdp
WHERE gdp_per_capita BETWEEN 10000 AND 50000
  AND year BETWEEN 2000 AND 2020
ORDER BY gdp_per_capita DESC
LIMIT 2
"""

result = c.sql(query).compute()
print(result)

    country  year  gdp_per_capita  population
294     USA  2002    48942.492140   284207485
295     USA  2003    47607.365171   277711486


### Question 08: Dask SQL with Aggregation

Using the same `data.csv` file, write a SQL query that calculates:

1. The average GDP per capita for each country
2. The minimum and maximum years in the dataset for each country

Group by country and display all results.

In [18]:
# Please write your answer here.
query = """
SELECT
  country,
  AVG(gdp_per_capita) AS avg_gdp,
  MIN(year) AS min_year,
  MAX(year) AS max_year
FROM gdp
GROUP BY country
"""

agg_result = c.sql(query).compute()
print(agg_result)

  country       avg_gdp  min_year  max_year
0  Brazil   5496.292031      1945      2023
1   India   1251.704443      1945      2023
2      UK  27496.851363      1945      2023
3     USA  40189.822290      1945      2023


### Question 09: Generating `requirements.txt` and `environment.yml` Files

Write the commands to:

1. Export your current environment's packages to a `requirements.txt` and an `environment.yml` file
2. Show how someone else would install these exact dependencies in these two cases

Explain each step with comments. It is not necessary to run the code.

In [None]:
# Please write your answer here.

# Export all installed Python packages into a requirements.txt file.
# This is basically a snapshot of the environment for pip users.
pip freeze > requirements.txt

# Export the full conda environment, including package versions and channels,
# to an environment.yml file. This is the “official” way to share conda envs.
conda env export > environment.yml

#2. Someone else wants to recreate this setup:

# Install everything from requirements.txt using pip.
# This recreates (almost) the same Python environment.
pip install -r requirements.txt

# Create a full conda environment from the YAML file.
# This copies the environment almost exactly same versions, same packages.
conda env create -f environment.yml

SyntaxError: invalid syntax (3497796018.py, line 3)

### Question 10: Troubleshooting a Broken Dockerfile

The following Dockerfile has several errors. Identify and fix 5 issues, then explain what was wrong with each line:

INCCORRECTED DOCKERFILE:
```dockerfile
# Broken Dockerfile - Fix the errors
from ubuntu

RUN apt install python3 python3-pip
RUN pip install numpy pandas

COPY . .
EXPOSE 8888
RUN ["python3", "app.py"]
```
CORRECTED DOCKERFILE:
```dockerfile
# 1. Use a proper base image name and tag
FROM ubuntu:24.04

# 2. Install Python and pip in a single RUN, update first, and clean up
RUN apt-get update && \
    apt-get install -y python3 python3-pip && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# 3. Set a working directory for the app
WORKDIR /app

# 4. Copy project files into the image
COPY . .

# 5. Install Python dependencies
RUN pip3 install numpy pandas

# 6. Expose the port the app will run on
EXPOSE 8888

# 7. Use CMD so this runs when the container starts
CMD ["python3", "app.py"]
```
Write the corrected Dockerfile and list each error with its fix.
1. from ubuntu
Instruction keyword should be uppercase: FROM, not from. No tag specified (ubuntu instead of something like ubuntu:24.04), which hurts reproducibility.

2. RUN apt install python3 python3-pip
Problems: Uses apt instead of apt-get, which is what Docker docs and the course use, it doesn’t run apt-get update first, so package metadata may be stale. Also, missing -y so the build can hang waiting for interactive confirmation.

3. RUN pip install numpy pandas
Problems: Uses bare pip, which might be linked to the wrong Python, doesn’t avoid pip’s cache, which makes the image larger. Also, no clear place where the app lives (no WORKDIR yet), so it’s a bit messy.

4. Missing WORKDIR before COPY . .
Problem: Without a WORKDIR, COPY . . dumps files in the root of the filesystem (/), which is bad practice and confusing. The app has no clear home directory.

5. RUN ["python3", "app.py"]
Problem: RUN is a build-time instruction; it runs while the image is being built and then finishes. We want app.py to run when the container starts, not during the image build. For that we should use CMD (or ENTRYPOINT).

There is still two issues that could be fixed:

6. The Dockerfile installs packages using apt, but never clears the package lists or cache, which makes the final image significantly larger.

7. Inconsistent/mixed RUN command styles (shell vs exec), and it can cause confusing differences in behavior; lecture recommends consistent shell.

### Question 11 - Writing a Dockerfile to Install Software on a Base Image

Create a Dockerfile that starts from an Ubuntu image and installs the following software:

- Git version 2.43.0-1ubuntu7.1
- SQLite version 3.45.1-1ubuntu2

Ensure that you specify the exact versions of the packages by checking their versions after installation. Include commands to clean up the package manager cache after installation to reduce the image size.

#### Please write your anwer here. You can use ```dockerfile to format your code

```dockerfile
FROM ubuntu:24.04

RUN apt-get update && \
    apt-get install -y \
        git=1:2.43.0-1ubuntu7.1 \
        sqlite3=3.45.1-1ubuntu2 \
        libsqlite3-0=3.45.1-1ubuntu2 && \
    git --version && \
    sqlite3 --version && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
```

### Question 12: Dockerfile for a Jupyter Data Science Environment

Create a Dockerfile starting from Ubuntu that:

1. Installs Python 3.11 and pip
2. Installs `jupyterlab`, `numpy`, `pandas`, `matplotlib`, and `scikit-learn` with specific versions of your choice
4. Sets the working directory to `/home/analyst/notebooks`
5. Exposes port 8888
6. Starts JupyterLab with `--no-browser` and `--ip=0.0.0.0`

Clean up apt cache to reduce image size.

#### Please write your answer here. You can use ```dockerfile to format your code

```dockerfile
FROM ubuntu:24.04

RUN apt-get update && \
    apt-get install -y python3.11 python3.11-venv python3-pip && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

WORKDIR /home/analyst/notebooks

RUN python3.11 -m pip install --no-cache-dir \
    jupyterlab \
    numpy \
    pandas \
    matplotlib \
    scikit-learn

EXPOSE 8888

CMD ["jupyter", "lab", "--no-browser", "--ip=0.0.0.0", "--notebook-dir=/home/analyst/notebooks", "--allow-root"]
```