# PDM Big Data course - final exam

**First thing, please share this notebook with me (belhalfaoui@gmail.com).**

## 0. Setup
The following code downloads and prepares the dataset, and puts it into the `data` folder. It is ready to execute. You will need it for the practical part (2.).

Since it takes a couple of minutes to finish, run it now, and start answering the preliminary questions.

In [None]:
!pip install pyspark

In [None]:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [None]:
!wget https://github.com/CSSEGISandData/COVID-19/archive/master.zip

In [None]:
!unzip -o -q master.zip

In [None]:
!rm -r data
!mkdir data

In [None]:
import shutil
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm

DATA_IN = Path("COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports")
DATA_OUT = Path("data")

for file_path in tqdm(list(DATA_IN.iterdir()), desc="Preprocessing"):
    if file_path.suffix != '.csv':
        shutil.copy(file_path, Path("data") / file_path.name)
        continue
    df = pd.read_csv(file_path)
    month, day, year = file_path.stem.split('.')[0].split('-')
    date = f'{year}-{month}-{day}'
    if date > '2021-01-20':
        continue
    df.rename(columns={'Lat': 'Lat_',
                       'Province/State': 'Province_State',
                       'Country/Region': 'Country_Region',
                       'Last Update': 'Last_Update'}, inplace=True)
    df = df.replace(',', '', regex=True).replace('"', '', regex=True)
    df = df[['Province_State', 'Country_Region', 'Last_Update',
             'Confirmed', 'Deaths', 'Recovered']].fillna(0)
   
    df['Date'] = date
    df.to_csv(Path("data") / file_path.name, index=False)

## 1. Preliminary questions

---
**Note**

In this part, you are expected to write Spark code. Among all Spark methods, you may only use RDD methods (`filter`, `map`, `reduceByKey`, etc.).

However, if you come up with another Spark code, that uses any other method, but still produces the expected results, you will still get half of the points.

---

### 1.1. About _replication_ and _sharding_:
* What problem(s) does each of them try to solve?
* Give one drawback of each.

[YOUR ANSWER HERE]

### 1.2. Among the PySpark methods `filter`, `map`, `count` and `take`, which ones are _lazy_ and which ones are _actions_? Which ones transfer data onto the master, and which ones do not?

[YOUR ANSWER HERE]

### 1.3. Here are three separate Spark codes (wheere `sc` is a Spark context). For each of them, tell how many times the `numbers.txt` file will actually be read from the disk? Please explain each answer in a few words.
(a)
```
sc.textFile("numbers.txt")
```
(b)
```
sc.textFile("numbers.txt").cache().max()
sc.textFile("numbers.txt").cache().count()
```
(c)
```
rdd = sc.textFile("numbers.txt")
rdd.max()
rdd.count()
```

[YOUR ANSWER HERE]

## 2. Practical example

### 2.1. Read all CSV files from `data` folder into a single Spark RDD, and count the total number of rows.

In [None]:
# [YOUR CODE HERE]

### 2.2. Find the last (most recent) date in the data set.


In [None]:
# [YOUR CODE HERE]

### 2.3. Compute the number of deaths per day worldwide.

In [None]:
# [YOUR CODE HERE]

### 2.4. Compute the mortality rate per country (over the whole time period).
The mortality rate is defined as the number of deaths divided by the number of confirmed cases.

NB: The expected answer only reads the CSV files once. But if you come up with a two-pass solution, you will still get half of the points.

In [None]:
# [YOUR CODE HERE]

### 3. (bonus points) Same as questions 2.1, 2.2, 2.3 and 2.3 but using DataFrames instead of RDD.

In [None]:
# [YOUR CODE HERE]