# Distributed computation
## ESIPE — INFO 3 — Option Logiciel
<style type="text/css">
    .question {
        background-color: yellow;
    }
</style>

# Lab 3 : Parking meters devices analysis with Apache Spark

In this lab, we will analyse the parking meters of Paris for the year 2014. The dataset is composed in two parts:
* The parking meter devices
* The transactions

All the data needed for this evaluation are located in the directory `data`.

This notebook is divided in 5 parts:
* PART 1: Initiate the environment (/1)
* PART 2: Get and analyse the device dataset (/3)
* PART 3: Get and analyse the transaction dataset (/7)
* PART 4: Joining devices and transactions (/5)
* PART 5: Analytics on a map (/4)

All questions are highlight in <span style="background-color: yellow">yellow</span>. They have to be answered using Spark Core / SQL / ML features. Indicative rating is given for each question (total / 20).

During this evaluation, you can access to any support including internet, course lectures and labs. the use of online messaging and drives are not permitted during this session.

<span style="background-color: #ffbbaa;">**Do not forget to oftenly save your whole notebook.**</span>

In [None]:
import $ivy.`org.apache.spark::spark-sql:3.3.1`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

## PART 1: Initiate the environment (/1)
To do our analysis, we will use Spark SQL.

<span style="background-color: yellow;">Create a NotebookSparkSession and assign it to the variable `spark`.</span>

In [None]:
import org.apache.spark.sql._

val spark = ???
val sc = spark.sparkContext

println(spark.version)

If necessary, the Spark UI interface is available at http://localhost:4040/ or  http://localhost:4041/.

We will need also many Spark SQL tools. Run the cell below.

In [None]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._

## PART 2: Get and analyse the device dataset (/3)

### Read parking meter files (/0.5)
The parking meter devices are stored in a JSON file of 4.5MB.

<span style="background-color: yellow;">Read file `data/horodateurs-mobiliers.json` and store it in a variable named `raw_parkmeters`. Display its content by using the method `.show()`.</span>

In [None]:
// read "data/horodateurs-mobiliers.json"
val raw_parkmeters = ???


### Display schema (/0.5)
The file comes with nested records. We will need to simplify its structure.

To understand its structure, <span style="background-color: yellow;">display the schema of `raw_parkmeters`.</span>

In [None]:
raw_parkmeters.???

### Simplify dataframe (/2)
We are here interested only on those fields:
* `numhoro`: parking meter number (it must be renamed to `parkmeter_id`)
* `arrondt`: district number in Paris (it must be renamed to `district`)
* `regime`: pricing mode (MIX = includes specific rule for inhabitants (_résident_), ROT = everyone follows the same rules - it must be renamed `type`)
* `zoneres`: residential area (it must be renamed to `area`)

<span style="background-color: yellow;">Create a new dataframe from `raw_parkmeters` named `parkmeters`, that includes only the fields shown above.</span>

In [None]:
val parkmeters = raw_parkmeters.???

## PART 3: Get and analyse the transaction dataset (/7)

### Read all the files (/0.5)
<span style="background-color: yellow;">Read all the files in `data/horodateurs-transactions-de-paiement` directory in a single command to create a dataframe named `raw_transactions`.</span>

Pay attention to the fact that there is a header in the files and that the semi-colon (`;`) is used as a field delimiter. For the last one, we will use the option `.option("delimiter", ";")`.

In [None]:
val raw_transactions = spark.???

### Display the content (/0.5)
<span style="background-color: yellow;">Display its content by using the method `.show()`.</span>

In [None]:
raw_transactions.???

Note: here `usager` are the users of parking meters. They can be Résident (or `R�sident`), if they are inhabitants. They can be `Rotatif`, if they are considered as occasional visitors.

### The schema (/0.5)
<span style="background-color: yellow;">Now, display the schema of the dataset.</span>

In [None]:
raw_transactions.???

### Cleaning (/5)
The dataset comes with some inconveniences:
* Everything is a string in this schema
* Some columns have name with strange characters
* Numbers are in French format
* Timestamps are in french format too

To improve the dataset, we will provide two functions:
* `toDouble` that takes a column representing a number, replace "," by "." and cast it into DoubleType (you will need the Spark SQL function `translate`)
* `toTimestamp` that takes a column representing a timestamp with the format `"dd/MM/yyyy HH:mm:ss"` and convert it Unix timestamp (you will need the Spark SQL function `unix_timestamp` with two parameters). A Unix timestamp is in seconds.

But first, let run the cell below, that creates a function to simplify the writing of unit tests.

In [None]:
def test_function(function : (Column => Column), de : DataFrame): Unit = {
        val text_df = de.toDF("data", "expected")
        val result = text_df
          .withColumn("result", function(col("data")))
          .withColumn("succeed", col("expected") === col("result"))
        result.show()
      }

#### ToDouble function (/1)
<span style="background-color: yellow;">Complete the function `toDouble`.</span>

In [None]:
def toDouble(column: Column): Column = ???

// Unit test
val data_expected = Seq(
    ("1,0", 1.0),
    ("3,4", 3.4),
    ("0,65", 0.65)
).toDF()
test_function(toDouble, data_expected)

#### ToTimestamp function (/1)
<span style="background-color: yellow;">Complete the function `toTimestamp`.</span>

In [None]:
def toTimestamp(column: Column): Column = ???

// Unit tests
val data_expected = Seq(
    ("31/01/2014 15:09:33", 1391180973),
    ("24/01/2014 13:56:24", 1390571784),
    ("26/01/2014 19:21:09", 1390764069)
).toDF
test_function(toTimestamp, data_expected)

#### Cleaning process (/3)
Now do the cleaning:
* `horodateur` needs to be renamed into `parkmeter_id`
* `montant carte` needs to be converted into number and renamed `amount`
* `début stationnement` needs to be converted into timestamp and renamed `parking_start`
* `fin stationnement` needs to be converted into timestamp and renamed `parking_end`

You will also add a column `duration`, that is the result of the difference between `parking_start` and `parking_end`. Make sure that `duration` is in hours, knowing that `parking_start` and `parking_end` are in seconds.

We only want transactions for users marked as `Rotatif`.

<span style="background-color: yellow;">Starts from `raw_transactions` and apply all the cleaning rules seen above to create the dataframe `transactions`.</span>

In [None]:
val transactions = raw_transactions.???

<span style="background-color: yellow;">Use `.show()` method to display the content of `transactions`.</span>

In [None]:
transactions.???

### Number of records (/0.5)

<span style="background-color: yellow;">Display the number of records in `transactions`.</span>

In [None]:
transactions.???

## PART 4: Joining devices and transactions (/5)
Now that we have the devices location and the transactions, we can merge those two datasets and do different analysis.

## Joining (/2)
<span style="background-color: yellow;">Create a dataframe named `parkmeter_transactions`, that joins the dataframes `parkmeters` and `transactions`.</span>

* Keep only those columns: `"parkmeter_id", "district", "area", "duration", "parking_start", "parking_end", "amount"`
* Beware! some columns are defined both in `transactions` and in `parkmeters`. Depending, on the way you reference a column, it can lead Spark to confusion and thus a failure.

In [None]:
val parkmeter_transactions = ???


### Save the join (/1)
Before going further, due to the size of the data, the relative heaviness of the processing, and the weakness of the machine you are working on, it is preferable to store data in a Parquet file first.

Once written, this file will be used as a checkpoint. So, **if something goes wrong in your notebook, you can start again from the read of the parquet file below.**

<span style="background-color: yellow;">Store the `parkmeter_transactions` dataframe in the Parquet file `parkmeter_transactions.parquet`.</span>

In [None]:
parkmeter_transactions.???

<span style="background-color: yellow;">Now load the file in `parkmeter_transactions`.</span>

In [None]:
val parkmeter_transactions = ???

### First analysis of parkmeter_transactions (/2)

We will do an analysis of dataframe `parkmeter_transactions`. For that we will use the method `.describe()` available on dataframes. `.describe()` returns a dataframe with stats on the different columns.

<span style="background-color: yellow;">Use `.describe()` on `parkmeter_transactions` and display its result.</span>

In [None]:
parkmeter_transactions.???

In [None]:
parkmeter_transactions.select($"parking_end").orderBy("parking_end").show

The `count` row shows the number of non-null elements for each column.

What can you identify from the result of `.describe()`?

<span style="background-color: yellow;">Update `parkmeter_transactions` to remove rows with undesirable values.</span>

In [None]:
val parkmeter_transactions_updated = parkmeter_transactions.???

## PART 5: Analytics (/4)

### Number of transactions (/2)
<span style="background-color: yellow;">Find the number of transactions per district on the map of Paris, the columns must be "district" and "count_transactions".</span>

In [None]:
val count_transactions = ???

count_transactions.orderBy("district").show()

<span style="background-color: yellow;">Find the number of transactions per area on the map of Paris, the columns must be "area" and "count_transactions".</span>

In [None]:
val count_transactions = ???

count_transactions.show()

## Average transaction amount (/2)
<span style="background-color: yellow;">Find the average transaction amount per district in Paris, the columns must be "district" and "avg_amount".</span>

In [None]:
val avg_amount = parkmeter_transactions.???

avg_amount.orderBy("district").show()

<span style="background-color: yellow;">Find the average transaction amount per area in Paris, the columns must be "area" and "avg_amount".</span>

In [None]:
val avg_amount = parkmeter_transactions.???

avg_amount.show()

## End