# Raw data preprocessing notebook

Activate Spark for this Almond Session

In [None]:
import $ivy.`org.apache.spark::spark-sql:2.4.0`

Use a logger to avoid polluting cells outputs with long log messages

In [None]:
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

Create the SparkSession

Note the master settings, here only using local cores.

In [None]:
import org.apache.spark.sql._

val sparkSession = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .getOrCreate()
}

Create a variable storing data directory

Create a variable alias for the SparkSession

Import some implicits needed to name columns using the `$` prefix function

In [None]:
val dataDir = System.getenv("HOME") + "/data/history"
val spark   = sparkSession

In [None]:
import spark.implicits._

#### Exercise:

System call to look at the file content like `head <file>`



### Read csv file to Dataframe

* No header
* Infer a schema
* name columns

In [None]:
val rawDF = spark.read.format("csv")
        .option("header", "false")
        .option("inferSchema", "true")
        .load(s"${dataDir}/*.csv")
        .toDF("instrument","timestamp","open","high","low","close","volume")

##### Count the number of records

In [None]:
rawDF.count

### Find distinct instruments

In [None]:
rawDF.select("instrument").distinct

### Count number of distinct timestamps
Use the count aggregate function
Columns are identified with one of:

col("<column name>")

$"<column name>"

"<column name>"

In [None]:
// ...

### Count number of timestamp for each instrument

Group by instrument, then count

In [None]:
// ...

### Count number of occurences of each timestamp

In [None]:
rawDF.groupBy("timestamp").count
     .toDF("ts", "ts_count")
     .groupBy("ts_count").count
     .orderBy($"ts_count".asc)

## Use of SQL queries

- Create a `table` to associate a DataFrame with a table name, e.g:

```
df.createOrReplaceTempView("people")
```

- Run sql queries, e.g:

```
spark.sql("SELECT count(*) FROM people")
```



### Data cleaning: only keep timestamps with the 5 instruments

These counts teach us there are occurences of `timestamps` with duplicate or missing `instruments`

We need to remove any duplicate line then identify timestamps with the 5 instruments

TODO: Write the SQL equivalent...

In [None]:
(rawDF.count, rawDF.distinct.count)

In [None]:
rawDF.distinct
     .groupBy("timestamp").agg(collect_list($"instrument"))
     .toDF("ts", "instruments")
     .count

In [None]:
rawDF.distinct
     .groupBy("timestamp").agg(collect_list($"instrument"))
     .toDF("ts", "instruments")
     .filter("size(instruments) != 5")
     .count

In [None]:
val timestamps = rawDF.distinct
     .groupBy("timestamp").agg(collect_list($"instrument"))
     .toDF("ts", "instruments")
     .filter("size(instruments) == 5")
     .select("ts")

There are 10067 `timestamps` to keep to get only complete data

### Inner join `timestamps` to filter data


In [None]:
val filtered = rawDF.distinct
                    .join(timestamps, $"timestamp" === $"ts")
                    .select("timestamp", "instrument", "close")
//.count

In [None]:
50385/5

In [None]:
filtered.distinct.count
     //.groupBy("timestamp").agg(collect_list($"instrument"))
     //.toDF("ts", "instruments")
     //.count

In [None]:
50335/5

### Pivot the table to get instruments as colums

* Aggregate by timestamp to define the rows keys
* Pivot around instruments to define columns
* Keep the min (or max or avg -- only one element is used anyway thanks to previous filtering)
* Order by timestamp

In [None]:
val data = filtered.groupBy($"timestamp")   
                   .pivot($"instrument")
                   .agg(min("close"))
                   .orderBy($"timestamp".desc)

### Save as parquet file

See partitions on disk

In [None]:
// TODO ... toon many partitions ...

In [None]:
val dataLocation = System.getenv("HOME") + "/data/cleaned-history.parquet"
data.write.save(dataLocation)

##### TODO:

`ls -l $dataLocation`
