# Raw data preprocessing notebook

Create a variable with data directory

In [ ]:
val dataDir = System.getenv("HOME") + "/data/history"
val spark   = sparkSession

### Read csv file to Dataframe

* No header
* Infer a schema
* name columns


In [ ]:
val rawDF = spark.read.format("csv")
        .option("header", "false")
        .option("inferSchema", "true")
        .load(s"${dataDir}/*.csv")
        .toDF("instrument","timestamp","open","high","low","close","volume")

In [ ]:
rawDF.count

### Find distinct instruments

In [ ]:
rawDF.select("instrument").distinct

### Count number of distinct timestamps

In [ ]:
// ...

### Count number of timestamp for each instrument

In [ ]:
// ...

### Count number of occurences of each timestamps and display distribution

In [ ]:
rawDF.groupBy("timestamp").count
     .toDF("ts", "ts_count")
     .groupBy("ts_count").count
     .orderBy($"ts_count".asc)

### Data cleaning: only keep timestamps with the 5 instruments

These counts teach us there are occurences of `timestamps` with duplicate or missing `instruments`

We need to remove any duplicate line then identify timestamps with the 5 instruments

In [ ]:
(rawDF.count, rawDF.distinct.count)

In [ ]:
rawDF.distinct
     .groupBy("timestamp").count
     .toDF("ts", "ts_count")
     .groupBy("ts_count").count
     .orderBy($"ts_count".asc)

In [ ]:
rawDF.distinct
     .groupBy("timestamp").agg(collect_list($"instrument"))
     .toDF("ts", "instruments")
     .count

In [ ]:
rawDF.distinct
     .groupBy("timestamp").agg(collect_list($"instrument"))
     .toDF("ts", "instruments")
     .filter("size(instruments) != 5")
     .count

In [ ]:
val timestamps = rawDF.distinct
     .groupBy("timestamp").agg(collect_list($"instrument"))
     .toDF("ts", "instruments")
     .filter("size(instruments) == 5")
     .select("ts")

There are 10067 `timestamps` to keep to get only complete data

In [ ]:
timestamps.count