In [ ]:
import org.apache.spark.sql.SparkSession

## Download csv data for Dow Jones

We download a file containing end-of-day data for stock prices, 2017 and 2018.
Taken as a sample from Quandl WIKIPRICES

In [ ]:
val remote = "https://s3-eu-west-1.amazonaws.com/kensuio-training/data/djia-2017-2018.csv"
val local = "djia-2017-2018.csv"

We set the target local directory and create it.

In [ ]:
val dataDir = sys.props("java.io.tmpdir") + "/data/linear_regression"
new java.io.File(dataDir).mkdirs()

Now we download the file and save the content

In [ ]:
val source = scala.io.Source.fromURL(remote)
val f = new java.io.FileWriter(new java.io.File(s"${dataDir}/$local"), false)
source.foreach(f.append(_))
f.close

In [ ]:
:sh ls -lh ${dataDir}/$local

## Read the csv file as a Dataframe

We are using the SparkSession object to read a csv file.

see https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/api/scala/index.html#org.apache.spark.sql.DataFrameReader

and

https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/sql-programming-guide.html

In [ ]:
val csvDF = sparkSession.read.option("inferSchema", true)
                             .option("header", true)
                             .csv(s"${dataDir}/$local")
                                           

The columns of interest are "ticker", "date", "adj_close" and "adj_volume"

In [ ]:
val flatDF = csvDF.select("ticker", "date", "adj_close", "adj_volume")

In [ ]:
flatDF.groupBy("date").count.groupBy("count").agg(count("count"))

In [ ]:
val ts = flatDF.groupBy("date").count.where($"count" === 32).select("date")

In [ ]:
val cleanDF = flatDF.join(ts, ts("date") === flatDF("date")).select($"ticker", flatDF("date"), $"adj_close")

We "pivot" the table, grouping by date, and pivoting on ticker, so we end up with 1 row per date, and one column per ticker, and we rertain the price

In [ ]:
val data = cleanDF.groupBy("date").pivot("ticker").agg(max("adj_close"))

We save the data as a parquet file (compact data with schema included)

In [ ]:
:sh rm -rf /tmp/data/linear_regression/djia.parquet

In [ ]:
data.write.format("parquet").save(s"${dataDir}/djia.parquet")

In [ ]:
:sh ls -l ${dataDir}/djia.parquet