![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png)

# Distributed computation
## ESIPE — INFO 3 — Option Logiciel

<p style="font-size:35px;font-weight: bold;">Lab 4 : Songs release year prediction </p>

 This lab covers a common supervised learning pipeline, using a subset of the [Million Song Dataset](http://labrosa.ee.columbia.edu/millionsong/) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD). Our goal is to train a linear regression model to predict the release year of a song given a set of audio features.
 
 Note that, for reference, you can look up the details of the relevant Spark methods in [Spark's Scala API](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package).

<p style="font-size:20pt; font-weight: bold;color:blue"> How to complete this lab :</p>

This assignment is broken up into sections with bite-sized examples for demonstrating Spark functionality for log processing. For each problem, you should start by thinking about the algorithm that you will use to *efficiently* process the log in a parallel, distributed manner. This means using the various [RDD](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.rdd.RDD) operations along with [`anonymous` functions](https://docs.scala-lang.org/scala3/book/fun-anonymous-functions.html) that are applied at each worker.

 
This assignment consists of 3 parts:

- Part 1: Exploring the initial dataset with Spark Core.
- Part 2: Machine learning with Spark MLlib.
- Part 3: Beat the benchmark.

##  Prerequisites : Spark Context configuration 

In [None]:
//using spark 2.4.8 because vegas plot library only works with scala 2.11
import $ivy.`org.apache.spark::spark-sql:2.4.8`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark.sql._

val spark = SparkSession.builder
  .appName("lab4_linear_reg_text")
  .master("local[*]")
  //.config("spark.executor.memory", "8g")
  //.config("spark.driver.memory", "8g")
  .getOrCreate()
val sc = spark.sparkContext

# Part 1: Exploring the initial dataset with Spark Core

## 1.  Load and check the data

 The raw data is currently stored in text file.  We will start by storing this raw data in as an RDD, with each element of the RDD representing a data point as a comma-delimited string. Each string starts with the label (a year) followed by numerical audio features. Load the data into a two partitions RDD using SparkContext. Then Use the [count method](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.rdd.RDD@count():Long) to check how many data points we have.  Finally use the [take method](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.rdd.RDD@take(num:Int):Array[T]) to create and print out a list of the first 5 data points in their initial string format.

In [None]:
// declaring testing function
def assertEquals[A](expected : A, answer : A, error : String) = {
    if (expected equals answer) println("1 test passed")
    else error
}

In [None]:
val fileName = "../resources/tp2/millionsong.txt" //change path to where the file is if necessary

In [None]:
// TODO: Replace ??? with appropriate code
import org.apache.spark.rdd.RDD

val numPartitions = 2
val rawData: RDD[String] = ???

val numPoints: Long = ???
println( numPoints )

val samplePoints: Array[String] = ???
println( samplePoints )

In [None]:
// TEST Load and check the data (1.1)
assertEquals(numPoints, 6724, "incorrect value for numPoints")
assertEquals(samplePoints.length, 5, "incorrect length for samplePoints")
assertEquals(rawData.getNumPartitions, 2, "Incorrect number of partitions")

## 2. Parsing the rows 

As you can see using methods `collect`or `take`, the `rawData` RDD only contains string elements that cannot be used for Data exploration or machine learning. Define a function `row_parser` you can use into a `map` method in order to parse the rows. Make sure all defined elements are casted as float elements. Assign the parsed RDD to variable `parsed_rdd`.

In [None]:
// TODO: Replace ??? with appropriate code

???
      
val parsed_rdd: RDD[Array[Float]] = ???
val parsed_rdd_sample: Array[Array[Float]] = parsed_rdd.take(5)
parsed_rdd_sample.foreach(x => println(x.mkString(",")))

In [None]:
// TEST Parsing the rows (1.2)
val round_sample_0: List[Double] = parsed_rdd_sample(0).map(x => (x*10.0).round/10.0).toList
assertEquals(parsed_rdd_sample(0).length, 13, "incorrect number of elements")
assertEquals(round_sample_0, List(2001.0, 0.9, 0.6, 0.6, 0.5, 0.2, 0.4, 0.3, 0.3, 0.6, 0.4, 0.6, 0.4), "Incorrect parsing.")

## 3. Take a first look to the features

Let's take a first look to the dataset features.  First we will look at the raw features for 50 data points by generating a heatmap that visualizes each feature on a grey-scale and shows the variation of each feature across the 50 sample data points.  The features are all between 0 and 1, with values closer to 1 represented via darker shades of grey. To achieve this task, extract the first 50 observations features and assign the result to variable `top_50_features`.

_Note_ : More advanced features description will be provided in part 2 using Spark SQL.

In [None]:
// TODO: Replace ??? with appropriate code
val top_50_features =  parsed_rdd.???

In [None]:
// TEST Take a first look to the featrues (1.3)
assertEquals(top_50_features.length, 50, "incorrect number of rows")
assertEquals(top_50_features(0).map(x => (x*100.0).round/100.0).toList, List(0.88, 0.61, 0.6, 0.47, 0.25, 0.36, 0.34, 0.34, 0.6, 0.43, 0.6, 0.42), "Values don't match.")
assertEquals(top_50_features.map(_.length).toSet, Set(12), "incorrect number of features")

In [None]:
import $ivy.`org.vegas-viz::vegas:0.3.11`
import $ivy.`org.vegas-viz::vegas-spark:0.3.11`

In [None]:
import vegas._
import vegas.sparkExt._

Vegas("Year Releases").
  mark(Point).
  //withData(year_release_keys.zip(year_release_values).map(x => Map("Year"->x._1,"Count"->x._2))).
  withData(top_50_features
           .map(x => x.zipWithIndex).zipWithIndex
           .flatMap(y => y._1.map(z => (z._1,y._2,z._2)))
           .map(x => Map("Row"->x._2,"Column"->x._3,"Feature"->x._1))
           ).
  encodeX("Column", Ord, axis=Axis(title="Feature")).
  encodeY("Row", Ord, scale=Scale(bandSize = 8.0), axis=Axis(title="Observation")).
  encodeColor("Feature", Quant, scale=Scale(rangeNominals=List("#DDDDDD", "#000000"))).
  show


## 4. Describe year release label with Spark Core

Get more information on the `year release` variable to predict. Extract the release years, compute the `min`, the `max`, the `mean` and plot a resulting histogram. Don't forget to `reduceByKey` before collect and assign the results to the following variables.

In [None]:
// TODO: Replace ??? with appropriate code
val year_release_rdd: RDD[Float] = parsed_rdd.???
val year_release_min: Float = ???
val year_release_max: Float = ???
val year_release_mean: Double = ???

// Count the number of occurence for each year
val year_release_count: Array[(Float, Int)] = ???

// Year
val year_release_keys: Array[Float] = ???

// Number of occurence
val year_release_values: Array[Int] = ???

In [None]:
// TEST Describe year release (1.4)
assertEquals(year_release_min, 1922.0, "incorrect minimum")
assertEquals(year_release_max, 2011.0, "incorrect maximum")
assertEquals((year_release_mean*10).round/10.0, 1975.8, "incorrect mean")
assertEquals(year_release_keys.toList.sorted, List(2008.0, 2002.0, 2004.0, 1992.0, 2000.0, 1996.0, 1998.0, 2006.0, 1930.0, 1990.0, 1994.0, 1974.0, 1976.0, 1970.0, 1972.0, 2010.0, 1988.0, 1980.0, 1986.0, 1958.0, 1978.0, 1968.0, 1962.0, 1982.0, 1984.0, 1966.0, 1964.0, 1960.0, 1942.0, 1926.0, 1956.0, 1954.0, 1928.0, 1948.0, 1922.0, 1952.0, 1944.0, 1946.0, 1950.0, 1932.0, 1938.0, 1936.0, 1940.0, 1934.0, 1924.0, 2001.0, 2007.0, 2003.0, 1999.0, 1997.0, 1987.0, 2005.0, 2009.0, 1993.0, 1991.0, 1933.0, 1935.0, 1995.0, 1941.0, 1943.0, 1975.0, 1971.0, 1981.0, 1989.0, 1969.0, 1973.0, 1983.0, 1985.0, 1979.0, 1967.0, 1961.0, 1965.0, 1963.0, 1977.0, 1945.0, 1955.0, 1927.0, 1957.0, 1959.0, 1953.0, 1949.0, 1939.0, 1937.0, 1951.0, 1929.0, 1947.0, 1931.0, 1925.0, 2011.0).sorted, "incorrect bar plot x-label")
assertEquals(year_release_values.toList.sorted, List(100, 100, 100, 100, 100, 100, 100, 100, 38, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 21, 19, 100, 100, 48, 38, 6, 65, 14, 29, 58, 11, 19, 22, 14, 28, 5, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 6, 24, 100, 31, 13, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 27, 100, 40, 100, 100, 100, 53, 35, 25, 62, 79, 55, 31, 7, 1).sorted, "incorrect bar plot y-label")

In [None]:
// plotting the histogram :
import vegas._
import vegas.sparkExt._

Vegas("Year Releases", width = 1000).
  mark(Bar).
  withData(year_release_keys.zip(year_release_values).map(x => Map("Year"->x._1,"Count"->x._2))).
  encodeX("Year", Nom).
  encodeY("Count", Quant).
  show

## 5. Shift the labels

As we just saw, the labels are years in the 1900s and 2000s.  In learning problems, it is often natural to shift labels such that they start from zero.  Starting with `parsed_rdd`, create a new RDD into which the labels are shifted such that smallest label equals zero.

In [None]:
// TODO: Replace ??? with appropriate code
val parsed_rdd_shift: RDD[Array[Float]] = parsed_rdd.???
//parsed_rdd_shift.cache()

In [None]:
// TEST Shift labels (1.5)
val old_features: Array[Array[Float]] = parsed_rdd.map(row => row.tail).take(5)
val new_features: Array[Array[Float]] = parsed_rdd_shift.map(row => row.tail).take(5)
val min_year_new: Float = parsed_rdd_shift.map(row => row(0)).min()
val max_year_new: Float = parsed_rdd_shift.map(row => row(0)).max()
assertEquals(old_features.map(_.toList).toList, new_features.map(_.toList).toList, "new features do not match old features")
assertEquals(min_year_new, 0.0, "incorrect min year in shifted data")
assertEquals(max_year_new, 89.0, "incorrect max year in shifted data")

# Part 2 : Machine learning with Spark MLlib

Now, we will perform machine learning with  Spark’s machine learning (ML) library,with the DataFrame-based API.

## 1. Create dataframe

Using RDD `parsed_rdd_shift` from part 1, create a Spark SQL dataframe called `ml_df`.
- Check that columns casting are all float or double.
- Check that the label column is named 'year_release' and the features like (feature_0, feature_1, ..., feature_11).
- Make sure to cache the dataframe in order to improve your following performs.

In [None]:
import $ivy.`org.apache.spark::spark-mllib:2.4.8`

// TODO: Replace <FILL IN> with appropriate code
val label = "year_release"
val features = Range(0, 12).map(num => s"feature_${num}")

val spark = SparkSession.builder
  .appName("lab4_linear_reg_text")
  .master("local[*]")
  .config("spark.executor.memory", "8g")
  .config("spark.driver.memory", "8g")
  .getOrCreate()
val sc = spark.sparkContext

import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val schema: StructType = ???

val ml_df: DataFrame = ???
ml_df.printSchema()
ml_df.show(1)

In [None]:
// TEST Create dataframe (3.1)
assertEquals(ml_df.getClass.toString, "class org.apache.spark.sql.Dataset", "ml_df is not a Spark DataFrame.")
assertEquals(ml_df.storageLevel.useMemory, true, "dataframe has to be cached")
assertEquals(ml_df.dtypes.toList, List(("year_release", "DoubleType"), ("feature_0", "DoubleType"), ("feature_1", "DoubleType"), ("feature_2", "DoubleType"), ("feature_3", "DoubleType"), ("feature_4", "DoubleType"), ("feature_5", "DoubleType"), ("feature_6", "DoubleType"), ("feature_7", "DoubleType"), ("feature_8", "DoubleType"), ("feature_9", "DoubleType"), ("feature_10", "DoubleType"), ("feature_11", "DoubleType")), "Dataframe schema is not correct.")

## 2. Train, validation, test Split 

Such as earlier, search for method `randomSplit` from `spark.sql` module and make a train-val-test split with associated proportions `[0.8, 0.1, 0.1]`. 
- Assign the training dataframe to variable train_df.
- Assign the validation dataframe to variable val_df.
- Assign the test dataframe to variable test_df.
- Make sure to cache all these dataframes.

In [None]:
// TODO: Replace ??? with appropriate code
val weights: Array[Double] = Array(.8, .1, .1)
val seed: Int = 60
???

// Count the number of element in each Dataframe
val n_train: Long = ???
val n_val: Long = ???
val n_test: Long = ???

In [None]:
// TEST Training, validation, and test sets (2.2)
assertEquals(n_train + n_val + n_test, 6724, "unexpected Train, Val, Test data set size")
assertEquals(n_train, 5381, "unexpected value for n_train")
assertEquals(n_val, 654, "unexpected value for n_val")
assertEquals(n_test, 689, "unexpected value for n_test")
assertEquals(train_df.storageLevel.useMemory, true, "dataframe has to be cached")
assertEquals(val_df.storageLevel.useMemory, true, "dataframe has to be cached")
assertEquals(test_df.storageLevel.useMemory, true, "dataframe has to be cached")

## 3. Correlation matrix Plot

Libraries MLlib and Spark ML provide multiple statistics tools such as some statistics tests or correlation matrixes. Let's plot the correlation matrix and detect which columns are correlated with the 'year release' label. 
- Wrap all the `ml_df` columns using a `VectorAssembler` object (See the doc).
- After wrapping, compute the pearson correlation matrix and assign the result matrix to variable `pearsonCorr` (see the doc).

In [None]:
// TODO: Replace <FILL IN> with appropriate code
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.ml.linalg.Matrix
import org.apache.spark.sql.DataFrame


val all_cols_names: Array[String] = ml_df.columns
val vect_assembler: VectorAssembler = ???
val all_wrapped_cols: DataFrame = vect_assembler.???
val Row(pearsonCorr: Matrix) = Correlation.???

In [None]:
// TEST Training, validation, and test sets (3.3)
assertEquals(pearsonCorr.getClass.toString, "class org.apache.spark.ml.linalg.DenseMatrix", "pearsonCorr has to be type org.apache.spark.ml.linalg.DenseMatrix")
assertEquals((pearsonCorr.numCols,pearsonCorr.numRows), (13, 13), "incorrect correlation matrix shape.")

In [None]:
// Printing your correlation matrix : 
println("Pearson correlation matrix:\n" + pearsonCorr.toString(13,125))

## 4. Linear regression with Spark ML

This time, we will use linear regression with Spark ML instead of Spark MLlib. To achieve this task, you will be asked to wrap the features into an unique colummn, train a linear regression, make predictions and finally evaluate your rmse scores using an evaluator. Feel free to refer to the [Spark Documentation](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.ml.package) if needed.

- Assign the list of all dataframe features columns names to variable `features_cols`.
- Using a `VectorAssembler` object, wrap up all the features columns into a new column named 'all_features' (for train and test datasets). You have to set parameters `inputCols`and `outputCol`.
- Using a `LinearRegression` object, train a machine learning model and use it to make prediction on test dataset. You have to set paramaters `featuresCol`, `labelCol` and `predictionCol`. Do not set any hypertuning parameter.
- Use the model to make predictions on train, val and test dataset. Assign the resulting dataframe to variable `train_pred`, `val_pred` and `test_pred`.
- Using a `RegressionEvaluator` object, compute the resulting RMSE scores. You have to set paramaters `predictionCol`, `labelCol` and `metricName`. 

In [None]:
// TODO: Replace ??? with appropriate code
import org.apache.spark.ml.regression._
import org.apache.spark.ml.evaluation._

// Wrap the features into a unique column
val features_cols: Array[String] = ???
val vect_assembler: VectorAssembler = new VectorAssembler()
  .???
val train_df_wrapped: DataFrame = vect_assembler.???
val val_df_wrapped: DataFrame = vect_assembler.???
val test_df_wrapped: DataFrame = vect_assembler.???

// Train and predict :
val lr: LinearRegression = .???
val model: LinearRegressionModel = .???
val train_pred: DataFrame = model.???
val val_pred: DataFrame = model.???
val test_pred: DataFrame = model.???
test_pred.show(5)

// Evaluation : 
val evaluator: RegressionEvaluator = ???
val rmse_train: Double = evaluator.???
val rmse_val: Double = evaluator.???
val rmse_test: Double = evaluator.???

println(s">>> Train RMSE Score : ${rmse_train}")
println(s">>> Val RMSE Score : ${rmse_val}")
println(s">>> Test RMSE Score : ${rmse_test}")

In [None]:
// TEST Linear Regression with Spark ML (3.4)
assertEquals(features_cols.length, 12, "features_cols length has to be 12.")
assertEquals(train_df_wrapped.columns.length, 14, "Incorrect number of columns in train_df_wrapped.")
assertEquals(train_df_wrapped.columns.length, 14, "Incorrect number of columns in test_df_wrapped.")
assertEquals(train_df_wrapped.columns.contains("all_features"), true, "Column 'all_features' does not exist in wrapped train dataset.")
assertEquals(test_df_wrapped.columns.contains("all_features"), true, "Column 'all_features' does not exist in wrapped test dataset.")
assertEquals((rmse_train*100.0).round/100.0, 15.53, "Incorrect train RMSE Score.")
assertEquals((rmse_val*100.0).round/100.0, 16.30, "Incorrect val RMSE Score.")
assertEquals((rmse_test*100.0).round/100.0, 15.32, "Incorrect test RMSE Score.")

## 5. Add interactions between features (Order 2)

So far, we've used the features as they were provided.  Now, we will add features that capture the two-way interactions between our existing features. For every set of features `x` and `y`, we will compute feature `x * y` (including when column x = column y).

To achieve this task, tou will have to :
- Create a UDF based on `compute_product` function that takes two feature columns as input and return the product of the two columns. Make sure that the resulting column will be named such as `feature_i_x_j` with `i` and `j` the features numbers.
- Apply the UDF on train, val and test datasets using loops and `withColumn` method.

_Hint_ : If needed, refer to the UDF Spark documentation or classroom lecture 3.

_Hint_ : Because there are 12 features in dataset, it is expected to add 78 new columns to dataset.

_Note_ : This task can also be achieved without UDF. However, here is a good chance to practice with UDFs.

In [None]:
// TODO: Replace ??? with appropriate code
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.UserDefinedFunction

// Design the UDF :
val compute_product = (x:Double, y:Double) => {
      /*
      Returns x * y
      */
      ???
    }

val my_udf: UserDefinedFunction = ???

// Apply the UDF on train and test datasets
val features_cols: Array[String] = ???
val features_index: Seq[Int] = ???

val index: Seq[(Int, Int)] = for {
      i <- features_index
      j <- features_index
      if j >= i
    } yield (i,j)

val train_df_2 : DataFrame = index.foldLeft[DataFrame](train_df) {
      ???
    }.cache()

val val_df_2 : DataFrame = index.foldLeft[DataFrame](val_df) {
      ???
    }.cache()

val test_df_2 : DataFrame = index.foldLeft[DataFrame](test_df) {
      ???
    }.cache()
            
test_df_2.printSchema()

In [None]:
// TEST Add two-way interactions (3.5)
assertEquals(train_df_2.columns.length, 91, " Incorrect correct number of columns for train_df_2.")
assertEquals(val_df_2.columns.length, 91, " Incorrect correct number of columns for val_df_2.")
assertEquals(test_df_2.columns.length, 91, " Incorrect correct number of columns for test_df_2.")
assertEquals(train_df_2.dtypes.toList, List(("year_release", "DoubleType"), ("feature_0", "DoubleType"), ("feature_1", "DoubleType"), ("feature_2", "DoubleType"), ("feature_3", "DoubleType"), ("feature_4", "DoubleType"), ("feature_5", "DoubleType"), ("feature_6", "DoubleType"), ("feature_7", "DoubleType"), ("feature_8", "DoubleType"), ("feature_9", "DoubleType"), ("feature_10", "DoubleType"), ("feature_11", "DoubleType"), ("feature_0_x_0", "DoubleType"), ("feature_0_x_1", "DoubleType"), ("feature_0_x_2", "DoubleType"), ("feature_0_x_3", "DoubleType"), ("feature_0_x_4", "DoubleType"), ("feature_0_x_5", "DoubleType"), ("feature_0_x_6", "DoubleType"), ("feature_0_x_7", "DoubleType"), ("feature_0_x_8", "DoubleType"), ("feature_0_x_9", "DoubleType"), ("feature_0_x_10", "DoubleType"), ("feature_0_x_11", "DoubleType"), ("feature_1_x_1", "DoubleType"), ("feature_1_x_2", "DoubleType"), ("feature_1_x_3", "DoubleType"), ("feature_1_x_4", "DoubleType"), ("feature_1_x_5", "DoubleType"), ("feature_1_x_6", "DoubleType"), ("feature_1_x_7", "DoubleType"), ("feature_1_x_8", "DoubleType"), ("feature_1_x_9", "DoubleType"), ("feature_1_x_10", "DoubleType"), ("feature_1_x_11", "DoubleType"), ("feature_2_x_2", "DoubleType"), ("feature_2_x_3", "DoubleType"), ("feature_2_x_4", "DoubleType"), ("feature_2_x_5", "DoubleType"), ("feature_2_x_6", "DoubleType"), ("feature_2_x_7", "DoubleType"), ("feature_2_x_8", "DoubleType"), ("feature_2_x_9", "DoubleType"), ("feature_2_x_10", "DoubleType"), ("feature_2_x_11", "DoubleType"), ("feature_3_x_3", "DoubleType"), ("feature_3_x_4", "DoubleType"), ("feature_3_x_5", "DoubleType"), ("feature_3_x_6", "DoubleType"), ("feature_3_x_7", "DoubleType"), ("feature_3_x_8", "DoubleType"), ("feature_3_x_9", "DoubleType"), ("feature_3_x_10", "DoubleType"), ("feature_3_x_11", "DoubleType"), ("feature_4_x_4", "DoubleType"), ("feature_4_x_5", "DoubleType"), ("feature_4_x_6", "DoubleType"), ("feature_4_x_7", "DoubleType"), ("feature_4_x_8", "DoubleType"), ("feature_4_x_9", "DoubleType"), ("feature_4_x_10", "DoubleType"), ("feature_4_x_11", "DoubleType"), ("feature_5_x_5", "DoubleType"), ("feature_5_x_6", "DoubleType"), ("feature_5_x_7", "DoubleType"), ("feature_5_x_8", "DoubleType"), ("feature_5_x_9", "DoubleType"), ("feature_5_x_10", "DoubleType"), ("feature_5_x_11", "DoubleType"), ("feature_6_x_6", "DoubleType"), ("feature_6_x_7", "DoubleType"), ("feature_6_x_8", "DoubleType"), ("feature_6_x_9", "DoubleType"), ("feature_6_x_10", "DoubleType"), ("feature_6_x_11", "DoubleType"), ("feature_7_x_7", "DoubleType"), ("feature_7_x_8", "DoubleType"), ("feature_7_x_9", "DoubleType"), ("feature_7_x_10", "DoubleType"), ("feature_7_x_11", "DoubleType"), ("feature_8_x_8", "DoubleType"), ("feature_8_x_9", "DoubleType"), ("feature_8_x_10", "DoubleType"), ("feature_8_x_11", "DoubleType"), ("feature_9_x_9", "DoubleType"), ("feature_9_x_10", "DoubleType"), ("feature_9_x_11", "DoubleType"), ("feature_10_x_10", "DoubleType"), ("feature_10_x_11", "DoubleType"), ("feature_11_x_11", "DoubleType")), "incorrect columns names or casting.")
assertEquals(val_df_2.dtypes.toList, List(("year_release", "DoubleType"), ("feature_0", "DoubleType"), ("feature_1", "DoubleType"), ("feature_2", "DoubleType"), ("feature_3", "DoubleType"), ("feature_4", "DoubleType"), ("feature_5", "DoubleType"), ("feature_6", "DoubleType"), ("feature_7", "DoubleType"), ("feature_8", "DoubleType"), ("feature_9", "DoubleType"), ("feature_10", "DoubleType"), ("feature_11", "DoubleType"), ("feature_0_x_0", "DoubleType"), ("feature_0_x_1", "DoubleType"), ("feature_0_x_2", "DoubleType"), ("feature_0_x_3", "DoubleType"), ("feature_0_x_4", "DoubleType"), ("feature_0_x_5", "DoubleType"), ("feature_0_x_6", "DoubleType"), ("feature_0_x_7", "DoubleType"), ("feature_0_x_8", "DoubleType"), ("feature_0_x_9", "DoubleType"), ("feature_0_x_10", "DoubleType"), ("feature_0_x_11", "DoubleType"), ("feature_1_x_1", "DoubleType"), ("feature_1_x_2", "DoubleType"), ("feature_1_x_3", "DoubleType"), ("feature_1_x_4", "DoubleType"), ("feature_1_x_5", "DoubleType"), ("feature_1_x_6", "DoubleType"), ("feature_1_x_7", "DoubleType"), ("feature_1_x_8", "DoubleType"), ("feature_1_x_9", "DoubleType"), ("feature_1_x_10", "DoubleType"), ("feature_1_x_11", "DoubleType"), ("feature_2_x_2", "DoubleType"), ("feature_2_x_3", "DoubleType"), ("feature_2_x_4", "DoubleType"), ("feature_2_x_5", "DoubleType"), ("feature_2_x_6", "DoubleType"), ("feature_2_x_7", "DoubleType"), ("feature_2_x_8", "DoubleType"), ("feature_2_x_9", "DoubleType"), ("feature_2_x_10", "DoubleType"), ("feature_2_x_11", "DoubleType"), ("feature_3_x_3", "DoubleType"), ("feature_3_x_4", "DoubleType"), ("feature_3_x_5", "DoubleType"), ("feature_3_x_6", "DoubleType"), ("feature_3_x_7", "DoubleType"), ("feature_3_x_8", "DoubleType"), ("feature_3_x_9", "DoubleType"), ("feature_3_x_10", "DoubleType"), ("feature_3_x_11", "DoubleType"), ("feature_4_x_4", "DoubleType"), ("feature_4_x_5", "DoubleType"), ("feature_4_x_6", "DoubleType"), ("feature_4_x_7", "DoubleType"), ("feature_4_x_8", "DoubleType"), ("feature_4_x_9", "DoubleType"), ("feature_4_x_10", "DoubleType"), ("feature_4_x_11", "DoubleType"), ("feature_5_x_5", "DoubleType"), ("feature_5_x_6", "DoubleType"), ("feature_5_x_7", "DoubleType"), ("feature_5_x_8", "DoubleType"), ("feature_5_x_9", "DoubleType"), ("feature_5_x_10", "DoubleType"), ("feature_5_x_11", "DoubleType"), ("feature_6_x_6", "DoubleType"), ("feature_6_x_7", "DoubleType"), ("feature_6_x_8", "DoubleType"), ("feature_6_x_9", "DoubleType"), ("feature_6_x_10", "DoubleType"), ("feature_6_x_11", "DoubleType"), ("feature_7_x_7", "DoubleType"), ("feature_7_x_8", "DoubleType"), ("feature_7_x_9", "DoubleType"), ("feature_7_x_10", "DoubleType"), ("feature_7_x_11", "DoubleType"), ("feature_8_x_8", "DoubleType"), ("feature_8_x_9", "DoubleType"), ("feature_8_x_10", "DoubleType"), ("feature_8_x_11", "DoubleType"), ("feature_9_x_9", "DoubleType"), ("feature_9_x_10", "DoubleType"), ("feature_9_x_11", "DoubleType"), ("feature_10_x_10", "DoubleType"), ("feature_10_x_11", "DoubleType"), ("feature_11_x_11", "DoubleType")), "incorrect columns names or casting.")
assertEquals(test_df_2.dtypes.toList, List(("year_release", "DoubleType"), ("feature_0", "DoubleType"), ("feature_1", "DoubleType"), ("feature_2", "DoubleType"), ("feature_3", "DoubleType"), ("feature_4", "DoubleType"), ("feature_5", "DoubleType"), ("feature_6", "DoubleType"), ("feature_7", "DoubleType"), ("feature_8", "DoubleType"), ("feature_9", "DoubleType"), ("feature_10", "DoubleType"), ("feature_11", "DoubleType"), ("feature_0_x_0", "DoubleType"), ("feature_0_x_1", "DoubleType"), ("feature_0_x_2", "DoubleType"), ("feature_0_x_3", "DoubleType"), ("feature_0_x_4", "DoubleType"), ("feature_0_x_5", "DoubleType"), ("feature_0_x_6", "DoubleType"), ("feature_0_x_7", "DoubleType"), ("feature_0_x_8", "DoubleType"), ("feature_0_x_9", "DoubleType"), ("feature_0_x_10", "DoubleType"), ("feature_0_x_11", "DoubleType"), ("feature_1_x_1", "DoubleType"), ("feature_1_x_2", "DoubleType"), ("feature_1_x_3", "DoubleType"), ("feature_1_x_4", "DoubleType"), ("feature_1_x_5", "DoubleType"), ("feature_1_x_6", "DoubleType"), ("feature_1_x_7", "DoubleType"), ("feature_1_x_8", "DoubleType"), ("feature_1_x_9", "DoubleType"), ("feature_1_x_10", "DoubleType"), ("feature_1_x_11", "DoubleType"), ("feature_2_x_2", "DoubleType"), ("feature_2_x_3", "DoubleType"), ("feature_2_x_4", "DoubleType"), ("feature_2_x_5", "DoubleType"), ("feature_2_x_6", "DoubleType"), ("feature_2_x_7", "DoubleType"), ("feature_2_x_8", "DoubleType"), ("feature_2_x_9", "DoubleType"), ("feature_2_x_10", "DoubleType"), ("feature_2_x_11", "DoubleType"), ("feature_3_x_3", "DoubleType"), ("feature_3_x_4", "DoubleType"), ("feature_3_x_5", "DoubleType"), ("feature_3_x_6", "DoubleType"), ("feature_3_x_7", "DoubleType"), ("feature_3_x_8", "DoubleType"), ("feature_3_x_9", "DoubleType"), ("feature_3_x_10", "DoubleType"), ("feature_3_x_11", "DoubleType"), ("feature_4_x_4", "DoubleType"), ("feature_4_x_5", "DoubleType"), ("feature_4_x_6", "DoubleType"), ("feature_4_x_7", "DoubleType"), ("feature_4_x_8", "DoubleType"), ("feature_4_x_9", "DoubleType"), ("feature_4_x_10", "DoubleType"), ("feature_4_x_11", "DoubleType"), ("feature_5_x_5", "DoubleType"), ("feature_5_x_6", "DoubleType"), ("feature_5_x_7", "DoubleType"), ("feature_5_x_8", "DoubleType"), ("feature_5_x_9", "DoubleType"), ("feature_5_x_10", "DoubleType"), ("feature_5_x_11", "DoubleType"), ("feature_6_x_6", "DoubleType"), ("feature_6_x_7", "DoubleType"), ("feature_6_x_8", "DoubleType"), ("feature_6_x_9", "DoubleType"), ("feature_6_x_10", "DoubleType"), ("feature_6_x_11", "DoubleType"), ("feature_7_x_7", "DoubleType"), ("feature_7_x_8", "DoubleType"), ("feature_7_x_9", "DoubleType"), ("feature_7_x_10", "DoubleType"), ("feature_7_x_11", "DoubleType"), ("feature_8_x_8", "DoubleType"), ("feature_8_x_9", "DoubleType"), ("feature_8_x_10", "DoubleType"), ("feature_8_x_11", "DoubleType"), ("feature_9_x_9", "DoubleType"), ("feature_9_x_10", "DoubleType"), ("feature_9_x_11", "DoubleType"), ("feature_10_x_10", "DoubleType"), ("feature_10_x_11", "DoubleType"), ("feature_11_x_11", "DoubleType")), "incorrect columns names or casting.")

## 6. Linear regression model with features interactions (order 2)

Using the previously computed new 78 columns, we are now able to catch interaction between variable in order to build a more robust linear regression model. As for task 3.4, train this new machine learning algorithm and compute the new RMSE score.

In [None]:
// TODO: Replace ??? with appropriate code

// Wrap the features into a unique column
val features_cols: Array[String] = ???
val vect_assembler: VectorAssembler = new VectorAssembler()
    .???
val train_df_wrapped: DataFrame = vect_assembler.???
val val_df_wrapped: DataFrame = vect_assembler.???
val test_df_wrapped: DataFrame = vect_assembler.???

// Train and predict :
val lr: LinearRegression = ???
val model: LinearRegressionModel = lr.???
val train_pred: DataFrame = model.???
val val_pred: DataFrame = model.???
val test_pred: DataFrame = model.???
test_pred.show(5)

// Evaluation : 
val evaluator: RegressionEvaluator = ???
val rmse_train: Double = evaluator.???
val rmse_val: Double = evaluator.???
val rmse_test: Double = evaluator.???

println(s">>> Train RMSE Score : ${rmse_train}")
println(s">>> Val RMSE Score : ${rmse_val}")
println(s">>> Test RMSE Score : ${rmse_test}")

In [None]:
// TEST New Linear Regression with features interactions (3.6)
assertEquals(features_cols.length, 90, "features_cols length has to be 90.")
assertEquals(train_df_wrapped.columns.length, 92, "Incorrect number of columns in train_df_wrapped.")
assertEquals(val_df_wrapped.columns.length, 92, "Incorrect number of columns in test_df_wrapped.")
assertEquals(test_df_wrapped.columns.length, 92, "Incorrect number of columns in test_df_wrapped.")
assertEquals(train_df_wrapped.columns.contains("all_features"), true, "Column 'all_features' does not exist in wrapped train dataset.")
assertEquals(test_df_wrapped.columns.contains("all_features"), true, "Column 'all_features' does not exist in wrapped test dataset.")
assertEquals((rmse_train*100.0).round/100.0, 14.50, "Incorrect train RMSE Score.")
assertEquals((rmse_val*100.0).round/100.0, 15.34, "Incorrect val RMSE Score.")
assertEquals((rmse_test*100.0).round/100.0, 14.52, "Incorrect test RMSE Score.")

## 7. An example of extensive hypertuning cross-validator

Here, we build a simple a simple example of a grid cross-validator in order to find the best regularization hyperparameters (`regParam` and `elasticNetParam`) for our regression model. According to our previous rmse scores on train, val and test datasets (which are very close to each other),  it seems that our linear model is currently under-fitting. Thus, the regularization tuning will not necessarily result in an improvement on test dataset. However, in the following part 3, you will interest in features interaction with order 3 or tree-based machine learning algorithms. In these cases, hypertuning cross-validator would be really helpful.

We use 2 grids of parameters : <br>
ElasticNetParam = [0.0, 0.35, 0.65, 1.0] <br>
RegParam = [0.0, 0.001, 0.01, 1.0]

_Note_ : Don't forget to cache your datasets in memory.

In [None]:
import org.apache.spark.ml.tuning._
import org.apache.spark.ml.param._

// Preprocessing the Datasets
val train_val_df_2: DataFrame = train_df_2.union(val_df_2).cache()
val features_cols: Array[String] = train_val_df_2.columns.filter(col => col != "year_release")
val vect_assembler: VectorAssembler = new VectorAssembler()
    .setInputCols(features_cols)
    .setOutputCol("all_features")
val train_df_wrapped: DataFrame = vect_assembler.transform(train_val_df_2)
val test_df_wrapped: DataFrame = vect_assembler.transform(test_df_2)

// Define the Grid
val lr: LinearRegression = new LinearRegression()
val grid: Array[ParamMap] = new ParamGridBuilder().baseOn({lr.labelCol -> "year_release"})
  .baseOn({lr.featuresCol -> "all_features"})
  .baseOn({lr.predictionCol -> "prediction"})
  .addGrid(lr.elasticNetParam, ???)
  .addGrid(lr.regParam, ???)
  .build()
        

val evaluator: RegressionEvaluator = ???
val cv: CrossValidator = new CrossValidator()
    .setEstimator(lr)
    .setEstimatorParamMaps(grid)
    .setEvaluator(evaluator)
    .setNumFolds(5) // use 3+ folds in practice

// Run cross-validation, and choose the best set of parameters.
val cvModel: CrossValidatorModel = cv.fit(train_df_wrapped.cache())

// Make predictions on test dataset. cvModel uses the best model found.
val rmse_train: Double = evaluator.evaluate(cvModel.transform(train_df_wrapped))
val rmse_test: Double = evaluator.evaluate(cvModel.transform(test_df_wrapped))
println(s">>> Train RMSE Score : ${rmse_train}")
println(s">>> Test RMSE Score : ${rmse_test}")

// Info display
println("\n===== Hypertuning cross validation results : =====")
cvModel.getEstimatorParamMaps.zip(cvModel.avgMetrics)
  .foreach(x => {
    println(s"Model ${cvModel.getEstimatorParamMaps.indexOf(x._1)} :" )
    println(s"Params : ${x._1.toSeq.map(y => (y.param.name,y.value)).mkString(",")}")
    println(s"Average validation score : ${x._2}\n")
  })

# Part 3 : Beat the benchmark

So far we have tried multiple linear regression models with Spark MLLib. Our best model reaches a test RMSE score about `14.55`. Let us now use more powerful models such as tree-based ones ! Can you beat the benchmark `14.25`?

Some ideas :
- You can try one or several algorithms including `DecisionTreeRegressor`, `RandomForestRegressor`,  `GBTRegressor`(gradient boosting). See the spark.ml documentation for more information on algorithms settings [Spark ML Documentation](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.ml.package)
- You can also try hypertuning cross-validator in order to reach the best test score.

Notes :
- Because tree-based technics are non-linear algorithms, you should use the original dataframes with 12 features. Using `train_df_2` will massively increase computing time with a increased risk of overfitting.
- For `RandomForestRegressor` and  `GBTRegressor`, be careful with the number of trees. Because you are running algorithm in a Spark local environment, you don't have a lot of computing resources.
- Make sure to cache your dataframes for better performances.
- Hypertuning cross-validator is weel suited for decision tree but not for `RandomForest` and `GBT`. Because you are running algorithm in a Spark local environment, you don't have enough computing resources to afford heavy grid search.

## 1. Tree Based Technics

In [None]:
// TODO: Replace ??? with appropriate code
// Look for LinearRegression, DecisionTreeRegressor, RandomForestRegressor, GBTRegressor and RegressionEvaluator

???

## 2. Visualize your results

In [None]:
Vegas.layered("Tree Based techniques", width = 600, height = 300).
  withDataFrame(test_pred).
  withLayers(
    Layer().
      mark(Line).
      encodeX("year_release", Quant, axis=Axis(title="Predicted")).
      encodeY("year_release", Quant, axis=Axis(title="Actual")),
    Layer().
      mark(Point).
      encodeX("prediction", Quant).
      encodeY("year_release", Quant)
  ).
  show
