
# Scala basics

We will explore some basic methods on Scala Collections, for a tiny use case, and see how, later on, we can translate this to Spark.

## Use case: Housing market
For the Housing market, we are interested in the prices of the houses, their size and their location. 
For simplification, all the prices are presented in Euro, and the location is represented by the zip code and the country.
We'll be comparing a few countries in Europe.

In [ ]:
case class Home(m2: Int, price: Double, zipCode: String, country: String)

## Sample data

In [ ]:
val data = Seq(
  Home(100, 561000, "1024AA", "Netherlands"),
  Home(100, 525000, "3011CD", "Netherlands"),
  Home(100, 916000, "75001", "France"),
  Home(100, 598000, "69002", "France"),
  Home(100, 354000, "28014", "Spain"),
  Home(100, 200000, "36400", "Spain"),
  Home(100, 1180000, "EC1A", "UK"),
  Home(100, 263000, "CF10", "UK"),
  Home(100, 336600, "1000", "Belgium"),
  Home(100, 299000, "9000", "Belgium"),
  Home(100, 820000, "2450", "Luxembourg"),
  Home(100, 570000, "4238", "Luxembourg"),
  Home(100, 375000, "50667", "Germany"),
  Home(100, 218000, "10117", "Germany")
)

In [ ]:
// Setting up some crazy price for a house, so we can include it in our examples below. Is it not crazy enough for you? You're lucky...

val crazyPrice = 300000

## How many entries do we have in our dataset?

In [ ]:
data.length
// data.size

## How many different entries do we have in our dataset?

In [ ]:
data.distinct

## Which countries have crazy prices in their Housing market?

In [ ]:
data.filter(home => home.price > crazyPrice).map(_.country).distinct

In [ ]:
// Let's order our dataset somehow... 

data.sortBy(home => (home.country, -home.price))

## Aggregations

We want to compute the average price in the Housing market per country. Let's define how to calculate the average, and then apply it to our dataset.

In [ ]:
def avg(seq: Seq[Double]): Double =
  if (seq.nonEmpty) {
    val sum = seq.foldLeft(0.0)(_ + _)
    sum / seq.length
  } else {
    0
  }


In [ ]:
data.groupBy(_.country).map {
  case (country, list) => (country, avg(list.map(_.price)))
}

## TODO: Which locations in UK have crazy prizes?

## TODO: Which are the top 3 most expensive countries?

## TODO: Which percentage of houses in France are over the crazy price?

## TODO: Which percentage of houses in the dataset are over the crazy price?

## TODO: Which are the top 3 cheapest locations (zipCode + country) in the dataset?