Skip to content

Commit

Permalink
Merge pull request #194 from imarios/pivot-docs
Browse files Browse the repository at this point in the history
Adding docs for pivot
  • Loading branch information
imarios committed Oct 10, 2017
2 parents 14c0c0a + 30b950f commit f81be1d
Showing 1 changed file with 36 additions and 8 deletions.
44 changes: 36 additions & 8 deletions docs/src/main/tut/FeatureOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,19 @@ import spark.implicits._
We start by defining a case class:

```tut:silent
case class Apartment(city: String, surface: Int, price: Double)
case class Apartment(city: String, surface: Int, price: Double, bedrooms: Int)
```

And few `Apartment` instances:

```tut:silent
val apartments = Seq(
Apartment("Paris", 50, 300000.0),
Apartment("Paris", 100, 450000.0),
Apartment("Paris", 25, 250000.0),
Apartment("Lyon", 83, 200000.0),
Apartment("Lyon", 45, 133000.0),
Apartment("Nice", 74, 325000.0)
Apartment("Paris", 50, 300000.0, 2),
Apartment("Paris", 100, 450000.0, 3),
Apartment("Paris", 25, 250000.0, 1),
Apartment("Lyon", 83, 200000.0, 2),
Apartment("Lyon", 45, 133000.0, 1),
Apartment("Nice", 74, 325000.0, 3)
)
```

Expand Down Expand Up @@ -243,7 +243,7 @@ priceByCity.collect().run()
```
Again if we try to aggregate a column that can't be aggregated, we get a compilation error
```tut:book:fail
aptTypedDs.groupBy(aptTypedDs('city)).agg(avg(aptTypedDs('city))) ^
aptTypedDs.groupBy(aptTypedDs('city)).agg(avg(aptTypedDs('city)))
```

Next, we combine `select` and `groupBy` to calculate the average price/surface ratio per city:
Expand All @@ -256,6 +256,34 @@ val cityPriceRatio = aptds.select(aptds('city), aptds('price) / aptds('surface)
cityPriceRatio.groupBy(cityPriceRatio('_1)).agg(avg(cityPriceRatio('_2))).show().run()
```

We can also use `pivot` to further group data on a secondary column.
For example, we can compare the average price across cities by number of bedrooms.

```tut:book
case class BedroomStats(
city: String,
AvgPriceBeds1: Option[Double], // Pivot values may be missing, so we encode them using Options
AvgPriceBeds2: Option[Double],
AvgPriceBeds3: Option[Double],
AvgPriceBeds4: Option[Double])
val bedroomStats = aptds.
groupBy(aptds('city)).
pivot(aptds('bedrooms)).
on(1,2,3,4). // We only care for up to 4 bedrooms
agg(avg(aptds('price))).
as[BedroomStats] // Typesafe casting
bedroomStats.show().run()
```

With pivot, collecting data preserves typesafety by
encoding potentially missing columns with `Option`.

```tut:book
bedroomStats.collect().run().foreach(println)
```

### Entire TypedDataset Aggregation

We often want to aggregate the entire `TypedDataset` and skip the `groupBy()` clause.
Expand Down

0 comments on commit f81be1d

Please sign in to comment.