Merge pull request #194 from imarios/pivot-docs

Adding docs for pivot
typelevel · Oct 10, 2017 · f81be1d · f81be1d
2 parents 14c0c0a + 30b950f
commit f81be1d
Showing 1 changed file with 36 additions and 8 deletions.
diff --git a/docs/src/main/tut/FeatureOverview.md b/docs/src/main/tut/FeatureOverview.md
@@ -22,19 +22,19 @@ import spark.implicits._
 We start by defining a case class:
 
 ```tut:silent
-case class Apartment(city: String, surface: Int, price: Double)
+case class Apartment(city: String, surface: Int, price: Double, bedrooms: Int)
 ```
 
 And few `Apartment` instances:
 
 ```tut:silent
 val apartments = Seq(
-  Apartment("Paris", 50, 300000.0),
-  Apartment("Paris", 100, 450000.0),
-  Apartment("Paris", 25, 250000.0),
-  Apartment("Lyon", 83, 200000.0),
-  Apartment("Lyon", 45, 133000.0),
-  Apartment("Nice", 74, 325000.0)
+  Apartment("Paris", 50,  300000.0, 2),
+  Apartment("Paris", 100, 450000.0, 3),
+  Apartment("Paris", 25,  250000.0, 1),
+  Apartment("Lyon",  83,  200000.0, 2),
+  Apartment("Lyon",  45,  133000.0, 1),
+  Apartment("Nice",  74,  325000.0, 3)
 )
 ```
 
@@ -243,7 +243,7 @@ priceByCity.collect().run()
 ```
 Again if we try to aggregate a column that can't be aggregated, we get a compilation error
 ```tut:book:fail
-aptTypedDs.groupBy(aptTypedDs('city)).agg(avg(aptTypedDs('city)))                                                         ^
+aptTypedDs.groupBy(aptTypedDs('city)).agg(avg(aptTypedDs('city)))                                                        
 ```
 
 Next, we combine `select` and `groupBy` to calculate the average price/surface ratio per city:
@@ -256,6 +256,34 @@ val cityPriceRatio =  aptds.select(aptds('city), aptds('price) / aptds('surface)
 cityPriceRatio.groupBy(cityPriceRatio('_1)).agg(avg(cityPriceRatio('_2))).show().run()
 ```
 
+We can also use `pivot` to further group data on a secondary column. 
+For example, we can compare the average price across cities by number of bedrooms.  
+
+```tut:book
+case class BedroomStats(
+   city: String, 
+   AvgPriceBeds1: Option[Double], // Pivot values may be missing, so we encode them using Options
+   AvgPriceBeds2: Option[Double], 
+   AvgPriceBeds3: Option[Double], 
+   AvgPriceBeds4: Option[Double])
+   
+val bedroomStats = aptds.
+   groupBy(aptds('city)).
+   pivot(aptds('bedrooms)).
+   on(1,2,3,4). // We only care for up to 4 bedrooms
+   agg(avg(aptds('price))).
+   as[BedroomStats]  // Typesafe casting
+
+bedroomStats.show().run()
+```
+
+With pivot, collecting data preserves typesafety by 
+encoding potentially missing columns with `Option`.
+
+```tut:book
+bedroomStats.collect().run().foreach(println)
+```
+
 ### Entire TypedDataset Aggregation
 
 We often want to aggregate the entire `TypedDataset` and skip the `groupBy()` clause.