# Raster Statistics

In [2]:
import astraea.spark.rasterframes._
import geotrellis.raster._
import geotrellis.raster.render._
import geotrellis.raster.io.geotiff.SinglebandGeoTiff
import geotrellis.spark._
import geotrellis.spark.io._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

implicit val spark = SparkSession.builder().
   master("local[*]").appName("RasterFrames").getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val scene = SinglebandGeoTiff("../samples/L8-B8-Robinson-IL.tiff")
val rf = scene.projectedRaster.toRF(128, 128).cache()

import astraea.spark.rasterframes._
import geotrellis.raster._
import geotrellis.raster.render._
import geotrellis.raster.io.geotiff.SinglebandGeoTiff
import geotrellis.spark._
import geotrellis.spark.io._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1ff12bec
import spark.implicits._
scene: geotrellis.raster.io.geotiff.SinglebandGeoTiff = SinglebandGeoTiff(geotrellis.raster.UShortConstantNoDataArrayTile@4119ea0a,Extent(431902.5, 4313647.5, 443512.5, 4321147.5),EPSG:32616,Tags(Map(AREA_OR_POINT -> POINT),List(Map())),GeoTiffOptions(geotrellis.raster.io.geotiff.Striped@a58661f,geotrellis.raster.io.geotiff.compression.DeflateCompression$@41f94e4,1,None))
rf: org.apache.spark.sql.DataFrame ...

RasterFrames has a number of extension methods and columnar functions for performing analysis on tiles.

## Tile Statistics 

### Tile Dimensions

Get the nominal tile dimensions. Depending on the tiling there may be some tiles with different sizes on the edges.

In [3]:
rf.select(rf.spatialKeyColumn, tileDimensions($"tile")).show(3)

+-----------+---------------+
|spatial_key|dimension(tile)|
+-----------+---------------+
|      [6,3]|      [128,128]|
|      [4,0]|      [128,128]|
|      [2,1]|      [128,128]|
+-----------+---------------+
only showing top 3 rows



### Descriptive Statistics

#### NoData Counts

Count the numer of `NoData` and non-`NoData` cells in each tile.

In [4]:
rf.select(rf.spatialKeyColumn, noDataCells($"tile"), dataCells($"tile")).show(3)

+-----------+-----------------+---------------+
|spatial_key|noDataCells(tile)|dataCells(tile)|
+-----------+-----------------+---------------+
|      [6,3]|            15688|            696|
|      [4,0]|                0|          16384|
|      [2,1]|                0|          16384|
+-----------+-----------------+---------------+
only showing top 3 rows



#### Tile Mean

Compute the mean value in each tile. Use `tileMean` for integral cell types, and `tileMeanDouble` for floating point
cell types.
 

In [5]:
rf.select(rf.spatialKeyColumn, tileMean($"tile")).show(3)

+-----------+------------------+
|spatial_key|    tileMean(tile)|
+-----------+------------------+
|      [6,3]|10757.254310344828|
|      [4,0]| 9883.589050292969|
|      [2,1]| 9966.188293457031|
+-----------+------------------+
only showing top 3 rows



#### Tile Summary Statistics

Compute a suite of summary statistics for each tile. Use `tileStats` for integral cells types, and `tileStatsDouble`
for floating point cell types.

In [6]:
rf.withColumn("stats", tileStats($"tile")).select(rf.spatialKeyColumn, $"stats.*").show(3)

+-----------+---------+------+-------+------------------+------------------+
|spatial_key|dataCells|   min|    max|              mean|          variance|
+-----------+---------+------+-------+------------------+------------------+
|      [6,3]|      696|7604.0|16143.0|10757.254310344822| 3271125.902280271|
|      [4,0]|    16384|7678.0|16464.0| 9883.589050292961|2163148.3790329304|
|      [2,1]|    16384|7209.0|31489.0| 9966.188293457051| 5606533.298102704|
+-----------+---------+------+-------+------------------+------------------+
only showing top 3 rows



### Histogram

The `tileHistogram` function computes a histogram over the data in each tile. See the 
@scaladoc[GeoTrellis `Histogram`](geotrellis.raster.histogram.Histogram) documentation for details on what's
available in the resulting data structure. Use this version for integral cell types, and `tileHistorgramDouble` for
floating  point cells types. 

In this example we compute quantile breaks.

In [7]:
rf.select(tileHistogram($"tile")).map(_.quantileBreaks(5)).show(5, false)

+--------------------------------------------------------------------------------------------------+
|value                                                                                             |
+--------------------------------------------------------------------------------------------------+
|[8809.728925619835, 9867.17899408284, 10610.464285714286, 11537.7625, 12449.983431952664]         |
|[8092.536291685227, 8799.830256846086, 9883.927927555094, 10663.851206181313, 11410.889115337006] |
|[7968.26240877842, 8562.230214398447, 9197.04438927365, 10051.879083058071, 11392.528575516666]   |
|[7779.22924835041, 8571.834631513755, 10178.87875199173, 10846.649480361679, 11391.22742857556]   |
|[7873.758444506966, 8966.896173598834, 10637.314862591527, 11377.284237089707, 12150.871174122809]|
+--------------------------------------------------------------------------------------------------+
only showing top 5 rows



## Aggregate Statistics

The `aggStats` function computes the same summary statistics as `tileStats`, but aggregates them over the whole 
RasterFrame.

In [8]:
rf.select(aggStats($"tile")).show()

+---------+------+-------+-----------------+------------------+
|dataCells|   min|    max|             mean|          variance|
+---------+------+-------+-----------------+------------------+
|   387000|7209.0|39217.0|10160.48549870801|3315238.5311127007|
+---------+------+-------+-----------------+------------------+



A more involved example: extract bin counts from a computed `Histogram`.

In [9]:
rf.select(aggHistogram($"tile")).
  map(h => for(v <- h.labels) yield(v, h.itemCount(v))).
  select(explode($"value") as "counts").
  select("counts._1", "counts._2").
  toDF("value", "count").
  orderBy(desc("count")).
  show(10)

+------------------+-----+
|             value|count|
+------------------+-----+
| 7912.160422482329|61257|
| 9693.363186541315|38399|
|10442.785040368732|32327|
| 8720.579789348172|30192|
| 10078.11093862406|26411|
| 11622.43037347561|26240|
| 11054.25987914447|25154|
| 8342.309765285427|20706|
|10768.925962006819|20530|
|11333.802580318747|19765|
+------------------+-----+
only showing top 10 rows

