# Raster Statistics

In [1]:
from pyrasterframes import *
from pyrasterframes.rasterfunctions import *
import pyspark
from pyspark.sql import SparkSession
from pathlib import Path

resource_dir = Path('./samples').resolve()

spark = SparkSession.builder. \
    master("local[*]"). \
    appName("RasterFrames"). \
    config("spark.ui.enabled", "false"). \
    getOrCreate(). \
    withRasterFrames()
# spark.sparkContext.setLogLevel("ERROR")

rf = spark.read.geotiff(resource_dir.joinpath("L8-B8-Robinson-IL.tiff").as_uri())

RasterFrames has a number of extension methods and columnar functions for performing analysis on tiles.

## Tile Statistics 

### Tile Dimensions

Get the nominal tile dimensions. Depending on the tiling there may be some tiles with different sizes on the edges.

In [2]:
rf.select(rf.spatialKeyColumn(), tileDimensions("tile")).show()

+-----------+---------------+
|spatial_key|dimension(tile)|
+-----------+---------------+
|      [0,0]|      [250,250]|
|      [1,0]|      [250,250]|
|      [0,1]|      [250,250]|
|      [1,1]|      [250,250]|
+-----------+---------------+



### Descriptive Statistics

#### NoData Counts

Count the numer of `NoData` and non-`NoData` cells in each tile.

In [3]:
rf.select(rf.spatialKeyColumn(), noDataCells("tile"), dataCells("tile")).show()

+-----------+-----------------+---------------+
|spatial_key|noDataCells(tile)|dataCells(tile)|
+-----------+-----------------+---------------+
|      [0,0]|                0|          62500|
|      [1,0]|                0|          62500|
|      [0,1]|                0|          62500|
|      [1,1]|                0|          62500|
+-----------+-----------------+---------------+



#### Tile Mean

Compute the mean value in each tile. Use `tileMean` for integral cell types, and `tileMeanDouble` for floating point
cell types.

In [4]:
rf.select(rf.spatialKeyColumn(), tileMean("tile")).show(3)

+-----------+--------------+
|spatial_key|tileMean(tile)|
+-----------+--------------+
|      [0,0]|  10083.360304|
|      [1,0]|   9860.046272|
|      [0,1]|   10106.30968|
+-----------+--------------+
only showing top 3 rows



#### Tile Summary Statistics

Compute a suite of summary statistics for each tile. Use `tileStats` for integral cells types, and `tileStatsDouble`
for floating point cell types.

In [5]:
rf.withColumn("stats", tileStats("tile")).select(rf.spatialKeyColumn(), "stats.*").show(3)

+-----------+---------+------+-------+------------------+------------------+
|spatial_key|dataCells|   min|    max|              mean|          variance|
+-----------+---------+------+-------+------------------+------------------+
|      [0,0]|    62500|7254.0|31887.0|10083.360304000025| 3232555.629237034|
|      [1,0]|    62500|7470.0|29722.0| 9860.046272000003|  2811384.49990689|
|      [0,1]|    62500|7249.0|35647.0|10106.309679999991|3441403.8590902924|
+-----------+---------+------+-------+------------------+------------------+
only showing top 3 rows



### Histogram

The `tileHistogram` function computes a histogram over the data in each tile. See the 
@scaladoc[GeoTrellis `Histogram`](geotrellis.raster.histogram.Histogram) documentation for details on what's
available in the resulting data structure. Use this version for integral cell types, and `tileHistorgramDouble` for
floating  point cells types. 

In this example we compute quantile breaks.

In [None]:
hist = rf.select(tileHistogram("tile")).rdd.flatMap(lambda x: x).toLocalIterator()
hists = [x for x in hist]

In [5]:
rf.withSpatialIndex().show()

+-----------+--------------------+--------------------+--------------------+-------------------+
|spatial_key|              bounds|            metadata|                tile|      spatial_index|
+-----------+--------------------+--------------------+--------------------+-------------------+
|      [0,1]|POLYGON ((431902....|Map(AREA_OR_POINT...|geotrellis.raster...|2777272716961181063|
|      [0,0]|POLYGON ((431902....|Map(AREA_OR_POINT...|geotrellis.raster...|2777272889250452463|
|      [1,1]|POLYGON ((437707....|Map(AREA_OR_POINT...|geotrellis.raster...|2777273490489613465|
|      [1,0]|POLYGON ((437707....|Map(AREA_OR_POINT...|geotrellis.raster...|2777273662401083501|
+-----------+--------------------+--------------------+--------------------+-------------------+



In [None]:
rf.select(map(lambda x: x.quantileBreaks(5), hists)).show(5, false)

## Aggregate Statistics

The `aggStats` function computes the same summary statistics as `tileStats`, but aggregates them over the whole 
RasterFrame.

In [17]:
rf.select(aggStats("tile")).show()

+--------------------+
|      aggStats(tile)|
+--------------------+
|[250000,7249.0,39...|
+--------------------+



A more involved example: extract bin counts from a computed `Histogram`.

In [None]:
rf.select(aggHistogram("tile")).
  map(h => for(v in h.labels) yield(v, h.itemCount(v))).
  select(explode("value") as "counts").
  select("counts._1", "counts._2").
  toDF("value", "count").
  orderBy(desc("count")).
  show(10)


In [None]:
spark.stop()

In [7]:
rf.withSpatialIndex().show()

+-----------+--------------------+--------------------+--------------------+-------------------+
|spatial_key|              bounds|            metadata|                tile|      spatial_index|
+-----------+--------------------+--------------------+--------------------+-------------------+
|      [0,1]|POLYGON ((431902....|Map(AREA_OR_POINT...|geotrellis.raster...|2777272716961181063|
|      [0,0]|POLYGON ((431902....|Map(AREA_OR_POINT...|geotrellis.raster...|2777272889250452463|
|      [1,1]|POLYGON ((437707....|Map(AREA_OR_POINT...|geotrellis.raster...|2777273490489613465|
|      [1,0]|POLYGON ((437707....|Map(AREA_OR_POINT...|geotrellis.raster...|2777273662401083501|
+-----------+--------------------+--------------------+--------------------+-------------------+

