# Aggregate functions
Aggregate functions operate on whole columns of tiles at once, typically returning a tile that is the result of the operation . An example of this type of function is finding the mean cell values of a column of tiles. Other statistical and diagnostic tasks can be performed on columns of tiles at a time.

In [1]:
from pyrasterframes import *
from pyrasterframes.rasterfunctions import *
import pyspark
from pyspark.sql import SparkSession
from pathlib import Path

spark = SparkSession.builder. \
    master("local[*]"). \
    appName("RasterFrames"). \
    config("spark.ui.enabled", "false"). \
    getOrCreate(). \
    withRasterFrames()

sc = spark.sparkContext

resource_dir = Path('../samples').resolve()
filenamePattern = "L8-B{}-Elkton-VA.tiff"
def readTiff(name):
    return resource_dir.joinpath(filenamePattern.format(name)).as_uri()

In [2]:
# this is a yikes
from functools import reduce
def bandsToJoin(bands):
    return(reduce(lambda rf1, rf2: rf1.asRF().spatialJoin(rf2), 
    list(map(lambda x: x[1].withColumn(("col_" + str(x[0])), x[1]["tile"]), 
    list(map(lambda b: ((b[0] - 1) % 3, spark.read.geotiff(b[1]).asRF()), 
    list(map(lambda s: (s, readTiff(s)), (1,2,3)))))))).drop('bounds', 'metadata', 'tile'))

Create a contrived example of columns with multiple tiles. Bands 1,4,7 are in col_0, 2,5,8 are in col_1, and 3,6,9 are in col_2.

In [3]:
bands1 = bandsToJoin((1,2,3))
bands2 = bandsToJoin((4,5,6))
bands3 = bandsToJoin((7,8,9))

In [4]:
# get a rasterframe with band_0, band_1, band_2
bands3.show()

+-----------+--------------------+--------------------+--------------------+
|spatial_key|               col_0|               col_1|               col_2|
+-----------+--------------------+--------------------+--------------------+
|      [0,0]|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
+-----------+--------------------+--------------------+--------------------+



In [5]:
grouped = bands1.union(bands2).union(bands3)

Compute statistics based on aggregates of tiles. These are all column-wise operations, so `aggMean` computes the average of all cells of all tiles in a certain column, returning a scalar. This is in contrast to local functions like `localAggMean`, which would return a tile where every cell is the mean of the corresponding input cells.

In [6]:
grouped.select(aggMean("col_0")).show()

+-----------------+
|  agg_mean(col_0)|
+-----------------+
|9834.874785264363|
+-----------------+



`aggDataCells` returns the number of cells in a column of tiles that are not NODATA. `aggNoDataCells` does the exact opposite, returning the number of NODATA cells.

In [7]:
grouped.select(aggDataCells("col_1")).show()

+---------------------+
|agg_data_cells(col_1)|
+---------------------+
|                94302|
+---------------------+



## Local aggregate functions
Local functions operate on a cell-by-cell basis. For instance, `localAggMax` examines every tile in a column and the cells of the output tile are each the maximum of the number (in this case, three) of cell values that it sees.

In [11]:
array1 = range(1, 10)
tile1 = generateTile('int', array1, 3, 3)
array2 = range(1, 10, -1)
tile2 = Tile(array2, 3, 3)
array3 = Array.fill(9)(5)
tile3 = Tile(array3, 3, 3)

NameError: name 'generateTile' is not defined

In [5]:
df = sc.parallelize((tile1, tile2, tile3)).toDF("tiles")

df: org.apache.spark.sql.DataFrame = [tiles: rf_tile]


For instance, since each tile is filled with data cells, the result of aggregating a count of them is that each cell contains three, one for each tile aggregated over. `localAggNoDataCells` works in the same way.

In [75]:
df.select(tileHistogram(localAggDataCells("tiles"))).show(0)

+-------------------+---------+
|stats              |bins     |
+-------------------+---------+
|[9,3.0,3.0,3.0,0.0]|[[3.0,9]]|
+-------------------+---------+



Functions also exist for finding the aggregated max, mean, and min

In [None]:
val showTile = udf((t: Tile) => t.asciiDraw())
df.select(showTile(localAggMax("tiles"))).show(false)
df.select(showTile(localAggMean("tiles"))).show(false)

`localAggStats` will return a structure with columns representing a variety of different tile-wise aggregate statistics.

In [6]:
df.select(localAggStats("tiles") as "stats").select("stats.*").show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|               count|                 min|                 max|                mean|            variance|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|IntUserDefinedNoD...|IntConstantNoData...|IntConstantNoData...|DoubleConstantNoD...|DoubleConstantNoD...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

