# Aggregate functions
Aggregate functions operate on whole columns of tiles at once, typically returning a tile that is the result of the operation . An example of this type of function is finding the mean cell values of a column of tiles. Other statistical and diagnostic tasks can be performed on columns of tiles at a time.

In [7]:
from pyrasterframes import *
from pyrasterframes.rasterfunctions import *
import pyspark
from pyspark.sql import SparkSession
from pathlib import Path

spark = SparkSession.builder. \
    master("local[*]"). \
    appName("RasterFrames"). \
    config("spark.ui.enabled", "false"). \
    getOrCreate(). \
    withRasterFrames()

sc = spark.sparkContext

resource_dir = Path('../samples').resolve()
filenamePattern = "L8-B{}-Elkton-VA.tiff"
def readTiff(name):
    return spark.read.geotiff(resource_dir.joinpath(filenamePattern.format(name)).as_uri())

In [5]:
from functools import reduce
# Start at the bottom
def bandsToJoin(bands):
    # Join all three columns together according to a spatial key
    return(reduce(lambda rf1, rf2: rf1.asRF().spatialJoin(rf2), 
    # create a new column with the value assigned in the previous step
    [x[1].withColumn(("col_" + str(x[0])), x[1]["tile"]) for x in
    # Read bands 1-3 of the tiff and also assign one of three values to it
    [((s - 1) % 3, readTiff(s)) for s in (1,2,3)]]) \
        # Get rid of columns we don't care about
        .drop('bounds', 'metadata', 'tile'))

Create a contrived example of columns with multiple tiles. Bands 1,4,7 are in col_0, 2,5,8 are in col_1, and 3,6,9 are in col_2.

In [11]:
bands1RF = bandsToJoin((1,2,3))
bands2RF = bandsToJoin((4,5,6))
bands3RF = bandsToJoin((7,8,9))

In [9]:
# get a rasterframe with band_0, band_1, band_2
bands3RF.show()

+-----------+--------------------+--------------------+--------------------+
|spatial_key|               col_0|               col_1|               col_2|
+-----------+--------------------+--------------------+--------------------+
|      [0,0]|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
+-----------+--------------------+--------------------+--------------------+



Join the three rasterframes together vertically, stacking them based on the four common columns

In [12]:
grouped = bands1RF.union(bands2RF).union(bands3RF)

Compute statistics based on aggregates of tiles. These are all column-wise operations, so `aggMean` computes the average of all cells of all tiles in a certain column, returning a scalar. This is in contrast to local functions like `localAggMean`, which would return a tile where every cell is the mean of the corresponding input cells.

In [13]:
grouped.select(aggMean("col_0")).show()

+-----------------+
|  agg_mean(col_0)|
+-----------------+
|9834.874785264363|
+-----------------+



`aggDataCells` returns the number of cells in a column of tiles that are not NODATA. `aggNoDataCells` does the exact opposite, returning the number of NODATA cells.

In [14]:
grouped.select(aggDataCells("col_1")).show()

+---------------------+
|agg_data_cells(col_1)|
+---------------------+
|                94302|
+---------------------+



## Local aggregate functions
Local functions operate on a cell-by-cell basis. For instance, `localAggMax` examines every tile in a column and the cells of the output tile are each the maximum of the number (in this case, three) of cell values that it sees.

In [15]:
from pyrasterframes.rasterfunctions import *

In [16]:
array1 = [list([1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0])]
array2 = [list([2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0])]
array3 = [list([-1.0,-2.0,-3.0,-4.0,-5.0,-6.0,-7.0,-8.0,-9.0])]

In [17]:
df = sc.parallelize((array1, array2, array3)).toDF()

In [18]:
df.show(truncate=False)
tileddf = df.withColumn("tiles", arrayToTile("_1", 3, 3))

+------------------------------------------------------+
|_1                                                    |
+------------------------------------------------------+
|[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]         |
|[2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]        |
|[-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0]|
+------------------------------------------------------+



In [19]:
tileddf.show()

+--------------------+--------------------+
|                  _1|               tiles|
+--------------------+--------------------+
|[1.0, 2.0, 3.0, 4...|DoubleRawArrayTil...|
|[2.0, 3.0, 4.0, 5...|DoubleRawArrayTil...|
|[-1.0, -2.0, -3.0...|DoubleRawArrayTil...|
+--------------------+--------------------+



For instance, since each tile is filled with data cells, the result of aggregating a count of them is that each cell contains three, one for each tile aggregated over. `localAggNoDataCells` works in the same way.

In [20]:
tileddf.select(tileHistogram(localAggDataCells("tiles"))).show(truncate = False)

+----------------------------------------------+
|tileHistogram(localCount(tiles))              |
+----------------------------------------------+
|[[9,-1,3.0,3.0,3.0,0.0],WrappedArray([3.0,9])]|
+----------------------------------------------+



Functions also exist for finding the aggregated max, mean, and min

In [21]:
tileddf.select(tileStats(localAggMax("tiles"))).show()

+-----------------------------+
|tileStats(localAggMax(tiles))|
+-----------------------------+
|         [9,-1,2.0,10.0,6....|
+-----------------------------+



`localAggStats` will return a structure with columns representing a variety of different tile-wise aggregate statistics.

In [22]:
tileddf.select(tileStats(localAggMax("tiles"))).show()

+-----------------------------+
|tileStats(localAggMax(tiles))|
+-----------------------------+
|         [9,-1,2.0,10.0,6....|
+-----------------------------+

