# Getis Ord Gi*
Getis  and Ord's Gi and Gi* statistics are popular statistical approaches for finding statistically significant hot and cold spots across space. It compares the value of some numerical variable of a spatial record with those of the neighboring records. The nature of these neighborhoods is controlled by the user. 

In this example, we will use the Gi* statistic on the Kings County Homes dataset we prepared last week to identify regions of high and lower "density".

In [None]:
import wkls

region = wkls.us.wa.kirkland.wkt()
neighbor_search_radius_degrees = .01
h3_zoom_level = 10

In [None]:
database = 'gde_gold'

sedona.sql(f'CREATE DATABASE IF NOT EXISTS org_catalog.{database}')
database

## Spark Initialization
We will use Spark to run the Gi* algorithm. We initialize a Spark session with Sedona.


In [None]:
from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

## Filtering and Aggregation
In this notebook we assign an H3 cell to each record and filter down to only the region of interest. We aggregate the places data by the cell idenitier and find the number of places in each cell.


In [None]:
import pyspark.sql.functions as f
places_df = (
    sedona.table("org_catalog.gde_bronze.king_co_homes_conflated")
        .select(f.col("point"), f.col("sale_price"), f.col("sale_date"), f.col("sale_id"))
        .withColumn("h3Cell", ST_H3CellIDs(f.col("point"), h3_zoom_level, False)[0])
)

if region is not None:
    places_df = places_df.filter(ST_Intersects(ST_GeomFromText(f.lit(region)), f.col("geometry"))).repartition(100)

places_df.count()

In [None]:
hexes_df = (
    places_df
        .groupBy(f.col("h3Cell"))
        .agg(f.count("*").alias("num_places")) # how many places in this cell
        .withColumn("geometry", ST_H3ToGeom(f.array(f.col("h3Cell")))[0])
)

## Sanity Check our Variable
We want to make sure we have a good distribution of values in our variable that we will analyze. Specifically we are ensuring that our cells are not too small which would be indicated by the places counts all being very low. We generate deciles here to make sure that there is some good range of these values. An extreme negative example would be if these values were all zero and one.


In [None]:
hexes_df.select(f.percentile_approx("num_places", [x / 10.0 for x in range(11)])).collect()[0][0]

## Generate our Gi* statistic

Finally, we generate our statistic. There are a lot of variables to fine tune here; these are explained in the API documentation. Here we use the most typical parameters. The exception is the search radius which is always domain specific.

The output here will show us, among other things, a Z score and P value. A Z score shows how many standard deviations from the mean of the neighborhood the value is and the P score tells us the chance that value is from random variation rather than an actual phenomenon.



In [None]:
from sedona.spark import *


gi_df = g_local(
    add_binary_distance_band_column(
        hexes_df,
        neighbor_search_radius_degrees,
        include_self=True,
    ),
    "num_places",
    "weights",
    star=True
).cache()

gi_df.orderBy(f.col("P").asc()).show(5)

## Visualize
Now we plot our statistics in Kepler. Once Kepler is rendered, you can color the cells by Z score and set the number of bands to 10 with the color palette that goes from blue to red. the bluest are the cold spots and reddest hottest.

In [None]:
from sedona.spark import *

kmap = SedonaKepler.create_map(places_df, "places")

SedonaKepler.add_df(
    kmap,
    gi_df.drop("weights"),
    "cells"
)

kmap

## Save out the statistics dataframe to a Gold table

In [None]:
%%time

gi_df.writeTo(f"org_catalog.{database}.king_co_homes_hotspots").createOrReplace()