# Geospatial Analysis Work

<div class="alert alert-block alert-danger">
Warning: This notebook might take a while to run.
</div>

In this notebook, we performed some geospatial analysis as part of our intial analysis phase. Nothing particularly insightful is found, other than the fact that desert areas in Australia still has a large revenue.

In [1]:
# import constants and libraries
import sys
sys.path.append('../scripts/utils')
from constants import *

import geopandas as gpd
import pandas as pd
import folium
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import numpy as np

In [2]:
# create a spark session 
spark = (
    SparkSession.builder.appName("MAST30034 Project 2")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config('spark.driver.memory', '4g')
    .config('spark.executor.memory', '2g')
    .getOrCreate()
)

23/10/19 11:07:43 WARN Utils: Your hostname, vanessas-MacBook-Pro-3.local resolves to a loopback address: 127.0.0.1; using 192.168.18.7 instead (on interface en0)
23/10/19 11:07:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/19 11:07:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/19 11:07:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/10/19 11:07:44 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


### Calculate Revenue of each SA2 Area

In [3]:
# read data and aggregate
sf = gpd.read_file(f"{LANDING_DATA}SA2_2021_AUST_GDA2020.shp")
sa2 = pd.read_pickle(f"{RAW_DATA}SA2_code.pkl")
transaction_data = spark.read.parquet(f'{TRANSACTION_DATA}').groupBy("user_id").sum("dollar_value")
consumer_data = spark.createDataFrame(pd.read_pickle(f"{CURATED_DATA}tbl_consumer.pkl")[["user_id", "postcode"]])
transaction_consumer = transaction_data.join(consumer_data, on="user_id", how="inner").toPandas()

                                                                                

In [7]:
# drop duplicates, merge with SA2 data
sa2_no_dupe = sa2.drop_duplicates()
transaction_consumer = transaction_consumer.merge(sa2_no_dupe)
transactions_grouped = transaction_consumer.groupby("SA2_code")["sum(dollar_value)"].sum().reset_index()
transactions_grouped

Unnamed: 0,SA2_code,sum(dollar_value)
0,101021007.0,1.495891e+06
1,101021008.0,8.568414e+05
2,101021009.0,8.568414e+05
3,101021010.0,8.568414e+05
4,101021011.0,4.472325e+06
...,...,...
2216,801111141.0,1.401791e+06
2217,901011001.0,5.913394e+05
2218,901021002.0,8.224491e+05
2219,901031003.0,8.689848e+05


### Exploring SA2 areas with High Revenues

In [8]:
transactions_top_10 = transactions_grouped.sort_values(by="sum(dollar_value)", ascending=False)
transactions_top_10

Unnamed: 0,SA2_code,sum(dollar_value)
342,117031337.0,1.180859e+08
1732,503021041.0,2.041840e+07
1659,406021141.0,2.034533e+07
685,206041122.0,1.426584e+07
597,202031033.0,1.254348e+07
...,...,...
1122,305031123.0,8.206280e+04
1126,305031127.0,8.206280e+04
1119,305031120.0,8.206280e+04
1098,304041098.0,8.206280e+04


We see that the SA2 area with highest revenue is Haymarket, New South Wales, followed by Perth City and Outback.

## Geospatial Analysis

Now, we use the shapefile to plot each SA2 area's revenue.

In [9]:
# convert the geometry shape in our shapefile to latitude and longitude, convert SA2 to float to match 
sf['geometry'] = sf['geometry'].to_crs("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
sf = sf[:2472]
sf["SA2 code"] = sf["SA2_CODE21"].astype(float)

# create a JSON 
gdf = gpd.GeoDataFrame(sf)
geoJSON = gdf[['SA2 code', 'geometry']].drop_duplicates('SA2 code').to_json()

In [12]:
m = folium.Map(location=[-38.043995, 145.264296], tiles="Stamen Terrain", zoom_start=10)
c = folium.Choropleth(
   geo_data=geoJSON, # geoJSON
   name='choropleth', # name of plot
   data=transactions_grouped.reset_index(), # data source
   columns=['SA2_code','sum(dollar_value)'], # the columns required
   key_on='properties.SA2 code', # this is from the geoJSON's properties
   fill_color='Reds', # color scheme
   nan_fill_color='black',
   legend_name='Sum Total Value'
)

c.add_to(m)

# UNCOMMENT THIS TO OBTAIN PLOT: commenting because it crashes down my laptop.
#m

<folium.features.Choropleth at 0x12bfdc100>

## Geospatial Analysis based on Logarithmic Scale of Revenue

Since we notice that some revenue values were extreme and skewed the plot, we tried to plot the log of revenue instead.

In [13]:
transactions_grouped["log_dollar"] = np.log(transactions_grouped["sum(dollar_value)"])

m = folium.Map(location=[-38.043995, 145.264296], tiles="Stamen Terrain", zoom_start=10)
c = folium.Choropleth(
   geo_data=geoJSON, # geoJSON
   name='choropleth', # name of plot
   data=transactions_grouped.reset_index(), # data source
   columns=['SA2_code','log_dollar'], # the columns required
   key_on='properties.SA2 code', # this is from the geoJSON's properties
   fill_color='Reds', # color scheme
   nan_fill_color='black',
   legend_name='Log Sum Total Value'
)

c.add_to(m)

# uncomment this (commenting bc it makes my laptop crash)
#m

<folium.features.Choropleth at 0x12c02a470>