# This Notebook explores the SCAR GeoMAP dataset released in 2019
## Cox S.C., Smith Lyttle B. and the GeoMAP team (2019). Lower Hutt, New Zealand. GNS Science. Release v.201907.
### [Data Available Here](https://data.gns.cri.nz/ata_geomap/index.html?content=/mapservice/Content/antarctica/www/index.html)
### Notebook by Sam Elkind

Initially, I'll look at the data in terms of polygon counts. This section will be focused on examining the data schema and frequency of values occurring within specific fields. This investigation will focus on finding inconsistencies in the data attribution, but also could stimulate some discussion regarding relationships between columns.

Next, I'll look at the data in terms of polygon area and data attribution. How much surface water has been mapped? How much till has been mapped? How much outcropping rock is of Jurassic age?

### Configure packages, paths, and load data

In [2]:
import os
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import pprint as pp
from tabulate import tabulate

In [3]:
def plot_value_counts(field_name, values_to_plot, counts, counts_norm):
    fig, ax = plt.subplots(2, 1, figsize=(30,15))
    fig.tight_layout(pad=2.0)
    fig.subplots_adjust(top=.94)
    fig.suptitle(f"Frequency of {field_name} values", size=18)

    ax[0].set_title(field_name)
    ax[1].set_title(f"{field_name} normalized")
    for i, v in enumerate(counts[:values_to_plot]):
        ax[0].text(i - .5, v, str(v), color='black', fontweight='bold')
    for i, v in enumerate(counts_norm[:values_to_plot]):
        ax[1].text(i - .5, v, f"{str(v * 100)[:3]}%", color='black', fontweight='bold')
    ax[0].bar(counts.index[:values_to_plot], counts[:values_to_plot])
    ax[1].bar(counts_norm.index[:values_to_plot], counts_norm[:values_to_plot])

In [4]:
geol_path = f"{os.getcwd()}/data/ATA_SCAR_GeoMAP_geology.gdb"
print(geol_path)

/home/sam/geomap/data/ATA_SCAR_GeoMAP_geology.gdb


In [5]:
data = gpd.read_file(geol_path)

## Let's start by looking at the number of unique values for these two fields

In [9]:
display(data[["NAME", "DESCR"]].nunique())

NAME     666
DESCR    757
dtype: int64

## There are more descriptions than names, that kinda seems weird, I would expect to see a 1-1 relationship with these fields. Perhaps complexes with a varied lithology were given the same name value but different, more granular descriptions.

### Let's take a look at the unique pairs of values that occur.

In [7]:
unique_pairs = data[["NAME", "DESCR"]].drop_duplicates()
unique_pairs["pair_id"] = range(len(unique_pairs.index))

In [8]:
display(unique_pairs)

Unnamed: 0,NAME,DESCR,pair_id
0,marine sedimentary and metasedimentary rocks (...,unfossiliferous low grade regional metamorphic...,0
3,intermediate intrusive rocks (early Jurassic t...,intermediate intrusive rocks (early Jurassic t...,1
5,Paleozoic-Triassic metamorphic rock,regionally metamorphosed rocks ranging from Pa...,2
7,sedimentary rocks (Paleozic to mid-Jurassic),inferred sedimentary rocks and low-grade meta...,3
10,Antarctic Peninsula Volcanic Group,"calc-alkaline volcanic suite, lava flows predo...",4
...,...,...,...
94142,Shaw-Clemence Complex,"aluminous gneisses, quartz feldspathic gneisse...",797
94939,,younger till,798
94940,,older till,799
95112,,Orthopyroxene-biotite-quartz-plagioclase gneis...,800


#### Looks like there are a lot of names that have different descriptions. Let's see how many pairs have "None"s in the name column 

In [53]:
null_names = unique_pairs[unique_pairs["NAME"].isnull()]

In [55]:
display(null_names)

Unnamed: 0,NAME,DESCR,pair_id
222,,regionally metamorphosed rocks ranging from Ar...,20
61780,,"Gabbro-diorite and melamonzogranite, coeval wi...",468
85559,,Orthopyroxene-quartz-feldspar gneiss (tonaliti...,670
85560,,Layerd biotite-garnet-quartz-feldspar gneiss; ...,671
85562,,Hornblende-clinopyroxene-orthopyroxene quartz ...,672
...,...,...,...
93976,,Bt and Hb-Bt granite plutons,795
94939,,younger till,798
94940,,older till,799
95112,,Orthopyroxene-biotite-quartz-plagioclase gneis...,800


In [61]:
print(f"{unique_pairs.shape[0] - null_names.shape[0]} unique pairs have a NAME without a value, but a description with a value")
print(f"{data[data['NAME'].isnull()].shape[0]} polygons have a NAME without a value. Let's get a list of the unique sources for these polygons so we can check them if needed")

738 unique pairs have a NAME without a value, but a description with a value
5119 polygons have a NAME without a value. Let's get a list of the unique sources for these polygons so we can check them if needed


In [63]:
null_name_sources = data[data['NAME'].isnull()][["SOURCECODE", "MAPSYMBOL", "NAME", "SOURCE"]]

In [68]:
display(null_name_sources.drop_duplicates(["SOURCECODE","SOURCE"]))

Unnamed: 0,SOURCECODE,MAPSYMBOL,NAME,SOURCE
222,m,?n,,Thomson & Harris 1979_Southern Graham Land
14398,m,?n,,Thomson et al. 1982 North Palmer Land
20274,m,?n,,Burton-Johnson & Riley 2015
61780,GHgra,EOd,,Pertusati et al. 2012
85559,Pp,Rzn,,Sheraton 1985. Geology of Enderby Land and Wes...
...,...,...,...,...
93976,AR-PPg1,ALg,,"Mikhalsky etal 2001, Prince Charles Mountains"
94939,Ty,Czs,,Ishikawa et al. 2000. Geological map of Mount ...
94940,To,Czs,,Ishikawa et al. 2000. Geological map of Mount ...
95112,Ppp,Rzn,,Sheraton 1985. Geology of Enderby Land and Wes...


### A significant number of pol