# This Notebook explores the SCAR GeoMAP dataset released in 2019
## Cox S.C., Smith Lyttle B. and the GeoMAP team (2019). Lower Hutt, New Zealand. GNS Science. Release v.201907.
### [Data Available Here](https://data.gns.cri.nz/ata_geomap/index.html?content=/mapservice/Content/antarctica/www/index.html)
### Notebook by Sam Elkind

Initially, I'll look at the data in terms of polygon counts. This section will be focused on examining the data schema and frequency of values occurring within specific fields. This investigation will focus on finding inconsistencies in the data attribution, but also could stimulate some discussion regarding relationships between columns.

Next, I'll look at the data in terms of polygon area and data attribution. How much surface water has been mapped? How much till has been mapped? How much outcropping rock is of Jurassic age?

### Configure packages, paths, and load data

In [23]:
import os
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
import pprint as pp
from tabulate import tabulate

In [44]:
def plot_value_counts(field_name, values_to_plot, counts, counts_norm):
    fig, ax = plt.subplots(2, 1, figsize=(30,15))
    fig.tight_layout(pad=2.0)
    fig.subplots_adjust(top=.94)
    fig.suptitle(f"Frequency of {field_name} values", size=18)

    ax[0].set_title(field_name)
    ax[1].set_title(f"{field_name} normalized")
    for i, v in enumerate(counts[:values_to_plot]):
        ax[0].text(i - .5, v, str(v), color='black', fontweight='bold')
    for i, v in enumerate(counts_norm[:values_to_plot]):
        ax[1].text(i - .5, v, f"{str(v * 100)[:3]}%", color='black', fontweight='bold')
    ax[0].bar(counts.index[:values_to_plot], counts[:values_to_plot])
    ax[1].bar(counts_norm.index[:values_to_plot], counts_norm[:values_to_plot])

def printmd(md):
    display(Markdown(md))

def print_field_summary_stats(data, field):
    try:
        val_counts = data[field].value_counts()
    except AttributeError:
        return
    printmd(f"# {field}")
    printmd(f"## Source of values: {''}")
    printmd(f"## Value formatting: {''}")
    printmd(f"## Related fields: {''}")
    printmd(f"## [GeoSciML field description]({''})")
    printmd(f"## Descriptive Statistics:")
    printmd(f"- ### Unique Values: {len(val_counts)}")
    if len(val_counts) > 0:
        printmd(f"- ### Most frequently occuring value:<br/>{val_counts.index[0]}")
    printmd(f"- ### Number of values with a single occurrence: {len([i for i in val_counts if i == 1])}")

In [3]:
geol_path = f"{os.getcwd()}/data/ATA_SCAR_GeoMAP_geology.gdb"
print(geol_path)

/home/sam/geomap/data/ATA_SCAR_GeoMAP_geology.gdb


In [4]:
data = gpd.read_file(geol_path)

In [5]:
fields = " ".join(data.columns)

In [39]:
print(fields)

SOURCECODE MAPSYMBOL PLOTSYMBOL NAME DESCR POLYGTYPE MBREQUIV FMNEQUIV SBGRPEQUIV GRPEQUIV SPGRPEQUIV TERREQUIV STRATRANK TYPENAME TYPE_URI GEOLHIST REPAGE_URI YNGAGE_URI OLDAGE_URI ABSMIN_MA ABSMAX_MA AGECODE LITHCODE LITHOLOGY REPLITH_URI OBSMETHOD CONFIDENCE POSACC_M SOURCE METADATA RESSCALE CAPTSCALE CAPTDATE MODDATE FEATUREID SPEC_URI SYMBOL DATASET REGION Shape_Length Shape_Area geometry


# **SOURCECODE**
## Source of values: copied from publication referenced in **SOURCE** field
## value format: Whatever format used by publication authors
## Related fields: **MAPSYMBOL**, **NAME**, **DESCR**
## [GeoSciML field description]()
## Descriptive statistics:

In [46]:
for i in data.columns:
    print_field_summary_stats(data, i)

# SOURCECODE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 801

- ### Most frequently occuring value:<br/>C-Tr

- ### Number of values with a single occurrence: 54

# MAPSYMBOL

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 173

- ### Most frequently occuring value:<br/>JKg

- ### Number of values with a single occurrence: 3

# PLOTSYMBOL

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 183

- ### Most frequently occuring value:<br/>JKg

- ### Number of values with a single occurrence: 3

# NAME

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 666

- ### Most frequently occuring value:<br/>marine sedimentary and metasedimentary rocks (Carboniferous to Triassic)

- ### Number of values with a single occurrence: 36

# DESCR

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 757

- ### Most frequently occuring value:<br/>unfossiliferous low grade regional metamorphic clastic sedimentary rocks; some basaltic to andesitic lavas

- ### Number of values with a single occurrence: 41

# POLYGTYPE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 3

- ### Most frequently occuring value:<br/>rock

- ### Number of values with a single occurrence: 0

# MBREQUIV

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 19

- ### Most frequently occuring value:<br/> 

- ### Number of values with a single occurrence: 5

# FMNEQUIV

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 246

- ### Most frequently occuring value:<br/>LeMay Formation; Trinity Penninsula Formation

- ### Number of values with a single occurrence: 15

# SBGRPEQUIV

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 1

- ### Most frequently occuring value:<br/>Ross Sea Drift

- ### Number of values with a single occurrence: 0

# GRPEQUIV

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 59

- ### Most frequently occuring value:<br/> 

- ### Number of values with a single occurrence: 0

# SPGRPEQUIV

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 14

- ### Most frequently occuring value:<br/> 

- ### Number of values with a single occurrence: 0

# TERREQUIV

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 11

- ### Most frequently occuring value:<br/>Wilson Terrane

- ### Number of values with a single occurrence: 0

# STRATRANK

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 12

- ### Most frequently occuring value:<br/>rank not specified

- ### Number of values with a single occurrence: 0

# TYPENAME

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 6

- ### Most frequently occuring value:<br/>lithostratigraphic unit

- ### Number of values with a single occurrence: 0

# TYPE_URI

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 5

- ### Most frequently occuring value:<br/>http://resource.geosciml.org/classifier/cgi/geologicunittype/lithostratigraphic_unit

- ### Number of values with a single occurrence: 0

# GEOLHIST

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 113

- ### Most frequently occuring value:<br/>early Jurassic to early Cretaceous

- ### Number of values with a single occurrence: 2

# REPAGE_URI

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 41

- ### Most frequently occuring value:<br/>http://resource.geosciml.org/classifier/ics/ischart/Paleozoic

- ### Number of values with a single occurrence: 0

# YNGAGE_URI

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 51

- ### Most frequently occuring value:<br/>http://resource.geosciml.org/classifier/ics/ischart/Albian

- ### Number of values with a single occurrence: 0

# OLDAGE_URI

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 52

- ### Most frequently occuring value:<br/>http://resource.geosciml.org/classifier/ics/ischart/Cambrian

- ### Number of values with a single occurrence: 0

# ABSMIN_MA

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 121

- ### Most frequently occuring value:<br/>100.5

- ### Number of values with a single occurrence: 2

# ABSMAX_MA

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 130

- ### Most frequently occuring value:<br/>541.0

- ### Number of values with a single occurrence: 2

# AGECODE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 46

- ### Most frequently occuring value:<br/>JK

- ### Number of values with a single occurrence: 0

# LITHCODE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 25

- ### Most frequently occuring value:<br/>s

- ### Number of values with a single occurrence: 0

# LITHOLOGY

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 410

- ### Most frequently occuring value:<br/>unknown

- ### Number of values with a single occurrence: 11

# REPLITH_URI

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 80

- ### Most frequently occuring value:<br/>http://resource.geosciml.org/classifier/cgi/lithology/metamorphic_rock

- ### Number of values with a single occurrence: 3

# OBSMETHOD

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 5

- ### Most frequently occuring value:<br/>synthesis from multiple sources

- ### Number of values with a single occurrence: 0

# CONFIDENCE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 162

- ### Most frequently occuring value:<br/>GEOLHIST uncertain

- ### Number of values with a single occurrence: 12

# POSACC_M

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 1

- ### Most frequently occuring value:<br/>250.0

- ### Number of values with a single occurrence: 0

# SOURCE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 158

- ### Most frequently occuring value:<br/>Burton-Johnson & Riley 2015

- ### Number of values with a single occurrence: 8

# METADATA

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 1

- ### Most frequently occuring value:<br/>https://data.gns.cri.nz/metadata/srv/eng/catalog.search#/metadata/1482B48B-3E70-41AE-9BD0-672722A81EC7

- ### Number of values with a single occurrence: 0

# RESSCALE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 3

- ### Most frequently occuring value:<br/>250000

- ### Number of values with a single occurrence: 0

# CAPTSCALE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 1

- ### Most frequently occuring value:<br/>50000

- ### Number of values with a single occurrence: 0

# CAPTDATE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 14

- ### Most frequently occuring value:<br/>2017-07-26T00:00:00

- ### Number of values with a single occurrence: 0

# MODDATE

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 15

- ### Most frequently occuring value:<br/>2018-06-06T00:00:00

- ### Number of values with a single occurrence: 0

# FEATUREID

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 95161

- ### Most frequently occuring value:<br/>ATA_geological_units_045255

- ### Number of values with a single occurrence: 95161

# SPEC_URI

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 1

- ### Most frequently occuring value:<br/>http://www.opengis.net/def/nil/OGC/0/missing

- ### Number of values with a single occurrence: 0

# SYMBOL

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 0

- ### Number of values with a single occurrence: 0

# DATASET

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 9

- ### Most frequently occuring value:<br/>ATA_PEN_geological_units

- ### Number of values with a single occurrence: 0

# REGION

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 2

- ### Most frequently occuring value:<br/>East Antarctica

- ### Number of values with a single occurrence: 0

# Shape_Length

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 95072

- ### Most frequently occuring value:<br/>830.2526232140143

- ### Number of values with a single occurrence: 94990

# Shape_Area

## Source of values: 

## Value formatting: 

## Related fields: 

## [GeoSciML field description]()

## Descriptive Statistics:

- ### Unique Values: 95057

- ### Most frequently occuring value:<br/>51115.04434995726

- ### Number of values with a single occurrence: 94960