# This Notebook explores the SCAR GeoMAP dataset released in 2019
## Cox S.C., Smith Lyttle B. and the GeoMAP team (2019). Lower Hutt, New Zealand. GNS Science. Release v.201907.
### [Data Available Here](https://data.gns.cri.nz/ata_geomap/index.html?content=/mapservice/Content/antarctica/www/index.html)

### Notebook by Sam Elkind

In [1]:
import os
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
import pprint as pp
from tabulate import tabulate

In [2]:
def plot_value_counts(field_name, values_to_plot, counts, counts_norm):
    fig, ax = plt.subplots(2, 1, figsize=(30,15))
    fig.tight_layout(pad=2.0)
    fig.subplots_adjust(top=.94)
    fig.suptitle(f"Frequency of {field_name} values", size=18)

    ax[0].set_title(field_name)
    ax[1].set_title(f"{field_name} normalized")
    for i, v in enumerate(counts[:values_to_plot]):
        ax[0].text(i - .5, v, str(v), color='black', fontweight='bold')
    for i, v in enumerate(counts_norm[:values_to_plot]):
        ax[1].text(i - .5, v, f"{str(v * 100)[:3]}%", color='black', fontweight='bold')
    ax[0].bar(counts.index[:values_to_plot], counts[:values_to_plot])
    ax[1].bar(counts_norm.index[:values_to_plot], counts_norm[:values_to_plot])

In [3]:
geol_path = f"{os.getcwd()}/data/ATA_SCAR_GeoMAP_geology.gdb"
print(geol_path)

/home/sam/geomap/data/ATA_SCAR_GeoMAP_geology.gdb


In [4]:
data = gpd.read_file(geol_path)

# Exploring the discrepancy between polygon names and source codes

In [5]:
display(data[["SOURCECODE","NAME"]].nunique())

SOURCECODE    801
NAME          666
dtype: int64

## Why are there more source codes than names?
### Let's make a set of sourecodes for each unique name.

In [10]:
name_src_code_vals = sorted([data[data["NAME"] == i][["MAPSYMBOL", "SOURCECODE", "SOURCE", "NAME", "DESCR"]].drop_duplicates("SOURCECODE") for i in data["NAME"].drop_duplicates()], key=lambda x: len(x), reverse=True)
name_src_code_counts = pd.DataFrame([(i["NAME"], len(i)) for i in name_src_code_vals if len(i) > 1])

In [12]:
display(name_src_code_counts)

Unnamed: 0,0,1
0,41773 Marie Byrd Land Volcanics: basalt 417...,17
1,41781 Marie Byrd Land Volcanics: trachyte 4...,11
2,60905 Melbourne volcanic province 61140 ...,11
3,60975 Hallett volcanic province 61016 Ha...,10
4,41775 Marie Byrd Land Volcanics: undifferen...,7
5,44248 late granitoid 55116 late granitoi...,5
6,40463 older ice sheet margin till 43686 ...,4
7,"77020 Marble, calc-silicate and skarn 81058...",4
8,79090 Amphibolite 80168 Amphibolite 8149...,4
9,80654 Garnet-biotite gneiss 83956 Garnet...,4


#### There are 44 unit names that have a variety of sourcecodes. These names are either compositionally descriptive or are part of a complex that is further differentiated by sourcecode. Perhaps some of the latter category could have their sources reviewed for more specific unit names for this field.