# Organize Variables
_Examine and modify datatypes - categorical, continuous, discrete. Organize dataset._ 

Polygons has most of the data and will serve as the focus of this EDA.

## Remove Non-Glacier Columns
Since the purpose is to find data stories about glaciers rather than the glacier research process, variables pertaining data submission and measurement uncertainty are identified, referencing the [User Guide](https://nsidc.org/sites/default/files/nsidc-0272-v001-userguide_1.pdf) and the [GLIMS Description of fields](http://www.glims.org/MapsAndDocs/downloaded_field_desc.html), verified, and removed.

In [1]:
# Import libraries
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

# Check working directory
current_directory = os.getcwd()
print("Current Directory:", current_directory)

# Ask pandas to display all columns
pd.set_option('display.max_columns', None)

Current Directory: /Users/yun/Documents/GLIMS/GLIMS_20230716


In [None]:
# Load Esri shapefiles as geopandas dataframes
polygons = gpd.read_file("glims_download_13173/glims_polygons.shp")

In [None]:
polygons.dtypes

In [None]:
# all are "okay", as expected from documentation
polygons.rec_status.value_counts()
polygons.proc_desc.nunique()

In [None]:
# Lots of "None"
polygons.wgms_id.value_counts(normalize=True) # 91% "None"
polygons.local_id.value_counts(normalize=True) # 65% "None"

In [None]:
# Some of these fields may be related to each other: proc_desc and analysts; submitters, rc_id, geog_area, and chief_affl
polygons[[
    #'line_type', #'rec_status', #'glac_stat', 
    #'area', 'db_area','width', 'length', 'primeclass', 'min_elev', 'mean_elev', 'max_elev',
    'anlys_id', 'glac_id', 'glac_name', 'wgms_id', 'local_id',
    'anlys_time', 'src_date', 'subm_id', 'release_dt', 
    'proc_desc', 'rc_id', 'geog_area','chief_affl', 'submitters', 'analysts', 
    #'loc_unc_x', 'loc_unc_y', 'glob_unc_x', 'glob_unc_y', #'geometry'
]].nunique().sort_values(ascending=False)

In [None]:
# Alignments with higher value counts
polygons[[
    'proc_desc', 
    'rc_id', 
    'geog_area', # 'umbrella' and 'various' categories too large to be useful
    'chief_affl', 
    'submitters', 
    'analysts'
]].value_counts()

In [None]:
# Continuous Temporal, not defined in documentation, data release?
polygons.release_dt.hist();

In [None]:
# Positional Uncertainty, Discrete
polygons[['loc_unc_x', 'loc_unc_y', 'glob_unc_x', 'glob_unc_y']].hist();

In [None]:
# Remove submission-related columns
polygons1 = polygons.drop(labels=[
    'rec_status', 'wgms_id', 'local_id', 
    'subm_id', 'release_dt', 'proc_desc', 
    'rc_id', 'geog_area', 'chief_affl', 
    'loc_unc_x', 'loc_unc_y', 'glob_unc_x', 'glob_unc_y',
    'submitters', 'analysts'
], axis=1)

In [None]:
polygons1.dtypes

## Remove Non-Glacier Rows
Remove polygon features representing non-glacier boundaries.

In [None]:
# Most entities are glacier boundaries.
polygons1.line_type.value_counts()

In [None]:
# Remove non-glacier entities.
polygons2 = polygons1[polygons1.line_type=="glac_bound"]
#polygons2.line_type.value_counts()
polygons2.drop('line_type', axis=1, inplace=True)

In [None]:
polygons2.dtypes

In [None]:
# Minimal NAN's address later
polygons2.isna().sum()

In [None]:
polygons2.sample(5)

In [None]:
polygons2.to_csv("polygons2.csv", index=False)