<a href="https://colab.research.google.com/github/rkbono/GLY4451/blob/main/GLY4451_mar29_v01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
from IPython.display import Image

if 'google.colab' in str(get_ipython()):
    print('Running on CoLab')
    !pip install --no-binary shapely shapely --force
    !git clone https://github.com/rkbono/GLY4451.git
    !pip install cartopy
    import cartopy.crs as ccrs
    fpath = './GLY4451/'
else:
    import cartopy.crs as ccrs
    
    print('Not running on CoLab')
    fpath = './'

# Pickle

Pickle is more than a ... vegetable? It's a way to pack/unpack data into a single, compressed file. It's similar to a zip file or a matlab .mat file. You can turn python variables into pickle files and vice/versa. Just like with pickles in real life?

Formally (from website):

*The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.*

https://docs.python.org/3/library/pickle.html

In [None]:
import pickle

In [None]:
# with open(fpath+"Datasets/BedMachine_Antarctica.pkl", "wb") as f:
#     pickle.dump(ant_data, f)

In [None]:
with open(fpath+"Datasets/BedMachine_Antarctica.pkl", "rb") as f:
    ant_data = pickle.load(f)

# NetCDF

From wiki:

*NetCDF (Network Common Data Form) is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.*

*The netCDF libraries support multiple different binary formats for netCDF files. All formats are "self-describing". This means that there is a header which describes the layout of the rest of the file, in particular the data arrays, as well as arbitrary file metadata in the form of name/value attributes. The format is platform independent. The data are stored in a fashion that allows efficient subsetting.*

https://en.wikipedia.org/wiki/NetCDF

Very common for sharing gridded data, datasets with very rigid requirements, etc., and found within most modeling communities. 

We aren't going to actually *use* the package since the datasets are typically too large to be practical.

In [None]:
# import netCDF4 as nc

In [None]:
# ant = nc.Dataset(antPath)

## Antarctica

This ice data looks fun -- ice surface and bedrock heights for Antarctica at 500 m spacing. See the details from the CDF file below. I've gone ahead and downsampled it by 20x to 10 km resolution so it doesn't melt your computer. If you'd like the full file, ask - the file is ~1 gb but it uses ~6 or so in RAM from my testing.

`<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    Conventions: CF-1.7
    Title: BedMachine Antarctica
    Author: Mathieu Morlighem
    version: 03-Jun-2022 (v3.4)
    nx: 13333.0
    ny: 13333.0
    Projection: Polar Stereographic South (71S,0E)
    proj4: +init=epsg:3031
    sea_water_density (kg m-3): 1027.0
    ice_density (kg m-3): 917.0
    xmin: -3333000
    ymax: 3333000
    spacing: 500
    no_data: -9999.0
    license: No restrictions on access or use
    Data_citation: Morlighem M. et al., (2019), Deep glacial troughs and stabilizing ridges unveiled beneath the margins of the Antarctic ice sheet, Nature Geoscience (accepted)
    Notes: Data processed at the Department of Earth System Science, University of California, Irvine
    dimensions(sizes): x(13333), y(13333)
    variables(dimensions): |S1 mapping(), int32 x(x), int32 y(y), int8 mask(y, x), float32 firn(y, x), float32 surface(y, x), float32 thickness(y, x), float32 bed(y, x), int16 errbed(y, x), int8 source(y, x), int8 dataid(y, x), int16 geoid(y, x)
    groups: `

In [None]:
ant_data

In [None]:
fig = plt.figure(figsize=(18,4))
ax = fig.add_subplot(1,3,1,projection=ccrs.SouthPolarStereo())

ax.coastlines()
ax.gridlines()

mline = int(ant_data['mx'].shape[0]/2)

ax.plot(ant_data['mx'][mline,:],ant_data['my'][mline,:],'-r')
ax.plot(ant_data['mx'][:,mline],ant_data['my'][:,mline],'-r')

ax = fig.add_subplot(1,3,2)
ax.plot(ant_data['mx'][mline,:],ant_data['bedrock'][mline,:],'-r',label='bedrock')
ax.plot(ant_data['mx'][mline,:],ant_data['icesurf'][mline,:],'-b',label='ice')
ax.set_title('East-West')

ax = fig.add_subplot(1,3,3)
ax.plot(ant_data['my'][:,mline],ant_data['bedrock'][:,mline],'-r',label='bedrock')
ax.plot(ant_data['my'][:,mline],ant_data['icesurf'][:,mline],'-b',label='ice')
ax.set_title('North-South')



# Challenge #1

Given the dataset above on Antarctic ice and bedrock heights, estimate the mass of the Antarctic ice sheet to a reasonable degree. Explain how you reached an answer and what assumptions were needed. If you did not have more data, but infinite time and perfect knowledge of how to calculate it, what are some further tweaks you could make to your estimate? I was able to get within 2% of the reported volume with one line of code. With more code, it could be closer.

For inspiration, here are some starter images that you can make as well with the gridded datasets.

In [None]:
Image(fpath+'Figures/antarctica_maps.png')

# Seaborn

In [None]:
import seaborn as sns

From Seaborn Introduction:

*Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.*

*Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.*

## Sea ice data
Let's get some good-enough data -- seaborn has some built in datasets which we will use to our advantage. These are not rated for science - get real data for that.

In [None]:
dfIce = sns.load_dataset('seaice')
dfIce.head()

This dataset contains northern hemisphere sea ice extent (in millions of km) recorded on ~2-day frequency. Let's play around with it.

Let's use the Date column as an index. If we convert it to pandas DatetimeIndex type, that'll unlock some neat indexing abilities. https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html

In [None]:
dfIce['Date'] = pd.to_datetime(dfIce['Date'])
dfIce.set_index('Date',inplace=True)
dfIce

In [None]:
dfIce.loc['2005-03']

In [None]:
sns.lineplot(data=dfIce,x=dfIce.index.dayofyear,y='Extent',hue=dfIce.index.year)

Now let's add some "helper" columns for later. We'll extract those useful datetime attributes and store them as columns. It might be overkill, but it'll give us some freedom to explore

In [None]:
dfIce['year'] = dfIce.index.year
dfIce['month'] = dfIce.index.month
dfIce['dayofyear'] = dfIce.index.dayofyear

### Pandas tangent - groupby

groupby is a really powerful tool that's also a pain to use. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

Pandas will split up a dataframe based on a column of categorical data. It returns an object that functions like a dictionary, with each key referring to the category and the value being a dataframe containing just those rows.

In [None]:
# print out the mean ice extent by month

month_names = {1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
for idx,grp in dfIce.groupby('month'):
    print('%s: %.2fe6 km2'%(month_names[idx],grp['Extent'].mean()))

groupby can be paired with another pandas tool, cut, to group numerical data into bins. 

In [None]:
# define decade boundaries
age_edges = np.arange(1980,2020+1e-5,10)

# pd.cut will bin the given column, here "year", using the provided bins/edges, here "age_edges"
dfIce['decade'] = pd.cut(dfIce['year'],age_edges,right=False)
display(dfIce.head())

Note how the decade column is presented. pd.cut will define a new datatype, an "Interval". Parentheses indicate exclusive, brackets indicate inclusive. By default, pd.cut will INCLUDE the right edge and EXCLUDE the left edge. Here, since we generally count the 0th year as part of the decade (eg, 90s), I set the argument "right" to false, which reverses which edge is included/excluded.

In [None]:
# now lets apply groupby by decade, and we'll skip straight to some aggregate statistic
dfIce[['Extent','decade']].groupby('decade').describe()

## Some plots

In [None]:
sns.lineplot(data=dfIce['Extent'])

In [None]:
sns.lineplot(data=dfIce.loc[dfIce.index.year>=2000,'Extent'])

In [None]:
sns.boxplot(data=dfIce,x='month',y='Extent')

In [None]:
sns.violinplot(data=dfIce,x='month',y='Extent')

In [None]:
sns.regplot(data=dfIce,x='year',y='Extent', x_estimator=np.mean)

In [None]:
fig,ax = plt.subplots(1,1)
sns.kdeplot(ax=ax,data=dfIce,x='Extent',hue='decade',fill=True,legend=True)

# Challenge #2

Using the sea ice dataset and seaborn plotting module, describe (with plots) how sea ice extent has changed on the following timescales and/or intervals:
1. Annual
2. Monthly
2. Weekly
3. Moving five-year average

Use a different seaborn plot type for each timescale/interval. For each timescale/interval, quantify and/or visualize the variation in the dataset. The exact feature you present is not important, I'd just like to see that you played around a little bit with seaborn. *Be creative!*

***These should be "presentation" or "publication" quality.*** More so than most of the other figures you have made in this class, I want you to also focus on the clarity, impact, and aesthetic quality of these figures. By now you should start getting a feel for how to style a figure. Use subplots and titles as appropriate, label axes, pick clear and logical colors. 


### Bonus challenge (extra credit)
Which day of the year has the most variation in sea ice extent?

(Can you answer the above question using only one line of code?)


## Seaborn and Maps

These can play nicely together but it can take a little diligence. Let's play with another databased of magnetic field strength data, www.PINTdb.org, which is maintained by some guy. This is a medium sized excel table with magnetic field strength and associated meta-data reported at the site-mean level. Let's load it up and take a look.

In [None]:
dfPINT = pd.read_excel(fpath+'Datasets/PINTv811.xlsx')
dfPINT

In [None]:
dfPINT.columns

We can see column descriptors here: http://pintdb.org/data/PINT_column_headings_v800.pdf

Let's make a site map with symbols colored by age and size by QPI (a qualitative score on the presumed robustness of the data).

In [None]:
# since the dataset is dominated by Cenozoic data but extends into the Hadean
# let's look at log(age) to see more of a spread

dfPINT['logAGE'] = np.log10(dfPINT['AGE'])

fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(1,1,1,projection=ccrs.Robinson())

ax.set_global()
ax.stock_img()

hh = sns.scatterplot(ax=ax,data=dfPINT,x='SLONG',y='SLAT',
                     hue='logAGE',size='QPI',legend='brief',
                     palette='turbo',
                     transform = ccrs.PlateCarree()
                    )
ax.legend(loc='center left',bbox_to_anchor=(1.05,0.5))

What about shape of the field? We can get a sense of morphology based on the expected intensity-latitude relationship due to the dipole assumption.

In [None]:
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(1,1,1,projection=ccrs.Robinson())

ax.set_global()
ax.coastlines()

hh = sns.scatterplot(ax=ax,data=dfPINT.loc[dfPINT['AGE']<=10],x='SLONG',y='SLAT',
                     hue='B',legend='brief',s=300,
                     palette='turbo',
                     transform = ccrs.PlateCarree()
                    )
ax.legend(loc='center left',bbox_to_anchor=(1.05,0.5))

Hmm, not enough coverage and too much variability. Let's bin by latitude.

Lots of scatter, can we get a trend?

In [None]:
sns.scatterplot(data=dfPINT.loc[dfPINT['AGE']<=10],x='SLAT',y='B')

In [None]:
dfPINT['lbin'] = pd.cut(dfPINT['SLAT'],bins=np.arange(-90,90.1,10))

fig = plt.figure(figsize=(12,4))
ax = fig.subplots(1,1)

sns.barplot(ax=ax,data=dfPINT.loc[dfPINT['AGE']<=10],x='lbin',y='B')

ax.set_xticklabels(ax.get_xticklabels(),rotation=90);
ax.set_xlabel('Latitude Bin');

Okay, clear trend in the Northern hemi. Southern hemi looks undersampled.

Why are we only looking at the most recent 10 million years?

# Challenge #3

Using the PINTdb, describe the distribution of field strength data from the Mesozoic. Consider sampling, strengths, quality, rock types, etc. What are some fundamental differences in the **geology** (think geography) between the Mesozoic and the recent Cenozoic?