# **ATLAS/ICESat-2 Land Ice Height [ATL06](https://nsidc.org/data/atl06/) Exploratory Data Analysis**

[Yet another](https://xkcd.com/927) take on playing with ICESat-2's Land Ice Height ATL06 data,
specfically with a focus on analyzing ice elevation changes over Antarctica.
Specifically, this jupyter notebook will cover:

- Downloading datasets from the web via [intake](https://intake.readthedocs.io)
- Performing [Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis)
  using the [PyData](https://pydata.org) stack (e.g. [xarray](http://xarray.pydata.org), [dask](https://dask.org))
- Plotting figures using [Hvplot](https://hvplot.holoviz.org) and [PyGMT](https://www.pygmt.org) (TODO)

This is in contrast with the [icepyx](https://github.com/icesat2py/icepyx) package
and 'official' 2019/2020 [ICESat-2 Hackweek tutorials](https://github.com/ICESAT-2HackWeek/ICESat2_hackweek_tutorials) (which are also awesome!)
that tends to use a slightly different approach (e.g. handcoded download scripts, [h5py](http://www.h5py.org) for data reading, etc).
The core concept here is to run things in a more intuitive and scalable (parallelizable) manner on a continent scale (rather than just a specific region).

In [None]:
import glob
import json
import logging
import netrc
import os

import dask
import dask.distributed
import hvplot.dask
import hvplot.pandas
import hvplot.xarray
import intake
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import tqdm
import xarray as xr

# %matplotlib inline

In [2]:
# Configure intake and set number of compute cores for data download
intake.config.conf["cache_dir"] = "catdir"  # saves data to current folder
intake.config.conf["download_progress"] = False  # disable automatic tqdm progress bars

logging.basicConfig(level=logging.WARNING)

# Limit compute to 8 cores for download part using intake
# Can possibly go up to 10 because there are 10 DPs?
# See https://n5eil02u.ecs.nsidc.org/opendap/hyrax/catalog.xml
client = dask.distributed.Client(n_workers=10, threads_per_worker=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:39289  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 10  Cores: 10  Memory: 201.22 GB


## Quick view

Use our [intake catalog](https://intake.readthedocs.io/en/latest/catalog.html) to get some sample ATL06 data
(while making sure we have our Earthdata credentials set up properly),
and view it using [xarray](https://xarray.pydata.org) and [hvplot](https://hvplot.pyviz.org).

In [3]:
catalog = intake.open_catalog(uri="catalog.yaml")  # open the local catalog file containing ICESAT2 stuff

In [4]:
try:
    netrc.netrc()
except FileNotFoundError as error_msg:
    print(f"{error_msg}, please follow instructions to create one at "
          "https://nsidc.org/support/faq/what-options-are-available-bulk-downloading-data-https-earthdata-login-enabled "
          'basically using `echo "machine urs.earthdata.nasa.gov login <uid> password <password>" >> ~/.netrc`')
    raise

dataset = catalog.icesat2atl06.to_dask().unify_chunks()  # depends on .netrc file in home folder
dataset

Unnamed: 0,Array,Chunk
Bytes,7.26 MB,622.86 kB
Shape,"(907899,)","(77857,)"
Count,701 Tasks,15 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 7.26 MB 622.86 kB Shape (907899,) (77857,) Count 701 Tasks 15 Chunks Type float64 numpy.ndarray",907899  1,

Unnamed: 0,Array,Chunk
Bytes,7.26 MB,622.86 kB
Shape,"(907899,)","(77857,)"
Count,701 Tasks,15 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.26 MB,622.86 kB
Shape,"(907899,)","(77857,)"
Count,701 Tasks,15 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 7.26 MB 622.86 kB Shape (907899,) (77857,) Count 701 Tasks 15 Chunks Type float64 numpy.ndarray",907899  1,

Unnamed: 0,Array,Chunk
Bytes,7.26 MB,622.86 kB
Shape,"(907899,)","(77857,)"
Count,701 Tasks,15 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,72.63 MB,622.86 kB
Shape,"(10, 907899)","(1, 77857)"
Count,656 Tasks,150 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 72.63 MB 622.86 kB Shape (10, 907899) (1, 77857) Count 656 Tasks 150 Chunks Type float64 numpy.ndarray",907899  10,

Unnamed: 0,Array,Chunk
Bytes,72.63 MB,622.86 kB
Shape,"(10, 907899)","(1, 77857)"
Count,656 Tasks,150 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,36.32 MB,311.43 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 36.32 MB 311.43 kB Shape (10, 907899) (1, 77857) Count 632 Tasks 150 Chunks Type float32 numpy.ndarray",907899  10,

Unnamed: 0,Array,Chunk
Bytes,36.32 MB,311.43 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,36.32 MB,311.43 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 36.32 MB 311.43 kB Shape (10, 907899) (1, 77857) Count 632 Tasks 150 Chunks Type float32 numpy.ndarray",907899  10,

Unnamed: 0,Array,Chunk
Bytes,36.32 MB,311.43 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,72.63 MB,622.86 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 72.63 MB 622.86 kB Shape (10, 907899) (1, 77857) Count 632 Tasks 150 Chunks Type float64 numpy.ndarray",907899  10,

Unnamed: 0,Array,Chunk
Bytes,72.63 MB,622.86 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,36.32 MB,311.43 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 36.32 MB 311.43 kB Shape (10, 907899) (1, 77857) Count 632 Tasks 150 Chunks Type float32 numpy.ndarray",907899  10,

Unnamed: 0,Array,Chunk
Bytes,36.32 MB,311.43 kB
Shape,"(10, 907899)","(1, 77857)"
Count,632 Tasks,150 Chunks
Type,float32,numpy.ndarray


In [None]:
#dataset.hvplot.points(
#    x="longitude", y="latitude", datashade=True, width=800, height=500, hover=True,
#    #geo=True, coastline=True, crs=cartopy.crs.PlateCarree(), #projection=cartopy.crs.Stereographic(central_latitude=-71),
#)
catalog.icesat2atl06.hvplot.quickview()

## Data intake

Pulling in all of the raw ATL06 data (HDF5 format) from the NSIDC servers via an intake catalog file.
Note that this will involve 100s if not 1000s of GBs of data, so make sure there's enough storage!!

In [4]:
# Download all ICESAT2 ATLAS hdf files from start to end date
dates1 = pd.date_range(start="2018.10.14", end="2019.06.26")  # 1st batch
dates2 = pd.date_range(start="2019.07.26", end="2020.03.06")  # 2nd batch
dates = dates1.append(dates2)

In [5]:
# Submit download jobs to Client
futures = []
for date in dates:
    source = catalog.icesat2atlasdownloader(date=date)
    future = client.submit(func=source.discover)  # triggers download of the file(s), or loads from cache
    futures.append(future)

In [None]:
# Check download progress here, https://stackoverflow.com/a/37901797/6611055
responses = []
for f in tqdm.tqdm(iterable=dask.distributed.as_completed(futures=futures), total=len(futures)):
    responses.append(f.result())

 25%|██▌       | 121/481 [4:33:33<20:49:30, 208.25s/it]

## Exploratory data analysis on local files

Now that we've downloaded a good chunk of data and cached them locally,
we can have some fun with visualizing the point clouds!

In [7]:
dataset = catalog.icesat2atl06.to_dask()  # unfortunately, we have to load this in dask to get the path...
root_directory = os.path.dirname(os.path.dirname(dataset.encoding["source"]))

In [8]:
def get_crossing_dates(
    catalog_entry: intake.catalog.local.LocalCatalogEntry,
    root_directory: str,
    referencegroundtrack: str="????",
    datetime="*",
    cyclenumber="??",
    orbitalsegment="??",
    version="002",
    revision="01"
):
    """
    Given a 4-digit reference groundtrack (e.g. 1234),
    we output a dictionary where the
    key is the date in "YYYY.MM.DD" format when an ICESAT2 crossing was made and the
    value is the filepath to the HDF5 data file.
    """
    
    # Get a glob string that looks like "ATL06_??????????????_XXXX????_002_01.h5"
    globpath = catalog_entry.path_as_pattern
    if datetime == "*":
        globpath = globpath.replace("{datetime:%Y%m%d%H%M%S}", "??????????????")
    globpath = globpath.format(
        referencegroundtrack=referencegroundtrack, cyclenumber=cyclenumber, orbitalsegment=orbitalsegment,
        version=version, revision=revision
    )
    
    # Get list of filepaths (dates are contained in the filepath)
    globedpaths = glob.glob(os.path.join(root_directory, "??????????", globpath))
    
    # Pick out just the dates in "YYYY.MM.DD" format from the globedpaths
    # crossingdates = [os.path.basename(os.path.dirname(p=p)) for p in globedpaths]
    crossingdates = {os.path.basename(os.path.dirname(p=p)): p for p in sorted(globedpaths)}
    
    return crossingdates

In [9]:
crossing_dates_dict = {}
for rgt in range(0,1388):   # ReferenceGroundTrack goes from 0001 to 1387
    referencegroundtrack = f"{rgt}".zfill(4)
    crossing_dates = dask.delayed(get_crossing_dates)(
        catalog_entry=catalog.icesat2atl06, root_directory=root_directory, referencegroundtrack=referencegroundtrack
    )
    crossing_dates_dict[referencegroundtrack] = crossing_dates
crossing_dates_dict = dask.compute(crossing_dates_dict)[0]

In [10]:
crossing_dates_dict["0349"].keys()

dict_keys(['2018.10.21', '2019.01.20', '2019.04.21', '2019.10.19'])

![ICESat-2 Laser Beam Pattern](https://ars.els-cdn.com/content/image/1-s2.0-S0034425719303712-gr1.jpg)

In [11]:
def six_laser_beams(crossing_dates: list):
    """
    For all 6 lasers along one reference ground track,
    concatenate all points from all crossing dates into one xr.Dataset
    """
    lasers = ["gt1l", "gt1r", "gt2l", "gt2r", "gt3l", "gt3r"]
    
    objs = [
        xr.open_mfdataset(
            paths=crossing_dates.values(),
            combine="nested",
            engine="h5netcdf",
            concat_dim="delta_time",
            group=f"{laser}/land_ice_segments",
            parallel=True,
        ).assign_coords(coords={"laser": laser})
        for laser in lasers
    ]
    
    try:
        da = xr.concat(objs=objs, dim="laser")  # dim=pd.Index(data=lasers, name="laser")
        df = da.unify_chunks().to_dask_dataframe()
    except ValueError:
        # ValueError: cannot reindex or align along dimension 'delta_time' because the index has duplicate values
        df = dask.dataframe.concat([obj.unify_chunks().to_dask_dataframe() for obj in objs])

    return df

In [12]:
dataset_dict = {}
#for referencegroundtrack in list(crossing_dates_dict)[349:350]:   # ReferenceGroundTrack goes from 0001 to 1387
for referencegroundtrack in list(crossing_dates_dict)[340:350]:   # ReferenceGroundTrack goes from 0001 to 1387
    # print(referencegroundtrack)
    if len(crossing_dates_dict[referencegroundtrack]) > 0:
        da = dask.delayed(six_laser_beams)(
            crossing_dates=crossing_dates_dict[referencegroundtrack]
        )
        # da = six_laser_beams(crossing_dates=crossing_dates_dict[referencegroundtrack])
        dataset_dict[referencegroundtrack] = da

In [13]:
df = dataset_dict["0349"].compute()  # loads into a dask dataframe (lazy)

In [14]:
df

Unnamed: 0_level_0,delta_time,laser,latitude,longitude,atl06_quality_summary,h_li,h_li_sigma,segment_id,sigma_geo_h
npartitions=115,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,datetime64[ns],object,float64,float64,float64,float32,float32,float64,float32
569904,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...
14522322,...,...,...,...,...,...,...,...,...
15181415,...,...,...,...,...,...,...,...,...


In [None]:
dataset_dict = dask.compute(dataset_dict)[0]  # compute every referencegroundtrack, slow... though somewhat parallelized

In [None]:
bdf = dask.dataframe.concat(dfs=list(dataset_dict.values()))

In [None]:
da.sel(crossingdates="2018.10.21").h_li.unify_chunks().drop(labels=["longitude", "datetime", "cyclenumber"]).hvplot(
    kind="scatter", x="latitude", by="crossingdates", datashade=True, dynspread=True,
    width=800, height=500, dynamic=True, flip_xaxis=True, hover=True
)

## Plot them points!

In [15]:
# convert dask.dataframe to pd.DataFrame
df = df.compute()

In [16]:
df = df.dropna(subset=["h_li"]).query(expr="atl06_quality_summary == 0").reset_index()

In [17]:
dfs = df.query(expr="0 <= segment_id - 1443620 < 900")
dfs

Unnamed: 0,index,delta_time,laser,latitude,longitude,atl06_quality_summary,h_li,h_li_sigma,segment_id,sigma_geo_h
6,40,2018-10-21 12:20:38.549732352,gt3l,-78.991839,-147.185583,0.0,572.686401,0.015605,1443620.0,0.301177
8,52,2018-10-21 12:20:38.552551576,gt3l,-78.992014,-147.185765,0.0,572.709351,0.019122,1443621.0,0.300144
10,64,2018-10-21 12:20:38.555369756,gt3l,-78.992189,-147.185947,0.0,572.764465,0.022926,1443622.0,0.300081
12,76,2018-10-21 12:20:38.558187524,gt3l,-78.992364,-147.186130,0.0,572.822083,0.014114,1443623.0,0.303498
14,88,2018-10-21 12:20:38.561005216,gt3l,-78.992538,-147.186313,0.0,572.834229,0.018818,1443624.0,0.300117
...,...,...,...,...,...,...,...,...,...,...
1801924,11259319,2019-10-19 18:59:56.522928840,gt1r,-79.163368,-146.952320,0.0,544.210999,0.011127,1444515.0,0.312816
1801930,11259355,2019-10-19 18:59:56.525754424,gt1r,-79.163543,-146.952507,0.0,544.146729,0.011416,1444516.0,0.338958
1801936,11259391,2019-10-19 18:59:56.528577288,gt1r,-79.163717,-146.952694,0.0,544.076782,0.010085,1444517.0,0.322600
1801942,11259427,2019-10-19 18:59:56.531397512,gt1r,-79.163892,-146.952881,0.0,543.966675,0.009702,1444518.0,0.322036


In [None]:
dfs.hvplot.scatter(
    x="longitude", y="latitude", by="laser", hover_cols=["delta_time", "segment_id"],
    #datashade=True, dynspread=True,
    #width=800, height=500, colorbar=True
)

In [19]:
import pyproj

In [20]:
transformer = pyproj.Transformer.from_crs(crs_from=pyproj.CRS.from_epsg(4326), crs_to=pyproj.CRS.from_epsg(3031), always_xy=True)

In [21]:
dfs["x"], dfs["y"] = transformer.transform(xx=dfs.longitude.values, yy=dfs.latitude.values)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs["x"], dfs["y"] = transformer.transform(xx=dfs.longitude.values, yy=dfs.latitude.values)


In [22]:
dfs

Unnamed: 0,index,delta_time,laser,latitude,longitude,atl06_quality_summary,h_li,h_li_sigma,segment_id,sigma_geo_h,x,y
6,40,2018-10-21 12:20:38.549732352,gt3l,-78.991839,-147.185583,0.0,572.686401,0.015605,1443620.0,0.301177,-650091.025258,-1.008187e+06
8,52,2018-10-21 12:20:38.552551576,gt3l,-78.992014,-147.185765,0.0,572.709351,0.019122,1443621.0,0.300144,-650077.439154,-1.008173e+06
10,64,2018-10-21 12:20:38.555369756,gt3l,-78.992189,-147.185947,0.0,572.764465,0.022926,1443622.0,0.300081,-650063.849228,-1.008159e+06
12,76,2018-10-21 12:20:38.558187524,gt3l,-78.992364,-147.186130,0.0,572.822083,0.014114,1443623.0,0.303498,-650050.252182,-1.008144e+06
14,88,2018-10-21 12:20:38.561005216,gt3l,-78.992538,-147.186313,0.0,572.834229,0.018818,1443624.0,0.300117,-650036.644541,-1.008130e+06
...,...,...,...,...,...,...,...,...,...,...,...,...
1801924,11259319,2019-10-19 18:59:56.522928840,gt1r,-79.163368,-146.952320,0.0,544.210999,0.011127,1444515.0,0.312816,-643937.547615,-9.897727e+05
1801930,11259355,2019-10-19 18:59:56.525754424,gt1r,-79.163543,-146.952507,0.0,544.146729,0.011416,1444516.0,0.338958,-643923.872690,-9.897587e+05
1801936,11259391,2019-10-19 18:59:56.528577288,gt1r,-79.163717,-146.952694,0.0,544.076782,0.010085,1444517.0,0.322600,-643910.196765,-9.897448e+05
1801942,11259427,2019-10-19 18:59:56.531397512,gt1r,-79.163892,-146.952881,0.0,543.966675,0.009702,1444518.0,0.322036,-643896.524591,-9.897308e+05


In [None]:
dfs.hvplot.scatter(
    x="x", y="y", by="laser", hover_cols=["delta_time", "segment_id", "h_li"],
    #datashade=True, dynspread=True,
    #width=800, height=500, colorbar=True
)

In [None]:
dfs.hvplot.scatter(x="x", y="h_li", by="laser")

In [None]:
dfs.to_pickle(path="icesat2_sample.pkl")

## Old making a DEM grid surface from points

In [None]:
import scipy

In [None]:
# https://github.com/ICESAT-2HackWeek/gridding/blob/master/notebook/utils.py#L23
def make_grid(xmin, xmax, ymin, ymax, dx, dy):
    """Construct output grid-coordinates."""
    
    # Setup grid dimensions
    Nn = int((np.abs(ymax - ymin)) / dy) + 1
    Ne = int((np.abs(xmax - xmin)) / dx) + 1
    
    # Initiate x/y vectors for grid
    x_i = np.linspace(xmin, xmax, num=Ne)
    y_i = np.linspace(ymin, ymax, num=Nn)
    
    return np.meshgrid(x_i, y_i)

In [None]:
xi, yi = make_grid(xmin=dfs.x.min(), xmax=dfs.x.max(), ymin=dfs.y.max(), ymax=dfs.y.min(), dx=10, dy=10)

In [None]:
ar = scipy.interpolate.griddata(points=(dfs.x, dfs.y), values=dfs.h_li, xi=(xi, yi))

In [None]:
plt.imshow(ar, extent=(dfs.x.min(), dfs.x.max(), dfs.y.min(), dfs.y.max()))

In [25]:
import plotly.express as px

In [None]:
px.scatter_3d(data_frame=dfs, x="longitude", y="latitude", z="h_li", color="laser")

### Play using XrViz

Install the PyViz JupyterLab extension first using the [extension manager](https://jupyterlab.readthedocs.io/en/stable/user/extensions.html#using-the-extension-manager) or via the command below:

```bash
jupyter labextension install @pyviz/jupyterlab_pyviz@v0.8.0 --no-build
jupyter labextension list  # check to see that extension is installed
jupyter lab build --debug  # build extension ??? with debug messages printed
```

Note: Had to add `network-timeout 600000` to `.yarnrc` file to resolve university network issues.

In [None]:
import xrviz

In [None]:
xrviz.example()

In [None]:
# https://xrviz.readthedocs.io/en/latest/set_initial_parameters.html
initial_params={
    # Select variable to plot
    "Variables": "h_li",
    # Set coordinates
    "Set Coords": ["longitude", "latitude"],
    # Axes
    "x": "longitude",
    "y": "latitude",
    #"sigma": "animate",
    # Projection
    #"is_geo": True,
    #"basemap": True,
    #"crs": "PlateCarree"
}
dashboard = xrviz.dashboard.Dashboard(data=dataset) #, initial_params=initial_params)

In [None]:
dashboard.panel

In [None]:
dashboard.show()

## OpenAltimetry

In [None]:
"minx=-154.56678505984297&miny=-88.82881451427136&maxx=-125.17872921546498&maxy=-81.34051361301398&date=2019-05-02&trackId=516"

In [None]:
# Paste the OpenAltimetry selection parameters here
OA_REFERENCE_URL = 'minx=-177.64275595145213&miny=-88.12014866942751&maxx=-128.25920892322736&maxy=-85.52394234080862&date=2019-05-02&trackId=515'
# We populate a list with the photon data using the OpenAltimetry API, no HDF! 
OA_URL = 'https://openaltimetry.org/data/icesat2/getPhotonData?client=jupyter&' + OA_REFERENCE_URL
OA_PHOTONS = ['Noise', 'Low', 'Medium', 'High']
# OA_PLOTTED_BEAMS = [1,2,3,4,5,6] you can select up to 6 beams for each ground track.
# Some beams may not be usable due cloud covering or QC issues.
OA_BEAMS = [3,4]

In [None]:
minx, miny, maxx, maxy = [-156, -88, -127, -84]
date = "2019-05-02" # UTC date?
track = 515 # 
beam = 1 # 1 to 6
params = {"client": "jupyter", "minx": minx, "miny": miny, "maxx": maxx, "maxy": maxy, "date": date, "trackId": str(track), "beam": str(beam)}

In [None]:
r = requests.get(url="https://openaltimetry.org/data/icesat2/getPhotonData", params=params)

In [None]:
# OpenAltimetry Data cleansing
df = pd.io.json.json_normalize(data=r.json()["series"], meta="name", record_path="data")
df.name = df.name.str.split().str.get(0) # Get e.g. just "Low" instead of "Low [12345]"
df.query(expr="name in ('Low', 'Medium', 'High')", inplace=True) # filter out Noise and Buffer points

df.rename(columns={0: "latitude", 1: "elevation", 2: "longitude"}, inplace=True)
df = df.reindex(columns=["longitude", "latitude", "elevation", "name"]) # reorder columns
df.reset_index(inplace=True)
df

In [None]:
df.hvplot.scatter(x="latitude", y="elevation")