# **ATLAS/ICESat-2 Land Ice Height [ATL06](https://nsidc.org/data/atl06/) Exploratory Data Analysis**

[Yet another](https://xkcd.com/927) take on playing with ICESat-2's Land Ice Height ATL06 data,
specfically with a focus on analyzing ice elevation changes over Antarctica.
Specifically, this jupyter notebook will cover:

- Downloading datasets from the web via [intake](https://intake.readthedocs.io)
- Performing [Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis)
  using the [PyData](https://pydata.org) stack (e.g. [xarray](http://xarray.pydata.org), [dask](https://dask.org))
- Plotting figures using [Hvplot](https://hvplot.holoviz.org) and [PyGMT](https://www.pygmt.org)

This is in contrast with the [icepyx](https://github.com/icesat2py/icepyx) package
and 'official' 2019/2020 [ICESat-2 Hackweek tutorials](https://github.com/ICESAT-2HackWeek/ICESat2_hackweek_tutorials) (which are also awesome!)
that tends to use a slightly different approach (e.g. handcoded download scripts, [h5py](http://www.h5py.org) for data reading, etc).
The core concept here is to run things in a more intuitive and scalable (parallelizable) manner on a continent scale (rather than just a specific region).

In [None]:
import glob
import json
import logging
import netrc
import os

import cartopy
import dask
import dask.distributed
import dvc.repo
import hvplot.dask
import hvplot.pandas
import hvplot.xarray
import icepyx as ipx
import intake
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyproj
import requests
import tqdm
import xarray as xr

import deepicedrain

In [2]:
# Limit compute to 8 cores for download part
client = dask.distributed.Client(n_workers=8, threads_per_worker=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:35061  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 8  Cores: 8  Memory: 187.40 GiB


## Quick view

Use our [intake catalog](https://intake.readthedocs.io/en/latest/catalog.html) to get some sample ATL06 data
(while making sure we have our Earthdata credentials set up properly),
and view it using [xarray](https://xarray.pydata.org) and [hvplot](https://hvplot.pyviz.org).

In [3]:
# Open the local intake data catalog file containing ICESat-2 stuff
catalog = intake.open_catalog("deepicedrain/atlas_catalog.yaml")
# or if the deepicedrain python package is installed, you can use either of the below:
# catalog = deepicedrain.catalog
# catalog = intake.cat.atlas_cat

In [4]:
try:
    netrc.netrc()
except FileNotFoundError as error_msg:
    print(
        f"{error_msg}, please follow instructions to create one at "
        "https://nsidc.org/support/faq/what-options-are-available-bulk-downloading-data-https-earthdata-login-enabled "
        'basically using `echo "machine urs.earthdata.nasa.gov login <uid> password <password>" >> ~/.netrc`'
    )
    raise

# data download will depend on having a .netrc file in home folder
dataset: xr.Dataset = catalog.icesat2atl06.to_dask().unify_chunks()
print(dataset)

<xarray.Dataset>
Dimensions:                (delta_time: 1696704)
Coordinates:
  * delta_time             (delta_time) datetime64[ns] 2020-11-11T00:23:22.61...
    latitude               (delta_time) float64 dask.array<chunksize=(50000,), meta=np.ndarray>
    longitude              (delta_time) float64 dask.array<chunksize=(50000,), meta=np.ndarray>
    datetime               (delta_time) datetime64[ns] 2020-11-11T00:23:23 .....
    referencegroundtrack   (delta_time) <U4 '0730' '0730' ... '0744' '0744'
    cyclenumber            <U2 '09'
    orbitalsegment         <U2 '11'
    version                <U3 '003'
    revision               <U2 '01'
Data variables:
    atl06_quality_summary  (delta_time) int8 dask.array<chunksize=(50000,), meta=np.ndarray>
    h_li                   (delta_time) float32 dask.array<chunksize=(50000,), meta=np.ndarray>
    h_li_sigma             (delta_time) float32 dask.array<chunksize=(50000,), meta=np.ndarray>
    segment_id             (delta_time) float

In [None]:
# dataset.hvplot.points(
#     x="longitude",
#     y="latitude",
#     c="h_li",
#     cmap="Blues",
#     rasterize=True,
#     hover=True,
#     width=800,
#     height=500,
#     geo=True,
#     coastline=True,
#     crs=cartopy.crs.PlateCarree(),
#     projection=cartopy.crs.Stereographic(central_latitude=-71),
# )
catalog.icesat2atl06.hvplot.quickview()

## Download ATL06 data using intake

Pulling in all of the raw ATL06 data (HDF5 format) from the NSIDC servers via an intake catalog file.
Note that this will involve 100s if not 1000s of GBs of data, so make sure there's enough storage!!

In [6]:
# Download all ICESAT2 ATLAS hdf files from start to end date
# dates0 = pd.date_range(start="2018.10.14", end="2018.12.08")  # 0th batch
# dates1 = pd.date_range(start="2018.12.10", end="2019.06.26")  # 1st batch

# Skip ICESat-2 cycles 1 to 2 from 2018.10.14 to 2019.03.28 above
dates1 = pd.date_range(start="2019.03.29", end="2019.06.26")  # 1st batch
dates2 = pd.date_range(start="2019.07.26", end="2021.07.15")  # 3rd batch
dates = dates1.append(other=dates2)
# dates = pd.date_range(start="2020.11.11", end="2021.07.15")  # custom batch

In [7]:
# Submit download jobs to Client
futures = []
for date in dates:
    # revision = 2 if date in pd.date_range(start="2020.04.22", end="2020.05.04") else 1
    source = catalog.icesat2atlasdownloader(date=date, revision=1)
    future = client.submit(
        func=source.discover, key=f"download-{date}"
    )  # triggers download of the file(s), or loads from cache
    futures.append(future)

## Download ATL06 data using icepyx and dvc's get-url function

Alternative way to obtain raw ATL06 data (HDF5 format) from NSIDC servers

In [8]:
@dask.delayed
def get_ATL06_links(referencegroundtrack: str = "0001") -> pd.Series:
    """Use icepyx to get ATL06 download URL links."""
    antarctic_region = ipx.Query(
        dataset="ATL06",
        date_range=["2019-03-29", "2021-07-15"],
        spatial_extent=[180, -89, -180, -60],
        version="4",
        tracks=[referencegroundtrack],
    )
    antarctic_region.avail_granules()
    df: pd.DataFrame = pd.json_normalize(
        antarctic_region.granules.avail, record_path="links"
    )
    links: pd.Series = df.query("type == 'application/x-hdfeos'").href
    os.makedirs(name=f"ATL06.00X/{referencegroundtrack}", exist_ok=True)
    links.to_csv(
        f"ATL06.00X/{referencegroundtrack}/ATL06_file_list.txt",
        index=False,
        header=None,
    )

    return links

In [9]:
# Submit download jobs to Client
links: list = []
for rgt in range(1, 1388):
    referencegroundtrack: str = str(rgt).zfill(4)
    _links = get_ATL06_links(referencegroundtrack=referencegroundtrack)
    links.append(_links)

In [10]:
_ = client.compute(links)

In [11]:
futures = []
repo = dvc.repo.Repo(root_dir=".")
for rgt in range(1, 1388):
    referencegroundtrack: str = str(rgt).zfill(4)
    with open(f"ATL06.00X/{referencegroundtrack}/ATL06_file_list.txt") as f:
        links = f.readlines()
        for url in links:
            filename = os.path.basename(url.strip())
            if not os.path.exists(f"ATL06.00X/{referencegroundtrack}/{filename}"):
                future = client.submit(
                    func=repo.get_url,
                    url=url.strip(),
                    out=f"ATL06.00X/{referencegroundtrack}",
                    jobs=1,
                    key=f"download-{referencegroundtrack}-{filename}",
                )
                futures.append(future)

In [12]:
# Check download progress here, https://stackoverflow.com/a/37901797/6611055
responses = []
for f in tqdm.tqdm(
    iterable=dask.distributed.as_completed(futures=futures), total=len(futures)
):
    responses.append(f.result())

100%|█████████████████████████████████████████████████████████████████████| 6546/6546 [3:20:40<00:00,  1.84s/it]


In [13]:
# In case of error, check which downloads are unfinished
# Manually delete those folders and retry
unfinished = []
for foo in futures:
    if foo.status != "finished":
        print(foo)
        unfinished.append(foo)
        if foo.status == "error":
            foo.retry()
            # pass

In [14]:
try:
    assert len(unfinished) == 0
except AssertionError:
    for task in unfinished:
        print(task)
    raise ValueError(
        f"{len(unfinished)} download tasks are unfinished,"
        " please delete those folders and retry again!"
    )

## Exploratory data analysis on local files

Now that we've downloaded a good chunk of data and cached them locally,
we can have some fun with visualizing the point clouds!

In [11]:
root_directory = os.path.dirname(
    catalog.icesat2atl06.storage_options["simplecache"]["cache_storage"]
)

In [12]:
def get_crossing_dates(
    catalog_entry: intake.catalog.local.LocalCatalogEntry,
    root_directory: str,
    referencegroundtrack: str = "????",
    datetimestr: str = "*",
    cyclenumber: str = "??",
    orbitalsegment: str = "??",
    version: str = "003",
    revision: str = "01",
) -> dict:
    """
    Given a 4-digit reference groundtrack (e.g. 1234),
    we output a dictionary where the
    key is the date in "YYYY.MM.DD" format when an ICESAT2 crossing was made and the
    value is the filepath to the HDF5 data file.
    """

    # Get a glob string that looks like "ATL06_??????????????_XXXX????_002_01.h5"
    globpath: str = catalog_entry.path_as_pattern
    if datetimestr == "*":
        globpath: str = globpath.replace("{datetime:%Y%m%d%H%M%S}", "??????????????")
    globpath: str = globpath.format(
        referencegroundtrack=referencegroundtrack,
        cyclenumber=cyclenumber,
        orbitalsegment=orbitalsegment,
        version=version,
        revision=revision,
    )

    # Get list of filepaths (dates are contained in the filepath)
    globedpaths: list = glob.glob(os.path.join(root_directory, "??????????", globpath))

    # Pick out just the dates in "YYYY.MM.DD" format from the globedpaths
    # crossingdates = [os.path.basename(os.path.dirname(p=p)) for p in globedpaths]
    crossingdates: dict = {
        os.path.basename(os.path.dirname(p=p)): p for p in sorted(globedpaths)
    }

    return crossingdates

In [13]:
crossing_dates_dict = {}
for rgt in range(1, 1388):  # ReferenceGroundTrack goes from 0001 to 1387
    referencegroundtrack: str = f"{rgt}".zfill(4)
    crossing_dates: dict = dask.delayed(get_crossing_dates)(
        catalog_entry=catalog.icesat2atl06,
        root_directory=root_directory,
        referencegroundtrack=referencegroundtrack,
    )
    crossing_dates_dict[referencegroundtrack] = crossing_dates
crossing_dates_dict = dask.compute(crossing_dates_dict)[0]

In [14]:
crossing_dates_dict["0349"].keys()

dict_keys(['2018.10.21', '2019.01.20', '2019.04.21', '2019.10.19', '2020.01.18', '2020.04.18', '2020.07.18', '2020.10.17'])

![ICESat-2 Laser Beam Pattern](https://ars.els-cdn.com/content/image/1-s2.0-S0034425719303712-gr1.jpg)

In [15]:
def six_laser_beams(filepaths: list) -> dask.dataframe.DataFrame:
    """
    For all 6 lasers along one reference ground track,
    concatenate all points from all crossing dates into one Dask DataFrame

    E.g. if there are 5 crossing dates and 6 lasers,
    there would be data from 5 x 6 = 30 files being concatenated together.
    """
    lasers: list = ["gt1l", "gt1r", "gt2l", "gt2r", "gt3l", "gt3r"]

    objs: list = [
        xr.open_mfdataset(
            paths=filepaths,
            combine="by_coords",
            engine="h5netcdf",
            group=f"{laser}/land_ice_segments",
            parallel=True,
        ).assign_coords(coords={"laser": laser})
        for laser in lasers
    ]

    try:
        da: xr.Dataset = xr.concat(objs=objs, dim="laser")
        df: dask.dataframe.DataFrame = da.unify_chunks().to_dask_dataframe()
    except ValueError:
        # ValueError: cannot reindex or align along dimension 'delta_time'
        # because the index has duplicate values
        df: dask.dataframe.DataFrame = dask.dataframe.concat(
            [obj.unify_chunks().to_dask_dataframe() for obj in objs]
        )

    return df

In [16]:
dataset_dict = {}
# ReferenceGroundTrack goes from 0001 to 1387
for referencegroundtrack in list(crossing_dates_dict)[348:349]:
    # print(referencegroundtrack)
    filepaths = list(crossing_dates_dict[referencegroundtrack].values())
    if len(filepaths) > 0:
        dataset_dict[referencegroundtrack] = dask.delayed(obj=six_laser_beams)(
            filepaths=filepaths
        )
        # df = six_laser_beams(filepaths=filepaths)

In [17]:
df = dataset_dict["0349"].compute()  # loads into a dask dataframe (lazy)

In [18]:
df

Unnamed: 0_level_0,delta_time,laser,latitude,longitude,atl06_quality_summary,h_li,h_li_sigma,segment_id,sigma_geo_h
npartitions=265,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,datetime64[ns],object,float64,float64,float64,float32,float32,float64,float32
340326,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...
15565554,...,...,...,...,...,...,...,...,...
15966059,...,...,...,...,...,...,...,...,...


In [19]:
# compute every referencegroundtrack, slow... though somewhat parallelized
# dataset_dict = dask.compute(dataset_dict)[0]

In [20]:
# big dataframe containing data across all 1387 reference ground tracks!
# bdf = dask.dataframe.concat(dfs=list(dataset_dict.values()))

## Plot ATL06 points!

In [21]:
# Convert dask.DataFrame to pd.DataFrame
df: pd.DataFrame = df.compute()

In [22]:
# Drop points with poor quality
df = df.dropna(subset=["h_li"]).query(expr="atl06_quality_summary == 0").reset_index()

In [23]:
# Get a small random sample of our data
dfs = df.sample(n=1_000, random_state=42)
dfs.head()

Unnamed: 0,index,delta_time,laser,latitude,longitude,atl06_quality_summary,h_li,h_li_sigma,segment_id,sigma_geo_h
819646,4944847,2019-04-21 03:46:48.291173520,gt1r,-76.193339,49.741861,0.0,3457.071289,0.019371,1581386.0,0.307167
1232641,7617374,2020-01-18 14:46:13.000787432,gt2l,-75.688184,49.531386,0.0,3404.925293,0.022874,1584229.0,0.302676
1192498,7376515,2020-01-18 14:45:54.142172976,gt1r,-76.893664,50.236463,0.0,3472.934814,0.01083,1577405.0,0.350412
1311341,8164152,2020-01-18 14:47:45.788451720,gt1l,-69.876752,46.670177,0.0,2393.138672,0.029221,1617142.0,0.33098
1325158,8247056,2020-01-18 14:47:52.279762744,gt2l,-69.464079,46.61075,0.0,2231.435791,0.021212,1619451.0,0.307674


In [None]:
dfs.hvplot.scatter(
    x="longitude",
    y="latitude",
    by="laser",
    hover_cols=["delta_time", "segment_id"],
    # datashade=True, dynspread=True,
    # width=800, height=500, colorbar=True
)

### Transform from EPSG:4326 (lat/lon) to EPSG:3031 (Antarctic Polar Stereographic)

In [25]:
dfs["x"], dfs["y"] = deepicedrain.lonlat_to_xy(
    longitude=dfs.longitude, latitude=dfs.latitude
)

In [26]:
dfs.head()

Unnamed: 0,index,delta_time,laser,latitude,longitude,atl06_quality_summary,h_li,h_li_sigma,segment_id,sigma_geo_h,x,y
819646,4944847,2019-04-21 03:46:48.291173520,gt1r,-76.193339,49.741861,0.0,3457.071289,0.019371,1581386.0,0.307167,1150156.0,973959.7
1232641,7617374,2020-01-18 14:46:13.000787432,gt2l,-75.688184,49.531386,0.0,3404.925293,0.022874,1584229.0,0.302676,1188936.0,1014321.0
1192498,7376515,2020-01-18 14:45:54.142172976,gt1r,-76.893664,50.236463,0.0,3472.934814,0.01083,1577405.0,0.350412,1099248.0,914674.5
1311341,8164152,2020-01-18 14:47:45.788451720,gt1l,-69.876752,46.670177,0.0,2393.138672,0.029221,1617142.0,0.33098,1606343.0,1515321.0
1325158,8247056,2020-01-18 14:47:52.279762744,gt2l,-69.464079,46.61075,0.0,2231.435791,0.021212,1619451.0,0.307674,1638361.0,1548738.0


In [None]:
dfs.hvplot.scatter(
    x="x",
    y="y",
    by="laser",
    hover_cols=["delta_time", "segment_id", "h_li"],
    # datashade=True, dynspread=True,
    # width=800, height=500, colorbar=True
)

In [None]:
# Plot cross section view
dfs.hvplot.scatter(x="x", y="h_li", by="laser")

## Experimental Work-in-Progress stuff below

### Play using XrViz

In [None]:
import xrviz

In [None]:
xrviz.example()

In [None]:
# https://xrviz.readthedocs.io/en/latest/set_initial_parameters.html
initial_params = {
    # Select variable to plot
    "Variables": "h_li",
    # Set coordinates
    "Set Coords": ["longitude", "latitude"],
    # Axes
    "x": "longitude",
    "y": "latitude",
    # "sigma": "animate",
    # Projection
    # "is_geo": True,
    # "basemap": True,
    # "crs": "PlateCarree"
}
dashboard = xrviz.dashboard.Dashboard(data=dataset)  # , initial_params=initial_params)

In [None]:
dashboard.panel

In [None]:
dashboard.show()

## OpenAltimetry

In [None]:
"minx=-154.56678505984297&miny=-88.82881451427136&maxx=-125.17872921546498&maxy=-81.34051361301398&date=2019-05-02&trackId=516"

In [None]:
# Paste the OpenAltimetry selection parameters here
OA_REFERENCE_URL = "minx=-177.64275595145213&miny=-88.12014866942751&maxx=-128.25920892322736&maxy=-85.52394234080862&date=2019-05-02&trackId=515"
# We populate a list with the photon data using the OpenAltimetry API, no HDF!
OA_URL = (
    "https://openaltimetry.org/data/icesat2/getPhotonData?client=jupyter&"
    + OA_REFERENCE_URL
)
OA_PHOTONS = ["Noise", "Low", "Medium", "High"]
# OA_PLOTTED_BEAMS = [1,2,3,4,5,6] you can select up to 6 beams for each ground track.
# Some beams may not be usable due cloud covering or QC issues.
OA_BEAMS = [3, 4]

In [None]:
minx, miny, maxx, maxy = [-156, -88, -127, -84]
date = "2019-05-02"  # UTC date?
track = 515  #
beam = 1  # 1 to 6
params = {
    "client": "jupyter",
    "minx": minx,
    "miny": miny,
    "maxx": maxx,
    "maxy": maxy,
    "date": date,
    "trackId": str(track),
    "beam": str(beam),
}

In [None]:
r = requests.get(
    url="https://openaltimetry.org/data/icesat2/getPhotonData", params=params
)

In [None]:
# OpenAltimetry Data cleansing
df = pd.io.json.json_normalize(data=r.json()["series"], meta="name", record_path="data")
df.name = df.name.str.split().str.get(0)  # Get e.g. just "Low" instead of "Low [12345]"
df.query(
    expr="name in ('Low', 'Medium', 'High')", inplace=True
)  # filter out Noise and Buffer points

df.rename(columns={0: "latitude", 1: "elevation", 2: "longitude"}, inplace=True)
df = df.reindex(
    columns=["longitude", "latitude", "elevation", "name"]
)  # reorder columns
df.reset_index(inplace=True)
df

In [None]:
df.hvplot.scatter(x="latitude", y="elevation")