**CRIB SHEET RULES OF THE ROAD:**

This crib sheet is provided to support access, utilization, and plotting of UCalgary optical datasets. It is intended as a base set of code that a user may edit and manipulate to serve their own needs.  Crib sheets contain UCalgary verified and validated procedures for plotting and manipulating UCalgary ASI data for common use cases.  Use of this crib sheet does not require acknowledgment, it is freely distributed for personal scientific use. The crib sheet (or elements of the crib sheet) must not be ingested into third party libraries without written consent of the UCalgary team. Please also remember to perform due diligence on all data use.  We recommend comparison with verified data products on [data.phys.ucalgary.ca](https://data.phys.ucalgary.ca) to ensure that any user output does not contradict operational summary plots.  Data use must be acknowledged according to the information available for each data set - please see [data.phys.ucalgary.ca](https://data.phys.ucalgary.ca).  If you encounter any issues with the data or the crib sheet, please contact the UCalgary team for support (Emma Spanswick, elspansw@ucalgary.ca). Copyright © University of Calgary.

# Downloading data from our archive using the API

Data can be downloaded from our archive using an API, in addition to the conventional FTP, Rsync, and other HTTP methods. The API is currently under development and we will do our best to keep this crib sheet up-to-date with the latest changes. If you have any questions, please reach out to the UCalgary Team (Emma Spanswick, elspansw@ucalgary.ca).

## Install dependencies

In [1]:
!pip install requests joblib tqdm



In [2]:
import os
import pprint
import getpass
import requests
import joblib
from tqdm.notebook import tqdm

## List available datasets

In [3]:
def list_datasets():
    r = requests.get("https://api.phys.ucalgary.ca/api/v1/data_distribution/datasets")
    datasets = r.json()
    return datasets

# extract specific information about the datasets that we want to show
datasets = list_datasets()
total = len(datasets)
file_listing_supported = []
for d in datasets:
    if (d["file_listing_supported"] is True):
        file_listing_supported.append(d)

# show a listing of the datasets
print("Found %d datasets, with file listing supported for %d\n" % (len(datasets), len(file_listing_supported)))
print("%-45s%s\n%s" % ("Dataset Name", "File Listing Supported?", '-'*75))
for d in datasets:
    print("%-45s%s" % (d["name"], d["file_listing_supported"]))

#--------------------------------
# You'll notice that not all of our datasets are currently available, and even fewer 
# support file listing support. As we continue our development of the API, more will 
# be added, and more file listing supported will be enabled. The output seen above will 
# change over the coming months.

Found 63 datasets, with file listing supported for 10

Dataset Name                                 File Listing Supported?
---------------------------------------------------------------------------
REGO_CALIBRATION_FLATFIELD_IDLSAV            False
REGO_CALIBRATION_RAYLEIGHS_IDLSAV            False
REGO_SKYMAP_IDLSAV                           False
REGO_STREAM0_RAW                             True
REGO_STREAM2_DAILY_KEOGRAM_JPG               False
REGO_STREAM2_DAILY_KEOGRAM_PGM               False
REGO_STREAM2_DAILY_KEOGRAM_PNG               False
REGO_STREAM2_DAILY_MONTAGE_JPG               False
REGO_STREAM2_DAILY_MONTAGE_PGM               False
REGO_STREAM2_DAILY_MONTAGE_PNG               False
REGO_STREAM2_HOURLY_KEOGRAM_JPG              False
REGO_STREAM2_HOURLY_KEOGRAM_PGM              False
REGO_STREAM2_HOURLY_KEOGRAM_PNG              False
REGO_STREAM2_HOURLY_MONTAGE_JPG              False
REGO_STREAM2_HOURLY_MONTAGE_PGM              False
REGO_STREAM2_HOURLY_MONTAGE_PNG     

In [4]:
# a 'dataset' also contains additional information including a
# description, link to the data tree, and citation details.
#
# let's view one of the records
pprint.pprint(datasets[3])

{'citation': 'Spanswick, E., & Donovan, E. (2014). Redline Geospace '
             'Observatory (REGO) dataset [Data set]. University of Calgary. '
             'https://doi.org/10.11575/Z7X6-5C42',
 'data_tree_url': 'https://data.phys.ucalgary.ca/sort_by_project/GO-Canada/REGO/stream0',
 'doi': 'https://doi.org/10.11575/Z7X6-5C42',
 'doi_details': 'https://commons.datacite.org/doi.org/10.11575/z7x6-5c42',
 'file_listing_supported': True,
 'long_description': 'Redline Geospace Observatory (REGO) All Sky Imager '
                     'array. More information can be found at '
                     'https://data.phys.ucalgary.ca',
 'name': 'REGO_STREAM0_RAW',
 'short_description': 'REGO All Sky Imagers 3-sec raw data'}


## Retrieve list of URLs of data files

Now we're going to use the API to retrieve URLs for an hour of THEMIS ASI data at Gillam. Later on, we'll use the URLs to quickly download them.

In [5]:
# get dataset details we're interested in
dataset_name = "THEMIS_ASI_RAW"
for d in datasets:
    if (d["name"] == dataset_name):
        dataset = d
def get_data_urls(dataset_name, start, end, site_uid):
    params = {"name": dataset_name, "start": start, "end": end, "site_uid": site_uid, "include_total_bytes": True}
    r = requests.get("https://api.phys.ucalgary.ca/api/v1/data_distribution/urls", params=params)
    data = r.json()
    return data
        
# set up API request
start = "2022-01-01T06:00"
end = "2022-01-01T06:59"
site_uid = "gill"
data = get_data_urls(dataset_name, start, end, site_uid)
print("Found %d URLs, showing first 10\n" % (len(data["urls"])))
pprint.pprint(data["urls"][0:10])
print("...")

Found 60 URLs, showing first 10

['https://data.phys.ucalgary.ca/sort_by_project/THEMIS/asi/stream0/2022/01/01/gill_themis19/ut06/20220101_0600_gill_themis19_full.pgm.gz',
 'https://data.phys.ucalgary.ca/sort_by_project/THEMIS/asi/stream0/2022/01/01/gill_themis19/ut06/20220101_0601_gill_themis19_full.pgm.gz',
 'https://data.phys.ucalgary.ca/sort_by_project/THEMIS/asi/stream0/2022/01/01/gill_themis19/ut06/20220101_0602_gill_themis19_full.pgm.gz',
 'https://data.phys.ucalgary.ca/sort_by_project/THEMIS/asi/stream0/2022/01/01/gill_themis19/ut06/20220101_0603_gill_themis19_full.pgm.gz',
 'https://data.phys.ucalgary.ca/sort_by_project/THEMIS/asi/stream0/2022/01/01/gill_themis19/ut06/20220101_0604_gill_themis19_full.pgm.gz',
 'https://data.phys.ucalgary.ca/sort_by_project/THEMIS/asi/stream0/2022/01/01/gill_themis19/ut06/20220101_0605_gill_themis19_full.pgm.gz',
 'https://data.phys.ucalgary.ca/sort_by_project/THEMIS/asi/stream0/2022/01/01/gill_themis19/ut06/20220101_0606_gill_themis19_full.pgm

## Download the data

Now that we have the URLs, we will download them. We'll utilize joblib and tqdm to download multiple files in parallel, and show us a progress bar. These are nice additions, but not necessary.

NOTE: it is good practice to maintain the data tree while saving files to your computer. This helps you easily use the other data download methods (like Rsync or FTP), and also is good data management. Doing the easy approach and placing all files into a single directory can quickly get out of control due to the amount of data in our archive. We have over a billion files!

In [6]:
# set the top-level output path we want to save the files to
#
# NOTE: This crib sheet assumes you're using a Windows computer. Change as needed.
output_base_path = "C:/Users/%s/Desktop/ucalgary_data/%s" % (getpass.getuser(), dataset_name)

In [7]:
def download_url(url, prefix, output_base_path, pbar=None):
    # set output filename
    output_filename = "%s/%s" % (output_base_path, url.removeprefix(prefix))

    # create destination directory
    try:
        os.makedirs(os.path.dirname(output_filename), exist_ok=True)
    except Exception:
        # NOTE: sometimes when making directories in parallel there are race conditions. We put 
        # in a catch here and carry on if there are ever issues. 
        pass

    # retrieve file and save to disk
    r = requests.get(url)
    with open(output_filename, 'wb') as fp:
        fp.write(r.content)

    # advance progress bar
    if (pbar is not None):
        pbar.update()

def download_urls(dataset, urls, output_base_path, n_parallel=5):
    prefix_to_strip = dataset["data_tree_url"]
    with tqdm(total=len(urls), desc="Downloading and saving files to disk") as pbar:
        joblib.Parallel(n_jobs=n_parallel, prefer="threads")(
            joblib.delayed(download_url)(
                urls[i], 
                prefix_to_strip,
                output_base_path,
                pbar=pbar,
            ) for i in range(0, len(urls))
        )
    print("\nData saved to %s" % (output_base_path))

# download the data
download_urls(dataset, data["urls"], output_base_path)

Downloading and saving files to disk:   0%|          | 0/60 [00:00<?, ?it/s]


Data saved to C:/Users/darrenc/Desktop/ucalgary_data/THEMIS_ASI_STREAM0_RAW
