This notebooks is a quick introduction to the main ideas in the T4 API. To learn more, check out the official documentation.

## Installation

To get started, you will first need to [install the `t4` Python client](https://github.com/quiltdata/t4/blob/master/UserDocs.md). Then import it into the environment:

In [1]:
import t4

Let's also do some other config:

In [2]:
# The name of the bucket you will run this demo against. This must be 
# an S3 bucket you have access to.
bucket_name   = "s3://alpha-quilt-storage"

# The subfolder inside of the bucket that this demo will be placed in.
bucket_folder = "hurdat-demo"

# The local folder that will act as scratch space for some files we
# will create in this notebook.
# This path must end in a forward slash ("/").
local_folder  = "./"

# A date timestamp that will be included in the output path for files
# pushed to S3 by this notebook. Helps ensure tidyness.
from datetime import datetime
bucket_subfolder = str(datetime.now())\
    .replace(" ", "-").replace(":", "_").replace(".", "_")

# Resulting path.
t4_path = f'{bucket_name}/{bucket_folder}/{bucket_subfolder}/'

We'll also need some data. Here's a script we've built that downloads and cleans up an NOAA hurricane dataset known as HURDAT. It is pretty typical of the sorts of clean-up scripts you'd be running when performing data science:

In [5]:
# %load build.py
import requests
import io
from collections import Counter
import pandas as pd
import numpy as np


atlantic_raw = requests.get(
    "https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt"
)
atlantic_raw.raise_for_status()  # check that we actually got something back

c = Counter()
for line in io.StringIO(atlantic_raw.text).readlines():
    c[line[:2]] += 1

atlantic_storms_r = []
atlantic_storm_r = {'header': None, 'data': []}

for i, line in enumerate(io.StringIO(atlantic_raw.text).readlines()):
    if line[:2] == 'AL':
        atlantic_storms_r.append(atlantic_storm_r.copy())
        atlantic_storm_r['header'] = line
        atlantic_storm_r['data'] = []
    else:
        atlantic_storm_r['data'].append(line)

atlantic_storms_r = atlantic_storms_r[1:]

atlantic_storm_dfs = []
for storm_dict in atlantic_storms_r:
    storm_id, storm_name, storm_entries_n = storm_dict['header'].split(",")[:3]
    data = [[entry.strip() for entry in datum[:-1].split(",")] for datum in storm_dict['data']]
    frame = pd.DataFrame(data)
    frame['id'] = storm_id
    frame['name'] = storm_name
    atlantic_storm_dfs.append(frame)

atlantic_storms = pd.concat(atlantic_storm_dfs)
atlantic_storms = atlantic_storms.reindex(columns=atlantic_storms.columns[-2:] | atlantic_storms.columns[:-2])

# Assign columns from the metadata.
atlantic_storms.columns = [
        "id",
        "name",
        "date",
        "hours_minutes",
        "record_identifier",
        "status_of_system",
        "latitude",
        "longitude",
        "maximum_sustained_wind_knots",
        "maximum_pressure",
        "34_kt_ne",
        "34_kt_se",
        "34_kt_sw",
        "34_kt_nw",
        "50_kt_ne",
        "50_kt_se",
        "50_kt_sw",
        "50_kt_nw",
        "64_kt_ne",
        "64_kt_se",
        "64_kt_sw",
        "64_kt_nw",
        "na"
]

# Replace sentinal values with true NAs.
del atlantic_storms['na']
atlantic_storms = atlantic_storms.replace(to_replace='-999', value=np.nan)
atlantic_storms = atlantic_storms.replace(to_replace="", value=np.nan)

# Fix date and location columns.
atlantic_storms['latitude'] = atlantic_storms['latitude']\
    .map(lambda lat: lat[:-1] if lat[-1] == "N" else -lat[:-1])
atlantic_storms['longitude']= atlantic_storms['longitude']\
    .map(lambda long: long[:-1] if long[-1] == "E" else "-" + long[:-1])
atlantic_storms['date'] = pd.to_datetime(atlantic_storms['date'])
atlantic_storms['date'] = atlantic_storms\
    .apply(
        lambda srs: srs['date'].replace(hour=int(srs['hours_minutes'][:2]), minute=int(srs['hours_minutes'][2:])),
        axis='columns'
    )

# Remove unused column.
del atlantic_storms['hours_minutes']

# Strip out spaces padding out names.
atlantic_storms['name'] = atlantic_storms['name'].map(lambda n: n.strip())

# Reindex.
atlantic_storms.index = range(len(atlantic_storms.index))
atlantic_storms.index.name = "index"

This script generates a history of Atlantic hurricanes in a `pandas` `DataFrame`:

In [6]:
atlantic_storms.head()

Unnamed: 0_level_0,id,name,date,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,AL011851,UNNAMED,1851-06-25 00:00:00,,HU,28.0,-94.8,80,,,...,,,,,,,,,,
1,AL011851,UNNAMED,1851-06-25 06:00:00,,HU,28.0,-95.4,80,,,...,,,,,,,,,,
2,AL011851,UNNAMED,1851-06-25 12:00:00,,HU,28.0,-96.0,80,,,...,,,,,,,,,,
3,AL011851,UNNAMED,1851-06-25 18:00:00,,HU,28.1,-96.5,80,,,...,,,,,,,,,,
4,AL011851,UNNAMED,1851-06-25 21:00:00,L,HU,28.2,-96.8,80,,,...,,,,,,,,,,


Which we'll also save to disk.

In [7]:
local_filepath = f"{local_folder}atlantic-storms.csv"
atlantic_storms.to_csv(local_filepath)

## Creating packages

The core construct in T4 is the **data package**. A data package is a collection of individual files which are meaningful when considered as a whole. A data package includes raw data files, metadata describing the raw data files, and anything else you think is meaningful.

Data packages make it easy to share data assets across the team. We'll use the HURDAT dataset to demonstrate how they work.

To initialize an in-memory data package:

In [8]:
p = t4.Package()

To add a file to a package, use `set`:

In [9]:
p.set('storms/atlantic-storms.csv', local_filepath)

<t4.packages.Package at 0x10db752b0>

To capture everything in a folder, use `set_dir`:

In [10]:
p.set_dir('resources/', './')

<t4.packages.Package at 0x10db752b0>

You can point a package key at any local file or S3 key.

Packages support metadata on data nodes (directories too):

In [11]:
p.set('storms/atlantic-storms.csv', local_filepath, meta={'side':'atlantic'})

<t4.packages.Package at 0x10db752b0>

Packages mimic `dict` objects in their behavior. So to introspect a package, key into it using a path fragment:

In [12]:
p['storms']

<t4.packages.Package at 0x11f851b70>

You can interact with directories and files inside of a pacakge once you're at their key. For example, use `get_meta` to get the metadata:

In [13]:
p['storms/atlantic-storms.csv'].get_meta()

{'side': 'atlantic'}

Use `fetch` to download the data to a file or a directory:

In [14]:
p['storms/atlantic-storms.csv'].fetch('storms.csv')

And finally, `deserialize` to load a piece of data directory into memory as a Python object (this only works on subsect of objects and object types right now):

In [15]:
# b = t4.Bucket(bucket_name)
# b.put('atlantic_storms.parquet', atlantic_storms)
# d = t4.Package().set('atlantic_storms', f'{bucket_name}/atlantic_storms.parquet')['atlantic_storms']\
#         .deserialize()

## Consuming packages

So far we've seen how to create packages and how to consume resources inside of packages. Now let's look at how to consume the packages themselves.

Suppose that you've create a package and want to share it with the rest of your team. T4 makes this easy by providing you with a **catalog**. A T4 catalog sits on top of an S3 bucket and allows anyone with access to that bucket to see, push, and download packages in that bucket.

To send a package to a catalog, use `push` (note: the large number of progress bars are a bug that we are working to fix):

In [16]:
p.push('example/package', f'{bucket_name}')

HBox(children=(IntProgress(value=0, max=6148), HTML(value='')))




HBox(children=(IntProgress(value=0, max=36), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=232), HTML(value='')))




HBox(children=(IntProgress(value=0, max=73), HTML(value='')))




HBox(children=(IntProgress(value=0, max=478), HTML(value='')))




HBox(children=(IntProgress(value=0, max=896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=189), HTML(value='')))




HBox(children=(IntProgress(value=0, max=424), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1642), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1348), HTML(value='')))




HBox(children=(IntProgress(value=0, max=4898), HTML(value='')))




HBox(children=(IntProgress(value=0, max=544), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1239), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3610), HTML(value='')))




HBox(children=(IntProgress(value=0, max=385), HTML(value='')))




HBox(children=(IntProgress(value=0, max=240), HTML(value='')))




HBox(children=(IntProgress(value=0, max=360), HTML(value='')))




HBox(children=(IntProgress(value=0, max=360), HTML(value='')))




HBox(children=(IntProgress(value=0, max=91727), HTML(value='')))




HBox(children=(IntProgress(value=0, max=161), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1125), HTML(value='')))




HBox(children=(IntProgress(value=0, max=250), HTML(value='')))




HBox(children=(IntProgress(value=0, max=41), HTML(value='')))




HBox(children=(IntProgress(value=0, max=46), HTML(value='')))




HBox(children=(IntProgress(value=0, max=170), HTML(value='')))




HBox(children=(IntProgress(value=0, max=141), HTML(value='')))




HBox(children=(IntProgress(value=0, max=192), HTML(value='')))




HBox(children=(IntProgress(value=0, max=41), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=74398), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23887), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1278), HTML(value='')))




HBox(children=(IntProgress(value=0, max=74398), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3871481), HTML(value='')))




HBox(children=(IntProgress(value=0, max=2908), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3871481), HTML(value='')))




<t4.packages.Package at 0x11f8516a0>

`push` grabs your package and sends it and all of its data up to the catalog. Everyone with access to that catalog can now see and download this package and data from that catalog.

Alternatively, you may wish to save a package locally (we call this the local catalog). This is `build`, which is a much faster operation because it doesn't necessitate moving data.

In [17]:
p.build('example/package')

'43f3816c2ac87ef3cf943c7bc5a6be69985fcac200cc47db4f9e3b741f9dbe24'

To see a list of packages available locally or remotely, use `list_packages`:

In [18]:
t4.list_packages()

['example/package', 'foo/bar']

In [19]:
t4.list_packages(bucket_name)

['aics/pipeline',
 'akarve/test',
 'akave/t4test',
 'ay/lmao-redux',
 'dima/tmp2',
 'eode/testing_package',
 'example/package']

To download a package and all of its data from a remote catalog, `install` it.

In [20]:
# to a temporary folder for demo purposes
p = t4.Package.install('example/package', bucket_name, dest='temp/')
p

HBox(children=(IntProgress(value=0, max=6148), HTML(value='')))




HBox(children=(IntProgress(value=0, max=36), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=232), HTML(value='')))




HBox(children=(IntProgress(value=0, max=73), HTML(value='')))




HBox(children=(IntProgress(value=0, max=478), HTML(value='')))




HBox(children=(IntProgress(value=0, max=896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=189), HTML(value='')))




HBox(children=(IntProgress(value=0, max=424), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1642), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1348), HTML(value='')))




HBox(children=(IntProgress(value=0, max=4898), HTML(value='')))




HBox(children=(IntProgress(value=0, max=544), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1239), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3610), HTML(value='')))




HBox(children=(IntProgress(value=0, max=385), HTML(value='')))




HBox(children=(IntProgress(value=0, max=240), HTML(value='')))




HBox(children=(IntProgress(value=0, max=360), HTML(value='')))




HBox(children=(IntProgress(value=0, max=360), HTML(value='')))




HBox(children=(IntProgress(value=0, max=91727), HTML(value='')))




HBox(children=(IntProgress(value=0, max=161), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1125), HTML(value='')))




HBox(children=(IntProgress(value=0, max=250), HTML(value='')))




HBox(children=(IntProgress(value=0, max=41), HTML(value='')))




HBox(children=(IntProgress(value=0, max=46), HTML(value='')))




HBox(children=(IntProgress(value=0, max=170), HTML(value='')))




HBox(children=(IntProgress(value=0, max=141), HTML(value='')))




HBox(children=(IntProgress(value=0, max=192), HTML(value='')))




HBox(children=(IntProgress(value=0, max=41), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=74398), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23887), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1278), HTML(value='')))




HBox(children=(IntProgress(value=0, max=74398), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3871481), HTML(value='')))




HBox(children=(IntProgress(value=0, max=2908), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3871481), HTML(value='')))




<t4.packages.Package at 0x117164f98>

In [21]:
!rm -rf temp/

You can also choose to download just the package **manifest** without downloading the data files it references. The manifest is a simple JSON file that is independent of the actual package data, but stores pointers to and metadata about it. To load a package from a local or remote catalog, use the extremely fast static `browse` method:

In [22]:
p = t4.Package.browse('example/package', bucket_name)
p

<t4.packages.Package at 0x117168470>

`browse` is particularly benefitial when you are working with large packages that you only need parts of at a time; and when working with packages containing many subpackages. In those cases you can `browse`, then `fetch` to get data of interest:

In [23]:
p = t4.Package.browse('example/package', bucket_name)
p['resources'].fetch('temp/')

HBox(children=(IntProgress(value=0, max=6148), HTML(value='')))




HBox(children=(IntProgress(value=0, max=36), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=232), HTML(value='')))




HBox(children=(IntProgress(value=0, max=73), HTML(value='')))




HBox(children=(IntProgress(value=0, max=478), HTML(value='')))




HBox(children=(IntProgress(value=0, max=896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=189), HTML(value='')))




HBox(children=(IntProgress(value=0, max=424), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1642), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1348), HTML(value='')))




HBox(children=(IntProgress(value=0, max=4898), HTML(value='')))




HBox(children=(IntProgress(value=0, max=544), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1239), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3610), HTML(value='')))




HBox(children=(IntProgress(value=0, max=385), HTML(value='')))




HBox(children=(IntProgress(value=0, max=240), HTML(value='')))




HBox(children=(IntProgress(value=0, max=360), HTML(value='')))




HBox(children=(IntProgress(value=0, max=360), HTML(value='')))




HBox(children=(IntProgress(value=0, max=91727), HTML(value='')))




HBox(children=(IntProgress(value=0, max=161), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1125), HTML(value='')))




HBox(children=(IntProgress(value=0, max=250), HTML(value='')))




HBox(children=(IntProgress(value=0, max=41), HTML(value='')))




HBox(children=(IntProgress(value=0, max=46), HTML(value='')))




HBox(children=(IntProgress(value=0, max=170), HTML(value='')))




HBox(children=(IntProgress(value=0, max=141), HTML(value='')))




HBox(children=(IntProgress(value=0, max=192), HTML(value='')))




HBox(children=(IntProgress(value=0, max=41), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=74398), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23887), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1278), HTML(value='')))




HBox(children=(IntProgress(value=0, max=74398), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3871481), HTML(value='')))




HBox(children=(IntProgress(value=0, max=2908), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




In [24]:
!rm -rf temp/

## Buckets

Coming soon!

## Addendum&mdash;clean up

In [25]:
local_data_copy = local_folder + "atlantic-storms.csv"
other_local_data_copy = local_folder + "storms.csv"

In [26]:
!rm $local_data_copy
!rm $other_local_data_copy