## Installation
[Install the T4 Python client](https://github.com/quiltdata/t4/blob/master/UserDocs.md), `helium`.

## Intro
This notebook offers a five-minute tour of the T4 Python API, codename `helium`.

In [1]:
import helium as he



T4 lets you read and write data from S3. Every file in T4 is searchable, versioned, and secured according to your S3 policies.


![](./helium-api.png)


To start off, we'll need some data. Here's a script we've built that downloads and cleans up an NOAA hurricane dataset known as HURDAT. It is pretty typical of the sorts of clean-up scripts you'd be running when performing data science:

In [None]:
%load hurdat/build.py


This script generates a history of Atlantic hurricanes in a `pandas` `DataFrame`:

In [6]:
atlantic_storms.head()

Unnamed: 0_level_0,id,name,date,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,AL011851,UNNAMED,1851-06-25 00:00:00,,HU,28.0,-94.8,80,,,...,,,,,,,,,,
1,AL011851,UNNAMED,1851-06-25 06:00:00,,HU,28.0,-95.4,80,,,...,,,,,,,,,,
2,AL011851,UNNAMED,1851-06-25 12:00:00,,HU,28.0,-96.0,80,,,...,,,,,,,,,,
3,AL011851,UNNAMED,1851-06-25 18:00:00,,HU,28.1,-96.5,80,,,...,,,,,,,,,,
4,AL011851,UNNAMED,1851-06-25 21:00:00,L,HU,28.2,-96.8,80,,,...,,,,,,,,,,


## Read and write objects

`helium` lets you read and write Python objects with `put()`. `put()` accepts an optional `metadata=` keyword. Use `metadata=` to annotate objects. T4 indexes all metadata so that you can find specific objects or files with `search()`.

In the example below are are working with an S3 bucket called `alpha-quilt-storage`. To write to your own

In [12]:
he.put(atlantic_storms, "alpha-quilt-storage/~aleksey/hurdat/atlantic-storms-data.parquet",
       meta={'source': 'https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt', 
             'ocean': 'atlantic'})

You can retrieve them (along with the metadata) using `get`:

In [13]:
atlantic_storms, meta = he.get("alpha-quilt-storage/~aleksey/hurdat/atlantic-storms-data.parquet")

In [14]:
meta

{'ocean': 'atlantic',
 'source': 'https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt'}

`put` transparently chooses file formats for common data structures. In the above example , that meant writing a `pandas.DataFrame` as a `.parquet` file.

To move files to S3, use `put_file`:

In [15]:
fn = "~/Desktop/atlantic-storms.csv"
atlantic_storms.to_csv(fn)

In [16]:
%ls ~/Desktop | grep 'atlantic'

atlantic-storms.csv


In [10]:
he.put_file("/Users/alex/Desktop/atlantic-storms.csv", "alpha-quilt-storage/~aleksey/hurdat/atlantic-storms-data.csv")

HBox(children=(IntProgress(value=0, max=3871481), HTML(value='')))




## Object versions

It is recommended that you use T4 on an S3 bucket with [object versioning](https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html) enabled.

Every time you write to a versioned S3 bucket, including with `he.put*`, a new *object version* is born. With an object version, you can reconstruct the contents of an object at any point in time.

You can list object version with `ls` command. For example, here are the first three versions of some files in our HURDAT project:

In [17]:
he.ls("alpha-quilt-storage/~aleksey/hurdat")[1][:3]

[{'ETag': '"7d9faecef6a675b04246fda5d2747a7f"',
  'IsLatest': False,
  'Key': '~aleksey/hurdat/',
  'LastModified': datetime.datetime(2018, 10, 4, 21, 13, 14, tzinfo=tzutc()),
  'Owner': {'DisplayName': 'kmoore',
   'ID': '1e740c9f01d3eb40d580b51a943de9c75ba2af0c2f75e1ac7b021cd7afd1872a'},
  'Size': 40,
  'StorageClass': 'STANDARD',
  'VersionId': 'jwSyCWiv_zL5Lg.sOyN1RMMQCnGzk.0O'},
 {'ETag': '"7d9faecef6a675b04246fda5d2747a7f"',
  'IsLatest': False,
  'Key': '~aleksey/hurdat/',
  'LastModified': datetime.datetime(2018, 10, 4, 21, 11, 41, tzinfo=tzutc()),
  'Owner': {'DisplayName': 'kmoore',
   'ID': '1e740c9f01d3eb40d580b51a943de9c75ba2af0c2f75e1ac7b021cd7afd1872a'},
  'Size': 40,
  'StorageClass': 'STANDARD',
  'VersionId': 'HvmCd4AGwG4Og3mwGxQMfPDWiZhmtII3'},
 {'ETag': '"7d9faecef6a675b04246fda5d2747a7f"',
  'IsLatest': False,
  'Key': '~aleksey/hurdat/',
  'LastModified': datetime.datetime(2018, 10, 4, 21, 9, 53, tzinfo=tzutc()),
  'Owner': {'DisplayName': 'kmoore',
   'ID': '1e74

In the future, T4 sill offer other ways of accessing version information more directly.

To grab a specific object version ,use the optional `version=` keyword to `get()` or `get_file()`:

In [17]:
data, meta = he.get("alpha-quilt-storage/~aleksey/hurdat/atlantic-storms.parquet", 
                    version="mP4USSZF2mJSaKNvr7EjUldDQm3Sqb_b")

> You'll need to provide the full object version for this to work

## Snapshot folders in S3

<!-- In the future this section should treat versions, not snapshots. -->

A T4 **snapshot** is an immutable picture of one or more objects in S3 at a specific moment in time. Whereas object versions are for single objects, snapshots are for one or more objects.

The snapshot `path` means "seal everything underneath this key" in S3.

In [5]:
he.snapshot("alpha-quilt-storage/~aleksey/hurdat/", message="Third cut at cleaning up HURDAT")

'724cde9ad4688727ce886b5ece405103c3cb152d7ac076c88d2bf2cd254a1e66'

You can list snapshots of an S3 key using `list_snapshots`:

In [5]:
he.list_snapshots("alpha-quilt-storage/~aleksey/hurdat/")

Unnamed: 0_level_0,hash,timestamp,message
path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
~aleksey/hurdat/,724cde9ad4688727ce886b5ece405103c3cb152d7ac076...,2018-10-09 23:09:19+00:00,Third cut at cleaning up HURDAT
,ad9f3e3d938da7fbc5624245fbcb72f5bc25c2dfe4f9af...,2018-10-08 22:26:29+00:00,foo2
~aleksey/,5460a76611597d3cf53ea4b0acb8d9695261523ac04de5...,2018-10-08 22:26:16+00:00,foo2
~aleksey/,9aa46097e10cb7b22a6667ac4bb7b6329b2411240936aa...,2018-10-08 22:01:13+00:00,foo
,435d7b954fe6dbd35cf51b311971fc49643d24af9f6f69...,2018-10-08 21:21:01+00:00,foo
~aleksey/hurdat/,7b1e211f91ac3242748c1423525f7d6e846914c055a6c5...,2018-10-05 00:36:20+00:00,Temporary message.


In [17]:
he.put({"description": "A simple JSON file"}, "alpha-quilt-storage/~aleksey/hurdat/simple.json")

You can diff overlapping snapshots to see what's changed. In this case `"latest"` represents what is currently in S3.

In [7]:
he.diff("alpha-quilt-storage", "724cde9ad46", "latest")

Unnamed: 0_level_0,Key,ETag
status,Unnamed: 1_level_1,Unnamed: 2_level_1
Added,~aleksey/hurdat/simple.json,"""725f0cda0939ef902a1cb9bcb89923cd"""


Snapshots can be used to version anything with an S3 key, but are at their most useful when versioning **data packages**: groups of files which together represent the data component to a specific project you are working on.

You can think of a data project as having three components: code, environment, and data. Versioning code is obvious: just use `git`. Similarly, sophisticated tools exist for versioning environments: `conda` and Docker, for example.

But what about your data? Data can balloon to many terabytes in size, becoming too large for `git` or Docker to manage. At the same time, in data science, small changes in data can often have disproportionate impact in your analysis and throw off your models. In a [seminal paper](https://ai.google/research/pubs/pub43146) on data systems, Google refered to this as the CACE principle: "Changing Anything Changes Everything". 

Clearly, data needs its own native versioning tool. T4 snapshots provide just that!

To demonstrate, let's start by cloning a simple project using our storms data.

In [None]:
!cd ~/Desktop; git clone https://github.com/ResidentMario/hurdat-example-repo

This project contains an `environment.yml` file defining our code environment, a `notebooks` folder containing some Jupyter notebooks, and a `data` folder containing inputs and outputs.

Our objective: smartly manage our `data`. With T4 snapshots, this is easy:

In [None]:
# Note: replace this path with one that works on your local machine.
he.put_file("/Users/alex/Desktop/hurdat-example-repo/data/", 
            "alpha-quilt-storage/aleksey/hurdat-example-repo/data/")

In [None]:
he.snapshot("alpha-quilt-storage/aleksey/hurdat-example-repo/data/", message="Snap.")

In [None]:
he.list_snapshots("alpha-quilt-storage/aleksey/hurdat-example-repo/data/")

Now whenever we want to grab a file from a particular snapshot of this particular data project, we need only pass its hash to the `snapshot` parameter of `get_file`:

In [None]:
# Note: replace this path with one that works on your local machine.
he.get_file("alpha-quilt-storage/aleksey/hurdat-example-repo/data/atlantic.csv", 
            "/Users/alex/Desktop/hurdat-example-repo/data/atlantic.csv",
            snapshot="cb06134062b8b8")

Check this hash into your `README.md` and enjoy your newfound project reproducibility!

In summary, every data science product&mdash;be it an analysis, a model, or exposition&mdash;relies on a new collection of data file **versions**, which a data science can logically organize into one (or more) **snapshots**. These snapshots are **immutable**, and, in conjunction with version control on the project code and the project environment, enable reproducible, distributable data science.

## Addendum&mdash;clean up

In [26]:
# Clean up
!rm -rf ~/Desktop/hurdat-example-repo
!rm ~/Desktop/atlantic-storms.csv