This notebook is a quick introduction to the Quilt 3 (formerly Quilt 4) Python API.

[You can run this notebook yourself using Binder](https://mybinder.org/v2/gh/quiltdata/hurdat/master?filepath=notebooks%2FQuickStart.ipynb). Alternatively you may [clone the git repo](https://github.com/quiltdata/hurdat).

## Installation

To get started, you will first need to [install the `quilt3` Python client](https://docs.quiltdata.com/installation#python-client). This as easy as `pip install quilt3`.

If you're following along interactively, make sure you also have push access to an S3 bucket.

Then, import it into the environment:

In [1]:
import quilt3

In [None]:
!mkdir ../data/

## Data

We'll also need some data. For the purposes of this demo, I wrote a small script that builds a clean copy of an NOAA hurricane dataset known as [HURDAT](https://www.nhc.noaa.gov/data/). You can see the code for yourself by uncommenting the following code cell:

In [3]:
# %load ../scripts/build.py

If you are following along with the code, you can re-run this cell to generate this dataset yourself.

This script generates a history of Atlantic hurricanes in a `pandas` `DataFrame`:

In [4]:
import pandas as pd
atlantic_storms = pd.read_csv("../data/atlantic-storms.csv")
atlantic_storms.head()

Unnamed: 0,id,name,date,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,1851-06-25 00:00:00,,HU,28.0,-94.8,80,,,...,,,,,,,,,,
1,AL011851,UNNAMED,1851-06-25 06:00:00,,HU,28.0,-95.4,80,,,...,,,,,,,,,,
2,AL011851,UNNAMED,1851-06-25 12:00:00,,HU,28.0,-96.0,80,,,...,,,,,,,,,,
3,AL011851,UNNAMED,1851-06-25 18:00:00,,HU,28.1,-96.5,80,,,...,,,,,,,,,,
4,AL011851,UNNAMED,1851-06-25 21:00:00,L,HU,28.2,-96.8,80,,,...,,,,,,,,,,


## Data packages

The core construct in Quilt is the **data package**. A data package is a collection of individual files which are meaningful when considered as a whole. A data package includes raw data files, metadata describing the raw data files, and anything else you think is meaningful.

Data packages make it easy to share data assets across the team. We'll use the small HURDAT dataset we just built to demonstrate how they work.

To initialize an in-memory data package:

In [5]:
hurdat = quilt3.Package()

To add a file to a package, use `set`:

In [6]:
hurdat.set('data/atlantic-storms.csv', '../data/atlantic-storms.csv')
hurdat

(local Package)
 └─data/
   └─atlantic-storms.csv

To capture everything in a folder, use `set_dir`:

In [7]:
hurdat = quilt3.Package().set_dir('/', '../')
hurdat

(local Package)
 └─.gitignore
 └─.quiltignore
 └─data/
   └─atlantic-storms.csv
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─requirements.txt
 └─scripts/
   └─build.py

For neatness, we recommend sorting the resources in your package into different directories by type, e.g. by using the [cookiecutter data science pattern](https://github.com/drivendata/cookiecutter-data-science).

In [8]:
hurdat = (quilt3.Package()
          .set('data/atlantic-storms.csv', '../data/atlantic-storms.csv')
          .set('scripts/build.py', '../scripts/build.py')
          # the following set operation may fail if you are on Binder, if it does, comment it out
          .set('notebooks/QuickStart.ipynb', '../notebooks/Quickstart.ipynb')
          .set('quilt_summarize.json', '../quilt_summarize.json')
         )
hurdat

(local Package)
 └─data/
   └─atlantic-storms.csv
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─scripts/
   └─build.py

All of packages, package directories, and package entries support metadata. You can attach metadata by passing a `meta` parameter to `set` or `set_dir`, or by using the dedicated `set_meta` method.

In [9]:
# to set metadata on a package entry
hurdat = hurdat.set('data/atlantic-storms.csv', '../data/atlantic-storms.csv',
                    meta={'source': 'NOAA', 'homepage': 'https://www.nhc.noaa.gov/data/'})

# to set metadata on a package
hurdat = hurdat.set_meta({'author': 'aleksey@', 'resource-type': 'demo'})

hurdat

(local Package)
 └─data/
   └─atlantic-storms.csv
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─scripts/
   └─build.py

## Publishing packages

Data is no use if it's hanging around on your machine. If you're building a data package, it's probably doing it because you want to share that data with the rest of your team!

The `push` command lets you take a package that you have locally and push it to your team's catalog. A Quilt 3 **catalog** sits on top of an S3 bucket and gives you features useful for data scientists in a web interface. If you're looking at this file on [open.quiltdata.com](https://open.quiltdata.com/), you're browsing a catalog right now!

**Note**: the following line of code will only work if you have push access to our demo catalog. You can replace `s3://quilt-example` with any bucket you have access to.

In [10]:
hurdat.push('examples/hurdat', 's3://quilt-example', message="Updated example")

Hashing: 100%|██████████| 3.62M/3.62M [00:00<00:00, 123MB/s]
Copying: 100%|██████████| 3.62M/3.62M [00:01<00:00, 3.21MB/s]


(remote Package)
 └─data/
   └─atlantic-storms.csv
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─scripts/
   └─build.py

Other users will now be able to view the packages (and package versions) available on a catalog using `quilt3.list_packages`.

In [11]:
list(quilt3.list_packages('s3://quilt-example'))

['akarve/amazon-reviews',
 'akarve/cbre',
 'akarve/gpt-2-output-dataset',
 'akarve/heterogeneous',
 'akarve/many-revisions',
 'akarve/pytorch-intro',
 'akarve/reinforcement-learning',
 'akarve/s3-funk',
 'akarve/sample_jupyter_notebooks',
 'akarve/xgboost_abalone',
 'aleksey/fashion_mnist',
 'aleksey/file_previews',
 'aleksey/hurdat',
 'aleksey/yellowbrick_x_keras',
 'dima/q',
 'examples/hurdat',
 'quilt/altair',
 'quilt/hurdat',
 'quilt/open_fruit',
 'quilt/open_images',
 'robnewman/honey_bees',
 'robnewman/us_county_smoking_vs_poverty']

## Installing packages

Use `quilt3.Package.install` to download the latest version of a package from a catalog.

In [12]:
hurdat = quilt3.Package.install('examples/hurdat', 's3://quilt-example')
hurdat

Copying: 100%|██████████| 3.62M/3.62M [00:01<00:00, 3.00MB/s]


(local Package)
 └─data/
   └─atlantic-storms.csv
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─scripts/
   └─build.py

To download a *specific* vesion of a package, provide the corresponding `top_hash`.

To specify, a target directory for a package, provide a `dest`.

In [13]:
# to install a specific version of this package to a local directory
quilt3.Package.install(
    'quilt/hurdat',
    top_hash='d3541062d8f303644bfdc0052f515d25e74ef79fadbfbf02ec0ab9215af0891c',
    dest='/a/local/path'
)

Copying: 100%|██████████| 3.62M/3.62M [00:01<00:00, 3.55MB/s]


(local Package)
 └─data/
   └─atlantic-storms.csv
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─scripts/
   └─build.py

With Quilt once you have a package localized, you can load the package directly into memory in a Python program using an `import` statement:

In [15]:
from quilt3.data.examples import hurdat
hurdat

(local Package)
 └─data/
   └─atlantic-storms.csv
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─scripts/
   └─build.py

To see the list of packages you have installed, run `quilt3.list_packages` without any parameters.

In [19]:
list(quilt3.list_packages())

['quilt/Empty',
 'quilt/altair',
 'quilt/hurdat',
 'quilt/Package',
 'examples/hurdat',
 'census/tracts_cartographic',
 'akarve/example',
 'akarve/cbre',
 'akarve/test3',
 'aleksey/hurdat']

## Consuming packages

Once you have the package localized, you can consume it. Packages mimic `dict` objects in their behavior. So to introspect a package, key into it.

In [20]:
hurdat['data']

(local Package)
 └─atlantic-storms.csv

In [21]:
hurdat['data']['atlantic-storms.csv']

PackageEntry('file:///Users/karve/Desktop/outboy/data/atlantic-storms.csv')

Packages and parts of packages support a variety of operations. The most important ones are `fetch`, which will copy (or download) a file or directory to your local disk, and `deserialize`, which will read the file into memory.

In [22]:
# pass a parameter to copy/download to a specific location
hurdat['data']['atlantic-storms.csv'].fetch()

Copying: 100%|██████████| 3.58M/3.58M [00:00<00:00, 350MB/s]


PackageEntry('file:///Users/karve/Desktop/tmp-docs-fix/notebooks/atlantic-storms.csv')

In [23]:
hurdat['data']['atlantic-storms.csv'].deserialize()\
    .head()

Unnamed: 0,id,name,date,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,...,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,1851-06-25 00:00:00,,HU,28.0,-94.8,80,,,...,,,,,,,,,,
1,AL011851,UNNAMED,1851-06-25 06:00:00,,HU,28.0,-95.4,80,,,...,,,,,,,,,,
2,AL011851,UNNAMED,1851-06-25 12:00:00,,HU,28.0,-96.0,80,,,...,,,,,,,,,,
3,AL011851,UNNAMED,1851-06-25 18:00:00,,HU,28.1,-96.5,80,,,...,,,,,,,,,,
4,AL011851,UNNAMED,1851-06-25 21:00:00,L,HU,28.2,-96.8,80,,,...,,,,,,,,,,


`get()` and `get_meta()` are also worth keeping in mind.

In [27]:
print(hurdat['data']['atlantic-storms.csv'].get())
print(hurdat['data']['atlantic-storms.csv'].meta)

file:///Users/karve/Desktop/outboy/data/atlantic-storms.csv
{'source': 'NOAA', 'homepage': 'https://www.nhc.noaa.gov/data/'}


## Browsing packages

`quilt3.Package.install` will download the entire contents of a package to local memory. However, there are many cases when you do not actually want to download all of the data in the package. For example, the package may be very large, and you want only want to work with a small part of it. Or perhaps you do not need the data all; you just want to work with the metadata.

We support this workflow using the `quilt3.Package.browse` command.

In [28]:
quilt3.Package.browse('aleksey/hurdat', 's3://quilt-example')

(remote Package)
 └─.gitignore
 └─.quiltignore
 └─notebooks/
   └─QuickStart.ipynb
 └─quilt_summarize.json
 └─requirements.txt
 └─scripts/
   └─build.py

## Helpful tips

Here are some helpful tips for getting the most out of the Quilt 3 API.

* You can omit the `s3://` path argument in `browse` and `install` if you configure a default catalog. This saves on typing:

    ```python
quilt3.config(default_remote_registry='s3://quilt-example')
# this now "just works"
quilt3.Package.push('examples/hurdat')
quilt3.Package.install('examples/hurdat')
    ```
    
    
* If you create a `quilt_summarize.json` file with a list of files at the top of your package, visitors to that package's landing page will be served previews of those files. Mixing data and metadata in this way is a great way of performing "literate data science"!


* `set_dir` will slurp up every file in the directory. But junk files are inevitable (looking at you, `.ipynb_checkpoints/`). You can configure which files are and aren't included by `set_dir` by creating a file named `.quiltignore`, which has the same syntax, and effect, as the familiar `.gitignore`.

## Conclusion

That concludes this short demo!

Another great resoure for getting started with API is our [official documentation](https://docs.quiltdata.com/).

For help getting started with our web interface, the Quilt 3 Catalog, [check out our demo catalog](http://open.quiltdata.com/), and also take a look at the [corresponding section of our docs](https://docs.quiltdata.com/walkthrough/working-with-the-catalog).