# Acquire data
Only data fragments that the user does not have on disk are sent over the network. All data fragments are deduplicated at the data frame level.

Users/machines can also read data from a centrally mounted file store to further reduce copying, duplication, and network traffic.

User clients talk directly to blob storage; much faster than retrieving data from a database.

In [2]:
import quilt
quilt.install("uciml/iris", hash="da2b6f56f3")

Downloading package metadata...


  0%|          | 0/6 [00:00<?, ?obj/s]

Downloading 6 fragments...


100%|██████████| 6/6 [00:00<00:00,  9.78obj/s]


# Log
We can travel time to any of these hashes. Users can also tag and version package instances.

In [3]:
quilt.log("uciml/iris")

Hash                                                             Pushed              Author
da2b6f56f323b11f7ebe2e32fd3a920e82842aaf0d52fc48eeb1e20f470e66c7 2017-06-13 12:52:22 uciml
d65b9514da28398be09687b7960eb6f7ac388d24035e8de94fde54c61c9f4291 2017-06-13 12:44:25 uciml
5c382b9757487b57d2baf0a82962df308f4f1547ef82e3ca9794cd64f42f615b 2017-06-13 11:34:44 uciml
d79643ef31ffffbd6a0d80fe049c832fea8481a9068b17bdafee81670c066568 2017-06-13 11:23:14 uciml
08f66e8902a178d293c41f2045feff872b7fa63422efe287fa0f2bd650ba3aa9 2017-06-13 11:19:16 uciml


# Navigate data

In [4]:
from quilt.data.uciml import iris

In [5]:
iris

<PackageNode '/Users/karve/Library/Application Support/QuiltCli/quilt_packages/pkgs/Quilt/uciml/iris'>
raw/
tables/
README

In [6]:
iris.tables

<GroupNode>

bezdek_iris
iris

## Deserialize parquet into a pandas dataframe
~10X faster than parsing CSV files

In [7]:
iris.tables.iris()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


## Dataframes are backed by numpy
We could natively support numpy arrays as well

In [8]:
iris.tables.iris().values

array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
       [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
       [4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
       [5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
       [5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
       [4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
       [5.0, 3.4, 1.5, 0.2, 'Iris-setosa'],
       [4.4, 2.9, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.1, 1.5, 0.1, 'Iris-setosa'],
       [5.4, 3.7, 1.5, 0.2, 'Iris-setosa'],
       [4.8, 3.4, 1.6, 0.2, 'Iris-setosa'],
       [4.8, 3.0, 1.4, 0.1, 'Iris-setosa'],
       [4.3, 3.0, 1.1, 0.1, 'Iris-setosa'],
       [5.8, 4.0, 1.2, 0.2, 'Iris-setosa'],
       [5.7, 4.4, 1.5, 0.4, 'Iris-setosa'],
       [5.4, 3.9, 1.3, 0.4, 'Iris-setosa'],
       [5.1, 3.5, 1.4, 0.3, 'Iris-setosa'],
       [5.7, 3.8, 1.7, 0.3, 'Iris-setosa'],
       [5.1, 3.8, 1.5, 0.3, 'Iris-setosa'],
       [5.4, 3.4, 1.7, 0.2, 'Iris-setosa'],
       [5.1, 3.7, 1.5, 0.4, 'Iris-setosa'],
       [4.6, 3.6, 1.0, 0.2, 'Iri

# Users can also create and modify packages on the fly
See [this notebook](https://github.com/quiltdata/examples/blob/master/Examples.ipynb) for further details.

Quilt also supports data unit tests, run at packaging time, that ensure data profile (e.g. column cardinality, standard deviation, etc.) that can be applied as part of a model drift detection system.

# Quilt for reproducible machine learning
Both models and data can be stored in quilt packages, see [this article on the Domino Data Labs blog](https://blog.dominodatalab.com/reproducible-machine-learning-with-jupyter-and-quilt/).

# More features
Visit [docs.quilt.data.com](https://docs.quiltdata.com/) for more.