# Hangar Quick Start Tutorial

This tutorial will guide you on working with the basics of Hangar.

## 0. Setup

You can install Hangar via `pip`:

```bash
pip install hangar
```

or via `conda`:
```bash
conda install -c conda-forge hangar
```

Other requirements:
* pillow

#### And just start with

In [1]:
from hangar import Repository

## 1. Create and initialize a `Repository`

Create the folder where you want to store the Hangar `Repository`:

In [None]:
! mkdir /Volumes/Archivio/tensorwerk/hangar/imagenette

and create the `Repository` object. Note that when you specify a new folder for a Hangar repository, Python shows you a warning saying that you will need to initialize the repo before starting working on it.

In [2]:
repo = Repository(path='/Volumes/Archivio/tensorwerk/hangar/imagenette')

In [3]:
repo.init(user_name='Alessia Marcolini', user_email='alessia@tensorwerk.com', remove_old=True)

Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar


'/Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar'

## 2. Repository checkout

A `Repository` can be checked out in two modes: write-enabled and read-only. We need to checkout the repo in write mode in order to initialize the arraysets and write into them

In [4]:
co = repo.checkout(write=True)

A checkout allows access to `columns`. The `columns` attribute of a checkout provide the interface to working with all of the data on disk!

In [5]:
co.columns

Hangar Columns                
    Writeable         : True                
    Number of Columns : 0                
    Column Names / Partial Remote References:                
      - 

## 3a. Download and prepare the data

To start playing with Hangar, let's get some data to work on. We'll be using the [Imagenette dataset](https://github.com/fastai/imagenette).

In [None]:
! wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz

In [None]:
! tar -xzf imagenette2-160.tgz

Download `words.txt` ... 

In [None]:
! wget http://image-net.org/archive/words.txt -P imagenette2-160

... and create a dictionary to store the corrispondence between ImageNet synset name and a human readable label.

In [6]:
from pathlib import Path
dataset_dir = Path('imagenette2-160')

synset_label = {}

with open(dataset_dir / 'words.txt', 'r') as f:
    for line in f.readlines():
        synset, label = line.split('\t')
        synset_label[synset] = label.rstrip()

In [7]:
import os
from tqdm import tqdm

import numpy as np
from numpy import asarray
from PIL import Image

Read training data (images and labels) from disk and store them in numpy arrays.

In [8]:
train_images = []
train_labels = []

for synset in tqdm(os.listdir(dataset_dir / 'train')):
    label = synset_label[synset]
    
    for image_filename in os.listdir(dataset_dir / 'train' / synset):
        image = Image.open(dataset_dir / 'train' / synset / image_filename)
        image = image.resize((163, 160))
        data = asarray(image)
        
        if len(data.shape) == 2:
            continue
        
        train_images.append(data)
        train_labels.append(label)
        
train_images = np.array(train_images)
train_labels = np.array(train_labels)

100%|██████████| 10/10 [01:49<00:00, 10.91s/it]


In [9]:
train_images.shape

(9296, 160, 163, 3)

Read validation data (images and labels) from disk and store them in numpy arrays, same as before.

In [10]:
val_images = []
val_labels = []

for synset in tqdm(os.listdir(dataset_dir / 'val')):
    label = synset_label[synset]
    
    for image_filename in os.listdir(dataset_dir / 'val' / synset):
        image = Image.open(dataset_dir / 'val' / synset / image_filename)
        image = image.resize((163, 160))
        data = asarray(image)
        
        if len(data.shape) == 2:
                continue
            
        val_images.append(data)
        val_labels.append(label)
        
val_images = np.array(val_images)
val_labels = np.array(val_labels)

100%|██████████| 10/10 [00:34<00:00,  3.47s/it]


In [11]:
val_images.shape

(3856, 160, 163, 3)

## 3b. Column initialization

With checkout write-enabled, we can now initialize a new colum of the repository using the method `add_ndarray_column()`. 

All samples within a column have the same data type, and number of dimensions. The size of each dimension can be either fixed (the default behavior) or variable per sample.

You will need to provide a column `name` and a `prototype`, so Hangar can infer the shape of the elements contained in the array.
`train_im_col` will become a column accessor object.

In [12]:
train_im_col = co.add_ndarray_column(name='imagenette_training_images', prototype=train_images[0])

In [13]:
train_im_col

Hangar FlatSampleWriter                 
    Column Name              : imagenette_training_images                
    Writeable                : True                
    Column Type              : ndarray                
    Column Layout            : flat                
    Schema Type              : fixed_shape                
    DType                    : uint8                
    Shape                    : (160, 163, 3)                
    Number of Samples        : 0                
    Partial Remote Data Refs : False


.....

In [14]:
co.columns['imagenette_training_images']

Hangar FlatSampleWriter                 
    Column Name              : imagenette_training_images                
    Writeable                : True                
    Column Type              : ndarray                
    Column Layout            : flat                
    Schema Type              : fixed_shape                
    DType                    : uint8                
    Shape                    : (160, 163, 3)                
    Number of Samples        : 0                
    Partial Remote Data Refs : False


Since Hangar 0.5, it's possible to have a column with string datatype, and we will be using it to store the labels of our dataset.

In [15]:
train_lab_col = co.add_str_column(name='imagenette_training_labels')

## 4. Adding data

To add data to a named column, we can use dict-style mode (refer to the `__setitem__`, `__getitem__`, and `__delitem__` methods) or the `update()` method. Sample keys can be either `str` or `int` type.

In [16]:
train_im_col[0] = train_images[0]
train_lab_col[0] = train_labels[0]

As we can see, `Number of Samples` is equal to 1 now.

In [18]:
co.columns['imagenette_training_labels']

Hangar FlatSampleWriter                 
    Column Name              : imagenette_training_labels                
    Writeable                : True                
    Column Type              : str                
    Column Layout            : flat                
    Schema Type              : variable_shape                
    DType                    : <class 'str'>                
    Shape                    : None                
    Number of Samples        : 1                
    Partial Remote Data Refs : False


In [19]:
data = {1: train_images[1], 2: train_images[2]}

In [20]:
train_im_col.update(data)

In [21]:
train_im_col

Hangar FlatSampleWriter                 
    Column Name              : imagenette_training_images                
    Writeable                : True                
    Column Type              : ndarray                
    Column Layout            : flat                
    Schema Type              : fixed_shape                
    DType                    : uint8                
    Shape                    : (160, 163, 3)                
    Number of Samples        : 3                
    Partial Remote Data Refs : False


Let's add the remaining training images:

In [22]:
with train_im_col:
    for i, img in tqdm(enumerate(train_images)):
        if i not in [0, 1, 2]:
            train_im_col[i] = img

9296it [00:17, 534.78it/s]


In [29]:
train_im_col

Hangar FlatSampleWriter                 
    Column Name              : imagenette_training_images                
    Writeable                : True                
    Column Type              : ndarray                
    Column Layout            : flat                
    Schema Type              : fixed_shape                
    DType                    : uint8                
    Shape                    : (160, 163, 3)                
    Number of Samples        : 9296                
    Partial Remote Data Refs : False


In [24]:
with train_lab_col:
    for i, label in tqdm(enumerate(train_labels)):
        if i != 1:
            train_lab_col[i] = label

9296it [00:00, 56786.52it/s]


In [25]:
train_lab_col

Hangar FlatSampleWriter                 
    Column Name              : imagenette_training_labels                
    Writeable                : True                
    Column Type              : str                
    Column Layout            : flat                
    Schema Type              : variable_shape                
    DType                    : <class 'str'>                
    Shape                    : None                
    Number of Samples        : 9295                
    Partial Remote Data Refs : False


Both the `imagenette_training_images` and the `imagenette_training_labels` have 9295 samples. Great!

N.B.: to get an overview of the different ways you could add data to a Hangar repository (also from a performance point of view), please refer to the Performance Section of the Hangar Tutorial Part 1. TODO: add link

## 5. Committing changes

Once you have made a set of changes you want to commit, simply call the `commit()` method and specify a message.

In [30]:
co.commit('Add Imagenette training images and labels')

'a=a4f98f501045d9db04eb5b6692bbc1aeb8fcefe8'

Let's add the validation data to the repository ...

In [31]:
val_im_col = co.add_ndarray_column(name='imagenette_validation_images', prototype=val_images[0])
val_lab_col = co.add_str_column(name='imagenette_validation_labels')

In [32]:
for img, label in tqdm(zip(val_images, val_labels)):
    val_im_col[i] = img
    val_lab_col[i] = label

3856it [01:22, 47.01it/s]


... and commit!

In [34]:
co.commit('Add Imagenette validation images and labels')

'a=a06a7b83b3d96f91d8e27f621c130279de4d0bd7'

To view the **history** of your commits:

In [35]:
co.log()

* a=a06a7b83b3d96f91d8e27f621c130279de4d0bd7 ([1;31mmaster[m) : Add Imagenette validation images and labels
* a=a4f98f501045d9db04eb5b6692bbc1aeb8fcefe8 : Add Imagenette training images and labels


### Do not forget to close the write-enabled checkout!

In [36]:
co.close()

Let's inspect the repository state! This will show disk usage information, the details of the last commit and all the information about the dataset columns.

In [37]:
repo.summary()

Summary of Contents Contained in Data Repository 
 
| Repository Info 
|----------------- 
|  Base Directory: /Volumes/Archivio/tensorwerk/hangar/imagenette 
|  Disk Usage: 860.53 MB 
 
| Commit Details 
------------------- 
|  Commit: a=a06a7b83b3d96f91d8e27f621c130279de4d0bd7 
|  Created: Wed Mar 11 14:03:42 2020 
|  By: Alessia Marcolini 
|  Email: alessia@tensorwerk.com 
|  Message: Add Imagenette validation images and labels 
 
| DataSets 
|----------------- 
|  Number of Named Columns: 4 
|
|  * Column Name: ColumnSchemaKey(column="imagenette_training_images", layout="flat") 
|    Num Data Pieces: 9296 
|    Details: 
|    - column_layout: flat 
|    - column_type: ndarray 
|    - schema_type: fixed_shape 
|    - shape: (160, 163, 3) 
|    - dtype: uint8 
|    - backend: 01 
|    - backend_options: {'complib': 'blosc:lz4hc', 'complevel': 5, 'shuffle': 'byte'} 
|
|  * Column Name: ColumnSchemaKey(column="imagenette_training_labels", layout="flat") 
|    Num Data Pieces: 9295 
|   