# Hangar Quick Start Tutorial

This tutorial will guide you on working with the basics of Hangar.

## 0. Setup

You can install Hangar via `pip`:

```bash
pip install hangar
```

or via `conda`:
```bash
conda install -c conda-forge hangar
```

Other requirements:
* pillow

#### And just start with

In [1]:
from hangar import Repository

## 1. Create and initialize a `Repository`

Create the folder where you want to store the Hangar `Repository`:

In [2]:
! mkdir /Volumes/Archivio/tensorwerk/hangar/imagenette

and create the `Repository` object. Note that when you specify a new folder for a Hangar repository, Python shows you a warning saying that you will need to initialize the repo before starting working on it.

In [3]:
repo = Repository(path='/Volumes/Archivio/tensorwerk/hangar/imagenette')



In [4]:
repo.init(user_name='Alessia Marcolini', user_email='alessia@tensorwerk.com', remove_old=True)

Hangar Repo initialized at: /Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar


'/Volumes/Archivio/tensorwerk/hangar/imagenette/.hangar'

## 2. Repository checkout

A `Repository` can be checked out in two modes: write-enabled and read-only. We need to checkout the repo in write mode in order to initialize the arraysets and write into them

In [5]:
co = repo.checkout(write=True)

A checkout allows access to `columns` and `metadata`. The `columns` and `metadata` attributes of a checkout provide the interface to working with all of the data on disk!

In [6]:
co.columns

Hangar Columns                
    Writeable         : True                
    Number of Columns : 0                
    Column Names / Partial Remote References:                
      - 

In [7]:
co.metadata

Hangar Metadata                
    Writeable: True                
    Number of Keys: 0


## 3a. Download and prepare the data

To start playing with Hangar, let's get some data to work on. We'll be using the [Imagenette dataset](https://github.com/fastai/imagenette).

In [8]:
! wget https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz

--2020-03-09 15:11:17--  https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
Resolving s3.amazonaws.com... 52.217.42.134
Connecting to s3.amazonaws.com|52.217.42.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 98948031 (94M) [application/x-tar]
Saving to: ‘imagenette2-160.tgz’


2020-03-09 15:12:08 (1.92 MB/s) - ‘imagenette2-160.tgz’ saved [98948031/98948031]



In [9]:
! tar -xzf imagenette2-160.tgz

Download `words.txt` to get the corrispondence between ImageNet synset name and a human readable label.

In [10]:
! wget http://image-net.org/archive/words.txt -P imagenette2-160

--2020-03-09 15:12:45--  http://image-net.org/archive/words.txt
Resolving image-net.org... 171.64.68.16
Connecting to image-net.org|171.64.68.16|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2655750 (2.5M) [text/plain]
Saving to: ‘imagenette2-160/words.txt’


2020-03-09 15:12:50 (728 KB/s) - ‘imagenette2-160/words.txt’ saved [2655750/2655750]



In [11]:
from pathlib import Path
dataset_dir = Path('imagenette2-160')

synset_label = {}

with open(dataset_dir / 'words.txt', 'r') as f:
    for line in f.readlines():
        synset, label = line.split('\t')
        synset_label[synset] = label.rstrip()

In [12]:
import os
from tqdm import tqdm

import numpy as np
from numpy import asarray
from PIL import Image

In [13]:
train_images = []
train_labels = []

for synset in tqdm(os.listdir(dataset_dir / 'train')):
    label = synset_label[synset]
    
    for image_filename in os.listdir(dataset_dir / 'train' / synset):
        image = Image.open(dataset_dir / 'train' / synset / image_filename)
        image = image.resize((163, 160))
        data = asarray(image)
        
        if len(data.shape) == 2:
            continue
        
        train_images.append(data)
        train_labels.append(label)

100%|██████████| 10/10 [00:25<00:00,  2.60s/it]


In [14]:
train_images = np.array(train_images)
train_labels = np.array(train_labels)

In [15]:
train_images.shape

(9296, 160, 163, 3)

In [16]:
val_images = []
val_labels = []

for synset in tqdm(os.listdir(dataset_dir / 'val')):
    label = synset_label[synset]
    
    for image_filename in os.listdir(dataset_dir / 'val' / synset):
        image = Image.open(dataset_dir / 'val' / synset / image_filename)
        image = image.resize((163, 160))
        data = asarray(image)
        
        if len(data.shape) == 2:
                continue
            
        val_images.append(data)
        val_labels.append(label)

100%|██████████| 10/10 [00:05<00:00,  1.80it/s]


In [17]:
val_images = np.array(val_images)
val_labels = np.array(val_labels)

In [18]:
val_images.shape

(3856, 160, 163, 3)