### Part 1: building the core image dataset

In [1]:
import pandas as pd
a = pd.read_csv("data/metadata/train-annotations-bbox.csv", index_col=0)
b = pd.read_csv("data/metadata/train-annotations-human-imagelabels-boxable.csv", index_col=0)
c = pd.read_csv("data/metadata/train-images-ids.csv", index_col=0)

  mask |= (ar1 == a)


In [6]:
imglist = a['ImageID'].unique()

In [7]:
len(imglist)

1743042

In [9]:
a = a.set_index("ImageID")
b = b.set_index("ImageID")
c = c.set_index("ImageID")

In [40]:
def defineAnnotationsByID(ImageID):
    bboxes   = pd.DataFrame(a.loc[ImageID])
    labels   = pd.DataFrame(b.loc[ImageID])
    img_meta = c.loc[ImageID]
    
    out = dict(img_meta)
    out['bboxes'] = list(bboxes.apply(lambda srs: dict(srs), axis='columns'))
    out['labels'] = list(labels.apply(lambda srs: dict(srs), axis='columns'))
    
    return out

In [43]:
from tqdm import tqdm_notebook

meta_entries = []
for ImageID in tqdm_notebook(imglist):
    meta_entries.append(defineAnnotationsByID(ImageID))

HBox(children=(IntProgress(value=0, max=1743042), HTML(value='')))

KeyboardInterrupt: 

This code could generate the full metadata annotations for the images as a JSON blob (though I need to also throw the image label IDs into the mix). It would take about 3 hours to generate the full dataset metadata entry in JSON.

The images can be localized by passing through the Flickr API, but the Flickr API rate limits to 3600 requests per hour (documented [here](https://www.flickr.com/services/developer/api/)). Respecting this rate limit, it would take almost 500 hours, or approximately 20 days of continuous work, in order to download the images from Flickr. Not practical.

The images could be downloaded more quickly as a direct ZIP, but I don't have enough storage space on my local drive to do it, as the zipped file is over a terabyte in size.

The best way to populate the images would be to use an Amazon EC2 instance as an intermediary, per [this comment](https://datascience.stackexchange.com/questions/5589/downloading-a-large-dataset-on-the-web-directly-into-aws-s3).

I can push the raw CSV files right away, and build some summaries around those; and do the `README.md`.

### Part 2: building data sources for the summary Vega charts

Summaries for Vega charts.

#### Image labels

This segment generates the data for a summary of the most common image labels.

In [73]:
l = b['LabelName'].value_counts().head(20)
l = pd.DataFrame(l)
l.index.name = "label_id"
l.columns = ['label_count']

In [77]:
l.head()

Unnamed: 0_level_0,label_count
label_id,Unnamed: 1_level_1
/m/01g317,839436
/m/09j2d,675650
/m/04yx4,472414
/m/05s2s,436288
/m/07j7r,423757


In [152]:
labels = pd.read_csv("../data/metadata/image-class-names.csv", index_col=0)

In [153]:
labels.head()

Unnamed: 0,LabelID,LabelName
0,/m/011k07,Tortoise
1,/m/011q46kg,Container
2,/m/012074,Magpie
3,/m/0120dh,Sea turtle
4,/m/01226z,Football


In [155]:
labels.columns = ['label_id', 'label_name']
labels = labels.set_index('label_id')

In [164]:
l = b['LabelName'].value_counts()
l = pd.DataFrame(l)
l.index.name = "label_id"
l.columns = ['label_count']

In [165]:
l = l.join(labels)

In [161]:
l2 = l2.join(labels)

In [49]:
mkdir ../data/summaries/

In [166]:
l.to_csv("../data/summaries/top-image-labels.csv")

#### Label hierarchy

Another summary view, this one of the overall label hierarchy.

In [94]:
import requests

r = requests.get(
    "https://storage.googleapis.com/openimages/2018_04/bbox_labels_600_hierarchy.json"
)
hierarchy = r.json()

In [167]:
l2 = b['LabelName'].value_counts()
l2 = pd.DataFrame(l2)
l2.index.name = "label_id"
l2.columns = ['label_count']

In [168]:
l2 = l2.join(labels)

In [222]:
# Reformat the provided hierarchy into something immediately parsable by the Vega built-ins.
# TODO: rip out the ids_already_seen? after studying 
def reify(node, parent=None, ids_already_seen=None):
    if not ids_already_seen:
        ids_already_seen = set()
        
    out = []
    
#     if node['LabelName'] == "/m/03hlz0c":
#         import pdb; pdb.set_trace()
    
    if node['LabelName'] not in ids_already_seen:
        ids_already_seen.update({node['LabelName']})
        entry = dict()
        entry['id'] = node['LabelName']

        if parent:
            entry['parent'] = parent

        try:
            entry['name'] = l.loc[entry['id']].label_name
        except KeyError:
            entry['name'] = 'Object'

        out.append(entry)

        if 'Subcategory' in node.keys():
            for subnode in node['Subcategory']:
                out += reify(subnode, parent=node['LabelName'], ids_already_seen=ids_already_seen)
        
    return out


# TODO
def uniquify(l):
    pass

In [223]:
hierarchy_transform = reify(hierarchy)

In [224]:
import json
with open("../data/summaries/image-labels-transformed.json", "w") as fp:
    json.dump(hierarchy_transform, fp, indent=4)