## t4-summaries-push

Builds a few simple dataset files which I later push to the `/summaries/` folder in the T4 repo. These dataset files are used by Vega definition files also included in the repo to generate some nice interactive charts that are part of the overall bucket summary.

In [None]:
import pandas as pd
a = pd.read_csv("data/metadata/train-annotations-bbox.csv", index_col=0)
b = pd.read_csv("data/metadata/train-annotations-human-imagelabels-boxable.csv", index_col=0)
c = pd.read_csv("data/metadata/train-images-ids.csv", index_col=0)

### Image labels

This segment generates the data for a summary of the most common image labels.

In [None]:
l = b['LabelName'].value_counts().head(20)
l = pd.DataFrame(l)
l.index.name = "label_id"
l.columns = ['label_count']

In [None]:
l.head()

In [None]:
labels = pd.read_csv("../data/metadata/image-class-names.csv", index_col=0)

In [None]:
labels.head()

In [None]:
labels.columns = ['label_id', 'label_name']
labels = labels.set_index('label_id')

In [None]:
l = b['LabelName'].value_counts()
l = pd.DataFrame(l)
l.index.name = "label_id"
l.columns = ['label_count']
l = l.join(labels)

In [None]:
mkdir ../data/summaries/

In [None]:
l.to_csv("../data/summaries/top-image-labels.csv")

### Label hierarchy

Another summary view, this one of the overall label hierarchy.

In [None]:
import requests

r = requests.get(
    "https://storage.googleapis.com/openimages/2018_04/bbox_labels_600_hierarchy.json"
)
hierarchy = r.json()

In [None]:
l2 = b['LabelName'].value_counts()
l2 = pd.DataFrame(l2)
l2.index.name = "label_id"
l2.columns = ['label_count']
l2 = l2.join(labels)

In [1]:
# Reformat the provided hierarchy into something immediately parsable by the Vega built-ins.
def reify(node, parent=None, ids_already_seen=None):
    """
    Reformats the structured node hierarchy provided by the Google Image Dataset landing page
    into a list of nodes with parent IDs that is more immediately parsable using Vega.
    
    Node IDs are checked and whitelisted to make sure they are only added to the list once,
    because for some reason the hierarchy provided by the dataset providers allows child nodes
    with multiple parents, leading to cycles.
    """
    
    if not ids_already_seen:
        ids_already_seen = set()
        
    out = []
    
    if node['LabelName'] not in ids_already_seen:
        ids_already_seen.update({node['LabelName']})
        entry = dict()
        entry['id'] = node['LabelName']

        if parent:
            entry['parent'] = parent

        try:
            entry['name'] = l.loc[entry['id']].label_name
        except KeyError:
            entry['name'] = 'Object'

        out.append(entry)

        if 'Subcategory' in node.keys():
            for subnode in node['Subcategory']:
                out += reify(subnode, parent=node['LabelName'], ids_already_seen=ids_already_seen)
        
    return out

In [None]:
hierarchy_transform = reify(hierarchy)

In [None]:
import json
with open("../data/summaries/image-labels-transformed.json", "w") as fp:
    json.dump(hierarchy_transform, fp, indent=4)