OpenImages V4
=============

https://storage.googleapis.com/openimages/web/download_v4.html

The full set is 9,178,275 image URLs (mostly on Flickr, and some are no longer available).

#### Target Set
Of the full dataset, 1,743,042 images are annotated and hosted by the [CVDF](https://github.com/cvdfoundation/open-images-dataset). They are labeled as "training data", but our encoder was not trained on these images; instead, we held them out to use as a target database for our simulations and wetlab experiments.

#### Validation Set
The CVDF also hosts a validation set consisting of 41,620 images, which we used to track the performance of the encoder during training.

#### Training Set
To build our encoder training set, we subtracted the target set from the full dataset, and downloaded whatever was available out of the first 1,200,000 remaining images.

#### Extended Target Set
The remainder of the full dataset (about 6 million image URLs) is used as an extended target database for performing scalability simulations.

### Download Target and Validation Sets

Run the following code to download the target set (513 gigabytes) and validation set (12 gigagbytes). It will take some time:

In [None]:
!aws s3 --no-sign-request sync \
    s3://open-images-dataset/tar/ \
    /tf/open_images/targets/images/ \
    --exclude "*" --include "train_*.tar.gz"
    
!aws s3 --no-sign-request sync \
    s3://open-images-dataset/tar/ \
    /tf/open_images/validation/images/ \
    --exclude "*" --include "validation.tar.gz"

This code will convert the `.tar.gz` files to `.zip` files (which will make accessing the images easier). 

In [None]:
import glob
tgzs = glob.glob('/tf/open_images/targets/images/*.tar.gz')
!cd /tf/open_images/targets/images/
for tgz in tgzs:
    dname = tgz.replace('.tar.gz', '')
    !tar -xf {tgz}
    !zip -rq {dname + '.zip'} {dname}
    !rm -rf {dname} {tgz}

### Download Metadata 
The images used to train the encoder were taken from the full, un-annotated dataset. Use this code to download the image IDs and URLs for the full dataset (3.1 gigabytes), and the annotated dataset (609 megabytes).

In [None]:
!wget -c -P /tf/open_images/metadata/ 'https://storage.googleapis.com/openimages/2018_04/image_ids_and_rotation.csv'
!wget -c -P /tf/open_images/metadata/ 'https://storage.googleapis.com/openimages/2018_04/train/train-images-boxable-with-rotation.csv'

### Assemble Encoder Training Set and Extended Target Set
This code will get the URLs of images that are not in the original target set, to be used for training the encoder (and for additional targets):

In [None]:
import requests
import hashlib
import os, sys
import pandas as pd
from PIL import Image
from io import BytesIO
from multiprocessing import Pool

In [None]:
def mkdirs(subset):
    if not os.path.exists('/tf/open_images/%s' % subset):
        os.mkdir('/tf/open_images/%s' % subset)
        os.mkdir('/tf/open_images/%s/images' % subset)

    for i in range(256):
        path = '/tf/open_images/%s/images/%02x' % (subset, i)
        if not os.path.exists(path):
            os.mkdir(path)

In [None]:
full_set = pd.read_csv('/tf/open_images/metadata/image_ids_and_rotation.csv').set_index("ImageID")
target_set = pd.read_csv('/tf/open_images/metadata/train-images-boxable-with-rotation.csv').set_index("ImageID")
unused = full_set[(full_set.Subset == 'train') & ~full_set.index.isin(target_set.index)]
train_set = unused[:1200000]
extended_target_set = unused[1200000:]

### Download Encoder Training Set
This code will download, resize, and save images from the training set. It will attempt to download 1,200,000 images. The final number will be less because some URLs point to images that are no longer available.

**Warning**: This will probably take at least a full day to complete.

In [None]:
mkdirs('train')
subset = 'train'
def download((img_id, img_meta)):
    resp = requests.get(img_meta.OriginalURL)
    
    img_data = resp.content
    md5 = hashlib.md5(img_data).digest().encode("base64").strip()
    
    if md5 != img_meta.OriginalMD5:
        return False
    
    image = Image.open(BytesIO(img_data))
    image.thumbnail([1024,1024])
    
    img_prefix = img_id[:2]
    filename = '/tf/open_images/%s/images/%s/%s.jpg' % (subset, img_prefix, img_id)
    image.save(filename)
        
    return True

In [None]:
step = 100000
pool = Pool()
try:
    for start in range(0, len(train_set), step):
        print start
        download_set = train_set[["OriginalURL","OriginalMD5"]][start : start+step].iterrows()
        checks = pool.map(download, download_set)
finally:
    pool.close()

### Download Extended Target Set
This code will download, resize, and save images from the extended target set. It will attempt to download 6,068,177 images. The final number will be less because some URLs point to images that are no longer available.

**Warning**: This will probably take at least several days to complete.

In [None]:
mkdirs('extended_targets')
subset = 'extended_targets'
def download((img_id, img_meta)):
    resp = requests.get(img_meta.OriginalURL)
    
    img_data = resp.content
    md5 = hashlib.md5(img_data).digest().encode("base64").strip()
    
    if md5 != img_meta.OriginalMD5:
        return False
    
    image = Image.open(BytesIO(img_data))
    image.thumbnail([1024,1024])
    
    img_prefix = img_id[:2]
    filename = '/tf/open_images/%s/images/%s/%s.jpg' % (subset, img_prefix, img_id)
    image.save(filename)
        
    return True

In [None]:
step = 100000
pool = Pool()
try:
    for start in range(0, len(extended_target_set), step):
        print start
        download_set = extended_target_set[["OriginalURL","OriginalMD5"]][start : start+step].iterrows()
        checks = pool.map(download, download_set)
finally:
    pool.close()