OpenImages V4
=============

https://storage.googleapis.com/openimages/web/download_v4.html

The full set is 9,178,275 image URLs (mostly on Flickr, and some are no longer available).

#### Target Set
Of the full dataset, 1,743,042 images are annotated and hosted by the [CVDF](https://github.com/cvdfoundation/open-images-dataset). They are labeled as "training data", but our encoder was not trained on these images; instead, we held them out to use as a target database for our simulations and wetlab experiments.

#### Validation Set
The CVDF also hosts a validation set consisting of 41,620 images, which we used to track the performance of the encoder during training.

#### Training Set
To build our encoder training set, we subtracted the target set from the full dataset, and downloaded whatever was available out of the first 1,200,000 remaining images.

### Download Target and Validation Sets

Run the following code to download the target set (513 gigabytes) and validation set (12 gigagbytes). It will take some time:

In [None]:
!aws s3 --no-sign-request sync \
    s3://open-images-dataset/tar/ \
    /tf/open_images/targets/images/ \
    --exclude "*" --include "train_*.tar.gz"
    
!aws s3 --no-sign-request sync \
    s3://open-images-dataset/tar/ \
    /tf/open_images/validation/images/ \
    --exclude "*" --include "validation.tar.gz"

### Download Metadata 
The images used to train the encoder were taken from the full, un-annotated dataset. Use this code to download the image IDs and URLs for the full dataset (3.1 gigabytes), and the annotated dataset (609 megabytes).

In [None]:
!wget -c -P /tf/open_images/metadata/ 'https://storage.googleapis.com/openimages/2018_04/image_ids_and_rotation.csv'
!wget -c -P /tf/open_images/metadata/ 'https://storage.googleapis.com/openimages/2018_04/train/train-images-boxable-with-rotation.csv'

### Assemble Encoder Training Set
This code will get the URLs of images that are not in the target set, to be used for training the encoder:

In [1]:
import pandas as pd

full_set = pd.read_csv('/tf/open_images/metadata/image_ids_and_rotation.csv').set_index("ImageID")
target_set = pd.read_csv('/tf/open_images/metadata/train-images-boxable-with-rotation.csv').set_index("ImageID")
train_set = full_set[(full_set.Subset == 'train') & ~full_set.index.isin(target_set.index)]

### Download Encoder Training Set
This code will download, resize, and save images from the training set. It will attempt to download 1,200,000 images. The final number will be less because some URLs point to images that are no longer available.

In [2]:
import requests
import hashlib
import os
from PIL import Image
from io import BytesIO

import sys
from primo.tools.multiprogress import ProgressPool

In [3]:
if not os.path.exists('/tf/open_images/train'):
    os.mkdir('/tf/open_images/train')
    os.mkdir('/tf/open_images/train/images')

for i in range(256):
    path = '/tf/open_images/train/images/%02x' % i
    if not os.path.exists(path):
        os.mkdir(path)

In [4]:
def download((img_id, img_meta)):
    resp = requests.get(img_meta.OriginalURL)
    
    img_data = resp.content
    md5 = hashlib.md5(img_data).digest().encode("base64").strip()
    
    if md5 != img_meta.OriginalMD5:
        return False
    
    image = Image.open(BytesIO(img_data))
    image.thumbnail([1024,1024])
    
    img_prefix = img_id[:2]
    filename = '/tf/open_images/train/images/%s/%s.jpg' % (img_prefix, img_id)
    image.save(filename)
        
    return True

In [5]:
download_size = 1200000
download_set = list(train_set[["OriginalURL","OriginalMD5"]][:download_size].iterrows())
del full_set, target_set, train_set

In [None]:
pool = ProgressPool()
try:
    checks = pool.map(download, download_set)
finally:
    pool.close()

IntProgress(value=0, description=u'0/1200000', max=1200000)

  " Skipping tag %s" % (size, len(data), tag)
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limi

    return request('get', url, params=params, **kwargs)
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 61, in request
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 61, in request
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 61, in request
    result = (True, func(*args, **kwds))
    return session.request(method=method, url=url, **kwargs)
  File "<ipython-input-4-626c0977445a>", line 2, in download
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 530, in request
    resp = requests.get(img_meta.OriginalURL)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 530, in request
    return session.request(method=m

  File "/usr/lib/python2.7/ssl.py", line 772, in recv
    v = self._sslobj.read(len)
    r = adapter.send(request, **kwargs)
    return self.read(buflen)
    response.begin()
    return self.read(buflen)
KeyboardInterrupt
    return self.read(buflen)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 449, in send
  File "/usr/lib/python2.7/httplib.py", line 448, in begin
  File "/usr/lib/python2.7/ssl.py", line 659, in read
  File "/usr/lib/python2.7/ssl.py", line 659, in read
    timeout=timeout
    v = self._sslobj.read(len)
  File "/usr/lib/python2.7/ssl.py", line 659, in read
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    version, status, reason = self._read_status()
KeyboardInterrupt
  File "/usr/lib/python2.7/httplib.py", line 404, in _read_status
    chunked=chunked,
    v = self._sslobj.read(len)
    v = self._sslobj.read(len)
KeyboardInterrupt
KeyboardInterrupt
  File "/usr/local/lib/python2.7/dist-pa