In this notebook, we will load a small subset of the **Visual Genome** dataset. All **Visual Genome** datasets are sourced from [here](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html). We will pull a random sample of around 1,000 images from HuggingFace's `dataset` package, and download the `Objects` dataset from the [original source](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html).

In [2]:
from datasets import load_dataset
import aiohttp
import time

start = time.time()

dataset = load_dataset("visual_genome", "region_descriptions_v1.2.0", split="train[:1%]",
    storage_options={'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=900)}})

print(f"This took {(time.time() - start) / 60:.2f} minutes.")

This took 0.01 minutes.


In [3]:
len(dataset)

1081

In [43]:
dataset[1].keys()

dict_keys(['image', 'image_id', 'url', 'width', 'height', 'coco_id', 'flickr_id', 'regions'])

The raw images are available in the `url` column.

In [28]:
dataset[1]['url']

'https://cs.stanford.edu/people/rak248/VG_100K/2.jpg'

In [29]:
# let's convert the dictionary into a dataframe
df = pd.DataFrame(dataset)

In [30]:
df.head()

Unnamed: 0,image,image_id,url,width,height,coco_id,flickr_id,regions
0,<PIL.JpegImagePlugin.JpegImageFile image mode=...,1,https://cs.stanford.edu/people/rak248/VG_100K_...,800,600,,,"[{'region_id': 1382, 'image_id': 1, 'phrase': ..."
1,<PIL.JpegImagePlugin.JpegImageFile image mode=...,2,https://cs.stanford.edu/people/rak248/VG_100K/...,800,600,,,"[{'region_id': 1387, 'image_id': 2, 'phrase': ..."
2,<PIL.Image.Image image mode=RGB size=640x480 a...,3,https://cs.stanford.edu/people/rak248/VG_100K/...,640,480,,,"[{'region_id': 1392, 'image_id': 3, 'phrase': ..."
3,<PIL.JpegImagePlugin.JpegImageFile image mode=...,4,https://cs.stanford.edu/people/rak248/VG_100K/...,640,480,,,"[{'region_id': 1397, 'image_id': 4, 'phrase': ..."
4,<PIL.JpegImagePlugin.JpegImageFile image mode=...,5,https://cs.stanford.edu/people/rak248/VG_100K/...,800,600,,,"[{'region_id': 1402, 'image_id': 5, 'phrase': ..."


In [31]:
# we don't need any columns other than image_id and url
df = df[['image_id', 'url']]
df.head()

Unnamed: 0,image_id,url
0,1,https://cs.stanford.edu/people/rak248/VG_100K_...
1,2,https://cs.stanford.edu/people/rak248/VG_100K/...
2,3,https://cs.stanford.edu/people/rak248/VG_100K/...
3,4,https://cs.stanford.edu/people/rak248/VG_100K/...
4,5,https://cs.stanford.edu/people/rak248/VG_100K/...


The `objects` JSON file was downloaded from [here](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html). This dataset contains image tags (aka objects) present in each image.

In [32]:
import pandas as pd

df_obj = pd.read_json('../data/objects.json')
df_obj.head()

Unnamed: 0,image_id,objects,image_url,merged_object_ids
0,1,"[{'synsets': ['tree.n.01'], 'h': 557, 'object_...",https://cs.stanford.edu/people/rak248/VG_100K_...,
1,2,"[{'synsets': ['road.n.01'], 'h': 254, 'object_...",https://cs.stanford.edu/people/rak248/VG_100K_...,
2,3,"[{'synsets': ['booth.n.02'], 'h': 389, 'object...",https://cs.stanford.edu/people/rak248/VG_100K_...,
3,4,"[{'synsets': ['floor.n.01'], 'h': 168, 'object...",https://cs.stanford.edu/people/rak248/VG_100K_...,
4,5,"[{'synsets': ['room.n.01'], 'h': 599, 'object_...",https://cs.stanford.edu/people/rak248/VG_100K_...,


In [None]:
df_obj = df_obj[['image_id', 'objects']]

In [41]:
# extract the list of image tags from the `objects` column
df_obj['tags'] = df_obj['objects'].apply(lambda row: [obj['names'][0] for obj in row])
df_obj.head()

Unnamed: 0,image_id,objects,image_url,merged_object_ids,tags
0,1,"[{'synsets': ['tree.n.01'], 'h': 557, 'object_...",https://cs.stanford.edu/people/rak248/VG_100K_...,,"[trees, sidewalk, building, street, wall, tree..."
1,2,"[{'synsets': ['road.n.01'], 'h': 254, 'object_...",https://cs.stanford.edu/people/rak248/VG_100K_...,,"[road, sidewalk, building, building, street li..."
2,3,"[{'synsets': ['booth.n.02'], 'h': 389, 'object...",https://cs.stanford.edu/people/rak248/VG_100K_...,,"[cubicles, table, desktop, desk, wall, dividin..."
3,4,"[{'synsets': ['floor.n.01'], 'h': 168, 'object...",https://cs.stanford.edu/people/rak248/VG_100K_...,,"[floor, carpet, curtains, patio, window, table..."
4,5,"[{'synsets': ['room.n.01'], 'h': 599, 'object_...",https://cs.stanford.edu/people/rak248/VG_100K_...,,"[room, area, wall, wall, shelf, books, chair, ..."


In [35]:
# combine both datasets
df = df.merge(df_obj[['image_id', 'tags']], left_on='image_id', right_on='image_id')
df.shape

(1081, 3)

In [36]:
df.head()

Unnamed: 0,image_id,url,tags
0,1,https://cs.stanford.edu/people/rak248/VG_100K_...,"[trees, sidewalk, building, street, wall, tree..."
1,2,https://cs.stanford.edu/people/rak248/VG_100K/...,"[road, sidewalk, building, building, street li..."
2,3,https://cs.stanford.edu/people/rak248/VG_100K/...,"[cubicles, table, desktop, desk, wall, dividin..."
3,4,https://cs.stanford.edu/people/rak248/VG_100K/...,"[floor, carpet, curtains, patio, window, table..."
4,5,https://cs.stanford.edu/people/rak248/VG_100K/...,"[room, area, wall, wall, shelf, books, chair, ..."


In [39]:
df['tags'] = df['tags'].apply(lambda x: list(set(x)))
df.head()

Unnamed: 0,image_id,url,tags
0,1,https://cs.stanford.edu/people/rak248/VG_100K_...,"[windows, car, man, trees, back, tree trunk, a..."
1,2,https://cs.stanford.edu/people/rak248/VG_100K/...,"[sign, sidewalk, window, backpack, sneakers, t..."
2,3,https://cs.stanford.edu/people/rak248/VG_100K/...,"[photos, wireless phone, chain, computer tower..."
3,4,https://cs.stanford.edu/people/rak248/VG_100K/...,"[carpet, chair, cloths, colour, frame, drape, ..."
4,5,https://cs.stanford.edu/people/rak248/VG_100K/...,"[windows, chair, colour, paper, cables, color,..."


In [40]:
# export the dataset; we will use it late
df.to_csv('../data/vgenome_sample_1k.csv')