In this notebook, we will load a small subset of the **Visual Genome** dataset. All **Visual Genome** datasets are sourced from [here](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html). We only need one dataset from here called `Objects`. This is a `JSON` file,  which should be downloaded into `./data` directory.

Once downloaded, we can read it into a `pandas` dataframe.

### Read data

In [1]:
import pandas as pd

df = pd.read_json('../data/objects.json')

# we will take a small random sample for this exercise
df = df.sample(n=1000)
df.head()

Unnamed: 0,image_id,objects,image_url,merged_object_ids
53559,2366867,"[{'synsets': ['bird.n.01'], 'h': 17, 'object_i...",http://crowdfile.blob.core.chinacloudapi.cn/46...,
41981,2378980,"[{'synsets': ['wheel.n.01'], 'h': 60, 'object_...",http://crowdfile.blob.core.chinacloudapi.cn/46...,
89244,2329504,"[{'synsets': [], 'h': 16, 'object_id': 4673734...",http://crowdfile.blob.core.chinacloudapi.cn/46...,
71785,2347788,"[{'synsets': ['apron.n.01'], 'h': 163, 'object...",http://crowdfile.blob.core.chinacloudapi.cn/46...,
100014,2318253,"[{'synsets': ['bear.n.01'], 'h': 204, 'object_...",http://crowdfile.blob.core.chinacloudapi.cn/46...,


### Clean data

In [2]:
# drop the last column (we don't need it)
df = df.drop(columns=df.columns[-1])

Let's look closely at the `objects` column. It contains multiple items, but all we need is the values from `names`.

In [3]:
df.iloc[1, 1][0]

{'synsets': ['wheel.n.01'],
 'h': 60,
 'object_id': 555958,
 'merged_object_ids': [],
 'names': ['back wheel'],
 'w': 39,
 'y': 245,
 'x': 368}

Let's extract the list of image tags from the `objects` column and store them in a new column.

In [4]:
df['tags'] = df['objects'].apply(lambda row: [obj['names'][0] for obj in row])

df = df.drop(columns='objects')

df.head()

Unnamed: 0,image_id,image_url,tags
53559,2366867,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[bird, bird, bird, bird, bird, bird, bird, bir..."
41981,2378980,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[back wheel, bus, bus, destination window, dou..."
89244,2329504,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[briefcase, building, building, bus, car, flag..."
71785,2347788,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[apron, bench, bench, bench, bowl, box, child,..."
100014,2318253,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[bear, claws, ear, ear, eye, face, grass, grou..."


In [5]:
# not sure if there are duplicates; let's remove them (if any)
df['tags'] = df['tags'].apply(lambda x: list(set(x)))

df.head()

Unnamed: 0,image_id,image_url,tags
53559,2366867,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[birds, camera, bird, sky, neck, peach sky, me..."
41981,2378980,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[window, line, double door, front windshield, ..."
89244,2329504,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[window, head, man, flag, pole, light, tie, bu..."
71785,2347788,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[headband, kid, child, whisk, visor, container..."
100014,2318253,http://crowdfile.blob.core.chinacloudapi.cn/46...,"[bear, eye, claws, face, grass, mouth, these, ..."


### Export data

In [6]:
# reindex and export the dataset; we will use it later
df.reindex().to_csv('../data/vgenome_sample_1k.csv')

In the [next notebook](01_prepare_opensearch.ipynb), we will configure and prepare OpenSearch.