# Check Flickr8k

First we check out the 8k metadata.

In [None]:
import json
from IPython.display import Image, Audio
from collections import Counter
import matplotlib.pyplot as plt

In [None]:
data_dn = '/Users/pbos/projects/spokenLanguage/flickr8k/'

In [None]:
with open(data_dn + 'dataset.json') as fh:
    metadata = json.load(fh)

In [None]:
metadata.keys(), metadata['dataset']

In [None]:
metadata['images']

In [None]:
len(metadata['images'])

Let's check out whether they match:

In [None]:
def check_image(ix):
    print([i['raw'] for i in metadata['images'][ix]['sentences']])
    display(Image(data_dn + 'Flickr8k_Dataset/Flicker8k_Dataset/' + metadata['images'][ix]['filename']))

In [None]:
check_image(0)

In [None]:
check_image(300)

# flickr1d

Let's build the flickr1d (1 deca, i.e. 10) dataset out of this!

In [None]:
'cp ' + ' '.join([metadata['images'][ix]['filename'] for ix in range(10)]) + ' ' + data_dn + '../../../Flickr8k_Dataset/Flicker8k_Dataset/flickr1d/'

In [None]:
data_dn_1d = '/Users/pbos/projects/spokenLanguage/flickr1d/'

In [None]:
flickr1d_meta = {'dataset': 'flickr1d', 'images': metadata['images'][:10]}

In [None]:
with open(data_dn_1d + 'dataset.json', 'w') as fh:
    json.dump(flickr1d_meta, fh)

## Wav files -  wav2capt metadata

In [None]:
flickr_1d_img_filenames = [im['filename'] for im in flickr1d_meta['images']]
with open(data_dn + 'wav2capt.txt', 'r') as fh:
    wav2capt = [line.split() for line in fh if line.split()[1] in flickr_1d_img_filenames]

In [None]:
wav2capt[:5], len(wav2capt), Counter(list(zip(*wav2capt))[1])

Which wav files belong to which captions? Let's check.

In [None]:
Audio(data_dn + 'flickr_audio/wavs/' + wav2capt[0][0])

This says "a dog is carrying something pink in its mouth while walking through the snow" in a female voice.

In [None]:
Audio(data_dn + 'flickr_audio/wavs/' + wav2capt[1][0])

This says in a different female voice "a brown dog is holding a pink shirt in the snow".

It's clear both are about image 2. The first file, which is labeled "#3" is the third caption in the list for that file, the second, #2, is the second. So that matches up:

In [None]:
flickr1d_meta['images'][2]['sentences'][2]['raw'], flickr1d_meta['images'][2]['sentences'][3]['raw']

What about which speakers say which things?

In [None]:
Audio(data_dn + 'flickr_audio/wavs/' + wav2capt[4][0])

In [None]:
Audio(data_dn + 'flickr_audio/wavs/' + wav2capt[5][0])

No correspondence at all, all four seem different speakers, so speaker info is apparently not in these files.

In [None]:
with open(data_dn_1d + 'wav2capt.txt', 'w') as fh:
    for line in wav2capt:
        fh.write(' '.join(line) + '\n')

Of course, because the speaker info is in wav2spk.txt, let's also modify that.

In [None]:
flickr_1d_wav_filenames = [wav[0] for wav in wav2capt_1d]
with open(data_dn + 'wav2spk.txt', 'r') as fh:
    wav2spk = [line.split() for line in fh if line.split()[0] in flickr_1d_wav_filenames]

In [None]:
wav2spk[:5], len(wav2spk), Counter(list(zip(*wav2spk))[1])

So, this is not a really nicely balanced dataset, obviously. I'm not sure how important speaker info is, though. Also, what's the total dataset like?

In [None]:
with open(data_dn + 'wav2spk.txt', 'r') as fh:
    full_speaker_distribution = Counter([line.split()[1] for line in fh])

In [None]:
plt.bar(list(full_speaker_distribution.keys()), list(full_speaker_distribution.values()))

Not balanced at all either, so probably it doesn't matter (it's actually representative of flickr8k in this sense).

In [None]:
with open(data_dn_1d + 'wav2spk.txt', 'w') as fh:
    for line in wav2spk:
        fh.write(' '.join(line) + '\n')

Then finally copy over the actual wavs:

In [None]:
print(f"mkdir -p {data_dn_1d}/flickr_audio/wavs/; cp {data_dn}/flickr_audio/wavs/{{{','.join([wv[0] for wv in wav2spk])}}} {data_dn_1d}/flickr_audio/wavs/")