# Introduction
In this notebook I have tried consuming the `Flickr dataset` with `tf.data.Dataset`. This can act as the data pipeline for others to work their model on. The dataframe is built from scratch, so there is a lot of flexibility that the user can get.

I have made the dataset such that each element of the dataset has two components.
- Image - (height, width, channel)
- Comments - (5, seq_length)

# Imports
The global imports are as follows:
- tensorflow
- matplotlib
- pandas

In [None]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import matplotlib.pyplot as plt
import pandas as pd

# Util function
- Load image: This helps in loading images from the path given
- configure_dataset: Helps in caching and fetching the dataset

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

def load_img(image_path):
    '''
    This function helps load the image from the path
    inputs:
    path_to_img = The path of the image
    outputs:
    the image itself in form of tf.tensor
    '''
    # parse image
    image = tf.io.read_file(image_path)
    image = tf.image.decode_image(image)
    image = tf.image.convert_image_dtype(image, tf.float32)
    return image

def configure_dataset(dataset):
    return dataset.cache().prefetch(buffer_size=AUTOTUNE)

# Data

In [None]:
data_dir = '../input/flickr-image-dataset/flickr30k_images'
image_dir = f'{data_dir}/flickr30k_images'
csv_file = f'{data_dir}/results.csv'

## DataFrame
Load the `results.csv` in the form of a dataframe.

While building the notebook I had come across a problem with the `csv` file. The entry at index `19999` was messed up. This is why you can see hard coded values for the respective indices. Doing this makes the code later simpler.

In [None]:
df = pd.read_csv(csv_file, delimiter='|')
# Under scrutiny I had found that 19999 had a messed up entry
df[' comment_number'][19999] = ' 4'
df[' comment'][19999] = ' A dog runs across the grass .'
df['image_name'] = image_dir+'/'+df['image_name']
df.head(5)

> Let us get some information from the dataset

In [None]:
print(f'[INFO] The shape of dataframe: {df.shape}')
print(f'[INFO] The columns in the dataframe: {df.columns}')
print(f'[INFO] Unique rows: {len(pd.unique(df["image_name"]))}')

> The unique rows are `31,783` while there are `1,58,915` rows in the dataframe. On scrutiny we will find that each image has 5 comments. This is why there are 5 times the rows as there are unique images. 

In [None]:
# A simple sanity check to figure the duplicacy issue
def duplicacy(index):
    print(f"There are `{len(df.loc[df['image_name'] == df['image_name'][index]])}` comments for image `{index}`")

# Change the index to see for yourself
duplicacy(index=200)

## Dividing data
The thought behind this section is to obtain a `tf.data.Dataset` which consists of elements in this format:
```python
{
    image,
    comment0,
    comment1,
    comment2,
    comment3,
    comment4
}
```

With that in mind these are the steps that I have taken to get the dataset done:
- Make two dataframes. One for image_names and the other for comments.
- Build two different `tf.data.Dataset` objects.
- Proprocess the comments dataset.
- Zip the two datasets together.
- Map a function to obtain image from image_names and keep the comments as it is.

In [None]:
image_name = {
    'image_name':df[df[' comment_number'] == df[' comment_number'][0]]['image_name'].values,
}
comments = {
    'comment_0':df[df[' comment_number'] == df[' comment_number'][0]][' comment'].values,
    'comment_1':df[df[' comment_number'] == df[' comment_number'][1]][' comment'].values,
    'comment_2':df[df[' comment_number'] == df[' comment_number'][2]][' comment'].values,
    'comment_3':df[df[' comment_number'] == df[' comment_number'][3]][' comment'].values,
    'comment_4':df[df[' comment_number'] == df[' comment_number'][4]][' comment'].values,
}

In [None]:
image_name_df = pd.DataFrame.from_dict(image_name)
comments_df = pd.DataFrame.from_dict(comments)

image_name_df_values = image_name_df[image_name_df.columns].astype(str).values
comments_df_values = comments_df[comments_df.columns].astype(str).values

image_name_ds = tf.data.Dataset.from_tensor_slices(image_name_df_values)
comments_ds = tf.data.Dataset.from_tensor_slices(comments_df_values)

## TextVectorization
Here we create a `TextVectorization` layer. As per the tf tutorial, this layer is capable of `Standardization`, `Tokenization` and `Vectorization` all at once.

In [None]:
VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 15

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

In [None]:
# Adapt the state of the layer to the current data
int_vectorize_layer.adapt(comments_ds)

In [None]:
# Function that will map text to the int embeds
def int_vectorize_text(text):
    text = tf.expand_dims(text, -1)
    return int_vectorize_layer(text)

> Sanity check with one caption

In [None]:
text = next(iter(comments_ds))
print("[INFO] COMMENTS:",text)
print("[INFO] `int` VECOTRIZED COMMENTS:",int_vectorize_text(text))

In [None]:
# Build the int comments dataset
int_comments_ds = comments_ds.map(int_vectorize_text)

In [None]:
# Join the two datasets
# Image name dataset + int vectorised comments dataset
name_comments_ds = tf.data.Dataset.zip((image_name_ds, int_comments_ds))

In [None]:
def process(image_name,comments):
    """
    This function takes image_name and comments
    and returns the image and comments.
    
    Args:
        image_name (tensor): The path name to the image
        comments (tensor): The comments, preferably the int vectorised
    """
    img = load_img(image_name[0])
    return img, comments

# The Joint Dataset

In [None]:
train_ds = name_comments_ds.map(process)

In [None]:
for image, comments in train_ds.take(1):
    print(image.shape)
    print(comments.shape)

In [None]:
plt.figure(figsize=(10, 10))
for image, comments in train_ds.take(2):
    plt.imshow(image)
    plt.show()
    print(comments)