# Develop Deep learning Model part 1

## Loading the data

##### Train the data on all of the photos and captions in the training dataset.
##### Monitor the performance of the model on the testing dataset, here the development dataset is the testing dataset.
##### The train and development (testing) dataset have been predefined in the Flickr_8k.trainImages.txt and Flickr_8k.devImages.txt files respectively, and both these files contain list of names of the photo filenames int the traing and testing set respectively.
##### So when we are buliding the training dataset we will refer to photo file names nemtioned in Flickr_8k.trainImages.txt and then extract the photo identifiers (photo file names) to filter (choose) photos (identifiers of photos) and descriptions to put them in the training dataset.
##### We will follow the similar procedure as above for the development (testing) dataset

#### Load document into the memory (Flickr_8k.trainImages.txt and Flickr_8k.devImages.txt)

In [None]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

#### The function load_set() below will load a pre-defined set of identifiers given the train or development sets filename.

In [None]:
# load a pre-defined list of photo identifiers
def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    # process line by line
    for line in doc.split('\n'):
        # skip empty lines
        if len(line) < 1:
            continue
        # get the image identifier
        identifier = line.split('.')[0]
        dataset.append(identifier)
    return set(dataset)

##### Now we will load the photos and descriptions according to the pre-defined set of train or development identifiers.

##### function load_clean_descriptions() loads the cleaned text descriptions from ‘descriptions.txt‘ for a given set of identifiers and returns a dictionary of identifiers to lists of text descriptions.

##### The model we intend to develop will generate a caption given a photo, and the caption will be generated one word at a time. The sequence of previously generated words will be provided as input. Therefore, we will need a ‘first word’ to kick-off the generation process and a ‘last word‘ to signal the end of the caption. We will use the strings ‘startseq‘ and ‘endseq‘ for this purpose. These tokens are added to the loaded descriptions as they are loaded. It is important to do this now before we encode the text so that the tokens are also encoded correctly.

##### So now load_clean_descriptions() will return a dictionary which will look something like this
1280147517_98767ca3b3 startseq boat on lake endseq

1280147517_98767ca3b3 startseq wakeboarder flies sideways in air endseq

1280147517_98767ca3b3 startseq water skier performs tricks endseq

1280147517_98767ca3b3 startseq water skiing man does flip behind speedboat endseq


##### load_clean_descriptions() will be called once to form traing dataset and once to form development dataset

In [None]:
# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
    # load document
    doc = load_doc(filename)
    descriptions = dict()
    for line in doc.split('\n'):
        # split line by white space
        tokens = line.split()
        # split id from description
        image_id, image_desc = tokens[0], tokens[1:]
        # skip images not in the set
        if image_id in dataset:
            # create list
            if image_id not in descriptions:
                descriptions[image_id] = list()
            # wrap description in tokens
            desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
            # store
            descriptions[image_id].append(desc)
    return descriptions

##### load_photo_features() for photos is analogous to load_clean_descriptions() for descriptions of photos
##### we will pass the pickle file conatining extracted features in place of filename and and in dataset we will pass the document loaded from Flickr_8k.trainImages.txt

In [1]:
# load photo features
def load_photo_features(filename, dataset):
    # load all features
    all_features = load(open(filename, 'rb'))
    # filter features
    features = {k: all_features[k] for k in dataset}
    return features

##### Calling all the above functions one by one for traing dataset

In [None]:
# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))

In [None]:
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))

In [None]:
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))

##### Since models will basically deal with numbers we need to create a mapping for all the words present in all the descriptions in the training set. So each distinct word will be mapped to an integer value
##### In order to achieve this mapping Keras provides the Tokenizer class that can learn this mapping from the loaded description data.

-> create_tokenizer() function that will fit a Tokenizer given the loaded photo description text. Hence we will pass 'train_descriptions' to 'create_tokenizer()' function. 

-> create_tokenizer will call 'to_lines()' function to create a single list with all the descriptions appended to the list so 'train_descriptions' will also be passed to 'to_lines()' function

In [None]:
# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

In [None]:
# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
    lines = to_lines(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

###### calling the above functions

In [None]:
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

#### Model will be trained this way:
The model will be provided one word and the photo and generate the next word. Then the first two words of the description will be provided to the model as input with the image to generate the next word.

###### For example:
the input sequence “little girl running in field” would be split into 6 input-output pairs to train the model:

X1,		X2 (text sequence), 						y (word)
photo	startseq, 									little
photo	startseq, little,							girl
photo	startseq, little, girl, 					running
photo	startseq, little, girl, running, 			in
photo	startseq, little, girl, running, in, 		field
photo	startseq, little, girl, running, in, field, endseq

###### Later, when the model is used to generate descriptions, the generated words will be concatenated and recursively provided as input to generate a caption for an image.

#### create_sequence() function
The function below named create_sequences(), given the tokenizer, a maximum sequence length, and the dictionary of all descriptions and photos, will transform the data into input-output pairs of data for training the model. There are two input arrays to the model: one for photo features and one for the encoded text. There is one output for the model which is the encoded next word in the text sequence.

The input text is encoded as integers, which will be fed to a word embedding layer. The photo features will be fed directly to another part of the model. The model will output a prediction, which will be a probability distribution over all words in the vocabulary.

#### Output is one-hot encoded:
The output data will therefore be a one-hot encoded version of each word, representing an idealized probability distribution with 0 values at all word positions except the actual word position, which has a value of 1.

In [None]:
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
    X1, X2, y = list(), list(), list()
    # walk through each image identifier
    for key, desc_list in descriptions.items():
        # walk through each description for the image
        for desc in desc_list:
            # encode the sequence
            seq = tokenizer.texts_to_sequences([desc])[0]
            # split one sequence into multiple X,y pairs
            for i in range(1, len(seq)):
                # split into input and output pair
                in_seq, out_seq = seq[:i], seq[i]
                # pad input sequence
                in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                # encode output sequence
                out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                # store
                X1.append(photos[key][0])
                X2.append(in_seq)
                y.append(out_seq)
    return array(X1), array(X2), array(y)

We will need to calculate the maximum number of words in the longest description. A short helper function named max_length() is defined below.
max_lentgh is passed as a parameter t0 'create_sequences' function

In [None]:
# calculate the length of the description with the most words
def max_length(descriptions):
	lines = to_lines(descriptions)
	return max(len(d.split()) for d in lines)