## Develop Deep learning Model part 2 & 3

### Part 2 - Defining the model

#### Feature Extractor Model
The input to this model is a vector of length 4096 obtained from the VGG model. We use 50% dropout at the input layer (to avoid over-fitting). Now the output layer of this model contains 256 units and uses 'relu' as the activation function in each output unit.

#### Sequence Model
The input to this model is a vector of length equal to the maximum lengths of the descriptors present. This input is passed through as Embedding layer and then next a 50% dropout is used after the Embedding layer. The output layer uses LSTM and has 256 units as output.

#### Decoder Model
The decoder model takes an input vector of length 256. This input is obtained by the addition of the outputs obtained at the output layers of the above two models. So the first layer is a dense layer with 256 units and uses 'relu' activation function. The output of this layer is again fed to another dense layer which uses 'softmax' activation function. The output is one hot encoded therefore the output layer will contain number of units equal to the vocabulary size.
So the output we need is the output of the decoder mode;

#### Using Model Class API
The Model class API is used to instantiate a model. Given some input tensors(s) (in our case the extracted features from the photos and the descriptions are the input tensors) and output tensor(s) (in our case the output of the decoder).
This model by itself will include all layers required in the computation of b given a and hence create a new model for you and return it.
In the case of multi-input or multi-output models, you can use lists as well (as we will use for inputs since we have multiple(two) inputs):
model = Model(inputs=[a1, a2], outputs=[b1, b2, b3])

define_model() defines and returns the model ready to be fit.

In [None]:
# define the captioning model
def define_model(vocab_size, max_length):
    
    # feature extractor model
    inputs1 = Input(shape=(4096,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    
    # sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    
    # decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    
    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    # summarize model
    print(model.summary())
    plot_model(model, to_file='model.png', show_shapes=True)
    return model

### Part 3 - Fitting the model

In [None]:
# define the model
model = define_model(vocab_size, max_length)
# define checkpoint callback
filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))