# Image Captioning

La idea de la práctica es desarrollar un modelo con redes neuronales que permita "subtitular" una imagen, describiendola.

Para ello usamos una arquitectura de encoder-decoder, es decir, cogemos una imagen, extraemos sus caracteristicas, y con la descripción que tenemos de la misma entrenamos una LSTM para extraer la descripción que tiene según las caracteristicas.

![imagen del modelo](https://raw.githubusercontent.com/yunjey/pytorch-tutorial/master/tutorials/03-advanced/image_captioning/png/model.png)


## Datos
Para entrenar la red neuronal usaremos los datos de [Flicker30K](http://shannon.cs.illinois.edu/DenotationGraph/data/index.html)

Para la ejecución de la practica este dataset es excesivo. De hecho, al intentar ejecutar los pasos con todo el dataset Google Colab ha llegado a pensar que estaba minando bitcoins y [ha límitado los recursos](https://research.google.com/colaboratory/faq.html#gpu-availability) (menos disco duro y sin GPU).

Así pues, considerando que el objetivo de la practica es más bien mostrar el camino y no conseguir unos resultados exactos, he preferido limitar el número de imagenes y de descripciones usadas para entrenar la red neuronal, así como disminuir el número de épocas usadas para entrenar la red.

Pese a todo he procurado dejar el código de tal manera que si tengo acceso a una máquina más potente (o estoy dispuesto a dejar mi ordenador entrenando la red varias semanas) poder generar un modelo más completo con el menor número posible de cambios.

Por ello, las partes computacionalmente ligeras las he realizado para las 30.000 imagenes del dataset, pero la parte del LSTM la he limitado.


In [0]:
#URLS de los datasets
ANNOTATIONS_URL = 'http://shannon.cs.illinois.edu/DenotationGraph/data/flickr30k.tar.gz'
IMAGES_URL_TRAIN = 'http://shannon.cs.illinois.edu/DenotationGraph/data/flickr30k-images.tar'

## Preparación de datos


### Descripciones

#### Descarga de descripciones

In [2]:
#descarga de descripciones
!wget http://shannon.cs.illinois.edu/DenotationGraph/data/flickr30k.tar.gz

--2019-08-29 01:00:54--  http://shannon.cs.illinois.edu/DenotationGraph/data/flickr30k.tar.gz
Resolving shannon.cs.illinois.edu (shannon.cs.illinois.edu)... 18.220.149.166
Connecting to shannon.cs.illinois.edu (shannon.cs.illinois.edu)|18.220.149.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3652513 (3.5M) [application/x-gzip]
Saving to: ‘flickr30k.tar.gz.1’


2019-08-29 01:00:54 (18.8 MB/s) - ‘flickr30k.tar.gz.1’ saved [3652513/3652513]



In [3]:
# Descomprimimos
!tar -xvzf flickr30k.tar.gz

readme.txt
results_20130124.token


#### Limpieza y preparación de descripciones 

Definimos dos funciones auxiliares que nos permiten leer y grabar documentos.

In [0]:
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# save descriptions to file, one per line
def save_doc(descriptions, filename):
	lines = list()
	for key, desc in descriptions.items():
		lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

In [0]:
filename = 'results_20130124.token'
# Cargamos las descripciones
doc = load_doc(filename)

En doc tenemos:

```NOMBRE_DE_ARCHIVO.JPG```**#**```ID_DESC```**\t**````DESCRIPCION```

(Un mismo archivo puede tener varias descripciones)

In [6]:
doc[0:150]

'1000092795.jpg#0\tTwo young guys with shaggy hair look at their hands while hanging out in the yard .\n1000092795.jpg#1\tTwo young , White males are outs'

In [7]:
# Funcion para cargar las descripciones
def load_descriptions(doc):
  # creamos un diccionario que tiene como clave el nombre de archivo y como valor
  # la descripcion de la imagen
  mapping = dict()
  # Para cada linea
  for line in doc.split('\n'):
    #separar la linea por espacios
    tokens = line.split()
    if len(line) < 2:
      continue
      # la primera palabra es el image ID, el resto la descripción
    image_id, image_desc = tokens[0], tokens[1:]
		# quitar la extension .jpg del imageID
    image_id = image_id.split('.')[0]
		# Convertir la descripcion (en este momento una cadesna de tokens en una cadenba
    image_desc = ' '.join(image_desc)
		# guardar ña primera descripcion de cada imagen
    if image_id not in mapping:
      mapping[image_id] = image_desc
      
  return mapping
 
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))

Loaded: 31783 


Para la realización de la practica, y teniendo en cuenta que sólo es una aproximación sólo me quedo con la `primera descripción. En el entrenamiento final, tendría que usar todas las descripciones, no sólo la primera.


In [8]:
import string

def clean_descriptions(descriptions):
  # palabras a minusculas, eliminamos palabras de una sola letra
  
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc in descriptions.items():
		# tokenize
		desc = desc.split()
		# convert to lower case
		desc = [word.lower() for word in desc]
		# remove punctuation from each token
		desc = [w.translate(table) for w in desc]
		# remove hanging 's' and 'a'
		desc = [word for word in desc if len(word)>1]
		# store as string
		descriptions[key] =  ' '.join(desc)

# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
all_tokens = ' '.join(descriptions.values()).split()
vocabulary = set(all_tokens)
print('Vocabulary Size: %d' % len(vocabulary))

Vocabulary Size: 13130


Una vez hecho todo esto grabamos las descripciones como un archivo de texto, para poder recurrir a él si reinicamos la máquina.
(Este archivo se incluye en el GIT)

In [0]:
save_doc(descriptions, 'descriptions.txt')  

En la variable descriptions nos ha quedado un diccionario con el nombre de archivo, y su descripción. Estos datos los usaremos para entrenar nuestro modelo LSTM

In [10]:
descriptions.keys()

dict_keys(['1000092795', '10002456', '1000268201', '1000344755', '1000366164', '1000523639', '1000919630', '10010052', '1001465944', '1001545525', '1001573224', '1001633352', '1001773457', '1001896054', '100197432', '100207720', '1002674143', '1003163366', '1003420127', '1003428081', '100444898', '1005216151', '100577935', '1006452823', '100652400', '1007129816', '100716317', '1007205537', '1007320043', '100759042', '10082347', '10082348', '100845130', '10090841', '1009434119', '1009692167', '101001624', '1010031975', '1010087179', '1010087623', '10101477', '1010470346', '1010673430', '101093029', '101093045', '1011572216', '1012150929', '1012212859', '1012328893', '101262930', '1013536888', '101362133', '101362650', '1014609273', '101471792', '1014785440', '1015118661', '1015584366', '101559400', '1015712668', '10160966', '101654506', '1016626169', '101669240', '1016887272', '1017675163', '1018057225', '1018148011', '101859883', '10188041', '1019077836', '101958970', '1019604187', '10

In [11]:
descriptions.values()



### Imagenes

#### Descarga de imagenes

In [12]:
!wget http://shannon.cs.illinois.edu/DenotationGraph/data/flickr30k-images.tar

--2019-08-29 01:01:01--  http://shannon.cs.illinois.edu/DenotationGraph/data/flickr30k-images.tar
Resolving shannon.cs.illinois.edu (shannon.cs.illinois.edu)... 18.220.149.166
Connecting to shannon.cs.illinois.edu (shannon.cs.illinois.edu)|18.220.149.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4440780800 (4.1G) [application/x-tar]
Saving to: ‘flickr30k-images.tar.1’


2019-08-29 01:01:59 (72.4 MB/s) - ‘flickr30k-images.tar.1’ saved [4440780800/4440780800]



In [13]:
# Descomprimimos
!tar -vxf flickr30k-images.tar

flickr30k-images/
flickr30k-images/1000092795.jpg
flickr30k-images/10002456.jpg
flickr30k-images/1000268201.jpg
flickr30k-images/1000344755.jpg
flickr30k-images/1000366164.jpg
flickr30k-images/1000523639.jpg
flickr30k-images/1000919630.jpg
flickr30k-images/10010052.jpg
flickr30k-images/1001465944.jpg
flickr30k-images/1001545525.jpg
flickr30k-images/1001573224.jpg
flickr30k-images/1001633352.jpg
flickr30k-images/1001773457.jpg
flickr30k-images/1001896054.jpg
flickr30k-images/100197432.jpg
flickr30k-images/100207720.jpg
flickr30k-images/1002674143.jpg
flickr30k-images/1003163366.jpg
flickr30k-images/1003420127.jpg
flickr30k-images/1003428081.jpg
flickr30k-images/100444898.jpg
flickr30k-images/1005216151.jpg
flickr30k-images/100577935.jpg
flickr30k-images/1006452823.jpg
flickr30k-images/100652400.jpg
flickr30k-images/1007129816.jpg
flickr30k-images/100716317.jpg
flickr30k-images/1007205537.jpg
flickr30k-images/1007320043.jpg
flickr30k-images/100759042.jpg
flickr30k-images/10082347.jpg
fli

In [0]:
# elimino readme.txt para evitar problemas a la hora entrenar
# Si, la solución correcta sería comprobar que es una imagen antes de entrenar la red
# pero por el momento vale.

!rm flickr30k-images/readme.txt

#### Limpieza y preparacion de las imagenes

Las imagenes ya están todas preparadas para su procesado, sólo será necesario cambiar sus dimensiones a 224x224, pero este paso se realiza mientras extraemos las caracteristicas.

## Modelo





En este momento nos vamos a centrar en el extractor de caracteristicas.


#### Extracción de caracteristicas 


Podriamos crear un modelo de de redes neuronales para extraer las caracteristicas del conjunto de imagenes que hemos bajado, pero cuando hay gente muy lista que ha creado unos modelos ya probados, es mejor apoyarse en el trabajo de otros y extraer las características de las imagenes con un modelo (y unos pesos) ya entrenado.

Vamos a usar el VGG16. Este modelo es lento de aplicar, así que vamos a obtener las caracteristicas que nos da el modelo y las guardaremos en un archivo.

De esta manera, a la hora de entrenar el LSTM, el trabajo de la extracción de caracteristicas estará ya realizado, y no tendremos que sacar las mismas de cada imagen justo antes de entrar en la red neuronal recurrente.



In [15]:
!pip install tqdm



In [16]:
from os import listdir
from pickle import load, dump
from tqdm import tqdm
from numpy import array
from numpy import argmax
from pandas import DataFrame
from nltk.translate.bleu_score import corpus_bleu
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img, img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.layers import Input, Dense, Flatten, LSTM, RepeatVector, TimeDistributed, Embedding
from keras.layers.merge import concatenate
from keras.layers.pooling import GlobalMaxPooling2D

Using TensorFlow backend.


In [17]:
# extract features from each photo in the directory
def extract_features(directory):
  # load the model
  in_layer = Input(shape=(224, 224, 3))
  model = VGG16(include_top=False, input_tensor=in_layer)
  print(model.summary())
  # extract features from each photo
  features = dict()

  files_in_directory = listdir(directory)
  n_images = len(files_in_directory)
  for i, name in tqdm(enumerate(files_in_directory), total=n_images):
    # load an image from file
    filename = directory + '/' + name
    image = load_img(filename, target_size=(224, 224))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    # get image id
    image_id = name.split('.')[0]
    # store feature
    features[image_id] = feature
    # print('{} / {} > {}'.format(i, n_images, name))
  return features
# extract features from all images
directory = 'flickr30k-images'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

W0829 01:02:45.460898 140502137550720 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0829 01:02:45.519878 140502137550720 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0829 01:02:45.540890 140502137550720 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0829 01:02:45.578577 140502137550720 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W0829 01:02:45.979857 140502137550720 deprecation_wrapp

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

100%|██████████| 31783/31783 [15:17<00:00, 34.63it/s]


Extracted Features: 31783


Tenemos 31783 descripciones y 31783 imagenes con sus caracteristicas extraidas. Tratar de entrenar un modelo con todos estos datos llevaría días, asi que por el momento vamos a seleccionar 1000 imagenes con las que trabajar, y con ellas generaremos nuestros conjuntos de entrenamiento, test y validación.


In [0]:
TRAINING_SET = 1000

import random
# fix random seed for reproducibility
random.seed(42)

def select_files_for_training(directory, n=TRAINING_SET):
  
  files_in_directory = listdir(directory)
  index_de_ficheros = random.sample(range(1, len(files_in_directory)), n)
  l = []
  for i in index_de_ficheros:
    imagen = files_in_directory[i]
    imagen = imagen[0:-4] #quitar la extension
    l.append(imagen)
  return l

def save_training_file_list(l, filename):
	lines = list()
	for desc in l:
		lines.append(desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()
  
l = select_files_for_training('flickr30k-images')  
save_training_file_list(l, 'train-test-val-set.txt')

### Generar modelo LSTM base

A la hora de crear un modelo para luego mejorarlo podriamos intentar entrenar el modelo con todos los datos, pero es muy costoso computacioanlemte hablando, y a la hora de realizar mejoras, nos pasariamos días. 

Siguiendo las recomendaciones de [How to Use Small Experiments to Develop a Caption Generation Model in Keras](https://machinelearningmastery.com/develop-a-caption-generation-model-in-keras/) y tal y como he explicado previamente, voy a quedarme con solo unos pocos datos. Estudiar cómo se comporta el modelo con esos datos, es decir, ver que parametros podemos afinar en la red neuronal, y generalizar la respuesta para ellos.

Al realizar un entrenamiento tan escaso, los valores que vamos a tener de Accuracy van a a ser bajos, pero nos va a permitir realizar pequeños cambios en el mismo, y comprobar los efectos que tienen esos combios en nuestros resultados.

De nuevo, si tuviesemos recursos computacionales infinitos estos experimentos se realizarian con todo el dataset.


In [19]:
from pickle import load

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

# split a dataset into train/test elements
def train_test_split(dataset,train,test,val):
  
  tr = int(len(dataset) * train / 100)
  te = int(len(dataset) * test / 100)
  va = int(len(dataset) * val / 100)
  
  # order keys so the split is consistent
  ordered = sorted(dataset)
	# return split dataset as two new sets
  return set(ordered[:tr]), set(ordered[tr:(tr+te)]), set(ordered[tr+te:tr+te+va])

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# store
			descriptions[image_id] = 'startseq ' + ' '.join(image_desc) + ' endseq'
	return descriptions

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

# load dev set
filename = 'train-test-val-set.txt'
dataset = load_set(filename)
print('Dataset: %d' % len(dataset))

# train-test split
train, test, val = train_test_split(dataset,10,10,5)
print (val)
print('Train=%d, Test=%d' % (len(train), len(test)))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
test_descriptions = load_clean_descriptions('descriptions.txt', test)
val_descriptions = load_clean_descriptions('descriptions.txt', val)

print('Descriptions: train=%d, test=%d, val=%d' % (len(train_descriptions), 
                                                    len(test_descriptions), 
                                                    len(val_descriptions)))
# photo features
train_features = load_photo_features('features.pkl', train)
test_features = load_photo_features('features.pkl', test)
val_features = load_photo_features('features.pkl', val)

print('Photos: train=%d, test=%d, val=%d' % (len(train_features), 
                                             len(test_features),
                                             len(val_features)))

Dataset: 1000
{'2602279427', '2515247156', '2600386812', '265528702', '2600229816', '26653224', '2624947061', '2559260825', '2522995166', '2598775005', '2667242810', '2545363449', '2518853257', '2536957963', '2584487952', '2546149002', '2644920808', '2572988623', '2629027962', '2507349105', '2638207727', '2539933563', '2504991916', '2574194729', '2640260087', '2491343114', '2594336381', '2605591891', '2647862489', '2670637584', '2649216969', '267836606', '260231029', '2483993772', '2599908425', '2525898604', '254730529', '2592711202', '2506703854', '248174959', '264928854', '2508512414', '2657643451', '2502248149', '2537702182', '2648165716', '259520146', '247617035', '2538588862', '2642336024'}
Train=100, Test=100
Descriptions: train=100, test=100, val=50
Photos: train=100, test=100, val=50


Como estamos trabajando con tan pocas muestras, y para evitar encontrar resultados maravillosos por azar, entrenamos el modelo varias veces distintas. Esto aumenta el tiempo en el que extraemos conclusiones del modelo, pero sigue siendo mucho más corto que entrenar el modelo con todo el dataset.


Del dataset reducido que tenemos (100 imagenes)preparamos las secuencias de entrenamiento.

In [20]:
# Ahora codificamos nuestras descripciones a números
from keras.preprocessing.text import Tokenizer

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = list(descriptions.values())
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer
 
# prepare tokenizer
tokenizer = create_tokenizer(descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 13131


In [0]:
# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, desc, image, max_length):
	Ximages, XSeq, y = list(), list(),list()
	vocab_size = len(tokenizer.word_index) + 1
	# integer encode the description
	seq = tokenizer.texts_to_sequences([desc])[0]
	# split one sequence into multiple X,y pairs
	for i in range(1, len(seq)):
		# select
		in_seq, out_seq = seq[:i], seq[i]
		# pad input sequence
		in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
		# encode output sequence
		out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
		# store
		Ximages.append(image)
		XSeq.append(in_seq)
		y.append(out_seq)
	# Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)
	return [Ximages, XSeq, y]

Este es el modelo básico con el que vamos a trabajar.

In [0]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(128, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	print(model.summary())
	return model

In [0]:
# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length, n_step):
	# loop until we finish training
	while 1:
		# loop over photo identifiers in the dataset
		keys = list(descriptions.keys())
		for i in range(0, len(keys), n_step):
			Ximages, XSeq, y = list(), list(),list()
			for j in range(i, min(len(keys), i+n_step)):
				image_id = keys[j]
				# retrieve photo feature input
				image = features[image_id][0]
				# retrieve text input
				desc = descriptions[image_id]
				# generate input-output pairs
				in_img, in_seq, out_word = create_sequences(tokenizer, desc, image, max_length)
				for k in range(len(in_img)):
					Ximages.append(in_img[k])
					XSeq.append(in_seq[k])
					y.append(out_word[k])
			# yield this batch of samples to the model
			yield [[array(Ximages), array(XSeq)], array(y)]

Aqui se genera la secuencia que va a servir de entrenamiento a la LSTM

In [0]:
# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None
 
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
	# seed the generation process
	in_text = 'startseq'
	# iterate over the whole length of the sequence
	for i in range(max_length):
		# integer encode input sequence
		sequence = tokenizer.texts_to_sequences([in_text])[0]
		# pad input
		sequence = pad_sequences([sequence], maxlen=max_length)
		# predict next word
		yhat = model.predict([photo,sequence], verbose=0)
		# convert probability to integer
		yhat = argmax(yhat)
		# map integer to word
		word = word_for_id(yhat, tokenizer)
		# stop if we cannot map the word
		if word is None:
			break
		# append as input for generating the next word
		in_text += ' ' + word
		# stop if we predict the end of the sequence
		if word == 'endseq':
			break
	return in_text
 


Y con la funcion [Bleu](https://es.wikipedia.org/wiki/BLEU) evaluamos la calidad de la secuencia generada por nuestra red.

In [0]:
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
	actual, predicted = list(), list()
	# step over the whole set
	for key, desc in descriptions.items():
		# generate description
		yhat = generate_desc(model, tokenizer, photos[key], max_length)
		# store actual and predicted
		actual.append([desc.split()])
		predicted.append(yhat.split())
	# calculate BLEU score
	bleu = corpus_bleu(actual, predicted)
	return bleu

Ejecutamos el primer modelo 3 veces y almacenamos los distintos resultados para su posterior analisis.

In [26]:
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'baseline1'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35


W0829 01:18:36.919805 140502137550720 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2974: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0829 01:18:37.667130 140502137550720 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 35)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_1 (GlobalM (None, 512)          0           input_2[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 35, 50)       29550       input_3[0][0]                    
__________________________________________________________________________________________________
dense_1 (D

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.000346 test=0.000570
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 35)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_2 (GlobalM (None, 512)          0           input_4[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 35, 50)       29550       input_5[0][0]                    
____________________________________________________________________________

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.000346 test=0.000570
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_7 (InputLayer)            (None, 35)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_3 (GlobalM (None, 512)          0           input_6[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 35, 50)       29550       input_7[0][0]                    
____________________________________________________________________________

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.000346 test=0.000570
          train     test
count  3.000000  3.00000
mean   0.000346  0.00057
std    0.000000  0.00000
min    0.000346  0.00057
25%    0.000346  0.00057
50%    0.000346  0.00057
75%    0.000346  0.00057
max    0.000346  0.00057


Tenemos nuestro primer experimento, que nos servirá para establecer un benchmark para comparar como se comporta el modelo con distintos cambios en alguna de sus capas.


## Mejoras

Ahora vamos a ir cambiando el modelo para ver si obtenemos mejores resultados.

### Entrada del LSTM

En nuestro modelo inicial, las caracteristicas de la foto (128) y la secuencia de texto (128) se concatenan y se meten en la red LSTM. Probamos a ver como afecta el cambiar el tamaño de la las caracteristicas y la descripcion.

#### Vector de entrada de la LSTM a 128 (size_sm_fixed_vec)

Cambiamos de 128 cáracteristicas y 128 elementos en la descripción a 64 (que al concatenarse hacen un vector de entrada de 128).

In [27]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(64, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(64, activation='relu'))(emb3) #cambio en esta (de 128 a 64)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

	


# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'size_sm_fixed_vec'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.115510 test=0.073329
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.066704 test=0.075890
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.028092 test=0.024320
          train      test
count  3.000000  3.000000
mean   0.070102  0.057846
std    0.043808  0.029063
min    0.028092  0.024320
25%    0.047398  0.048825
50%    0.066704  0.073329
75%    0.091107  0.074610
max    0.115510  0.075890


#### Vector de entrada de la LSTM a 512 (size_lg_fixed_vec)

Y ahora duplicamos a 256.

In [28]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(256, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(256, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'size_lg_fixed_vec'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.003413 test=0.004455
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.003413 test=0.004309
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.000346 test=0.000570
          train      test
count  3.000000  3.000000
mean   0.002391  0.003111
std    0.001771  0.002202
min    0.000346  0.000570
25%    0.001880  0.002439
50%    0.003413  0.004309
75%    0.003413  0.004382
max    0.003413  0.004455


### Secuenciador

El secuenciador es la parte del modelo que genera las secuencias y que se entrena tambien con una LSTM. Podemos aumentar y disminur su 'memoria'

#### Modelo con poca memoria en el secuenciador(size_sm_seq_model)

In [29]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(128, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(128, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model


# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'size_sm_seq_model'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.000346 test=0.000570
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.000346 test=0.000570
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.003289 test=0.004309
          train      test
count  3.000000  3.000000
mean   0.001327  0.001816
std    0.001699  0.002159
min    0.000346  0.000570
25%    0.000346  0.000570
50%    0.000346  0.000570
75%    0.001817  0.002439
max    0.003289  0.004309


#### Modelo con mucha memoria en el secuenciador (size_lg_seq_model)

Añadimos una nueva capa de 256 neuronas despues de la primera

In [43]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
    inputs1 = Input(shape=(7, 7, 512))
    fe1 = GlobalMaxPooling2D()(inputs1)
    fe2 = Dense(128, activation='relu')(fe1)
    fe3 = RepeatVector(max_length)(fe2)
	# embedding
    inputs2 = Input(shape=(max_length,))
    emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
    emb3 = LSTM(256, return_sequences=True)(emb2) 
    # se añade esta nueva capa 
    emb4 = LSTM(256, return_sequences=True)(emb3)
    emb5 = TimeDistributed(Dense(128, activation='relu'))(emb4)
	
    # merge inputs
    merged = concatenate([fe3, emb5])
	# language model (decoder)
    lm2 = LSTM(500)(merged)
    lm3 = Dense(500, activation='relu')(lm2)
    outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model



# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'size_lg_seq_model'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.000346 test=0.000570
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.000361 test=0.000619
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.005917 test=0.009047
          train      test
count  3.000000  3.000000
mean   0.002208  0.003412
std    0.003212  0.004880
min    0.000346  0.000570
25%    0.000354  0.000594
50%    0.000361  0.000619
75%    0.003139  0.004833
max    0.005917  0.009047


#### Cambio en la dimensionalidad de la capa embedding del secuenciador (size_em_seq_model)

Intentamos incrementar la capacidad de vocabulario aumentando de 50 a 100 dimensiones

In [46]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
    inputs1 = Input(shape=(7, 7, 512))
    fe1 = GlobalMaxPooling2D()(inputs1)
    fe2 = Dense(128, activation='relu')(fe1)
    fe3 = RepeatVector(max_length)(fe2)
	# embedding
    inputs2 = Input(shape=(max_length,))
	
    #cambia esta
    emb2 = Embedding(vocab_size, 100, mask_zero=True)(inputs2)
	
    emb3 = LSTM(256, return_sequences=True)(emb2)
    emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
    merged = concatenate([fe3, emb4])
	# language model (decoder)
    lm2 = LSTM(500)(merged)
    lm3 = Dense(500, activation='relu')(lm2)
    outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	
    return model



# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'size_em_seq_model'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.006544 test=0.009212
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.000411 test=0.000570
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.003616 test=0.015353
          train      test
count  3.000000  3.000000
mean   0.003523  0.008378
std    0.003068  0.007427
min    0.000411  0.000570
25%    0.002013  0.004891
50%    0.003616  0.009212
75%    0.005080  0.012282
max    0.006544  0.015353


### Modelo de lenguaje

El modelo de lenguaje es la parte del modelo que aprende las palabras (decoder).

#### size_sm_lang_model

Vamos a ver que efecto tiene disminuir las neuronas  de la LSTM y la capa densa a la mitad.

In [48]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
    inputs1 = Input(shape=(7, 7, 512))
    fe1 = GlobalMaxPooling2D()(inputs1)
    fe2 = Dense(128, activation='relu')(fe1)
    fe3 = RepeatVector(max_length)(fe2)
	# embedding
    inputs2 = Input(shape=(max_length,))
    emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
    emb3 = LSTM(256, return_sequences=True)(emb2)
    emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
    merged = concatenate([fe3, emb4])
	# language model (decoder)
	
    # estas
    lm2 = LSTM(256)(merged)
    lm3 = Dense(256, activation='relu')(lm2)
	
  
    outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
 



# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'size_sm_lang_model'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.000361 test=0.000570
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.000346 test=0.000670
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.003413 test=0.004386
          train      test
count  3.000000  3.000000
mean   0.001373  0.001876
std    0.001767  0.002175
min    0.000346  0.000570
25%    0.000353  0.000620
50%    0.000361  0.000670
75%    0.001887  0.002528
max    0.003413  0.004386


#### size_lg_lang_model

Y ahora añadimos una segunda capa al LSTM

In [49]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
    inputs1 = Input(shape=(7, 7, 512))
    fe1 = GlobalMaxPooling2D()(inputs1)
    fe2 = Dense(128, activation='relu')(fe1)
    fe3 = RepeatVector(max_length)(fe2)
	# embedding
    inputs2 = Input(shape=(max_length,))
    emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
    emb3 = LSTM(256, return_sequences=True)(emb2)
    emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
    merged = concatenate([fe3, emb4])
	# language model (decoder)
    lm2 = LSTM(500, return_sequences=True)(merged)
    lm3 = LSTM(500)(lm2)
    lm4 = Dense(500, activation='relu')(lm3)
    outputs = Dense(vocab_size, activation='softmax')(lm4)
	# tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
 
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'size_lg_lang_model'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.027494 test=0.032987
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.055735 test=0.047379
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.049760 test=0.042895
          train      test
count  3.000000  3.000000
mean   0.044330  0.041087
std    0.014883  0.007365
min    0.027494  0.032987
25%    0.038627  0.037941
50%    0.049760  0.042895
75%    0.052748  0.045137
max    0.055735  0.047379


### Extractor de caracteristicas

El modelo base elimina parte del modelo VGG (quedandose sólo con el principio del mismo, y añadiendo una capa de global max pooling que luego alimenta a nuestro modelo)

Vamos a realizar cambios en este paso

#### fe_avg_pool

Cambiando el Global Max Pooling a Global Average Pooling

In [50]:
from keras.layers.pooling import GlobalAveragePooling2D


# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
    inputs1 = Input(shape=(7, 7, 512))
    fe1 = GlobalAveragePooling2D()(inputs1)
    fe2 = Dense(128, activation='relu')(fe1)
    fe3 = RepeatVector(max_length)(fe2)
	# embedding
    inputs2 = Input(shape=(max_length,))
    emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
    emb3 = LSTM(256, return_sequences=True)(emb2)
    emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
    merged = concatenate([fe3, emb4])
	# language model (decoder)
    lm2 = LSTM(500)(merged)
    lm3 = Dense(500, activation='relu')(lm2)
    outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model




# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'fe_avg_pool'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.068819 test=0.139216
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.117968 test=0.037236
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
>3: train=0.079522 test=0.173758
          train      test
count  3.000000  3.000000
mean   0.088770  0.116737
std    0.025846  0.070983
min    0.068819  0.037236
25%    0.074171  0.088226
50%    0.079522  0.139216
75%    0.098745  0.156487
max    0.117968  0.173758


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


#### fe_flat

Y quitando el Pooling

In [51]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
    inputs1 = Input(shape=(7, 7, 512))
    fe1 = Flatten()(inputs1)
    fe2 = Dense(128, activation='relu')(fe1)
    fe3 = RepeatVector(max_length)(fe2)
	# embedding
    inputs2 = Input(shape=(max_length,))
    emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
    emb3 = LSTM(256, return_sequences=True)(emb2)
    emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
    merged = concatenate([fe3, emb4])
	# language model (decoder)
    lm2 = LSTM(500)(merged)
    lm3 = Dense(500, activation='relu')(lm2)
    outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
# define experiment
model_name = 'fe_flat'
verbose = 1
n_epochs = 25
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Vocabulary Size: 591
Description Length: 35
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.132634 test=0.082226
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>2: train=0.017271 test=0.104119
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>3: train=0.136263 test=0.101933
          train      test
count  3.000000  3.000000
mean   0.095389  0.096093
std    0.067677  0.012058
min    0.017271  0.082226
25%    0.074952  0.092080
50%    0.132634  0.101933
75%    0.134449  0.103026
max    0.136263  0.104119


## Evaluación

Tras todos estos experimentos vemos que:

In [0]:
files_in_directory = listdir('/content')

# Generamos una lista con todos los archivos CSV que hay en el directorio

csv_files = list()

for f in files_in_directory:
  try:
    csv_e = f.split('.')[1]
  except IndexError:
    csv_e = ''
  if csv_e == 'csv':
    csv_files.append(f)

In [0]:
import pandas as pd

In [0]:
# unimos todos los CSV y lo guardamos todo en models-results.csv

todos = pd.DataFrame(columns=['train','test','model'])
# agregamos todos los resultados en un dataframe con los resultados de train 
# y test y el nombre delmodelo

for f in csv_files:
    data = pd.read_csv(f, index_col=None, header=0)
    data['model'] = f
    todos = pd.concat([todos, data])
    
todos.reindex()
todos.to_csv('models-results.csv')

In [0]:
# Calculamos la media de cada experimento y lo guardamos todo en 
# models-summary.csv

li= []
med = []

# agregamos todos los resultados en un dataframe con los resultados de train 
# y test y el nombre del modelo

for f in csv_files:
    data = pd.read_csv(f, index_col=None, header=0)
    data['model'] = f
    medias = data.mean()
    linea = [medias[0],medias[1],f]
    
    li.append(data)
    med.append(linea)
  
  
modelos = pd.concat(li, axis=0, ignore_index=True)

medias = pd.DataFrame(med, columns=['train','test','model'])

In [0]:
medias.to_csv('models-summary.csv', index=False)

In [55]:
medias.sort_values(by=['train', 'test'])

Unnamed: 0,train,test,model
2,0.000346,0.00057,baseline1.csv
9,0.001327,0.001816,size_sm_seq_model.csv
7,0.001373,0.001876,size_sm_lang_model.csv
0,0.002208,0.003412,size_lg_seq_model.csv
5,0.002391,0.003111,size_lg_fixed_vec.csv
3,0.003523,0.008378,size_em_seq_model.csv
8,0.04433,0.041087,size_lg_lang_model.csv
4,0.070102,0.057846,size_sm_fixed_vec.csv
6,0.08877,0.116737,fe_avg_pool.csv
1,0.095389,0.096093,fe_flat.csv


Respecto a la linea base, vemos que las siguientes modificaciones han mejoran el rendimiento:

* size_lg_lang_model
* fe_avg_pool
* size_lg_seq_model
* size_sm_fixed_model

Así pues, definimos un modelo que incorpore las mejoras que hemos encontrado en los experimentos.


## Entrenamiento con más datos

![modelo](https://docs.google.com/drawings/d/e/2PACX-1vQend4dewDH4a9ZRlRVDjmbUokgp8COGpo__TN3_Ye4Wlf0bvJhN8REwjXeVFj_-lC2isskBFAnJTqm/pub?w=846&h=345)

In [0]:
from keras.layers.pooling import GlobalAveragePooling2D

# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
    inputs1 = Input(shape=(7, 7, 512))
    fe1 = GlobalAveragePooling2D()(inputs1)
    fe2 = Dense(128, activation='relu')(fe1)
    fe3 = RepeatVector(max_length)(fe2)
	# embedding
    inputs2 = Input(shape=(max_length,))
    emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
    emb3 = LSTM(256, return_sequences=True)(emb2)
    emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
    
    # merge inputs
    merged = concatenate([fe3, emb4])
	# language model (decoder)
    lm2 = LSTM(500, return_sequences=True)(merged)
    lm3 = LSTM(500)(lm2)
    lm4 = Dense(500, activation='relu')(lm3)
    outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [57]:
filename = 'train-test-val-set.txt'
dataset = load_set(filename)
print('Dataset: %d' % len(dataset))

# train-test split
train, test, val = train_test_split(dataset,50,10,5)
print (val)
print('Train=%d, Test=%d' % (len(train), len(test)))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
test_descriptions = load_clean_descriptions('descriptions.txt', test)
val_descriptions = load_clean_descriptions('descriptions.txt', val)

print('Descriptions: train=%d, test=%d, val=%d' % (len(train_descriptions), 
                                                    len(test_descriptions), 
                                                    len(val_descriptions)))
# photo features
train_features = load_photo_features('features.pkl', train)
test_features = load_photo_features('features.pkl', test)
val_features = load_photo_features('features.pkl', val)

print('Photos: train=%d, test=%d, val=%d' % (len(train_features), 
                                             len(test_features),
                                             len(val_features)))

Dataset: 1000
{'4438301773', '4412893298', '4521610802', '454333157', '4403629017', '4407494664', '442138526', '4431086360', '4552389008', '446902008', '4441843357', '441921713', '4481066026', '4458984161', '4544052868', '455914518', '4548074929', '4488485439', '451547131', '435739506', '4519904608', '4544246166', '4561823639', '4443934130', '4523882717', '4538960982', '43390068', '4473229124', '4463211241', '4496551407', '4545104306', '4433557295', '4477683613', '437057062', '4503067869', '4467161305', '4556114229', '439830368', '4351426633', '4475974227', '4363060125', '4463772847', '4369970740', '4558179720', '4533955542', '4450184997', '4493330346', '4497873178', '4459992117', '4493008657'}
Train=500, Test=100
Descriptions: train=500, test=100, val=50
Photos: train=500, test=100, val=50


In [58]:
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)
 
  
verbose = 1
n_epochs = 50
n_photos_per_update = 4
n_batches_per_epoch = int(len(train) / n_photos_per_update)


# run experiment
train_results, test_results = list(), list()

# define the model
model = define_model(vocab_size, max_length)

# fit model
model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)



Vocabulary Size: 1542
Description Length: 36
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fc616e8e550>

## Evaluacion final

In [0]:
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
	actual, predicted = list(), list()
	# step over the whole set
	for key, desc in descriptions.items():
		# generate description
		yhat = generate_desc(model, tokenizer, photos[key], max_length)
		# store actual and predicted
		actual.append([desc.split()])
		predicted.append(yhat.split())
		print('Actual:    %s' % desc)
		print('Predicted: %s' % yhat)
		if len(actual) >= 5:
			break
	# calculate BLEU score
	bleu = corpus_bleu(actual, predicted)
	return bleu

In [65]:
# evaluate model on training data
train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

# store
train_results.append(train_score)
test_results.append(test_score)


Actual:    startseq small girl in the grass plays with fingerpaints in front of white canvas with rainbow on it endseq
Predicted: startseq little boy in the brown tshirt is running on the snow endseq
Actual:    startseq the white out conditions of snow on the ground seem to almost obliverate the details of man dressed for the cold weather in heavy jacket and red hat riding bicycle in suburban neighborhood endseq
Predicted: startseq the white out conditions of snow on the ground seem to almost obliverate the details of man dressed for the cold weather in heavy jacket and red hat riding bicycle in suburban neighborhood endseq
Actual:    startseq person wearing skis looking at framed pictures set up in the snow endseq
Predicted: startseq person wearing skis looking on framed pictures set appears in the snow endseq
Actual:    startseq group of people stand in the back of truck filled with cotton endseq
Predicted: startseq group of people stand in the back of truck filled with cotton endseq

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [66]:
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())

df.to_csv('final_result.csv', index=False)

         train      test
count  2.00000  2.000000
mean   0.73463  0.267518
std    0.00000  0.000000
min    0.73463  0.267518
25%    0.73463  0.267518
50%    0.73463  0.267518
75%    0.73463  0.267518
max    0.73463  0.267518


## Conclusiones


Vemos unos resultados suficientemente buenos. Las frases estan bien construidas, pero claramente hay overfitting, ya que parece que nuestro modelo se ha aprendido demasiado bien el dataset. 

Aun así son resultados prometedores, y probablemente usandoo el dataset completo y ajustando algún parametro de entrenamiento de forma más fina, podremos tener un buen modelo. 

Para ello, necesitamos tiempo y máquina, pero para una primera aproximación está bien.
