Vamos a usar el dataset de Flickr8K que se compone de 8000 fotografías con sus descripciones.

Lo primero que haremos es descargarnos el dataset:

In [4]:
!wget http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_text.zip
!unzip -o Flickr8k_text.zip

--2018-07-01 21:11:02--  http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_text.zip
Resolving nlp.cs.illinois.edu (nlp.cs.illinois.edu)... 192.17.58.132
Connecting to nlp.cs.illinois.edu (nlp.cs.illinois.edu)|192.17.58.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2340801 (2.2M) [application/zip]
Saving to: ‘Flickr8k_text.zip.1’


2018-07-01 21:11:03 (4.72 MB/s) - ‘Flickr8k_text.zip.1’ saved [2340801/2340801]

Archive:  Flickr8k_text.zip
  inflating: CrowdFlowerAnnotations.txt  
  inflating: ExpertAnnotations.txt   
  inflating: Flickr8k.lemma.token.txt  
  inflating: __MACOSX/._Flickr8k.lemma.token.txt  
  inflating: Flickr8k.token.txt      
  inflating: Flickr_8k.devImages.txt  
  inflating: Flickr_8k.testImages.txt  
  inflating: Flickr_8k.trainImages.txt  
  inflating: readme.txt              


Ahora, vamos a preparar nuestros datos. Para ello, vamos a crear unas cuantas funciones axiliares:

In [5]:
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
filename = 'Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)

In [6]:
len(doc)

3395237

In [7]:
doc[:255]

'1000268201_693b08cb0e.jpg#0\tA child in a pink dress is climbing up a set of stairs in an entry way .\n1000268201_693b08cb0e.jpg#1\tA girl going into a wooden building .\n1000268201_693b08cb0e.jpg#2\tA little girl climbing into a wooden playhouse .\n1000268201_'

In [8]:
# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# store the first description for each image
		if image_id not in mapping:
			mapping[image_id] = image_desc
	return mapping
 
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))

Loaded: 8092 


Ahora, vamos a limpiar las descripciones. Necesitamos convertir todas las palabras a minúsculas, quitar la puntuación y quitar también las palabras que son de 1 solo caracter: por ejemplo, "a". Tened en cuenta que el dataset está en inglés.

Vamos allá:

In [9]:
import string

def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc in descriptions.items():
		# tokenize
		desc = desc.split()
		# convert to lower case
		desc = [word.lower() for word in desc]
		# remove punctuation from each token
		desc = [w.translate(table) for w in desc]
		# remove hanging 's' and 'a'
		desc = [word for word in desc if len(word)>1]
		# store as string
		descriptions[key] =  ' '.join(desc)

# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
all_tokens = ' '.join(descriptions.values()).split()
vocabulary = set(all_tokens)
print('Vocabulary Size: %d' % len(vocabulary))

Vocabulary Size: 4484


Y nos guardamos el dataset arreglado para poder usarlo más cómodamente en otras ocasiones:

In [10]:
# save descriptions to file, one per line
def save_doc(descriptions, filename):
	lines = list()
	for key, desc in descriptions.items():
		lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# save descriptions
save_doc(descriptions, 'descriptions.txt')

In [11]:
!ls -la

total 1104804
drwxr-xr-x 1 root root       4096 Jul  1 21:11 .
drwxr-xr-x 1 root root       4096 Jul  1 20:25 ..
drwx------ 4 root root       4096 Jul  1 20:34 .cache
drwxr-xr-x 3 root root       4096 Jul  1 20:34 .config
-rw-r--r-- 1 root root    2918552 Oct 14  2013 CrowdFlowerAnnotations.txt
drwxr-xr-x 3 root root       4096 Jun 28 16:55 datalab
-rw-r--r-- 1 root root     587144 Jul  1 21:11 descriptions.txt
-rw-r--r-- 1 root root     346674 Oct 14  2013 ExpertAnnotations.txt
-rw-r--r-- 1 root root          0 Jul  1 21:06 features.pkl
drwxr-xr-x 2 root root     425984 Oct  3  2012 Flicker8k_Dataset
-rw-r--r-- 1 root root 1115419746 Oct 24  2013 Flickr8k_Dataset.zip
-rw-r--r-- 1 root root      25801 Oct 10  2013 Flickr_8k.devImages.txt
-rw-r--r-- 1 root root    3244761 Feb 16  2012 Flickr8k.lemma.token.txt
-rw-r--r-- 1 root root      25775 Oct 10  2013 Flickr_8k.testImages.txt
-rw-r--r-- 1 root root    2340801 Oct 28  2013 Flickr8k_text.zip
-rw-r--r-- 1 root root    2

Y si miramos nuestro dataset ahora...

In [12]:
type(descriptions)

dict

In [13]:
descriptions.keys()

dict_keys(['1000268201_693b08cb0e', '1001773457_577c3a7d70', '1002674143_1b742ab4b8', '1003163366_44323f5815', '1007129816_e794419615', '1007320043_627395c3d8', '1009434119_febe49276a', '1012212859_01547e3f17', '1015118661_980735411b', '1015584366_dfcec3c85a', '101654506_8eb26cfb60', '101669240_b2d3e7f17b', '1016887272_03199f49c4', '1019077836_6fc9b15408', '1019604187_d087bf9a5f', '1020651753_06077ec457', '1022454332_6af2c1449a', '1022454428_b6b660a67b', '1022975728_75515238d8', '102351840_323e3de834', '1024138940_f1fefbdce1', '102455176_5f8ead62d5', '1026685415_0431cbf574', '1028205764_7e8df9a2ea', '1030985833_b0902ea560', '103106960_e8a41d64f8', '103195344_5d2dc613a3', '103205630_682ca7285b', '1032122270_ea6f0beedb', '1032460886_4a598ed535', '1034276567_49bb87c51c', '104136873_5b5d41be75', '1042020065_fb3d3ba5ba', '1042590306_95dea0916c', '1045521051_108ebc19be', '1048710776_bb5b0a5c7c', '1052358063_eae6744153', '105342180_4d4a40b47f', '1053804096_ad278b25f1', '1055623002_8195a43714'

In [14]:
descriptions.values()

dict_values(['child in pink dress is climbing up set of stairs in an entry way', 'black dog and spotted dog are fighting', 'little girl covered in paint sits in front of painted rainbow with her hands in bowl', 'man lays on bench while his dog sits by him', 'man in an orange hat starring at something', 'child playing on rope net', 'black and white dog is running in grassy garden surrounded by white fence', 'dog shakes its head near the shore red ball next to it', 'boy smiles in front of stony wall in city', 'black dog leaps over log', 'brown and white dog is running through the snow', 'man in hat is displaying pictures next to skier in blue hat', 'collage of one person climbing cliff', 'brown dog chases the water from sprinkler on lawn', 'dog prepares to catch thrown object in field with nearby cars', 'black and white dog jumping in the air to get toy', 'child and woman are at waters edge in big city', 'couple and an infant being held by the male sitting next to pond with near by strol

Vale, pues hasta aquí la preparación del dataset de texto! Ahora vamos a preparar las fotos. 

Lo que vamos a hacer es usar la VGG16 para extraer las características de las imagenes, y nos guardaremos estas características. Así que vamos a cargar el modelo sin el top_model, a predecir lo que nos da cada una de nuestras imágenes, y a guardar sus características en un archivo que luego usaremos para alimentar nuestra LSTM. Podríamos hacerlo "online", es decir, predecir y después metérselo a la LSTM sin guardarlo en disco, pero para que nuestro modelo sea más rápido a la hora de entrenar, vamos a primero, predecir y guardar las características, y luego entrenar la LSTM.

Para ello, vamos a bajarnos el dataset de imágenes:

In [15]:
rm -rf Flickr8k_Dataset.zip*

In [16]:
ls -la

total 15516
drwxr-xr-x 1 root root    4096 Jul  1 21:11 [0m[01;34m.[0m/
drwxr-xr-x 1 root root    4096 Jul  1 20:25 [01;34m..[0m/
drwx------ 4 root root    4096 Jul  1 20:34 [01;34m.cache[0m/
drwxr-xr-x 3 root root    4096 Jul  1 20:34 [01;34m.config[0m/
-rw-r--r-- 1 root root 2918552 Oct 14  2013 CrowdFlowerAnnotations.txt
drwxr-xr-x 3 root root    4096 Jun 28 16:55 [01;34mdatalab[0m/
-rw-r--r-- 1 root root  587144 Jul  1 21:11 descriptions.txt
-rw-r--r-- 1 root root  346674 Oct 14  2013 ExpertAnnotations.txt
-rw-r--r-- 1 root root       0 Jul  1 21:06 features.pkl
drwxr-xr-x 2 root root  425984 Oct  3  2012 [01;34mFlicker8k_Dataset[0m/
-rw-r--r-- 1 root root   25801 Oct 10  2013 Flickr_8k.devImages.txt
-rw-r--r-- 1 root root 3244761 Feb 16  2012 Flickr8k.lemma.token.txt
-rw-r--r-- 1 root root   25775 Oct 10  2013 Flickr_8k.testImages.txt
-rw-r--r-- 1 root root 2340801 Oct 28  2013 Flickr8k_text.zip
-rw-r--r-- 1 root root 2340801 Oct 28  2013 Flickr8k_text.

In [17]:
!wget http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_Dataset.zip
!unzip -q Flickr8k_Dataset.zip

--2018-07-01 21:12:05--  http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_Dataset.zip
Resolving nlp.cs.illinois.edu (nlp.cs.illinois.edu)... 192.17.58.132
Connecting to nlp.cs.illinois.edu (nlp.cs.illinois.edu)|192.17.58.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115419746 (1.0G) [application/zip]
Saving to: ‘Flickr8k_Dataset.zip’


2018-07-01 21:12:18 (79.1 MB/s) - ‘Flickr8k_Dataset.zip’ saved [1115419746/1115419746]

replace Flicker8k_Dataset/1000268201_693b08cb0e.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [19]:
# comprobamos si se han descomprimido
!ls -ls

total 1104768
   2852 -rw-r--r-- 1 root root    2918552 Oct 14  2013 CrowdFlowerAnnotations.txt
      4 drwxr-xr-x 3 root root       4096 Jun 28 16:55 datalab
    576 -rw-r--r-- 1 root root     587144 Jul  1 21:11 descriptions.txt
    340 -rw-r--r-- 1 root root     346674 Oct 14  2013 ExpertAnnotations.txt
      0 -rw-r--r-- 1 root root          0 Jul  1 21:06 features.pkl
    420 drwxr-xr-x 2 root root     425984 Oct  3  2012 Flicker8k_Dataset
1089288 -rw-r--r-- 1 root root 1115419746 Oct 24  2013 Flickr8k_Dataset.zip
     28 -rw-r--r-- 1 root root      25801 Oct 10  2013 Flickr_8k.devImages.txt
   3172 -rw-r--r-- 1 root root    3244761 Feb 16  2012 Flickr8k.lemma.token.txt
     28 -rw-r--r-- 1 root root      25775 Oct 10  2013 Flickr_8k.testImages.txt
   2292 -rw-r--r-- 1 root root    2340801 Oct 28  2013 Flickr8k_text.zip
   2292 -rw-r--r-- 1 root root    2340801 Oct 28  2013 Flickr8k_text.zip.1
   3316 -rw-r--r-- 1 root root    3395237 Oct 14  2013 Flickr8k.token.txt


In [20]:
!pip install tqdm



In [21]:
from os import listdir
from pickle import load, dump
from tqdm import tqdm
from numpy import array
from numpy import argmax
from pandas import DataFrame
from nltk.translate.bleu_score import corpus_bleu
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img, img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.layers import Input, Dense, Flatten, LSTM, RepeatVector, TimeDistributed, Embedding, Dropout
from keras.layers.merge import concatenate
from keras.layers.pooling import GlobalMaxPooling2D

In [0]:
# extract features from each photo in the directory
def extract_features(directory):
  # load the model
  in_layer = Input(shape=(224, 224, 3))
  model = VGG16(include_top=False, input_tensor=in_layer)
  print(model.summary())
  # extract features from each photo
  features = dict()

  files_in_directory = listdir(directory)
  n_images = len(files_in_directory)
  for i, name in tqdm(enumerate(files_in_directory)):
    # load an image from file
    filename = directory + '/' + name
    image = load_img(filename, target_size=(224, 224))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    # get image id
    image_id = name.split('.')[0]
    # store feature
    features[image_id] = feature
    # print('{} / {} > {}'.format(i, n_images, name))
  return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


0it [00:00, ?it/s]

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

7510it [04:23, 28.47it/s]

In [26]:
from pickle import load

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

# split a dataset into train/test elements
def train_test_split(dataset):
	# order keys so the split is consistent
	ordered = sorted(dataset)
	# return split dataset as two new sets
	return set(ordered[:100]), set(ordered[100:200])

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# store
			descriptions[image_id] = 'startseq ' + ' '.join(image_desc) + ' endseq'
	return descriptions

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

# load dev set
filename = 'Flickr_8k.devImages.txt'
dataset = load_set(filename)
print('Dataset: %d' % len(dataset))
# train-test split
train, test = train_test_split(dataset)
print('Train=%d, Test=%d' % (len(train), len(test)))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: train=%d, test=%d' % (len(train_descriptions), len(test_descriptions)))
# photo features
train_features = load_photo_features('features.pkl', train)
test_features = load_photo_features('features.pkl', test)
print('Photos: train=%d, test=%d' % (len(train_features), len(test_features)))

Dataset: 1000
Train=100, Test=100
Descriptions: train=100, test=100
Photos: train=100, test=100


In [27]:
# Ahora codificamos nuestras descripciones a números
from keras.preprocessing.text import Tokenizer

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = list(descriptions.values())
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer
 
# prepare tokenizer
tokenizer = create_tokenizer(descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 4485


For example, the input sequence “little girl running in field” would be split into 6 input-output pairs to train the model:


X1,		X2 (text sequence), 						y (word)

photo	startseq, 									little

photo	startseq, little,							girl

photo	startseq, little, girl, 					running

photo	startseq, little, girl, running, 			in

photo	startseq, little, girl, running, in, 		field

photo	startseq, little, girl, running, in, field, endseq


In [28]:
# Y creamos las secuencias:

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, desc, image, max_length):
	Ximages, XSeq, y = list(), list(),list()
	vocab_size = len(tokenizer.word_index) + 1
	# integer encode the description
	seq = tokenizer.texts_to_sequences([desc])[0]
	# split one sequence into multiple X,y pairs
	for i in range(1, len(seq)):
		# select
		in_seq, out_seq = seq[:i], seq[i]
		# pad input sequence
		in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
		# encode output sequence
		out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
		# store
		Ximages.append(image)
		XSeq.append(in_seq)
		y.append(out_seq)
	# Ximages, XSeq, y = array(Ximages), array(XSeq), array(y)
	return [Ximages, XSeq, y]

In [18]:
# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(128, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	print(model.summary())
	return model

In [29]:
# data generator, intended to be used in a call to model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length, n_step):
	# loop until we finish training
	while 1:
		# loop over photo identifiers in the dataset
		keys = list(descriptions.keys())
		for i in range(0, len(keys), n_step):
			Ximages, XSeq, y = list(), list(),list()
			for j in range(i, min(len(keys), i+n_step)):
				image_id = keys[j]
				# retrieve photo feature input
				image = features[image_id][0]
				# retrieve text input
				desc = descriptions[image_id]
				# generate input-output pairs
				in_img, in_seq, out_word = create_sequences(tokenizer, desc, image, max_length)
				for k in range(len(in_img)):
					Ximages.append(in_img[k])
					XSeq.append(in_seq[k])
					y.append(out_word[k])
			# yield this batch of samples to the model
			yield [[array(Ximages), array(XSeq)], array(y)]

In [30]:
# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None
 
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
	# seed the generation process
	in_text = 'startseq'
	# iterate over the whole length of the sequence
	for i in range(max_length):
		# integer encode input sequence
		sequence = tokenizer.texts_to_sequences([in_text])[0]
		# pad input
		sequence = pad_sequences([sequence], maxlen=max_length)
		# predict next word
		yhat = model.predict([photo,sequence], verbose=0)
		# convert probability to integer
		yhat = argmax(yhat)
		# map integer to word
		word = word_for_id(yhat, tokenizer)
		# stop if we cannot map the word
		if word is None:
			break
		# append as input for generating the next word
		in_text += ' ' + word
		# stop if we predict the end of the sequence
		if word == 'endseq':
			break
	return in_text
 
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
	actual, predicted = list(), list()
	# step over the whole set
	for key, desc in descriptions.items():
		# generate description
		yhat = generate_desc(model, tokenizer, photos[key], max_length)
		# store actual and predicted
		actual.append([desc.split()])
		predicted.append(yhat.split())
	# calculate BLEU score
	bleu = corpus_bleu(actual, predicted)
	return bleu

Como hemos entrenado con datasets muy pequeños, nuestro dataset tiene gran variabilidad. Con lo cual, vamos a ejecutar 3 rondas de experimentos (n_repeats) y a promediar las métricas para poder dar una métrica más fiable.

In [31]:
# load dev set
filename = 'Flickr_8k.devImages.txt'
dataset = load_set(filename)
print('Dataset: %d' % len(dataset))
# train-test split
train, test = train_test_split(dataset)
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: train=%d, test=%d' % (len(train_descriptions), len(test_descriptions)))
# photo features
train_features = load_photo_features('features.pkl', train)
test_features = load_photo_features('features.pkl', test)
print('Photos: train=%d, test=%d' % (len(train_features), len(test_features)))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max(len(s.split()) for s in list(train_descriptions.values()))
print('Description Length: %d' % max_length)

Dataset: 1000
Descriptions: train=100, test=100
Photos: train=100, test=100
Vocabulary Size: 366
Description Length: 25


In [28]:
# define experiment
model_name = 'baseline1'
verbose = 1
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

Dataset: 1000
Descriptions: train=100, test=100
Photos: train=100, test=100
Vocabulary Size: 366
Description Length: 25
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_1 (GlobalM (None, 512)          0           input_2[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 25, 50)       18300       input_3[0][0]          

Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50

Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
 7/50 [===>..........................] - ETA: 7s - loss: 4.8392 - acc: 0.0915

Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50

Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
 6/50 [==>...........................] - ETA: 8s - loss: 3.9791 - acc: 0.1472

Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50

Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50

Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
 3/50 [>.............................] - ETA: 9s - loss: 2.1616 - acc: 0.3707

Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50

Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
 5/50 [==>...........................] - ETA: 9s - loss: 1.3974 - acc: 0.5980



Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


>1: train=0.132057 test=0.055566
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_2 (GlobalM (None, 512)          0           input_4[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 25, 50)       18300       input_5[0][0]                    
____________________________________________________________________________

Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
 5/50 [==>...........................] - ETA: 8s - loss: 5.1012 - acc: 0.1044

Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
 1/50 [..............................] - ETA: 9s - loss: 4.7852 - acc: 0.1905

Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50

Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
 7/50 [===>..........................] - ETA: 8s - loss: 3.7332 - acc: 0.1429

Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50


Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
 1/50 [..............................] - ETA: 9s - loss: 3.1847 - acc: 0.1905

Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50


Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50

Epoch 48/50
Epoch 49/50
Epoch 50/50
>2: train=0.082915 test=0.030061
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_7 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_3 (GlobalM (None, 512)          0           input_6[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 25, 50)       18300       input_7[0][0]                    
________________________________________

Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
 2/50 [>.............................] - ETA: 9s - loss: 4.9782 - acc: 0.1032

Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50

Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
 5/50 [==>...........................] - ETA: 8s - loss: 4.3436 - acc: 0.1440

Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50

Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
 5/50 [==>...........................] - ETA: 9s - loss: 3.3518 - acc: 0.1897

Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50

Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
 3/50 [>.............................] - ETA: 9s - loss: 2.2837 - acc: 0.2784

Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50

Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

>3: train=0.054383 test=0.077934
          train      test
count  3.000000  3.000000
mean   0.089785  0.054520
std    0.039291  0.023954
min    0.054383  0.030061
25%    0.068649  0.042813
50%    0.082915  0.055566
75%    0.107486  0.066750
max    0.132057  0.077934


Fijaos que nos da unas pérdidas medias son 0.02 en train y 0.05 en test. La verdad es que esto no nos dice mucho, así que vamos a probar a ejecutarlo viendo realmente lo que predice nuestra red LSTM:

In [32]:
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
	actual, predicted = list(), list()
	# step over the whole set
	for key, desc in descriptions.items():
		# generate description
		yhat = generate_desc(model, tokenizer, photos[key], max_length)
		# store actual and predicted
		actual.append([desc.split()])
		predicted.append(yhat.split())
		print('Actual:    %s' % desc)
		print('Predicted: %s' % yhat)
		if len(actual) >= 5:
			break
	# calculate BLEU score
	bleu = corpus_bleu(actual, predicted)
	return bleu

In [29]:
# define experiment
model_name = 'baseline1'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_8 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_9 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_4 (GlobalM (None, 512)          0           input_8[0][0]                    
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 25, 50)       18300       input_9[0][0]                    
__________________________________________________________________________________________________
dense_13 (

 - 10s - loss: 4.6583 - acc: 0.1406
Epoch 16/50
 - 10s - loss: 4.5955 - acc: 0.1493
Epoch 17/50
 - 10s - loss: 4.5429 - acc: 0.1479
Epoch 18/50
 - 10s - loss: 4.5016 - acc: 0.1503
Epoch 19/50
 - 9s - loss: 4.4499 - acc: 0.1535
Epoch 20/50
 - 9s - loss: 4.4276 - acc: 0.1488
Epoch 21/50
 - 10s - loss: 4.4440 - acc: 0.1461
Epoch 22/50
 - 9s - loss: 4.3534 - acc: 0.1507
Epoch 23/50
 - 9s - loss: 4.3137 - acc: 0.1585
Epoch 24/50
 - 9s - loss: 4.2789 - acc: 0.1538
Epoch 25/50
 - 10s - loss: 4.2188 - acc: 0.1530
Epoch 26/50
 - 9s - loss: 4.2190 - acc: 0.1511
Epoch 27/50
 - 9s - loss: 4.2599 - acc: 0.1511
Epoch 28/50
 - 9s - loss: 4.1890 - acc: 0.1510
Epoch 29/50
 - 9s - loss: 4.1021 - acc: 0.1663
Epoch 30/50
 - 9s - loss: 4.0885 - acc: 0.1768
Epoch 31/50
 - 9s - loss: 4.0311 - acc: 0.1702
Epoch 32/50
 - 9s - loss: 3.9595 - acc: 0.1755
Epoch 33/50
 - 9s - loss: 3.8800 - acc: 0.1892
Epoch 34/50
 - 9s - loss: 3.8090 - acc: 0.1848
Epoch 35/50
 - 9s - loss: 3.7734 - acc: 0.1861
Epoch 36/50
 - 9s -

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_10 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_11 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_5 (GlobalM (None, 512)          0           input_10[0][0]                   
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 25, 50)       18300       input_11[0][0]                   
__________________________________________________________________________________________________
dense_17 (

 - 9s - loss: 4.6857 - acc: 0.1401
Epoch 16/50
 - 9s - loss: 4.6398 - acc: 0.1500
Epoch 17/50
 - 9s - loss: 4.5717 - acc: 0.1454
Epoch 18/50
 - 9s - loss: 4.5071 - acc: 0.1532
Epoch 19/50
 - 9s - loss: 4.4424 - acc: 0.1494
Epoch 20/50
 - 9s - loss: 4.4104 - acc: 0.1517
Epoch 21/50
 - 9s - loss: 4.3364 - acc: 0.1497
Epoch 22/50
 - 9s - loss: 4.3155 - acc: 0.1581
Epoch 23/50
 - 9s - loss: 4.2781 - acc: 0.1528
Epoch 24/50
 - 9s - loss: 4.2929 - acc: 0.1556
Epoch 25/50
 - 9s - loss: 4.2462 - acc: 0.1500
Epoch 26/50
 - 9s - loss: 4.1939 - acc: 0.1518
Epoch 27/50
 - 9s - loss: 4.1140 - acc: 0.1513
Epoch 28/50
 - 9s - loss: 4.0163 - acc: 0.1599
Epoch 29/50
 - 9s - loss: 3.9627 - acc: 0.1654
Epoch 30/50
 - 9s - loss: 3.9338 - acc: 0.1648
Epoch 31/50
 - 9s - loss: 3.8718 - acc: 0.1714
Epoch 32/50
 - 9s - loss: 3.7916 - acc: 0.1704
Epoch 33/50
 - 9s - loss: 3.7743 - acc: 0.1677
Epoch 34/50
 - 9s - loss: 3.7646 - acc: 0.1759
Epoch 35/50
 - 9s - loss: 3.7420 - acc: 0.1769
Epoch 36/50
 - 9s - loss:

Epoch 1/50
 - 12s - loss: 5.4504 - acc: 0.0909
Epoch 2/50
 - 9s - loss: 5.1907 - acc: 0.1012
Epoch 3/50
 - 9s - loss: 5.1564 - acc: 0.1012
Epoch 4/50
 - 9s - loss: 5.1153 - acc: 0.1012
Epoch 5/50
 - 9s - loss: 5.0951 - acc: 0.1012
Epoch 6/50
 - 10s - loss: 5.0886 - acc: 0.1012
Epoch 7/50
 - 9s - loss: 5.0916 - acc: 0.1012
Epoch 8/50
 - 9s - loss: 5.0768 - acc: 0.1012
Epoch 9/50
 - 9s - loss: 5.0591 - acc: 0.1012
Epoch 10/50
 - 10s - loss: 5.0416 - acc: 0.1012
Epoch 11/50
 - 9s - loss: 5.0274 - acc: 0.1012
Epoch 12/50
 - 9s - loss: 4.9790 - acc: 0.1034
Epoch 13/50
 - 9s - loss: 4.9589 - acc: 0.1032
Epoch 14/50
 - 9s - loss: 4.9351 - acc: 0.1043
Epoch 15/50
 - 9s - loss: 4.8789 - acc: 0.1054
Epoch 16/50
 - 9s - loss: 4.8729 - acc: 0.1058
Epoch 17/50
 - 9s - loss: 4.8433 - acc: 0.1102
Epoch 18/50
 - 9s - loss: 4.8245 - acc: 0.1114
Epoch 19/50
 - 9s - loss: 4.7782 - acc: 0.1200
Epoch 20/50
 - 9s - loss: 4.7344 - acc: 0.1251
Epoch 21/50
 - 9s - loss: 4.7618 - acc: 0.1133
Epoch 22/50
 - 9s -

Epoch 47/50
 - 9s - loss: 3.7036 - acc: 0.1921
Epoch 48/50
 - 10s - loss: 3.6664 - acc: 0.1942
Epoch 49/50
 - 9s - loss: 3.6335 - acc: 0.1873
Epoch 50/50
 - 9s - loss: 3.6085 - acc: 0.1971
Actual:    startseq child and woman are at waters edge in big city endseq
Predicted: startseq man in in in the the the the the the the the the the the the the the the the the the the the the
Actual:    startseq boy with stick kneeling in front of goalie net endseq
Predicted: startseq boy in in in front in front in front in front in front in front in the at endseq
Actual:    startseq woman crouches near three dogs in field endseq
Predicted: startseq black dog black dog to in to in to in to in to in endseq
Actual:    startseq boy bites hard into treat while he sits outside endseq
Predicted: startseq boy is is is is is and is is is is and is is is is and is is is is and is is is
Actual:    startseq person eats takeout while watching small television endseq
Predicted: startseq boy holds takeout while whi

Según podéis ver, las descripciones que damos de nuestras imágenes no son las más precisas del mundo. Esto es porque este ejemplo usa muy pocos datos, en la FUENTE que tenéis más abajo, que es de donde he extraído este ejemplo, podéis ver cómo mejoran lo que hemos hecho aquí.

FUENTE: https://machinelearningmastery.com/develop-a-caption-generation-model-in-keras/

Os dejo también varios enlaces que podéis consultar para conseguir hacer algo mejor todavía:

* https://machinelearningmastery.com/how-to-caption-photos-with-deep-learning/
* https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/
* https://daniel.lasiman.com/post/image-captioning/
* https://www.oreilly.com/learning/caption-this-with-tensorflow
* https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2
* https://deeplearningmania.quora.com/Keras-deep-learning-for-image-caption-retrieval

Todos ellos son muy didácticos y os van a servir de mucho a la hora de hacer la práctica. Lamento no poder proporcionaros material en español, pero desgraciadamente este mundo evoluciona muy rápido y lo mejor está siempre en inglés. Lo único que puedo deciros es que estamos trabajando en solucionar esto, y en unos meses esperamos tener un blog donde proporcionar ejemplos de alto nivel, de forma didáctica y a la vez compleja, en un perfecto castellano manchego. Si estáis interesados decidmelo y en cuanto lo lancemos os lo haremos saber!

Buena suerte, y ya sabéis, cualquier duda que tengáis me decís por Slack! ;)

Hasta pronto!

In [30]:
!ls -ls


total 1896144
      4 -rw-r--r-- 1 root root        127 Jun 29 20:31 baseline1.csv
   2852 -rw-r--r-- 1 root root    2918552 Oct 14  2013 CrowdFlowerAnnotations.txt
      4 drwxr-xr-x 3 root root       4096 Jun 28 16:55 datalab
    576 -rw-r--r-- 1 root root     587144 Jun 29 19:27 descriptions.txt
    340 -rw-r--r-- 1 root root     346674 Oct 14  2013 ExpertAnnotations.txt
 793680 -rw-r--r-- 1 root root  812722413 Jun 29 19:32 features.pkl
    404 drwxr-xr-x 2 root root     409600 Oct  3  2012 Flicker8k_Dataset
1089288 -rw-r--r-- 1 root root 1115419746 Oct 24  2013 Flickr8k_Dataset.zip
     28 -rw-r--r-- 1 root root      25801 Oct 10  2013 Flickr_8k.devImages.txt
   3172 -rw-r--r-- 1 root root    3244761 Feb 16  2012 Flickr8k.lemma.token.txt
     28 -rw-r--r-- 1 root root      25775 Oct 10  2013 Flickr_8k.testImages.txt
   2292 -rw-r--r-- 1 root root    2340801 Oct 28  2013 Flickr8k_text.zip
   3316 -rw-r--r-- 1 root root    3395237 Oct 14  2013 Flickr8k.token.txt
    15

Mostramos la línea base que hemos generado con los parámetros iniciales para luego comparar

In [31]:

!more baseline1.csv

train,test
0.07442134117103315,0.186617486185131
0.3494000958926732,0.2256908871471102
0.26360580829318675,0.19135406228852972


Primera prueba, modificamos el tamaño de los vectores de salida del feature extractor VGG y de la conversión
de palabras a números (sequence encoder), los dos sacan vectores de 128 elementos, los reducimos a 64

In [33]:
# define the captioning model
def define_model_baseline2(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(64, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(64, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	print(model.summary())
	return model

In [34]:
# define experiment
model_name = 'baseline2'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_baseline2(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_14 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_15 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_7 (GlobalM (None, 512)          0           input_14[0][0]                   
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 25, 50)       18300       input_15[0][0]                   
__________________________________________________________________________________________________
dense_25 (

 - 9s - loss: 3.5229 - acc: 0.1830
Epoch 16/50
 - 9s - loss: 3.1225 - acc: 0.2165
Epoch 17/50
 - 9s - loss: 2.7949 - acc: 0.2347
Epoch 18/50
 - 10s - loss: 2.7094 - acc: 0.2478
Epoch 19/50
 - 9s - loss: 2.6893 - acc: 0.2392
Epoch 20/50
 - 9s - loss: 2.5587 - acc: 0.2362
Epoch 21/50
 - 9s - loss: 2.3683 - acc: 0.2517
Epoch 22/50
 - 10s - loss: 2.1100 - acc: 0.3170
Epoch 23/50
 - 9s - loss: 2.0957 - acc: 0.3275
Epoch 24/50
 - 9s - loss: 2.0327 - acc: 0.3315
Epoch 25/50
 - 9s - loss: 1.9757 - acc: 0.3463
Epoch 26/50
 - 9s - loss: 1.8770 - acc: 0.3635
Epoch 27/50
 - 9s - loss: 1.7910 - acc: 0.3737
Epoch 28/50
 - 9s - loss: 1.7694 - acc: 0.3836
Epoch 29/50
 - 9s - loss: 1.7092 - acc: 0.4032
Epoch 30/50
 - 9s - loss: 1.6369 - acc: 0.4288
Epoch 31/50
 - 9s - loss: 1.5811 - acc: 0.4258
Epoch 32/50
 - 9s - loss: 1.5368 - acc: 0.4413
Epoch 33/50
 - 9s - loss: 1.4420 - acc: 0.4715
Epoch 34/50
 - 9s - loss: 1.4631 - acc: 0.4667
Epoch 35/50
 - 9s - loss: 1.4621 - acc: 0.4678
Epoch 36/50
 - 10s - lo

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_16 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_17 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_8 (GlobalM (None, 512)          0           input_16[0][0]                   
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, 25, 50)       18300       input_17[0][0]                   
__________________________________________________________________________________________________
dense_29 (

 - 9s - loss: 3.3347 - acc: 0.1875
Epoch 16/50
 - 9s - loss: 3.2422 - acc: 0.1968
Epoch 17/50
 - 9s - loss: 3.0931 - acc: 0.1886
Epoch 18/50
 - 9s - loss: 2.9913 - acc: 0.1951
Epoch 19/50
 - 9s - loss: 2.8921 - acc: 0.2188
Epoch 20/50
 - 9s - loss: 2.7413 - acc: 0.2343
Epoch 21/50
 - 9s - loss: 2.5305 - acc: 0.2739
Epoch 22/50
 - 9s - loss: 2.3479 - acc: 0.2997
Epoch 23/50
 - 10s - loss: 2.3676 - acc: 0.2688
Epoch 24/50
 - 10s - loss: 2.1732 - acc: 0.3190
Epoch 25/50
 - 10s - loss: 2.1084 - acc: 0.3082
Epoch 26/50
 - 9s - loss: 2.1140 - acc: 0.2953
Epoch 27/50
 - 10s - loss: 2.0997 - acc: 0.2953
Epoch 28/50
 - 10s - loss: 1.9110 - acc: 0.3490
Epoch 29/50
 - 9s - loss: 1.8175 - acc: 0.3635
Epoch 30/50
 - 10s - loss: 1.7371 - acc: 0.3683
Epoch 31/50
 - 9s - loss: 1.7176 - acc: 0.3975
Epoch 32/50
 - 9s - loss: 1.7073 - acc: 0.3916
Epoch 33/50
 - 9s - loss: 1.7069 - acc: 0.3991
Epoch 34/50
 - 9s - loss: 1.6664 - acc: 0.3939
Epoch 35/50
 - 9s - loss: 1.5740 - acc: 0.4218
Epoch 36/50
 - 9s -

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_18 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_19 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_9 (GlobalM (None, 512)          0           input_18[0][0]                   
__________________________________________________________________________________________________
embedding_9 (Embedding)         (None, 25, 50)       18300       input_19[0][0]                   
__________________________________________________________________________________________________
dense_33 (

 - 9s - loss: 3.5006 - acc: 0.1863
Epoch 16/50
 - 9s - loss: 3.4332 - acc: 0.1819
Epoch 17/50
 - 9s - loss: 3.3589 - acc: 0.1884
Epoch 18/50
 - 9s - loss: 3.1414 - acc: 0.2095
Epoch 19/50
 - 9s - loss: 2.9594 - acc: 0.2172
Epoch 20/50
 - 9s - loss: 2.7497 - acc: 0.2366
Epoch 21/50
 - 9s - loss: 2.5462 - acc: 0.2773
Epoch 22/50
 - 9s - loss: 2.4595 - acc: 0.2655
Epoch 23/50
 - 9s - loss: 2.3890 - acc: 0.2872
Epoch 24/50
 - 9s - loss: 2.3173 - acc: 0.2921
Epoch 25/50
 - 9s - loss: 2.1821 - acc: 0.3162
Epoch 26/50
 - 9s - loss: 2.0449 - acc: 0.3342
Epoch 27/50
 - 9s - loss: 2.0255 - acc: 0.3528
Epoch 28/50
 - 9s - loss: 2.0438 - acc: 0.3341
Epoch 29/50
 - 9s - loss: 2.0160 - acc: 0.3380
Epoch 30/50
 - 9s - loss: 1.8744 - acc: 0.3745
Epoch 31/50
 - 9s - loss: 1.7619 - acc: 0.3809
Epoch 32/50
 - 9s - loss: 1.6880 - acc: 0.3995
Epoch 33/50
 - 9s - loss: 1.5724 - acc: 0.4550
Epoch 34/50
 - 9s - loss: 1.6019 - acc: 0.4352
Epoch 35/50
 - 9s - loss: 1.5646 - acc: 0.4327
Epoch 36/50
 - 9s - loss:

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [35]:
!more baseline2.csv

train,test
0.21027812894375322,0.13869466501960379
0.26901038664981225,0.234592094433893
0.2693906386516805,0.4400835078482482


No hay grandes variaciones.

Segunda prueba, modificamos el tamaño de los vectores de salida del feature extractor VGG y de la conversión de palabras a números (sequence encoder), los dos sacan vectores de 128 elemento, ahora al revés que antes aumentamos a 256


In [25]:
# define the captioning model
def define_model_vec_salidas_256(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(256, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(256, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	print(model.summary())
	return model

In [28]:
# define experiment
model_name = 'vec_salidas_256'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_vec_salidas_256(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_1 (GlobalM (None, 512)          0           input_2[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 25, 50)       18300       input_3[0][0]                    
__________________________________________________________________________________________________
dense_1 (D

 - 9s - loss: 4.9371 - acc: 0.1032
Epoch 16/50
 - 9s - loss: 4.8788 - acc: 0.1052
Epoch 17/50
 - 9s - loss: 4.8489 - acc: 0.1052
Epoch 18/50
 - 9s - loss: 4.8320 - acc: 0.1032
Epoch 19/50
 - 9s - loss: 4.8369 - acc: 0.1035
Epoch 20/50
 - 9s - loss: 4.8063 - acc: 0.1051
Epoch 21/50
 - 9s - loss: 4.8050 - acc: 0.1069
Epoch 22/50
 - 9s - loss: 4.8191 - acc: 0.1072
Epoch 23/50
 - 9s - loss: 4.8150 - acc: 0.1112
Epoch 24/50
 - 9s - loss: 4.7619 - acc: 0.1096
Epoch 25/50
 - 9s - loss: 4.7544 - acc: 0.1144
Epoch 26/50
 - 9s - loss: 4.7631 - acc: 0.1233
Epoch 27/50
 - 9s - loss: 4.7409 - acc: 0.1194
Epoch 28/50
 - 9s - loss: 4.7582 - acc: 0.1246
Epoch 29/50
 - 9s - loss: 4.7355 - acc: 0.1147
Epoch 30/50
 - 9s - loss: 4.7586 - acc: 0.1188
Epoch 31/50
 - 9s - loss: 4.7810 - acc: 0.1150
Epoch 32/50
 - 9s - loss: 4.7094 - acc: 0.1176
Epoch 33/50
 - 9s - loss: 4.7190 - acc: 0.1202
Epoch 34/50
 - 9s - loss: 4.6865 - acc: 0.1242
Epoch 35/50
 - 9s - loss: 4.6796 - acc: 0.1195
Epoch 36/50
 - 9s - loss:

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Actual:    startseq adults and children stand and play in front of steps near wooded area endseq
Predicted: startseq boy endseq
Actual:    startseq boy in grey pajamas is jumping on the couch endseq
Predicted: startseq man endseq
Actual:    startseq boy holding kitchen utensils and making threatening face endseq
Predicted: startseq man endseq
Actual:    startseq man in green hat is someplace up high endseq
Predicted: startseq man endseq
>1: train=0.044164 test=0.027695
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 25)           0                                            
_______________________________

 - 9s - loss: 5.0484 - acc: 0.1012
Epoch 14/50
 - 9s - loss: 5.0429 - acc: 0.1012
Epoch 15/50
 - 9s - loss: 5.0347 - acc: 0.1012
Epoch 16/50
 - 9s - loss: 5.0264 - acc: 0.1012
Epoch 17/50
 - 9s - loss: 5.0172 - acc: 0.1012
Epoch 18/50
 - 9s - loss: 5.0066 - acc: 0.1012
Epoch 19/50
 - 9s - loss: 4.9946 - acc: 0.1033
Epoch 20/50
 - 9s - loss: 4.9816 - acc: 0.1089
Epoch 21/50
 - 9s - loss: 4.9678 - acc: 0.1089
Epoch 22/50
 - 9s - loss: 4.9538 - acc: 0.1089
Epoch 23/50
 - 9s - loss: 4.9398 - acc: 0.1098
Epoch 24/50
 - 9s - loss: 4.9265 - acc: 0.1098
Epoch 25/50
 - 9s - loss: 4.9144 - acc: 0.1098
Epoch 26/50
 - 9s - loss: 4.9040 - acc: 0.1112
Epoch 27/50
 - 9s - loss: 4.8943 - acc: 0.1101
Epoch 28/50
 - 9s - loss: 4.8859 - acc: 0.1091
Epoch 29/50
 - 9s - loss: 4.8789 - acc: 0.1107
Epoch 30/50
 - 9s - loss: 4.8712 - acc: 0.1107
Epoch 31/50
 - 9s - loss: 4.8643 - acc: 0.1107
Epoch 32/50
 - 9s - loss: 4.8592 - acc: 0.1107
Epoch 33/50
 - 9s - loss: 4.8535 - acc: 0.1101
Epoch 34/50
 - 9s - loss:

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Actual:    startseq adults and children stand and play in front of steps near wooded area endseq
Predicted: startseq man endseq
Actual:    startseq boy in grey pajamas is jumping on the couch endseq
Predicted: startseq man endseq
Actual:    startseq boy holding kitchen utensils and making threatening face endseq
Predicted: startseq man endseq
Actual:    startseq man in green hat is someplace up high endseq
Predicted: startseq man endseq
>2: train=0.076686 test=0.027695
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_7 (InputLayer)            (None, 25)           0                                            
_______________________________

 - 9s - loss: 5.0593 - acc: 0.1012
Epoch 14/50
 - 9s - loss: 5.0506 - acc: 0.1012
Epoch 15/50
 - 9s - loss: 5.0578 - acc: 0.1012
Epoch 16/50
 - 9s - loss: 5.0494 - acc: 0.1012
Epoch 17/50
 - 9s - loss: 5.0768 - acc: 0.1012
Epoch 18/50
 - 9s - loss: 5.0386 - acc: 0.1012
Epoch 19/50
 - 9s - loss: 5.0395 - acc: 0.1012
Epoch 20/50
 - 9s - loss: 5.0312 - acc: 0.1012
Epoch 21/50
 - 9s - loss: 5.0233 - acc: 0.1012
Epoch 22/50
 - 9s - loss: 5.0134 - acc: 0.1019
Epoch 23/50
 - 9s - loss: 5.0022 - acc: 0.1033
Epoch 24/50
 - 9s - loss: 4.9908 - acc: 0.1055
Epoch 25/50
 - 9s - loss: 4.9790 - acc: 0.1098
Epoch 26/50
 - 9s - loss: 4.9658 - acc: 0.1087
Epoch 27/50
 - 9s - loss: 4.9517 - acc: 0.1098
Epoch 28/50
 - 9s - loss: 4.9384 - acc: 0.1098
Epoch 29/50
 - 9s - loss: 4.9259 - acc: 0.1098
Epoch 30/50
 - 9s - loss: 4.9147 - acc: 0.1110
Epoch 31/50
 - 9s - loss: 4.9043 - acc: 0.1098
Epoch 32/50
 - 9s - loss: 4.8949 - acc: 0.1098
Epoch 33/50
 - 9s - loss: 4.8866 - acc: 0.1098
Epoch 34/50
 - 9s - loss:

In [29]:
!more vec_salidas_256.csv

train,test
0.04416371865034087,0.027694585930698876
0.07668624331822337,0.027694585930698876
0.05367478007323275,0.033658868738356


Vemos que empeora, ahora vamos a tocar el parámetro del language model (el decoder) que es el responsable de generar las palabras, su input son las secuencias de palabras y las features ya calculadas.

In [30]:
# define the captioning model
def define_model_size_language(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalMaxPooling2D()(inputs1)
	fe2 = Dense(128, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(256)(merged)
	lm3 = Dense(256, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	print(model.summary())
	return model

In [31]:
# define experiment
model_name = 'size_language'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_size_language(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_8 (InputLayer)            (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_9 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_4 (GlobalM (None, 512)          0           input_8[0][0]                    
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 25, 50)       18300       input_9[0][0]                    
__________________________________________________________________________________________________
dense_13 (

 - 9s - loss: 4.8525 - acc: 0.1035
Epoch 16/50
 - 9s - loss: 4.8147 - acc: 0.1028
Epoch 17/50
 - 9s - loss: 4.7969 - acc: 0.1045
Epoch 18/50
 - 9s - loss: 4.7815 - acc: 0.1042
Epoch 19/50
 - 9s - loss: 4.7396 - acc: 0.1078
Epoch 20/50
 - 9s - loss: 4.7525 - acc: 0.1096
Epoch 21/50
 - 9s - loss: 4.7466 - acc: 0.1122
Epoch 22/50
 - 9s - loss: 4.7402 - acc: 0.1144
Epoch 23/50
 - 9s - loss: 4.7095 - acc: 0.1144
Epoch 24/50
 - 9s - loss: 4.6608 - acc: 0.1186
Epoch 25/50
 - 9s - loss: 4.6491 - acc: 0.1264
Epoch 26/50
 - 9s - loss: 4.6213 - acc: 0.1266
Epoch 27/50
 - 9s - loss: 4.6196 - acc: 0.1253
Epoch 28/50
 - 9s - loss: 4.5824 - acc: 0.1247
Epoch 29/50
 - 9s - loss: 4.5503 - acc: 0.1315
Epoch 30/50
 - 9s - loss: 4.5192 - acc: 0.1266
Epoch 31/50
 - 9s - loss: 4.4872 - acc: 0.1333
Epoch 32/50
 - 9s - loss: 4.4848 - acc: 0.1360
Epoch 33/50
 - 9s - loss: 4.4700 - acc: 0.1388
Epoch 34/50
 - 9s - loss: 4.4484 - acc: 0.1430
Epoch 35/50
 - 9s - loss: 4.3848 - acc: 0.1444
Epoch 36/50
 - 9s - loss:

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Actual:    startseq couple with young child wrapped in blanket sitting on concrete step endseq
Predicted: startseq boy is in endseq
Actual:    startseq adults and children stand and play in front of steps near wooded area endseq
Predicted: startseq boy in endseq
Actual:    startseq boy in grey pajamas is jumping on the couch endseq
Predicted: startseq dog dog endseq
Actual:    startseq boy holding kitchen utensils and making threatening face endseq
Predicted: startseq boy is is endseq
Actual:    startseq man in green hat is someplace up high endseq
Predicted: startseq girl is the endseq
>1: train=0.193783 test=0.089643
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_10 (InputLayer)           (None, 7, 7, 512)    0                                            
____________________________________________________________________________

Epoch 13/50
 - 9s - loss: 4.7594 - acc: 0.1418
Epoch 14/50
 - 9s - loss: 4.7216 - acc: 0.1445
Epoch 15/50
 - 9s - loss: 4.7018 - acc: 0.1450
Epoch 16/50
 - 9s - loss: 4.6366 - acc: 0.1451
Epoch 17/50
 - 9s - loss: 4.5826 - acc: 0.1474
Epoch 18/50
 - 9s - loss: 4.5272 - acc: 0.1552
Epoch 19/50
 - 9s - loss: 4.4963 - acc: 0.1557
Epoch 20/50
 - 9s - loss: 4.4687 - acc: 0.1603
Epoch 21/50
 - 9s - loss: 4.4029 - acc: 0.1537
Epoch 22/50
 - 9s - loss: 4.3883 - acc: 0.1521
Epoch 23/50
 - 9s - loss: 4.4193 - acc: 0.1584
Epoch 24/50
 - 9s - loss: 4.3770 - acc: 0.1544
Epoch 25/50
 - 9s - loss: 4.3447 - acc: 0.1599
Epoch 26/50
 - 9s - loss: 4.2993 - acc: 0.1565
Epoch 27/50
 - 9s - loss: 4.3126 - acc: 0.1581
Epoch 28/50
 - 9s - loss: 4.2663 - acc: 0.1632
Epoch 29/50
 - 9s - loss: 4.2160 - acc: 0.1628
Epoch 30/50
 - 9s - loss: 4.2127 - acc: 0.1671
Epoch 31/50
 - 9s - loss: 4.1804 - acc: 0.1706
Epoch 32/50
 - 9s - loss: 4.1367 - acc: 0.1647
Epoch 33/50
 - 9s - loss: 4.1084 - acc: 0.1685
Epoch 34/50
 

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_12 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_13 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_6 (GlobalM (None, 512)          0           input_12[0][0]                   
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, 25, 50)       18300       input_13[0][0]                   
__________________________________________________________________________________________________
dense_21 (

 - 9s - loss: 5.0487 - acc: 0.1012
Epoch 16/50
 - 9s - loss: 5.0422 - acc: 0.1012
Epoch 17/50
 - 9s - loss: 5.0313 - acc: 0.1012
Epoch 18/50
 - 9s - loss: 5.0235 - acc: 0.1012
Epoch 19/50
 - 9s - loss: 5.0298 - acc: 0.1012
Epoch 20/50
 - 9s - loss: 5.0240 - acc: 0.1012
Epoch 21/50
 - 9s - loss: 5.0148 - acc: 0.1012
Epoch 22/50
 - 9s - loss: 5.0104 - acc: 0.1033
Epoch 23/50
 - 9s - loss: 4.9986 - acc: 0.1033
Epoch 24/50
 - 9s - loss: 4.9876 - acc: 0.1047
Epoch 25/50
 - 9s - loss: 4.9761 - acc: 0.1033
Epoch 26/50
 - 9s - loss: 4.9646 - acc: 0.1089
Epoch 27/50
 - 9s - loss: 4.9530 - acc: 0.1098
Epoch 28/50
 - 9s - loss: 4.9414 - acc: 0.1098
Epoch 29/50
 - 9s - loss: 4.9298 - acc: 0.1098
Epoch 30/50
 - 9s - loss: 4.9187 - acc: 0.1110
Epoch 31/50
 - 9s - loss: 4.9080 - acc: 0.1118
Epoch 32/50
 - 9s - loss: 4.8979 - acc: 0.1118
Epoch 33/50
 - 9s - loss: 4.8885 - acc: 0.1118
Epoch 34/50
 - 9s - loss: 4.8799 - acc: 0.1127
Epoch 35/50
 - 9s - loss: 4.8715 - acc: 0.1118
Epoch 36/50
 - 9s - loss:

In [32]:
!more size_language.csv

train,test
0.1937829789834858,0.08964305477640969
0.316227766016838,0.09836481073949695
0.05367478007323275,0.033658868738356


Un poco mejor que la anterior pero peor  que la primera prueba, no mejoramos.

En esta prueba cambiaremos el GolbalMaxPooling2D por GlobalAveragePooling2D (media). Los dos se emplean para reducir el overfiting, veremos si cambiando tiene algún efecto

In [33]:
# define the captioning model
def define_model_cambio_pooling(vocab_size, max_length):
	# feature extractor (encoder)
	inputs1 = Input(shape=(7, 7, 512))
	fe1 = GlobalAveragePooling2D()(inputs1)
	fe2 = Dense(128, activation='relu')(fe1)
	fe3 = RepeatVector(max_length)(fe2)
	# embedding
	inputs2 = Input(shape=(max_length,))
	emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
	emb3 = LSTM(256, return_sequences=True)(emb2)
	emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
	# merge inputs
	merged = concatenate([fe3, emb4])
	# language model (decoder)
	lm2 = LSTM(500)(merged)
	lm3 = Dense(500, activation='relu')(lm2)
	outputs = Dense(vocab_size, activation='softmax')(lm3)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	print(model.summary())
	return model

In [34]:
# define experiment
model_name = 'cambio_pooling'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_cambio_pooling(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_14 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_15 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_7 (GlobalM (None, 512)          0           input_14[0][0]                   
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 25, 50)       18300       input_15[0][0]                   
__________________________________________________________________________________________________
dense_25 (

 - 9s - loss: 4.8371 - acc: 0.1010
Epoch 16/50
 - 9s - loss: 4.7874 - acc: 0.1019
Epoch 17/50
 - 9s - loss: 4.7955 - acc: 0.1057
Epoch 18/50
 - 9s - loss: 4.7493 - acc: 0.1058
Epoch 19/50
 - 9s - loss: 4.7230 - acc: 0.1084
Epoch 20/50
 - 9s - loss: 4.7262 - acc: 0.1108
Epoch 21/50
 - 9s - loss: 4.6849 - acc: 0.1192
Epoch 22/50
 - 9s - loss: 4.6634 - acc: 0.1269
Epoch 23/50
 - 9s - loss: 4.6140 - acc: 0.1277
Epoch 24/50
 - 9s - loss: 4.6212 - acc: 0.1354
Epoch 25/50
 - 9s - loss: 4.5490 - acc: 0.1456
Epoch 26/50
 - 9s - loss: 4.5151 - acc: 0.1367
Epoch 27/50
 - 9s - loss: 4.4733 - acc: 0.1516
Epoch 28/50
 - 9s - loss: 4.4525 - acc: 0.1432
Epoch 29/50
 - 9s - loss: 4.3847 - acc: 0.1445
Epoch 30/50
 - 9s - loss: 4.3175 - acc: 0.1515
Epoch 31/50
 - 9s - loss: 4.2796 - acc: 0.1576
Epoch 32/50
 - 9s - loss: 4.2180 - acc: 0.1629
Epoch 33/50
 - 9s - loss: 4.1989 - acc: 0.1598
Epoch 34/50
 - 9s - loss: 4.1417 - acc: 0.1611
Epoch 35/50
 - 9s - loss: 4.1068 - acc: 0.1645
Epoch 36/50
 - 9s - loss:

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Actual:    startseq couple with young child wrapped in blanket sitting on concrete step endseq
Predicted: startseq boy in in in in in endseq
Actual:    startseq adults and children stand and play in front of steps near wooded area endseq
Predicted: startseq boy in in in in walk in the the the the endseq
Actual:    startseq boy in grey pajamas is jumping on the couch endseq
Predicted: startseq child swims pool to endseq
Actual:    startseq boy holding kitchen utensils and making threatening face endseq
Predicted: startseq child game game game game child above above above above above head the endseq
Actual:    startseq man in green hat is someplace up high endseq
Predicted: startseq child laying laying with in the endseq
>1: train=0.204213 test=0.596445


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_16 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
input_17 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
global_max_pooling2d_8 (GlobalM (None, 512)          0           input_16[0][0]                   
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, 25, 50)       18300       input_17[0][0]                   
__________________________________________________________________________________________________
dense_29 (

 - 9s - loss: 4.8597 - acc: 0.1048
Epoch 16/50
 - 9s - loss: 4.8514 - acc: 0.1044
Epoch 17/50
 - 9s - loss: 4.8552 - acc: 0.1035
Epoch 18/50
 - 9s - loss: 4.8187 - acc: 0.1069
Epoch 19/50
 - 9s - loss: 4.7896 - acc: 0.1060
Epoch 20/50
 - 9s - loss: 4.7952 - acc: 0.1035
Epoch 21/50
 - 9s - loss: 4.7712 - acc: 0.1115
Epoch 22/50
 - 9s - loss: 4.7805 - acc: 0.1070
Epoch 23/50
 - 9s - loss: 4.7410 - acc: 0.1122
Epoch 24/50
 - 9s - loss: 4.6980 - acc: 0.1187
Epoch 25/50
 - 9s - loss: 4.6689 - acc: 0.1216
Epoch 26/50
 - 9s - loss: 4.6413 - acc: 0.1322
Epoch 27/50
 - 9s - loss: 4.6189 - acc: 0.1381
Epoch 28/50
 - 9s - loss: 4.5601 - acc: 0.1407
Epoch 29/50
 - 9s - loss: 4.5742 - acc: 0.1441
Epoch 30/50
 - 9s - loss: 4.5198 - acc: 0.1358
Epoch 31/50
 - 9s - loss: 4.4687 - acc: 0.1402
Epoch 32/50
 - 9s - loss: 4.4238 - acc: 0.1521
Epoch 33/50
 - 9s - loss: 4.4303 - acc: 0.1472
Epoch 34/50
 - 9s - loss: 4.3407 - acc: 0.1511
Epoch 35/50
 - 9s - loss: 4.3046 - acc: 0.1481
Epoch 36/50
 - 9s - loss:

 - 13s - loss: 5.4187 - acc: 0.0905
Epoch 2/50
 - 9s - loss: 5.1520 - acc: 0.1001
Epoch 3/50
 - 9s - loss: 5.1692 - acc: 0.1012
Epoch 4/50
 - 9s - loss: 5.1516 - acc: 0.1012
Epoch 5/50
 - 9s - loss: 5.1172 - acc: 0.1012
Epoch 6/50
 - 9s - loss: 5.0733 - acc: 0.1012
Epoch 7/50
 - 9s - loss: 5.0317 - acc: 0.1000
Epoch 8/50
 - 9s - loss: 5.0141 - acc: 0.1012
Epoch 9/50
 - 9s - loss: 4.9891 - acc: 0.1037
Epoch 10/50
 - 9s - loss: 4.9690 - acc: 0.1023
Epoch 11/50
 - 9s - loss: 4.9434 - acc: 0.1032
Epoch 12/50
 - 9s - loss: 4.9010 - acc: 0.1043
Epoch 13/50
 - 9s - loss: 4.8685 - acc: 0.1051
Epoch 14/50
 - 9s - loss: 4.8461 - acc: 0.1044
Epoch 15/50
 - 9s - loss: 4.8184 - acc: 0.1068
Epoch 16/50
 - 9s - loss: 4.7932 - acc: 0.1046
Epoch 17/50
 - 9s - loss: 4.7665 - acc: 0.1146
Epoch 18/50
 - 9s - loss: 4.7476 - acc: 0.1148
Epoch 19/50
 - 9s - loss: 4.7132 - acc: 0.1198
Epoch 20/50
 - 9s - loss: 4.6756 - acc: 0.1263
Epoch 21/50
 - 9s - loss: 4.6289 - acc: 0.1309
Epoch 22/50
 - 9s - loss: 4.6532

Epoch 47/50
 - 9s - loss: 3.4644 - acc: 0.1836
Epoch 48/50
 - 9s - loss: 3.4994 - acc: 0.1867
Epoch 49/50
 - 9s - loss: 3.4171 - acc: 0.1891
Epoch 50/50
 - 9s - loss: 3.3089 - acc: 0.2047
Actual:    startseq child and woman are at waters edge in big city endseq
Predicted: startseq man and woman in endseq
Actual:    startseq boy with stick kneeling in front of goalie net endseq
Predicted: startseq boy in in in to endseq
Actual:    startseq woman crouches near three dogs in field endseq
Predicted: startseq black dog and dog in endseq
Actual:    startseq boy bites hard into treat while he sits outside endseq
Predicted: startseq boy swims swims swims swims endseq
Actual:    startseq person eats takeout while watching small television endseq
Predicted: startseq boy takeout with with with with endseq
Actual:    startseq couple with young child wrapped in blanket sitting on concrete step endseq
Predicted: startseq girl in in endseq
Actual:    startseq adults and children stand and play in fro

En esta tercera prueba vamos a introducir una regularización (dropout) para que nuestra red no crea que el dataset de entrenamiento es la verdad absoluta y así cuándo lleguen datasets que desconzca cómo en test u otros pueda generalizar mejor y ajustarse


In [35]:
!more cambio_pooling.csv

train,test
0.2042128370387497,0.5964449011273305
0.176283828631542,0.23281052604553176
0.2929935643561398,0.40060195776406227


La verdad es que hemos mejorado cambiando el tipo de pooling, ahora vamos a hacer la misma pero incluyendo un
dropout para prevenir overciting, y un learning rate de 0.001

In [40]:
from keras.layers.pooling import GlobalAveragePooling2D
from keras.optimizers import Adam

def define_model_dropout_pooling(vocab_size, max_length):
  #feature extractor (encoder)
  inputs1 = Input(shape=(7,7,512))
  fe1 = GlobalAveragePooling2D()(inputs1)
  fe2 = Dropout(0.5)(fe1)
  fe3 = Dense(128, activation='relu')(fe2)
  fe4 = RepeatVector(max_length)(fe3)
  #embedding
  inputs2 = Input(shape=(max_length,))
  emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
  emb3 = LSTM(256, return_sequences=True)(emb2)
  emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
  #merge inputs
  merged = concatenate([fe4, emb4])
  #language model (decoder)
  lm2 = LSTM(500)(merged)
  lm3 = Dense(500, activation='relu')(lm2)
  outputs = Dense(vocab_size, activation='softmax')(lm3)
  # tie it together [image, seq] [word]
  model = Model(inputs=[inputs1, inputs2], outputs=outputs)
  model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
  print(model.summary())
  return model

In [41]:
# define experiment
model_name = 'dropout_pooling'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_dropout_pooling(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_23 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
global_average_pooling2d_2 (Glo (None, 512)          0           input_23[0][0]                   
__________________________________________________________________________________________________
input_24 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 512)          0           global_average_pooling2d_2[0][0] 
__________________________________________________________________________________________________
embedding_

 - 9s - loss: 2.7261 - acc: 0.2660
Epoch 15/50
 - 9s - loss: 2.6741 - acc: 0.2659
Epoch 16/50
 - 9s - loss: 2.4553 - acc: 0.2970
Epoch 17/50
 - 9s - loss: 2.2267 - acc: 0.3440
Epoch 18/50
 - 9s - loss: 2.0700 - acc: 0.3549
Epoch 19/50
 - 9s - loss: 2.0405 - acc: 0.3599
Epoch 20/50
 - 9s - loss: 1.9411 - acc: 0.3825
Epoch 21/50
 - 9s - loss: 1.8858 - acc: 0.3884
Epoch 22/50
 - 9s - loss: 1.7020 - acc: 0.4644
Epoch 23/50
 - 9s - loss: 1.5802 - acc: 0.4767
Epoch 24/50
 - 9s - loss: 1.4314 - acc: 0.5060
Epoch 25/50
 - 9s - loss: 1.3533 - acc: 0.5506
Epoch 26/50
 - 9s - loss: 1.2940 - acc: 0.5563
Epoch 27/50
 - 9s - loss: 1.3041 - acc: 0.5568
Epoch 28/50
 - 9s - loss: 1.3390 - acc: 0.5467
Epoch 29/50
 - 9s - loss: 1.2667 - acc: 0.5634
Epoch 30/50
 - 9s - loss: 1.1804 - acc: 0.5858
Epoch 31/50
 - 9s - loss: 1.0292 - acc: 0.6328
Epoch 32/50
 - 9s - loss: 0.9584 - acc: 0.6601
Epoch 33/50
 - 9s - loss: 0.8893 - acc: 0.6880
Epoch 34/50
 - 9s - loss: 0.8287 - acc: 0.7042
Epoch 35/50
 - 9s - loss:

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_25 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
global_average_pooling2d_3 (Glo (None, 512)          0           input_25[0][0]                   
__________________________________________________________________________________________________
input_26 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 512)          0           global_average_pooling2d_3[0][0] 
__________________________________________________________________________________________________
embedding_

 - 9s - loss: 2.6730 - acc: 0.2908
Epoch 15/50
 - 9s - loss: 2.4248 - acc: 0.3149
Epoch 16/50
 - 9s - loss: 2.2672 - acc: 0.3527
Epoch 17/50
 - 9s - loss: 2.1783 - acc: 0.3652
Epoch 18/50
 - 9s - loss: 2.1991 - acc: 0.3532
Epoch 19/50
 - 9s - loss: 2.1158 - acc: 0.3823
Epoch 20/50
 - 9s - loss: 1.8502 - acc: 0.4179
Epoch 21/50
 - 9s - loss: 1.6951 - acc: 0.4641
Epoch 22/50
 - 9s - loss: 1.5041 - acc: 0.4998
Epoch 23/50
 - 9s - loss: 1.4381 - acc: 0.5273
Epoch 24/50
 - 9s - loss: 1.3552 - acc: 0.5589
Epoch 25/50
 - 9s - loss: 1.2853 - acc: 0.5611
Epoch 26/50
 - 9s - loss: 1.2454 - acc: 0.5707
Epoch 27/50
 - 9s - loss: 1.2302 - acc: 0.5830
Epoch 28/50
 - 9s - loss: 1.1426 - acc: 0.6138
Epoch 29/50
 - 9s - loss: 1.0587 - acc: 0.6312
Epoch 30/50
 - 9s - loss: 0.9478 - acc: 0.6783
Epoch 31/50
 - 9s - loss: 0.9112 - acc: 0.6917
Epoch 32/50
 - 9s - loss: 0.8717 - acc: 0.7127
Epoch 33/50
 - 9s - loss: 0.8282 - acc: 0.7109
Epoch 34/50
 - 9s - loss: 0.7383 - acc: 0.7572
Epoch 35/50
 - 9s - loss:

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_27 (InputLayer)           (None, 7, 7, 512)    0                                            
__________________________________________________________________________________________________
global_average_pooling2d_4 (Glo (None, 512)          0           input_27[0][0]                   
__________________________________________________________________________________________________
input_28 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 512)          0           global_average_pooling2d_4[0][0] 
__________________________________________________________________________________________________
embedding_

 - 9s - loss: 2.7778 - acc: 0.2475
Epoch 15/50
 - 9s - loss: 2.6044 - acc: 0.2610
Epoch 16/50
 - 9s - loss: 2.4760 - acc: 0.2887
Epoch 17/50
 - 9s - loss: 2.3425 - acc: 0.2889
Epoch 18/50
 - 9s - loss: 2.1301 - acc: 0.3347
Epoch 19/50
 - 9s - loss: 2.1060 - acc: 0.3361
Epoch 20/50
 - 9s - loss: 2.0510 - acc: 0.3474
Epoch 21/50
 - 9s - loss: 1.9604 - acc: 0.3497
Epoch 22/50
 - 9s - loss: 1.7273 - acc: 0.4003
Epoch 23/50
 - 9s - loss: 1.6572 - acc: 0.4452
Epoch 24/50
 - 9s - loss: 1.6407 - acc: 0.4526
Epoch 25/50
 - 9s - loss: 1.6692 - acc: 0.4278
Epoch 26/50
 - 9s - loss: 1.4825 - acc: 0.4724
Epoch 27/50
 - 9s - loss: 1.4168 - acc: 0.4799
Epoch 28/50
 - 9s - loss: 1.3724 - acc: 0.5112
Epoch 29/50
 - 9s - loss: 1.3706 - acc: 0.5069
Epoch 30/50
 - 9s - loss: 1.3329 - acc: 0.5235
Epoch 31/50
 - 9s - loss: 1.2766 - acc: 0.5521
Epoch 32/50
 - 9s - loss: 1.2388 - acc: 0.5669
Epoch 33/50
 - 9s - loss: 1.0973 - acc: 0.6122
Epoch 34/50
 - 9s - loss: 1.1065 - acc: 0.6087
Epoch 35/50
 - 9s - loss:

In [42]:
!more dropout_pooling.csv

train,test
0.8798640436132479,0.6509781938655957
1.0,0.24300481795982193
0.5609763895172714,0.7027856398148137


Bastante mejor

Ahora vamos a cambiar VGG16 por Inception3

In [0]:
# extract features from each photo in the directory
from keras.applications.inception_v3 import InceptionV3 

def extract_features(directory):
  # load the model
  in_layer = Input(shape=(299, 299, 3))
  model = InceptionV3(include_top=False, input_tensor=in_layer)
  print(model.summary())
  # extract features from each photo
  features = dict()

  files_in_directory = listdir(directory)
  n_images = len(files_in_directory)
  for i, name in tqdm(enumerate(files_in_directory)):
    # load an image from file
    filename = directory + '/' + name
    image = load_img(filename, target_size=(299, 299))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    # get image id
    image_id = name.split('.')[0]
    # store feature
    features[image_id] = feature
    # print('{} / {} > {}'.format(i, n_images, name))
  return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 299, 299, 3)  0                                            
__________________________________________________________________________________________________
conv2d_95 (Conv2D)              (None, 149, 149, 32) 864         input_2[0][0]                    
__________________________________________________________________________________________________
batch_normalization_95 (BatchNo (None, 149, 149, 32) 96          conv2d_95[0][0]                  
__________________________________________________________________________________________________
activation_95 (Activation)      (None, 149, 149, 32) 0           batch_normalization_95[0][0]     
__________________________________________________________________________________________________
conv2d_96 

0it [00:00, ?it/s]


                                                                 activation_179[0][0]             
__________________________________________________________________________________________________
conv2d_184 (Conv2D)             (None, 8, 8, 448)    917504      mixed9[0][0]                     
__________________________________________________________________________________________________
batch_normalization_184 (BatchN (None, 8, 8, 448)    1344        conv2d_184[0][0]                 
__________________________________________________________________________________________________
activation_184 (Activation)     (None, 8, 8, 448)    0           batch_normalization_184[0][0]    
__________________________________________________________________________________________________
conv2d_181 (Conv2D)             (None, 8, 8, 384)    786432      mixed9[0][0]                     
__________________________________________________________________________________________________
conv2d_18

8091it [07:13, 18.66it/s]

No he conseguido que termine corretamente, por algo que no comprendo me lanzaba fuera del entorno cada dos por 
tres, otras se queda colgado, otras ..... vaya que no he logrado que terminase la extracción de features para luego
poder procesar los dos siguientes bloques, me quedo con las ganas de ver el resultado, más adelante con más tiempo lo sigo intentando.

In [1]:
from keras.layers.pooling import GlobalAveragePooling2D
from keras.optimizers import Adam

def define_model_inception3_dropout_pooling(vocab_size, max_length):
  #feature extractor (encoder)
  inputs1 = Input(shape=(7,7,512))
  fe1 = GlobalAveragePooling2D()(inputs1)
  fe2 = Dropout(0.5)(fe1)
  fe3 = Dense(128, activation='relu')(fe2)
  fe4 = RepeatVector(max_length)(fe3)
  #embedding
  inputs2 = Input(shape=(max_length,))
  emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
  emb3 = LSTM(256, return_sequences=True)(emb2)
  emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
  #merge inputs
  merged = concatenate([fe4, emb4])
  #language model (decoder)
  lm2 = LSTM(500)(merged)
  lm3 = Dense(500, activation='relu')(lm2)
  outputs = Dense(vocab_size, activation='softmax')(lm3)
  # tie it together [image, seq] [word]
  model = Model(inputs=[inputs1, inputs2], outputs=outputs)
  model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['accuracy'])
  print(model.summary())
  return model

Using TensorFlow backend.


In [2]:
# define experiment
model_name = 'inception3_dropout_pooling'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_inception3_dropout_pooling(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

NameError: ignored

Probamos con Resnet50, espero que con esta red si me dejo el colab

In [25]:
# extract features from each photo in the directory
from keras.applications import ResNet50 

def extract_features(directory):
  # load the model
  in_layer = Input(shape=(299, 299, 3))
  model = ResNet50(include_top=False, input_tensor=in_layer)
  print(model.summary())
  # extract features from each photo
  features = dict()

  files_in_directory = listdir(directory)
  n_images = len(files_in_directory)
  for i, name in tqdm(enumerate(files_in_directory)):
    # load an image from file
    filename = directory + '/' + name
    image = load_img(filename, target_size=(299, 299))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    # get image id
    image_id = name.split('.')[0]
    # store feature
    features[image_id] = feature
    # print('{} / {} > {}'.format(i, n_images, name))
  return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


0it [00:00, ?it/s]

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 299, 299, 3)  0                                            
__________________________________________________________________________________________________
conv1_pad (ZeroPadding2D)       (None, 305, 305, 3)  0           input_3[0][0]                    
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 150, 150, 64) 9472        conv1_pad[0][0]                  
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 150, 150, 64) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation

8091it [07:41, 17.52it/s]


Extracted Features: 8091


In [35]:
from keras.layers.pooling import GlobalAveragePooling2D
from keras.optimizers import Adam

def define_model_resnet50_dropout_pooling(vocab_size, max_length):
  #feature extractor (encoder)
  inputs1 = Input(shape=(1,1,2048))
  fe1 = GlobalAveragePooling2D()(inputs1)
  fe2 = Dropout(0.5)(fe1)
  fe3 = Dense(128, activation='relu')(fe2)
  fe4 = RepeatVector(max_length)(fe3)
  #embedding
  inputs2 = Input(shape=(max_length,))
  emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
  emb3 = LSTM(256, return_sequences=True)(emb2)
  emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
  #merge inputs
  merged = concatenate([fe4, emb4])
  #language model (decoder)
  lm2 = LSTM(500)(merged)
  lm3 = Dense(500, activation='relu')(lm2)
  outputs = Dense(vocab_size, activation='softmax')(lm3)
  # tie it together [image, seq] [word]
  model = Model(inputs=[inputs1, inputs2], outputs=outputs)
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  return model

In [36]:
# define experiment
model_name = 'resnet50_dropout_pooling'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_resnet50_dropout_pooling(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 1, 1, 2048)   0                                            
__________________________________________________________________________________________________
global_average_pooling2d_2 (Glo (None, 2048)         0           input_6[0][0]                    
__________________________________________________________________________________________________
input_7 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 2048)         0           global_average_pooling2d_2[0][0] 
__________________________________________________________________________________________________
embedding_

 - 10s - loss: 3.2668 - acc: 0.2653
Epoch 15/50
 - 10s - loss: 3.1013 - acc: 0.2889
Epoch 16/50
 - 10s - loss: 2.9292 - acc: 0.3136
Epoch 17/50
 - 10s - loss: 2.8011 - acc: 0.3402
Epoch 18/50
 - 10s - loss: 2.5675 - acc: 0.3652
Epoch 19/50
 - 10s - loss: 2.2785 - acc: 0.4049
Epoch 20/50
 - 10s - loss: 2.1742 - acc: 0.4306
Epoch 21/50
 - 10s - loss: 2.2261 - acc: 0.4261
Epoch 22/50
 - 10s - loss: 1.9344 - acc: 0.4593
Epoch 23/50
 - 10s - loss: 1.6021 - acc: 0.5322
Epoch 24/50
 - 10s - loss: 1.4456 - acc: 0.5547
Epoch 25/50
 - 10s - loss: 1.2986 - acc: 0.6004
Epoch 26/50
 - 10s - loss: 1.0745 - acc: 0.6490
Epoch 27/50
 - 10s - loss: 0.9343 - acc: 0.6947
Epoch 28/50
 - 10s - loss: 0.8549 - acc: 0.7248
Epoch 29/50
 - 10s - loss: 0.8577 - acc: 0.7160
Epoch 30/50
 - 10s - loss: 0.7767 - acc: 0.7497
Epoch 31/50
 - 10s - loss: 0.6445 - acc: 0.7885
Epoch 32/50
 - 10s - loss: 0.5416 - acc: 0.8204
Epoch 33/50
 - 10s - loss: 0.4923 - acc: 0.8403
Epoch 34/50
 - 11s - loss: 0.4704 - acc: 0.8434
Epoc

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_8 (InputLayer)            (None, 1, 1, 2048)   0                                            
__________________________________________________________________________________________________
global_average_pooling2d_3 (Glo (None, 2048)         0           input_8[0][0]                    
__________________________________________________________________________________________________
input_9 (InputLayer)            (None, 25)           0                                            
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 2048)         0           global_average_pooling2d_3[0][0] 
__________________________________________________________________________________________________
embedding_

 - 11s - loss: 3.6261 - acc: 0.2163
Epoch 15/50
 - 11s - loss: 3.4492 - acc: 0.2458
Epoch 16/50
 - 11s - loss: 3.3201 - acc: 0.2427
Epoch 17/50
 - 11s - loss: 3.0852 - acc: 0.2733
Epoch 18/50
 - 11s - loss: 2.8354 - acc: 0.3086
Epoch 19/50
 - 11s - loss: 2.6654 - acc: 0.3041
Epoch 20/50
 - 11s - loss: 2.4532 - acc: 0.3626
Epoch 21/50
 - 10s - loss: 2.1827 - acc: 0.4129
Epoch 22/50
 - 10s - loss: 1.8660 - acc: 0.4687
Epoch 23/50
 - 10s - loss: 1.5720 - acc: 0.5570
Epoch 24/50
 - 10s - loss: 1.4527 - acc: 0.5861
Epoch 25/50
 - 10s - loss: 1.2482 - acc: 0.6254
Epoch 26/50
 - 10s - loss: 1.0193 - acc: 0.6909
Epoch 27/50
 - 10s - loss: 0.7687 - acc: 0.7831
Epoch 28/50
 - 10s - loss: 0.5662 - acc: 0.8347
Epoch 29/50
 - 10s - loss: 0.4572 - acc: 0.8720
Epoch 30/50
 - 11s - loss: 0.3579 - acc: 0.8995
Epoch 31/50
 - 11s - loss: 0.3090 - acc: 0.9111
Epoch 32/50
 - 11s - loss: 0.2732 - acc: 0.9238
Epoch 33/50
 - 11s - loss: 0.2165 - acc: 0.9326
Epoch 34/50
 - 11s - loss: 0.1693 - acc: 0.9575
Epoc

Epoch 1/50
 - 14s - loss: 5.4335 - acc: 0.0888
Epoch 2/50
 - 10s - loss: 5.0648 - acc: 0.1120
Epoch 3/50
 - 10s - loss: 4.8931 - acc: 0.1351
Epoch 4/50
 - 10s - loss: 4.7013 - acc: 0.1465
Epoch 5/50
 - 11s - loss: 4.4970 - acc: 0.1603
Epoch 6/50
 - 10s - loss: 4.3165 - acc: 0.1620
Epoch 7/50
 - 10s - loss: 4.1793 - acc: 0.1605
Epoch 8/50
 - 10s - loss: 4.0028 - acc: 0.1730
Epoch 9/50
 - 10s - loss: 3.9141 - acc: 0.1880
Epoch 10/50
 - 10s - loss: 3.8592 - acc: 0.1891
Epoch 11/50
 - 10s - loss: 3.6998 - acc: 0.1960
Epoch 12/50
 - 10s - loss: 3.8880 - acc: 0.2034
Epoch 13/50
 - 10s - loss: 3.9043 - acc: 0.2185
Epoch 14/50
 - 10s - loss: 3.8262 - acc: 0.2111
Epoch 15/50
 - 10s - loss: 3.4913 - acc: 0.2451
Epoch 16/50
 - 10s - loss: 3.1542 - acc: 0.2652
Epoch 17/50
 - 10s - loss: 2.8874 - acc: 0.2918
Epoch 18/50
 - 10s - loss: 2.6289 - acc: 0.3448
Epoch 19/50
 - 10s - loss: 2.3667 - acc: 0.3905
Epoch 20/50
 - 10s - loss: 2.2238 - acc: 0.4065
Epoch 21/50
 - 10s - loss: 1.9833 - acc: 0.4499
E

 - 10s - loss: 0.0106 - acc: 0.9988
Epoch 47/50
 - 10s - loss: 0.0092 - acc: 0.9995
Epoch 48/50
 - 10s - loss: 0.0066 - acc: 0.9995
Epoch 49/50
 - 10s - loss: 0.0058 - acc: 0.9995
Epoch 50/50
 - 10s - loss: 0.0054 - acc: 0.9995
Actual:    startseq child and woman are at waters edge in big city endseq
Predicted: startseq child and woman are at waters edge in big city endseq
Actual:    startseq boy with stick kneeling in front of goalie net endseq
Predicted: startseq boy with stick kneeling in front of goalie net endseq
Actual:    startseq woman crouches near three dogs in field endseq
Predicted: startseq woman crouches near three dogs in field endseq
Actual:    startseq boy bites hard into treat while he sits outside endseq
Predicted: startseq boy bites hard into treat while he sits outside endseq
Actual:    startseq person eats takeout while watching small television endseq
Predicted: startseq person eats takeout while watching small television endseq
Actual:    startseq couple with yo

In [37]:
!more resnet50_dropout_pooling.csv

train,test
1.0,0.299049310064541
1.0,0.3234616980350596
1.0,0.25516287578362484


Con esta me red sí me ha dejado.

En train perfecto pero está claro que no está generalizando correctamente ya que test está muy por debajo, estamos teniendo mucho overfiting,
vamos a intentar mitigarlo añadiendo más capas de Dropout, cambiando el maxpooling de media al máximo

In [38]:
from keras.optimizers import Adam

def define_model_resnet50_dropout_pooling_max_mas_dropout(vocab_size, max_length):
  #feature extractor (encoder)
  inputs1 = Input(shape=(1,1,2048))
  fe1 = GlobalMaxPooling2D()(inputs1)
  fe2 = Dropout(0.5)(fe1)
  fe3 = Dense(128, activation='relu')(fe2)
  fe4 = RepeatVector(max_length)(fe3)
  #embedding
  inputs2 = Input(shape=(max_length,))
  emb2 = Embedding(vocab_size, 50, mask_zero=True)(inputs2)
  emb3 = LSTM(256, return_sequences=True)(emb2)
  emb4 = TimeDistributed(Dense(128, activation='relu'))(emb3)
  #merge inputs
  merged = concatenate([fe4, emb4])
  #language model (decoder)
  lm2 = LSTM(500)(merged)
  lm3 = Dense(500, activation='relu')(lm2)
  lm4 = Dropout(0.5)(lm3)
  outputs = Dense(vocab_size, activation='softmax')(lm4)
  # tie it together [image, seq] [word]
  model = Model(inputs=[inputs1, inputs2], outputs=outputs)
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  return model

In [39]:
# define experiment
model_name = 'resnet50_dropout_pooling_max_mas_dropout'
verbose = 2
n_epochs = 50
n_photos_per_update = 2
n_batches_per_epoch = int(len(train) / n_photos_per_update)
n_repeats = 3
 
# run experiment
train_results, test_results = list(), list()
for i in range(n_repeats):
	# define the model
	model = define_model_resnet50_dropout_pooling_max_mas_dropout(vocab_size, max_length)
	# fit model
	model.fit_generator(data_generator(train_descriptions, train_features, tokenizer, max_length, n_photos_per_update), steps_per_epoch=n_batches_per_epoch, epochs=n_epochs, verbose=verbose)
	# evaluate model on training data
	train_score = evaluate_model(model, train_descriptions, train_features, tokenizer, max_length)
	test_score = evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)
	# store
	train_results.append(train_score)
	test_results.append(test_score)
	print('>%d: train=%f test=%f' % ((i+1), train_score, test_score))
# save results to file
df = DataFrame()
df['train'] = train_results
df['test'] = test_results
print(df.describe())
df.to_csv(model_name+'.csv', index=False)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_12 (InputLayer)           (None, 1, 1, 2048)   0                                            
__________________________________________________________________________________________________
global_max_pooling2d_1 (GlobalM (None, 2048)         0           input_12[0][0]                   
__________________________________________________________________________________________________
input_13 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 2048)         0           global_max_pooling2d_1[0][0]     
__________________________________________________________________________________________________
embedding_

Epoch 14/50
 - 11s - loss: 3.5478 - acc: 0.2434
Epoch 15/50
 - 10s - loss: 3.3863 - acc: 0.2485
Epoch 16/50
 - 10s - loss: 3.2817 - acc: 0.2508
Epoch 17/50
 - 10s - loss: 2.9548 - acc: 0.3093
Epoch 18/50
 - 10s - loss: 2.8087 - acc: 0.3321
Epoch 19/50
 - 10s - loss: 2.5628 - acc: 0.3734
Epoch 20/50
 - 10s - loss: 2.3276 - acc: 0.3961
Epoch 21/50
 - 10s - loss: 2.0693 - acc: 0.4397
Epoch 22/50
 - 10s - loss: 1.8561 - acc: 0.4940
Epoch 23/50
 - 10s - loss: 1.6253 - acc: 0.5438
Epoch 24/50
 - 10s - loss: 1.4692 - acc: 0.5860
Epoch 25/50
 - 10s - loss: 1.3834 - acc: 0.5987
Epoch 26/50
 - 10s - loss: 1.1776 - acc: 0.6451
Epoch 27/50
 - 10s - loss: 1.1392 - acc: 0.6517
Epoch 28/50
 - 10s - loss: 1.0609 - acc: 0.7013
Epoch 29/50
 - 10s - loss: 0.9104 - acc: 0.7364
Epoch 30/50
 - 10s - loss: 0.8917 - acc: 0.7308
Epoch 31/50
 - 10s - loss: 0.7833 - acc: 0.7477
Epoch 32/50
 - 10s - loss: 0.7354 - acc: 0.7816
Epoch 33/50
 - 10s - loss: 0.6859 - acc: 0.7936
Epoch 34/50
 - 10s - loss: 0.5956 - acc:

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_14 (InputLayer)           (None, 1, 1, 2048)   0                                            
__________________________________________________________________________________________________
global_max_pooling2d_2 (GlobalM (None, 2048)         0           input_14[0][0]                   
__________________________________________________________________________________________________
input_15 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
dropout_7 (Dropout)             (None, 2048)         0           global_max_pooling2d_2[0][0]     
__________________________________________________________________________________________________
embedding_

Epoch 14/50
 - 10s - loss: 3.5529 - acc: 0.2298
Epoch 15/50
 - 10s - loss: 3.3930 - acc: 0.2458
Epoch 16/50
 - 10s - loss: 3.1748 - acc: 0.2831
Epoch 17/50
 - 10s - loss: 3.1212 - acc: 0.3075
Epoch 18/50
 - 10s - loss: 2.8619 - acc: 0.3266
Epoch 19/50
 - 10s - loss: 2.6570 - acc: 0.3548
Epoch 20/50
 - 10s - loss: 2.3957 - acc: 0.3804
Epoch 21/50
 - 10s - loss: 2.1981 - acc: 0.4249
Epoch 22/50
 - 10s - loss: 2.1108 - acc: 0.4264
Epoch 23/50
 - 10s - loss: 1.9441 - acc: 0.4777
Epoch 24/50
 - 11s - loss: 1.7366 - acc: 0.5190
Epoch 25/50
 - 11s - loss: 1.7254 - acc: 0.5161
Epoch 26/50
 - 11s - loss: 1.6028 - acc: 0.5329
Epoch 27/50
 - 11s - loss: 1.5673 - acc: 0.5673
Epoch 28/50
 - 11s - loss: 1.3844 - acc: 0.5813
Epoch 29/50
 - 11s - loss: 1.2265 - acc: 0.6234
Epoch 30/50
 - 11s - loss: 1.1472 - acc: 0.6320
Epoch 31/50
 - 11s - loss: 1.0440 - acc: 0.6718
Epoch 32/50
 - 11s - loss: 0.9423 - acc: 0.6927
Epoch 33/50
 - 11s - loss: 0.8296 - acc: 0.7272
Epoch 34/50
 - 11s - loss: 0.8066 - acc:

Epoch 1/50
 - 16s - loss: 5.4537 - acc: 0.0895
Epoch 2/50
 - 11s - loss: 5.1340 - acc: 0.1187
Epoch 3/50
 - 11s - loss: 4.9701 - acc: 0.1348
Epoch 4/50
 - 11s - loss: 4.8686 - acc: 0.1486
Epoch 5/50
 - 11s - loss: 4.6795 - acc: 0.1507
Epoch 6/50
 - 11s - loss: 4.5768 - acc: 0.1468
Epoch 7/50
 - 11s - loss: 4.4248 - acc: 0.1594
Epoch 8/50
 - 11s - loss: 4.2615 - acc: 0.1634
Epoch 9/50
 - 11s - loss: 4.1694 - acc: 0.1808
Epoch 10/50
 - 11s - loss: 4.1046 - acc: 0.1810
Epoch 11/50
 - 11s - loss: 4.0317 - acc: 0.1905
Epoch 12/50
 - 11s - loss: 3.9580 - acc: 0.1979
Epoch 13/50
 - 11s - loss: 3.7040 - acc: 0.2238
Epoch 14/50
 - 11s - loss: 3.3207 - acc: 0.2669
Epoch 15/50
 - 11s - loss: 3.0283 - acc: 0.2927
Epoch 16/50
 - 10s - loss: 2.7718 - acc: 0.3158
Epoch 17/50
 - 10s - loss: 2.5491 - acc: 0.3526
Epoch 18/50
 - 10s - loss: 2.2970 - acc: 0.3998
Epoch 19/50
 - 10s - loss: 2.2653 - acc: 0.4092
Epoch 20/50
 - 10s - loss: 2.1474 - acc: 0.4314
Epoch 21/50
 - 10s - loss: 1.9306 - acc: 0.4612
E

 - 11s - loss: 0.1626 - acc: 0.9509
Epoch 47/50
 - 10s - loss: 0.1305 - acc: 0.9650
Epoch 48/50
 - 10s - loss: 0.1026 - acc: 0.9792
Epoch 49/50
 - 10s - loss: 0.0877 - acc: 0.9789
Epoch 50/50
 - 10s - loss: 0.1010 - acc: 0.9726
Actual:    startseq child and woman are at waters edge in big city endseq
Predicted: startseq child and woman are at waters edge in big city endseq
Actual:    startseq boy with stick kneeling in front of goalie net endseq
Predicted: startseq boy with stick kneeling in front of goalie net endseq
Actual:    startseq woman crouches near three dogs in field endseq
Predicted: startseq woman crouches near three dogs in field endseq
Actual:    startseq boy bites hard into treat while he sits outside endseq
Predicted: startseq boy bites hard into treat while he sits outside endseq
Actual:    startseq person eats takeout while watching small television endseq
Predicted: startseq person eats takeout while watching small television endseq
Actual:    startseq couple with yo

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [40]:
!more resnet50_dropout_pooling_max_mas_dropout.csv

train,test
1.0,0.24565373354629153
1.0,0.30540016537570525
1.0,0.12490661420929208


No he reducido mucho más la parte de test.
Hasta aquí llegan mis pruebas, no tengo más tiempo porque sino no llego a entregar la práctica, me hubiese gustado hacer alguna más con el v3 y con los hyperparámetros, pero entre las veces que el colab me hechaba fuera ó se colgaba he perdido bastante tiempo. No obstante felicidades porque el módulo ha estado realmente bien y creo que he aprendido bastante. 