<h1 align="center">First of all -- Checking Questions</h1> 

**Вопрос 1**: Можно ли использовать сверточные сети для классификации текстов? Если нет обоснуйте :D, если да то как? как решить проблему с произвольной длинной входа?

Можно. Разбиваем текст на блоки с фиксированным количеством слов. Подаём их на вход нейросети как одномерные массивы, элементы которых - числовые коды слов (как в этой задаче). Делаем одномерные свёртки по этим массивам. Получаем показания классификации для отдельны блоков текста, потом можно дать ответ для целого текста, используя эти показания. К примеру, ответить наиболее частой категорией.

**Вопрос 2**: Чем LSTM лучше/хуже чем обычная RNN?

RNN помнят лишь несколько недавних состояний, и не могут анализировать долгосрочные зависимости. Говоря о примере с анализом текста, они воспринимают контекст фразы или предложения, но скорее всего не могут анализировать связь между предложениями, контекст больших объёмов текста. LSTM предназначен для решения этой проблемы: они могут анализировать долгосрочные зависимости. Недостаток LSTM в их сложной структуре, из-за которой, в частности, они значительно дольше обучаются.

**Вопрос 3**:  Выпишите производную $\frac{d c_{n+1}}{d c_{k}}$ для LSTM http://colah.github.io/posts/2015-08-Understanding-LSTMs/, объясните формулу, когда производная затухает, когда взрывается?

$\frac{d c_{n+1}}{d c_{n}} = f_t = \sigma \cdot (W_f \cdot [h_{t-1}, x_t]+b_f)$

$\frac{d c_{n+1}}{d c_{n}} = f_t \cdot f_{t-1} \cdot _{\dots} \cdot f_k$

Производная взрывается, если произведение больше единицы. Затухает, если меньше единицы.
 
**Вопрос 4**: Зачем нужен TBPTT почему BPTT плох?

TBPTT (Truncated back propagation through time) - прерываемое обратное распространение градиентов: вместо того, чтобы при счёте градиентов проходить по всемум временному ряду до самого первого состояния, мы прекращаем этот процесс, сделав некоторое фиксированное число назад. Это позволяет ускорить обучение, избежать переобучения, избежать взрыва градиентов, не стратить время на пересчёт градиентов, когда они затухли и не вносят больших изменений.

**Вопрос 5**: Как комбинировать рекуррентные и сверточные сети, а главное зачем? Приведите несколько примеров реальных задач.

Например, можно анализировать визуальные ряды, такие как презентации и кинофильмы. К примеру, мы хотим с помощью нейронной сети придумать название для фильма или определить его жанр. Понадобится выделять признаки из каждого кадра, для чего хороши свёрточные сети. А для анализа долгосрочных зависимостей - взаимосвязи между кадрами - понадобится рекуррентная сеть. В этом случае инициализируем нейроны в LSTM не dense-слоями, как в этой задаче, а свёртками.

**Вопрос 6**: Объясните интуицию выбора размера эмбединг слоя? почему это опасное место?

Embedding Layer составляет из сырых входных данных (текстов) векторы признаков. Для текстов его размер делают несколько десятков, редко более 200. Если размер слишком маленький, теряем данные.

* Arseniy Ashuha, you can text me ```ars.ashuha@gmail.com```, Александр Панин

<h1 align="center"> Image Captioning </h1> 

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet.

In [None]:
!wget https://www.dropbox.com/s/3hj16b0fj6yw7cc/data.tar.gz?dl=1 -O data.tar.gz
!tar -xvzf data.tar.gz

### Data preprocessing

In [1]:
%%time
# Read Dataset
import numpy as np
import pickle

img_codes = np.load("data/image_codes.npy")
captions = pickle.load(open('data/caption_tokens.pcl', 'rb'))

CPU times: user 1.26 s, sys: 788 ms, total: 2.05 s
Wall time: 9.55 s


In [2]:
print("each image code is a 1000-unit vector:", img_codes.shape)
print(img_codes[0,:10])
print('\n\n')
print("for each image there are 5-7 descriptions, e.g.:\n")
print('\n'.join(captions[0]))

each image code is a 1000-unit vector: (123287, 1000)
[ 1.38901556 -3.82951474 -1.94360816 -0.5317238  -0.03120959 -2.87483215
 -2.9554503   0.6960277  -0.68551242 -0.7855981 ]



for each image there are 5-7 descriptions, e.g.:

a man with a red helmet on a small moped on a dirt road
man riding a motor bike on a dirt road on the countryside
a man riding on the back of a motorcycle
a dirt path with a young person on a motor bike rests to the foreground of a verdant area with a bridge and a background of cloud wreathed mountains
a man in a red shirt and a red hat is on a motorcycle on a hill side


In [3]:
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]

In [4]:
# Build a Vocabulary
from collections import Counter
word_counts = Counter()

for caption in captions:
    for word in caption:
        word_counts.update(word)

In [5]:
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
n_tokens = len(vocab)

assert 10000 <= n_tokens <= 10500

word_to_index = {w: i for i, w in enumerate(vocab)}

Подсчёт и составление словаря выполняется долго, поэтому для эффективного переиспользования сохраним данные в файлы.

In [6]:
import json

In [7]:
with open('word_counts.json', 'w') as wc_file:
    json.dump(word_counts, wc_file, indent=4, separators=(',', ': '))
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab, vocab_file, indent=4, separators=(',', ': '))
with open('word_to_index.json', 'w') as index_file:
    json.dump(word_to_index, index_file, indent=4, separators=(',', ': '))

In [None]:
with open('vocab.json', 'r') as wc_file:
    vocab = json.load(wc_file)
with open('word_to_index.json', 'r') as index_file:
    word_to_index = json.load(index_file)

In [None]:
n_tokens = len(vocab)

In [8]:
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')

def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    
    return matrix

In [9]:
#try it out on several descriptions of a random image
as_matrix(captions[1337])

array([[7223, 9967, 5948, 4328, 5578, 1746,  122, 8535, 3409, 7932, 4522,
        9286,   92,   -1,   -1],
       [7223, 3409, 7932, 1656, 5384, 7113, 9286, 4522, 6881, 2603,  122,
          92,   -1,   -1,   -1],
       [7223, 5810, 4328, 5578, 1746,  122, 8535, 3409, 5964, 8264, 7968,
        7078, 4844, 3167,   92],
       [7223, 5810, 1745, 1618,  230, 5425, 5810, 1745, 3425,   92,   -1,
          -1,   -1,   -1,   -1],
       [7223, 9967, 5948, 4328, 5578, 1746,  122, 8535, 3409, 7932, 7968,
        7078, 8002,   92,   -1]], dtype=int32)

### Mah Neural Network

In [27]:
# network shapes. 
CNN_FEATURE_SIZE = img_codes.shape[1]
EMBED_SIZE = 128 #pls change me if u want
LSTM_UNITS = 256 #pls change me if u want

In [28]:
import theano
import lasagne
import theano.tensor as T
from lasagne.layers import *

In [29]:
# Input Variable
sentences = T.imatrix()# [batch_size x time] of word ids
image_vectors = T.matrix() # [batch size x unit] of CNN image features
sentence_mask = T.neq(sentences, PAD_ix)

In [30]:
#network inputs
l_words = InputLayer((None, None), sentences)
l_mask = InputLayer((None, None), sentence_mask)

#embeddings for words 
############# TO CODE IT BY YOURSELF ##################
l_word_embeddings = EmbeddingLayer(l_words, n_tokens, EMBED_SIZE)

In [31]:
# input layer for image features
l_image_features = InputLayer((None, CNN_FEATURE_SIZE), image_vectors)


#convert 1000 image features from googlenet to whatever LSTM_UNITS you have set
#it's also a good idea to add some dropout here and there
l_image_features_small = lasagne.layers.DropoutLayer(l_image_features, p=0.2)
l_image_features_small = lasagne.layers.DenseLayer(l_image_features_small, num_units=LSTM_UNITS)
assert l_image_features_small.output_shape == (None, LSTM_UNITS)

In [32]:
############# TO CODE IT BY YOURSELF ##################
# Concatinate image features and word embedings in one sequence 
decoder = LSTMLayer(l_word_embeddings,
                    num_units=LSTM_UNITS,
                    cell_init=l_image_features_small,
                    mask_input=l_mask,
                    grad_clipping=10)

In [33]:
# Decoding of rnn hiden states
from broadcast import BroadcastLayer,UnbroadcastLayer

#apply whatever comes next to each tick of each example in a batch. Equivalent to 2 reshapes
broadcast_decoder_ticks = BroadcastLayer(decoder, (0, 1))
print("broadcasted decoder shape = ",broadcast_decoder_ticks.output_shape)

predicted_probabilities_each_tick = DenseLayer(
    broadcast_decoder_ticks,n_tokens, nonlinearity=lasagne.nonlinearities.softmax)

#un-broadcast back into (batch,tick,probabilities)
predicted_probabilities = UnbroadcastLayer(
    predicted_probabilities_each_tick, broadcast_layer=broadcast_decoder_ticks)

print("output shape = ", predicted_probabilities.output_shape)

#remove if you know what you're doing (e.g. 1d convolutions or fixed shape)
assert predicted_probabilities.output_shape == (None, None, 10373)

broadcasted decoder shape =  (None, 256)
output shape =  (None, None, 10373)


In [34]:
next_word_probas = get_output(predicted_probabilities)

reference_answers = sentences[:,1:]
output_mask = sentence_mask[:,1:]

#write symbolic loss function to train NN for
loss = lasagne.objectives.categorical_crossentropy(
    next_word_probas[:, :-1].reshape((-1, n_tokens)),
    reference_answers.reshape((-1,))
).reshape(reference_answers.shape)

############# TO CODE IT BY YOURSELF ##################
loss = (loss * output_mask).sum() / output_mask.sum()

In [35]:
#trainable NN weights
############# TO CODE IT BY YOURSELF ##################
weights = get_all_params(predicted_probabilities,trainable=True)
updates = lasagne.updates.adam(loss, weights)

In [36]:
#compile a function that takes input sentence and image mask, outputs loss and updates weights
#please not that your functions must accept image features as FIRST param and sentences as second one
############# TO CODE IT BY YOURSELF ##################
train_step = theano.function([image_vectors, sentences], loss, updates=updates)
val_step   = theano.function([image_vectors, sentences], loss)

  "flatten outdim parameter is deprecated, use ndim instead.")


# Training

* You first have to implement a batch generator
* Than the network will get trained the usual way

In [37]:
captions = np.array(captions)

In [38]:
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0, len(images), size=batch_size)
    
    #get images
    batch_images = images[random_image_ix]
    
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    
    #pick 1 from 5-7 captions for each image
    batch_captions = list(map(choice, captions_for_batch_images))
    
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    
    return batch_images, batch_captions_ix

In [39]:
generate_batch(img_codes,captions, 3)

(array([[ 3.20294285,  4.12841511, -3.54088283, ...,  2.17112184,
          8.34316349,  5.65327692],
        [ 2.27223778,  0.48416096, -1.36779344, ..., -0.06882471,
          4.65919065,  0.15165681],
        [-3.69761467, -4.84340429, -0.59039426, ..., -0.82694697,
          3.54036236, -0.3912183 ]], dtype=float32),
 array([[7223, 5810, 1745, 3425, 9627, 5810,  359, 4522, 2686, 8535, 5810,
         2808, 8535, 1906,   92],
        [7223, 5810, 1745,  269, 7590, 4378, 8535, 6856, 9627, 9967,   52,
         2615,   92,   -1,   -1],
        [7223, 7932, 1652, 8875, 5578, 6514,  212, 4522, 2686, 8535, 5810,
         6382,   92,   -1,   -1]], dtype=int32))

### Main loop
* We recommend you to periodically evaluate the network using the next "apply trained model" block
 *  its safe to interrupt training, run a few examples and start training again

In [40]:
batch_size = 50 #adjust me
n_epochs   = 100 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch

In [41]:
from tqdm import tqdm

for epoch in range(n_epochs):
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    
    val_loss=0
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))
    val_loss /= n_validation_batches
    
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))

print("Finish :)")


  0%|          | 0/50 [00:00<?, ?it/s][A
  2%|▏         | 1/50 [00:02<01:58,  2.41s/it][A
100%|██████████| 50/50 [01:58<00:00,  1.90s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 0, train loss: 6.525006889514105, val loss: 5.557623076098362


100%|██████████| 50/50 [02:08<00:00,  2.34s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 1, train loss: 5.281102177096264, val loss: 5.168582545305071


100%|██████████| 50/50 [02:13<00:00,  2.32s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 2, train loss: 5.003108089608322, val loss: 4.746789260056002


100%|██████████| 50/50 [02:06<00:00,  2.22s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 3, train loss: 4.76282122976219, val loss: 4.597395886360425


100%|██████████| 50/50 [02:01<00:00,  2.37s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 4, train loss: 4.5163989745913735, val loss: 4.382462936260396


100%|██████████| 50/50 [01:58<00:00,  2.36s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 5, train loss: 4.318876153953291, val loss: 4.265879598990135


100%|██████████| 50/50 [01:54<00:00,  3.09s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 6, train loss: 4.188464107855843, val loss: 4.167415834400251


100%|██████████| 50/50 [01:56<00:00,  2.08s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 7, train loss: 4.104867399135802, val loss: 4.052211029853474


100%|██████████| 50/50 [02:03<00:00,  2.48s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 8, train loss: 3.990547402812579, val loss: 4.007367464025952


100%|██████████| 50/50 [02:01<00:00,  2.31s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 9, train loss: 3.884576091515461, val loss: 3.898897541036611


100%|██████████| 50/50 [01:52<00:00,  2.17s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 10, train loss: 3.822786705081664, val loss: 3.859639224621927


100%|██████████| 50/50 [02:02<00:00,  2.60s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 11, train loss: 3.814835610951282, val loss: 3.797094875961426


100%|██████████| 50/50 [02:05<00:00,  2.77s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 12, train loss: 3.7156257526660488, val loss: 3.704461160025415


100%|██████████| 50/50 [01:57<00:00,  2.56s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 13, train loss: 3.6582272753762597, val loss: 3.6898198995827824


100%|██████████| 50/50 [02:10<00:00,  3.53s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 14, train loss: 3.6458998874504465, val loss: 3.553656284895086


100%|██████████| 50/50 [01:57<00:00,  3.03s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 15, train loss: 3.5827244985967526, val loss: 3.711288695765981


100%|██████████| 50/50 [02:04<00:00,  2.55s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 16, train loss: 3.5730024200081925, val loss: 3.486008552592339


100%|██████████| 50/50 [01:57<00:00,  2.35s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 17, train loss: 3.531043023235368, val loss: 3.533681844934615


100%|██████████| 50/50 [02:19<00:00,  2.82s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 18, train loss: 3.4696212459216764, val loss: 3.363203458362905


100%|██████████| 50/50 [02:10<00:00,  2.60s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 19, train loss: 3.424090032441603, val loss: 3.2803603468799545


100%|██████████| 50/50 [02:06<00:00,  2.43s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 20, train loss: 3.465025973076889, val loss: 3.422946978237035


100%|██████████| 50/50 [02:07<00:00,  2.35s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 21, train loss: 3.4138587125461286, val loss: 3.3521167969386787


100%|██████████| 50/50 [02:01<00:00,  2.25s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 22, train loss: 3.388732134808196, val loss: 3.3817342697865116


100%|██████████| 50/50 [02:01<00:00,  2.44s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 23, train loss: 3.3648728213191004, val loss: 3.3248800596420125


  4%|▍         | 2/50 [00:05<01:50,  2.31s/it]


KeyboardInterrupt: 

In [42]:
train_step = theano.function([image_vectors, sentences], loss, updates=lasagne.updates.adam(loss, weights, learning_rate=0.003))

for epoch in range(n_epochs):
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    
    val_loss=0
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))
    val_loss /= n_validation_batches
    
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))

print("Finish :)")

  "flatten outdim parameter is deprecated, use ndim instead.")
100%|██████████| 50/50 [01:55<00:00,  2.74s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 0, train loss: 3.478004370563096, val loss: 3.512606213607389


100%|██████████| 50/50 [02:00<00:00,  2.24s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 1, train loss: 3.4620304367759243, val loss: 3.462019469970015


100%|██████████| 50/50 [02:05<00:00,  2.29s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 2, train loss: 3.437556825250525, val loss: 3.3229880407591232


100%|██████████| 50/50 [02:02<00:00,  2.49s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 3, train loss: 3.370811669800373, val loss: 3.3224626801114736


100%|██████████| 50/50 [02:01<00:00,  2.34s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 4, train loss: 3.2970280844356123, val loss: 3.3918963066451284


 16%|█▌        | 8/50 [00:17<01:36,  2.31s/it]

KeyboardInterrupt: 

In [None]:
train_step = theano.function([image_vectors, sentences], loss, updates=lasagne.updates.adam(loss, weights, learning_rate=0.005))

for epoch in range(n_epochs):
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    
    val_loss=0
    for _ in range(n_validation_batches):
        val_loss += val_step(*generate_batch(img_codes,captions,batch_size))
    val_loss /= n_validation_batches
    
    print('\nEpoch: {}, train loss: {}, val loss: {}'.format(epoch, train_loss, val_loss))

print("Finish :)")

  "flatten outdim parameter is deprecated, use ndim instead.")

  0%|          | 0/50 [00:00<?, ?it/s][A
  2%|▏         | 1/50 [00:02<01:51,  2.28s/it][A
  4%|▍         | 2/50 [00:04<01:50,  2.30s/it][A
  6%|▌         | 3/50 [00:06<01:45,  2.24s/it][A
  8%|▊         | 4/50 [00:08<01:40,  2.19s/it][A
100%|██████████| 50/50 [01:48<00:00,  2.23s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 0, train loss: 3.411034830124682, val loss: 3.3548591496983122


100%|██████████| 50/50 [01:50<00:00,  2.23s/it]
  0%|          | 0/50 [00:00<?, ?it/s]


Epoch: 1, train loss: 3.3734598216295733, val loss: 3.293693737175338


 96%|█████████▌| 48/50 [01:54<00:04,  2.42s/it]

### apply trained model

In [None]:
#the same kind you did last week, but a bit smaller
from pretrained_lenet import build_model,preprocess,MEAN_VALUES

# build googlenet
lenet = build_model()

#load weights
lenet_weights = pickle.load(open('data/blvc_googlenet.pkl', 'rb'), encoding='latin1')['param values']
set_all_param_values(lenet["prob"], lenet_weights)

#compile get_features
cnn_input_var = lenet['input'].input_var
cnn_feature_layer = lenet['loss3/classifier']
get_cnn_features = theano.function([cnn_input_var], lasagne.layers.get_output(cnn_feature_layer))

In [None]:
import skimage.transform
import numpy as np
MEAN_VALUES = np.array([104, 117, 123]).reshape((3,1,1))
def preprocess(im):
    if len(im.shape) == 2:
        im = im[:, :, np.newaxis]
        im = np.repeat(im, 3, axis=2)
    # Resize so smallest dim = 224, preserving aspect ratio
    h, w, _ = im.shape
    if h < w:
        im = skimage.transform.resize(im, (224, w*224//h), preserve_range=True)
    else:
        im = skimage.transform.resize(im, (h*224//w, 224), preserve_range=True)

    # Central crop to 224x224
    h, w, _ = im.shape
    im = im[h//2-112:h//2+112, w//2-112:w//2+112]
    
    rawim = np.copy(im).astype('uint8')
    
    # Shuffle axes to c01
    im = np.swapaxes(np.swapaxes(im, 1, 2), 0, 1)
    
    # Convert to BGR
    im = im[::-1, :, :]

    im = im - MEAN_VALUES
    return im[np.newaxis].astype('float32')


In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

#sample image
img = plt.imread('data/Dog-and-Cat.jpg')
img = preprocess(img)

In [None]:
#deprocess and show, one line :)
from pretrained_lenet import MEAN_VALUES
plt.imshow(np.transpose((img[0] + MEAN_VALUES)[::-1],[1,2,0]).astype('uint8'))

## Generate caption

In [None]:
last_word_probas_det = get_output(predicted_probabilities,deterministic=False)[:,-1]

get_probs = theano.function([image_vectors,sentences], last_word_probas_det)

#this is exactly the generation function from week5 classwork,
#except now we condition on image features instead of words
def generate_caption(image,caption_prefix = ("START",),t=1,sample=True,max_len=100):
    image_features = get_cnn_features(image)
    caption = list(caption_prefix)
    for _ in range(max_len):
        
        next_word_probs = get_probs(image_features,as_matrix([caption]) ).ravel()
        #apply temperature
        next_word_probs = next_word_probs**t / np.sum(next_word_probs**t)

        if sample:
            next_word = np.random.choice(vocab,p=next_word_probs) 
        else:
            next_word = vocab[np.argmax(next_word_probs)]

        caption.append(next_word)

        if next_word=="#END#":
            break
            
    return caption

In [None]:
for i in range(10):
    print(generate_caption(img,t=1.)[1:-1])

# Bonus Part
- Use ResNet Instead of GoogLeNet
- Use W2V as embedding
- Use Attention :) 

# Pass Assignment https://goo.gl/forms/2qqVtfepn0t1aDgh1 

In [None]:
vocab