# Seq2Seq
### Neural Machine Translation using word level language model and embeddings in Keras

- https://medium.com/@dev.elect.iitd/neural-machine-translation-using-word-level-seq2seq-model-47538cba8cd7
- https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
- https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/
- https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

---

In [55]:
import pandas as pd
import numpy as np

import string
from itertools import chain
import pickle
import re

from string import digits

import matplotlib.pyplot as plt

# Load Data

In [56]:
# load
with open('df_story_desc_final.pickle', 'rb') as f:
    df_story_and_desc = pickle.load(f)

print ("before: ", df_story_and_desc.shape)
df_story_and_desc = df_story_and_desc[:30000]

df_story_and_desc_id = df_story_and_desc['Story_Photo_id']
df_story_and_desc_text = df_story_and_desc[['Desc', 'Story']]

print ("after: ", df_story_and_desc_text.shape)
df_story_and_desc_text.head(15)

before:  (145116, 4)
after:  (30000, 2)


Unnamed: 0,Desc,Story
0,big old tree being photographed on a sunny day,"and its magnificent trunk , larger than life i..."
1,a old curvy tree in the sun light .,"and its magnificent trunk , larger than life i..."
2,a person is taking a picture of a large tree a...,"and its magnificent trunk , larger than life i..."
3,large tree with many outstretching branches an...,we found this tree when we were walking in a n...
4,a green sign is describing a historic tree and...,it turns out it is a popular attraction here .
5,a large tree with roots that look like crocodi...,"the tree is very unusual , with its roots expo..."
6,big old tree being photographed on a sunny day,"the trunk was really wide , as much as 12 feet !"
7,huge brown tree roots rose above the ground .,you can see how big these roots are - pretty a...
8,a large tree with many branches coming out,we found this tree when we were walking in a n...
9,a plaque describes an historical tree and advi...,it turns out it is a popular attraction here .


==> 15개 단위로 같은 사진 set에 대한 내용

In [57]:
df_story_and_desc_text[df_story_and_desc_text.Desc=='big old tree being photographed on a sunny day']

Unnamed: 0,Desc,Story
0,big old tree being photographed on a sunny day,"and its magnificent trunk , larger than life i..."
6,big old tree being photographed on a sunny day,"the trunk was really wide , as much as 12 feet !"
18,big old tree being photographed on a sunny day,some more different parts of the tree .
24,big old tree being photographed on a sunny day,the trunk was incredibly thick and rigid .
39,big old tree being photographed on a sunny day,i was dwarfed by the tree 's size .


In [58]:
for i,row in df_story_and_desc_text[df_story_and_desc_text.Desc=='big old tree being photographed on a sunny day'].iterrows():
    print("pair",i,":",row['Desc']+" ==> "+row['Story'])

pair 0 : big old tree being photographed on a sunny day ==> and its magnificent trunk , larger than life itself .
pair 6 : big old tree being photographed on a sunny day ==> the trunk was really wide , as much as 12 feet !
pair 18 : big old tree being photographed on a sunny day ==> some more different parts of the tree .
pair 24 : big old tree being photographed on a sunny day ==> the trunk was incredibly thick and rigid .
pair 39 : big old tree being photographed on a sunny day ==> i was dwarfed by the tree 's size .


==> 하나의 Description Text에 5개의 Story가 대응됨

## cleaning

In [61]:
def re_sub(item):
    re_sentence = []
    for sentence in item:
        sentence = re.sub('[^a-z0-9A-Z]+', ' ', sentence)
        re_sentence.append(sentence)
    return re_sentence

In [94]:
clean_data = df_story_and_desc_text.apply(lambda x: re_sub(x))
clean_data.columns = ['in_desc','out_story']
clean_data['out_story'] = clean_data['out_story'].apply(lambda x : '<sos> '+ x + ' <eos>')
clean_data

Unnamed: 0,in_desc,out_story
0,big old tree being photographed on a sunny day,<sos> and its magnificent trunk larger than li...
1,a old curvy tree in the sun light,<sos> and its magnificent trunk larger than li...
2,a person is taking a picture of a large tree a...,<sos> and its magnificent trunk larger than li...
3,large tree with many outstretching branches an...,<sos> we found this tree when we were walking ...
4,a green sign is describing a historic tree and...,<sos> it turns out it is a popular attraction ...
...,...,...
29995,a crowd of people in a village square three of...,<sos> we went to organization last summer for ...
29996,people on a safari truck watching as they expl...,<sos> we got to ride so many different rides ...
29997,children pose for a photograph on steps in a p...,<sos> the kids were having so much fun <eos>
29998,kids wearing pirate hats are brandishing toy s...,<sos> they loved going around to all the diffe...


# Vectorize the data

--- 

0907 DONE

---

In [96]:
input_words = sorted(list(desc_words))
target_words = sorted(list(story_words))

num_encoder_tokens = len(desc_words)
num_decoder_tokens = len(story_words)

max_encoder_seq_length = max([len(txt.split(" ")) for txt in clean_data['in_desc']])
max_decoder_seq_length = max([len(txt.split(" ")) for txt in clean_data['out_story']])

print('Number of samples:', len(input_words))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 5690
Number of unique input tokens: 5690
Number of unique output tokens: 6524
Max sequence length for inputs: 54
Max sequence length for outputs: 82


In [97]:
encoder_input_data = np.zeros(
    (len(clean_data.in_desc), max_encoder_seq_length),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(clean_data.out_story), max_decoder_seq_length),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(clean_data.out_story), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

In [98]:
print(encoder_input_data.shape)
print(decoder_input_data.shape)
print(decoder_target_data.shape)

(30000, 54)
(30000, 82)
(30000, 82, 6524)


**why decoder_target_data.shape is 3d**

- 모든 단어에 대하여 이전 단어로부터 다음 단어를 예측하는 소프트맥스 층을 가지기 때문에 

--- 

0908 DONE

- 왜 1로 초기화 하는가

    - ==> 초기화를 1로 하는 것이 아니라 t=0을 건너뛰는 것임

---

In [99]:
input_token_index = dict([(word, i) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i) for i, word in enumerate(target_words)])

In [100]:
for i, (input_text, target_text) in enumerate(zip(clean_data['in_desc'], clean_data['out_story'])):

    # encoder
    for t, word in enumerate(input_text.split()):
        encoder_input_data[i, t] = input_token_index[word]
        
    # decoder
    for t, word in enumerate(target_text.split()):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t] = target_token_index[word]  
        if t > 0: 
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[word]] = 1.

In [101]:
print (decoder_input_data.shape)
decoder_input_data

(30000, 82)


array([[  46.,  234., 3129., ...,    0.,    0.,    0.],
       [  46.,  234., 3129., ...,    0.,    0.,    0.],
       [  46.,  234., 3129., ...,    0.,    0.,    0.],
       ...,
       [  46., 5832., 3224., ...,    0.,    0.,    0.],
       [  46., 5844., 3458., ...,    0.,    0.,    0.],
       [  46., 6331., 2708., ...,    0.,    0.,    0.]], dtype=float32)

In [102]:
print(decoder_target_data.shape)
decoder_target_data

(30000, 82, 6524)


array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0.

--- 

0909 DONE

---

# Build keras encoder-decoder model

http://incredible.ai/nlp/2020/02/20/Sequence-To-Sequence-with-Attention/
https://docs.chainer.org/en/stable/examples/seq2seq.html

이론공부 먼저..

In [104]:
import tensorflow as tf
from tensorflow import keras

from keras.models import Model
from keras.layers import Input, LSTM, Dense

In [106]:
batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.

In [116]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
print('encoder_inputs: ', encoder_inputs.shape)

encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
print('encoder_outputs: ', encoder_outputs.shape)
print('state_h: ', state_h.shape)
print('state_c: ', state_c.shape)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

encoder_inputs:  (None, None, 5690)
encoder_outputs:  (None, 256)
state_h:  (None, 256)
state_c:  (None, 256)


In [117]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
print('decoder_inputs: ', decoder_inputs.shape)

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)

print('decoder_outputs: ', decoder_outputs.shape)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
print('decoder_outputs: ', decoder_outputs.shape)

decoder_inputs:  (None, None, 6524)
decoder_outputs:  (None, None, 256)
decoder_outputs:  (None, None, 6524)


In [29]:
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints


class AttentionL(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(AttentionL, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

    def get_config(self):
        config={'step_dim':self.step_dim}
        base_config = super(AttentionL, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [31]:
EMBEDDING_DIM = 150
vocab_size_input = num_encoder_tokens
vocab_size_output = num_decoder_tokens
MAX_LENGTH_INPUT = max_desc_length
MAX_LENGTH_OUTPUT = max_story_length
units = 100

# Model test

In [None]:
encoder_inputs = Input(shape=(None,))

In [32]:
# 11-02  decoder 1 layer
encoder_inputs = Input(shape=(None,))

display('encoder_inputs: ', encoder_inputs.shape)

en_x=  Embedding(num_encoder_tokens, EMBEDDING_DIM)(encoder_inputs)

encoder = Bidirectional(LSTM(units, return_state=True,
                             dropout = 0.5, recurrent_dropout = 0.5))

encoder_outputs, forward_h, forward_c, backward_h, backward_c = encoder(en_x)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

### decoder

decoder_inputs = Input(shape=(None,))

dex=  Embedding(num_decoder_tokens, EMBEDDING_DIM)

final_dex= dex(decoder_inputs)

decoder_lstm = LSTM(units * 2, return_sequences=True, return_state=True,
                    dropout = 0.5, recurrent_dropout = 0.5)

print (decoder_lstm(final_dex, initial_state = encoder_states))
decoder_outputs, _, _ = decoder_lstm(final_dex, initial_state = encoder_states)

decoder_dense = Dense(num_decoder_tokens, activation='softmax')

decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['acc'])

model.summary()

'encoder_inputs: '

TensorShape([Dimension(None), Dimension(None)])

[<tf.Tensor 'lstm_2/transpose_1:0' shape=(?, ?, 200) dtype=float32>, <tf.Tensor 'lstm_2/while/Exit_2:0' shape=(?, 200) dtype=float32>, <tf.Tensor 'lstm_2/while/Exit_3:0' shape=(?, 200) dtype=float32>]
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 150)    1179450     input_1[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
bidirec

In [33]:
import keras.backend.tensorflow_backend as K
with K.tf.device('/gpu:0'):
    model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
              batch_size = 128,
              epochs = 50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [77]:
# from keras.models import load_model
# model.save("1124_bilstm_emb150_model22.h5")

  '. They will not be included '


#### Create sampling model

In [34]:
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 150)    1179450     input_1[0][0]                    
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 200), (None, 200800      embedding_1[0][0]                
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 200)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________

In [35]:
decoder_state_input_h = Input(shape=(units * 2,))  # encoder를 bilstm으로 학습했기 때문에 shape이 50이 아니고 100이다.
decoder_state_input_c = Input(shape=(units * 2,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

final_dex2= dex(decoder_inputs)
print (final_dex2)

decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())


Tensor("embedding_2_1/embedding_lookup:0", shape=(?, ?, 150), dtype=float32)


In [36]:
# # save
# with open('bilstm150_encoder_model_1124.json', 'w', encoding='utf8') as f:
#     f.write(encoder_model.to_json())
# encoder_model.save_weights('bilstm150_encoder_model_1124.h5')

# with open('bilstm150_decoder_model_1124.json', 'w', encoding='utf8') as f:
#     f.write(decoder_model.to_json())
# decoder_model.save_weights('bilstm150_decoder_model_1124.h5')

  '. They will not be included '


In [63]:
# model load
from keras.models import model_from_json
def load_model(model_filename, model_weights_filename):
    with open(model_filename, 'r', encoding='utf8') as f:
        model = model_from_json(f.read())
    model.load_weights(model_weights_filename)
    return model

encoder = load_model('test_bilstm150_encoder_model.json', 'test_bilstm150_encoder_model.h5')
decoder = load_model('test_bilstm150_decoder_model.json', 'test_bilstm150_decoder_model.h5')

In [65]:
reverse_input_char_index

{0: '#',
 1: '&',
 2: "'",
 3: '(',
 4: ')',
 5: '..',
 6: '1',
 7: '10',
 8: '101',
 9: '10:05',
 10: '12',
 11: '14,',
 12: '1942',
 13: "1950's",
 14: '1958',
 15: '1998.',
 16: '2',
 17: '200',
 18: '2003.',
 19: '2007-2008',
 20: '2010',
 21: '2012',
 22: '2012.',
 23: '20rh',
 24: '20th',
 25: '24',
 26: '25',
 27: '3',
 28: '3-legged',
 29: '34',
 30: '4',
 31: '4-lane',
 32: '429',
 33: '429-8044,',
 34: '4th',
 35: '4th.',
 36: '5',
 37: '50',
 38: '51',
 39: '5k',
 40: '6',
 41: '6,',
 42: '66',
 43: '70',
 44: '703',
 45: '73',
 46: '800',
 47: '8044',
 48: '80th',
 49: ':',
 50: ';',
 51: '?',
 52: '[',
 53: '[female',
 54: '[female]',
 55: "[female]'s",
 56: '[female].',
 57: '[location',
 58: '[location]',
 59: "[location]''",
 60: "[location]''.",
 61: "[location]'s",
 62: '[location],',
 63: '[location].',
 64: '[male',
 65: '[male]',
 66: "[male]'s",
 67: '[male].',
 68: '[organization',
 69: '[organization]',
 70: "[organization]''.",
 71: "[organization]'s",
 72: '[o

In [66]:
reverse_target_char_index

{0: '!',
 1: '#',
 2: '$',
 3: '&',
 4: "'",
 5: "''",
 6: '(',
 7: ')',
 8: '-',
 9: '--',
 10: '-free',
 11: '.',
 12: '..',
 13: '...',
 14: '.finally!',
 15: '0.0',
 16: '000',
 17: '1',
 18: '10',
 19: '100',
 20: "100's",
 21: '100,000',
 22: '1000',
 23: '11:30pm,',
 24: '12',
 25: '12.',
 26: '13',
 27: '1300',
 28: '1500',
 29: '1700',
 30: "1700's.",
 31: "1800's.",
 32: '1900',
 33: '1940',
 34: '1940s',
 35: '1942',
 36: '1994.',
 37: '1995.',
 38: '1998',
 39: '1st',
 40: '2',
 41: '20',
 42: '20,000',
 43: '2nd',
 44: '3',
 45: '30',
 46: '30,',
 47: '30th',
 48: '37',
 49: '3d',
 50: '3rd',
 51: '4',
 52: "40's.",
 53: '45',
 54: '4h',
 55: '4t',
 56: '4th',
 57: '4th!',
 58: '4th.',
 59: '5',
 60: '5.',
 61: '56',
 62: '5th',
 63: '66',
 64: '7th',
 65: '80',
 66: "80's",
 67: '9:20',
 68: ':',
 69: ';',
 70: '?',
 71: "?'",
 72: 'START_',
 73: '[female',
 74: '[female]',
 75: '[female]!',
 76: "[female]'s",
 77: '[female],',
 78: '[female].',
 79: '[male',
 80: '[male]

In [67]:
target_token_index['START_']

72

#### Function to generate sequences

In [68]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    #print ("target_seq: ", target_seq)
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['START_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        #print ("output_tokens: ", output_tokens)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        #print ("sampled_token_index: ", sampled_token_index)
        #print ("")
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' ' + sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '_END' or
           len(decoded_sentence) > 150): # 52
            # print ("Stop_condition = TRUE")
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

# test data

In [71]:
clean_df_all

Unnamed: 0,description_text,story_text
0,big old tree being photographed on a sunny da,"START_ and its magnificent trunk, larger than ..."
1,a old curvy tree in the sun light.,"START_ and its magnificent trunk, larger than ..."
2,a person is taking a picture of a large tree a...,"START_ and its magnificent trunk, larger than ..."
3,large tree with many outstretching branches an...,START_ we found this tree when we were walking...
4,a green sign is describing a historic tree and...,START_ it turns out it is a popular attraction...
5,a large tree with roots that look like crocodi...,"START_ the tree is very unusual, with its root..."
6,big old tree being photographed on a sunny da,"START_ the trunk was really wide, as much as 1..."
7,huge brown tree roots rose above the ground.,START_ you can see how big these roots are - p...
8,a large tree with many branches coming ou,START_ we found this tree when we were walking...
9,a plaque describes an historical tree and advi...,START_ it turns out it is a popular attraction...


In [44]:
df_input5000_0 = clean_df_all.description_text[0:5000]
df_input5000_1 = clean_df_all.description_text[5000:10000]
df_input5000_2 = clean_df_all.description_text[10000:15000]
df_input5000_3 = clean_df_all.description_text[15000:20000]
df_input5000_4 = clean_df_all.description_text[20000:25000]
df_input5000_5 = clean_df_all.description_text[25000:30000]
df_input5000_5

25000    red and blue dominate the colorful archways in...
25001       large colorful building lit up in the evening.
25002      a projection of an image made out of binary cod
25003    work of modern art being displayed and lit in ...
25004    adult man sitting on commode using laptop in w...
25005    a view of the building inside of a well-lit sc...
25006    a nighttime photograph of people outside of a ...
25007    artistic mural projections in green rendering ...
25008      a sculpture made of metal on display at a museu
25009    man sits on commode in water closet while work...
25010    the inside of the church is illuminated with d...
25011    a group of spectators is watching a a building...
25012    a red wall with a green digital display showin...
25013    an object that has been crafted by a creative ...
25014    a man sits on a toilet while using a laptop co...
25015    red and blue dominate the colorful archways in...
25016       large colorful building lit up in the evenin

In [45]:
df_story_and_desc_id

0        2626983575
1        2626983575
2        2626983575
3        2701863545
4        2626977325
5        2627795780
6        2626983575
7        2626982337
8        2701863545
9        2626977325
10       2627795780
11       2626983575
12       2626982337
13       2701863545
14       2626977325
15       2627795780
16       2626983575
17       2626982337
18       2626983575
19       2626983575
20       2626983575
21       2701863545
22       2626977325
23       2627795780
24       2626983575
25       2626982337
26       2701863545
27       2626977325
28       2627795780
29       2626983575
            ...    
29970    4794142562
29971    4794143044
29972    4793510447
29973    4793511023
29974    4793511455
29975    4794142562
29976    4794143044
29977    4793510447
29978    4793511023
29979    4793511455
29980    4794142562
29981    4794143044
29982    4793510447
29983    4793511023
29984    4793511455
29985    4794177464
29986    4793545161
29987    4794180770
29988    4793552387


In [46]:
def return_input_length(start_num, data):
    df_input_length = []
    for i in range(start_num, start_num + len(data)):
        df_input_length.append(i)
    return df_input_length

df_input5000_0_length = return_input_length(0, df_input5000_0)
df_input5000_1_length = return_input_length(5000, df_input5000_1)
df_input5000_2_length = return_input_length(10000, df_input5000_2)
df_input5000_3_length = return_input_length(15000, df_input5000_3)
df_input5000_4_length = return_input_length(20000, df_input5000_4)
df_input5000_5_length = return_input_length(25000, df_input5000_5)

df_input5000_5_length

[25000,
 25001,
 25002,
 25003,
 25004,
 25005,
 25006,
 25007,
 25008,
 25009,
 25010,
 25011,
 25012,
 25013,
 25014,
 25015,
 25016,
 25017,
 25018,
 25019,
 25020,
 25021,
 25022,
 25023,
 25024,
 25025,
 25026,
 25027,
 25028,
 25029,
 25030,
 25031,
 25032,
 25033,
 25034,
 25035,
 25036,
 25037,
 25038,
 25039,
 25040,
 25041,
 25042,
 25043,
 25044,
 25045,
 25046,
 25047,
 25048,
 25049,
 25050,
 25051,
 25052,
 25053,
 25054,
 25055,
 25056,
 25057,
 25058,
 25059,
 25060,
 25061,
 25062,
 25063,
 25064,
 25065,
 25066,
 25067,
 25068,
 25069,
 25070,
 25071,
 25072,
 25073,
 25074,
 25075,
 25076,
 25077,
 25078,
 25079,
 25080,
 25081,
 25082,
 25083,
 25084,
 25085,
 25086,
 25087,
 25088,
 25089,
 25090,
 25091,
 25092,
 25093,
 25094,
 25095,
 25096,
 25097,
 25098,
 25099,
 25100,
 25101,
 25102,
 25103,
 25104,
 25105,
 25106,
 25107,
 25108,
 25109,
 25110,
 25111,
 25112,
 25113,
 25114,
 25115,
 25116,
 25117,
 25118,
 25119,
 25120,
 25121,
 25122,
 25123,
 25124,


In [74]:
import re
from tqdm import tqdm
# clean_df_all.description_text, clean_df_all.story_text
desc_paragraph = []
story_paragraph = []
#df_input_length_list= [df_input5000_0_length, df_input5000_1_length, df_input5000_2_length, df_input5000_3_length, df_input5000_4_length, df_input5000_5_length]
df_input_length_list = [df_input5000_0_length, df_input5000_1_length, df_input5000_2_length]

for item in tqdm(df_input_length_list): 
    for seq_index in item:
        input_seq = encoder_input_data[seq_index: seq_index + 1]
        decoded_sentence = decode_sequence(input_seq)
        #print('-')
        #print('Input sentence:', clean_df_all.description_text[seq_index: seq_index + 1])
        #print (type(clean_df_all.description_text[seq_index: seq_index + 1]))
        desc_paragraph.append(list(df_test_input3.description_text[seq_index: seq_index + 1]))
        #print('decoded sentence: ', decoded_sentence)
        re_decoded_sentence = re.sub('_END', '', decoded_sentence).strip()
        #print('Re decoded sentence:', re_decoded_sentence)
        story_paragraph.append(re_decoded_sentence)


  0%|                                                                                            | 0/3 [00:00<?, ?it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 21.64it/s]
