# <u>Neural Date Translation



**Date translation from Human readable format to machine readable format(YYYY-MM-DD) using Recurrent Neural Network.**
For example a date in human readable format can be : 'sunday 15 september 2013', '29-oct-1997' or '30 august 1985' etc.
<br>The task is to convert this to a more normalized format that is **YYYY-MM-DD**.

For this task a **sequence to sequence encoder-decoder** network with **LSTM units** have been used along with **Attention Mechanism**. 

In [20]:
from keras.layers import RepeatVector, Dense, Activation
from keras.layers import Bidirectional, Concatenate, Dot, LSTM, Input
from keras.optimizers import Adam
from keras.models import Model

import os.path
import numpy as np
import random
from utility import *

## Data Preprocessing and Loading
We will train the neural network on a dataset of 10000 human readable dates(e.g. "the 19th of July 2016", "23/02/2017") and their equivalent machine readable dates that will be in format **DD-MM-YYYY**.

For generating the training examples we will be using Faker.
Code for generating the training examples and for preprocessing is there in the utility file.
1. First we generate the training examples
2. We make mappings from characters used in the dates to numerical indices and vice-versa.
3. We make one hot encodings for the training data.

In [2]:
import numpy as np
import random
from faker import Faker
from babel.dates import format_date

# for creating a single fake date
def create_date(fake_obj, DATE_FORMATS):
    
    try:
        # create a date object
        dt = fake_obj.date_object()
        # create machine readable dates
        machine_read_dates = dt.isoformat()
        # get human readable dates
        human_read_dates = format_date(dt, format=random.choice(DATE_FORMATS),
                            locale='en_IN')
        # remove punctuations
        human_read_dates = human_read_dates.replace(',', '')
        # change to lower case
        human_read_dates = human_read_dates.lower()
    
    except AttributeError as e:
        return None, None
    
    return human_read_dates, machine_read_dates


# for creating 'm' training dataset
def create_dataset(m):
    '''
    Arg:
        m: no. of training examples
    Returns:
    dataset --list: for saving list of tuples of date pairs
    human_vocab --set: for saving human readable dates vocabulary
    machine_vocab -- set: for saving machine readable dates vocabulary
    
    '''
    # for generating fake training data
    fake_obj = Faker()
    
    # date formats for generating date
    # one of them is selected randomly each time that is why full is mentioned
    # many times to increase its chances
    DATE_FORMATS = ['short',
           'medium',
           'long',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'full',
           'd MMM YYY', 
           'd MMMM YYY',
           'dd MMM YYY',
           'd MMM, YYY',
           'd MMMM, YYY',
           'dd, MMM YYY',
           'd MM YY',
           'd MMMM YYY',
           'MMMM d YYY',
           'MMMM d, YYY',
           'dd.MM.YY']
    # for saving the dataset
    dataset = []
    # for saving human readable dates vocabulary
    human_vocab = set()
    # for saving machine readable dates vocabulary
    machine_vocab = set()

    for i in range(m):
        human_date, machine_date = create_date(fake_obj, DATE_FORMATS)
        if human_date is not None:
            # add date to dataset
            dataset.append((human_date, machine_date))
            # add new vocabulary entry
            for char in human_date:
                if char not in human_vocab:
                    human_vocab.add(char)
            for char in machine_date:
                if char not in machine_vocab:
                    machine_vocab.add(char)
            
    human_vocab = sorted(human_vocab)
    machine_vocab = sorted(machine_vocab)
    
    return dataset, human_vocab, machine_vocab

In [3]:
def preprocess_data(m, dataset, human_char_idx, machine_char_idx, Tx, Ty):
    # separate the tuples
    X, Y = zip(*dataset)
    
    # make numpy arrays to store the X and Y data
    X_ohe = np.zeros((m, Tx, len(human_char_idx)), dtype = 'float32')
    Y_ohe = np.zeros((m, Ty, len(machine_char_idx)), dtype = 'float32')
    
    
    # truncate the length of date if it exceeds Tx
    for i, date in enumerate(X):
        if len(date) > Tx:
            X[i] = X[:Tx]
            
    # now do one hot encoding
    for i in range(m):
        for timestep, char in enumerate(X[i]):
            X_ohe[i, timestep, human_char_idx[char]] = 1
        for timestep, char in enumerate(Y[i]):
            Y_ohe[i, timestep, machine_char_idx[char]] = 1
            
    return X, Y, X_ohe, Y_ohe

In [4]:
def create_training_data(m, Tx, Ty):
    dataset, human_vocab, machine_vocab = create_dataset(m)
    # add the unknown and pad characters
    #human_vocab += ['<UNK>', '<PAD>']
    
    # now we will create a dictionary for mapping the vocabulary tokens to numerical indices
    human_char_idx = dict((token, i) for i, token in enumerate(human_vocab) )
    # reverse mapping from indices to tokens for machine readable dates
    machine_idx_char = dict(enumerate(machine_vocab))
    # mapping from char to indices for machine readable dates
    machine_char_idx = dict((token, i) for i, token in enumerate(machine_vocab) )

    X, Y, X_ohe, Y_ohe = preprocess_data(m, dataset, human_char_idx, machine_char_idx, Tx, Ty)

    return dataset, X, Y, X_ohe, Y_ohe, human_vocab, human_char_idx, machine_vocab, machine_char_idx, machine_idx_char    

In [5]:
# no. of training examples
m = 10000
Tx = 30 # time steps for input
Ty = 10 # time steps for output

dataset, X, Y, X_ohe, Y_ohe, human_vocab, human_char_idx, machine_vocab, machine_char_idx, machine_idx_char = create_training_data(m, Tx, Ty)

In [6]:
dataset[:10]

[('monday 8 september 1997', '1997-09-08'),
 ('november 8 2017', '2017-11-08'),
 ('10 february 2010', '2010-02-10'),
 ('monday 5 july 1982', '1982-07-05'),
 ('30.06.02', '2002-06-30'),
 ('wednesday 6 august 1975', '1975-08-06'),
 ('16 december 2008', '2008-12-16'),
 ('saturday 16 january 1993', '1993-01-16'),
 ('wednesday 15 april 1981', '1981-04-15'),
 ('30 dec 2012', '2012-12-30')]

In [7]:
print(X_ohe.shape)
print(Y_ohe.shape)
print( human_char_idx)
print()
print( machine_char_idx)

(10000, 30, 36)
(10000, 10, 11)
{' ': 0, '-': 1, '.': 2, '/': 3, '0': 4, '1': 5, '2': 6, '3': 7, '4': 8, '5': 9, '6': 10, '7': 11, '8': 12, '9': 13, 'a': 14, 'b': 15, 'c': 16, 'd': 17, 'e': 18, 'f': 19, 'g': 20, 'h': 21, 'i': 22, 'j': 23, 'l': 24, 'm': 25, 'n': 26, 'o': 27, 'p': 28, 'r': 29, 's': 30, 't': 31, 'u': 32, 'v': 33, 'w': 34, 'y': 35}

{'-': 0, '0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10}


# <u>Model


## Model Architecture
We will use an **encoder-decoder** network for for this task.<br>
        Attention mechanism has also been incorporated to make the decoding better. The network uses **LSTM** cells
in both decoder and encoder networks. The pre attention encoder network is actually a **bidirectional LSTM** and the post attention decoder is an **unidirectional LSTM** network.  <br>    
        Post attention LSTM cells get the input from the context calculated from the attention weights. The output from each cell is not passed onto the next since it doesn't matter much what the previous chracter is in **YYYY-MM-DD** sequence.
        
To make sure that the layers share the same weight values throughout the different timesteps we will make layers defined as global variables.

In [8]:
# shared layers
# Following layers are mainly for the neural network for finding the attention weights
concatenator = Concatenate(axis = -1)
repeator = RepeatVector(Tx)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation('softmax')
dotor = Dot(axes = 1)

In [9]:
# shared layers
# below layers are for the post LSTM network
n_acti_pre = 32 # hidden activation units in pre attention LSTM network
n_acti_post = 64 # hidden activation units in post attention LSTM network

# post activation LSTM cell
post_acti_LSTM = LSTM(n_acti_post, return_state = True)
output_layer = Dense(len(machine_vocab), activation = 'softmax')

# initial input states for post LSTM network
ini_acti_post = np.zeros((m, n_acti_post))
ini_mem_post = np.zeros((m, n_acti_post))

In [10]:
# for getting the context value using attention weights for each timestep of output
def get_context(acti_pre, acti_post_prev):
    """
    Finds the context value using attention weights for each timestep of output
    Arguments:
        acti_pre -- numpy-array(m, Tx, 2*n_acti_pre): hidden state values of  Bidirectional-LSTM network 
        acti_post_prev -- numpy array(m, n_acti_post): previous hidden state of the (post-attention) LSTM

    Returns:
        context -- vector, input of the next post-attetion LSTM cell
    """
    # repeat the previous state of the post -attention LSTM cell
    acti_post_prev = repeator(acti_post_prev)
    # concatenate the output with the activations from the different pre attetntion LSTM cells
    concat_vals = concatenator([acti_post_prev, acti_pre])
    # pass the concatenated values through a dense layer
    inter_vals = densor1(concat_vals)
    inter_vals = densor2(inter_vals)
    # pass the intermediate value through a softmax layer to get the alpha values
    alpha_vals = activator(inter_vals)
    # after getting the alpha values find the sum of weighted product of pre- activations values with their context weights(alpha values)
    context = dotor([alpha_vals, acti_pre])
    
    return context

In [11]:
# for creating a Keras model Instance
def create_model(human_vocab_len, machine_vocab_len, Tx, Ty, n_acti_pre, n_acti_post):
    """
    Arguments:
        human_vocab_len - length of the human_char_idx dictionary
        machine_vocab_len - length of the machine_char_idx dictionary
        Tx - length of the input sequence
        Ty - length of the output sequence
        n_acti_pre - no. of hidden state units of the Bi-LSTM
        n_acti_post - no. of hidden state units of the post-attention LSTM

    Returns:
        model -- Keras model instance
    """
    # for storing the outputs
    outputs = []
    
    # define input for the model
    X = Input(shape=(Tx, human_vocab_len))
    # for the decoder LSTM i.e post attention network
    # initial values
    ini_acti_post = Input(shape=(n_acti_post,), name='ini_acti_post')
    ini_mem_post = Input(shape=(n_acti_post,), name='ini_mem_post')
   
    
    # current timestep values
    mem_post = ini_mem_post
    acti_post = ini_acti_post
    
    # make the encoder Bidirectional LSTM network
    acti_pre = Bidirectional(LSTM(n_acti_pre, return_sequences=True))(X)
    
    # loop over each output timestep
    for timestep in range(Ty):
        # get the current context value
        context = get_context(acti_pre, acti_post)
        # feed the context value to the decoder LSTM network post attention
        acti_post, _, mem_post = post_acti_LSTM (context, initial_state=[acti_post, mem_post])
        # get the softmax output 
        target_output = output_layer(acti_post)
        
        # add the output target value
        outputs.append(target_output)
   
    # make the model and return model instance
    model = Model(inputs=[X, ini_acti_post, ini_mem_post], output=outputs)
    
    return model

In [12]:
model = create_model(len(human_vocab), len(machine_vocab), Tx, Ty, n_acti_pre, n_acti_post)



In [13]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
ini_acti_post (InputLayer)      (None, 64)           0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            (None, 30, 36)       0                                            
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 30, 64)       0           ini_acti_post[0][0]              
                                                                 lstm_1[0][0]                     
                                                                 lstm_1[1][0]                     
                                                                 lstm_1[2][0]                     
          

In [21]:
# load weights from any previously saved model
model_path = r'models/weights_55.h5'
if os.path.exists(model_path):
    model.load_weights(model_path)

In [47]:
# define optimizer and compile the model
opt = Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics = ['accuracy'])

In [51]:
model.fit([X_ohe, ini_acti_post, ini_mem_post], list(Y_ohe.swapaxes(0,1)), epochs=20, batch_size=100, verbose = 2)

Epoch 1/20
 - 22s - loss: 0.4388 - dense_4_loss_1: 0.0135 - dense_4_loss_2: 0.0116 - dense_4_loss_3: 0.0811 - dense_4_loss_4: 0.0902 - dense_4_loss_5: 0.0014 - dense_4_loss_6: 0.0205 - dense_4_loss_7: 0.1163 - dense_4_loss_8: 0.0020 - dense_4_loss_9: 0.0344 - dense_4_loss_10: 0.0680 - dense_4_acc_1: 0.9954 - dense_4_acc_2: 0.9955 - dense_4_acc_3: 0.9770 - dense_4_acc_4: 0.9814 - dense_4_acc_5: 1.0000 - dense_4_acc_6: 0.9936 - dense_4_acc_7: 0.9707 - dense_4_acc_8: 1.0000 - dense_4_acc_9: 0.9978 - dense_4_acc_10: 0.9897
Epoch 2/20
 - 22s - loss: 0.4207 - dense_4_loss_1: 0.0130 - dense_4_loss_2: 0.0109 - dense_4_loss_3: 0.0769 - dense_4_loss_4: 0.0858 - dense_4_loss_5: 0.0014 - dense_4_loss_6: 0.0199 - dense_4_loss_7: 0.1132 - dense_4_loss_8: 0.0020 - dense_4_loss_9: 0.0322 - dense_4_loss_10: 0.0654 - dense_4_acc_1: 0.9954 - dense_4_acc_2: 0.9953 - dense_4_acc_3: 0.9788 - dense_4_acc_4: 0.9840 - dense_4_acc_5: 1.0000 - dense_4_acc_6: 0.9933 - dense_4_acc_7: 0.9717 - dense_4_acc_8: 1.0000

Epoch 17/20
 - 21s - loss: 0.2521 - dense_4_loss_1: 0.0075 - dense_4_loss_2: 0.0055 - dense_4_loss_3: 0.0403 - dense_4_loss_4: 0.0425 - dense_4_loss_5: 9.2088e-04 - dense_4_loss_6: 0.0137 - dense_4_loss_7: 0.0822 - dense_4_loss_8: 0.0013 - dense_4_loss_9: 0.0183 - dense_4_loss_10: 0.0399 - dense_4_acc_1: 0.9980 - dense_4_acc_2: 0.9982 - dense_4_acc_3: 0.9919 - dense_4_acc_4: 0.9963 - dense_4_acc_5: 1.0000 - dense_4_acc_6: 0.9961 - dense_4_acc_7: 0.9812 - dense_4_acc_8: 1.0000 - dense_4_acc_9: 0.9994 - dense_4_acc_10: 0.9961
Epoch 18/20
 - 22s - loss: 0.2473 - dense_4_loss_1: 0.0073 - dense_4_loss_2: 0.0054 - dense_4_loss_3: 0.0391 - dense_4_loss_4: 0.0411 - dense_4_loss_5: 9.5703e-04 - dense_4_loss_6: 0.0136 - dense_4_loss_7: 0.0809 - dense_4_loss_8: 0.0012 - dense_4_loss_9: 0.0181 - dense_4_loss_10: 0.0395 - dense_4_acc_1: 0.9981 - dense_4_acc_2: 0.9981 - dense_4_acc_3: 0.9918 - dense_4_acc_4: 0.9965 - dense_4_acc_5: 1.0000 - dense_4_acc_6: 0.9960 - dense_4_acc_7: 0.9814 - dense_4_acc

<keras.callbacks.History at 0x1a585aa9ac8>

In [53]:
# save model weights
model.save_weights(r'models/weights_55.h5')

## Predictions
Once the model has been trained it is time to check its performance on new data and see how well it performs.

In [24]:
# generate new date samples
dates, _, _ = create_dataset(10)
vocab_len = len(human_vocab)

for date, machine in dates:
    # truncate the length of date if it exceeds Tx
    if len(date) > Tx:
        date = date[:Tx]
        
    # some preprocessing
    date = date.lower().replace(',', '')        
    source = np.zeros((1, Tx, vocab_len))
    
    # make OHE of date
    for t, char in enumerate(date):
        source[0, t, human_char_idx[char]] = 1
    
    prediction = model.predict([source, ini_acti_post, ini_mem_post])
    prediction = np.argmax(prediction, axis = -1)
    output = [machine_idx_char[int(i)] for i in prediction]
    
    print("Input: ", date)
    print("Output: ", ''.join(output))
    print('Target Output: ', machine)
    print()

Input:  may 13 2010
Output:  2010-05-13
Target Output:  2010-05-13

Input:  2 november 2007
Output:  2007-11-02
Target Output:  2007-11-02

Input:  12 september 2016
Output:  2016-09-12
Target Output:  2016-09-12

Input:  9 12 86
Output:  1986-12-09
Target Output:  1986-12-09

Input:  f1984llf1984llf1984ll
Output:  1883333333
Target Output:  1984-10-10

Input:  8 apr 1999
Output:  1999-04-08
Target Output:  1999-04-08

Input:  5 july 1992
Output:  1992-07-05
Target Output:  1992-07-05

Input:  04.07.05
Output:  2005-07-04
Target Output:  2005-07-04

Input:  14 nov 2010
Output:  2010-11-14
Target Output:  2010-11-14

Input:  10 september 1996
Output:  1996-09-10
Target Output:  1996-09-10



### <u>Credits:
This project is based on the assignment from Sequence Models Specialization by Deeplearning.ai on Coursera. <br>https://www.coursera.org/learn/nlp-sequence-models/home/welcome