## **Encoder-Decoder Project:** 


## **Context**
The Encoder-Decoder architecture with recurrent neural networks has become an effective and standard approach for both neural machine translation (NMT) and **sequence-to-sequence (seq2seq)** prediction in general.

The key benefits of the approach are the ability to train a single end-to-end model directly on source and target sentences and the ability to handle variable length input and output sequences of text.
## **Content**
Train an Encoder–Decoder model that can convert a date string from one format to another (e.g., from "April 22, 2019" to "2019-04-22")



### **Loading data and preparing dataset**
- import libraries
- create dataset

In [None]:
# import libraries
import os
import sys
import numpy as np
import sklearn
import tensorflow as tf
from tensorflow import keras
import numpy as np

# plot figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

### **Preprocess data**
- Create dataset dates: 1000-01-01 and 9999-12-31
- Print random dates with input and output target format
- INPUT_CHARS: Get list of character inputs
- OUTPUT_CHARS: Show list of character outputs
- Create function to convert string to character IDs list

Let's start by creating the dataset. We will use random days between 1000-01-01 and 9999-12-31:

In [None]:
# create dataset
from datetime import date

# cannot use strftime()'s %B format since it depends on the locale
MONTHS = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]

def random_dates(n_dates):
    min_date = date(1000, 1, 1).toordinal()
    max_date = date(9999, 12, 31).toordinal()

    ordinals = np.random.randint(max_date - min_date, size=n_dates) + min_date
    dates = [date.fromordinal(ordinal) for ordinal in ordinals]

    x = [MONTHS[dt.month - 1] + " " + dt.strftime("%d, %Y") for dt in dates]
    y = [dt.isoformat() for dt in dates]
    return x, y

Here are a few random dates, displayed in both the input format and the target format:

In [None]:
# random dates
np.random.seed(42)

n_dates = 3
x_example, y_example = random_dates(n_dates)
print("{:25s}{:25s}".format("Input", "Target"))
print("-" * 50)
for idx in range(n_dates):
    print("{:25s}{:25s}".format(x_example[idx], y_example[idx]))

Input                    Target                   
--------------------------------------------------
September 20, 7075       7075-09-20               
May 15, 8579             8579-05-15               
January 11, 7103         7103-01-11               


Let's get the list of all possible characters in the inputs:

In [None]:
# char input list
INPUT_CHARS = "".join(sorted(set("".join(MONTHS)))) + "01234567890, "
INPUT_CHARS

'ADFJMNOSabceghilmnoprstuvy01234567890, '

And here's the list of possible characters in the outputs:

In [None]:
# char outputs
OUTPUT_CHARS = "0123456789-"

Let's write a function to convert a string to a list of character IDs
- date_str_to for input char and output chars
- function prepare_date_str
- function create_dataset

In [None]:
def date_str_to_ids(date_str, chars=INPUT_CHARS):
    return [chars.index(c) for c in date_str]

In [None]:
date_str_to_ids(x_example[0], INPUT_CHARS)

[7, 11, 19, 22, 11, 16, 9, 11, 20, 38, 28, 26, 37, 38, 33, 26, 33, 31]

In [None]:
date_str_to_ids(y_example[0], OUTPUT_CHARS)

[7, 0, 7, 5, 10, 0, 9, 10, 2, 0]

In [None]:
def prepare_date_strs(date_strs, chars=INPUT_CHARS):
    X_ids = [date_str_to_ids(dt, chars) for dt in date_strs]
    X = tf.ragged.constant(X_ids, ragged_rank=1)
    return (X + 1).to_tensor() # using 0 as the padding token ID

def create_dataset(n_dates):
    x, y = random_dates(n_dates)
    return prepare_date_strs(x, INPUT_CHARS), prepare_date_strs(y, OUTPUT_CHARS)

### **Create training set**
- train dataset: 10000
- valid dataset: 2000
- test dataset: 2000


In [None]:
# train, validate, test
np.random.seed(42)

X_train, Y_train = create_dataset(10000)
X_valid, Y_valid = create_dataset(2000)
X_test, Y_test = create_dataset(2000)

In [None]:
Y_train[0]

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([ 9,  6,  8, 10, 11,  1,  6, 11,  2,  6], dtype=int32)>

### **seq2seq model**
We feed in the input sequence, which first goes through the encoder (an embedding layer followed by a single LSTM layer), which outputs a vector, then it goes through a decoder (a single LSTM layer, followed by a dense output layer), which outputs a sequence of vectors, each representing the estimated probabilities for all possible output character.

Since the decoder expects a sequence as input, we repeat the vector (which is output by the decoder) as many times as the longest possible output sequence.


**LSTM** Long-Short-Term Memory, In keras you can use the LSTM layer, it will perform much better; training will coverge faster, and it will detect long-term dependencies in the data.




### **Build (layers)compile and train model with history**
    - embedding size: 32. epochs =20
    - encoder (embedding layer + single LSTM layer
    - decoder (single LSTM layer + dense output layer
    - keras LSTM layers: 128
    - dense layer activation softmax
    - optimizer Nadam. Loss Sparse categorical crossentropy


**Embedding layer**: It is defined as the first hidden layer of a network. It must specify 3 arguments:
- **input_dim**: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
- **output_dim**: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
- **input_length**: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.


In [None]:
# encoder (embedding layer + single LSTM layer)
embedding_size = 32
max_output_length = Y_train.shape[1]

np.random.seed(42)
tf.random.set_seed(42)

encoder = keras.models.Sequential([
    keras.layers.Embedding(input_dim=len(INPUT_CHARS) + 1,
                           output_dim=embedding_size,
                           input_shape=[None]), keras.layers.LSTM(128)
])

In [None]:
# decoder (single LSTM layer + dense output layer
decoder = keras.models.Sequential([
    keras.layers.LSTM(128, return_sequences=True),
    keras.layers.Dense(len(OUTPUT_CHARS) + 1, activation="softmax")
])

model = keras.models.Sequential([
    encoder,
    keras.layers.RepeatVector(max_output_length),
    decoder
])

####**Splitting Dataset into Training, Validation and Test Sets**

---


**Training dataset**
- The actual dataset that we use to train the model(**weights and biases**) in the case of the Neural Network

**Validation dataset: Known as Development [Dev set]**
- The validation set is the sample data used to evaluate of a *model fit* of a given model, the dataset helps during the 'development' stage of the model 

**Test Dataset**
The Test dataset is used to evaluate the model. It is used once a model is completely trained(using the train and validation sets)

In [None]:
# compile and train model with history
optimizer = keras.optimizers.Nadam()
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(X_train, Y_train, epochs=20,
                    validation_data=(X_valid, Y_valid))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Looks great, we reach 100% validation accuracy! Let's use the model to make some predictions. We will need to be able to convert a sequence of character IDs to a readable string:
- create function (ids_to_date) 
- X_new: use model to convert date (sept 17,2009 & Jul 14, 1789
- ids: iterate to show both dates

In [None]:
# create function ids_to-date
def ids_to_date_strs(ids, chars=OUTPUT_CHARS):
    return ["".join([("?" + chars)[index] for index in sequence])
            for sequence in ids]

In [None]:
# convert dates
X_new = prepare_date_strs(["September 17, 2009", "July 14, 1789"])

In [None]:
# iterate ids to product dates
ids = model.predict_classes(X_new)
for date_str in ids_to_date_strs(ids):
    print(date_str)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
2009-09-17
1789-07-14


### **Summary**
The model worked, the Encoder–Decoder model converted a date string from one format: "September 17, 2009" to another format "2009-09-17"