## What is sequence-to-sequence learning? 
Sequence to sequence learning is about training models to covert sequences from one domain to sequences in another domain.When both input sequences and output sequences have the same length, you can implement such models simply with a Keras LSTM or GRU layer (or stack thereof). 

Process
----------------------------------------
1) Turn the sentences into 3 Numpy arrays, encoder_input_data, decoder_input_data, decoder_target_data:

* encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences.
* decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) containg a one-hot vectorization of the French sentences.
* decoder_target_data is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].

2) Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data. Our model uses teacher forcing.

3) Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data).

In [2]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [6]:
!!curl -O http://www.manythings.org/anki/fra-eng.zip
!!unzip fra-eng.zip

['Archive:  fra-eng.zip',
 'replace _about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename:  NULL',
 '(EOF or read error, treating as "[N]one" ...)']

In [7]:
# configuration
batch_size = 64
epochs = 10
latent_dim = 256
num_samples = 10000 # no of training samles 
data_path = 'fra.txt'

In [9]:
input_texts = []
target_texts = []
input_chars = set()
target_chars = set()

with open(data_path,'r',encoding = 'utf-8') as f:
    lines = f.read().split('\n')

In [14]:
for line in lines[:min(num_samples,len(lines)-1)]:
    input_text, target_text, _ = line.split('\t')
    
    # using tab and \n as the start and end seq characters. 
    target_text = '\t'+target_text+'\n'
    
    input_texts.append(input_text)
    target_texts.append(target_text)
    
    for char in input_text:
        if char not in input_chars:
            input_chars.add(char)
    for char in target_text:
        if char not in target_chars:
            target_chars.add(char)

In [20]:
# getting all the length  parameters to define the room for vectors
input_characters = sorted(list(input_chars))
target_characters = sorted(list(target_chars))

num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters) 

max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

In [21]:
input_token_index = dict([(char,i) for i, char in enumerate(input_characters)])

In [25]:
encoder_input_data = np.zeros(
    (len(input_texts),max_encoder_seq_length,num_encoder_tokens),dtype = "float32"
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length,num_decoder_tokens),dtype = "float32"
)

decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length,num_decoder_tokens),dtype = "float32"
)

In [None]:
for i,(input_text,target_text) in enumerate(zip(input_texts,target_texts)):
    for t,char in enumerate(input_text):
        encoder_input_data[i,t,input_token_index[char]] = 1.0
    encoder_input_data[i,t+1;, input_token_index[" "]] = 1.0
    