<a href="https://colab.research.google.com/github/rahiakela/nlp-use-case-study-applications/blob/master/neural_machine_translation_with_attention_for_english_spanish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural machine translation with attention for English-Spanish

This notebook trains a sequence to sequence (seq2seq) model for Spanish to English translation. This is an advanced example that assumes some knowledge of sequence to sequence models.

After training the model in this notebook, you will be able to input a Spanish sentence, such as *"¿todavia estan en casa?"*, and return the English translation: *"are you still at home?"*

The translation quality is reasonable for a toy example, but the generated attention plot is perhaps more interesting. This shows which parts of the input sentence has the model's attention while translating:

<img src="https://tensorflow.org/images/spanish-english.png" alt="spanish-english attention plot">

Note: This example takes approximately 10 minutes to run on a single P100 GPU.

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time

TensorFlow 2.x selected.


## Download and prepare the dataset

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

There are a variety of languages available, but we'll use the English-Spanish dataset. For convenience, we've hosted a copy of this dataset on Google Cloud, but you can also download your own copy. After downloading the dataset, here are the steps we'll take to prepare the data:

1. Add a *start* and *end* token to each sentence.
2. Clean the sentences by removing special characters.
3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
4. Pad each sentence to a maximum length.

### 0. Download the Spanish - English dataset 

In [2]:
# Download the file spa-eng.zip for Spanish - English
path_to_zip = tf.keras.utils.get_file('spa-eng.zip',
                                      origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
                                      extract=True)
path_to_file = os.path.dirname(path_to_zip) + '/spa-eng/spa.txt'

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


### 1. Add a start and end token to each sentence.

In [0]:
# Converts the unicode file to ascii
def unicode_to_ascii(sentence):
  return ''.join(c for c in unicodedata.normalize('NFD', sentence) if unicodedata.category(c) != 'Mn')

def preprocess_sentence(sent):
  sent = unicode_to_ascii(sent.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  sent = re.sub(r'([?.!,¿])', r' \1 ', sent)
  sent = re.sub(r'[" "]+', " ", sent)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  sent = re.sub(r'[^a-zA-Z?.!,¿]+', ' ', sent)

  sent = sent.rstrip().strip()

  # adding a start and an end token to the sentence so that the model know when to start and stop predicting.
  sent = '<start> ' + sent + ' <end>'

  return sent

In [8]:
en_sentence = u'May I borrow this book?'
sp_sentence = u'¿Puedo tomar prestado este libro?'

print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


### 2. Clean the sentences by removing special characters.

We will do the following things in this section:

1. Remove the accents
2. Clean the sentences
3. Return word pairs in the format: [ENGLISH, SPANISH]

In [0]:
def create_dataset(path, num_examples):
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

  word_pairs = [[preprocess_sentence(word) for word in line.split('\t')] for line in lines[: num_examples]]

  return zip(*word_pairs)

In [11]:
en_sent, sp_sent = create_dataset(path_to_file, None)

# check last sentence
print(en_sent[-1])
print(sp_sent[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


In [0]:
def max_length(tensor):
  return max(len(t) for t in tensor)

In [0]:
def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')

  lang_tokenizer.fit_on_texts(lang)
  tensor = lang_tokenizer.texts_to_sequences(lang)
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

  return tensor, lang_tokenizer

In [0]:
def load_dataset(path, num_examples=None):
  # creating cleaned input, output pairs
  target_lang, input_lang = create_dataset(path, num_examples)

  input_tensor, input_lang_tokenizer = tokenize(input_lang)
  target_tensor, target_lang_tokenizer = tokenize(target_lang)

  return input_tensor, target_tensor, input_lang_tokenizer, target_lang_tokenizer

### Limit the size of the dataset to experiment faster (optional)

Training on the complete dataset of >100,000 sentences will take a long time. To train faster, we can limit the size of the dataset to 30,000 sentences (of course, translation quality degrades with less data):

In [0]:
# Try experimenting with the size of that dataset
num_examples = 30000
input_tensor, target_tensor, input_lang, target_lang = load_dataset(path_to_file, num_examples)

# Calculate max_length of the target tensors
max_length_target, max_length_input = max_length(target_tensor), max_length(input_tensor)

In [19]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor,
                                                                                                target_tensor,
                                                                                                test_size=0.2)
# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

24000 24000 6000 6000


In [0]:
def convert(lang, tensor):
  for t in tensor:
    if t != 0:
      print('%d ----> %s' % (t, lang.index_word[t]))

In [20]:
print('Input Language; index to word mapping')
convert(input_lang, input_tensor_train[0])
print()
print('Target Language; index to word mapping')
convert(target_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <start>
53 ----> quiero
1041 ----> regresar
3 ----> .
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
4 ----> i
47 ----> want
15 ----> to
36 ----> go
91 ----> back
3 ----> .
2 ----> <end>


### Create a tf.data dataset

In [0]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train) // BATCH_SIZE
embedding_dim = 256
units = 1024

vocab_input_size = len(input_lang.word_index) + 1
vocab_target_size = len(target_lang.word_index) + 1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [22]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 16]), TensorShape([64, 11]))

In [23]:
print(example_input_batch.shape, example_target_batch.shape)

(64, 16) (64, 11)


## Write the encoder and decoder model