<a href="https://colab.research.google.com/github/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/04_machine_translation_sequence_to_sequence_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Machine translation: sequence-to-sequence learning

In this notebook, you’ll deepen your expertise by learning about
sequence-to-sequence models.

A sequence-to-sequence model takes a sequence as input (often a sentence or
paragraph) and translates it into a different sequence. This is the task at the heart of many of the most successful applications of NLP:
- **Machine translation**—Convert a paragraph in a source language to its equivalent in a target language.
- **Text summarization**—Convert a long document to a shorter version that retains the most important information.
- **Question answering**—Convert an input question into its answer.
- **Chatbots**—Convert a dialogue prompt into a reply to this prompt, or convert the history of a conversation into the next reply in the conversation.
- **Text generation**—Convert a text prompt into a paragraph that completes the prompt.

The general template behind sequence-to-sequence models is described in figure.

<img src='https://github.com/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/images/3.png?raw=1' width='800'/>

During training:-
- An `encoder` model turns the source sequence into an intermediate representation.
- A `decoder` is trained to predict the next token i in the target sequence by looking at both previous tokens `(0 to i - 1)` and the encoded source sequence.

**During inference, we don’t have access to the target sequence**—we’re trying to predict it from scratch. We’ll have to generate it one token at a time:

- We obtain the encoded source sequence from the encoder.
- The decoder starts by looking at the encoded source sequence as well as an initial “seed” token (such as the string `[start]`), and uses them to predict the first real token in the sequence.
- The predicted sequence so far is fed back into the decoder, which generates the next token, and so on, until it generates a stop token (such as the string
`[end]`).

Everything you’ve learned so far can be repurposed to build this new kind of model.

Let’s dive in.


##Setup

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import random
import string
import re

import numpy as np

We’ll be working with an English-to-Spanish translation dataset available at
www.manythings.org/anki/. 

Let’s download it:

In [None]:
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip

##Data preparation

The text file contains one example per line: an English sentence, followed by a tab character, followed by the corresponding Spanish sentence. 

Let’s parse this file.