# Machine Translation Project
A deep neural network that functions as part of an end-to-end machine translation pipeline


In [8]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
import collections 

import helper
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import LSTM, GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam 
from keras.losses import sparse_categorical_crossentropy

from sklearn.model_selection import train_test_split

## Dataset 
This dataset will be used to train and evaluate the pipeline. This dataset contains a small vocabulary. This dataset will allow the model to be trained in a reasonable amount of time. 
### Load Data
The data is located in `data/small_vocab_en` and `data/small_vocab_fr`. The `small_vocab_en` file contains English sentences with their French translations in the `small_vocab_fr` file. Load the English and French data from these files from running the cell below.

In [10]:
# Loading english data
english_sentences = helper.load_data('data/small_vocab_en')
# Loading french data
french_sentences = helper.load_data('data/small_vocab_fr')

print('Dataset Loaded')
print(type(english_sentences))


Dataset Loaded
<class 'list'>


## Files 
Each line in `small_vocab_en` contains an English sentence with the respective translation in each line of `small_vocab_fr`. View the first two lines from each file.

In [11]:
for sample_i in range(2):
    print('small_vocab_en Line{}: {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line{}: {}'.format(sample_i + 1, french_sentences[sample_i]))

small_vocab_en Line1: new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line1: new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line2: the united states is usually chilly during july , and it is usually freezing in november .
small_vocab_fr Line2: les Ã©tats-unis est gÃ©nÃ©ralement froid en juillet , et il gÃ¨le habituellement en novembre .


From looking at the sentences, One can see they have been preprocessed already. The puncuations have been delimited using spaces. All the text have been converted to lowercase. This should save some time, but the text requires more preprocessing.
### Vocabulary 
The complexity of the problem is determined by the complexity of the vocabulary.  A more complex vocabulary is a more complex problem.  Let's look at the complexity of the dataset we'll be working with.

In [12]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


For comparison, _Alice's Adventures in Wonderland_ contains 2,766 unique words of a total of 15,500 words.
## Preprocess
For this project, We won't use text data as input to the model. Instead, we'll convert the text into sequences of integers using the following preprocess methods:
1. Tokenize the words into ids
2. Add padding to make all the sequences the same length.

Time to start preprocessing the data...