# Machine translation with TensorFlow

## Introduction

This is a project dedicated to Machine Translation (MT) based on this [TensorFlow](https://www.tensorflow.org/text/tutorials/nmt_with_attention) tutorial.

The MT history summary is aided by this [article](https://www.freecodecamp.org/news/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5/) from FreeCodeCamp.

| ![History of MT image](https://miro.medium.com/v2/resize:fit:1400/1*XuR_iuPOuY-8i5A3cGmcBw.png) |
|:--:|
| <b>History of Machine Translation</b>|

My name is Ivaylo Radev and my involvement in the field of Natural Language Processing (NLP) began in the summer of 2011, right after I finished my bachelor degree in Bulgarian Philology. Up until January 2022 I was working purely as linguist on creating various language resources mainly WordNet for Bulgarian language since 2014. In February 2022 I started to learn Python in order to become "true" computational linguist. Computational linguistics is a coin that combines the sides of Linguistics and Computer science. For 10 years I was looking solely to one side and decided to peek at the other. I do not know much of Machine Learning and Word Embeddings and my math knowledge is limited, but I want to give it a try and see what can I learn.


#### Statistical Machine Translation

At that time (2011) the Statistical Machine Translation (SMT) approach was the state-of-the-art and my first task was to align Bulgarian-English sentence pairs from the [SETIMES corpus](https://opus.nlpl.eu/SETIMES.php) extracted from news articles written on the both languages.

| ![Word Alignment image](https://cdn-media-1.freecodecamp.org/images/jG95Sgc2W4VJbwi4LFlJeMHnjLZbdGydCCzI) |
|:--:|
| <b>Word Alignment</b>|

The Statistical Machine Translation is  based on the idea that a machine can infer patterns if an pair of identical sentences on two languages is split into words. These words are then linked, creating weights between each of the words in the other language sentence. Applying this method millions or even billions of times results in the count of how many times a word from language A is translated into word1, word2 or word3 in language B. These counts are then normalized and transformed into weights with values between 0 and 1. For example the English word *bank* can have {банка : 0.91, бряг : 0.65, боб : 0.01} weights for Bulgarian.
The main problems of SMT are the need of large amount of sentence pairs, did not understand cases, gender, homonymy and word order, but most importantly was using only the most common translations making errors in examples like *He was fishing from the bank* - *Той ловеше риба от __банката__*.
To combat these shortcomings the Phrase-based SMT (splitting the sentences into phrases or n-grams/n-words in a row) and Syntax-based SMT (converting the sentences into sentence tree and then translate one tree into another).

| ![sentence tree image](https://cdn-media-1.freecodecamp.org/images/JKfKjepj-r-NgsmX7A1qipPF7Jb1LEJYghAQ) |
|:--:|
| <b>Translating trees</b>|


#### Neural Machine Translation

The Syntax-based SMT was considered the best approach to MT before the emergence of machine translation with Neural Networks (NNs) or Neural Machine Translation (NMT). The first time I was introduced to NMT was sometime in 2017.

| ![NMT image](https://cdn-media-1.freecodecamp.org/images/2TRCJS9nG0g1YVZPzbeg3DKvZLgsMEEiBXRs) |
|:--:|
| <b>Neural Network Translation</b>|

The Neural Machine Translation uses method similar to Syntax-based SMT. It takes sentences or texts and converts them to numerical representation instead to syntax trees. NMT works by encoding (by Encoder NN) some text in language A into features (vectors) and decoding (by Decoder NN) these features into language B.

Machines are not capable of processing strings or plain text in their raw form and require numerical numbers as inputs to perform any sort of task, such as classification, regression, clustering, etc. Vectorization or word embedding is the process of converting text data to numerical vectors. The simplest way to represent the word *apple* from the sentence *The green apple is tasty.* is to set the vector dimension equal of the length of the sentence and mark the position of the targeted word: apple = [0, 0, 1, 0, 0, 0]. Keep in mind that sentences are split into string tokens that separate words and punctuation. This example is very basic as the large NNs use 300, 1000 or more dimensional vectors (or matrixes with these vectors) and encode the whole (or large percent of) vocabulary of given language.

The Bag-of-Words approach sts the vector size for a particular document to be equal to the number of unique words present in the corpus of all documents. Then for each entry of a vector is filled with the corresponding word frequency in a particular document resulting in sparse matrix.
 
| ![Matrix image](https://editor.analyticsvidhya.com/uploads/12860Screenshot%202021-06-15%20at%205.16.36%20PM.png) |
|:--:|
| <b>Matrix with Text Representations</b>|




In [6]:
# Lets see it in code
from sklearn.feature_extraction.text import CountVectorizer
# corpus = document1, document2, ..., documentN
corpus = ["this pasta is very tasty and affordable", "this pasta is not tasty and is affordable", "this pasta is very very delicious"]
count_vectors = CountVectorizer()
result = count_vectors.fit_transform(corpus)
result_matrix = result.toarray()
print(result_matrix)


[[1 1 0 1 0 1 1 1 1]
 [1 1 0 2 1 1 1 1 0]
 [0 0 1 1 0 1 0 1 2]]


Once we have the representations of the vocabularies (populate our embedding with vectors) on more then one language we can do translation by taking the vector for given word in language A and finding the same vector in language B. The most commonly used metric for comparing two vectors is Cosine Similarity.
Cosine Similarity is the cosine of the angle between the vectors. If cos(0) = 1 the vectors are on the same line and direction and are similar. If cos(90) = 0 the vectors are orthogonal and not similar. If cos(180) = -1 the vectors are entirely dissimilar. Hence if angle θ is 0 <= Cos(θ) <= 1 the vectors are similar (more similar closer to 0). Since distance and similarity are not the same, the distance metric can be defined using cosine similarity as $d = 1 -cos (u, v)$. This will result in function that accepts the word vector (v) and returns the closest vector (c) ( $f(v) = c$ ) where v is the source word and c is the translated word. 

#### Seq2seq

The Seq2seq is a family of machine learning approaches used for natural language processing. It uses Recurrent Neural Networks (RNN) and Long short-term memory (LSTM). The RNNs are a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. They are good for NLP tasks as they can feed context - in this case the  result of the previous word in the text. The LSTMs are a type of RNNs that has the ability to remember and forget previous results based on time intervals or forget gates.

The Seq2seq can be optimized with:

* Attention: The input to the decoder is a single vector which stores the entire context. Attention allows the decoder to look at the input sequence selectively.
    
* Beam Search: Instead of picking the single output (word) as the output, multiple highly probable choices are retained, structured as a tree (using a Softmax on the set of attention scores). Average the encoder states weighted by the attention distribution.

* Bucketing: Variable-length sequences are possible because of padding with 0s, which may be done to both input and output. However, if the sequence length is 100 and the input is just 3 items long, expensive space is wasted. Buckets can be of varying sizes and specify both input and output lengths.



## Machine Translation with TensorFlow

#### TensorFlow
TensorFlow is a open-source software library for machine learning and artificial intelligence. It has sequence-to-sequence (seq2seq) model for English-to-OtherLanguage translations with attention mechanisms. Let's try to implement it here with English and Bulgarian.

In [None]:
# First install the libraries
#!pip install einops
#!pip install tensorflow
#!pip install tensorflow-text

In [6]:
import numpy as np

import typing
from typing import Any, Tuple

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import einops

import tensorflow as tf
import tensorflow_text as tf_text

import pathlib


In [37]:
# Load the data
#link - http://www.manythings.org/anki/bul-eng.zip


path_to_zip = tf.keras.utils.get_file(
    'bul-eng.zip', origin='http://www.manythings.org/anki/bul-eng.zip',
    extract=True)

path_to_file = pathlib.Path(path_to_zip).parent/'bul-eng/bul.txt'

def load_data(path):
  text = path.read_text(encoding='utf-8')

  lines = text.splitlines()
  pairs = [line.split('\t') for line in lines]

  context = np.array([context for target, context, _meta in pairs])
  target = np.array([target for target, context, _meta in pairs])

  return target, context

In [38]:
target_raw, context_raw = load_data(path_to_file)
print(context_raw[-1])
print(target_raw[-1])


Понеже обикновено могат да се намерят много уебсайтове на дадена тема, обикновено кликвам бутона "Назад", когато попадна на някоя уебстраница с изскачащи реклами. Просто отивам на следващата страница, която ми предлага Google, и се надявам тя да дразни по-малко.
Since there are usually multiple websites on any given topic, I usually just click the back button when I arrive on any webpage that has pop-up advertising. I just go to the next page found by Google and hope for something less irritating.


##### Other References:

[Text vectorization](https://www.analyticsvidhya.com/blog/2021/06/part-5-step-by-step-guide-to-master-nlp-text-vectorization-approaches/)

[Machine Translation](https://medium.com/mlearning-ai/na%C3%AFve-machine-translation-in-nlp-13cf02b9400)