Imports

In [1]:
import xml.etree.ElementTree as ET
import spacy
import random
import string
import math

nlp_models = {
    'en': spacy.load('en_core_web_md'),
    'it': spacy.load('it_core_news_md')
}

from collections import defaultdict
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tree import Tree
from nltk.translate import bleu_score as bleu
from nltk.translate import IBMModel1, AlignedSent
from spacy import displacy
from tqdm.notebook import tqdm
from zipfile import ZipFile
from operator import itemgetter

# NLP Final Project - Rule Based Part

## Exploration of Machine Translation Techniques using Movie Subtitles dataset

Arnaud Ruymaekers, S5298338

---

Description: 

I would like to explore developping 3 different techniques to perform Machine Translation. 
I would like to implement and compare implementations of a Statistical, Rule-Based and Neural Machine Translation.
I will attempt to implement these techniques from scratch (not using libraries to do the whole thing) to understand how they work on a deeper level.
I plan to implement this in python and to use as dataset sentence correspondances from movies subtitles EN <-> IT coming from opensubtitles.org .

Feedback:

If you will develop 3 different techniques, the project will be for sure hard. As a B-plan, you might downgrade to developing 2 techniques only, 
to make sure to stay in about 7 to 10 days of work

---

### Introduction (TODO)



## Datasets Prep

In [3]:
line_count_total = 35_216_229
file_name = 'OpenSubtitles.en-it.'
languages = ['en', 'it']

### Text Subtitles

In [4]:
def extract_file(file_name, lang='en', line_count=None, from_line=0, tokenize=True) -> dict:
    
    if line_count is None:
        line_count = line_count_total
        
    assert (from_line+line_count <= line_count_total), f'line_count + from_line should be under {line_count_total} (it is currently {line_count+from_line})'
    
    file_lines = []
    
    with ZipFile('en-it.txt.zip') as zf:
        with zf.open(file_name + lang, 'r') as f:

            for i, line in tqdm(enumerate(f), total=from_line+line_count, desc=f'Reading {lang.upper()} language file'):
                if i < from_line:
                    continue
                elif i < from_line+line_count:
                    decoded_line = line.decode("utf-8").replace('\n', '')
                    file_lines.append(word_tokenize(decoded_line) if tokenize else decoded_line)
                else:
                    break

    return file_lines

In [5]:
# Extracting 100k sentences for now
sentences = {}
for lang in languages:
    sentences[lang] = extract_file(file_name, lang, 100_000, tokenize=False)

Reading EN language file:   0%|          | 0/100000 [00:00<?, ?it/s]

Reading IT language file:   0%|          | 0/100000 [00:00<?, ?it/s]

In [6]:
# Printing some samples
for i in range(5):
    print(f'Sample {i}:')
    print('\t' + sentences['en'][i])
    print('\t\t=> ' + sentences['it'][i])

Sample 0:
	Permaculture is a design science based on three simple ethics:
		=> La permacultura è un metodo di progettazione basato su tre semplici principi etici:
Sample 1:
	care for the earth
		=> cura della terra
Sample 2:
	care for people
		=> cura delle persone
Sample 3:
	share the surplus
		=> Condividi il superfluo
Sample 4:
	Permaculture also has core principles They guide us in creating sustainable abundance
		=> La permacultura ha anche principi cardine le linee guida per la creazione di abbondanza sostenibile


## Evaluation Strategy

In [7]:
def evaluate_translation(ref_translations:list, translations:list, max_n=None):
    assert len(ref_translations) == len(translations), 'There should be as many reference translations as translations'
    
    if not max_n:
        max_n = 4
    
    weights = [1/(max_n) for i in range(max_n)]
    
    n = len(ref_translations)
    scores = []
    
    for i, ref_trans in tqdm(enumerate(ref_translations), total=n, desc='Evaluation'):
        trans = translations[i]
        scores.append(bleu.sentence_bleu(ref_trans, trans, weights=weights))
        
    return sum(scores)/n

## Rule-Based Machine Translation Intro (RBMT)

3 Options: Direct, Transfer, Interlingual (insert image)

Steps from Source Language (SL) to Target Language (TL)

---

Sentences in SL

-> Tokenization: The segments are further divided into individual words or tokens, which can then be processed individually. (We Can stop here for direct translation)

    -> Part-of-speech tagging: Each token is labeled with its part of speech, such as noun, verb, adjective, etc.

    -> Morphological analysis: The inflectional and derivational forms of the words are identified and analyzed.

        -> Syntactic parsing: The grammatical structure of the sentences is analyzed and represented in a tree-like structure. (Syntactic Transfer)

            -> Semantic analysis: The meanings of the words and phrases in the sentence are extracted and represented in a formal representation, such as semantic frames or predicate-argument structures. (Semantic Transfer)

                -> Interlingual generation: The extracted meanings are used to generate a target language text that conveys the same information as the source text.

            <- Language-specific generation: The text generated in the previous step is transformed into a grammatically and idiomatically correct target language text.
            
Sentences in TL

---

Here we will only attempt to build from scratch up to a simple version of Syntactic Transfer. Interlingual translation is too complicated to be built here.

A fully built version of RBMT will be used however for comparisson.

## Direct Translation (TODO)

In [8]:
### Building Translation layer

# from pytranslate import Translator

# # Create an English parse tree
# english_tree = Tree.fromstring("(S (NP (DT the) (JJ quick) (JJ brown) (NN fox)) (VP (VBD ran) (RB quickly) (PP (IN in) (NP (DT the) (NN forest)))))

# # Flatten the tree to make it a sentence
# leaves = english_tree.flatten()
# text = " ".join(leaves)

# # Initialize the translator
# translator = Translator(from_lang='en', to_lang="it")

# # Translate the text
# it_text = translator.translate(text)

# # Print the translation
# print(it_text)

## Semantic Transfer Translation

### Loading sentences in Spacy

This is to extract the POS of each token and do the Morfological analysis.

In [9]:
docs = {}
for lang in languages:
    print(f'Loading {lang.upper()} sentences as Spacy docs:')
    docs[lang] = [nlp_models[lang](sent) for sent in tqdm(sentences[lang])]

Loading EN sentences as Spacy docs:


  0%|          | 0/100000 [00:00<?, ?it/s]

Loading IT sentences as Spacy docs:


  0%|          | 0/100000 [00:00<?, ?it/s]

### POS tagging

In [10]:
def retrieve_pos_tags(doc):
    output = []
    for sentence in tqdm(doc):
        output.append([(token.text, token.pos_) for token in sentence])
    return output

In [11]:
pos_tags = {}
for lang in languages:
    print(f'Retrieving POS tag for {lang.upper()} doc:')
    pos_tags[lang] = retrieve_pos_tags(docs[lang])

Retrieving POS tag for EN doc:


  0%|          | 0/100000 [00:00<?, ?it/s]

Retrieving POS tag for IT doc:


  0%|          | 0/100000 [00:00<?, ?it/s]

In [12]:
print(pos_tags['en'][0])
print(' =>')
print(pos_tags['it'][0])

[('Permaculture', 'NOUN'), ('is', 'AUX'), ('a', 'DET'), ('design', 'NOUN'), ('science', 'NOUN'), ('based', 'VERB'), ('on', 'ADP'), ('three', 'NUM'), ('simple', 'ADJ'), ('ethics', 'NOUN'), (':', 'PUNCT')]
 =>
[('La', 'DET'), ('permacultura', 'NOUN'), ('è', 'AUX'), ('un', 'DET'), ('metodo', 'NOUN'), ('di', 'ADP'), ('progettazione', 'NOUN'), ('basato', 'ADJ'), ('su', 'ADP'), ('tre', 'NUM'), ('semplici', 'ADJ'), ('principi', 'NOUN'), ('etici', 'ADJ'), (':', 'PUNCT')]


### Morfological analysis

In [13]:
def morfo_analysis(doc):
    output = []
    for sentence in tqdm(doc):
        output.append([(token.text, token.pos_, token.lemma_) for token in sentence])
    return output

In [14]:
lem_pos_tags = {}
for lang in languages:
    print(f'Retrieving Lemmas and POS tag for {lang.upper()} doc:')
    lem_pos_tags[lang] = morfo_analysis(docs[lang])

Retrieving Lemmas and POS tag for EN doc:


  0%|          | 0/100000 [00:00<?, ?it/s]

Retrieving Lemmas and POS tag for IT doc:


  0%|          | 0/100000 [00:00<?, ?it/s]

In [15]:
print(lem_pos_tags['en'][0])
print(' =>')
print(lem_pos_tags['it'][0])

[('Permaculture', 'NOUN', 'permaculture'), ('is', 'AUX', 'be'), ('a', 'DET', 'a'), ('design', 'NOUN', 'design'), ('science', 'NOUN', 'science'), ('based', 'VERB', 'base'), ('on', 'ADP', 'on'), ('three', 'NUM', 'three'), ('simple', 'ADJ', 'simple'), ('ethics', 'NOUN', 'ethic'), (':', 'PUNCT', ':')]
 =>
[('La', 'DET', 'il'), ('permacultura', 'NOUN', 'permacultura'), ('è', 'AUX', 'essere'), ('un', 'DET', 'uno'), ('metodo', 'NOUN', 'metodo'), ('di', 'ADP', 'di'), ('progettazione', 'NOUN', 'progettazione'), ('basato', 'ADJ', 'basato'), ('su', 'ADP', 'su'), ('tre', 'NUM', 'tre'), ('semplici', 'ADJ', 'semplice'), ('principi', 'NOUN', 'principio'), ('etici', 'ADJ', 'etico'), (':', 'PUNCT', ':')]


In the first example, you might find it interesting to see how the lemmatizer converts the verb "running" to its base form "run", and how it converts the adjective "late" to its base form "late".

### Syntactic parsing

In [16]:
displacy.render(docs['en'][0], style='dep', jupyter=True, options={'compact': True, 'distance':100})
displacy.render(docs['it'][0], style='dep', jupyter=True, options={'compact': True, 'distance':100})

In [17]:
def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.pos_, [Tree(node.orth_, [to_nltk_tree(child) for child in node.children])])
    else:
        return Tree(node.pos_, [node.orth_])

def syntactic_parsing(doc):
    return [to_nltk_tree(list(sentence.sents)[0].root) for sentence in tqdm(doc)]

In [18]:
synt_pars_tags = {}
for lang in languages:
    print(f'Retrieving Syntactic Parsing for {lang.upper()} doc:')
    synt_pars_tags[lang] = syntactic_parsing(docs[lang])

Retrieving Syntactic Parsing for EN doc:


  0%|          | 0/100000 [00:00<?, ?it/s]

Retrieving Syntactic Parsing for IT doc:


  0%|          | 0/100000 [00:00<?, ?it/s]

In [19]:
synt_pars_tags['en'][0].pretty_print()
print('=>')
synt_pars_tags['it'][0].pretty_print()

                   AUX                            
                    |                              
                    is                            
      ______________|__________________________    
     |             NOUN                        |  
     |              |                          |   
     |           science                       |  
     |         _____|____________              |   
     |        |     |           VERB           |  
     |        |     |            |             |   
     |        |     |          based           |  
     |        |     |            |             |   
     |        |     |           ADP            |  
     |        |     |            |             |   
     |        |     |            on            |  
     |        |     |            |             |   
     |        |     |           NOUN           |  
     |        |     |            |             |   
     |        |     |          ethics          |  
     |        |     | 

### Syntactic Transfer (TODO)

## RBMT Model (TODO)

## Sources


DS:
- https://opus.nlpl.eu/OpenSubtitles.php
- http://www.opensubtitles.org/

General:
- https://machinetranslate.org/
- https://towardsdatascience.com/machine-translation-b0f0dbcef47c
- https://towardsdatascience.com/data-preprocessing-for-machine-translation-fcbedef0e26a

Evalutation:
- https://towardsdatascience.com/bleu-bilingual-evaluation-understudy-2b4eab9bcfd1

Rule Based model: 
- https://link.springer.com/article/10.1007/s10590-021-09260-6