<a href="https://colab.research.google.com/github/souvorinkg/Eng2Kin/blob/main/tutorial/EnKinDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to translate into Low-Resource languages:
The purpose of this tutorial is to demonstrate how to build a parallel corpus, for the purposes of training a neural machine translation model. In the next tutorial (link coming soon), I will walkthrough how to train the model with the dataset gathered here. I will be translating between English and Kinyarwanda, however with a few modifications to the code any two languages could be used. The first section will cover an introduction to machine translation, a parallel corpus, and backtranslation. If you are familiar with these concepts, you can safely proceed to the code. You will then put these ideas into practice through building a parallel corpus between two languages of your choice. Finally you will publish your results on HuggingFace for you or others to use in future project.

For this project I have selected English and Kinyarwanda. Kinyarwanda is a language spoken by roughly 15 million people in the nation of Rwanda, where it is universally spoken as a first language. I would like to thank [Gaelle Agahozo](https://github.com/GaelleAgahozo), whose initial translations and feedback were crucial to this project's success. I am grateful to [David Dale](https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865), whose article greatly aided me in this project. Finally, I would like to thank my advisor, [Dr. Ferrer](https://github.com/gjf2a/), who has taught me AI and guided this project.

# The Machine Translation Problem and Solutions:
[Machine Translation](https://en.wikipedia.org/wiki/Machine_translation) (MT) is the use of algorithms, software, and artificial intelligence to translate between two languages, as opposed to by hand [translation](https://en.wikipedia.org/wiki/Translation), which is done by a bilingual speaker. MT takes a sequence of characters in the source language, then predicts the corresponding sequence in the target language. MT has a [long history](https://www.researchgate.net/publication/380264043_Survey_of_Popular_Machine_Translation_Systems), starting with cold-war era rule based algorithms. Current MT is done with [Neural Networks](https://en.wikipedia.org/wiki/Neural_network_(machine_learning)), which [associate](https://www.3blue1brown.com/lessons/neural-networks) certain inputs to certain outputs. For example, [Google Translate](https://translate.google.com/) uses [Neural MT](https://en.wikipedia.org/wiki/Neural_machine_translation#cite_note-5) (NMT) for its 600 million daily users.

### Seq2Seq Models
Translation problems can be modeled with as a [Seq2Seq](https://en.wikipedia.org/wiki/Seq2seq) model within the domain of Machine Learning. In Seq2Seq models, a neural network is given an input sentence, called a sequence, and it must predict an output sequence that is read back to the user. In the problem of Machine Translation, the input sequence is the sentence you want translated, and the output sentence is the translation. The first Seq2Seq model was Google Translate, though now a variety of problems are modeled in this way.

### Generative Models
This problem can also be solved using a [generative](https://en.wikipedia.org/wiki/Generative_model) model. You have likely already used the popular generative model, ChatGPT, where the G stands for generative. A generative model is designed to mimic various human behaviors such as writing or pattern recognition. Given a sentence as input in the prompt, [ChatGPT](https://) can output a likely translation based on everything it knows about the source langauge, or language being translated, and the target language, the language you are translating to.

# The Parallel Corpus
Neural Network models need data to train on. For the translation task, the most prized form of structured data is the parallel corpus.  A [corpus](https://en.wikipedia.org/wiki/Text_corpus) is a body of text in a given language, larger than a single document. It is written in sentence form, unlike a dictionary. Some examples of corpura inlcude [Project Gutenberg](https://www.gutenberg.org/), [Norton Anthology of English Literature](https://archive.org/details/nortonanthologyo0002unse_x7c8), or New York Times articles [published in 2008](https://www.nytimes.com/search?dropmab=false&endDate=2008-12-31&query=&sort=best&startDate=2008-01-01). When a corpus is in a single language, it is said to be monolingual.

Instead of a list of sentences, a parallel corpus consists of a list of pairs of sentences. Pairs contain the source language sentence and the target language sentneces with an equivilent meaning. Importantly, each word is not garunteed to be an exact translation, rather the sentences as a whole should have the same meanings. A famous example of a [parallel corpus](https://en.wikipedia.org/wiki/Parallel_text) is the [Rosetta Stone](https://en.wikipedia.org/wiki/Rosetta_Stone), which allowed the first translation of Eygyptian Hieroglyphics, because the stone was also written in the known language of Greek.

Parallel corpora already exist for languages with long history of contact, such as English and French. In the Europeon Union, the [largest ever parallel corpus](https://github.com/souvorinkg/Internet_Languages/blob/main/Souvorin_Nicolai_Final_Draft.pdf) has been created, containing millions of sentences in the 24 languages used by Europeon states. As a result, it is trivial to get highly accurate translations in these languages, which has constributed to Europeon Language dominance on the [internet](https://github.com/souvorinkg/Internet_Languages/blob/main/Souvorin_Nicolai_Final_Draft.pdf).

# Relationships between Languages
There are 7,000 currently known languages, with 7,000 squared combinations of languages. Of that large number, under a hundred pairs have pre-existing, organically constructed parallel corpora. Organically constructed parallel corpora are often built between neighboring similar languages, like Portuguese and Spanish. Hence, translating between French and English is a trivial problem, but translating between French and Indonesian is a formidable task. Where a parallel corpus doesn't natrually exist, we can artificially create one through the process of backtranslation, discussed in a latter section.

To create a parallel corpus for a language pair, it is useful to know some information about the two languages. Languages can be either high-resource or low-resource, or they can share a [language family](https://en.wikipedia.org/wiki/Language_family) or be of different families.

### High and Low resource languages
High-reource languages have lots of text availible online, because their are [lots of speakers](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers) of the language, because people [use the internet](https://en.wikipedia.org/wiki/Languages_used_on_the_Internet) in these languages, or language users have used [violence](https://en.wikipedia.org/wiki/Linguistic_imperialism#History) to suppress other languages. For example, English, Spanish, and Portuguese would be high resource languages. For each language, there are lots of speakers, there are lots of internet users in the language, and all have been used in [colonization](https://en.wikipedia.org/wiki/Colonialism#Modernity). Large, high-quality, monolingual corpora are widely availible for these languages. We will talk about the English high-quality monolingual corpus in the next section.

On the other hand, low-resource languages have little text availible, because there are little speakers of the language, the language doesn't have a extensive written tradition, it is not used on the internet, or the language has been suppressed by a dominant language. An example of a low-resource language is [Cherokee](https://en.wikipedia.org/wiki/Cherokee_language) (ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ), whose use has been [suppressed](https://en.wikipedia.org/wiki/American_Indian_boarding_schools) in the United States. Small monolingual corpora can be built in these languages using [creative methods](https://).

###Language Families
Languages in the same family have a common ancestor language that both are derived from. Grammar and vocabulary change slowly in a language, so languages that share a family will have similar strucutre and words. If you know any Spanish, you may have noticed it shares many traits with the [Romance](https://en.wikipedia.org/wiki/Romance_languages) languages of Portuguese and French. Spanish and English share fewer traits, as they are both [Indo-Europeon](https://en.wikipedia.org/wiki/Indo-European_languages) languages. Spanish and Chinese however are in seperate language families and any grammar or vocabulary similaries are purely coincidental. # Constructing Artificial Parallel Corpora through Backtranslation
[Backtranslation](https://www.semanticscholar.org/paper/Unsupervised-Neural-Machine-Translation-Artetxe-Labaka/c2a7afbb5609a723f8eea91bfde4b02579b048d6) is the process of turning a monolingual corpora into a parallel corpora, and is one strategy for translating languages. Backtranslation contains two steps.

First, a monolingual sentence is tranlated into a second language using a machine translator that is not garunteed to produce high-quality translations. Then, the the source sentence and the weakly translated sentence are put in a pair. This process is repeated for the entire monolingual corpus, to create an "artificial" parallel corpus. The corpus is artifical, becuase no human has verified that the translations are of good quality.

The second step of Backtranslation is using the artificial parallel corpus to [train](https://www.3blue1brown.com/lessons/gradient-descent) the translation model you are building. After training, the model you created can then create an improved artificial parallel corpus by returning to step 1 then training the model on it's own output. This can be done repeatedly until the model stops showing improvement. For our purposes, we will conduct a single round of backtranslation.

Once you know the resource level and family relationship of your source and target language, there are three cases to consider with distinct strategies:

1. If the two languages are in the same language family, then a translation model can be trained on two monolingual corpora, and the model will inductively align words with similar meanings. This is because these words will have similar locations in a [vector space representation](https://). Backtranslation is not needed.
Otherwise,
  2. If your source language is high-resource and you target language is low resource, you can backtranslate from the high-resource language to the low resource language.
  3. If both languages are high resource, you can backtranslate in both directions, then combine the two artificial parallel corpura together.

In our case, we are interested in translating between a high-resource language, English, and a low-resource language, Kinywarwanda, that do not share a family, case 3. Therefore, we need a large monolingual corpus in English.

# COCA, a monolingual corpus:
The [Corpus of Contemporary American English](https://www.english-corpora.org/coca/) (COCA) was created contains 1 billion words of English test. It draws from a balanced variety of sentences from spoken conversation, fiction, magazines, newspapers, academic journals, and TV and movie scripts. The project was created by Mark Davies, retired professor of Corpus Linguistics at BYU. It is the most widely used public corpora of English. While the entire corpus isn't free, subsections availible without charge for download.

# Step 1: Get COCA
Download [COCA](https://huggingface.co/datasets/souvorinkg/COCA/tree/main) .txt files. I have gotten the free subsections of the COCA database and collected them here. For the easiest way to use them in this collab notebook, click the "files" tab on the left sidebar, then click the "file upload" button. Alternativley, you can load the files from your downloads folder or Google Drive.

# Step 2: Clean the text
This function takes in a filename, then puts a cleaned sentence into a list. I split the sentence into word-sized chunks, then keep sentences that have in between 4 and 10 words, start with a capital letter, and have at least 10 letters. Then, I remove non-alphanumeric characters, extra whitespace, and leading numbers. The decision to keep medium length sentences is arbitray, I encourage you to use longer or shorter length sentences depending on the goal's of your dataset.

In [None]:
import re

def fileToSent(filename):
    with open(filename, 'r') as file:
        sentences = cleanFile(file)
    return sentences

def cleanFile(file):
      sentences = []
      line_count = 0
      for line in file:
        line_sentences = re.split(r'(?<=[.!?])\s+', line.strip()) #split the sentence in words, if there is a nonalphanumeric character.
        for sent in line_sentences:
            if len(sent) > 10 and len(sent.split()) > 4 and len(sent.split()) < 10 and sent[0].isupper(): # get only sentences with between 4 and 10 words, and atleast 10 letters that start with a capital letter
                # Remove non-alphanumeric characters
                sent = re.sub(r'[^a-zA-Z0-9\s]', '', sent)
                # Remove extra white spaces within the sentence
                sent = re.sub(r'\s+', ' ', sent).strip()
                # Remove leading numbers until the sentence begins with a letter
                sent = re.sub(r'^\d+\s', '', sent)
                sentences.append(sent)
      return sentences


I uploaded my files from the .zip package, then I called the function on every file within. Change the file names as necessary. Finally, I concatenated the smaller corpora into a list of corpora.

In [None]:
mag_sentences = fileToSent('text_mag.txt')
fic_sentences = fileToSent('text_fic.txt')
news_sentences = fileToSent('text_news.txt')
spok_sentences = fileToSent('text_spok.txt')
acad_sentences = fileToSent('text_acad.txt')
web_sentences = fileToSent('text_web.txt')
tvm_sentences = fileToSent('text_tvm.txt')
# concat all sentences
corpura = [mag_sentences, fic_sentences, news_sentences, spok_sentences, acad_sentences, web_sentences, tvm_sentences] # list of each corpus


Let's check our output, using a splice of a list, this selects the first 4 sentences in magazines.

In [None]:
print(corpura[0][:4])

['Wow You drive that fast', 'I always say I m still in training', 'How do you spot rookie drivers', 'But I like to go around twice']


Commas and apostropes were removed in our cleaning step, so some meaning has been lost. However, for the purpose of this dataset, these sentences are of high-quality.

# Step 3: Use DeepTranslator
The [DeepTranslate](https://github.com/nidhaloff/deep-translator) library allows us to easily and quickly translate within Python. First, uncomment the code to install deep_translator using pip. I used the google translate API within deep translate for it's speed an accuracy, however there are a number of translators availible in the project depeding on your requirements. Here is a list of languages supported by [Google Translate](https://en.wikipedia.org/wiki/Google_Translate#Supported_languages). To use a different language, replace "en," "rw" with your language's two letter [ISO 639 code](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). German would be "de", Indonesian would be "id."

In [None]:

# !pip install deep_translator
# use pip for installation
from deep_translator import GoogleTranslator # change to a different translator if needed
def BatchTranslate(batch, source, target):
    my_translator = GoogleTranslator(source=source, target=target) # change to a different translator if needed
    result = my_translator.translate_batch(batch)
    return result

Test out the BatchTranslate() function and see our list of translations!

In [None]:
translated = BatchTranslate(tvm_sentences[:10], 'en', 'rw') # change the language codes to use different languages.
print(tvm_sentences[:10]) #original sentences
print(translated)


['How you doing Norm', 'What do you know', 'How about a beer', 'Missed a digit in the debit column today', 'I ve been hearing good things about it', 'Norman you are looking especially spry today', 'How s life treating you', 'This bozo could probably be a better Congressman', 'You re out of work too', 'Do nt be ridiculous']
['Nigute ukora Norm', 'Uzi iki', 'Bite ho byeri?', 'Yabuze imibare mumurongo wo kubikuza uyumunsi', 'Nagiye numva ibintu byiza kubyerekeye', 'Norman urimo kureba cyane cyane kuneka uyumunsi', 'Nigute ubuzima bugufata', 'Iyi bozo birashoboka ko ishobora kuba Umudepite mwiza', 'Nawe ntushobora kuva ku kazi', 'Ntugasekeje']


We will need to write our translations to a CSV file through the process, so that if we are interupted we can resume from our last checkpoint. WriteTranslation() takes in a csv file to write to, and the pair of parallel translated and source sentence.

In [None]:
import csv
def WriteTranslation(csv_filename, translated_sentences, source_sentences):
    with open(csv_filename, 'a', newline='') as csv_file:
        csv_writer = csv.writer(csv_file)
        for i in range(len(translated_sentences)):
            csv_writer.writerow([translated_sentences[i], source_sentences[i]])


The API is not instantaneous, you should expect to spend several minutes translating 1000 sentences. To ensure falt-tolerance, only translate a subset of the sentences at a time. BatchTranslate works on a large list of sentences, but in my experience crashes often with more than 3000 sentences. Adjust the distance, offset, and corpus number until you have translated as many sentences as you would like. If you are following the Kinyarwanda example, I translated 50,000 sentences. It may take a few seconds for your csv file to appear in your Google collab notebook.

In [None]:
distance = 10 # pick the number of sentences to translate
offset = 4050 # pick the sentence number to start at
corpus = corpura[0] # [mag_sentences, fic_sentences, news_sentences, spok_sentences, acad_sentences, web_sentences, tvm_sentences]
source_sentences = corpus[offset:offset+distance]

translated_sentences = BatchTranslate(source_sentences, 'en', 'rw')

csv_filename = 'trans2.csv'
WriteTranslation(csv_filename, translated_sentences, source_sentences)

# Step 4: Upload results to HuggingFace

HuggingFace is a repository for opensource AI projects. There, you can download popular datasets for a number of natural language processing tasks. You can clone and interact with popular Machine Learning models. Finally, you can host your datasets, models, and even deploy a gui for your app!

For the next step in this tutorial, you will need to [create a HuggingFace](https://huggingface.co/join) account. After creating your account, click your profile image, then select "New Dataset."

You should arrive at a screen like this:

![image.png](https://drive.google.com/uc?export=view&id=13tbcNvKrep61QeKMtyyrgNDqojBqI_2u)

For now, don't worry about licenses or ReadMe. Then, upload the CSV file you created in the files tab.

Congrats! You took unprocessed, monolingual text, then you cleaned it and put it into a list of sentences. Then, you translated those sentences to create a parallel pair of sentences. Finally, you published your parallel translations on the internet for others to use!

# Conclusion
In this tutorial, you learned about how nueral networks are used in Machine Translation. Then, you learned how to create the parallel corpus necessary to train a NMT model with backtranslation. Using the COCA dataset, you built a cleaned monolingual corpus. On the corpus, you performed a series of translation with the Deep Translator API, until you had a parallel corpus. Finally, you publishd the corpus to HuggingFace Hub.

In the [next tutorial](https://github.com/souvorinkg/Eng2Kin/blob/main/tutorial/training_NLLB_en_kin.ipynb), we will use your HuggingFace dataset to train a NLLB translation model.
