# ACTIVITY: LOST IN TRANSLATION

**ACTIVITY DESCRIPTION: At the end of this activity, the learner will be able to compare different methodologies for classification of non-English texts**

**WHAT YOU SHOULD FOCUS ON: Use and compare different methodologies for a similar task**

One (very big) problem of natural language processing is that most data *corpi* are in English, but most people in the world are not English speakers. There are languages that have few speakers, or that have few available documents for training. This is especially true in the context of large-scale models, which are known for being *data hungry*, that is, they require *a lot* of data to be adequately trained.

There are essentially two methods to deal with non-English texts:

1. Translate the text to English and then act on the translation
1. Train language models in different languages

Nowadays, translation is relatively easy to perform using the `argostranslate` package in Python:

In [None]:
import os

os.environ["ARGOS_DEVICE_TYPE"] = "cuda" # or "cpu"

import argostranslate.package
import argostranslate.translate

# First, we need to configure the translator:
def _configure_argos(from_code="pt", to_code="en"):
    argostranslate.package.update_package_index()
    available_packages = argostranslate.package.get_available_packages()
    available_package = list(
        filter(lambda x: x.from_code == from_code and x.to_code == to_code,
               available_packages))[0]
    download_path = available_package.download()
    argostranslate.package.install_from_path(download_path)
    installed_languages = argostranslate.translate.get_installed_languages()
    from_lang = list(filter(lambda x: x.code == from_code,
                            installed_languages))[0]
    to_lang = list(filter(lambda x: x.code == to_code, installed_languages))[0]
    translation = from_lang.get_translation(to_lang)
    return translation

translator = _configure_argos('pt', 'en')

In [None]:
translator.translate("Olá, como você está?")  # "Hello, how are you?"

In [None]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-multilingual-uncased')


In [None]:
print(unmasker("I walk this lonely [MASK], the only one that I have ever known!")[0]['sequence'])
print(unmasker("Eu caminhei sozinho pela [MASK], falei com as estrelas e com a Lua")[0]['sequence'])

# Exercise

Compare three different approaches to tackle the classification problem for texts in Portuguese. For each problem, make a learning curve, that is, make random samples of the training dataset with different sizes.

1. Perform classification directly in Portuguese, using a Bag-of-Words approach
1. Classify texts directly in Portuguese using multilingual BERT
1. Translate texts to English and then use the usual BERT
