# Language Identification

Whenever we want to work with textual data, there is the need to identify the language it is written in. There is no difference if we are spellchecking a text, building a search index or search for the names of people, each of these tasks is always language dependent. Sometimes you can find metadata for the texts, where the language is mentioned. But this is not always the case and sometimes wrong. In those cases it is often easier to just detect the language.



## 1 Problem Definition

What exactly is the task?


### 1.1 Number of Languages

The task is easier if we only have to predict between two languages, instead of 3000. In a lot of practical cases we do have to choose from a couple of languages. For example if we collect Social Media Texts about a "german topic" or from a German Site, we can be very certain that the language is either German or English. This also applies to abstracts collected from the catalogue of a german library. 

In the following section we want to select from a small number of Western European languages.

### 1.2 Text encoding

Worst case scenario: We dont even know the encoding a file is in. We are reading bits and bytes, we don't know whether we should convert it to ASCII, UTF8 or Latin2. In this cases we have to predict also the encoding, but we neglect this problem for this exercise and assume that all texts we read are coded in UTF8.

## Algorithm

A simple algorthm would be: How many typical/frequent German words do we find in a given text. First of all the text needs to have a sufficiant length to make precise predictions, as some words belong to different languages.
Another frequently suggested approach is to count the occurance of stopwords, as stopwords are frequent (stopword lists are also easily accessible). This approach doesn't work with short texts. Words like 'de' and 'en' are for example part of the French and Dutch language/stopword list. Such double occurences are frequent, as stopwords are usually just one or two syllables long and the nummer of possible syllables is not infinite.

A book title such as:

*De kleine prins en de grote drakejacht*

could be French or Dutch, if you only look at the stopwords. But someone who is familiar with both languages knows that it is a Dutch phrase.

The best results are achieved when we are using character distributions as a feature and not word occurences.
In English, the letter y is much more frequent then in German. On the other hand, German has characters (Umlaute) that do not appear in English at all. Even better are combinations of 2 or 3 letters, so called bi- and tri-gramms. We can also combine characters, bi- and trigramms as described in this paper:

Cavnar, W.B., Trenkle, J.M.: *N-gram-based text categorization*. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. pp. 161-175 (1994) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3248&rep=rep1&type=pdf

### 2.1 Extraction of n-grams

The following function extracts ngrams from a string:


In [1]:
def ngram(string,n):
    liste = []
    if n < len(string):
        for p in range(len(string) - n + 1) :
            tg = string[p:p+n]
            liste.append(tg)
    return liste

Testing the fuction:


In [2]:
text = "Die Verfasserin unternimmt es in diesem Buche, die Geschichte des Kautschuks in Menschenschicksalen zu erzählen."
trigramme = ngram(text,3)
print(trigramme)

['Die', 'ie ', 'e V', ' Ve', 'Ver', 'erf', 'rfa', 'fas', 'ass', 'sse', 'ser', 'eri', 'rin', 'in ', 'n u', ' un', 'unt', 'nte', 'ter', 'ern', 'rni', 'nim', 'imm', 'mmt', 'mt ', 't e', ' es', 'es ', 's i', ' in', 'in ', 'n d', ' di', 'die', 'ies', 'ese', 'sem', 'em ', 'm B', ' Bu', 'Buc', 'uch', 'che', 'he,', 'e, ', ', d', ' di', 'die', 'ie ', 'e G', ' Ge', 'Ges', 'esc', 'sch', 'chi', 'hic', 'ich', 'cht', 'hte', 'te ', 'e d', ' de', 'des', 'es ', 's K', ' Ka', 'Kau', 'aut', 'uts', 'tsc', 'sch', 'chu', 'huk', 'uks', 'ks ', 's i', ' in', 'in ', 'n M', ' Me', 'Men', 'ens', 'nsc', 'sch', 'che', 'hen', 'ens', 'nsc', 'sch', 'chi', 'hic', 'ick', 'cks', 'ksa', 'sal', 'ale', 'len', 'en ', 'n z', ' zu', 'zu ', 'u e', ' er', 'erz', 'rzä', 'zäh', 'ähl', 'hle', 'len', 'en.']


According to Cavnar et al. the best results are achieved by combining mono-, bi- and trigramms, so we write a functions that does exectly this (for  1≤n<4 ).


In [3]:
def xgram(string):
    return [w for n in range(1,4) for w in ngram(string.lower(),n)]

In [4]:
xgramme = xgram(text)
print(xgramme)

['d', 'i', 'e', ' ', 'v', 'e', 'r', 'f', 'a', 's', 's', 'e', 'r', 'i', 'n', ' ', 'u', 'n', 't', 'e', 'r', 'n', 'i', 'm', 'm', 't', ' ', 'e', 's', ' ', 'i', 'n', ' ', 'd', 'i', 'e', 's', 'e', 'm', ' ', 'b', 'u', 'c', 'h', 'e', ',', ' ', 'd', 'i', 'e', ' ', 'g', 'e', 's', 'c', 'h', 'i', 'c', 'h', 't', 'e', ' ', 'd', 'e', 's', ' ', 'k', 'a', 'u', 't', 's', 'c', 'h', 'u', 'k', 's', ' ', 'i', 'n', ' ', 'm', 'e', 'n', 's', 'c', 'h', 'e', 'n', 's', 'c', 'h', 'i', 'c', 'k', 's', 'a', 'l', 'e', 'n', ' ', 'z', 'u', ' ', 'e', 'r', 'z', 'ä', 'h', 'l', 'e', 'n', '.', 'di', 'ie', 'e ', ' v', 've', 'er', 'rf', 'fa', 'as', 'ss', 'se', 'er', 'ri', 'in', 'n ', ' u', 'un', 'nt', 'te', 'er', 'rn', 'ni', 'im', 'mm', 'mt', 't ', ' e', 'es', 's ', ' i', 'in', 'n ', ' d', 'di', 'ie', 'es', 'se', 'em', 'm ', ' b', 'bu', 'uc', 'ch', 'he', 'e,', ', ', ' d', 'di', 'ie', 'e ', ' g', 'ge', 'es', 'sc', 'ch', 'hi', 'ic', 'ch', 'ht', 'te', 'e ', ' d', 'de', 'es', 's ', ' k', 'ka', 'au', 'ut', 'ts', 'sc', 'ch', 'hu', '

## 2.1 Language model

A **model** for a language is a set of such n-gramms and their probabilities. In the following, a model is represented as a python dictionary.


In [5]:
def buildmodel(text):
    model = {}

    xgramme = xgram(text)
    nr_of_ngs = len(xgramme)

    for w in xgramme:
        f = 1 + model.get(w,0)
        model[w] = f
    
    for w in model:
        model[w] = float(model[w]) / float(nr_of_ngs)

    return model

Testing the function an printing the result:


In [6]:
model = buildmodel(text)
print(model)

{'d': 0.012012012012012012, 'i': 0.02702702702702703, 'e': 0.05105105105105105, ' ': 0.042042042042042045, 'v': 0.003003003003003003, 'r': 0.012012012012012012, 'f': 0.003003003003003003, 'a': 0.009009009009009009, 's': 0.03303303303303303, 'n': 0.02702702702702703, 'u': 0.015015015015015015, 't': 0.012012012012012012, 'm': 0.012012012012012012, 'b': 0.003003003003003003, 'c': 0.021021021021021023, 'h': 0.021021021021021023, ',': 0.003003003003003003, 'g': 0.003003003003003003, 'k': 0.009009009009009009, 'l': 0.006006006006006006, 'z': 0.006006006006006006, 'ä': 0.003003003003003003, '.': 0.003003003003003003, 'di': 0.009009009009009009, 'ie': 0.009009009009009009, 'e ': 0.009009009009009009, ' v': 0.003003003003003003, 've': 0.003003003003003003, 'er': 0.012012012012012012, 'rf': 0.003003003003003003, 'fa': 0.003003003003003003, 'as': 0.003003003003003003, 'ss': 0.003003003003003003, 'se': 0.006006006006006006, 'ri': 0.003003003003003003, 'in': 0.009009009009009009, 'n ': 0.0120120120

We could also use a Lib:


In [7]:
import collections

def buildmodel(text):
    model = collections.Counter(xgram(text))  
    nr_of_ngs = sum(model.values())

    for w in model:
        model[w] = float(model[w]) / float(nr_of_ngs)

    return model

In [8]:
model = buildmodel(text)
print(model)

Counter({'e': 0.05105105105105105, ' ': 0.042042042042042045, 's': 0.03303303303303303, 'i': 0.02702702702702703, 'n': 0.02702702702702703, 'c': 0.021021021021021023, 'h': 0.021021021021021023, 'ch': 0.018018018018018018, 'u': 0.015015015015015015, 'd': 0.012012012012012012, 'r': 0.012012012012012012, 't': 0.012012012012012012, 'm': 0.012012012012012012, 'er': 0.012012012012012012, 'n ': 0.012012012012012012, 'es': 0.012012012012012012, 'sc': 0.012012012012012012, 'en': 0.012012012012012012, 'sch': 0.012012012012012012, 'a': 0.009009009009009009, 'k': 0.009009009009009009, 'di': 0.009009009009009009, 'ie': 0.009009009009009009, 'e ': 0.009009009009009009, 'in': 0.009009009009009009, 's ': 0.009009009009009009, ' d': 0.009009009009009009, 'die': 0.009009009009009009, 'in ': 0.009009009009009009, 'l': 0.006006006006006006, 'z': 0.006006006006006006, 'se': 0.006006006006006006, 'te': 0.006006006006006006, ' e': 0.006006006006006006, ' i': 0.006006006006006006, 'he': 0.006006006006006006, 

Given a sufficient amout of text, we can get values that are typical for a language. We can use these values  to compare it with the values of an unknown text.

**NLTK** offers the Declaration of Human Rights in over 300 languages. We will use those texts to build some language models, though the amount fo texts in fact is too small to get relyable models.

Eventually, you have to download the NLTK ressources. This has only to be done once. If you execute the code in the next cell a dialog window will open. Select download and then the option 'book' (or 'all' or just 'udhr').


In [9]:
import nltk
nltk.download('udhr')

[nltk_data] Downloading package udhr to /root/nltk_data...
[nltk_data]   Package udhr is already up-to-date!


True

In [10]:
from nltk.corpus import udhr

#print udhr.fileids()

languages = ['english','german','dutch','french','italian','spanish']

english_udhr = udhr.raw('English-Latin1')
german_udhr = udhr.raw('German_Deutsch-Latin1')
dutch_udhr = udhr.raw('Dutch_Nederlands-Latin1')
french_udhr = udhr.raw('French_Francais-Latin1')
italian_udhr = udhr.raw('Italian_Italiano-Latin1')
spanish_udhr = udhr.raw('Spanish_Espanol-Latin1')

texts = {'english':english_udhr,'german':german_udhr,'dutch':dutch_udhr,'french':french_udhr,'italian':italian_udhr,'spanish':spanish_udhr}
models = {lang:buildmodel(texts[lang]) for lang in languages}

These texts are very short and contain only a fraction of the words that belong to the vocabulary of a language. The German version is only 10 000 Characters long:



In [11]:
print(len(german_udhr))

9999


## 3. Determine the Language

We know have to compare the frequencies of the n-grams of a given text to the frequencies of a model. We will use Cosine Similarity for this:

Wir müssen jetzt die n-Gram Frequenzen eiens Textes mit den Frequenzen der Modelle vergleichen. Um die Modelle zu vergleichen berechnen wir den Cosinus:

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/a71c4add4abded66efd42b202c76f6a59944a587)


In [12]:
import math

def cosine(a,b):
    return sum([a[k]*b[k] for k in a if k in b]) / (math.sqrt(sum([a[k]**2 for k in a])) * math.sqrt(sum([b[k]**2 for k in b])))

In [13]:
print(text)
textmodel = buildmodel(text)
for m in models:
    print(m, cosine(models[m],textmodel))

Die Verfasserin unternimmt es in diesem Buche, die Geschichte des Kautschuks in Menschenschicksalen zu erzählen.
english 0.7411160989023053
german 0.8522092567280237
dutch 0.7916873428505355
french 0.7659701598838251
italian 0.7148098723418794
spanish 0.7444444064683035


We already get decent results. Of course the accuracy could be increased if we use longer texts from a different set of genres.

## 4. Beautify it

We need a function that predicts in which language a text is written in:

In [14]:
def guess_language(text):
    textmodel = buildmodel(text)
    lang = "english"
    best = 0
    for m in models:
        c = cosine(models[m],textmodel)
        if c > best:
            best = c
            lang = m
    return lang

In [15]:
t = "Wie zijn leven voltooid vindt en met een consulent in gesprek gaat over zelfdoding, stelt zelfeuthanasie vaak uit of ziet ervan af"  
print(guess_language(t))

t = u"L’ancien candidat écologiste à la primaire de la gauche s’était engagé à soutenir le vainqueur de ce scrutin à la fin janvier, en l’occurrence Benoît Hamon."  
print(guess_language(t))


t = "Una capra al posto del giardiniere"
print(guess_language(t))

dutch
french
spanish


The last Result is wrong. Italian would be right. The text is very short and the modell is based on a too small corpus.

Also, to not always calculate the models anew, we can pickle them and save them to google drive.

In [16]:
import pickle

#from pydrive.auth import GoogleAuth
#from pydrive.drive import GoogleDrive
#from google.colab import auth
#from oauth2client.client import GoogleCredentials

# Save model with pickle
pickle.dump(models, open('langidmodels.pkl', 'wb'))


# 1. Authenticate and create the PyDrive client.
#auth.authenticate_user()
#gauth = GoogleAuth()
#gauth.credentials = GoogleCredentials.get_application_default()
#drive = GoogleDrive(gauth)  

# get the folder id where you want to save your file
#file = drive.CreateFile({'parents':[{u'id': folder_id}]})
#file.SetContentFile('langidmodels.pkl')
#file.Upload() 

In [17]:
models = pickle.load(open('langidmodels.pkl', 'rb'))