# Language identification Problem

In order to extract any kind of information from text, the first thing we have to know is what language the text is in. In this assignment you are going to use **character N-gram grammars** to solve the problem of language identification.

Given a document, your goal is to say what language it is written in. We will give you a set of training documents (one in each of 6 languages) and a set of development test documents. You will be graded on an unseen set of 6 test documents. To make the problem tractable, we guarantee that the test documents will come from one of the 6 languages you have seen in the training set.

The data you will use is 6 translations of part of the Universal Declaration of Human Rights (which has been translated into many languages although the data for the 6 languages is in the Language Identification folder in the Week 04 folder.).

The algorithm you will use requires that you **build 6 separate character bigram grammars**, one for each language, on the training data. Mostly in lecture we talked about word bigrams. A character bigram is computed on characters instead of words. You should use the **simple Bayesian Unigram Prior smoothing method**.

For each test document in the dev subfolder, for each of your 6 bigram grammars, you **compute the log-likelihood of the test document given the bigram grammar** (use the log-likelihood instead of the likelihood since it's less likely to underflow). Then you choose as your answer for that document the language that gave the highest log-likelihood.

Here's the formal description of the equations you should be computing. First, you want to pick the language, out of the 6 languages, which assigns the highest log probability to the document:

## L = argmax logPL(Document)

To compute the log probability for each language, you make the Markov (N-gram) assumption, and use a bigram grammar that has been trained on that language: 

## Log PL(Document) = log Psmooth(char1^n) ~= sum( log Psmooth(chari | chari-1))

(That was the equation in log-space; in non-log space it would be:)

## PL(Document) = Psmooth(char1^n) ~= product( Psmooth(chari | chari-1))

Don't forget to add some sort of special START and END characters at the beginning and end of the file.

To train your bigram grammars, use Bayesian Unigram Prior smoothing:

## Psmooth(chari | chari-1) = (C(chari-1, chari) + P(chari)) / (C(chari-1) + 1)

Please develop your solution in an iPython notebook using the text in the train subfolder. Then test your models on data in the dev subfolder.

The data is in UTF-8 format.

In [1]:
import os
from itertools import chain
from glob import glob
from collections import Counter
import numpy as np
import pandas as pd
from itertools import islice

In [2]:
# Function used to read in a list of characters and create character pairs for character bigram model.
def window(seq, n=2):
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

In [20]:
# Create transition matrix for each file.
for filename in os.listdir("./train"):
    if filename.endswith(".txt"):
        f = open("./train" + "/" + filename, 'r')
        lines = f.read()
        #print (filename, lines)
        
        #lines = text.lower()
        pairs = pd.DataFrame(window(lines), columns=['state1', 'state2'])
        counts = pairs.groupby('state1')['state2'].value_counts()
        counts = counts.unstack().fillna(0)
        counts = counts + 1
        probs = (counts / counts.sum().sum())
        probs.to_pickle(filename[0:3]+"_transition_matrix.pkl") 
        print ("For",filename,probs)

spn.txt Declaración Universal de Derechos Humanos
Adoptada y proclamada por la Asamblea General en su resolución 217 A (III), de
10 de diciembre de 1948
Preámbulo
Considerando que la libertad, la justicia y la paz en el mundo tienen por base el
reconocimiento de la dignidad intrínseca y de los derechos iguales e inalienables
de todos los miembros de la familia humana,
Considerando que el desconocimiento y el menosprecio de los derechos
humanos han originado actos de barbarie ultrajantes para la conciencia de la
humanidad; y que se ha proclamado, como la aspiración más elevada del
hombre, el advenimiento de un mundo en que los seres humanos, liberados del
temor y de la miseria, disfruten de la libertad de palabra y de la libertad de
creencias,
Considerando esencial que los derechos humanos sean protegidos por un
régimen de Derecho, a fin de que el hombre no se vea compelido al supremo
recurso de la rebelión contra la tiranía y la opresión,
Considerando también esencial promover el desar

In [4]:
# Load in matrix and take log of the matrix
eng_matrix = np.log(pd.read_pickle("eng_transition_matrix.pkl"))
esp_matrix = np.log(pd.read_pickle("esp_transition_matrix.pkl"))
dut_matrix = np.log(pd.read_pickle("dut_transition_matrix.pkl"))
frn_matrix = np.log(pd.read_pickle("frn_transition_matrix.pkl"))
ger_matrix = np.log(pd.read_pickle("ger_transition_matrix.pkl"))
spn_matrix = np.log(pd.read_pickle("spn_transition_matrix.pkl"))

absolute_min = min(eng_matrix.min().min(), esp_matrix.min().min(), dut_matrix.min().min(), 
                   ger_matrix.min().min(), frn_matrix.min().min(), spn_matrix.min().min())

In [6]:
# Function used to return a transition probability
def get_prob(transition_matrix, row):
    try:
        get_matrix = transition_matrix.at[row.state1,row.state2]
        return row.counts * get_matrix
    except KeyError:
        return row.counts * absolute_min 

In [54]:
for filename in os.listdir("./dev"):
    mylist=[]
    if filename.endswith(".txt"):
        f = open("./dev" + "/" + filename, 'r')
        text = f.read()
        
        text = text.splitlines()
        #print(filename, text)
        for line in text:
            lines = line.strip()
            #print(lines)
            list1 = list(lines)
            mylist += list1
                
        pairs = pd.DataFrame(window(mylist), columns=['state1', 'state2'])
        # create freq counts of each (chari-1, chari) pair
        counts = pairs.groupby('state1')['state2'].value_counts().to_frame("counts")
        counts = counts.reset_index()
        
        # get the probabilities for each language
        counts['english'] = counts.apply(lambda row: get_prob(eng_matrix, row), axis=1)
        counts['esperanto'] = counts.apply(lambda row: get_prob(esp_matrix, row), axis=1)
        counts['french'] = counts.apply(lambda row: get_prob(frn_matrix, row), axis=1)
        counts['dutch'] = counts.apply(lambda row: get_prob(dut_matrix, row), axis=1)
        counts['germany'] = counts.apply(lambda row: get_prob(ger_matrix, row), axis=1)
        counts['spanish'] = counts.apply(lambda row: get_prob(spn_matrix, row), axis=1)
            
        probs = counts[['english', 'esperanto', 'french', 'dutch', 'germany', 'spanish']].sum(axis=0)
        probability = np.exp(probs - probs.max())
        probability = probability / probability.sum()
        print (filename, "is most likely in", probs.idxmax(), "the probability is \n", probability)


spn.txt is most likely in spanish the probability is 
 english      2.136511e-226
esperanto     0.000000e+00
french       3.766188e-215
dutch         0.000000e+00
germany       0.000000e+00
spanish       1.000000e+00
dtype: float64
ger.txt is most likely in dutch the probability is 
 english      4.457761e-113
esperanto     0.000000e+00
french        0.000000e+00
dutch         1.000000e+00
germany       3.756524e-58
spanish       0.000000e+00
dtype: float64
dut.txt is most likely in dutch the probability is 
 english      0.0
esperanto    0.0
french       0.0
dutch        1.0
germany      0.0
spanish      0.0
dtype: float64
esper.txt is most likely in esperanto the probability is 
 english      0.0
esperanto    1.0
french       0.0
dutch        0.0
germany      0.0
spanish      0.0
dtype: float64
frn.txt is most likely in french the probability is 
 english      0.0
esperanto    0.0
french       1.0
dutch        0.0
germany      0.0
spanish      0.0
dtype: float64
eng.txt is most likel