# A Language Model for Cuneiform Texts 

## Introduction
In this notebook I show how to construct a simple language model to identify the language or dialect in which the cuneiform texts are written. Generally speaking, a [language model](https://en.wikipedia.org/wiki/Language_model) is a probability distribution over sequences of words. Within this broad definition, we find different types of models. Some of the most common use the concept of _n-gram_, i.e. sequences of n consecutive characters in which we split the text. N-gram models are common in NLP literature. Their applications include speech recognition, machine translation, or handwriting recognition. Not surprisingly, there are some well-known packages in Python, such as [NLTK](http://www.nltk.org/), that make constructing an N-gram model becomes a relatively easy task. Here, I use an N-gram model to predict the probability that a sequence of cuneiform n-grams belongs to a particular Ancient Mesopotamian language or dialect.

Now, as you may have already guessed, I wouldn't be writing this post if I had followed that easy path for this challenge. On the contrary, this notebook is about how to construct an N-gram model from scratch. In fact, the only libraries I use are `numpy` and `pandas`. There are two reasons why I decided to do it this way: first of all, standard NLP libraries are designed to work with Latin characters, so adapting them to cuneiform text is a bit of a pain. Second, it's fun! I am a firm believer that the best way of testing your knowledge on a subject is to go to the basics and see if you can code it with as few specialised libraries as possible. That is what I do in this notebook.

The N-gram model implementation that I have come up with is inspired in the type of N-gram models that use Markov chains and the Markov assumption. Simply put, this assumption states that the probability of an n-gram in position $t$ depends only on the previous $k$ n-grams (where $k$ is usually 1). Hence, the probability of a sequence can be computed by taking the product of conditional probabilities of its n-grams. 

My code does not _exactly_ do that, but something simpler. What the code actually does is to compute the probability of bigrams ($n$ = 2) for a training set of each of the languages, and estimates the probability that an observed sequence belong to each of the languages. The language with the highest probability is assigned as a prediction. 

Obviously, more complex methods can be developed. However, I have found that this very simple model already performs really well on a test set, with a weighted F-1 score of ~0.8 (depending on how train-test split). 

Hope you find it interesting!

## Libraries and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# Read data in
df = pd.read_csv('../input/cuneiform-language-identification/train.csv')
print('Shape:', df.shape)
df.sample(10).head(10)

## Implementation
The cells below show the helper functions and class that make up the implementation of the simple model explained in the introduction. The class `CuneiPy` follows the classic Scikit-Learn logic of `train()` and `predict()`.

This implementation has its limitations, of course. Perhaps the most important is that the prediction part, so if your test set is large, hit the button and go grab a coffee!

In [None]:
# Helper functions
## Preprocess cuneiform data
def preprocess(input):
    '''
    Preprocess lines in cuneiform script. Performs two steps: (1) Adds beggining- and end-of-sentence tokens 
    (B and E, respectively) (2) Adds a space between characters. Parameters:
    :input: Pandas Series, each line being a document in cuneiform alphabet with no separation between characters.
    '''
    # Add beggining- and end-of-sentence caracter to the end - just a B and an E
    input_mod = 'B' + input + 'E'

    # Split line by character
    for i in range(len(input)):
        input_mod[i] = " ".join(input_mod[i])

    return input_mod

## Count ngrams and put them in a dictionary
## Elaborated from https://stackoverflow.com/questions/13423919/computing-n-grams-using-python
def ngrams(input, n):
    '''
    Create a dictionary with each ngram and the number of occurrences
    :input: string of characters separated by a white space (' ')
    :n: int. Length of ngram
    '''
    input = input.split(' ')
    output = {}
   
    for i in range(len(input)-n+1):
        g = ' '.join(input[i:i+n])
        output.setdefault(g, 0)
        output[g] += 1
       
    for key in list(output):
        if key[0] == 'E': # Drops 'E B' bigram
            del output[key]

    return output

## Get a list of ngrams - similar to ngrams function, but this one doesn't return a dictionary with frequencies
def ngram_list(input, n):
    '''
    Return a list of ngrams within sentence. Params:
    :input: String. Sequence of cuneiform characters
    :n: Int. length of ngram. 
    '''
    input = input.split(' ')
    _ngrams = []
    for i in range(len(input)-n+1):
        g = ' '.join(input[i:i+n])
        _ngrams.append(g)
    return _ngrams

## Calculate log probability of a sequence
def logprob(input, lang, probs):
    '''
    Calculate the probability of a sequence of ngrams. Params:
    :input: list of ngrams
    :lang: summerian language out of ['NEA', 'LTB', 'MPB', 'OLB', 'NEB', 'SUX', 'STB']
    '''
    _logprob = 0
    for _ngram in input:                                                                                            # Iterate over ngrams within list of ngrams for that sequence
        if _ngram in list(probs[lang]['Bigram']):                                                 
            _logprob = _logprob + np.log(probs[lang]['Probability'][probs[lang]['Bigram'] == _ngram].values)      # If there is a probability for that ngram, we add it to our conditional probability
        else:
            _logprob = _logprob + np.log(0.00001)                                                                   # Update by very low probability (almost zero)

    return float(_logprob)

In [None]:
# Define language model class
class CuneiPy:

    def __init__(self):
        self.languages = []
        self.train_data = pd.DataFrame()
        self._dfs = {}
        self._freqs = {}
        self._probs = {}
        self.n = None
        self.predictions = pd.DataFrame()

    def fit(self, _df, input, target, n):
        '''
        Fits the model; i.e. calculates probability matrices with ngrams of length n for each language.
        Parameters:
        :_df: Pandas dataframe with raining data.
        :input: String. Name of column containing text in cuneiform alphabet. 
        :target: String. Name of column containing the target values of our model. Values should be strings
            such as "SUX" for Sumerian, "LTB" for Late Babylonian, etc.
        :n: Integer. Length of n-grams.
        '''
        self.languages = list(set(_df[target]))
        self.train_data = _df
        self.n = n

        # Preprocess training data
        self.train_data['cuneiform_mod'] = preprocess(_df[input])

        #Â Split dataframes by language
        for lang in self.languages:
            self._dfs[lang] = self.train_data[self.train_data[target] == lang].reset_index(drop = True)

        # Put frequencies in dictionaries by language
        for lang in self.languages:
            self._freqs[lang] = ngrams(" ".join(self._dfs[lang]['cuneiform_mod']), n)

        # Put probabilities in a dictionary of dataframes - These are our probability matrices
        for lang in self.languages:
            self._probs[lang] = pd.DataFrame(list(self._freqs[lang].items()),columns = ['Bigram','Frequency']).sort_values(by = 'Frequency', ascending=False)
            self._probs[lang]['Probability'] = self._probs[lang]['Frequency'] / sum(self._probs[lang]['Frequency'])

    def predict(self, input):
        '''
        Predict language of a given sequence in cuneiform language. Parameters:
        :input: Pandas series containing text in cuneiform alphabet. Each observation will get assigned a 
            predicted language. 
        '''
        self.predictions['cuneiform'] = input

        # Preprocess text, just like in training
        self.predictions['cuneiform_mod'] = preprocess(self.predictions['cuneiform']) 

        # Get all n-grams from the text
        self.predictions['ngrams'] = [ngram_list(self.predictions['cuneiform_mod'][i], self.n) for i in range(len(self.predictions))]

        # Return a column with the probability of the sequence belonging to each language
        for lang in self.languages:
            self.predictions[lang] = self.predictions.apply(lambda x: logprob(x['ngrams'], lang, self._probs), axis = 1)

        # Predict label - Name of the column for which log probability is maximised
        self.predictions['lang_pred'] = self.predictions[self.languages].idxmax(axis = 1)

        return self.predictions['lang_pred']


## Construct the model

In [None]:
# Split data into training and testing sets
df_test = df.sample(n = 1000, random_state = 10)
df_train = df.drop(df_test.index)

df_test = df_test.reset_index(drop = True)
df_train = df_train.reset_index(drop = True)

In [None]:
# Initialise model
cp = CuneiPy()

# Fit
cp.fit(_df = df_train, 
       input = 'cuneiform',
       target = 'lang',
       n = 2)

## Results

In [None]:
# Predict on test set
df_test['lang_pred'] = cp.predict(df_test['cuneiform']).values

In [None]:
# Classification report
print(classification_report(df_test['lang'], df_test['lang_pred']))

In [None]:
# Plot confussion matrix
cm = confusion_matrix(df_test['lang'], df_test['lang_pred'])
fig, ax = plt.subplots(figsize=(8, 8))
ax.matshow(cm, cmap=plt.cm.Reds, alpha=0.5)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(x=j, y=i,s=cm[i, j], va='center', ha='center', size='large')

classNames = ['LTB','MPB', 'NEA', 'NEB', 'OLB', 'STB', 'SUX']
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, fontsize = 10)
plt.yticks(tick_marks, classNames, fontsize = 10)
plt.xlabel('Predictions', fontsize=12)
plt.ylabel('Actuals', fontsize=12)
plt.title('Classification of cuneiform texts by language',fontsize = 18)

plt.show()

Not too bad! We get a weighted F1 score of 0.81. Not surprisingly, the languages wiht less support in the training sample, such as MBP (Middle Babylonian Peripheral) or Neo-Babylonian (NEB) get the lowest scores, but overall the model seems to perform relatively well for its level of complexity. 

## Parting thoughts
Thanks for reading! This is my first Kaggle notebook, so if you have any feedback, it's more than welcome!

If you found the notebook interesting and you want to reach out to me, feel free to contact me on [Linkedin](https://www.linkedin.com/in/alvaro-corrales-cano/) or via email. 

You can read more about my approach in [this blog](https://towardsdatascience.com/assyrian-or-babylonian-language-identification-in-cuneiform-texts-4f15a14a5d70) that I published in Towards Data Science.