# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 53 (delta 23), reused 32 (delta 8), pack-reused 0[K
Unpacking objects: 100% (53/53), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:
!ls nlp-course/lm-languages-data-new

en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
import pandas as pd
import numpy as np
import os

In [None]:
def preprocess_tweet(tweet):
    tokens = []
    for c in tweet:
        tokens.append(c)
    return tokens

def unique_values_from_list(lst):
    unique_set = set(lst)
    unique_lst = list(unique_set)
    return unique_lst

def add_unique_symbols(text, n):
    prefix = 'א' * (n-1) 
    suffix = 'ת' * (n-1)
    return prefix + text + suffix

def preprocess_file(file_name):
    data_file_pd = pd.read_csv(f'/content/nlp-course/lm-languages-data-new/{file_name}')
    tweets = data_file_pd['tweet_text']
    tweets_tokens = []
    for tweet in tweets:
        tweet = add_unique_symbols(tweet, 2)
        tweet_tokens = preprocess_tweet(tweet)
        tweets_tokens = tweets_tokens + unique_values_from_list(tweet_tokens)
    return unique_values_from_list(tweets_tokens)

In [None]:
directory = os.fsencode('/content/nlp-course/lm-languages-data-new')
files_tokens = []
for file in os.listdir(directory):
     filename = os.fsdecode(file)
     if filename.endswith(".csv") and (filename not in ['test.csv', 'tests.csv']): 
         print(filename)
         files_tokens = files_tokens + preprocess_file(filename)
vocabulary = unique_values_from_list(files_tokens)

es.csv
in.csv
en.csv
fr.csv
nl.csv
it.csv
pt.csv
tl.csv


**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
def calculate_probas(model, vocabulary, add_one):
    for _, counts in model.items():
        if add_one:
            total_counts = len(vocabulary) + sum(counts.values())
        else:
            total_counts = sum(counts.values())
        for token, count in counts.items():
            counts.update({token : count / total_counts})
    return model

def add_default_zero(model):
    for _, counts in model.items():
        counts.update({'default' : 0})
    return model

def add_one_func(model):
    for _, counts in model.items():
        for token in list(counts.items()):
            counts.update({token[0] : token[1] + 1})
        counts.update({'default' : 1})
    return model

In [None]:
def lm(n, vocabulary, data_file_path, add_one):
    # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
    # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
    # data_file_path - the data_file from which we record probabilities for our model
    # add_one - True/False (use add_one smoothing or not)
    data = pd.read_csv(data_file_path)
    model = {}
    for tweet in data['tweet_text']:
        tweet = add_unique_symbols(tweet, n)
        for i in range(len(tweet) - n + 1):
            ngram = tweet[i:i+n]
            if ngram[0:n-1] not in model.keys():
                model.update({ngram[0:n-1] : {}})
            counts = model[ngram[0:n-1]]
            if ngram[n-1:n] in counts.keys():
                current_val = counts.get(ngram[n-1:n]) + 1
            else:
                current_val = 1
            counts.update({ngram[n-1:n] : current_val})
    if add_one:
        model = add_one_func(model)
    else:
        model = add_default_zero(model)
    model = calculate_probas(model, vocabulary, add_one)
    return model

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def evaluate(n, model, data_file):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # data_file - the tweets file that you wish to claculate a perplexity score for
    tweets = data_file['tweet_text']
    total_entropy = []
    for tweet in tweets:
        tweet = add_unique_symbols(tweet, n)
        entropy = []
        for i in range(len(tweet) - n + 1):
            ngram = tweet[i:i+n]
            proba = model.get(ngram[0:n-1],{}).get(ngram[n-1:], model.get(ngram[0:n-1],{}).get('default',0))
            if proba != 0:
                entropy.append(-1 * np.log2(proba))
        tweet_entropy = sum(entropy)
        total_entropy.append(tweet_entropy / (len(tweet) - n + 1))
    H = sum(total_entropy) / len(tweets)
    perplexity = 2 ** H
    return perplexity

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [None]:
def match(n, add_one):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    languages = ['en','es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
    file_path = '/content/nlp-course/lm-languages-data-new/'
    perplexity = []
    for model_lang in languages:
        model = lm(n, vocabulary, f'{file_path}{model_lang}.csv', add_one)
        values = []
        for value_lang in languages:
            data_file = pd.read_csv(f'{file_path}{value_lang}.csv')
            values.append(float('%.2f' % evaluate(n, model, data_file)))
        perplexity.append(values)
    return pd.DataFrame(perplexity, index = languages, columns = languages)

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
for n in range(1,5):
    for add_one in [True, False]:
        print(f'n = {n}, add_one = {add_one}:')
        print(match(n, add_one))

n = 1, add_one = True:
       en     es     fr     in     it     nl     pt     tl
en  38.18  39.84  41.66  41.16  40.50  39.71  41.68  43.31
es  41.73  35.42  40.07  43.51  40.08  41.74  38.06  45.93
fr  41.34  38.81  36.90  44.36  40.05  41.23  39.53  48.02
in  42.08  42.19  45.77  37.12  43.11  41.78  43.68  40.92
it  41.15  38.05  39.70  43.36  37.69  41.27  39.62  45.22
nl  40.45  39.77  40.95  41.32  40.88  37.78  41.12  44.95
pt  42.17  36.62  39.78  42.84  40.44  41.78  35.50  45.70
tl  41.58  41.58  46.20  38.60  42.28  42.47  43.16  39.11
n = 1, add_one = False:
       en     es     fr     in     it     nl     pt     tl
en  38.10  37.46  38.45  40.33  38.34  39.20  36.92  42.59
es  41.11  35.34  38.04  42.54  38.50  41.01  34.57  45.06
fr  40.56  38.14  36.82  43.41  39.32  40.63  36.71  47.15
in  41.45  36.98  43.56  37.02  41.11  41.16  37.25  40.31
it  40.27  37.52  38.18  42.32  37.60  40.64  35.99  44.34
nl  39.56  38.54  40.15  40.36  39.79  37.70  38.19  44.06
pt  41.33

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
def evaluate_tweet(n, model, tweet):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # tweet - the tweetthat you wish to claculate a perplexity score for
    tweet_text = tweet['tweet_text']
    tweet_text = add_unique_symbols(tweet_text, n)
    entropy = []
    for i in range(len(tweet_text) - n + 1):
        ngram = tweet_text[i:i+n]
        proba = model.get(ngram[0:n-1],{}).get(ngram[n-1:], model.get(ngram[0:n-1],{}).get('default',0))
        if proba != 0:
            entropy.append(-1 * np.log2(proba))
    tweet_entropy = sum(entropy) / (len(tweet_text) - n + 1)
    tweet_perplexity = 2 ** tweet_entropy
    return tweet_perplexity

In [None]:
# create all models for classification

languages = ['en','es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
file_path = '/content/nlp-course/lm-languages-data-new/'
models = {}

for lang in languages:
    for n in range(1,6):
        for add_one in [True, False]:
            model = lm(n, vocabulary, f'{file_path}{lang}.csv', add_one)
            if lang in models:
                if n in models[lang]:
                    models[lang][n].update({add_one : model})
                else:
                    models[lang].update({n : {add_one : model}})
            else:
                models.update({lang: {n : {add_one:model}}})

### Classification using majority voting

In [None]:
def classify(models):
    languages = ['en','es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
    file_path = '/content/nlp-course/lm-languages-data-new/'
    data_file  = pd.read_csv(f'{file_path}test.csv')

    final_predictions = []
    for tweet_idx in range(0, len(data_file)):
        tweet_predictions = []
        tweet = data_file.iloc[tweet_idx]
        for n in range(1, 5):
            for add_one in [True, False]:  # [True, False]
                perplexity = {}
                for model_lang in languages:
                    model = models[model_lang][n][add_one]
                    perplexity.update({model_lang : evaluate_tweet(n, model, tweet)})
                tweet_predictions.append(min(perplexity, key=perplexity.get))
        final_predictions.append((tweet['tweet_id'],max(tweet_predictions,key=tweet_predictions.count)))
    return final_predictions

In [None]:
final_predictions = classify(models)
predictions_df = pd.DataFrame(final_predictions)

In [None]:
labels_df = pd.read_csv('/content/nlp-course/lm-languages-data-new/test.csv')

In [None]:
# used macro average for balanced dataset
from sklearn.metrics import f1_score
f1_score(labels_df['label'],predictions_df[1], average='macro')

0.873523695552922

### Classification with predictive model (n-gram predictions as features)

In [None]:
def create_dataset(language, models, data_file):
    # data file is a file of specific language
    languages = ['en','es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
    final_features = []
    for tweet_idx in range(0, len(data_file)):
        tweet_predictions = []
        tweet = data_file.iloc[tweet_idx]
        tweet_predictions.append(tweet['tweet_id'])
        for n in range(1, 6):
            for add_one in [True, False]:  # [True, False]
                perplexity = {}
                for model_lang in languages:
                    model = models[model_lang][n][add_one]
                    perplexity.update({model_lang : evaluate_tweet(n, model, tweet)})
                tweet_predictions.append(min(perplexity, key=perplexity.get))
        tweet_predictions.append(language)
        final_features.append(tweet_predictions)
    df = pd.DataFrame(final_features, columns = ['tweet_id'] + ["feature_"+str(x) for x in range(1, (len(tweet_predictions) - 1))] + ['label'])
    df.set_index(df.columns[0])
    return df

In [None]:
! pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/47/80/8e9c57ec32dfed6ba2922bc5c96462cbf8596ce1a6f5de532ad1e43e53fe/catboost-0.25.1-cp37-none-manylinux1_x86_64.whl (67.3MB)
[K     |████████████████████████████████| 67.3MB 80kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.25.1


In [None]:
file_path = '/content/nlp-course/lm-languages-data-new/'
languages = ['en','es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']

feature_matrix = None
for lang in languages:
    data_file  = pd.read_csv(f'{file_path}{lang}.csv')
    df = create_dataset(lang, models, data_file)
    feature_matrix = df if feature_matrix is None else pd.concat([feature_matrix, df]) 
feature_matrix

Unnamed: 0,tweet_id,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,label
0,845395018743459840,en,en,en,en,en,en,en,en,es,pt,en
1,845395017917173760,en,en,en,en,en,en,en,pt,it,pt,en
2,845395018760306693,en,en,en,en,en,en,en,pt,it,pt,en
3,845395018336649216,it,it,en,en,en,tl,tl,tl,tl,nl,en
4,845395018751856642,en,en,en,en,en,en,en,pt,pt,fr,en
...,...,...,...,...,...,...,...,...,...,...,...,...
8995,829190064584392704,tl,tl,tl,tl,tl,pt,pt,es,es,es,tl
8996,829190068803866625,tl,tl,tl,tl,tl,tl,tl,fr,fr,fr,tl
8997,829190072998232065,pt,pt,tl,tl,tl,pt,nl,pt,pt,fr,tl
8998,829190135883395072,tl,tl,tl,tl,tl,tl,tl,it,en,it,tl


In [None]:
feature_matrix2 = feature_matrix.set_index(feature_matrix.columns[0])
feature_matrix2

Unnamed: 0_level_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,label
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
845395018743459840,en,en,en,en,en,en,en,en,es,pt,en
845395017917173760,en,en,en,en,en,en,en,pt,it,pt,en
845395018760306693,en,en,en,en,en,en,en,pt,it,pt,en
845395018336649216,it,it,en,en,en,tl,tl,tl,tl,nl,en
845395018751856642,en,en,en,en,en,en,en,pt,pt,fr,en
...,...,...,...,...,...,...,...,...,...,...,...
829190064584392704,tl,tl,tl,tl,tl,pt,pt,es,es,es,tl
829190068803866625,tl,tl,tl,tl,tl,tl,tl,fr,fr,fr,tl
829190072998232065,pt,pt,tl,tl,tl,pt,nl,pt,pt,fr,tl
829190135883395072,tl,tl,tl,tl,tl,tl,tl,it,en,it,tl


In [None]:
file_path = '/content/nlp-course/lm-languages-data-new/'

data_file  = pd.read_csv(f'{file_path}test.csv')
test_matrix = create_dataset('test', models, data_file)
test_matrix.merge(data_file, on='tweet_id', how='inner')
test_matrix

Unnamed: 0,tweet_id,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,label
0,845394879479996416,en,en,en,en,en,en,en,pt,pt,it,test
1,836313846675619841,it,it,it,it,it,it,it,nl,nl,nl,test
2,836259442328940544,es,in,tl,tl,tl,tl,tl,fr,in,es,test
3,847729104472358912,nl,nl,nl,nl,nl,nl,nl,it,it,in,test
4,836491739699412992,tl,tl,tl,tl,tl,tl,tl,pt,pt,fr,test
...,...,...,...,...,...,...,...,...,...,...,...,...
7994,836250659464761344,en,en,es,fr,es,pt,es,pt,pt,in,test
7995,847676283089637380,in,in,in,in,in,in,in,it,es,en,test
7996,836319299279138816,tl,tl,it,it,it,it,it,en,en,en,test
7997,836258179847716865,pt,pt,pt,pt,pt,es,pt,es,es,tl,test


In [None]:
test_matrix2 = test_matrix.merge(data_file, on='tweet_id', how='inner').drop(['label_x', 'tweet_text'], axis=1).rename(columns = {'label_y' : 'label'})
test_matrix2 = test_matrix2.set_index(test_matrix2.columns[0])
test_matrix2

Unnamed: 0_level_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,label
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
845394879479996416,en,en,en,en,en,en,en,pt,pt,it,en
836313846675619841,it,it,it,it,it,it,it,nl,nl,nl,it
836259442328940544,es,in,tl,tl,tl,tl,tl,fr,in,es,tl
847729104472358912,nl,nl,nl,nl,nl,nl,nl,it,it,in,nl
836491739699412992,tl,tl,tl,tl,tl,tl,tl,pt,pt,fr,tl
...,...,...,...,...,...,...,...,...,...,...,...
836250659464761344,en,en,es,fr,es,pt,es,pt,pt,in,es
847676283089637380,in,in,in,in,in,in,in,it,es,en,in
836319299279138816,tl,tl,it,it,it,it,it,en,en,en,it
836258179847716865,pt,pt,pt,pt,pt,es,pt,es,es,tl,pt


In [None]:
from catboost import Pool, CatBoostClassifier

def catboost_classification(feature_matrix, test_matrix):
    train = feature_matrix[feature_matrix.columns[:-1]]
    label = feature_matrix[feature_matrix.columns[-1:]]
    cat_features = range(0,10)
    train_dataset = Pool(data = train, label = label, cat_features=cat_features)

    model = CatBoostClassifier(iterations=100,
                           learning_rate=0.05,
                           loss_function='MultiClass')
    model.fit(train_dataset)

    test = test_matrix[feature_matrix.columns[:-1]]
    label = test_matrix[feature_matrix.columns[-1:]]
    cat_features = range(0,10)
    test_dataset = Pool(data = test, label = label, cat_features=cat_features)
    preds_class = model.predict(test_dataset)
    return preds_class

In [None]:
preds = catboost_classification(feature_matrix2, test_matrix2)

0:	learn: 1.8063775	total: 1.58s	remaining: 2m 36s
1:	learn: 1.6052523	total: 3s	remaining: 2m 27s
2:	learn: 1.4610137	total: 4.38s	remaining: 2m 21s
3:	learn: 1.3393740	total: 5.75s	remaining: 2m 17s
4:	learn: 1.2439115	total: 7.1s	remaining: 2m 14s
5:	learn: 1.1583623	total: 8.52s	remaining: 2m 13s
6:	learn: 1.0849130	total: 9.87s	remaining: 2m 11s
7:	learn: 1.0205702	total: 11.2s	remaining: 2m 9s
8:	learn: 0.9644316	total: 12.6s	remaining: 2m 7s
9:	learn: 0.9174021	total: 14s	remaining: 2m 5s
10:	learn: 0.8750675	total: 15.3s	remaining: 2m 3s
11:	learn: 0.8343682	total: 16.7s	remaining: 2m 2s
12:	learn: 0.7999147	total: 18.1s	remaining: 2m 1s
13:	learn: 0.7665839	total: 19.5s	remaining: 1m 59s
14:	learn: 0.7365786	total: 20.9s	remaining: 1m 58s
15:	learn: 0.7091301	total: 22.2s	remaining: 1m 56s
16:	learn: 0.6838994	total: 23.7s	remaining: 1m 55s
17:	learn: 0.6609531	total: 25.1s	remaining: 1m 54s
18:	learn: 0.6398980	total: 26.5s	remaining: 1m 52s
19:	learn: 0.6213680	total: 27.9s	

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
# used macro average for balanced dataset
from sklearn.metrics import f1_score
f1_score(test_matrix2['label'], preds, average='macro')

0.9155830867335729

# **Good luck!**