# NLP - Assignment 1
By Millis Sahar

In this assignment you will be creating tools for learning and testing language models.  
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script.   
The data is provided either formatted as CSV or as JSON, for your convenience.  
The end goal is to write a set of tools that can detect the language of a given tweet.

#### disable warnings

In [None]:
import warnings
warnings.filterwarnings('ignore')

#### disable autoscrolling

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

#### Google Drive

In [None]:
from google.colab import drive  
drive.mount(r'/content/drive/',force_remount=True) 

data_dir = 'drive/My Drive/Colab Notebooks/NLP/HW1'
# os.listdir(data_dir)

Mounted at /content/drive/


#### imports

In [None]:
import os,io
import numpy as np
import pandas as pd

from IPython.display import display, HTML

#### tokens

In [None]:
start_token = '<start>'
end_token = '<end>'
unknown_token = 'UNK'
unknown_token_value = 1e-8

*As a preparation for this task, place the data files somewhere in your drive so that you can access the files from this notebook. The files are available to download from the Moodle assignment activity*

The relevant files are:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)





**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

Note:  
Although my way is including the "id" of each row - it's faster.   
The down sides is adding the digits to our vocabilary and running over the ids.  
Due to a reletavly short number of digits it's ok.  

In [None]:
def preprocess():
    # get all csv files in 'data' folder
    files = []
    for filename in os.listdir(os.getcwd() + '/' + data_dir + '/data'):
        if filename.endswith('csv'):
            files.append(filename)
            
#     unique_chars = set()
#     for filename in files:
#         df = pd.read_csv(filepath_or_buffer='data/' + filename,encoding='utf-8')
#         unique_chars.update(set(list(''.join(df.iloc[:,1]))))

# get all unique characters
    unique_chars = set()
    for filename in files:
        with open(file=os.getcwd() + '/' + data_dir + '/data/'+filename, mode='r',encoding='utf-8') as f: 
            unique_chars.update(set(f.read()))
    
    # Sign for a start & end of a row
    unique_chars.update(set([start_token,end_token]))


#     print("Number of unique charactes :",len(unique_chars))
    return list(unique_chars)

# vocabulary = preprocess()
# print(len(vocabulary))

**Part 2**

Write a function `lm` that generates a language model from a textual corpus.  
The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur.  
For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)

  # TODO

#     start_token = '<start>'
#     end_token = '<end>'
#     unknown_token = 'UNK'
#     unknown_token_value = 1e-8

    df = pd.read_csv(filepath_or_buffer=data_file_path,encoding='utf-8')
    tweets = df.tweet_text

    model = {}
    
    # Creating model
    for t in tweets:

        # start of sentence
        for i in range(0,n-1,1):
            c_from = start_token * (n-i-1) + t[0:i]
            c_to = t[i]
            if c_from not in model.keys():
                model[c_from] = {}
            if c_to not in model[c_from].keys():
                model[c_from][c_to] = 1
            else:
                model[c_from][c_to] += 1 

        # n-grams of sentence
        for i in range(len(t)-n+1):
            c_from = t[i:i+n-1]
            c_to = t[i+n-1]
            if c_from not in model.keys():
                model[c_from] = {}
            if c_to not in model[c_from].keys():
                model[c_from][c_to] = 1
            else:
                model[c_from][c_to] += 1 

        # end of sentence
        c_from = t[i+1:i+n]
        c_to = end_token
        if c_from not in model.keys():
            model[c_from] = {}
        if c_to not in model[c_from].keys():
            model[c_from][c_to] = 1
        else:
            model[c_from][c_to] += 1 
    
    # Add One - True
    if add_one == True:
        for k in model.keys():
            model[k][unknown_token] = 0

            for kk in model[k].keys():
                model[k][kk] += 1

            sum_k = sum(model[k].values()) #+ len(vocabulary)
            for kk in model[k].keys():
                model[k][kk] /= sum_k
    
        
    if add_one == False:
        for k in model.keys():
            sum_k = sum(model[k].values())
            for kk in model[k].keys():
                model[k][kk] /= sum_k
            # set UNK
            model[k][unknown_token] = unknown_token_value
    
    # from unseen tokens 
    model[unknown_token] = unknown_token_value
    
    
    return model

# model = lm(n=4, vocabulary=vocabulary, data_file_path='data/en.csv', add_one=False)
# model

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def get_proba_from_to(model,c_from,c_to):

    # c_from not in vocabulary
    if c_from not in model.keys() or c_from=='UNK':
        return model['UNK']

    # c_to not in vocabulary
    if c_to not in model[c_from].keys():
        return model[c_from]['UNK']

    return model[c_from][c_to]


def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for

  # TODO
    
    df = pd.read_csv(filepath_or_buffer=data_file,encoding='utf-8')
    tweets = df.tweet_text
    
    proba_tweets = []

    for t in tweets:
        # print(t)
        # start of sentence
        for i in range(0,n-1,1):
            c_from = start_token * (n-i-1) + t[0:i]
            c_to = t[i]
            proba_tweets.append(get_proba_from_to(model,c_from,c_to))
#             print(c_from,c_to)
            
        # n-grams of sentence
        for i in range(len(t)-n+1):
            c_from = t[i:i+n-1]
            c_to = t[i+n-1]
            proba_tweets.append(get_proba_from_to(model,c_from,c_to))
#             print(c_from,c_to)
            
        # end of sentence
        c_from = t[i+1:i+n]
        c_to = end_token
        proba_tweets.append(get_proba_from_to(model,c_from,c_to))
#         print(c_from,c_to)

    # print('counts=',len(proba_tweets))
    # print('proba_tweets=',proba_tweets)
    # print('proba_tweets log2=',np.log2(proba_tweets).sum())    
    return np.power(2,- np.log2(proba_tweets).mean())

# model = lm(n=2, vocabulary=vocabulary, data_file_path='data/en.csv', add_one=False)
# eval(n=2, model=model, data_file='data/fr.csv')

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*.   
Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...).   
This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages.   
Then, the values are the relevant perplexity values.

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not

  #TODO

    files = {}
    for filename in os.listdir(os.getcwd() + '/' + data_dir + '/data'):
        if filename.endswith('csv'):
            path = os.getcwd() + '/' + data_dir + '/data/' + filename
            name = os.path.splitext(filename)[0]
            files[name] = path    

    # get vocabulary
    vocabulary = preprocess()    

    # create models
    models = {}
    for f in files.keys():
        models[f] = lm(n=n, vocabulary=vocabulary, data_file_path=files[f], add_one=add_one)
    
    # evaluate for all languages
    df = pd.DataFrame(columns=files.keys())
    for lang_model in files.keys():
        for lang_test in files.keys():
            df.loc[lang_model,lang_test] = eval(n=n, model=models[lang_model], data_file=files[lang_test])
#             print(lang_model,lang_test,eval(n=n, model=models[lang_model], data_file=files[lang_test])
    return df

# match(n=2,add_one=False)

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
# TODO
from IPython.display import Javascript
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 5000})'''))


n_s = [1,2,3,4]
add_on_s = [False,True]

for n in n_s:
    for ao in add_on_s:
        display(HTML('<H2>N={} &nbsp&nbsp&nbsp&nbsp add_one={}</H2>'.format(n,ao)))
        df = match(n=n,add_one=ao)
        display(df)
        display(HTML('<br><br>'))
        
    display(HTML('<br><br><br><br><br>'))

<IPython.core.display.Javascript object>

Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,37.8155,40.9155,43.1323,38.9283,40.4245,40.8883,44.1225,44.3082
es,41.391,35.4511,40.8046,40.7529,39.6356,43.0116,46.6349,40.1468
fr,41.0064,39.4063,36.7572,40.1658,39.1325,43.8168,48.5308,41.4918
nl,40.1927,40.3302,41.3244,36.7356,40.3324,40.9722,45.8411,43.0708
it,40.8766,38.2787,40.1407,40.2984,36.7651,42.8081,45.8272,41.7757
in,41.8838,44.5735,47.3579,41.1173,43.0328,36.5903,42.0254,47.3855
tl,41.5914,42.2969,47.752,42.0241,42.4906,38.5119,39.9033,45.0447
pt,41.9283,36.8475,40.2943,40.8969,40.177,42.3834,46.6028,36.2269


Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,37.8203,39.9793,41.8955,38.8167,39.6781,40.6506,43.9153,42.3912
es,41.1939,35.4551,40.0523,40.5695,39.1362,42.7328,46.388,38.8119
fr,40.7925,38.9097,36.7619,40.0317,38.9404,43.5558,48.3081,40.2653
nl,39.9387,39.8757,41.0174,36.7395,39.982,40.6854,45.5792,41.9338
it,40.6162,38.1029,39.6361,40.1342,36.7687,42.5195,45.5673,40.3411
in,41.7138,42.4792,46.1324,40.9645,42.32,36.5965,41.8524,44.679
tl,41.3345,41.9436,46.6649,41.8778,41.7826,38.2605,39.9093,44.1298
pt,41.6645,36.705,39.7803,40.6847,39.5534,42.0698,46.2885,36.2314


Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,18.2886,30.2626,33.4835,25.3983,27.6,27.5662,27.0262,37.8465
es,29.1031,16.2653,28.9506,30.0798,23.4261,30.9691,30.5896,26.2961
fr,26.1245,27.6946,17.1228,27.1802,24.6027,30.1329,31.4539,31.9095
nl,25.054,31.7616,29.6374,17.9473,28.8831,28.1347,29.2788,37.0871
it,28.7876,23.751,30.4275,30.1302,16.7155,30.4894,29.8152,29.7926
in,26.6907,34.0945,41.14,27.0107,28.3149,18.1508,24.0628,41.3162
tl,24.9965,28.4714,41.4229,28.1747,27.2717,23.4979,17.9873,35.6522
pt,29.8409,21.4945,28.7374,31.2407,24.7088,32.3514,31.7168,16.596


Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,18.3078,24.7993,26.5817,24.2862,25.5332,26.2739,25.5807,29.5319
es,27.4436,16.2835,25.2734,28.6231,21.7368,29.3634,28.8652,23.6449
fr,24.8762,21.6914,17.1399,26.0119,23.2035,28.5776,29.5968,25.9813
nl,23.9753,26.6772,25.9539,17.9633,27.0327,26.8773,27.5775,30.3803
it,27.4216,20.653,25.674,28.7272,16.733,28.8438,28.0637,25.0662
in,25.4171,29.0935,28.7854,25.7575,25.6584,18.1736,22.7798,32.5799
tl,23.9007,23.8286,28.8175,26.7458,24.7028,22.5064,18.0114,27.0469
pt,27.8921,19.827,24.9772,29.5116,22.6496,30.25,29.4441,16.6174


Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,8.963,76.5862,106.826,86.6282,64.0316,91.3391,74.4461,104.014
es,77.2421,8.60731,99.3324,140.851,51.9139,131.914,98.8187,56.7719
fr,57.2887,64.9911,8.56596,98.5673,55.8074,104.05,96.5427,84.6945
nl,48.4263,85.159,93.6485,9.2365,75.1864,86.9358,77.4014,107.673
it,71.7443,58.693,93.783,141.81,8.63636,140.475,91.0406,73.0836
in,56.6487,85.2715,142.239,88.6925,73.0497,9.90744,58.7731,105.931
tl,49.698,73.0234,150.954,110.855,68.2762,62.9563,8.5301,96.227
pt,95.6233,54.2748,121.914,181.817,68.0819,174.123,114.883,8.11238


Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,9.06244,24.1597,24.3727,18.4607,21.0262,23.0989,20.1294,30.7987
es,19.0636,8.70134,20.646,21.4272,16.8378,22.9641,21.3338,18.3378
fr,17.4877,21.2423,8.65696,19.8976,17.9588,23.4055,22.0566,25.6135
nl,16.5023,24.4282,20.2428,9.33837,22.1134,22.8311,20.842,29.3829
it,18.5326,17.1375,21.4033,22.3473,8.72892,23.4926,19.9566,22.0277
in,17.2169,26.1334,30.619,19.5154,21.293,10.0254,16.443,31.5402
tl,15.1641,20.8507,29.478,19.5414,20.0587,17.5209,8.63902,26.7967
pt,19.17,14.3313,20.4693,22.5556,18.045,23.6194,20.3879,8.20855


Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,4.49421,1661.27,1755.15,2581.05,1734.02,7052.03,2402.12,3353.76
es,1695.71,4.74224,1990.0,7500.16,661.1,10910.5,3637.16,586.05
fr,827.058,980.148,4.4945,4225.86,1109.38,8397.27,4789.24,1943.34
nl,475.16,2405.2,1797.33,4.6455,2454.97,5500.8,3119.01,4319.9
it,1200.04,520.573,1733.52,8365.81,4.67216,10826.0,2934.97,784.216
in,617.205,2163.13,3360.03,3505.95,1993.78,5.11889,820.692,3398.2
tl,330.607,1071.25,3260.97,4341.15,1152.65,1242.05,4.32313,1880.49
pt,2623.97,583.377,3495.93,14581.0,1205.54,19489.7,5152.69,4.39733


Unnamed: 0,en,es,fr,nl,it,in,tl,pt
en,4.72863,65.4103,90.5955,65.0879,48.8198,63.9575,49.7092,83.1827
es,52.6957,4.9751,78.8692,96.1009,41.2666,93.8644,65.7443,42.5415
fr,40.0582,54.7031,4.71416,70.745,43.9101,72.5205,63.2528,64.9806
nl,33.9982,64.1111,72.6843,4.90316,52.6378,60.0581,52.1838,78.1723
it,46.829,46.6833,73.5084,91.2207,4.89654,97.6623,60.8942,55.9814
in,37.9837,70.1486,113.886,63.3818,57.4739,5.39629,43.9632,82.5814
tl,33.1246,57.343,120.605,78.7805,52.2283,47.7098,4.55242,72.0619
pt,64.799,42.7035,98.4822,119.148,54.537,119.161,77.5407,4.61385


# **Good luck!**