<a href="https://colab.research.google.com/github/ronenbendavid/IDC_NLP/blob/master/Asi_Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, place the data files somewhere in your drive so that you can access the files from this notebook. The files are available to download from the Moodle assignment activity*

The relevant files are:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)





In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [0]:
import json
from pathlib import Path
from glob import glob
import pandas as pd
import os

os.chdir('/content/drive/My Drive/idc/nlp/ex1/')

def preprocess():
  vocabulary = set()
  pathlist = glob('*.csv')
  for data_file_path in pathlist:
      data_file = pd.read_csv(data_file_path)
      for data in data_file['tweet_text'].values:
          vocabulary.update(list(data))

  vocabulary.add('<s>')
  vocabulary.add('</s>')

  # vocabulary.discard(' ')
  # vocabulary.discard('\t')
  # vocabulary.discard('\r')
  # vocabulary.discard('\n')

  return sorted(list(vocabulary))

  

In [3]:
%%time
vocabulary = preprocess()
print(len(vocabulary))
print(vocabulary)

1861
['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '</s>', '<s>', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x91', '\x92', '\x97', '\x9d', '¡', '£', '¤', '¥', '§', '¨', '©', 'ª', '«', '\xad', '®', '¯', '°', '²', '³', '´', '¶', '·', '¸', 'º', '»', '½', '¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Å', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ù', 'Ú', 'Ü', 'à', 'á', 'â', 'ã', 'ä', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ė', 'Ğ', 'ğ', 'İ', 'ı', 'ń', 'ō', 'Œ', 'œ', 'Ş', 'ş', 'Š', 'Ÿ', 'ƒ', 'ʔ', 'ʕ', '

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [0]:
import numpy as np
from collections import defaultdict
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)

  voc_size = len(vocabulary)
  model = defaultdict(lambda: defaultdict(lambda: 0))
  data_file = pd.read_csv(data_file_path)
  for data in data_file['tweet_text'].values:
      data = ["<s>"] + list(data) + ["</s>"]
      for i in range(len(data) - n + 1):
          word, char = ''.join(data[i:i + n - 1]), data[i + n - 1]
          model[word][char] += 1

  if add_one:
      pmodel = defaultdict(lambda: defaultdict(lambda: 1/voc_size))
  else:
      pmodel = defaultdict(lambda: defaultdict(lambda: 1e-08))
  for word, counts in model.items():
      if add_one:
          total_count = sum(counts.values()) + voc_size
          pmodel[word] = defaultdict(lambda: 1/total_count)
          for char in counts:
              pmodel[word][char] = (counts[char] + 1) / total_count
      else:
          total_count = sum(counts.values())
          for char in counts:
              pmodel[word][char] = counts[char] / total_count

  return pmodel



In [99]:
%%time
vocabulary = preprocess()
model = lm(3, vocabulary, 'en.csv', False)
print(model['Ab'])
print(sum(v for v in model['Ab'].values()))
print(sum(model['Ab'][v] for v in vocabulary))

model = lm(3, vocabulary, 'en.csv', True)
print(model['Ab'])
print(sum(v for v in model['Ab'].values()))
print(sum(model['Ab'][v] for v in vocabulary))


defaultdict(<function lm.<locals>.<lambda>.<locals>.<lambda> at 0x7fe40841ce18>, {'o': 0.2571428571428571, 's': 0.1, 'e': 0.05714285714285714, 'U': 0.04285714285714286, 'u': 0.08571428571428572, 'D': 0.014285714285714285, 'd': 0.1, 'a': 0.014285714285714285, 'b': 0.1, '7': 0.014285714285714285, 'y': 0.014285714285714285, 'l': 0.02857142857142857, 'j': 0.014285714285714285, 'q': 0.014285714285714285, '2': 0.014285714285714285, 'i': 0.02857142857142857, '</s>': 0.014285714285714285, 'r': 0.05714285714285714, 'n': 0.014285714285714285, 'Z': 0.014285714285714285})
0.9999999999999994
1.0000184099998928
defaultdict(<function lm.<locals>.<lambda>.<locals>.<lambda> at 0x7fe4088c8488>, {'o': 18, 's': 7, 'e': 4, 'U': 3, 'u': 6, 'D': 1, 'd': 7, 'a': 1, 'b': 7, '7': 1, 'y': 1, 'l': 2, 'j': 1, 'q': 1, '2': 1, 'i': 2, '</s>': 1, 'r': 4, 'n': 1, 'Z': 1})
defaultdict(<function lm.<locals>.<lambda> at 0x7fe407809e18>, {'<s>': 0.0005178663904712584, 'o': 0.00983946141895391, 's': 0.004142931123770067, '

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [0]:
import math

def eval(n, model, data_file_path):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for

  data_file = pd.read_csv(data_file_path)
  count = 0
  total = 0.0
  for data in data_file['tweet_text'].values:
      data = ['<s>'] + list(data) + ['</s>']
      for i in range(len(data) - n + 1):
          count += 1
          word, char = ''.join(data[i:i + n - 1]), data[i + n - 1]
          total += math.log2(model[word][char])

  ent = -1 / count * total
  per = 2 ** ent

  return per

In [17]:
%%time
model = lm(3, vocabulary, 'en.csv', False)
print(eval(3, model, 'en.csv'))

8.965438001439026
CPU times: user 1.21 s, sys: 7.97 ms, total: 1.22 s
Wall time: 1.23 s


In [18]:
%%time
model = lm(3, vocabulary, 'en.csv', True)
print(eval(3, model, 'en.csv'))

26.318871163182674
CPU times: user 1.14 s, sys: 2.99 ms, total: 1.15 s
Wall time: 1.15 s


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [0]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not

  file_path = '{}.csv'

  lang = ['en' ,'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl']
  df = pd.DataFrame(columns=lang, index=lang)
  for l1 in lang:
    l1_model = lm(n, vocabulary, file_path.format(l1), add_one)
    for l2 in lang:
      df.at[l1, l2] = eval(n, l1_model, file_path.format(l2))

  return df


In [20]:
%%time
pd.options.display.max_columns = None
print(match(3, False))

         en       es       fr       in       it       nl       pt       tl
en  8.96544  78.0531  109.272  93.1641  65.1218  87.6969   106.41  76.1182
es  78.8428   8.5912  101.603  134.955  52.6994  143.156  57.7033  101.265
fr  58.3061  66.1363  8.56056  106.203  56.6941  99.7183  86.5802  98.8763
in  57.6622  86.8521  145.839  9.87273  74.3799  89.9082    108.3   59.971
it  73.2088  59.7344  95.8546  144.067  8.60361  144.459  74.6154  93.2924
nl  49.1501   86.677  95.5061  88.6947  76.5365  9.16546  110.144  79.1393
pt  97.6712  55.0793  124.818  178.722  69.2821  185.273  8.05634  117.905
tl  50.5246  74.3102   154.86  64.0306    69.47  112.556  98.4439  8.50025
CPU times: user 39.6 s, sys: 106 ms, total: 39.7 s
Wall time: 39.8 s


In [21]:
%%time
pd.options.display.max_columns = None
print(match(3, True))

         en       es       fr       in       it       nl       pt       tl
en  26.3189  56.1303  56.5069  75.7786   58.678  59.4674  65.8466  69.4008
es  70.2417  24.2023  58.3679  88.1664   47.765  81.1119  45.7008  86.5626
fr   57.368  47.4652   24.232  84.2093  52.9249  67.8738   58.516  86.7851
in  61.7705  65.3927  75.3658  30.4108  66.3883  70.5887  74.1277  56.7802
it  68.2684  45.9564  59.4491  90.6819  24.9038  84.7244  52.3253  79.6418
nl  53.0435  67.0519  63.0558  77.1002  70.7973  27.1713  76.7761  76.1951
pt   79.311  44.7339  66.8505  101.437  56.6742  95.7371  25.4544  92.4071
tl  53.6397  62.1633  75.2583  58.1438  63.0929  75.8147  71.3543  28.5867
CPU times: user 41 s, sys: 92 ms, total: 41.1 s
Wall time: 41.3 s


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [0]:
%%time
pd.options.display.max_columns = None
vocabulary = preprocess()
for n in range(0, 4):
  print("\n\nn: {}. add_one: False".format(n + 1))
  print(match(n + 1, False))
  print("\n\nn: {}. add_one: True".format(n + 1))
  print(match(n + 1, True))



n: 1. add_one: False
         en       es       fr       in       it       nl       pt       tl
en  38.5819  41.7284  43.9312  41.7317  41.2459  39.7133  45.2394  45.0282
es  42.1904  36.2134  41.5859  43.8699  40.4491  41.5543  41.0399  47.5528
fr   41.802   40.208  37.5004  44.6866  39.9451  40.9627  42.4084   49.471
in  42.6906  45.4138  48.1969  37.3912  43.8748  41.9243  48.3191  42.9053
it  41.6723  39.0687  40.9179  43.6638  37.5525  41.0965  42.6782  46.7371
nl  40.9808  41.1384  42.1104   41.815  41.1531  37.4997  43.9915  46.7566
pt  42.7496  37.6343  41.0913  43.2408  41.0032  41.7144  37.0763   47.514
tl  42.4023  43.1284   48.603  39.3321  43.3317  42.8451   45.963  40.7645


n: 1. add_one: True
         en       es       fr       in       it       nl       pt       tl
en  38.6302  39.7953  41.1775  41.2137  39.5433  39.4878  41.0819  44.5881
es  41.7911  36.2646  39.9556   43.276  39.3732  41.1705  38.0581  47.0439
fr   41.334   39.384   37.546  44.1159  39.5241  40.674

# **Good luck!**