#Students:
Ofir Nesher - 204502926

Yuval Katz - 311132468

https://colab.research.google.com/drive/17uCnxPYkmmM1z79Husux1lh61Br31bjK

# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

fatal: destination path 'nlp-course' already exists and is not an empty directory.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:
!ls nlp-course/lm-languages-data-new

en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


### IMPORTS

In [None]:
import numpy as np # used for scientific computing
import pandas as pd # used for data analysis and manipulation
import matplotlib.pyplot as plt # used for visualization and plotting
import os
from google.colab import drive
from collections import defaultdict
from math import log2

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### CONSTANTS

In [None]:
DIRECTORY = r'/content/nlp-course/lm-languages-data-new'
START_TOKEN = '‚Ü†'
END_TOKEN = '‚Üû'

# These were used between parts (leaving here for possible future use)
# EXAMPLE = pd.DataFrame(['abcee', 'abc', 'abd', 'abb', 'z', 'caa', 'caa', 'cab', 'cab', 'cab', 'cab', 'cab', 'cab', 'cab', 'cad'], columns = ['tweet_text'])
# EXAMPLE2 = {'%%': {'i': 0.109}, '%i': {'w': 0.144}, 'iw': {'l': 0.489}, 'wl': {'t': 0.905}, 'lt': {'c': 0.002}, 'tc': {'t': 0.472}, 'ct': {'r': 0.147}, 'tr': {'o': 0.056}, 'ro': {'h': 0.194}, 'oh': {'w': 0.089}, 'hw': {'.': 0.290}, 'w.': {'': 0.99999}}
# EXAMPLE3 = {'': {'a': 0.333}, 'a': {'b': 1}, 'b': {'c': 0.5}, 'c': {'d': 0.5}, 'dw': {'e': 1}}

---

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
# Part 1 helper functions

def remove_non_alphabet_words(inputlist):
  return [w for w in inputlist if w.isalpha()]


def get_data_file_path(data_file):
  return os.path.join(DIRECTORY, data_file)

In [None]:
df = pd.read_csv(get_data_file_path('en.csv'))  
df

Unnamed: 0,tweet_id,tweet_text
0,845395018743459840,RT @ONHERPERlOD: Boyfriends that take pictures...
1,845395017917173760,He got his surgery done today but he's happy w...
2,845395018760306693,@levi_a1998 @mcluber29 I'm doing so much winni...
3,845395018336649216,RT @Rt_YourFavBands: #BandsTournament2017 Roun...
4,845395018751856642,#Merlin oh no she wanted to enchant him my bad
...,...,...
8990,845396513572433921,RT @michaelknaepen: losing is an alternative f...
8991,845396513572368384,@BobbyScott Thank you for opposing #Trumpcare...
8992,845396513572433923,RT @forbid: i planted a little surprise on a p...
8993,845396514012827648,RT @justin_halpern: ooof the kicker on this Ne...


In [None]:
def preprocess():
  tokens = []

  for data_file in os.listdir(DIRECTORY):
    if data_file.endswith('.csv'):
      df = pd.read_csv(get_data_file_path(data_file))

      for tweet in df['tweet_text'].values:
        tokens.extend(list(tweet))
      
  return list(set(tokens))

In [None]:
vocabulary = preprocess()

In [None]:
print(list(vocabulary)[:5]) # first n elements from vocabulary

['ü§¢', 'üôâ', 'ÏÑú', '‚öò', 'ü•í']


**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
# Part 2 helper functions

def add_symbols(tweet):
  return START_TOKEN + str(tweet) + END_TOKEN


def concat_strings(df, n):
  list_of_processed_tweets = df['tweet_text'].apply(add_symbols).values
  concatenated_string = ''.join(map(str, list_of_processed_tweets))
  concatenated_string = START_TOKEN * (n-1) + concatenated_string + END_TOKEN * (n-1)
  return concatenated_string


def get_ngram(given_string, n, v, add_one = False):
  my_dict = dict()

  for sub in range(len(given_string) - 1):
    substring = given_string[sub : sub + n - 1]
    next_char = given_string[sub + n - 1 : sub + n]

    # ignore tokens consists of only START_TOKEN or END_TOKEN (redundant since not real in vocabulary)
    # if substring == START_TOKEN * (n - 1) or substring == END_TOKEN * (n - 1):
    #   continue
    
    if not bool(my_dict):
      my_dict[substring] = dict()
      my_dict[substring][next_char] = 1
    elif substring in my_dict:
      if next_char in my_dict[substring]:
        my_dict[substring][next_char] += 1
      else:
        my_dict[substring][next_char] = 1
    else:
      my_dict[substring] = {next_char: 1}

  for a in my_dict:
    s = 0
    for b in my_dict[a]:
      s += my_dict[a][b]
    for b in my_dict[a]:
      my_dict[a][b] = (my_dict[a][b] + (1 * add_one) )/ (s + (v * add_one))
      
  return my_dict

In [None]:
def lm(n, vocabulary, data_file_path, add_one = False):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)
  
  v = len(vocabulary)
  df = pd.read_csv(data_file_path)
  my_string = concat_strings(df, n-1)
  n_gram = get_ngram(my_string, n, v, add_one)

  return n_gram

In [None]:
n = 3 # used througout testing the functions
add_one = False # used througout testing the functions
data_file = 'en.csv' # used througout testing the functions
data_file_path = get_data_file_path(data_file)

n_gram = lm(n, vocabulary, data_file_path, add_one)

In [None]:
n_gram

{'‚Ü†‚Ü†': {'R': 1.0},
 '‚Ü†R': {'.': 0.00020990764063811922,
  'E': 0.0006297229219143577,
  'T': 0.9884550797649034,
  'a': 0.0016792611251049538,
  'e': 0.005877413937867338,
  'i': 0.0010495382031905961,
  'l': 0.00020990764063811922,
  'o': 0.0010495382031905961,
  'u': 0.0006297229219143577,
  'y': 0.00020990764063811922},
 'RT': {' ': 0.9801378192136198,
  '!': 0.00040535062829347385,
  '"': 0.00020267531414673692,
  "'": 0.004053506282934738,
  '*': 0.00040535062829347385,
  '+': 0.00020267531414673692,
  '.': 0.0006080259424402108,
  '2': 0.0006080259424402108,
  '4': 0.00020267531414673692,
  ':': 0.0006080259424402108,
  '?': 0.00020267531414673692,
  'A': 0.00040535062829347385,
  'B': 0.00020267531414673692,
  'C': 0.00020267531414673692,
  'E': 0.0012160518848804217,
  'H': 0.00040535062829347385,
  'I': 0.00040535062829347385,
  'K': 0.00020267531414673692,
  'L': 0.0006080259424402108,
  'M': 0.00020267531414673692,
  'O': 0.00040535062829347385,
  'R': 0.00060802594244

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to calculate a perplexity score for
  
  data_file_path = get_data_file_path(data_file)
  df = pd.read_csv(get_data_file_path(data_file))
  entropies_list = []
  missing_value = 1e-8
  probabilities = []
  entropy = 0

  for tweet in df['tweet_text'].values:
    N = len(tweet)

    for idx in range(N - n):
      substring = tweet[idx: idx + n]
      key = substring[:-1]
      value = substring[-1]

      if key in model:
        if value in model[key]:
          probabilities.append(model[key][value])
        else:
          probabilities.append(missing_value)
      else:
        probabilities.append(missing_value)
      
  entropy = -log2(np.mean(probabilities))
  entropies_list.append(entropy)

  average_entropy = np.average(entropies_list)
  return 2 ** average_entropy

In [None]:
eval(n, n_gram, data_file)

3.868582333260864

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not
  vocabulary = preprocess()
  results = defaultdict(lambda: defaultdict(float))

  for data_file in os.listdir(DIRECTORY):
    if data_file.endswith('.csv'): # and data_file in ['en.csv', 'test.csv']:
      data_file_path = get_data_file_path(data_file)
      model = lm(n, vocabulary, data_file_path, add_one)
      model_name = data_file.split('.')[0]

      for eval_name in os.listdir(DIRECTORY):
        if eval_name.endswith('.csv'):
          data_file_path = get_data_file_path(eval_name)
          eval_model_name = eval_name.split('.')[0]
          results[model_name][eval_model_name] = eval(n, model, data_file_path)
          # print(model_name, eval_name, results[model_name][eval_model_name])

  return pd.DataFrame(results)

In [None]:
match_result = match(n, add_one)

In [None]:
print(f' n={n}, add_one={add_one}')
match_result

 n=3, add_one=False


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,3.868582,5.092793,4.912901,5.427989,5.238332,5.520707,5.188779,5.524892,4.05205,5.475605
tl,5.888946,3.726786,4.972173,5.728226,6.406026,5.576016,5.98267,6.237512,6.029867,6.039567
test,5.381259,5.274486,4.639759,5.310175,5.20458,5.56184,5.408556,5.221767,5.297774,5.294316
it,5.925279,5.756749,5.232727,3.94039,5.594397,6.012816,6.251241,5.266592,6.015708,5.477189
fr,5.395616,5.849288,4.72834,5.375366,3.749857,6.028281,5.310292,5.185904,5.476888,5.466138
in,6.316324,5.542667,5.517652,6.311426,6.344912,4.337786,6.192636,6.266922,6.413979,6.3785
nl,5.303823,5.62211,4.981235,5.810636,5.264324,5.600395,3.91956,5.565661,5.393037,5.766294
es,5.744939,5.675912,4.825695,5.154467,5.179027,6.000852,5.757078,3.877825,5.883526,4.817513
tests,3.944145,4.971205,4.543883,5.265193,5.083286,5.371371,5.060044,5.379246,3.369462,5.345671
pt,5.87103,5.759647,4.742679,5.196993,5.556734,6.109246,6.040977,4.827614,6.002873,3.672958


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
for n_value in range(1, 5):
    for add_one_value in [False, True]:
        aaa = match(n_value, add_one_value)
        print(f'n: {n_value}, add_one: {add_one_value}')
        display(aaa)
        print()
        print()

n: 1, add_one: False


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,21.735327,23.658466,22.011249,22.229448,21.563887,22.971424,22.200633,21.350971,21.769265,21.510154
tl,23.582284,22.233837,22.889273,23.167409,23.625355,21.737391,24.255681,22.375312,23.749603,22.381045
test,22.00777,22.959469,21.762702,21.910631,21.504137,22.168157,22.180131,21.018197,22.074073,21.166313
it,22.225637,23.236542,21.90778,21.591859,21.705381,22.580134,22.468113,21.102571,22.28637,21.259053
fr,21.62757,23.765095,21.571212,21.773484,20.702224,22.772819,21.650985,20.690501,21.664838,20.944829
in,22.90598,21.742178,22.105356,22.523374,22.641898,20.742356,23.251064,21.692191,23.071647,21.700606
nl,22.187508,24.320309,22.172849,22.465416,21.573589,23.311412,21.543143,21.505544,22.193978,21.867901
es,21.372633,22.468334,21.040644,21.125543,20.647043,21.781872,21.533888,20.027848,21.432171,20.209183
tests,21.768432,23.821989,22.076928,22.289346,21.601367,23.133643,22.210034,21.410982,21.788877,21.573815
pt,21.380431,22.333852,21.046255,21.141391,20.753492,21.652508,21.745337,20.073414,21.444014,20.019279




n: 1, add_one: True


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,21.781515,23.720816,22.06732,22.281279,21.608252,23.027122,22.24896,21.39883,21.821944,21.570145
tl,23.632346,22.292477,22.947552,23.221399,23.673907,21.790132,24.308423,22.42544,23.807013,22.443435
test,22.054531,23.02,21.818146,21.961727,21.548381,22.221932,22.228414,21.065319,22.127484,21.225356
it,22.272857,23.297796,21.963589,21.64222,21.750034,22.634896,22.517016,21.149881,22.340288,21.318352
fr,21.673535,23.827725,21.626175,21.824265,20.744836,22.828043,21.69813,20.736898,21.717271,21.003262
in,22.954625,21.799539,22.161663,22.575882,22.688454,20.792709,23.301649,21.740807,23.127438,21.761122
nl,22.234645,24.384381,22.229327,22.517789,21.617975,23.367923,21.590055,21.553746,22.247672,21.928877
es,21.418061,22.527587,21.09427,21.174829,20.689544,21.834725,21.580781,20.072773,21.484049,20.265584
tests,21.814689,23.884765,22.133163,22.341314,21.645808,23.189728,22.258381,21.458974,21.841602,21.633981
pt,21.425879,22.392756,21.099896,21.190718,20.796212,21.705052,21.792687,20.118444,21.495923,20.075156




n: 2, add_one: False


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,8.637126,10.163464,9.726121,10.369298,9.559288,10.772592,9.703378,9.867879,8.692206,10.337764
tl,10.619687,8.223286,9.999525,10.543567,11.217245,9.684862,11.091937,10.422469,10.620226,10.789639
test,9.930542,9.873443,9.279939,9.502329,9.350708,10.158688,9.866035,9.141136,9.88808,9.50402
it,10.609274,10.176837,9.452839,8.000771,9.770345,10.490112,10.780573,9.050744,10.593282,9.311915
fr,9.669012,10.759289,9.006588,9.359368,7.679209,10.840544,9.513371,8.704005,9.638314,9.293542
in,10.878915,9.456566,10.029417,10.653719,10.963166,8.521556,10.625457,10.546688,10.870479,11.032246
nl,9.811989,10.796609,9.757941,10.58282,9.512642,10.4985,8.189938,9.989005,9.724959,10.758908
es,9.882477,9.876488,8.89291,8.72914,8.799512,10.507458,10.123334,7.565262,10.078554,8.417099
tests,8.595123,10.068435,9.475566,10.196666,9.399281,10.66313,9.521853,9.700297,8.347624,10.205621
pt,10.473019,9.936775,8.865229,8.806157,9.304085,10.840781,10.708983,8.33754,10.532952,7.478979




n: 2, add_one: True


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,10.10487,12.302149,11.610722,12.352278,11.163191,12.931925,11.365782,11.694971,10.205686,12.659928
tl,12.498152,9.949073,12.103984,12.73307,13.461486,11.520648,13.234338,12.533268,12.691886,13.353159
test,11.599785,11.998505,11.291723,11.397357,10.957104,12.083072,11.618292,10.806357,11.691929,11.559586
it,12.377828,12.278418,11.440867,9.605428,11.435529,12.503604,12.762288,10.644137,12.500664,11.207455
fr,11.267658,13.143278,11.034881,11.429551,9.050434,13.162029,11.376795,10.411651,11.349235,11.326714
in,12.796102,11.298748,12.002059,12.744748,13.050573,10.092681,12.49271,12.628691,12.971781,13.530139
nl,11.282983,12.926941,11.473908,12.50888,10.999175,12.349179,9.407203,11.699938,11.317794,13.031698
es,11.728171,12.194437,10.690592,10.463438,10.237804,12.446215,11.906095,8.865069,11.8354,10.040363
tests,10.021388,12.238978,11.47377,12.203656,11.02278,12.846525,11.181259,11.561075,10.018177,12.53239
pt,12.290743,12.592114,11.089539,10.760086,10.865239,12.983317,12.840783,9.775519,12.438537,9.263267




n: 3, add_one: False


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,3.868582,5.092793,4.912901,5.427989,5.238332,5.520707,5.188779,5.524892,4.05205,5.475605
tl,5.888946,3.726786,4.972173,5.728226,6.406026,5.576016,5.98267,6.237512,6.029867,6.039567
test,5.381259,5.274486,4.639759,5.310175,5.20458,5.56184,5.408556,5.221767,5.297774,5.294316
it,5.925279,5.756749,5.232727,3.94039,5.594397,6.012816,6.251241,5.266592,6.015708,5.477189
fr,5.395616,5.849288,4.72834,5.375366,3.749857,6.028281,5.310292,5.185904,5.476888,5.466138
in,6.316324,5.542667,5.517652,6.311426,6.344912,4.337786,6.192636,6.266922,6.413979,6.3785
nl,5.303823,5.62211,4.981235,5.810636,5.264324,5.600395,3.91956,5.565661,5.393037,5.766294
es,5.744939,5.675912,4.825695,5.154467,5.179027,6.000852,5.757078,3.877825,5.883526,4.817513
tests,3.944145,4.971205,4.543883,5.265193,5.083286,5.371371,5.060044,5.379246,3.369462,5.345671
pt,5.87103,5.759647,4.742679,5.196993,5.556734,6.109246,6.040977,4.827614,6.002873,3.672958




n: 3, add_one: True


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,7.537928,11.504073,10.203061,11.28459,9.982791,11.626888,10.090897,11.125508,7.847404,12.243618
tl,11.655076,8.483058,11.423269,12.861464,13.10616,11.187248,12.419081,13.140622,12.162548,14.406129
test,10.134289,11.58241,10.146076,10.627993,9.639006,11.22416,10.362577,9.894632,10.467921,11.118186
it,11.488699,13.014923,10.961147,8.169166,10.593531,12.625456,12.503801,10.214907,11.926718,11.639981
fr,10.135409,12.917763,9.849928,10.611301,7.129011,12.318488,10.375197,9.498017,10.474007,11.074717
in,11.933016,11.263595,11.354073,12.76138,12.374998,8.888208,11.767791,12.423963,12.388526,13.856999
nl,9.400886,11.667258,9.682424,11.251745,9.373686,10.791721,7.132564,10.227413,9.687248,11.754564
es,11.118867,12.858327,9.756448,10.01236,9.141679,12.257377,11.208748,7.34635,11.563028,9.519284
tests,7.331814,11.088472,9.781081,10.830642,9.568495,11.153041,9.661138,10.695548,7.441549,11.758533
pt,11.564336,13.400466,10.31824,10.542542,10.150861,12.775858,12.097829,8.963469,12.034125,8.131328




n: 4, add_one: False


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,2.334446,3.976621,3.770115,4.57134,4.462747,4.557662,4.338399,4.839688,3.115465,4.854172
tl,4.991997,2.265769,3.7119,4.822834,5.358464,4.736043,5.347458,5.195801,5.946489,5.01579
test,4.726029,4.534497,2.782605,4.619866,4.501219,4.930491,4.757222,4.579634,4.694311,4.683523
it,5.436205,5.163074,4.162115,2.443522,5.071275,5.790956,5.678146,4.675382,6.174152,4.937887
fr,4.981069,5.498715,3.671191,4.952465,2.340186,5.692374,4.925041,4.795798,5.539546,5.222165
in,6.091167,5.107319,4.329874,6.203006,6.12988,2.576102,5.990103,6.246622,6.803349,6.493825
nl,4.790704,5.227448,3.798866,5.339864,4.880031,5.257512,2.359574,5.198362,5.199082,5.480613
es,5.503923,5.252551,3.914146,4.601606,4.803015,5.903603,5.495202,2.483969,6.246298,4.251625
tests,2.838525,3.897258,2.68669,4.464343,4.338238,4.440543,4.221852,4.709904,1.968508,4.740599
pt,5.629461,5.276578,3.727592,4.621041,5.167705,6.028083,5.891214,4.170498,6.587733,2.339832




n: 4, add_one: True


Unnamed: 0,en,tl,test,it,fr,in,nl,es,tests,pt
en,10.364121,17.14737,15.077729,16.663808,14.95115,17.044317,14.539881,16.520506,10.921529,18.305397
tl,18.437838,12.280979,18.006469,19.96898,20.133683,18.523025,18.856803,20.688791,19.1617,22.387275
test,15.118963,17.651905,15.324027,16.026661,14.375379,17.159313,14.946985,14.891708,15.489133,17.016712
it,18.014943,20.865045,17.330716,11.533814,16.921639,20.104484,18.324137,16.403345,18.617114,19.113436
fr,15.715408,19.882755,15.073522,16.561374,10.052543,18.781744,15.34211,14.613291,16.223283,17.368103
in,17.94254,18.536062,17.728484,19.302343,18.459523,13.540457,17.676624,19.050139,18.4415,21.216722
nl,13.145074,16.232286,13.570127,15.085396,13.253259,15.183298,9.490229,14.255263,13.451047,16.288596
es,17.692244,20.76061,15.406072,16.070416,14.359363,19.73843,16.682661,10.755557,18.272276,15.30526
tests,9.948074,16.100635,14.022959,15.595925,13.993586,15.910085,13.538139,15.452474,9.945519,17.103979
pt,18.353073,21.238537,16.485196,17.208863,16.148151,20.477097,18.234098,14.121232,18.956099,12.496041






**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
def classify_language(n = 3, add_one = False):
  LIMIT = 100 # limit the running time (~3 Secs per 1 tweet X 8000 = 400 mins for the all df)
  print(f'Read first {LIMIT} rows (tweets) from file')

  data_file = 'test.csv'
  model_name = 'test'
  df = pd.read_csv(get_data_file_path(data_file))[:LIMIT]
  v = len(vocabulary)
  accuracy = 0
  predicted_value=[]
  
  for row in df.itertuples():
      model = get_ngram(row[2], n, v, add_one)
      results = defaultdict(lambda: defaultdict(float))

      for eval_name in os.listdir(DIRECTORY):
          if eval_name.endswith('.csv') and eval_name.find('test') == -1: # not 'test.csv' or 'tests.csv'
              data_file_path = get_data_file_path(eval_name)
              eval_model_name = eval_name.split('.')[0]
              results[model_name][eval_model_name] = eval(n, model, data_file_path)
              #print(eval_model_name, results[model_name][eval_model_name])
      
      # get key of min value
      df.at[row[0], 'predicted_label'] = min(results[model_name], key=results[model_name].get)  

  # number of "hits" between predicted label and actual label per row in df
  accuracy = sum(df['predicted_label'] == df['label']) / df.shape[0]
  
  return df, accuracy

Our "classify_language" function took a long time running so we output our results only for the first 100 tweets in 'test.csv'.

**ANOTHER APPROACH** to expedite the classification process is to read only the first X tokens per tweet (chars/words or a percentage of the tweet's length) without having to go over all of it and calculate such a long calculation (all possible substring of length n per tweet).

This is extremely beneficial when a tweet is very long:
Usually, it is sufficient to read just a subset of the text to classify its language (if X is not too small of course).

In [None]:
classify_result, classify_accuracy = classify_language(n) # remember that n is set above (n = 3 when this was written)

Read first 100 rows (tweets) from file


In [None]:
print(f'Accuracy = {classify_accuracy * 100}%')

classify_result

Accuracy = 79.0%


Unnamed: 0,tweet_id,tweet_text,label,predicted_label
0,845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,en
1,836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,it
2,836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,tl
3,847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,nl
4,836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,tl
...,...,...,...,...
95,847719861459402753,"@jorisvdberg zie ook: Skinny Love, van Bon Iver",nl,nl
96,836489139214147586,RT @sinful_rider: FLASHING: TWINK EDITION. Cur...,tl,tl
97,836479412602208256,@georgeroldaniii hahaha mabunog napod laptop a...,tl,tl
98,836319232157704193,Comunque sotto shock quando Negan ha buttato i...,it,es


**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
from sklearn.metrics import f1_score
# F1 = 2 * (precision * recall) / (precision + recall)
f1_score = f1_score(classify_result['label'].tolist(), classify_result['predicted_label'].tolist(), average = 'weighted')
print(f'F1 score = {f1_score}')

F1 score = 0.803574687555008


# **Good luck!**

In [None]:
# Toda Raba!