# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.

Do make sure all results are uploaded to CSVs (as well as printed to console) for your assignment to be fully graded.

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Unpacking objects: 100% (71/71), 11.28 MiB | 6.04 MiB/s, done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


#### Imports and Global Variables:

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
import math 
import os 
import json
import csv
from collections import defaultdict, Counter
from google.colab import files

start_token = '↠'
end_token = '↞'
languages = ["en","es","in","it","pt","fr","nl","tl"]

base_path = "nlp-course/lm-languages-data-new/"

student_id_1 = '011996279'
student_id_2 = '322404252'

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
def preprocess():
  # vocab = set()
  # list = []
  df = pd.DataFrame(columns=['tweet_id','tweet_text'])
  for file_name in os.listdir(base_path):
    if file_name.endswith('.csv'):
      df_lang = pd.read_csv(base_path + file_name,encoding='utf-8')
      # print(df_lang)
      df = pd.concat((df,df_lang),axis=0)
  return list(set(df['tweet_text'].str.split('').explode()))


  #   if file_name.endswith(".json"):
  #     with open(base_path + file_name) as f:
  #       data = json.load(f)
  #       print(data.keys())
  #       for tweet in data:
  #         vocab.update(set(tweet))
  # return list(vocab)

In [None]:
vocab = preprocess()
print(f'{len(vocab)=}\n')

# double-check that start & end tokens not in vocab:
print('start or end token already in vocab?') 
('↠' or '↞') in vocab


len(vocab)=1860

start or end token already in vocab?


False

**Part 2**

Write a function `lm` that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:

def lm(n, vocabulary, data_file_path, add_one):
# n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
# vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
# data_file_path - the data_file from which we record probabilities for our model
# add_one - True/False (use add_one smoothing or not)

# TODO
  model = defaultdict(Counter)
  # model = {}
  vocab_size = len(vocabulary)
  
  
  with open(data_file_path) as f:
      data = json.load(f)
      # print(data['tweet_text'])
      for tweet in data['tweet_text'].values():
          # print(f'{tweet=}')
          for j in range(n-1):
            tweet = start_token + tweet
          tweet = tweet + end_token
          for i in range(len(tweet) - n + 1):
              history = tweet[i:i+n-1]
              char = tweet[i+n-1]
              model[history][char] += 1

  if add_one:
      for history in model:
          for char in vocabulary:
              model[history][char] += 1

  for history, chars in model.items():
      total = sum(chars.values())
      for char in chars:
          model[history][char] /= total

  return model


In [None]:
lm_check = lm(3, vocab, base_path + 'en.json', add_one=False)
lm_check

defaultdict(collections.Counter,
            {'↠↠': Counter({'R': 0.5296275708727071,
                      'H': 0.012673707615341857,
                      '@': 0.145414118954975,
                      '#': 0.017231795441912175,
                      'B': 0.008449138410227904,
                      'w': 0.0024458032240133407,
                      'E': 0.004669260700389105,
                      'T': 0.02745969983324069,
                      'M': 0.010894941634241245,
                      'G': 0.0062256809338521405,
                      'I': 0.03646470261256254,
                      'N': 0.010227904391328516,
                      'Y': 0.008893829905503057,
                      'P': 0.007115063924402446,
                      'A': 0.01500833796553641,
                      '.': 0.0017787659811006114,
                      'S': 0.019455252918287938,
                      'i': 0.006670372429127293,
                      '＠': 0.0008893829905503057,
                      'W': 0.01556

In [None]:
print(lm_check)

defaultdict(<class 'collections.Counter'>, {'↠↠': Counter({'R': 0.5296275708727071, '@': 0.145414118954975, 'I': 0.03646470261256254, 'T': 0.02745969983324069, 'S': 0.019455252918287938, '#': 0.017231795441912175, 'W': 0.01556420233463035, 'A': 0.01500833796553641, 'H': 0.012673707615341857, 'C': 0.012340188993885491, 'M': 0.010894941634241245, 'L': 0.010894941634241245, 'N': 0.010227904391328516, 'Y': 0.008893829905503057, 'D': 0.008782657031684269, 'B': 0.008449138410227904, 'F': 0.007782101167315175, 'O': 0.007226236798221234, 'P': 0.007115063924402446, 'i': 0.006670372429127293, 'J': 0.006336853807670928, 'G': 0.0062256809338521405, '"': 0.005002779321845469, 'E': 0.004669260700389105, 'h': 0.004002223457476375, 'o': 0.0033351862145636463, '1': 0.003224013340744858, 't': 0.0031128404669260703, 'w': 0.0024458032240133407, 'U': 0.0023346303501945525, '2': 0.0022234574763757642, 's': 0.0020011117287381877, 'm': 0.0020011117287381877, 'V': 0.0018899388549193997, '.': 0.0017787659811006

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def eval(n, model, data_file, sentence=False):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for
  # sentence - boolean, if true data_file is a sentence otherwise its a file path

  # TODO
  total_log_prob = 0
  total_chars = 0

  if not sentence:
    with open(data_file) as f:
        data = json.load(f)
        for tweet in data['tweet_text'].values():
          for j in range(n-1):
            tweet = start_token + tweet
          tweet = tweet + end_token

          for i in range(len(tweet) - n + 1):
              history = tweet[i:i+n-1]
              char = tweet[i+n-1]
              prob = model[history].get(char, 0)
              total_log_prob += np.log2(prob) if prob > 0 else 0
              total_chars += 1
  else:
    tweet = data_file
    for j in range(n-1):
      tweet = start_token + tweet
    tweet = tweet + end_token

    for i in range(len(tweet) - n + 1):
        history = tweet[i:i+n-1]
        char = tweet[i+n-1]
        prob = model[history].get(char, 0)
        total_log_prob += np.log2(prob) if prob > 0 else 0
        total_chars += 1
    

  return 2 ** (-total_log_prob / total_chars)

In [None]:
eval_check = eval(3, lm_check, base_path +'en.json')
print(f'eval check - perplexity of lm check running over en.json: {eval_check:.2f}')

eval check - perplexity of lm check running over en.json: 8.90


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

Save the dataframe to a CSV with the name format: {student_id_1}\_...\_{student_id_n}\_part4.csv

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not

  #TODO
  # base_path = "nlp-course/lm-languages-data-new/"
  data_files = [file_name for file_name in os.listdir(base_path) if (file_name.endswith(".json") and file_name != "test.json" and file_name != 'tests.json')]
  languages = [file_name[:-5] for file_name in data_files]
  # print(f'{languages=}\n')
  results = pd.DataFrame(index=languages, columns=languages).round(2)

  vocabulary = preprocess()

  for lang in languages:
      data_file_path = os.path.join(base_path, f"{lang}.json")
      model = lm(n, vocabulary, data_file_path, add_one)
      for test_lang in languages:
          test_data_file = os.path.join(base_path, f"{test_lang}.json")
          perplexity = eval(n, model, test_data_file)
          # print(f'model built on {lang}, and {perplexity=} tested on {test_lang}')
          results.at[lang, test_lang] = perplexity.round(2)
          
  
  file_prefix = f"{student_id_1}_{student_id_2}"
  file_name = f"{file_prefix}_part4.csv"
  results.to_csv(file_name)
  # print('Resulting match dataframe - lang model in rows and lang tested on in columns \n')

  return results

result_part4 = match(3, False)
print('Resulting match dataframe - lang model in rows and lang tested on in columns \n')
file_prefix = f"{student_id_1}_{student_id_2}"
file_name = f"{file_prefix}_part4.csv"
result_part4.to_csv(file_name)
files.download(file_name)
display(result_part4)

Resulting match dataframe - lang model in rows and lang tested on in columns 



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,8.52,9.77,10.94,10.81,10.9,12.88,12.19,11.12
es,9.23,8.55,10.69,8.92,11.12,11.6,11.52,9.9
nl,9.65,11.34,9.16,11.87,10.61,13.09,11.79,12.59
pt,9.24,8.51,10.97,8.06,10.52,11.14,10.34,9.73
en,8.91,10.55,10.36,10.85,8.9,13.04,11.76,11.67
in,9.78,10.87,11.41,10.93,10.92,9.82,9.98,11.26
tl,9.16,10.44,10.76,10.35,9.76,10.88,8.54,10.88
it,9.54,9.33,11.47,9.43,11.16,11.63,10.83,8.52


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

Load each result to a dataframe and save to a CSV with the name format: 

for cases with add_one: {student_id_1}\_...\_{student_id_n}\_n1\_part5.csv

For cases without add_one:
{student_id_1}\_...\_{student_id_n}\_n1\_wo\_addone\_part5.csv

Follow the same format for n2,n3, and n4

<font color='red'> Runtime warning: takes several minutes to run (5-7 mins on VS Code, 15-20 mins on Colab) and generate all 8 dataframes and csv's </font>

In [None]:
def run_match():
  for n in range  (1, 5):
    for add_one in [True, False]:
        print(f"Results for n = {n}, add_one = {add_one}")
        results = match(n, add_one)
        display(results)
        file_prefix = f"{student_id_1}_{student_id_2}"
        if add_one:
            file_name = f"{file_prefix}_n{n}_part5.csv"
        else:
            file_name = f"{file_prefix}_n{n}_wo_addone_part5.csv"
        results.to_csv(file_name)
        files.download(file_name)

run_match() 

Results for n = 1, add_one = True


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,36.84,38.9,40.12,40.33,40.84,43.72,48.33,39.13
es,40.12,35.49,40.69,38.89,41.27,42.92,46.42,39.32
nl,41.09,39.89,36.85,42.01,40.01,40.83,45.59,40.18
pt,39.89,36.75,40.81,36.37,41.73,42.25,46.3,39.74
en,41.95,40.0,38.94,42.45,37.87,40.79,43.95,39.85
in,46.16,42.46,41.04,44.7,41.74,36.75,41.87,42.49
tl,46.67,41.9,41.96,44.11,41.38,38.41,39.97,41.91
it,39.75,38.13,40.29,40.41,40.71,42.72,45.59,36.94


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Results for n = 1, add_one = False


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,36.79,38.27,39.58,37.29,40.09,42.86,47.56,38.47
es,38.1,35.43,40.01,35.22,40.62,41.99,45.55,37.76
nl,40.31,38.74,36.79,38.98,39.16,39.91,44.74,39.15
pt,38.47,36.3,40.06,36.29,40.92,41.29,45.31,38.01
en,38.63,37.53,38.51,37.53,37.81,39.99,43.23,37.67
in,43.98,37.08,40.49,38.21,41.14,36.69,41.27,40.51
tl,44.2,41.15,41.4,42.26,40.52,37.59,39.9,39.92
it,38.26,37.67,39.69,36.6,39.85,41.71,44.72,36.87


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Results for n = 2, add_one = True


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,20.1,25.73,30.14,27.84,28.77,33.33,35.44,26.67
es,28.78,19.18,33.68,24.55,32.6,34.77,34.83,24.75
nl,31.11,31.19,20.93,33.43,27.59,30.79,32.53,31.12
pt,29.72,23.91,35.52,20.28,33.95,36.54,36.47,26.2
en,28.73,27.4,28.08,30.14,21.38,30.08,29.67,27.55
in,34.54,28.96,30.34,32.06,30.14,21.59,27.22,29.15
tl,34.09,29.03,31.62,32.36,27.88,26.1,21.77,27.85
it,29.58,24.93,33.89,26.49,32.09,34.11,33.72,19.69


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Results for n = 2, add_one = False


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,17.15,18.72,24.98,20.62,23.7,27.07,28.04,21.87
es,21.17,16.26,27.35,18.99,26.2,27.82,27.37,19.65
nl,23.33,22.9,17.95,23.94,22.88,25.48,26.1,24.96
pt,21.95,18.83,28.04,16.61,26.28,28.44,27.59,20.01
en,19.99,19.86,23.49,20.95,18.28,24.97,24.18,22.1
in,22.12,21.21,24.78,22.09,24.46,18.15,21.85,22.52
tl,21.92,21.04,25.68,21.31,22.71,21.45,17.98,21.58
it,21.84,18.9,27.73,19.31,26.2,27.38,26.61,16.66


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Results for n = 3, add_one = True


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,25.2,41.82,67.72,50.33,56.61,82.68,85.23,52.19
es,53.03,25.07,80.26,41.17,68.81,86.63,84.8,45.88
nl,59.39,61.1,28.38,66.82,52.81,76.4,74.86,69.53
pt,61.65,43.71,94.06,26.59,76.86,98.48,89.38,53.77
en,47.26,48.86,59.51,53.56,27.24,74.87,67.76,55.42
in,61.02,56.02,70.32,59.67,61.54,31.7,55.94,62.97
tl,60.12,56.7,74.88,59.63,52.94,57.31,30.09,59.15
it,53.87,43.26,84.51,45.25,67.67,89.55,78.1,25.78


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Results for n = 3, add_one = False


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,8.52,9.77,10.94,10.81,10.9,12.88,12.19,11.12
es,9.23,8.55,10.69,8.92,11.12,11.6,11.52,9.9
nl,9.65,11.34,9.16,11.87,10.61,13.09,11.79,12.59
pt,9.24,8.51,10.97,8.06,10.52,11.14,10.34,9.73
en,8.91,10.55,10.36,10.85,8.9,13.04,11.76,11.67
in,9.78,10.87,11.41,10.93,10.92,9.82,9.98,11.26
tl,9.16,10.44,10.76,10.35,9.76,10.88,8.54,10.88
it,9.54,9.33,11.47,9.43,11.16,11.63,10.83,8.52


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Results for n = 4, add_one = True


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,54.87,87.88,108.03,101.1,110.98,158.65,149.22,116.87
es,81.27,56.65,106.39,70.72,122.64,136.24,137.89,93.48
nl,93.66,124.97,64.7,132.82,107.52,159.3,140.23,148.56
pt,92.1,74.8,121.69,59.45,125.93,137.94,130.89,105.22
en,79.05,112.37,98.91,112.77,61.65,156.18,128.71,130.18
in,106.15,130.23,131.51,127.98,123.79,79.04,103.72,139.35
tl,99.08,122.99,124.18,118.33,100.61,121.62,68.82,130.86
it,90.3,80.59,116.35,82.23,121.34,133.7,123.91,58.55


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Results for n = 4, add_one = False


Unnamed: 0,fr,es,nl,pt,en,in,tl,it
fr,4.45,4.48,3.61,4.18,4.26,3.59,3.49,4.51
es,3.82,4.7,3.28,4.05,3.96,3.4,3.57,4.45
nl,3.94,4.1,4.59,3.82,4.49,3.82,3.73,4.11
pt,3.65,4.06,3.03,4.36,3.57,3.05,3.27,4.08
en,4.07,4.56,3.85,3.97,4.45,3.7,3.83,4.29
in,3.87,4.43,3.87,4.02,4.28,5.04,4.31,4.34
tl,3.76,4.53,3.64,4.05,4.15,4.45,4.3,4.47
it,3.89,4.61,3.22,4.31,3.9,3.34,3.51,4.59


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be accepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.


<font color='red'> Runtime warning: model with n=4, add_one=True takes 20-25 minutes to run! </font>

In [None]:
def classify(n,add_one):
  # TODO
  # base_path = "nlp-course/lm-languages-data-new/"

  vocabulary = preprocess()
  language_models = {}

  for lang in ["en", "es", "fr", "in", "it", "nl", "pt", "tl"]:
      data_file_path = os.path.join(base_path, f"{lang}.json")
      model = lm(n, vocabulary, data_file_path, add_one)
      language_models[lang] = model

  test_data_path = os.path.join(base_path, "test.json")
  classifications = []

  with open(test_data_path) as f:
      test_data = json.load(f)
      
      for tweet in test_data['tweet_text'].values():
      
        min_perplexity = float('inf')
        predicted_lang = None

        for lang, model in language_models.items():
            perplexity = eval(n, model, tweet,sentence=True)
            if perplexity < min_perplexity:
                min_perplexity = perplexity
                predicted_lang = lang

        classifications.append(predicted_lang)

  return classifications

clasification_result = classify(n=3, add_one=False)
clasification_result_1 = classify(n=2, add_one=False)
clasification_result_2 = classify(n=2, add_one=True)

In [None]:
print('First classification, n=3, add_one=False \n')
test_df = pd.read_csv(base_path+'test.csv')
test_df['pred'] = clasification_result
display(test_df)
accuracy = np.mean(test_df['pred'] == test_df['label'])
print(f'classification accuracy: {accuracy:.2%}\n')

print('Second classification, n=2, add_one=False \n')
test_df_1 = pd.read_csv(base_path+'test.csv')
test_df_1['pred'] = clasification_result_1
display(test_df_1)
accuracy_1 = np.mean(test_df_1['pred'] == test_df_1['label'])
print(f'classification accuracy: {accuracy_1:.2%}\n')

print('Third classification, n=2, add_one=True \n')
test_df_2 = pd.read_csv(base_path+'test.csv')
test_df_2['pred'] = clasification_result_2
display(test_df_2)
accuracy_2 = np.mean(test_df_2['pred'] == test_df_2['label'])
print(f'classification accuracy: {accuracy_2:.2%}\n')


First classification, n=3, add_one=False 



Unnamed: 0,tweet_id,tweet_text,label,pred
0,845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,en
1,836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,it
2,836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,tl
3,847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,nl
4,836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,tl
...,...,...,...,...
7994,836250659464761344,"La triste historia que inspiró ""Tu falta de qu...",es,pt
7995,847676283089637380,RT @ShahwalAdli_: Aku tak bersuara tak bermakn...,in,in
7996,836319299279138816,@Benji_Mascolo DEVI TAGLIARE QUEI CAPELLI 😠😡😠😂❤,it,it
7997,836258179847716865,Assistimos de camarote varias brigas ontem!,pt,es


classification accuracy: 71.40%

Second classification, n=2, add_one=False 



Unnamed: 0,tweet_id,tweet_text,label,pred
0,845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,en
1,836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,it
2,836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,tl
3,847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,nl
4,836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,tl
...,...,...,...,...
7994,836250659464761344,"La triste historia que inspiró ""Tu falta de qu...",es,fr
7995,847676283089637380,RT @ShahwalAdli_: Aku tak bersuara tak bermakn...,in,in
7996,836319299279138816,@Benji_Mascolo DEVI TAGLIARE QUEI CAPELLI 😠😡😠😂❤,it,it
7997,836258179847716865,Assistimos de camarote varias brigas ontem!,pt,pt


classification accuracy: 78.68%

Third classification, n=2, add_one=True 



Unnamed: 0,tweet_id,tweet_text,label,pred
0,845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,en
1,836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,it
2,836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,tl
3,847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,nl
4,836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,tl
...,...,...,...,...
7994,836250659464761344,"La triste historia que inspiró ""Tu falta de qu...",es,es
7995,847676283089637380,RT @ShahwalAdli_: Aku tak bersuara tak bermakn...,in,in
7996,836319299279138816,@Benji_Mascolo DEVI TAGLIARE QUEI CAPELLI 😠😡😠😂❤,it,it
7997,836258179847716865,Assistimos de camarote varias brigas ontem!,pt,pt


classification accuracy: 86.71%



**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 

Load the results to a CSV (using a DataFrame), with a model_name and f1_score Name it {student_id_1}\_...\_{student_id_n}\_part7.csv



```
  model_name  f1_score
0    Model A      0.85
1    Model B      0.92
2    Model C      0.87
3    Model D      0.90
```



In [None]:
def calc_f1(result):
    # TODO

    y_true = result['label']
    y_pred = result['pred']

    f1 = f1_score(y_true, y_pred, average='weighted')
    return f1

f1_scores = []
for classification in [test_df, test_df_1, test_df_2]:    
    f1_scores.append(calc_f1(classification))

f1_df = pd.DataFrame({'model_name': ['Model A', 'Model B', 'Model C'], 'f1_score': pd.Series(f1_scores)}).round(2)


display(f1_df.style.hide_index())

file_prefix = f"{student_id_1}_{student_id_2}"
file_name = f"{file_prefix}_part7.csv"
f1_df.to_csv(file_name)
files.download(file_name)

  display(f1_df.style.hide_index())


model_name,f1_score
Model A,0.72
Model B,0.79
Model C,0.87


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<br><br><br><br>
**Part 8**  
Let's use your Language model (dictionary) for generation (NLG).

When it comes to sampling from a language model decoder during text generation, there are several different methods that can be used to control the randomness and diversity of the generated text. 

Some of the most commonly used methods include:

> `Greedy sampling`
In this method, the model simply selects the word with the highest probability as the next word at each time step. This method can produce fluent text, but it can also lead to repetitive or predictable output.

> `Temperature scaling`  
Temperature scaling involves scaling the logits output of the language model by a temperature parameter before softmax normalization. This has the effect of smoothing the distribution of probabilities and increasing the probability of lower-probability words, which can lead to more diverse and creative output.

> `Top-K sampling`  
In this method, the model restricts the sampling to the top-K most likely words at each time step, where K is a predefined hyperparameter. This can generate more diverse output than greedy sampling, while limiting the number of low-probability words that are sampled.

> `Nucleus sampling` (also known as top-p sampling)  
This method restricts the sampling to the smallest possible set of words whose cumulative probability exceeds a certain threshold, defined by a hyperparameter p. Like top-K sampling, this can generate more diverse output than greedy sampling, while avoiding sampling extremely low probability words.

> `Beam search`  
Beam search involves maintaining a fixed number k of candidate output sequences at each time step, and then selecting the k most likely sequences based on their probabilities. This can improve the fluency and coherence of the output, but may not produce as much diversity as sampling methods.

The choice of sampling method depends on the specific application and desired balance between fluency, diversity, and randomness. Hyperparameters such as temperature, K, p, and beam size can also be tuned to adjust the behavior of the language model during sampling.


You may read more about this concept in <a href='https://huggingface.co/blog/how-to-generate#:~:text=pad_token_id%3Dtokenizer.eos_token_id)-,Greedy%20Search,-Greedy%20search%20simply'>this</a> blog post.


**Please added the needed code for each sampeling method:**

In [None]:
def sample_greedy(probabilities):
    # your code here
    return [np.argmax(probabilities)]


def sample_temperature(probabilities, temperature=1.0):
    # your code here
    adjusted_probabilities = np.exp(np.log(probabilities) / temperature)
    adjusted_probabilities /= np.sum(adjusted_probabilities)
    return np.random.choice(len(probabilities), size=1, p=adjusted_probabilities)


def sample_topK(probabilities, k=1):
    # your code here
    topK_indices = np.argpartition(probabilities, -k)[-k:]
    topK_probabilities = probabilities[topK_indices]
    topK_probabilities /= np.sum(topK_probabilities)
    return np.random.choice(topK_indices, size=1, p=topK_probabilities)


def sample_topP(probabilities, p=0.9):
    # your code here
    sorted_indices = np.argsort(probabilities)[::-1]
    cumulative_probabilities = np.cumsum(probabilities[sorted_indices])
    indices = sorted_indices[cumulative_probabilities <= p]
    topP_probabilities = probabilities[indices]
    topP_probabilities /= np.sum(topP_probabilities)
    return np.random.choice(indices, size=1, p=topP_probabilities)


def sample_beam(model, gen_length, start_tokens, stop_token,n=2, k=3):
    def get_probability(sequence):
        prob = 0
        
        for j in range(n-1):
          sequence = start_token + sequence
        sequence = sequence + end_token

        for i in range(n-1,len(sequence)):
            prob *= model[sequence[i-n+1:i]][sequence[i]]
        return prob


    sequences = [start_tokens for _ in range(k)]
    
    for _ in range(gen_length):
        all_candidates = []
        for seq in sequences:
            if seq[-1] == stop_token:
                all_candidates.append(seq)
            else:
                for next_token, next_prob in model[seq[-n+1:]].items():
                    new_seq = seq + next_token
                    all_candidates.append(new_seq)

        all_candidates.sort(key=get_probability, reverse=True)
        sequences = all_candidates[:k]
        # print(f'{sequences=}')
    
    best_seq = sequences[0]
    return best_seq

Use your Language Model to generate each one out of the following examples with the coresponding params.    
Notice the 4 core issues: 
- Starting tokens
- Length of the generation
- Sampling methond (use all)
- Stop Token (if this token is sampled, stop generating)

In [None]:
test_ = {
    'example1' : {
        'start_tokens' : "H",
        'sampling_method' : ['greedy','beam'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example2' : {
        'start_tokens' : "H",
        'sampling_method' : ['temperature','topK','topP'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example3' : {
        'start_tokens' : "He",
        'sampling_method' : ['greedy','beam','temperature','topK','topP'],
        'gen_length' : "20",
        'stop_token' : "me",
        'generation' : []
    }
}

Use your LM to generate a string based on the parametes of each examples, and store the generation sequance at the generation list.

In [None]:
### your code here ###

def generate_text(start_tokens, sampling_method, gen_length, stop_token, language_model, k=3, temperature=1.0):
    text = start_tokens
    n = len(list(language_model.keys())[0]) + 1

    if sampling_method == 'beam':
        return sample_beam(language_model, gen_length, start_tokens, stop_token,n=n, k=k)

    while len(text) < gen_length:
        if len(text) < n - 1:
            context = text[-(n - 1):]
        else:
            context = text[-(n - 1):]

        probabilities = language_model.get(context, None)
        if probabilities is None:
            break
        prob_values = np.array(list(probabilities.values()))
        

        if sampling_method == 'greedy':
            next_token_idx = sample_greedy(prob_values)
        # elif sampling_method == 'beam':
        #     next_token_idx = [1]
            # next_token_idx = sample_beam(probabilities, k=k)
        elif sampling_method == 'temperature':
            next_token_idx = sample_temperature(prob_values, temperature=temperature)
        elif sampling_method == 'topK':
            next_token_idx = sample_topK(prob_values, k=k)
        elif sampling_method == 'topP':
            next_token_idx = sample_topP(prob_values, p=0.9)
        else:
            raise ValueError(f"Invalid sampling method: {sampling_method}")
        # print(f'{next_token_idx=}')
        next_token = list(probabilities.keys())[next_token_idx[0]]

        # print(f'{text=}')
        # print(f'{next_token=}')
        text += str(next_token)

        if stop_token in text:
            break

    return text

#####################

In [None]:
vocab = preprocess()
lang_model = lm(2, vocab, base_path+'en.json', False)
for example_name, example in test_.items():
    start_tokens = example['start_tokens']
    gen_length = int(example['gen_length'])
    stop_token = example['stop_token']
    example['generation'] = []
    print(f'\n{example_name=}')

    for sampling_method in example['sampling_method']:
        # print(f'{sampling_method=}')
        generated_text = generate_text(start_tokens, sampling_method, gen_length, stop_token, lang_model, k=3, temperature=1.0)
        print(f'{generated_text=}')
        example['generation'].append(generated_text)


example_name='example1'
generated_text='Hon t t t '
generated_text='HERT @ONHER'

example_name='example2'
generated_text='Ho alpe at'
generated_text='Hend he th'
generated_text='Ht urema p'

example_name='example3'
generated_text='He t t t t t t t t t'
generated_text='Hends @ONHERT @ONHERT '
generated_text='Hex2s gis won #was J'
generated_text='He attthe httt t he '
generated_text='He eacon it Sueeaiga'


In [None]:
### do not change ###
print('-------- NLG --------')

for k,v in test_.items():
  l = ''.join([f'\t{sm} >> {v["start_tokens"]}{g}\n' for sm,g in zip(v['sampling_method'],v['generation'])])
  print(f'{k}:')
  print(l)

-------- NLG --------
example1:
	greedy >> HHon t t t 
	beam >> HHERT @ONHER

example2:
	temperature >> HHoulam5S: 
	topK >> HHerin that
	topP >> HHo alt.cas

example3:
	greedy >> HeHe t t t t t t t t t
	beam >> HeHends @ONHERT @ONHERT 
	temperature >> HeHe ke MiqLico//We tt
	topK >> HeHend t han thtthe he
	topP >> HeHeldand Schaberrumy 



<br><br><br>
# **Good luck!**