# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.

Do make sure all results are uploaded to CSVs (as well as printed to console) for your assignment to be fully graded.

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
import os
import math
import re
import pandas as pd
from collections import defaultdict
import json
import sys
import numpy as np
from sklearn.metrics import f1_score
import random

In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Unpacking objects: 100% (71/71), 11.28 MiB | 3.65 MiB/s, done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
def preprocess(data_dir):
    vocab = set()
    for filename in os.listdir(data_dir):
        if 'csv' in filename and 'test' not in filename:
            # Read CSV file with two columns
            df = pd.read_csv(data_dir + "/" + filename, encoding="utf-8")
            for tweet in df['tweet_text']:
                for char in tweet:
                    if char not in vocab:
                        vocab.add(char)
    vocab.add('<S>')
    vocab.add('<E>')
    return list(vocab)

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)
    # 
    if add_one:
        model = defaultdict(lambda: defaultdict(lambda: 1.0 / len(vocabulary)))
        model["unk"]["unk"] = 1.0 / len(vocabulary)
    else:
        model = defaultdict(lambda: defaultdict(int))
        model["unk"]["unk"] = 0
    total_counts = defaultdict(int)
    total_counts["unk"] = 1
    if 'csv' in data_file_path:
        # Read CSV file with two columns
        df = pd.read_csv(data_file_path, encoding="utf-8")
    for tweet in df['tweet_text']:
          tweet = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', "", tweet)
          tweet = re.sub(r'@\w+\s*', '', tweet)              
          tweet = re.sub('[^\w\s]+', '', tweet)
          # Pad the text with special start and end tokens
          tweet = '<S>'*(n) + tweet + '<E>'
          j = 0
          #deal with <S> as it is of size 4
          for i in range(n):
              # Increment the count for this n-gram and token
              model[tweet[i+(j):i+3*n]][tweet[i+3*n]] += 1
              # Increment the total count for this n-gram
              total_counts[tweet[i+(j):i+3*n]] += 1
              j += 2
          for i in range(3*n,len(tweet)-3-n):
              # Increment the count for this n-gram and token
              model[tweet[i:i+n]][tweet[i+n]] += 1
              # Increment the total count for this n-gram
              total_counts[tweet[i:i+n]] += 1
          #deal with <E> as it needs to be a token of size 4 and not 1
          # Increment the count for this n-gram and token
          model[tweet[len(tweet)-n-3:len(tweet)-3]][tweet[len(tweet)-3:]] += 1
          # Increment the total count for this n-gram
          total_counts[tweet[len(tweet)-n-3:len(tweet)-3]] += 1
    # Convert counts to probabilities
    for ngram in model.keys():
        total = total_counts[ngram]
 #       for token in vocabulary:
        for token in model[ngram].keys():
            count = model[ngram][token]
            if add_one:
                count += 1
                total += len(vocabulary)
            model[ngram][token] = count / total
    return model


**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def eval(n, model, data_file):
    # n - the n-gram that you used to build your model (must be the same number)
    # model - the dictionary (model) to use for calculating perplexity
    # data_file - the tweets file that you wish to claculate a perplexity score for
    if 'csv' in data_file:
        # Read CSV file with two columns
        df = pd.read_csv(data_file, encoding="utf-8")
    probs = []
    for tweet in df['tweet_text']:
          tweet = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', "", tweet)
          tweet = re.sub(r'@\w+\s*', '', tweet) 
          tweet = re.sub('[^\w\s]+', '', tweet)
          # Pad the text with special start and end tokens
          tweet = '<S>'*(n) + tweet + '<E>'
          j = 0
          #deal with <S> as it is of size 3
          for i in range(n):
              temp = 0
              ngram = tweet[i+(j):i+3*n]
              if ngram in model.keys():
                  if tweet[i+3*n] in model[ngram].keys():
                      temp = model[ngram][tweet[i+3*n]]
                      probs.append(temp)
              if temp == 0:
                  temp = model["unk"]["unk"]
                  if temp > 0:
                      probs.append(temp)
                  # else:
                  #     temp = 1.e-17
                  #     probs.append(temp)
              j += 2
          for i in range(3*n,len(tweet)-n-3):
              temp = 0
              ngram = tweet[i:i+n]
              if ngram in model.keys():
                  if tweet[i+n] in model[ngram].keys():   
                      temp = model[ngram][tweet[i+n]]
                      if temp > 0: 
                          probs.append(temp)
              if temp == 0:
                  temp = model["unk"]["unk"]
                  if temp > 0:
                      probs.append(temp)
                  # else:
                  #     temp = 1.e-17
                  #     probs.append(temp)

          temp = 0
          ngram = tweet[len(tweet)-n-3:len(tweet)-3]
          if ngram in model.keys():
              if tweet[len(tweet)-3:] in model[ngram].keys():   
                  temp = model[ngram][tweet[len(tweet)-3:]]
                  if temp > 0:
                          probs.append(temp)
          if temp == 0:
              temp = model["unk"]["unk"]
              if temp > 0:
                  probs.append(temp)
              # else:
              #     temp = 1.e-17
              #     probs.append(temp)
    
    entropy = -np.mean(np.log2(probs))
    P = 2** entropy
    return P

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

Save the dataframe to a CSV with the name format: {student_id_1}\_...\_{student_id_n}\_part4.csv

In [None]:
def match(n, add_one):
    # n - the n-gram to use for creating n-gram models
    # add_one - use add_one smoothing or not
    df = pd.DataFrame(columns=['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl'], index=['en', 'es', 'fr', 'in', 'it', 'nl', 'pt', 'tl'])
    vocabulary = preprocess('nlp-course/lm-languages-data-new/')  
    for modelfile in ['en.csv', 'in.csv', 'pt.csv', 'tl.csv', 'fr.csv', 'nl.csv', 'es.csv', 'it.csv']: 
        model = lm(n, vocabulary, 'nlp-course/lm-languages-data-new/' + modelfile, add_one)
        for vocabfile in ['en.csv', 'in.csv', 'pt.csv', 'tl.csv', 'fr.csv', 'nl.csv', 'es.csv', 'it.csv']:
            P = eval(n, model, 'nlp-course/lm-languages-data-new/' + vocabfile)
            
            df.loc[modelfile[:2], vocabfile[:2]] = P

    df.to_csv(f"{208542944}_{318339041}_part4.csv")
    return df
# match(4, False)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

Load each result to a dataframe and save to a CSV with the name format: 

for cases with add_one: {student_id_1}\_...\_{student_id_n}\_n1\_part5.csv

For cases without add_one:
{student_id_1}\_...\_{student_id_n}\_n1\_wo\_addone\_part5.csv

Follow the same format for n2,n3, and n4


In [None]:
def run_match():
    for n in range(1,5):
       match(n, True).to_csv(f"{208542944}_{318339041}_{n}_part5.csv")
       match(n, False).to_csv(f"{208542944}_{318339041}_{n}_no_addone_part5.csv")
run_match()

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be accepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
def classify(n,add_one):
      
      # Read CSV file with two columns
      df = pd.read_csv('nlp-course/lm-languages-data-new/test.csv', encoding="utf-8")
      ans = defaultdict(lambda: defaultdict(int))
      labels = np.empty((len(df), 2), dtype=object)
      vocabulary = preprocess('nlp-course/lm-languages-data-new')
      eval = 0
      for modelfile in ['en.csv', 'in.csv', 'pt.csv', 'tl.csv', 'fr.csv', 'nl.csv', 'es.csv', 'it.csv']:   
          model = lm(n, vocabulary, 'nlp-course/lm-languages-data-new/' + modelfile, add_one)
          for iter,tweet in df.iterrows():
              if iter > 0:      
                    labels[iter][0] = tweet['label']
                    vals = []
                    tweet = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', "", tweet["tweet_text"])
                    tweet = re.sub(r'@\w+\s*', '', tweet)   
                    tweet = re.sub('[^\w\s]+', '', tweet)
                    # Pad the text with special start and end tokens
                    tweet = '<S>'*(n) + tweet + '<E>'
                    j = 0
                    for i in range(n):
                        temp = 0
                        ngram = tweet[i+(j):i+3*n]
                        if ngram in model.keys():
                            if tweet[i+3*n] in model[ngram].keys():   
                                temp = model[ngram][tweet[i+3*n]]
                                if temp > 0:
                                    vals.append(temp)
                        if temp == 0:
                            temp = model["unk"]["unk"]
                            if temp > 0:
                                vals.append(temp)
                            # else:
                            #     vals.append(1.e-17)

                        j += 2
                    for i in range(3*n,len(tweet)-n-3):
                        temp = 0
                        ngram = tweet[i:i+n]
                        if ngram in model.keys():
                            if tweet[i+n] in model[ngram].keys():   
                                temp = model[ngram][tweet[i+n]]
                                if temp > 0:
                                    vals.append(temp)

                        if temp == 0:
                            temp = model["unk"]["unk"]
                            if temp > 0:
                                vals.append(temp)
                            # else:
                            #     vals.append(1.e-17)
                    #deal with <E> as it needs to be a token of size 4 and not 1
                    temp = 0
                    ngram = tweet[len(tweet)-n-3:len(tweet)-3]
                    if ngram in model.keys():
                        if tweet[len(tweet)-3:] in model[ngram].keys():   
                            temp = model[ngram][tweet[len(tweet)-3:]]
                        if temp > 0:
                                vals.append(temp)
                    if temp == 0:
                        temp = model["unk"]["unk"]
                        if temp > 0:
                            vals.append(temp)
                        # else:
                        #     vals.append(1.e-17)
                    entropy = 0
                    # for ngram in H.keys():
                    #     entropy -= math.log2(H[ngram])
                    if vals:   
                        entropy = -np.mean(np.log2(vals))
                        P = 2** entropy

                    ans[iter][modelfile] = P
      for i in range(1,len(df)):
          guess = min(ans[i],key=ans[i].get)
          labels[i][1] = guess.split(".")[0]
      return labels



In [None]:
clasification_result = classify(2, True)

In [None]:
clasification_result

array([[None, None],
       ['it', 'it'],
       ['tl', 'in'],
       ...,
       ['it', 'it'],
       ['pt', 'pt'],
       ['en', 'en']], dtype=object)

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 

Load the results to a CSV (using a DataFrame), where the row indicates the F1 results, and the columns indicate the model used. Name it {student_id_1}\_...\_{student_id_n}\_part7.csv

In [None]:
def calc_f1(result):
  return f1_score(result[1:, 0], result[1:, 1], average='macro')
arg_max = 0
current_n = 0 
is_True = -1
df = pd.DataFrame(columns=['model', 'f1_score'], index=[1, 2, 3, 4, 5, 6, 7, 8])
for i in range(1,4):
    c1  = calc_f1(classify(i,True))
    print(f"the score for model_{i}_True is:",c1,"where n =",i ,"and add one is:", "True")
    if c1 > arg_max:
        arg_max = c1
        current_n = i
        is_True = 1
    c2  = calc_f1(classify(i,False))
    print(f"the score for model_{i}_False is:",c2,"where n =",i ,"and add one is:", "False")
    if c2 > arg_max:
        arg_max = c2
        current_n = i
        is_True = 0
    df.loc[i]['model'] = f"model_{i}_True"
    df.loc[i+4]['model'] = f"model_{i}_False"
    df.loc[i]['f1_score'] = c1
    df.loc[i+4]['f1_score'] = c2
df.to_csv(f"{208542944}_{318339041}_part7.csv")
print("max prob:",arg_max,"where n =",current_n,"and add one is:", is_True)


the score for model_1_True is: 0.8730725237412568 where n = 1 and add one is: True
the score for model_1_False is: 0.8639926511708674 where n = 1 and add one is: False
the score for model_2_True is: 0.9250610494684947 where n = 2 and add one is: True
the score for model_2_False is: 0.9112248198769488 where n = 2 and add one is: False
the score for model_3_True is: 0.9232825156550549 where n = 3 and add one is: True
the score for model_3_False is: 0.8476553626536604 where n = 3 and add one is: False
max prob: 0.9250610494684947 where n = 2 and add one is: 1


<br><br><br><br>
**Part 8**  
Let's use your Language model (dictionary) for generation (NLG).

When it comes to sampling from a language model decoder during text generation, there are several different methods that can be used to control the randomness and diversity of the generated text. 

Some of the most commonly used methods include:

> `Greedy sampling`
In this method, the model simply selects the word with the highest probability as the next word at each time step. This method can produce fluent text, but it can also lead to repetitive or predictable output.

> `Temperature scaling`  
Temperature scaling involves scaling the logits output of the language model by a temperature parameter before softmax normalization. This has the effect of smoothing the distribution of probabilities and increasing the probability of lower-probability words, which can lead to more diverse and creative output.

> `Top-K sampling`  
In this method, the model restricts the sampling to the top-K most likely words at each time step, where K is a predefined hyperparameter. This can generate more diverse output than greedy sampling, while limiting the number of low-probability words that are sampled.

> `Nucleus sampling` (also known as top-p sampling)  
This method restricts the sampling to the smallest possible set of words whose cumulative probability exceeds a certain threshold, defined by a hyperparameter p. Like top-K sampling, this can generate more diverse output than greedy sampling, while avoiding sampling extremely low probability words.

> `Beam search`  
Beam search involves maintaining a fixed number k of candidate output sequences at each time step, and then selecting the k most likely sequences based on their probabilities. This can improve the fluency and coherence of the output, but may not produce as much diversity as sampling methods.

The choice of sampling method depends on the specific application and desired balance between fluency, diversity, and randomness. Hyperparameters such as temperature, K, p, and beam size can also be tuned to adjust the behavior of the language model during sampling.


You may read more about this concept in <a href='https://huggingface.co/blog/how-to-generate#:~:text=pad_token_id%3Dtokenizer.eos_token_id)-,Greedy%20Search,-Greedy%20search%20simply'>this</a> blog post.


**Please added the needed code for each sampeling method:**

In [None]:
def sample_greedy(probabilities, k=1):
    return max(probabilities,key=probabilities.get)


def sample_temperature(probabilities, temperature=1.0, k=1):
    probs = np.array(list(probabilities.values()))
    scaled_logits = np.log(probs) / temperature
    scaled_probs = np.exp(scaled_logits) / np.sum(np.exp(scaled_logits))
    return np.random.choice(list(probabilities.keys()), p=scaled_probs)


def sample_topK(probabilities, k=2,is_char=False):
    # Sort the dictionary by value in descending order and get the top k items
    sorted_dict = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)[:k]
    # Extract the keys of the top k items and return them in a list
    top_k = [(key,value) for key, value in sorted_dict]
    if is_char:
        return top_k
    return random.choice(top_k)[0] 


def sample_topP(probabilities, p=0.9):
    sorted_probs = sorted(probabilities.items(), key=lambda x: x[1], reverse=True)
    cumulative_sum = 0
    result = []
    for (key, value) in sorted_probs:
        if cumulative_sum < p:
            cumulative_sum += value
            result.append(key)
        else:
            break
    return random.choice(result)


def sample_beam(n,start_token, model, k=3, target_len=10):
    currnt_candidates = [()]
    all_candits = []
    best_k_candidates = [()]
    len_start = len(start_token)
    start_token = '<S>'*(n-len_start) + start_token
    ngram = start_token[:3*n]
    best_k_candidates = sample_topK(model[ngram], k=k,is_char=True)
    for index in range(1,int(target_len)-1):
          for ngram_candit in best_k_candidates:
                current_val = ngram_candit[1]
                current_string = start_token + ngram_candit[0]
       
                if index < n+1-len_start:
                    ngram = current_string[index+2*index:index+3*n]
                else :
                    ngram = current_string[-n:]
                currnt_candidates = sample_topK(model[ngram], k=k,is_char=True)
                for candidat in currnt_candidates:
                      all_candits.append((ngram_candit[0]+candidat[0], ngram_candit[1]+candidat[1]))
          best_k_candidates = sorted(all_candits, key=lambda tup: tup[1], reverse=True)[:k]
    return best_k_candidates


Use your Language Model to generate each one out of the following examples with the coresponding params.    
Notice the 4 core issues: 
- Starting tokens
- Length of the generation
- Sampling methond (use all)
- Stop Token (if this token is sampled, stop generating)

In [None]:
test_ = {
    'example1' : {
        'start_tokens' : "H",
        'sampling_method' : ['greedy','beam'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example2' : {
        'start_tokens' : "H",
        'sampling_method' : ['temperature','topK','topP'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example3' : {
        'start_tokens' : "He",
        'sampling_method' : ['greedy','beam','temperature','topK','topP'],
        'gen_length' : "20",
        'stop_token' : "me",
        'generation' : []
    }
}

Use your LM to generate a string based on the parametes of each examples, and store the generation sequance at the generation list.

In [None]:
### your code here ###
n = 4
add_one = False
vocabulary = preprocess('nlp-course/lm-languages-data-new/') 
model = lm(n, vocabulary, 'nlp-course/lm-languages-data-new/en.csv', add_one)
for exmp in test_.values():
    gen_length = int(exmp["gen_length"])
    stop_token = exmp["stop_token"]
    for func in exmp["sampling_method"]:
        start_token = exmp["start_tokens"]
        start_len = len(start_token)
        ganarted_text = ""
        if func == 'beam':
            ans = sample_beam(n, start_token, model, k=3, target_len=gen_length)
            ganarted_text += ans[0][0]
        else:
            # ganarted_text += start_token
            start_token = '<S>'*(n-start_len) + start_token
            for index in range(int(gen_length)-1):
                if index < n+1-start_len:
                    ngram = start_token[index+2*index:index+3*n]
                else :
                    ngram = start_token[-n:]
                probabilities = model[ngram]
                if func == 'greedy':
                    next_char = sample_greedy(probabilities)
                elif func == 'temperature':
                    next_char = sample_temperature(probabilities)
                elif func == 'topK':
                    next_char = sample_topK(probabilities)
                elif func == 'topP':
                    next_char = sample_topP(probabilities)
                else : 
                    break
                if next_char == "<E>":
                    break
                ganarted_text += next_char
                start_token += next_char
        ganarted_text += stop_token
        exmp["generation"].append(ganarted_text)



#####################

In [None]:
### do not change ###
print('-------- NLG --------')

for k,v in test_.items():
  l = ''.join([f'\t{sm} >> {v["start_tokens"]}{g}\n' for sm,g in zip(v['sampling_method'],v['generation'])])
  print(f'{k}:')
  print(l)

-------- NLG --------
example1:
	greedy >> Hello Bad This is a great a big book at the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the same to the s

<br><br><br>
# **Good luck!**