# Language modeling
## 0: Install dependencies:

**You can only use the libraries imported for you in this assignment**

In [3]:
!pip install transformers==4.24.0 datasets==2.7.0 tqdm==4.64.1 sentencepiece==0.1.97 gensim==4.2.0 apache-beam==2.42.0 sentence-transformers==2.2.2 googledrivedownloader

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/



## 1: Introduction
This section give you a quick introduction/refresher to language models

### What is a language model?
A language model is a statistical model that give you the probability of some given text

### What is a token?
You can't find the probability of most sequences longer than a few words directly since the $26^N$ possible sequences of length N only including the lower case letters in the English alphabet. That number can become astronomically large quickly. 

Solution: break the text up into small units (tokens)

Each token is typically a word or punctuation (but, can be other short sequences of characters)

**Question:** a) Finish the implementation exercises
#### Your first exercise is to create a tokenizer which take some text as input and outputs a list of tokens

To make things a little easier you can assume that all tokens are separated by " " or "-"

You may use the `re` module, but there are simpler solutions that do not need it

In [4]:
import re

# Basic tokenizer function
def tokenize(text: str) -> list:
  """
  Input: text, the string to be tokenized
  Output: tokens, a list of token strings

  Turns text into a list of tokens
  """
  # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 
  tokens = re.split(r"-| ", text)
  # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
  return tokens

example_text = "Is water a Non-Newtonian fluid ?"
tokenized_example_text = tokenize(example_text)
print(tokenized_example_text)
# Expected output: ['Is', 'water', 'a', 'Non', 'Newtonian', 'fluid', '?']

['Is', 'water', 'a', 'Non', 'Newtonian', 'fluid', '?']


One of the simplest language models is the unigram model. It stores the probability of encountering each token, ignoring surrounding tokens(it does not use conditional probability):

$P(sentence)=P(token_1)P(token_2)...P(token_N)$

In [44]:
import numpy as np

class Unigram:

  def __init__(self):
    """
    Initializes log probabilities
    """
    self.log_probabilities = {}
    self.unknown_log_probability = 0.0

  def train(self, sentences: list)->None:
    """
    Input: sentences, list of already tokenized sentences 
    Ex. [['Hello','my','name','is','HAL'],['Hi','HAL']]

    Save log probability of seeing each token using `np.log` to obtain the log probabilities

    """
    # Add a single unknown token
    sentences.append(['<unknown token>'])
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 
    total = []
    for sentence in sentences:
      total = total + sentence
    token_num = len(total)
    uniqe = set(total)
    probabilities = {}
    for i in uniqe:
      probabilities[i] = 0
    for token in total:
      probabilities[token] += 1/token_num
    for i in uniqe:
      self.log_probabilities[i] = np.log(probabilities[i])
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
    # Assign probability for unseen tokens

    self.unknown_log_probability = self.log_probabilities.pop('<unknown token>')

  def token_log_prob(self, token:str) -> float:
    """
    Get the log probability of a single token with self.unknown_log_probability use if a token was not found during training
    """
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 
    if token not in list(self.log_probabilities.keys()):
      return self.unknown_log_probability
    return self.log_probabilities[token]
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 

  def sentence_log_prob(self, sentence:list) -> float:
    """
    Get the log probability of an already tokenized sentence
    """
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 
    s_prob = 0
    for token in sentence:
      s_prob += self.token_log_prob(token)
    return s_prob 
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 

model = Unigram()
model.train([['Hello','my','name','is','HAL'],['Hi','HAL']])
print('"Hello" log prob:',model.token_log_prob('Hello'))
print('"Hi my name is HAL" log prob:',model.sentence_log_prob(tokenize("Hi my name is HAL")))

"Hello" log prob: -2.0794415416798357
"Hi my name is HAL" log prob: -9.704060527839234


We can use the Unigram model to classify text (but, may not have the highest accuracy)

In [45]:
# from datasets import load_dataset

import pandas as pd

from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(
    file_id='1hyziAb67N3RuhhBSCtBhpQEPgr6ZIpJP',
    dest_path='/tmp/emotion.csv'
)
df_full = pd.read_csv('/tmp/emotion.csv')
df_train = df_full[df_full['subset']=='train'].copy()
df_test = df_full[df_full['subset']=='test'].copy()
label_key = ["sadness", "joy", "love", "anger", "fear", "surprise"]

# Init models
total_count = len(df_train)
label_counts = df_train['label'].value_counts().sort_index()
models = [{
     'index': i,
     'label': label,
     'log_prior': np.log(label_counts.iloc[i]/total_count),
     'unigram_model': Unigram(),
} for i, label in enumerate(label_key)]

# Train models
for model in models:
  df_train_matching_label = df_train[df_train['label']==model['index']]
  tokenized_sentences = df_train_matching_label['text'].apply(tokenize).tolist()
  model['unigram_model'].train(tokenized_sentences)

In [46]:
# Predict classes
def predict(sentence:str)->int:
  tokenized_sentence = tokenize(sentence)
  highest_log_prob = float('-inf')
  highest_log_prob_index = 0
  for model in models:
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 
    # Compute log prob of the sentence using the ungram model + the log prior of the label
    log_prob = model['unigram_model'].sentence_log_prob(tokenized_sentence) + model['log_prior']
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
    if log_prob > highest_log_prob:
      highest_log_prob = log_prob
      highest_log_prob_index = model['index']
  return highest_log_prob_index

df_test['predicted_label'] = df_test['text'].apply(predict)

tp_count = sum(df_test['predicted_label']==df_test['label'])
accuracy = tp_count/len(df_test)
print(f'Accuracy: {accuracy*100}%')

Accuracy: 62.3%



## 2: Types of Language Models
This sections explains different types of language models. We will go over 3 of the most used language model types:
1. Causal
2. Masked
3. Sequence to sequence



### 2.1: Causal language model

A causal language model provides the probability of a token given the tokens before it

$P(token_T|token_1,token_2,...,token_{T-1})$

It is useful for a variety of NLP tasking including sequence generation and sequence classification

Example:
Hello, my name is ...

Output:
Hello, my name is HAL

In [47]:
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, set_seed
from tqdm import tqdm

import torch
import torch.nn as nn

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
model = OpenAIGPTLMHeadModel.from_pretrained("openai-gpt")

ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.
Some weights of OpenAIGPTLMHeadModel were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Here is an example

In [48]:
def generate_gpt_text_greedy(input_text,sequence_max_length = 25):
  """
  This uses greedy decoding which is not optimal for most tasks,
  but requires little system resources and is simple to implement
  
  Recommended for deterministic tasks: Beam search
  Recommended for creative tasks: Nucleus sampling
  Recommended for low latency/real time tasks: Greedy decoding or Nucleus sampling

  Note: Optimal decoding/generation algorithm depends on the task
  """
  generated_text = input_text

  # Generate a sequence
  for i in tqdm(range(sequence_max_length)):
    with torch.no_grad(): # Better preformance
      inputs = tokenizer(generated_text, return_tensors="pt")
      outputs = model(**inputs)
      next_token_logits = outputs.logits[0, -1, :]
      next_token_index = torch.argmax(next_token_logits)
      generated_text = tokenizer.decode(
          torch.cat((inputs['input_ids'][0],torch.tensor([next_token_index]))) # generated_text = generated_text + new_token
      )
  
  return generated_text

input_text = "Hello, my name is John Smith. I am the"

print(f'Generated: {generate_gpt_text_greedy(input_text)}')

100%|██████████| 25/25 [00:04<00:00,  5.07it/s]

Generated: hello, my name is john smith. i am the head of the department of defense. "





In [49]:
alt_text = "John said it again three times: \"No No No"
print(f'Generated: {generate_gpt_text_greedy(alt_text)}')

100%|██████████| 25/25 [00:05<00:00,  4.68it/s]

Generated: john said it again three times : " no no no no no no no no no no no no no no no no no no no no no no no no no no no no





**Question:** b) 

i) What do you notice about the generated text?



ii) How can this be avoided?


**Question:** c) (*CMPUT 566 Students Only*)
Implement the Nucleas Sampling method described in Section 3.1 of https://arxiv.org/pdf/1904.09751.pdf. You can use any `torch` or `torch.nn` (`nn`) functions

Hint: Use `torch.multinomial` for sampling



In [11]:
softmax = nn.Softmax(dim=0)

def generate_gpt_text_nucleus_sampling(input_text, sequence_max_length = 25, p=0.9):
  """
  This uses greedy decoding which is not optimal for most tasks,
  but requires little system resources and is simple to implement
  
  Recommended for deterministic tasks: Beam search
  Recommended for creative tasks: Nucleus sampling
  Recommended for low latency/real time tasks: Greedy decoding or Nucleus sampling

  Note: Optimal decoding/generation algorithm depends on the task
  """
  generated_text = input_text

  # Generate a sequence
  for i in tqdm(range(sequence_max_length)):
    with torch.no_grad(): # Better preformance
      inputs = tokenizer(generated_text, return_tensors="pt")
      outputs = model(**inputs)
      # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 

      # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
      generated_text = tokenizer.decode(
          torch.cat((inputs['input_ids'][0],torch.tensor([next_token_index]))) # generated_text = generated_text + new_token
      )
  
  return generated_text


torch.manual_seed(314159)
input_text = "Hello, my name is John Smith. I am the"

print(f'Generated: {generate_gpt_text_nucleus_sampling(input_text)}')

  0%|          | 0/25 [00:00<?, ?it/s]


NameError: ignored

In [None]:
inputs = tokenizer("Hello, my name is John Smith. I am the", return_tensors="pt")
torch.cat((inputs['input_ids'][0],torch.tensor([1000])))

In [None]:
inputs['input_ids']

### 2.2: Masked language models
A masked language model provides the probability of a token given the tokens before it and after it (fill in the blanks)

$P(token_T|token_1,...,token_{T-1},token_{T+1},...,token_{N})$

It is useful for a variety of NLP tasking including sequence classification and grammar correction

Example: Hello, my name is ...

Output: Hello, my name is HAL

In [50]:
from transformers import BertTokenizer, BertForMaskedLM

import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [51]:
example_text = "The capital of Alberta is [MASK]."

def predict_mask(input_text):

  with torch.no_grad():
    inputs = tokenizer(input_text, return_tensors="pt")
    logits = model(**inputs).logits
    mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
    predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
  return tokenizer.decode(predicted_token_id)

print('Input: ',example_text)
print('Mask prediction: ',predict_mask(example_text))

Input:  The capital of Alberta is [MASK].
Mask prediction:  edmonton


**Question:** d) Use the `predict_mask` function and the `[MASK]` token to exract a fact from the language model(similar to the example above). Include your input and the model's prediction in your pdf report

In [52]:
your_prompt = "The winner of World Cup 2022 is [MASK]."

print('Input: ',your_prompt)
print('Mask prediction: ',predict_mask(your_prompt))

Input:  The winner of World Cup 2022 is [MASK].
Mask prediction:  azerbaijan


### 2.3: Sequence to sequence models
A sequence to sequence models provides the probability of a token given the tokens before it and all tokens in another related sequence

It is useful for a variety of NLP tasking including translation and summarization (primarily used for text generation)

Example: Bonjour, je m'appelle HAL (French)

Output: Hello, my name is HAL (English)

In [53]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from tqdm import tqdm

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")


In [54]:
def t5_summarize(text, max_length=20):
  # inference
  input_ids = tokenizer(
      f"summarize: {text}", return_tensors="pt"
  ).input_ids
  outputs = model.generate(
      input_ids,
      max_length=max_length,
  )
  return tokenizer.decode(outputs[0], skip_special_tokens=True)
# The Road Not Taken
# Poem by Robert Frost (1916)
# (Public Domain)
poem = """Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference."""

print('Summary: ',t5_summarize(poem))

Summary:  two roads diverged in a yellow wood, and I took the one less traveled by


The metaphors are lost on the model, but it still does a fairly good job summarizing the literal meaning of the poem

**Question:** e)

i) Find a short piece of text (article, poem, section of a paper) and get the model to summarize it. Include the summary in your report

In [60]:
short_text = """
Despite considerable advances in neural language modeling, it remains an open
question what the best decoding strategy is for text generation from a language
model (e.g. to generate a story). The counter-intuitive empirical observation is
that even though the use of likelihood as training objective leads to high quality
models for a broad range of language understanding tasks, maximization-based
decoding methods such as beam search lead to degeneration — output text that is
bland, incoherent, or gets stuck in repetitive loops.
"""

print('Summary: ',t5_summarize(short_text,max_length=35)) # You can change `max_length` if summary seems truncated

Summary:  decoding is a strategy that can be used to generate a story. the decoding method is a good example of a language model.


ii) Is the summary accurate? If yes, explain why the summary is accurate? If not, explain how the summary could be improved