# Machine Exercise 7
### By: Jeryl Salas

We are tasked to train a transformer-based decoder only language model using 1000 documents. I specifically used the stack v2 from HuggingFace as the dataset of my choice. My transformer model is then compared to my trigram model using the same validation set.

## 1. Processing Stage

### 1.A Import Libraries

In [28]:
# For loading and preprocessing of data
import os  # For handling file paths
from dotenv import load_dotenv  # Loads environment variables from .env file
from huggingface_hub import login  # Used to log into the Hugging Face Hub
import boto3  # Used for interacting with AWS services.
from smart_open import open  # Used for opening files
from datasets import load_dataset  # Used to load Hugging Face's Datasets
import gc  # Clearing unused memory
import shutil  # Copying and removing directories or files.
import torch  

# For use in transformer model
from transformers import BertTokenizer, GPT2LMHeadModel, AdamW  # Transformer library
from sklearn.model_selection import train_test_split  # Used to split training and testing sets
from torch.utils.data import DataLoader  # For batching datasets

# For use in trigram model with EM algorithm
from nltk.tokenize import RegexpTokenizer, sent_tokenize  # Tokenizer
from typing import Iterator  # Faster iteration
import numpy as np  


### 1.B Define Settings
We define hyperparameters and file paths for the transformer model like the maximum number of documents, max token length, number of epochs, batch size, learning rate and the hugging face & AWS credential paths for login authentication

In [29]:
MAX_DOCUMENTS = 1000
MAX_LENGTH = 512
EPOCHS = 3
BATCH_SIZE = 8
l_r=5e-5
huggingface_token_file_path = r'C:\Users\Jeryl Salas\Documents\AI 351\MEx 7 Transformers\huggingface_token.txt'
AWS_credential_path = r'C:\Users\Jeryl Salas\Documents\AI 351\MEx 7 Transformers\credentials.env'

### 1.C Loading Data
The dataset used in this code is "bigcode/the-stack-v2", a collection of publicly available source code. The dataset is designed for tasks like code understanding, code generation, and code-related machine learning. It is stored in an S3 bucket, and we'd have to access it using our Hugging Face token to login into the Hugging Face Hub and we use our AWS credentials to get the S3 bucket which we can download the dataset's contents. We're only loading 1000 documents and we allow to do streaming so I can avoid experiencing memory issues

In [30]:
# Get Hugging Face token and login
with open(huggingface_token_file_path, "r") as token_file:
    huggingface_token = token_file.read().strip()  # Read and strip any surrounding whitespace or newlines
login(huggingface_token)

# Loading environment variables which contains AWS credentials and creating an AWS session
load_dotenv(dotenv_path=AWS_credential_path)
session = boto3.Session(
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"]
)
s3 = session.client("s3")

# Download from s3 bucket
def download_cont(blob_id, src_encoding):
    s3_url = f"s3://softwareheritage/content/{blob_id}"
    
    with open(s3_url, "rb", compression=".gz", transport_params={"client": s3}) as fin:
        content = fin.read().decode(src_encoding)
    
    return {"content": content}


# Load dataset 
ds = load_dataset("bigcode/the-stack-v2", split="train", streaming=True)


# Limit to 1000 samples
sampled_data = []
for idx, row in enumerate(ds):
    if idx >= MAX_DOCUMENTS:  
        break
    content = download_cont(row["blob_id"], row["src_encoding"])
    sampled_data.append(content)


# Print a sample content
for row in sampled_data:
    print(row["content"])
    break  


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\Jeryl Salas\.cache\huggingface\token
Login successful


Resolving data files:   0%|          | 0/917 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/917 [00:00<?, ?it/s]

///////////////////////////////////////////////////////////////////////////////////////////////////////
// Copyright (c) 2019, ООО 1С-Софт
// Все права защищены. Эта программа и сопроводительные материалы предоставляются 
// в соответствии с условиями лицензии Attribution 4.0 International (CC BY 4.0)
// Текст лицензии доступен по ссылке:
// https://creativecommons.org/licenses/by/4.0/legalcode
///////////////////////////////////////////////////////////////////////////////////////////////////////

#Если Сервер Или ТолстыйКлиентОбычноеПриложение Или ВнешнееСоединение Тогда

#Область ОбработчикиСобытий

Процедура ПриКомпоновкеРезультата(ДокументРезультат, ДанныеРасшифровки, СтандартнаяОбработка)
	
	ТаблицаРезультата = ЗарегистрированныеОбъекты();
	
	СтандартнаяОбработка = Ложь;
	НастройкиКД = КомпоновщикНастроек.ПолучитьНастройки();
	ВнешниеНаборыДанных = Новый Структура("ТаблицаРезультата", ТаблицаРезультата);
	
	КомпоновщикМакетаКД = Новый КомпоновщикМакетаКомпоновкиДанных;
	МакетКД = 

In [31]:
# Print 20 samples
i = 0
for ds in sampled_data:
    print(ds)
    if i == 20:
        break
    i+=1

{'content': '\ufeff///////////////////////////////////////////////////////////////////////////////////////////////////////\n// Copyright (c) 2019, ООО 1С-Софт\n// Все права защищены. Эта программа и сопроводительные материалы предоставляются \n// в соответствии с условиями лицензии Attribution 4.0 International (CC BY 4.0)\n// Текст лицензии доступен по ссылке:\n// https://creativecommons.org/licenses/by/4.0/legalcode\n///////////////////////////////////////////////////////////////////////////////////////////////////////\n\n#Если Сервер Или ТолстыйКлиентОбычноеПриложение Или ВнешнееСоединение Тогда\n\n#Область ОбработчикиСобытий\n\nПроцедура ПриКомпоновкеРезультата(ДокументРезультат, ДанныеРасшифровки, СтандартнаяОбработка)\n\t\n\tТаблицаРезультата = ЗарегистрированныеОбъекты();\n\t\n\tСтандартнаяОбработка = Ложь;\n\tНастройкиКД = КомпоновщикНастроек.ПолучитьНастройки();\n\tВнешниеНаборыДанных = Новый Структура("ТаблицаРезультата", ТаблицаРезультата);\n\t\n\tКомпоновщикМакетаКД = Новый

## 2. Transformer Model

### 2.A Tokenization and Splitting of Data for Transformer
Our transformer model uses GPT-2 model using tokenized data which was tokenized by BERT's tokenizer which is based on Wordpiece tokenization. The sample data is tokenized and padded. After tokenization, the data is split into training and validation sets (90-10 split).

In [32]:

# Load GPT-2 model and tokenizer. Check if CUDA is available
model = GPT2LMHeadModel.from_pretrained("gpt2")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Will use {device} as our device")
model.to(device)

# Tokenization and padding
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

tokenized_docs = [
    tokenizer(
        sample['content'],
        truncation=True,
        padding="max_length",
        max_length=MAX_LENGTH,
        return_tensors="pt"
    ) for sample in sampled_data
]

# Preprocess datasets into batches
input_ids = torch.cat([doc['input_ids'] for doc in tokenized_docs])
attention_mask = torch.cat([doc['attention_mask'] for doc in tokenized_docs])

# Split data into training and validation
train_input_ids, val_input_ids, train_attention_mask, val_attention_mask = train_test_split(
    input_ids, attention_mask, test_size=0.1  # 10% for validation
)

# Move tensors to the device (In our case GPU)
train_input_ids = train_input_ids.to(device)
train_attention_mask = train_attention_mask.to(device)
val_input_ids = val_input_ids.to(device)
val_attention_mask = val_attention_mask.to(device)

# Optimizer
optimizer = AdamW(model.parameters(), lr=l_r)

# Training Set
train_data = torch.utils.data.TensorDataset(train_input_ids, train_attention_mask)
train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)

# Validation Set
val_data = torch.utils.data.TensorDataset(val_input_ids, val_attention_mask)
val_dataloader = torch.utils.data.DataLoader(val_data, batch_size=BATCH_SIZE)


Will use cuda as our device




### 2.B Training Model
The model undergoes training over epochs (3). The loss is computed and printed for each batch. After calculating the loss, backpropagation and optimization are performed to update the model’s parameters. Once all batches for an epoch are processed, the model continues to the next epoch. 

In [33]:
model.train()

for epoch in range(EPOCHS):
    print(f"Epoch {epoch+1}/{EPOCHS}")
    
    for batch in train_dataloader:
        input_ids_batch, attention_mask_batch = batch
        optimizer.zero_grad()  
        outputs = model(input_ids=input_ids_batch, attention_mask=attention_mask_batch, labels=input_ids_batch)

        # Calculate loss
        loss = outputs.loss
        print(f"Batch loss: {loss.item()}")

        # Backpropagation and optimization
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1} finished.")


Epoch 1/3
Batch loss: 8.149853706359863
Batch loss: 6.809586524963379
Batch loss: 6.7234954833984375
Batch loss: 6.283030986785889
Batch loss: 5.931887626647949
Batch loss: 5.593898296356201
Batch loss: 5.234626770019531
Batch loss: 5.377565860748291
Batch loss: 4.878478527069092
Batch loss: 4.263491153717041
Batch loss: 5.057415962219238
Batch loss: 5.144780158996582
Batch loss: 4.354034900665283
Batch loss: 4.300561904907227
Batch loss: 4.2146220207214355
Batch loss: 4.6291680335998535
Batch loss: 4.132321834564209
Batch loss: 4.359503269195557
Batch loss: 3.3162498474121094
Batch loss: 3.341586112976074
Batch loss: 3.561171054840088
Batch loss: 4.211730003356934
Batch loss: 4.060637474060059
Batch loss: 4.751778602600098
Batch loss: 3.6513125896453857
Batch loss: 3.5888383388519287
Batch loss: 3.6306259632110596
Batch loss: 3.4563140869140625
Batch loss: 3.4942500591278076
Batch loss: 4.044579982757568
Batch loss: 3.615466356277466
Batch loss: 3.7369492053985596
Batch loss: 3.373864

### 2.C Evaluating Model
The model undergoes evaluation using the validation set. The performance is recorded by the model's perplexity. For each batch, the loss is computed using the input IDs as both the input and the target labels. The total loss is accumulated across all tokens in the batch. The total loss is divided by the total number of tokens, and the perplexity is calculated as the exponential of this value

### Accumulated Loss Formula:
$$
\text{Total Loss} = \sum_{i=1}^{N} \left( \text{Batch Loss}_i \times \text{Number of Tokens}_i \right)
$$

### Perplexity Formula:
$$
\text{Perplexity} = \exp\left(\frac{\text{Total Loss}}{\text{Total Tokens}}\right)
$$


In [34]:

model.eval()

def calculate_perplexity(model, dataloader):
    total_loss = 0.0
    total_tokens = 0

    with torch.no_grad():
        for batch in dataloader:
            input_ids_batch, attention_mask_batch = batch
            outputs = model(input_ids=input_ids_batch, attention_mask=attention_mask_batch, labels=input_ids_batch)
            total_loss += outputs.loss.item() * input_ids_batch.size(1)  # Multiply by the number of tokens
            total_tokens += input_ids_batch.size(1)
    
    return torch.exp(torch.tensor(total_loss / total_tokens))


# Evaluate model
perplexity = calculate_perplexity(model, val_dataloader)
print(f"Model perplexity: {perplexity.item()}")

Model perplexity: 3.5677289962768555


## 3. Trigram Language Model

### 3.A Preprocess data
We preprocess the data the same way that we did in the previous Machine Exercise #3 by using the Regexp tokenizer. We store the tokenized source codes in an array both for training and validation sets

In [35]:

content_array = []
train_array = []
val_array = []

# We iterate over each sample in sampled_data and extract the content
for sample in sampled_data:
    content_array.append(sample['content'])

# Splitting content_array into 90-10 training validation
train_data, val_data = train_test_split(content_array, test_size=0.1, random_state=42)

def generate_tokens(paragraph: str) -> Iterator[str]:
    """
    Tokenize sentences using RegexpTokenizer and appending '[END]' token on each setence
    """
    word_tokenizer = RegexpTokenizer(r'[-\'\w]+')

    for sentence in sent_tokenize(paragraph):
        tokenized_sentence = word_tokenizer.tokenize(sentence)
        if tokenized_sentence:
            tokenized_sentence.append('[END]')
            yield tokenized_sentence
        

for text in train_data:
    for tokenized_text in generate_tokens(text):
        train_array.append(tokenized_text)

for text in val_data:
    for tokenized_text in generate_tokens(text):
        val_array.append(tokenized_text)


### 3.B Generate N grams
With this class, we can count the unigrams, bigrams, and trigrams. This is where they can also compute the respective n gram probabilities. These functions will be accessed in both the EM Algorithm and the generation of texts.

In [36]:

class generate_Ngrams:
    """
    Main class for dealing with N grams involving functions such as counting and calculating probabilities (unigram, bigram, trigram)
    """
    def __init__(self, train_data, test_data):
        self.sentences = train_data
        self.test_sentences = test_data
        self.unigram_counts = {}
        self.total_unigrams = sum(self.unigram_counts.values())
        self.bigram_counts = {}
        self.trigram_counts = {}
        self.unigram_prob = {}
        self.bigram_prob = {}
        self.trigram_prob = {}
        self.a = 1
        self.count()
    
    def count_ngrams(self, tokens, n):
        """
        Used for countring bigrams and trigrams
        """
        ngram_counts = {}
        for i in range(len(tokens) - n + 1):
            ngram = tuple(tokens[i:i + n])
            if ngram in ngram_counts:
                ngram_counts[ngram] += 1
            else:
                ngram_counts[ngram] = 1
        return ngram_counts
    
    def count(self):
        """
        Main function for storing unigram, bigram, and trigram counts on a dict
        """
        for tokens in self.sentences:
            for unigram in tokens:
                self.unigram_counts[unigram] = self.unigram_counts.get(unigram, 0) + 1  
            
            bigrams = self.count_ngrams(tokens, 2)
            for bigram, count in bigrams.items():
                if bigram in self.bigram_counts:
                    self.bigram_counts[bigram] += count
                else:
                    self.bigram_counts[bigram] = count
                 
            trigrams = self.count_ngrams(tokens, 3)
            for trigram, count in trigrams.items():
                if trigram in self.trigram_counts:
                    self.trigram_counts[trigram] += count
                else:
                    self.trigram_counts[trigram] = count

    def calc_unigram_prob(self, word):
        """
        Function for calculating unigram probability. Used Laplace smoothing with a = 1
        """
        uni_count = self.unigram_counts.get(word, 0)
        prob = (uni_count + self.a) / (self.total_unigrams + self.a * len(self.unigram_counts))
        #print(f"unigram prob = {prob}")
        return prob
    
    def calc_bigram_prob(self, w1, w2):
        """
        Function for calculating bigram probability. Used Laplace smoothing with a = 1
        """
        uni_count = self.unigram_counts.get(w1, 0)
        bi_count = self.bigram_counts.get((w1, w2), 0)
        vocab_size = len(self.unigram_counts)
        prob = (bi_count + self.a) / (uni_count + self.a * vocab_size)
        #print(f"bigram prob = {prob}")
        return prob
    
    def calc_trigram_prob(self, w1, w2, w3):
        """
        Function for calculating trigram probability. Used Laplace smoothing with a = 1
        """
        bi_count = self.bigram_counts.get((w1, w2), 0)
        tri_count = self.trigram_counts.get((w1, w2, w3), 0)
        vocab_size = len(self.unigram_counts)

        prob = (tri_count + self.a) / (bi_count + self.a * vocab_size)
        #print(f"trigram prob = {prob}")
        return prob
    
        

### 3.C Perplexity Computation and Lambda Optimizer
With this step, we compute interpolation probabilities for each token in training sentences and we use n gram probabilities as our basis in updating our lambdas via EM algorithm. The changes in perplexity will also be observed each iteration. For now, we use max iterations of 20 with stopping critera. 

In [37]:
class EM_Algorithm:
    """
    Main class for the training model. Functions include interpolation, updating of expectations and lambdas, and the main optimization function that facilitates the EM Algorithm
    """
    def __init__(self, gen_data, lambdas, eps):
        self.data = gen_data
        self.lambdas = lambdas
        self.eps = eps
        self.total_unigrams = sum(self.data.unigram_counts.values())

    def compute_interpolated_probability(self, w1, w2, w3, lambdas):
        """
        Computes interpolated probability which inclues the unigram, bigram, and trigram probabilities
        """
        p_tri = self.data.calc_trigram_prob(w1, w2, w3)
        p_bi = self.data.calc_bigram_prob(w2, w3)
        p_uni = self.data.calc_unigram_prob(w3)
        inter_p = lambdas[0] * p_tri + lambdas[1] * p_bi + lambdas[2] * p_uni
        inter_p = max(inter_p, 1e-12)
        #print(f"inter prob = {inter_p}")
        log_inter_p = np.log(inter_p)
        #print(f"log inter prob = {log_inter_p}")
        return log_inter_p, inter_p, p_tri, p_bi, p_uni

    def expectation_step(self, sentences, lambdas):
        """
        Function used to update expectations of counts per sentence iteration and computes total log probability for perplexity computation
        """
        total_log_prob = 0
        expected_counts = np.zeros(3) 
        #print(f"initial exp count = {expected_counts}")

        for tokens in sentences:
            #print(f"for sentence = {s}")
            #print("_____________________")
            for i in range(2, len(tokens)):
                w1, w2, w3 = tokens[i-2], tokens[i-1], tokens[i]
                #print(f"for iter {i}: w1={w1}, w2={w2}, w3={w3}, lambdas={lambdas}, tokens={tokens}")
                log_inter_p, total_p, p_tri, p_bi, p_uni =  self.compute_interpolated_probability(w1, w2, w3, lambdas)
                total_log_prob += log_inter_p
                #print(f"cumulative log prob = {total_log_prob}")

                if total_p > 0:
                    expected_counts[0] += (lambdas[0] * p_tri) / total_p
                    expected_counts[1] += (lambdas[1] * p_bi) / total_p
                    expected_counts[2] += (lambdas[2] * p_uni) / total_p
                #print(f"updt. expected counts = {expected_counts}")

        #print(f"after exp. step: total log prob = {total_log_prob}, exp counts = {expected_counts}")
        #print("___________________________________________________________________________________")
        return total_log_prob, expected_counts

    def update_lambdas(self, exp_counts):
        """
        Function used to update lambdas
        """
        total_counts = np.sum(exp_counts)
        new_lambdas = exp_counts / total_counts
        print(f"updt. lambdas = {new_lambdas}")
        return new_lambdas


    def optimize(self):
        """
        Main function that facilitates the update of lambdas and computation of perplexity
        """
        prev_perplexity = float('inf')
        for iteration in range(20):
            total_log_prob, expected_counts = self.expectation_step(generated_data.sentences, self.lambdas)
            average_log_prob = total_log_prob / self.total_unigrams
            print(f"avg log prob = {total_log_prob} / {self.total_unigrams} = {average_log_prob}")
            current_perplexity = np.exp(-average_log_prob)
            print(f"updt. perplexity = {current_perplexity}")
            self.lambdas = self.update_lambdas(expected_counts)

            if abs(current_perplexity - prev_perplexity) < self.eps:
                print(f"Converged after {iteration} iterations.")
                break

            print(f"Iteration {iteration}, Perplexity: {current_perplexity}")
            prev_perplexity = current_perplexity
        

        return self.lambdas, current_perplexity
    
    def generate_sentence(self, max_length=20):
        """
        Function that generates sentences using the model's hyperparameters
        """
        sentence = ['<s>', '<s>'] 
        while len(sentence) < max_length:
            w1, w2 = sentence[-2], sentence[-1]
            possible_words = list(self.data.unigram_counts.keys())
            probabilities = []

            for w3 in possible_words:
                _, inter_p, _, _, _ = self.compute_interpolated_probability(w1, w2, w3, self.lambdas)
                probabilities.append(inter_p)
            
            probabilities = np.array(probabilities) / sum(probabilities)

            next_word = np.random.choice(possible_words, p=probabilities)
            if next_word == '</s>':
                break
            sentence.append(next_word)
        
        return ' '.join(sentence[2:])


generated_data = generate_Ngrams(train_data, val_data)
lambdas = np.array([0.9, 0.9, 0.9])
eps = 1e-10
model = EM_Algorithm(generated_data, lambdas, eps)
lambdas, perp = model.optimize()


avg log prob = 146827144.9771011 / 21818015 = 6.729628931738341
updt. perplexity = 0.001194976293658231
updt. lambdas = [6.21662349e-04 2.40432716e-04 9.99137905e-01]
Iteration 0, Perplexity: 0.001194976293658231
avg log prob = 149087878.0164507 / 21818015 = 6.833246654952373
updt. perplexity = 0.0010773546346206269
updt. lambdas = [3.96431643e-07 5.97311593e-08 9.99999544e-01]
Iteration 1, Perplexity: 0.0010773546346206269
avg log prob = 149106673.87517267 / 21818015 = 6.834108138397222
updt. perplexity = 0.0010764269111052708
updt. lambdas = [2.52589719e-10 1.48267184e-11 1.00000000e+00]
Iteration 2, Perplexity: 0.0010764269111052708
avg log prob = 149106683.810402 / 21818015 = 6.834108593765381
updt. perplexity = 0.0010764264209348414
updt. lambdas = [1.60939569e-13 3.68034846e-15 1.00000000e+00]
Iteration 3, Perplexity: 0.0010764264209348414
avg log prob = 149106683.82102886 / 21818015 = 6.8341085942524495
updt. perplexity = 0.0010764264204105483
updt. lambdas = [1.02543940e-16 9.1

### 3.D Testing N-gram model
We test the N gram model's performance with computation of average perplexity using the validation set

In [38]:
def calculate_perplexity(model, test_sentences):
    """
    Function that calculates perplexity of the LM using the testing set
    """
    total_perplexity = 0
    num_sentences = len(test_sentences)

    total_log_prob = 0
    for tokens in test_sentences:
        for i in range(2, len(tokens)):
            w1, w2, w3 = tokens[i-2], tokens[i-1], tokens[i]
            log_inter_p, _, _, _, _ =  model.compute_interpolated_probability(w1, w2, w3, lambdas)
            total_log_prob += log_inter_p
    
    average_log_prob = total_log_prob / model.total_unigrams
    return np.exp(-average_log_prob)

# Printing of results
avg_perp = calculate_perplexity(model, generated_data.test_sentences)
print(f"Optimized lambdas: {lambdas}")  
print(f"Training Perplexity: {perp}") 
print(f"Test Perplexity on Unseen Data: {avg_perp}") 

Optimized lambdas: [1.02543940e-16 9.13551096e-19 1.00000000e+00]
Training Perplexity: 0.0010764264204105483
Test Perplexity on Unseen Data: 0.4942775464555673


## 4. Clean Memory

In [39]:

# Get the cache directory
cache_dir = os.path.join(os.path.expanduser("~"), ".cache", "huggingface", "datasets")

# Clear all datasets in the cache directory
if os.path.exists(cache_dir):
    shutil.rmtree(cache_dir)
    print("Cleared the datasets cache.")
else:
    print("No cache found.")

# Delete specific variables
gc.collect()
print("Collected garbage")


Cleared the datasets cache.
Collected garbage


## 5. Results and Discussion
As shown by the results, the transformer model was able to have a perplexity of 3.5677 while the trigram language model was able to have a perplexity of 0.4943. It seems like for this dataset, the loss in the trigram model doesn't seem to improve and hence it was converged after 4 iterations despite having very small tolerance as stopping criteria. Meanwhile in the transformer model, the batch loss is seen to be improving slowly as it goes through the dataset. It's possible that the transformer model can handle the complex dataset of Stack V2 with its self attention mechanisms that is able to capture long range dependencies which may be important in capturing the complex structures in programming languages while trigram models only look at the neighboring words which limits it's capacity. It was able to learn slowly in the coleridge dataset but not necessarily on this Stack V2.