**CS Interview QA Chatbot Usage**

This workbook contains the code for training and running the a DistilRoberta-LSTM model that answers question about CS interview questions.

Please ensure that you have ran the preprocessing.ipynb notebook before running this, and store it in your google drive
- ensure that you have combined.csv stored in your directory

To train from scratch:
1. [Import libraries](#scrollTo=6R9dFf2BZBUt&line=1&uniqifier=1)
2. [Load the preprocessed dataset](#scrollTo=7_4lFKnZZH7e&line=1&uniqifier=1)
3. [DistilRoberta Tokenization](#scrollTo=sW5r1HbZv2HL)
4. [Define attention](#scrollTo=GzVX8jhbwQZw)
5. [Define the decoder architecture](#scrollTo=pGzpY6f20quk)
6. [Define loss function](#scrollTo=zAfJAgKFDKgN)
7. [Define training functions](#scrollTo=f848ae33&line=9&uniqifier=1)
8. [Define and finetune DistilRoberta encoder](#scrollTo=Urkzy-wexPqO)
9. [Initialise and train model](#scrollTo=845aa1a3&line=5&uniqifier=1) (Set the loadFilename variable to None)
10. [Run evaluation functions](#scrollTo=484e6fdc&line=6&uniqifier=1)
11. [Evaluate with BERT, ROGUE, BLEU Score (quantitative)](#scrollTo=QiNsbbyrW-xR)
12. [Host on Telegram](#scrollTo=icPY3cbo3ksU&line=1&uniqifier=1)

To run saved model:
1. [Import libraries](#scrollTo=6R9dFf2BZBUt&line=1&uniqifier=1)
2. [Load the preprocessed dataset](#scrollTo=7_4lFKnZZH7e&line=1&uniqifier=1)
3. [DistilRoberta Tokenization](#scrollTo=sW5r1HbZv2HL)
4. [Define attention](#scrollTo=GzVX8jhbwQZw)
5. [Define the decoder architecture](#scrollTo=pGzpY6f20quk)
6. [Define loss function](#scrollTo=zAfJAgKFDKgN)
7. [Define training functions](#scrollTo=f848ae33&line=9&uniqifier=1)
8. [Define and finetune DistilRoberta encoder](#scrollTo=Urkzy-wexPqO)
9. [Initialise and train model](#scrollTo=845aa1a3&line=5&uniqifier=1) (Set the loadFilename variable to your saved checkpoint file)
10. [Run evaluation functions](#scrollTo=484e6fdc&line=6&uniqifier=1)
11. [Evaluate with BERT, ROGUE, BLEU Score (quantitative)](#scrollTo=QiNsbbyrW-xR)
12. [Host on Telegram](#scrollTo=icPY3cbo3ksU&line=1&uniqifier=1)

### Imports

In [1]:
!pip install random
!pip install torch
!pip install itertools
!pip install pandas
!pip install collections
!pip install os
!pip install transformer

[31mERROR: Could not find a version that satisfies the requirement random (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for random[0m[31m
[31mERROR: Could not find a version that satisfies the requirement itertools (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for itertools[0m[31m
[31mERROR: Could not find a version that satisfies the requirement collections (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for collections[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for os[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement transformer (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for transformer[0m[31m
[0m

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Change FOLDER_PATH to the directory with combined_data.csv

In [3]:
FOLDER_PATH = '/content/drive/My Drive/project'

In [5]:
import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
from io import open
import itertools
import math
import json
import pandas as pd



USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

Loading Saved Dataset

In [7]:
# read from csv
df = pd.read_csv(f'{FOLDER_PATH}/combined_data.csv')
print(df.head())

                                 Question  \
0      how does randomised algorithm work   
1    what do you mean by bestfirst search   
2             how do you explain a daemon   
3              what is phonetic algorithm   
4  what do you mean by uniform costsearch   

                                              Answer  
0  the algorithm typically uses uniformly random ...  
1  bestfirst search is a search algorithm which e...  
2  daemon disk and execution monitor is a process...  
3  a phonetic algorithm is an algorithm for index...  
4  a tree search that finds the lowestcost route ...  


In [6]:
df

Unnamed: 0,Question,Answer
0,how does randomised algorithm work,the algorithm typically uses uniformly random ...
1,what do you mean by bestfirst search,bestfirst search is a search algorithm which e...
2,how do you explain a daemon,daemon disk and execution monitor is a process...
3,what is phonetic algorithm,a phonetic algorithm is an algorithm for index...
4,what do you mean by uniform costsearch,a tree search that finds the lowestcost route ...
...,...,...
3770,explain biasvariance tradeoff,biasvariance tradeoff is a concept in machine ...
3771,what is stochastic gradient descent sgd in mac...,stochastic gradient descent sgd is an optimiza...
3772,explain stochastic gradient descent,stochastic gradient descent sgd is an optimiza...
3773,what is the backpropagation algorithm in machi...,the backpropagation algorithm is a widely used...


Check max length for Question Answer pair

In [7]:
import pandas as pd

# Calculate the length of each answer
df['answer_length'] = df['Answer'].apply(lambda x: len(x.split()))

# Find the maximum length
max_answer_length = df['answer_length'].max()

print(f"The maximum length of an answer in terms of tokens is: {max_answer_length}")

The maximum length of an answer in terms of tokens is: 177


In [8]:
df['question_length'] = df['Question'].apply(lambda x: len(x.split()))

max_question_length = df['question_length'].max()

print(f"The maximum length of an question in terms of tokens is: {max_question_length}")


The maximum length of an question in terms of tokens is: 105


**Tokenization using DistilRoberta Tokenizer**

In [9]:
from transformers import AutoTokenizer
import pandas as pd
from collections import Counter

# Load the DistilRoBERTa tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')

questions = df['Question'].tolist()
answers = df['Answer'].tolist()

MAX_LENGTH = 180

# Tokenize, pad, and truncate the questions
question_encodings = tokenizer(
    questions,
    padding=True,  # pad to the longest sequence in the batch
    truncation=True,  # Truncate sequences to the max_length
    max_length=MAX_LENGTH,
    return_tensors='pt'
)

answer_encodings = tokenizer(
    answers,
    padding=True,
    truncation=True,
    max_length=MAX_LENGTH,
    return_tensors='pt'
)

all_tokens = []
for answer in answers:
    tokens = tokenizer.tokenize(answer)
    all_tokens.extend(tokens)

for qn in questions:
    tokens = tokenizer.tokenize(qn)
    all_tokens.extend(tokens)

vocab_counter = Counter(all_tokens)
total_tokens = len(vocab_counter)

print(f"Target vocabulary size (output_size): {total_tokens}")

# Display the tokenized and padded data
print(question_encodings['input_ids'])
print(answer_encodings['input_ids'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Target vocabulary size (output_size): 6433
tensor([[    0,  9178,   473,  ...,     1,     1,     1],
        [    0, 12196,   109,  ...,     1,     1,     1],
        [    0,  9178,   109,  ...,     1,     1,     1],
        ...,
        [    0, 23242,  1851,  ...,     1,     1,     1],
        [    0, 12196,    16,  ...,     1,     1,     1],
        [    0, 23242,  1851,  ...,     1,     1,     1]])
tensor([[    0,   627, 17194,  ...,     1,     1,     1],
        [    0,  7885,  9502,  ...,     1,     1,     1],
        [    0,  6106, 34344,  ...,     1,     1,     1],
        ...,
        [    0,   620,  4306,  ...,     1,     1,     1],
        [    0,   627,   124,  ...,     1,     1,     1],
        [    0,   627,   124,  ...,     1,     1,     1]])


In [10]:
import torch

X = question_encodings['input_ids']
attention_mask_X = question_encodings['attention_mask']
y = answer_encodings['input_ids']
attention_mask_y = answer_encodings['attention_mask']

X_train = X
attention_mask_X_train = attention_mask_X
y_train = y
attention_mask_y_train = attention_mask_y

In [11]:
from torch.utils.data import DataLoader, Dataset

class Seq2SeqDataset(Dataset):
    def __init__(self, input_data, target_data):
        self.input_data = input_data
        self.target_data = target_data

    def __len__(self):
        return len(self.input_data['input_ids'])

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_data['input_ids'][idx],
            'attention_mask': self.input_data['attention_mask'][idx],
            'target_ids': self.target_data['input_ids'][idx],
            'target_attention_mask': self.target_data['attention_mask'][idx]
        }

train_dataset = Seq2SeqDataset(
    {'input_ids': X_train, 'attention_mask': attention_mask_X_train},
    {'input_ids': y_train, 'attention_mask': attention_mask_y_train}
)

# Create a DataLoader for the full dataset
data_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
print(f"size: {len(data_loader)}")

size: 59


**Luong Attention**

In [12]:
# Luong attention layer
class Attn(torch.nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        self.hidden_size = hidden_size
        if self.method == 'general':
            self.attn = torch.nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':
            self.attn = torch.nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = torch.nn.Parameter(torch.FloatTensor(hidden_size))

    def dot_score(self, hidden, encoder_output):
        return torch.sum(hidden * encoder_output, dim=2)

    def general_score(self, hidden, encoder_output):
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    def concat_score(self, hidden, encoder_output):
        energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)

        # Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()

        # Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1) # (batch_size,1,max_length)

**LSTM Decoder**

In [13]:
import torch.nn as nn


class LuongAttnDecoderLSTM(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderLSTM, self).__init__()
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    def forward(self, input_seq, last_hidden, encoder_outputs):
        embedded = self.embedding(input_seq)
        embedded = self.embedding_dropout(embedded)
        rnn_output, hidden = self.lstm(embedded, last_hidden)

        # Calculate attention weights from the current LSTM output
        attn_weights = self.attn(rnn_output, encoder_outputs)

        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        return output, hidden

<b>Masked NLL Loss</b>
- employed to focus loss calculation on relevant parts of output sequence, ignoring padding tokens

In [14]:
def maskNLLLoss(inp, target, mask):

    nTotal = mask.sum()  # Total number of valid elements

    if nTotal == 0:
        return torch.tensor(0.0, device=device), 0  # Handle case with no valid elements

    # Add epsilon to avoid log(0)
    epsilon = 1e-10
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)) + epsilon)

    # Apply mask and compute mean loss
    loss = crossEntropy.masked_select(mask).mean()

    return loss.to(device), nTotal.item()

**Teacher Forcing**
- employed to help the model learn more effectively by providing it with the correct context

In [15]:
def train(input_variable, attention_mask, target_variable, target_attention_mask, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_target_len=MAX_LENGTH):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_last_hidden = encoder(input_ids=input_variable, attention_mask=attention_mask)
    decoder_hidden = encoder_last_hidden

    SOS_token = "<s>"
    SOS_token_id = tokenizer.convert_tokens_to_ids(SOS_token)

    decoder_input = torch.LongTensor([[SOS_token_id] * batch_size]).to(device)

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )

            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], target_attention_mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], target_attention_mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropatation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = torch.nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = torch.nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

In [16]:
def trainIters(model_name, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, clip, loadFilename):

    print('Initializing ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    # Training loop
    print("Training...")
    vocab_size = tokenizer.vocab_size
    if hasattr(decoder, 'output_size'):
        assert decoder.output_size == vocab_size, "Model output size does not match vocabulary size!"
    else:
        decoder = nn.Linear(hidden_size, vocab_size)

    iteration = start_iteration

    while iteration <= n_iteration:
        for batch in data_loader:
            if iteration > n_iteration:
                break
            input_variable = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            target_variable = batch['target_ids'].to(device)
            target_attention_mask = batch['target_attention_mask'].to(device)
            target_variable = target_variable.transpose(0, 1)
            target_attention_mask = target_attention_mask.transpose(0, 1)
            target_attention_mask = target_attention_mask.bool()
            batch_size = input_variable.size(0)

            # Run a training iteration with batch
            loss = train(input_variable, attention_mask, target_variable, target_attention_mask, encoder, decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
            print_loss += loss

            # Print progress
            if iteration % print_every == 0:
                print_loss_avg = print_loss / print_every
                print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))
                print_loss = 0

            # Save checkpoint
            if iteration % save_every == 0:
                directory = os.path.join(save_dir, model_name, '{}_{}'.format(decoder_n_layers, hidden_size))
                if not os.path.exists(directory):
                    os.makedirs(directory)
                torch.save({
                    'iteration': iteration,
                    'en': encoder.state_dict(),
                    'de': decoder.state_dict(),
                    'en_opt': encoder_optimizer.state_dict(),
                    'de_opt': decoder_optimizer.state_dict(),
                    'loss': loss,
                    'embedding': embedding.state_dict()
                }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))

            iteration += 1  # Increment iteration counter

<b>DistilRoberta Encoder</b>
- DistilRoberta is finetuned before joint training to adapt to our CS Interview QA task

In [23]:
import torch
import torch.nn as nn
from transformers import (
    AutoModelForMaskedLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments
)

class EncoderDistilRoberta(nn.Module):
    def __init__(self, hidden_size, n_layers=2, dropout=0.1, freeze_layers=None):
        super(EncoderDistilRoberta, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        # Load the DistilRoBERTa model
        self.distilroberta = AutoModelForMaskedLM.from_pretrained('distilroberta-base')

        # Freeze specified layers if needed
        if freeze_layers is not None:
            for name, param in self.distilroberta.named_parameters():
                if any(str(layer) in name for layer in freeze_layers):
                    param.requires_grad = False

        self.dropout = nn.Dropout(dropout)
        self.hidden_transform = nn.Linear(768, hidden_size)
        self.layer_norm = nn.LayerNorm(hidden_size)

    def forward(self, input_ids, attention_mask):
        outputs = self.distilroberta.roberta(input_ids=input_ids,
                                           attention_mask=attention_mask)

        hidden_states = outputs.last_hidden_state
        hidden_states = self.dropout(hidden_states)
        # Transform hidden states to make it suitable for input to LSTM decoder
        encoder_output = hidden_states.permute(1, 0, 2)
        encoder_output = self.hidden_transform(encoder_output)
        encoder_output = self.layer_norm(encoder_output)

        last_hidden_state = hidden_states[:, -1, :]
        transformed_last_hidden_state = self.hidden_transform(last_hidden_state)
        decoder_hidden = transformed_last_hidden_state.unsqueeze(0).repeat(self.n_layers, 1, 1)
        # Initialize cell state as zeros as DistilRoberta does not have an internal cell state
        cell_state = torch.zeros_like(decoder_hidden)

        return encoder_output, (decoder_hidden, cell_state)


Convert dataset to MLM format for DistilRoberta Finetuning
- MLM Dataset helps models learn contextual relationships between words, allowing it to develop a deeper understanding of language structure, semantics and syntax

In [24]:
# convert Seq2SeqDataset to MLM format
def create_mlm_dataset(dataset):
    max_length = MAX_LENGTH
    mlm_inputs = []
    mlm_masks = []
    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
    pad_token_id = tokenizer.pad_token_id

    for i in range(len(dataset)):
        item = dataset[i]
        q_input = item['input_ids']
        q_mask = item['attention_mask']
        q_pad_length = max_length - q_input.size(0)
        if q_pad_length > 0:
            q_input = torch.cat([q_input, torch.full((q_pad_length,), pad_token_id)])
            q_mask = torch.cat([q_mask, torch.zeros(q_pad_length)])

        a_input = item['target_ids']
        a_mask = item['target_attention_mask']
        a_pad_length = max_length - a_input.size(0)
        if a_pad_length > 0:
            a_input = torch.cat([a_input, torch.full((a_pad_length,), pad_token_id)])
            a_mask = torch.cat([a_mask, torch.zeros(a_pad_length)])

        mlm_inputs.extend([q_input, a_input])
        mlm_masks.extend([q_mask, a_mask])

    return {
        'input_ids': torch.stack(mlm_inputs),
        'attention_mask': torch.stack(mlm_masks),
        'labels': torch.stack(mlm_inputs).clone()
    }

DistilRoberta Fine-Tuning

In [27]:
# Define the directory path in Google Drive
output_dir = f'{FOLDER_PATH}/distilroberta_fine_tuned_results'

# Check if the directory exists
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Directory '{output_dir}' created.")
else:
    print(f"Directory '{output_dir}' already exists.")

def fine_tune_encoder_with_existing_data(
    dataset,
    hidden_size=768,
    output_dir=output_dir + '/results',
    freeze_layers=None,
    num_epochs=3,
    batch_size=16,
    learning_rate=2e-5
):
    # Initialize DistilRoberta model
    encoder = EncoderDistilRoberta(
        hidden_size=hidden_size,
        freeze_layers=freeze_layers
    )

    # Convert dataset to MLM format
    print("Creating MLM dataset...")
    mlm_data = create_mlm_dataset(dataset)
    print("MLM dataset created successfully")

    # Create a custom Dataset class for the MLM data
    class MLMDataset(Dataset):
        def __init__(self, data):
            self.data = data

        def __len__(self):
            return len(self.data['input_ids'])

        def __getitem__(self, idx):
            return {
                'input_ids': self.data['input_ids'][idx],
                'attention_mask': self.data['attention_mask'][idx],
                'labels': self.data['labels'][idx]
            }

    # Convert to proper dataset
    mlm_dataset = MLMDataset(mlm_data)

    # Create training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=learning_rate,
        warmup_ratio=0.1,
        weight_decay=0.01,
        gradient_accumulation_steps=4,
        logging_dir=output_dir + '/logs',
        logging_steps=100,
        save_strategy='steps',
        save_steps=500,
        report_to=[],
        fp16=True
    )

    # Create data collator for dynamic masking
    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=True,
        mlm_probability=0.15
    )

    # Initialize trainer
    trainer = Trainer(
        model=encoder.distilroberta,
        args=training_args,
        train_dataset=mlm_dataset,
        data_collator=data_collator
    )

    # Train the model
    trainer.train()

    return encoder

def fine_tune_and_prepare_encoder(train_dataset, hidden_size=768):
    # Fine-tune the encoder
    encoder = fine_tune_encoder_with_existing_data(
        dataset=train_dataset,
        # val_dataset=val_dataset,
        hidden_size=hidden_size,
        output_dir=output_dir + '/results',
        freeze_layers=[0, 1]  # Optionally freeze first two layers
    )

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    encoder = encoder.to(device)

    return encoder

Directory '/content/drive/MyDrive/project/distilroberta_fine_tuned_results' created.


In [28]:
encoder = fine_tune_and_prepare_encoder(
    train_dataset=train_dataset,
    hidden_size=768
)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Creating MLM dataset...
MLM dataset created successfully


Step,Training Loss
100,3.1644
200,2.7315
300,2.6965


In [29]:
from transformers import AutoModel

# Configure models
model_name = 'cb_latestBESTESTfullset_model'
attn_model = 'dot'
hidden_size = 768
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 4000
# loadFilename = os.path.join(save_dir, model_name, corpus_name,
#                             '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
#                             '{}_checkpoint.tar'.format(checkpoint_iter))


# Load model if a loadFilename is provided
if loadFilename:
    # If loading on same machine the model was trained on
    checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
#     voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(tokenizer.vocab_size, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
decoder = LuongAttnDecoderLSTM(attn_model, embedding, hidden_size, tokenizer.vocab_size, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


**Start training the DistilRoberta-LSTM Model**

In [32]:
from torch import optim
import os


# Configure training/optimization
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 1000
print_every = 1
save_every = 500

# Ensure dropout layers are in train mode
encoder.train()
decoder.train()

# Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
if loadFilename:
    encoder_optimizer.load_state_dict(encoder_optimizer_sd)
    decoder_optimizer.load_state_dict(decoder_optimizer_sd)

# Run training iterations
print("Starting Training!")

# Define the directory path in Google Drive
save_dir = f'{FOLDER_PATH}/distilroberta_checkpoints'

# Check if the directory exists
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
    print(f"Directory '{save_dir}' created.")
else:
    print(f"Directory '{save_dir}' already exists.")

trainIters(model_name, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, decoder_n_layers, save_dir, n_iteration, batch_size,
           print_every, save_every, clip, loadFilename)


Building optimizers ...
Starting Training!
Directory '/content/drive/MyDrive/project/distilroberta_checkpoints' already exists.
Initializing ...
Training...
Iteration: 1; Percent complete: 5.0%; Average loss: 4.3275
Iteration: 2; Percent complete: 10.0%; Average loss: 4.0685
Iteration: 3; Percent complete: 15.0%; Average loss: 4.0079
Iteration: 4; Percent complete: 20.0%; Average loss: 4.0082
Iteration: 5; Percent complete: 25.0%; Average loss: 4.0911
Iteration: 6; Percent complete: 30.0%; Average loss: 4.0113
Iteration: 7; Percent complete: 35.0%; Average loss: 4.1782
Iteration: 8; Percent complete: 40.0%; Average loss: 3.8778
Iteration: 9; Percent complete: 45.0%; Average loss: 4.3307
Iteration: 10; Percent complete: 50.0%; Average loss: 4.0300
Iteration: 11; Percent complete: 55.0%; Average loss: 4.0757
Iteration: 12; Percent complete: 60.0%; Average loss: 3.7002
Iteration: 13; Percent complete: 65.0%; Average loss: 3.5559
Iteration: 14; Percent complete: 70.0%; Average loss: 4.1236

**Initialization of Evaluation Function & GreedySearchDecoder**

In [None]:
from transformers import AutoTokenizer
import torch
import string
import torch.nn as nn

tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')

def preprocess_sentence(sentence):
    sentence = sentence.lower()
    sentence = sentence.translate(str.maketrans('', '', string.punctuation)).strip()
    sentence = ' '.join(sentence.split())
    return sentence

def clean_output(decoded_words):
    response = tokenizer.convert_tokens_to_string(decoded_words)
    # Clean response
    response = response.replace('Ġ', ' ').strip()
    response = ' '.join(response.split())
    return response

def evaluate(encoder, decoder, searcher, sentence, max_length=MAX_LENGTH):
    # Preprocess the input sentence
    sentence = preprocess_sentence(sentence)

    # Tokenize and encode the input sentence
    inputs = tokenizer(sentence, return_tensors='pt', max_length=max_length, truncation=True, padding=True)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    # Decode sentence with searcher
    tokens, scores = searcher(input_ids, attention_mask, max_length)
    tokens = tokens.view(-1)

    # Convert token IDs to words
    decoded_words = tokenizer.convert_ids_to_tokens(tokens.tolist(), skip_special_tokens=True)
    cleaned_response = clean_output(decoded_words)
    return cleaned_response

def evaluateInput(encoder, decoder, searcher):
    input_sentence = ''
    while True:
        try:
            # Get input sentence
            input_sentence = input('> ')
            # Check if it is quit case
            if input_sentence.lower() in ['q', 'quit']:
                break

            # Evaluate sentence
            output_words = evaluate(encoder, decoder, searcher, input_sentence)

            # Print response sentence
            print('Bot:', output_words)

        except KeyError:
            print("Error: Encountered unknown word.")

class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder, device, tokenizer):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self._device = device
        self.tokenizer = tokenizer

    def forward(self, input_variable, attention_mask, max_length=MAX_LENGTH):
        # Forward input through encoder model
        encoder_outputs, encoder_last_hidden = self.encoder(input_ids=input_variable, attention_mask=attention_mask)

        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_last_hidden

        batch_size = input_variable.size(0)
        SOS_token = "<s>"
        SOS_token_id = self.tokenizer.convert_tokens_to_ids(SOS_token)

        # Initialize decoder input with SOS token
        decoder_input = torch.tensor([[SOS_token_id]] * batch_size).to(self._device)

        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=self._device, dtype=torch.long)
        all_scores = torch.zeros([0], device=self._device)

        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)

            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)

            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input.unsqueeze(0)), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores.unsqueeze(0)), dim=0)

            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = decoder_input.unsqueeze(0)

        # Return collections of word tokens and scores
        return all_tokens, all_scores

**Begin Chatting with the Chatbot here!**

In [None]:
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder, device, tokenizer)

# Begin chatting
evaluateInput(encoder, decoder, searcher)

> what is recursion
Bot: recursion is a programming technique where a function calls itself in its own definition it allows for solving complex problems by breaking them down into smaller simpler subproblems that are solved recursively recursion can be used to solve problems that exhibit a divide and conquer or topdown approach where a problem is divided into smaller subproblems until a base case is reached recursion can be powerful but should be used with caution to prevent infinite loops or stack overflow errors
> Explain the concept of recursion to me
Bot: recursion is a programming technique where a function calls itself in its own definition it allows for solving complex problems by breaking them down into smaller simpler subproblems that are solved recursively recursion can be used to solve problems that exhibit a divide and conquer or topdown approach where a problem is divided into smaller subproblems until a base case is reached recursion can be powerful but should be used wit

KeyboardInterrupt: Interrupted by user

### BERTScore evaluation

In [None]:
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [None]:
from bert_score import score

# Generate responses for all questions
encoder.eval()
decoder.eval()
generated_responses = []

with torch.no_grad():
    for question in questions:
        response = evaluate(encoder, decoder, searcher, question)
        generated_responses.append(response)  # Store the generated response

# Step 4: Compute BERTScore
P, R, F1 = score(generated_responses, answers, model_type='distilroberta-base', verbose=True)

# Prepare data for CSV
results = []
for idx, (p, r, f1) in enumerate(zip(P, R, F1)):
    results.append({
        "Question": questions[idx],
        "Ground Truth": answers[idx],
        "Generated Response": generated_responses[idx],
        "Precision": p.item(),
        "Recall": r.item(),
        "F1": f1.item()
    })

# Create a DataFrame
results_df = pd.DataFrame(results)

# Save the DataFrame to a CSV file
results_df.to_csv(f'{FOLDER_PATH}/DistilRoberta_bert_scores.csv', index=False)

# Calculate average BERTScores
avg_precision = results_df["Precision"].mean()
avg_recall = results_df["Recall"].mean()
avg_f1 = results_df["F1"].mean()

# Print averages
print(f"Average Precision: {avg_precision:.4f}")
print(f"Average Recall: {avg_recall:.4f}")
print(f"Average F1 Score: {avg_f1:.4f}")

calculating scores...
computing bert embedding.


  0%|          | 0/40 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/59 [00:00<?, ?it/s]

done in 5.21 seconds, 725.08 sentences/sec
Average Precision: 0.9705
Average Recall: 0.9705
Average F1 Score: 0.9704


In [None]:
!pip install nltk
!pip install bert-score
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=6396f6a8ab8e3c5d1c8d9820a807e16208b4f3f2b433431d687b1e724912a141
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from bert_score import score
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

# Generate responses for all questions
encoder.eval()
decoder.eval()
generated_responses = []

with torch.no_grad():
    for question in questions:
        response = evaluate(encoder, decoder, searcher, question)
        generated_responses.append(response)  # Store the generated response

# Compute BERTScore
P, R, F1 = score(
    generated_responses,
    answers,
    model_type='distilroberta-base',
    verbose=True
)

# Initialize BLEU smoothing function and ROUGE scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
smoothie = SmoothingFunction().method4
rouge = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

# Prepare data for CSV
results = []
for idx, (question, reference, candidate, p, r, f1) in enumerate(zip(questions, answers, generated_responses, P, R, F1)):
    # Tokenize using transformers' tokenizer
    reference_tokens = tokenizer.tokenize(reference.lower())
    candidate_tokens = tokenizer.tokenize(candidate.lower())

    # Compute BLEU score
    bleu = sentence_bleu([reference_tokens], candidate_tokens, smoothing_function=smoothie)

    # Compute ROUGE scores (ROUGE expects strings, no tokenization needed)
    scores = rouge.score(reference.lower(), candidate.lower())
    rouge1_f1 = scores['rouge1'].fmeasure
    rougeL_f1 = scores['rougeL'].fmeasure

    # Store the results
    results.append({
        "Question": question,
        "Ground Truth": reference,
        "Generated Response": candidate,
        "Precision": p.item(),
        "Recall": r.item(),
        "F1": f1.item(),
        "BLEU": bleu,
        "ROUGE-1 F1": rouge1_f1,
        "ROUGE-L F1": rougeL_f1
    })

# Create a DataFrame
results_df = pd.DataFrame(results)

# Save the DataFrame to a CSV file
results_df.to_csv(f'{FOLDER_PATH}/bryan_model/DistilRoberta_bert_scores.csv', index=False)

# Calculate average BERTScores
avg_precision = results_df["Precision"].mean()
avg_recall = results_df["Recall"].mean()
avg_f1 = results_df["F1"].mean()

# Calculate average BLEU and ROUGE scores
avg_bleu = results_df["BLEU"].mean()
avg_rouge1_f1 = results_df["ROUGE-1 F1"].mean()
avg_rougeL_f1 = results_df["ROUGE-L F1"].mean()

# Print averages
print(f"Average BERTScore Precision: {avg_precision:.4f}")
print(f"Average BERTScore Recall: {avg_recall:.4f}")
print(f"Average BERTScore F1 Score: {avg_f1:.4f}")
print(f"Average BLEU Score: {avg_bleu:.4f}")
print(f"Average ROUGE-1 F1 Score: {avg_rouge1_f1:.4f}")
print(f"Average ROUGE-L F1 Score: {avg_rougeL_f1:.4f}")

calculating scores...
computing bert embedding.


  0%|          | 0/40 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/59 [00:00<?, ?it/s]

done in 5.48 seconds, 688.60 sentences/sec
Average BERTScore Precision: 0.9705
Average BERTScore Recall: 0.9705
Average BERTScore F1 Score: 0.9704
Average BLEU Score: 0.7080
Average ROUGE-1 F1 Score: 0.7943
Average ROUGE-L F1 Score: 0.7731


### Telegram Implementation
- Replace token with your own API token obtained by BotFather on Telegram
- application = Application.builder().token('**Replace token here**').build()


In [None]:
!pip install python-telegram-bot --upgrade
!pip install nest_asyncio

Collecting python-telegram-bot
  Downloading python_telegram_bot-21.7-py3-none-any.whl.metadata (17 kB)
Downloading python_telegram_bot-21.7-py3-none-any.whl (654 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/654.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m634.9/654.9 kB[0m [31m20.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m654.9/654.9 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-telegram-bot
Successfully installed python-telegram-bot-21.7


In [None]:
import logging
import asyncio
import nest_asyncio
from telegram import Update
from telegram.ext import Application, MessageHandler, filters, ContextTypes

# Apply the nest_asyncio patch for Jupyter environments
nest_asyncio.apply()

# Configure logging
logging.basicConfig(
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    level=logging.INFO
)

# Define the function to process user input and generate responses using your model
async def respond(update: Update, context: ContextTypes.DEFAULT_TYPE):
    user_message = update.message.text
    # Preprocess user message if necessary
    input_sentence = user_message  # Or use preprocess_sentence(user_message)

    # Generate a response using your evaluate function
    response = evaluate(encoder, decoder, searcher, input_sentence)

    # Send the response back to the user
    await update.message.reply_text(response)

# Main function to set up and run the bot
async def main():
    # Replace 'YOUR_TOKEN' with your bot's API token
    application = Application.builder().token('Replace token here').build()

    # Add handler for text messages
    application.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, respond))

    # Initialize, start, and poll the bot
    await application.initialize()
    await application.start()
    await application.updater.start_polling()
    # Keep the bot running
    await asyncio.Event().wait()

# Run the bot using asyncio
asyncio.run(main())

KeyboardInterrupt: 