<a href="https://www.kaggle.com/code/rafsunsheikh/q-a-transformer-model-from-scratch?scriptVersionId=140018350" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [54]:
# Load the required Libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy
import json
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split

**If you want to use GPU**

In [55]:
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# print("Device:", device)

### **Load the Dataset**

In [56]:
# Load the JSON dataset
with open("/kaggle/input/qa-dataset/qa_dataset.json", "r") as f:
    qa_data = json.load(f)

### **Split the dataset into question, context and answer**

In [57]:
# Extract questions, contexts, and answers
questions = [qa["question"] for qa in qa_data]
contexts = [qa["context"] for qa in qa_data]
answers = [qa["answer"] for qa in qa_data]

# **Split the Dataset into train, test and Validation**

In [58]:
# Split the dataset into train, test, and validation sets
train_questions, test_questions, train_contexts, test_contexts, train_answers, test_answers = train_test_split(
    questions, contexts, answers, test_size=0.2, random_state=42
)
train_questions, val_questions, train_contexts, val_contexts, train_answers, val_answers = train_test_split(
    train_questions, train_contexts, train_answers, test_size=0.2, random_state=42
)

# Print the sizes of the datasets
print("Train dataset size:", len(train_questions))
print("Validation dataset size:", len(val_questions))
print("Test dataset size:", len(test_questions))


Train dataset size: 137
Validation dataset size: 35
Test dataset size: 43


In [59]:
questions

['What is a modem?',
 'What is the purpose of a router?',
 "What does 'bandwidth' refer to in telecommunication?",
 'Explain the concept of latency in telecommunication.',
 'What is VoIP?',
 'Describe the role of a network firewall.',
 'What is the difference between 4G and 5G?',
 'What is a fiber optic cable?',
 'How does Wi-Fi work?',
 'What is latency in online gaming?',
 'What is a satellite communication system?',
 'Explain the concept of VoLTE.',
 'What is a modem?',
 'What is latency in telecommunication?',
 'Explain the concept of bandwidth throttling.',
 'What is the OSI model?',
 'What is fiber-to-the-home (FTTH) technology?',
 'What are the benefits of using SIP in telephony?',
 'How does a wireless router work?',
 'What is a DNS server and what role does it play in networking?',
 'Explain the difference between 4G and 5G networks.',
 'What is a VPN and why is it used in telecommunications?',
 'Describe the role of a network switch in data communication.',
 'What is VoLTE an

In [60]:
contexts

['A modem is a device that modulates and demodulates digital data to allow communication between computers over telephone lines.',
 'A router is used to direct data packets between different computer networks. It helps manage the flow of data and ensure efficient communication.',
 'Bandwidth refers to the maximum data transfer rate of a network connection. It measures how much data can be transmitted in a given time period.',
 'Latency is the delay between sending and receiving data in a network. It can impact real-time communication and online gaming experiences.',
 'VoIP stands for Voice over Internet Protocol. It is a technology that allows voice communication and multimedia sessions over the Internet, enabling phone calls through data networks.',
 'A network firewall is a security device that monitors and controls incoming and outgoing network traffic. It acts as a barrier between a trusted internal network and untrusted external networks.',
 'The main difference between 4G and 5G 

In [61]:
answers

['A modem is a communication device that converts digital data to analog and vice versa, enabling computers to communicate over telephone lines.',
 'The purpose of a router is to forward data packets between different computer networks. It plays a key role in managing data flow and ensuring efficient communication.',
 'Bandwidth in telecommunication refers to the maximum rate at which data can be transmitted over a network connection. It indicates the data capacity within a specific time frame.',
 'Latency in telecommunication refers to the time delay between data transmission and reception within a network. It can have effects on real-time communication applications and online gaming experiences.',
 'VoIP, or Voice over Internet Protocol, is a technology enabling voice communication and multimedia sessions over the Internet. It enables phone calls and other forms of communication using data networks.',
 'The role of a network firewall is to provide security by monitoring and regulatin

In [62]:
# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

### **Tokenize the Train Question, Context and Answer**

In [63]:
# Tokenize questions and contexts
tokenized_questions = tokenizer(
    train_questions,
    train_contexts,
    padding=True,
    truncation=True,
    return_tensors="pt",
    return_attention_mask=True  # Add this line to get attention masks
)

# Tokenize answers
tokenized_answers = tokenizer(
    train_answers,
    padding=True,
    truncation=True,
    return_tensors="pt",
    return_attention_mask=True  # Add this line to get attention masks
)

# **Multi Head Attention**

![Multi Head Attention Layer](https://miro.medium.com/v2/resize:fit:720/format:webp/0*--TCGWYxwASbv2ra.png)

> #### **Each pair of sets in a sequence have their attention computed by the multi-head attention mechanism. It is made up of various "attention heads" that focus on various facets of the input sequence.**

> #### **Initializing the module with input parameters and linear transformation layers is done by the MultiHeadAttention function. The attention outputs from all the heads are combined, the input tensor is split up into several heads, and attention scores are calculated. The multi-head self-attention is computed using the forward technique, which enables the model to concentrate on various facets of the input sequence.**

In [64]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask = None):
#         print("Mask shape:", mask.shape)
        attn_scores = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(self.d_k)
        if mask is not None:
#             print("Attn Scores shape:", attn_scores.shape)
#             print("Mask shape:", mask.shape)
#             print("Mask == 0 shape:", (mask == 0).shape)
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output
    
    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1,2)
    
    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1,2).contiguous().view(batch_size, seq_length, self.d_model)
    
    def forward(self, Q, K, V, mask = None):
#         print("Q shape:", Q.shape)
#         print("K shape:", K.shape)
#         print("V shape:", V.shape)
        
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output


# **Position-wise Feed-forward Network**

> #### **Position-wise feed-forward networks are implemented via the PositionWiseFeedForward class, which extends PyTorch's nn.Module. Two linear transformation layers and a ReLU activation function are the class's initial configurations. The forward method computes the output by consecutively applying these transformations and the activation function. This procedure enables the model to make predictions while taking the position of input elements into account.**

In [65]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

# **Positional Encoding**

> #### **Positional Encoding is used to inject the position information of each token in the input sequence. It uses sine and cosine functions of different frequencies to generate the positional encoding.**

> #### **With the input parameters d_model and max_seq_length, the PositionalEncoding class creates a tensor to hold the positional encoding data. Based on the scaling factor div_term, the class computes sine and cosine values for even and odd indices, respectively. The model is able to capture the positional information of the input sequence thanks to the forward technique, which computes the positional encoding by adding the stored positional encoding values to the input tensor.**

In [66]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype = torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# **Encoding Layer**

![Encoding Layer](https://miro.medium.com/v2/resize:fit:552/format:webp/0*bPKV4ekQr9ZjYkWJ.png)

> #### **An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward layer, and two Layer Normalization layers.**

> #### **Initialization parameters and components for the MultiHeadAttention module, PositionWiseFeedForward module, two layer normalization modules, and dropout layer are provided to the EncoderLayer class. By using self-attention, adding the attention output to the input tensor, and normalizing the result, the forward method computes the encoder layer output. Prior to providing the processed tensor, it computes the position-wise feed-forward output, mixes it with the normalized self-attention output, and then normalizes the final result.**

In [67]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
#         print("Input x shape:", x.shape)
        
        # For the self-attention mechanism, queries, keys, and values all come from the same input x
        attn_output = self.self_attn(x, x, x, attention_mask)  # Use the modified attention mask
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

# **Decoding Layer**

![Decoder Layer](https://miro.medium.com/v2/resize:fit:552/format:webp/0*SPZgT4k8GQi37H__.png)

> #### **Two Multi-Head Attention layers, a Position-wise Feed-Forward layer, and three Layer Normalization levels make up a decoder layer.**

> #### **The PositionWiseFeedForward module, three layer normalization modules, a dropout layer, and MultiHeadAttention modules for masked self- and cross-attention are some of the input parameters and components that the DecoderLayer initializes with.**

In [68]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        # For self-attention in the decoder, queries, keys, and values all come from the same input x
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # For cross-attention in the decoder, queries come from x, and keys and values come from the encoder output enc_output
        cross_attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))
        
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

# **Q&A Transformer Model**

![Q&A Transformer Model](https://miro.medium.com/v2/resize:fit:640/format:webp/0*ljYs7oOlKC71SzSr.png)

1. `nn.Module` Inheritance:
   The `QATransformer` class inherits from `nn.Module`, which is the base class for all PyTorch modules.

2. Initialization (`__init__`):
   The constructor initializes the various components of the transformer model.
   
   - `self.embedding`: An embedding layer that converts input token IDs into dense vectors of size `d_model`.
   - `self.positional_encoding`: A positional encoding layer that adds positional information to the input embeddings.
   - `self.encoder_layers`: A list of `num_layers` `EncoderLayer` instances, which make up the encoder stack.
   - `self.fc`: A linear layer that projects the final encoder output to the vocabulary size, making predictions for each token in the vocabulary.
   - `self.dropout`: A dropout layer for regularization.

3. Forward Pass (`forward`):
   The `forward` method defines the forward pass through the transformer model.
   
   - `input_ids`: Token IDs of the input sequence.
   - `attention_mask`: A mask to specify which positions should receive attention and which shouldn't. This mask helps the model ignore padding tokens.
   
   Inside the forward pass:
   
   - The input token IDs are first embedded using the embedding layer and then passed through the positional encoding layer.
   - The resulting embeddings are then passed through the stack of `num_layers` `EncoderLayer` instances, using the provided `attention_mask`.
   - The final output of the encoder stack is passed through the linear layer (`self.fc`) to get the predicted token scores for each position in the sequence.
   
4. `EncoderLayer`:
   The `EncoderLayer` class is expected to be defined elsewhere in your code and is used within the `QATransformer` class. It likely includes sublayers such as multi-head self-attention and feedforward neural networks.

5. `PositionalEncoding`:
   The `PositionalEncoding` class is expected to be defined elsewhere in your code and is used to add positional information to the input embeddings. It helps the model understand the order of the tokens in the sequence.

> #### **Overall, the code defines the structure of a transformer-based QA model and encapsulates the core logic needed for processing input sequences, applying attention mechanisms, and generating predictions.**

In [69]:
class QATransformer(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(QATransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
        
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.fc = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input_ids, attention_mask=None):  # Add attention_mask parameter
        input_embedded = self.dropout(self.positional_encoding(self.embedding(input_ids)))
        enc_output = input_embedded
        for enc_layer in self.encoder_layers:
            enc_output  = enc_layer(enc_output, attention_mask)  # Use the provided attention_mask

        output = self.fc(enc_output)
        return output

# **Hyperparameters**

- Source (input) vocabulary size: 5000
- Target (output) vocabulary size: 5000
- Model dimension (d_model): 512
- Number of attention heads (num_heads): 8
- Number of encoder layers (num_layers): 6
- Feedforward dimension (d_ff): 2048
- Maximum sequence length (max_seq_length): 500
- Dropout rate: 0.1

> #### **With these hyperparameters, you can instantiate the `QATransformer` class and create your QA model.**

In [70]:
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length  = 500
dropout = 0.1 

# **Create a Q&A Transformer Model**

> #### **creating an instance of the QATransformer model using the hyperparameters we've defined**

In [71]:
# Create a QATransformer model
# qa_transformer = QATransformer(tokenizer.vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

# qa_transformer.to(device)

In [72]:
model_path = "/kaggle/working/model_final_1"
qa_transformer = QATransformer(tokenizer.vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)
qa_transformer.load_state_dict(torch.load(model_path))
qa_transformer.eval()

QATransformer(
  (embedding): Embedding(30522, 512)
  (positional_encoding): PositionalEncoding()
  (encoder_layers): ModuleList(
    (0-5): 6 x EncoderLayer(
      (self_attn): MultiHeadAttention(
        (W_q): Linear(in_features=512, out_features=512, bias=True)
        (W_k): Linear(in_features=512, out_features=512, bias=True)
        (W_v): Linear(in_features=512, out_features=512, bias=True)
        (W_o): Linear(in_features=512, out_features=512, bias=True)
      )
      (feed_forward): PositionWiseFeedForward(
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (relu): ReLU()
      )
      (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (fc): Linear(in_features=512, out_features=30522, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

> **We'll set up the loss criterion and optimizer for training your QA Transformer model:**

1. **Loss Criterion (`nn.CrossEntropyLoss`)**
   - We're using the `nn.CrossEntropyLoss` function as our loss criterion. This loss function combines a softmax operation with the negative log likelihood loss. It's commonly used for multi-class classification tasks where each example can belong to only one class.
   - The `ignore_index` parameter is set to 0. This parameter is used to ignore padding tokens during loss calculation. In this case, it seems that token ID 0 represents padding tokens, so they will be ignored when computing the loss.

2. **Optimizer (`optim.Adam`)**
   - We're using the Adam optimizer for training your model's parameters. Adam is an adaptive learning rate optimization algorithm that adjusts the learning rate for each parameter.
   - The `qa_transformer.parameters()` call provides the list of trainable parameters in our`qa_transformer` model to the optimizer.
   - `lr` sets the initial learning rate for the optimizer.
   - `betas` are the coefficients used for computing running averages of gradient and its square.
   - `eps` is a small value added to the denominator for numerical stability.

> **These configurations are appropriate for training a transformer model for a QA task. After setting up the loss criterion and optimizer, we can proceed with the training loop where we'll iterate through your training data, compute gradients, and update model parameters to minimize the loss.**

In [73]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(qa_transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

# **Q&A Training**

In [74]:
qa_transformer.train()

QATransformer(
  (embedding): Embedding(30522, 512)
  (positional_encoding): PositionalEncoding()
  (encoder_layers): ModuleList(
    (0-5): 6 x EncoderLayer(
      (self_attn): MultiHeadAttention(
        (W_q): Linear(in_features=512, out_features=512, bias=True)
        (W_k): Linear(in_features=512, out_features=512, bias=True)
        (W_v): Linear(in_features=512, out_features=512, bias=True)
        (W_o): Linear(in_features=512, out_features=512, bias=True)
      )
      (feed_forward): PositionWiseFeedForward(
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (relu): ReLU()
      )
      (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (fc): Linear(in_features=512, out_features=30522, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [75]:
# input_ids = tokenized_questions["input_ids"].to(device)
# attention_mask = tokenized_questions["attention_mask"].unsqueeze(1).unsqueeze(2).to(device)

# These input_ids and attention_mask are for the Train Dataset
# input_ids = tokenized_questions["input_ids"]
# attention_mask = tokenized_questions["attention_mask"].unsqueeze(1).unsqueeze(2)

In [76]:
print("Input IDs shape:", tokenized_questions["input_ids"].shape)
print("Attention Mask shape:", tokenized_questions["attention_mask"].shape)

Input IDs shape: torch.Size([137, 386])
Attention Mask shape: torch.Size([137, 386])


In [77]:
# Save the trained model
model_save_path = "/kaggle/working/model"

## **Start the Training**

In [78]:
for epoch in range(1):
    optimizer.zero_grad()

    # Modify the attention mask right before passing it to the model
    attention_mask = tokenized_questions["attention_mask"].unsqueeze(1).unsqueeze(2)
    attention_mask = attention_mask.to(dtype=next(qa_transformer.parameters()).dtype)

    output = qa_transformer(input_ids=tokenized_questions["input_ids"], attention_mask=attention_mask)
    
    # Flatten the output and target tensors, excluding the padded tokens
    output_flat = output.view(-1, tokenizer.vocab_size)
    target_flat = tokenized_answers["input_ids"].contiguous().view(-1)
    mask = (target_flat != tokenizer.pad_token_id)
    non_padded_indices = mask.nonzero().squeeze()  # Get indices of non-padded tokens
    
#     print("Output flat shape:", output_flat.shape)
#     print("Target flat shape:", target_flat.shape)
#     print("Mask shape:", mask.shape)
    
    loss = criterion(output_flat[non_padded_indices], target_flat[non_padded_indices])
    
    # Calculate the loss using only the non-padded tokens
#     loss = criterion(output_flat[mask], target_flat[mask])
    loss.backward()
    optimizer.step()
    torch.save(qa_transformer.state_dict(), model_save_path)
    print(f"Trained model saved to {model_save_path}")
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")


Trained model saved to /kaggle/working/model
Epoch: 1, Loss: 0.2824430763721466


# **Save the Model**

In [79]:
# Save the trained model
model_save_path_final = "/kaggle/working/model_final_1"
torch.save(qa_transformer.state_dict(), model_save_path_final)
print(f"Trained model saved to {model_save_path_final}")

Trained model saved to /kaggle/working/model_final_1


In [80]:
# # Inference using the model
# with torch.no_grad():
#     output = qa_transformer(input_ids=tokenized_questions["input_ids"], attention_mask=attention_mask)

## **Inference Using the Model (Test Dataset)**

In [81]:
# Tokenize questions and contexts for Test Dataset
tokenized_questions_test = tokenizer(
    test_questions,
    test_contexts,
    padding=True,
    truncation=True,
    return_tensors="pt",
    return_attention_mask=True  # Add this line to get attention masks
)

# Tokenize answers for Test Dataset
tokenized_answers_test = tokenizer(
    test_answers,
    padding=True,
    truncation=True,
    return_tensors="pt",
    return_attention_mask=True  # Add this line to get attention masks
)

In [82]:
print("Input IDs shape:", tokenized_questions_test["input_ids"].shape)
print("Attention Mask shape:", tokenized_questions_test["attention_mask"].shape)

Input IDs shape: torch.Size([43, 257])
Attention Mask shape: torch.Size([43, 257])


In [83]:
# These input_ids and attention_mask are for the Test Dataset
# input_ids_test = tokenized_questions_test["input_ids"]
# attention_mask_test = tokenized_questions_test["attention_mask"].unsqueeze(1).unsqueeze(2)

> **We're performing inference using the trained QA Transformer model. Let's break down the code snippet:**

1. **Inference Block (`with torch.no_grad():`)**
   - This context manager is used to ensure that no gradients are computed during inference, which helps save memory and processing time since gradients aren't needed for inference.

2. **Model Inference**
   - We're passing `input_ids` and `attention_mask` to the `qa_transformer` model to generate predictions.
   - `input_ids` likely contains the token IDs of the input sequence you want to perform QA on.
   - `attention_mask` is used to mask out padding tokens so that the model doesn't attend to them during inference.
   - The `qa_transformer` model computes predictions for each token in the input sequence.

> **After this code block, the `output` tensor contains the model's predictions for each position in the sequence.**

In [84]:
# Modify the attention mask right before passing it to the model
attention_mask = tokenized_questions_test["attention_mask"].unsqueeze(1).unsqueeze(2)
attention_mask = attention_mask.to(dtype=next(qa_transformer.parameters()).dtype)
with torch.no_grad():
    test_output = qa_transformer(input_ids=tokenized_questions_test["input_ids"], attention_mask=attention_mask)

In [85]:
# Convert model output to answer tokens
predicted_answer_tokens_test = torch.argmax(test_output, dim=-1)

In [86]:
# Decode answer tokens
predicted_answers_test = []

In [87]:
for answer_tokens in predicted_answer_tokens_test:
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
    predicted_answers_test.append(answer)

In [88]:
# Print predicted answers
for question, predicted_answer in zip(train_questions, predicted_answers_test):
    print(f"Question: {question}")
    print(f"Predicted Answer: {predicted_answer}")
    print("=" * 50)

Question: How does the Lempel-Ziv algorithm improve compression efficiency?
Predicted Answer: it was data and, algorithms radio like of in allowing,, it call adjusting, communication communication information digital, voice different digital in, a it for communication and communication about and and ar and management it,,,. structured,,, dh.,.,, congestion, by telecommunication.. communication.....,.,,. routing routing,, and and..ging uses,,, ) communication..... over,, while.. - telephone, long. digital, computers 4 and,. horns,,. providers providers... path,,.,,.,. it it.,,, and and,.. service..,,
Question: What is the alternative method of dividing a communications medium into channels?
Predicted Answer: over amplitude it. multiple. they to data. communication in a, or they analog three by it adjusting,, or over'sequences'wave, modulation used. databand, other or communication and on or they over technology of and, for digital a telecommunication conventions,. and and.,,, congestion

# **Code to load the model and answer a single question**

In [89]:
# Code to load the model and answer a single question
# Load the trained model and tokenizer
model_path = "/kaggle/working/model_final_1"
qa_transformer = QATransformer(tokenizer.vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)
qa_transformer.load_state_dict(torch.load(model_path))
qa_transformer.eval()

# Write a new question
new_question = "What is a modem?"

# Tokenize the new question
tokenized_new_question = tokenizer(
    new_question,
    padding=True,
    truncation=True,
    return_tensors="pt",
    return_attention_mask=True
)

# Modify the attention mask right before passing it to the model
attention_mask = tokenized_new_question["attention_mask"].unsqueeze(1).unsqueeze(2)
attention_mask = attention_mask.to(dtype=next(qa_transformer.parameters()).dtype)
# Pass the tokenized question through the model
with torch.no_grad():
    output = qa_transformer(input_ids=tokenized_new_question["input_ids"], attention_mask=attention_mask)
    
# Convert model output to answer tokens
predicted_answer_tokens = torch.argmax(output, dim=-1)

# Decode answer tokens
predicted_answer = tokenizer.decode(predicted_answer_tokens[0], skip_special_tokens=True)

# Print the predicted answer
print(f"Question: {new_question}")
print(f"Predicted Answer: {predicted_answer}")


Question: What is a modem?
Predicted Answer: . over of between, it parisg
