# Empathetic Chatbot - Transformer from Scratch

This notebook implements a Transformer encoder-decoder model from scratch for generating empathetic responses.

**Dataset**: Empathetic Dialogues (Facebook AI)

**Architecture**: Transformer encoder-decoder (no pretrained weights)

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

# For text processing
import re
import string
from collections import Counter, defaultdict
import json

# For evaluation metrics
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

# For data splitting
from sklearn.model_selection import train_test_split

# Utils
import math
import random
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 2. Configuration and Hyperparameters

### Important: Update DATA_PATH

**For Kaggle:** Set `DATA_PATH = '/kaggle/input/your-dataset-name/demo.csv'`

**For Local:** Set `DATA_PATH = 'demo.csv'` or full path like `'d:/BSSE Notes/Generative AI/Assignment2/demo.csv'`

In [None]:
class Config:
    # Data paths (update this to your data file path)
    DATA_PATH = '/kaggle/input/empathetic-dialogues-facebook-ai/demo.csv'  # Single CSV file
    # For local testing, use: 'demo.csv' or full path
    
    # Model hyperparameters
    EMBEDDING_DIM = 256
    NUM_HEADS = 4
    NUM_ENCODER_LAYERS = 2
    NUM_DECODER_LAYERS = 2
    FFN_DIM = 512
    DROPOUT = 0.1
    MAX_SEQ_LEN = 128
    
    # Training hyperparameters
    BATCH_SIZE = 32
    LEARNING_RATE = 3e-4
    NUM_EPOCHS = 20
    GRAD_CLIP = 1.0
    
    # Special tokens
    PAD_TOKEN = '<pad>'
    BOS_TOKEN = '<bos>'
    EOS_TOKEN = '<eos>'
    UNK_TOKEN = '<unk>'
    
    # Vocabulary
    MIN_FREQ = 2
    MAX_VOCAB_SIZE = 10000
    
    # Data split ratios
    TRAIN_RATIO = 0.8
    VAL_RATIO = 0.1
    TEST_RATIO = 0.1

config = Config()
print("Configuration loaded successfully!")

## 3. Data Preprocessing

This section handles:
- Loading the dataset
- Text normalization (lowercase, punctuation, whitespace)
- Building vocabulary from training data
- Train/Val/Test split (80/10/10)

### Dataset Format

Your data has the following structure:
- **Situation**: Background context (e.g., "I remember going to the fireworks with my best friend")
- **emotion**: Emotion label (e.g., "sentimental", "afraid", "proud", "faithful")
- **empathetic_dialogues**: Contains conversation in format: `"Customer :{text}\nAgent :"`
- **labels**: The agent's response (ground truth)

The code will:
1. Load single CSV file
2. Split into 80% train / 10% validation / 10% test
3. Extract customer utterance from `empathetic_dialogues` column
4. Format as: `"Emotion: {emotion} | Situation: {situation} | Customer: {customer} Agent:"`
5. Use `labels` as target output

In [None]:
def normalize_text(text):
    """Normalize text: lowercase, clean whitespace, normalize punctuation"""
    if pd.isna(text):
        return ""
    
    text = str(text).lower()
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    # Normalize punctuation - keep basic punctuation
    text = text.strip()
    
    return text

def simple_tokenize(text):
    """Simple word tokenization"""
    # Split on whitespace and punctuation but keep punctuation
    tokens = re.findall(r'\w+|[^\w\s]', text)
    return tokens

print("Text preprocessing functions defined!")

In [None]:
def load_dataset(data_path):
    """Load the empathetic dialogues dataset from a single CSV file"""
    try:
        # Try loading the dataset
        df = pd.read_csv(data_path)
        print(f"Loaded dataset: {len(df)} samples")
        print(f"Columns: {df.columns.tolist()}")
        
        # Display sample
        print(f"\nSample row:")
        print(df.head(1))
        
        # Split into train/val/test (80/10/10)
        from sklearn.model_selection import train_test_split
        
        # First split: 80% train, 20% temp
        train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)
        
        # Second split: split temp into 50/50 for val and test (each 10% of total)
        val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, shuffle=True)
        
        print(f"\nSplit sizes:")
        print(f"  Train: {len(train_df)} samples ({len(train_df)/len(df)*100:.1f}%)")
        print(f"  Validation: {len(val_df)} samples ({len(val_df)/len(df)*100:.1f}%)")
        print(f"  Test: {len(test_df)} samples ({len(test_df)/len(df)*100:.1f}%)")
        
        return train_df, val_df, test_df
        
    except FileNotFoundError:
        print(f"Error: File not found at {data_path}")
        print("Using demo.csv from current directory...")
        try:
            df = pd.read_csv('demo.csv')
            print(f"Loaded demo.csv: {len(df)} samples")
            from sklearn.model_selection import train_test_split
            train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)
            val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, shuffle=True)
            return train_df, val_df, test_df
        except:
            print("Creating minimal sample data for demonstration...")
            sample_data = {
                'Situation': ['I remember going to the fireworks with my best friend'] * 10,
                'emotion': ['sentimental'] * 10,
                'empathetic_dialogues': ['Customer :This was a best friend. I miss her.\nAgent :'] * 10,
                'labels': ['Where has she gone?'] * 10
            }
            df = pd.DataFrame(sample_data)
            train_df = df.copy()
            val_df = df.copy()
            test_df = df.copy()
            return train_df, val_df, test_df

# Load dataset
train_df, val_df, test_df = load_dataset(config.DATA_PATH)
print(f"\nDataset ready for processing!")

In [None]:
class Vocabulary:
    """Vocabulary class for token to index mapping"""
    
    def __init__(self, pad_token='<pad>', bos_token='<bos>', 
                 eos_token='<eos>', unk_token='<unk>'):
        self.pad_token = pad_token
        self.bos_token = bos_token
        self.eos_token = eos_token
        self.unk_token = unk_token
        
        # Initialize with special tokens
        self.token2idx = {
            pad_token: 0,
            bos_token: 1,
            eos_token: 2,
            unk_token: 3
        }
        self.idx2token = {v: k for k, v in self.token2idx.items()}
        self.token_counts = Counter()
        
    def build_vocabulary(self, texts, min_freq=2, max_vocab_size=10000):
        """Build vocabulary from texts"""
        # Count all tokens
        for text in texts:
            tokens = simple_tokenize(normalize_text(text))
            self.token_counts.update(tokens)
        
        # Get most common tokens
        most_common = self.token_counts.most_common(max_vocab_size)
        
        # Add tokens that meet minimum frequency
        for token, count in most_common:
            if count >= min_freq and token not in self.token2idx:
                idx = len(self.token2idx)
                self.token2idx[token] = idx
                self.idx2token[idx] = token
        
        print(f"Vocabulary size: {len(self.token2idx)}")
        print(f"Most common tokens: {most_common[:10]}")
        
    def encode(self, text, add_special_tokens=True):
        """Convert text to token indices"""
        tokens = simple_tokenize(normalize_text(text))
        indices = []
        
        if add_special_tokens:
            indices.append(self.token2idx[self.bos_token])
        
        for token in tokens:
            indices.append(self.token2idx.get(token, self.token2idx[self.unk_token]))
        
        if add_special_tokens:
            indices.append(self.token2idx[self.eos_token])
        
        return indices
    
    def decode(self, indices, skip_special_tokens=True):
        """Convert token indices back to text"""
        tokens = []
        special_tokens = {self.token2idx[self.pad_token], 
                         self.token2idx[self.bos_token], 
                         self.token2idx[self.eos_token]}
        
        for idx in indices:
            if skip_special_tokens and idx in special_tokens:
                continue
            tokens.append(self.idx2token.get(idx, self.unk_token))
        
        return ' '.join(tokens)
    
    def __len__(self):
        return len(self.token2idx)

print("Vocabulary class defined!")

In [None]:
def prepare_data(train_df, val_df, test_df):
    """
    Prepare X and Y pairs in format:
    X: "Emotion: {emotion} | Situation: {situation} | Customer: {customer_utterance} Agent:"
    Y: "{agent_reply}"
    
    Data format:
    - Situation: background context
    - emotion: the emotion label
    - empathetic_dialogues: contains "Customer :{text}\nAgent :"
    - labels: the agent's response
    """
    
    def extract_customer_utterance(dialogue_text):
        """Extract customer utterance from empathetic_dialogues column"""
        if pd.isna(dialogue_text):
            return ""
        
        # Format: "Customer :{text}\nAgent :"
        text = str(dialogue_text)
        
        # Extract text between "Customer :" and "Agent :"
        if "Customer :" in text and "Agent :" in text:
            start = text.index("Customer :") + len("Customer :")
            end = text.index("Agent :")
            customer_text = text[start:end].strip()
            return customer_text
        
        return ""
    
    def create_xy_pairs(df):
        X_list = []
        Y_list = []
        
        for idx, row in df.iterrows():
            # Extract fields from the CSV columns
            situation = str(row.get('Situation', '')).strip()
            emotion = str(row.get('emotion', 'neutral')).strip()
            customer = extract_customer_utterance(row.get('empathetic_dialogues', ''))
            agent = str(row.get('labels', '')).strip()
            
            # Skip if any essential field is missing
            if not customer or not agent:
                continue
            
            # Format input according to requirements
            x_text = f"Emotion: {emotion} | Situation: {situation} | Customer: {customer} Agent:"
            y_text = agent
            
            X_list.append(x_text)
            Y_list.append(y_text)
        
        return X_list, Y_list
    
    train_X, train_Y = create_xy_pairs(train_df)
    val_X, val_Y = create_xy_pairs(val_df)
    test_X, test_Y = create_xy_pairs(test_df)
    
    print(f"Training pairs: {len(train_X)}")
    print(f"Validation pairs: {len(val_X)}")
    print(f"Test pairs: {len(test_X)}")
    print(f"\nExample input X:")
    print(f"{train_X[0][:200]}...")
    print(f"\nExample output Y:")
    print(f"{train_Y[0]}")
    
    return train_X, train_Y, val_X, val_Y, test_X, test_Y

# Prepare data
train_X, train_Y, val_X, val_Y, test_X, test_Y = prepare_data(train_df, val_df, test_df)

In [None]:
# Build vocabulary from training data only
vocab = Vocabulary(
    pad_token=config.PAD_TOKEN,
    bos_token=config.BOS_TOKEN,
    eos_token=config.EOS_TOKEN,
    unk_token=config.UNK_TOKEN
)

# Combine all training texts to build vocabulary
all_train_texts = train_X + train_Y
vocab.build_vocabulary(all_train_texts, min_freq=config.MIN_FREQ, max_vocab_size=config.MAX_VOCAB_SIZE)

# Test encoding/decoding
sample_text = train_Y[0]
encoded = vocab.encode(sample_text)
decoded = vocab.decode(encoded)
print(f"\nOriginal: {sample_text}")
print(f"Encoded: {encoded[:20]}...")
print(f"Decoded: {decoded}")

In [None]:
# Verify data format - show a few examples
print("="*80)
print("DATA VERIFICATION - Sample Input/Output Pairs")
print("="*80)

for i in range(min(3, len(train_X))):
    print(f"\n--- Example {i+1} ---")
    print(f"INPUT (X):\n{train_X[i]}")
    print(f"\nOUTPUT (Y):\n{train_Y[i]}")
    print("-"*80)

## 4. PyTorch Dataset and DataLoader

In [None]:
class EmpatheticDataset(Dataset):
    """Dataset for empathetic dialogues"""
    
    def __init__(self, X_texts, Y_texts, vocab, max_len=128):
        self.X_texts = X_texts
        self.Y_texts = Y_texts
        self.vocab = vocab
        self.max_len = max_len
        
    def __len__(self):
        return len(self.X_texts)
    
    def __getitem__(self, idx):
        x_text = self.X_texts[idx]
        y_text = self.Y_texts[idx]
        
        # Encode texts
        x_encoded = self.vocab.encode(x_text, add_special_tokens=True)
        y_encoded = self.vocab.encode(y_text, add_special_tokens=True)
        
        # Truncate if needed
        x_encoded = x_encoded[:self.max_len]
        y_encoded = y_encoded[:self.max_len]
        
        return {
            'encoder_input': torch.tensor(x_encoded, dtype=torch.long),
            'decoder_input': torch.tensor(y_encoded[:-1], dtype=torch.long),  # Remove last token
            'decoder_target': torch.tensor(y_encoded[1:], dtype=torch.long)   # Remove first token (BOS)
        }

print("EmpatheticDataset class defined!")

In [None]:
def collate_fn(batch):
    """Custom collate function to pad sequences in a batch"""
    encoder_inputs = [item['encoder_input'] for item in batch]
    decoder_inputs = [item['decoder_input'] for item in batch]
    decoder_targets = [item['decoder_target'] for item in batch]
    
    # Pad sequences
    encoder_inputs_padded = pad_sequence(encoder_inputs, batch_first=True, padding_value=0)
    decoder_inputs_padded = pad_sequence(decoder_inputs, batch_first=True, padding_value=0)
    decoder_targets_padded = pad_sequence(decoder_targets, batch_first=True, padding_value=0)
    
    return {
        'encoder_input': encoder_inputs_padded,
        'decoder_input': decoder_inputs_padded,
        'decoder_target': decoder_targets_padded
    }

# Create datasets
train_dataset = EmpatheticDataset(train_X, train_Y, vocab, max_len=config.MAX_SEQ_LEN)
val_dataset = EmpatheticDataset(val_X, val_Y, vocab, max_len=config.MAX_SEQ_LEN)
test_dataset = EmpatheticDataset(test_X, test_Y, vocab, max_len=config.MAX_SEQ_LEN)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=config.BATCH_SIZE, 
                         shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=config.BATCH_SIZE, 
                       shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=config.BATCH_SIZE, 
                        shuffle=False, collate_fn=collate_fn)

print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")
print(f"Test batches: {len(test_loader)}")

# Test a batch
sample_batch = next(iter(train_loader))
print(f"\nSample batch shapes:")
print(f"  Encoder input: {sample_batch['encoder_input'].shape}")
print(f"  Decoder input: {sample_batch['decoder_input'].shape}")
print(f"  Decoder target: {sample_batch['decoder_target'].shape}")

## 5. Transformer Architecture from Scratch

We'll implement all components:
- Positional Encoding
- Multi-Head Attention
- Feed-Forward Network
- Encoder Layer
- Decoder Layer
- Full Transformer Model

In [None]:
class PositionalEncoding(nn.Module):
    """Sinusoidal positional encoding"""
    
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)  # Add batch dimension
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, d_model)
        """
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

print("PositionalEncoding defined!")

In [None]:
class MultiHeadAttention(nn.Module):
    """Multi-Head Attention mechanism"""
    
    def __init__(self, d_model, num_heads, dropout=0.1):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        """
        Args:
            Q, K, V: (batch_size, num_heads, seq_len, d_k)
            mask: (batch_size, 1, seq_len, seq_len) or (batch_size, 1, 1, seq_len)
        """
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        output = torch.matmul(attention_weights, V)
        return output, attention_weights
    
    def forward(self, query, key, value, mask=None):
        """
        Args:
            query, key, value: (batch_size, seq_len, d_model)
            mask: (batch_size, seq_len, seq_len) for decoder or (batch_size, seq_len) for padding
        """
        batch_size = query.size(0)
        
        # Linear projections and split into heads
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        x, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        # Final linear projection
        output = self.W_o(x)
        
        return output, attention_weights

print("MultiHeadAttention defined!")

In [None]:
class FeedForward(nn.Module):
    """Position-wise Feed-Forward Network"""
    
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        Args:
            x: (batch_size, seq_len, d_model)
        """
        x = F.relu(self.linear1(x))
        x = self.dropout(x)
        x = self.linear2(x)
        return x

print("FeedForward defined!")

In [None]:
class EncoderLayer(nn.Module):
    """Transformer Encoder Layer"""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        """
        Args:
            x: (batch_size, seq_len, d_model)
            mask: (batch_size, seq_len, seq_len)
        """
        # Self-attention with residual connection
        attn_output, _ = self.self_attention(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))
        
        return x

print("EncoderLayer defined!")

In [None]:
class DecoderLayer(nn.Module):
    """Transformer Decoder Layer"""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.cross_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        
    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        """
        Args:
            x: (batch_size, tgt_seq_len, d_model)
            encoder_output: (batch_size, src_seq_len, d_model)
            src_mask: mask for encoder output
            tgt_mask: causal mask for decoder (batch_size, tgt_seq_len, tgt_seq_len)
        """
        # Masked self-attention with residual connection
        attn_output, _ = self.self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Cross-attention with encoder output and residual connection
        attn_output, _ = self.cross_attention(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout2(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout3(ff_output))
        
        return x

print("DecoderLayer defined!")

In [None]:
class Transformer(nn.Module):
    """Complete Transformer Encoder-Decoder Model"""
    
    def __init__(self, vocab_size, d_model=256, num_heads=4, 
                 num_encoder_layers=2, num_decoder_layers=2, 
                 d_ff=512, dropout=0.1, max_seq_len=128, pad_idx=0):
        super(Transformer, self).__init__()
        
        self.d_model = d_model
        self.pad_idx = pad_idx
        
        # Embeddings
        self.encoder_embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_idx)
        self.decoder_embedding = nn.Embedding(vocab_size, d_model, padding_idx=pad_idx)
        
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len, dropout)
        
        # Encoder layers
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_encoder_layers)
        ])
        
        # Decoder layers
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_decoder_layers)
        ])
        
        # Output projection
        self.fc_out = nn.Linear(d_model, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize weights
        self._init_weights()
        
    def _init_weights(self):
        """Initialize weights"""
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
    
    def make_src_mask(self, src):
        """Create mask for source sequence (padding mask)"""
        # src: (batch_size, src_len)
        src_mask = (src != self.pad_idx).unsqueeze(1).unsqueeze(2)
        # (batch_size, 1, 1, src_len)
        return src_mask
    
    def make_tgt_mask(self, tgt):
        """Create causal mask for target sequence"""
        # tgt: (batch_size, tgt_len)
        batch_size, tgt_len = tgt.size()
        
        # Padding mask
        tgt_pad_mask = (tgt != self.pad_idx).unsqueeze(1).unsqueeze(2)
        # (batch_size, 1, 1, tgt_len)
        
        # Causal mask (no future information)
        tgt_sub_mask = torch.tril(torch.ones((tgt_len, tgt_len), device=tgt.device)).bool()
        # (tgt_len, tgt_len)
        
        tgt_mask = tgt_pad_mask & tgt_sub_mask
        # (batch_size, 1, tgt_len, tgt_len)
        
        return tgt_mask
    
    def encode(self, src, src_mask):
        """Encoder forward pass"""
        # Embedding + positional encoding
        x = self.encoder_embedding(src) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        
        # Pass through encoder layers
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        
        return x
    
    def decode(self, tgt, encoder_output, src_mask, tgt_mask):
        """Decoder forward pass"""
        # Embedding + positional encoding
        x = self.decoder_embedding(tgt) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        
        # Pass through decoder layers
        for layer in self.decoder_layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        
        return x
    
    def forward(self, src, tgt):
        """
        Args:
            src: (batch_size, src_len)
            tgt: (batch_size, tgt_len)
        Returns:
            output: (batch_size, tgt_len, vocab_size)
        """
        src_mask = self.make_src_mask(src)
        tgt_mask = self.make_tgt_mask(tgt)
        
        encoder_output = self.encode(src, src_mask)
        decoder_output = self.decode(tgt, encoder_output, src_mask, tgt_mask)
        
        output = self.fc_out(decoder_output)
        
        return output

print("Transformer model defined!")

## 6. Model Instantiation and Training Setup

In [None]:
# Initialize model
model = Transformer(
    vocab_size=len(vocab),
    d_model=config.EMBEDDING_DIM,
    num_heads=config.NUM_HEADS,
    num_encoder_layers=config.NUM_ENCODER_LAYERS,
    num_decoder_layers=config.NUM_DECODER_LAYERS,
    d_ff=config.FFN_DIM,
    dropout=config.DROPOUT,
    max_seq_len=config.MAX_SEQ_LEN,
    pad_idx=vocab.token2idx[config.PAD_TOKEN]
).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Optimizer - Adam with betas=(0.9, 0.98)
optimizer = torch.optim.Adam(model.parameters(), lr=config.LEARNING_RATE, betas=(0.9, 0.98), eps=1e-9)

# Loss function - CrossEntropyLoss with padding ignored
criterion = nn.CrossEntropyLoss(ignore_index=vocab.token2idx[config.PAD_TOKEN])

print(f"\nModel initialized on {device}")
print(f"Optimizer: Adam (lr={config.LEARNING_RATE}, betas=(0.9, 0.98))")
print(f"Loss function: CrossEntropyLoss (ignore padding)")

## 7. Training with Teacher Forcing

Training loop with:
- Teacher forcing (use ground truth as decoder input)
- Gradient clipping
- Validation after each epoch
- Save best model based on validation BLEU score

In [None]:
def train_epoch(model, dataloader, optimizer, criterion, device, clip_grad=1.0):
    """Train for one epoch"""
    model.train()
    total_loss = 0
    
    for batch in dataloader:
        src = batch['encoder_input'].to(device)
        tgt_input = batch['decoder_input'].to(device)
        tgt_output = batch['decoder_target'].to(device)
        
        # Forward pass with teacher forcing
        optimizer.zero_grad()
        output = model(src, tgt_input)
        
        # Reshape for loss calculation
        output = output.reshape(-1, output.shape[-1])
        tgt_output = tgt_output.reshape(-1)
        
        # Calculate loss
        loss = criterion(output, tgt_output)
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad)
        
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(dataloader)
    return avg_loss

def evaluate_loss(model, dataloader, criterion, device):
    """Evaluate loss on validation/test set"""
    model.eval()
    total_loss = 0
    
    with torch.no_grad():
        for batch in dataloader:
            src = batch['encoder_input'].to(device)
            tgt_input = batch['decoder_input'].to(device)
            tgt_output = batch['decoder_target'].to(device)
            
            output = model(src, tgt_input)
            
            output = output.reshape(-1, output.shape[-1])
            tgt_output = tgt_output.reshape(-1)
            
            loss = criterion(output, tgt_output)
            total_loss += loss.item()
    
    avg_loss = total_loss / len(dataloader)
    return avg_loss

print("Training functions defined!")

## 8. Evaluation Metrics

Implementation of:
- BLEU score
- ROUGE-L
- chrF (character n-gram F-score)
- Perplexity

In [None]:
def calculate_bleu(references, hypotheses):
    """Calculate corpus-level BLEU score"""
    smooth_fn = SmoothingFunction().method1
    scores = []
    
    for ref, hyp in zip(references, hypotheses):
        ref_tokens = ref.split()
        hyp_tokens = hyp.split()
        score = sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smooth_fn)
        scores.append(score)
    
    return sum(scores) / len(scores) if scores else 0.0

def calculate_rouge_l(references, hypotheses):
    """Calculate ROUGE-L score"""
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    scores = []
    
    for ref, hyp in zip(references, hypotheses):
        score = scorer.score(ref, hyp)
        scores.append(score['rougeL'].fmeasure)
    
    return sum(scores) / len(scores) if scores else 0.0

def calculate_chrf(references, hypotheses):
    """Calculate chrF score (character n-gram F-score)"""
    def char_ngrams(text, n=3):
        """Get character n-grams"""
        chars = list(text.replace(' ', ''))
        return [tuple(chars[i:i+n]) for i in range(len(chars)-n+1)]
    
    scores = []
    for ref, hyp in zip(references, hypotheses):
        ref_ngrams = set(char_ngrams(ref))
        hyp_ngrams = set(char_ngrams(hyp))
        
        if not hyp_ngrams:
            scores.append(0.0)
            continue
        
        intersection = len(ref_ngrams & hyp_ngrams)
        precision = intersection / len(hyp_ngrams) if hyp_ngrams else 0
        recall = intersection / len(ref_ngrams) if ref_ngrams else 0
        
        if precision + recall > 0:
            f_score = 2 * precision * recall / (precision + recall)
        else:
            f_score = 0.0
        
        scores.append(f_score)
    
    return sum(scores) / len(scores) if scores else 0.0

def calculate_perplexity(loss):
    """Calculate perplexity from loss"""
    return math.exp(loss)

print("Evaluation metrics defined!")

## 9. Inference Functions

Implementation of:
- Greedy decoding
- Beam search decoding

In [None]:
def greedy_decode(model, src, vocab, max_len=50, device='cpu'):
    """Greedy decoding for inference"""
    model.eval()
    
    with torch.no_grad():
        # Encode source
        src = src.to(device)
        src_mask = model.make_src_mask(src)
        encoder_output = model.encode(src, src_mask)
        
        # Start with BOS token
        tgt_indices = [vocab.token2idx[vocab.bos_token]]
        
        for _ in range(max_len):
            tgt = torch.LongTensor([tgt_indices]).to(device)
            tgt_mask = model.make_tgt_mask(tgt)
            
            # Decode
            decoder_output = model.decode(tgt, encoder_output, src_mask, tgt_mask)
            output = model.fc_out(decoder_output)
            
            # Get next token (greedy)
            next_token = output[0, -1, :].argmax().item()
            tgt_indices.append(next_token)
            
            # Stop if EOS token
            if next_token == vocab.token2idx[vocab.eos_token]:
                break
        
        return tgt_indices

def beam_search_decode(model, src, vocab, beam_width=3, max_len=50, device='cpu'):
    """Beam search decoding for inference"""
    model.eval()
    
    with torch.no_grad():
        # Encode source
        src = src.to(device)
        src_mask = model.make_src_mask(src)
        encoder_output = model.encode(src, src_mask)
        
        # Initialize beams: [(sequence, score)]
        beams = [([vocab.token2idx[vocab.bos_token]], 0.0)]
        
        for _ in range(max_len):
            candidates = []
            
            for seq, score in beams:
                # Skip if sequence already ended
                if seq[-1] == vocab.token2idx[vocab.eos_token]:
                    candidates.append((seq, score))
                    continue
                
                tgt = torch.LongTensor([seq]).to(device)
                tgt_mask = model.make_tgt_mask(tgt)
                
                # Decode
                decoder_output = model.decode(tgt, encoder_output, src_mask, tgt_mask)
                output = model.fc_out(decoder_output)
                
                # Get top-k tokens
                log_probs = F.log_softmax(output[0, -1, :], dim=0)
                top_k_probs, top_k_indices = torch.topk(log_probs, beam_width)
                
                # Create new candidates
                for prob, idx in zip(top_k_probs, top_k_indices):
                    new_seq = seq + [idx.item()]
                    new_score = score + prob.item()
                    candidates.append((new_seq, new_score))
            
            # Keep top beam_width candidates
            beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
            
            # Check if all beams ended
            if all(seq[-1] == vocab.token2idx[vocab.eos_token] for seq, _ in beams):
                break
        
        # Return best sequence
        best_seq, best_score = beams[0]
        return best_seq

def generate_response(model, input_text, vocab, device='cpu', method='greedy', beam_width=3):
    """Generate response for given input text"""
    model.eval()
    
    # Encode input
    src_indices = vocab.encode(input_text, add_special_tokens=True)
    src = torch.LongTensor([src_indices])
    
    # Decode
    if method == 'greedy':
        output_indices = greedy_decode(model, src, vocab, device=device)
    elif method == 'beam':
        output_indices = beam_search_decode(model, src, vocab, beam_width=beam_width, device=device)
    else:
        raise ValueError(f"Unknown method: {method}")
    
    # Decode to text
    output_text = vocab.decode(output_indices, skip_special_tokens=True)
    
    return output_text

print("Inference functions defined!")

## 10. Main Training Loop

Train the model with validation and save best checkpoint based on BLEU score

In [None]:
def evaluate_model(model, dataloader, vocab, device, num_samples=100):
    """Evaluate model and calculate metrics"""
    model.eval()
    
    references = []
    hypotheses = []
    
    count = 0
    for batch in dataloader:
        if count >= num_samples:
            break
        
        src = batch['encoder_input'].to(device)
        tgt = batch['decoder_target'].to(device)
        
        batch_size = src.size(0)
        
        for i in range(batch_size):
            if count >= num_samples:
                break
            
            # Get reference
            ref_indices = tgt[i].cpu().tolist()
            ref_text = vocab.decode(ref_indices, skip_special_tokens=True)
            
            # Generate hypothesis
            src_i = src[i:i+1]
            hyp_indices = greedy_decode(model, src_i, vocab, device=device)
            hyp_text = vocab.decode(hyp_indices, skip_special_tokens=True)
            
            references.append(ref_text)
            hypotheses.append(hyp_text)
            count += 1
    
    # Calculate metrics
    bleu = calculate_bleu(references, hypotheses)
    rouge_l = calculate_rouge_l(references, hypotheses)
    chrf = calculate_chrf(references, hypotheses)
    
    return {
        'bleu': bleu,
        'rouge_l': rouge_l,
        'chrf': chrf,
        'references': references,
        'hypotheses': hypotheses
    }

# Training history
history = {
    'train_loss': [],
    'val_loss': [],
    'val_bleu': [],
    'val_perplexity': []
}

best_bleu = 0.0
best_model_path = 'best_transformer_model.pt'

print("Starting training...")
print("="*60)

for epoch in range(config.NUM_EPOCHS):
    # Train
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device, config.GRAD_CLIP)
    
    # Validate
    val_loss = evaluate_loss(model, val_loader, criterion, device)
    val_perplexity = calculate_perplexity(val_loss)
    
    # Evaluate on validation set (sample)
    val_metrics = evaluate_model(model, val_loader, vocab, device, num_samples=50)
    val_bleu = val_metrics['bleu']
    
    # Update history
    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    history['val_bleu'].append(val_bleu)
    history['val_perplexity'].append(val_perplexity)
    
    # Print progress
    print(f"Epoch {epoch+1}/{config.NUM_EPOCHS}")
    print(f"  Train Loss: {train_loss:.4f}")
    print(f"  Val Loss: {val_loss:.4f} | Val Perplexity: {val_perplexity:.4f}")
    print(f"  Val BLEU: {val_bleu:.4f}")
    
    # Save best model based on BLEU score
    if val_bleu > best_bleu:
        best_bleu = val_bleu
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'val_bleu': val_bleu,
            'config': config
        }, best_model_path)
        print(f"  ✓ Best model saved! (BLEU: {val_bleu:.4f})")
    
    print("-"*60)

print("\nTraining completed!")
print(f"Best validation BLEU: {best_bleu:.4f}")

## 11. Test Set Evaluation

Comprehensive evaluation on test set with all metrics

In [None]:
# Load best model
checkpoint = torch.load(best_model_path)
model.load_state_dict(checkpoint['model_state_dict'])
print(f"Loaded best model from epoch {checkpoint['epoch']+1}")

# Evaluate on test set
print("\nEvaluating on test set...")
test_loss = evaluate_loss(model, test_loader, criterion, device)
test_perplexity = calculate_perplexity(test_loss)

test_metrics = evaluate_model(model, test_loader, vocab, device, num_samples=200)

print("\n" + "="*60)
print("TEST SET RESULTS")
print("="*60)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Perplexity: {test_perplexity:.4f}")
print(f"Test BLEU: {test_metrics['bleu']:.4f}")
print(f"Test ROUGE-L: {test_metrics['rouge_l']:.4f}")
print(f"Test chrF: {test_metrics['chrf']:.4f}")
print("="*60)

## 12. Qualitative Examples

Compare model outputs with ground truth responses

In [None]:
# Show qualitative examples
print("\n" + "="*80)
print("QUALITATIVE EXAMPLES - Model vs Ground Truth")
print("="*80)

num_examples = 10
for i in range(min(num_examples, len(test_X))):
    input_text = test_X[i]
    ground_truth = test_Y[i]
    
    # Generate with greedy decoding
    model_output_greedy = generate_response(model, input_text, vocab, device, method='greedy')
    
    # Generate with beam search
    model_output_beam = generate_response(model, input_text, vocab, device, method='beam', beam_width=3)
    
    print(f"\nExample {i+1}:")
    print(f"Input: {input_text[:150]}...")
    print(f"\nGround Truth: {ground_truth}")
    print(f"Model (Greedy): {model_output_greedy}")
    print(f"Model (Beam-3): {model_output_beam}")
    print("-"*80)

## 13. Interactive Testing

Test the model with custom inputs

In [None]:
# Test with custom inputs
def test_chatbot(emotion, situation, customer_utterance, method='greedy'):
    """Test the chatbot with custom input"""
    input_text = f"Emotion: {emotion} | Situation: {situation} | Customer: {customer_utterance} Agent:"
    response = generate_response(model, input_text, vocab, device, method=method)
    return response

# Example 1: Sentimental
print("="*80)
print("Example 1: Sentimental")
emotion = "sentimental"
situation = "I remember going to the fireworks with my best friend"
customer = "This was a best friend. I miss her."
response = test_chatbot(emotion, situation, customer, method='beam')
print(f"Emotion: {emotion}")
print(f"Situation: {situation}")
print(f"Customer: {customer}")
print(f"Agent: {response}")

print("\n" + "="*80)
print("Example 2: Afraid")
emotion = "afraid"
situation = "I used to scare for darkness"
customer = "it feels like hitting to blank wall when I see the darkness"
response = test_chatbot(emotion, situation, customer, method='beam')
print(f"Emotion: {emotion}")
print(f"Situation: {situation}")
print(f"Customer: {customer}")
print(f"Agent: {response}")

print("\n" + "="*80)
print("Example 3: Joyful")
emotion = "joyful"
situation = "I got promoted at work today"
customer = "I am so happy about this news!"
response = test_chatbot(emotion, situation, customer, method='beam')
print(f"Emotion: {emotion}")
print(f"Situation: {situation}")
print(f"Customer: {customer}")
print(f"Agent: {response}")
print("="*80)

## 14. Save Model and Vocabulary

Save the final model and vocabulary for deployment

In [None]:
# Save vocabulary
vocab_data = {
    'token2idx': vocab.token2idx,
    'idx2token': vocab.idx2token,
    'pad_token': vocab.pad_token,
    'bos_token': vocab.bos_token,
    'eos_token': vocab.eos_token,
    'unk_token': vocab.unk_token
}

import pickle
with open('vocabulary.pkl', 'wb') as f:
    pickle.dump(vocab_data, f)

print("Vocabulary saved to vocabulary.pkl")

# Save final model
torch.save({
    'model_state_dict': model.state_dict(),
    'config': {
        'vocab_size': len(vocab),
        'd_model': config.EMBEDDING_DIM,
        'num_heads': config.NUM_HEADS,
        'num_encoder_layers': config.NUM_ENCODER_LAYERS,
        'num_decoder_layers': config.NUM_DECODER_LAYERS,
        'd_ff': config.FFN_DIM,
        'dropout': config.DROPOUT,
        'max_seq_len': config.MAX_SEQ_LEN,
        'pad_idx': vocab.token2idx[config.PAD_TOKEN]
    },
    'test_metrics': {
        'bleu': test_metrics['bleu'],
        'rouge_l': test_metrics['rouge_l'],
        'chrf': test_metrics['chrf'],
        'perplexity': test_perplexity
    }
}, 'final_transformer_model.pt')

print("Final model saved to final_transformer_model.pt")
print("\nAll files ready for deployment!")

## 15. Summary and Next Steps

### What We Built:
✅ **Transformer encoder-decoder from scratch** (no pretrained weights)
- Multi-head attention
- Positional encoding
- Feed-forward networks
- Layer normalization
- Residual connections

✅ **Complete preprocessing pipeline**
- Text normalization (lowercase, whitespace, punctuation)
- Custom vocabulary built from training data only
- Special tokens: `<pad>`, `<bos>`, `<eos>`, `<unk>`
- Train/Val/Test split (80/10/10)

✅ **Training with teacher forcing**
- Adam optimizer (betas=0.9, 0.98)
- Gradient clipping
- Best model saved based on validation BLEU

✅ **Comprehensive evaluation**
- BLEU score
- ROUGE-L
- chrF (character n-gram F-score)
- Perplexity

✅ **Inference strategies**
- Greedy decoding
- Beam search decoding

✅ **Causal masking in decoder** (no future token access)

### Model Architecture:
- Embedding dimension: 256
- Attention heads: 4
- Encoder layers: 2
- Decoder layers: 2
- Feed-forward dimension: 512
- Dropout: 0.1

### Next Steps for Deployment:

1. **Streamlit/Gradio UI**: Create interactive chatbot interface
2. **Deploy**: Host on Streamlit Cloud or Gradio public link
3. **Evaluation Report**: Document metrics and qualitative examples
4. **Blog Post**: Write Medium article explaining the approach

### Files Generated:
- `best_transformer_model.pt` - Best model checkpoint (based on validation BLEU)
- `final_transformer_model.pt` - Final trained model
- `vocabulary.pkl` - Vocabulary for tokenization

---

**Note**: This notebook implements all requirements from scratch. On Kaggle, ensure:
- Dataset path is correct (`/kaggle/input/empathetic-dialogues-facebook-ai/`)
- GPU is enabled for faster training
- Adjust `NUM_EPOCHS` and `BATCH_SIZE` based on available resources