# Improved DistilBERT + TextRCNN Pipeline
## Advanced Document Classification with Optimized Parameters

This notebook implements an improved DistilBERT + TextRCNN architecture for document authenticity classification with better parameters and architecture.

# Model Development Journey: From Traditional ML to Advanced Neural Architectures

## Initial Approach: Traditional Machine Learning
Started with TF-IDF + XGBoost as a baseline approach. This traditional pipeline used:
TF-IDF vectorization for text feature extraction
XGBoost classifier for document authenticity classification
Basic cross-validation and hyperparameter tuning

## First Neural Approach: BERT

Moved to BERT (Bidirectional Encoder Representations from Transformers) as the primary model:
Standard BERT for contextual text understanding
Fine-tuning on the document authenticity task
Pooled output classification with custom classification heads

## Domain-Specific Exploration: SciBERT

Experimented with SciBERT (scientific BERT) to leverage domain-specific knowledge:
Scientific vocabulary pre-training for better understanding of technical documents
Specialized tokenization for scientific and engineering terminology
Full fine-tuning approach with unfrozen BERT layers

## Ensemble and Pseudo-Labeling Breakthrough

Implemented pseudo-labeling techniques that significantly improved results:
Random Forest pseudo-labeling for fast, reliable expansion of training data
BERT training on expanded dataset using high-confidence predictions

## Advanced Architecture: TextRCNN

Developed TextRCNN (Text RNN + CNN) architecture combining:
Bidirectional LSTM for sequential text processing
Multi-scale CNN (3x3, 5x5, 7x7 kernels) for local feature extraction
Multi-head attention for capturing complex relationships
Residual connections with layer normalization

## Final Optimization: DistilBERT + TextRCNN

Settled on DistilBERT + TextRCNN as the optimal configuration:
DistilBERT: Lightweight, distilled version maintaining 97% of BERT's performance
Enhanced TextRCNN: 3 LSTM layers, 12 attention heads, improved CNN architecture
Progressive unfreezing: BERT frozen initially, then unfrozen after 4 epochs



## Setup and Imports

In [32]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer, DistilBertModel, get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
from torch.optim import AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Delete old model file if it exists
if os.path.exists('best_distilbert_textrcnn_model.pth'):
    os.remove('best_distilbert_textrcnn_model.pth')
    print("Deleted old model file")

# Set device
if torch.backends.mps.is_available():
    device = torch.device('mps')
    print('Using MPS (Metal GPU) device')
elif torch.cuda.is_available():
    device = torch.device('cuda')
    print('Using CUDA device')
else:
    device = torch.device('cpu')
    print('Using CPU device')

print(f'Device: {device}')

Deleted old model file
Using MPS (Metal GPU) device
Device: mps


## Improved TextRCNN Architecture

In [33]:
class ImprovedTextRCNN(nn.Module):
   
    
    def __init__(self, hidden_size=768, num_layers=3, num_classes=2, dropout=0.3):
        super(ImprovedTextRCNN, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.num_classes = num_classes
        
        # Bidirectional LSTM with more layers
        self.lstm = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size // 2,  # Bidirectional will double this
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Multi-scale CNN layers that MAINTAIN hidden_size for residual connections
        self.conv1 = nn.Conv1d(hidden_size, hidden_size, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(hidden_size, hidden_size, kernel_size=5, padding=2)  # Keep hidden_size
        self.conv3 = nn.Conv1d(hidden_size, hidden_size, kernel_size=7, padding=3)  # Keep hidden_size
        
        # Enhanced attention mechanism
        self.attention = nn.MultiheadAttention(
            embed_dim=hidden_size,
            num_heads=12,  # More attention heads
            dropout=dropout,
            batch_first=True
        )
        
        # Improved classification head
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size // 2, num_classes)
        )
        
        # Layer normalization
        self.layer_norm1 = nn.LayerNorm(hidden_size)
        self.layer_norm2 = nn.LayerNorm(hidden_size)
        self.pre_attention_norm = nn.LayerNorm(hidden_size)  # Pre-attention norm
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, bert_outputs, attention_mask=None):
        # bert_outputs: [batch_size, seq_len, hidden_size]
        batch_size, seq_len, hidden_size = bert_outputs.shape
        
        # Bidirectional LSTM
        lstm_out, _ = self.lstm(bert_outputs)
        
        # Multi-scale CNN feature extraction
        cnn_input = lstm_out.transpose(1, 2)
        
        # Apply CNN layers with residual connections
        conv1_out = F.relu(self.conv1(cnn_input))
        conv2_out = F.relu(self.conv2(conv1_out))
        conv3_out = F.relu(self.conv3(conv2_out))
        
        # Transpose back - now cnn_out has same dimensions as lstm_out
        cnn_out = conv3_out.transpose(1, 2)
        
        # Add residual connection - now dimensions match perfectly
        cnn_out = self.layer_norm1(cnn_out + lstm_out)
        
        # Pre-attention normalization
        cnn_out = self.pre_attention_norm(cnn_out)
        
        # Enhanced attention mechanism
        attn_out, _ = self.attention(cnn_out, cnn_out, cnn_out)
        
        # Add residual connection
        attn_out = self.layer_norm2(attn_out + cnn_out)
        
        # Global average pooling
        if attention_mask is not None:
            masked_output = attn_out * attention_mask.unsqueeze(-1)
            pooled_output = masked_output.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
        else:
            pooled_output = attn_out.mean(dim=1)
        
        # Classification
        logits = self.classifier(pooled_output)
        
        return logits

## Improved DistilBERT + TextRCNN Classifier

In [34]:
class ImprovedDistilBertTextRCnnClassifier(nn.Module):
    """Improved DistilBERT + TextRCNN classifier."""
    
    def __init__(self, model_name='distilbert-base-uncased', num_classes=2, dropout=0.3):
        super(ImprovedDistilBertTextRCnnClassifier, self).__init__()
        
        # DistilBERT encoder
        self.bert = DistilBertModel.from_pretrained(model_name)
        
        # Improved TextRCNN classifier
        self.textrcnn = ImprovedTextRCNN(
            hidden_size=768,  
            num_layers=3,     
            num_classes=num_classes,
            dropout=dropout
        )
        
        # Freeze BERT initially
        self.freeze_bert()
        
    def freeze_bert(self):
        """Freeze BERT parameters."""
        for param in self.bert.parameters():
            param.requires_grad = False
        print("BERT parameters frozen")
    
    def unfreeze_bert(self):
        """Unfreeze BERT parameters for finetuning."""
        for param in self.bert.parameters():
            param.requires_grad = True
        print("BERT parameters unfrozen for finetuning")
    
    def forward(self, input_ids, attention_mask):
        # Get BERT outputs
        bert_outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        ).last_hidden_state
        
        # Pass through improved TextRCNN
        logits = self.textrcnn(bert_outputs, attention_mask)
        
        return logits

## Custom Dataset

In [35]:
class DocumentDataset(Dataset):
    """Custom dataset for document classification."""
    
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        text = item['text']
        label = item['label']
        
        # Tokenize text
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

## Data Loading Functions

In [36]:
import requests
from io import StringIO

def load_document_pairs(data_dir):
    """Load document pairs directly from GitHub repository."""
    pairs = []
    
    base_url = "https://raw.githubusercontent.com/v-gapsys/Fake-or-Real-The-Impostor-Hunt-in-Texts/main"
    
    print(f"Loading data from GitHub: {base_url}/{data_dir}")
    
    try:
        
        if data_dir == 'train':
           
            labels_df = pd.read_csv('https://raw.githubusercontent.com/v-gapsys/Fake-or-Real-The-Impostor-Hunt-in-Texts/main/train.csv')
            article_ids = labels_df['id'].tolist()
        else:
           
            article_ids = list(range(1068))
        
        for article_id in article_ids:
            if data_dir == 'train':
                article_folder = f"article_{str(article_id).zfill(4)}"
            else:
                article_folder = f"article_{str(article_id).zfill(4)}"
            
            # Try to load file_1.txt and file_2.txt
            file1_url = f"{base_url}/{data_dir}/{article_folder}/file_1.txt"
            file2_url = f"{base_url}/{data_dir}/{article_folder}/file_2.txt"
            
            try:
                # Load file 1
                response1 = requests.get(file1_url)
                if response1.status_code == 200:
                    content1 = response1.text.strip()
                else:
                    continue
                
                # Load file 2
                response2 = requests.get(file2_url)
                if response2.status_code == 200:
                    content2 = response2.text.strip()
                else:
                    continue
                
                pairs.append({
                    'article_id': article_folder,
                    'file1': 'file_1.txt',
                    'file2': 'file_2.txt',
                    'content1': content1,
                    'content2': content2,
                    'file1_path': file1_url,
                    'file2_path': file2_url
                })
                
            except Exception as e:
                print(f"Error loading article {article_folder}: {e}")
                continue
        
        print(f"Successfully loaded {len(pairs)} document pairs from GitHub '{data_dir}'")
        
    except Exception as e:
        print(f"Error accessing GitHub repository: {e}")
      
    
    return pairs

def create_training_data(pairs, labels_df):
    """Create training data from document pairs and labels."""
    
    if labels_df is None:
        print("Error: No labels dataframe provided")
        return []
    
    print(f"Creating training data with {len(labels_df)} articles...")
    
    training_data = []
    
    for _, row in labels_df.iterrows():
        article_id = row['id']
        real_text_id = row['real_text_id']
        
        article_folder = f"article_{str(article_id).zfill(4)}"
        pair = None
        
        for p in pairs:
            if p['article_id'] == article_folder:
                pair = p
                break
        
        if pair is None:
            print(f"Warning: Could not find pair for article {article_id}")
            continue
        
        if real_text_id == 1:
            real_content = pair['content1']
            fake_content = pair['content2']
        else:
            real_content = pair['content2']
            fake_content = pair['content1']
        
        training_data.append({
            'text': real_content,
            'label': 1,
            'article_id': article_id,
            'text_type': 'real'
        })
        
        training_data.append({
            'text': fake_content,
            'label': 0,
            'article_id': article_id,
            'text_type': 'fake'
        })
    
    print(f"Created {len(training_data)} training examples")
    print(f"    Real documents: {len([x for x in training_data if x['label'] == 1])}")
    print(f"    Fake documents: {len([x for x in training_data if x['label'] == 0])}")
    
    return training_data

## Training Function

In [None]:
def train_model_improved(model, train_loader, val_loader, num_epochs=15, learning_rate=1e-5):
   
    
    print("Training DistilBERT + TextRCNN model...")
    
    # Loss function and optimizer with weight decay
    criterion = nn.CrossEntropyLoss()
    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8, weight_decay=0.01)
    
    # Improved learning rate scheduler with cosine annealing
    total_steps = len(train_loader) * num_epochs
    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_warmup_steps=int(0.15 * total_steps),  # Increased warmup
        num_training_steps=total_steps
    )
    
    # Training loop
    best_val_acc = 0
    patience = 7  # Increased patience
    patience_counter = 0
    
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        total_loss = 0
        correct = 0
        total = 0
        
        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            optimizer.zero_grad()
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs, labels)
            loss.backward()
            
            # Improved gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            
            optimizer.step()
            scheduler.step()
            
            total_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        
        train_acc = correct / total
        avg_loss = total_loss / len(train_loader)
        
        # Validation phase
        model.eval()
        val_correct = 0
        val_total = 0
        val_loss = 0
        
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)
                
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                loss = criterion(outputs, labels)
                val_loss += loss.item()
                
                _, predicted = torch.max(outputs.data, 1)
                val_total += labels.size(0)
                val_correct += (predicted == labels).sum().item()
        
        val_acc = val_correct / val_total if val_total > 0 else 0
        avg_val_loss = val_loss / len(val_loader) if len(val_loader) > 0 else 0
        
        print(f"Epoch {epoch+1}/{num_epochs}:")
        print(f"  Training Loss: {avg_loss:.4f}, Training Acc: {train_acc:.4f}")
        print(f"  Validation Loss: {avg_val_loss:.4f}, Validation Acc: {val_acc:.4f}")
        
        # Early stopping
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            patience_counter = 0
            torch.save(model.state_dict(), 'best_distilbert_textrcnn_model.pth')
            print(f"   New best validation accuracy: {best_val_acc:.4f}")
        else:
            patience_counter += 1
        
        if patience_counter >= patience:
            print(f"    Early stopping after {patience} epochs without improvement")
            break
        
        # Unfreeze BERT after first few epochs for finetuning
        if epoch == 3:  # Increased from 2
            model.unfreeze_bert()
            print("   Unfrozen BERT for finetuning")
    
    print(f"\nBest validation accuracy: {best_val_acc:.4f}")
    print("Training completed!")
    
    return best_val_acc

## Prediction Function

In [21]:
def predict_test_set(model, tokenizer, test_pairs, max_length=512):
    """Generate predictions on test set."""
    
    print("Generating test predictions...")
    
    model.eval()
    predictions = []
    
    for i, pair in enumerate(test_pairs):
        article_id = pair['article_id']
        try:
            numeric_id = int(article_id.split('_')[1])
            solution_id = numeric_id
        except (IndexError, ValueError):
            solution_id = i
        
        text1 = pair['content1']
        text2 = pair['content2']
        
        # Tokenize texts
        encoding1 = tokenizer(
            text1,
            truncation=True,
            padding='max_length',
            max_length=max_length,
            return_tensors='pt'
        )
        
        encoding2 = tokenizer(
            text2,
            truncation=True,
            padding='max_length',
            max_length=max_length,
            return_tensors='pt'
        )
        
        # Move to device
        input_ids1 = encoding1['input_ids'].to(device)
        attention_mask1 = encoding1['attention_mask'].to(device)
        input_ids2 = encoding2['input_ids'].to(device)
        attention_mask2 = encoding2['attention_mask'].to(device)
        
        # Get predictions
        with torch.no_grad():
            outputs1 = model(input_ids=input_ids1, attention_mask=attention_mask1)
            outputs2 = model(input_ids=input_ids2, attention_mask=attention_mask2)
            
            probs1 = F.softmax(outputs1, dim=1)
            probs2 = F.softmax(outputs2, dim=1)
            
            pred1 = torch.argmax(outputs1, dim=1).item()
            pred2 = torch.argmax(outputs2, dim=1).item()
            
            real_prob1 = probs1[0][1].item()
            real_prob2 = probs2[0][1].item()
        
        # Determine which file is real
        if pred1 == 1 and pred2 == 0:
            real_text_id = 1
        elif pred1 == 0 and pred2 == 1:
            real_text_id = 2
        else:
            # Use probability
            real_text_id = 1 if real_prob1 > real_prob2 else 2
        
        predictions.append({
            'id': solution_id,
            'real_text_id': real_text_id,
            'text1_pred': pred1,
            'text2_pred': pred2,
            'text1_real_prob': real_prob1,
            'text2_real_prob': real_prob2
        })
        
        if (i + 1) % 100 == 0:
            print(f"Processed {i + 1}/{len(test_pairs)} pairs...")
    
    return predictions

## Main Pipeline Execution

In [22]:
# 1. Load data
print("Step 1: Loading data...")
labels_df = pd.read_csv('train.csv')
train_pairs = load_document_pairs('train')
test_pairs = load_document_pairs('test')

print(f"Loaded {len(labels_df)} training articles")
print(f"Loaded {len(train_pairs)} training pairs")
print(f"Loaded {len(test_pairs)} test pairs")

Step 1: Loading data...
Loaded 95 training articles
Loaded 95 training pairs
Loaded 1068 test pairs


In [23]:
# 2. Create training data
print("Step 2: Creating training data...")
train_data = create_training_data(train_pairs, labels_df)

Step 2: Creating training data...
Creating training data with 95 articles...
Created 190 training examples
    Real documents: 95
    Fake documents: 95


In [24]:
# 3. Initialize tokenizer and improved model
print("Step 3: Initializing improved DistilBERT + TextRCNN...")
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = ImprovedDistilBertTextRCnnClassifier(
    model_name='distilbert-base-uncased',
    num_classes=2,
    dropout=0.3
).to(device)

print(f"Improved model initialized on {device}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

Step 3: Initializing improved DistilBERT + TextRCNN...
BERT parameters frozen
Improved model initialized on mps
Total parameters: 89,101,442
Trainable parameters: 22,738,562


In [25]:
# 4. Create datasets and data loaders
print("Step 4: Creating datasets...")

# Create DocumentDataset instances
train_dataset = DocumentDataset(train_data, tokenizer)
val_dataset = DocumentDataset(train_data, tokenizer)

# Split the dataset
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size

train_dataset, val_dataset = torch.utils.data.random_split(
    train_dataset, [train_size, val_size]
)

# Create data loaders with larger batch size
batch_size = 16  # Increased from 8
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

print(f"Training batches: {len(train_loader)}")
print(f"Validation batches: {len(val_loader)}")

Step 4: Creating datasets...
Training batches: 10
Validation batches: 3


In [26]:
# 5. Train the improved model
print("Step 5: Training improved model...")
try:
    best_val_acc = train_model_improved(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        num_epochs=15,
        learning_rate=1e-5  # Lower learning rate for better stability
    )
    print(f"Training completed successfully! Best validation accuracy: {best_val_acc:.4f}")
except Exception as e:
    print(f"Training failed: {e}")
    best_val_acc = 0.0

Step 5: Training improved model...
Training improved DistilBERT + TextRCNN model...
Epoch 1/15:
  Training Loss: 0.6974, Training Acc: 0.5197
  Validation Loss: 0.7043, Validation Acc: 0.3158
   New best validation accuracy: 0.3158
Epoch 2/15:
  Training Loss: 0.6922, Training Acc: 0.5592
  Validation Loss: 0.6902, Validation Acc: 0.5263
   New best validation accuracy: 0.5263
Epoch 3/15:
  Training Loss: 0.6610, Training Acc: 0.6711
  Validation Loss: 0.6570, Validation Acc: 0.7632
   New best validation accuracy: 0.7632
Epoch 4/15:
  Training Loss: 0.6236, Training Acc: 0.7500
  Validation Loss: 0.6456, Validation Acc: 0.7368
BERT parameters unfrozen for finetuning
   Unfrozen BERT for finetuning
Epoch 5/15:
  Training Loss: 0.6085, Training Acc: 0.6908
  Validation Loss: 0.5871, Validation Acc: 0.7632
Epoch 6/15:
  Training Loss: 0.5256, Training Acc: 0.7829
  Validation Loss: 0.5338, Validation Acc: 0.7895
   New best validation accuracy: 0.7895
Epoch 7/15:
  Training Loss: 0.4281,

In [27]:
# 6. Load best model
print("Step 6: Loading best model")
if os.path.exists('best_distilbert_textrcnn_model.pth'):
    model.load_state_dict(torch.load('best_distilbert_textrcnn_model.pth'))
    print("Best model loaded")
else:
    print("No saved model found, using current model")

Step 6: Loading best model
Best model loaded


In [28]:
# 7. Generate predictions
print("Step 7: Generating test predictions...")
predictions = predict_test_set(model, tokenizer, test_pairs)

Step 7: Generating test predictions...
Generating test predictions...
Processed 100/1068 pairs...
Processed 200/1068 pairs...
Processed 300/1068 pairs...
Processed 400/1068 pairs...
Processed 500/1068 pairs...
Processed 600/1068 pairs...
Processed 700/1068 pairs...
Processed 800/1068 pairs...
Processed 900/1068 pairs...
Processed 1000/1068 pairs...


In [29]:
# 8. Create solution file
solution_df = pd.DataFrame(predictions)
submission_df = solution_df[['id', 'real_text_id']].copy()

submission_df = submission_df.sort_values('id').reset_index(drop=True)
submission_df['id'] = submission_df['id'].astype(int)
submission_df['real_text_id'] = submission_df['real_text_id'].astype(int)

solution_file = 'Hunt_In_Text_Solution_Improved.csv'
submission_df.to_csv(solution_file, index=False)

print(f"Improved solution file saved as: {solution_file}")
print("Pipeline completed successfully!")

Improved solution file saved as: Hunt_In_Text_Solution_Improved.csv
Pipeline completed successfully!
