# Training Informal-to-Formal Text Transformer

This notebook fine-tunes a T5/FLAN-T5 model on the task of converting informal text to formal text.

## Dataset
We'll use the GYAFC (Grammarly's Yahoo Answers Formality Corpus) dataset or create a synthetic dataset for informal-to-formal transformation.

## Model
- Base model: `google/flan-t5-base` (or `t5-base`)
- Task: Sequence-to-sequence text generation
- Training: 100 epochs with early stopping

## Metrics
- BLEU score
- ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L)
- Perplexity
- Training/Validation loss

In [None]:
# Install required packages if not already installed
!pip install transformers datasets evaluate torch accelerate tensorboard sentencepiece sacrebleu rouge-score

In [None]:
import os
import json
import numpy as np
import torch
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)
import evaluate
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

## Step 1: Prepare Dataset

We'll create a synthetic dataset of informal-to-formal text pairs. For a real project, you can use:
- GYAFC dataset from: https://github.com/raosudha89/GYAFC-corpus
- Or load from Hugging Face: `jxm/informal_to_formal`

In [None]:
# Synthetic dataset for demonstration
# For better results, replace with a real dataset

informal_formal_pairs = [
    # Requests
    ("send me the file now", "Could you please send me the file at your earliest convenience?"),
    ("need this asap", "I would appreciate receiving this as soon as possible."),
    ("gimme a sec", "Please allow me a moment."),
    ("hey can u help?", "Hello, would you be able to assist me?"),
    ("get back to me", "Please respond at your convenience."),
    
    # Questions
    ("why didn't you finish the report?", "I noticed the report is incomplete. Could you please provide an update on its status?"),
    ("what's taking so long?", "May I inquire about the current progress?"),
    ("where's the data?", "Could you please direct me to the location of the data?"),
    ("did u see my email?", "Have you had the opportunity to review my email?"),
    ("when can we meet?", "When would be a convenient time for us to schedule a meeting?"),
    
    # Statements
    ("this is wrong", "I believe there may be an error in this."),
    ("that won't work", "I'm concerned that approach may not be effective."),
    ("u made a mistake", "It appears there may have been an oversight."),
    ("i'm busy right now", "I am currently occupied with other tasks."),
    ("can't do it today", "Unfortunately, I will not be able to complete this today."),
    
    # Feedback
    ("good job", "Excellent work on this project."),
    ("looks ok to me", "This appears to be satisfactory."),
    ("not bad", "This is quite good."),
    ("yeah that's fine", "Yes, that would be acceptable."),
    ("nah i don't think so", "I respectfully disagree with that assessment."),
    
    # More examples for better training
    ("fix it now", "Please address this issue at your earliest convenience."),
    ("what's up with that?", "Could you please clarify the situation?"),
    ("i dunno", "I am uncertain about that."),
    ("nope", "No, thank you."),
    ("yup", "Yes, certainly."),
    ("call me", "Please contact me at your convenience."),
    ("let's do this", "Shall we proceed with this plan?"),
    ("i'm not gonna do that", "I will not be able to proceed with that request."),
    ("whatever", "I understand your perspective."),
    ("who cares", "That may not be a primary concern."),
    
    # Extended set for better training
    ("got it", "I understand completely."),
    ("my bad", "I apologize for that error."),
    ("no way", "I find that difficult to believe."),
    ("for sure", "Absolutely, I agree."),
    ("hang on", "Please wait a moment."),
    ("lemme know", "Please inform me when you have an update."),
    ("gonna be late", "I will be arriving later than expected."),
    ("can't make it", "Unfortunately, I will be unable to attend."),
    ("what's the deal?", "What is the current situation?"),
    ("cut it out", "Please discontinue that behavior."),
    ("knock it off", "Please cease that action."),
    ("chill out", "Please remain calm."),
    ("no biggie", "That is not a significant concern."),
    ("kinda busy", "I am somewhat occupied at the moment."),
    ("sorta like that", "It is somewhat similar to that."),
    ("tons of work", "I have a substantial amount of work."),
    ("super important", "This is extremely important."),
    ("really cool", "This is quite impressive."),
    ("that sucks", "That is unfortunate."),
    ("awesome job", "You have done exceptional work."),
]

# Create more variations by duplicating and slightly modifying
extended_pairs = informal_formal_pairs.copy()

# Add some variations
variations = [
    ("pls send the file", "Please send the file at your convenience."),
    ("need help with this", "I would appreciate assistance with this matter."),
    ("can u check this?", "Could you please review this?"),
    ("talk later", "Let us continue this conversation at a later time."),
    ("brb", "I will return shortly."),
    ("idk what to do", "I am uncertain about how to proceed."),
    ("thx for the help", "Thank you for your assistance."),
    ("np", "You are welcome."),
    ("gtg", "I must leave now."),
    ("omg that's bad", "That is quite concerning."),
]

extended_pairs.extend(variations)

# Split into train/validation/test
np.random.seed(42)
indices = np.random.permutation(len(extended_pairs))
train_size = int(0.7 * len(extended_pairs))
val_size = int(0.15 * len(extended_pairs))

train_pairs = [extended_pairs[i] for i in indices[:train_size]]
val_pairs = [extended_pairs[i] for i in indices[train_size:train_size+val_size]]
test_pairs = [extended_pairs[i] for i in indices[train_size+val_size:]]

print(f"Training samples: {len(train_pairs)}")
print(f"Validation samples: {len(val_pairs)}")
print(f"Test samples: {len(test_pairs)}")

# Example pairs
print("\nExample training pairs:")
for i in range(3):
    print(f"Informal: {train_pairs[i][0]}")
    print(f"Formal: {train_pairs[i][1]}")
    print()

In [None]:
# Create Hugging Face datasets
def create_dataset(pairs):
    return Dataset.from_dict({
        "informal": [p[0] for p in pairs],
        "formal": [p[1] for p in pairs]
    })

dataset = DatasetDict({
    "train": create_dataset(train_pairs),
    "validation": create_dataset(val_pairs),
    "test": create_dataset(test_pairs)
})

print(dataset)

## Step 2: Load Model and Tokenizer

In [None]:
# Model configuration
MODEL_NAME = "google/flan-t5-base"  # or "t5-base" or "google/flan-t5-small" for faster training
OUTPUT_DIR = "./models/informal-to-formal-t5"
MAX_INPUT_LENGTH = 128
MAX_TARGET_LENGTH = 128

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

print(f"Model: {MODEL_NAME}")
print(f"Model parameters: {model.num_parameters():,}")

## Step 3: Preprocessing

In [None]:
# Preprocessing function
def preprocess_function(examples):
    # Add task prefix for T5
    inputs = [f"Convert informal to formal: {text}" for text in examples["informal"]]
    targets = examples["formal"]
    
    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
        padding="max_length"
    )
    
    # Tokenize targets
    labels = tokenizer(
        targets,
        max_length=MAX_TARGET_LENGTH,
        truncation=True,
        padding="max_length"
    )
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["informal", "formal"]
)

print("Tokenized dataset:")
print(tokenized_dataset)

## Step 4: Setup Training

In [None]:
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

# Load metrics
bleu_metric = evaluate.load("sacrebleu")
rouge_metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
    predictions, labels = eval_preds
    
    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in labels (used for padding)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Clean up text
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    
    # Compute BLEU
    bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)
    
    # Compute ROUGE
    decoded_labels_flat = [label[0] for label in decoded_labels]
    rouge_result = rouge_metric.compute(
        predictions=decoded_preds,
        references=decoded_labels_flat
    )
    
    return {
        "bleu": bleu_result["score"],
        "rouge1": rouge_result["rouge1"],
        "rouge2": rouge_result["rouge2"],
        "rougeL": rouge_result["rougeL"],
    }

In [None]:
# Training arguments - 100 epochs as requested
training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=100,  # 100 epochs as requested
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    logging_dir=f"{OUTPUT_DIR}/logs",
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="bleu",
    greater_is_better=True,
    warmup_steps=100,
    gradient_accumulation_steps=2,
    report_to="tensorboard",
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  FP16: {training_args.fp16}")

In [None]:
# Initialize trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer initialized successfully!")

## Step 5: Train Model (100 Epochs)

In [None]:
# Train the model
print("Starting training...")
train_result = trainer.train()

# Save the final model
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print("\nTraining completed!")
print(f"Final training loss: {train_result.training_loss:.4f}")
print(f"Model saved to: {OUTPUT_DIR}")

## Step 6: Evaluate Model

In [None]:
# Evaluate on test set
print("Evaluating on test set...")
test_results = trainer.evaluate(tokenized_dataset["test"])

print("\nTest Results:")
for key, value in test_results.items():
    print(f"  {key}: {value:.4f}")

# Save results
with open(f"{OUTPUT_DIR}/test_results.json", "w") as f:
    json.dump(test_results, f, indent=2)

print(f"\nResults saved to {OUTPUT_DIR}/test_results.json")

## Step 7: Test Inference

In [None]:
# Load the trained model for inference
from transformers import pipeline

# Create text generation pipeline
generator = pipeline(
    "text2text-generation",
    model=OUTPUT_DIR,
    tokenizer=OUTPUT_DIR,
    device=0 if torch.cuda.is_available() else -1
)

# Test examples
test_examples = [
    "send me the file now",
    "why didn't you finish the report?",
    "need this asap",
    "what's taking so long?",
    "good job on this",
]

print("\n" + "="*60)
print("INFERENCE EXAMPLES")
print("="*60)

for informal_text in test_examples:
    prompt = f"Convert informal to formal: {informal_text}"
    result = generator(prompt, max_length=128, num_beams=4)
    formal_text = result[0]['generated_text']
    
    print(f"\nInformal: {informal_text}")
    print(f"Formal:   {formal_text}")
    print("-" * 60)

## Step 8: Save Training Metadata

In [None]:
# Save training metadata
metadata = {
    "model_name": MODEL_NAME,
    "output_dir": OUTPUT_DIR,
    "training_samples": len(train_pairs),
    "validation_samples": len(val_pairs),
    "test_samples": len(test_pairs),
    "epochs": training_args.num_train_epochs,
    "batch_size": training_args.per_device_train_batch_size,
    "learning_rate": training_args.learning_rate,
    "final_train_loss": float(train_result.training_loss),
    "test_results": test_results,
    "timestamp": str(np.datetime64('now')),
}

with open(f"{OUTPUT_DIR}/training_metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print("\nTraining metadata saved!")
print(json.dumps(metadata, indent=2))

## Optional: View TensorBoard Logs

To view training progress in TensorBoard:

```bash
tensorboard --logdir ./models/informal-to-formal-t5/logs
```

In [None]:
print("\n" + "="*60)
print("TRAINING COMPLETE!")
print("="*60)
print(f"\nModel saved to: {OUTPUT_DIR}")
print(f"To use in production, update backend/model.py:")
print(f"DEFAULT_MODEL = '{OUTPUT_DIR}'")
print("\nTo view training logs:")
print(f"tensorboard --logdir {OUTPUT_DIR}/logs")