# dLNk GPT - Enhanced Training Workflow

This notebook implements a comprehensive training workflow with:
- ✅ **Early Stopping** - Prevents overfitting
- ✅ **Learning Rate Scheduling** - Optimizes training
- ✅ **Real-time Monitoring** - TensorBoard integration
- ✅ **Quality Assurance** - Automated testing each epoch
- ✅ **Resource Monitoring** - GPU/Memory tracking

**Requirements:**
- GPU Runtime (T4 recommended, A100 for faster training)
- Hugging Face Token
- 12-16 hours training time (T4)

**Steps:**
1. Enable GPU: Runtime → Change runtime type → GPU
2. Run all cells in order
3. Monitor with TensorBoard

## 1. Environment Setup

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install required packages
!pip install -q transformers>=4.30.0 datasets>=2.12.0 accelerate>=0.20.0 peft>=0.4.0 bitsandbytes tensorboard
print("✅ All packages installed")

## 2. Login to Hugging Face

In [None]:
from huggingface_hub import login

# Enter your Hugging Face token here
HF_TOKEN = ""  # Paste your token between the quotes

if not HF_TOKEN:
    print("⚠️  Please enter your Hugging Face token above")
else:
    login(token=HF_TOKEN)
    print("✅ Logged in to Hugging Face")

## 3. Clone Repository and Setup Files

In [None]:
# Clone the repository
!git clone https://github.com/traingptproject/gptprojecttrain.git
%cd gptprojecttrain
!ls -la

## 4. Configure Training (Optional)

In [None]:
# View current configuration
!cat training_config.py | head -50

print("\n" + "="*80)
print("You can modify training_config.py to adjust hyperparameters")
print("="*80)

## 5. Launch TensorBoard

In [None]:
# Load TensorBoard extension
%load_ext tensorboard

# Launch TensorBoard
%tensorboard --logdir ./logs --port 6006

print("\n✅ TensorBoard launched!")
print("📊 You can view training metrics in real-time above")

## 6. Start Training

**This will take 12-16 hours on T4 GPU**

The training includes:
- Early stopping (stops if validation loss doesn't improve for 3 epochs)
- Learning rate scheduling (cosine schedule with warmup)
- Quality assurance tests after each epoch
- Automatic checkpoint management
- Real-time metrics logging to TensorBoard

In [None]:
# Start training
!python train_enhanced.py

## 7. View Training Results

In [None]:
# View metrics history
import json

with open('./training_output/metrics_history.json', 'r') as f:
    metrics = json.load(f)

print(f"Total training steps: {len(metrics)}")
print(f"\nFinal metrics:")
print(json.dumps(metrics[-1], indent=2))

In [None]:
# View QA test results from last epoch
import os
import json

qa_dir = './training_output/qa_results'
qa_files = sorted(os.listdir(qa_dir))
latest_qa = os.path.join(qa_dir, qa_files[-1])

with open(latest_qa, 'r') as f:
    qa_results = json.load(f)

print(f"QA Results from Epoch {qa_results['epoch']}")
print("="*80)

for i, test in enumerate(qa_results['tests'], 1):
    print(f"\n[Test {i}]")
    print(f"Prompt: {test['prompt']}")
    print(f"Response: {test['response'][:200]}...")
    print(f"Time: {test['generation_time']:.2f}s")

## 8. Test the Trained Model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

print("Loading trained model...")

model_path = "./training_output/final_model"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

print("✅ Model loaded!\n")

def generate_response(prompt, max_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test
test_prompt = "Write a Python function to calculate fibonacci numbers:"
print(f"Prompt: {test_prompt}")
print("="*80)
response = generate_response(test_prompt)
print(response)

## 9. Push to Hugging Face Hub (Optional)

If you didn't enable `push_to_hub` in the config, you can manually push the model here.

In [None]:
# Push model to Hub
from huggingface_hub import HfApi

model_path = "./training_output/final_model"
repo_id = "dlnkgpt/dlnkgpt-uncensored"  # Change to your repo

api = HfApi()
api.upload_folder(
    folder_path=model_path,
    repo_id=repo_id,
    repo_type="model",
)

print(f"✅ Model pushed to: https://huggingface.co/{repo_id}")

## Summary

### Training Complete! 🎉

**What was accomplished:**
- ✅ Model trained with early stopping
- ✅ Validation loss monitored throughout
- ✅ Quality assurance tests run each epoch
- ✅ Best checkpoint automatically selected
- ✅ Comprehensive metrics logged

**Next Steps:**
1. Review TensorBoard metrics
2. Check QA test results
3. Test the model with your own prompts
4. Deploy to production or push to Hub

**Files Created:**
- `./training_output/final_model/` - Best model checkpoint
- `./training_output/metrics_history.json` - All training metrics
- `./training_output/qa_results/` - Quality assurance test results
- `./logs/` - TensorBoard logs