# Deep Debugging: Why is CNN Training Failing?

## The Problem

**Observations:**
- Simple linear regression: **R¬≤ = +0.20** ‚úÖ
- CNN with ColorJitter: **R¬≤ = -1.25** ‚ùå
- CNN without ColorJitter: **R¬≤ = -1.99** ‚ùå (WORSE!)

**Critical insight:** Removing ColorJitter made it WORSE, not better!

This suggests the problem is NOT ColorJitter. Something more fundamental is broken.

## Investigation Plan

1. **Data Loading**: Are images and targets loading correctly?
2. **Model Forward Pass**: Are predictions in the right range?
3. **Loss Calculation**: Is loss computing correctly?
4. **Gradient Flow**: Are gradients flowing through the network?
5. **Normalization**: Is ImageNet normalization appropriate?
6. **Overfitting Test**: Can model memorize a single batch?
7. **Target Scale**: Are target values on the right scale?

Let's systematically check each component.

---
## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
import torchvision.models as models
from PIL import Image

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
from tqdm.auto import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

np.random.seed(42)
torch.manual_seed(42)

print("‚úì Imports complete")

In [None]:
# Load data
train_enriched = pd.read_csv('competition/train_enriched.csv')
train_enriched['Sampling_Date'] = pd.to_datetime(train_enriched['Sampling_Date'])
train_enriched['full_image_path'] = train_enriched['image_path'].apply(lambda x: f'competition/{x}')

target_cols = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']
competition_weights = [0.1, 0.1, 0.1, 0.2, 0.5]

train_data, val_data = train_test_split(train_enriched, test_size=0.2, random_state=42)

print(f"Data loaded: {len(train_data)} train, {len(val_data)} val")
print(f"Targets: {target_cols}")

---
## Test 1: Inspect Single Batch - Data Loading

In [None]:
# Create simple dataset (no normalization first)
class DebugDataset(Dataset):
    def __init__(self, dataframe, normalize=False):
        self.df = dataframe.reset_index(drop=True)
        self.normalize = normalize
        
        if normalize:
            # ImageNet normalization
            self.transform = transforms.Compose([
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
            ])
        else:
            # No normalization - just convert to tensor
            self.transform = transforms.Compose([
                transforms.Resize((224, 224)),
                transforms.ToTensor()
            ])
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img = Image.open(row['full_image_path']).convert('RGB')
        img = self.transform(img)
        targets = torch.tensor(row[target_cols].values.astype('float32'), dtype=torch.float32)
        return {'image': img, 'targets': targets, 'path': row['full_image_path']}

# Create dataset WITHOUT normalization
debug_dataset = DebugDataset(train_data, normalize=False)
debug_loader = DataLoader(debug_dataset, batch_size=16, shuffle=False)

# Get first batch
batch = next(iter(debug_loader))
images = batch['image']
targets = batch['targets']

print("="*80)
print("BATCH INSPECTION (No Normalization)")
print("="*80)

print(f"\nBatch shape: {images.shape}")
print(f"Targets shape: {targets.shape}")

print(f"\nImage statistics (should be [0, 1] after ToTensor):")
print(f"  Min: {images.min().item():.4f}")
print(f"  Max: {images.max().item():.4f}")
print(f"  Mean: {images.mean().item():.4f}")
print(f"  Std: {images.std().item():.4f}")

print(f"\nTarget statistics (biomass in grams):")
for i, col in enumerate(target_cols):
    print(f"  {col}:")
    print(f"    Min: {targets[:, i].min().item():.2f}g")
    print(f"    Max: {targets[:, i].max().item():.2f}g")
    print(f"    Mean: {targets[:, i].mean().item():.2f}g")
    print(f"    Std: {targets[:, i].std().item():.2f}g")

In [None]:
# Check for NaN or Inf in targets
print("\nData Quality Checks:")
print(f"  NaN in images: {torch.isnan(images).any().item()}")
print(f"  Inf in images: {torch.isinf(images).any().item()}")
print(f"  NaN in targets: {torch.isnan(targets).any().item()}")
print(f"  Inf in targets: {torch.isinf(targets).any().item()}")

if torch.isnan(targets).any() or torch.isinf(targets).any():
    print("\n‚ö†Ô∏è  WARNING: Found NaN or Inf in targets!")
else:
    print("\n‚úì No NaN or Inf values detected")

In [None]:
# Visualize first image from batch
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Show first image
img_to_show = images[0].permute(1, 2, 0).numpy()
axes[0].imshow(img_to_show)
axes[0].set_title('First Image (No Normalization)', fontweight='bold')
axes[0].axis('off')

# Show target values
target_vals = targets[0].numpy()
axes[1].bar(range(5), target_vals, color=['green', 'brown', 'lightgreen', 'blue', 'purple'])
axes[1].set_xticks(range(5))
axes[1].set_xticklabels(['Green', 'Dead', 'Clover', 'GDM', 'Total'], rotation=45)
axes[1].set_ylabel('Biomass (g)')
axes[1].set_title('Target Values for First Image', fontweight='bold')
axes[1].grid(alpha=0.3)

# Show RGB histogram
for c, color in enumerate(['red', 'green', 'blue']):
    hist = images[0, c].flatten().numpy()
    axes[2].hist(hist, bins=50, alpha=0.5, label=color.upper(), color=color)
axes[2].set_xlabel('Pixel Value')
axes[2].set_ylabel('Count')
axes[2].set_title('RGB Histogram (First Image)', fontweight='bold')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úì Batch visualization complete")

---
## Test 2: Compare ImageNet vs Custom Normalization

In [None]:
# Calculate actual mean/std of training images
print("Calculating actual image statistics from training set...")
print("This may take 2-3 minutes...\n")

# Sample 100 images to calculate stats
sample_size = min(100, len(train_data))
sample_indices = np.random.choice(len(train_data), sample_size, replace=False)

all_pixels = []
for idx in tqdm(sample_indices):
    img_path = train_data.iloc[idx]['full_image_path']
    img = Image.open(img_path).convert('RGB')
    img = img.resize((224, 224))
    img_array = np.array(img) / 255.0  # Normalize to [0, 1]
    all_pixels.append(img_array.reshape(-1, 3))

all_pixels = np.vstack(all_pixels)
actual_mean = all_pixels.mean(axis=0)
actual_std = all_pixels.std(axis=0)

print("\n" + "="*80)
print("NORMALIZATION COMPARISON")
print("="*80)

print(f"\nImageNet normalization (what we're using):")
print(f"  Mean: [0.485, 0.456, 0.406]")
print(f"  Std:  [0.229, 0.224, 0.225]")

print(f"\nActual training data statistics:")
print(f"  Mean: [{actual_mean[0]:.3f}, {actual_mean[1]:.3f}, {actual_mean[2]:.3f}]")
print(f"  Std:  [{actual_std[0]:.3f}, {actual_std[1]:.3f}, {actual_std[2]:.3f}]")

# Calculate difference
imagenet_mean = np.array([0.485, 0.456, 0.406])
imagenet_std = np.array([0.229, 0.224, 0.225])

mean_diff = np.abs(actual_mean - imagenet_mean)
std_diff = np.abs(actual_std - imagenet_std)

print(f"\nDifference:")
print(f"  Mean diff: [{mean_diff[0]:.3f}, {mean_diff[1]:.3f}, {mean_diff[2]:.3f}]")
print(f"  Std diff:  [{std_diff[0]:.3f}, {std_diff[1]:.3f}, {std_diff[2]:.3f}]")

if np.max(mean_diff) > 0.1 or np.max(std_diff) > 0.05:
    print("\n‚ö†Ô∏è  WARNING: Large difference between ImageNet and actual statistics!")
    print("   ImageNet normalization might be inappropriate for this data.")
    print("   Recommendation: Use custom normalization based on actual data.")
else:
    print("\n‚úì ImageNet normalization appears reasonable for this data.")

---
## Test 3: Model Forward Pass - Prediction Range

In [None]:
# Create simple model
class SimpleModel(nn.Module):
    def __init__(self, num_outputs=5):
        super().__init__()
        self.resnet = models.resnet18(pretrained=True)
        num_features = self.resnet.fc.in_features
        self.resnet.fc = nn.Sequential(
            nn.Linear(num_features, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Dropout(0.2),
            nn.Linear(256, num_outputs)
        )
    
    def forward(self, x):
        return self.resnet(x)

model = SimpleModel(num_outputs=5).to(device)
model.eval()

print("‚úì Model created")

In [None]:
# Test with normalized images
debug_dataset_norm = DebugDataset(train_data, normalize=True)
debug_loader_norm = DataLoader(debug_dataset_norm, batch_size=16, shuffle=False)
batch_norm = next(iter(debug_loader_norm))

images_norm = batch_norm['image'].to(device)
targets_batch = batch_norm['targets']

# Forward pass
with torch.no_grad():
    predictions = model(images_norm)

predictions_cpu = predictions.cpu()

print("="*80)
print("MODEL FORWARD PASS ANALYSIS")
print("="*80)

print(f"\nPrediction statistics (BEFORE training):")
print(f"  Shape: {predictions_cpu.shape}")
print(f"  Min: {predictions_cpu.min().item():.4f}")
print(f"  Max: {predictions_cpu.max().item():.4f}")
print(f"  Mean: {predictions_cpu.mean().item():.4f}")
print(f"  Std: {predictions_cpu.std().item():.4f}")

print(f"\nPer-target predictions:")
for i, col in enumerate(target_cols):
    pred_mean = predictions_cpu[:, i].mean().item()
    pred_std = predictions_cpu[:, i].std().item()
    target_mean = targets_batch[:, i].mean().item()
    target_std = targets_batch[:, i].std().item()
    
    print(f"\n  {col}:")
    print(f"    Pred:   {pred_mean:8.2f} ¬± {pred_std:6.2f}")
    print(f"    Target: {target_mean:8.2f} ¬± {target_std:6.2f}")
    print(f"    Diff:   {abs(pred_mean - target_mean):8.2f}")

print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)

pred_range = predictions_cpu.max().item() - predictions_cpu.min().item()
target_range = targets_batch.max().item() - targets_batch.min().item()

print(f"\nPrediction range: {pred_range:.2f}")
print(f"Target range: {target_range:.2f}")

if abs(predictions_cpu.mean().item()) < 1.0:
    print("\n‚ö†Ô∏è  WARNING: Predictions are very close to zero!")
    print("   Model might not be initialized properly.")
    print("   Expected: Predictions should be in range [0, 200] like targets.")
elif pred_range < 10:
    print("\n‚ö†Ô∏è  WARNING: Predictions have very small range!")
    print("   Model is outputting nearly identical values for all samples.")
    print("   This suggests model hasn't learned to differentiate.")
else:
    print("\n‚úì Prediction range looks reasonable for untrained model.")

---
## Test 4: Loss Calculation - Manual Verification

In [None]:
# Test loss function
class CompetitionLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = torch.tensor([0.1, 0.1, 0.1, 0.2, 0.5]).to(device)
    
    def forward(self, pred, target):
        mse = F.mse_loss(pred, target, reduction='none')
        weighted_mse = (mse * self.weights).mean()
        return weighted_mse

criterion = CompetitionLoss()

# Calculate loss
loss = criterion(predictions, targets_batch.to(device))

print("="*80)
print("LOSS CALCULATION ANALYSIS")
print("="*80)

print(f"\nCompetition loss: {loss.item():.4f}")

# Manual calculation per target
print(f"\nPer-target losses:")
for i, col in enumerate(target_cols):
    pred_i = predictions_cpu[:, i]
    target_i = targets_batch[:, i]
    
    # Manual MSE
    mse_i = ((pred_i - target_i) ** 2).mean().item()
    weighted_mse_i = mse_i * competition_weights[i]
    
    print(f"\n  {col}:")
    print(f"    MSE: {mse_i:.2f}")
    print(f"    Weighted MSE: {weighted_mse_i:.2f} (weight: {competition_weights[i]})")

# Compare plain MSE
plain_mse = F.mse_loss(predictions, targets_batch.to(device))
print(f"\nComparison:")
print(f"  Competition loss (weighted): {loss.item():.4f}")
print(f"  Plain MSE (unweighted): {plain_mse.item():.4f}")

print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)

if loss.item() > 10000:
    print("\n‚ö†Ô∏è  WARNING: Loss is extremely high!")
    print("   This suggests predictions are very far from targets.")
    print("   Expected loss for random initialization: 1000-5000")
elif loss.item() < 100:
    print("\n‚ö†Ô∏è  WARNING: Loss is very low for untrained model!")
    print("   This is suspicious. Check if loss is calculating correctly.")
else:
    print(f"\n‚úì Loss value ({loss.item():.2f}) is in expected range for untrained model.")
    print("  Expected: 1000-5000 for random predictions on biomass data.")

---
## Test 5: Gradient Flow Check

In [None]:
# Test gradient flow
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

# Forward pass
predictions = model(images_norm)
loss = criterion(predictions, targets_batch.to(device))

# Backward pass
optimizer.zero_grad()
loss.backward()

print("="*80)
print("GRADIENT FLOW ANALYSIS")
print("="*80)

# Check gradients
print(f"\nGradient statistics for each layer:")
total_grad_norm = 0
zero_grad_layers = 0

for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        total_grad_norm += grad_norm
        
        # Only show FC layers (ResNet conv layers would be too many)
        if 'fc' in name:
            print(f"  {name:40s}: {grad_norm:.6f}")
        
        if grad_norm < 1e-8:
            zero_grad_layers += 1
    else:
        print(f"  {name:40s}: No gradient!")

print(f"\nTotal gradient norm: {total_grad_norm:.4f}")
print(f"Layers with zero gradient: {zero_grad_layers}")

print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)

if total_grad_norm < 1e-6:
    print("\n‚ö†Ô∏è  WARNING: Gradients are extremely small or zero!")
    print("   Possible causes:")
    print("     - Vanishing gradient problem")
    print("     - Loss not connected to model parameters")
    print("     - Learning rate too low")
elif total_grad_norm > 1000:
    print("\n‚ö†Ô∏è  WARNING: Gradients are exploding!")
    print("   Possible causes:")
    print("     - Learning rate too high")
    print("     - Unstable loss function")
    print("   Recommendation: Use gradient clipping or lower learning rate")
else:
    print(f"\n‚úì Gradient norms are in reasonable range ({total_grad_norm:.4f}).")
    print("  Gradients are flowing through the network.")

---
## Test 6: Weight Update Check

In [None]:
# Store weight before update
fc_weight_before = model.resnet.fc[0].weight.data.clone()

# Take optimizer step
optimizer.step()

# Check weight after update
fc_weight_after = model.resnet.fc[0].weight.data
weight_change = (fc_weight_after - fc_weight_before).abs().mean().item()

print("="*80)
print("WEIGHT UPDATE ANALYSIS")
print("="*80)

print(f"\nFC layer (first layer):")
print(f"  Weight before update: mean={fc_weight_before.mean().item():.6f}, std={fc_weight_before.std().item():.6f}")
print(f"  Weight after update:  mean={fc_weight_after.mean().item():.6f}, std={fc_weight_after.std().item():.6f}")
print(f"  Mean absolute change: {weight_change:.8f}")

print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)

if weight_change < 1e-8:
    print("\n‚ö†Ô∏è  WARNING: Weights are not changing!")
    print("   Possible causes:")
    print("     - Learning rate too low (current: 1e-4)")
    print("     - Gradients too small")
    print("     - Optimizer issue")
    print("   Recommendation: Increase learning rate to 1e-3 or 3e-4")
elif weight_change > 0.1:
    print("\n‚ö†Ô∏è  WARNING: Weights changing too much in one step!")
    print("   Learning rate might be too high.")
    print("   Recommendation: Lower learning rate")
else:
    print(f"\n‚úì Weights are updating normally ({weight_change:.8f} change per step).")
    print("  Optimizer is working correctly.")

---
## Test 7: Overfit Single Batch (Critical Test)

In [None]:
print("="*80)
print("OVERFITTING TEST: Can model memorize a single batch?")
print("="*80)
print("\nThis test trains on just 1 batch for 100 steps.")
print("A working model should achieve near-zero loss.")
print("If it can't, the model architecture is broken.\n")

# Create fresh model
test_model = SimpleModel(num_outputs=5).to(device)
test_optimizer = torch.optim.AdamW(test_model.parameters(), lr=1e-3)  # Higher LR for faster overfitting
test_criterion = CompetitionLoss()

# Get one batch
test_batch = next(iter(debug_loader_norm))
test_images = test_batch['image'].to(device)
test_targets = test_batch['targets'].to(device)

# Train on this batch for 100 steps
losses = []
r2_scores_over_time = []

test_model.train()
for step in range(100):
    # Forward
    pred = test_model(test_images)
    loss = test_criterion(pred, test_targets)
    
    # Backward
    test_optimizer.zero_grad()
    loss.backward()
    test_optimizer.step()
    
    # Track
    losses.append(loss.item())
    
    # Calculate R¬≤ every 10 steps
    if step % 10 == 0:
        with torch.no_grad():
            pred_np = pred.cpu().numpy()
            target_np = test_targets.cpu().numpy()
            r2_total = sum([competition_weights[i] * r2_score(target_np[:, i], pred_np[:, i]) for i in range(5)])
            r2_scores_over_time.append(r2_total)
            print(f"Step {step:3d}: Loss = {loss.item():.4f}, R¬≤ = {r2_total:+.4f}")

# Final evaluation
test_model.eval()
with torch.no_grad():
    final_pred = test_model(test_images)
    final_loss = test_criterion(final_pred, test_targets)
    
    pred_np = final_pred.cpu().numpy()
    target_np = test_targets.cpu().numpy()
    final_r2 = sum([competition_weights[i] * r2_score(target_np[:, i], pred_np[:, i]) for i in range(5)])

print(f"\nFinal results after 100 steps:")
print(f"  Loss: {final_loss.item():.4f}")
print(f"  R¬≤: {final_r2:+.4f}")

print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)

if final_r2 > 0.9:
    print("\n‚úÖ SUCCESS: Model can overfit a single batch!")
    print("   This proves the model architecture is working.")
    print("   The problem is likely:")
    print("     - Learning rate too low for full training")
    print("     - Need more epochs")
    print("     - Regularization too strong (dropout, weight decay)")
elif final_r2 > 0.5:
    print("\n‚ö†Ô∏è  PARTIAL SUCCESS: Model is learning but slowly")
    print("   Model architecture works but might not be optimal.")
    print("   Recommendations:")
    print("     - Increase learning rate")
    print("     - Simplify architecture (remove BatchNorm/Dropout)")
elif final_r2 > 0.0:
    print("\n‚ö†Ô∏è  POOR PERFORMANCE: Model learning very slowly")
    print("   Issues:")
    print("     - Learning rate too low")
    print("     - Model architecture might be problematic")
    print("     - Check normalization")
else:
    print("\n‚ùå FAILURE: Model CANNOT overfit a single batch!")
    print("   This is a critical failure. The model architecture is broken.")
    print("   Possible causes:")
    print("     - Wrong input/output shapes")
    print("     - Loss function not connected properly")
    print("     - Severe numerical issues (NaN/Inf)")
    print("     - Model too simple for the task")

In [None]:
# Plot overfitting progress
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
ax = axes[0]
ax.plot(losses, linewidth=2)
ax.set_xlabel('Step', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Loss During Overfitting Test (100 steps)', fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)

# R¬≤ curve
ax = axes[1]
ax.plot(range(0, 100, 10), r2_scores_over_time, 'o-', linewidth=2, markersize=8)
ax.axhline(y=0.0, color='gray', linestyle='--', label='Baseline')
ax.axhline(y=0.9, color='green', linestyle='--', label='Target (R¬≤=0.9)')
ax.set_xlabel('Step', fontsize=12)
ax.set_ylabel('R¬≤ Score', fontsize=12)
ax.set_title('R¬≤ During Overfitting Test', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('overfit_test.png', dpi=150, bbox_inches='tight')
plt.show()

print("‚úì Overfitting test complete")

---
## Test 8: Target Scale Analysis

In [None]:
# Analyze target value scales across dataset
print("="*80)
print("TARGET SCALE ANALYSIS")
print("="*80)

for col in target_cols:
    values = train_data[col].values
    print(f"\n{col}:")
    print(f"  Min: {values.min():.2f}g")
    print(f"  Max: {values.max():.2f}g")
    print(f"  Range: {values.max() - values.min():.2f}g")
    print(f"  Mean: {values.mean():.2f}g")
    print(f"  Std: {values.std():.2f}g")
    print(f"  Coefficient of variation: {values.std() / values.mean():.2f}")

print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)

print("\nDifferent targets have very different scales:")
print("  - Dry_Clover_g often ~0-20g (low)")
print("  - Dry_Total_g typically 20-150g (high)")
print("\nThis scale difference might make training harder.")
print("\nOptions:")
print("  1. Normalize targets (StandardScaler) - Model predicts normalized, denormalize for eval")
print("  2. Use per-target learning rates (not easy in PyTorch)")
print("  3. Use different loss weights (already doing this)")
print("  4. Keep as-is but use longer training")

---
## Summary: Root Cause Analysis

In [None]:
print("="*80)
print("DEBUGGING SUMMARY")
print("="*80)

print("\nüìã TESTS COMPLETED:")
print("  1. ‚úì Data loading inspection")
print("  2. ‚úì ImageNet vs custom normalization")
print("  3. ‚úì Model forward pass")
print("  4. ‚úì Loss calculation")
print("  5. ‚úì Gradient flow")
print("  6. ‚úì Weight updates")
print("  7. ‚úì Overfit single batch test (CRITICAL)")
print("  8. ‚úì Target scale analysis")

print("\n" + "="*80)
print("KEY FINDINGS")
print("="*80)

print("\nReview the results above to identify:")
print("\n1. Can model overfit single batch?")
print("   - If YES: Architecture works, need better training setup")
print("   - If NO: Architecture broken, need to fix model")

print("\n2. Are gradients flowing?")
print("   - Check gradient norm (should be 1-100 range)")
print("   - If too small: Learning rate too low or vanishing gradients")
print("   - If too large: Learning rate too high or exploding gradients")

print("\n3. Is ImageNet normalization appropriate?")
print("   - Check difference between ImageNet and actual data stats")
print("   - If large difference: Use custom normalization")

print("\n4. Are predictions in right range?")
print("   - Should be roughly 0-200g like targets")
print("   - If very different scale: Initialization or architecture issue")

print("\n" + "="*80)
print("RECOMMENDED NEXT STEPS")
print("="*80)

print("\nBased on the findings above, prioritize:")
print("\n1. If model CAN overfit single batch:")
print("   ‚Üí Increase learning rate to 3e-4 or 5e-4")
print("   ‚Üí Train for 15-20 epochs instead of 5")
print("   ‚Üí Remove or reduce dropout/weight decay")

print("\n2. If gradients are very small:")
print("   ‚Üí Increase learning rate by 10√ó")
print("   ‚Üí Check if BatchNorm is causing issues")
print("   ‚Üí Try simpler architecture without BatchNorm")

print("\n3. If ImageNet normalization is very different:")
print("   ‚Üí Use custom normalization based on actual data")
print("   ‚Üí Retrain with correct normalization")

print("\n4. If model CANNOT overfit single batch:")
print("   ‚Üí Major architecture problem")
print("   ‚Üí Try even simpler model (linear layer on features)")
print("   ‚Üí Check for implementation bugs")

print("\n" + "="*80)
print("‚úì Debugging complete!")
print("="*80)