# TabNet Embedding Dimension Test (FIXED)

**Goal:** Verify that TabNet is producing proper compressed embeddings

**Expected:** 24 dimensions (sweet spot for learned representations)

**NOT:** 150 dimensions (raw features) or 1 dimension (predictions only)

**Runtime:** ~1-2 minutes on Colab GPU

---

## What We Fixed:
- ‚ùå Old method: Got 150-dim from `forward_masks` (just transformed inputs)
- ‚úÖ New method: Get 24-48 dim from `encoder` (compressed learned embeddings)

## Step 1: Install Dependencies

In [None]:
!pip install -q pytorch-tabnet lightgbm scikit-learn pandas numpy torch
print("[OK] All dependencies installed")

## Step 2: Check GPU Availability

In [None]:
import torch

if torch.cuda.is_available():
    print(f"[OK] GPU available: {torch.cuda.get_device_name(0)}")
    USE_GPU = True
else:
    print("[WARNING] No GPU found, using CPU (will be slower)")
    USE_GPU = False

## Step 3: Upload FIXED neural_hybrid.py

**IMPORTANT:** Make sure you upload the LATEST version with the fixed `_get_embeddings()` method!

**Click the folder icon on the left ‚Üí Upload ‚Üí Select `neural_hybrid.py`**

Then run this cell to verify:

In [None]:
import os

if os.path.exists('neural_hybrid.py'):
    print("[OK] neural_hybrid.py found")
    
    # Check if it has the fix
    with open('neural_hybrid.py', 'r') as f:
        content = f.read()
        if 'steps_output[:, -1, :].cpu().numpy()' in content:
            print("[OK] File contains the FIXED embedding extraction method")
        else:
            print("[WARNING] File might be the OLD version - make sure you uploaded the latest!")
else:
    print("[ERROR] neural_hybrid.py NOT found")
    print("   Please upload it using the file browser on the left")

## Step 4: Create Test Data

In [None]:
import numpy as np
import pandas as pd
from neural_hybrid import NeuralHybridPredictor

print("=" * 70)
print("TESTING TABNET EMBEDDING EXTRACTION")
print("=" * 70)

# Create dummy data (similar to your NBA stats)
np.random.seed(42)  # For reproducibility
n_samples = 500
n_features = 150  # Similar to your 150+ features

X_train = pd.DataFrame(np.random.randn(n_samples, n_features))
y_train = pd.Series(np.random.randn(n_samples) * 5 + 15)  # Simulate points/assists

X_val = pd.DataFrame(np.random.randn(100, n_features))
y_val = pd.Series(np.random.randn(100) * 5 + 15)

X_test = pd.DataFrame(np.random.randn(20, n_features))

print(f"\n[OK] Created test data:")
print(f"  - Train: {n_samples} samples, {n_features} features")
print(f"  - Val: {len(X_val)} samples")
print(f"  - Test: {len(X_test)} samples")

## Step 5: Initialize Model

This shows the TabNet configuration:
- **n_d = 24** (dimension for each decision step)
- **n_a = 24** (attention dimension)
- **n_steps = 4** (number of sequential decision steps)

Expected embedding dimension: **24-48** (compressed learned representation)

In [None]:
# Initialize model
model = NeuralHybridPredictor(prop_name='test_points', use_gpu=USE_GPU)

print(f"\n[OK] Initialized NeuralHybridPredictor")
print(f"  - n_d (dimension): {model.tabnet_params['n_d']}")
print(f"  - n_a (attention): {model.tabnet_params['n_a']}")
print(f"  - n_steps: {model.tabnet_params['n_steps']}")
print(f"  - Expected embedding size: {model.tabnet_params['n_d']} to {model.tabnet_params['n_d'] + model.tabnet_params['n_a']}")
print(f"  - Device: {'GPU' if USE_GPU else 'CPU'}")

## Step 6: Train TabNet (1 Epoch Only)

**Watch for the line:**
```
  - Embedding dimension: XX
```

This is printed during Step 2 of training and shows what dimension the embeddings actually are.

In [None]:
# Train with 1 epoch for quick test
print(f"\n{'='*70}")
print("TRAINING TABNET (1 EPOCH)")
print(f"{'='*70}")
print(f"This will take ~30-60 seconds on GPU, ~2-3 minutes on CPU\n")

model.fit(X_train, y_train, X_val, y_val, epochs=1, batch_size=256)

print(f"\n[OK] Model trained successfully")

## Step 7: Extract Embeddings and Check Dimension

**üëÄ WATCH FOR:**

The line that says:
```
[INFO] Extracted XX-dim embeddings from encoder (last step)
```

**What the dimension means:**
- **24 dimensions** ‚úÖ Perfect! Compressed learned embeddings (n_d)
- **48 dimensions** ‚úÖ Good! Full encoder output (n_d + n_a)
- **8-32 dimensions** ‚úÖ Acceptable sweet spot for embeddings
- **150 dimensions** ‚ùå BAD - Just transformed inputs, no compression
- **1 dimension** ‚ùå BAD - Fallback to predictions only

In [None]:
print(f"\n{'='*70}")
print("EXTRACTING EMBEDDINGS FROM TEST DATA")
print(f"{'='*70}")
print("\nüëÄ Watch for [INFO] or [WARNING] messages\n")

try:
    embeddings = model._get_embeddings(X_test.values.astype(np.float32))
    
    print(f"\n{'='*70}")
    print(f"RESULTS:")
    print(f"{'='*70}")
    print(f"Embedding shape: {embeddings.shape}")
    print(f"Expected shape: (20, 24-48)")
    print(f"\n‚≠ê ACTUAL DIMENSION: {embeddings.shape[1]}")
    
    # Analyze results
    dim = embeddings.shape[1]
    
    if 8 <= dim <= 48:
        print(f"\n‚úÖ SUCCESS! Embeddings are {dim}-dimensional!")
        print(f"\nThis is in the optimal range (8-48 dimensions).")
        print(f"Your TabNet is producing compressed learned representations.")
        
        if dim == 24:
            print(f"\nüéØ PERFECT! 24 dimensions is the sweet spot!")
        elif dim == 48:
            print(f"\nüëç GOOD! 48 dimensions (n_d + n_a) captures full encoder output.")
        
        print(f"\nSample embedding (first 5 values of row 0):")
        print(embeddings[0, :5])
        print(f"\nEmbedding statistics:")
        print(f"  - Mean: {embeddings.mean():.4f}")
        print(f"  - Std: {embeddings.std():.4f}")
        print(f"  - Min: {embeddings.min():.4f}")
        print(f"  - Max: {embeddings.max():.4f}")
        
    elif dim == 150:
        print(f"\n‚ùå PROBLEM! Embeddings are {dim}-dimensional")
        print(f"\nThis equals your input feature count.")
        print(f"TabNet is NOT compressing features into learned embeddings.")
        print(f"It's just passing through transformed inputs.")
        print(f"\nüîß The _get_embeddings() method needs more fixes.")
        
    elif dim == 1 or dim == 2:
        print(f"\n‚ùå FAILED! Embeddings are only {dim}-dimensional")
        print(f"\nFallback code was used (predictions only).")
        print(f"TabNet's internal encoder is not accessible.")
        print(f"\nüîß Check the error messages above.")
        
    else:
        print(f"\n‚ö†Ô∏è UNEXPECTED! Embeddings are {dim}-dimensional")
        print(f"\nExpected 24-48, got {dim}.")
        print(f"This might still work, but it's not the expected range.")

except Exception as e:
    print(f"\n‚ùå ERROR extracting embeddings:")
    print(f"   {str(e)}")
    import traceback
    traceback.print_exc()

## Summary & Next Steps

### If you got 24-48 dimensions:
‚úÖ **PERFECT!** Your TabNet is working correctly!

- TabNet is producing compressed learned embeddings
- These capture complex non-linear relationships from your 150 features
- Your hybrid ensemble (TabNet embeddings + LightGBM) is optimized

**Next:** Your system is ready. Now you can optionally add H2O AutoML to discover additional feature interactions.

---

### If you got 150 dimensions:
‚ùå **PROBLEM:** TabNet is not compressing features

The code is extracting from `forward_masks` which returns transformed inputs, not encoder embeddings.

**Next:** Need to debug the encoder access in `_get_embeddings()` method.

---

### If you got 1-2 dimensions:
‚ùå **FAILED:** Fallback code was used

The encoder extraction failed and fell back to using predictions only.

**Next:** Check error messages and debug why encoder is not accessible.

---

## Final Answer

**Report back with:**
```
‚≠ê ACTUAL DIMENSION: XX
```

And I'll either:
1. ‚úÖ Confirm your system is perfect and explain next steps for H2O AutoML
2. üîß Give you the exact fix to get proper compressed embeddings