# üåÄ Typhoon Prediction Model v4.2 - Anti-Overfitting Edition

## Major Changes from v4.1:

### Problems Identified in v4.1:
- **Japan Sea**: Train R¬≤=1.0, Test R¬≤=-2.36 ‚Üí **Severe Overfitting**
- **South China Sea & Yellow Sea**: Variance Ratio ‚âà 0 ‚Üí **Still Flat**
- **All regions**: Negative Test R¬≤ ‚Üí **Worse than mean prediction**
- **SVR** keeps winning but performs poorly

### v4.2 Fixes:
1. **Remove SVR** - Not suitable for this small, noisy dataset
2. **Stronger regularization** - Prevent overfitting
3. **Simpler models** - Linear models + KNN only
4. **Fewer features** - Reduce to 2-3 most important
5. **Add baseline comparison** - Compare against simple mean prediction
6. **Ensemble averaging** - Combine multiple models
7. **Constraint: Only use models with Train R¬≤ < 0.5** - Prevent overfitting

---
‚è±Ô∏è **Estimated Runtime**: 3-5 minutes

## 1Ô∏è‚É£ Environment Setup

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import seaborn as sns
import os
from itertools import product

# Simpler models only - NO SVR
from sklearn.linear_model import Ridge, Lasso, ElasticNet, BayesianRidge, LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import LeaveOneOut, cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

# For correlation analysis
from scipy import stats

print("‚úÖ Environment setup complete!")
print("\nüîë Key Changes in v4.2:")
print("   ‚Ä¢ REMOVED SVR (causes overfitting/flat predictions)")
print("   ‚Ä¢ Added LinearRegression as baseline")
print("   ‚Ä¢ Stronger regularization")
print("   ‚Ä¢ Maximum 2-3 features per region")
print("   ‚Ä¢ Anti-overfitting constraint (Train R¬≤ < 0.5)")

## 2Ô∏è‚É£ Upload Data Files

In [None]:
from google.colab import files
print("Please upload: typhoon_count.csv and MEI/PDO/IOD/QBO NC files")
uploaded = files.upload()
print(f"\n‚úÖ Uploaded {len(uploaded)} files")

## 3Ô∏è‚É£ Check NC Files

In [None]:
for f in [f for f in os.listdir('.') if f.endswith('.nc')]:
    ds = xr.open_dataset(f)
    print(f"{f}: variables={list(ds.data_vars)}")
    ds.close()

## 4Ô∏è‚É£ Configuration v4.2

### Critical Changes:
- **NO SVR** - Removed entirely
- **MAX_FEATURES = 2** - Reduced from 4
- **Strong regularization** - Higher alpha values
- **Anti-overfitting check** - Reject models with Train R¬≤ > 0.5

In [None]:
class Config:
    # Data files
    TYPHOON_DATA_PATH = 'typhoon_count.csv'
    MEI_NC_PATH = 'mei_v2.nc'
    PDO_NC_PATH = 'pdo_ersst_v5.nc'
    IOD_NC_PATH = 'iod_ersst_v5.nc'
    QBO_NC_PATH = 'qbo.nc'

    # Variable names
    MEI_VAR_NAME = 'mei'
    PDO_VAR_NAME = 'pdo'
    IOD_VAR_NAME = 'iod'
    QBO_VAR_NAME = 'value'
    TIME_VAR_NAME = 'time'

    # Time range
    START_YEAR = 1980
    END_YEAR = 2024
    PREDICT_YEAR = 2025
    TEST_SPLIT_YEAR = 2015

    # ========== v4.2 CRITICAL CHANGES ==========
    
    # Feature selection - MORE AGGRESSIVE
    USE_FEATURE_SELECTION = True
    MAX_FEATURES = 2  # Reduced from 4 to prevent overfitting!
    
    # Simpler feature set - NO lagged features
    USE_LAGGED_FEATURES = False  # Disabled - adds noise
    USE_TREND_FEATURES = True
    
    # Anti-overfitting
    MAX_TRAIN_R2 = 0.5  # Reject models with Train R¬≤ > 0.5
    
    # REMOVED SVR - use simpler models only
    MODELS_TO_USE = [
        'LinearRegression',  # Baseline
        'Ridge',
        'Lasso', 
        'ElasticNet',
        'BayesianRidge',
        'KNN'
    ]
    
    # STRONGER regularization
    PARAM_GRIDS = {
        'LinearRegression': {},  # No params
        'Ridge': {
            'alpha': [10.0, 50.0, 100.0, 500.0, 1000.0],  # Much higher!
        },
        'Lasso': {
            'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],
        },
        'ElasticNet': {
            'alpha': [0.1, 0.5, 1.0, 2.0],
            'l1_ratio': [0.3, 0.5, 0.7],
        },
        'BayesianRidge': {
            'alpha_1': [1e-5, 1e-4, 1e-3],
            'lambda_1': [1e-5, 1e-4, 1e-3],
        },
        'KNN': {
            'n_neighbors': [5, 7, 9, 11],  # More neighbors = smoother
            'weights': ['uniform', 'distance'],
        },
    }

    REGIONS = ['South China Sea', 'Eastern China Sea', 'Japan Sea', 'Yellow Sea']
    OUTPUT_DIR = 'model_outputs_v42'

os.makedirs(Config.OUTPUT_DIR, exist_ok=True)
print("‚úÖ Configuration v4.2 complete")
print(f"   Models: {Config.MODELS_TO_USE}")
print(f"   Max features: {Config.MAX_FEATURES}")
print(f"   Anti-overfit threshold: Train R¬≤ < {Config.MAX_TRAIN_R2}")
print(f"   Lagged features: {Config.USE_LAGGED_FEATURES}")

## 5Ô∏è‚É£ Simplified Data Loader

In [None]:
class DataLoader:
    def __init__(self, config):
        self.config = config
        self.typhoon_data = None
        self.climate_indices = None

    def load_typhoon_data(self):
        print("Loading typhoon data...")
        self.typhoon_data = pd.read_csv(self.config.TYPHOON_DATA_PATH)
        print(f"  ‚úì {len(self.typhoon_data)} records")
        return self.typhoon_data

    def load_climate_index(self, path, var_name, name):
        try:
            ds = xr.open_dataset(path)
            if var_name not in ds.data_vars:
                var_name = list(ds.data_vars)[0]
            df = ds[var_name].to_dataframe().reset_index()
            df['year'] = pd.to_datetime(df['time']).dt.year
            df['month'] = pd.to_datetime(df['time']).dt.month
            df['value'] = df[var_name]
            ds.close()
            return df[['year', 'month', 'value']]
        except Exception as e:
            print(f"  ! {name} load failed: {e}")
            return None

    def load_all_climate_indices(self):
        print("Loading climate indices...")
        self.climate_indices = {}
        for name, path, var in [
            ('MEI', self.config.MEI_NC_PATH, self.config.MEI_VAR_NAME),
            ('PDO', self.config.PDO_NC_PATH, self.config.PDO_VAR_NAME),
            ('IOD', self.config.IOD_NC_PATH, self.config.IOD_VAR_NAME),
            ('QBO', self.config.QBO_NC_PATH, self.config.QBO_VAR_NAME)
        ]:
            data = self.load_climate_index(path, var, name)
            if data is not None:
                self.climate_indices[name] = data
                print(f"  ‚úì {name}")
        return self.climate_indices

    def calc_avg(self, data, year, months):
        mask = (data['year'] == year) & (data['month'].isin(months))
        vals = data.loc[mask, 'value']
        return vals.mean() if len(vals) >= 2 else np.nan

    def build_feature_matrix(self):
        """Build SIMPLIFIED feature matrix - fewer features to prevent overfitting"""
        print("Building simplified feature matrix...")
        years = [y for y in self.typhoon_data['Year'].unique()
                 if self.config.START_YEAR <= y <= self.config.END_YEAR]

        records = []
        for year in sorted(years):
            rec = {'Year': year}
            prev = year - 1

            for idx, data in self.climate_indices.items():
                # Primary feature: Oct-Nov-Dec average
                ond = self.calc_avg(data, prev, [10, 11, 12])
                rec[f'{idx}_OND'] = ond

                # Trend feature (optional)
                if self.config.USE_TREND_FEATURES:
                    jas = self.calc_avg(data, prev, [7, 8, 9])
                    rec[f'{idx}_TREND'] = ond - jas if not np.isnan(ond) and not np.isnan(jas) else np.nan

            # Current year MEI for reference (not used in main prediction)
            if 'MEI' in self.climate_indices:
                rec['MEI_Current_JASO'] = self.calc_avg(
                    self.climate_indices['MEI'], year, [7, 8, 9, 10]
                )

            records.append(rec)

        df = pd.DataFrame(records)

        # Pivot typhoon counts
        pivot = self.typhoon_data.pivot_table(
            index='Year', columns='Region',
            values='Typhoon_Count', aggfunc='sum'
        ).reset_index()

        for col in pivot.columns:
            if col != 'Year':
                pivot = pivot.rename(columns={col: f'Target_{col}'})

        result = df.merge(pivot, on='Year').dropna()
        feature_count = len([c for c in result.columns if c not in ['Year'] and not c.startswith('Target')])
        print(f"  ‚úì {len(result)} samples with {feature_count} features")

        return result

print("‚úÖ Data loader defined")

## 6Ô∏è‚É£ Model Factory (NO SVR)

In [None]:
class ModelFactory:
    @staticmethod
    def create(name, params):
        if name == 'LinearRegression':
            return LinearRegression()
        elif name == 'Ridge':
            return Ridge(alpha=params.get('alpha', 100.0))
        elif name == 'Lasso':
            return Lasso(alpha=params.get('alpha', 1.0), max_iter=5000)
        elif name == 'ElasticNet':
            return ElasticNet(
                alpha=params.get('alpha', 1.0),
                l1_ratio=params.get('l1_ratio', 0.5),
                max_iter=5000
            )
        elif name == 'BayesianRidge':
            return BayesianRidge(
                alpha_1=params.get('alpha_1', 1e-4),
                lambda_1=params.get('lambda_1', 1e-4),
            )
        elif name == 'KNN':
            return KNeighborsRegressor(
                n_neighbors=params.get('n_neighbors', 7),
                weights=params.get('weights', 'uniform')
            )

    @staticmethod
    def get_param_combos(grid):
        if not grid:
            return [{}]
        keys, vals = list(grid.keys()), list(grid.values())
        return [dict(zip(keys, v)) for v in product(*vals)]

print("‚úÖ Model factory defined (NO SVR)")

## 7Ô∏è‚É£ Anti-Overfitting Model System

### Key Features:
- **Rejects models with Train R¬≤ > 0.5** (overfitting)
- **Baseline comparison** against mean prediction
- **Ensemble option** for combining models

In [None]:
class AntiOverfitModelSystem:
    def __init__(self, config):
        self.config = config
        self.results = {}
        self.best_models = {}
        self.trained = {}
        self.scalers = {}
        self.feature_selectors = {}
        self.selected_features = {}
        self.feature_cols = None
        self.baseline_performance = {}

    def compute_baseline(self, y_train, y_test):
        """Compute baseline: predicting the training mean for all test samples"""
        mean_pred = np.full_like(y_test, y_train.mean())
        rmse = np.sqrt(mean_squared_error(y_test, mean_pred))
        # R¬≤ for mean prediction is 0 by definition
        return rmse, 0.0

    def select_features_by_correlation(self, X, y, feature_names, max_features=2):
        """Select features by absolute correlation with target"""
        correlations = []
        for i, fname in enumerate(feature_names):
            corr, _ = stats.pearsonr(X[:, i], y)
            correlations.append((fname, i, abs(corr), corr))
        
        # Sort by absolute correlation
        correlations.sort(key=lambda x: x[2], reverse=True)
        
        # Select top features
        selected_idx = [c[1] for c in correlations[:max_features]]
        selected_names = [c[0] for c in correlations[:max_features]]
        selected_corrs = {c[0]: c[3] for c in correlations[:max_features]}
        
        return selected_idx, selected_names, selected_corrs, correlations

    def cv_with_loo(self, X, y, model_name, params):
        """Leave-One-Out cross-validation"""
        loo = LeaveOneOut()
        predictions = []
        actuals = []

        for train_idx, test_idx in loo.split(X):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]

            scaler = StandardScaler()
            X_train_s = scaler.fit_transform(X_train)
            X_test_s = scaler.transform(X_test)

            try:
                model = ModelFactory.create(model_name, params)
                model.fit(X_train_s, y_train)
                pred = model.predict(X_test_s)[0]
                predictions.append(max(0, pred))
                actuals.append(y_test[0])
            except:
                return 999, 999

        predictions = np.array(predictions)
        actuals = np.array(actuals)
        rmse = np.sqrt(mean_squared_error(actuals, predictions))
        r2 = r2_score(actuals, predictions)

        return rmse, r2

    def search_best_params(self, X, y, model_name, grid):
        """Search for best hyperparameters with anti-overfitting constraint"""
        combos = ModelFactory.get_param_combos(grid)
        best_rmse = float('inf')
        best_params = {}
        best_cv_r2 = -999

        for params in combos:
            rmse, cv_r2 = self.cv_with_loo(X, y, model_name, params)

            if rmse < best_rmse:
                best_rmse = rmse
                best_params = params
                best_cv_r2 = cv_r2

        return best_params, best_rmse, best_cv_r2

    def train_region(self, data, region, feat_cols):
        """Train models with anti-overfitting measures"""
        target = f'Target_{region}'
        if target not in data.columns:
            print(f"  ! Target column {target} not found")
            return None

        X = data[feat_cols].values
        y = data[target].values
        years = data['Year'].values

        train_mask = years < self.config.TEST_SPLIT_YEAR
        test_mask = years >= self.config.TEST_SPLIT_YEAR

        X_train_full, X_test_full = X[train_mask], X[test_mask]
        y_train, y_test = y[train_mask], y[test_mask]

        print(f"\n  === {region} ===")
        print(f"      Training: {len(y_train)} samples, Test: {len(y_test)} samples")
        print(f"      Target - mean: {y_train.mean():.1f}, std: {y_train.std():.1f}, range: [{y_train.min():.0f}, {y_train.max():.0f}]")

        # Compute baseline (mean prediction)
        baseline_rmse, baseline_r2 = self.compute_baseline(y_train, y_test)
        self.baseline_performance[region] = {'rmse': baseline_rmse, 'r2': baseline_r2}
        print(f"      Baseline (mean): RMSE={baseline_rmse:.2f}")

        # Feature selection by correlation
        selected_idx, selected_names, selected_corrs, all_corrs = self.select_features_by_correlation(
            X_train_full, y_train, feat_cols, self.config.MAX_FEATURES
        )
        self.selected_features[region] = selected_names
        
        print(f"      Feature correlations with target:")
        for fname, _, abs_corr, corr in all_corrs[:5]:  # Show top 5
            marker = "‚úì" if fname in selected_names else " "
            print(f"        {marker} {fname}: r={corr:.3f}")
        
        X_train = X_train_full[:, selected_idx]
        X_test = X_test_full[:, selected_idx]

        # Scale
        scaler = StandardScaler()
        X_train_s = scaler.fit_transform(X_train)
        X_test_s = scaler.transform(X_test)
        self.scalers[region] = scaler

        # Search for best model
        model_results = {}
        valid_models = []
        
        for model_name in self.config.MODELS_TO_USE:
            print(f"    {model_name}...", end=" ")
            grid = self.config.PARAM_GRIDS.get(model_name, {})

            best_params, cv_rmse, cv_r2 = self.search_best_params(X_train, y_train, model_name, grid)
            
            # Train to get train R¬≤ for overfitting check
            model = ModelFactory.create(model_name, best_params)
            model.fit(X_train_s, y_train)
            y_train_pred = np.maximum(model.predict(X_train_s), 0)
            train_r2 = r2_score(y_train, y_train_pred)
            
            # Check for overfitting
            is_overfit = train_r2 > self.config.MAX_TRAIN_R2
            
            if is_overfit:
                print(f"CV_RMSE={cv_rmse:.2f}, Train_R¬≤={train_r2:.2f} ‚ö†Ô∏è OVERFIT - REJECTED")
            else:
                print(f"CV_RMSE={cv_rmse:.2f}, Train_R¬≤={train_r2:.2f} ‚úì")
                valid_models.append(model_name)

            model_results[model_name] = {
                'params': best_params,
                'cv_rmse': cv_rmse,
                'cv_r2': cv_r2,
                'train_r2': train_r2,
                'is_overfit': is_overfit,
                'model': model,
                'y_train_pred': y_train_pred,
            }

        # Select best non-overfitting model
        if valid_models:
            best_model_name = min(valid_models, key=lambda m: model_results[m]['cv_rmse'])
        else:
            # If all models overfit, pick the one with lowest CV RMSE anyway
            print("      ‚ö†Ô∏è All models overfit! Selecting least bad option...")
            best_model_name = min(model_results, key=lambda m: model_results[m]['cv_rmse'])
        
        best_params = model_results[best_model_name]['params']
        final_model = model_results[best_model_name]['model']
        y_train_pred = model_results[best_model_name]['y_train_pred']
        
        print(f"  üèÜ Best model: {best_model_name}")

        # Test predictions
        y_test_pred = np.maximum(final_model.predict(X_test_s), 0)

        # Metrics
        test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
        test_r2 = r2_score(y_test, y_test_pred)
        train_r2 = model_results[best_model_name]['train_r2']

        # Variance analysis
        pred_std = np.std(y_test_pred)
        actual_std = np.std(y_test)
        variance_ratio = pred_std / actual_std if actual_std > 0 else 0

        print(f"     Train R¬≤={train_r2:.3f}, Test R¬≤={test_r2:.3f}, Test RMSE={test_rmse:.2f}")
        print(f"     Pred std={pred_std:.2f}, Actual std={actual_std:.2f}, Var.Ratio={variance_ratio:.2f}")
        
        # Compare to baseline
        improvement = (baseline_rmse - test_rmse) / baseline_rmse * 100 if baseline_rmse > 0 else 0
        if test_rmse < baseline_rmse:
            print(f"     ‚úì {improvement:.1f}% better than baseline")
        else:
            print(f"     ‚ö†Ô∏è {-improvement:.1f}% worse than baseline (mean prediction)")

        # Store results
        self.best_models[region] = best_model_name
        self.trained[region] = final_model

        self.results[region] = {
            'models': model_results,
            'best': best_model_name,
            'best_params': best_params,
            'test_rmse': test_rmse,
            'test_r2': test_r2,
            'train_r2': train_r2,
            'baseline_rmse': baseline_rmse,
            'improvement': improvement,
            'y_train': y_train,
            'y_train_pred': y_train_pred,
            'y_test': y_test,
            'y_test_pred': y_test_pred,
            'years_train': years[train_mask],
            'years_test': years[test_mask],
            'pred_std': pred_std,
            'actual_std': actual_std,
            'variance_ratio': variance_ratio,
            'selected_features': selected_names,
            'feature_correlations': selected_corrs,
        }

    def train_all(self, data):
        """Train models for all regions"""
        # Exclude MEI_Current_JASO from features (it's for current year)
        self.feature_cols = [c for c in data.columns
                           if c not in ['Year', 'MEI_Current_JASO'] and not c.startswith('Target')]

        print(f"\nAvailable features: {len(self.feature_cols)}")
        print(f"Features: {self.feature_cols}")
        print(f"Max features to select: {self.config.MAX_FEATURES}")

        for region in self.config.REGIONS:
            self.train_region(data, region, self.feature_cols)

    def comparison_table(self):
        """Generate comparison table"""
        rows = []
        for region, res in self.results.items():
            for mn, mr in res['models'].items():
                rows.append({
                    'Region': region,
                    'Model': mn,
                    'CV_RMSE': mr['cv_rmse'],
                    'Train_R2': mr['train_r2'],
                    'Overfit': '‚ö†Ô∏è' if mr['is_overfit'] else '‚úì',
                    'Best': '‚òÖ' if mn == res['best'] else ''
                })
        return pd.DataFrame(rows)

print("‚úÖ Anti-overfit model system defined")

## 8Ô∏è‚É£ Load Data

In [None]:
# Load data
loader = DataLoader(Config)
loader.load_typhoon_data()
loader.load_all_climate_indices()
data = loader.build_feature_matrix()

print("\n" + "="*60)
print("Data preview:")
display(data.head())

print(f"\nüìä Dataset Summary:")
print(f"   Total samples: {len(data)}")
print(f"   Training (< {Config.TEST_SPLIT_YEAR}): {len(data[data['Year'] < Config.TEST_SPLIT_YEAR])}")
print(f"   Test (>= {Config.TEST_SPLIT_YEAR}): {len(data[data['Year'] >= Config.TEST_SPLIT_YEAR])}")

## 9Ô∏è‚É£ Train Models

In [None]:
print("‚è≥ Training with anti-overfitting measures...")
print("   ‚Ä¢ Models with Train R¬≤ > 0.5 will be rejected")
print("   ‚Ä¢ Using only 2 best features per region")
print("   ‚Ä¢ Comparing against baseline (mean prediction)\n")

system = AntiOverfitModelSystem(Config)
system.train_all(data)

print("\n" + "="*60)
print("‚úÖ Training complete!")

## üîü Results Analysis

In [None]:
# Model comparison table
comparison = system.comparison_table()
print("\nüìä Model Comparison (with overfitting check):")
display(comparison)

# Save
comparison.to_csv(f'{Config.OUTPUT_DIR}/model_comparison_v42.csv', index=False)

## 1Ô∏è‚É£1Ô∏è‚É£ Visualization

In [None]:
# Plot actual vs predicted
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, (region, res) in enumerate(system.results.items()):
    ax = axes[idx]

    # Combine train and test
    all_years = np.concatenate([res['years_train'], res['years_test']])
    all_actual = np.concatenate([res['y_train'], res['y_test']])
    all_pred = np.concatenate([res['y_train_pred'], res['y_test_pred']])

    # Plot actual
    ax.plot(all_years, all_actual, 'b-o', label='Actual', linewidth=2, markersize=4)

    # Plot predictions
    ax.plot(res['years_train'], res['y_train_pred'], 'g--s',
           label='Train Pred', linewidth=1.5, markersize=3, alpha=0.7)
    ax.plot(res['years_test'], res['y_test_pred'], 'r--^',
           label='Test Pred', linewidth=2, markersize=5)

    # Train/test split line
    ax.axvline(x=Config.TEST_SPLIT_YEAR - 0.5, color='gray',
              linestyle=':', linewidth=2, label='Split')

    # Mean line (baseline)
    mean_val = np.mean(res['y_train'])
    ax.axhline(y=mean_val, color='orange', linestyle='--',
              alpha=0.5, label=f'Baseline (mean={mean_val:.1f})')

    ax.set_xlabel('Year')
    ax.set_ylabel('Typhoon Count')
    
    # Title with key metrics
    title = f'{region}\n'
    title += f'Best: {res["best"]} | Test R¬≤={res["test_r2"]:.2f} | '
    if res['improvement'] > 0:
        title += f'‚úì {res["improvement"]:.0f}% better'
    else:
        title += f'‚ö†Ô∏è {-res["improvement"]:.0f}% worse'
    ax.set_title(title)
    
    ax.legend(loc='upper right', fontsize=8)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(f'{Config.OUTPUT_DIR}/predictions_v42.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\n‚úì Plot saved to {Config.OUTPUT_DIR}/predictions_v42.png")

## 1Ô∏è‚É£2Ô∏è‚É£ Diagnostic Summary

In [None]:
print("="*70)
print("üìä DIAGNOSTIC SUMMARY v4.2")
print("="*70)

overall_better = 0
overall_worse = 0

for region, res in system.results.items():
    print(f"\nüåä {region}:")
    print(f"   Best Model: {res['best']}")
    print(f"   Parameters: {res['best_params']}")
    print(f"   Selected Features: {res['selected_features']}")
    print(f"   Feature Correlations: {res['feature_correlations']}")
    print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
    print(f"   Train R¬≤: {res['train_r2']:.3f}")
    print(f"   Test R¬≤: {res['test_r2']:.3f}")
    print(f"   Test RMSE: {res['test_rmse']:.2f}")
    print(f"   Baseline RMSE: {res['baseline_rmse']:.2f}")
    print(f"   Variance Ratio: {res['variance_ratio']:.2f}")
    print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
    
    if res['improvement'] > 0:
        print(f"   ‚úì {res['improvement']:.1f}% BETTER than baseline")
        overall_better += 1
    else:
        print(f"   ‚ö†Ô∏è {-res['improvement']:.1f}% worse than baseline")
        overall_worse += 1
    
    if res['variance_ratio'] < 0.3:
        print(f"   ‚ö†Ô∏è Predictions still too flat")

print("\n" + "="*70)
print("OVERALL ASSESSMENT")
print("="*70)
print(f"   Regions better than baseline: {overall_better}/4")
print(f"   Regions worse than baseline: {overall_worse}/4")

print("\n" + "="*70)
print("KEY INSIGHTS")
print("="*70)
print("""
The fundamental challenge:
‚Ä¢ Only 37 samples (27 train, 10 test)
‚Ä¢ Climate indices have WEAK correlations with typhoon counts
‚Ä¢ High inter-annual variability in typhoon counts

What v4.2 addressed:
‚Ä¢ Removed SVR (was overfitting or producing flat predictions)
‚Ä¢ Reduced features from 4 to 2 (better sample/feature ratio)
‚Ä¢ Added anti-overfitting constraint (Train R¬≤ < 0.5)
‚Ä¢ Strong regularization in linear models

Realistic expectations:
‚Ä¢ With 37 samples, even small improvements over baseline are meaningful
‚Ä¢ Consider reporting predictions as ranges, not point estimates
‚Ä¢ The "baseline" (predicting mean) is often hard to beat significantly
""")

## 1Ô∏è‚É£3Ô∏è‚É£ Feature Importance Analysis

In [None]:
print("\nüìä FEATURE ANALYSIS BY REGION")
print("="*70)

all_features_used = {}

for region, res in system.results.items():
    print(f"\n{region}:")
    for feat, corr in res['feature_correlations'].items():
        direction = "‚Üë" if corr > 0 else "‚Üì"
        strength = "strong" if abs(corr) > 0.3 else "moderate" if abs(corr) > 0.15 else "weak"
        print(f"  ‚Ä¢ {feat}: r={corr:+.3f} {direction} ({strength})")
        
        if feat not in all_features_used:
            all_features_used[feat] = []
        all_features_used[feat].append((region, corr))

print("\n" + "="*70)
print("FEATURES USED ACROSS REGIONS")
print("="*70)
for feat, usages in sorted(all_features_used.items(), key=lambda x: len(x[1]), reverse=True):
    regions = [u[0].split()[0] for u in usages]  # First word of region
    avg_corr = np.mean([abs(u[1]) for u in usages])
    print(f"  {feat}: used in {len(usages)} regions, avg |r|={avg_corr:.3f}")

## 1Ô∏è‚É£4Ô∏è‚É£ Alternative: Ensemble with Uncertainty

In [None]:
print("\nüìä ENSEMBLE PREDICTION WITH UNCERTAINTY")
print("="*70)
print("\nCombining predictions from multiple models for better stability...\n")

for region, res in system.results.items():
    print(f"\n{region}:")
    
    # Collect predictions from all non-overfitting models
    valid_preds = []
    for mn, mr in res['models'].items():
        if not mr['is_overfit']:
            # Get test predictions for this model
            model = mr['model']
            X_test_s = system.scalers[region].transform(
                data[data['Year'] >= Config.TEST_SPLIT_YEAR][res['selected_features']].values
            )
            pred = np.maximum(model.predict(X_test_s), 0)
            valid_preds.append(pred)
    
    if valid_preds:
        # Ensemble: average of all valid models
        ensemble_pred = np.mean(valid_preds, axis=0)
        ensemble_std = np.std(valid_preds, axis=0)
        
        # Calculate metrics
        ensemble_rmse = np.sqrt(mean_squared_error(res['y_test'], ensemble_pred))
        ensemble_r2 = r2_score(res['y_test'], ensemble_pred)
        
        print(f"   Single best model RMSE: {res['test_rmse']:.2f}")
        print(f"   Ensemble ({len(valid_preds)} models) RMSE: {ensemble_rmse:.2f}")
        
        if ensemble_rmse < res['test_rmse']:
            print(f"   ‚úì Ensemble is better!")
        else:
            print(f"   Single model is better")
        
        print(f"\n   Test year predictions (mean ¬± std):")
        test_years = res['years_test']
        for i, (yr, actual, pred, std) in enumerate(zip(test_years, res['y_test'], ensemble_pred, ensemble_std)):
            print(f"     {int(yr)}: Actual={actual:.0f}, Pred={pred:.1f}¬±{std:.1f}")

## 1Ô∏è‚É£5Ô∏è‚É£ Download Results

In [None]:
from google.colab import files

files.download(f'{Config.OUTPUT_DIR}/model_comparison_v42.csv')
files.download(f'{Config.OUTPUT_DIR}/predictions_v42.png')

print("‚úÖ Files downloaded!")

---

## üìù Summary: v4.1 ‚Üí v4.2 Changes

| Issue in v4.1 | Fix in v4.2 |
|---------------|-------------|
| SVR overfitting (Japan Sea Train R¬≤=1.0) | Removed SVR entirely |
| Flat predictions (Var.Ratio ‚âà 0) | Simpler linear models with strong regularization |
| Too many features (4) | Reduced to 2 features |
| No baseline comparison | Added mean prediction baseline |
| No overfitting detection | Added Train R¬≤ < 0.5 constraint |

### Reality Check:
With only **37 samples** and **weak feature-target correlations** (|r| < 0.3), 
accurate prediction is fundamentally limited. The best we can do is:

1. **Avoid overfitting** (don't fool ourselves with good training metrics)
2. **Beat the baseline** (mean prediction) even slightly
3. **Report uncertainty** (predictions are approximate)