# Ch·ªß ƒë·ªÅ 1: Regression vs ARIMA ‚Äì Khi n√†o ch·ªçn c√°i n√†o?

Notebook n√†y so s√°nh **c√¥ng b·∫±ng** gi·ªØa hai ph∆∞∆°ng ph√°p d·ª± b√°o PM2.5:
1. **Baseline Regression** - D·ª± b√°o b·∫±ng lag features v√† time features
2. **ARIMA** - M√¥ h√¨nh chu·ªói th·ªùi gian ƒë∆°n bi·∫øn

## ƒêi·ªÅu ki·ªán so s√°nh c√¥ng b·∫±ng:
- ‚úÖ C√πng tr·∫°m: **Aotizhongxin**
- ‚úÖ C√πng CUTOFF: **2017-01-01**
- ‚úÖ C√πng HORIZON: **1 gi·ªù**
- ‚úÖ C√πng metrics: MAE, RMSE, R¬≤

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

In [None]:
# Parameters
RAW_ZIP_PATH = '../data/raw/PRSA2017_Data_20130301-20170228.zip'
STATION = 'Aotizhongxin'
CUTOFF = '2017-01-01'
HORIZON = 1
LAG_HOURS = [1, 3, 24]

# Add parent directory to path for imports
import sys
from pathlib import Path
if str(Path('..').resolve()) not in sys.path:
    sys.path.insert(0, str(Path('..').resolve()))

## 1. C·∫•u h√¨nh chung

In [None]:
print("=" * 70)
print("C·∫§U H√åNH SO S√ÅNH")
print("=" * 70)
print(f"Tr·∫°m: {STATION}")
print(f"Train/Test cutoff: {CUTOFF}")
print(f"Horizon: {HORIZON} gi·ªù")
print(f"Lag features: {LAG_HOURS}")

## 2. Load v√† Chu·∫©n b·ªã D·ªØ li·ªáu

In [None]:
from src.classification_library import load_beijing_air_quality, clean_air_quality_df

# Load data
df_raw = load_beijing_air_quality(use_ucimlrepo=False, raw_zip_path=RAW_ZIP_PATH)
df = clean_air_quality_df(df_raw)

# Filter station
df_station = df[df['station'] == STATION].sort_values('datetime').reset_index(drop=True)
print(f"Data shape: {df_station.shape}")
print(f"Date range: {df_station['datetime'].min()} to {df_station['datetime'].max()}")

## 3. M√¥ h√¨nh 1: Baseline Regression

In [None]:
# T·∫°o features cho regression
df_reg = df_station[['datetime', 'PM2.5', 'TEMP', 'PRES', 'DEWP', 'WSPM']].copy()

# Lag features
for lag in LAG_HOURS:
    df_reg[f'PM25_lag{lag}'] = df_reg['PM2.5'].shift(lag)

# Time features
df_reg['hour'] = df_reg['datetime'].dt.hour
df_reg['dayofweek'] = df_reg['datetime'].dt.dayofweek
df_reg['month'] = df_reg['datetime'].dt.month

# Target: PM2.5 sau HORIZON gi·ªù
df_reg['target'] = df_reg['PM2.5'].shift(-HORIZON)

# Drop missing
df_reg = df_reg.dropna()

print(f"Regression data shape: {df_reg.shape}")
df_reg.head()

In [None]:
# Split train/test theo cutoff
cutoff_date = pd.Timestamp(CUTOFF)
train_mask = df_reg['datetime'] < cutoff_date
test_mask = df_reg['datetime'] >= cutoff_date

feature_cols = [col for col in df_reg.columns if col not in ['datetime', 'PM2.5', 'target']]

X_train_reg = df_reg.loc[train_mask, feature_cols]
y_train_reg = df_reg.loc[train_mask, 'target']
X_test_reg = df_reg.loc[test_mask, feature_cols]
y_test_reg = df_reg.loc[test_mask, 'target']
test_dates_reg = df_reg.loc[test_mask, 'datetime']

print(f"Train: {len(X_train_reg)} samples")
print(f"Test:  {len(X_test_reg)} samples")
print(f"Features: {feature_cols}")

In [None]:
# Train regression model
print("Training Random Forest Regressor...")
model_reg = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1)
model_reg.fit(X_train_reg, y_train_reg)

# Predictions
y_pred_reg = model_reg.predict(X_test_reg)

# Metrics
rmse_reg = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
mae_reg = mean_absolute_error(y_test_reg, y_pred_reg)
r2_reg = r2_score(y_test_reg, y_pred_reg)

print("\n" + "=" * 70)
print("REGRESSION MODEL PERFORMANCE")
print("=" * 70)
print(f"RMSE: {rmse_reg:.2f}")
print(f"MAE:  {mae_reg:.2f}")
print(f"R¬≤:   {r2_reg:.4f}")

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': model_reg.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTOP 10 IMPORTANT FEATURES:")
print(feature_importance.head(10))

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
feature_importance.head(10).plot(x='feature', y='importance', kind='barh', ax=ax, color='steelblue')
ax.set_xlabel('Importance', fontsize=11)
ax.set_title('Top 10 Feature Importance - Regression Model', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

## 4. M√¥ h√¨nh 2: ARIMA

In [None]:
# Chu·∫©n b·ªã data cho ARIMA
df_arima = df_station[['datetime', 'PM2.5']].dropna().copy()
df_arima = df_arima.set_index('datetime')

# Split train/test
train_arima = df_arima[df_arima.index < CUTOFF]['PM2.5']
test_arima = df_arima[df_arima.index >= CUTOFF]['PM2.5']

print(f"Train: {len(train_arima)} samples")
print(f"Test:  {len(test_arima)} samples")

In [None]:
# Grid search ARIMA (simplified)
print("Grid searching ARIMA parameters...")
best_aic = np.inf
best_order = None

p_range = range(0, 4)
d_range = [1]  # Th∆∞·ªùng d=1 cho PM2.5
q_range = range(0, 4)

results_arima = []
for p in p_range:
    for d in d_range:
        for q in q_range:
            try:
                model = ARIMA(train_arima, order=(p, d, q))
                fitted = model.fit()
                if fitted.aic < best_aic:
                    best_aic = fitted.aic
                    best_order = (p, d, q)
                results_arima.append({'p': p, 'd': d, 'q': q, 'aic': fitted.aic})
            except:
                continue

print(f"\nBest ARIMA order: {best_order}")
print(f"Best AIC: {best_aic:.2f}")

In [None]:
# Train best ARIMA model
print(f"\nTraining ARIMA{best_order}...")
model_arima = ARIMA(train_arima, order=best_order)
fitted_arima = model_arima.fit()

# Forecast
forecast_arima = fitted_arima.forecast(steps=len(test_arima))

# Metrics
rmse_arima = np.sqrt(mean_squared_error(test_arima, forecast_arima))
mae_arima = mean_absolute_error(test_arima, forecast_arima)
r2_arima = r2_score(test_arima, forecast_arima)

print("\n" + "=" * 70)
print(f"ARIMA{best_order} MODEL PERFORMANCE")
print("=" * 70)
print(f"RMSE: {rmse_arima:.2f}")
print(f"MAE:  {mae_arima:.2f}")
print(f"R¬≤:   {r2_arima:.4f}")

## 5. So S√°nh T·ªïng Quan

In [None]:
# B·∫£ng so s√°nh
comparison = pd.DataFrame({
    'Model': ['Regression (Random Forest)', f'ARIMA{best_order}'],
    'RMSE': [rmse_reg, rmse_arima],
    'MAE': [mae_reg, mae_arima],
    'R¬≤': [r2_reg, r2_arima],
    'RMSE/MAE': [rmse_reg/mae_reg, rmse_arima/mae_arima]
})

print("\n" + "=" * 70)
print("SO S√ÅNH T·ªîNG QUAN")
print("=" * 70)
print(comparison.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

metrics = ['RMSE', 'MAE', 'R¬≤']
for idx, metric in enumerate(metrics):
    ax = axes[idx]
    comparison.plot(x='Model', y=metric, kind='bar', ax=ax, legend=False, color=['steelblue', 'coral'])
    ax.set_title(metric, fontsize=12, fontweight='bold')
    ax.set_ylabel(metric)
    ax.set_xlabel('')
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

print("\nüìä NH·∫¨N X√âT NHANH:")
if rmse_reg < rmse_arima:
    print(f"‚úÖ Regression T·ªêT H∆†N v·ªÅ RMSE ({rmse_reg:.2f} < {rmse_arima:.2f})")
else:
    print(f"‚úÖ ARIMA T·ªêT H∆†N v·ªÅ RMSE ({rmse_arima:.2f} < {rmse_reg:.2f})")

if mae_reg < mae_arima:
    print(f"‚úÖ Regression T·ªêT H∆†N v·ªÅ MAE ({mae_reg:.2f} < {mae_arima:.2f})")
else:
    print(f"‚úÖ ARIMA T·ªêT H∆†N v·ªÅ MAE ({mae_arima:.2f} < {mae_reg:.2f})")

## C√ÇU 1: M√¥ h√¨nh n√†o t·ªët h∆°n cho horizon=1?

### Ph√¢n t√≠ch chi ti·∫øt

In [None]:
print("=" * 70)
print("C√ÇU 1: M√î H√åNH N√ÄO T·ªêT H∆†N CHO HORIZON=1?")
print("=" * 70)

# So s√°nh chi ti·∫øt
print(f"\n1. PERFORMANCE METRICS:")
print(f"   Regression: RMSE={rmse_reg:.2f}, MAE={mae_reg:.2f}, R¬≤={r2_reg:.4f}")
print(f"   ARIMA:      RMSE={rmse_arima:.2f}, MAE={mae_arima:.2f}, R¬≤={r2_arima:.4f}")

winner = "Regression" if rmse_reg < rmse_arima else "ARIMA"
print(f"\n   ‚Üí Winner: {winner} (RMSE th·∫•p h∆°n)")

# Ph√¢n t√≠ch feature importance c·ªßa regression
top_feature = feature_importance.iloc[0]
print(f"\n2. FEATURE QUAN TR·ªåNG NH·∫§T (Regression):")
print(f"   {top_feature['feature']}: {top_feature['importance']:.4f}")

if 'PM25_lag1' in top_feature['feature']:
    print("   ‚Üí Lag 1h l√† feature quan tr·ªçng nh·∫•t!")
    print("   ‚Üí ƒê√∫ng v·ªõi l√Ω thuy·∫øt: d·ª± b√°o ng·∫Øn h·∫°n ph·ª• thu·ªôc m·∫°nh v√†o gi√° tr·ªã g·∫ßn nh·∫•t")

print(f"\n3. V√å SAO {winner} T·ªêT H∆†N?")
if winner == "Regression":
    print("   ‚úÖ Regression c√≥ l·ª£i th·∫ø v·ªõi horizon=1 v√¨:")
    print("      ‚Ä¢ C√≥ th·ªÉ d√πng tr·ª±c ti·∫øp PM25_lag1 (r·∫•t t∆∞∆°ng quan v·ªõi target)")
    print("      ‚Ä¢ C√≥ th·ªÉ k·∫øt h·ª£p nhi·ªÅu features: lag + time + weather")
    print("      ‚Ä¢ Random Forest capture ƒë∆∞·ª£c non-linear relationships")
    print("      ‚Ä¢ Kh√¥ng y√™u c·∫ßu chu·ªói d·ª´ng hay sai ph√¢n")
else:
    print("   ‚úÖ ARIMA t·ªët h∆°n v√¨:")
    print("      ‚Ä¢ B·∫Øt ƒë∆∞·ª£c c·∫•u tr√∫c t·ª± t∆∞∆°ng quan ph·ª©c t·∫°p qua (p,d,q)")
    print("      ‚Ä¢ M√¥ h√¨nh h√≥a tr·ª±c ti·∫øp dynamic c·ªßa time series")
    print("      ‚Ä¢ √çt b·ªã overfit h∆°n v·ªõi d·ªØ li·ªáu ·ªìn")

## C√ÇU 2: M√¥ h√¨nh n√†o ·ªïn h∆°n khi c√≥ spike?

### T√¨m v√† ph√¢n t√≠ch ƒëo·∫°n c√≥ spike

In [None]:
# T√¨m spike trong test set (top 5% gi√° tr·ªã cao nh·∫•t)
threshold_spike = test_arima.quantile(0.95)
spike_dates = test_arima[test_arima > threshold_spike].index

print("=" * 70)
print("C√ÇU 2: M√î H√åNH N√ÄO ·ªîN H∆†N KHI C√ì SPIKE?")
print("=" * 70)
print(f"\nThreshold spike (95th percentile): {threshold_spike:.2f}")
print(f"S·ªë ƒëi·ªÉm spike trong test: {len(spike_dates)}")

if len(spike_dates) > 0:
    # Ch·ªçn m·ªôt spike ƒë·ªÉ ph√¢n t√≠ch (spike ƒë·∫ßu ti√™n)
    spike_start = spike_dates[0] - pd.Timedelta(days=1)
    spike_end = spike_dates[0] + pd.Timedelta(days=2)
    
    print(f"\nPh√¢n t√≠ch spike t·∫°i: {spike_dates[0]}")
    print(f"Zoom v√†o kho·∫£ng: {spike_start} ƒë·∫øn {spike_end}")

In [None]:
# Visualize spike region
if len(spike_dates) > 0:
    fig, ax = plt.subplots(figsize=(14, 6))
    
    # Filter data in spike region
    mask_spike = (test_arima.index >= spike_start) & (test_arima.index <= spike_end)
    actual_spike = test_arima[mask_spike]
    forecast_arima_spike = forecast_arima[mask_spike]
    
    # Get regression predictions for same period
    mask_reg_spike = (test_dates_reg >= spike_start) & (test_dates_reg <= spike_end)
    reg_spike_dates = test_dates_reg[mask_reg_spike]
    reg_spike_actual = y_test_reg[mask_reg_spike]
    reg_spike_pred = y_pred_reg[mask_reg_spike.values]
    
    # Plot
    ax.plot(actual_spike.index, actual_spike.values, 'o-', linewidth=2, markersize=4, 
            label='Actual', color='black', alpha=0.8)
    ax.plot(forecast_arima_spike.index, forecast_arima_spike.values, 's--', linewidth=2, markersize=3,
            label=f'ARIMA{best_order}', color='coral', alpha=0.8)
    ax.plot(reg_spike_dates.values, reg_spike_pred, '^--', linewidth=2, markersize=3,
            label='Regression', color='steelblue', alpha=0.8)
    
    ax.axhline(y=threshold_spike, color='red', linestyle=':', linewidth=2, 
               label=f'Spike threshold ({threshold_spike:.0f})')
    
    ax.set_xlabel('Time', fontsize=11)
    ax.set_ylabel('PM2.5 (Œºg/m¬≥)', fontsize=11)
    ax.set_title('Forecast vs Actual - V√πng c√≥ SPIKE', fontsize=13, fontweight='bold')
    ax.legend(loc='best')
    ax.grid(True, alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Calculate errors for spike region
    mae_arima_spike = mean_absolute_error(actual_spike, forecast_arima_spike)
    rmse_arima_spike = np.sqrt(mean_squared_error(actual_spike, forecast_arima_spike))
    
    mae_reg_spike = mean_absolute_error(reg_spike_actual, reg_spike_pred)
    rmse_reg_spike = np.sqrt(mean_squared_error(reg_spike_actual, reg_spike_pred))
    
    print("\nPERFORMANCE TR√äN V√ôNG SPIKE:")
    print(f"ARIMA:      MAE={mae_arima_spike:.2f}, RMSE={rmse_arima_spike:.2f}, RMSE/MAE={rmse_arima_spike/mae_arima_spike:.3f}")
    print(f"Regression: MAE={mae_reg_spike:.2f}, RMSE={rmse_reg_spike:.2f}, RMSE/MAE={rmse_reg_spike/mae_reg_spike:.3f}")
    
    print("\nüìä PH√ÇN T√çCH:")
    if rmse_arima_spike/mae_arima_spike > rmse_reg_spike/mae_reg_spike:
        print("‚ùå ARIMA c√≥ RMSE/MAE ratio cao h∆°n ‚Üí b·ªã ph·∫°t n·∫∑ng h∆°n ·ªü spike")
        print("   ‚Üí ARIMA c√≥ xu h∆∞·ªõng m∆∞·ª£t h√≥a (smoothing), kh√¥ng ph·∫£n ·ª©ng nhanh v·ªõi spike")
    else:
        print("‚ùå Regression c√≥ RMSE/MAE ratio cao h∆°n ‚Üí b·ªã ph·∫°t n·∫∑ng h∆°n ·ªü spike")
        print("   ‚Üí Regression c√≥ th·ªÉ overreact ho·∫∑c underpredict spike")
    
    if mae_reg_spike < mae_arima_spike:
        print(f"\n‚úÖ Regression ·ªîN H∆†N v·ªõi spike (MAE th·∫•p h∆°n)")
        print("   ‚Üí Nh·ªù c√≥ lag features g·∫ßn (PM25_lag1) ƒë·ªÉ b√°m theo spike")
    else:
        print(f"\n‚úÖ ARIMA ·ªîN H∆†N v·ªõi spike (MAE th·∫•p h∆°n)")
        print("   ‚Üí ARIMA capture ƒë∆∞·ª£c pattern bi·∫øn ƒë·ªông t·ªët h∆°n")

## C√ÇU 3: N·∫øu tri·ªÉn khai th·∫≠t, b·∫°n ch·ªçn g√¨ v√† v√¨ sao?

### Ph√¢n t√≠ch ƒëa chi·ªÅu

In [None]:
print("=" * 70)
print("C√ÇU 3: N·∫æU TRI·ªÇN KHAI TH·∫¨T, CH·ªåN G√å V√Ä V√å SAO?")
print("=" * 70)

print("\nüìã B·∫¢NG PH√ÇN T√çCH ƒêA CHI·ªÄU:")
print("\n" + "-" * 70)
print(f"{'Ti√™u ch√≠':<30} {'Regression':<20} {'ARIMA':<20}")
print("-" * 70)
print(f"{'1. Accuracy (RMSE)':<30} {rmse_reg::<20.2f} {rmse_arima::<20.2f}")
print(f"{'2. Stability (MAE)':<30} {mae_reg::<20.2f} {mae_arima::<20.2f}")
print(f"{'3. Spike handling':<30} {'T·ªët (lag features)':<20} {'Smoothing':<20}")
print(f"{'4. Interpretability':<30} {'Feature importance':<20} {'(p,d,q) + CI':<20}")
print(f"{'5. Extensibility':<30} {'D·ªÖ th√™m features':<20} {'Kh√≥ (univariate)':<20}")
print(f"{'6. Training time':<30} {'Trung b√¨nh':<20} {'Ch·∫≠m (grid search)':<20}")
print(f"{'7. Prediction time':<30} {'Nhanh':<20} {'Nhanh':<20}")
print(f"{'8. Update frequency':<30} {'D·ªÖ (retrain)':<20} {'Kh√≥ (re-fit)':<20}")
print(f"{'9. Confidence interval':<30} {'Kh√¥ng c√≥ s·∫µn':<20} {'C√≥ s·∫µn':<20}")
print("-" * 70)

In [None]:
print("\nüí° KHUY·∫æN NGH·ªä THEO B·ªêI C·∫¢NH:")
print("\n" + "=" * 70)

print("\nüéØ B·ªêI C·∫¢NH 1: H·ªÜ TH·ªêNG C·∫¢NH B√ÅO S·ªöM")
print("-" * 70)
print("M·ª•c ti√™u: D·ª± b√°o CH√çNH X√ÅC ƒë·ªÉ k√≠ch ho·∫°t c·∫£nh b√°o k·ªãp th·ªùi")
print("\n‚Üí CH·ªåN: REGRESSION")
print("   L√Ω do:")
print("   ‚úÖ RMSE/MAE th·∫•p h∆°n ‚Üí √≠t false alarm")
print("   ‚úÖ C√≥ th·ªÉ th√™m weather forecast l√†m features")
print("   ‚úÖ Ph·∫£n ·ª©ng nhanh v·ªõi spike (nh·ªù lag1)")
print("   ‚úÖ D·ªÖ update model khi c√≥ data m·ªõi")

print("\nüéØ B·ªêI C·∫¢NH 2: NGHI√äN C·ª®U & PH√ÇN T√çCH XU H∆Ø·ªöNG")
print("-" * 70)
print("M·ª•c ti√™u: Hi·ªÉu C·∫§U TR√öC v√† XU H∆Ø·ªöNG d√†i h·∫°n")
print("\n‚Üí CH·ªåN: ARIMA")
print("   L√Ω do:")
print("   ‚úÖ (p,d,q) gi·∫£i th√≠ch ƒë∆∞·ª£c c·∫•u tr√∫c time series")
print("   ‚úÖ C√≥ confidence interval cho uncertainty")
print("   ‚úÖ Ph√π h·ª£p v·ªõi forecasting d√†i h·∫°n (multi-step)")
print("   ‚úÖ Chu·∫©n m·ª±c trong econometrics/environmental science")

print("\nüéØ B·ªêI C·∫¢NH 3: PRODUCTION SYSTEM (REAL-TIME)")
print("-" * 70)
print("M·ª•c ti√™u: D·ª± b√°o 24/7 v·ªõi ƒë·ªô tin c·∫≠y cao, d·ªÖ maintain")
print("\n‚Üí CH·ªåN: REGRESSION (ho·∫∑c HYBRID)")
print("   L√Ω do:")
print("   ‚úÖ Prediction speed nhanh")
print("   ‚úÖ D·ªÖ monitor (feature drift, model drift)")
print("   ‚úÖ D·ªÖ A/B testing v·ªõi models kh√°c")
print("   ‚úÖ C√≥ th·ªÉ ensemble v·ªõi ARIMA n·∫øu c·∫ßn")

print("\nüéØ B·ªêI C·∫¢NH 4: B√ÅO C√ÅO CH√çNH PH·ª¶ / CH√çNH S√ÅCH")
print("-" * 70)
print("M·ª•c ti√™u: DI·ªÑN GI·∫¢I ƒë∆∞·ª£c cho policy makers")
print("\n‚Üí CH·ªåN: ARIMA (ho·∫∑c c·∫£ hai)")
print("   L√Ω do:")
print("   ‚úÖ M√¥ h√¨nh th·ªëng k√™ truy·ªÅn th·ªëng, d·ªÖ ch·∫•p nh·∫≠n")
print("   ‚úÖ Confidence interval quan tr·ªçng cho risk assessment")
print("   ‚úÖ C√≥ th·ªÉ gi·∫£i th√≠ch 't√≠nh d·ª´ng', 'xu h∆∞·ªõng', 'm√πa v·ª•'")

In [None]:
print("\n" + "=" * 70)
print("üìå K·∫æT LU·∫¨N CU·ªêI C√ôNG")
print("=" * 70)

if rmse_reg < rmse_arima:
    print("\n‚úÖ KHUY·∫æN NGH·ªä: REGRESSION (Random Forest)")
    print("\nL√Ω do ch√≠nh:")
    print(f"  1. Performance t·ªët h∆°n: RMSE={rmse_reg:.2f} < {rmse_arima:.2f}")
    print(f"  2. Ph√π h·ª£p v·ªõi horizon=1 (short-term forecast)")
    print(f"  3. D·ªÖ m·ªü r·ªông: c√≥ th·ªÉ th√™m weather forecast, traffic, etc.")
    print(f"  4. D·ªÖ deploy v√† maintain trong production")
else:
    print("\n‚úÖ KHUY·∫æN NGH·ªä: ARIMA")
    print("\nL√Ω do ch√≠nh:")
    print(f"  1. Performance t·ªët h∆°n: RMSE={rmse_arima:.2f} < {rmse_reg:.2f}")
    print(f"  2. M√¥ h√¨nh h√≥a tr·ª±c ti·∫øp time series structure")
    print(f"  3. C√≥ confidence interval cho uncertainty quantification")
    print(f"  4. Ph√π h·ª£p v·ªõi forecasting nhi·ªÅu b∆∞·ªõc")

print("\n‚ö†Ô∏è L∆ØU √ù:")
print("  ‚Ä¢ N√™n d√πng ENSEMBLE (k·∫øt h·ª£p c·∫£ hai) n·∫øu c√≥ t√†i nguy√™n")
print("  ‚Ä¢ Monitor performance li√™n t·ª•c, retrain ƒë·ªãnh k·ª≥")
print("  ‚Ä¢ Th√™m external features (weather, traffic) ƒë·ªÉ c·∫£i thi·ªán")
print("  ‚Ä¢ Xem x√©t deep learning (LSTM, Transformer) cho b√†i to√°n ph·ª©c t·∫°p h∆°n")

## T√ìM T·∫ÆT 3 C√ÇU H·ªéI

### C√ÇU 1: M√¥ h√¨nh n√†o t·ªët h∆°n cho horizon=1?
**K·∫øt lu·∫≠n:** Th∆∞·ªùng l√† **Regression** v√¨:
- C√≥ th·ªÉ d√πng tr·ª±c ti·∫øp lag features (ƒë·∫∑c bi·ªát lag1 r·∫•t quan tr·ªçng)
- K·∫øt h·ª£p ƒë∆∞·ª£c nhi·ªÅu ngu·ªìn th√¥ng tin (lag + time + weather)
- D·ª± b√°o ng·∫Øn h·∫°n ph·ª• thu·ªôc m·∫°nh v√†o gi√° tr·ªã g·∫ßn nh·∫•t

### C√ÇU 2: M√¥ h√¨nh n√†o ·ªïn h∆°n khi c√≥ spike?
**K·∫øt lu·∫≠n:** Ph·ª• thu·ªôc v√†o:
- **Regression** th∆∞·ªùng ph·∫£n ·ª©ng nhanh h∆°n (nh·ªù lag1)
- **ARIMA** c√≥ xu h∆∞·ªõng smoothing, ch·∫≠m ph·∫£n ·ª©ng
- Xem RMSE/MAE ratio: cao ‚Üí b·ªã ph·∫°t n·∫∑ng ·ªü spike

### C√ÇU 3: Tri·ªÉn khai th·∫≠t ch·ªçn g√¨?
**K·∫øt lu·∫≠n:** T√πy b·ªëi c·∫£nh:
- **C·∫£nh b√°o s·ªõm** ‚Üí Regression (ch√≠nh x√°c, d·ªÖ update)
- **Nghi√™n c·ª©u** ‚Üí ARIMA (gi·∫£i th√≠ch, CI)
- **Production** ‚Üí Regression ho·∫∑c Hybrid
- **B√°o c√°o ch√≠nh ph·ªß** ‚Üí ARIMA (chu·∫©n m·ª±c)

**Best practice:** Ensemble c·∫£ hai n·∫øu c√≥ t√†i nguy√™n!