<a href="https://colab.research.google.com/github/thisisreallife/Medium/blob/master/Bias_Variance_Decoposition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb

In [13]:
from sklearn.model_selection import train_test_split
from mlxtend.evaluate import bias_variance_decomp
from mlxtend.data import boston_housing_data

In [51]:
import warnings
# Suppress scikit-learn deprecation warnings
warnings.filterwarnings("ignore", category=FutureWarning)

这篇文档主要是描述方差和偏差分解, 假设有待估参数$\theta$, 我们的估计量是$\hat{\theta}$, 如果用均方误差来刻画估计的准确性, 可以把均方误差拆分为偏差和方差两部分.
$$ MSE(\hat{\theta};\theta) = E[(\hat{\theta} - \theta)^2]\\
= E[(\hat{\theta} - E(\hat\theta) + E(\hat\theta) - \theta)^2]\\
= E[(\hat{\theta} - E(\hat\theta))^2 + (E(\hat\theta) - \theta)^2 + 2(\hat{\theta} - E(\hat\theta))(E(\hat\theta) - \theta)]$$

# Synthetic Data

In [58]:
# Set seed for reproducibility
np.random.seed(42)

# Generate 1000 samples with 5 numeric features
n_samples = 10000
X = pd.DataFrame(np.random.randn(n_samples, 5), columns=['x1', 'x2', 'x3', 'x4', 'x5'])

# Define Y as a non-linear function of the features + noise
Y = (
    10 * X['x1'] +
    5 * X['x2']**2 +
    3 * X['x3'] * X['x4'] +
    np.exp(X['x5']) +
    np.random.normal(0, 10, n_samples)  # Noise
)

# Bias Variance Decomposition

Boosting Reduces Bias, Bagging Reduces Variance: Explanation

1. Bagging Reduces Variance
Mechanism:

Trains multiple models (e.g., decision trees) independently on bootstrapped subsets of the data.

Predictions are averaged (regression) or majority-voted (classification).

Why Variance Decreases:

High-variance models (e.g., deep trees) overfit to noise in individual subsets.

Averaging predictions cancels out "noise" across models, stabilizing the final prediction.

Example:
Random Forest averages predictions from many uncorrelated trees.

2. Boosting Reduces Bias
Mechanism:

Trains models sequentially, where each new model corrects errors of previous ones.

Misclassified instances are reweighted to focus on hard-to-predict cases.

Why Bias Decreases:

Weak learners (e.g., shallow trees) start with high bias (underfitting).

Sequential correction improves approximation of the true relationship in the data.

Example:
Gradient Boosting iteratively fits residuals (errors) of prior models.

3. Key Differences
Aspect	Bagging	Boosting
Training	Parallel (models independent)	Sequential (models dependent)
Focus	Reduce overfitting (variance)	Reduce underfitting (bias)
Base Models	High-variance (e.g., deep trees)	High-bias (e.g., stumps)
Prediction	Average/majority vote	Weighted sum of model outputs
4. Mathematical Intuition
Bias-Variance Decomposition:

Total Error = Bias
2+Variance+Irreducible Error

Bagging:

Reduces
Variance
Variance through averaging:
$Var
(
1
N
∑
i
=
1
N
f
i
(
x
)
)
≈
1
N
Var
(
f
(
x
)
)
Var(
N
1
​
  
i=1
∑
N
​
 f
i
​
 (x))≈
N
1
​
 Var(f(x))$


Boosting:

Reduces
Bias
2
Bias
2
  by iteratively minimizing residuals (e.g., gradient descent in function space).

5. Practical Tradeoffs
Boosting Risks:

Overfitting if iterations are excessive (increasing variance).

Bagging Limits:

Fails to reduce bias if base models are too simple.

6. Example Workflow
python
# Bagging (Random Forest)
from sklearn.ensemble import RandomForestRegressor
bagging_model = RandomForestRegressor(n_estimators=100, max_depth=10)  # Reduces variance

# Boosting (Gradient Boosting)
from sklearn.ensemble import GradientBoostingRegressor
boosting_model = GradientBoostingRegressor(n_estimators=100, max_depth=3)  # Reduces bias
7. When to Use Which
Use bagging if:

Your model overfits (high validation error).

Base learners are high-variance (e.g., deep decision trees).

Use boosting if:

Your model underfits (high training error).

Base learners are high-bias (e.g., shallow trees).

By leveraging these mechanisms, you can systematically address bias or variance issues in your models.

In [None]:
# bootstrapping rounds
M  = 100
for i in range(M):


In [59]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_bagging = RandomForestRegressor(n_estimators = 1)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_bagging, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 36.068
Average bias: 15.440
Average variance: 20.627


In [71]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_bagging = RandomForestRegressor(n_estimators = 10,n_jobs = -1)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_bagging, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 19.848
Average bias: 15.158
Average variance: 4.690


In [78]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_bagging = RandomForestRegressor(n_estimators = 100, n_jobs = -1)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_bagging, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 18.718
Average bias: 15.527
Average variance: 3.191


In [77]:
avg_bias**2 + avg_var**2 , avg_expected_loss**2

(np.float64(248.18593771160886), np.float64(350.74220778291203))

In [80]:
np.array([[1,2,3], [4,5,6]]).shape

(2, 3)

In [62]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_boosting = lgb.LGBMRegressor(n_estimators = 1, learning_rate = 0.01)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_boosting, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 80.506
Average bias: 80.272
Average variance: 0.235


In [63]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_boosting = lgb.LGBMRegressor(n_estimators = 10, learning_rate = 0.01)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_boosting, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 71.001
Average bias: 70.750
Average variance: 0.251


In [64]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_boosting = lgb.LGBMRegressor(n_estimators = 100, learning_rate = 0.01)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_boosting, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 30.290
Average bias: 29.166
Average variance: 1.123


In [65]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_boosting = lgb.LGBMRegressor(n_estimators = 500, learning_rate = 0.01)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_boosting, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 19.407
Average bias: 17.115
Average variance: 2.292


# Overfitting and Underfitting

In [66]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_boosting = lgb.LGBMRegressor(n_estimators = 1, learning_rate = 0.1)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_boosting, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 70.558
Average bias: 70.295
Average variance: 0.263


In [67]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_boosting = lgb.LGBMRegressor(n_estimators = 500, learning_rate = 0.1)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_boosting, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 18.317
Average bias: 14.507
Average variance: 3.809


In [69]:
X, y = boston_housing_data()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True)

model_boosting = lgb.LGBMRegressor(n_estimators = 5000, learning_rate = 0.1)

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        model_boosting, X_train, y_train, X_test, y_test,
        loss='mse',
        random_seed=123)

print('Average expected loss: %.3f' % avg_expected_loss)
print('Average bias: %.3f' % avg_bias)
print('Average variance: %.3f' % avg_var)

Average expected loss: 18.359
Average bias: 14.357
Average variance: 4.002
