### 线性回归 - 模型规范
该笔记本演示了：

0. 正确指定模型（基线）
1. 型号不详
     - 不相关的预测变量
     - 相关预测变量
2. 过度指定模型
     - 完美的多重共线性
     - 不同程度的多重共线性
     - 包括不相关的变量

In [1]:
import numpy as np
import statsmodels.api as sm

### 1.正确指定模型（基线）

- 我们按照正态分布随机生成 100 个不相关的 X1 和 X2。

- 我们指定一条真正的回归线：$y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- 我们在回归模型中正确包含 X1 和 X2：y ~ X1 + X2（带截距）

- 我们拟合 1000 个这样的回归模型，并检查估计值的偏差和精确度。

In [2]:
estimates = []
for i in range(1000):
    X1 = np.random.randn(100) + 2
    X2 = np.random.randn(100) + 2
    
    e = np.random.randn(100)
    y = 3 + 2*X1 + 4*X2 + e
    X = sm.add_constant(np.column_stack([X1,X2]))
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [3]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [3.0037811  2.00347394 3.99617821]
SE of estimates: [0.30086517 0.09971629 0.10145132]


### 1. 未指定型号

#### 不相关的预测变量 (#1)
- 我们按照正态分布随机生成 100 个不相关的 X1 和 X2。

- 我们指定一条真正的回归线：$y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- 我们的回归模型中仅包含 X1：y ~ X1（带截距）

- 我们拟合 1000 个这样的回归模型并检查偏差。

In [4]:
estimates = []
for i in range(1000):
    means = [2 , 2]
    cov = [[1,0], [0,1]]
    
    X = np.random.multivariate_normal(means,cov,100)

    e = np.random.randn(100)
    
    y = 3 + 2*X[:,0] + 4*X[:,1] + e
    
    X = sm.add_constant(X[:,0])
    model = sm.OLS(y, X).fit()
    
    estimates.append(model.params)

In [5]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))

print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [11.01614906  1.99128492]
SE of estimates: [0.95786376 0.42621356]


我们可以看到截距估计是**有偏**（只有当您省略的变量均值为零时才会无偏），并且 b1 的系数是**无偏**。 估计的 SE 远高于正确指定的模型中的 SE。

#### 相关预测变量 (#2)
- 我们按照双变量正态分布随机生成 100 个相关的 X1 和 X2（因此 X1 和 X2 的相关系数为 0.4）。

- 我们指定一条真正的回归线：$y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- 我们的回归模型中仅包含 X1：y ~ X1（带截距）

- 我们拟合 1000 个这样的回归模型并检查偏差。

In [6]:
estimates = []
for i in range(1000):
    means = [2, 2]
    cov = [[1,0.4], [0.4,1]]
    X = np.random.multivariate_normal(means,cov,100)

    e = np.random.randn(100)
    y = 3 + 2*X[:,0] + 4*X[:,1] + e
    X = sm.add_constant(X[:,0])
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [7]:
#The averages of estimates across the 1000 replications are:
np.mean(np.array(estimates),axis=0)

array([7.77702431, 3.60628815])

In [8]:
#The averages of estimates across the 1000 replications are:
np.std(np.array(estimates),axis=0)

array([0.85773046, 0.38963273])

我们可以看到截距估计是有**偏差的（只有当两个变量的均值都为零时，它才是无偏差的），并且 b1 的系数也相对于其真实值 2 有**偏差**。

结论：当预测变量相关时，忽略一个变量会使其他估计产生偏差。

### 2. 过度指定模型

#### 完美多重共线性 (#1)
- 我们按照正态分布随机生成 X1 的 100 个数据点。
- 我们设置 X2 = 1 - X1

- 我们指定一条真正的回归线：$y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- 我们只将回归模型拟合为：y ~ X1 + X2（带截距）

In [9]:
X1 = np.random.randn(100)
X2 = 1 - X1
e = np.random.randn(100)

y = 3 + 2*X1 + 4*X2 + e

X = sm.add_constant(np.column_stack([X1,X2]))

model = sm.OLS(y,X).fit()

In [10]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.826
Model:,OLS,Adj. R-squared:,0.825
Method:,Least Squares,F-statistic:,466.2
Date:,"Tue, 13 Feb 2024",Prob (F-statistic):,4.9799999999999997e-39
Time:,09:50:43,Log-Likelihood:,-141.9
No. Observations:,100,AIC:,287.8
Df Residuals:,98,BIC:,293.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.0724,0.075,54.100,0.000,3.923,4.222
x1,1.0145,0.072,14.005,0.000,0.871,1.158
x2,3.0579,0.045,67.296,0.000,2.968,3.148

0,1,2,3
Omnibus:,1.402,Durbin-Watson:,2.156
Prob(Omnibus):,0.496,Jarque-Bera (JB):,1.313
Skew:,0.152,Prob(JB):,0.519
Kurtosis:,2.527,Cond. No.,1.32e+16


#### 多重共线性 (#2)
- 我们按照双变量正态分布随机生成 100 个相关的 X1 和 X2，并让 X1 和 X2 具有不同程度的相关性（0、0.3、0.6、0.9、0.95 和 0.99）。

- 我们指定一条真实的回归线：$y$ = 2*$X_1$ + 4*$X_2$ + e

- 我们在回归模型中包含 X1 和 X2：y ~ X1 + X2（为了简单起见，没有截距）

- 对于每个预设的相关值，我们重复 1000 次以生成回归估计的采样分布。
- 绘制采样分布图

In [18]:
#Write a function to return the sampling distribution of regression coefficients 
# based on different degree of correlation between X1 and X2
def simulation_multi_colinearity(cor):
    params = []
    for i in range(1000):
        means = [2,2]
        cov = [[1,cor], [cor,1]]
        X = np.random.multivariate_normal(means,cov,100)
        e = np.random.randn(100)*2
        y = 2*X[:,0] + 4*X[:,1] + e
        model = sm.OLS(y,X).fit()
        params.append(model.params)
    return params

In [19]:
sampling_dist_0 = simulation_multi_colinearity(0)
sampling_dist_0_3 = simulation_multi_colinearity(0.3)
sampling_dist_0_6 = simulation_multi_colinearity(0.6)
sampling_dist_0_9 = simulation_multi_colinearity(0.9)
sampling_dist_0_95 = simulation_multi_colinearity(0.95)
sampling_dist_0_99 = simulation_multi_colinearity(0.99)

In [20]:
print("SE of estimates when X1 and X2 have a cor of 0:", np.std(sampling_dist_0,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.3:", np.std(sampling_dist_0_3,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.6:", np.std(sampling_dist_0_6,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.9:", np.std(sampling_dist_0_9,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.95:", np.std(sampling_dist_0_95,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.99:", np.std(sampling_dist_0_99,axis=0))

SE of estimates when X1 and X2 have a cor of 0: [0.14924138 0.15362695]
SE of estimates when X1 and X2 have a cor of 0.3: [0.18184739 0.18149639]
SE of estimates when X1 and X2 have a cor of 0.6: [0.23542159 0.23302496]
SE of estimates when X1 and X2 have a cor of 0.9: [0.46005811 0.46522827]
SE of estimates when X1 and X2 have a cor of 0.95: [0.62053723 0.61689469]
SE of estimates when X1 and X2 have a cor of 0.99: [1.5058649  1.50165901]


In [22]:
print("Mean of estimates when X1 and X2 have a cor of 0:", np.mean(sampling_dist_0,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.3:", np.mean(sampling_dist_0_3,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.6:", np.mean(sampling_dist_0_6,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.9:", np.mean(sampling_dist_0_9,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.95:", np.mean(sampling_dist_0_95,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.99:", np.mean(sampling_dist_0_99,axis=0))

Mean of estimates when X1 and X2 have a cor of 0: [1.99528853 4.00220908]
Mean of estimates when X1 and X2 have a cor of 0.3: [1.99374527 4.00339456]
Mean of estimates when X1 and X2 have a cor of 0.6: [2.01461324 3.98911849]
Mean of estimates when X1 and X2 have a cor of 0.9: [2.02487827 3.97569048]
Mean of estimates when X1 and X2 have a cor of 0.95: [2.04199266 3.95822687]
Mean of estimates when X1 and X2 have a cor of 0.99: [1.96516269 4.03487359]


我们可以观察到，模型中两个预测变量之间的相关程度不同，回归系数仍然**无偏**，但估计的标准误差根据相关程度**夸大**。

#### 添加不相关的预测变量 (#2)
- 我们按照正态分布随机生成 X1、X2 和 X3 的 100 个数据点。

- 我们指定一条真正的回归线：$y$ = 3 + 2*$X_1$ + 4*$X_2$ + e

- 我们在回归模型中包括 X1、X2 和 X3：y ~ X1（带截距）。 这里X3不应该在模型中，而是包含在内。

- 我们拟合 1000 个这样的回归模型并检查偏差。

In [15]:
estimates = []
for i in range(1000):
    X1 = np.random.randn(100)
    X2 = np.random.randn(100)
    X3 = np.random.randn(100)
    e = np.random.randn(100)

    y = 3 + 2*X1 + 4*X2 + e

    X = sm.add_constant(np.column_stack([X1,X2,X3]))
    model = sm.OLS(y, X).fit()
    estimates.append(model.params)

In [16]:
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [ 3.00132214e+00  1.99576168e+00  3.99886082e+00 -2.36659646e-03]
SE of estimates: [0.10146959 0.09980508 0.10321398 0.1018764 ]


我们发现 X1 和 X2 的估计是**无偏**，并且标准误差也很小。 X3 的估计值几乎为零，如果我们从 1000 个模型中检查一个模型，我们会发现该估计值在统计上并不显着。 从这个意义上讲，将不相关的随机数据纳入模型并没有多大害处。

In [17]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.962
Model:,OLS,Adj. R-squared:,0.961
Method:,Least Squares,F-statistic:,806.9
Date:,"Tue, 13 Feb 2024",Prob (F-statistic):,6.269999999999999e-68
Time:,09:50:44,Log-Likelihood:,-138.6
No. Observations:,100,AIC:,285.2
Df Residuals:,96,BIC:,295.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.0338,0.099,30.704,0.000,2.838,3.230
x1,1.9591,0.094,20.781,0.000,1.772,2.146
x2,3.9472,0.090,43.918,0.000,3.769,4.126
x3,0.1564,0.099,1.580,0.117,-0.040,0.353

0,1,2,3
Omnibus:,2.692,Durbin-Watson:,2.382
Prob(Omnibus):,0.26,Jarque-Bera (JB):,2.525
Skew:,-0.387,Prob(JB):,0.283
Kurtosis:,2.92,Cond. No.,1.2
