# 一元线性回归分析的基本理论 
## Simple Linear Regression Model

变量之间的关系：
- 确定性关系或函数关系：研究的是确定性现象非随机变量间的关系
  $$ 圆的面积 = \pi * 半径^{2} $$
- 统计依赖或像关系：研究的是非确定性现象随机变量间的关系
  $$ 农作物的产量 = f(气温，降雨量，阳光，施肥量) $$

回归分析（Regression Analysis）是研究一个变量关于另一个（些）变量的具体依赖关系的计算方法和理论。

一元线性回归预测模型是最基本的回归模型。其形式为
$$
Y=\beta_{0}+\beta_{1} X
$$
其中， $Y$ 是解释变量（Explained Variable），又叫应变量 dependent variable，  $X$ 为解释变量 Explanatory Variable ，又称为自变量 Independent Variable,的单方程线性函数。

$\beta_{0}$ 是常数(constant)或截距(intercept)项, 它表示当 $X$ 为 0 时 $Y$ 的取值。 $\beta_{1}$ 是斜率系数(slope coefficient), 它表示当 $X$ 增加一个单位时 $Y$ 所增加的数量。

在因变量 $Y$ 的变异中,除了来自自变量 $X$ 外, 还存在着来自其他因素的变异。随机误差项是用来解释 $Y$ 中所有不能被 $X$ 解释的变异的。一般用符号 $\varepsilon$ 有时也用其他符号如 $u$ 代替。

在方程中添加一个随机误差项, 它就变为一个典型的回归方程；
$$
Y=\beta_{0}+\beta_{1} X+\varepsilon
$$

$\beta_{0}+\beta_{1} X$是回归方程的确定性部分，它给出了 $Y$ 在给定非随机的 $X$ 值之后的值。确定性部分可以称为 $Y$ 对于给定 $X$值的期望值,即对应给定某一特定的 $X$ 之后, 所有 $Y$ 的均值。确定性部分可以写为：
$$
E(Y \mid X)=\beta_{0}+\beta_{1} X
$$
不幸的是,在现实中,所观测到的 $Y$ 值并不可能准确地等于确定的期望值 $E(Y \mid X)$ 。 所以随机误差项包含在方程中:
$$
Y=E(Y \mid X)+\varepsilon=\beta_{0}+\beta_{1} X+\varepsilon
$$
如果我们以典型的观测值代表某一次观测值,但方程线性回归模型可以写成:
$$
Y_{i}=\beta_{0}+\beta_{1} X_{i}+\varepsilon_{i} \quad(i=1, \cdots, N)
$$
该方程称为理论回归方程,其中 $Y_{i}$ 为因变量 $Y$ 的第 $i$ 次观测值, $X_{i}$ 为解释变量 $X$ 的 $i i$ 次观测值, $\varepsilon_{i}$ 为随机误差项的第 $i$ 次观测值, $\beta_{0}, \beta_{1}$ 为回归系数, $N$ 为观测值的个数。

## 系数的估计

$$
\mathrm{E}(\varepsilon)=0
$$

残差平方和:
$$
S(\beta)=\sum_{i=1}^{m}\left(y_{i}-x_{i} \beta_1-\beta_{0}\right)^{2}
$$
我们要求出 $\beta_1$ 和 $\beta_{0}$ 使得上述目标函数取得最小值, 显然, 可以通过对 $\beta_1$ 和 $\beta_{0}$ 分别求偏导得到:
$$
\begin{aligned}
\frac{\partial S(\beta)}{\partial \beta_1} &=\sum_{i=1}^{m} 2\left(y_{i}-x_{i} \beta_1-\beta_{0}\right)\left(-x_{i}\right) \\
&=\sum_{i=1}^{m}(-2)\left(x_{i} y_{i}-x_{i}^{2} \beta_1 - x_{i} \beta_{0} \right) \\
&=2 \sum_{i=1}^{m}\left(x_{i}^{2} \beta_1+x_{i}\beta_{0} -x_{i} y_{i}\right) \\
\frac{\partial S(\beta)}{\partial \beta_{0}} &=\sum_{i=1}^{m} 2\left(y_{i}-x_{i} \beta-\beta_{0}\right)(-1) \\
&=\sum_{i=1}^{m}(-2)\left(y_{i}-x_{i} \beta_1-\beta_{0}\right) \\
&=2 \sum_{i=1}^{m}\left(x_{i} \beta_1+\beta_{0}-y_{i}\right) \\
&=2\left(m \beta_1 \frac{\sum_{i=1}^{m}\left(x_{i}\right)}{m}+m \beta_{0}-m \frac{\sum_{i=1}^{m} y_{i}}{m}\right)
\end{aligned}
$$

令 $\bar{x}=\frac{\sum_{i=1}^{m}\left(x_{i}\right)}{m}, \bar{y}=\frac{\sum_{i=1}^{m} y_{i}}{m}$

那么, 上述第二个偏导结果:
$$
\frac{\partial S(\beta)}{\partial \beta_{0}}=2 m\left(\beta_1 \bar{x}+\beta_{0}-\bar{y}\right)
$$
令第二个偏导等于0:
$$
\begin{gathered}
2 m\left(\beta_1 \bar{x}+\beta_{0}-\bar{y}\right)=0 \\
\beta_{0}=\bar{y}-\beta_1 \bar{x}
\end{gathered}
$$
令上述第一个偏导结果等于0, 并带入上述 $\beta_{0}$ 有:

$$
\frac{\partial S(\beta)}{\partial \beta_1}=0\\
2 \sum_{i=1}^{m}\left[x_{i}^{2} \beta_1+(\bar{y}-\beta_1 \bar{x}) x_{i}-x_{i} y_{i}\right]=0\\
\beta_1\left(\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m} x_{i}\right)=\sum_{i=1}^{m} x_{i} y_{i}-\bar{y} \sum_{i=1}^{m} x_{i}\\
\beta_1=\frac{\sum_{i=1}^{m} x_{i} y_{i}-\bar{y} \sum_{i=1}^{m} x_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\bar{x} \sum_{i=1}^{m}x_i}\\
\beta_1=\frac{\sum_{i=1}^{m} x_{i} y_{i}-\bar{y} \sum_{i=1}^{m} x_{i}-m \bar{y} \bar{x}+m \bar{y} \bar{x}}{\sum_{i=1}^{m} x_{i}^{2}-2 \bar{x} \sum_{i=1}^{m}x_i + \bar{x} \sum_{i=1}^{m}x_i}\\
\beta_1=\frac{\sum_{i=1}^{m} x_{i} y_{i}-\bar{y} \sum_{i=1}^{m} x_{i}-\sum_{i=1}^{m} y_{i} \bar{x}+m \bar{y} \bar{x}}{\sum_{i=1}^{m} x_{i}^{2}-2 \bar{x} \sum_{i=1}^{m} x_{i}+m \bar{x}^{2}}\\
\beta_1=\frac{\sum_{i=1}^{m}\left(x_{i} y_{i}-\bar{y} x_{i}-y_{i} \bar{x}+\bar{y} \bar{x}\right)}{\sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2}}\\
\beta_1=\frac{\sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{m}\left(x_{i}-\bar{x}\right)^{2}}
$$
这样, $\beta$ 和 $\beta_{0}$ 就可以求出来了。

## 系数方差的估计

首先, 我们对上述的估计做一个变形：

- 我们知道, 由总体回归线, $y_{i}=\beta_{0}+\beta_{1} x_{i}+\epsilon_{i}$, 假如我们将左右两边按 $i=1, \ldots, n$ 加起来再平均, 我们有 $\bar{y}=\beta_{0}+\beta_{1}\bar{x}, \epsilon$ 在这里消失了, 是因为其均值（期望）是0, 这是我们假设得到的。

- 对$\hat{\beta}_{1}$做变形: 

$$
\hat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(\beta_{0}+\beta_{1} x_{i}+\epsilon_{i}-\beta_{0}-\beta_{1} \bar{x}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}=\beta_{1}+\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) \epsilon_{i}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \\
\operatorname{Var}(x+y)=\operatorname{Var}(x)+\operatorname{Var}(y)+2 \operatorname{Cov}(x, y)\\
\operatorname{Var}(a x)=a^{2} \operatorname{Var}(x), a \in R
$$

再次估计
$$\operatorname{Var}\left(\hat{\beta_{1}}\right)=\operatorname{Var}\left(\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) \epsilon_{i}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}\right)
$$ 

这是因为 $\beta_{1}$ 是一个我们不知道的常数。

接着 
$$\operatorname{Var}\left(\hat{\beta_{1}}\right)=\operatorname{Var}\left(\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) \epsilon_{i}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}\right)=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} V a r\left(\epsilon_{i}\right)}{\left(\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\right)^{2}}
$$
注意这一步, 在这里所有的 $x_{i}, \bar{x}$ 都是数, 都是已知的, 所以关注点就在分子上, 而我们的假设中, $\epsilon_{i}$是不相关的, 所以其之间的协方差为0, 所以方差性质那边, 其$Cov$项为 0 。


再接着 
$$
\operatorname{Var}\left(\hat{\beta}_{1}\right)=\sigma^{2} / \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}
$$


## 简单的一元OLS回归


动量效应（Momentum effect）一般又称“惯性效应”。动量效应是由Jegadeesh和Titman（1993）提出的，是指股票的收益率有延续原来的运动方向的趋势，即过去一段时间收益率较高的股票在未来获得的收益率仍会高于过去收益率较低的股票。

在下面的模型里面，我们考虑中国股票市场收益率在时间层面上的动量效应。

In [1]:
import numpy as np # 数据处理最重要的模块
import pandas as pd # 数据处理最重要的模块
import scipy.stats as stats # 统计模块
import scipy
# import pymysql  # 导入数据库模块

from datetime import datetime # 时间模块
import statsmodels.formula.api as smf  # OLS regression

import pyreadr # read RDS file

from matplotlib import style
import matplotlib.pyplot as plt  # 画图模块
import matplotlib.dates as mdates


from matplotlib.font_manager import FontProperties # 作图中文
from pylab import mpl
#mpl.rcParams['font.sans-serif'] = ['SimHei']
#plt.rcParams['font.family'] = 'Times New Roman'


#输出矢量图 渲染矢量图
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

from IPython.core.interactiveshell import InteractiveShell # jupyter运行输出的模块
#显示每一个运行结果
InteractiveShell.ast_node_interactivity = 'all'

#设置行不限制数量
#pd.set_option('display.max_rows',None)

#设置列不限制数量
pd.set_option('display.max_columns', None)


In [3]:
# 导入处理
data = pd.read_csv('datasets/000001.csv')
data['Day'] = pd.to_datetime(data['Day'],format = '%Y/%m/%d')
data.set_index('Day',inplace = True)
data['Close'] = pd.to_numeric(data['Close'],errors = 'coerce')
data['Preclose'] = data['Close'].shift(1)
data['Return'] = (data['Close'] - data['Preclose'])/data['Preclose']
data

Unnamed: 0_level_0,Preclose,Open,Highest,Lowest,Close,Volume,Money,Return
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1990-12-19,,96.050,99.980,95.790,99.980,126000.00,4.940000e+05,
1990-12-20,99.980,104.300,104.390,99.980,104.390,19700.00,8.400000e+04,0.044109
1990-12-21,104.390,109.070,109.130,103.730,109.130,2800.00,1.600000e+04,0.045407
1990-12-24,109.130,113.570,114.550,109.130,114.550,3200.00,3.100000e+04,0.049666
1990-12-25,114.550,120.090,120.250,114.550,120.250,1500.00,6.000000e+03,0.049760
...,...,...,...,...,...,...,...,...
2024-09-24,2748.918,2770.754,2863.152,2761.372,2863.126,4776195.45,4.427953e+07,0.041547
2024-09-25,2863.126,2901.419,2952.451,2889.048,2896.306,5682598.16,5.166981e+07,0.011589
2024-09-26,2896.306,2893.745,3000.953,2889.014,3000.953,5763192.61,5.246691e+07,0.036131
2024-09-27,3000.953,3049.103,3087.529,3017.445,3087.529,4922871.63,4.806126e+07,0.028850


In [4]:
data_new = data['1995-01-01':'2024-12-31'].copy()
Month_data = data_new.resample('ME')['Return'].apply(lambda x:(1+x).prod()-1).to_frame()
Month_data

Unnamed: 0_level_0,Return
Day,Unnamed: 1_level_1
1995-01-31,-0.131631
1995-02-28,-0.023694
1995-03-31,0.177803
1995-04-30,-0.103552
1995-05-31,0.207922
...,...
2024-05-31,-0.005801
2024-06-30,-0.038684
2024-07-31,-0.009656
2024-08-31,-0.032849


In [5]:
Quarter_data = data_new.resample('QE')['Return'].apply(lambda x:(1+x).prod()-1).to_frame()
Quarter_data

Unnamed: 0_level_0,Return
Day,Unnamed: 1_level_1
1995-03-31,-0.001466
1995-06-30,-0.025258
1995-09-30,0.145660
1995-12-31,-0.231358
1996-03-31,0.001981
...,...
2023-09-30,-0.028603
2023-12-31,-0.043575
2024-03-31,0.022263
2024-06-30,-0.024255


In [6]:
Year_data = data_new.resample('YE')['Return'].apply(lambda x:(1+x).prod()-1).to_frame()
Year_data

Unnamed: 0_level_0,Return
Day,Unnamed: 1_level_1
1995-12-31,-0.142899
1996-12-31,0.651425
1997-12-31,0.302153
1998-12-31,-0.039695
1999-12-31,0.19175
2000-12-31,0.517277
2001-12-31,-0.20618
2002-12-31,-0.175167
2003-12-31,0.10267
2004-12-31,-0.153997


In [8]:
Month_data['lag_Return'] = Month_data['Return'].shift(1)
Month_data

Unnamed: 0_level_0,Return,lag_Return
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1995-01-31,-0.131631,
1995-02-28,-0.023694,-0.131631
1995-03-31,0.177803,-0.023694
1995-04-30,-0.103552,0.177803
1995-05-31,0.207922,-0.103552
...,...,...
2024-05-31,-0.005801,0.020932
2024-06-30,-0.038684,-0.005801
2024-07-31,-0.009656,-0.038684
2024-08-31,-0.032849,-0.009656


Model:
$$
r_{t} = \alpha + \beta * r_{t-1} + \epsilon_{t}
$$

where $r_t$ is the raw return of stock market on month $t$.

* H0: $\beta = 0$
* H1: $\beta \ne 0$

In [12]:
# 使用t-1月的收益率预测t月的收益率
# Newey West 检验 自相关
# 一般在日数据里面，lag12，月lag6，季度lag2

model1_mom = smf.ols('Return ~ lag_Return',
                 data=Month_data['2000-01':'2024-09']).fit()
#  cov_type='HAC', cov_kwds={'maxlags': 6}
print(model1_mom.summary())

                            OLS Regression Results                            
Dep. Variable:                 Return   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     3.946
Date:                Mon, 14 Oct 2024   Prob (F-statistic):             0.0479
Time:                        13:41:02   Log-Likelihood:                 367.33
No. Observations:                 297   AIC:                            -730.7
Df Residuals:                     295   BIC:                            -723.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0050      0.004      1.214      0.2

所以上述模型的结果是：

$$r_{t} = 0.0050 + 0.1159 * r_{t-1} + \epsilon_{t} $$


In [13]:
Quarter_data['lag_Return'] = Quarter_data['Return'].shift(1)
Quarter_data

Unnamed: 0_level_0,Return,lag_Return
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1995-03-31,-0.001466,
1995-06-30,-0.025258,-0.001466
1995-09-30,0.145660,-0.025258
1995-12-31,-0.231358,0.145660
1996-03-31,0.001981,-0.231358
...,...,...
2023-09-30,-0.028603,-0.021632
2023-12-31,-0.043575,-0.028603
2024-03-31,0.022263,-0.043575
2024-06-30,-0.024255,0.022263


In [15]:
model2_mom = smf.ols('Return ~ lag_Return',
                 data=Quarter_data['2000-01':'2024-09']).fit(
                     cov_type='HAC', cov_kwds={'maxlags': 2})
print(model2_mom.summary())

                            OLS Regression Results                            
Dep. Variable:                 Return   R-squared:                       0.025
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     2.089
Date:                Mon, 14 Oct 2024   Prob (F-statistic):              0.152
Time:                        13:42:14   Log-Likelihood:                 53.697
No. Observations:                  99   AIC:                            -103.4
Df Residuals:                      97   BIC:                            -98.20
Df Model:                           1                                         
Covariance Type:                  HAC                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0162      0.014      1.130      0.2

In [17]:
data_new['lag_Return'] = data_new['Return'].shift(1)
data_new

Unnamed: 0_level_0,Preclose,Open,Highest,Lowest,Close,Volume,Money,Return,lag_Return
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1995-01-03,647.870,637.720,647.710,630.530,639.880,23451800.00,1.806930e+08,-0.012333,
1995-01-04,639.880,641.900,655.510,638.860,653.810,42222000.00,3.069230e+08,0.021770,-0.012333
1995-01-05,653.810,655.380,657.520,645.810,646.890,43012300.00,3.015330e+08,-0.010584,0.021770
1995-01-06,646.890,642.750,643.890,636.330,640.760,48748200.00,3.537580e+08,-0.009476,-0.010584
1995-01-09,640.760,637.520,637.550,625.040,626.000,50985100.00,3.985190e+08,-0.023035,-0.009476
...,...,...,...,...,...,...,...,...,...
2024-09-24,2748.918,2770.754,2863.152,2761.372,2863.126,4776195.45,4.427953e+07,0.041547,0.004423
2024-09-25,2863.126,2901.419,2952.451,2889.048,2896.306,5682598.16,5.166981e+07,0.011589,0.041547
2024-09-26,2896.306,2893.745,3000.953,2889.014,3000.953,5763192.61,5.246691e+07,0.036131,0.011589
2024-09-27,3000.953,3049.103,3087.529,3017.445,3087.529,4922871.63,4.806126e+07,0.028850,0.036131


In [20]:
model3_mom = smf.ols('Return ~ lag_Return',
                 data=data_new['2000-01':'2024-09']).fit(
                     cov_type='HAC', cov_kwds={'maxlags': 12})
print(model3_mom.summary())

                            OLS Regression Results                            
Dep. Variable:                 Return   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.931
Date:                Mon, 14 Oct 2024   Prob (F-statistic):              0.165
Time:                        13:43:03   Log-Likelihood:                 16786.
No. Observations:                5997   AIC:                        -3.357e+04
Df Residuals:                    5995   BIC:                        -3.355e+04
Df Model:                           1                                         
Covariance Type:                  HAC                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0003      0.000      1.298      0.1

# 日历效应
## 中国股票日数据
-设置新变量 $$D_1 = \begin{cases}
 1, 星期一\\
0，不是星期一
\end{cases}$$
-设置新变量 $$D_2 = \begin{cases}
 1, 星期二\\
0，不是星期二
\end{cases}$$
-设置新变量 $$D_3 = \begin{cases}
 1, 星期三\\
0，不是星期三
\end{cases}$$
-设置新变量 $$D_4 = \begin{cases}
 1, 星期四\\
0，不是星期四
\end{cases}$$
-设置新变量 $$D_5 = \begin{cases}
 1, 星期五\\
0，不是星期五
\end{cases}$$


In [14]:
daily_data = data['1995-01':'2023-06'].copy()
daily_data['Close'] = pd.to_numeric(daily_data['Close'])
daily_data['Preclose'] = pd.to_numeric(daily_data['Preclose'])
# 计算000001上证指数日收益率 两种：
daily_data['Raw_return'] = daily_data['Close'] / daily_data['Preclose'] - 1
daily_data['Log_return'] = np.log(daily_data['Close']) - np.log(daily_data['Preclose'])
daily_data

Unnamed: 0_level_0,Preclose,Open,Highest,Lowest,Close,Volume,Money,Raw_return,Log_return
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1995-01-03,647.8700,637.7200,647.7100,630.5300,639.8800,23451800,1.806930e+08,-0.012333,-0.012409
1995-01-04,639.8800,641.9000,655.5100,638.8600,653.8100,42222000,3.069230e+08,0.021770,0.021536
1995-01-05,653.8100,655.3800,657.5200,645.8100,646.8900,43012300,3.015330e+08,-0.010584,-0.010641
1995-01-06,646.8900,642.7500,643.8900,636.3300,640.7600,48748200,3.537580e+08,-0.009476,-0.009521
1995-01-09,640.7600,637.5200,637.5500,625.0400,626.0000,50985100,3.985190e+08,-0.023035,-0.023305
...,...,...,...,...,...,...,...,...,...
2023-06-26,3197.9011,3177.2293,3181.0758,3144.2484,3150.6189,30812981100,3.988600e+11,-0.014785,-0.014896
2023-06-27,3150.6189,3153.3132,3194.4086,3148.2657,3189.4427,28760432700,3.544880e+11,0.012323,0.012247
2023-06-28,3189.4427,3183.4865,3192.6589,3157.1229,3189.3758,27623194800,3.601180e+11,-0.000021,-0.000021
2023-06-29,3189.3758,3185.4242,3196.5025,3179.5251,3182.3812,25034007400,3.406370e+11,-0.002193,-0.002196


In [15]:
# 如何在Python中识别每个日期是星期几？


# 假设daily_data是一个包含日期索引的DataFrame
# 日期索引通常是datetime类型

# 提取星期几信息并将其存储在新的列中
daily_data['Weekday'] = daily_data.index.weekday + 1

# 使用get_dummies()函数创建虚拟变量
dummy_variable = pd.get_dummies(daily_data['Weekday'], prefix='Weekday')

# 将虚拟变量添加到原始DataFrame
daily_data = pd.concat([daily_data, dummy_variable], axis=1)

daily_data


Unnamed: 0_level_0,Preclose,Open,Highest,Lowest,Close,Volume,Money,Raw_return,Log_return,Weekday,Weekday_1,Weekday_2,Weekday_3,Weekday_4,Weekday_5
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1995-01-03,647.8700,637.7200,647.7100,630.5300,639.8800,23451800,1.806930e+08,-0.012333,-0.012409,2,0,1,0,0,0
1995-01-04,639.8800,641.9000,655.5100,638.8600,653.8100,42222000,3.069230e+08,0.021770,0.021536,3,0,0,1,0,0
1995-01-05,653.8100,655.3800,657.5200,645.8100,646.8900,43012300,3.015330e+08,-0.010584,-0.010641,4,0,0,0,1,0
1995-01-06,646.8900,642.7500,643.8900,636.3300,640.7600,48748200,3.537580e+08,-0.009476,-0.009521,5,0,0,0,0,1
1995-01-09,640.7600,637.5200,637.5500,625.0400,626.0000,50985100,3.985190e+08,-0.023035,-0.023305,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-06-26,3197.9011,3177.2293,3181.0758,3144.2484,3150.6189,30812981100,3.988600e+11,-0.014785,-0.014896,1,1,0,0,0,0
2023-06-27,3150.6189,3153.3132,3194.4086,3148.2657,3189.4427,28760432700,3.544880e+11,0.012323,0.012247,2,0,1,0,0,0
2023-06-28,3189.4427,3183.4865,3192.6589,3157.1229,3189.3758,27623194800,3.601180e+11,-0.000021,-0.000021,3,0,0,1,0,0
2023-06-29,3189.3758,3185.4242,3196.5025,3179.5251,3182.3812,25034007400,3.406370e+11,-0.002193,-0.002196,4,0,0,0,1,0


In [16]:
model_week = smf.ols('Raw_return ~ Weekday_1 + Weekday_2 + Weekday_3 + Weekday_5',
                 data=daily_data['2010-01':'2023-06']).fit(
                     cov_type='HAC', cov_kwds={'maxlags': 12})
print(model_week.summary())

                            OLS Regression Results                            
Dep. Variable:             Raw_return   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     2.313
Date:                Mon, 23 Oct 2023   Prob (F-statistic):             0.0553
Time:                        07:52:42   Log-Likelihood:                 9618.2
No. Observations:                3277   AIC:                        -1.923e+04
Df Residuals:                    3272   BIC:                        -1.920e+04
Df Model:                           4                                         
Covariance Type:                  HAC                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0011      0.000     -2.399      0.0

In [17]:
model2_week = smf.ols('Raw_return ~ Weekday_4',
                 data=daily_data['2000-01':'2023-06']).fit(
                     cov_type='HAC', cov_kwds={'maxlags': 12})
print(model2_week.summary())

                            OLS Regression Results                            
Dep. Variable:             Raw_return   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     11.01
Date:                Mon, 23 Oct 2023   Prob (F-statistic):           0.000911
Time:                        07:52:42   Log-Likelihood:                 15854.
No. Observations:                5692   AIC:                        -3.170e+04
Df Residuals:                    5690   BIC:                        -3.169e+04
Df Model:                           1                                         
Covariance Type:                  HAC                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0006      0.000      2.472      0.0

### 整合结果

In [18]:
from statsmodels.iolib.summary2 import summary_col

info_dict = {'No. observations': lambda x: f"{int(x.nobs):d}"}

results_table = summary_col(results=[model3_mom, model1_mom, model2_mom],
                            float_format='%0.3f', #数据显示的格式，默认四位小数
                            stars=True, # 是否有*，True为有
                            model_names=['Daily MOM', 'Month MOM', 'Quarter MOM'],
                            info_dict=info_dict,
                            regressor_order=['Intercept', 'lag_Raw_Return'])

results_table.add_title(
    'Table - OLS Regressions: Forecast Stock Market Return')

print(results_table)

Table - OLS Regressions: Forecast Stock Market Return
                 Daily MOM Month MOM Quarter MOM
------------------------------------------------
Intercept        0.000     0.005     0.016      
                 (0.000)   (0.005)   (0.015)    
lag_Raw_return   0.021     0.124*    0.157      
                 (0.018)   (0.071)   (0.108)    
R-squared        0.000     0.015     0.025      
R-squared Adj.   0.000     0.012     0.014      
No. observations 5692      282       94         
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01


## 月数据中的日历效应

In [19]:
Month_data = daily_data.resample('m')['Log_return'].sum().to_frame()
Month_data['Raw_return'] = np.exp(Month_data['Log_return']) - 1

# 在Month_data中增加一个虚拟变量 1代表1月 0代表其他月份 如此类推 生成12个虚拟变量
Month_data['Month'] = Month_data.index.month
dummy_variable = pd.get_dummies(Month_data['Month'], prefix='Month') # prefix='Month' 生成的虚拟变量前缀
Month_data = pd.concat([Month_data, dummy_variable], axis=1) # axis=1 按列合并
Month_data

Unnamed: 0_level_0,Log_return,Raw_return,Month,Month_1,Month_2,Month_3,Month_4,Month_5,Month_6,Month_7,Month_8,Month_9,Month_10,Month_11,Month_12
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1995-01-31,-0.141139,-0.131631,1,1,0,0,0,0,0,0,0,0,0,0,0
1995-02-28,-0.023979,-0.023694,2,0,1,0,0,0,0,0,0,0,0,0,0
1995-03-31,0.163651,0.177803,3,0,0,1,0,0,0,0,0,0,0,0,0
1995-04-30,-0.109315,-0.103552,4,0,0,0,1,0,0,0,0,0,0,0,0
1995-05-31,0.188901,0.207922,5,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-28,0.007325,0.007352,2,0,1,0,0,0,0,0,0,0,0,0,0
2023-03-31,-0.002059,-0.002057,3,0,0,1,0,0,0,0,0,0,0,0,0
2023-04-30,0.015286,0.015404,4,0,0,0,1,0,0,0,0,0,0,0,0
2023-05-31,-0.036374,-0.035721,5,0,0,0,0,1,0,0,0,0,0,0,0


In [20]:
# 使用Raw_return分别对12个虚拟变量回归
model1_mon = smf.ols('Raw_return ~ Month_1',
                 data=Month_data['1995-01':'2023-06']).fit(
                     cov_type='HAC', cov_kwds={'maxlags': 6})
model2_mon = smf.ols('Raw_return ~ Month_2',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model3_mon = smf.ols('Raw_return ~ Month_3',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model4_mon = smf.ols('Raw_return ~ Month_4',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model5_mon = smf.ols('Raw_return ~ Month_5',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model6_mon = smf.ols('Raw_return ~ Month_6',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model7_mon = smf.ols('Raw_return ~ Month_7',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model8_mon = smf.ols('Raw_return ~ Month_8',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model9_mon = smf.ols('Raw_return ~ Month_9',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model10_mon = smf.ols('Raw_return ~ Month_10',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model11_mon = smf.ols('Raw_return ~ Month_11',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model12_mon = smf.ols('Raw_return ~ Month_12',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})
model_mon_allmonth = smf.ols('Raw_return ~ Month_1 + Month_2 + Month_3 + Month_4 + Month_5 + Month_6 + Month_7 + Month_8 + Month_9 + Month_10 + Month_11',
                    data=Month_data['1995-01':'2023-06']).fit(
                        cov_type='HAC', cov_kwds={'maxlags': 6})

info_dict = {'No. observations': lambda x: f"{int(x.nobs):d}"}

results_table = summary_col(results=[model1_mon, model2_mon, model3_mon, model4_mon, model5_mon, model6_mon, model7_mon, model8_mon, model9_mon, model10_mon, model11_mon, model12_mon,model_mon_allmonth],
                            float_format='%0.3f', #数据显示的格式，默认四位小数
                            stars=True, # 是否有*，True为有
                            model_names=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec','ALL'],
                            info_dict=info_dict,
                            regressor_order=['Intercept', 'Month_1', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12'])

results_table.add_title(
    'Table - OLS Regressions: Forecast Stock Market Return')

print(results_table)

                                 Table - OLS Regressions: Forecast Stock Market Return
                   Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec     ALL  
------------------------------------------------------------------------------------------------------------------------
Intercept        0.009*  0.006   0.006   0.006   0.007   0.008*  0.008   0.009*  0.009*  0.008   0.007   0.007   0.009  
                 (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.017)
Month_1          -0.013                                                                                          -0.014 
                 (0.015)                                                                                         (0.023)
Month_2                  0.015                                                                                   0.012  
                         (0.010)                                  