Narrative:
Data are daily portfolio returns of stocks from SGX during 28 Oct 1997 through to 18 Oct 2002. The large stock portfolio returns (LSR) are simple daily ave return rates from 10 stocks viz. Singtel, UOB, DBS, OCBC, SIA, SPH, Jardine, HK Land, Great Eastern, and City Developments. The small stock portfolio returns (SSR) are simple daily ave return rates from 10 stocks viz. Econ Intl, Casa Holdings, Pertama Holdings, Meiban Group, Sunright Ltd, Armstrong Ind Corp, Penguin Boat, Freight Links Express Holdings, Liang Huat Aluminium, and Tye Soon Ltd. The market return rate is proxied by Straits Times Index return rate, STIR. d1, d2, d3, d4, d5 are dummy variables representing Monday, Tuesday, Wednesday, Thursday, and Friday.

Requirements:
Perform multivariate regression and answer the following 5 Questions. Use 'from statsmodels.formula.api import ols' as a start.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [2]:
from statsmodels.formula.api import ols

In [3]:
df = pd.read_csv('Large_Small_Day_of_Week.csv', index_col = 'Date').dropna()
print(df)

                 Days      STIR       LSR       SSR  d1  d2  d3  d4  d5
Date                                                                   
28/10/1997    Tuesday -0.096719 -0.088550 -0.091323   0   1   0   0   0
29/10/1997  Wednesday  0.066769  0.053139  0.030660   0   0   1   0   0
30/10/1997   Thursday  0.000000  0.000000  0.000000   0   0   0   1   0
31/10/1997     Friday  0.020108  0.002225  0.015986   0   0   0   0   1
3/11/1997      Monday  0.069216  0.057976  0.093426   1   0   0   0   0
...               ...       ...       ...       ...  ..  ..  ..  ..  ..
14/10/2002     Monday  0.003452  0.002468 -0.007561   1   0   0   0   0
15/10/2002    Tuesday  0.036498  0.033885  0.046484   0   1   0   0   0
16/10/2002  Wednesday  0.006533  0.007196  0.042938   0   0   1   0   0
17/10/2002   Thursday  0.018568  0.019657  0.023428   0   0   0   1   0
18/10/2002     Friday -0.003163  0.001367  0.050051   0   0   0   0   1

[1299 rows x 9 columns]


In [4]:
df.columns

Index(['Days', 'STIR', 'LSR', 'SSR', 'd1', 'd2', 'd3', 'd4', 'd5'], dtype='object')

In [5]:
dat1 = df[df['d1'] != 0] ### dat1 contains only Monday returns
dat1.reset_index(drop=True, inplace=True) 
### this is important to reset index to 0,1,2,... so that when concatenating - they align
print(dat1)

       Days      STIR       LSR       SSR  d1  d2  d3  d4  d5
0    Monday  0.069216  0.057976  0.093426   1   0   0   0   0
1    Monday -0.017570  0.000688 -0.022599   1   0   0   0   0
2    Monday  0.011866  0.013696  0.002207   1   0   0   0   0
3    Monday  0.019007  0.015035 -0.021014   1   0   0   0   0
4    Monday  0.003762 -0.001700 -0.015981   1   0   0   0   0
..      ...       ...       ...       ...  ..  ..  ..  ..  ..
254  Monday  0.007382  0.001110  0.017402   1   0   0   0   0
255  Monday -0.003673  0.004445 -0.044629   1   0   0   0   0
256  Monday -0.015839 -0.022405 -0.011432   1   0   0   0   0
257  Monday  0.008040  0.006424 -0.023557   1   0   0   0   0
258  Monday  0.003452  0.002468 -0.007561   1   0   0   0   0

[259 rows x 9 columns]


In [6]:
dat2 = df[df['d2'] != 0] ### dat2 contains only Tuesday returns
dat2.reset_index(drop=True, inplace=True) 
### 
dat3 = df[df['d3'] != 0] ### dat3 contains only Wed returns
dat3.reset_index(drop=True, inplace=True) 
### 
dat4 = df[df['d4'] != 0] ### dat4 contains only Thurs returns
dat4.reset_index(drop=True, inplace=True) 
### 
dat5 = df[df['d5'] != 0] ### dat5 contains only Fri returns
dat5.reset_index(drop=True, inplace=True) 

## Question 1
What is the difference in mean Monday return between the large portfolio versus the small portfolio?  Find the t-statistic to test if the difference is significantly different from the null hypothesis of zero. Assume returns are normally distributed with the same variances. The means are unconditional expectations. Find the answer with the difference, the t-statistic, and the p-value.


In [7]:
df1 = df.copy()

In [8]:
df1['Diff'] = df1['LSR'] - df1['SSR']

f=df1.groupby('Days').mean()
f.rename(columns={'Diff': 'Mean'}, inplace=True)
f1=f['Mean']

g=df1.groupby('Days').var()  ### note .var() here is biased, i.e., divisor is sample size
se = np.sqrt((g['LSR'] + g['SSR']) / len(dat1))

t1=f['Mean']/se

from scipy.stats import t
p_value = 2 * (1- t.cdf(t1,258))

print(f1, t1, p_value)

Days
Friday      -0.000630
Monday       0.006249
Thursday     0.000278
Tuesday      0.001509
Wednesday   -0.000989
Name: Mean, dtype: float64 Days
Friday      -0.345438
Monday       2.436323
Thursday     0.149019
Tuesday      0.784552
Wednesday   -0.510262
dtype: float64 [1.26995343 0.01551451 0.88165473 0.43343604 1.38969614]


## Question 2
Run OLS with dependent variable LSR and explanatory variables STIR and the 5 dummy variables. Similarly run OLS with dependent variable SSR and explanatory variables STIR and the 5 dummy variables. Which of the following statement is the most accurate? (Significance level is 1%)

In [9]:
from statsmodels.formula.api import ols
formula = 'LSR ~ STIR + d1 + d2 + d3 + d4 + d5 - 1'
results = ols(formula, df).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                    LSR   R-squared:                       0.887
Model:                            OLS   Adj. R-squared:                  0.886
Method:                 Least Squares   F-statistic:                     2022.
Date:                Sun, 18 Feb 2024   Prob (F-statistic):               0.00
Time:                        14:00:29   Log-Likelihood:                 4866.9
No. Observations:                1299   AIC:                            -9722.
Df Residuals:                    1293   BIC:                            -9691.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
STIR           0.9224      0.009    100.462      0.0

In [10]:
formula1 = 'SSR ~ STIR + d1 + d2 + d3 + d4 + d5 - 1'
results1 = ols(formula1, df).fit()
print(results1.summary())

                            OLS Regression Results                            
Dep. Variable:                    SSR   R-squared:                       0.280
Model:                            OLS   Adj. R-squared:                  0.278
Method:                 Least Squares   F-statistic:                     100.8
Date:                Sun, 18 Feb 2024   Prob (F-statistic):           7.37e-90
Time:                        14:00:29   Log-Likelihood:                 3003.4
No. Observations:                1299   AIC:                            -5995.
Df Residuals:                    1293   BIC:                            -5964.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
STIR           0.8439      0.039     21.895      0.0

## Question 5

In the OLS regression of dependent variable LSR on explanatory variables STIR and the 5 dummy variables, suppose the fitted residuals indicate significantly positive autocorrelations. Perform a GLS regression to improve on the estimates. Report the OLS Durbin-Watson statistic and the GLS Durbin-Watson statistic.

In [48]:
Dx=pd.concat([df['STIR'], df['d1'],df['d2'],df['d3'],df['d4'],df['d5']],axis=1)

In [49]:
resid_fit = sm.OLS(
    np.asarray(results.resid)[1:], sm.add_constant(np.asarray(results.resid)[:-1])
).fit()
print(resid_fit.tvalues[1])
print(resid_fit.pvalues[1])
rho = resid_fit.params[1]
print(rho)

1.6916522439077528
0.0909527020273625
0.04694953172306798


In [50]:
from scipy.linalg import toeplitz
trix = toeplitz(range(len(results.resid))) ### trix is sq matrix with zero in diag, 1 in first off diag, 2 in 2nd off diag, etc.
sigma = rho ** trix ### this is cov matrix of residuals except the factor of sigma_u^2 is left out
gls_model = sm.GLS(df['LSR'], Dx, sigma=sigma)
gls_results = gls_model.fit()

In [51]:
print(gls_results.summary())

                            GLS Regression Results                            
Dep. Variable:                    LSR   R-squared:                       0.886
Model:                            GLS   Adj. R-squared:                  0.886
Method:                 Least Squares   F-statistic:                     2009.
Date:                Sun, 18 Feb 2024   Prob (F-statistic):               0.00
Time:                        14:07:18   Log-Likelihood:                 4868.3
No. Observations:                1299   AIC:                            -9725.
Df Residuals:                    1293   BIC:                            -9694.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
STIR           0.9222      0.009    100.117      0.0

## Question 3

Find the variances of the fitted residuals for the two regressions in Q2. Assume these variances are different. Run a GLS regression with both LSR and SSR combined as dependent variable. The explanatory variables are the same STIR and the 5 dummy variables. What is the coefficient estimate and its t-value for the Monday dummy?

In [26]:
LSR = df['LSR']
SSR = df['SSR']
df2 = pd.concat([LSR, SSR])
df2

Date
28/10/1997   -0.088550
29/10/1997    0.053139
30/10/1997    0.000000
31/10/1997    0.002225
3/11/1997     0.057976
                ...   
14/10/2002   -0.007561
15/10/2002    0.046484
16/10/2002    0.042938
17/10/2002    0.023428
18/10/2002    0.050051
Length: 2598, dtype: float64

In [32]:
var_lsr = np.var(results.resid)
var_ssr = np.var(results1.resid)

variances = np.array([var_lsr] * len(df) + [var_ssr] * len(df))
cov_matrix = np.diag(variances)

cov_matrix.shape

(2598, 2598)

In [31]:
Dx2=pd.concat([Dx, Dx])
Dx2.shape

(2598, 6)

In [40]:
gls_model1 = sm.GLS(df2, Dx2, sigma = cov_matrix)
gls_results1 = gls_model1.fit()

In [41]:
print(gls_results1.summary())

                            GLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.802
Model:                            GLS   Adj. R-squared:                  0.802
Method:                 Least Squares   F-statistic:                     2106.
Date:                Sun, 18 Feb 2024   Prob (F-statistic):               0.00
Time:                        14:03:36   Log-Likelihood:                 7859.2
No. Observations:                2598   AIC:                        -1.571e+04
Df Residuals:                    2592   BIC:                        -1.567e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
STIR           0.9182      0.009    102.483      0.0

## Question 4
Suppose we find the fitted residuals in the GLS regression in Q3. What are the Breusch-Pagan chi-square test statistic value and the White's Heteroskedasticity chi-square test statistic value?

In [47]:
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(gls_results1.resid ** 2, sm.add_constant(Dx2))
bp_test[0]

7.6745240798849945

In [46]:
from statsmodels.stats.diagnostic import het_white
wh_test = het_white(gls_results1.resid ** 2, sm.add_constant(Dx2))
wh_test[0]

12.190975640109777