<a href="https://colab.research.google.com/github/zhong338/MFM-FM5222/blob/main/Week11_FactorModelIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FM5222
# Introduction to Factor Models


* Basic Concept
* CAPM
* Example with data






## Factor Models

We have already been fitting factor models, but we haven't been calling them that.  The idea is that the returns on individual stocks are driven by a finite set of market factors, which (in theory) impact all stocks. For example, we might write:

$$\mathrm{GM}_t = \beta_0 + \beta_1 \mathrm{SP}_t + \beta_2 \mathrm{Rut}_t + \epsilon_t$$

Here, we are modelling the daily log-return of GM as a linear function of the log-return on the S&P500 and the Rusell2000.


One problem with this model (which we will discuss later) is that because the S&P500 and Russell2000 are so highly correlated, the fitted coefficients of such a model can be quite unstable from period to period.  So, more generally, we prefer to create models where the Factors themselves have little correlation. But that may not always be the case. 



We also don't just want to create a model for GM, but for (say) $n$ stocks.  Such a model coould be written 

$$R_{j,t} = \beta_{0,j} + \beta_{1,j}F_{1,t} + \beta_{2,j}F_{2,t} + \cdots + \beta_{j,n}F_{p,t} + \epsilon_{j,t}$$


where we have



$p$ factors $F_k$

$n$ stocks indexed by $j$ with daily (or weekly, monthly) returns $R_{j,t}$

$n \times (p+1) $ coefficients $\beta_{k,j}$, and

$n$ random noise processes $\epsilon_{j,t}$

We assume that the noises are uncorrelated (hopefully independent) both temporally 

$$\mathrm{Cov}(\epsilon_{j,t},\epsilon_{j,s}) = 0, t \neq s$$

and cross-sectionally

$$\mathrm{Cov}(\epsilon_{j,t},\epsilon_{i,t}) = 0, j \neq i$$


Furthermore, it is assumed that the factors are uncorrelated to the noise terms:


$$\mathrm{Cov}(F_{k,t}, \epsilon_{j,s}) = 0, \forall (k,j,t,s)$$

But before digging on $p>1$ factor models, let's take a look at the most classic and arguably the "original" Factor Model, CAPM










## CAPM

The Capital Asset Pricng Model (CAPM) is based on several assumptions about returns and investor preferences.  The conclusion is the famous equation

$$\mathrm{E}(R_{j,t})= r_f(t) + \beta_j \left(M_t - r_f(t)       \right)$$

This says that the expected *excesss return* of a stock over the risk-free rate is determed by 

1) The excess return of the market portfolio (the portfolio with all stocks).

2) The sensitivity of the individual  (jth) stock to the market portfolio ($\beta_j$)


It's easy to see how CAPM is just a factor model with one factor.


A few comments are in order:

### Systemic Risk vs Idiosyncratic Risk

Under CAPM, the return on a stock is determined by its beta, the market return, and its individual noise term $\epsilon_{j,t}$. By assumption, they are uncorrelated. 

The "risk" of a stock identified with the variance of its returns, and by the uncorrelation assumption:

$$\mathrm{Var}(R_{i,j}) = \beta_j^2 \mathrm{Var}(M_t) + \mathrm{Var}(\epsilon_{j,t})$$


This means that there are two sources of risk, "market" (sometimes called systemic) and "idiosyncratic".


The latter represents for example bad(good) decisions and/or bad(good) luck of the individual company.  The former represents the extent to which the stock rises and falls with the broader market.  

The expected return equation leads us to the conclusion that only market risk garners risk-premium.  Idiosyncratic risk does not.

2. CAPM is usually stated with respect to a return

$$R_t = \frac{S_{t}- S_{t-1}}{S_{t-1}}$$

but we have framed things in terms of log-returns

$$lR_t = \ln(S_{t}) - \ln(S_{t-1})$$

How much does this matter?

Mathematically, they are not the same. And if $R_t$ is Gaussian, then $lR_t$ is not (and visa-versa).  So, how much of a big deal is this?



Generally speaking, $R_t$ is pretty small, we we can note the Taylor expansion of $\ln(1+x)$ for $|x| <1$:

$$\ln(1+x) =  x -\frac{1}{2}x^2 + \frac{1}{3}x^3 - \frac{1}{4}x^4 + \cdots$$

In particular, if $x<<1$, $\ln(1+x) \approx x$

So we have,

$$lR_t = \ln(S_{t}) - \ln(S_{t-1}) \\
=\ln\left(\frac{S_t}{S_{t-1}} \right)\\
=\ln(1 + R_t)\\
\approx R_t$$


Consequently, for small time steps like daily or even monthly, the distinction is not numerically large.  

Having said that, for our purposes, we will stick to log-Returns.


3. Over the last 20 years, the risk free rates have been quite small (usually under $2.00\%$).  This means that for small $\Delta t$, the term $r_f \approx 0$ ( e.g. $\frac{.02}{252} $)

Hence, there *can* be very little difference to modeling things as


$$R_{j,t}= r_f(t) + \beta_j \left(M_t - r_f(t) \right) +\epsilon_{j,t}$$

versus



$$R_{j,t}=  \beta_j M_t +\epsilon_{j,t}$$


**Caveat** During period where the risk-free is not close to zero, the regressed coefficients will be different under the two approaches.  The advantage to the second approach 

1. It is simpler
2. Does not assume CAPM.


4. Because $r_f(t)$ is the same for all stocks, CAPM can be fitted with linear regression with no intercept:


$$ER_{j,t} = R_{j,t}- r_f(t) =  + \beta_j \left(M_t - r_f(t) \right) +\epsilon_{j,t}$$

Obviously, the second formulation in comment 3 is also without intercept.





## Example with data

People often take the S&P500 to be the "market" portfolio.  But because the S&P500 is (by construction) all large caps, let us take a broad market index to be the Wilshire5000 (^W5000).

For the risk-free rate, we will take the Federal Funds Rate.

We will model the returns (under CAPM) of 3 stocks:

USbank, Pepsi, and OtterTail.



In [None]:
! pip install yfinance

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import yfinance as yf
import pandas_datareader.data as dr

  import pandas.util.testing as tm


In [None]:
start = '2002-01-01'
end = '2022-01-01'


tickers = ['USB', 'PEP','OTTR', '^W5000' ]


stocks = yf.download(tickers,start = start, end = end )


rf = dr.DataReader(['DFF'], 'fred', start = start, end= end)


[*********************100%***********************]  4 of 4 completed


Get the columns we need.

In [None]:
data = np.log(stocks.Close).diff()

data['rfdaily'] = rf/252

data = data.dropna()
data

Unnamed: 0_level_0,OTTR,PEP,USB,^W5000,rfdaily
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2002-01-03,0.028988,-0.007351,0.004854,0.009794,0.006825
2002-01-04,0.023531,-0.008852,0.018710,0.006943,0.006389
2002-01-07,-0.088523,-0.005807,-0.002379,-0.006809,0.006389
2002-01-08,0.068049,-0.005841,-0.002385,-0.002475,0.006389
2002-01-09,-0.040483,-0.003983,-0.009597,-0.004521,0.006905
...,...,...,...,...,...
2021-12-27,-0.000872,0.009905,0.007607,0.012303,0.000317
2021-12-28,0.002323,0.005177,0.000176,-0.002557,0.000317
2021-12-29,0.020806,0.003533,-0.002647,0.000706,0.000317
2021-12-30,0.003685,-0.001736,-0.007982,-0.001424,0.000317


We first want to model excess returns, so we create a columns for excess returns.

In [None]:
data['OTTR_er'] = data['OTTR'] - data.rfdaily

data['PEP_er'] = data['PEP'] - data.rfdaily

data['USB_er'] = data['USB'] - data.rfdaily


data['M_er'] = data['^W5000'] - data.rfdaily

data

Unnamed: 0_level_0,OTTR,PEP,USB,^W5000,rfdaily,OTTR_er,PEP_er,USB_er,M_er
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2002-01-03,0.028988,-0.007351,0.004854,0.009794,0.006825,0.022162,-0.014177,-0.001971,0.002968
2002-01-04,0.023531,-0.008852,0.018710,0.006943,0.006389,0.017142,-0.015241,0.012321,0.000554
2002-01-07,-0.088523,-0.005807,-0.002379,-0.006809,0.006389,-0.094912,-0.012196,-0.008768,-0.013197
2002-01-08,0.068049,-0.005841,-0.002385,-0.002475,0.006389,0.061660,-0.012230,-0.008774,-0.008864
2002-01-09,-0.040483,-0.003983,-0.009597,-0.004521,0.006905,-0.047388,-0.010888,-0.016502,-0.011426
...,...,...,...,...,...,...,...,...,...
2021-12-27,-0.000872,0.009905,0.007607,0.012303,0.000317,-0.001189,0.009587,0.007290,0.011985
2021-12-28,0.002323,0.005177,0.000176,-0.002557,0.000317,0.002005,0.004860,-0.000141,-0.002875
2021-12-29,0.020806,0.003533,-0.002647,0.000706,0.000317,0.020488,0.003215,-0.002964,0.000389
2021-12-30,0.003685,-0.001736,-0.007982,-0.001424,0.000317,0.003368,-0.002053,-0.008300,-0.001741


Now we perform regressions to fit the three stocks' betas.

#### OtterTail

In [None]:
OTTRfit =  sm.OLS(data.OTTR_er, data['M_er'] ).fit()

OTTRfit.summary()

0,1,2,3
Dep. Variable:,OTTR_er,R-squared (uncentered):,0.459
Model:,OLS,Adj. R-squared (uncentered):,0.459
Method:,Least Squares,F-statistic:,4261.0
Date:,"Thu, 07 Apr 2022",Prob (F-statistic):,0.0
Time:,00:17:07,Log-Likelihood:,13847.0
No. Observations:,5023,AIC:,-27690.0
Df Residuals:,5022,BIC:,-27680.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
M_er,0.9639,0.015,65.274,0.000,0.935,0.993

0,1,2,3
Omnibus:,2055.895,Durbin-Watson:,2.174
Prob(Omnibus):,0.0,Jarque-Bera (JB):,122220.199
Skew:,-1.143,Prob(JB):,0.0
Kurtosis:,27.057,Cond. No.,1.0


We see here a "beta" for OtterTail of .9639


#### Pepsil

In [None]:
PEPfit =  sm.OLS(data.PEP_er, data['M_er'] ).fit()

PEPfit.summary()

0,1,2,3
Dep. Variable:,PEP_er,R-squared (uncentered):,0.502
Model:,OLS,Adj. R-squared (uncentered):,0.502
Method:,Least Squares,F-statistic:,5068.0
Date:,"Thu, 07 Apr 2022",Prob (F-statistic):,0.0
Time:,00:17:28,Log-Likelihood:,15923.0
No. Observations:,5023,AIC:,-31840.0
Df Residuals:,5022,BIC:,-31840.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
M_er,0.6952,0.010,71.188,0.000,0.676,0.714

0,1,2,3
Omnibus:,1081.728,Durbin-Watson:,1.932
Prob(Omnibus):,0.0,Jarque-Bera (JB):,45523.978
Skew:,0.031,Prob(JB):,0.0
Kurtosis:,17.748,Cond. No.,1.0


For Pepsi, beta is .6952.

#### US Bank



In [None]:
USBfit =  sm.OLS(data.USB_er, data['M_er'] ).fit()

USBfit.summary()

0,1,2,3
Dep. Variable:,USB_er,R-squared (uncentered):,0.549
Model:,OLS,Adj. R-squared (uncentered):,0.549
Method:,Least Squares,F-statistic:,6112.0
Date:,"Thu, 07 Apr 2022",Prob (F-statistic):,0.0
Time:,00:17:35,Log-Likelihood:,13948.0
No. Observations:,5023,AIC:,-27890.0
Df Residuals:,5022,BIC:,-27890.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
M_er,1.1313,0.014,78.177,0.000,1.103,1.160

0,1,2,3
Omnibus:,1395.966,Durbin-Watson:,2.025
Prob(Omnibus):,0.0,Jarque-Bera (JB):,147974.12
Skew:,0.138,Prob(JB):,0.0
Kurtosis:,29.588,Cond. No.,1.0


We see that USBank has a beta of 1.1313.

For contrast, we can also fit the models where we treat returns instead of excess returns.



In [None]:
USBfit =  sm.OLS(data.USB, data['^W5000'] ).fit()

USBfit.summary()

0,1,2,3
Dep. Variable:,USB,R-squared (uncentered):,0.487
Model:,OLS,Adj. R-squared (uncentered):,0.487
Method:,Least Squares,F-statistic:,4766.0
Date:,"Thu, 07 Apr 2022",Prob (F-statistic):,0.0
Time:,00:18:02,Log-Likelihood:,13965.0
No. Observations:,5023,AIC:,-27930.0
Df Residuals:,5022,BIC:,-27920.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
^W5000,1.1865,0.017,69.037,0.000,1.153,1.220

0,1,2,3
Omnibus:,1408.474,Durbin-Watson:,2.029
Prob(Omnibus):,0.0,Jarque-Bera (JB):,146211.4
Skew:,0.195,Prob(JB):,0.0
Kurtosis:,29.428,Cond. No.,1.0


In [None]:
PEPfit =  sm.OLS(data.PEP, data['^W5000'] ).fit()

PEPfit.summary()

0,1,2,3
Dep. Variable:,PEP,R-squared (uncentered):,0.343
Model:,OLS,Adj. R-squared (uncentered):,0.342
Method:,Least Squares,F-statistic:,2617.0
Date:,"Thu, 07 Apr 2022",Prob (F-statistic):,0.0
Time:,00:18:18,Log-Likelihood:,16117.0
No. Observations:,5023,AIC:,-32230.0
Df Residuals:,5022,BIC:,-32230.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
^W5000,0.5729,0.011,51.159,0.000,0.551,0.595

0,1,2,3
Omnibus:,1198.184,Durbin-Watson:,2.066
Prob(Omnibus):,0.0,Jarque-Bera (JB):,71999.778
Skew:,0.008,Prob(JB):,0.0
Kurtosis:,21.548,Cond. No.,1.0


In [None]:
OTTRfit =  sm.OLS(data.OTTR, data['^W5000'] ).fit()

OTTRfit.summary()

0,1,2,3
Dep. Variable:,OTTR,R-squared (uncentered):,0.367
Model:,OLS,Adj. R-squared (uncentered):,0.367
Method:,Least Squares,F-statistic:,2914.0
Date:,"Thu, 07 Apr 2022",Prob (F-statistic):,0.0
Time:,00:18:25,Log-Likelihood:,13848.0
No. Observations:,5023,AIC:,-27690.0
Df Residuals:,5022,BIC:,-27690.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
^W5000,0.9497,0.018,53.981,0.000,0.915,0.984

0,1,2,3
Omnibus:,2049.119,Durbin-Watson:,2.178
Prob(Omnibus):,0.0,Jarque-Bera (JB):,122119.259
Skew:,-1.136,Prob(JB):,0.0
Kurtosis:,27.048,Cond. No.,1.0
