/
replicating-fama-and-french-factors.qmd
381 lines (307 loc) · 17.2 KB
/
replicating-fama-and-french-factors.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
title: Replicating Fama-French Factors
aliases:
- ../replicating-fama-and-french-factors.html
metadata:
pagetitle: Replicating Fama-French Factors with R
description-meta: Use the programming language R to replicate the famous Fama-French three- and five-factor asset pricing models.
---
::: callout-note
You are reading **Tidy Finance with R**. You can find the equivalent chapter for the sibling **Tidy Finance with Python** [here](../python/replicating-fama-and-french-factors.qmd).
:::
In this chapter, we provide a replication of the famous Fama-French factor portfolios. The Fama-French factor models are a cornerstone of empirical asset pricing [see @Fama1992 and @FamaFrench2015]. On top of the market factor represented by the traditional CAPM beta, the three-factor model includes the size and value factors to explain the cross section of returns. Its successor, the five-factor model, additionally includes profitability and investment as explanatory factors.
We start with the three-factor model. We already introduced the size and value factors in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd), and their definition remains the same: size is the SMB factor (small-minus-big) that is long small firms and short large firms. The value factor is HML (high-minus-low) and is long in high book-to-market firms and short in low book-to-market counterparts.
After the replication of the three-factor model, we move to the five factors by constructing the profitability factor RMW (robust-minus-weak) as the difference between the returns of firms with high and low operating profitability and the investment factor CMA (conservative-minus-aggressive) as the difference between firms with high versus low investment rates.
The current chapter relies on this set of R packages.
```{r}
#| message: false
library(tidyverse)
library(RSQLite)
```
## Data Preparation
We use CRSP and Compustat as data sources, as we need the same variables to compute the factors in the way Fama and French do it. Hence, there is nothing new below, and we only load data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).^[Note that @Fama1992 claim to exclude financial firms. To a large extent this happens through using industry format "INDL", as we do in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). Neither the original paper, nor Ken French's website, or the WRDS replication contains any indication that financial companies are excluded using additional filters such as industry codes.] \index{Data!CRSP}\index{Data!Compustat}
```{r}
tidy_finance <- dbConnect(
SQLite(),
"data/tidy_finance_r.sqlite",
extended_types = TRUE
)
crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |>
select(
permno, gvkey, month, ret_excess,
mktcap, mktcap_lag, exchange
) |>
collect()
compustat <- tbl(tidy_finance, "compustat") |>
select(gvkey, datadate, be, op, inv) |>
collect()
factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
select(month, smb, hml) |>
collect()
factors_ff5_monthly <- tbl(tidy_finance, "factors_ff5_monthly") |>
select(month, smb, hml, rmw, cma) |>
collect()
```
Yet when we start merging our dataset for computing the premiums, there are a few differences to [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). First, Fama and French form their portfolios in June of year $t$, whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year $t+1$.
Second, Fama and French also have a different protocol for computing the book-to-market ratio.\index{Book-to-market ratio} They use market equity as of the end of year $t - 1$ and the book equity reported in year $t-1$, i.e., the `datadate` is within the last year.\index{Book equity} Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity. The other sorting variables are analogously to book equity taken from year $t-1$.
To implement all these time lags, we again employ the temporary `sorting_date`-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of `distinct()` at the end of the chunk below.
```{r}
size <- crsp_monthly |>
filter(month(month) == 6) |>
mutate(sorting_date = month %m+% months(1)) |>
select(permno, exchange, sorting_date, size = mktcap)
market_equity <- crsp_monthly |>
filter(month(month) == 12) |>
mutate(sorting_date = ymd(str_c(year(month) + 1, "0701)"))) |>
select(permno, gvkey, sorting_date, me = mktcap)
book_to_market <- compustat |>
mutate(sorting_date = ymd(str_c(year(datadate) + 1, "0701"))) |>
select(gvkey, sorting_date, be) |>
inner_join(market_equity, join_by(gvkey, sorting_date)) |>
mutate(bm = be / me) |>
select(permno, sorting_date, me, bm)
sorting_variables <- size |>
inner_join(
book_to_market, join_by(permno, sorting_date)
) |>
drop_na() |>
distinct(permno, sorting_date, .keep_all = TRUE)
```
## Portfolio Sorts
Next, we construct our portfolios with an adjusted `assign_portfolio()` function.\index{Portfolio sorts} Fama and French rely on NYSE-specific breakpoints, they form two portfolios in the size dimension at the median and three portfolios in the dimension of each other sorting variable at the 30- and 70-percentiles, and they use dependent sorts. The sorts for book-to-market require an adjustment to the function in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd) because the `seq()` we would produce does not produce the right breakpoints. Instead of `n_portfolios`, we now specify `percentiles`, which take the breakpoint-sequence as an object specified in the function's call. Specifically, we give `percentiles = c(0, 0.3, 0.7, 1)` to the function. Additionally, we perform an `inner_join()` with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.\index{Breakpoints}
```{r}
assign_portfolio <- function(data,
sorting_variable,
percentiles) {
breakpoints <- data |>
filter(exchange == "NYSE") |>
pull({{ sorting_variable }}) |>
quantile(
probs = percentiles,
na.rm = TRUE,
names = FALSE
)
assigned_portfolios <- data |>
mutate(portfolio = findInterval(
pick(everything()) |>
pull({{ sorting_variable }}),
breakpoints,
all.inside = TRUE
)) |>
pull(portfolio)
return(assigned_portfolios)
}
portfolios <- sorting_variables |>
group_by(sorting_date) |>
mutate(
portfolio_size = assign_portfolio(
data = pick(everything()),
sorting_variable = size,
percentiles = c(0, 0.5, 1)
),
portfolio_bm = assign_portfolio(
data = pick(everything()),
sorting_variable = bm,
percentiles = c(0, 0.3, 0.7, 1)
)
) |>
ungroup() |>
select(permno, sorting_date,
portfolio_size, portfolio_bm)
```
Next, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column `sorting_date` in our return data by setting the date to sort on to July of $t-1$ if the month is June (of year $t$) or earlier or to July of year $t$ if the month is July or later.
```{r}
portfolios <- crsp_monthly |>
mutate(sorting_date = case_when(
month(month) <= 6 ~ ymd(str_c(year(month) - 1, "0701")),
month(month) >= 7 ~ ymd(str_c(year(month), "0701"))
)) |>
inner_join(portfolios, join_by(permno, sorting_date))
```
## Fama-French Three-Factor Model
Equipped with the return data and the assigned portfolios, we can now compute the value-weighted average return for each of the six portfolios. Then, we form the Fama-French factors. For the size factor (i.e., SMB), we go long in the three small portfolios and short the three large portfolios by taking an average across either group. For the value factor (i.e., HML), we go long in the two high book-to-market portfolios and short the two low book-to-market portfolios, again weighting them equally.\index{Factor!Size}\index{Factor!Value}
```{r}
factors_replicated <- portfolios |>
group_by(portfolio_size, portfolio_bm, month) |>
summarize(
ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop"
) |>
group_by(month) |>
summarize(
smb_replicated = mean(ret[portfolio_size == 1]) -
mean(ret[portfolio_size == 2]),
hml_replicated = mean(ret[portfolio_bm == 3]) -
mean(ret[portfolio_bm == 1])
)
```
## Replication Evaluation
In the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French.\index{Size!Size premium}\index{Value premium} The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using `lm()`. If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).
```{r}
test <- factors_ff3_monthly |>
inner_join(factors_replicated, join_by(month)) |>
mutate(
across(c(smb_replicated, hml_replicated), ~round(., 4))
)
```
To test the success of the SMB factor, we hence run the following regression:
```{r}
model_smb <- lm(smb ~ smb_replicated, data = test)
summary(model_smb)
```
The results for the SMB factor are really convincing, as all three criteria outlined above are met and the coefficient is `r round(model_smb$coefficients[2], 2)` and the R-squared is at `r round(summary(model_smb)$adj.r.squared, 2) * 100` percent.
```{r}
model_hml <- lm(hml ~ hml_replicated, data = test)
summary(model_hml)
```
The replication of the HML factor is also a success, although at a slightly lower coefficient of `r round(model_hml$coefficients[2], 2)` and an R-squared around `r round(summary(model_hml)$adj.r.squared, 2) * 100` percent.
The evidence hence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.
## Fama-French Five-Factor Model
Now, let us move to the replication of the five-factor model. We extend the `other_sorting_variables` table from above with the additional characteristics operating profitability `op` and investment `inv`. Note that the `drop_na()` statement yields different sample sizes as some firms with `be` values might not have `op` or `inv` values.
```{r}
other_sorting_variables <- compustat |>
mutate(sorting_date = ymd(str_c(year(datadate) + 1, "0701"))) |>
select(gvkey, sorting_date, be, op, inv) |>
inner_join(market_equity,
join_by(gvkey, sorting_date)) |>
mutate(bm = be / me) |>
select(permno, sorting_date, me, be, bm, op, inv)
sorting_variables <- size |>
inner_join(
other_sorting_variables,
join_by(permno, sorting_date)
) |>
drop_na() |>
distinct(permno, sorting_date, .keep_all = TRUE)
```
In each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above.
```{r}
portfolios <- sorting_variables |>
group_by(sorting_date) |>
mutate(
portfolio_size = assign_portfolio(
data = pick(everything()),
sorting_variable = size,
percentiles = c(0, 0.5, 1)
)) |>
group_by(sorting_date, portfolio_size) |>
mutate(
across(c(bm, op, inv), ~assign_portfolio(
data = pick(everything()),
sorting_variable = .,
percentiles = c(0, 0.3, 0.7, 1)),
.names = "portfolio_{.col}"
)
) |>
ungroup() |>
select(permno, sorting_date,
portfolio_size, portfolio_bm,
portfolio_op, portfolio_inv)
portfolios <- crsp_monthly |>
mutate(sorting_date = case_when(
month(month) <= 6 ~ ymd(str_c(year(month) - 1, "0701")),
month(month) >= 7 ~ ymd(str_c(year(month), "0701"))
)) |>
inner_join(portfolios, join_by(permno, sorting_date))
```
Now, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.
```{r}
portfolios_value <- portfolios |>
group_by(portfolio_size, portfolio_bm, month) |>
summarize(
ret = weighted.mean(ret_excess, mktcap_lag),
.groups = "drop"
)
factors_value <- portfolios_value |>
group_by(month) |>
summarize(
hml_replicated = mean(ret[portfolio_bm == 3]) -
mean(ret[portfolio_bm == 1])
)
```
For the profitability factor, RMW, we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.\index{Factor!Profitability}
```{r}
portfolios_profitability <- portfolios |>
group_by(portfolio_size, portfolio_op, month) |>
summarize(
ret = weighted.mean(ret_excess, mktcap_lag),
.groups = "drop"
)
factors_profitability <- portfolios_profitability |>
group_by(month) |>
summarize(
rmw_replicated = mean(ret[portfolio_op == 3]) -
mean(ret[portfolio_op == 1])
)
```
For the investment factor, CMA, we go long the two low investment portfolios and short the two high investment portfolios.\index{Factor!Investment}
```{r}
portfolios_investment <- portfolios |>
group_by(portfolio_size, portfolio_inv, month) |>
summarize(
ret = weighted.mean(ret_excess, mktcap_lag),
.groups = "drop"
)
factors_investment <- portfolios_investment |>
group_by(month) |>
summarize(
cma_replicated = mean(ret[portfolio_inv == 1]) -
mean(ret[portfolio_inv == 3])
)
```
Finally, the size factor, SMB, is constructed by going long the six small portfolios and short the six large portfolios.
```{r}
factors_size <- bind_rows(
portfolios_value,
portfolios_profitability,
portfolios_investment
) |>
group_by(month) |>
summarize(
smb_replicated = mean(ret[portfolio_size == 1]) -
mean(ret[portfolio_size == 2])
)
```
We then join all factors together into one dataframe and construct again a suitable table to run tests for evaluating our replication.
```{r}
factors_replicated <- factors_size |>
full_join(
factors_value, join_by(month)
) |>
full_join(
factors_profitability, join_by(month)
) |>
full_join(
factors_investment, join_by(month)
)
test <- factors_ff5_monthly |>
inner_join(factors_replicated, join_by(month)) |>
mutate(
across(c(smb_replicated, hml_replicated,
rmw_replicated, cma_replicated), ~round(., 4))
)
```
Let us start the replication evaluation again with the size factor:
```{r}
model_smb <- lm(smb ~ smb_replicated, data = test)
summary(model_smb)
```
The results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is `r round(model_smb$coefficients[2], 2)` and the R-squared is at `r round(summary(model_smb)$adj.r.squared, 2) * 100` percent.
```{r}
model_hml <- lm(hml ~ hml_replicated, data = test)
summary(model_hml)
```
The replication of the HML factor is also a success, although at a slightly higher coefficient of `r round(model_hml$coefficients[2], 2)` and an R-squared around `r round(summary(model_hml)$adj.r.squared, 2) * 100` percent.
```{r}
model_rmw <- lm(rmw ~ rmw_replicated, data = test)
summary(model_rmw)
```
We are also able to replicate the RMW factor quite well with a coefficient of `r round(model_rmw$coefficients[2], 2)` and an R-squared around `r round(summary(model_rmw)$adj.r.squared, 2) * 100` percent.
```{r}
model_cma <- lm(cma ~ cma_replicated, data = test)
summary(model_cma)
```
Finally, the CMA factor also replicates well with a coefficient of `r round(model_cma$coefficients[2], 2)` and an R-squared around `r round(summary(model_cma)$adj.r.squared, 2) * 100` percent.
Overall, our approach seems to replicate the Fama-French five-factor models just as well as the three factors.
## Exercises
1. @Fama1993 claim that their sample excludes firms until they have appeared in Compustat for two years. Implement this additional filter and compare the improvements of your replication effort.
2. On his homepage, [Kenneth French](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) provides instructions on how to construct the most common variables used for portfolio sorts. Try to replicate the univariate portfolio sort return time series for `E/P` (earnings/price) provided on his homepage and evaluate your replication effort using regressions.