diff --git a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json index 4fa6646a..f4920b89 100644 --- a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json +++ b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "a7081974449810c632425178ddd3ae0a", "result": { "engine": "knitr", - "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n - ../accessing-and-managing-financial-data.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Accessing and Managing Financial Data with R\n description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n\nIn this chapter, we suggest a way to organize your financial data. Everybody, who has experience with data, is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo!Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n mutate(\n month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n mutate(\n month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n mutate(\n date = ymd(date),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n mutate(month = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n mutate(across(where(is.numeric), ~ . / 100)) |>\n select(month, everything(), -date) |>\n filter(month >= start_date & month <= end_date) |> \n rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. For an alternative to download Fama-French data, check out the `FFdownload` package by [Sebastian Stöckl](https://github.com/sstoeckl/ffdownload).\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2022.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n mutate(month = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n select(-R_F, -R_MKT, -year) |>\n rename_with(~ str_remove(., \"R_\")) |>\n rename_with(~ str_to_lower(.)) |>\n mutate(across(-month, ~ . / 100)) |>\n filter(month >= start_date & month <= end_date)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google drive location, we need additional packages to access the data directly from our R session. Therefore, we load `readxl` to read the XLSX-file [@readxl] and `googledrive` for the Google drive connection [@googledrive].\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readxl)\nlibrary(googledrive)\n```\n:::\n\n\nUsually, you need to authenticate if you interact with Google drive directly in R. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndrive_deauth()\n```\n:::\n\n\nThe `drive_download()` function from the `googledrive` package allows us to download the data and store it locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors_link <-\n \"https://docs.google.com/spreadsheets/d/1g4LOaRj4TvwJr9RIaA_nwrXXWTOy46bP\"\n\ndrive_download(\n macro_predictors_link,\n path = \"macro_predictors.xlsx\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997] \n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- read_xlsx(\n \"macro_predictors.xlsx\",\n sheet = \"Monthly\"\n) |>\n mutate(month = ym(yyyymm)) |>\n mutate(across(where(is.character), as.numeric)) |>\n mutate(\n IndexDiv = Index + D12,\n logret = log(IndexDiv) - log(lag(IndexDiv)),\n Rfree = log(Rfree + 1),\n rp_div = lead(logret - Rfree, 1), # Future excess market return\n dp = log(D12) - log(Index), # Dividend Price ratio\n dy = log(D12) - log(lag(Index)), # Dividend yield\n ep = log(E12) - log(Index), # Earnings price ratio\n de = log(D12) - log(E12), # Dividend payout ratio\n tms = lty - tbl, # Term spread\n dfy = BAA - AAA # Default yield spread\n ) |>\n select(month, rp_div, dp, dy, ep, de, svar,\n bm = `b/m`, ntis, tbl, lty, ltr,\n tms, dfy, infl\n ) |>\n filter(month >= start_date & month <= end_date) |>\n drop_na()\n```\n:::\n\n\nFinally, after reading in the macro predictors to our memory, we remove the raw data file from our temporary storage. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(\"macro_predictors.xlsx\")\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar `tidyquant` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyquant)\n\ncpi_monthly <- tq_get(\"CPIAUCNS\",\n get = \"economic.data\",\n from = start_date,\n to = end_date\n) |>\n mutate(\n month = floor_date(date, \"month\"),\n cpi = price / price[month == max(month)],\n .keep = \"none\"\n )\n```\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. The `tidyquant` package provides access to around 10,000 time series of the FRED database. If your desired time series is not included, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details. \n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types=TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the resulting file `tidy_finance_r.sqlite` in the subfolder `data` for all subsequent chapters to retrieve our data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"factors_ff3_monthly\",\n value = factors_ff3_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(month, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source: SQL [?? x 2]\n# Database: sqlite 3.41.2 [data/tidy_finance_r.sqlite]\n month rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(month, rf) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 756 × 2\n month rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 751 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"factors_ff5_monthly\",\n value = factors_ff5_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"factors_ff3_daily\",\n value = factors_ff3_daily,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"industries_ff_monthly\",\n value = industries_ff_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"factors_q_monthly\",\n value = factors_q_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"macro_predictors\",\n value = macro_predictors,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"cpi_monthly\",\n value = cpi_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n SQL VACUUM\n ROWS Fetched: 0 [complete]\n Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read up in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"beta\" \"compustat\" \n [3] \"cpi_monthly\" \"crsp_daily\" \n [5] \"crsp_monthly\" \"factors_ff3_daily\" \n [7] \"factors_ff3_monthly\" \"factors_ff5_monthly\" \n [9] \"factors_q_monthly\" \"industries_ff_monthly\"\n[11] \"macro_predictors\" \"mergent\" \n[13] \"trace_enhanced\" \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database. \n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package. \n1. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n", + "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n - ../accessing-and-managing-financial-data.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Accessing and Managing Financial Data with R\n description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n\nIn this chapter, we suggest a way to organize your financial data. Everybody, who has experience with data, is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo!Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n mutate(\n month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n mutate(\n month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |> \n filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n mutate(\n date = ymd(date),\n across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n .keep = \"none\"\n ) |>\n rename_with(str_to_lower) |>\n rename(mkt_excess = `mkt-rf`) |>\n filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n mutate(month = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n mutate(across(where(is.numeric), ~ . / 100)) |>\n select(month, everything(), -date) |>\n filter(month >= start_date & month <= end_date) |> \n rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. For an alternative to download Fama-French data, check out the `FFdownload` package by [Sebastian Stöckl](https://github.com/sstoeckl/ffdownload).\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2022.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n mutate(month = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n select(-R_F, -R_MKT, -year) |>\n rename_with(~ str_remove(., \"R_\")) |>\n rename_with(~ str_to_lower(.)) |>\n mutate(across(-month, ~ . / 100)) |>\n filter(month >= start_date & month <= end_date)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google drive location, we need additional packages to access the data directly from our R session. Therefore, we load `readxl` to read the XLSX-file [@readxl] and `googledrive` for the Google drive connection [@googledrive].\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readxl)\nlibrary(googledrive)\n```\n:::\n\n\nUsually, you need to authenticate if you interact with Google drive directly in R. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndrive_deauth()\n```\n:::\n\n\nThe `drive_download()` function from the `googledrive` package allows us to download the data and store it locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors_link <-\n \"https://docs.google.com/spreadsheets/d/1g4LOaRj4TvwJr9RIaA_nwrXXWTOy46bP\"\n\ndrive_download(\n macro_predictors_link,\n path = \"macro_predictors.xlsx\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997] \n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- read_xlsx(\n \"macro_predictors.xlsx\",\n sheet = \"Monthly\"\n) |>\n mutate(month = ym(yyyymm)) |>\n mutate(across(where(is.character), as.numeric)) |>\n mutate(\n IndexDiv = Index + D12,\n logret = log(IndexDiv) - log(lag(IndexDiv)),\n Rfree = log(Rfree + 1),\n rp_div = lead(logret - Rfree, 1), # Future excess market return\n dp = log(D12) - log(Index), # Dividend Price ratio\n dy = log(D12) - log(lag(Index)), # Dividend yield\n ep = log(E12) - log(Index), # Earnings price ratio\n de = log(D12) - log(E12), # Dividend payout ratio\n tms = lty - tbl, # Term spread\n dfy = BAA - AAA # Default yield spread\n ) |>\n select(month, rp_div, dp, dy, ep, de, svar,\n bm = `b/m`, ntis, tbl, lty, ltr,\n tms, dfy, infl\n ) |>\n filter(month >= start_date & month <= end_date) |>\n drop_na()\n```\n:::\n\n\nFinally, after reading in the macro predictors to our memory, we remove the raw data file from our temporary storage. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(\"macro_predictors.xlsx\")\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar `tidyquant` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyquant)\n\ncpi_monthly <- tq_get(\"CPIAUCNS\",\n get = \"economic.data\",\n from = start_date,\n to = end_date\n) |>\n mutate(\n month = floor_date(date, \"month\"),\n cpi = price / price[month == max(month)],\n .keep = \"none\"\n )\n```\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. The `tidyquant` package provides access to around 10,000 time series of the FRED database. If your desired time series is not included, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details. \n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types=TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation} We will use the resulting file `tidy_finance_r.sqlite` in the subfolder `data` for all subsequent chapters to retrieve our data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"factors_ff3_monthly\",\n value = factors_ff3_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(month, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source: SQL [?? x 2]\n# Database: sqlite 3.41.2 [data/tidy_finance_r.sqlite]\n month rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n select(month, rf) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 756 × 2\n month rf\n \n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 751 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"factors_ff5_monthly\",\n value = factors_ff5_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"factors_ff3_daily\",\n value = factors_ff3_daily,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"industries_ff_monthly\",\n value = industries_ff_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"factors_q_monthly\",\n value = factors_q_monthly,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"macro_predictors\",\n value = macro_predictors,\n overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n \"cpi_monthly\",\n value = cpi_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n SQL VACUUM\n ROWS Fetched: 0 [complete]\n Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read up in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"beta\" \"compustat\" \n [3] \"cpi_monthly\" \"crsp_daily\" \n [5] \"crsp_monthly\" \"factors_ff3_daily\" \n [7] \"factors_ff3_monthly\" \"factors_ff5_monthly\" \n [9] \"factors_q_monthly\" \"fisd\" \n[11] \"industries_ff_monthly\" \"macro_predictors\" \n[13] \"trace_enhanced\" \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database. \n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package. \n1. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/r/beta-estimation/execute-results/html.json b/_freeze/r/beta-estimation/execute-results/html.json index 58a18cd4..46a37b9e 100644 --- a/_freeze/r/beta-estimation/execute-results/html.json +++ b/_freeze/r/beta-estimation/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "b38cf1c3a936b0371fd75eef648b83e7", + "hash": "fdf890ff102507ae5bd5d91ee49171e5", "result": { "engine": "knitr", - "markdown": "---\ntitle: Beta Estimation\naliases:\n - ../beta-estimation.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Beta Estimation with R\n description-meta: Estimate CAPM betas using monthly or daily CRSP data and the programming language R. \n---\n\n\nIn this chapter, we introduce an important concept in financial economics: the exposure of an individual stock to changes in the market portfolio. According to the Capital Asset Pricing Model (CAPM) of @Sharpe1964, @Lintner1965, and @Mossin1966, cross-sectional variation in expected asset returns should be a function of the covariance between the excess return of the asset and the excess return on the market portfolio.\\index{CAPM} The regression coefficient of excess market returns on excess stock returns is usually called the market beta. We show an estimation procedure for the market betas.\\index{Beta} We do not go into details about the foundations of market beta but simply refer to any treatment of the [CAPM](https://en.wikipedia.org/wiki/Capital_asset_pricing_model) for further information. Instead, we provide details about all the functions that we use to compute the results. In particular, we leverage useful computational concepts: rolling-window estimation and parallelization.\n\nWe use the following R packages throughout this chapter:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(slider)\nlibrary(furrr)\n```\n:::\n\n\nCompared to previous chapters, we introduce `slider` [@slider] for sliding window functions, and `furrr` [@furrr] to apply mapping functions in parallel.\n\n## Estimating Beta using Monthly Returns\n\nThe estimation procedure is based on a rolling-window estimation, where we may use either monthly or daily returns and different window lengths. First, let us start with loading the monthly CRSP data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(permno, month, industry, ret_excess) |>\n collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, mkt_excess) |>\n collect()\n\ncrsp_monthly <- crsp_monthly |>\n left_join(factors_ff3_monthly, by = \"month\")\n```\n:::\n\n\nTo estimate the CAPM regression coefficients \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\beta_i(r_{m, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwe regress stock excess returns `ret_excess` on excess returns of the market portfolio `mkt_excess`. \nR provides a simple solution to estimate (linear) models with the function `lm()`. `lm()` requires a formula as input that is specified in a compact symbolic form. An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modeled by a linear predictor specified symbolically by `model`. Such a model consists of a series of terms separated by `+` operators. In addition to standard linear models, `lm()` provides a lot of flexibility. You should check out the documentation for more information. To start, we restrict the data only to the time series of observations in CRSP that correspond to Apple’s stock (i.e., to `permno` 14593 for Apple) and compute $\\hat\\alpha_i$ as well as $\\hat\\beta_i$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit <- lm(ret_excess ~ mkt_excess,\n data = crsp_monthly |>\n filter(permno == \"14593\")\n)\n\nsummary(fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = ret_excess ~ mkt_excess, data = filter(crsp_monthly, \n permno == \"14593\"))\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.5170 -0.0589 0.0001 0.0610 0.3947 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.01019 0.00508 2.0 0.046 * \nmkt_excess 1.38889 0.11141 12.5 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.113 on 502 degrees of freedom\nMultiple R-squared: 0.236,\tAdjusted R-squared: 0.235 \nF-statistic: 155 on 1 and 502 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\n`lm()` returns an object of class `lm` which contains all information we usually care about with linear models. `summary()` returns an overview of the estimated parameters. `coefficients(fit)` would return only the estimated coefficients. The output above indicates that Apple moves excessively with the market as the estimated $\\hat\\beta_i$ is above one ($\\hat\\beta_i \\approx 1.4$). \n\n## Rolling-Window Estimation\n\nAfter we estimated the regression coefficients on an example, we scale the estimation of $\\beta_i$ to a whole different level and perform rolling-window estimations for the entire CRSP sample.\\index{Rolling-window estimation} The following function implements the CAPM regression for a data frame (or a part thereof) containing at least `min_obs` observations to avoid huge fluctuations if the time series is too short. If the condition is violated, that is, the time series is too short, the function returns a missing value. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nestimate_capm <- function(data, min_obs = 1) {\n if (nrow(data) < min_obs) {\n beta <- as.numeric(NA)\n } else {\n fit <- lm(ret_excess ~ mkt_excess, data = data)\n beta <- as.numeric(coefficients(fit)[2])\n }\n return(beta)\n}\n```\n:::\n\n\nNext, we define a function that does the rolling estimation. The `slide_period` function is able to handle months in its window input in a straightforward manner. We thus avoid using any time-series package (e.g., `zoo`) and converting the data to fit the package functions, but rather stay in the world of the `tidyverse`.\n\nThe following function takes input data and slides across the `month` vector, considering only a total of `months` months. The function essentially performs three steps: (i) arrange all rows, (ii) compute betas by sliding across months, and (iii) return a tibble with months and corresponding beta estimates (again particularly useful in the case of daily data).\nAs we demonstrate further below, we can also apply the same function to daily returns data. \n\n::: {.cell}\n\n```{.r .cell-code}\nroll_capm_estimation <- function(data, months, min_obs) {\n data <- data |>\n arrange(month)\n\n betas <- slide_period_vec(\n .x = data,\n .i = data$month,\n .period = \"month\",\n .f = ~ estimate_capm(., min_obs),\n .before = months - 1,\n .complete = FALSE\n )\n\n return(tibble(\n month = unique(data$month),\n beta = betas\n ))\n}\n```\n:::\n\n\nBefore we attack the whole CRSP sample, let us focus on a couple of examples for well-known firms.\n\n::: {.cell}\n\n```{.r .cell-code}\nexamples <- tribble(\n ~permno, ~company,\n 14593, \"Apple\",\n 10107, \"Microsoft\",\n 93436, \"Tesla\",\n 17778, \"Berkshire Hathaway\"\n)\n```\n:::\n\nIf we want to estimate rolling betas for Apple, we can use `mutate()`. \nWe take a total of 5 years of data and require at least 48 months with return data to compute our betas. \nCheck out the exercises if you want to compute beta for different time periods. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_example <- crsp_monthly |>\n filter(permno == examples$permno[1]) |>\n mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n drop_na()\nbeta_example\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 457 × 6\n permno month industry ret_excess mkt_excess beta\n \n1 14593 1984-12-01 Manufacturing 0.170 0.0184 2.05\n2 14593 1985-01-01 Manufacturing -0.0108 0.0799 1.90\n3 14593 1985-02-01 Manufacturing -0.152 0.0122 1.88\n4 14593 1985-03-01 Manufacturing -0.112 -0.0084 1.89\n5 14593 1985-04-01 Manufacturing -0.0467 -0.0096 1.90\n# ℹ 452 more rows\n```\n\n\n:::\n:::\n\nIt is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting @fig-601. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_examples <- crsp_monthly |>\n inner_join(examples, by = \"permno\") |>\n group_by(permno) |>\n mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n ungroup() |>\n select(permno, company, month, beta) |>\n drop_na()\n\nbeta_examples |>\n ggplot(aes(\n x = month, \n y = beta, \n color = company,\n linetype = company)) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly beta estimates for example stocks using 5 years of data\"\n )\n```\n\n::: {.cell-output-display}\n![The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.](beta-estimation_files/figure-html/fig-601-1.png){#fig-601 fig-alt='Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock.' width=2100}\n:::\n:::\n\n\n## Parallelized Rolling-Window Estimation\n\nEven though we could now just apply the function using `group_by()` on the whole CRSP sample, we advise against doing it as it is computationally quite expensive. \nRemember that we have to perform rolling-window estimations across all stocks and time periods. \nHowever, this estimation problem is an ideal scenario to employ the power of parallelization. \nParallelization means that we split the tasks which perform rolling-window estimations across different workers (or cores on your local machine). \n\nFirst, we `nest()` the data by `permno`. Nested data means we now have a list of `permno` with corresponding time series data and an `industry` label. We get one row of output for each unique combination of non-nested variables which are `permno` and `industry`.\\index{Data!Nested}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n nest(data = c(month, ret_excess, mkt_excess))\ncrsp_monthly_nested\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30,384 × 3\n permno industry data \n \n1 10000 Manufacturing \n2 10001 Utilities \n3 10057 Manufacturing \n4 10042 Mining \n5 10043 Services \n# ℹ 30,379 more rows\n```\n\n\n:::\n:::\n\n\nAlternatively, we could have created the same nested data by *excluding* the variables that we *do not* want to nest, as in the following code chunk. However, for many applications it is desirable to explicitly state the variables that are nested into the `data` list-column, so that the reader can track what ends up in there.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n nest(data = -c(permno, industry))\n```\n:::\n\n\nNext, we want to apply the `roll_capm_estimation()` function to each stock. This situation is an ideal use case for `map()`, which takes a list or vector as input and returns an object of the same length as the input. In our case, `map()` returns a single data frame with a time series of beta estimates for each stock. Therefore, we use `unnest()` to transform the list of outputs to a tidy data frame. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested |>\n inner_join(examples, by = \"permno\") |>\n mutate(beta = map(\n data,\n ~ roll_capm_estimation(., months = 60, min_obs = 48)\n )) |>\n unnest(beta) |>\n select(permno, month, beta_monthly = beta) |>\n drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,458 × 3\n permno month beta_monthly\n \n1 10107 1990-03-01 1.39\n2 10107 1990-04-01 1.38\n3 10107 1990-05-01 1.43\n4 10107 1990-06-01 1.43\n5 10107 1990-07-01 1.45\n# ℹ 1,453 more rows\n```\n\n\n:::\n:::\n\n\nHowever, instead, we want to perform the estimations of rolling betas for different stocks in parallel. If you have a Windows or Mac machine, it makes most sense to define `multisession`, which means that separate R processes are running in the background on the same machine to perform the individual jobs. If you check out the documentation of `plan()`, you can also see other ways to resolve the parallelization in different environments. Note that we use `availableCores()` to determine the number of cores available for parallelization, but keep one core free for other tasks. Some machines might freeze if all cores are busy with Python jobs. \\index{Parallelization}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_cores = availableCores() - 1\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nUsing eight cores, the estimation for our sample of around 25k stocks takes around 20 minutes. Of course, you can speed up things considerably by having more cores available to share the workload or by having more powerful cores. Notice the difference in the code below? All you need to do is to replace `map()` with `future_map()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly <- crsp_monthly_nested |>\n mutate(beta = future_map(\n data, ~ roll_capm_estimation(., months = 60, min_obs = 48)\n )) |>\n unnest(c(beta)) |>\n select(permno, month, beta_monthly = beta) |>\n drop_na()\n```\n:::\n\n\n## Estimating Beta using Daily Returns\n\nBefore we provide some descriptive statistics of our beta estimates, we implement the estimation for the daily CRSP sample as well. \nDepending on the application, you might either use longer horizon beta estimates based on monthly data or shorter horizon estimates based on daily returns. \n\nFirst, we load daily CRSP data. \nNote that the sample is large compared to the monthly data, so make sure to have enough memory available.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- tbl(tidy_finance, \"crsp_daily\") |>\n select(permno, month, date, ret_excess) |>\n collect()\n```\n:::\n\n\nWe also need the daily Fama-French market excess returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n select(date, mkt_excess) |>\n collect()\n```\n:::\n\n\nWe make sure to keep only relevant data to save memory space. \nHowever, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- crsp_daily |>\n inner_join(factors_ff3_daily, by = \"date\") |>\n select(permno, month, ret_excess, mkt_excess)\n```\n:::\n\n\nJust like above, we nest the data by `permno` for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested <- crsp_daily |>\n nest(data = c(month, ret_excess, mkt_excess))\n```\n:::\n\n\nThis is what the estimation looks like for a couple of examples using `map()`. \nFor the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. \nThese restrictions help us to retrieve somewhat smooth coefficient estimates.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested |>\n inner_join(examples, by = \"permno\") |>\n mutate(beta_daily = map(\n data,\n ~ roll_capm_estimation(., months = 3, min_obs = 50)\n )) |>\n unnest(c(beta_daily)) |>\n select(permno, month, beta_daily = beta) |>\n drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,639 × 3\n permno month beta_daily\n \n1 10107 1986-05-01 0.898\n2 10107 1986-06-01 0.906\n3 10107 1986-07-01 0.822\n4 10107 1986-08-01 0.900\n5 10107 1986-09-01 1.01 \n# ℹ 1,634 more rows\n```\n\n\n:::\n:::\n\n\nFor the sake of completeness, we tell our session again to use multiple workers for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nThe code chunk for beta estimation using daily returns now looks very similar to the one for monthly data. The whole estimation takes around 30 minutes using eight cores and 16gb memory. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_daily <- crsp_daily_nested |>\n mutate(beta_daily = future_map(\n data, ~ roll_capm_estimation(., months = 3, min_obs = 50)\n )) |>\n unnest(c(beta_daily)) |>\n select(permno, month, beta_daily = beta) |>\n drop_na()\n```\n:::\n\n\n## Comparing Beta Estimates\n\nWhat is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated $\\hat\\beta_i$ across different industries and across time below. @fig-602 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n left_join(beta_monthly, by = c(\"permno\", \"month\")) |>\n drop_na(beta_monthly) |>\n group_by(industry, permno) |>\n summarize(beta = mean(beta_monthly), \n .groups = \"drop\") |>\n ggplot(aes(x = reorder(industry, beta, FUN = median), y = beta)) +\n geom_boxplot() +\n coord_flip() +\n labs(\n x = NULL, y = NULL,\n title = \"Firm-specific beta distributions by industry\"\n )\n```\n\n::: {.cell-output-display}\n![The box plots show the average firm-specific beta estimates by industry.](beta-estimation_files/figure-html/fig-602-1.png){#fig-602 fig-alt='Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5.' width=2100}\n:::\n:::\n\n\nNext, we illustrate the time-variation in the cross-section of estimated betas. @fig-603 shows the monthly deciles of estimated betas (based on monthly data) and indicates an interesting pattern: First, betas seem to vary over time in the sense that during some periods, there is a clear trend across all deciles. Second, the sample exhibits periods where the dispersion across stocks increases in the sense that the lower decile decreases and the upper decile increases, which indicates that for some stocks the correlation with the market increases while for others it decreases. Note also here: stocks with negative betas are a rare exception.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly |>\n drop_na(beta_monthly) |>\n group_by(month) |>\n reframe(\n x = quantile(beta_monthly, seq(0.1, 0.9, 0.1)),\n quantile = 100 * seq(0.1, 0.9, 0.1)\n ) |>\n ggplot(aes(\n x = month, \n y = x, \n color = as_factor(quantile),\n linetype = as_factor(quantile)\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly deciles of estimated betas\",\n )\n```\n\n::: {.cell-output-display}\n![Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.](beta-estimation_files/figure-html/fig-603-1.png){#fig-603 fig-alt='Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly.' width=2100}\n:::\n:::\n\n\nTo compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in @fig-604. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta <- beta_monthly |>\n full_join(beta_daily, by = c(\"permno\", \"month\")) |>\n arrange(permno, month)\n\nbeta |>\n inner_join(examples, by = \"permno\") |>\n pivot_longer(cols = c(beta_monthly, beta_daily)) |>\n drop_na() |>\n ggplot(aes(\n x = month, \n y = value, \n color = name, \n linetype = name\n )) +\n geom_line() +\n facet_wrap(~company, ncol = 1) +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL, \n title = \"Comparison of beta estimates using monthly and daily data\"\n )\n```\n\n::: {.cell-output-display}\n![CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.](beta-estimation_files/figure-html/fig-604-1.png){#fig-604 fig-alt='Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency.' width=2100}\n:::\n:::\n\n\nThe estimates in @fig-604 look as expected. As you can see, it really depends on the estimation window and data frequency how your beta estimates turn out. \n\nFinally, we write the estimates to our database such that we can use them in later chapters. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"beta\",\n value = beta,\n overwrite = TRUE\n)\n```\n:::\n\n\nWhenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. \nThis descriptive helps us discover potential errors in our data preparation or estimation procedure. \nFor instance, suppose there was a gap in our output where we do not have any betas. \nIn this case, we would have to go back and check all previous steps to find out what went wrong. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long <- crsp_monthly |>\n left_join(beta, by = c(\"permno\", \"month\")) |>\n pivot_longer(cols = c(beta_monthly, beta_daily))\n\nbeta_long |>\n group_by(month, name) |>\n summarize(share = sum(!is.na(value)) / n(), \n .groups = \"drop\") |>\n ggplot(aes(\n x = month, \n y = share, \n color = name,\n linetype = name\n )) +\n geom_line() +\n scale_y_continuous(labels = percent) +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"End-of-month share of securities with beta estimates\"\n ) +\n coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.](beta-estimation_files/figure-html/fig-605-1.png){#fig-605 fig-alt='Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the beta.' width=2100}\n:::\n:::\n\n\n@fig-605 does not indicate any troubles, so let us move on to the next check. \n\nWe also encourage everyone to always look at the distributional summary statistics of variables. You can easily spot outliers or weird distributions when looking at such tables.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long |>\n select(name, value) |>\n drop_na() |>\n group_by(name) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value),\n n = n()\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 9\n name mean sd min q05 q50 q95 max n\n \n1 beta_daily 0.752 0.924 -43.7 -0.440 0.690 2.23 56.6 3286117\n2 beta_monthly 1.10 0.713 -13.0 0.127 1.04 2.32 11.8 2135342\n```\n\n\n:::\n:::\n\n\nThe summary statistics also look plausible for the two estimation procedures. \n\nFinally, since we have two different estimators for the same theoretical object, we expect the estimators should be at least positively correlated (although not perfectly as the estimators are based on different sample periods and frequencies).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta |>\n select(beta_daily, beta_monthly) |>\n cor(use = \"complete.obs\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n beta_daily beta_monthly\nbeta_daily 1.000 0.324\nbeta_monthly 0.324 1.000\n```\n\n\n:::\n:::\n\n\nIndeed, we find a positive correlation between our beta estimates. In the subsequent chapters, we mainly use the estimates based on monthly data as most readers should be able to replicate them due to potential memory limitations that might arise with the daily data. \n\n## Exercises\n\n1. Compute beta estimates based on monthly data using 1, 3, and 5 years of data and impose a minimum number of observations of 10, 28, and 48 months with return data, respectively. How strongly correlated are the estimated betas?\n1. Compute beta estimates based on monthly data using 5 years of data and impose different numbers of minimum observations. How does the share of permno-month observations with successful beta estimates vary across the different requirements? Do you find a high correlation across the estimated betas? \n1. Instead of using `future_map()`, perform the beta estimation in a loop (using either monthly or daily data) for a subset of 100 permnos of your choice. Verify that you get the same results as with the parallelized code from above.\n1. Filter out the stocks with negative betas. Do these stocks frequently exhibit negative betas, or do they resemble estimation errors? \n1. Compute beta estimates for multi-factor models such as the Fama-French 3 factor model. For that purpose, you extend your regression to \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\sum\\limits_{j=1}^k\\beta_{i,k}(r_{j, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwhere $r_{j, t}$ are the $k$ factor returns. Thus, you estimate 4 parameters ($\\alpha_i$ and the slope coefficients). Provide some summary statistics of the cross-section of firms and their exposure to the different factors.", + "markdown": "---\ntitle: Beta Estimation\naliases:\n - ../beta-estimation.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Beta Estimation with R\n description-meta: Estimate CAPM betas using monthly or daily CRSP data and the programming language R. \n---\n\n\nIn this chapter, we introduce an important concept in financial economics: the exposure of an individual stock to changes in the market portfolio. According to the Capital Asset Pricing Model (CAPM) of @Sharpe1964, @Lintner1965, and @Mossin1966, cross-sectional variation in expected asset returns should be a function of the covariance between the excess return of the asset and the excess return on the market portfolio.\\index{CAPM} The regression coefficient of excess market returns on excess stock returns is usually called the market beta. We show an estimation procedure for the market betas.\\index{Beta} We do not go into details about the foundations of market beta but simply refer to any treatment of the [CAPM](https://en.wikipedia.org/wiki/Capital_asset_pricing_model) for further information. Instead, we provide details about all the functions that we use to compute the results. In particular, we leverage useful computational concepts: rolling-window estimation and parallelization.\n\nWe use the following R packages throughout this chapter:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(slider)\nlibrary(furrr)\n```\n:::\n\n\nCompared to previous chapters, we introduce `slider` [@slider] for sliding window functions, and `furrr` [@furrr] to apply mapping functions in parallel.\n\n## Estimating Beta using Monthly Returns\n\nThe estimation procedure is based on a rolling-window estimation, where we may use either monthly or daily returns and different window lengths. First, let us start with loading the monthly CRSP data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(permno, month, industry, ret_excess) |>\n collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, mkt_excess) |>\n collect()\n\ncrsp_monthly <- crsp_monthly |>\n left_join(factors_ff3_monthly, join_by(month))\n```\n:::\n\n\nTo estimate the CAPM regression coefficients \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\beta_i(r_{m, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwe regress stock excess returns `ret_excess` on excess returns of the market portfolio `mkt_excess`. \nR provides a simple solution to estimate (linear) models with the function `lm()`. `lm()` requires a formula as input that is specified in a compact symbolic form. An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modeled by a linear predictor specified symbolically by `model`. Such a model consists of a series of terms separated by `+` operators. In addition to standard linear models, `lm()` provides a lot of flexibility. You should check out the documentation for more information. To start, we restrict the data only to the time series of observations in CRSP that correspond to Apple’s stock (i.e., to `permno` 14593 for Apple) and compute $\\hat\\alpha_i$ as well as $\\hat\\beta_i$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit <- lm(ret_excess ~ mkt_excess,\n data = crsp_monthly |>\n filter(permno == \"14593\")\n)\n\nsummary(fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = ret_excess ~ mkt_excess, data = filter(crsp_monthly, \n permno == \"14593\"))\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.5170 -0.0589 0.0001 0.0610 0.3947 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.01019 0.00508 2.0 0.046 * \nmkt_excess 1.38889 0.11141 12.5 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.113 on 502 degrees of freedom\nMultiple R-squared: 0.236,\tAdjusted R-squared: 0.235 \nF-statistic: 155 on 1 and 502 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\n`lm()` returns an object of class `lm` which contains all information we usually care about with linear models. `summary()` returns an overview of the estimated parameters. `coefficients(fit)` would return only the estimated coefficients. The output above indicates that Apple moves excessively with the market as the estimated $\\hat\\beta_i$ is above one ($\\hat\\beta_i \\approx 1.4$). \n\n## Rolling-Window Estimation\n\nAfter we estimated the regression coefficients on an example, we scale the estimation of $\\beta_i$ to a whole different level and perform rolling-window estimations for the entire CRSP sample.\\index{Rolling-window estimation} The following function implements the CAPM regression for a data frame (or a part thereof) containing at least `min_obs` observations to avoid huge fluctuations if the time series is too short. If the condition is violated, that is, the time series is too short, the function returns a missing value. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nestimate_capm <- function(data, min_obs = 1) {\n if (nrow(data) < min_obs) {\n beta <- as.numeric(NA)\n } else {\n fit <- lm(ret_excess ~ mkt_excess, data = data)\n beta <- as.numeric(coefficients(fit)[2])\n }\n return(beta)\n}\n```\n:::\n\n\nNext, we define a function that does the rolling estimation. The `slide_period` function is able to handle months in its window input in a straightforward manner. We thus avoid using any time-series package (e.g., `zoo`) and converting the data to fit the package functions, but rather stay in the world of the `tidyverse`.\n\nThe following function takes input data and slides across the `month` vector, considering only a total of `months` months. The function essentially performs three steps: (i) arrange all rows, (ii) compute betas by sliding across months, and (iii) return a tibble with months and corresponding beta estimates (again particularly useful in the case of daily data).\nAs we demonstrate further below, we can also apply the same function to daily returns data. \n\n::: {.cell}\n\n```{.r .cell-code}\nroll_capm_estimation <- function(data, months, min_obs) {\n data <- data |>\n arrange(month)\n\n betas <- slide_period_vec(\n .x = data,\n .i = data$month,\n .period = \"month\",\n .f = ~ estimate_capm(., min_obs),\n .before = months - 1,\n .complete = FALSE\n )\n\n return(tibble(\n month = unique(data$month),\n beta = betas\n ))\n}\n```\n:::\n\n\nBefore we attack the whole CRSP sample, let us focus on a couple of examples for well-known firms.\n\n::: {.cell}\n\n```{.r .cell-code}\nexamples <- tribble(\n ~permno, ~company,\n 14593, \"Apple\",\n 10107, \"Microsoft\",\n 93436, \"Tesla\",\n 17778, \"Berkshire Hathaway\"\n)\n```\n:::\n\nIf we want to estimate rolling betas for Apple, we can use `mutate()`. \nWe take a total of 5 years of data and require at least 48 months with return data to compute our betas. \nCheck out the exercises if you want to compute beta for different time periods. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_example <- crsp_monthly |>\n filter(permno == examples$permno[1]) |>\n mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n drop_na()\nbeta_example\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 457 × 6\n permno month industry ret_excess mkt_excess beta\n \n1 14593 1984-12-01 Manufacturing 0.170 0.0184 2.05\n2 14593 1985-01-01 Manufacturing -0.0108 0.0799 1.90\n3 14593 1985-02-01 Manufacturing -0.152 0.0122 1.88\n4 14593 1985-03-01 Manufacturing -0.112 -0.0084 1.89\n5 14593 1985-04-01 Manufacturing -0.0467 -0.0096 1.90\n# ℹ 452 more rows\n```\n\n\n:::\n:::\n\nIt is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting @fig-601. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_examples <- crsp_monthly |>\n inner_join(examples, join_by(permno)) |>\n group_by(permno) |>\n mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n ungroup() |>\n select(permno, company, month, beta) |>\n drop_na()\n\nbeta_examples |>\n ggplot(aes(\n x = month, \n y = beta, \n color = company,\n linetype = company)) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly beta estimates for example stocks using 5 years of data\"\n )\n```\n\n::: {.cell-output-display}\n![The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.](beta-estimation_files/figure-html/fig-601-1.png){#fig-601 fig-alt='Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock.' width=2100}\n:::\n:::\n\n\n## Parallelized Rolling-Window Estimation\n\nEven though we could now just apply the function using `group_by()` on the whole CRSP sample, we advise against doing it as it is computationally quite expensive. \nRemember that we have to perform rolling-window estimations across all stocks and time periods. \nHowever, this estimation problem is an ideal scenario to employ the power of parallelization. \nParallelization means that we split the tasks which perform rolling-window estimations across different workers (or cores on your local machine). \n\nFirst, we `nest()` the data by `permno`. Nested data means we now have a list of `permno` with corresponding time series data and an `industry` label. We get one row of output for each unique combination of non-nested variables which are `permno` and `industry`.\\index{Data!Nested}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n nest(data = c(month, ret_excess, mkt_excess))\ncrsp_monthly_nested\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30,384 × 3\n permno industry data \n \n1 10066 Services \n2 10067 Manufacturing \n3 10068 Finance \n4 10069 Retail \n5 10070 Manufacturing \n# ℹ 30,379 more rows\n```\n\n\n:::\n:::\n\n\nAlternatively, we could have created the same nested data by *excluding* the variables that we *do not* want to nest, as in the following code chunk. However, for many applications it is desirable to explicitly state the variables that are nested into the `data` list-column, so that the reader can track what ends up in there.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n nest(data = -c(permno, industry))\n```\n:::\n\n\nNext, we want to apply the `roll_capm_estimation()` function to each stock. This situation is an ideal use case for `map()`, which takes a list or vector as input and returns an object of the same length as the input. In our case, `map()` returns a single data frame with a time series of beta estimates for each stock. Therefore, we use `unnest()` to transform the list of outputs to a tidy data frame. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested |>\n inner_join(examples, join_by(permno)) |>\n mutate(beta = map(\n data,\n ~ roll_capm_estimation(., months = 60, min_obs = 48)\n )) |>\n unnest(beta) |>\n select(permno, month, beta_monthly = beta) |>\n drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,458 × 3\n permno month beta_monthly\n \n1 10107 1990-03-01 1.39\n2 10107 1990-04-01 1.38\n3 10107 1990-05-01 1.43\n4 10107 1990-06-01 1.43\n5 10107 1990-07-01 1.45\n# ℹ 1,453 more rows\n```\n\n\n:::\n:::\n\n\nHowever, instead, we want to perform the estimations of rolling betas for different stocks in parallel. If you have a Windows or Mac machine, it makes most sense to define `multisession`, which means that separate R processes are running in the background on the same machine to perform the individual jobs. If you check out the documentation of `plan()`, you can also see other ways to resolve the parallelization in different environments. Note that we use `availableCores()` to determine the number of cores available for parallelization, but keep one core free for other tasks. Some machines might freeze if all cores are busy with Python jobs. \\index{Parallelization}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_cores = availableCores() - 1\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nUsing eight cores, the estimation for our sample of around 25k stocks takes around 20 minutes. Of course, you can speed up things considerably by having more cores available to share the workload or by having more powerful cores. Notice the difference in the code below? All you need to do is to replace `map()` with `future_map()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly <- crsp_monthly_nested |>\n mutate(beta = future_map(\n data, ~ roll_capm_estimation(., months = 60, min_obs = 48)\n )) |>\n unnest(c(beta)) |>\n select(permno, month, beta_monthly = beta) |>\n drop_na()\n```\n:::\n\n\n## Estimating Beta using Daily Returns\n\nBefore we provide some descriptive statistics of our beta estimates, we implement the estimation for the daily CRSP sample as well. \nDepending on the application, you might either use longer horizon beta estimates based on monthly data or shorter horizon estimates based on daily returns. \n\nFirst, we load daily CRSP data. \nNote that the sample is large compared to the monthly data, so make sure to have enough memory available.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- tbl(tidy_finance, \"crsp_daily\") |>\n select(permno, month, date, ret_excess) |>\n collect()\n```\n:::\n\n\nWe also need the daily Fama-French market excess returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n select(date, mkt_excess) |>\n collect()\n```\n:::\n\n\nWe make sure to keep only relevant data to save memory space. \nHowever, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- crsp_daily |>\n inner_join(factors_ff3_daily, join_by(date)) |>\n select(permno, month, ret_excess, mkt_excess)\n```\n:::\n\n\nJust like above, we nest the data by `permno` for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested <- crsp_daily |>\n nest(data = c(month, ret_excess, mkt_excess))\n```\n:::\n\n\nThis is what the estimation looks like for a couple of examples using `map()`. \nFor the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. \nThese restrictions help us to retrieve somewhat smooth coefficient estimates.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested |>\n inner_join(examples, join_by(permno)) |>\n mutate(beta_daily = map(\n data,\n ~ roll_capm_estimation(., months = 3, min_obs = 50)\n )) |>\n unnest(c(beta_daily)) |>\n select(permno, month, beta_daily = beta) |>\n drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,639 × 3\n permno month beta_daily\n \n1 10107 1986-05-01 0.898\n2 10107 1986-06-01 0.906\n3 10107 1986-07-01 0.822\n4 10107 1986-08-01 0.900\n5 10107 1986-09-01 1.01 \n# ℹ 1,634 more rows\n```\n\n\n:::\n:::\n\n\nFor the sake of completeness, we tell our session again to use multiple workers for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nThe code chunk for beta estimation using daily returns now looks very similar to the one for monthly data. The whole estimation takes around 30 minutes using eight cores and 16gb memory. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_daily <- crsp_daily_nested |>\n mutate(beta_daily = future_map(\n data, ~ roll_capm_estimation(., months = 3, min_obs = 50)\n )) |>\n unnest(c(beta_daily)) |>\n select(permno, month, beta_daily = beta) |>\n drop_na()\n```\n:::\n\n\n## Comparing Beta Estimates\n\nWhat is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated $\\hat\\beta_i$ across different industries and across time below. @fig-602 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n left_join(beta_monthly, join_by(permno, month)) |>\n drop_na(beta_monthly) |>\n group_by(industry, permno) |>\n summarize(beta = mean(beta_monthly), \n .groups = \"drop\") |>\n ggplot(aes(x = reorder(industry, beta, FUN = median), y = beta)) +\n geom_boxplot() +\n coord_flip() +\n labs(\n x = NULL, y = NULL,\n title = \"Firm-specific beta distributions by industry\"\n )\n```\n\n::: {.cell-output-display}\n![The box plots show the average firm-specific beta estimates by industry.](beta-estimation_files/figure-html/fig-602-1.png){#fig-602 fig-alt='Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5.' width=2100}\n:::\n:::\n\n\nNext, we illustrate the time-variation in the cross-section of estimated betas. @fig-603 shows the monthly deciles of estimated betas (based on monthly data) and indicates an interesting pattern: First, betas seem to vary over time in the sense that during some periods, there is a clear trend across all deciles. Second, the sample exhibits periods where the dispersion across stocks increases in the sense that the lower decile decreases and the upper decile increases, which indicates that for some stocks the correlation with the market increases while for others it decreases. Note also here: stocks with negative betas are a rare exception.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly |>\n drop_na(beta_monthly) |>\n group_by(month) |>\n reframe(\n x = quantile(beta_monthly, seq(0.1, 0.9, 0.1)),\n quantile = 100 * seq(0.1, 0.9, 0.1)\n ) |>\n ggplot(aes(\n x = month, \n y = x, \n color = as_factor(quantile),\n linetype = as_factor(quantile)\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly deciles of estimated betas\",\n )\n```\n\n::: {.cell-output-display}\n![Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.](beta-estimation_files/figure-html/fig-603-1.png){#fig-603 fig-alt='Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly.' width=2100}\n:::\n:::\n\n\nTo compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in @fig-604. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta <- beta_monthly |>\n full_join(beta_daily, join_by(permno, month)) |>\n arrange(permno, month)\n\nbeta |>\n inner_join(examples, join_by(permno)) |>\n pivot_longer(cols = c(beta_monthly, beta_daily)) |>\n drop_na() |>\n ggplot(aes(\n x = month, \n y = value, \n color = name, \n linetype = name\n )) +\n geom_line() +\n facet_wrap(~company, ncol = 1) +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL, \n title = \"Comparison of beta estimates using monthly and daily data\"\n )\n```\n\n::: {.cell-output-display}\n![CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.](beta-estimation_files/figure-html/fig-604-1.png){#fig-604 fig-alt='Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency.' width=2100}\n:::\n:::\n\n\nThe estimates in @fig-604 look as expected. As you can see, it really depends on the estimation window and data frequency how your beta estimates turn out. \n\nFinally, we write the estimates to our database such that we can use them in later chapters. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"beta\",\n value = beta,\n overwrite = TRUE\n)\n```\n:::\n\n\nWhenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. \nThis descriptive helps us discover potential errors in our data preparation or estimation procedure. \nFor instance, suppose there was a gap in our output where we do not have any betas. \nIn this case, we would have to go back and check all previous steps to find out what went wrong. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long <- crsp_monthly |>\n left_join(beta, join_by(permno, month)) |>\n pivot_longer(cols = c(beta_monthly, beta_daily))\n\nbeta_long |>\n group_by(month, name) |>\n summarize(share = sum(!is.na(value)) / n(), \n .groups = \"drop\") |>\n ggplot(aes(\n x = month, \n y = share, \n color = name,\n linetype = name\n )) +\n geom_line() +\n scale_y_continuous(labels = percent) +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"End-of-month share of securities with beta estimates\"\n ) +\n coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.](beta-estimation_files/figure-html/fig-605-1.png){#fig-605 fig-alt='Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the beta.' width=2100}\n:::\n:::\n\n\n@fig-605 does not indicate any troubles, so let us move on to the next check. \n\nWe also encourage everyone to always look at the distributional summary statistics of variables. You can easily spot outliers or weird distributions when looking at such tables.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long |>\n select(name, value) |>\n drop_na() |>\n group_by(name) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value),\n n = n()\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 9\n name mean sd min q05 q50 q95 max n\n \n1 beta_daily 0.752 0.924 -43.7 -0.440 0.690 2.23 56.6 3286117\n2 beta_monthly 1.10 0.713 -13.0 0.127 1.04 2.32 11.8 2135342\n```\n\n\n:::\n:::\n\n\nThe summary statistics also look plausible for the two estimation procedures. \n\nFinally, since we have two different estimators for the same theoretical object, we expect the estimators should be at least positively correlated (although not perfectly as the estimators are based on different sample periods and frequencies).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta |>\n select(beta_daily, beta_monthly) |>\n cor(use = \"complete.obs\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n beta_daily beta_monthly\nbeta_daily 1.000 0.324\nbeta_monthly 0.324 1.000\n```\n\n\n:::\n:::\n\n\nIndeed, we find a positive correlation between our beta estimates. In the subsequent chapters, we mainly use the estimates based on monthly data as most readers should be able to replicate them due to potential memory limitations that might arise with the daily data. \n\n## Exercises\n\n1. Compute beta estimates based on monthly data using 1, 3, and 5 years of data and impose a minimum number of observations of 10, 28, and 48 months with return data, respectively. How strongly correlated are the estimated betas?\n1. Compute beta estimates based on monthly data using 5 years of data and impose different numbers of minimum observations. How does the share of permno-month observations with successful beta estimates vary across the different requirements? Do you find a high correlation across the estimated betas? \n1. Instead of using `future_map()`, perform the beta estimation in a loop (using either monthly or daily data) for a subset of 100 permnos of your choice. Verify that you get the same results as with the parallelized code from above.\n1. Filter out the stocks with negative betas. Do these stocks frequently exhibit negative betas, or do they resemble estimation errors? \n1. Compute beta estimates for multi-factor models such as the Fama-French 3 factor model. For that purpose, you extend your regression to \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\sum\\limits_{j=1}^k\\beta_{i,k}(r_{j, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwhere $r_{j, t}$ are the $k$ factor returns. Thus, you estimate 4 parameters ($\\alpha_i$ and the slope coefficients). Provide some summary statistics of the cross-section of firms and their exposure to the different factors.", "supporting": [ "beta-estimation_files" ], diff --git a/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json b/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json index d72e3554..dcdc734e 100644 --- a/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json +++ b/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json @@ -1,11 +1,9 @@ { - "hash": "1e2484c3cced12486f09d0650b05db55", + "hash": "06984c67c24538430ebea23c84e143f4", "result": { "engine": "knitr", - "markdown": "---\ntitle: Clean Enhanced TRACE with R\naliases:\n - ../clean-enhanced-trace-with-r.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Clean Enhanced TRACE with R\n description-meta: Copy the code to clean enhanced TRACE bond transaction data using the programming language R. \n---\n\n\nThis appendix contains code to clean enhanced TRACE with R. It is also available via the following Github [gist](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4).\\index{Github!Gist} Hence, you could also source the function with `devtools::source_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")`. We need this function in Chapter [TRACE and FISD](trace-and-fisd.qmd) to download and clean enhanced TRACE trade messages following @Dick2009 and @Dick2014 for enhanced TRACE specifically. Related, WRDS provides SAS code and there is Python code available by the project [Open Source Bond Asset Pricing](https://openbondassetpricing.com/).\n\nThe function takes a vector of CUSIPs (in `cusips`), a connection to WRDS (`connection`) explained in Chapter 3, and a start and end date (`start_date` and `end_date`, respectively). Specifying too many CUSIPs will result in very slow downloads and a potential failure due to the size of the request to WRDS. The dates should be within the coverage of TRACE itself, i.e., starting after 2002, and the dates should be supplied using the class date. The output of the function contains all valid trade messages for the selected CUSIPs over the specified period.\\index{CUSIP}\\index{Dick-Nielsen cleaning} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nclean_enhanced_trace <- function(cusips,\n connection,\n start_date = as.Date(\"2002-01-01\"),\n end_date = today()) {\n\n # Packages (required)\n library(tidyverse)\n library(dbplyr)\n library(RPostgres)\n\n # Function checks ---------------------------------------------------------\n # Input parameters\n ## Cusips\n if (length(cusips) == 0 | any(is.na(cusips))) stop(\"Check cusips.\")\n\n ## Dates\n if (!is.Date(start_date) | !is.Date(end_date)) stop(\"Dates needed\")\n if (start_date < as.Date(\"2002-01-01\")) stop(\"TRACE starts later.\")\n if (end_date > today()) stop(\"TRACE does not predict the future.\")\n if (start_date >= end_date) stop(\"Date conflict.\")\n\n ## Connection\n if (!dbIsValid(connection)) stop(\"Connection issue.\")\n\n # Enhanced Trace ----------------------------------------------------------\n # Main file\n trace_all <- tbl(\n connection,\n in_schema(\"trace\", \"trace_enhanced\")\n ) |>\n filter(cusip_id %in% cusips) |>\n filter(trd_exctn_dt >= start_date & trd_exctn_dt <= end_date) |>\n select(\n cusip_id, msg_seq_nb, orig_msg_seq_nb,\n entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,\n trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm,\n pr_trd_dt, trc_st, asof_cd, wis_fl,\n days_to_sttl_ct, stlmnt_dt, spcl_trd_fl\n ) |>\n collect()\n\n # Enhanced Trace: Post 06-02-2012 -----------------------------------------\n # Trades (trc_st = T) and correction (trc_st = R)\n trace_post_TR <- trace_all |>\n filter(\n (trc_st == \"T\" | trc_st == \"R\"),\n trd_rpt_dt >= as.Date(\"2012-02-06\")\n )\n\n # Cancellations (trc_st = X) and correction cancellations (trc_st = C)\n trace_post_XC <- trace_all |>\n filter(\n (trc_st == \"X\" | trc_st == \"C\"),\n trd_rpt_dt >= as.Date(\"2012-02-06\")\n )\n\n # Cleaning corrected and cancelled trades\n trace_post_TR <- trace_post_TR |>\n anti_join(trace_post_XC,\n by = c(\n \"cusip_id\", \"msg_seq_nb\", \"entrd_vol_qt\",\n \"rptd_pr\", \"rpt_side_cd\", \"cntra_mp_id\",\n \"trd_exctn_dt\", \"trd_exctn_tm\"\n )\n )\n\n # Reversals (trc_st = Y)\n trace_post_Y <- trace_all |>\n filter(\n trc_st == \"Y\",\n trd_rpt_dt >= as.Date(\"2012-02-06\")\n )\n\n # Clean reversals\n ## match the orig_msg_seq_nb of the Y-message to\n ## the msg_seq_nb of the main message\n trace_post <- trace_post_TR |>\n anti_join(trace_post_Y,\n by = c(\"cusip_id\",\n \"msg_seq_nb\" = \"orig_msg_seq_nb\",\n \"entrd_vol_qt\", \"rptd_pr\", \"rpt_side_cd\",\n \"cntra_mp_id\", \"trd_exctn_dt\", \"trd_exctn_tm\"\n )\n )\n\n\n # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------\n # Cancelations (trc_st = C)\n trace_pre_C <- trace_all |>\n filter(\n trc_st == \"C\",\n trd_rpt_dt < as.Date(\"2012-02-06\")\n )\n\n # Trades w/o cancellations\n ## match the orig_msg_seq_nb of the C-message\n ## to the msg_seq_nb of the main message\n trace_pre_T <- trace_all |>\n filter(\n trc_st == \"T\",\n trd_rpt_dt < as.Date(\"2012-02-06\")\n ) |>\n anti_join(trace_pre_C,\n by = c(\"cusip_id\",\n \"msg_seq_nb\" = \"orig_msg_seq_nb\",\n \"entrd_vol_qt\", \"rptd_pr\", \"rpt_side_cd\",\n \"cntra_mp_id\", \"trd_exctn_dt\", \"trd_exctn_tm\"\n )\n )\n\n # Corrections (trc_st = W) - W can also correct a previous W\n trace_pre_W <- trace_all |>\n filter(\n trc_st == \"W\",\n trd_rpt_dt < as.Date(\"2012-02-06\")\n )\n\n # Implement corrections in a loop\n ## Correction control\n correction_control <- nrow(trace_pre_W)\n correction_control_last <- nrow(trace_pre_W)\n\n ## Correction loop\n while (correction_control > 0) {\n # Corrections that correct some msg\n trace_pre_W_correcting <- trace_pre_W |>\n semi_join(trace_pre_T,\n by = c(\"cusip_id\", \"trd_exctn_dt\",\n \"orig_msg_seq_nb\" = \"msg_seq_nb\"\n )\n )\n\n # Corrections that do not correct some msg\n trace_pre_W <- trace_pre_W |>\n anti_join(trace_pre_T,\n by = c(\"cusip_id\", \"trd_exctn_dt\",\n \"orig_msg_seq_nb\" = \"msg_seq_nb\"\n )\n )\n\n # Delete msgs that are corrected and add correction msgs\n trace_pre_T <- trace_pre_T |>\n anti_join(trace_pre_W_correcting,\n by = c(\"cusip_id\", \"trd_exctn_dt\",\n \"msg_seq_nb\" = \"orig_msg_seq_nb\"\n )\n ) |>\n union_all(trace_pre_W_correcting)\n\n # Escape if no corrections remain or they cannot be matched\n correction_control <- nrow(trace_pre_W)\n if (correction_control == correction_control_last) {\n correction_control <- 0\n }\n correction_control_last <- nrow(trace_pre_W)\n }\n\n\n # Clean reversals\n ## Record reversals\n trace_pre_R <- trace_pre_T |>\n filter(asof_cd == \"R\") |>\n group_by(\n cusip_id, trd_exctn_dt, entrd_vol_qt,\n rptd_pr, rpt_side_cd, cntra_mp_id\n ) |>\n arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n mutate(seq = row_number()) |>\n ungroup()\n\n ## Remove reversals and the reversed trade\n trace_pre <- trace_pre_T |>\n filter(is.na(asof_cd) | !(asof_cd %in% c(\"R\", \"X\", \"D\"))) |>\n group_by(\n cusip_id, trd_exctn_dt, entrd_vol_qt,\n rptd_pr, rpt_side_cd, cntra_mp_id\n ) |>\n arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n mutate(seq = row_number()) |>\n ungroup() |>\n anti_join(trace_pre_R,\n by = c(\n \"cusip_id\", \"trd_exctn_dt\", \"entrd_vol_qt\",\n \"rptd_pr\", \"rpt_side_cd\", \"cntra_mp_id\", \"seq\"\n )\n ) |>\n select(-seq)\n\n\n # Agency trades -----------------------------------------------------------\n # Combine pre and post trades\n trace_clean <- trace_post |>\n union_all(trace_pre)\n\n # Keep angency sells and unmatched agency buys\n ## Agency sells\n trace_agency_sells <- trace_clean |>\n filter(\n cntra_mp_id == \"D\",\n rpt_side_cd == \"S\"\n )\n\n # Agency buys that are unmatched\n trace_agency_buys_filtered <- trace_clean |>\n filter(\n cntra_mp_id == \"D\",\n rpt_side_cd == \"B\"\n ) |>\n anti_join(trace_agency_sells,\n by = c(\n \"cusip_id\", \"trd_exctn_dt\",\n \"entrd_vol_qt\", \"rptd_pr\"\n )\n )\n\n # Agency clean\n trace_clean <- trace_clean |>\n filter(cntra_mp_id == \"C\") |>\n union_all(trace_agency_sells) |>\n union_all(trace_agency_buys_filtered)\n\n\n # Additional Filters ------------------------------------------------------\n trace_add_filters <- trace_clean |>\n mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |>\n filter(\n is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) <= 7,\n is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) <= 7,\n wis_fl == \"N\",\n is.na(spcl_trd_fl) | spcl_trd_fl == \"\",\n is.na(asof_cd) | asof_cd == \"\"\n )\n\n\n # Output ------------------------------------------------------------------\n # Only keep necessary columns\n trace_final <- trace_add_filters |>\n arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) |>\n select(\n cusip_id, trd_exctn_dt, trd_exctn_tm,\n rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id\n ) |>\n mutate(trd_exctn_tm = format(as_datetime(trd_exctn_tm), \"%H:%M:%S\"))\n\n # Return\n return(trace_final)\n}\n```\n:::", - "supporting": [ - "clean-enhanced-trace-with-r_files" - ], + "markdown": "---\ntitle: Clean Enhanced TRACE with R\naliases:\n - ../clean-enhanced-trace-with-r.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Clean Enhanced TRACE with R\n description-meta: Copy the code to clean enhanced TRACE bond transaction data using the programming language R. \n---\n\n\nThis appendix contains code to clean enhanced TRACE with R. It is also available via the following Github [gist](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4).\\index{Github!Gist} Hence, you could also source the function with `devtools::source_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")`. We need this function in Chapter [TRACE and FISD](trace-and-fisd.qmd) to download and clean enhanced TRACE trade messages following @Dick2009 and @Dick2014 for enhanced TRACE specifically. Related, WRDS provides SAS code and there is Python code available by the project [Open Source Bond Asset Pricing](https://openbondassetpricing.com/).\n\nThe function takes a vector of CUSIPs (in `cusips`), a connection to WRDS (`connection`) explained in Chapter 3, and a start and end date (`start_date` and `end_date`, respectively). Specifying too many CUSIPs will result in very slow downloads and a potential failure due to the size of the request to WRDS. The dates should be within the coverage of TRACE itself, i.e., starting after 2002, and the dates should be supplied using the class date. The output of the function contains all valid trade messages for the selected CUSIPs over the specified period.\\index{CUSIP}\\index{Dick-Nielsen cleaning} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nclean_enhanced_trace <- function(cusips,\n connection,\n start_date = as.Date(\"2002-01-01\"),\n end_date = today()) {\n\n # Packages (required)\n library(tidyverse)\n library(dbplyr)\n library(RPostgres)\n\n # Function checks ---------------------------------------------------------\n # Input parameters\n ## Cusips\n if (length(cusips) == 0 | any(is.na(cusips))) stop(\"Check cusips.\")\n\n ## Dates\n if (!is.Date(start_date) | !is.Date(end_date)) stop(\"Dates needed\")\n if (start_date < as.Date(\"2002-01-01\")) stop(\"TRACE starts later.\")\n if (end_date > today()) stop(\"TRACE does not predict the future.\")\n if (start_date >= end_date) stop(\"Date conflict.\")\n\n ## Connection\n if (!dbIsValid(connection)) stop(\"Connection issue.\")\n\n # Enhanced Trace ----------------------------------------------------------\n # Main file\n trace_all <- tbl(\n connection,\n in_schema(\"trace\", \"trace_enhanced\")\n ) |>\n filter(cusip_id %in% cusips) |>\n filter(trd_exctn_dt >= start_date & trd_exctn_dt <= end_date) |>\n select(\n cusip_id, msg_seq_nb, orig_msg_seq_nb,\n entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,\n trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm,\n pr_trd_dt, trc_st, asof_cd, wis_fl,\n days_to_sttl_ct, stlmnt_dt, spcl_trd_fl\n ) |>\n collect()\n\n # Enhanced Trace: Post 06-02-2012 -----------------------------------------\n # Trades (trc_st = T) and correction (trc_st = R)\n trace_post_TR <- trace_all |>\n filter(\n (trc_st == \"T\" | trc_st == \"R\"),\n trd_rpt_dt >= as.Date(\"2012-02-06\")\n )\n\n # Cancellations (trc_st = X) and correction cancellations (trc_st = C)\n trace_post_XC <- trace_all |>\n filter(\n (trc_st == \"X\" | trc_st == \"C\"),\n trd_rpt_dt >= as.Date(\"2012-02-06\")\n )\n\n # Cleaning corrected and cancelled trades\n trace_post_TR <- trace_post_TR |>\n anti_join(trace_post_XC,\n join_by(\n cusip_id, msg_seq_nb, entrd_vol_qt,\n rptd_pr, rpt_side_cd, cntra_mp_id,\n trd_exctn_dt, trd_exctn_tm\n )\n )\n\n # Reversals (trc_st = Y)\n trace_post_Y <- trace_all |>\n filter(\n trc_st == \"Y\",\n trd_rpt_dt >= as.Date(\"2012-02-06\")\n )\n\n # Clean reversals\n ## match the orig_msg_seq_nb of the Y-message to\n ## the msg_seq_nb of the main message\n trace_post <- trace_post_TR |>\n anti_join(trace_post_Y,\n join_by(\n cusip_id,\n msg_seq_nb == orig_msg_seq_nb,\n entrd_vol_qt, rptd_pr, rpt_side_cd,\n cntra_mp_id, trd_exctn_dt, trd_exctn_tm\n )\n )\n\n\n # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------\n # Cancelations (trc_st = C)\n trace_pre_C <- trace_all |>\n filter(\n trc_st == \"C\",\n trd_rpt_dt < as.Date(\"2012-02-06\")\n )\n\n # Trades w/o cancellations\n ## match the orig_msg_seq_nb of the C-message\n ## to the msg_seq_nb of the main message\n trace_pre_T <- trace_all |>\n filter(\n trc_st == \"T\",\n trd_rpt_dt < as.Date(\"2012-02-06\")\n ) |>\n anti_join(trace_pre_C,\n join_by(\n cusip_id,\n msg_seq_nb == orig_msg_seq_nb,\n entrd_vol_qt, rptd_pr, rpt_side_cd,\n cntra_mp_id, trd_exctn_dt, trd_exctn_tm\n )\n )\n\n # Corrections (trc_st = W) - W can also correct a previous W\n trace_pre_W <- trace_all |>\n filter(\n trc_st == \"W\",\n trd_rpt_dt < as.Date(\"2012-02-06\")\n )\n\n # Implement corrections in a loop\n ## Correction control\n correction_control <- nrow(trace_pre_W)\n correction_control_last <- nrow(trace_pre_W)\n\n ## Correction loop\n while (correction_control > 0) {\n # Corrections that correct some msg\n trace_pre_W_correcting <- trace_pre_W |>\n semi_join(trace_pre_T,\n join_by(\n cusip_id, trd_exctn_dt,\n orig_msg_seq_nb == msg_seq_nb\n )\n )\n\n # Corrections that do not correct some msg\n trace_pre_W <- trace_pre_W |>\n anti_join(trace_pre_T,\n join_by(cusip_id, trd_exctn_dt,\n orig_msg_seq_nb == msg_seq_nb\n )\n )\n\n # Delete msgs that are corrected and add correction msgs\n trace_pre_T <- trace_pre_T |>\n anti_join(trace_pre_W_correcting,\n join_by(\n cusip_id, trd_exctn_dt,\n msg_seq_nb == orig_msg_seq_nb\n )\n ) |>\n union_all(trace_pre_W_correcting)\n\n # Escape if no corrections remain or they cannot be matched\n correction_control <- nrow(trace_pre_W)\n if (correction_control == correction_control_last) {\n correction_control <- 0\n }\n correction_control_last <- nrow(trace_pre_W)\n }\n\n\n # Clean reversals\n ## Record reversals\n trace_pre_R <- trace_pre_T |>\n filter(asof_cd == \"R\") |>\n group_by(\n cusip_id, trd_exctn_dt, entrd_vol_qt,\n rptd_pr, rpt_side_cd, cntra_mp_id\n ) |>\n arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n mutate(seq = row_number()) |>\n ungroup()\n\n ## Remove reversals and the reversed trade\n trace_pre <- trace_pre_T |>\n filter(is.na(asof_cd) | !(asof_cd %in% c(\"R\", \"X\", \"D\"))) |>\n group_by(\n cusip_id, trd_exctn_dt, entrd_vol_qt,\n rptd_pr, rpt_side_cd, cntra_mp_id\n ) |>\n arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n mutate(seq = row_number()) |>\n ungroup() |>\n anti_join(trace_pre_R,\n join_by(\n cusip_id, trd_exctn_dt, entrd_vol_qt,\n rptd_pr, rpt_side_cd, cntra_mp_id, seq\n )\n ) |>\n select(-seq)\n\n\n # Agency trades -----------------------------------------------------------\n # Combine pre and post trades\n trace_clean <- trace_post |>\n union_all(trace_pre)\n\n # Keep angency sells and unmatched agency buys\n ## Agency sells\n trace_agency_sells <- trace_clean |>\n filter(\n cntra_mp_id == \"D\",\n rpt_side_cd == \"S\"\n )\n\n # Agency buys that are unmatched\n trace_agency_buys_filtered <- trace_clean |>\n filter(\n cntra_mp_id == \"D\",\n rpt_side_cd == \"B\"\n ) |>\n anti_join(trace_agency_sells,\n join_by(\n cusip_id, trd_exctn_dt,\n entrd_vol_qt, rptd_pr\n )\n )\n\n # Agency clean\n trace_clean <- trace_clean |>\n filter(cntra_mp_id == \"C\") |>\n union_all(trace_agency_sells) |>\n union_all(trace_agency_buys_filtered)\n\n\n # Additional Filters ------------------------------------------------------\n trace_add_filters <- trace_clean |>\n mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |>\n filter(\n is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) <= 7,\n is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) <= 7,\n wis_fl == \"N\",\n is.na(spcl_trd_fl) | spcl_trd_fl == \"\",\n is.na(asof_cd) | asof_cd == \"\"\n )\n\n\n # Output ------------------------------------------------------------------\n # Only keep necessary columns\n trace_final <- trace_add_filters |>\n arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) |>\n select(\n cusip_id, trd_exctn_dt, trd_exctn_tm,\n rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id\n ) |>\n mutate(trd_exctn_tm = format(as_datetime(trd_exctn_tm), \"%H:%M:%S\"))\n\n # Return\n return(trace_final)\n}\n```\n:::", + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json b/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json index e66542d3..60c83592 100644 --- a/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json +++ b/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "1c2684b4fabe1e11d16fb9d3e16a8a9f", "result": { "engine": "knitr", - "markdown": "---\ntitle: Constrained Optimization and Backtesting\naliases:\n - ../constrained-optimization-and-backtesting.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Constrained Optimization and Backtesting with R\n description-meta: Conduct portfolio backtesting with transaction costs and no-shortselling constraints using the programming language R. \n---\n\n\n\\index{Backtesting} In this chapter, we conduct portfolio backtesting in a realistic setting by including transaction costs and investment constraints such as no-short-selling rules. \\index{Short-selling}\nWe start with standard mean-variance efficient portfolios and introduce constraints in a step-by-step manner. To do so, we rely on numerical optimization procedures in R.\\index{Efficient portfolio} We conclude the chapter by providing an out-of-sample backtesting procedure for the different strategies that we introduce in this chapter. \n\nThroughout this chapter, we use the following R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(nloptr)\n```\n:::\n\n\nCompared to previous chapters, we introduce the `nloptr` package [@nloptr] to perform numerical constrained optimization for portfolio choice problems.\\index{Optimization}\\index{Portfolio choice}\n\n## Data Preparation\n\nWe start by loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). For simplicity, we restrict our investment universe to the monthly Fama-French industry portfolio returns in the following application. \\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nindustry_returns <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n select(-month) |>\n collect() |> \n drop_na()\n```\n:::\n\n\n## Recap of Portfolio Choice \n\nA common objective for portfolio optimization is to find mean-variance efficient portfolio weights, i.e., the allocation which delivers the lowest possible return variance for a given minimum level of expected returns. \nIn the most extreme case, where the investor is only concerned about portfolio variance, she may choose to implement the minimum variance portfolio (MVP) weights which are given by the solution to \n$$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1$$\nwhere $\\Sigma$ is the $(N \\times N)$ covariance matrix of the returns. The optimal weights $\\omega_\\text{mvp}$ can be found analytically and are $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$. In terms of code, the math is equivalent to the following chunk. \\index{Minimum variance portfolio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_industries <- ncol(industry_returns)\n\nSigma <- cov(industry_returns)\nw_mvp <- solve(Sigma) %*% rep(1, n_industries)\nw_mvp <- as.vector(w_mvp / sum(w_mvp))\n\nmu <- colMeans(industry_returns)\n```\n:::\n\n\nNext, consider an investor who aims to achieve minimum variance *given a required expected portfolio return* $\\bar{\\mu}$ such that she chooses\n$$\\omega_\\text{eff}({\\bar{\\mu}}) =\\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\nWe leave it as an exercise below to show that the portfolio choice problem can equivalently be formulated for an investor with mean-variance preferences and risk aversion factor $\\gamma$. That means the investor aims to choose portfolio weights as the solution to\n$$ \\omega^*_\\gamma = \\arg\\max \\omega' \\mu - \\frac{\\gamma}{2}\\omega'\\Sigma \\omega\\quad \\text{ s.t. } \\omega'\\iota = 1.$$\nThe solution to the optimal portfolio choice problem is:\n$$\\omega^*_{\\gamma} = \\frac{1}{\\gamma}\\left(\\Sigma^{-1} - \\frac{1}{\\iota' \\Sigma^{-1}\\iota }\\Sigma^{-1}\\iota\\iota' \\Sigma^{-1} \\right) \\mu + \\frac{1}{\\iota' \\Sigma^{-1} \\iota }\\Sigma^{-1} \\iota.$$\nEmpirically, this classical solution imposes many problems. \nIn particular, the estimates of $\\mu$ are noisy over short horizons, the ($N \\times N$) matrix $\\Sigma$ contains $N(N-1)/2$ distinct elements and thus, estimation error is huge. \nSeminal papers on the effect of ignoring estimation uncertainty, among others, are @Brown1976, @Jobson1980, @Jorion1986, and @Chopra1993.\n\nEven worse, if the asset universe contains more assets than available time periods $(N > T)$, the sample covariance matrix is no longer positive definite such that the inverse $\\Sigma^{-1}$ does not exist anymore. \nTo address estimation issues for vast-dimensional covariance matrices, regularization techniques are a popular tool [see, e.g., @Ledoit2003; @Ledoit2004; @Ledoit2012; @Fan2008].\\index{Covariance}\n\nWhile the uncertainty associated with estimated parameters is challenging, the data-generating process is also unknown to the investor. In other words, model uncertainty reflects that it is ex-ante not even clear which parameters require estimation (for instance, if returns are driven by a factor model, selecting the universe of relevant factors imposes model uncertainty). @Wang2005 and @Garlappi2007 provide theoretical analysis on optimal portfolio choice under model *and* estimation uncertainty. In the most extreme case, @Pflug2012 shows that the naive portfolio which allocates equal wealth to all assets is the optimal choice for an investor averse to model uncertainty.\\index{Model uncertainty}\n\nOn top of the estimation uncertainty, *transaction costs* are a major concern.\\index{Transaction cost} \nRebalancing portfolios is costly, and, therefore, the optimal choice should depend on the investor's current holdings. In the presence of transaction costs, the benefits of reallocating wealth may be smaller than the costs associated with turnover. This aspect has been investigated theoretically, among others, for one risky asset by @Magill1976 and @Davis1990. Subsequent extensions to the case with multiple assets have been proposed by @Balduzzi1999 and @Balduzzi2000. More recent papers on empirical approaches which explicitly account for transaction costs include @Garleanu2013, and @DeMiguel2014, and @DeMiguel2015. \n\n## Estimation Uncertainty and Transaction Costs\n\nThe empirical evidence regarding the performance of a mean-variance optimization procedure in which you simply plug in some sample estimates $\\hat \\mu$ and $\\hat \\Sigma$ can be summarized rather briefly: mean-variance optimization performs poorly! The literature discusses many proposals to overcome these empirical issues. For instance, one may impose some form of regularization of $\\Sigma$, rely on Bayesian priors inspired by theoretical asset pricing models [@Kan2007] or use high-frequency data to improve forecasting [@Hautsch2015].\nOne unifying framework that works easily, effectively (even for large dimensions), and is purely inspired by economic arguments is an ex-ante adjustment for transaction costs [@Hautsch2019]. \n\nAssume that returns are from a multivariate normal distribution with mean $\\mu$ and variance-covariance matrix $\\Sigma$, $N(\\mu,\\Sigma)$. Additionally, we assume quadratic transaction costs which penalize rebalancing such that $$\n\\begin{aligned}\n\\nu\\left(\\omega_{t+1},\\omega_{t^+}, \\beta\\right) = \\frac{\\beta}{2} \\left(\\omega_{t+1} - \\omega_{t^+}\\right)'\\left(\\omega_{t+1}- \\omega_{t^+}\\right),\\end{aligned}$$\nwith cost parameter $\\beta>0$ and $\\omega_{t^+} = {\\omega_t \\circ (1 +r_{t})}/{\\iota' (\\omega_t \\circ (1 + r_{t}))}$. $\\omega_{t^+}$ denotes the portfolio weights just before rebalancing. Note that $\\omega_{t^+}$ differs mechanically from $\\omega_t$ due to the returns in the past period.\nIntuitively, transaction costs penalize portfolio performance when the portfolio is shifted from the current holdings $\\omega_{t^+}$ to a new allocation $\\omega_{t+1}$. \nIn this setup, transaction costs do not increase linearly. Instead, larger rebalancing is penalized more heavily than small adjustments. \nThen, the optimal portfolio choice for an investor with mean variance preferences is\n$$\\begin{aligned}\\omega_{t+1} ^* &= \\arg\\max \\omega'\\mu - \\nu_t (\\omega,\\omega_{t^+}, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega \\text{ s.t. } \\iota'\\omega = 1\\\\\n&=\\arg\\max\n\\omega'\\mu^* - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t.} \\iota'\\omega=1,\\end{aligned}$$\nwhere\n$$\\mu^*=\\mu+\\beta \\omega_{t^+} \\quad \\text{and} \\quad \\Sigma^*=\\Sigma + \\frac{\\beta}{\\gamma} I_N.$$\nAs a result, adjusting for transaction costs implies a standard mean-variance optimal portfolio choice with adjusted return parameters $\\Sigma^*$ and $\\mu^*$: $$\\omega^*_{t+1} = \\frac{1}{\\gamma}\\left(\\Sigma^{*-1} - \\frac{1}{\\iota' \\Sigma^{*-1}\\iota }\\Sigma^{*-1}\\iota\\iota' \\Sigma^{*-1} \\right) \\mu^* + \\frac{1}{\\iota' \\Sigma^{*-1} \\iota }\\Sigma^{*-1} \\iota.$$\n\nAn alternative formulation of the optimal portfolio can be derived as follows: \n$$\\omega_{t+1} ^*=\\arg\\max\n\\omega'\\left(\\mu+\\beta\\left(\\omega_{t^+} - \\frac{1}{N}\\iota\\right)\\right) - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t. } \\iota'\\omega=1.$$\nThe optimal weights correspond to a mean-variance portfolio, where the vector of expected returns is such that assets that currently exhibit a higher weight are considered as delivering a higher expected return. \n\n## Optimal Portfolio Choice\n\nThe function below implements the efficient portfolio weight in its general form, allowing for transaction costs (conditional on the holdings *before* reallocation). \nFor $\\beta=0$, the computation resembles the standard mean-variance efficient framework. `gamma` denotes the coefficient of risk aversion $\\gamma$, `beta` is the transaction cost parameter $\\beta$ and `w_prev` are the weights before rebalancing $\\omega_{t^+}$. \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight <- function(Sigma,\n mu,\n gamma = 2,\n beta = 0, # transaction costs\n w_prev = rep(\n 1 / ncol(Sigma),\n ncol(Sigma)\n )) {\n iota <- rep(1, ncol(Sigma))\n Sigma_processed <- Sigma + beta / gamma * diag(ncol(Sigma))\n mu_processed <- mu + beta * w_prev\n\n Sigma_inverse <- solve(Sigma_processed)\n\n w_mvp <- Sigma_inverse %*% iota\n w_mvp <- as.vector(w_mvp / sum(w_mvp))\n w_opt <- w_mvp + 1 / gamma *\n (Sigma_inverse - w_mvp %*% t(iota) %*% Sigma_inverse) %*%\n mu_processed\n \n return(as.vector(w_opt))\n}\n\nw_efficient <- compute_efficient_weight(Sigma, mu)\nround(w_efficient, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1.630 0.090 -1.355 0.687 0.333 -0.412 0.504 0.402 -0.219\n[10] -0.661\n```\n\n\n:::\n:::\n\n\nThe portfolio weights above indicate the efficient portfolio for an investor with risk aversion coefficient $\\gamma=2$ in absence of transaction costs. Some of the positions are negative which implies short-selling, most of the positions are rather extreme. For instance, a position of $-1$ implies that the investor takes a short position worth her entire wealth to lever long positions in other assets. \\index{Short-selling}\nWhat is the effect of transaction costs or different levels of risk aversion on the optimal portfolio choice? The following few lines of code analyze the distance between the minimum variance portfolio and the portfolio implemented by the investor for different values of the transaction cost parameter $\\beta$ and risk aversion $\\gamma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngammas <- c(2, 4, 8, 20)\nbetas <- 20 * qexp((1:99) / 100)\n \ntransaction_costs <- expand_grid(\n gamma = gammas,\n beta = betas\n) |>\n mutate(\n weights = map2(\n .x = gamma,\n .y = beta,\n ~ compute_efficient_weight(Sigma,\n mu,\n gamma = .x,\n beta = .y / 10000,\n w_prev = w_mvp\n )\n ),\n concentration = map_dbl(weights, ~ sum(abs(. - w_mvp)))\n )\n```\n:::\n\n\nThe code chunk above computes the optimal weight in presence of transaction cost for different values of $\\beta$ and $\\gamma$ but with the same initial allocation, the theoretical optimal minimum variance portfolio. \nStarting from the initial allocation, the investor chooses her optimal allocation along the efficient frontier to reflect her own risk preferences. \nIf transaction costs would be absent, the investor would simply implement the mean-variance efficient allocation. If transaction costs make it costly to rebalance, her optimal portfolio choice reflects a shift toward the efficient portfolio, whereas her current portfolio anchors her investment.\\index{Graph!Comparative statics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransaction_costs |>\n mutate(risk_aversion = as_factor(gamma)) |>\n ggplot(aes(\n x = beta,\n y = concentration,\n color = risk_aversion,\n linetype = risk_aversion\n )) +\n geom_line() +\n guides(linetype = \"none\") + \n labs(\n x = \"Transaction cost parameter\",\n y = \"Distance from MVP\",\n color = \"Risk aversion\",\n title = \"Portfolio weights for different risk aversion and transaction cost\"\n )\n```\n\n::: {.cell-output-display}\n![The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.](constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png){#fig-1701 fig-alt='Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion.' width=2100}\n:::\n:::\n\n\n@fig-1701 shows rebalancing from the initial portfolio (which we always set to the minimum variance portfolio weights in this example). The higher the transaction costs parameter $\\beta$, the smaller is the rebalancing from the initial portfolio. In addition, if risk aversion $\\gamma$ increases, the efficient portfolio is closer to the minimum variance portfolio weights such that the investor desires less rebalancing from the initial holdings. \n\n## Constrained Optimization\n\nNext, we introduce constraints to the above optimization procedure. Very often, typical constraints such as short-selling restrictions prevent analytical solutions for optimal portfolio weights (short-selling restrictions simply imply that negative weights are not allowed such that we require that $w_i \\geq 0\\quad \\forall i$). However, numerical optimization allows computing the solutions to such constrained problems. \n\nWe rely on the powerful `nloptr` package, which provides a common interface to a number of different optimization routines. In particular, we employ the Sequential Least-Squares Quadratic Programming (SLSQP) algorithm of @Kraft1994 because it is able to hand multiple equality and inequality constraints at the same time and typically used for problems where the objective function and the constraints are twice continuously differentiable. We hence have to provide the algorithm with the objective function and its gradient, as well as the constraints and their Jacobian. \n\nWe illustrate the use of the `nloptr()` function by replicating the analytical solutions for the minimum variance and efficient portfolio weights from above. Note that the equality constraint for both solutions is given by the requirement that the weights must sum up to one. In addition, we supply a vector of equal weights as an initial value for the algorithm in all applications. We verify that the output is equal to the above solution. Note that `near()` is a safe way to compare two vectors for pairwise equality. The alternative `==` is sensitive to small differences that may occur due to the representation of floating points on a computer, while `near()` has a built-in tolerance. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_initial <- rep(1 / n_industries, n_industries)\n\nobjective_mvp <- function(w) {\n 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_mvp <- function(w) {\n Sigma %*% w\n}\n\nequality_constraint <- function(w) {\n sum(w) - 1\n}\n\njacobian_equality <- function(w) {\n rep(1, n_industries)\n}\n\noptions <- list(\n \"xtol_rel\"=1e-20, \n \"algorithm\" = \"NLOPT_LD_SLSQP\", \n \"maxeval\" = 10000\n)\n\nw_mvp_numerical <- nloptr(\n x0 = w_initial, \n eval_f = objective_mvp, \n eval_grad_f = gradient_mvp,\n eval_g_eq = equality_constraint,\n eval_jac_g_eq = jacobian_equality,\n opts = options\n)\n\nall(near(w_mvp, w_mvp_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nobjective_efficient <- function(w) {\n 2 * 0.5 * t(w) %*% Sigma %*% w - sum((1 + mu) * w)\n}\n\ngradient_efficient <- function(w) {\n 2 * Sigma %*% w - (1 + mu)\n}\n\nw_efficient_numerical <- nloptr(\n x0 = w_initial, \n eval_f = objective_efficient, \n eval_grad_f = gradient_efficient,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality,\n opts = options\n)\n\nall(near(w_efficient, w_efficient_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nThe result above shows that indeed the numerical procedure recovered the optimal weights for a scenario, where we already know the analytic solution. For more complex optimization routines, [R's optimization task view](https://cran.r-project.org/web/views/Optimization.html) provides an overview of the vast optimization landscape. \\index{Optimization}\n\nNext, we approach problems where no analytical solutions exist. First, we additionally impose short-sale constraints, which implies $N$ inequality constraints of the form $\\omega_i >=0$. We can implement the short sale constraints by imposing a vector of lower bounds `lb = rep(0, n_industries)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_no_short_sale <- nloptr(\n x0 = w_initial, \n eval_f = objective_efficient, \n eval_grad_f = gradient_efficient,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality,\n lb = rep(0, n_industries),\n opts = options\n)\n\nround(w_no_short_sale$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 0.610 0.000 0.000 0.211 0.000 0.000 0.000 0.179 0.000 0.000\n```\n\n\n:::\n:::\n\n\nAs expected, the resulting portfolio weights are all positive (up to numerical precision). Typically, the holdings in the presence of short-sale constraints are concentrated among way fewer assets than for the unrestricted case. \nYou can verify that `sum(w_no_short_sale$solution)` returns 1. In other words: `nloptr()` provides the numerical solution to a portfolio choice problem for a mean-variance investor with risk aversion `gamma = 2`, where negative holdings are forbidden. \n\n`nloptr()` can also handle more complex problems. As an example, we show how to compute optimal weights, subject to the so-called [Regulation T-constraint,](https://en.wikipedia.org/wiki/Regulation_T) which requires that the sum of all absolute portfolio weights is smaller than 1.5, that is $\\sum_{i=1}^N |\\omega_i| \\leq 1.5$. \nThe constraint enforces that a maximum of 50 percent of the allocated wealth can be allocated to short positions, thus implying an initial margin requirement of 50 percent. Imposing such a margin requirement reduces portfolio risks because extreme portfolio weights are not attainable anymore. The implementation of Regulation-T rules is numerically interesting because the margin constraints imply a non-linear constraint on the portfolio weights. \\index{Regulation T}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_t <- 1.5\n\ninequality_constraint <- function(w) {\n sum(abs(w)) - reg_t\n}\n\njacobian_inequality <- function(w) {\n sign(w)\n}\n\nobjective_reg_t <- function(w) {\n - t(w) %*% (1 + mu) +\n 2 * 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_reg_t <- function(w) {\n - (1 + mu) + 2 * Sigma %*% w\n}\n\nw_reg_t <- nloptr(\n x0 = w_initial,\n eval_f = objective_reg_t, \n eval_grad_f = gradient_reg_t,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality, \n eval_g_ineq = inequality_constraint, \n eval_jac_g_ineq = jacobian_inequality,\n opts = options\n)\n\nround(w_reg_t$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 0.736 0.000 -0.134 0.264 0.000 -0.019 0.028 0.223 0.000\n[10] -0.097\n```\n\n\n:::\n:::\n\n\n@fig-1702 shows the optimal allocation weights across all 10 industries for the four different strategies considered so far: minimum variance, efficient portfolio with $\\gamma$ = 2, efficient portfolio with short-sale constraints, and the Regulation-T constrained portfolio.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n `No short-sale` = w_no_short_sale$solution,\n `Minimum Variance` = w_mvp,\n `Efficient portfolio` = compute_efficient_weight(Sigma, mu),\n `Regulation-T` = w_reg_t$solution,\n Industry = colnames(industry_returns)\n) |>\n pivot_longer(-Industry,\n names_to = \"Strategy\",\n values_to = \"weights\"\n ) |>\n ggplot(aes(\n fill = Strategy,\n y = weights,\n x = Industry\n )) +\n geom_bar(position = \"dodge\", stat = \"identity\") +\n coord_flip() +\n labs(\n y = \"Allocation weight\", fill = NULL,\n title = \"Optimal allocations for different strategies\"\n ) +\n scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.](constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png){#fig-1702 fig-alt='Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio.' width=2100}\n:::\n:::\n\n\nThe results clearly indicate the effect of imposing additional constraints: the extreme holdings the investor implements if she follows the (theoretically optimal) efficient portfolio vanish under, e.g., the Regulation-T constraint.\nYou may wonder why an investor would deviate from what is theoretically the optimal portfolio by imposing potentially arbitrary constraints. \nThe short answer is: the *efficient portfolio* is only efficient if the true parameters of the data generating process correspond to the estimated parameters $\\hat\\Sigma$ and $\\hat\\mu$. \nEstimation uncertainty may thus lead to inefficient allocations. By imposing restrictions, we implicitly shrink the set of possible weights and prevent extreme allocations, which could result from *error-maximization* due to estimation uncertainty [@Jagannathan2003].\n\nBefore we move on, we want to propose a final allocation strategy, which reflects a somewhat more realistic structure of transaction costs instead of the quadratic specification used above. The function below computes efficient portfolio weights while adjusting for transaction costs of the form $\\beta\\sum_{i=1}^N |(\\omega_{i, t+1} - \\omega_{i, t^+})|$. No closed-form solution exists, and we rely on non-linear optimization procedures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight_L1_TC <- function(mu,\n Sigma,\n gamma,\n beta,\n initial_weights) {\n objective <- function(w) {\n -t(w) %*% mu +\n gamma / 2 * t(w) %*% Sigma %*% w +\n (beta / 10000) / 2 * sum(abs(w - initial_weights))\n }\n \n gradient <- function(w) {\n -mu + gamma * Sigma %*% w + \n (beta / 10000) * 0.5 * sign(w - initial_weights)\n }\n\n w_optimal <- nloptr(\n x0 = initial_weights,\n eval_f = objective, \n eval_grad_f = gradient,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality, \n opts = options\n )\n\n return(w_optimal$solution)\n}\n```\n:::\n\n\n## Out-of-Sample Backtesting\n\nFor the sake of simplicity, we committed one fundamental error in computing portfolio weights above: we used the full sample of the data to determine the optimal allocation [@Harvey2019]. To implement this strategy at the beginning of the 2000s, you will need to know how the returns will evolve until 2021. \\index{Backtesting} \\index{Performance evaluation}\\index{Out-of-sample}\nWhile interesting from a methodological point of view, we cannot evaluate the performance of the portfolios in a reasonable out-of-sample fashion. We do so next in a backtesting application for three strategies. For the backtest, we recompute optimal weights just based on past available data. \n\nThe few lines below define the general setup. We consider 120 periods from the past to update the parameter estimates before recomputing portfolio weights. Then, we update portfolio weights which is costly and affects the performance. The portfolio weights determine the portfolio return. A period later, the current portfolio weights have changed and form the foundation for transaction costs incurred in the next period. We consider three different competing strategies: the mean-variance efficient portfolio, the mean-variance efficient portfolio with ex-ante adjustment for transaction costs, and the naive portfolio, which allocates wealth equally across the different assets.\\index{Transaction cost}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwindow_length <- 120\nperiods <- nrow(industry_returns) - window_length\n\nbeta <- 50\ngamma <- 2\n\nperformance_values <- matrix(NA,\n nrow = periods,\n ncol = 3\n)\ncolnames(performance_values) <- c(\"raw_return\", \"turnover\", \"net_return\")\n\nperformance_values <- list(\n \"MV (TC)\" = performance_values,\n \"Naive\" = performance_values,\n \"MV\" = performance_values\n)\n\nw_prev_1 <- w_prev_2 <- w_prev_3 <- rep(\n 1 / n_industries,\n n_industries\n)\n```\n:::\n\n\nWe also define two helper functions: one to adjust the weights due to returns and one for performance evaluation, where we compute realized returns net of transaction costs. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjust_weights <- function(w, next_return) {\n w_prev <- 1 + w * next_return\n as.numeric(w_prev / sum(as.vector(w_prev)))\n}\n\nevaluate_performance <- function(w, w_previous, next_return, beta = 50) {\n raw_return <- as.matrix(next_return) %*% w\n turnover <- sum(abs(w - w_previous))\n net_return <- raw_return - beta / 10000 * turnover\n c(raw_return, turnover, net_return)\n}\n```\n:::\n\n\nThe following code chunk performs a rolling-window estimation, which we implement in a loop. In each period, the estimation window contains the returns available up to the current period. \nNote that we use the sample variance-covariance matrix and ignore the estimation of $\\hat\\mu$ entirely, but you might use more advanced estimators in practice. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (p in 1:periods) {\n returns_window <- industry_returns[p:(p + window_length - 1), ]\n next_return <- industry_returns[p + window_length, ] |> as.matrix()\n\n Sigma <- cov(returns_window)\n mu <- 0 * colMeans(returns_window)\n\n # Transaction-cost adjusted portfolio\n w_1 <- compute_efficient_weight_L1_TC(\n mu = mu,\n Sigma = Sigma,\n beta = beta,\n gamma = gamma,\n initial_weights = w_prev_1\n )\n\n performance_values[[1]][p, ] <- evaluate_performance(w_1,\n w_prev_1,\n next_return,\n beta = beta\n )\n\n w_prev_1 <- adjust_weights(w_1, next_return)\n\n # Naive portfolio\n w_2 <- rep(1 / n_industries, n_industries)\n\n performance_values[[2]][p, ] <- evaluate_performance(\n w_2,\n w_prev_2,\n next_return\n )\n\n w_prev_2 <- adjust_weights(w_2, next_return)\n\n # Mean-variance efficient portfolio (w/o transaction costs)\n w_3 <- compute_efficient_weight(\n Sigma = Sigma,\n mu = mu,\n gamma = gamma\n )\n\n performance_values[[3]][p, ] <- evaluate_performance(\n w_3,\n w_prev_3,\n next_return\n )\n\n w_prev_3 <- adjust_weights(w_3, next_return)\n}\n```\n:::\n\n\nFinally, we get to the evaluation of the portfolio strategies *net-of-transaction costs*. Note that we compute annualized returns and standard deviations. \\index{Sharpe Ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance <- lapply(\n performance_values,\n as_tibble\n) |>\n bind_rows(.id = \"strategy\")\n\nlength_year <- 12\n\nperformance_table <- performance |>\n group_by(Strategy = strategy) |>\n summarize(\n Mean = length_year * mean(100 * net_return),\n SD = sqrt(length_year) * sd(100 * net_return),\n `Sharpe ratio` = if_else(Mean > 0,\n Mean / SD,\n NA_real_\n ),\n Turnover = 100 * mean(turnover)\n )\n\nperformance_table |> \n mutate(across(-Strategy, ~round(., 4)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n Strategy Mean SD `Sharpe ratio` Turnover\n \n1 MV -0.897 12.6 NA 211. \n2 MV (TC) 11.9 15.2 0.780 0 \n3 Naive 11.8 15.2 0.779 0.234\n```\n\n\n:::\n:::\n\n\nThe results clearly speak against mean-variance optimization. Turnover is huge when the investor only considers her portfolio's expected return and variance. Effectively, the mean-variance portfolio generates a *negative* annualized return after adjusting for transaction costs. At the same time, the naive portfolio turns out to perform very well. In fact, the performance gains of the transaction-cost adjusted mean-variance portfolio are small. The out-of-sample Sharpe ratio is slightly higher than for the naive portfolio. Note the extreme effect of turnover penalization on turnover: *MV (TC)* effectively resembles a buy-and-hold strategy which only updates the portfolio once the estimated parameters $\\hat\\mu_t$ and $\\hat\\Sigma_t$indicate that the current allocation is too far away from the optimal theoretical portfolio. \n\n## Exercises\n\n1. Consider the portfolio choice problem for transaction-cost adjusted certainty equivalent maximization with risk aversion parameter $\\gamma$ \n$$\\omega_{t+1} ^* = \\arg\\max_{\\omega \\in \\mathbb{R}^N, \\iota'\\omega = 1} \\omega'\\mu - \\nu_t (\\omega, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega$$\nwhere $\\Sigma$ and $\\mu$ are (estimators of) the variance-covariance matrix of the returns and the vector of expected returns. Assume for now that transaction costs are quadratic in rebalancing *and* proportional to stock illiquidity such that \n$$\\nu_t\\left(\\omega, B\\right) = \\frac{\\beta}{2} \\left(\\omega - \\omega_{t^+}\\right)'B\\left(\\omega - \\omega_{t^+}\\right)$$ where $B = \\text{diag}(ill_1, \\ldots, ill_N)$ is a diagonal matrix, where $ill_1, \\ldots, ill_N$. Derive a closed-form solution for the mean-variance efficient portfolio $\\omega_{t+1} ^*$ based on the transaction cost specification above. Discuss the effect of illiquidity $ill_i$ on the individual portfolio weights relative to an investor that myopically ignores transaction costs in her decision. \n1. Use the solution from the previous exercise to update the function `compute_efficient_weight()` such that you can compute optimal weights conditional on a matrix $B$ with illiquidity measures. \n1. Illustrate the evolution of the *optimal* weights from the naive portfolio to the efficient portfolio in the mean-standard deviation diagram.\n1. Is it always optimal to choose the same $\\beta$ in the optimization problem than the value used in evaluating the portfolio performance? In other words: can it be optimal to choose theoretically sub-optimal portfolios based on transaction cost considerations that do not reflect the actual incurred costs? Evaluate the out-of-sample Sharpe ratio after transaction costs for a range of different values of imposed $\\beta$ values.", + "markdown": "---\ntitle: Constrained Optimization and Backtesting\naliases:\n - ../constrained-optimization-and-backtesting.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Constrained Optimization and Backtesting with R\n description-meta: Conduct portfolio backtesting with transaction costs and no-shortselling constraints using the programming language R. \n---\n\n\n\\index{Backtesting} In this chapter, we conduct portfolio backtesting in a realistic setting by including transaction costs and investment constraints such as no-short-selling rules. \\index{Short-selling}\nWe start with standard mean-variance efficient portfolios and introduce constraints in a step-by-step manner. To do so, we rely on numerical optimization procedures in R.\\index{Efficient portfolio} We conclude the chapter by providing an out-of-sample backtesting procedure for the different strategies that we introduce in this chapter. \n\nThroughout this chapter, we use the following R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(nloptr)\n```\n:::\n\n\nCompared to previous chapters, we introduce the `nloptr` package [@nloptr] to perform numerical constrained optimization for portfolio choice problems.\\index{Optimization}\\index{Portfolio choice}\n\n## Data Preparation\n\nWe start by loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). For simplicity, we restrict our investment universe to the monthly Fama-French industry portfolio returns in the following application. \\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nindustry_returns <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n select(-month) |>\n collect() |> \n drop_na()\n```\n:::\n\n\n## Recap of Portfolio Choice \n\nA common objective for portfolio optimization is to find mean-variance efficient portfolio weights, i.e., the allocation which delivers the lowest possible return variance for a given minimum level of expected returns. \nIn the most extreme case, where the investor is only concerned about portfolio variance, she may choose to implement the minimum variance portfolio (MVP) weights which are given by the solution to \n$$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1$$\nwhere $\\Sigma$ is the $(N \\times N)$ covariance matrix of the returns. The optimal weights $\\omega_\\text{mvp}$ can be found analytically and are $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$. In terms of code, the math is equivalent to the following chunk. \\index{Minimum variance portfolio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_industries <- ncol(industry_returns)\n\nSigma <- cov(industry_returns)\nw_mvp <- solve(Sigma) %*% rep(1, n_industries)\nw_mvp <- as.vector(w_mvp / sum(w_mvp))\n\nmu <- colMeans(industry_returns)\n```\n:::\n\n\nNext, consider an investor who aims to achieve minimum variance *given a required expected portfolio return* $\\bar{\\mu}$ such that she chooses\n$$\\omega_\\text{eff}({\\bar{\\mu}}) =\\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\nWe leave it as an exercise below to show that the portfolio choice problem can equivalently be formulated for an investor with mean-variance preferences and risk aversion factor $\\gamma$. That means the investor aims to choose portfolio weights as the solution to\n$$ \\omega^*_\\gamma = \\arg\\max \\omega' \\mu - \\frac{\\gamma}{2}\\omega'\\Sigma \\omega\\quad \\text{ s.t. } \\omega'\\iota = 1.$$\nThe solution to the optimal portfolio choice problem is:\n$$\\omega^*_{\\gamma} = \\frac{1}{\\gamma}\\left(\\Sigma^{-1} - \\frac{1}{\\iota' \\Sigma^{-1}\\iota }\\Sigma^{-1}\\iota\\iota' \\Sigma^{-1} \\right) \\mu + \\frac{1}{\\iota' \\Sigma^{-1} \\iota }\\Sigma^{-1} \\iota.$$\nEmpirically, this classical solution imposes many problems. \nIn particular, the estimates of $\\mu$ are noisy over short horizons, the ($N \\times N$) matrix $\\Sigma$ contains $N(N-1)/2$ distinct elements and thus, estimation error is huge. \nSeminal papers on the effect of ignoring estimation uncertainty, among others, are @Brown1976, @Jobson1980, @Jorion1986, and @Chopra1993.\n\nEven worse, if the asset universe contains more assets than available time periods $(N > T)$, the sample covariance matrix is no longer positive definite such that the inverse $\\Sigma^{-1}$ does not exist anymore. \nTo address estimation issues for vast-dimensional covariance matrices, regularization techniques are a popular tool [see, e.g., @Ledoit2003; @Ledoit2004; @Ledoit2012; @Fan2008].\\index{Covariance}\n\nWhile the uncertainty associated with estimated parameters is challenging, the data-generating process is also unknown to the investor. In other words, model uncertainty reflects that it is ex-ante not even clear which parameters require estimation (for instance, if returns are driven by a factor model, selecting the universe of relevant factors imposes model uncertainty). @Wang2005 and @Garlappi2007 provide theoretical analysis on optimal portfolio choice under model *and* estimation uncertainty. In the most extreme case, @Pflug2012 shows that the naive portfolio which allocates equal wealth to all assets is the optimal choice for an investor averse to model uncertainty.\\index{Model uncertainty}\n\nOn top of the estimation uncertainty, *transaction costs* are a major concern.\\index{Transaction cost} \nRebalancing portfolios is costly, and, therefore, the optimal choice should depend on the investor's current holdings. In the presence of transaction costs, the benefits of reallocating wealth may be smaller than the costs associated with turnover. This aspect has been investigated theoretically, among others, for one risky asset by @Magill1976 and @Davis1990. Subsequent extensions to the case with multiple assets have been proposed by @Balduzzi1999 and @Balduzzi2000. More recent papers on empirical approaches which explicitly account for transaction costs include @Garleanu2013, and @DeMiguel2014, and @DeMiguel2015. \n\n## Estimation Uncertainty and Transaction Costs\n\nThe empirical evidence regarding the performance of a mean-variance optimization procedure in which you simply plug in some sample estimates $\\hat \\mu$ and $\\hat \\Sigma$ can be summarized rather briefly: mean-variance optimization performs poorly! The literature discusses many proposals to overcome these empirical issues. For instance, one may impose some form of regularization of $\\Sigma$, rely on Bayesian priors inspired by theoretical asset pricing models [@Kan2007] or use high-frequency data to improve forecasting [@Hautsch2015].\nOne unifying framework that works easily, effectively (even for large dimensions), and is purely inspired by economic arguments is an ex-ante adjustment for transaction costs [@Hautsch2019]. \n\nAssume that returns are from a multivariate normal distribution with mean $\\mu$ and variance-covariance matrix $\\Sigma$, $N(\\mu,\\Sigma)$. Additionally, we assume quadratic transaction costs which penalize rebalancing such that $$\n\\begin{aligned}\n\\nu\\left(\\omega_{t+1},\\omega_{t^+}, \\beta\\right) = \\frac{\\beta}{2} \\left(\\omega_{t+1} - \\omega_{t^+}\\right)'\\left(\\omega_{t+1}- \\omega_{t^+}\\right),\\end{aligned}$$\nwith cost parameter $\\beta>0$ and $\\omega_{t^+} = {\\omega_t \\circ (1 +r_{t})}/{\\iota' (\\omega_t \\circ (1 + r_{t}))}$. $\\omega_{t^+}$ denotes the portfolio weights just before rebalancing. Note that $\\omega_{t^+}$ differs mechanically from $\\omega_t$ due to the returns in the past period.\nIntuitively, transaction costs penalize portfolio performance when the portfolio is shifted from the current holdings $\\omega_{t^+}$ to a new allocation $\\omega_{t+1}$. \nIn this setup, transaction costs do not increase linearly. Instead, larger rebalancing is penalized more heavily than small adjustments. \nThen, the optimal portfolio choice for an investor with mean variance preferences is\n$$\\begin{aligned}\\omega_{t+1} ^* &= \\arg\\max \\omega'\\mu - \\nu_t (\\omega,\\omega_{t^+}, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega \\text{ s.t. } \\iota'\\omega = 1\\\\\n&=\\arg\\max\n\\omega'\\mu^* - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t.} \\iota'\\omega=1,\\end{aligned}$$\nwhere\n$$\\mu^*=\\mu+\\beta \\omega_{t^+} \\quad \\text{and} \\quad \\Sigma^*=\\Sigma + \\frac{\\beta}{\\gamma} I_N.$$\nAs a result, adjusting for transaction costs implies a standard mean-variance optimal portfolio choice with adjusted return parameters $\\Sigma^*$ and $\\mu^*$: $$\\omega^*_{t+1} = \\frac{1}{\\gamma}\\left(\\Sigma^{*-1} - \\frac{1}{\\iota' \\Sigma^{*-1}\\iota }\\Sigma^{*-1}\\iota\\iota' \\Sigma^{*-1} \\right) \\mu^* + \\frac{1}{\\iota' \\Sigma^{*-1} \\iota }\\Sigma^{*-1} \\iota.$$\n\nAn alternative formulation of the optimal portfolio can be derived as follows: \n$$\\omega_{t+1} ^*=\\arg\\max\n\\omega'\\left(\\mu+\\beta\\left(\\omega_{t^+} - \\frac{1}{N}\\iota\\right)\\right) - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t. } \\iota'\\omega=1.$$\nThe optimal weights correspond to a mean-variance portfolio, where the vector of expected returns is such that assets that currently exhibit a higher weight are considered as delivering a higher expected return. \n\n## Optimal Portfolio Choice\n\nThe function below implements the efficient portfolio weight in its general form, allowing for transaction costs (conditional on the holdings *before* reallocation). \nFor $\\beta=0$, the computation resembles the standard mean-variance efficient framework. `gamma` denotes the coefficient of risk aversion $\\gamma$, `beta` is the transaction cost parameter $\\beta$ and `w_prev` are the weights before rebalancing $\\omega_{t^+}$. \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight <- function(Sigma,\n mu,\n gamma = 2,\n beta = 0, # transaction costs\n w_prev = rep(\n 1 / ncol(Sigma),\n ncol(Sigma)\n )) {\n iota <- rep(1, ncol(Sigma))\n Sigma_processed <- Sigma + beta / gamma * diag(ncol(Sigma))\n mu_processed <- mu + beta * w_prev\n\n Sigma_inverse <- solve(Sigma_processed)\n\n w_mvp <- Sigma_inverse %*% iota\n w_mvp <- as.vector(w_mvp / sum(w_mvp))\n w_opt <- w_mvp + 1 / gamma *\n (Sigma_inverse - w_mvp %*% t(iota) %*% Sigma_inverse) %*%\n mu_processed\n \n return(as.vector(w_opt))\n}\n\nw_efficient <- compute_efficient_weight(Sigma, mu)\nround(w_efficient, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1.630 0.090 -1.356 0.687 0.333 -0.412 0.504 0.402 -0.219\n[10] -0.659\n```\n\n\n:::\n:::\n\n\nThe portfolio weights above indicate the efficient portfolio for an investor with risk aversion coefficient $\\gamma=2$ in absence of transaction costs. Some of the positions are negative which implies short-selling, most of the positions are rather extreme. For instance, a position of $-1$ implies that the investor takes a short position worth her entire wealth to lever long positions in other assets. \\index{Short-selling}\nWhat is the effect of transaction costs or different levels of risk aversion on the optimal portfolio choice? The following few lines of code analyze the distance between the minimum variance portfolio and the portfolio implemented by the investor for different values of the transaction cost parameter $\\beta$ and risk aversion $\\gamma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngammas <- c(2, 4, 8, 20)\nbetas <- 20 * qexp((1:99) / 100)\n \ntransaction_costs <- expand_grid(\n gamma = gammas,\n beta = betas\n) |>\n mutate(\n weights = map2(\n .x = gamma,\n .y = beta,\n ~ compute_efficient_weight(Sigma,\n mu,\n gamma = .x,\n beta = .y / 10000,\n w_prev = w_mvp\n )\n ),\n concentration = map_dbl(weights, ~ sum(abs(. - w_mvp)))\n )\n```\n:::\n\n\nThe code chunk above computes the optimal weight in presence of transaction cost for different values of $\\beta$ and $\\gamma$ but with the same initial allocation, the theoretical optimal minimum variance portfolio. \nStarting from the initial allocation, the investor chooses her optimal allocation along the efficient frontier to reflect her own risk preferences. \nIf transaction costs would be absent, the investor would simply implement the mean-variance efficient allocation. If transaction costs make it costly to rebalance, her optimal portfolio choice reflects a shift toward the efficient portfolio, whereas her current portfolio anchors her investment.\\index{Graph!Comparative statics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransaction_costs |>\n mutate(risk_aversion = as_factor(gamma)) |>\n ggplot(aes(\n x = beta,\n y = concentration,\n color = risk_aversion,\n linetype = risk_aversion\n )) +\n geom_line() +\n guides(linetype = \"none\") + \n labs(\n x = \"Transaction cost parameter\",\n y = \"Distance from MVP\",\n color = \"Risk aversion\",\n title = \"Portfolio weights for different risk aversion and transaction cost\"\n )\n```\n\n::: {.cell-output-display}\n![The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.](constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png){#fig-1701 fig-alt='Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion.' width=2100}\n:::\n:::\n\n\n@fig-1701 shows rebalancing from the initial portfolio (which we always set to the minimum variance portfolio weights in this example). The higher the transaction costs parameter $\\beta$, the smaller is the rebalancing from the initial portfolio. In addition, if risk aversion $\\gamma$ increases, the efficient portfolio is closer to the minimum variance portfolio weights such that the investor desires less rebalancing from the initial holdings. \n\n## Constrained Optimization\n\nNext, we introduce constraints to the above optimization procedure. Very often, typical constraints such as short-selling restrictions prevent analytical solutions for optimal portfolio weights (short-selling restrictions simply imply that negative weights are not allowed such that we require that $w_i \\geq 0\\quad \\forall i$). However, numerical optimization allows computing the solutions to such constrained problems. \n\nWe rely on the powerful `nloptr` package, which provides a common interface to a number of different optimization routines. In particular, we employ the Sequential Least-Squares Quadratic Programming (SLSQP) algorithm of @Kraft1994 because it is able to hand multiple equality and inequality constraints at the same time and typically used for problems where the objective function and the constraints are twice continuously differentiable. We hence have to provide the algorithm with the objective function and its gradient, as well as the constraints and their Jacobian. \n\nWe illustrate the use of the `nloptr()` function by replicating the analytical solutions for the minimum variance and efficient portfolio weights from above. Note that the equality constraint for both solutions is given by the requirement that the weights must sum up to one. In addition, we supply a vector of equal weights as an initial value for the algorithm in all applications. We verify that the output is equal to the above solution. Note that `near()` is a safe way to compare two vectors for pairwise equality. The alternative `==` is sensitive to small differences that may occur due to the representation of floating points on a computer, while `near()` has a built-in tolerance. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_initial <- rep(1 / n_industries, n_industries)\n\nobjective_mvp <- function(w) {\n 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_mvp <- function(w) {\n Sigma %*% w\n}\n\nequality_constraint <- function(w) {\n sum(w) - 1\n}\n\njacobian_equality <- function(w) {\n rep(1, n_industries)\n}\n\noptions <- list(\n \"xtol_rel\"=1e-20, \n \"algorithm\" = \"NLOPT_LD_SLSQP\", \n \"maxeval\" = 10000\n)\n\nw_mvp_numerical <- nloptr(\n x0 = w_initial, \n eval_f = objective_mvp, \n eval_grad_f = gradient_mvp,\n eval_g_eq = equality_constraint,\n eval_jac_g_eq = jacobian_equality,\n opts = options\n)\n\nall(near(w_mvp, w_mvp_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nobjective_efficient <- function(w) {\n 2 * 0.5 * t(w) %*% Sigma %*% w - sum((1 + mu) * w)\n}\n\ngradient_efficient <- function(w) {\n 2 * Sigma %*% w - (1 + mu)\n}\n\nw_efficient_numerical <- nloptr(\n x0 = w_initial, \n eval_f = objective_efficient, \n eval_grad_f = gradient_efficient,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality,\n opts = options\n)\n\nall(near(w_efficient, w_efficient_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nThe result above shows that indeed the numerical procedure recovered the optimal weights for a scenario, where we already know the analytic solution. For more complex optimization routines, [R's optimization task view](https://cran.r-project.org/web/views/Optimization.html) provides an overview of the vast optimization landscape. \\index{Optimization}\n\nNext, we approach problems where no analytical solutions exist. First, we additionally impose short-sale constraints, which implies $N$ inequality constraints of the form $\\omega_i >=0$. We can implement the short sale constraints by imposing a vector of lower bounds `lb = rep(0, n_industries)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_no_short_sale <- nloptr(\n x0 = w_initial, \n eval_f = objective_efficient, \n eval_grad_f = gradient_efficient,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality,\n lb = rep(0, n_industries),\n opts = options\n)\n\nround(w_no_short_sale$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 0.610 0.000 0.000 0.211 0.000 0.000 0.000 0.179 0.000 0.000\n```\n\n\n:::\n:::\n\n\nAs expected, the resulting portfolio weights are all positive (up to numerical precision). Typically, the holdings in the presence of short-sale constraints are concentrated among way fewer assets than for the unrestricted case. \nYou can verify that `sum(w_no_short_sale$solution)` returns 1. In other words: `nloptr()` provides the numerical solution to a portfolio choice problem for a mean-variance investor with risk aversion `gamma = 2`, where negative holdings are forbidden. \n\n`nloptr()` can also handle more complex problems. As an example, we show how to compute optimal weights, subject to the so-called [Regulation T-constraint,](https://en.wikipedia.org/wiki/Regulation_T) which requires that the sum of all absolute portfolio weights is smaller than 1.5, that is $\\sum_{i=1}^N |\\omega_i| \\leq 1.5$. \nThe constraint enforces that a maximum of 50 percent of the allocated wealth can be allocated to short positions, thus implying an initial margin requirement of 50 percent. Imposing such a margin requirement reduces portfolio risks because extreme portfolio weights are not attainable anymore. The implementation of Regulation-T rules is numerically interesting because the margin constraints imply a non-linear constraint on the portfolio weights. \\index{Regulation T}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_t <- 1.5\n\ninequality_constraint <- function(w) {\n sum(abs(w)) - reg_t\n}\n\njacobian_inequality <- function(w) {\n sign(w)\n}\n\nobjective_reg_t <- function(w) {\n - t(w) %*% (1 + mu) +\n 2 * 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_reg_t <- function(w) {\n - (1 + mu) + 2 * Sigma %*% w\n}\n\nw_reg_t <- nloptr(\n x0 = w_initial,\n eval_f = objective_reg_t, \n eval_grad_f = gradient_reg_t,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality, \n eval_g_ineq = inequality_constraint, \n eval_jac_g_ineq = jacobian_inequality,\n opts = options\n)\n\nround(w_reg_t$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 0.736 0.000 -0.135 0.264 0.000 -0.019 0.028 0.223 0.000\n[10] -0.096\n```\n\n\n:::\n:::\n\n\n@fig-1702 shows the optimal allocation weights across all 10 industries for the four different strategies considered so far: minimum variance, efficient portfolio with $\\gamma$ = 2, efficient portfolio with short-sale constraints, and the Regulation-T constrained portfolio.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n `No short-sale` = w_no_short_sale$solution,\n `Minimum Variance` = w_mvp,\n `Efficient portfolio` = compute_efficient_weight(Sigma, mu),\n `Regulation-T` = w_reg_t$solution,\n Industry = colnames(industry_returns)\n) |>\n pivot_longer(-Industry,\n names_to = \"Strategy\",\n values_to = \"weights\"\n ) |>\n ggplot(aes(\n fill = Strategy,\n y = weights,\n x = Industry\n )) +\n geom_bar(position = \"dodge\", stat = \"identity\") +\n coord_flip() +\n labs(\n y = \"Allocation weight\", fill = NULL,\n title = \"Optimal allocations for different strategies\"\n ) +\n scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.](constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png){#fig-1702 fig-alt='Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio.' width=2100}\n:::\n:::\n\n\nThe results clearly indicate the effect of imposing additional constraints: the extreme holdings the investor implements if she follows the (theoretically optimal) efficient portfolio vanish under, e.g., the Regulation-T constraint.\nYou may wonder why an investor would deviate from what is theoretically the optimal portfolio by imposing potentially arbitrary constraints. \nThe short answer is: the *efficient portfolio* is only efficient if the true parameters of the data generating process correspond to the estimated parameters $\\hat\\Sigma$ and $\\hat\\mu$. \nEstimation uncertainty may thus lead to inefficient allocations. By imposing restrictions, we implicitly shrink the set of possible weights and prevent extreme allocations, which could result from *error-maximization* due to estimation uncertainty [@Jagannathan2003].\n\nBefore we move on, we want to propose a final allocation strategy, which reflects a somewhat more realistic structure of transaction costs instead of the quadratic specification used above. The function below computes efficient portfolio weights while adjusting for transaction costs of the form $\\beta\\sum_{i=1}^N |(\\omega_{i, t+1} - \\omega_{i, t^+})|$. No closed-form solution exists, and we rely on non-linear optimization procedures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight_L1_TC <- function(mu,\n Sigma,\n gamma,\n beta,\n initial_weights) {\n objective <- function(w) {\n -t(w) %*% mu +\n gamma / 2 * t(w) %*% Sigma %*% w +\n (beta / 10000) / 2 * sum(abs(w - initial_weights))\n }\n \n gradient <- function(w) {\n -mu + gamma * Sigma %*% w + \n (beta / 10000) * 0.5 * sign(w - initial_weights)\n }\n\n w_optimal <- nloptr(\n x0 = initial_weights,\n eval_f = objective, \n eval_grad_f = gradient,\n eval_g_eq = equality_constraint, \n eval_jac_g_eq = jacobian_equality, \n opts = options\n )\n\n return(w_optimal$solution)\n}\n```\n:::\n\n\n## Out-of-Sample Backtesting\n\nFor the sake of simplicity, we committed one fundamental error in computing portfolio weights above: we used the full sample of the data to determine the optimal allocation [@Harvey2019]. To implement this strategy at the beginning of the 2000s, you will need to know how the returns will evolve until 2021. \\index{Backtesting} \\index{Performance evaluation}\\index{Out-of-sample}\nWhile interesting from a methodological point of view, we cannot evaluate the performance of the portfolios in a reasonable out-of-sample fashion. We do so next in a backtesting application for three strategies. For the backtest, we recompute optimal weights just based on past available data. \n\nThe few lines below define the general setup. We consider 120 periods from the past to update the parameter estimates before recomputing portfolio weights. Then, we update portfolio weights which is costly and affects the performance. The portfolio weights determine the portfolio return. A period later, the current portfolio weights have changed and form the foundation for transaction costs incurred in the next period. We consider three different competing strategies: the mean-variance efficient portfolio, the mean-variance efficient portfolio with ex-ante adjustment for transaction costs, and the naive portfolio, which allocates wealth equally across the different assets.\\index{Transaction cost}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwindow_length <- 120\nperiods <- nrow(industry_returns) - window_length\n\nbeta <- 50\ngamma <- 2\n\nperformance_values <- matrix(NA,\n nrow = periods,\n ncol = 3\n)\ncolnames(performance_values) <- c(\"raw_return\", \"turnover\", \"net_return\")\n\nperformance_values <- list(\n \"MV (TC)\" = performance_values,\n \"Naive\" = performance_values,\n \"MV\" = performance_values\n)\n\nw_prev_1 <- w_prev_2 <- w_prev_3 <- rep(\n 1 / n_industries,\n n_industries\n)\n```\n:::\n\n\nWe also define two helper functions: one to adjust the weights due to returns and one for performance evaluation, where we compute realized returns net of transaction costs. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjust_weights <- function(w, next_return) {\n w_prev <- 1 + w * next_return\n as.numeric(w_prev / sum(as.vector(w_prev)))\n}\n\nevaluate_performance <- function(w, w_previous, next_return, beta = 50) {\n raw_return <- as.matrix(next_return) %*% w\n turnover <- sum(abs(w - w_previous))\n net_return <- raw_return - beta / 10000 * turnover\n c(raw_return, turnover, net_return)\n}\n```\n:::\n\n\nThe following code chunk performs a rolling-window estimation, which we implement in a loop. In each period, the estimation window contains the returns available up to the current period. \nNote that we use the sample variance-covariance matrix and ignore the estimation of $\\hat\\mu$ entirely, but you might use more advanced estimators in practice. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (p in 1:periods) {\n returns_window <- industry_returns[p:(p + window_length - 1), ]\n next_return <- industry_returns[p + window_length, ] |> as.matrix()\n\n Sigma <- cov(returns_window)\n mu <- 0 * colMeans(returns_window)\n\n # Transaction-cost adjusted portfolio\n w_1 <- compute_efficient_weight_L1_TC(\n mu = mu,\n Sigma = Sigma,\n beta = beta,\n gamma = gamma,\n initial_weights = w_prev_1\n )\n\n performance_values[[1]][p, ] <- evaluate_performance(w_1,\n w_prev_1,\n next_return,\n beta = beta\n )\n\n w_prev_1 <- adjust_weights(w_1, next_return)\n\n # Naive portfolio\n w_2 <- rep(1 / n_industries, n_industries)\n\n performance_values[[2]][p, ] <- evaluate_performance(\n w_2,\n w_prev_2,\n next_return\n )\n\n w_prev_2 <- adjust_weights(w_2, next_return)\n\n # Mean-variance efficient portfolio (w/o transaction costs)\n w_3 <- compute_efficient_weight(\n Sigma = Sigma,\n mu = mu,\n gamma = gamma\n )\n\n performance_values[[3]][p, ] <- evaluate_performance(\n w_3,\n w_prev_3,\n next_return\n )\n\n w_prev_3 <- adjust_weights(w_3, next_return)\n}\n```\n:::\n\n\nFinally, we get to the evaluation of the portfolio strategies *net-of-transaction costs*. Note that we compute annualized returns and standard deviations. \\index{Sharpe Ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance <- lapply(\n performance_values,\n as_tibble\n) |>\n bind_rows(.id = \"strategy\")\n\nlength_year <- 12\n\nperformance_table <- performance |>\n group_by(Strategy = strategy) |>\n summarize(\n Mean = length_year * mean(100 * net_return),\n SD = sqrt(length_year) * sd(100 * net_return),\n `Sharpe ratio` = if_else(Mean > 0,\n Mean / SD,\n NA_real_\n ),\n Turnover = 100 * mean(turnover)\n )\n\nperformance_table |> \n mutate(across(-Strategy, ~round(., 4)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n Strategy Mean SD `Sharpe ratio` Turnover\n \n1 MV -0.899 12.6 NA 211. \n2 MV (TC) 11.9 15.2 0.780 0 \n3 Naive 11.8 15.2 0.779 0.234\n```\n\n\n:::\n:::\n\n\nThe results clearly speak against mean-variance optimization. Turnover is huge when the investor only considers her portfolio's expected return and variance. Effectively, the mean-variance portfolio generates a *negative* annualized return after adjusting for transaction costs. At the same time, the naive portfolio turns out to perform very well. In fact, the performance gains of the transaction-cost adjusted mean-variance portfolio are small. The out-of-sample Sharpe ratio is slightly higher than for the naive portfolio. Note the extreme effect of turnover penalization on turnover: *MV (TC)* effectively resembles a buy-and-hold strategy which only updates the portfolio once the estimated parameters $\\hat\\mu_t$ and $\\hat\\Sigma_t$indicate that the current allocation is too far away from the optimal theoretical portfolio. \n\n## Exercises\n\n1. Consider the portfolio choice problem for transaction-cost adjusted certainty equivalent maximization with risk aversion parameter $\\gamma$ \n$$\\omega_{t+1} ^* = \\arg\\max_{\\omega \\in \\mathbb{R}^N, \\iota'\\omega = 1} \\omega'\\mu - \\nu_t (\\omega, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega$$\nwhere $\\Sigma$ and $\\mu$ are (estimators of) the variance-covariance matrix of the returns and the vector of expected returns. Assume for now that transaction costs are quadratic in rebalancing *and* proportional to stock illiquidity such that \n$$\\nu_t\\left(\\omega, B\\right) = \\frac{\\beta}{2} \\left(\\omega - \\omega_{t^+}\\right)'B\\left(\\omega - \\omega_{t^+}\\right)$$ where $B = \\text{diag}(ill_1, \\ldots, ill_N)$ is a diagonal matrix, where $ill_1, \\ldots, ill_N$. Derive a closed-form solution for the mean-variance efficient portfolio $\\omega_{t+1} ^*$ based on the transaction cost specification above. Discuss the effect of illiquidity $ill_i$ on the individual portfolio weights relative to an investor that myopically ignores transaction costs in her decision. \n1. Use the solution from the previous exercise to update the function `compute_efficient_weight()` such that you can compute optimal weights conditional on a matrix $B$ with illiquidity measures. \n1. Illustrate the evolution of the *optimal* weights from the naive portfolio to the efficient portfolio in the mean-standard deviation diagram.\n1. Is it always optimal to choose the same $\\beta$ in the optimization problem than the value used in evaluating the portfolio performance? In other words: can it be optimal to choose theoretically sub-optimal portfolios based on transaction cost considerations that do not reflect the actual incurred costs? Evaluate the out-of-sample Sharpe ratio after transaction costs for a range of different values of imposed $\\beta$ values.", "supporting": [ "constrained-optimization-and-backtesting_files" ], diff --git a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png index c55fe9bf..abfb4985 100644 Binary files a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png and b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png differ diff --git a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png index 29394693..dfc14b7b 100644 Binary files a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png and b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png differ diff --git a/_freeze/r/cover-and-logo-design/execute-results/html.json b/_freeze/r/cover-and-logo-design/execute-results/html.json index 44be99cf..b691ff40 100644 --- a/_freeze/r/cover-and-logo-design/execute-results/html.json +++ b/_freeze/r/cover-and-logo-design/execute-results/html.json @@ -3,9 +3,7 @@ "result": { "engine": "knitr", "markdown": "---\ntitle: Cover and Logo Design\naliases:\n - ../cover-and-logo-design.html\npre-render:\n - pre_render_script.R\n---\n\n\nThe cover of the book is inspired by the fast growing generative art community in R.\\index{Generative art}\nGenerative art refers to art that in whole or in part has been created with the use of an autonomous system. \nInstead of creating random dynamics we rely on what is core to the book: The evolution of financial markets. \nEach circle in the cover figure corresponds to daily market return within one year of our sample. Deviations from the circle line indicate positive or negative returns. \nThe colors are determined by the standard deviation of market returns during the particular year. \nThe few lines of code below replicate the entire figure. \nWe use the Wes Andersen color palette (also throughout the entire book), provided by the package `wesanderson` [@wesanderson]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(wesanderson)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_ff3_daily <- tbl(\n tidy_finance,\n \"factors_ff3_daily\"\n) |>\n collect()\n\ndata_plot <- factors_ff3_daily |>\n select(date, mkt_excess) |>\n group_by(year = floor_date(date, \"year\")) |>\n mutate(group_id = cur_group_id())\n\ndata_plot <- data_plot |>\n group_by(group_id) |>\n mutate(\n day = 2 * pi * (1:n()) / 252,\n ymin = pmin(1 + mkt_excess, 1),\n ymax = pmax(1 + mkt_excess, 1),\n vola = sd(mkt_excess)\n ) |>\n filter(year >= \"1962-01-01\" & year <= \"2021-12-31\")\n\nlevels <- data_plot |>\n distinct(group_id, vola) |>\n arrange(vola) |>\n pull(vola)\n\ncp <- coord_polar(\n direction = -1,\n clip = \"on\"\n)\n\ncp$is_free <- function() TRUE\ncolors <- wes_palette(\"Zissou1\",\n n_groups(data_plot),\n type = \"continuous\"\n)\n\ncover <- data_plot |>\n mutate(vola = factor(vola, levels = levels)) |>\n ggplot(aes(\n x = day,\n y = mkt_excess,\n group = group_id,\n fill = vola\n )) +\n cp +\n geom_ribbon(aes(\n ymin = ymin,\n ymax = ymax,\n fill = vola\n ), alpha = 0.90) +\n theme_void() +\n facet_wrap(~group_id,\n ncol = 10,\n scales = \"free\"\n ) +\n theme(\n strip.text.x = element_blank(),\n legend.position = \"None\",\n panel.spacing = unit(-5, \"lines\")\n ) +\n scale_fill_manual(values = colors)\n\nggsave(\n plot = cover,\n width = 10,\n height = 6,\n filename = \"images/cover.png\",\n bg = \"white\"\n)\n```\n:::\n\n\nTo generate our logo, we focus on year 2021 - the end of the sample period at the time we published tidy-finance.org for the first time. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogo <- data_plot |>\n ungroup() |> \n filter(year == \"2021-01-01\") |> \n mutate(vola = factor(vola, levels = levels)) |>\n ggplot(aes(\n x = day,\n y = mkt_excess,\n fill = vola\n )) +\n cp +\n geom_ribbon(aes(\n ymin = ymin,\n ymax = ymax,\n fill = vola\n ), alpha = 0.90) +\n theme_void() +\n theme(\n strip.text.x = element_blank(),\n legend.position = \"None\",\n plot.margin = unit(c(-0.15,-0.15,-0.15,-0.15), \"null\")\n ) +\n scale_fill_manual(values = \"white\") \n\nggsave(\n plot = logo,\n width = 840,\n height = 840,\n units = \"px\",\n filename = \"images/logo-website-white.png\",\n)\n\nggsave(\n plot = logo +\n scale_fill_manual(values = wes_palette(\"Zissou1\")[1]), \n width = 840,\n height = 840,\n units = \"px\",\n filename = \"images/logo-website.png\",\n)\n```\n:::\n\n\nHere is the code to generate the vector graphics for our buttons.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbutton_r <- data_plot |>\n ungroup() |> \n filter(year == \"2000-01-01\") |> \n mutate(vola = factor(vola, levels = levels)) |>\n ggplot(aes(\n x = day,\n y = mkt_excess,\n fill = vola\n )) +\n cp +\n geom_ribbon(aes(\n ymin = ymin,\n ymax = ymax,\n fill = vola\n ), alpha = 0.90) +\n theme_void() +\n theme(\n strip.text.x = element_blank(),\n legend.position = \"None\",\n plot.margin = unit(c(-0.15,-0.15,-0.15,-0.15), \"null\")\n ) \n\nggsave(\n plot = button_r +\n scale_fill_manual(values = wes_palette(\"Zissou1\")[1]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-r-blue.svg\",\n)\n\nggsave(\n plot = button_r +\n scale_fill_manual(values = wes_palette(\"Zissou1\")[4]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-r-orange.svg\",\n)\n\nbutton_python <- data_plot |>\n ungroup() |> \n filter(year == \"1991-01-01\") |> \n mutate(vola = factor(vola, levels = levels)) |>\n ggplot(aes(\n x = day,\n y = mkt_excess,\n fill = vola\n )) +\n cp +\n geom_ribbon(aes(\n ymin = ymin,\n ymax = ymax,\n fill = vola\n ), alpha = 0.90) +\n theme_void() +\n theme(\n strip.text.x = element_blank(),\n legend.position = \"None\",\n plot.margin = unit(c(-0.15,-0.15,-0.15,-0.15), \"null\")\n ) \n\nggsave(\n plot = button_python +\n scale_fill_manual(values = wes_palette(\"Zissou1\")[1]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-python-blue.svg\",\n)\n\nggsave(\n plot = button_python +\n scale_fill_manual(values = wes_palette(\"Zissou1\")[4]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-python-orange.svg\",\n)\n```\n:::\n", - "supporting": [ - "cover-and-logo-design_files" - ], + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/r/difference-in-differences/execute-results/html.json b/_freeze/r/difference-in-differences/execute-results/html.json index 5508df1f..789d17a6 100644 --- a/_freeze/r/difference-in-differences/execute-results/html.json +++ b/_freeze/r/difference-in-differences/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "ba151815ca57c6906c62e796919e5f8e", + "hash": "b0d9fb5aefc9d29ad7b7a9f2db4f05d1", "result": { "engine": "knitr", - "markdown": "---\ntitle: Difference in Differences\naliases:\n - ../difference-in-differences.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Difference in Differences with R\n description-meta: Perform difference-in-difference estimation and analyze parallel trends using the programming language R. \n---\n\n\nIn this chapter, we illustrate the concept of *difference in differences* (DD) estimators by evaluating the effects of climate change regulation on the pricing of bonds across firms. DD estimators are typically used to recover the treatment effects of natural or quasi-natural experiments that trigger sharp changes in the environment of a specific group. Instead of looking at differences in just one group (e.g., the effect in the treated group), DD investigates the treatment effects by looking at the difference between differences in two groups. Such experiments are usually exploited to address endogeneity concerns [e.g., @RobertsWhited2013]. The identifying assumption is that the outcome variable would change equally in both groups without the treatment. This assumption is also often referred to as the assumption of parallel trends. Moreover, we would ideally also want a random assignment to the treatment and control groups. Due to lobbying or other activities, this randomness is often violated in (financial) economics.\\index{Difference in differences}\n\nIn the context of our setting, we investigate the impact of the Paris Agreement (PA), signed on December 12, 2015, on the bond yields of polluting firms. We first estimate the treatment effect of the agreement using panel regression techniques that we discuss in [Fixed Effects and Clustered Standard Errors](fixed-effects-and-clustered-standard-errors.qmd). We then present two methods to illustrate the treatment effect over time graphically. Although we demonstrate that the treatment effect of the agreement is anticipated by bond market participants well in advance, the techniques we present below can also be applied to many other settings.\\index{Paris (Climate) Agreement}\n\nThe approach we use here replicates the results of @Seltzer2022 partly. Specifically, we borrow their industry definitions for grouping firms into green and brown types. Overall, the literature on ESG effects in corporate bond markets is already large but continues to grow (for recent examples, see, e.g., @Halling2021, @Handler2022, @Huynh2021, among many others).\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\nlibrary(broom)\n```\n:::\n\n\n## Data Preparation\n\nWe use TRACE and Mergent FISD as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [TRACE and FISD](trace-and-fisd.qmd). \\index{Data!TRACE}\\index{Data!FISD}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfisd <- tbl(tidy_finance, \"fisd\") |>\n select(complete_cusip, maturity, offering_amt, sic_code) |>\n collect() |> \n drop_na()\n\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n select(cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt)|>\n collect() |> \n drop_na()\n```\n:::\n\n\nWe start our analysis by preparing the sample of bonds. We only consider bonds with a time to maturity of more than one year to the signing of the PA, so that we have sufficient data to analyze the yield behavior after the treatment date. This restriction also excludes all bonds issued after the agreement. We also consider only the first two digits of the SIC industry code to identify the polluting industries [in line with @Seltzer2022].\\index{Time to maturity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntreatment_date <- ymd(\"2015-12-12\")\n\npolluting_industries <- c(\n 49, 13, 45, 29, 28, 33, 40, 20,\n 26, 42, 10, 53, 32, 99, 37\n)\n\nbonds <- fisd |>\n filter(offering_amt > 0) |> \n mutate(\n time_to_maturity = as.numeric(maturity - treatment_date) / 365,\n sic_code = as.integer(substr(sic_code, 1, 2)),\n log_offering_amt = log(offering_amt)\n ) |>\n filter(time_to_maturity >= 1) |>\n select(\n cusip_id = complete_cusip,\n time_to_maturity, log_offering_amt, sic_code\n ) |>\n mutate(polluter = sic_code %in% polluting_industries)\n```\n:::\n\n\nNext, we aggregate the individual transactions as reported in TRACE to a monthly panel of bond yields. We consider bond yields for a bond's last trading day in a month. Therefore, we first aggregate bond data to daily frequency and apply common restrictions from the literature [see, e.g., @Bessembinder2008]. We weigh each transaction by volume to reflect a trade's relative importance and avoid emphasizing small trades. Moreover, we only consider transactions with reported prices `rptd_pr` larger than 25 (to exclude bonds that are close to default) and only bond-day observations with more than five trades on a corresponding day (to exclude prices based on too few, potentially non-representative transactions).\\index{Yield aggregation} \\index{Returns!Bonds}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_aggregated <- trace_enhanced |>\n filter(rptd_pr > 25) |>\n group_by(cusip_id, trd_exctn_dt) |>\n summarize(\n avg_yield = weighted.mean(yld_pt, entrd_vol_qt * rptd_pr),\n trades = n(),\n .groups = \"drop\"\n ) |>\n drop_na(avg_yield) |>\n filter(trades >= 5) |>\n mutate(month = floor_date(trd_exctn_dt, \"months\")) |>\n group_by(cusip_id, month) |>\n slice_max(trd_exctn_dt) |>\n ungroup() |>\n select(cusip_id, month, avg_yield)\n```\n:::\n\n\nBy combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds |>\n inner_join(trace_aggregated, by = \"cusip_id\", multiple = \"all\") |>\n drop_na()\n```\n:::\n\n\nBefore we can run the first regression, we need to define the `treated` indicator, which is the product of the `post_period` (i.e., all months after the signing of the PA) and the `polluter` indicator defined above.\\index{Regression!Fixed effects} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds_panel |>\n mutate(post_period = month >= floor_date(treatment_date, \"months\")) |>\n mutate(treated = polluter & post_period)\n```\n:::\n\n\nAs usual, we tabulate summary statistics of the variables that enter the regression to check the validity of our variable definitions.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel |>\n pivot_longer(\n cols = c(avg_yield, time_to_maturity, log_offering_amt),\n names_to = \"measure\"\n ) |>\n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value),\n n = n(),\n .groups = \"drop\"\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n measure mean sd min q05 q50 q95 max n\n \n1 avg_yield 4.08 4.21 0.0595 1.27 3.38 8.10 128. 127530\n2 log_offering_amt 13.3 0.823 4.64 12.2 13.2 14.5 16.5 127530\n3 time_to_maturity 8.55 8.41 1.01 1.50 5.81 27.4 101. 127530\n```\n\n\n:::\n:::\n\n\n## Panel Regressions\n\nThe PA is a legally binding international treaty on climate change. It was adopted by 196 Parties at COP 21 in Paris on 12 December 2015 and entered into force on 4 November 2016. The PA obliges developed countries to support efforts to build clean, climate-resilient futures. One may thus hypothesize that adopting climate-related policies may affect financial markets. To measure the magnitude of this effect, we first run an OLS regression without fixed effects where we include the `treated`, `post_period`, and `polluter` dummies, as well as the bond-specific characteristics `log_offering_amt` and `time_to_maturity`. This simple model assumes that there are essentially two periods (before and after the PA) and two groups (polluters and non-polluters). Nonetheless, it should indicate whether polluters have higher yields following the PA compared to non-polluters.\\index{Regression!Yields}\\index{Regression!Panel}\n\nThe second model follows the typical DD regression approach by including individual (`cusip_id`) and time (`month`) fixed effects. In this model, we do not include any other variables from the simple model because the fixed effects subsume them, and we observe the coefficient of our main variable of interest: `treated`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe <- feols(\n fml = avg_yield ~ treated + post_period + polluter +\n log_offering_amt + time_to_maturity,\n vcov = \"iid\",\n data = bonds_panel\n)\n\nmodel_with_fe <- feols(\n fml = avg_yield ~ treated | cusip_id + month,\n vcov = \"iid\",\n data = bonds_panel\n)\n\netable(\n model_without_fe, model_with_fe, \n coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n model_without_fe model_with_fe\nDependent Var.: avg_yield avg_yield\n \nConstant 10.7*** (57.0) \ntreatedTRUE 0.462*** (9.31) 0.983*** (29.5)\npost_periodTRUE -0.174*** (-5.92) \npolluterTRUE 0.481*** (15.3) \nlog_offering_amt -0.551*** (-39.0) \ntime_to_maturity 0.058*** (41.6) \nFixed-Effects: ----------------- ---------------\ncusip_id No Yes\nmonth No Yes\n________________ _________________ _______________\nVCOV type IID IID\nObservations 127,530 127,530\nR2 0.032 0.647\nWithin R2 -- 0.007\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nBoth models indicate that polluters have significantly higher yields after the PA than non-polluting firms. Note that the magnitude of the `treated` coefficient varies considerably across models. \n\n## Visualizing Parallel Trends\n\nEven though the regressions above indicate that there is an impact of the PA on bond yields of polluters, the tables do not tell us anything about the dynamics of the treatment effect. In particular, the models provide no indication about whether the crucial *parallel trends* assumption is valid. This assumption requires that in the absence of treatment, the difference between the two groups is constant over time. Although there is no well-defined statistical test for this assumption, visual inspection typically provides a good indication.\\index{Parallel trends assumption}\n\nTo provide such visual evidence, we revisit the simple OLS model and replace the `treated` and `post_period` indicators with month dummies for each group. This approach estimates the average yield change of both groups for each period and provides corresponding confidence intervals. Plotting the coefficient estimates for both groups around the treatment date shows us the dynamics of our panel data.\\index{Graph!Diff-in-diff graph} \n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe_time <- feols(\n fml = avg_yield ~ polluter + month:polluter +\n time_to_maturity + log_offering_amt,\n vcov = \"iid\",\n data = bonds_panel |>\n mutate(month = factor(month))\n)\n\nmodel_without_fe_coefs <- tidy(model_without_fe_time) |>\n filter(str_detect(term, \"month\")) |>\n mutate(\n month = ymd(substr(term, nchar(term) - 9, nchar(term))),\n treatment = str_detect(term, \"TRUE\"),\n ci_up = estimate + qnorm(0.975) * std.error,\n ci_low = estimate + qnorm(0.025) * std.error\n )\n\nmodel_without_fe_coefs |>\n ggplot(aes(\n month, \n color = treatment,\n linetype = treatment,\n shape = treatment\n )) +\n geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n linetype = \"dashed\"\n ) +\n geom_hline(aes(yintercept = 0),\n linetype = \"dashed\"\n ) +\n geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n alpha = 0.5\n ) +\n guides(linetype = \"none\") + \n geom_point(aes(y = estimate)) +\n labs(\n x = NULL,\n y = \"Yield\",\n shape = \"Polluter?\",\n color = \"Polluter?\",\n title = \"Polluters respond stronger to Paris Agreement than green firms\"\n )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1301-1.png){#fig-1301 fig-alt='Title: Polluters respond stronger to Paris Agreement than green firms. The figure shows a sequence of monthly dots for two groups. Before the agreement, the dots mainly overlap. Ahead of the agreement, yields start to increase. Then, after the agreement, there is a strong divergence in yields. Polluters have significantly higher yields than non-polluters in the months before and after the signing of the Paris Agreement. However, this yield difference vanishes again towards the end of 2016.' width=2100}\n:::\n:::\n\n@fig-1301 shows that throughout most of 2014, the yields of the two groups changed in unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest difference around the signing of the PA. Afterward, the yields for both groups fall again, and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on the other hand, even experience significantly lower yields than polluters after the signing of the agreement. \n\nInstead of plotting both groups using the simple model approach, we can also use the fixed-effects model and focus on the polluter's yield response to the signing relative to the non-polluters. To perform this estimation, we need to replace the `treated` indicator with separate time dummies for the polluters, each marking a one-month period relative to the treatment date. We then regress the monthly yields on the set of time dummies and `cusip_id` and `month` fixed effects.\\index{Graph!Diff-in-diff graph}\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel_alt <- bonds_panel |>\n mutate(\n diff_to_treatment = interval(\n floor_date(treatment_date, \"month\"), month\n ) %/% months(1)\n )\n\nvariables <- bonds_panel_alt |>\n distinct(diff_to_treatment, month) |>\n arrange(month) |>\n mutate(variable_name = as.character(NA))\n\nformula <- \"avg_yield ~ \"\n\nfor (j in 1:nrow(variables)) {\n if (variables$diff_to_treatment[j] != 0) {\n old_names <- names(bonds_panel_alt)\n bonds_panel_alt <- bonds_panel_alt |>\n mutate(new_var = diff_to_treatment == variables$diff_to_treatment[j] & \n polluter)\n new_var_name <- ifelse(variables$diff_to_treatment[j] < 0,\n str_c(\"lag\", abs(variables$diff_to_treatment[j])),\n str_c(\"lead\", variables$diff_to_treatment[j])\n )\n variables$variable_name[j] <- new_var_name\n names(bonds_panel_alt) <- c(old_names, new_var_name)\n formula <- str_c(\n formula,\n ifelse(j == 1,\n new_var_name,\n str_c(\"+\", new_var_name)\n )\n )\n }\n}\nformula <- str_c(formula, \"| cusip_id + month\")\n\nmodel_with_fe_time <- feols(\n fml = as.formula(formula),\n vcov = \"iid\",\n data = bonds_panel_alt\n)\n\nmodel_with_fe_time_coefs <- tidy(model_with_fe_time) |>\n mutate(\n term = str_remove(term, \"TRUE\"),\n ci_up = estimate + qnorm(0.975) * std.error,\n ci_low = estimate + qnorm(0.025) * std.error\n ) |>\n left_join(\n variables,\n by = c(\"term\" = \"variable_name\")\n ) |>\n bind_rows(tibble(\n term = \"lag0\",\n estimate = 0,\n ci_up = 0,\n ci_low = 0,\n month = floor_date(treatment_date, \"month\")\n ))\n\nmodel_with_fe_time_coefs |>\n ggplot(aes(x = month, y = estimate)) +\n geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n linetype = \"dashed\"\n ) +\n geom_hline(aes(yintercept = 0),\n linetype = \"dashed\"\n ) +\n geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n alpha = 0.5\n ) +\n geom_point(aes(y = estimate)) +\n labs(\n x = NULL,\n y = \"Yield\",\n title = \"Polluters' yield patterns around Paris Agreement signing\"\n )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Polluters\\' yield patterns around Paris Agreement signing. The figure shows a sequence of monthly dots for the treated group. Ahead of the agreement, yields of polluters start to increase. Then, after the agreement, there is a small reversal and yields drop again.' width=2100}\n:::\n:::\n\n\n\\index{Robustness tests}\nThe resulting graph shown in @fig-1402 confirms the main conclusion of the previous image: polluters' yield patterns show a considerable anticipation effect starting toward the end of 2014. Yields only marginally increase after the signing of the agreement. However, as opposed to the simple model, we do not see a complete reversal back to the pre-agreement level. Yields of polluters stay at a significantly higher level even one year after the signing.\n\nNotice that during the year after the PA was signed, the 45th President of the United States was elected on November 8, 2016. During his campaign there were some indications of intentions to withdraw the US from the PA, which ultimately happened on November 4, 2020. Hence, reversal effects are potentially driven by these actions.\n\n## Exercises\n\n1. The 46th President of the US rejoined the Paris Agreement in February 2021. Repeat the difference in differences analysis for the day of his election victory. Note that you will also have to download new TRACE data. How did polluters' yields react to this action?\n2. Based on the exercise on ratings in [TRACE and FISD](trace-and-fisd.qmd), include ratings as a control variable in the analysis above. Do the results change?", + "markdown": "---\ntitle: Difference in Differences\naliases:\n - ../difference-in-differences.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Difference in Differences with R\n description-meta: Perform difference-in-difference estimation and analyze parallel trends using the programming language R. \n---\n\n\nIn this chapter, we illustrate the concept of *difference in differences* (DD) estimators by evaluating the effects of climate change regulation on the pricing of bonds across firms. DD estimators are typically used to recover the treatment effects of natural or quasi-natural experiments that trigger sharp changes in the environment of a specific group. Instead of looking at differences in just one group (e.g., the effect in the treated group), DD investigates the treatment effects by looking at the difference between differences in two groups. Such experiments are usually exploited to address endogeneity concerns [e.g., @RobertsWhited2013]. The identifying assumption is that the outcome variable would change equally in both groups without the treatment. This assumption is also often referred to as the assumption of parallel trends. Moreover, we would ideally also want a random assignment to the treatment and control groups. Due to lobbying or other activities, this randomness is often violated in (financial) economics.\\index{Difference in differences}\n\nIn the context of our setting, we investigate the impact of the Paris Agreement (PA), signed on December 12, 2015, on the bond yields of polluting firms. We first estimate the treatment effect of the agreement using panel regression techniques that we discuss in [Fixed Effects and Clustered Standard Errors](fixed-effects-and-clustered-standard-errors.qmd). We then present two methods to illustrate the treatment effect over time graphically. Although we demonstrate that the treatment effect of the agreement is anticipated by bond market participants well in advance, the techniques we present below can also be applied to many other settings.\\index{Paris (Climate) Agreement}\n\nThe approach we use here replicates the results of @Seltzer2022 partly. Specifically, we borrow their industry definitions for grouping firms into green and brown types. Overall, the literature on ESG effects in corporate bond markets is already large but continues to grow (for recent examples, see, e.g., @Halling2021, @Handler2022, @Huynh2021, among many others).\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\nlibrary(broom)\n```\n:::\n\n\n## Data Preparation\n\nWe use TRACE and Mergent FISD as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [TRACE and FISD](trace-and-fisd.qmd). \\index{Data!TRACE}\\index{Data!FISD}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfisd <- tbl(tidy_finance, \"fisd\") |>\n select(complete_cusip, maturity, offering_amt, sic_code) |>\n collect() |> \n drop_na()\n\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n select(cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt)|>\n collect() |> \n drop_na()\n```\n:::\n\n\nWe start our analysis by preparing the sample of bonds. We only consider bonds with a time to maturity of more than one year to the signing of the PA, so that we have sufficient data to analyze the yield behavior after the treatment date. This restriction also excludes all bonds issued after the agreement. We also consider only the first two digits of the SIC industry code to identify the polluting industries [in line with @Seltzer2022].\\index{Time to maturity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntreatment_date <- ymd(\"2015-12-12\")\n\npolluting_industries <- c(\n 49, 13, 45, 29, 28, 33, 40, 20,\n 26, 42, 10, 53, 32, 99, 37\n)\n\nbonds <- fisd |>\n filter(offering_amt > 0) |> \n mutate(\n time_to_maturity = as.numeric(maturity - treatment_date) / 365,\n sic_code = as.integer(substr(sic_code, 1, 2)),\n log_offering_amt = log(offering_amt)\n ) |>\n filter(time_to_maturity >= 1) |>\n select(\n cusip_id = complete_cusip,\n time_to_maturity, log_offering_amt, sic_code\n ) |>\n mutate(polluter = sic_code %in% polluting_industries)\n```\n:::\n\n\nNext, we aggregate the individual transactions as reported in TRACE to a monthly panel of bond yields. We consider bond yields for a bond's last trading day in a month. Therefore, we first aggregate bond data to daily frequency and apply common restrictions from the literature [see, e.g., @Bessembinder2008]. We weigh each transaction by volume to reflect a trade's relative importance and avoid emphasizing small trades. Moreover, we only consider transactions with reported prices `rptd_pr` larger than 25 (to exclude bonds that are close to default) and only bond-day observations with more than five trades on a corresponding day (to exclude prices based on too few, potentially non-representative transactions).\\index{Yield aggregation} \\index{Returns!Bonds}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_aggregated <- trace_enhanced |>\n filter(rptd_pr > 25) |>\n group_by(cusip_id, trd_exctn_dt) |>\n summarize(\n avg_yield = weighted.mean(yld_pt, entrd_vol_qt * rptd_pr),\n trades = n(),\n .groups = \"drop\"\n ) |>\n drop_na(avg_yield) |>\n filter(trades >= 5) |>\n mutate(month = floor_date(trd_exctn_dt, \"months\")) |>\n group_by(cusip_id, month) |>\n slice_max(trd_exctn_dt) |>\n ungroup() |>\n select(cusip_id, month, avg_yield)\n```\n:::\n\n\nBy combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds |>\n inner_join(trace_aggregated, join_by(cusip_id), multiple = \"all\") |>\n drop_na()\n```\n:::\n\n\nBefore we can run the first regression, we need to define the `treated` indicator, which is the product of the `post_period` (i.e., all months after the signing of the PA) and the `polluter` indicator defined above.\\index{Regression!Fixed effects} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds_panel |>\n mutate(post_period = month >= floor_date(treatment_date, \"months\")) |>\n mutate(treated = polluter & post_period)\n```\n:::\n\n\nAs usual, we tabulate summary statistics of the variables that enter the regression to check the validity of our variable definitions.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel |>\n pivot_longer(\n cols = c(avg_yield, time_to_maturity, log_offering_amt),\n names_to = \"measure\"\n ) |>\n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value),\n n = n(),\n .groups = \"drop\"\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n measure mean sd min q05 q50 q95 max n\n \n1 avg_yield 4.08 4.21 0.0595 1.27 3.38 8.10 128. 127530\n2 log_offering_amt 13.3 0.823 4.64 12.2 13.2 14.5 16.5 127530\n3 time_to_maturity 8.55 8.41 1.01 1.50 5.81 27.4 101. 127530\n```\n\n\n:::\n:::\n\n\n## Panel Regressions\n\nThe PA is a legally binding international treaty on climate change. It was adopted by 196 Parties at COP 21 in Paris on 12 December 2015 and entered into force on 4 November 2016. The PA obliges developed countries to support efforts to build clean, climate-resilient futures. One may thus hypothesize that adopting climate-related policies may affect financial markets. To measure the magnitude of this effect, we first run an OLS regression without fixed effects where we include the `treated`, `post_period`, and `polluter` dummies, as well as the bond-specific characteristics `log_offering_amt` and `time_to_maturity`. This simple model assumes that there are essentially two periods (before and after the PA) and two groups (polluters and non-polluters). Nonetheless, it should indicate whether polluters have higher yields following the PA compared to non-polluters.\\index{Regression!Yields}\\index{Regression!Panel}\n\nThe second model follows the typical DD regression approach by including individual (`cusip_id`) and time (`month`) fixed effects. In this model, we do not include any other variables from the simple model because the fixed effects subsume them, and we observe the coefficient of our main variable of interest: `treated`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe <- feols(\n fml = avg_yield ~ treated + post_period + polluter +\n log_offering_amt + time_to_maturity,\n vcov = \"iid\",\n data = bonds_panel\n)\n\nmodel_with_fe <- feols(\n fml = avg_yield ~ treated | cusip_id + month,\n vcov = \"iid\",\n data = bonds_panel\n)\n\netable(\n model_without_fe, model_with_fe, \n coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n model_without_fe model_with_fe\nDependent Var.: avg_yield avg_yield\n \nConstant 10.7*** (57.0) \ntreatedTRUE 0.462*** (9.31) 0.983*** (29.5)\npost_periodTRUE -0.174*** (-5.92) \npolluterTRUE 0.481*** (15.3) \nlog_offering_amt -0.551*** (-39.0) \ntime_to_maturity 0.058*** (41.6) \nFixed-Effects: ----------------- ---------------\ncusip_id No Yes\nmonth No Yes\n________________ _________________ _______________\nVCOV type IID IID\nObservations 127,530 127,530\nR2 0.032 0.647\nWithin R2 -- 0.007\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nBoth models indicate that polluters have significantly higher yields after the PA than non-polluting firms. Note that the magnitude of the `treated` coefficient varies considerably across models. \n\n## Visualizing Parallel Trends\n\nEven though the regressions above indicate that there is an impact of the PA on bond yields of polluters, the tables do not tell us anything about the dynamics of the treatment effect. In particular, the models provide no indication about whether the crucial *parallel trends* assumption is valid. This assumption requires that in the absence of treatment, the difference between the two groups is constant over time. Although there is no well-defined statistical test for this assumption, visual inspection typically provides a good indication.\\index{Parallel trends assumption}\n\nTo provide such visual evidence, we revisit the simple OLS model and replace the `treated` and `post_period` indicators with month dummies for each group. This approach estimates the average yield change of both groups for each period and provides corresponding confidence intervals. Plotting the coefficient estimates for both groups around the treatment date shows us the dynamics of our panel data.\\index{Graph!Diff-in-diff graph} \n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe_time <- feols(\n fml = avg_yield ~ polluter + month:polluter +\n time_to_maturity + log_offering_amt,\n vcov = \"iid\",\n data = bonds_panel |>\n mutate(month = factor(month))\n)\n\nmodel_without_fe_coefs <- tidy(model_without_fe_time) |>\n filter(str_detect(term, \"month\")) |>\n mutate(\n month = ymd(substr(term, nchar(term) - 9, nchar(term))),\n treatment = str_detect(term, \"TRUE\"),\n ci_up = estimate + qnorm(0.975) * std.error,\n ci_low = estimate + qnorm(0.025) * std.error\n )\n\nmodel_without_fe_coefs |>\n ggplot(aes(\n month, \n color = treatment,\n linetype = treatment,\n shape = treatment\n )) +\n geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n linetype = \"dashed\"\n ) +\n geom_hline(aes(yintercept = 0),\n linetype = \"dashed\"\n ) +\n geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n alpha = 0.5\n ) +\n guides(linetype = \"none\") + \n geom_point(aes(y = estimate)) +\n labs(\n x = NULL,\n y = \"Yield\",\n shape = \"Polluter?\",\n color = \"Polluter?\",\n title = \"Polluters respond stronger to Paris Agreement than green firms\"\n )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1301-1.png){#fig-1301 fig-alt='Title: Polluters respond stronger to Paris Agreement than green firms. The figure shows a sequence of monthly dots for two groups. Before the agreement, the dots mainly overlap. Ahead of the agreement, yields start to increase. Then, after the agreement, there is a strong divergence in yields. Polluters have significantly higher yields than non-polluters in the months before and after the signing of the Paris Agreement. However, this yield difference vanishes again towards the end of 2016.' width=2100}\n:::\n:::\n\n@fig-1301 shows that throughout most of 2014, the yields of the two groups changed in unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest difference around the signing of the PA. Afterward, the yields for both groups fall again, and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on the other hand, even experience significantly lower yields than polluters after the signing of the agreement. \n\nInstead of plotting both groups using the simple model approach, we can also use the fixed-effects model and focus on the polluter's yield response to the signing relative to the non-polluters. To perform this estimation, we need to replace the `treated` indicator with separate time dummies for the polluters, each marking a one-month period relative to the treatment date. We then regress the monthly yields on the set of time dummies and `cusip_id` and `month` fixed effects.\\index{Graph!Diff-in-diff graph}\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel_alt <- bonds_panel |>\n mutate(\n diff_to_treatment = interval(\n floor_date(treatment_date, \"month\"), month\n ) %/% months(1)\n )\n\nvariables <- bonds_panel_alt |>\n distinct(diff_to_treatment, month) |>\n arrange(month) |>\n mutate(variable_name = as.character(NA))\n\nformula <- \"avg_yield ~ \"\n\nfor (j in 1:nrow(variables)) {\n if (variables$diff_to_treatment[j] != 0) {\n old_names <- names(bonds_panel_alt)\n bonds_panel_alt <- bonds_panel_alt |>\n mutate(new_var = diff_to_treatment == variables$diff_to_treatment[j] & \n polluter)\n new_var_name <- ifelse(variables$diff_to_treatment[j] < 0,\n str_c(\"lag\", abs(variables$diff_to_treatment[j])),\n str_c(\"lead\", variables$diff_to_treatment[j])\n )\n variables$variable_name[j] <- new_var_name\n names(bonds_panel_alt) <- c(old_names, new_var_name)\n formula <- str_c(\n formula,\n ifelse(j == 1,\n new_var_name,\n str_c(\"+\", new_var_name)\n )\n )\n }\n}\nformula <- str_c(formula, \"| cusip_id + month\")\n\nmodel_with_fe_time <- feols(\n fml = as.formula(formula),\n vcov = \"iid\",\n data = bonds_panel_alt\n)\n\nmodel_with_fe_time_coefs <- tidy(model_with_fe_time) |>\n mutate(\n term = str_remove(term, \"TRUE\"),\n ci_up = estimate + qnorm(0.975) * std.error,\n ci_low = estimate + qnorm(0.025) * std.error\n ) |>\n left_join(\n variables,\n join_by(term == variable_name)\n ) |>\n bind_rows(tibble(\n term = \"lag0\",\n estimate = 0,\n ci_up = 0,\n ci_low = 0,\n month = floor_date(treatment_date, \"month\")\n ))\n\nmodel_with_fe_time_coefs |>\n ggplot(aes(x = month, y = estimate)) +\n geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n linetype = \"dashed\"\n ) +\n geom_hline(aes(yintercept = 0),\n linetype = \"dashed\"\n ) +\n geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n alpha = 0.5\n ) +\n geom_point(aes(y = estimate)) +\n labs(\n x = NULL,\n y = \"Yield\",\n title = \"Polluters' yield patterns around Paris Agreement signing\"\n )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Polluters\\' yield patterns around Paris Agreement signing. The figure shows a sequence of monthly dots for the treated group. Ahead of the agreement, yields of polluters start to increase. Then, after the agreement, there is a small reversal and yields drop again.' width=2100}\n:::\n:::\n\n\n\\index{Robustness tests}\nThe resulting graph shown in @fig-1402 confirms the main conclusion of the previous image: polluters' yield patterns show a considerable anticipation effect starting toward the end of 2014. Yields only marginally increase after the signing of the agreement. However, as opposed to the simple model, we do not see a complete reversal back to the pre-agreement level. Yields of polluters stay at a significantly higher level even one year after the signing.\n\nNotice that during the year after the PA was signed, the 45th President of the United States was elected on November 8, 2016. During his campaign there were some indications of intentions to withdraw the US from the PA, which ultimately happened on November 4, 2020. Hence, reversal effects are potentially driven by these actions.\n\n## Exercises\n\n1. The 46th President of the US rejoined the Paris Agreement in February 2021. Repeat the difference in differences analysis for the day of his election victory. Note that you will also have to download new TRACE data. How did polluters' yields react to this action?\n2. Based on the exercise on ratings in [TRACE and FISD](trace-and-fisd.qmd), include ratings as a control variable in the analysis above. Do the results change?", "supporting": [ "difference-in-differences_files" ], diff --git a/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json b/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json index bf2d639c..c3de96de 100644 --- a/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json +++ b/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "737369369e1c6b54de47c8f395406a42", + "hash": "cfae0fe7dccd12dadb62ebd01257070e", "result": { "engine": "knitr", - "markdown": "---\ntitle: Factor Selection via Machine Learning\naliases:\n - ../factor-selection-via-machine-learning.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Factor Selection via Machine Learning with R\n description-meta: Use machine learning tools such as Lasso and Ridge regressions to identify asset pricing factors using the programming language R.\n---\n\n\nThe aim of this chapter is twofold. From a data science perspective, we introduce `tidymodels`, a collection of packages for modeling and machine learning (ML) using `tidyverse` principles. `tidymodels` comes with a handy workflow for all sorts of typical prediction tasks. From a finance perspective, we address the notion of *factor zoo* [@Cochrane2011] using ML methods. We introduce Lasso and Ridge regression as a special case of penalized regression models. Then, we explain the concept of cross-validation for model *tuning* with Elastic Net regularization as a popular example. We implement and showcase the entire cycle from model specification, training, and forecast evaluation within the `tidymodels` universe. While the tools can generally be applied to an abundance of interesting asset pricing problems, we apply penalized regressions for identifying macroeconomic variables and asset pricing factors that help explain a cross-section of industry portfolios.\n\nIn previous chapters, we illustrate that stock characteristics such as size provide valuable pricing information in addition to the market beta.\\index{Factor zoo}\\index{Factor model}\\index{CAPM} \nSuch findings question the usefulness of the Capital Asset Pricing Model. \nIn fact, during the last decades, financial economists discovered a plethora of additional factors which may be correlated with the marginal utility of consumption (and would thus deserve a prominent role in pricing applications). The search for factors that explain the cross-section of expected stock returns has produced hundreds of potential candidates, as noted more recently by @Harvey2016, @Mclean2016, and @Hou2020.\nTherefore, given the multitude of proposed risk factors, the challenge these days rather is: *do we believe in the relevance of 300+ risk factors?* During recent years, promising methods from the field of ML got applied to common finance applications. We refer to @Mullainathan2017 for a treatment of ML from the perspective of an econometrician, @Nagel2021 for an excellent review of ML practices in asset pricing, @Easley2021 for ML applications in (high-frequency) market microstructure, and @Dixon2020 for a detailed treatment of all methodological aspects. \n\n## Brief Theoretical Background\n\nThis is a book about *doing* empirical work in a tidy manner, and we refer to any of the many excellent textbook treatments of ML methods and especially penalized regressions for some deeper discussion. Excellent material is provided, for instance, by @Hastie2009, @Hastie2013, and @DePrado2018. Instead, we briefly summarize the idea of Lasso and Ridge regressions as well as the more general Elastic Net. Then, we turn to the fascinating question on *how* to implement, tune, and use such models with the `tidymodels` workflow.\n\nTo set the stage, we start with the definition of a linear model: suppose we have data $(y_t, x_t), t = 1,\\ldots, T$, where $x_t$ is a $(K \\times 1)$ vector of regressors and $y_t$ is the response for observation $t$. \nThe linear model takes the form $y_t = \\beta' x_t + \\varepsilon_t$ with some error term $\\varepsilon_t$ and has been studied in abundance. The well-known ordinary-least square (OLS) estimator for the $(K \\times 1)$ vector $\\beta$ minimizes the sum of squared residuals and is then $$\\hat{\\beta}^\\text{ols} = \\left(\\sum\\limits_{t=1}^T x_t'x_t\\right)^{-1} \\sum\\limits_{t=1}^T x_t'y_t.$$ \n\\index{Regression!OLS}\n\nWhile we are often interested in the estimated coefficient vector $\\hat\\beta^\\text{ols}$, ML is about the predictive performance most of the time. For a new observation $\\tilde{x}_t$, the linear model generates predictions such that $$\\hat y_t = E\\left(y|x_t = \\tilde x_t\\right) = \\hat\\beta^\\text{ols}{}' \\tilde x_t.$$ \nIs this the best we can do? \nNot really: instead of minimizing the sum of squared residuals, penalized linear models can improve predictive performance by choosing other estimators $\\hat{\\beta}$ with lower variance than the estimator $\\hat\\beta^\\text{ols}$. \nAt the same time, it seems appealing to restrict the set of regressors to a few meaningful ones if possible. In other words, if $K$ is large (such as for the number of proposed factors in the asset pricing literature), it may be a desirable feature to *select* reasonable factors and set $\\hat\\beta^{\\text{ols}}_k = 0$ for some redundant factors. \n\nIt should be clear that the promised benefits of penalized regressions, i.e., reducing the mean squared error (MSE), come at a cost. In most cases, reducing the variance of the estimator introduces a bias such that $E\\left(\\hat\\beta\\right) \\neq \\beta$. What is the effect of such a bias-variance trade-off? To understand the implications, assume the following data-generating process for $y$: $$y = f(x) + \\varepsilon, \\quad \\varepsilon \\sim (0, \\sigma_\\varepsilon^2)$$ We want to recover $f(x)$, which denotes some unknown functional which maps the relationship between $x$ and $y$. While the properties of $\\hat\\beta^\\text{ols}$ as an unbiased estimator may be desirable under some circumstances, they are certainly not if we consider predictive accuracy. Alternative predictors $\\hat{f}(x)$ could be more desirable: For instance, the MSE depends on our model choice as follows: \\index{MSE} $$\\begin{aligned}\nMSE &=E((y-\\hat{f}(x))^2)=E((f(x)+\\epsilon-\\hat{f}(x))^2)\\\\\n&= \\underbrace{E((f(x)-\\hat{f}(x))^2)}_{\\text{total quadratic error}}+\\underbrace{E(\\epsilon^2)}_{\\text{irreducible error}} \\\\\n&= E\\left(\\hat{f}(x)^2\\right)+E\\left(f(x)^2\\right)-2E\\left(f(x)\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=E\\left(\\hat{f}(x)^2\\right)+f(x)^2-2f(x)E\\left(\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=\\underbrace{\\text{Var}\\left(\\hat{f}(x)\\right)}_{\\text{variance of model}}+ \\underbrace{E\\left((f(x)-\\hat{f}(x))\\right)^2}_{\\text{squared bias}} +\\sigma_\\varepsilon^2. \n\\end{aligned}$$ While no model can reduce $\\sigma_\\varepsilon^2$, a biased estimator with small variance may have a lower MSE than an unbiased estimator.\n\n### Ridge regression\n\n\\index{Regression!Ridge}\n\nOne biased estimator is known as Ridge regression. @Hoerl1970 propose to minimize the sum of squared errors *while simultaneously imposing a penalty on the $L_2$ norm of the parameters* $\\hat\\beta$. Formally, this means that for a penalty factor $\\lambda\\geq 0$ the minimization problem takes the form $\\min_\\beta \\left(y - X\\beta\\right)'\\left(y - X\\beta\\right)\\text{ s.t. } \\beta'\\beta \\leq c$. Here $c\\geq 0$ is a constant that depends on the choice of $\\lambda$. The larger $\\lambda$, the smaller $c$ (technically speaking, there is a one-to-one relationship between $\\lambda$, which corresponds to the Lagrangian of the minimization problem above and $c$). Here, $X = \\left(x_1 \\ldots x_T\\right)'$ and $y = \\left(y_1, \\ldots, y_T\\right)'$. A closed-form solution for the resulting regression coefficient vector $\\beta^\\text{ridge}$ exists: $$\\hat{\\beta}^\\text{ridge} = \\left(X'X + \\lambda I\\right)^{-1}X'y.$$ A couple of observations are worth noting: $\\hat\\beta^\\text{ridge} = \\hat\\beta^\\text{ols}$ for $\\lambda = 0$ and $\\hat\\beta^\\text{ridge} \\rightarrow 0$ for $\\lambda\\rightarrow \\infty$. Also for $\\lambda > 0$, $\\left(X'X + \\lambda I\\right)$ is non-singular even if $X'X$ is which means that $\\hat\\beta^\\text{ridge}$ exists even if $\\hat\\beta$ is not defined. However, note also that the Ridge estimator requires careful choice of the hyperparameter $\\lambda$ which controls the *amount of regularization*: a larger value of $\\lambda$ implies *shrinkage* of the regression coefficient toward 0, a smaller value of $\\lambda$ reduces the bias of the resulting estimator. \n\n::: {.calloutnote}\nNote, that $X$ usually contains an intercept column with ones. As a general rule, the associated intercept coefficient is not penalized. In practice, this often implies that $y$ is simply demeaned before computing $\\hat\\beta^\\text{ridge}$.\n::: \n\nWhat about the statistical properties of the Ridge estimator? First, the bad news is that $\\hat\\beta^\\text{ridge}$ is a biased estimator of $\\beta$. However, the good news is that (under homoscedastic error terms) the variance of the Ridge estimator is guaranteed to be *smaller* than the variance of the ordinary least square estimator. We encourage you to verify these two statements in the exercises. As a result, we face a trade-off: The Ridge regression sacrifices some unbiasedness to achieve a smaller variance than the OLS estimator.\n\n### Lasso\n\n\\index{Regression!Lasso}\n\nAn alternative to Ridge regression is the Lasso (*l*east *a*bsolute *s*hrinkage and *s*election *o*perator). Similar to Ridge regression, the Lasso [@Tibshirani1996] is a penalized and biased estimator. \nThe main difference to Ridge regression is that Lasso does not only *shrink* coefficients but effectively selects variables by setting coefficients for *irrelevant* variables to zero. Lasso implements a $L_1$ penalization on the parameters such that: $$\\hat\\beta^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right)\\text{ s.t. } \\sum\\limits_{k=1}^K|\\beta_k| < c(\\lambda).$$ There is no closed form solution for $\\hat\\beta^\\text{Lasso}$ in the above maximization problem but efficient algorithms exist (e.g., the R package `glmnet`). Like for Ridge regression, the hyperparameter $\\lambda$ has to be specified beforehand.\n\n### Elastic Net\n\nThe Elastic Net [@Zou2005] combines $L_1$ with $L_2$ penalization and encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. This more general framework considers the following optimization problem: $$\\hat\\beta^\\text{EN} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda(1-\\rho)\\sum\\limits_{k=1}^K|\\beta_k| +\\frac{1}{2}\\lambda\\rho\\sum\\limits_{k=1}^K\\beta_k^2$$ Now, we have to chose two hyperparameters: the *shrinkage* factor $\\lambda$ and the *weighting parameter* $\\rho$. The Elastic Net resembles Lasso for $\\rho = 0$ and Ridge regression for $\\rho = 1$. While the R package `glmnet` provides efficient algorithms to compute the coefficients of penalized regressions, it is a good exercise to implement Ridge and Lasso estimation on your own before you use the `glmnet` package or the `tidymodels` back-end.\n\n## Data Preparation\n\nTo get started, we load the required R packages and data. The main focus is on the workflow behind the `tidymodels` package collection [@tidymodels]. \n@Kuhn2022 provide a thorough introduction into all `tidymodels` components. `glmnet` [@glmnet] was developed and released in sync with @Tibshirani1996 and provides an R implementation of Elastic Net estimation. The package `timetk` [@timetk] provides useful tools for time series data wrangling. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(scales)\nlibrary(furrr)\nlibrary(glmnet)\nlibrary(timetk)\n```\n:::\n\n\nIn this analysis, we use four different data sources that we load from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). We start with two different sets of factor portfolio returns which have been suggested as representing practical risk factor exposure and thus should be relevant when it comes to asset pricing applications. \n\n- The standard workhorse: monthly Fama-French 3 factor returns (market, small-minus-big, and high-minus-low book-to-market valuation sorts) defined in @Fama1992 and @Fama1993\n- Monthly q-factor returns from @Hou2015. The factors contain the size factor, the investment factor, the return-on-equity factor, and the expected growth factor\n\nNext, we include macroeconomic predictors which may predict the general stock market economy. Macroeconomic variables effectively serve as conditioning information such that their inclusion hints at the relevance of conditional models instead of unconditional asset pricing. We refer the interested reader to @Cochrane2009 on the role of conditioning information.\n\n- Our set of macroeconomic predictors comes from @Goyal2008. The data has been updated by the authors until 2021 and contains monthly variables that have been suggested as good predictors for the equity premium. Some of the variables are the dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury bill rate, and inflation\n\nFinally, we need a set of *test assets*. The aim is to understand which of the plenty factors and macroeconomic variable combinations prove helpful in explaining our test assets' cross-section of returns. \nIn line with many existing papers, we use monthly portfolio returns from 10 different industries according to the definition from [Kenneth French's homepage](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_10_ind_port.html) as test assets.\\index{Data!Fama-French factors}\\index{Data!q-factors}\\index{Data!Macro predictors}\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n collect() |>\n rename_with(~ str_c(\"factor_ff_\", .), -month)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\") |>\n collect() |>\n rename_with(~ str_c(\"factor_q_\", .), -month)\n\nmacro_predictors <- tbl(tidy_finance, \"macro_predictors\") |>\n collect() |>\n rename_with(~ str_c(\"macro_\", .), -month) |>\n select(-macro_rp_div)\n\nindustries_ff_monthly <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n collect() |>\n pivot_longer(-month,\n names_to = \"industry\", values_to = \"ret\"\n ) |>\n arrange(desc(industry)) |> \n mutate(industry = as_factor(industry))\n```\n:::\n\n\nWe combine all the monthly observations into one data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- industries_ff_monthly |>\n left_join(factors_ff3_monthly, by = \"month\") |>\n left_join(factors_q_monthly, by = \"month\") |>\n left_join(macro_predictors, by = \"month\") |>\n mutate(\n ret = ret - factor_ff_rf\n ) |>\n select(month, industry, ret_excess = ret, everything()) |>\n drop_na()\n```\n:::\n\n\nOur data contains 22 columns of regressors with the 13 macro variables and 8 factor returns for each month. \n@fig-1401 provides summary statistics for the 10 monthly industry excess returns in percent.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata |>\n group_by(industry) |>\n ggplot(aes(x = industry, y = ret_excess)) +\n geom_boxplot() +\n coord_flip() +\n labs(\n x = NULL, y = NULL,\n title = \"Excess return distributions by industry in percent\"\n ) +\n scale_y_continuous(\n labels = percent\n )\n```\n\n::: {.cell-output-display}\n![The box plots show the monthly dispersion of returns for 10 different industries.](factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png){#fig-1401 fig-alt='Title: Excess return distributions by industry in percent. The figure shows boxplots that visualize the industry\\'s excess return distribution. All industry returns are centered around zero and exhibit substantial outliers in the magnitude of 20 percent on a monthly basis.' width=2100}\n:::\n:::\n\n\n## The tidymodels Workflow\n\nTo illustrate penalized linear regressions, we employ the `tidymodels` collection of packages for modeling and ML using `tidyverse` principles. You can simply use `install.packages(\"tidymodels\")` to get access to all the related packages. We recommend checking out the work of @Kuhn2022: They continuously write on their great book ['Tidy Modeling with R'](https://www.tmwr.org/) using tidy principles.\n\nThe `tidymodels` workflow encompasses the main stages of the modeling process: pre-processing of data, model fitting, and post-processing of results. As we demonstrate below, `tidymodels` provides efficient workflows that you can update with low effort.\n\nUsing the ideas of Ridge and Lasso regressions, the following example guides you through (i) pre-processing the data (data split and variable mutation), (ii) building models, (iii) fitting models, and (iv) tuning models to create the \"best\" possible predictions.\n\nTo start, we restrict our analysis to just one industry: Manufacturing. We first split the sample into a *training* and a *test* set. \nFor that purpose, `tidymodels` provides the function `initial_time_split()` from the `rsample` package [@rsample]. \nThe split takes the last 20% of the data as a test set, which is not used for any model tuning. \nWe use this test set to evaluate the predictive accuracy in an out-of-sample scenario.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit <- initial_time_split(\n data |>\n filter(industry == \"manuf\") |>\n select(-industry),\n prop = 4 / 5\n)\nsplit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n<536/135/671>\n```\n\n\n:::\n:::\n\n\nThe object `split` simply keeps track of the observations of the training and the test set. \nWe can call the training set with `training(split)`, while we can extract the test set with `testing(split)`.\n\n### Pre-process data\n\nRecipes help you pre-process your data before training your model. Recipes are a series of pre-processing steps such as variable selection, transformation, or conversion of qualitative predictors to indicator variables. Each recipe starts with a `formula` that defines the general structure of the dataset and the role of each variable (regressor or dependent variable). For our dataset, our recipe contains the following steps before we fit any model:\n\n- Our formula defines that we want to explain excess returns with all available predictors. The regression equation thus takes the form \n$$r_{t} = \\alpha_0 + \\left(\\tilde f_t \\otimes \\tilde z_t\\right)B + \\varepsilon_t $$ where $r_t$ is the vector of industry excess returns at time $t$ and $\\tilde f_t$ and $\\tilde z_t$ are the (standardized) vectors of factor portfolio returns and macroeconomic variables\n- We exclude the column *month* from the analysis\n- We include all interaction terms between factors and macroeconomic predictors\n- We demean and scale each regressor such that the standard deviation is one\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrec <- recipe(ret_excess ~ ., data = training(split)) |>\n step_rm(month) |>\n step_interact(terms = ~ contains(\"factor\"):contains(\"macro\")) |>\n step_normalize(all_predictors())\n```\n:::\n\n\nA table of all available recipe steps can be found [in the `tidymodels` documentation.](https://www.tidymodels.org/find/recipes/) As of 2023, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a *description* of the tasks to be applied. As a result, it is very easy to *reuse* recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. \nIn the example above, it does not make a difference whether you use the input `data = training(split)` or `data = testing(split)`. \nAll that matters at this early stage are the column names and types.\n\nWe can apply the recipe to any data with a suitable structure. The code below combines two different functions: `prep()` estimates the required parameters from a training set that can be applied to other data sets later. `bake()` applies the processed computations to new data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_prep <- prep(rec, training(split))\n```\n:::\n\nThe object `data_prep` contains information related to the different preprocessing steps applied to the training data: E.g., it is necessary to compute sample means and standard deviations to center and scale the variables. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_bake <- bake(data_prep,\n new_data = testing(split)\n)\ndata_bake\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 135 × 126\n factor_ff_mkt_excess factor_ff_smb factor_ff_hml factor_ff_rf\n \n1 -1.73 -1.10 -0.711 -1.78\n2 2.35 0.949 -0.0937 -1.78\n3 -0.156 -0.122 -0.284 -1.78\n4 0.0646 -0.256 0.425 -1.78\n5 0.995 0.563 -0.456 -1.78\n# ℹ 130 more rows\n# ℹ 122 more variables: factor_q_me , factor_q_ia ,\n# factor_q_roe , factor_q_eg , macro_dp ,\n# macro_dy , macro_ep , macro_de ,\n# macro_svar , macro_bm , macro_ntis ,\n# macro_tbl , macro_lty , macro_ltr ,\n# macro_tms , macro_dfy , macro_infl , …\n```\n\n\n:::\n:::\n\n\nNote that the resulting data contains the 132 observations from the test set and 126 columns. Why so many? Recall that the recipe states to compute every possible interaction term between the factors and predictors, which increases the dimension of the data matrix substantially. \n\nYou may ask at this stage: why should I use a recipe instead of simply using the data wrangling commands such as `mutate()` or `select()`? `tidymodels` beauty is that a lot is happening under the hood. Recall, that for the simple scaling step, you actually have to compute the standard deviation of each column, then *store* this value, and apply the identical transformation to a different dataset, e.g., `testing(split)`. A prepped `recipe` stores these values and hands them on once you `bake()` a novel dataset. Easy as pie with `tidymodels`, isn't it?\n\n### Build a model\n\n\\index{Regression}\nNext, we can build an actual model based on our pre-processed data. In line with the definition above, we estimate regression coefficients of a Lasso regression such that we get \n$$\\begin{aligned}\\hat\\beta_\\lambda^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda\\sum\\limits_{k=1}^K|\\beta_k|.\\end{aligned}$$ We want to emphasize that the `tidymodels` workflow for *any* model is very similar, irrespective of the specific model. As you will see further below, it is straightforward to fit Ridge regression coefficients and - later - Neural networks or Random forests with basically the same code. The structure is always as follows: create a so-called `workflow()` and use the `fit()` function. A table with all available model APIs is available [here.](https://www.tidymodels.org/find/parsnip/)\nFor now, we start with the linear regression model with a given value for the penalty factor $\\lambda$. In the setup below, `mixture` denotes the value of $\\rho$, hence setting `mixture = 1` implies the Lasso.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n penalty = 0.0001,\n mixture = 1\n) |>\n set_engine(\"glmnet\", intercept = FALSE)\n```\n:::\n\n\nThat's it - we are done! The object `lm_model` contains the definition of our model with all required information. Note that `set_engine(\"glmnet\")` indicates the API character of the `tidymodels` workflow: Under the hood, the package `glmnet` is doing the heavy lifting, while `linear_reg()` provides a unified framework to collect the inputs. The `workflow` ends with combining everything necessary for the serious data science workflow, namely, a recipe and a model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_fit <- workflow() |>\n add_recipe(rec) |>\n add_model(lm_model)\nlm_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n══ Workflow ═════════════════════════════════════════════════════════\nPreprocessor: Recipe\nModel: linear_reg()\n\n── Preprocessor ─────────────────────────────────────────────────────\n3 Recipe Steps\n\n• step_rm()\n• step_interact()\n• step_normalize()\n\n── Model ────────────────────────────────────────────────────────────\nLinear Regression Model Specification (regression)\n\nMain Arguments:\n penalty = 1e-04\n mixture = 1\n\nEngine-Specific Arguments:\n intercept = FALSE\n\nComputational engine: glmnet \n```\n\n\n:::\n:::\n\n\n### Fit a model\n\nWith the `workflow` from above, we are ready to use `fit()`. Typically, we use training data to fit the model. \nThe training data is pre-processed according to our recipe steps, and the Lasso regression coefficients are computed. \nFirst, we focus on the predicted values $\\hat{y}_t = x_t\\hat\\beta^\\text{Lasso}.$ @fig-1402 illustrates the projections for the *entire* time series of the manufacturing industry portfolio returns. The grey area indicates the out-of-sample period, which we did not use to fit the model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values <- lm_fit |>\n fit(data = training(split)) |>\n predict(data |> filter(industry == \"manuf\")) |>\n bind_cols(data |> filter(industry == \"manuf\")) |>\n select(month,\n \"Fitted value\" = .pred,\n \"Realization\" = ret_excess\n ) |>\n pivot_longer(-month, names_to = \"Variable\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values |>\n ggplot(aes(\n x = month, \n y = value, \n color = Variable,\n linetype = Variable\n )) +\n geom_line() +\n labs(\n x = NULL,\n y = NULL,\n color = NULL,\n linetype = NULL,\n title = \"Monthly realized and fitted manufacturing industry risk premia\"\n ) +\n scale_x_date(\n breaks = function(x) {\n seq.Date(\n from = min(x),\n to = max(x),\n by = \"5 years\"\n )\n },\n minor_breaks = function(x) {\n seq.Date(\n from = min(x),\n to = max(x),\n by = \"1 years\"\n )\n },\n expand = c(0, 0),\n labels = date_format(\"%Y\")\n ) +\n scale_y_continuous(\n labels = percent\n ) +\n annotate(\"rect\",\n xmin = testing(split) |> pull(month) |> min(),\n xmax = testing(split) |> pull(month) |> max(),\n ymin = -Inf, ymax = Inf,\n alpha = 0.5, fill = \"grey70\"\n )\n```\n\n::: {.cell-output-display}\n![The grey area corresponds to the out of sample period.](factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Monthly realized and fitted manufacturing industry risk premium. The figure shows the time series of realized and predicted manufacturing industry risk premiums. The figure seems to indicate that the predictions capture most of the return dynamics.' width=2100}\n:::\n:::\n\n\nWhat do the estimated coefficients look like? To analyze these values and to illustrate the difference between the `tidymodels` workflow and the underlying `glmnet` package, it is worth computing the coefficients $\\hat\\beta^\\text{Lasso}$ directly. The code below estimates the coefficients for the Lasso and Ridge regression for the processed training data sample. Note that `glmnet` actually takes a vector `y` and the matrix of regressors $X$ as input. Moreover, `glmnet` requires choosing the penalty parameter $\\alpha$, which corresponds to $\\rho$ in the notation above. When using the `tidymodels` model API, such details do not need consideration.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data_bake |>\n select(-ret_excess) |>\n as.matrix()\ny <- data_bake |> pull(ret_excess)\n\nfit_lasso <- glmnet(\n x = x,\n y = y,\n alpha = 1,\n intercept = FALSE,\n standardize = FALSE,\n lambda.min.ratio = 0\n)\n\nfit_ridge <- glmnet(\n x = x,\n y = y,\n alpha = 0,\n intercept = FALSE,\n standardize = FALSE,\n lambda.min.ratio = 0\n)\n```\n:::\n\n\nThe objects `fit_lasso` and `fit_ridge` contain an entire sequence of estimated coefficients for multiple values of the penalty factor $\\lambda$. @fig-1403 illustrates the trajectories of the regression coefficients as a function of the penalty factor. Both Lasso and Ridge coefficients converge to zero as the penalty factor increases.\\index{Graph!ML prediction path}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbind_rows(\n tidy(fit_lasso) |> mutate(Model = \"Lasso\"),\n tidy(fit_ridge) |> mutate(Model = \"Ridge\")\n) |>\n rename(\"Variable\" = term) |>\n ggplot(aes(x = lambda, y = estimate, color = Variable)) +\n geom_line() +\n scale_x_log10() +\n facet_wrap(~Model, scales = \"free_x\") +\n labs(\n x = \"Penalty factor (lambda)\", y = NULL,\n title = \"Estimated coefficient paths for different penalty factors\"\n ) +\n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.](factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png){#fig-1403 fig-alt='Title: Estimated coefficient paths for different penalty factors. The figure shows how estimated lasso and ridge coefficients tend to zero for a higher penalty parameter. Ridge trace is smooth, and Lasso exhibits non-linear behavior.' width=2100}\n:::\n:::\n\n\n::: {.rmdnote}\nOne word of caution: The package `glmnet` computes estimates of the coefficients $\\hat\\beta$ based on numerical optimization procedures. \nAs a result, the estimated coefficients for the [special case](https://parsnip.tidymodels.org/reference/glmnet-details.html) with no regularization ($\\lambda = 0$) can deviate from the standard OLS estimates.\n:::\n\n### Tune a model\n\nTo compute $\\hat\\beta_\\lambda^\\text{Lasso}$ , we simply imposed a value for the penalty hyperparameter $\\lambda$. Model tuning is the process of optimally selecting such hyperparameters. `tidymodels` provides extensive tuning options based on so-called *cross-validation*. Again, we refer to any treatment of cross-validation to get a more detailed discussion of the statistical underpinnings. Here we focus on the general idea and the implementation with `tidymodels`. \n\nThe goal for choosing $\\lambda$ (or any other hyperparameter, e.g., $\\rho$ for the Elastic Net) is to find a way to produce predictors $\\hat{Y}$ for an outcome $Y$ that minimizes the mean squared prediction error $\\text{MSPE} = E\\left( \\frac{1}{T}\\sum_{t=1}^T (\\hat{y}_t - y_t)^2 \\right)$. Unfortunately, the MSPE is not directly observable. We can only compute an estimate because our data is random and because we do not observe the entire population.\n\nObviously, if we train an algorithm on the same data that we use to compute the error, our estimate $\\hat{\\text{MSPE}}$ would indicate way better predictive accuracy than what we can expect in real out-of-sample data. The result is called overfitting.\n\nCross-validation is a technique that allows us to alleviate this problem. We approximate the true MSPE as the average of many MSPE obtained by creating predictions for $K$ new random samples of the data, none of them used to train the algorithm $\\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k - y_t^k\\right)^2$. In practice, this is done by carving out a piece of our data and pretending it is an independent sample. We again divide the data into a training set and a test set. The MSPE on the test set is our measure for actual predictive ability, while we use the training set to fit models with the aim to find the *optimal* hyperparameter values. To do so, we further divide our training sample into (several) subsets, fit our model for a grid of potential hyperparameter values (e.g., $\\lambda$), and evaluate the predictive accuracy on an *independent* sample. This works as follows:\n\n1. Specify a grid of hyperparameters\n2. Obtain predictors $\\hat{y}_i(\\lambda)$ to denote the predictors for the used parameters $\\lambda$\n3. Compute $$\n \\text{MSPE}(\\lambda) = \\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k(\\lambda) - y_t^k\\right)^2 \n $$ With K-fold cross-validation, we do this computation $K$ times. Simply pick a validation set with $M=T/K$ observations at random and think of these as random samples $y_1^k, \\dots, y_{\\tilde{T}}^k$, with $k=1$\n\nHow should you pick $K$? Large values of $K$ are preferable because the training data better imitates the original data. However, larger values of $K$ will have much higher computation time.\n`tidymodels` provides all required tools to conduct $K$-fold cross-validation. We just have to update our model specification and let `tidymodels` know which parameters to tune. In our case, we specify the penalty factor $\\lambda$ as well as the mixing factor $\\rho$ as *free* parameters. Note that it is simple to change an existing `workflow` with `update_model()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n penalty = tune(),\n mixture = tune()\n) |>\n set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n update_model(lm_model)\n```\n:::\n\n\nFor our sample, we consider a time-series cross-validation sample. This means that we tune our models with 20 random samples of length five years with a validation period of four years. For a grid of possible hyperparameters, we then fit the model for each fold and evaluate $\\hat{\\text{MSPE}}$ in the corresponding validation set. Finally, we select the model specification with the lowest MSPE in the validation set. First, we define the cross-validation folds based on our training data only.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_folds <- time_series_cv(\n data = training(split),\n date_var = month,\n initial = \"5 years\",\n assess = \"48 months\",\n cumulative = FALSE,\n slice_limit = 20\n)\ndata_folds\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Time Series Cross Validation Plan \n# A tibble: 20 × 2\n splits id \n \n1 Slice01\n2 Slice02\n3 Slice03\n4 Slice04\n5 Slice05\n# ℹ 15 more rows\n```\n\n\n:::\n:::\n\n\nThen, we evaluate the performance for a grid of different penalty values. `tidymodels` provides functionalities to construct a suitable grid of hyperparameters with `grid_regular`. The code chunk below creates a $10 \\times 3$ hyperparameters grid. Then, the function `tune_grid()` evaluates all the models for each fold.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_tune <- lm_fit |>\n tune_grid(\n resample = data_folds,\n grid = grid_regular(penalty(), mixture(), levels = c(20, 3)),\n metrics = metric_set(rmse)\n )\n```\n:::\n\n\nAfter the tuning process, we collect the evaluation metrics (the root mean-squared error in our example) to identify the *optimal* model. @fig-1404 illustrates the average validation set's root mean-squared error for each value of $\\lambda$ and $\\rho$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(lm_tune) + \n aes(linetype = `Proportion of Lasso Penalty`) + \n guides(linetype = \"none\") +\n labs(\n x = \"Penalty factor (lambda)\",\n y = \"Root MSPE\",\n title = \"Root MSPE for different penalty factors\"\n ) + \n scale_x_log10()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nScale for x is already present.\nAdding another scale for x, which will replace the existing scale.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.](factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png){#fig-1404 fig-alt='Title: Root MSPE for different penalty factors. The figure shows that more regularization does not affect the selected models in a meaningful fashion. At some point, the Elastic Net prediction error drops, which indicates the selected model. MSPE increases again for high penalization values.' width=2100}\n:::\n:::\n\n\n@fig-1404 shows that the cross-validated MSPE drops for Lasso and Elastic Net and spikes afterward. For Ridge regression, the MSPE increases above a certain threshold. Recall that the larger the regularization, the more restricted the model becomes. Thus, we would choose the model with the lowest MSPE.\n\n### Parallelized workflow\n\nOur starting point was the question: Which factors determine industry returns? While @Avramov2022b provide a Bayesian analysis related to the research question above, we choose a simplified approach: To illustrate the entire workflow, we now run the penalized regressions for all ten industries. \nWe want to identify relevant variables by fitting Lasso models for each industry returns time series. More specifically, we perform cross-validation for each industry to identify the optimal penalty factor $\\lambda$. \nThen, we use the set of `finalize_*()`-functions that take a list or tibble of tuning parameter values and update objects with those values. After determining the best model, we compute the final fit on the entire training set and analyze the estimated coefficients. \n\nFirst, we define the Lasso model with one tuning parameter.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlasso_model <- linear_reg(\n penalty = tune(),\n mixture = 1\n) |>\n set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n update_model(lasso_model)\n```\n:::\n\n\nThe following task can be easily parallelized to reduce computing time substantially. We use the parallelization capabilities of `furrr`. Note that we can also just recycle all the steps from above and collect them in a function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect_variables <- function(input) {\n # Split into training and testing data\n split <- initial_time_split(input, prop = 4 / 5)\n\n # Data folds for cross-validation\n data_folds <- time_series_cv(\n data = training(split),\n date_var = month,\n initial = \"5 years\",\n assess = \"48 months\",\n cumulative = FALSE,\n slice_limit = 20\n )\n\n # Model tuning with the Lasso model\n lm_tune <- lm_fit |>\n tune_grid(\n resample = data_folds,\n grid = grid_regular(penalty(), levels = c(10)),\n metrics = metric_set(rmse)\n )\n\n # Identify the best model and fit with the training data\n lasso_lowest_rmse <- lm_tune |> select_by_one_std_err(\"rmse\")\n lasso_final <- finalize_workflow(lm_fit, lasso_lowest_rmse)\n lasso_final_fit <- last_fit(lasso_final, split, metrics = metric_set(rmse))\n\n # Extract the estimated coefficients\n estimated_coefficients <- lasso_final_fit |>\n extract_fit_parsnip() |>\n tidy() |>\n mutate(\n term = str_remove_all(term, \"factor_|macro_|industry_\")\n )\n\n return(estimated_coefficients)\n}\n\n# Parallelization\nplan(multisession, workers = availableCores())\n\n# Computation by industry\nselected_factors <- data |>\n nest(data = -industry) |>\n mutate(selected_variables = future_map(\n data, select_variables,\n .options = furrr_options(seed = TRUE)\n ))\n```\n:::\n\n\nWhat has just happened? In principle, exactly the same as before but instead of computing the Lasso coefficients for one industry, we did it for ten in parallel. The final option `seed = TRUE` is required to make the cross-validation process reproducible. \nNow, we just have to do some housekeeping and keep only variables that Lasso does *not* set to zero. We illustrate the results in a heat map in @fig-1405.\\index{Graph!Heat map}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselected_factors |>\n unnest(selected_variables) |>\n filter(\n term != \"(Intercept)\",\n estimate != 0\n ) |>\n add_count(term) |>\n mutate(\n term = str_remove_all(term, \"NA|ff_|q_\"),\n term = str_replace_all(term, \"_x_\", \" \"),\n term = fct_reorder(as_factor(term), n),\n term = fct_lump_min(term, min = 2),\n selected = 1\n ) |>\n filter(term != \"Other\") |>\n mutate(term = fct_drop(term)) |>\n complete(industry, term, fill = list(selected = 0)) |>\n ggplot(aes(industry,\n term,\n fill = as_factor(selected)\n )) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 70)) +\n scale_fill_manual(values = c(\"white\", \"grey30\")) +\n theme(legend.position = \"None\") +\n labs(\n x = NULL, y = NULL,\n title = \"Selected variables for different industries\"\n )\n```\n\n::: {.cell-output-display}\n![Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.](factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png){#fig-1405 fig-alt='Title: Selected variables for different industries. The figure shows which factors and macroeconomic predictors the Lasso model selected for the different industries. In general, there are not many selected variables. The market excess return is selected across all industries except for utilities.' width=2100}\n:::\n:::\n\n\nThe heat map in @fig-1405 conveys two main insights. \nFirst, we see a lot of white, which means that many factors, macroeconomic variables, and interaction terms are not relevant for explaining the cross-section of returns across the industry portfolios. In fact, only the market factor and the return-on-equity factor play a role for several industries. Second, there seems to be quite some heterogeneity across different industries. While barely any variable is selected by Lasso for Utilities, many factors are selected for, e.g., High-Tech and Durable, but they do not coincide at all. \nIn other words, there seems to be a clear picture that we do not need many factors, but Lasso does not provide a factor that consistently provides pricing abilities across industries.\n\n## Exercises \n\n1. Write a function that requires three inputs, namely, `y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and `lambda` and then returns the Ridge estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_2$ penalty.\n1. Compute the $L_2$ norm ($\\beta'\\beta$) for the regression coefficients based on the predictive regression from the previous exercise for a range of $\\lambda$'s and illustrate the effect of penalization in a suitable figure.\n1. Now, write a function that requires three inputs, namely,`y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and 'lambda` and then returns the Lasso estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_1$ penalty.\n1. After you understand what Ridge and Lasso regressions are doing, familiarize yourself with the `glmnet()` package's documentation. It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients for Ridge and Lasso and for combinations, commonly called *Elastic Nets*. ", + "markdown": "---\ntitle: Factor Selection via Machine Learning\naliases:\n - ../factor-selection-via-machine-learning.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Factor Selection via Machine Learning with R\n description-meta: Use machine learning tools such as Lasso and Ridge regressions to identify asset pricing factors using the programming language R.\n---\n\n\nThe aim of this chapter is twofold. From a data science perspective, we introduce `tidymodels`, a collection of packages for modeling and machine learning (ML) using `tidyverse` principles. `tidymodels` comes with a handy workflow for all sorts of typical prediction tasks. From a finance perspective, we address the notion of *factor zoo* [@Cochrane2011] using ML methods. We introduce Lasso and Ridge regression as a special case of penalized regression models. Then, we explain the concept of cross-validation for model *tuning* with Elastic Net regularization as a popular example. We implement and showcase the entire cycle from model specification, training, and forecast evaluation within the `tidymodels` universe. While the tools can generally be applied to an abundance of interesting asset pricing problems, we apply penalized regressions for identifying macroeconomic variables and asset pricing factors that help explain a cross-section of industry portfolios.\n\nIn previous chapters, we illustrate that stock characteristics such as size provide valuable pricing information in addition to the market beta.\\index{Factor zoo}\\index{Factor model}\\index{CAPM} \nSuch findings question the usefulness of the Capital Asset Pricing Model. \nIn fact, during the last decades, financial economists discovered a plethora of additional factors which may be correlated with the marginal utility of consumption (and would thus deserve a prominent role in pricing applications). The search for factors that explain the cross-section of expected stock returns has produced hundreds of potential candidates, as noted more recently by @Harvey2016, @Mclean2016, and @Hou2020.\nTherefore, given the multitude of proposed risk factors, the challenge these days rather is: *do we believe in the relevance of 300+ risk factors?* During recent years, promising methods from the field of ML got applied to common finance applications. We refer to @Mullainathan2017 for a treatment of ML from the perspective of an econometrician, @Nagel2021 for an excellent review of ML practices in asset pricing, @Easley2021 for ML applications in (high-frequency) market microstructure, and @Dixon2020 for a detailed treatment of all methodological aspects. \n\n## Brief Theoretical Background\n\nThis is a book about *doing* empirical work in a tidy manner, and we refer to any of the many excellent textbook treatments of ML methods and especially penalized regressions for some deeper discussion. Excellent material is provided, for instance, by @Hastie2009, @Hastie2013, and @DePrado2018. Instead, we briefly summarize the idea of Lasso and Ridge regressions as well as the more general Elastic Net. Then, we turn to the fascinating question on *how* to implement, tune, and use such models with the `tidymodels` workflow.\n\nTo set the stage, we start with the definition of a linear model: suppose we have data $(y_t, x_t), t = 1,\\ldots, T$, where $x_t$ is a $(K \\times 1)$ vector of regressors and $y_t$ is the response for observation $t$. \nThe linear model takes the form $y_t = \\beta' x_t + \\varepsilon_t$ with some error term $\\varepsilon_t$ and has been studied in abundance. The well-known ordinary-least square (OLS) estimator for the $(K \\times 1)$ vector $\\beta$ minimizes the sum of squared residuals and is then $$\\hat{\\beta}^\\text{ols} = \\left(\\sum\\limits_{t=1}^T x_t'x_t\\right)^{-1} \\sum\\limits_{t=1}^T x_t'y_t.$$ \n\\index{Regression!OLS}\n\nWhile we are often interested in the estimated coefficient vector $\\hat\\beta^\\text{ols}$, ML is about the predictive performance most of the time. For a new observation $\\tilde{x}_t$, the linear model generates predictions such that $$\\hat y_t = E\\left(y|x_t = \\tilde x_t\\right) = \\hat\\beta^\\text{ols}{}' \\tilde x_t.$$ \nIs this the best we can do? \nNot really: instead of minimizing the sum of squared residuals, penalized linear models can improve predictive performance by choosing other estimators $\\hat{\\beta}$ with lower variance than the estimator $\\hat\\beta^\\text{ols}$. \nAt the same time, it seems appealing to restrict the set of regressors to a few meaningful ones if possible. In other words, if $K$ is large (such as for the number of proposed factors in the asset pricing literature), it may be a desirable feature to *select* reasonable factors and set $\\hat\\beta^{\\text{ols}}_k = 0$ for some redundant factors. \n\nIt should be clear that the promised benefits of penalized regressions, i.e., reducing the mean squared error (MSE), come at a cost. In most cases, reducing the variance of the estimator introduces a bias such that $E\\left(\\hat\\beta\\right) \\neq \\beta$. What is the effect of such a bias-variance trade-off? To understand the implications, assume the following data-generating process for $y$: $$y = f(x) + \\varepsilon, \\quad \\varepsilon \\sim (0, \\sigma_\\varepsilon^2)$$ We want to recover $f(x)$, which denotes some unknown functional which maps the relationship between $x$ and $y$. While the properties of $\\hat\\beta^\\text{ols}$ as an unbiased estimator may be desirable under some circumstances, they are certainly not if we consider predictive accuracy. Alternative predictors $\\hat{f}(x)$ could be more desirable: For instance, the MSE depends on our model choice as follows: \\index{MSE} $$\\begin{aligned}\nMSE &=E((y-\\hat{f}(x))^2)=E((f(x)+\\epsilon-\\hat{f}(x))^2)\\\\\n&= \\underbrace{E((f(x)-\\hat{f}(x))^2)}_{\\text{total quadratic error}}+\\underbrace{E(\\epsilon^2)}_{\\text{irreducible error}} \\\\\n&= E\\left(\\hat{f}(x)^2\\right)+E\\left(f(x)^2\\right)-2E\\left(f(x)\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=E\\left(\\hat{f}(x)^2\\right)+f(x)^2-2f(x)E\\left(\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=\\underbrace{\\text{Var}\\left(\\hat{f}(x)\\right)}_{\\text{variance of model}}+ \\underbrace{E\\left((f(x)-\\hat{f}(x))\\right)^2}_{\\text{squared bias}} +\\sigma_\\varepsilon^2. \n\\end{aligned}$$ While no model can reduce $\\sigma_\\varepsilon^2$, a biased estimator with small variance may have a lower MSE than an unbiased estimator.\n\n### Ridge regression\n\n\\index{Regression!Ridge}\n\nOne biased estimator is known as Ridge regression. @Hoerl1970 propose to minimize the sum of squared errors *while simultaneously imposing a penalty on the $L_2$ norm of the parameters* $\\hat\\beta$. Formally, this means that for a penalty factor $\\lambda\\geq 0$ the minimization problem takes the form $\\min_\\beta \\left(y - X\\beta\\right)'\\left(y - X\\beta\\right)\\text{ s.t. } \\beta'\\beta \\leq c$. Here $c\\geq 0$ is a constant that depends on the choice of $\\lambda$. The larger $\\lambda$, the smaller $c$ (technically speaking, there is a one-to-one relationship between $\\lambda$, which corresponds to the Lagrangian of the minimization problem above and $c$). Here, $X = \\left(x_1 \\ldots x_T\\right)'$ and $y = \\left(y_1, \\ldots, y_T\\right)'$. A closed-form solution for the resulting regression coefficient vector $\\beta^\\text{ridge}$ exists: $$\\hat{\\beta}^\\text{ridge} = \\left(X'X + \\lambda I\\right)^{-1}X'y.$$ A couple of observations are worth noting: $\\hat\\beta^\\text{ridge} = \\hat\\beta^\\text{ols}$ for $\\lambda = 0$ and $\\hat\\beta^\\text{ridge} \\rightarrow 0$ for $\\lambda\\rightarrow \\infty$. Also for $\\lambda > 0$, $\\left(X'X + \\lambda I\\right)$ is non-singular even if $X'X$ is which means that $\\hat\\beta^\\text{ridge}$ exists even if $\\hat\\beta$ is not defined. However, note also that the Ridge estimator requires careful choice of the hyperparameter $\\lambda$ which controls the *amount of regularization*: a larger value of $\\lambda$ implies *shrinkage* of the regression coefficient toward 0, a smaller value of $\\lambda$ reduces the bias of the resulting estimator. \n\n::: {.calloutnote}\nNote, that $X$ usually contains an intercept column with ones. As a general rule, the associated intercept coefficient is not penalized. In practice, this often implies that $y$ is simply demeaned before computing $\\hat\\beta^\\text{ridge}$.\n::: \n\nWhat about the statistical properties of the Ridge estimator? First, the bad news is that $\\hat\\beta^\\text{ridge}$ is a biased estimator of $\\beta$. However, the good news is that (under homoscedastic error terms) the variance of the Ridge estimator is guaranteed to be *smaller* than the variance of the ordinary least square estimator. We encourage you to verify these two statements in the exercises. As a result, we face a trade-off: The Ridge regression sacrifices some unbiasedness to achieve a smaller variance than the OLS estimator.\n\n### Lasso\n\n\\index{Regression!Lasso}\n\nAn alternative to Ridge regression is the Lasso (*l*east *a*bsolute *s*hrinkage and *s*election *o*perator). Similar to Ridge regression, the Lasso [@Tibshirani1996] is a penalized and biased estimator. \nThe main difference to Ridge regression is that Lasso does not only *shrink* coefficients but effectively selects variables by setting coefficients for *irrelevant* variables to zero. Lasso implements a $L_1$ penalization on the parameters such that: $$\\hat\\beta^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right)\\text{ s.t. } \\sum\\limits_{k=1}^K|\\beta_k| < c(\\lambda).$$ There is no closed form solution for $\\hat\\beta^\\text{Lasso}$ in the above maximization problem but efficient algorithms exist (e.g., the R package `glmnet`). Like for Ridge regression, the hyperparameter $\\lambda$ has to be specified beforehand.\n\n### Elastic Net\n\nThe Elastic Net [@Zou2005] combines $L_1$ with $L_2$ penalization and encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. This more general framework considers the following optimization problem: $$\\hat\\beta^\\text{EN} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda(1-\\rho)\\sum\\limits_{k=1}^K|\\beta_k| +\\frac{1}{2}\\lambda\\rho\\sum\\limits_{k=1}^K\\beta_k^2$$ Now, we have to chose two hyperparameters: the *shrinkage* factor $\\lambda$ and the *weighting parameter* $\\rho$. The Elastic Net resembles Lasso for $\\rho = 0$ and Ridge regression for $\\rho = 1$. While the R package `glmnet` provides efficient algorithms to compute the coefficients of penalized regressions, it is a good exercise to implement Ridge and Lasso estimation on your own before you use the `glmnet` package or the `tidymodels` back-end.\n\n## Data Preparation\n\nTo get started, we load the required R packages and data. The main focus is on the workflow behind the `tidymodels` package collection [@tidymodels]. \n@Kuhn2022 provide a thorough introduction into all `tidymodels` components. `glmnet` [@glmnet] was developed and released in sync with @Tibshirani1996 and provides an R implementation of Elastic Net estimation. The package `timetk` [@timetk] provides useful tools for time series data wrangling. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(scales)\nlibrary(furrr)\nlibrary(glmnet)\nlibrary(timetk)\n```\n:::\n\n\nIn this analysis, we use four different data sources that we load from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). We start with two different sets of factor portfolio returns which have been suggested as representing practical risk factor exposure and thus should be relevant when it comes to asset pricing applications. \n\n- The standard workhorse: monthly Fama-French 3 factor returns (market, small-minus-big, and high-minus-low book-to-market valuation sorts) defined in @Fama1992 and @Fama1993\n- Monthly q-factor returns from @Hou2015. The factors contain the size factor, the investment factor, the return-on-equity factor, and the expected growth factor\n\nNext, we include macroeconomic predictors which may predict the general stock market economy. Macroeconomic variables effectively serve as conditioning information such that their inclusion hints at the relevance of conditional models instead of unconditional asset pricing. We refer the interested reader to @Cochrane2009 on the role of conditioning information.\n\n- Our set of macroeconomic predictors comes from @Goyal2008. The data has been updated by the authors until 2021 and contains monthly variables that have been suggested as good predictors for the equity premium. Some of the variables are the dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury bill rate, and inflation\n\nFinally, we need a set of *test assets*. The aim is to understand which of the plenty factors and macroeconomic variable combinations prove helpful in explaining our test assets' cross-section of returns. \nIn line with many existing papers, we use monthly portfolio returns from 10 different industries according to the definition from [Kenneth French's homepage](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_10_ind_port.html) as test assets.\\index{Data!Fama-French factors}\\index{Data!q-factors}\\index{Data!Macro predictors}\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n collect() |>\n rename_with(~ str_c(\"factor_ff_\", .), -month)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\") |>\n collect() |>\n rename_with(~ str_c(\"factor_q_\", .), -month)\n\nmacro_predictors <- tbl(tidy_finance, \"macro_predictors\") |>\n collect() |>\n rename_with(~ str_c(\"macro_\", .), -month) |>\n select(-macro_rp_div)\n\nindustries_ff_monthly <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n collect() |>\n pivot_longer(-month,\n names_to = \"industry\", values_to = \"ret\"\n ) |>\n arrange(desc(industry)) |> \n mutate(industry = as_factor(industry))\n```\n:::\n\n\nWe combine all the monthly observations into one data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- industries_ff_monthly |>\n left_join(factors_ff3_monthly, join_by(month)) |>\n left_join(factors_q_monthly, join_by(month)) |>\n left_join(macro_predictors, join_by(month)) |>\n mutate(\n ret = ret - factor_ff_rf\n ) |>\n select(month, industry, ret_excess = ret, everything()) |>\n drop_na()\n```\n:::\n\n\nOur data contains 22 columns of regressors with the 13 macro variables and 8 factor returns for each month. \n@fig-1401 provides summary statistics for the 10 monthly industry excess returns in percent.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata |>\n group_by(industry) |>\n ggplot(aes(x = industry, y = ret_excess)) +\n geom_boxplot() +\n coord_flip() +\n labs(\n x = NULL, y = NULL,\n title = \"Excess return distributions by industry in percent\"\n ) +\n scale_y_continuous(\n labels = percent\n )\n```\n\n::: {.cell-output-display}\n![The box plots show the monthly dispersion of returns for 10 different industries.](factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png){#fig-1401 fig-alt='Title: Excess return distributions by industry in percent. The figure shows boxplots that visualize the industry\\'s excess return distribution. All industry returns are centered around zero and exhibit substantial outliers in the magnitude of 20 percent on a monthly basis.' width=2100}\n:::\n:::\n\n\n## The tidymodels Workflow\n\nTo illustrate penalized linear regressions, we employ the `tidymodels` collection of packages for modeling and ML using `tidyverse` principles. You can simply use `install.packages(\"tidymodels\")` to get access to all the related packages. We recommend checking out the work of @Kuhn2022: They continuously write on their great book ['Tidy Modeling with R'](https://www.tmwr.org/) using tidy principles.\n\nThe `tidymodels` workflow encompasses the main stages of the modeling process: pre-processing of data, model fitting, and post-processing of results. As we demonstrate below, `tidymodels` provides efficient workflows that you can update with low effort.\n\nUsing the ideas of Ridge and Lasso regressions, the following example guides you through (i) pre-processing the data (data split and variable mutation), (ii) building models, (iii) fitting models, and (iv) tuning models to create the \"best\" possible predictions.\n\nTo start, we restrict our analysis to just one industry: Manufacturing. We first split the sample into a *training* and a *test* set. \nFor that purpose, `tidymodels` provides the function `initial_time_split()` from the `rsample` package [@rsample]. \nThe split takes the last 20% of the data as a test set, which is not used for any model tuning. \nWe use this test set to evaluate the predictive accuracy in an out-of-sample scenario.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit <- initial_time_split(\n data |>\n filter(industry == \"manuf\") |>\n select(-industry),\n prop = 4 / 5\n)\nsplit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n<536/135/671>\n```\n\n\n:::\n:::\n\n\nThe object `split` simply keeps track of the observations of the training and the test set. \nWe can call the training set with `training(split)`, while we can extract the test set with `testing(split)`.\n\n### Pre-process data\n\nRecipes help you pre-process your data before training your model. Recipes are a series of pre-processing steps such as variable selection, transformation, or conversion of qualitative predictors to indicator variables. Each recipe starts with a `formula` that defines the general structure of the dataset and the role of each variable (regressor or dependent variable). For our dataset, our recipe contains the following steps before we fit any model:\n\n- Our formula defines that we want to explain excess returns with all available predictors. The regression equation thus takes the form \n$$r_{t} = \\alpha_0 + \\left(\\tilde f_t \\otimes \\tilde z_t\\right)B + \\varepsilon_t $$ where $r_t$ is the vector of industry excess returns at time $t$ and $\\tilde f_t$ and $\\tilde z_t$ are the (standardized) vectors of factor portfolio returns and macroeconomic variables\n- We exclude the column *month* from the analysis\n- We include all interaction terms between factors and macroeconomic predictors\n- We demean and scale each regressor such that the standard deviation is one\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrec <- recipe(ret_excess ~ ., data = training(split)) |>\n step_rm(month) |>\n step_interact(terms = ~ contains(\"factor\"):contains(\"macro\")) |>\n step_normalize(all_predictors())\n```\n:::\n\n\nA table of all available recipe steps can be found [in the `tidymodels` documentation.](https://www.tidymodels.org/find/recipes/) As of 2024, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a *description* of the tasks to be applied. As a result, it is very easy to *reuse* recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. \nIn the example above, it does not make a difference whether you use the input `data = training(split)` or `data = testing(split)`. \nAll that matters at this early stage are the column names and types.\n\nWe can apply the recipe to any data with a suitable structure. The code below combines two different functions: `prep()` estimates the required parameters from a training set that can be applied to other data sets later. `bake()` applies the processed computations to new data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_prep <- prep(rec, training(split))\n```\n:::\n\nThe object `data_prep` contains information related to the different preprocessing steps applied to the training data: E.g., it is necessary to compute sample means and standard deviations to center and scale the variables. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_bake <- bake(data_prep,\n new_data = testing(split)\n)\ndata_bake\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 135 × 126\n factor_ff_mkt_excess factor_ff_smb factor_ff_hml factor_ff_rf\n \n1 -1.73 -1.10 -0.711 -1.78\n2 2.35 0.949 -0.0937 -1.78\n3 -0.156 -0.122 -0.284 -1.78\n4 0.0646 -0.256 0.425 -1.78\n5 0.995 0.563 -0.456 -1.78\n# ℹ 130 more rows\n# ℹ 122 more variables: factor_q_me , factor_q_ia ,\n# factor_q_roe , factor_q_eg , macro_dp ,\n# macro_dy , macro_ep , macro_de ,\n# macro_svar , macro_bm , macro_ntis ,\n# macro_tbl , macro_lty , macro_ltr ,\n# macro_tms , macro_dfy , macro_infl , …\n```\n\n\n:::\n:::\n\n\nNote that the resulting data contains the 132 observations from the test set and 126 columns. Why so many? Recall that the recipe states to compute every possible interaction term between the factors and predictors, which increases the dimension of the data matrix substantially. \n\nYou may ask at this stage: why should I use a recipe instead of simply using the data wrangling commands such as `mutate()` or `select()`? `tidymodels` beauty is that a lot is happening under the hood. Recall, that for the simple scaling step, you actually have to compute the standard deviation of each column, then *store* this value, and apply the identical transformation to a different dataset, e.g., `testing(split)`. A prepped `recipe` stores these values and hands them on once you `bake()` a novel dataset. Easy as pie with `tidymodels`, isn't it?\n\n### Build a model\n\n\\index{Regression}\nNext, we can build an actual model based on our pre-processed data. In line with the definition above, we estimate regression coefficients of a Lasso regression such that we get \n$$\\begin{aligned}\\hat\\beta_\\lambda^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda\\sum\\limits_{k=1}^K|\\beta_k|.\\end{aligned}$$ We want to emphasize that the `tidymodels` workflow for *any* model is very similar, irrespective of the specific model. As you will see further below, it is straightforward to fit Ridge regression coefficients and - later - Neural networks or Random forests with basically the same code. The structure is always as follows: create a so-called `workflow()` and use the `fit()` function. A table with all available model APIs is available [here.](https://www.tidymodels.org/find/parsnip/)\nFor now, we start with the linear regression model with a given value for the penalty factor $\\lambda$. In the setup below, `mixture` denotes the value of $\\rho$, hence setting `mixture = 1` implies the Lasso.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n penalty = 0.0001,\n mixture = 1\n) |>\n set_engine(\"glmnet\", intercept = FALSE)\n```\n:::\n\n\nThat's it - we are done! The object `lm_model` contains the definition of our model with all required information. Note that `set_engine(\"glmnet\")` indicates the API character of the `tidymodels` workflow: Under the hood, the package `glmnet` is doing the heavy lifting, while `linear_reg()` provides a unified framework to collect the inputs. The `workflow` ends with combining everything necessary for the serious data science workflow, namely, a recipe and a model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_fit <- workflow() |>\n add_recipe(rec) |>\n add_model(lm_model)\nlm_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n══ Workflow ═════════════════════════════════════════════════════════\nPreprocessor: Recipe\nModel: linear_reg()\n\n── Preprocessor ─────────────────────────────────────────────────────\n3 Recipe Steps\n\n• step_rm()\n• step_interact()\n• step_normalize()\n\n── Model ────────────────────────────────────────────────────────────\nLinear Regression Model Specification (regression)\n\nMain Arguments:\n penalty = 1e-04\n mixture = 1\n\nEngine-Specific Arguments:\n intercept = FALSE\n\nComputational engine: glmnet \n```\n\n\n:::\n:::\n\n\n### Fit a model\n\nWith the `workflow` from above, we are ready to use `fit()`. Typically, we use training data to fit the model. \nThe training data is pre-processed according to our recipe steps, and the Lasso regression coefficients are computed. \nFirst, we focus on the predicted values $\\hat{y}_t = x_t\\hat\\beta^\\text{Lasso}.$ @fig-1402 illustrates the projections for the *entire* time series of the manufacturing industry portfolio returns. The grey area indicates the out-of-sample period, which we did not use to fit the model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values <- lm_fit |>\n fit(data = training(split)) |>\n predict(data |> filter(industry == \"manuf\")) |>\n bind_cols(data |> filter(industry == \"manuf\")) |>\n select(month,\n \"Fitted value\" = .pred,\n \"Realization\" = ret_excess\n ) |>\n pivot_longer(-month, names_to = \"Variable\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values |>\n ggplot(aes(\n x = month, \n y = value, \n color = Variable,\n linetype = Variable\n )) +\n geom_line() +\n labs(\n x = NULL,\n y = NULL,\n color = NULL,\n linetype = NULL,\n title = \"Monthly realized and fitted manufacturing industry risk premia\"\n ) +\n scale_x_date(\n breaks = function(x) {\n seq.Date(\n from = min(x),\n to = max(x),\n by = \"5 years\"\n )\n },\n minor_breaks = function(x) {\n seq.Date(\n from = min(x),\n to = max(x),\n by = \"1 years\"\n )\n },\n expand = c(0, 0),\n labels = date_format(\"%Y\")\n ) +\n scale_y_continuous(\n labels = percent\n ) +\n annotate(\"rect\",\n xmin = testing(split) |> pull(month) |> min(),\n xmax = testing(split) |> pull(month) |> max(),\n ymin = -Inf, ymax = Inf,\n alpha = 0.5, fill = \"grey70\"\n )\n```\n\n::: {.cell-output-display}\n![The grey area corresponds to the out of sample period.](factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Monthly realized and fitted manufacturing industry risk premium. The figure shows the time series of realized and predicted manufacturing industry risk premiums. The figure seems to indicate that the predictions capture most of the return dynamics.' width=2100}\n:::\n:::\n\n\nWhat do the estimated coefficients look like? To analyze these values and to illustrate the difference between the `tidymodels` workflow and the underlying `glmnet` package, it is worth computing the coefficients $\\hat\\beta^\\text{Lasso}$ directly. The code below estimates the coefficients for the Lasso and Ridge regression for the processed training data sample. Note that `glmnet` actually takes a vector `y` and the matrix of regressors $X$ as input. Moreover, `glmnet` requires choosing the penalty parameter $\\alpha$, which corresponds to $\\rho$ in the notation above. When using the `tidymodels` model API, such details do not need consideration.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data_bake |>\n select(-ret_excess) |>\n as.matrix()\ny <- data_bake |> pull(ret_excess)\n\nfit_lasso <- glmnet(\n x = x,\n y = y,\n alpha = 1,\n intercept = FALSE,\n standardize = FALSE,\n lambda.min.ratio = 0\n)\n\nfit_ridge <- glmnet(\n x = x,\n y = y,\n alpha = 0,\n intercept = FALSE,\n standardize = FALSE,\n lambda.min.ratio = 0\n)\n```\n:::\n\n\nThe objects `fit_lasso` and `fit_ridge` contain an entire sequence of estimated coefficients for multiple values of the penalty factor $\\lambda$. @fig-1403 illustrates the trajectories of the regression coefficients as a function of the penalty factor. Both Lasso and Ridge coefficients converge to zero as the penalty factor increases.\\index{Graph!ML prediction path}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbind_rows(\n tidy(fit_lasso) |> mutate(Model = \"Lasso\"),\n tidy(fit_ridge) |> mutate(Model = \"Ridge\")\n) |>\n rename(\"Variable\" = term) |>\n ggplot(aes(x = lambda, y = estimate, color = Variable)) +\n geom_line() +\n scale_x_log10() +\n facet_wrap(~Model, scales = \"free_x\") +\n labs(\n x = \"Penalty factor (lambda)\", y = NULL,\n title = \"Estimated coefficient paths for different penalty factors\"\n ) +\n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.](factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png){#fig-1403 fig-alt='Title: Estimated coefficient paths for different penalty factors. The figure shows how estimated lasso and ridge coefficients tend to zero for a higher penalty parameter. Ridge trace is smooth, and Lasso exhibits non-linear behavior.' width=2100}\n:::\n:::\n\n\n::: {.rmdnote}\nOne word of caution: The package `glmnet` computes estimates of the coefficients $\\hat\\beta$ based on numerical optimization procedures. \nAs a result, the estimated coefficients for the [special case](https://parsnip.tidymodels.org/reference/glmnet-details.html) with no regularization ($\\lambda = 0$) can deviate from the standard OLS estimates.\n:::\n\n### Tune a model\n\nTo compute $\\hat\\beta_\\lambda^\\text{Lasso}$ , we simply imposed a value for the penalty hyperparameter $\\lambda$. Model tuning is the process of optimally selecting such hyperparameters. `tidymodels` provides extensive tuning options based on so-called *cross-validation*. Again, we refer to any treatment of cross-validation to get a more detailed discussion of the statistical underpinnings. Here we focus on the general idea and the implementation with `tidymodels`. \n\nThe goal for choosing $\\lambda$ (or any other hyperparameter, e.g., $\\rho$ for the Elastic Net) is to find a way to produce predictors $\\hat{Y}$ for an outcome $Y$ that minimizes the mean squared prediction error $\\text{MSPE} = E\\left( \\frac{1}{T}\\sum_{t=1}^T (\\hat{y}_t - y_t)^2 \\right)$. Unfortunately, the MSPE is not directly observable. We can only compute an estimate because our data is random and because we do not observe the entire population.\n\nObviously, if we train an algorithm on the same data that we use to compute the error, our estimate $\\hat{\\text{MSPE}}$ would indicate way better predictive accuracy than what we can expect in real out-of-sample data. The result is called overfitting.\n\nCross-validation is a technique that allows us to alleviate this problem. We approximate the true MSPE as the average of many MSPE obtained by creating predictions for $K$ new random samples of the data, none of them used to train the algorithm $\\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k - y_t^k\\right)^2$. In practice, this is done by carving out a piece of our data and pretending it is an independent sample. We again divide the data into a training set and a test set. The MSPE on the test set is our measure for actual predictive ability, while we use the training set to fit models with the aim to find the *optimal* hyperparameter values. To do so, we further divide our training sample into (several) subsets, fit our model for a grid of potential hyperparameter values (e.g., $\\lambda$), and evaluate the predictive accuracy on an *independent* sample. This works as follows:\n\n1. Specify a grid of hyperparameters\n2. Obtain predictors $\\hat{y}_i(\\lambda)$ to denote the predictors for the used parameters $\\lambda$\n3. Compute $$\n \\text{MSPE}(\\lambda) = \\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k(\\lambda) - y_t^k\\right)^2 \n $$ With K-fold cross-validation, we do this computation $K$ times. Simply pick a validation set with $M=T/K$ observations at random and think of these as random samples $y_1^k, \\dots, y_{\\tilde{T}}^k$, with $k=1$\n\nHow should you pick $K$? Large values of $K$ are preferable because the training data better imitates the original data. However, larger values of $K$ will have much higher computation time.\n`tidymodels` provides all required tools to conduct $K$-fold cross-validation. We just have to update our model specification and let `tidymodels` know which parameters to tune. In our case, we specify the penalty factor $\\lambda$ as well as the mixing factor $\\rho$ as *free* parameters. Note that it is simple to change an existing `workflow` with `update_model()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n penalty = tune(),\n mixture = tune()\n) |>\n set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n update_model(lm_model)\n```\n:::\n\n\nFor our sample, we consider a time-series cross-validation sample. This means that we tune our models with 20 random samples of length five years with a validation period of four years. For a grid of possible hyperparameters, we then fit the model for each fold and evaluate $\\hat{\\text{MSPE}}$ in the corresponding validation set. Finally, we select the model specification with the lowest MSPE in the validation set. First, we define the cross-validation folds based on our training data only.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_folds <- time_series_cv(\n data = training(split),\n date_var = month,\n initial = \"5 years\",\n assess = \"48 months\",\n cumulative = FALSE,\n slice_limit = 20\n)\ndata_folds\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Time Series Cross Validation Plan \n# A tibble: 20 × 2\n splits id \n \n1 Slice01\n2 Slice02\n3 Slice03\n4 Slice04\n5 Slice05\n# ℹ 15 more rows\n```\n\n\n:::\n:::\n\n\nThen, we evaluate the performance for a grid of different penalty values. `tidymodels` provides functionalities to construct a suitable grid of hyperparameters with `grid_regular`. The code chunk below creates a $10 \\times 3$ hyperparameters grid. Then, the function `tune_grid()` evaluates all the models for each fold.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_tune <- lm_fit |>\n tune_grid(\n resample = data_folds,\n grid = grid_regular(penalty(), mixture(), levels = c(20, 3)),\n metrics = metric_set(rmse)\n )\n```\n:::\n\n\nAfter the tuning process, we collect the evaluation metrics (the root mean-squared error in our example) to identify the *optimal* model. @fig-1404 illustrates the average validation set's root mean-squared error for each value of $\\lambda$ and $\\rho$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(lm_tune) + \n aes(linetype = `Proportion of Lasso Penalty`) + \n guides(linetype = \"none\") +\n labs(\n x = \"Penalty factor (lambda)\",\n y = \"Root MSPE\",\n title = \"Root MSPE for different penalty factors\"\n ) + \n scale_x_log10()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nScale for x is already present.\nAdding another scale for x, which will replace the existing scale.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.](factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png){#fig-1404 fig-alt='Title: Root MSPE for different penalty factors. The figure shows that more regularization does not affect the selected models in a meaningful fashion. At some point, the Elastic Net prediction error drops, which indicates the selected model. MSPE increases again for high penalization values.' width=2100}\n:::\n:::\n\n\n@fig-1404 shows that the cross-validated MSPE drops for Lasso and Elastic Net and spikes afterward. For Ridge regression, the MSPE increases above a certain threshold. Recall that the larger the regularization, the more restricted the model becomes. Thus, we would choose the model with the lowest MSPE.\n\n### Parallelized workflow\n\nOur starting point was the question: Which factors determine industry returns? While @Avramov2022b provide a Bayesian analysis related to the research question above, we choose a simplified approach: To illustrate the entire workflow, we now run the penalized regressions for all ten industries. \nWe want to identify relevant variables by fitting Lasso models for each industry returns time series. More specifically, we perform cross-validation for each industry to identify the optimal penalty factor $\\lambda$. \nThen, we use the set of `finalize_*()`-functions that take a list or tibble of tuning parameter values and update objects with those values. After determining the best model, we compute the final fit on the entire training set and analyze the estimated coefficients. \n\nFirst, we define the Lasso model with one tuning parameter.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlasso_model <- linear_reg(\n penalty = tune(),\n mixture = 1\n) |>\n set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n update_model(lasso_model)\n```\n:::\n\n\nThe following task can be easily parallelized to reduce computing time substantially. We use the parallelization capabilities of `furrr`. Note that we can also just recycle all the steps from above and collect them in a function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect_variables <- function(input) {\n # Split into training and testing data\n split <- initial_time_split(input, prop = 4 / 5)\n\n # Data folds for cross-validation\n data_folds <- time_series_cv(\n data = training(split),\n date_var = month,\n initial = \"5 years\",\n assess = \"48 months\",\n cumulative = FALSE,\n slice_limit = 20\n )\n\n # Model tuning with the Lasso model\n lm_tune <- lm_fit |>\n tune_grid(\n resample = data_folds,\n grid = grid_regular(penalty(), levels = c(10)),\n metrics = metric_set(rmse)\n )\n\n # Identify the best model and fit with the training data\n lasso_lowest_rmse <- lm_tune |> select_by_one_std_err(\"rmse\")\n lasso_final <- finalize_workflow(lm_fit, lasso_lowest_rmse)\n lasso_final_fit <- last_fit(lasso_final, split, metrics = metric_set(rmse))\n\n # Extract the estimated coefficients\n estimated_coefficients <- lasso_final_fit |>\n extract_fit_parsnip() |>\n tidy() |>\n mutate(\n term = str_remove_all(term, \"factor_|macro_|industry_\")\n )\n\n return(estimated_coefficients)\n}\n\n# Parallelization\nplan(multisession, workers = availableCores())\n\n# Computation by industry\nselected_factors <- data |>\n nest(data = -industry) |>\n mutate(selected_variables = future_map(\n data, select_variables,\n .options = furrr_options(seed = TRUE)\n ))\n```\n:::\n\n\nWhat has just happened? In principle, exactly the same as before but instead of computing the Lasso coefficients for one industry, we did it for ten in parallel. The final option `seed = TRUE` is required to make the cross-validation process reproducible. \nNow, we just have to do some housekeeping and keep only variables that Lasso does *not* set to zero. We illustrate the results in a heat map in @fig-1405.\\index{Graph!Heat map}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselected_factors |>\n unnest(selected_variables) |>\n filter(\n term != \"(Intercept)\",\n estimate != 0\n ) |>\n add_count(term) |>\n mutate(\n term = str_remove_all(term, \"NA|ff_|q_\"),\n term = str_replace_all(term, \"_x_\", \" \"),\n term = fct_reorder(as_factor(term), n),\n term = fct_lump_min(term, min = 2),\n selected = 1\n ) |>\n filter(term != \"Other\") |>\n mutate(term = fct_drop(term)) |>\n complete(industry, term, fill = list(selected = 0)) |>\n ggplot(aes(industry,\n term,\n fill = as_factor(selected)\n )) +\n geom_tile() +\n scale_x_discrete(guide = guide_axis(angle = 70)) +\n scale_fill_manual(values = c(\"white\", \"grey30\")) +\n theme(legend.position = \"None\") +\n labs(\n x = NULL, y = NULL,\n title = \"Selected variables for different industries\"\n )\n```\n\n::: {.cell-output-display}\n![Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.](factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png){#fig-1405 fig-alt='Title: Selected variables for different industries. The figure shows which factors and macroeconomic predictors the Lasso model selected for the different industries. In general, there are not many selected variables. The market excess return is selected across all industries except for utilities.' width=2100}\n:::\n:::\n\n\nThe heat map in @fig-1405 conveys two main insights. \nFirst, we see a lot of white, which means that many factors, macroeconomic variables, and interaction terms are not relevant for explaining the cross-section of returns across the industry portfolios. In fact, only the market factor and the return-on-equity factor play a role for several industries. Second, there seems to be quite some heterogeneity across different industries. While barely any variable is selected by Lasso for Utilities, many factors are selected for, e.g., High-Tech and Durable, but they do not coincide at all. \nIn other words, there seems to be a clear picture that we do not need many factors, but Lasso does not provide a factor that consistently provides pricing abilities across industries.\n\n## Exercises \n\n1. Write a function that requires three inputs, namely, `y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and `lambda` and then returns the Ridge estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_2$ penalty.\n1. Compute the $L_2$ norm ($\\beta'\\beta$) for the regression coefficients based on the predictive regression from the previous exercise for a range of $\\lambda$'s and illustrate the effect of penalization in a suitable figure.\n1. Now, write a function that requires three inputs, namely,`y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and 'lambda` and then returns the Lasso estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_1$ penalty.\n1. After you understand what Ridge and Lasso regressions are doing, familiarize yourself with the `glmnet()` package's documentation. It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients for Ridge and Lasso and for combinations, commonly called *Elastic Nets*. ", "supporting": [ "factor-selection-via-machine-learning_files" ], diff --git a/_freeze/r/fama-macbeth-regressions/execute-results/html.json b/_freeze/r/fama-macbeth-regressions/execute-results/html.json index 0f1e7e23..f1c279be 100644 --- a/_freeze/r/fama-macbeth-regressions/execute-results/html.json +++ b/_freeze/r/fama-macbeth-regressions/execute-results/html.json @@ -3,9 +3,7 @@ "result": { "engine": "knitr", "markdown": "---\ntitle: Fama-MacBeth Regressions\naliases:\n - ../fama-macbeth-regressions.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Fama-MacBeth Regressions with R\n description-meta: Estimate risk premiums via Fama-MacBeth regressions using the programming language R.\n---\n\n\nIn this chapter, we present a simple implementation of @Fama1973, a regression approach commonly called Fama-MacBeth regressions. Fama-MacBeth regressions are widely used in empirical asset pricing studies. We use individual stocks as test assets to estimate the risk premium associated with the three factors included in @Fama1993.\n\nResearchers use the two-stage regression approach to estimate risk premiums in various markets, but predominately in the stock market. \nEssentially, the two-step Fama-MacBeth regressions exploit a linear relationship between expected returns and exposure to (priced) risk factors. \nThe basic idea of the regression approach is to project asset returns on factor exposures or characteristics that resemble exposure to a risk factor in the cross-section in each time period. \nThen, in the second step, the estimates are aggregated across time to test if a risk factor is priced. \nIn principle, Fama-MacBeth regressions can be used in the same way as portfolio sorts introduced in previous chapters.\n\n\\index{Regression!Fama-MacBeth}\\index{Fama-MacBeth} The Fama-MacBeth procedure is a simple two-step approach: \nThe first step uses the exposures (characteristics) as explanatory variables in $T$ cross-sectional regressions. For example, if $r_{i,t+1}$ denote the excess returns of asset $i$ in month $t+1$, then the famous Fama-French three factor model implies the following return generating process [see also @Campbell1998]:\n$$\\begin{aligned}r_{i,t+1} = \\alpha_i + \\lambda^{M}_t \\beta^M_{i,t} + \\lambda^{SMB}_t \\beta^{SMB}_{i,t} + \\lambda^{HML}_t \\beta^{HML}_{i,t} + \\epsilon_{i,t}.\\end{aligned}$$ \nHere, we are interested in the compensation $\\lambda^{f}_t$ for the exposure to each risk factor $\\beta^{f}_{i,t}$ at each time point, i.e., the risk premium. Note the terminology: $\\beta^{f}_{i,t}$ is a asset-specific characteristic, e.g., a factor exposure or an accounting variable. *If* there is a linear relationship between expected returns and the characteristic in a given month, we expect the regression coefficient to reflect the relationship, i.e., $\\lambda_t^{f}\\neq0$. \n\nIn the second step, the time-series average $\\frac{1}{T}\\sum_{t=1}^T \\hat\\lambda^{f}_t$ of the estimates $\\hat\\lambda^{f}_t$ can then be interpreted as the risk premium for the specific risk factor $f$. We follow @Zaffaroni2022 and consider the standard cross-sectional regression to predict future returns. If the characteristics are replaced with time $t+1$ variables, then the regression approach captures risk attributes rather than risk premiums. \n\nBefore we move to the implementation, we want to highlight that the characteristics, e.g., $\\hat\\beta^{f}_{i}$, are often estimated in a separate step before applying the actual Fama-MacBeth methodology. You can think of this as a *step 0*. You might thus worry that the errors of $\\hat\\beta^{f}_{i}$ impact the risk premiums' standard errors. Measurement error in $\\hat\\beta^{f}_{i}$ indeed affects the risk premium estimates, i.e., they lead to biased estimates. The literature provides adjustments for this bias [see, e.g., @Shanken1992; @Kim1995; @Chen2015, among others] but also shows that the bias goes to zero as $T \\to \\infty$. We refer to @Gagliardini2016 for an in-depth discussion also covering the case of time-varying betas. Moreover, if you plan to use Fama-MacBeth regressions with individual stocks: @Hou2020 advocates using weighed-least squares to estimate the coefficients such that they are not biased toward small firms. Without this adjustment, the high number of small firms would drive the coefficient estimates.\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(sandwich)\nlibrary(broom)\n```\n:::\n\n\n## Data Preparation\n\nWe illustrate @Fama1973 with the monthly CRSP sample and use three characteristics to explain the cross-section of returns: market capitalization, the book-to-market ratio, and the CAPM beta (i.e., the covariance of the excess stock returns with the market excess returns). We collect the data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Compustat}\\index{Beta}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(permno, gvkey, month, ret_excess, mktcap) |>\n collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n select(datadate, gvkey, be) |>\n collect()\n\nbeta <- tbl(tidy_finance, \"beta\") |>\n select(month, permno, beta_monthly) |>\n collect()\n```\n:::\n\n\nWe use the Compustat and CRSP data to compute the book-to-market ratio and the (log) market capitalization.\\index{Book-to-market ratio}\\index{Market capitalization} \nFurthermore, we also use the CAPM betas based on monthly returns we computed in the previous chapters.\\index{Beta}\\index{CAPM}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncharacteristics <- compustat |>\n mutate(month = floor_date(ymd(datadate), \"month\")) |>\n left_join(crsp_monthly, by = c(\"gvkey\", \"month\")) |>\n left_join(beta, by = c(\"permno\", \"month\")) |>\n transmute(gvkey,\n bm = be / mktcap,\n log_mktcap = log(mktcap),\n beta = beta_monthly,\n sorting_date = month %m+% months(6)\n )\n\ndata_fama_macbeth <- crsp_monthly |>\n left_join(characteristics, by = c(\"gvkey\", \"month\" = \"sorting_date\")) |>\n group_by(permno) |>\n arrange(month) |>\n fill(c(beta, bm, log_mktcap), .direction = \"down\") |>\n ungroup() |>\n left_join(crsp_monthly |>\n select(permno, month, ret_excess_lead = ret_excess) |>\n mutate(month = month %m-% months(1)),\n by = c(\"permno\", \"month\")\n ) |>\n select(permno, month, ret_excess_lead, beta, log_mktcap, bm) |>\n drop_na()\n```\n:::\n\n\n## Cross-sectional Regression\n\nNext, we run the cross-sectional regressions with the characteristics as explanatory variables for each month. We regress the returns of the test assets at a particular time point on the characteristics of each asset. By doing so, we get an estimate of the risk premiums $\\hat\\lambda^{f}_t$ for each point in time. \\index{Regression!Cross-section}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrisk_premiums <- data_fama_macbeth |>\n nest(data = c(ret_excess_lead, beta, log_mktcap, bm, permno)) |>\n mutate(estimates = map(\n data,\n ~ tidy(lm(ret_excess_lead ~ beta + log_mktcap + bm, data = .x))\n )) |>\n unnest(estimates)\n```\n:::\n\n\n## Time-Series Aggregation\n\nNow that we have the risk premiums' estimates for each period, we can average across the time-series dimension to get the expected risk premium for each characteristic. Similarly, we manually create the $t$-test statistics for each regressor, which we can then compare to usual critical values of 1.96 or 2.576 for two-tailed significance tests. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nprice_of_risk <- risk_premiums |>\n group_by(factor = term) |>\n summarize(\n risk_premium = mean(estimate) * 100,\n t_statistic = mean(estimate) / sd(estimate) * sqrt(n())\n )\n```\n:::\n\n\nIt is common to adjust for autocorrelation when reporting standard errors of risk premiums. As in [Univariate Portfolio Sorts](univariate-portfolio-sorts.qmd), the typical procedure for this is computing @Newey1987 standard errors. We again recommend the data-driven approach of @Newey1994 using the `NeweyWest()` function, but note that you can enforce the typical 6 lag settings via `NeweyWest(., lag = 6, prewhite = FALSE)`.\\index{Standard errors!Newey-West}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregressions_for_newey_west <- risk_premiums |>\n select(month, factor = term, estimate) |>\n nest(data = c(month, estimate)) |>\n mutate(\n model = map(data, ~ lm(estimate ~ 1, .)),\n mean = map(model, tidy)\n )\n\nprice_of_risk_newey_west <- regressions_for_newey_west |>\n mutate(newey_west_se = map_dbl(model, ~ sqrt(NeweyWest(.)))) |>\n unnest(mean) |>\n mutate(t_statistic_newey_west = estimate / newey_west_se) |>\n select(factor,\n risk_premium = estimate,\n t_statistic_newey_west\n )\n\nleft_join(price_of_risk,\n price_of_risk_newey_west |>\n select(factor, t_statistic_newey_west),\n by = \"factor\"\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 4\n factor risk_premium t_statistic t_statistic_newey_west\n \n1 (Intercept) 1.22 4.77 3.98 \n2 beta 0.00515 0.0499 0.0446\n3 bm 0.151 3.22 2.75 \n4 log_mktcap -0.104 -2.94 -2.60 \n```\n\n\n:::\n:::\n\n\nFinally, let us interpret the results. Stocks with higher book-to-market ratios earn higher expected future returns, which is in line with the value premium. The negative value for log market capitalization reflects the size premium for smaller stocks. Consistent with results from earlier chapters, we detect no relation between beta and future stock returns.\n\n## Exercises\n\n1. Download a sample of test assets from Kenneth French's homepage and reevaluate the risk premiums for industry portfolios instead of individual stocks.\n1. Use individual stocks with weighted-least squares based on a firm's size as suggested by @Hou2020. Then, repeat the Fama-MacBeth regressions without the weighting scheme adjustment but drop the smallest 20 percent of firms each month. Compare the results of the three approaches. ", - "supporting": [ - "fama-macbeth-regressions_files" - ], + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json b/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json index 1f39dcd2..cc3f76ec 100644 --- a/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json +++ b/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "1f5671dc77c104a1b302a07db6d86106", + "hash": "4a0ef56e54495717886fe255ef8103ad", "result": { "engine": "knitr", - "markdown": "---\ntitle: Fixed Effects and Clustered Standard Errors\naliases:\n - ../fixed-effects-and-clustered-standard-errors\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Fixed Effects and Clustered Standard Errors with R\n description-meta: Dive into the implementation of fixed effects regressions and clustered standard errors in finance using the programming language R. \n---\n\n\nIn this chapter, we provide an intuitive introduction to the two popular concepts of *fixed effects regressions* and *clustered standard errors*. When working with regressions in empirical finance, you will sooner or later be confronted with discussions around how you deal with omitted variables bias and dependence in your residuals. The concepts we introduce in this chapter are designed to address such concerns.\n\nWe focus on a classical panel regression common to the corporate finance literature [e.g., @Fazzari1988;@Erickson2012;@Gulen2015]: firm investment modeled as a function that increases in firm cash flow and firm investment opportunities. \n\nTypically, this investment regression uses quarterly balance sheet data provided via Compustat because it allows for richer dynamics in the regressors and more opportunities to construct variables. As we focus on the implementation of fixed effects and clustered standard errors, we use the annual Compustat data from our previous chapters and leave the estimation using quarterly data as an exercise. We demonstrate below that the regression based on annual data yields qualitatively similar results to estimations based on quarterly data from the literature, namely confirming the positive relationships between investment and the two regressors. \n\nThe current chapter relies on the following set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\n```\n:::\n\n\nCompared to previous chapters, we introduce `fixest` [@fixest] for the fixed effects regressions, the implementation of standard error clusters, and tidy estimation output.\n\n## Data Preparation\n\nWe use CRSP and annual Compustat as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, Compustat provides balance sheet and income statement data on a firm level, while CRSP provides market valuations. \\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(gvkey, month, mktcap) |>\n collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n select(datadate, gvkey, year, at, be, capx, oancf, txdb) |>\n collect()\n```\n:::\n\n\nThe classical investment regressions model the capital investment of a firm as a function of operating cash flows and Tobin's q, a measure of a firm's investment opportunities.\\index{Cash flows}\\index{Regression!Investment} We start by constructing investment and cash flows which are usually normalized by lagged total assets of a firm. In the following code chunk, we construct a *panel* of firm-year observations, so we have both cross-sectional information on firms as well as time-series information for each firm.\\index{Regression!Panel}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- compustat |>\n mutate(month = floor_date(datadate, \"month\")) |>\n left_join(compustat |>\n select(gvkey, year, at_lag = at) |>\n mutate(year = year + 1),\n by = c(\"gvkey\", \"year\")\n ) |>\n filter(at > 0, at_lag > 0) |>\n mutate(\n investment = capx / at_lag,\n cash_flows = oancf / at_lag\n )\n\ndata_investment <- data_investment |>\n left_join(data_investment |>\n select(gvkey, year, investment_lead = investment) |>\n mutate(year = year - 1),\n by = c(\"gvkey\", \"year\")\n )\n```\n:::\n\n\nTobin's q is the ratio of the market value of capital to its replacement costs.\\index{Tobin's q} It is one of the most common regressors in corporate finance applications [e.g., @Fazzari1988; @Erickson2012]. We follow the implementation of @Gulen2015 and compute Tobin's q as the market value of equity (`mktcap`) plus the book value of assets (`at`) minus book value of equity (`be`) plus deferred taxes (`txdb`), all divided by book value of assets (`at`). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- data_investment |>\n left_join(crsp_monthly,\n by = c(\"gvkey\", \"month\")\n ) |>\n mutate(tobins_q = (mktcap + at - be + txdb) / at) |>\n select(gvkey, year, investment_lead, cash_flows, tobins_q) |>\n drop_na()\n```\n:::\n\n\nAs the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions.\\index{Winsorization} In the specific example, we winsorize the main variables (`investment`, `cash_flows`, and `tobins_q`) at the 1 percent level. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nwinsorize <- function(x, cut) {\n x <- replace(\n x,\n x > quantile(x, 1 - cut, na.rm = T),\n quantile(x, 1 - cut, na.rm = T)\n )\n x <- replace(\n x,\n x < quantile(x, cut, na.rm = T),\n quantile(x, cut, na.rm = T)\n )\n return(x)\n}\n\ndata_investment <- data_investment |>\n mutate(across(\n c(investment_lead, cash_flows, tobins_q),\n ~ winsorize(., 0.01)\n ))\n```\n:::\n\n\nBefore proceeding to any estimations, we highly recommend tabulating summary statistics of the variables that enter the regression. These simple tables allow you to check the plausibility of your numerical variables, as well as spot any obvious errors or outliers. Additionally, for panel data, plotting the time series of the variable's mean and the number of observations is a useful exercise to spot potential problems.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment |>\n pivot_longer(\n cols = c(investment_lead, cash_flows, tobins_q),\n names_to = \"measure\"\n ) |>\n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value),\n n = n(),\n .groups = \"drop\"\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n measure mean sd min q05 q50 q95 max n\n \n1 cash_flows 0.0111 0.274 -1.56 -4.72e-1 0.0641 0.272 0.479 127446\n2 investmen… 0.0577 0.0772 0 6.75e-4 0.0327 0.206 0.464 127446\n3 tobins_q 2.00 1.70 0.573 7.94e-1 1.39 5.37 10.9 127446\n```\n\n\n:::\n:::\n\n\n## Fixed Effects \n\nTo illustrate fixed effects regressions, we use the `fixest` package, which is both computationally powerful and flexible with respect to model specifications.\\index{Regression!Fixed effects} We start out with the basic investment regression using the simple model\n$$ \\text{Investment}_{i,t+1} = \\alpha + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\varepsilon_t$ is i.i.d. normally distributed across time and firms. We use the `feols()`-function to estimate the simple model so that the output has the same structure as the other regressions below, but you could also use `lm()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_ols <- feols(\n fml = investment_lead ~ cash_flows + tobins_q,\n se = \"iid\",\n data = data_investment\n)\nmodel_ols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nStandard-errors: IID \n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.04254 0.000334 127.5 < 2.2e-16 ***\ncash_flows 0.04902 0.000793 61.8 < 2.2e-16 ***\ntobins_q 0.00731 0.000128 57.0 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.075503 Adj. R2: 0.043012\n```\n\n\n:::\n:::\n\n\nAs expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data. \n\nOne way to tackle the issue of omitted variable bias is to get rid of as much unexplained variation as possible by including *fixed effects* - i.e., model parameters that are fixed for specific groups [e.g., @Wooldridge2010]. In essence, each group has its own mean in fixed effects regressions. The simplest group that we can form in the investment regression is the firm level. The firm fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_i$ is the firm fixed effect and captures the firm-specific mean investment across all years. In fact, you could also compute firms' investments as deviations from the firms' average investments and estimate the model without the fixed effects. The idea of the firm fixed effect is to remove the firm's average investment, which might be affected by firm-specific variables that you do not observe. For example, firms in a specific industry might invest more on average. Or you observe a young firm with large investments but only small concurrent cash flows, which will only happen in a few years. This sort of variation is unwanted because it is related to unobserved variables that can bias your estimates in any direction.\n\nTo include the firm fixed effect, we use `gvkey` (Compustat's firm identifier) as follows:\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firm <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey,\n se = \"iid\",\n data = data_investment\n)\nmodel_fe_firm\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nFixed-effects: gvkey: 14,348\nStandard-errors: IID \n Estimate Std. Error t value Pr(>|t|) \ncash_flows 0.0141 0.000923 15.3 < 2.2e-16 ***\ntobins_q 0.0109 0.000133 82.0 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.049705 Adj. R2: 0.532649\n Within R2: 0.057237\n```\n\n\n:::\n:::\n\n\nThe regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm's cash flow and Tobin's q *on top* of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.\n\nThere is another source of variation that we can get rid of in our setting: average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. By including year fixed effects, we can take out the effect of unobservables that vary over time. The two-way fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\alpha_t + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_t$ is the time fixed effect. Here you can think of higher investments during an economic expansion with simultaneously high cash flows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firmyear <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey + year,\n se = \"iid\",\n data = data_investment\n)\nmodel_fe_firmyear\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nFixed-effects: gvkey: 14,348, year: 35\nStandard-errors: IID \n Estimate Std. Error t value Pr(>|t|) \ncash_flows 0.01750 0.000902 19.4 < 2.2e-16 ***\ntobins_q 0.00987 0.000131 75.3 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.048415 Adj. R2: 0.556452\n Within R2: 0.050071\n```\n\n\n:::\n:::\n\n\nThe inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time. \n\nHow can we further improve the robustness of our regression results? Ideally, we want to get rid of unexplained variation at the firm-year level, which means we need to include more variables that vary across firm *and* time and are likely correlated with investment. Note that we cannot include firm-year fixed effects in our setting because then cash flows and Tobin's q are colinear with the fixed effects, and the estimation becomes void. \n\nBefore we discuss the properties of our estimation errors, we want to point out that regression tables are at the heart of every empirical analysis, where you compare multiple models. Fortunately, the `etable()` function provides a convenient way to tabulate the regression output (with many parameters to customize and even print the output in LaTeX). We recommend printing $t$-statistics rather than standard errors in regression tables because the latter are typically very hard to interpret across coefficients that vary in size. We also do not print p-values because they are sometimes misinterpreted to signal the importance of observed effects [@Wasserstein2016]. The $t$-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications. \n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n model_ols, model_fe_firm, model_fe_firmyear,\n coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n model_ols model_fe_firm model_fe_firm..\nDependent Var.: investment_lead investment_lead investment_lead\n \nConstant 0.042*** (127.5) \ncash_flows 0.049*** (61.8) 0.014*** (15.3) 0.018*** (19.4)\ntobins_q 0.007*** (57.0) 0.011*** (82.0) 0.010*** (75.3)\nFixed-Effects: ---------------- --------------- ---------------\ngvkey No Yes Yes\nyear No No Yes\n_______________ ________________ _______________ _______________\nVCOV type IID IID IID\nObservations 127,446 127,446 127,446\nR2 0.043 0.585 0.607\nWithin R2 -- 0.057 0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n## Clustering Standard Errors\n\nApart from biased estimators, we usually have to deal with potentially complex dependencies of our residuals with each other. Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of our estimated coefficients. \n\nIn our setting, the residuals may be correlated across years for a given firm (time-series dependence), or, alternatively, the residuals may be correlated across different firms (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of *clustered standard errors* [@Petersen2008].\\index{Standard errors!Clustered} The idea behind clustering is that the correlation of residuals *within* a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent [@Lang2007;@Wooldridge2010]. A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient [@Cameron2011].\n\nInstead of relying on the iid assumption, we can use the cluster option in the `feols`-function as above. The code chunk below applies both one-way clustering by firm as well as two-way clustering by firm and year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cluster_firm <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey + year,\n cluster = \"gvkey\",\n data = data_investment\n)\n\nmodel_cluster_firmyear <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey + year,\n cluster = c(\"gvkey\", \"year\"),\n data = data_investment\n)\n```\n:::\n\n\n\\index{Robustness tests}\nThe table below shows the comparison of the different assumptions behind the standard errors. In the first column, we can see highly significant coefficients on both cash flows and Tobin's q. By clustering the standard errors on the firm level, the $t$-statistics of both coefficients drop in half, indicating a high correlation of residuals within firms. If we additionally cluster by year, we see a drop, particularly for Tobin's q, again. Even after relaxing the assumptions behind our standard errors, both coefficients are still comfortably significant as the $t$ statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed significance tests.\n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n model_fe_firmyear, model_cluster_firm, model_cluster_firmyear,\n coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n model_fe_firm.. model_cluster.. model_cluster...1\nDependent Var.: investment_lead investment_lead investment_lead\n \ncash_flows 0.018*** (19.4) 0.018*** (11.3) 0.018*** (9.57)\ntobins_q 0.010*** (75.3) 0.010*** (35.4) 0.010*** (15.4)\nFixed-Effects: --------------- --------------- ---------------\ngvkey Yes Yes Yes\nyear Yes Yes Yes\n_______________ _______________ _______________ _______________\nVCOV type IID by: gvkey by: gvkey & year\nObservations 127,446 127,446 127,446\nR2 0.607 0.607 0.607\nWithin R2 0.050 0.050 0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nInspired by @AbadieEtAl2017, we want to close this chapter by highlighting that choosing the right dimensions for clustering is a design problem. Even if the data is informative about whether clustering matters for standard errors, they do not tell you whether you should adjust the standard errors for clustering. Clustering at too aggregate levels can hence lead to unnecessarily inflated standard errors. \n\n## Exercises\n\n1. Estimate the two-way fixed effects model with two-way clustered standard errors using quarterly Compustat data from WRDS. Note that you can access quarterly data via `tbl(wrds, in_schema(\"comp\", \"fundq\"))`.\n1. Following @Peters2017, compute Tobin's q as the market value of outstanding equity `mktcap` plus the book value of debt (`dltt` + `dlc`) minus the current assets `atc` and everything divided by the book value of property, plant and equipment `ppegt`. What is the correlation between the measures of Tobin's q? What is the impact on the two-way fixed effects regressions?\n", + "markdown": "---\ntitle: Fixed Effects and Clustered Standard Errors\naliases:\n - ../fixed-effects-and-clustered-standard-errors\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Fixed Effects and Clustered Standard Errors with R\n description-meta: Dive into the implementation of fixed effects regressions and clustered standard errors in finance using the programming language R. \n---\n\n\nIn this chapter, we provide an intuitive introduction to the two popular concepts of *fixed effects regressions* and *clustered standard errors*. When working with regressions in empirical finance, you will sooner or later be confronted with discussions around how you deal with omitted variables bias and dependence in your residuals. The concepts we introduce in this chapter are designed to address such concerns.\n\nWe focus on a classical panel regression common to the corporate finance literature [e.g., @Fazzari1988;@Erickson2012;@Gulen2015]: firm investment modeled as a function that increases in firm cash flow and firm investment opportunities. \n\nTypically, this investment regression uses quarterly balance sheet data provided via Compustat because it allows for richer dynamics in the regressors and more opportunities to construct variables. As we focus on the implementation of fixed effects and clustered standard errors, we use the annual Compustat data from our previous chapters and leave the estimation using quarterly data as an exercise. We demonstrate below that the regression based on annual data yields qualitatively similar results to estimations based on quarterly data from the literature, namely confirming the positive relationships between investment and the two regressors. \n\nThe current chapter relies on the following set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\n```\n:::\n\n\nCompared to previous chapters, we introduce `fixest` [@fixest] for the fixed effects regressions, the implementation of standard error clusters, and tidy estimation output.\n\n## Data Preparation\n\nWe use CRSP and annual Compustat as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, Compustat provides balance sheet and income statement data on a firm level, while CRSP provides market valuations. \\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(gvkey, month, mktcap) |>\n collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n select(datadate, gvkey, year, at, be, capx, oancf, txdb) |>\n collect()\n```\n:::\n\n\nThe classical investment regressions model the capital investment of a firm as a function of operating cash flows and Tobin's q, a measure of a firm's investment opportunities.\\index{Cash flows}\\index{Regression!Investment} We start by constructing investment and cash flows which are usually normalized by lagged total assets of a firm. In the following code chunk, we construct a *panel* of firm-year observations, so we have both cross-sectional information on firms as well as time-series information for each firm.\\index{Regression!Panel}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- compustat |>\n mutate(month = floor_date(datadate, \"month\")) |>\n left_join(compustat |>\n select(gvkey, year, at_lag = at) |>\n mutate(year = year + 1),\n join_by(gvkey, year)\n ) |>\n filter(at > 0, at_lag > 0) |>\n mutate(\n investment = capx / at_lag,\n cash_flows = oancf / at_lag\n )\n\ndata_investment <- data_investment |>\n left_join(data_investment |>\n select(gvkey, year, investment_lead = investment) |>\n mutate(year = year - 1),\n join_by(gvkey, year)\n )\n```\n:::\n\n\nTobin's q is the ratio of the market value of capital to its replacement costs.\\index{Tobin's q} It is one of the most common regressors in corporate finance applications [e.g., @Fazzari1988; @Erickson2012]. We follow the implementation of @Gulen2015 and compute Tobin's q as the market value of equity (`mktcap`) plus the book value of assets (`at`) minus book value of equity (`be`) plus deferred taxes (`txdb`), all divided by book value of assets (`at`). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- data_investment |>\n left_join(crsp_monthly,\n join_by(gvkey, month)) |>\n mutate(tobins_q = (mktcap + at - be + txdb) / at) |>\n select(gvkey, year, investment_lead, cash_flows, tobins_q) |>\n drop_na()\n```\n:::\n\n\nAs the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions.\\index{Winsorization} In the specific example, we winsorize the main variables (`investment`, `cash_flows`, and `tobins_q`) at the 1 percent level. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nwinsorize <- function(x, cut) {\n x <- replace(\n x,\n x > quantile(x, 1 - cut, na.rm = T),\n quantile(x, 1 - cut, na.rm = T)\n )\n x <- replace(\n x,\n x < quantile(x, cut, na.rm = T),\n quantile(x, cut, na.rm = T)\n )\n return(x)\n}\n\ndata_investment <- data_investment |>\n mutate(across(\n c(investment_lead, cash_flows, tobins_q),\n ~ winsorize(., 0.01)\n ))\n```\n:::\n\n\nBefore proceeding to any estimations, we highly recommend tabulating summary statistics of the variables that enter the regression. These simple tables allow you to check the plausibility of your numerical variables, as well as spot any obvious errors or outliers. Additionally, for panel data, plotting the time series of the variable's mean and the number of observations is a useful exercise to spot potential problems.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment |>\n pivot_longer(\n cols = c(investment_lead, cash_flows, tobins_q),\n names_to = \"measure\"\n ) |>\n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value),\n n = n(),\n .groups = \"drop\"\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n measure mean sd min q05 q50 q95 max n\n \n1 cash_flows 0.0111 0.274 -1.56 -4.72e-1 0.0641 0.272 0.479 127450\n2 investmen… 0.0577 0.0772 0 6.75e-4 0.0327 0.206 0.464 127450\n3 tobins_q 2.00 1.70 0.573 7.94e-1 1.39 5.37 10.9 127450\n```\n\n\n:::\n:::\n\n\n## Fixed Effects \n\nTo illustrate fixed effects regressions, we use the `fixest` package, which is both computationally powerful and flexible with respect to model specifications.\\index{Regression!Fixed effects} We start out with the basic investment regression using the simple model\n$$ \\text{Investment}_{i,t+1} = \\alpha + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\varepsilon_t$ is i.i.d. normally distributed across time and firms. We use the `feols()`-function to estimate the simple model so that the output has the same structure as the other regressions below, but you could also use `lm()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_ols <- feols(\n fml = investment_lead ~ cash_flows + tobins_q,\n se = \"iid\",\n data = data_investment\n)\nmodel_ols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nStandard-errors: IID \n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.04254 0.000334 127.5 < 2.2e-16 ***\ncash_flows 0.04901 0.000793 61.8 < 2.2e-16 ***\ntobins_q 0.00731 0.000128 57.0 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.075502 Adj. R2: 0.043014\n```\n\n\n:::\n:::\n\n\nAs expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data. \n\nOne way to tackle the issue of omitted variable bias is to get rid of as much unexplained variation as possible by including *fixed effects* - i.e., model parameters that are fixed for specific groups [e.g., @Wooldridge2010]. In essence, each group has its own mean in fixed effects regressions. The simplest group that we can form in the investment regression is the firm level. The firm fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_i$ is the firm fixed effect and captures the firm-specific mean investment across all years. In fact, you could also compute firms' investments as deviations from the firms' average investments and estimate the model without the fixed effects. The idea of the firm fixed effect is to remove the firm's average investment, which might be affected by firm-specific variables that you do not observe. For example, firms in a specific industry might invest more on average. Or you observe a young firm with large investments but only small concurrent cash flows, which will only happen in a few years. This sort of variation is unwanted because it is related to unobserved variables that can bias your estimates in any direction.\n\nTo include the firm fixed effect, we use `gvkey` (Compustat's firm identifier) as follows:\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firm <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey,\n se = \"iid\",\n data = data_investment\n)\nmodel_fe_firm\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nFixed-effects: gvkey: 14,352\nStandard-errors: IID \n Estimate Std. Error t value Pr(>|t|) \ncash_flows 0.0141 0.000923 15.3 < 2.2e-16 ***\ntobins_q 0.0109 0.000133 82.0 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.049704 Adj. R2: 0.532639\n Within R2: 0.057238\n```\n\n\n:::\n:::\n\n\nThe regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm's cash flow and Tobin's q *on top* of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.\n\nThere is another source of variation that we can get rid of in our setting: average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. By including year fixed effects, we can take out the effect of unobservables that vary over time. The two-way fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\alpha_t + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_t$ is the time fixed effect. Here you can think of higher investments during an economic expansion with simultaneously high cash flows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firmyear <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey + year,\n se = \"iid\",\n data = data_investment\n)\nmodel_fe_firmyear\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nFixed-effects: gvkey: 14,352, year: 35\nStandard-errors: IID \n Estimate Std. Error t value Pr(>|t|) \ncash_flows 0.01750 0.000902 19.4 < 2.2e-16 ***\ntobins_q 0.00987 0.000131 75.3 < 2.2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.048414 Adj. R2: 0.556443\n Within R2: 0.050072\n```\n\n\n:::\n:::\n\n\nThe inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time. \n\nHow can we further improve the robustness of our regression results? Ideally, we want to get rid of unexplained variation at the firm-year level, which means we need to include more variables that vary across firm *and* time and are likely correlated with investment. Note that we cannot include firm-year fixed effects in our setting because then cash flows and Tobin's q are colinear with the fixed effects, and the estimation becomes void. \n\nBefore we discuss the properties of our estimation errors, we want to point out that regression tables are at the heart of every empirical analysis, where you compare multiple models. Fortunately, the `etable()` function provides a convenient way to tabulate the regression output (with many parameters to customize and even print the output in LaTeX). We recommend printing $t$-statistics rather than standard errors in regression tables because the latter are typically very hard to interpret across coefficients that vary in size. We also do not print p-values because they are sometimes misinterpreted to signal the importance of observed effects [@Wasserstein2016]. The $t$-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications. \n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n model_ols, model_fe_firm, model_fe_firmyear,\n coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n model_ols model_fe_firm model_fe_firm..\nDependent Var.: investment_lead investment_lead investment_lead\n \nConstant 0.043*** (127.5) \ncash_flows 0.049*** (61.8) 0.014*** (15.3) 0.018*** (19.4)\ntobins_q 0.007*** (57.0) 0.011*** (82.0) 0.010*** (75.3)\nFixed-Effects: ---------------- --------------- ---------------\ngvkey No Yes Yes\nyear No No Yes\n_______________ ________________ _______________ _______________\nVCOV type IID IID IID\nObservations 127,450 127,450 127,450\nR2 0.043 0.585 0.607\nWithin R2 -- 0.057 0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n## Clustering Standard Errors\n\nApart from biased estimators, we usually have to deal with potentially complex dependencies of our residuals with each other. Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of our estimated coefficients. \n\nIn our setting, the residuals may be correlated across years for a given firm (time-series dependence), or, alternatively, the residuals may be correlated across different firms (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of *clustered standard errors* [@Petersen2008].\\index{Standard errors!Clustered} The idea behind clustering is that the correlation of residuals *within* a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent [@Lang2007;@Wooldridge2010]. A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient [@Cameron2011].\n\nInstead of relying on the iid assumption, we can use the cluster option in the `feols`-function as above. The code chunk below applies both one-way clustering by firm as well as two-way clustering by firm and year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cluster_firm <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey + year,\n cluster = \"gvkey\",\n data = data_investment\n)\n\nmodel_cluster_firmyear <- feols(\n investment_lead ~ cash_flows + tobins_q | gvkey + year,\n cluster = c(\"gvkey\", \"year\"),\n data = data_investment\n)\n```\n:::\n\n\n\\index{Robustness tests}\nThe table below shows the comparison of the different assumptions behind the standard errors. In the first column, we can see highly significant coefficients on both cash flows and Tobin's q. By clustering the standard errors on the firm level, the $t$-statistics of both coefficients drop in half, indicating a high correlation of residuals within firms. If we additionally cluster by year, we see a drop, particularly for Tobin's q, again. Even after relaxing the assumptions behind our standard errors, both coefficients are still comfortably significant as the $t$ statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed significance tests.\n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n model_fe_firmyear, model_cluster_firm, model_cluster_firmyear,\n coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n model_fe_firm.. model_cluster.. model_cluster...1\nDependent Var.: investment_lead investment_lead investment_lead\n \ncash_flows 0.018*** (19.4) 0.018*** (11.3) 0.018*** (9.57)\ntobins_q 0.010*** (75.3) 0.010*** (35.4) 0.010*** (15.4)\nFixed-Effects: --------------- --------------- ---------------\ngvkey Yes Yes Yes\nyear Yes Yes Yes\n_______________ _______________ _______________ _______________\nVCOV type IID by: gvkey by: gvkey & year\nObservations 127,450 127,450 127,450\nR2 0.607 0.607 0.607\nWithin R2 0.050 0.050 0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nInspired by @AbadieEtAl2017, we want to close this chapter by highlighting that choosing the right dimensions for clustering is a design problem. Even if the data is informative about whether clustering matters for standard errors, they do not tell you whether you should adjust the standard errors for clustering. Clustering at too aggregate levels can hence lead to unnecessarily inflated standard errors. \n\n## Exercises\n\n1. Estimate the two-way fixed effects model with two-way clustered standard errors using quarterly Compustat data from WRDS. Note that you can access quarterly data via `tbl(wrds, in_schema(\"comp\", \"fundq\"))`.\n1. Following @Peters2017, compute Tobin's q as the market value of outstanding equity `mktcap` plus the book value of debt (`dltt` + `dlc`) minus the current assets `atc` and everything divided by the book value of property, plant and equipment `ppegt`. What is the correlation between the measures of Tobin's q? What is the impact on the two-way fixed effects regressions?\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/r/introduction-to-tidy-finance/execute-results/html.json b/_freeze/r/introduction-to-tidy-finance/execute-results/html.json index 16d9b1f7..049e909b 100644 --- a/_freeze/r/introduction-to-tidy-finance/execute-results/html.json +++ b/_freeze/r/introduction-to-tidy-finance/execute-results/html.json @@ -2,7 +2,7 @@ "hash": "739476056eb443fe25639888af8b6c10", "result": { "engine": "knitr", - "markdown": "---\ntitle: Introduction to Tidy Finance\naliases:\n - ../introduction-to-tidy-finance.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Introduction to Tidy Finance with R\n description-meta: Learn how to use the programming language R for downloading and analyzing stock market data.\n---\n\n\nThe main aim of this chapter is to familiarize yourself with the tidyverse. We start by downloading and visualizing stock data from Yahoo!Finance. Then we move to a simple portfolio choice problem and construct the efficient frontier. These examples introduce you to our approach of *Tidy Finance*.\n\n## Working with Stock Market Data\n\nAt the start of each session, we load the required R packages. Throughout the entire book, we always use the `tidyverse` [@Wickham2019]. In this chapter, we also load the convenient `tidyquant` package [@tidyquant] to download price data. This package provides a convenient wrapper for various quantitative functions compatible with the `tidyverse`.\\index{tidyverse}\nFinally, the package `scales` [@scales] provides useful scale functions for visualizations.\n\nYou typically have to install a package once before you can load it. In case you have not done this yet, call `install.packages(\"tidyquant\")`. \\index{tidyquant} If you have trouble using `tidyquant`, check out the corresponding [documentation.](https://cran.r-project.org/web/packages/tidyquant/vignettes/TQ00-introduction-to-tidyquant.html)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyquant)\nlibrary(scales)\n```\n:::\n\n\nWe first download daily prices for one stock symbol, e.g., the Apple stock, *AAPL*, directly from the data provider Yahoo!Finance. To download the data, you can use the command `tq_get`. If you do not know how to use it, make sure you read the help file by calling `?tq_get`. We especially recommend taking a look at the examples section of the documentation. We request daily data for a period of more than 20 years.\\index{Stock prices}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices <- tq_get(\"AAPL\",\n get = \"stock.prices\",\n from = \"2000-01-01\",\n to = \"2022-12-31\"\n)\nprices\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 8\n symbol date open high low close volume adjusted\n \n1 AAPL 2000-01-03 0.936 1.00 0.908 0.999 535796800 0.848\n2 AAPL 2000-01-04 0.967 0.988 0.903 0.915 512377600 0.777\n3 AAPL 2000-01-05 0.926 0.987 0.920 0.929 778321600 0.788\n4 AAPL 2000-01-06 0.948 0.955 0.848 0.848 767972800 0.720\n5 AAPL 2000-01-07 0.862 0.902 0.853 0.888 460734400 0.754\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\n\\index{Data!YahooFinance} `tq_get` downloads stock market data from Yahoo!Finance if you do not specify another data source. The function returns a tibble with eight quite self-explanatory columns: `symbol`, `date`, the market prices at the `open`, `high`, `low`, and `close`, the daily `volume` (in the number of traded shares), and the `adjusted` price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.\\index{Stock price adjustments}\n\nNext, we use the `ggplot2` package [@ggplot2] to visualize the time series of adjusted prices in @fig-100 . This package takes care of visualization tasks based on the principles of the grammar of graphics [@Wilkinson2012].\\index{Graph!Time series}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices |>\n ggplot(aes(x = date, y = adjusted)) +\n geom_line() +\n labs(\n x = NULL,\n y = NULL,\n title = \"Apple stock prices between beginning of 2000 and end of 2022\"\n )\n```\n\n::: {.cell-output-display}\n![Prices are in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-100-1.png){#fig-100 fig-alt='Title: Apple stock prices between the beginning of 2000 and the end of 2022. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD.' width=2100}\n:::\n:::\n\n\n\\index{Returns} Instead of analyzing prices, we compute daily net returns defined as $r_t = p_t / p_{t-1} - 1$, where $p_t$ is the adjusted day $t$ price. In that context, the function `lag()` is helpful, which returns the previous value in a vector.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- prices |>\n arrange(date) |>\n mutate(ret = adjusted / lag(adjusted) - 1) |>\n select(symbol, date, ret)\nreturns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 3\n symbol date ret\n \n1 AAPL 2000-01-03 NA \n2 AAPL 2000-01-04 -0.0843\n3 AAPL 2000-01-05 0.0146\n4 AAPL 2000-01-06 -0.0865\n5 AAPL 2000-01-07 0.0474\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\nThe resulting tibble contains three columns, where the last contains the daily returns (`ret`). Note that the first entry naturally contains a missing value (`NA`) because there is no previous price.\\index{Missing value} Obviously, the use of `lag()` would be meaningless if the time series is not ordered by ascending dates.\\index{Lag observations} The command `arrange()` provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use `arrange(desc(date))`.\n\nFor the upcoming examples, we remove missing values as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why `NA` values occur and carefully examine if you can simply get rid of these observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- returns |>\n drop_na(ret)\n```\n:::\n\n\nNext, we visualize the distribution of daily returns in a histogram in @fig-101. \\index{Graph!Histogram} Additionally, we add a dashed line that indicates the 5 percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most 5 percent. The 5 percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. \\index{Value-at-risk} We refer to @Tsay2010 for a more thorough introduction to stylized facts of returns.\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquantile_05 <- quantile(returns |> pull(ret), probs = 0.05)\nreturns |>\n ggplot(aes(x = ret)) +\n geom_histogram(bins = 100) +\n geom_vline(aes(xintercept = quantile_05),\n linetype = \"dashed\"\n ) +\n labs(\n x = NULL,\n y = NULL,\n title = \"Distribution of daily Apple stock returns\"\n ) +\n scale_x_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The dotted vertical line indicates the historical 5 percent quantile.](introduction-to-tidy-finance_files/figure-html/fig-101-1.png){#fig-101 fig-alt='Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical 5 percent quantile of daily returns was around negative 3 percent.' width=2100}\n:::\n:::\n\n\nHere, `bins = 100` determines the number of bins used in the illustration and hence implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom `geom_vline()` to add a dashed line that indicates the 5 percent quantile of the daily returns. A typical task before proceeding with *any* data is to compute summary statistics for the main variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n summarize(across(\n ret,\n list(\n daily_mean = mean,\n daily_sd = sd,\n daily_min = min,\n daily_max = max\n )\n ))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 4\n ret_daily_mean ret_daily_sd ret_daily_min ret_daily_max\n \n1 0.00120 0.0251 -0.519 0.139\n```\n\n\n:::\n:::\n\n\nWe see that the maximum *daily* return was 13.905 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.\\\nYou can also compute these summary statistics for each year individually by imposing `group_by(year = year(date))`, where the call `year(date)` returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n group_by(year = year(date)) |>\n summarize(across(\n ret,\n list(\n daily_mean = mean,\n daily_sd = sd,\n daily_min = min,\n daily_max = max\n ),\n .names = \"{.fn}\"\n )) |>\n print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n year daily_mean daily_sd daily_min daily_max\n \n 1 2000 -0.00346 0.0549 -0.519 0.137 \n 2 2001 0.00233 0.0393 -0.172 0.129 \n 3 2002 -0.00121 0.0305 -0.150 0.0846\n 4 2003 0.00186 0.0234 -0.0814 0.113 \n 5 2004 0.00470 0.0255 -0.0558 0.132 \n 6 2005 0.00349 0.0245 -0.0921 0.0912\n 7 2006 0.000949 0.0243 -0.0633 0.118 \n 8 2007 0.00366 0.0238 -0.0702 0.105 \n 9 2008 -0.00265 0.0367 -0.179 0.139 \n10 2009 0.00382 0.0214 -0.0502 0.0676\n11 2010 0.00183 0.0169 -0.0496 0.0769\n12 2011 0.00104 0.0165 -0.0559 0.0589\n13 2012 0.00130 0.0186 -0.0644 0.0887\n14 2013 0.000472 0.0180 -0.124 0.0514\n15 2014 0.00145 0.0136 -0.0799 0.0820\n16 2015 0.0000199 0.0168 -0.0612 0.0574\n17 2016 0.000575 0.0147 -0.0657 0.0650\n18 2017 0.00164 0.0111 -0.0388 0.0610\n19 2018 -0.0000573 0.0181 -0.0663 0.0704\n20 2019 0.00266 0.0165 -0.0996 0.0683\n21 2020 0.00281 0.0294 -0.129 0.120 \n22 2021 0.00131 0.0158 -0.0417 0.0539\n23 2022 -0.000970 0.0225 -0.0587 0.0890\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nIn case you wonder: the additional argument `.names = \"{.fn}\"` in `across()` determines how to name the output columns. The specification is rather flexible and allows almost arbitrary column names, which can be useful for reporting. The `print()` function simply controls the output options for the R console.\n\n## Scaling Up the Analysis\n\nAs a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.\n\nThis is where the `tidyverse` magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., `symbol <- c(\"AAPL\", \"MMM\", \"BA\")`, and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the [Dow Jones Industrial Average index.](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average) \\index{Data!Dow Jones Index}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsymbols <- tq_index(\"DOW\") |> \n filter(company != \"US DOLLAR\")\nsymbols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 8\n symbol company identifier sedol weight sector shares_held\n \n1 UNH UNITEDHEALTH GRO… 91324P102 2917… 0.106 - 5487702\n2 MSFT MICROSOFT CORP 594918104 2588… 0.0641 - 5487702\n3 GS GOLDMAN SACHS GR… 38141G104 2407… 0.0605 - 5487702\n4 HD HOME DEPOT INC 437076102 2434… 0.0571 - 5487702\n5 AMGN AMGEN INC 031162100 2023… 0.0556 - 5487702\n# ℹ 25 more rows\n# ℹ 1 more variable: local_currency \n```\n\n\n:::\n:::\n\n\nConveniently, `tidyquant` provides a function to get all stocks in a stock index with a single call (similarly, `tq_exchange(\"NASDAQ\")` delivers all stocks currently listed on the NASDAQ exchange). \\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- tq_get(symbols,\n get = \"stock.prices\",\n from = \"2000-01-01\",\n to = \"2022-12-31\"\n)\n```\n:::\n\n\nThe resulting tibble contains 165593 daily observations for 30 different corporations. @fig-103 illustrates the time series of downloaded *adjusted* prices for each of the constituents of the Dow Jones index. Make sure you understand every single line of code! What are the arguments of `aes()`? Which alternative `geoms` could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices |>\n ggplot(aes(\n x = date,\n y = adjusted,\n color = symbol\n )) +\n geom_line() +\n labs(\n x = NULL,\n y = NULL,\n color = NULL,\n title = \"Stock prices of DOW index constituents\"\n ) +\n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![Prices in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-103-1.png){#fig-103 fig-alt='Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index.' width=2100}\n:::\n:::\n\n\nDo you notice the small differences relative to the code we used before? `tq_get(symbols)` returns a tibble for several symbols as well. All we need to do to illustrate all stock symbols simultaneously is to include `color = symbol` in the `ggplot` aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.\n\nThe same holds for stock returns. Before computing the returns, we use `group_by(symbol)` such that the `mutate()` command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: `group_by(symbol)` is the key to aggregating the time series into symbol-specific variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nall_returns <- index_prices |>\n group_by(symbol) |>\n mutate(ret = adjusted / lag(adjusted) - 1) |>\n select(symbol, date, ret) |>\n drop_na(ret)\n\nall_returns |>\n group_by(symbol) |>\n summarize(across(\n ret,\n list(\n daily_mean = mean,\n daily_sd = sd,\n daily_min = min,\n daily_max = max\n ),\n .names = \"{.fn}\"\n )) |>\n print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 5\n symbol daily_mean daily_sd daily_min daily_max\n \n 1 AAPL 0.00120 0.0251 -0.519 0.139\n 2 AMGN 0.000489 0.0197 -0.134 0.151\n 3 AXP 0.000518 0.0229 -0.176 0.219\n 4 BA 0.000595 0.0224 -0.238 0.243\n 5 CAT 0.000709 0.0204 -0.145 0.147\n 6 CRM 0.00110 0.0270 -0.271 0.260\n 7 CSCO 0.000317 0.0237 -0.162 0.244\n 8 CVX 0.000553 0.0176 -0.221 0.227\n 9 DIS 0.000418 0.0195 -0.184 0.160\n10 DOW 0.000562 0.0260 -0.217 0.209\n11 GS 0.000550 0.0231 -0.190 0.265\n12 HD 0.000543 0.0194 -0.287 0.141\n13 HON 0.000515 0.0194 -0.174 0.282\n14 IBM 0.000273 0.0165 -0.155 0.120\n15 INTC 0.000285 0.0236 -0.220 0.201\n16 JNJ 0.000408 0.0122 -0.158 0.122\n17 JPM 0.000582 0.0242 -0.207 0.251\n18 KO 0.000337 0.0132 -0.101 0.139\n19 MCD 0.000533 0.0147 -0.159 0.181\n20 MMM 0.000378 0.0150 -0.129 0.126\n21 MRK 0.000383 0.0168 -0.268 0.130\n22 MSFT 0.000513 0.0194 -0.156 0.196\n23 NKE 0.000743 0.0194 -0.198 0.155\n24 PG 0.000377 0.0134 -0.302 0.120\n25 TRV 0.000569 0.0183 -0.208 0.256\n26 UNH 0.000984 0.0198 -0.186 0.348\n27 V 0.000929 0.0190 -0.136 0.150\n28 VZ 0.000239 0.0151 -0.118 0.146\n29 WBA 0.000284 0.0182 -0.150 0.166\n30 WMT 0.000314 0.0150 -0.114 0.117\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nNote that you are now also equipped with all tools to download price data for *each* symbol listed in the S&P 500 index with the same number of lines of code. Just use `symbol <- tq_index(\"SP500\")`, which provides you with a tibble that contains each symbol that is (currently) part of the S&P 500.\\index{Data!SP 500} However, don't try this if you are not prepared to wait for a couple of minutes because this is quite some data to download!\n\n## Other Forms of Data Aggregation\n\nOf course, aggregation across variables other than `symbol` can also make sense. For instance, suppose you are interested in answering the question: Are days with high aggregate trading volume likely followed by days with high aggregate trading volume? To provide some initial analysis on this question, we take the downloaded data and compute aggregate daily trading volume for all Dow Jones constituents in USD. Recall that the column `volume` is denoted in the number of traded shares.\\index{Trading volume} Thus, we multiply the trading volume with the daily closing price to get a proxy for the aggregate trading volume in USD. Scaling by `1e9` (R can handle scientific notation) denotes daily trading volume in billion USD.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume <- index_prices |>\n group_by(date) |>\n summarize(trading_volume = sum(volume * adjusted))\n\ntrading_volume |>\n ggplot(aes(x = date, y = trading_volume)) +\n geom_line() +\n labs(\n x = NULL, y = NULL,\n title = \"Aggregate daily trading volume of DOW index constitutens\"\n ) +\n scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-104-1.png){#fig-104 fig-alt='Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2022, with a maximum of more than 100.' width=2100}\n:::\n:::\n\n\n@fig-104 indicates a clear upward trend in aggregated daily trading volume. In particular, since the outbreak of the COVID-19 pandemic, markets have processed substantial trading volumes, as analyzed, for instance, by @Goldstein2021.\\index{Covid 19} One way to illustrate the persistence of trading volume would be to plot volume on day $t$ against volume on day $t-1$ as in the example below. In @fig-105, we add a dotted 45°-line to indicate a hypothetical one-to-one relation by `geom_abline()`, addressing potential differences in the axes' scales.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume |>\n ggplot(aes(x = lag(trading_volume), y = trading_volume)) +\n geom_point() +\n geom_abline(aes(intercept = 0, slope = 1),\n linetype = \"dashed\"\n ) +\n labs(\n x = \"Previous day aggregate trading volume\",\n y = \"Aggregate trading volume\",\n title = \"Persistence in daily trading volume of DOW index constituents\"\n ) + \n scale_x_continuous(labels = unit_format(unit = \"B\", scale = 1e-9)) +\n scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 1 rows containing missing values (`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-105-1.png){#fig-105 fig-alt='Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line.' width=2100}\n:::\n:::\n\n\nDo you understand where the warning `## Warning: Removed 1 rows containing missing values (geom_point).` comes from and what it means? Purely eye-balling reveals that days with high trading volume are often followed by similarly high trading volume days.\\index{Error message}\n\n## Portfolio Choice Problems\n\nIn the previous part, we show how to download stock market data and inspect it with graphs and summary statistics. Now, we move to a typical question in Finance: how to allocate wealth across different assets optimally.\\index{Portfolio choice} The standard framework for optimal portfolio selection considers investors that prefer higher future returns but dislike future return volatility (defined as the square root of the return variance): the *mean-variance investor* [@Markowitz1952].\\index{Markowitz optimization}\n\n\\index{Efficient frontier} An essential tool to evaluate portfolios in the mean-variance context is the *efficient frontier*, the set of portfolios which satisfies the condition that no other portfolio exists with a higher expected return but with the same volatility (the square root of the variance, i.e., the risk), see, e.g., @Merton1972.\\index{Return volatility} We compute and visualize the efficient frontier for several stocks. First, we extract each asset's *monthly* returns. In order to keep things simple, we work with a balanced panel and exclude DOW constituents for which we do not observe a price on every single trading day since the year 2000.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- index_prices |>\n group_by(symbol) |>\n mutate(n = n()) |>\n ungroup() |>\n filter(n == max(n)) |>\n select(-n)\nreturns <- index_prices |>\n mutate(month = floor_date(date, \"month\")) |>\n group_by(symbol, month) |>\n summarize(price = last(adjusted), .groups = \"drop_last\") |>\n mutate(ret = price / lag(price) - 1) |>\n drop_na(ret) |>\n select(-price)\n```\n:::\n\n\nHere, `floor_date()` is a function from the `lubridate` package [@lubridate], which provides useful functions to work with dates and times.\n\nNext, we transform the returns from a tidy tibble into a $(T \\times N)$ matrix with one column for each of the $N$ symbols and one row for each of the $T$ trading days to compute the sample average return vector $$\\hat\\mu = \\frac{1}{T}\\sum\\limits_{t=1}^T r_t$$ where $r_t$ is the $N$ vector of returns on date $t$ and the sample covariance matrix $$\\hat\\Sigma = \\frac{1}{T-1}\\sum\\limits_{t=1}^T (r_t - \\hat\\mu)(r_t - \\hat\\mu)'.$$ We achieve this by using `pivot_wider()` with the new column names from the column `symbol` and setting the values to `ret`. We compute the vector of sample average returns and the sample variance-covariance matrix, which we consider as proxies for the parameters of the distribution of future stock returns. Thus, for simplicity, we refer to $\\Sigma$ and $\\mu$ instead of explicitly highlighting that the sample moments are estimates. \\index{Covariance} In later chapters, we discuss the issues that arise once we take estimation uncertainty into account.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns_matrix <- returns |>\n pivot_wider(\n names_from = symbol,\n values_from = ret\n ) |>\n select(-month)\nsigma <- cov(returns_matrix)\nmu <- colMeans(returns_matrix)\n```\n:::\n\n\nThen, we compute the minimum variance portfolio weights $\\omega_\\text{mvp}$ as well as the expected portfolio return $\\omega_\\text{mvp}'\\mu$ and volatility $\\sqrt{\\omega_\\text{mvp}'\\Sigma\\omega_\\text{mvp}}$ of this portfolio. \\index{Minimum variance portfolio} Recall that the minimum variance portfolio is the vector of portfolio weights that are the solution to $$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\sum\\limits_{i=1}^N\\omega_i = 1.$$ The constraint that weights sum up to one simply implies that all funds are distributed across the available asset universe, i.e., there is no possibility to retain cash. It is easy to show analytically that $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$, where $\\iota$ is a vector of ones and $\\Sigma^{-1}$ is the inverse of $\\Sigma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nN <- ncol(returns_matrix)\niota <- rep(1, N)\nsigma_inv <- solve(sigma)\nmvp_weights <- sigma_inv %*% iota\nmvp_weights <- mvp_weights / sum(mvp_weights)\ntibble(\n average_ret = as.numeric(t(mvp_weights) %*% mu),\n volatility = as.numeric(sqrt(t(mvp_weights) %*% sigma %*% mvp_weights))\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 2\n average_ret volatility\n \n1 0.00792 0.0321\n```\n\n\n:::\n:::\n\n\nThe command `solve(A, b)` returns the solution of a system of equations $Ax = b$. If `b` is not provided, as in the example above, it defaults to the identity matrix such that `solve(sigma)` delivers $\\Sigma^{-1}$ (if a unique solution exists).\\\nNote that the *monthly* volatility of the minimum variance portfolio is of the same order of magnitude as the *daily* standard deviation of the individual components. Thus, the diversification benefits in terms of risk reduction are tremendous!\\index{Diversification}\n\nNext, we set out to find the weights for a portfolio that achieves, as an example, three times the expected return of the minimum variance portfolio. However, mean-variance investors are not interested in any portfolio that achieves the required return but rather in the efficient portfolio, i.e., the portfolio with the lowest standard deviation. If you wonder where the solution $\\omega_\\text{eff}$ comes from: \\index{Efficient portfolio} The efficient portfolio is chosen by an investor who aims to achieve minimum variance *given a minimum acceptable expected return* $\\bar{\\mu}$. Hence, their objective function is to choose $\\omega_\\text{eff}$ as the solution to $$\\omega_\\text{eff}(\\bar{\\mu}) = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\n\nThe code below implements the analytic solution to this optimization problem for a benchmark return $\\bar\\mu$, which we set to 3 times the expected return of the minimum variance portfolio. We encourage you to verify that it is correct.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmark_multiple <- 3\nmu_bar <- benchmark_multiple * t(mvp_weights) %*% mu\nC <- as.numeric(t(iota) %*% sigma_inv %*% iota)\nD <- as.numeric(t(iota) %*% sigma_inv %*% mu)\nE <- as.numeric(t(mu) %*% sigma_inv %*% mu)\nlambda_tilde <- as.numeric(2 * (mu_bar - D / C) / (E - D^2 / C))\nefp_weights <- mvp_weights +\n lambda_tilde / 2 * (sigma_inv %*% mu - D * mvp_weights)\n```\n:::\n\n\n## The Efficient Frontier\n\n\\index{Separation theorem} The mutual fund separation theorem states that as soon as we have two efficient portfolios (such as the minimum variance portfolio $\\omega_\\text{mvp}$ and the efficient portfolio for a higher required level of expected returns $\\omega_\\text{eff}(\\bar{\\mu})$, we can characterize the entire efficient frontier by combining these two portfolios. That is, any linear combination of the two portfolio weights will again represent an efficient portfolio. \\index{Efficient frontier} The code below implements the construction of the *efficient frontier*, which characterizes the highest expected return achievable at each level of risk. To understand the code better, make sure to familiarize yourself with the inner workings of the `for` loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlength_year <- 12\na <- seq(from = -0.4, to = 1.9, by = 0.01)\nres <- tibble(\n a = a,\n mu = NA,\n sd = NA\n)\nfor (i in seq_along(a)) {\n w <- (1 - a[i]) * mvp_weights + (a[i]) * efp_weights\n res$mu[i] <- length_year * t(w) %*% mu \n res$sd[i] <- sqrt(length_year) * sqrt(t(w) %*% sigma %*% w)\n}\n```\n:::\n\n\nThe code above proceeds in two steps: First, we compute a vector of combination weights $a$ and then we evaluate the resulting linear combination with $a\\in\\mathbb{R}$:\\\n$$\\omega^* = a\\omega_\\text{eff}(\\bar\\mu) + (1-a)\\omega_\\text{mvp} = \\omega_\\text{mvp} + \\frac{\\lambda^*}{2}\\left(\\Sigma^{-1}\\mu -\\frac{D}{C}\\Sigma^{-1}\\iota \\right)$$ with $\\lambda^* = 2\\frac{a\\bar\\mu + (1-a)\\tilde\\mu - D/C}{E-D^2/C}$ where $C = \\iota'\\Sigma^{-1}\\iota$, $D=\\iota'\\Sigma^{-1}\\mu$, and $E=\\mu'\\Sigma^{-1}\\mu$. Finally, it is simple to visualize the efficient frontier alongside the two efficient portfolios within one powerful figure using `ggplot` (see @fig-106). We also add the individual stocks in the same call. We compute annualized returns based on the simple assumption that monthly returns are independent and identically distributed. Thus, the average annualized return is just 12 times the expected monthly return.\\index{Graph!Efficient frontier}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres |>\n ggplot(aes(x = sd, y = mu)) +\n geom_point() +\n geom_point(\n data = res |> filter(a %in% c(0, 1)),\n size = 4\n ) +\n geom_point(\n data = tibble(\n mu = length_year * mu, \n sd = sqrt(length_year) * sqrt(diag(sigma))\n ),\n aes(y = mu, x = sd), size = 1\n ) +\n labs(\n x = \"Annualized standard deviation\",\n y = \"Annualized expected return\",\n title = \"Efficient frontier for DOW index constituents\"\n ) +\n scale_x_continuous(labels = percent) +\n scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The big dots indicate the location of the minimum variance and the efficient portfolio that delivers 3 times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.](introduction-to-tidy-finance_files/figure-html/fig-106-1.png){#fig-106 fig-alt='Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility.' width=2100}\n:::\n:::\n\n\nThe line in @fig-106 indicates the efficient frontier: the set of portfolios a mean-variance efficient investor would choose from. Compare the performance relative to the individual assets (the dots) - it should become clear that diversifying yields massive performance gains (at least as long as we take the parameters $\\Sigma$ and $\\mu$ as given).\n\n## Exercises\n\n1. Download daily prices for another stock market symbol of your choice from Yahoo!Finance with `tq_get()` from the `tidyquant` package. Plot two time series of the ticker’s un-adjusted and adjusted closing prices. Explain the differences.\n1. Compute daily net returns for an asset of your choice and visualize the distribution of daily returns in a histogram using 100 bins. Also, use `geom_vline()` to add a dashed red vertical line that indicates the 5 percent quantile of the daily returns. Compute summary statistics (mean, standard deviation, minimum and maximum) for the daily returns.\n1. Take your code from before and generalize it such that you can perform all the computations for an arbitrary vector of tickers (e.g., `ticker <- c(\"AAPL\", \"MMM\", \"BA\")`). Automate the download, the plot of the price time series, and create a table of return summary statistics for this arbitrary number of assets.\n1. Are days with high aggregate trading volume often also days with large absolute returns? Find an appropriate visualization to analyze the question using the ticker `AAPL`.\n1.Compute monthly returns from the downloaded stock market prices. Compute the vector of historical average returns and the sample variance-covariance matrix. Compute the minimum variance portfolio weights and the portfolio volatility and average returns. Visualize the mean-variance efficient frontier. Choose one of your assets and identify the portfolio which yields the same historical volatility but achieves the highest possible average return.\n1. In the portfolio choice analysis, we restricted our sample to all assets trading every day since 2000. How is such a decision a problem when you want to infer future expected portfolio performance from the results?\n1. The efficient frontier characterizes the portfolios with the highest expected return for different levels of risk. Identify the portfolio with the highest expected return per standard deviation. Which famous performance measure is close to the ratio of average returns to the standard deviation of returns?\n", + "markdown": "---\ntitle: Introduction to Tidy Finance\naliases:\n - ../introduction-to-tidy-finance.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Introduction to Tidy Finance with R\n description-meta: Learn how to use the programming language R for downloading and analyzing stock market data.\n---\n\n\nThe main aim of this chapter is to familiarize yourself with the tidyverse. We start by downloading and visualizing stock data from Yahoo!Finance. Then we move to a simple portfolio choice problem and construct the efficient frontier. These examples introduce you to our approach of *Tidy Finance*.\n\n## Working with Stock Market Data\n\nAt the start of each session, we load the required R packages. Throughout the entire book, we always use the `tidyverse` [@Wickham2019]. In this chapter, we also load the convenient `tidyquant` package [@tidyquant] to download price data. This package provides a convenient wrapper for various quantitative functions compatible with the `tidyverse`.\\index{tidyverse}\nFinally, the package `scales` [@scales] provides useful scale functions for visualizations.\n\nYou typically have to install a package once before you can load it. In case you have not done this yet, call `install.packages(\"tidyquant\")`. \\index{tidyquant} If you have trouble using `tidyquant`, check out the corresponding [documentation.](https://cran.r-project.org/web/packages/tidyquant/vignettes/TQ00-introduction-to-tidyquant.html)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyquant)\nlibrary(scales)\n```\n:::\n\n\nWe first download daily prices for one stock symbol, e.g., the Apple stock, *AAPL*, directly from the data provider Yahoo!Finance. To download the data, you can use the command `tq_get`. If you do not know how to use it, make sure you read the help file by calling `?tq_get`. We especially recommend taking a look at the examples section of the documentation. We request daily data for a period of more than 20 years.\\index{Stock prices}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices <- tq_get(\"AAPL\",\n get = \"stock.prices\",\n from = \"2000-01-01\",\n to = \"2022-12-31\"\n)\nprices\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 8\n symbol date open high low close volume adjusted\n \n1 AAPL 2000-01-03 0.936 1.00 0.908 0.999 535796800 0.847\n2 AAPL 2000-01-04 0.967 0.988 0.903 0.915 512377600 0.776\n3 AAPL 2000-01-05 0.926 0.987 0.920 0.929 778321600 0.787\n4 AAPL 2000-01-06 0.948 0.955 0.848 0.848 767972800 0.719\n5 AAPL 2000-01-07 0.862 0.902 0.853 0.888 460734400 0.753\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\n\\index{Data!YahooFinance} `tq_get` downloads stock market data from Yahoo!Finance if you do not specify another data source. The function returns a tibble with eight quite self-explanatory columns: `symbol`, `date`, the market prices at the `open`, `high`, `low`, and `close`, the daily `volume` (in the number of traded shares), and the `adjusted` price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.\\index{Stock price adjustments}\n\nNext, we use the `ggplot2` package [@ggplot2] to visualize the time series of adjusted prices in @fig-100 . This package takes care of visualization tasks based on the principles of the grammar of graphics [@Wilkinson2012].\\index{Graph!Time series}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices |>\n ggplot(aes(x = date, y = adjusted)) +\n geom_line() +\n labs(\n x = NULL,\n y = NULL,\n title = \"Apple stock prices between beginning of 2000 and end of 2022\"\n )\n```\n\n::: {.cell-output-display}\n![Prices are in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-100-1.png){#fig-100 fig-alt='Title: Apple stock prices between the beginning of 2000 and the end of 2022. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD.' width=2100}\n:::\n:::\n\n\n\\index{Returns} Instead of analyzing prices, we compute daily net returns defined as $r_t = p_t / p_{t-1} - 1$, where $p_t$ is the adjusted day $t$ price. In that context, the function `lag()` is helpful, which returns the previous value in a vector.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- prices |>\n arrange(date) |>\n mutate(ret = adjusted / lag(adjusted) - 1) |>\n select(symbol, date, ret)\nreturns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 3\n symbol date ret\n \n1 AAPL 2000-01-03 NA \n2 AAPL 2000-01-04 -0.0843\n3 AAPL 2000-01-05 0.0146\n4 AAPL 2000-01-06 -0.0865\n5 AAPL 2000-01-07 0.0474\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\nThe resulting tibble contains three columns, where the last contains the daily returns (`ret`). Note that the first entry naturally contains a missing value (`NA`) because there is no previous price.\\index{Missing value} Obviously, the use of `lag()` would be meaningless if the time series is not ordered by ascending dates.\\index{Lag observations} The command `arrange()` provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use `arrange(desc(date))`.\n\nFor the upcoming examples, we remove missing values as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why `NA` values occur and carefully examine if you can simply get rid of these observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- returns |>\n drop_na(ret)\n```\n:::\n\n\nNext, we visualize the distribution of daily returns in a histogram in @fig-101. \\index{Graph!Histogram} Additionally, we add a dashed line that indicates the 5 percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most 5 percent. The 5 percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. \\index{Value-at-risk} We refer to @Tsay2010 for a more thorough introduction to stylized facts of returns.\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquantile_05 <- quantile(returns |> pull(ret), probs = 0.05)\nreturns |>\n ggplot(aes(x = ret)) +\n geom_histogram(bins = 100) +\n geom_vline(aes(xintercept = quantile_05),\n linetype = \"dashed\"\n ) +\n labs(\n x = NULL,\n y = NULL,\n title = \"Distribution of daily Apple stock returns\"\n ) +\n scale_x_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The dotted vertical line indicates the historical 5 percent quantile.](introduction-to-tidy-finance_files/figure-html/fig-101-1.png){#fig-101 fig-alt='Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical 5 percent quantile of daily returns was around negative 3 percent.' width=2100}\n:::\n:::\n\n\nHere, `bins = 100` determines the number of bins used in the illustration and hence implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom `geom_vline()` to add a dashed line that indicates the 5 percent quantile of the daily returns. A typical task before proceeding with *any* data is to compute summary statistics for the main variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n summarize(across(\n ret,\n list(\n daily_mean = mean,\n daily_sd = sd,\n daily_min = min,\n daily_max = max\n )\n ))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 4\n ret_daily_mean ret_daily_sd ret_daily_min ret_daily_max\n \n1 0.00120 0.0251 -0.519 0.139\n```\n\n\n:::\n:::\n\n\nWe see that the maximum *daily* return was 13.905 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.\\\nYou can also compute these summary statistics for each year individually by imposing `group_by(year = year(date))`, where the call `year(date)` returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n group_by(year = year(date)) |>\n summarize(across(\n ret,\n list(\n daily_mean = mean,\n daily_sd = sd,\n daily_min = min,\n daily_max = max\n ),\n .names = \"{.fn}\"\n )) |>\n print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n year daily_mean daily_sd daily_min daily_max\n \n 1 2000 -0.00346 0.0549 -0.519 0.137 \n 2 2001 0.00233 0.0393 -0.172 0.129 \n 3 2002 -0.00121 0.0305 -0.150 0.0846\n 4 2003 0.00186 0.0234 -0.0814 0.113 \n 5 2004 0.00470 0.0255 -0.0558 0.132 \n 6 2005 0.00349 0.0245 -0.0921 0.0912\n 7 2006 0.000949 0.0243 -0.0633 0.118 \n 8 2007 0.00366 0.0238 -0.0702 0.105 \n 9 2008 -0.00265 0.0367 -0.179 0.139 \n10 2009 0.00382 0.0214 -0.0502 0.0676\n11 2010 0.00183 0.0169 -0.0496 0.0769\n12 2011 0.00104 0.0165 -0.0559 0.0589\n13 2012 0.00130 0.0186 -0.0644 0.0887\n14 2013 0.000472 0.0180 -0.124 0.0514\n15 2014 0.00145 0.0136 -0.0799 0.0820\n16 2015 0.0000199 0.0168 -0.0612 0.0574\n17 2016 0.000575 0.0147 -0.0657 0.0650\n18 2017 0.00164 0.0111 -0.0388 0.0610\n19 2018 -0.0000573 0.0181 -0.0663 0.0704\n20 2019 0.00266 0.0165 -0.0996 0.0683\n21 2020 0.00281 0.0294 -0.129 0.120 \n22 2021 0.00131 0.0158 -0.0417 0.0539\n23 2022 -0.000970 0.0225 -0.0587 0.0890\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nIn case you wonder: the additional argument `.names = \"{.fn}\"` in `across()` determines how to name the output columns. The specification is rather flexible and allows almost arbitrary column names, which can be useful for reporting. The `print()` function simply controls the output options for the R console.\n\n## Scaling Up the Analysis\n\nAs a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.\n\nThis is where the `tidyverse` magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., `symbol <- c(\"AAPL\", \"MMM\", \"BA\")`, and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the [Dow Jones Industrial Average index.](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average) \\index{Data!Dow Jones Index}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsymbols <- tq_index(\"DOW\") |> \n filter(company != \"US DOLLAR\")\nsymbols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 8\n symbol company identifier sedol weight sector shares_held\n \n1 UNH UNITEDHEALTH GRO… 91324P102 2917… 0.0958 - 5747928\n2 GS GOLDMAN SACHS GR… 38141G104 2407… 0.0672 - 5747928\n3 MSFT MICROSOFT CORP 594918104 2588… 0.0646 - 5747928\n4 HD HOME DEPOT INC 437076102 2434… 0.0594 - 5747928\n5 AMGN AMGEN INC 031162100 2023… 0.0532 - 5747928\n# ℹ 25 more rows\n# ℹ 1 more variable: local_currency \n```\n\n\n:::\n:::\n\n\nConveniently, `tidyquant` provides a function to get all stocks in a stock index with a single call (similarly, `tq_exchange(\"NASDAQ\")` delivers all stocks currently listed on the NASDAQ exchange). \\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- tq_get(symbols,\n get = \"stock.prices\",\n from = \"2000-01-01\",\n to = \"2022-12-31\"\n)\n```\n:::\n\n\nThe resulting tibble contains 165593 daily observations for 30 different corporations. @fig-103 illustrates the time series of downloaded *adjusted* prices for each of the constituents of the Dow Jones index. Make sure you understand every single line of code! What are the arguments of `aes()`? Which alternative `geoms` could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices |>\n ggplot(aes(\n x = date,\n y = adjusted,\n color = symbol\n )) +\n geom_line() +\n labs(\n x = NULL,\n y = NULL,\n color = NULL,\n title = \"Stock prices of DOW index constituents\"\n ) +\n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![Prices in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-103-1.png){#fig-103 fig-alt='Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index.' width=2100}\n:::\n:::\n\n\nDo you notice the small differences relative to the code we used before? `tq_get(symbols)` returns a tibble for several symbols as well. All we need to do to illustrate all stock symbols simultaneously is to include `color = symbol` in the `ggplot` aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.\n\nThe same holds for stock returns. Before computing the returns, we use `group_by(symbol)` such that the `mutate()` command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: `group_by(symbol)` is the key to aggregating the time series into symbol-specific variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nall_returns <- index_prices |>\n group_by(symbol) |>\n mutate(ret = adjusted / lag(adjusted) - 1) |>\n select(symbol, date, ret) |>\n drop_na(ret)\n\nall_returns |>\n group_by(symbol) |>\n summarize(across(\n ret,\n list(\n daily_mean = mean,\n daily_sd = sd,\n daily_min = min,\n daily_max = max\n ),\n .names = \"{.fn}\"\n )) |>\n print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 5\n symbol daily_mean daily_sd daily_min daily_max\n \n 1 AAPL 0.00120 0.0251 -0.519 0.139\n 2 AMGN 0.000489 0.0197 -0.134 0.151\n 3 AXP 0.000518 0.0229 -0.176 0.219\n 4 BA 0.000595 0.0224 -0.238 0.243\n 5 CAT 0.000709 0.0204 -0.145 0.147\n 6 CRM 0.00110 0.0270 -0.271 0.260\n 7 CSCO 0.000317 0.0237 -0.162 0.244\n 8 CVX 0.000553 0.0176 -0.221 0.227\n 9 DIS 0.000418 0.0195 -0.184 0.160\n10 DOW 0.000562 0.0260 -0.217 0.209\n11 GS 0.000550 0.0231 -0.190 0.265\n12 HD 0.000543 0.0194 -0.287 0.141\n13 HON 0.000515 0.0194 -0.174 0.282\n14 IBM 0.000273 0.0165 -0.155 0.120\n15 INTC 0.000285 0.0236 -0.220 0.201\n16 JNJ 0.000408 0.0122 -0.158 0.122\n17 JPM 0.000582 0.0242 -0.207 0.251\n18 KO 0.000337 0.0132 -0.101 0.139\n19 MCD 0.000533 0.0147 -0.159 0.181\n20 MMM 0.000378 0.0150 -0.129 0.126\n21 MRK 0.000383 0.0168 -0.268 0.130\n22 MSFT 0.000513 0.0194 -0.156 0.196\n23 NKE 0.000743 0.0194 -0.198 0.155\n24 PG 0.000377 0.0134 -0.302 0.120\n25 TRV 0.000569 0.0183 -0.208 0.256\n26 UNH 0.000984 0.0198 -0.186 0.348\n27 V 0.000929 0.0190 -0.136 0.150\n28 VZ 0.000239 0.0151 -0.118 0.146\n29 WBA 0.000284 0.0182 -0.150 0.166\n30 WMT 0.000314 0.0150 -0.114 0.117\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nNote that you are now also equipped with all tools to download price data for *each* symbol listed in the S&P 500 index with the same number of lines of code. Just use `symbol <- tq_index(\"SP500\")`, which provides you with a tibble that contains each symbol that is (currently) part of the S&P 500.\\index{Data!SP 500} However, don't try this if you are not prepared to wait for a couple of minutes because this is quite some data to download!\n\n## Other Forms of Data Aggregation\n\nOf course, aggregation across variables other than `symbol` can also make sense. For instance, suppose you are interested in answering the question: Are days with high aggregate trading volume likely followed by days with high aggregate trading volume? To provide some initial analysis on this question, we take the downloaded data and compute aggregate daily trading volume for all Dow Jones constituents in USD. Recall that the column `volume` is denoted in the number of traded shares.\\index{Trading volume} Thus, we multiply the trading volume with the daily closing price to get a proxy for the aggregate trading volume in USD. Scaling by `1e9` (R can handle scientific notation) denotes daily trading volume in billion USD.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume <- index_prices |>\n group_by(date) |>\n summarize(trading_volume = sum(volume * adjusted))\n\ntrading_volume |>\n ggplot(aes(x = date, y = trading_volume)) +\n geom_line() +\n labs(\n x = NULL, y = NULL,\n title = \"Aggregate daily trading volume of DOW index constitutens\"\n ) +\n scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-104-1.png){#fig-104 fig-alt='Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2022, with a maximum of more than 100.' width=2100}\n:::\n:::\n\n\n@fig-104 indicates a clear upward trend in aggregated daily trading volume. In particular, since the outbreak of the COVID-19 pandemic, markets have processed substantial trading volumes, as analyzed, for instance, by @Goldstein2021.\\index{Covid 19} One way to illustrate the persistence of trading volume would be to plot volume on day $t$ against volume on day $t-1$ as in the example below. In @fig-105, we add a dotted 45°-line to indicate a hypothetical one-to-one relation by `geom_abline()`, addressing potential differences in the axes' scales.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume |>\n ggplot(aes(x = lag(trading_volume), y = trading_volume)) +\n geom_point() +\n geom_abline(aes(intercept = 0, slope = 1),\n linetype = \"dashed\"\n ) +\n labs(\n x = \"Previous day aggregate trading volume\",\n y = \"Aggregate trading volume\",\n title = \"Persistence in daily trading volume of DOW index constituents\"\n ) + \n scale_x_continuous(labels = unit_format(unit = \"B\", scale = 1e-9)) +\n scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 1 rows containing missing values (`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-105-1.png){#fig-105 fig-alt='Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line.' width=2100}\n:::\n:::\n\n\nDo you understand where the warning `## Warning: Removed 1 rows containing missing values (geom_point).` comes from and what it means? Purely eye-balling reveals that days with high trading volume are often followed by similarly high trading volume days.\\index{Error message}\n\n## Portfolio Choice Problems\n\nIn the previous part, we show how to download stock market data and inspect it with graphs and summary statistics. Now, we move to a typical question in Finance: how to allocate wealth across different assets optimally.\\index{Portfolio choice} The standard framework for optimal portfolio selection considers investors that prefer higher future returns but dislike future return volatility (defined as the square root of the return variance): the *mean-variance investor* [@Markowitz1952].\\index{Markowitz optimization}\n\n\\index{Efficient frontier} An essential tool to evaluate portfolios in the mean-variance context is the *efficient frontier*, the set of portfolios which satisfies the condition that no other portfolio exists with a higher expected return but with the same volatility (the square root of the variance, i.e., the risk), see, e.g., @Merton1972.\\index{Return volatility} We compute and visualize the efficient frontier for several stocks. First, we extract each asset's *monthly* returns. In order to keep things simple, we work with a balanced panel and exclude DOW constituents for which we do not observe a price on every single trading day since the year 2000.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- index_prices |>\n group_by(symbol) |>\n mutate(n = n()) |>\n ungroup() |>\n filter(n == max(n)) |>\n select(-n)\nreturns <- index_prices |>\n mutate(month = floor_date(date, \"month\")) |>\n group_by(symbol, month) |>\n summarize(price = last(adjusted), .groups = \"drop_last\") |>\n mutate(ret = price / lag(price) - 1) |>\n drop_na(ret) |>\n select(-price)\n```\n:::\n\n\nHere, `floor_date()` is a function from the `lubridate` package [@lubridate], which provides useful functions to work with dates and times.\n\nNext, we transform the returns from a tidy tibble into a $(T \\times N)$ matrix with one column for each of the $N$ symbols and one row for each of the $T$ trading days to compute the sample average return vector $$\\hat\\mu = \\frac{1}{T}\\sum\\limits_{t=1}^T r_t$$ where $r_t$ is the $N$ vector of returns on date $t$ and the sample covariance matrix $$\\hat\\Sigma = \\frac{1}{T-1}\\sum\\limits_{t=1}^T (r_t - \\hat\\mu)(r_t - \\hat\\mu)'.$$ We achieve this by using `pivot_wider()` with the new column names from the column `symbol` and setting the values to `ret`. We compute the vector of sample average returns and the sample variance-covariance matrix, which we consider as proxies for the parameters of the distribution of future stock returns. Thus, for simplicity, we refer to $\\Sigma$ and $\\mu$ instead of explicitly highlighting that the sample moments are estimates. \\index{Covariance} In later chapters, we discuss the issues that arise once we take estimation uncertainty into account.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns_matrix <- returns |>\n pivot_wider(\n names_from = symbol,\n values_from = ret\n ) |>\n select(-month)\nsigma <- cov(returns_matrix)\nmu <- colMeans(returns_matrix)\n```\n:::\n\n\nThen, we compute the minimum variance portfolio weights $\\omega_\\text{mvp}$ as well as the expected portfolio return $\\omega_\\text{mvp}'\\mu$ and volatility $\\sqrt{\\omega_\\text{mvp}'\\Sigma\\omega_\\text{mvp}}$ of this portfolio. \\index{Minimum variance portfolio} Recall that the minimum variance portfolio is the vector of portfolio weights that are the solution to $$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\sum\\limits_{i=1}^N\\omega_i = 1.$$ The constraint that weights sum up to one simply implies that all funds are distributed across the available asset universe, i.e., there is no possibility to retain cash. It is easy to show analytically that $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$, where $\\iota$ is a vector of ones and $\\Sigma^{-1}$ is the inverse of $\\Sigma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nN <- ncol(returns_matrix)\niota <- rep(1, N)\nsigma_inv <- solve(sigma)\nmvp_weights <- sigma_inv %*% iota\nmvp_weights <- mvp_weights / sum(mvp_weights)\ntibble(\n average_ret = as.numeric(t(mvp_weights) %*% mu),\n volatility = as.numeric(sqrt(t(mvp_weights) %*% sigma %*% mvp_weights))\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 2\n average_ret volatility\n \n1 0.00792 0.0321\n```\n\n\n:::\n:::\n\n\nThe command `solve(A, b)` returns the solution of a system of equations $Ax = b$. If `b` is not provided, as in the example above, it defaults to the identity matrix such that `solve(sigma)` delivers $\\Sigma^{-1}$ (if a unique solution exists).\\\nNote that the *monthly* volatility of the minimum variance portfolio is of the same order of magnitude as the *daily* standard deviation of the individual components. Thus, the diversification benefits in terms of risk reduction are tremendous!\\index{Diversification}\n\nNext, we set out to find the weights for a portfolio that achieves, as an example, three times the expected return of the minimum variance portfolio. However, mean-variance investors are not interested in any portfolio that achieves the required return but rather in the efficient portfolio, i.e., the portfolio with the lowest standard deviation. If you wonder where the solution $\\omega_\\text{eff}$ comes from: \\index{Efficient portfolio} The efficient portfolio is chosen by an investor who aims to achieve minimum variance *given a minimum acceptable expected return* $\\bar{\\mu}$. Hence, their objective function is to choose $\\omega_\\text{eff}$ as the solution to $$\\omega_\\text{eff}(\\bar{\\mu}) = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\n\nThe code below implements the analytic solution to this optimization problem for a benchmark return $\\bar\\mu$, which we set to 3 times the expected return of the minimum variance portfolio. We encourage you to verify that it is correct.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmark_multiple <- 3\nmu_bar <- benchmark_multiple * t(mvp_weights) %*% mu\nC <- as.numeric(t(iota) %*% sigma_inv %*% iota)\nD <- as.numeric(t(iota) %*% sigma_inv %*% mu)\nE <- as.numeric(t(mu) %*% sigma_inv %*% mu)\nlambda_tilde <- as.numeric(2 * (mu_bar - D / C) / (E - D^2 / C))\nefp_weights <- mvp_weights +\n lambda_tilde / 2 * (sigma_inv %*% mu - D * mvp_weights)\n```\n:::\n\n\n## The Efficient Frontier\n\n\\index{Separation theorem} The mutual fund separation theorem states that as soon as we have two efficient portfolios (such as the minimum variance portfolio $\\omega_\\text{mvp}$ and the efficient portfolio for a higher required level of expected returns $\\omega_\\text{eff}(\\bar{\\mu})$, we can characterize the entire efficient frontier by combining these two portfolios. That is, any linear combination of the two portfolio weights will again represent an efficient portfolio. \\index{Efficient frontier} The code below implements the construction of the *efficient frontier*, which characterizes the highest expected return achievable at each level of risk. To understand the code better, make sure to familiarize yourself with the inner workings of the `for` loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlength_year <- 12\na <- seq(from = -0.4, to = 1.9, by = 0.01)\nres <- tibble(\n a = a,\n mu = NA,\n sd = NA\n)\nfor (i in seq_along(a)) {\n w <- (1 - a[i]) * mvp_weights + (a[i]) * efp_weights\n res$mu[i] <- length_year * t(w) %*% mu \n res$sd[i] <- sqrt(length_year) * sqrt(t(w) %*% sigma %*% w)\n}\n```\n:::\n\n\nThe code above proceeds in two steps: First, we compute a vector of combination weights $a$ and then we evaluate the resulting linear combination with $a\\in\\mathbb{R}$:\\\n$$\\omega^* = a\\omega_\\text{eff}(\\bar\\mu) + (1-a)\\omega_\\text{mvp} = \\omega_\\text{mvp} + \\frac{\\lambda^*}{2}\\left(\\Sigma^{-1}\\mu -\\frac{D}{C}\\Sigma^{-1}\\iota \\right)$$ with $\\lambda^* = 2\\frac{a\\bar\\mu + (1-a)\\tilde\\mu - D/C}{E-D^2/C}$ where $C = \\iota'\\Sigma^{-1}\\iota$, $D=\\iota'\\Sigma^{-1}\\mu$, and $E=\\mu'\\Sigma^{-1}\\mu$. Finally, it is simple to visualize the efficient frontier alongside the two efficient portfolios within one powerful figure using `ggplot` (see @fig-106). We also add the individual stocks in the same call. We compute annualized returns based on the simple assumption that monthly returns are independent and identically distributed. Thus, the average annualized return is just 12 times the expected monthly return.\\index{Graph!Efficient frontier}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres |>\n ggplot(aes(x = sd, y = mu)) +\n geom_point() +\n geom_point(\n data = res |> filter(a %in% c(0, 1)),\n size = 4\n ) +\n geom_point(\n data = tibble(\n mu = length_year * mu, \n sd = sqrt(length_year) * sqrt(diag(sigma))\n ),\n aes(y = mu, x = sd), size = 1\n ) +\n labs(\n x = \"Annualized standard deviation\",\n y = \"Annualized expected return\",\n title = \"Efficient frontier for DOW index constituents\"\n ) +\n scale_x_continuous(labels = percent) +\n scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The big dots indicate the location of the minimum variance and the efficient portfolio that delivers 3 times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.](introduction-to-tidy-finance_files/figure-html/fig-106-1.png){#fig-106 fig-alt='Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility.' width=2100}\n:::\n:::\n\n\nThe line in @fig-106 indicates the efficient frontier: the set of portfolios a mean-variance efficient investor would choose from. Compare the performance relative to the individual assets (the dots) - it should become clear that diversifying yields massive performance gains (at least as long as we take the parameters $\\Sigma$ and $\\mu$ as given).\n\n## Exercises\n\n1. Download daily prices for another stock market symbol of your choice from Yahoo!Finance with `tq_get()` from the `tidyquant` package. Plot two time series of the ticker’s un-adjusted and adjusted closing prices. Explain the differences.\n1. Compute daily net returns for an asset of your choice and visualize the distribution of daily returns in a histogram using 100 bins. Also, use `geom_vline()` to add a dashed red vertical line that indicates the 5 percent quantile of the daily returns. Compute summary statistics (mean, standard deviation, minimum and maximum) for the daily returns.\n1. Take your code from before and generalize it such that you can perform all the computations for an arbitrary vector of tickers (e.g., `ticker <- c(\"AAPL\", \"MMM\", \"BA\")`). Automate the download, the plot of the price time series, and create a table of return summary statistics for this arbitrary number of assets.\n1. Are days with high aggregate trading volume often also days with large absolute returns? Find an appropriate visualization to analyze the question using the ticker `AAPL`.\n1.Compute monthly returns from the downloaded stock market prices. Compute the vector of historical average returns and the sample variance-covariance matrix. Compute the minimum variance portfolio weights and the portfolio volatility and average returns. Visualize the mean-variance efficient frontier. Choose one of your assets and identify the portfolio which yields the same historical volatility but achieves the highest possible average return.\n1. In the portfolio choice analysis, we restricted our sample to all assets trading every day since 2000. How is such a decision a problem when you want to infer future expected portfolio performance from the results?\n1. The efficient frontier characterizes the portfolios with the highest expected return for different levels of risk. Identify the portfolio with the highest expected return per standard deviation. Which famous performance measure is close to the ratio of average returns to the standard deviation of returns?\n", "supporting": [ "introduction-to-tidy-finance_files" ], diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png index 0ac1fcb7..47838342 100644 Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png differ diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png index f51c1495..6b77e32b 100644 Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png differ diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png index 554fa966..88f0b0c9 100644 Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png differ diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png index e35424b3..5931c6ff 100644 Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png differ diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png index 93bfd0c2..27a55bcc 100644 Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png differ diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png index c5fe952d..bbff2aaf 100644 Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png differ diff --git a/_freeze/r/parametric-portfolio-policies/execute-results/html.json b/_freeze/r/parametric-portfolio-policies/execute-results/html.json index 24315a3a..d0872af3 100644 --- a/_freeze/r/parametric-portfolio-policies/execute-results/html.json +++ b/_freeze/r/parametric-portfolio-policies/execute-results/html.json @@ -1,11 +1,9 @@ { - "hash": "70da5db0be11b30e9837c466c16ecf0b", + "hash": "3c0c52e64199495858258579baf547c7", "result": { "engine": "knitr", - "markdown": "---\ntitle: Parametric Portfolio Policies\naliases:\n - ../parametric-portfolio-policies.html\nmetadata:\n pagetitle: Parametric Portfolio Policies with R\n description-meta: Evaluate portfolio allocation strategies based on Brandt, Santa-Clara, and Valkanov (2009) using the programming language R.\n---\n\n\nIn this chapter, we apply different portfolio performance measures to evaluate and compare portfolio allocation strategies. \nFor this purpose, we introduce a direct way to estimate optimal portfolio weights for large-scale cross-sectional applications. More precisely, the approach of @Brandt2009 proposes to parametrize the optimal portfolio weights as a function of stock characteristics instead of estimating the stock's expected return, variance, and covariances with other stocks in a prior step. \nWe choose weights as a function of the characteristics, which maximize the expected utility of the investor. This approach is feasible for large portfolio dimensions (such as the entire CRSP universe) and has been proposed by @Brandt2009. See the review paper @Brandt2010 for an excellent treatment of related portfolio choice methods. \n\nThe current chapter relies on the following set of R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nTo get started, we load the monthly CRSP file, which forms our investment universe. We load the data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(), \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(permno, month, ret_excess, mktcap, mktcap_lag) |>\n collect()\n```\n:::\n\n\nTo evaluate the performance of portfolios, we further use monthly market returns as a benchmark to compute CAPM alphas.\\index{Data!Fama-French factors} \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, mkt_excess) |>\n collect()\n```\n:::\n\n\nNext, we retrieve some stock characteristics that have been shown to have an effect on the expected returns or expected variances (or even higher moments) of the return distribution. \\index{Momentum} In particular, we record the lagged one-year return momentum (`momentum_lag`), defined as the compounded return between months $t-13$ and $t-2$ for each firm. In finance, momentum is the empirically observed tendency for rising asset prices to rise further, and falling prices to keep falling [@Jegadeesh1993]. \\index{Size!Size effect} The second characteristic is the firm's market equity (`size_lag`), defined as the log of the price per share times the number of shares outstanding [@Banz1981]. \nTo construct the correct lagged values, we use the approach introduced in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_lags <- crsp_monthly |>\n transmute(permno,\n month_13 = month %m+% months(13),\n mktcap\n )\n\ncrsp_monthly <- crsp_monthly |>\n inner_join(crsp_monthly_lags,\n by = c(\"permno\", \"month\" = \"month_13\"),\n suffix = c(\"\", \"_13\")\n )\n\ndata_portfolios <- crsp_monthly |>\n mutate(\n momentum_lag = mktcap_lag / mktcap_13,\n size_lag = log(mktcap_lag)\n ) |>\n drop_na(contains(\"lag\"))\n```\n:::\n\n\n## Parametric Portfolio Policies\n\nThe basic idea of parametric portfolio weights is as follows. Suppose that at each date $t$ we have $N_t$ stocks in the investment universe, where each stock $i$ has a return of $r_{i, t+1}$ and is associated with a vector of firm characteristics $x_{i, t}$ such as time-series momentum or the market capitalization. The investor's problem is to choose portfolio weights $w_{i,t}$ to maximize the expected utility of the portfolio return:\n$$\\begin{aligned}\n\\max_{\\omega} E_t\\left(u(r_{p, t+1})\\right) = E_t\\left[u\\left(\\sum\\limits_{i=1}^{N_t}\\omega_{i,t}r_{i,t+1}\\right)\\right]\n\\end{aligned}$$\nwhere $u(\\cdot)$ denotes the utility function.\n\nWhere do the stock characteristics show up? We parameterize the optimal portfolio weights as a function of the stock characteristic $x_{i,t}$ with the following linear specification for the portfolio weights: \n$$\\omega_{i,t} = \\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t},$$\nwhere $\\bar{\\omega}_{i,t}$ is a stock's weight in a benchmark portfolio (we use the value-weighted or naive portfolio in the application below), $\\theta$ is a vector of coefficients which we are going to estimate, and $\\hat{x}_{i,t}$ are the characteristics of stock $i$, cross-sectionally standardized to have zero mean and unit standard deviation. \n\nIntuitively, the portfolio strategy is a form of active portfolio management relative to a performance benchmark. Deviations from the benchmark portfolio are derived from the individual stock characteristics. Note that by construction the weights sum up to one as $\\sum_{i=1}^{N_t}\\hat{x}_{i,t} = 0$ due to the standardization. Moreover, the coefficients are constant across assets and over time. The implicit assumption is that the characteristics fully capture all aspects of the joint distribution of returns that are relevant for forming optimal portfolios. \n\nWe first implement cross-sectional standardization for the entire CRSP universe. We also keep track of (lagged) relative market capitalization `relative_mktcap`, which will represent the value-weighted benchmark portfolio, while `n` denotes the number of traded assets $N_t$, which we use to construct the naive portfolio benchmark.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_portfolios <- data_portfolios |>\n group_by(month) |>\n mutate(\n n = n(),\n relative_mktcap = mktcap_lag / sum(mktcap_lag),\n across(contains(\"lag\"), ~ (. - mean(.)) / sd(.)),\n ) |>\n ungroup() |>\n select(-mktcap_lag)\n```\n:::\n\n\n## Computing Portfolio Weights\n\nNext, we move on to identify optimal choices of $\\theta$. We rewrite the optimization problem together with the weight parametrization and can then estimate $\\theta$ to maximize the objective function based on our sample \n$$\\begin{aligned}\nE_t\\left(u(r_{p, t+1})\\right) = \\frac{1}{T}\\sum\\limits_{t=0}^{T-1}u\\left(\\sum\\limits_{i=1}^{N_t}\\left(\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}\\right)r_{i,t+1}\\right).\n\\end{aligned}$$\nThe allocation strategy is straightforward because the number of parameters to estimate is small. Instead of a tedious specification of the $N_t$ dimensional vector of expected returns and the $N_t(N_t+1)/2$ free elements of the covariance matrix, all we need to focus on in our application is the vector $\\theta$. $\\theta$ contains only two elements in our application - the relative deviation from the benchmark due to *size* and *momentum*. \n\nTo get a feeling for the performance of such an allocation strategy, we start with an arbitrary initial vector $\\theta_0$. The next step is to choose $\\theta$ optimally to maximize the objective function. We automatically detect the number of parameters by counting the number of columns with lagged values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_parameters <- sum(str_detect(\n colnames(data_portfolios), \"lag\"\n))\n\ntheta <- rep(1.5, n_parameters)\n\nnames(theta) <- colnames(data_portfolios)[str_detect(\n colnames(data_portfolios), \"lag\"\n)]\n```\n:::\n\n\nThe function `compute_portfolio_weights()` below computes the portfolio weights $\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}$ according to our parametrization for a given value $\\theta_0$. Everything happens within a single pipeline. Hence, we provide a short walk-through.\n\nWe first compute `characteristic_tilt`, the tilting values $\\frac{1}{N_t}\\theta'\\hat{x}_{i, t}$ which resemble the deviation from the benchmark portfolio. Next, we compute the benchmark portfolio `weight_benchmark`, which can be any reasonable set of portfolio weights. In our case, we choose either the value or equal-weighted allocation. \n`weight_tilt` completes the picture and contains the final portfolio weights `weight_tilt = weight_benchmark + characteristic_tilt` which deviate from the benchmark portfolio depending on the stock characteristics.\n\nThe final few lines go a bit further and implement a simple version of a no-short sale constraint. While it is generally not straightforward to ensure portfolio weight constraints via parameterization, we simply normalize the portfolio weights such that they are enforced to be positive. Finally, we make sure that the normalized weights sum up to one again:\n$$\\omega_{i,t}^+ = \\frac{\\max(0, \\omega_{i,t})}{\\sum_{j=1}^{N_t}\\max(0, \\omega_{i,t})}.$$\n\n\nThe following function computes the optimal portfolio weights in the way just described. \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_portfolio_weights <- function(theta,\n data,\n value_weighting = TRUE,\n allow_short_selling = TRUE) {\n data |>\n group_by(month) |>\n bind_cols(\n characteristic_tilt = data |>\n transmute(across(contains(\"lag\"), ~ . / n)) |>\n as.matrix() %*% theta |> as.numeric()\n ) |>\n mutate(\n # Definition of benchmark weight\n weight_benchmark = case_when(\n value_weighting == TRUE ~ relative_mktcap,\n value_weighting == FALSE ~ 1 / n\n ),\n # Parametric portfolio weights\n weight_tilt = weight_benchmark + characteristic_tilt,\n # Short-sell constraint\n weight_tilt = case_when(\n allow_short_selling == TRUE ~ weight_tilt,\n allow_short_selling == FALSE ~ pmax(0, weight_tilt)\n ),\n # Weights sum up to 1\n weight_tilt = weight_tilt / sum(weight_tilt)\n ) |>\n ungroup()\n}\n```\n:::\n\n\nIn the next step, we compute the portfolio weights for the arbitrary vector $\\theta_0$. In the example below, we use the value-weighted portfolio as a benchmark and allow negative portfolio weights.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweights_crsp <- compute_portfolio_weights(\n theta,\n data_portfolios,\n value_weighting = TRUE,\n allow_short_selling = TRUE\n)\n```\n:::\n\n\n## Portfolio Performance\n\n\\index{Performance evaluation}\nAre the computed weights optimal in any way? Most likely not, as we picked $\\theta_0$ arbitrarily. To evaluate the performance of an allocation strategy, one can think of many different approaches. In their original paper, @Brandt2009 focus on a simple evaluation of the hypothetical utility of an agent equipped with a power utility function $u_\\gamma(r) = \\frac{(1 + r)^{(1-\\gamma)}}{1-\\gamma}$, where $\\gamma$ is the risk aversion factor.\\index{Power utility}\n\n\n::: {.cell}\n\n```{.r .cell-code}\npower_utility <- function(r, gamma = 5) {\n (1 + r)^(1 - gamma) / (1 - gamma)\n}\n```\n:::\n\n\nWe want to note that @Gehrig2020 warn that, in the leading case of constant relative risk aversion (CRRA), strong assumptions on the properties of the returns, the variables used to implement the parametric portfolio policy, and the parameter space are necessary to obtain a well-defined optimization problem.\n\nNo doubt, there are many other ways to evaluate a portfolio. The function below provides a summary of all kinds of interesting measures that can be considered relevant. Do we need all these evaluation measures? It depends: the original paper @Brandt2009 only cares about the expected utility to choose $\\theta$. However, if you want to choose optimal values that achieve the highest performance while putting some constraints on your portfolio weights, it is helpful to have everything in one function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio <- function(weights_crsp,\n capm_evaluation = TRUE,\n full_evaluation = TRUE,\n length_year = 12) {\n \n evaluation <- weights_crsp |>\n group_by(month) |>\n summarize(\n tilt = weighted.mean(ret_excess, weight_tilt),\n benchmark = weighted.mean(ret_excess, weight_benchmark)\n ) |>\n pivot_longer(-month,\n values_to = \"portfolio_return\",\n names_to = \"model\"\n ) \n \n evaluation_stats <- evaluation |>\n group_by(model) |>\n left_join(factors_ff3_monthly, by = \"month\") |>\n summarize(tibble(\n \"Expected utility\" = mean(power_utility(portfolio_return)),\n \"Average return\" = 100 * mean(length_year * portfolio_return),\n \"SD return\" = 100 * sqrt(length_year) * sd(portfolio_return),\n \"Sharpe ratio\" = sqrt(length_year) * mean(portfolio_return) / sd(portfolio_return),\n\n )) |>\n mutate(model = str_remove(model, \"return_\")) \n \n if (capm_evaluation) {\n evaluation_capm <- evaluation |> \n left_join(factors_ff3_monthly, by = \"month\") |>\n group_by(model) |>\n summarize(\n \"CAPM alpha\" = coefficients(lm(portfolio_return ~ mkt_excess))[1],\n \"Market beta\" = coefficients(lm(portfolio_return ~ mkt_excess))[2]\n )\n \n evaluation_stats <- evaluation_stats |> \n left_join(evaluation_capm, by = \"model\")\n }\n\n if (full_evaluation) {\n evaluation_weights <- weights_crsp |>\n select(month, contains(\"weight\")) |>\n pivot_longer(-month, values_to = \"weight\", names_to = \"model\") |>\n group_by(model, month) |>\n mutate(\n \"Absolute weight\" = abs(weight),\n \"Max. weight\" = max(weight),\n \"Min. weight\" = min(weight),\n \"Avg. sum of negative weights\" = -sum(weight[weight < 0]),\n \"Avg. fraction of negative weights\" = sum(weight < 0) / n(),\n .keep = \"none\"\n ) |>\n group_by(model) |>\n summarize(across(-month, ~ 100 * mean(.))) |>\n mutate(model = str_remove(model, \"weight_\")) \n \n evaluation_stats <- evaluation_stats |> \n left_join(evaluation_weights, by = \"model\")\n }\n \n evaluation_output <- evaluation_stats |> \n pivot_longer(cols = -model, names_to = \"measure\") |> \n pivot_wider(names_from = model)\n \n return(evaluation_output)\n}\n```\n:::\n\n\n\\index{Sharpe Ratio}\nLet us take a look at the different portfolio strategies and evaluation measures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio(weights_crsp) |>\n print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 3\n measure benchmark tilt\n \n 1 Expected utility -0.250 -0.261 \n 2 Average return 6.64 0.259 \n 3 SD return 15.5 21.1 \n 4 Sharpe ratio 0.430 0.0122 \n 5 CAPM alpha 0.000138 -0.00496\n 6 Market beta 0.992 0.952 \n 7 Absolute weight 0.0249 0.0638 \n 8 Max. weight 3.59 3.72 \n 9 Min. weight 0.0000274 -0.144 \n10 Avg. sum of negative weights 0 78.0 \n11 Avg. fraction of negative weights 0 49.5 \n```\n\n\n:::\n:::\n\n\nThe value-weighted portfolio delivers an annualized return of more than 6 percent and clearly outperforms the tilted portfolio, irrespective of whether we evaluate expected utility, the Sharpe ratio, or the CAPM alpha. We can conclude the market beta is close to one for both strategies (naturally almost identically 1 for the value-weighted benchmark portfolio). When it comes to the distribution of the portfolio weights, we see that the benchmark portfolio weight takes less extreme positions (lower average absolute weights and lower maximum weight). By definition, the value-weighted benchmark does not take any negative positions, while the tilted portfolio also takes short positions.\n\n## Optimal Parameter Choice\n\nNext, we move to a choice of $\\theta$ that actually aims to improve some (or all) of the performance measures. We first define a helper function `compute_objective_function()`, which we then pass to an optimizer.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_objective_function <- function(theta,\n data,\n objective_measure = \"Expected utility\",\n value_weighting = TRUE,\n allow_short_selling = TRUE) {\n processed_data <- compute_portfolio_weights(\n theta,\n data,\n value_weighting,\n allow_short_selling\n )\n\n objective_function <- evaluate_portfolio(\n processed_data,\n capm_evaluation = FALSE,\n full_evaluation = FALSE\n ) |>\n filter(measure == objective_measure) |>\n pull(tilt)\n\n return(-objective_function)\n}\n```\n:::\n\n\nYou may wonder why we return the negative value of the objective function. This is simply due to the common convention for optimization procedures to search for minima as a default. By minimizing the negative value of the objective function, we get the maximum value as a result.\nIn its most basic form, R optimization relies on the function `optim()`. As main inputs, the function requires an initial guess of the parameters and the objective function to minimize. Now, we are fully equipped to compute the optimal values of $\\hat\\theta$, which maximize the hypothetical expected utility of the investor. \n\n\n::: {.cell}\n\n```{.r .cell-code}\noptimal_theta <- optim(\n par = theta,\n fn = compute_objective_function,\n objective_measure = \"Expected utility\",\n data = data_portfolios,\n value_weighting = TRUE,\n allow_short_selling = TRUE,\n method = \"Nelder-Mead\"\n)\n\noptimal_theta$par\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nmomentum_lag size_lag \n 0.361 -1.820 \n```\n\n\n:::\n:::\n\n\nThe resulting values of $\\hat\\theta$ are easy to interpret: intuitively, expected utility increases by tilting weights from the value-weighted portfolio toward smaller stocks (negative coefficient for size) and toward past winners (positive value for momentum). Both findings are in line with the well-documented size effect [@Banz1981] and the momentum anomaly [@Jegadeesh1993].\n\n## More Model Specifications\n\nHow does the portfolio perform for different model specifications? For this purpose, we compute the performance of a number of different modeling choices based on the entire CRSP sample. The next code chunk performs all the heavy lifting.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_optimal_performance <- function(data, \n objective_measure,\n value_weighting, \n allow_short_selling) {\n optimal_theta <- optim(\n par = theta,\n fn = compute_objective_function,\n data = data,\n objective_measure = \"Expected utility\",\n value_weighting = TRUE,\n allow_short_selling = TRUE,\n method = \"Nelder-Mead\"\n )\n\n processed_data = compute_portfolio_weights(\n optimal_theta$par, \n data,\n value_weighting,\n allow_short_selling\n )\n \n portfolio_evaluation = evaluate_portfolio(\n processed_data,\n capm_evaluation = TRUE,\n full_evaluation = TRUE\n )\n \n return(portfolio_evaluation) \n}\n\nspecifications <- expand_grid(\n data = list(data_portfolios),\n objective_measure = \"Expected utility\",\n value_weighting = c(TRUE, FALSE),\n allow_short_selling = c(TRUE, FALSE)\n) |> \n mutate(\n portfolio_evaluation = pmap(\n .l = list(data, objective_measure, value_weighting, allow_short_selling),\n .f = evaluate_optimal_performance\n )\n)\n```\n:::\n\n\nFinally, we can compare the results. The table below shows summary statistics for all possible combinations: equal- or value-weighted benchmark portfolio, with or without short-selling constraints, and tilted toward maximizing expected utility. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance_table <- specifications |>\n select(\n value_weighting,\n allow_short_selling,\n portfolio_evaluation\n ) |>\n unnest(portfolio_evaluation)\n\nperformance_table |>\n rename(\n \" \" = benchmark,\n Optimal = tilt\n ) |>\n mutate(\n value_weighting = case_when(\n value_weighting == TRUE ~ \"VW\",\n value_weighting == FALSE ~ \"EW\"\n ),\n allow_short_selling = case_when(\n allow_short_selling == TRUE ~ \"\",\n allow_short_selling == FALSE ~ \"(no s.)\"\n )\n ) |>\n pivot_wider(\n names_from = value_weighting:allow_short_selling,\n values_from = \" \":Optimal,\n names_glue = \"{value_weighting} {allow_short_selling} {.value} \"\n ) |>\n select(\n measure,\n `EW `,\n `VW `,\n sort(contains(\"Optimal\"))\n ) |>\n print(n = 11)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 7\n measure `EW ` `VW ` `VW Optimal ` `VW (no s.) Optimal `\n \n 1 Expected u… -0.251 -2.50e-1 -0.247 -0.248 \n 2 Average re… 9.99 6.64e+0 13.3 12.3 \n 3 SD return 20.4 1.55e+1 19.9 19.1 \n 4 Sharpe rat… 0.491 4.30e-1 0.670 0.641 \n 5 CAPM alpha 0.00222 1.38e-4 0.00564 0.00462\n 6 Market beta 1.12 9.92e-1 1.00 1.03 \n 7 Absolute w… 0.0249 2.49e-2 0.0360 0.0249 \n 8 Max. weight 0.0249 3.59e+0 3.42 2.80 \n 9 Min. weight 0.0249 2.74e-5 -0.0324 0 \n10 Avg. sum o… 0 0 23.1 0 \n11 Avg. fract… 0 0 37.5 0 \n# ℹ 2 more variables: `EW Optimal ` ,\n# `EW (no s.) Optimal ` \n```\n\n\n:::\n:::\n\n\nThe results indicate that the average annualized Sharpe ratio of the equal-weighted portfolio exceeds the Sharpe ratio of the value-weighted benchmark portfolio. Nevertheless, starting with the weighted value portfolio as a benchmark and tilting optimally with respect to momentum and small stocks yields the highest Sharpe ratio across all specifications. Finally, imposing no short-sale constraints does not improve the performance of the portfolios in our application.\n\n## Exercises\n\n1. How do the estimated parameters $\\hat\\theta$ and the portfolio performance change if your objective is to maximize the Sharpe ratio instead of the hypothetical expected utility?\n1. The code above is very flexible in the sense that you can easily add new firm characteristics. Construct a new characteristic of your choice and evaluate the corresponding coefficient $\\hat\\theta_i$. \n1. Tweak the function `optimal_theta()` such that you can impose additional performance constraints in order to determine $\\hat\\theta$, which maximizes expected utility under the constraint that the market beta is below 1.\n1. Does the portfolio performance resemble a realistic out-of-sample backtesting procedure? Verify the robustness of the results by first estimating $\\hat\\theta$ based on *past data* only. Then, use more recent periods to evaluate the actual portfolio performance. \n1. By formulating the portfolio problem as a statistical estimation problem, you can easily obtain standard errors for the coefficients of the weight function. @Brandt2009 provide the relevant derivations in their paper in Equation (10). Implement a small function that computes standard errors for $\\hat\\theta$.", - "supporting": [ - "parametric-portfolio-policies_files" - ], + "markdown": "---\ntitle: Parametric Portfolio Policies\naliases:\n - ../parametric-portfolio-policies.html\nmetadata:\n pagetitle: Parametric Portfolio Policies with R\n description-meta: Evaluate portfolio allocation strategies based on Brandt, Santa-Clara, and Valkanov (2009) using the programming language R.\n---\n\n\nIn this chapter, we apply different portfolio performance measures to evaluate and compare portfolio allocation strategies. \nFor this purpose, we introduce a direct way to estimate optimal portfolio weights for large-scale cross-sectional applications. More precisely, the approach of @Brandt2009 proposes to parametrize the optimal portfolio weights as a function of stock characteristics instead of estimating the stock's expected return, variance, and covariances with other stocks in a prior step. \nWe choose weights as a function of the characteristics, which maximize the expected utility of the investor. This approach is feasible for large portfolio dimensions (such as the entire CRSP universe) and has been proposed by @Brandt2009. See the review paper @Brandt2010 for an excellent treatment of related portfolio choice methods. \n\nThe current chapter relies on the following set of R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nTo get started, we load the monthly CRSP file, which forms our investment universe. We load the data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(), \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(permno, month, ret_excess, mktcap, mktcap_lag) |>\n collect()\n```\n:::\n\n\nTo evaluate the performance of portfolios, we further use monthly market returns as a benchmark to compute CAPM alphas.\\index{Data!Fama-French factors} \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, mkt_excess) |>\n collect()\n```\n:::\n\n\nNext, we retrieve some stock characteristics that have been shown to have an effect on the expected returns or expected variances (or even higher moments) of the return distribution. \\index{Momentum} In particular, we record the lagged one-year return momentum (`momentum_lag`), defined as the compounded return between months $t-13$ and $t-2$ for each firm. In finance, momentum is the empirically observed tendency for rising asset prices to rise further, and falling prices to keep falling [@Jegadeesh1993]. \\index{Size!Size effect} The second characteristic is the firm's market equity (`size_lag`), defined as the log of the price per share times the number of shares outstanding [@Banz1981]. \nTo construct the correct lagged values, we use the approach introduced in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_lags <- crsp_monthly |>\n transmute(permno,\n month_13 = month %m+% months(13),\n mktcap\n )\n\ncrsp_monthly <- crsp_monthly |>\n inner_join(crsp_monthly_lags,\n join_by(permno, month == month_13),\n suffix = c(\"\", \"_13\")\n )\n\ndata_portfolios <- crsp_monthly |>\n mutate(\n momentum_lag = mktcap_lag / mktcap_13,\n size_lag = log(mktcap_lag)\n ) |>\n drop_na(contains(\"lag\"))\n```\n:::\n\n\n## Parametric Portfolio Policies\n\nThe basic idea of parametric portfolio weights is as follows. Suppose that at each date $t$ we have $N_t$ stocks in the investment universe, where each stock $i$ has a return of $r_{i, t+1}$ and is associated with a vector of firm characteristics $x_{i, t}$ such as time-series momentum or the market capitalization. The investor's problem is to choose portfolio weights $w_{i,t}$ to maximize the expected utility of the portfolio return:\n$$\\begin{aligned}\n\\max_{\\omega} E_t\\left(u(r_{p, t+1})\\right) = E_t\\left[u\\left(\\sum\\limits_{i=1}^{N_t}\\omega_{i,t}r_{i,t+1}\\right)\\right]\n\\end{aligned}$$\nwhere $u(\\cdot)$ denotes the utility function.\n\nWhere do the stock characteristics show up? We parameterize the optimal portfolio weights as a function of the stock characteristic $x_{i,t}$ with the following linear specification for the portfolio weights: \n$$\\omega_{i,t} = \\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t},$$\nwhere $\\bar{\\omega}_{i,t}$ is a stock's weight in a benchmark portfolio (we use the value-weighted or naive portfolio in the application below), $\\theta$ is a vector of coefficients which we are going to estimate, and $\\hat{x}_{i,t}$ are the characteristics of stock $i$, cross-sectionally standardized to have zero mean and unit standard deviation. \n\nIntuitively, the portfolio strategy is a form of active portfolio management relative to a performance benchmark. Deviations from the benchmark portfolio are derived from the individual stock characteristics. Note that by construction the weights sum up to one as $\\sum_{i=1}^{N_t}\\hat{x}_{i,t} = 0$ due to the standardization. Moreover, the coefficients are constant across assets and over time. The implicit assumption is that the characteristics fully capture all aspects of the joint distribution of returns that are relevant for forming optimal portfolios. \n\nWe first implement cross-sectional standardization for the entire CRSP universe. We also keep track of (lagged) relative market capitalization `relative_mktcap`, which will represent the value-weighted benchmark portfolio, while `n` denotes the number of traded assets $N_t$, which we use to construct the naive portfolio benchmark.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_portfolios <- data_portfolios |>\n group_by(month) |>\n mutate(\n n = n(),\n relative_mktcap = mktcap_lag / sum(mktcap_lag),\n across(contains(\"lag\"), ~ (. - mean(.)) / sd(.)),\n ) |>\n ungroup() |>\n select(-mktcap_lag)\n```\n:::\n\n\n## Computing Portfolio Weights\n\nNext, we move on to identify optimal choices of $\\theta$. We rewrite the optimization problem together with the weight parametrization and can then estimate $\\theta$ to maximize the objective function based on our sample \n$$\\begin{aligned}\nE_t\\left(u(r_{p, t+1})\\right) = \\frac{1}{T}\\sum\\limits_{t=0}^{T-1}u\\left(\\sum\\limits_{i=1}^{N_t}\\left(\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}\\right)r_{i,t+1}\\right).\n\\end{aligned}$$\nThe allocation strategy is straightforward because the number of parameters to estimate is small. Instead of a tedious specification of the $N_t$ dimensional vector of expected returns and the $N_t(N_t+1)/2$ free elements of the covariance matrix, all we need to focus on in our application is the vector $\\theta$. $\\theta$ contains only two elements in our application - the relative deviation from the benchmark due to *size* and *momentum*. \n\nTo get a feeling for the performance of such an allocation strategy, we start with an arbitrary initial vector $\\theta_0$. The next step is to choose $\\theta$ optimally to maximize the objective function. We automatically detect the number of parameters by counting the number of columns with lagged values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_parameters <- sum(str_detect(\n colnames(data_portfolios), \"lag\"\n))\n\ntheta <- rep(1.5, n_parameters)\n\nnames(theta) <- colnames(data_portfolios)[str_detect(\n colnames(data_portfolios), \"lag\"\n)]\n```\n:::\n\n\nThe function `compute_portfolio_weights()` below computes the portfolio weights $\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}$ according to our parametrization for a given value $\\theta_0$. Everything happens within a single pipeline. Hence, we provide a short walk-through.\n\nWe first compute `characteristic_tilt`, the tilting values $\\frac{1}{N_t}\\theta'\\hat{x}_{i, t}$ which resemble the deviation from the benchmark portfolio. Next, we compute the benchmark portfolio `weight_benchmark`, which can be any reasonable set of portfolio weights. In our case, we choose either the value or equal-weighted allocation. \n`weight_tilt` completes the picture and contains the final portfolio weights `weight_tilt = weight_benchmark + characteristic_tilt` which deviate from the benchmark portfolio depending on the stock characteristics.\n\nThe final few lines go a bit further and implement a simple version of a no-short sale constraint. While it is generally not straightforward to ensure portfolio weight constraints via parameterization, we simply normalize the portfolio weights such that they are enforced to be positive. Finally, we make sure that the normalized weights sum up to one again:\n$$\\omega_{i,t}^+ = \\frac{\\max(0, \\omega_{i,t})}{\\sum_{j=1}^{N_t}\\max(0, \\omega_{i,t})}.$$\n\n\nThe following function computes the optimal portfolio weights in the way just described. \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_portfolio_weights <- function(theta,\n data,\n value_weighting = TRUE,\n allow_short_selling = TRUE) {\n data |>\n group_by(month) |>\n bind_cols(\n characteristic_tilt = data |>\n transmute(across(contains(\"lag\"), ~ . / n)) |>\n as.matrix() %*% theta |> as.numeric()\n ) |>\n mutate(\n # Definition of benchmark weight\n weight_benchmark = case_when(\n value_weighting == TRUE ~ relative_mktcap,\n value_weighting == FALSE ~ 1 / n\n ),\n # Parametric portfolio weights\n weight_tilt = weight_benchmark + characteristic_tilt,\n # Short-sell constraint\n weight_tilt = case_when(\n allow_short_selling == TRUE ~ weight_tilt,\n allow_short_selling == FALSE ~ pmax(0, weight_tilt)\n ),\n # Weights sum up to 1\n weight_tilt = weight_tilt / sum(weight_tilt)\n ) |>\n ungroup()\n}\n```\n:::\n\n\nIn the next step, we compute the portfolio weights for the arbitrary vector $\\theta_0$. In the example below, we use the value-weighted portfolio as a benchmark and allow negative portfolio weights.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweights_crsp <- compute_portfolio_weights(\n theta,\n data_portfolios,\n value_weighting = TRUE,\n allow_short_selling = TRUE\n)\n```\n:::\n\n\n## Portfolio Performance\n\n\\index{Performance evaluation}\nAre the computed weights optimal in any way? Most likely not, as we picked $\\theta_0$ arbitrarily. To evaluate the performance of an allocation strategy, one can think of many different approaches. In their original paper, @Brandt2009 focus on a simple evaluation of the hypothetical utility of an agent equipped with a power utility function $u_\\gamma(r) = \\frac{(1 + r)^{(1-\\gamma)}}{1-\\gamma}$, where $\\gamma$ is the risk aversion factor.\\index{Power utility}\n\n\n::: {.cell}\n\n```{.r .cell-code}\npower_utility <- function(r, gamma = 5) {\n (1 + r)^(1 - gamma) / (1 - gamma)\n}\n```\n:::\n\n\nWe want to note that @Gehrig2020 warn that, in the leading case of constant relative risk aversion (CRRA), strong assumptions on the properties of the returns, the variables used to implement the parametric portfolio policy, and the parameter space are necessary to obtain a well-defined optimization problem.\n\nNo doubt, there are many other ways to evaluate a portfolio. The function below provides a summary of all kinds of interesting measures that can be considered relevant. Do we need all these evaluation measures? It depends: the original paper @Brandt2009 only cares about the expected utility to choose $\\theta$. However, if you want to choose optimal values that achieve the highest performance while putting some constraints on your portfolio weights, it is helpful to have everything in one function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio <- function(weights_crsp,\n capm_evaluation = TRUE,\n full_evaluation = TRUE,\n length_year = 12) {\n \n evaluation <- weights_crsp |>\n group_by(month) |>\n summarize(\n tilt = weighted.mean(ret_excess, weight_tilt),\n benchmark = weighted.mean(ret_excess, weight_benchmark)\n ) |>\n pivot_longer(-month,\n values_to = \"portfolio_return\",\n names_to = \"model\"\n ) \n \n evaluation_stats <- evaluation |>\n group_by(model) |>\n left_join(factors_ff3_monthly, \n join_by(month)) |>\n summarize(tibble(\n \"Expected utility\" = mean(power_utility(portfolio_return)),\n \"Average return\" = 100 * mean(length_year * portfolio_return),\n \"SD return\" = 100 * sqrt(length_year) * sd(portfolio_return),\n \"Sharpe ratio\" = sqrt(length_year) * mean(portfolio_return) / sd(portfolio_return),\n\n )) |>\n mutate(model = str_remove(model, \"return_\")) \n \n if (capm_evaluation) {\n evaluation_capm <- evaluation |> \n left_join(factors_ff3_monthly, \n join_by(month)) |>\n group_by(model) |>\n summarize(\n \"CAPM alpha\" = coefficients(lm(portfolio_return ~ mkt_excess))[1],\n \"Market beta\" = coefficients(lm(portfolio_return ~ mkt_excess))[2]\n )\n \n evaluation_stats <- evaluation_stats |> \n left_join(evaluation_capm, join_by(model))\n }\n\n if (full_evaluation) {\n evaluation_weights <- weights_crsp |>\n select(month, contains(\"weight\")) |>\n pivot_longer(-month, values_to = \"weight\", names_to = \"model\") |>\n group_by(model, month) |>\n mutate(\n \"Absolute weight\" = abs(weight),\n \"Max. weight\" = max(weight),\n \"Min. weight\" = min(weight),\n \"Avg. sum of negative weights\" = -sum(weight[weight < 0]),\n \"Avg. fraction of negative weights\" = sum(weight < 0) / n(),\n .keep = \"none\"\n ) |>\n group_by(model) |>\n summarize(across(-month, ~ 100 * mean(.))) |>\n mutate(model = str_remove(model, \"weight_\")) \n \n evaluation_stats <- evaluation_stats |> \n left_join(evaluation_weights, join_by(model))\n }\n \n evaluation_output <- evaluation_stats |> \n pivot_longer(cols = -model, names_to = \"measure\") |> \n pivot_wider(names_from = model)\n \n return(evaluation_output)\n}\n```\n:::\n\n\n\\index{Sharpe Ratio}\nLet us take a look at the different portfolio strategies and evaluation measures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio(weights_crsp) |>\n print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 3\n measure benchmark tilt\n \n 1 Expected utility -0.250 -0.261 \n 2 Average return 6.64 0.259 \n 3 SD return 15.5 21.1 \n 4 Sharpe ratio 0.430 0.0122 \n 5 CAPM alpha 0.000138 -0.00496\n 6 Market beta 0.992 0.952 \n 7 Absolute weight 0.0249 0.0638 \n 8 Max. weight 3.59 3.72 \n 9 Min. weight 0.0000274 -0.144 \n10 Avg. sum of negative weights 0 78.0 \n11 Avg. fraction of negative weights 0 49.5 \n```\n\n\n:::\n:::\n\n\nThe value-weighted portfolio delivers an annualized return of more than 6 percent and clearly outperforms the tilted portfolio, irrespective of whether we evaluate expected utility, the Sharpe ratio, or the CAPM alpha. We can conclude the market beta is close to one for both strategies (naturally almost identically 1 for the value-weighted benchmark portfolio). When it comes to the distribution of the portfolio weights, we see that the benchmark portfolio weight takes less extreme positions (lower average absolute weights and lower maximum weight). By definition, the value-weighted benchmark does not take any negative positions, while the tilted portfolio also takes short positions.\n\n## Optimal Parameter Choice\n\nNext, we move to a choice of $\\theta$ that actually aims to improve some (or all) of the performance measures. We first define a helper function `compute_objective_function()`, which we then pass to an optimizer.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_objective_function <- function(theta,\n data,\n objective_measure = \"Expected utility\",\n value_weighting = TRUE,\n allow_short_selling = TRUE) {\n processed_data <- compute_portfolio_weights(\n theta,\n data,\n value_weighting,\n allow_short_selling\n )\n\n objective_function <- evaluate_portfolio(\n processed_data,\n capm_evaluation = FALSE,\n full_evaluation = FALSE\n ) |>\n filter(measure == objective_measure) |>\n pull(tilt)\n\n return(-objective_function)\n}\n```\n:::\n\n\nYou may wonder why we return the negative value of the objective function. This is simply due to the common convention for optimization procedures to search for minima as a default. By minimizing the negative value of the objective function, we get the maximum value as a result.\nIn its most basic form, R optimization relies on the function `optim()`. As main inputs, the function requires an initial guess of the parameters and the objective function to minimize. Now, we are fully equipped to compute the optimal values of $\\hat\\theta$, which maximize the hypothetical expected utility of the investor. \n\n\n::: {.cell}\n\n```{.r .cell-code}\noptimal_theta <- optim(\n par = theta,\n fn = compute_objective_function,\n objective_measure = \"Expected utility\",\n data = data_portfolios,\n value_weighting = TRUE,\n allow_short_selling = TRUE,\n method = \"Nelder-Mead\"\n)\n\noptimal_theta$par\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nmomentum_lag size_lag \n 0.361 -1.820 \n```\n\n\n:::\n:::\n\n\nThe resulting values of $\\hat\\theta$ are easy to interpret: intuitively, expected utility increases by tilting weights from the value-weighted portfolio toward smaller stocks (negative coefficient for size) and toward past winners (positive value for momentum). Both findings are in line with the well-documented size effect [@Banz1981] and the momentum anomaly [@Jegadeesh1993].\n\n## More Model Specifications\n\nHow does the portfolio perform for different model specifications? For this purpose, we compute the performance of a number of different modeling choices based on the entire CRSP sample. The next code chunk performs all the heavy lifting.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_optimal_performance <- function(data, \n objective_measure,\n value_weighting, \n allow_short_selling) {\n optimal_theta <- optim(\n par = theta,\n fn = compute_objective_function,\n data = data,\n objective_measure = \"Expected utility\",\n value_weighting = TRUE,\n allow_short_selling = TRUE,\n method = \"Nelder-Mead\"\n )\n\n processed_data = compute_portfolio_weights(\n optimal_theta$par, \n data,\n value_weighting,\n allow_short_selling\n )\n \n portfolio_evaluation = evaluate_portfolio(\n processed_data,\n capm_evaluation = TRUE,\n full_evaluation = TRUE\n )\n \n return(portfolio_evaluation) \n}\n\nspecifications <- expand_grid(\n data = list(data_portfolios),\n objective_measure = \"Expected utility\",\n value_weighting = c(TRUE, FALSE),\n allow_short_selling = c(TRUE, FALSE)\n) |> \n mutate(\n portfolio_evaluation = pmap(\n .l = list(data, objective_measure, value_weighting, allow_short_selling),\n .f = evaluate_optimal_performance\n )\n)\n```\n:::\n\n\nFinally, we can compare the results. The table below shows summary statistics for all possible combinations: equal- or value-weighted benchmark portfolio, with or without short-selling constraints, and tilted toward maximizing expected utility. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance_table <- specifications |>\n select(\n value_weighting,\n allow_short_selling,\n portfolio_evaluation\n ) |>\n unnest(portfolio_evaluation)\n\nperformance_table |>\n rename(\n \" \" = benchmark,\n Optimal = tilt\n ) |>\n mutate(\n value_weighting = case_when(\n value_weighting == TRUE ~ \"VW\",\n value_weighting == FALSE ~ \"EW\"\n ),\n allow_short_selling = case_when(\n allow_short_selling == TRUE ~ \"\",\n allow_short_selling == FALSE ~ \"(no s.)\"\n )\n ) |>\n pivot_wider(\n names_from = value_weighting:allow_short_selling,\n values_from = \" \":Optimal,\n names_glue = \"{value_weighting} {allow_short_selling} {.value} \"\n ) |>\n select(\n measure,\n `EW `,\n `VW `,\n sort(contains(\"Optimal\"))\n ) |>\n print(n = 11)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 7\n measure `EW ` `VW ` `VW Optimal ` `VW (no s.) Optimal `\n \n 1 Expected u… -0.251 -2.50e-1 -0.247 -0.248 \n 2 Average re… 9.99 6.64e+0 13.3 12.3 \n 3 SD return 20.4 1.55e+1 19.9 19.1 \n 4 Sharpe rat… 0.491 4.30e-1 0.670 0.641 \n 5 CAPM alpha 0.00222 1.38e-4 0.00564 0.00462\n 6 Market beta 1.12 9.92e-1 1.00 1.03 \n 7 Absolute w… 0.0249 2.49e-2 0.0360 0.0249 \n 8 Max. weight 0.0249 3.59e+0 3.42 2.80 \n 9 Min. weight 0.0249 2.74e-5 -0.0324 0 \n10 Avg. sum o… 0 0 23.1 0 \n11 Avg. fract… 0 0 37.5 0 \n# ℹ 2 more variables: `EW Optimal ` ,\n# `EW (no s.) Optimal ` \n```\n\n\n:::\n:::\n\n\nThe results indicate that the average annualized Sharpe ratio of the equal-weighted portfolio exceeds the Sharpe ratio of the value-weighted benchmark portfolio. Nevertheless, starting with the weighted value portfolio as a benchmark and tilting optimally with respect to momentum and small stocks yields the highest Sharpe ratio across all specifications. Finally, imposing no short-sale constraints does not improve the performance of the portfolios in our application.\n\n## Exercises\n\n1. How do the estimated parameters $\\hat\\theta$ and the portfolio performance change if your objective is to maximize the Sharpe ratio instead of the hypothetical expected utility?\n1. The code above is very flexible in the sense that you can easily add new firm characteristics. Construct a new characteristic of your choice and evaluate the corresponding coefficient $\\hat\\theta_i$. \n1. Tweak the function `optimal_theta()` such that you can impose additional performance constraints in order to determine $\\hat\\theta$, which maximizes expected utility under the constraint that the market beta is below 1.\n1. Does the portfolio performance resemble a realistic out-of-sample backtesting procedure? Verify the robustness of the results by first estimating $\\hat\\theta$ based on *past data* only. Then, use more recent periods to evaluate the actual portfolio performance. \n1. By formulating the portfolio problem as a statistical estimation problem, you can easily obtain standard errors for the coefficients of the weight function. @Brandt2009 provide the relevant derivations in their paper in Equation (10). Implement a small function that computes standard errors for $\\hat\\theta$.", + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json b/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json index 85b02eff..921fcba4 100644 --- a/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json +++ b/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "0dc3115bcf5adcc95bfb3e71a6e4e5e6", + "hash": "a902f8812d8309d205a7198613b759c9", "result": { "engine": "knitr", - "markdown": "---\ntitle: Replicating Fama and French Factors\naliases:\n - ../replicating-fama-and-french-factors.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Replicating Fama and French Factors with R\n description-meta: Use the programming language R to replicate the famous Fama-French three and five factor asset pricing models.\n---\n\n\nIn this chapter, we provide a replication of the famous Fama and French factor portfolios. The Fama and French factor models are a cornerstone of empirical asset pricing [see @Fama1992 and @FamaFrench2015]. On top of the market factor represented by the traditional CAPM beta, the three factor model includes the size and value factors to explain the cross section of returns. Its successor, the five factor model, additionally includes profitability and investment as explanatory factors. \n\nWe start with the three factor model. We already introduced the size and value factors in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd), and their definition remains the same: size is the SMB factor (small-minus-big) that is long small firms and short large firms. The value factor is HML (high-minus-low) and is long in high book-to-market firms and short in low book-to-market counterparts. \n\nAfter the replication of the three factor model, we move to the five factors by constructing the profitability factor RMW (robust-minus-weak) as the difference between the returns of firms with high and low operating profitability and the investment factor CMA (conservative-minus-aggressive) as the difference between firms with high versus low investment rates. \n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nWe use CRSP and Compustat as data sources, as we need the same variables to compute the factors in the way Fama and French do it. Hence, there is nothing new below and we only load data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(\n permno, gvkey, month, ret_excess,\n mktcap, mktcap_lag, exchange\n ) |>\n collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n select(gvkey, datadate, be, op, inv) |>\n collect() \n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, smb, hml) |>\n collect()\n\nfactors_ff5_monthly <- tbl(tidy_finance, \"factors_ff5_monthly\") |>\n select(month, smb, hml, rmw, cma) |>\n collect()\n```\n:::\n\n\nYet when we start merging our data set for computing the premiums, there are a few differences to [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). First, Fama and French form their portfolios in June of year $t$, whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year $t+1$.\n\nSecond, Fama and French also have a different protocol for computing the book-to-market ratio.\\index{Book-to-market ratio} They use market equity as of the end of year $t - 1$ and the book equity reported in year $t-1$, i.e., the `datadate` is within the last year.\\index{Book equity} Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity. The other sorting variables are analogously to book equity taken from year $t-1$.\n\nTo implement all these time lags, we again employ the temporary `sorting_date`-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of `distinct()` at the end of the chunk below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsize <- crsp_monthly |>\n filter(month(month) == 6) |>\n mutate(sorting_date = month %m+% months(1)) |>\n select(permno, exchange, sorting_date, size = mktcap)\n\nmarket_equity <- crsp_monthly |>\n filter(month(month) == 12) |>\n mutate(sorting_date = ymd(str_c(year(month) + 1, \"0701)\"))) |>\n select(permno, gvkey, sorting_date, me = mktcap)\n\nbook_to_market <- compustat |>\n mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n select(gvkey, sorting_date, be) |>\n inner_join(market_equity, by = c(\"gvkey\", \"sorting_date\")) |>\n mutate(bm = be / me) |>\n select(permno, sorting_date, me, bm)\n\nsorting_variables <- size |>\n inner_join(\n book_to_market, by = c(\"permno\", \"sorting_date\")\n ) |>\n drop_na() |>\n distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\n## Portfolio Sorts\n\nNext, we construct our portfolios with an adjusted `assign_portfolio()` function.\\index{Portfolio sorts} Fama and French rely on NYSE-specific breakpoints, they form two portfolios in the size dimension at the median and three portfolios in the dimension of each other sorting variable at the 30%- and 70%-percentiles, and they use dependent sorts. The sorts for book-to-market require an adjustment to the function in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd) because the `seq()` we would produce does not produce the right breakpoints. Instead of `n_portfolios`, we now specify `percentiles`, which take the breakpoint-sequence as an object specified in the function's call. Specifically, we give `percentiles = c(0, 0.3, 0.7, 1)` to the function. Additionally, we perform an `inner_join()` with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.\\index{Breakpoints}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n sorting_variable, \n percentiles) {\n breakpoints <- data |>\n filter(exchange == \"NYSE\") |>\n pull({{ sorting_variable }}) |>\n quantile(\n probs = percentiles,\n na.rm = TRUE,\n names = FALSE\n )\n\n assigned_portfolios <- data |>\n mutate(portfolio = findInterval(\n pick(everything()) |>\n pull({{ sorting_variable }}),\n breakpoints,\n all.inside = TRUE\n )) |>\n pull(portfolio)\n \n return(assigned_portfolios)\n}\n\nportfolios <- sorting_variables |>\n group_by(sorting_date) |>\n mutate(\n portfolio_size = assign_portfolio(\n data = pick(everything()),\n sorting_variable = size,\n percentiles = c(0, 0.5, 1)\n ),\n portfolio_bm = assign_portfolio(\n data = pick(everything()),\n sorting_variable = bm,\n percentiles = c(0, 0.3, 0.7, 1)\n )\n ) |>\n ungroup() |> \n select(permno, sorting_date, \n portfolio_size, portfolio_bm)\n```\n:::\n\n\nNext, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column `sorting_date` in our return data by setting the date to sort on to July of $t-1$ if the month is June (of year $t$) or earlier or to July of year $t$ if the month is July or later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- crsp_monthly |>\n mutate(sorting_date = case_when(\n month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n )) |>\n inner_join(portfolios, by = c(\"permno\", \"sorting_date\"))\n```\n:::\n\n\n## Fama and French Three Factor Model\n\nEquipped with the return data and the assigned portfolios, we can now compute the value-weighted average return for each of the six portfolios. Then, we form the Fama and French factors. For the size factor (i.e., SMB), we go long in the three small portfolios and short the three large portfolios by taking an average across either group. For the value factor (i.e., HML), we go long in the two high book-to-market portfolios and short the two low book-to-market portfolios, again weighting them equally.\\index{Factor!Size}\\index{Factor!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- portfolios |>\n group_by(portfolio_size, portfolio_bm, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), .groups = \"drop\"\n ) |>\n group_by(month) |>\n summarize(\n smb_replicated = mean(ret[portfolio_size == 1]) -\n mean(ret[portfolio_size == 2]),\n hml_replicated = mean(ret[portfolio_bm == 3]) -\n mean(ret[portfolio_bm == 1])\n )\n```\n:::\n\n\n## Replication Evaluation\n\nIn the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French.\\index{Size!Size premium}\\index{Value premium} The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using `lm()`. If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- factors_ff3_monthly |>\n inner_join(factors_replicated, by = \"month\") |>\n mutate(\n across(c(smb_replicated, hml_replicated), ~round(., 4))\n )\n```\n:::\n\n\nTo test the success of the SMB factor, we hence run the following regression:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.020294 -0.001541 -0.000055 0.001489 0.015482 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.000130 0.000131 -1 0.32 \nsmb_replicated 0.993003 0.004335 229 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00355 on 736 degrees of freedom\nMultiple R-squared: 0.986,\tAdjusted R-squared: 0.986 \nF-statistic: 5.25e+04 on 1 and 736 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are really convincing as all three criteria outlined above are met and the coefficient is 0.99 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.02324 -0.00287 -0.00010 0.00230 0.03407 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.000291 0.000219 1.33 0.18 \nhml_replicated 0.963431 0.007280 132.35 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00591 on 736 degrees of freedom\nMultiple R-squared: 0.96,\tAdjusted R-squared: 0.96 \nF-statistic: 1.75e+04 on 1 and 736 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly lower coefficient of 0.96 and an R-squared around 96%. \n\nThe evidence hence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.\n\n## Fama and French Five Factor Model\n\nNow, let us move to the replication of the five factor model. We extend the `other_sorting_variables` table from above with the additional characteristics operating profitability `op` and investment `inv`. Note that the `drop_na()` statement yields different sample sizes as some firms with `be` values might not have `op` or `inv` values. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nother_sorting_variables <- compustat |>\n mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n select(gvkey, sorting_date, be, op, inv) |>\n inner_join(market_equity, by = c(\"gvkey\", \"sorting_date\")) |>\n mutate(bm = be / me) |>\n select(permno, sorting_date, me, be, bm, op, inv)\n\nsorting_variables <- size |>\n inner_join(\n other_sorting_variables, by = c(\"permno\", \"sorting_date\")\n ) |>\n drop_na() |>\n distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\nIn each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- sorting_variables |>\n group_by(sorting_date) |>\n mutate(\n portfolio_size = assign_portfolio(\n data = pick(everything()),\n sorting_variable = size,\n percentiles = c(0, 0.5, 1)\n )) |> \n group_by(sorting_date, portfolio_size) |> \n mutate(\n across(c(bm, op, inv), ~assign_portfolio(\n data = pick(everything()), \n sorting_variable = ., \n percentiles = c(0, 0.3, 0.7, 1)),\n .names = \"portfolio_{.col}\"\n )\n ) |>\n ungroup() |> \n select(permno, sorting_date, \n portfolio_size, portfolio_bm,\n portfolio_op, portfolio_inv)\n\nportfolios <- crsp_monthly |>\n mutate(sorting_date = case_when(\n month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n )) |>\n inner_join(portfolios, by = c(\"permno\", \"sorting_date\"))\n```\n:::\n\n\nNow, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_value <- portfolios |>\n group_by(portfolio_size, portfolio_bm, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\"\n )\n\nfactors_value <- portfolios_value |>\n group_by(month) |>\n summarize(\n hml_replicated = mean(ret[portfolio_bm == 3]) -\n mean(ret[portfolio_bm == 1])\n )\n```\n:::\n\n\nFor the profitability factor, RMW, we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.\\index{Factor!Profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_profitability <- portfolios |>\n group_by(portfolio_size, portfolio_op, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\"\n ) \n\nfactors_profitability <- portfolios_profitability |>\n group_by(month) |>\n summarize(\n rmw_replicated = mean(ret[portfolio_op == 3]) -\n mean(ret[portfolio_op == 1])\n )\n```\n:::\n\n\nFor the investment factor, CMA, we go long the two low investment portfolios and short the two high investment portfolios.\\index{Factor!Investment}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_investment <- portfolios |>\n group_by(portfolio_size, portfolio_inv, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\"\n )\n\nfactors_investment <- portfolios_investment |>\n group_by(month) |>\n summarize(\n cma_replicated = mean(ret[portfolio_inv == 1]) -\n mean(ret[portfolio_inv == 3])\n )\n```\n:::\n\n\nFinally, the size factor, SMB, is constructed by going long the six small portfolios and short the six large portfolios. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_size <- bind_rows(\n portfolios_value,\n portfolios_profitability,\n portfolios_investment\n) |> \n group_by(month) |>\n summarize(\n smb_replicated = mean(ret[portfolio_size == 1]) -\n mean(ret[portfolio_size == 2])\n )\n```\n:::\n\n\nWe then join all factors together into one data frame and construct again a suitable table to run tests for evaluating our replication.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- factors_size |>\n full_join(\n factors_value, by = \"month\"\n ) |>\n full_join(\n factors_profitability, by = \"month\"\n ) |>\n full_join(\n factors_investment, by = \"month\"\n )\n\ntest <- factors_ff5_monthly |>\n inner_join(factors_replicated, by = \"month\") |>\n mutate(\n across(c(smb_replicated, hml_replicated, \n rmw_replicated, cma_replicated), ~round(., 4))\n )\n```\n:::\n\n\nLet us start the replication evaluation again with the size factor:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.018146 -0.001865 0.000181 0.001980 0.014417 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.000203 0.000136 -1.5 0.14 \nsmb_replicated 0.969574 0.004375 221.6 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00361 on 712 degrees of freedom\nMultiple R-squared: 0.986,\tAdjusted R-squared: 0.986 \nF-statistic: 4.91e+04 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is 0.97 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.04464 -0.00414 -0.00035 0.00410 0.03667 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.000473 0.000298 1.59 0.11 \nhml_replicated 0.991570 0.010271 96.54 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00792 on 712 degrees of freedom\nMultiple R-squared: 0.929,\tAdjusted R-squared: 0.929 \nF-statistic: 9.32e+03 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly higher coefficient of 0.99 and an R-squared around 93%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_rmw <- lm(rmw ~ rmw_replicated, data = test)\nsummary(model_rmw)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = rmw ~ rmw_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.019810 -0.003051 0.000054 0.003280 0.018801 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 5.59e-05 2.02e-04 0.28 0.78 \nrmw_replicated 9.55e-01 8.88e-03 107.52 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00536 on 712 degrees of freedom\nMultiple R-squared: 0.942,\tAdjusted R-squared: 0.942 \nF-statistic: 1.16e+04 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nWe are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an R-squared around 94%.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cma <- lm(cma ~ cma_replicated, data = test)\nsummary(model_cma)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = cma ~ cma_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.015071 -0.002720 -0.000225 0.002331 0.021404 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.000673 0.000171 3.93 9.4e-05 ***\ncma_replicated 0.965219 0.008195 117.78 < 2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00455 on 712 degrees of freedom\nMultiple R-squared: 0.951,\tAdjusted R-squared: 0.951 \nF-statistic: 1.39e+04 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nFinally, the CMA factor also replicates well with a coefficient of 0.97 and an R-squared around 95%.\n\nOverall, our approach seems to replicate the Fama-French three and five factor models just as well as the three factors. \n\n## Exercises\n\n1. @Fama1993 claim that their sample excludes firms until they have appeared in Compustat for two years. Implement this additional filter and compare the improvements of your replication effort. \n2. On his homepage, [Kenneth French](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) provides instructions on how to construct the most common variables used for portfolio sorts. Try to replicate the univariate portfolio sort return time series for `E/P` (earnings / price) provided on his homepage and evaluate your replication effort using regressions.\n", + "markdown": "---\ntitle: Replicating Fama and French Factors\naliases:\n - ../replicating-fama-and-french-factors.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Replicating Fama and French Factors with R\n description-meta: Use the programming language R to replicate the famous Fama-French three and five factor asset pricing models.\n---\n\n\nIn this chapter, we provide a replication of the famous Fama and French factor portfolios. The Fama and French factor models are a cornerstone of empirical asset pricing [see @Fama1992 and @FamaFrench2015]. On top of the market factor represented by the traditional CAPM beta, the three factor model includes the size and value factors to explain the cross section of returns. Its successor, the five factor model, additionally includes profitability and investment as explanatory factors. \n\nWe start with the three factor model. We already introduced the size and value factors in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd), and their definition remains the same: size is the SMB factor (small-minus-big) that is long small firms and short large firms. The value factor is HML (high-minus-low) and is long in high book-to-market firms and short in low book-to-market counterparts. \n\nAfter the replication of the three factor model, we move to the five factors by constructing the profitability factor RMW (robust-minus-weak) as the difference between the returns of firms with high and low operating profitability and the investment factor CMA (conservative-minus-aggressive) as the difference between firms with high versus low investment rates. \n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nWe use CRSP and Compustat as data sources, as we need the same variables to compute the factors in the way Fama and French do it. Hence, there is nothing new below and we only load data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(\n permno, gvkey, month, ret_excess,\n mktcap, mktcap_lag, exchange\n ) |>\n collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n select(gvkey, datadate, be, op, inv) |>\n collect() \n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, smb, hml) |>\n collect()\n\nfactors_ff5_monthly <- tbl(tidy_finance, \"factors_ff5_monthly\") |>\n select(month, smb, hml, rmw, cma) |>\n collect()\n```\n:::\n\n\nYet when we start merging our data set for computing the premiums, there are a few differences to [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). First, Fama and French form their portfolios in June of year $t$, whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year $t+1$.\n\nSecond, Fama and French also have a different protocol for computing the book-to-market ratio.\\index{Book-to-market ratio} They use market equity as of the end of year $t - 1$ and the book equity reported in year $t-1$, i.e., the `datadate` is within the last year.\\index{Book equity} Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity. The other sorting variables are analogously to book equity taken from year $t-1$.\n\nTo implement all these time lags, we again employ the temporary `sorting_date`-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of `distinct()` at the end of the chunk below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsize <- crsp_monthly |>\n filter(month(month) == 6) |>\n mutate(sorting_date = month %m+% months(1)) |>\n select(permno, exchange, sorting_date, size = mktcap)\n\nmarket_equity <- crsp_monthly |>\n filter(month(month) == 12) |>\n mutate(sorting_date = ymd(str_c(year(month) + 1, \"0701)\"))) |>\n select(permno, gvkey, sorting_date, me = mktcap)\n\nbook_to_market <- compustat |>\n mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n select(gvkey, sorting_date, be) |>\n inner_join(market_equity, join_by(gvkey, sorting_date)) |>\n mutate(bm = be / me) |>\n select(permno, sorting_date, me, bm)\n\nsorting_variables <- size |>\n inner_join(\n book_to_market, join_by(permno, sorting_date)\n ) |>\n drop_na() |>\n distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\n## Portfolio Sorts\n\nNext, we construct our portfolios with an adjusted `assign_portfolio()` function.\\index{Portfolio sorts} Fama and French rely on NYSE-specific breakpoints, they form two portfolios in the size dimension at the median and three portfolios in the dimension of each other sorting variable at the 30%- and 70%-percentiles, and they use dependent sorts. The sorts for book-to-market require an adjustment to the function in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd) because the `seq()` we would produce does not produce the right breakpoints. Instead of `n_portfolios`, we now specify `percentiles`, which take the breakpoint-sequence as an object specified in the function's call. Specifically, we give `percentiles = c(0, 0.3, 0.7, 1)` to the function. Additionally, we perform an `inner_join()` with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.\\index{Breakpoints}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n sorting_variable, \n percentiles) {\n breakpoints <- data |>\n filter(exchange == \"NYSE\") |>\n pull({{ sorting_variable }}) |>\n quantile(\n probs = percentiles,\n na.rm = TRUE,\n names = FALSE\n )\n\n assigned_portfolios <- data |>\n mutate(portfolio = findInterval(\n pick(everything()) |>\n pull({{ sorting_variable }}),\n breakpoints,\n all.inside = TRUE\n )) |>\n pull(portfolio)\n \n return(assigned_portfolios)\n}\n\nportfolios <- sorting_variables |>\n group_by(sorting_date) |>\n mutate(\n portfolio_size = assign_portfolio(\n data = pick(everything()),\n sorting_variable = size,\n percentiles = c(0, 0.5, 1)\n ),\n portfolio_bm = assign_portfolio(\n data = pick(everything()),\n sorting_variable = bm,\n percentiles = c(0, 0.3, 0.7, 1)\n )\n ) |>\n ungroup() |> \n select(permno, sorting_date, \n portfolio_size, portfolio_bm)\n```\n:::\n\n\nNext, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column `sorting_date` in our return data by setting the date to sort on to July of $t-1$ if the month is June (of year $t$) or earlier or to July of year $t$ if the month is July or later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- crsp_monthly |>\n mutate(sorting_date = case_when(\n month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n )) |>\n inner_join(portfolios, join_by(permno, sorting_date))\n```\n:::\n\n\n## Fama and French Three Factor Model\n\nEquipped with the return data and the assigned portfolios, we can now compute the value-weighted average return for each of the six portfolios. Then, we form the Fama and French factors. For the size factor (i.e., SMB), we go long in the three small portfolios and short the three large portfolios by taking an average across either group. For the value factor (i.e., HML), we go long in the two high book-to-market portfolios and short the two low book-to-market portfolios, again weighting them equally.\\index{Factor!Size}\\index{Factor!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- portfolios |>\n group_by(portfolio_size, portfolio_bm, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), .groups = \"drop\"\n ) |>\n group_by(month) |>\n summarize(\n smb_replicated = mean(ret[portfolio_size == 1]) -\n mean(ret[portfolio_size == 2]),\n hml_replicated = mean(ret[portfolio_bm == 3]) -\n mean(ret[portfolio_bm == 1])\n )\n```\n:::\n\n\n## Replication Evaluation\n\nIn the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French.\\index{Size!Size premium}\\index{Value premium} The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using `lm()`. If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- factors_ff3_monthly |>\n inner_join(factors_replicated, join_by(month)) |>\n mutate(\n across(c(smb_replicated, hml_replicated), ~round(., 4))\n )\n```\n:::\n\n\nTo test the success of the SMB factor, we hence run the following regression:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.020294 -0.001541 -0.000055 0.001489 0.015482 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.000130 0.000131 -1 0.32 \nsmb_replicated 0.993003 0.004335 229 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00355 on 736 degrees of freedom\nMultiple R-squared: 0.986,\tAdjusted R-squared: 0.986 \nF-statistic: 5.25e+04 on 1 and 736 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are really convincing as all three criteria outlined above are met and the coefficient is 0.99 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.02324 -0.00287 -0.00010 0.00230 0.03407 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.000291 0.000219 1.33 0.18 \nhml_replicated 0.963431 0.007280 132.35 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00591 on 736 degrees of freedom\nMultiple R-squared: 0.96,\tAdjusted R-squared: 0.96 \nF-statistic: 1.75e+04 on 1 and 736 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly lower coefficient of 0.96 and an R-squared around 96%. \n\nThe evidence hence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.\n\n## Fama and French Five Factor Model\n\nNow, let us move to the replication of the five factor model. We extend the `other_sorting_variables` table from above with the additional characteristics operating profitability `op` and investment `inv`. Note that the `drop_na()` statement yields different sample sizes as some firms with `be` values might not have `op` or `inv` values. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nother_sorting_variables <- compustat |>\n mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n select(gvkey, sorting_date, be, op, inv) |>\n inner_join(market_equity, \n join_by(gvkey, sorting_date)) |>\n mutate(bm = be / me) |>\n select(permno, sorting_date, me, be, bm, op, inv)\n\nsorting_variables <- size |>\n inner_join(\n other_sorting_variables, \n join_by(permno, sorting_date)\n ) |>\n drop_na() |>\n distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\nIn each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- sorting_variables |>\n group_by(sorting_date) |>\n mutate(\n portfolio_size = assign_portfolio(\n data = pick(everything()),\n sorting_variable = size,\n percentiles = c(0, 0.5, 1)\n )) |> \n group_by(sorting_date, portfolio_size) |> \n mutate(\n across(c(bm, op, inv), ~assign_portfolio(\n data = pick(everything()), \n sorting_variable = ., \n percentiles = c(0, 0.3, 0.7, 1)),\n .names = \"portfolio_{.col}\"\n )\n ) |>\n ungroup() |> \n select(permno, sorting_date, \n portfolio_size, portfolio_bm,\n portfolio_op, portfolio_inv)\n\nportfolios <- crsp_monthly |>\n mutate(sorting_date = case_when(\n month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n )) |>\n inner_join(portfolios, join_by(permno, sorting_date))\n```\n:::\n\n\nNow, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_value <- portfolios |>\n group_by(portfolio_size, portfolio_bm, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\"\n )\n\nfactors_value <- portfolios_value |>\n group_by(month) |>\n summarize(\n hml_replicated = mean(ret[portfolio_bm == 3]) -\n mean(ret[portfolio_bm == 1])\n )\n```\n:::\n\n\nFor the profitability factor, RMW, we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.\\index{Factor!Profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_profitability <- portfolios |>\n group_by(portfolio_size, portfolio_op, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\"\n ) \n\nfactors_profitability <- portfolios_profitability |>\n group_by(month) |>\n summarize(\n rmw_replicated = mean(ret[portfolio_op == 3]) -\n mean(ret[portfolio_op == 1])\n )\n```\n:::\n\n\nFor the investment factor, CMA, we go long the two low investment portfolios and short the two high investment portfolios.\\index{Factor!Investment}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_investment <- portfolios |>\n group_by(portfolio_size, portfolio_inv, month) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\"\n )\n\nfactors_investment <- portfolios_investment |>\n group_by(month) |>\n summarize(\n cma_replicated = mean(ret[portfolio_inv == 1]) -\n mean(ret[portfolio_inv == 3])\n )\n```\n:::\n\n\nFinally, the size factor, SMB, is constructed by going long the six small portfolios and short the six large portfolios. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_size <- bind_rows(\n portfolios_value,\n portfolios_profitability,\n portfolios_investment\n) |> \n group_by(month) |>\n summarize(\n smb_replicated = mean(ret[portfolio_size == 1]) -\n mean(ret[portfolio_size == 2])\n )\n```\n:::\n\n\nWe then join all factors together into one data frame and construct again a suitable table to run tests for evaluating our replication.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- factors_size |>\n full_join(\n factors_value, join_by(month)\n ) |>\n full_join(\n factors_profitability, join_by(month)\n ) |>\n full_join(\n factors_investment, join_by(month)\n )\n\ntest <- factors_ff5_monthly |>\n inner_join(factors_replicated, join_by(month)) |>\n mutate(\n across(c(smb_replicated, hml_replicated, \n rmw_replicated, cma_replicated), ~round(., 4))\n )\n```\n:::\n\n\nLet us start the replication evaluation again with the size factor:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.018146 -0.001865 0.000181 0.001980 0.014417 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.000203 0.000136 -1.5 0.14 \nsmb_replicated 0.969574 0.004375 221.6 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00361 on 712 degrees of freedom\nMultiple R-squared: 0.986,\tAdjusted R-squared: 0.986 \nF-statistic: 4.91e+04 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is 0.97 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.04464 -0.00414 -0.00035 0.00410 0.03667 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.000473 0.000298 1.59 0.11 \nhml_replicated 0.991570 0.010271 96.54 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00792 on 712 degrees of freedom\nMultiple R-squared: 0.929,\tAdjusted R-squared: 0.929 \nF-statistic: 9.32e+03 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly higher coefficient of 0.99 and an R-squared around 93%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_rmw <- lm(rmw ~ rmw_replicated, data = test)\nsummary(model_rmw)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = rmw ~ rmw_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.019810 -0.003051 0.000054 0.003280 0.018801 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 5.59e-05 2.02e-04 0.28 0.78 \nrmw_replicated 9.55e-01 8.88e-03 107.52 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00536 on 712 degrees of freedom\nMultiple R-squared: 0.942,\tAdjusted R-squared: 0.942 \nF-statistic: 1.16e+04 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nWe are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an R-squared around 94%.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cma <- lm(cma ~ cma_replicated, data = test)\nsummary(model_cma)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = cma ~ cma_replicated, data = test)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.015071 -0.002720 -0.000225 0.002331 0.021404 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.000673 0.000171 3.93 9.4e-05 ***\ncma_replicated 0.965219 0.008195 117.78 < 2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00455 on 712 degrees of freedom\nMultiple R-squared: 0.951,\tAdjusted R-squared: 0.951 \nF-statistic: 1.39e+04 on 1 and 712 DF, p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nFinally, the CMA factor also replicates well with a coefficient of 0.97 and an R-squared around 95%.\n\nOverall, our approach seems to replicate the Fama-French three and five factor models just as well as the three factors. \n\n## Exercises\n\n1. @Fama1993 claim that their sample excludes firms until they have appeared in Compustat for two years. Implement this additional filter and compare the improvements of your replication effort. \n2. On his homepage, [Kenneth French](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) provides instructions on how to construct the most common variables used for portfolio sorts. Try to replicate the univariate portfolio sort return time series for `E/P` (earnings / price) provided on his homepage and evaluate your replication effort using regressions.\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/r/setting-up-your-environment/execute-results/html.json b/_freeze/r/setting-up-your-environment/execute-results/html.json index a2b2bc74..68facf3a 100644 --- a/_freeze/r/setting-up-your-environment/execute-results/html.json +++ b/_freeze/r/setting-up-your-environment/execute-results/html.json @@ -1,11 +1,9 @@ { - "hash": "98e050968d54d756e6a65fdb0f8da089", + "hash": "81bf1ebef16bf2a9d7f3a09520917963", "result": { "engine": "knitr", - "markdown": "---\ntitle: Setting Up Your Environment\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Setting Up Your R Environment\n description-meta: How to set up your development environment to program with R. \n---\n\n\nWe aim to lower the bar for starting empirical research in financial economics. We want that using R is easy for you. However, given that Tidy Finance is a platform that supports multiple programming languages, we also consider the possibility that you are not familiar with R at all. Hence, we provide you with a simple guide to get started with R and RStudio. If you were not using R before, you will be able to use it after reading this chapter.\n\n## The R language\n\nSome good news first: The software you need is free and easy to download. We will start with downloading and installing R and follow up with doing the same for RStudio.\n\nR is provided via [The Comprehensive R Archive Network](https://cran.r-project.org/) (or short CRAN). CRAN does not only provide the main software but also nearly all extensions that you need. We will cover these extensions or packages later, as we usually visit the CRAN website only to download the base version. Now, go ahead and visit CRAN. On the landing page, you can choose your operating systems (i.e., Linux, macOS, and Windows). Click the respective link that fits your system:\\index{CRAN}\n\n- R comes as a part of many Linux distributions. If it does not, CRAN provides installation guides for individual Linux distributions.\n- For macOS, the choice currently depends on some hardware specifications, but the right version for your system is clearly indicated.\n- For Windows, you want to use the base version provided.\n\nAfter downloading and installing the software to your system, you are nearly ready to go. In fact, you could just use R now. Unfortunately for many users, R is not a program but a programming language and comes with an interpreter that you would use like a command line. While using R like this might make you feel like a hacker (not that we do not endorse any criminal activity), it is in your best interest to combine R with RStudio.\n\nR is constantly being updated, with new versions being released multiple times a year. This means that you might want to return to CRAN in the future to fetch yourself an update. You know it is time for an update if packages remind you that you are using an outdated version of R. \n\n## RStudio\n\nAssuming you are looking for a more comfortable way of using R, you will get RStudio next. You can download it for free from [Posit](https://posit.co/download/rstudio-desktop/) (i.e., the company that created RStudio, which was previously called RStudio itself). When you follow the instructions, you will see that Posit asks you to install R. However, you should have done that already and can move straight to downloading and installing RStudio.\\index{RStudio}\n\nRStudio is a program similar to other programs you most likely use, like a browser, text editor, or anything else. It comes with many advantages, including a project manager, Github integration, and much more. Unfortunately, Tidy Finance is not the right scope to elaborate more on these possibilities or introduce the basics of programming, but we point you to some excellent resources below. For the purposes of this book, you have completed your excursions to websites that provide you with the necessary software installers.\n\n## R Packages and Environments\n\nFollowing your read of the preface to this book, you might now wonder why we did not download the `tidyverse` yet. Therefore, you must understand one more concept, namely packages in R.\\index{Packages} You can think of them as extensions that you use for specific purposes, whereas R itself is the core pillar upon which everything rests. Comfortably, you can install packages within R with the following code. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nSimply specify the package you want where we placed `tidyverse`. You typically only need to install packages once - except for updates or project-specific R environments. Once installed, you can then load a package with a call to `library(\"tidyverse\")` to use it.\n\nTo keep track of the packages' versions and make our results replicatable, we rely on the package `renv`. It creates a project-specific installation of R packages and you can find the full list of packages used here in the colophon below. The recorded package versions can also be shared with collaborators to ensure consistency. Our use of `renv` also makes it easier for you to install the exact package versions we were using (if you want that) by initializing `renv` with our [renv.lock](https://github.com/tidy-finance/website/blob/main/renv.lock)-file from Github.\\index{Packages:renv} \\index{Github}\n\nOne more piece of advice is the use of RStudio projects. They are a powerful tool to save you some time and make working with R more fun. Without going into more detail here, we refer you to @Wickham2023's chapter on [Workflow: scripts and projects](https://r4ds.hadley.nz/workflow-scripts.html).\\index{RStudio!Projects}\n\n## Your First Steps with R\n\nWhile we believe that downloading and installing R and RStudio is sufficiently easy, you might find help from @Grolemund2014 on [R and RStudio](https://rstudio-education.github.io/hopr/starting.html), [packages](https://rstudio-education.github.io/hopr/packages2.html), as well as [updating](https://rstudio-education.github.io/hopr/updating.html) the software.\n\nThis book's scope cannot be to give you an introduction to R itself. It is not our comparative advantage. However, we can point you to a possible path that you could follow to familiarize yourself with R. Therefore, we make the following suggestion:\n\n- If you are new to R itself, a very gentle and good introduction to the workings of R can be found in @Grolemund2014. He provides a wonderful example in the form of the [weighted dice project](https://rstudio-education.github.io/hopr/project-1-weighted-dice.html). Once you are done setting up R on your machine, try to follow the instructions in this project.\n- The main book on the `tidyverse`, @Wickham2023, is available online and for free: [R for Data Science](https://r4ds.had.co.nz/introduction.html) explains the majority of the tools we use in our book. Working through this text is an eye-opening experience and really useful.\\index{tidyverse}\n\nAdditional resources we can encourage you to use are the following:\n\n- If you are an instructor searching to effectively teach R and data science methods, we recommend taking a look at the excellent [data science toolbox](https://datasciencebox.org/) by [Mine Cetinkaya-Rundel.](https://mine-cr.com/about/)\n- RStudio provides a range of excellent [cheat sheets](https://www.rstudio.com/resources/cheatsheets/) with extensive information on how to use the `tidyverse` packages.\\index{tidyverse} \\index{RStudio:Cheat sheets}\n\n## Colophon\n\nThis book was written in RStudio using `bookdown` [@Xie2016]. The website was rendered using `quarto` [@quarto] and it is hosted via GitHub Pages.\nThe complete source is [available from GitHub](https://github.com/tidy-finance/website).\\index{Bookdown} \\index{Github} \\index{Quarto} We generated all plots in this book using `ggplot2` and its classic dark-on-light theme (`theme_bw()`).\\index{ggplot2 theme}\n\nThis version of the book was built with R [@R-base] version 4.3.1 (2023-06-16, Beagle Scouts) and the following packages: \\index{Colophon}\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Package Version
RPostgres 1.4.5
RSQLite 2.3.1
broom 1.0.5
brulee 0.2.0
dbplyr 2.3.2
devtools 2.4.5
dplyr 1.1.3
fixest 0.11.1
forcats 1.0.0
frenchdata 0.2.0
furrr 0.3.1
ggplot2 3.4.3
glmnet 4.1-8
googledrive 2.1.0
hardhat 1.3.0
hexSticker 0.4.9
jsonlite 1.8.4
kableExtra 1.3.4
lmtest 0.9-40
lubridate 1.9.2
nloptr 2.0.3
purrr 1.0.1
ranger 0.15.1
readr 2.1.4
readxl 1.4.2
renv 1.0.3
rlang 1.1.1
rmarkdown 2.21
sandwich 3.0-2
scales 1.2.1
slider 0.3.0
stringr 1.5.0
tibble 3.2.1
tidymodels 1.1.0
tidyquant 1.0.7
tidyr 1.3.0
tidyverse 2.0.0
timetk 2.8.3
torch 0.11.0
wesanderson 0.3.6
\n\n`````\n:::\n:::\n", - "supporting": [ - "setting-up-your-environment_files" - ], + "markdown": "---\ntitle: Setting Up Your Environment\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Setting Up Your R Environment\n description-meta: How to set up your development environment to program with R. \n---\n\n\nWe aim to lower the bar for starting empirical research in financial economics. We want that using R is easy for you. However, given that Tidy Finance is a platform that supports multiple programming languages, we also consider the possibility that you are not familiar with R at all. Hence, we provide you with a simple guide to get started with R and RStudio. If you were not using R before, you will be able to use it after reading this chapter.\n\n## The R language\n\nSome good news first: The software you need is free and easy to download. We will start with downloading and installing R and follow up with doing the same for RStudio.\n\nR is provided via [The Comprehensive R Archive Network](https://cran.r-project.org/) (or short CRAN). CRAN does not only provide the main software but also nearly all extensions that you need. We will cover these extensions or packages later, as we usually visit the CRAN website only to download the base version. Now, go ahead and visit CRAN. On the landing page, you can choose your operating systems (i.e., Linux, macOS, and Windows). Click the respective link that fits your system:\\index{CRAN}\n\n- R comes as a part of many Linux distributions. If it does not, CRAN provides installation guides for individual Linux distributions.\n- For macOS, the choice currently depends on some hardware specifications, but the right version for your system is clearly indicated.\n- For Windows, you want to use the base version provided.\n\nAfter downloading and installing the software to your system, you are nearly ready to go. In fact, you could just use R now. Unfortunately for many users, R is not a program but a programming language and comes with an interpreter that you would use like a command line. While using R like this might make you feel like a hacker (not that we do not endorse any criminal activity), it is in your best interest to combine R with RStudio.\n\nR is constantly being updated, with new versions being released multiple times a year. This means that you might want to return to CRAN in the future to fetch yourself an update. You know it is time for an update if packages remind you that you are using an outdated version of R. \n\n## RStudio\n\nAssuming you are looking for a more comfortable way of using R, you will get RStudio next. You can download it for free from [Posit](https://posit.co/download/rstudio-desktop/) (i.e., the company that created RStudio, which was previously called RStudio itself). When you follow the instructions, you will see that Posit asks you to install R. However, you should have done that already and can move straight to downloading and installing RStudio.\\index{RStudio}\n\nRStudio is a program similar to other programs you most likely use, like a browser, text editor, or anything else. It comes with many advantages, including a project manager, Github integration, and much more. Unfortunately, Tidy Finance is not the right scope to elaborate more on these possibilities or introduce the basics of programming, but we point you to some excellent resources below. For the purposes of this book, you have completed your excursions to websites that provide you with the necessary software installers.\n\n## R Packages and Environments\n\nFollowing your read of the preface to this book, you might now wonder why we did not download the `tidyverse` yet. Therefore, you must understand one more concept, namely packages in R.\\index{Packages} You can think of them as extensions that you use for specific purposes, whereas R itself is the core pillar upon which everything rests. Comfortably, you can install packages within R with the following code. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nSimply specify the package you want where we placed `tidyverse`. You typically only need to install packages once - except for updates or project-specific R environments. Once installed, you can then load a package with a call to `library(\"tidyverse\")` to use it.\n\nTo keep track of the packages' versions and make our results replicatable, we rely on the package `renv`. It creates a project-specific installation of R packages and you can find the full list of packages used here in the colophon below. The recorded package versions can also be shared with collaborators to ensure consistency. Our use of `renv` also makes it easier for you to install the exact package versions we were using (if you want that) by initializing `renv` with our [renv.lock](https://github.com/tidy-finance/website/blob/main/renv.lock)-file from Github.\\index{Packages:renv} \\index{Github}\n\nOne more piece of advice is the use of RStudio projects. They are a powerful tool to save you some time and make working with R more fun. Without going into more detail here, we refer you to @Wickham2023's chapter on [Workflow: scripts and projects](https://r4ds.hadley.nz/workflow-scripts.html).\\index{RStudio!Projects}\n\n## Your First Steps with R\n\nWhile we believe that downloading and installing R and RStudio is sufficiently easy, you might find help from @Grolemund2014 on [R and RStudio](https://rstudio-education.github.io/hopr/starting.html), [packages](https://rstudio-education.github.io/hopr/packages2.html), as well as [updating](https://rstudio-education.github.io/hopr/updating.html) the software.\n\nThis book's scope cannot be to give you an introduction to R itself. It is not our comparative advantage. However, we can point you to a possible path that you could follow to familiarize yourself with R. Therefore, we make the following suggestion:\n\n- If you are new to R itself, a very gentle and good introduction to the workings of R can be found in @Grolemund2014. He provides a wonderful example in the form of the [weighted dice project](https://rstudio-education.github.io/hopr/project-1-weighted-dice.html). Once you are done setting up R on your machine, try to follow the instructions in this project.\n- The main book on the `tidyverse`, @Wickham2023, is available online and for free: [R for Data Science](https://r4ds.had.co.nz/introduction.html) explains the majority of the tools we use in our book. Working through this text is an eye-opening experience and really useful.\\index{tidyverse}\n\nAdditional resources we can encourage you to use are the following:\n\n- If you are an instructor searching to effectively teach R and data science methods, we recommend taking a look at the excellent [data science toolbox](https://datasciencebox.org/) by [Mine Cetinkaya-Rundel.](https://mine-cr.com/about/)\n- RStudio provides a range of excellent [cheat sheets](https://www.rstudio.com/resources/cheatsheets/) with extensive information on how to use the `tidyverse` packages.\\index{tidyverse} \\index{RStudio:Cheat sheets}\n\n## Colophon\n\nThis book was written in RStudio using `bookdown` [@Xie2016]. The website was rendered using `quarto` [@quarto] and it is hosted via GitHub Pages.\nThe complete source is [available from GitHub](https://github.com/tidy-finance/website).\\index{Bookdown} \\index{Github} \\index{Quarto} We generated all plots in this book using `ggplot2` and its classic dark-on-light theme (`theme_bw()`).\\index{ggplot2 theme}\n\nThis version of the book was built with R [@R-base] version 4.3.2 (2023-10-31, Eye Holes) and the following packages: \\index{Colophon}\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n
Package Version
RPostgres 1.4.5
RSQLite 2.3.1
broom 1.0.5
brulee 0.2.0
dbplyr 2.3.2
devtools 2.4.5
dplyr 1.1.3
fixest 0.11.1
forcats 1.0.0
frenchdata 0.2.0
furrr 0.3.1
ggplot2 3.4.3
glmnet 4.1-8
googledrive 2.1.0
hardhat 1.3.0
hexSticker 0.4.9
jsonlite 1.8.4
kableExtra 1.3.4
lmtest 0.9-40
lubridate 1.9.2
nloptr 2.0.3
purrr 1.0.1
ranger 0.15.1
readr 2.1.4
readxl 1.4.2
renv 1.0.3
rlang 1.1.1
rmarkdown 2.21
sandwich 3.0-2
scales 1.2.1
slider 0.3.0
stringr 1.5.0
tibble 3.2.1
tidymodels 1.1.0
tidyquant 1.0.7
tidyr 1.3.0
tidyverse 2.0.0
timetk 2.8.3
torch 0.11.0
wesanderson 0.3.6
\n\n`````\n:::\n:::\n", + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/r/trace-and-fisd/execute-results/html.json b/_freeze/r/trace-and-fisd/execute-results/html.json index 25cef0d5..35532f0f 100644 --- a/_freeze/r/trace-and-fisd/execute-results/html.json +++ b/_freeze/r/trace-and-fisd/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "716a9d11a8ce7edf5b5f4ee59c7883d1", + "hash": "f094010288c587f828fd06baab89551d", "result": { "engine": "knitr", - "markdown": "---\ntitle: TRACE and FISD\naliases:\n - ../trace-and-fisd.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: TRACE and FISD with R\n description-meta: Download and prepare corporate bond data such as TRACE and FISD from WRDS using the programming language R. \n---\n\n\nIn this chapter, we dive into the US corporate bond market. Bond markets are far more diverse than stock markets, as most issuers have multiple bonds outstanding simultaneously with potentially very different indentures. This market segment is exciting due to its size (roughly 10 trillion USD outstanding), heterogeneity of issuers (as opposed to government bonds), market structure (mostly over-the-counter trades), and data availability. We introduce how to use bond characteristics from FISD and trade reports from TRACE and provide code to download and clean TRACE in R. \n\nMany researchers study liquidity in the US corporate bond market [see, e.g., @bessembinder2006, @Edwards2007, and @Ohara2021, among many others]. We do not cover bond returns here, but you can compute them from TRACE data. Instead, we refer to studies on the topic such as @Bessembinder2008, @bai2019, and @kelly2020 and a survey by @Huang2021. Moreover, WRDS includes bond returns computed from TRACE data at a monthly frequency.\\index{Corporate bonds}\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(dbplyr)\nlibrary(RSQLite)\nlibrary(RPostgres)\nlibrary(devtools)\n```\n:::\n\n\nCompared to previous chapters, we load the `devtools` package [@devtools] to source code that we provided to the public via [gist.](https://docs.github.com/en/get-started/writing-on-github/editing-and-sharing-content-with-gists/creating-gists)\n\n## Bond Data from WRDS \n\nBoth bond databases we need are available on [WRDS](https://wrds-www.wharton.upenn.edu/) to which we establish the `RPostgres` connection described in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). Additionally, we connect to our local `SQLite`-database to store the data we download.\\index{WRDS}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n Postgres(),\n host = \"wrds-pgdata.wharton.upenn.edu\",\n dbname = \"wrds\",\n port = 9737,\n sslmode = \"require\",\n user = Sys.getenv(\"WRDS_USER\"),\n password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\n## Mergent FISD\n\nFor research on US corporate bonds, the Mergent Fixed Income Securities Database (FISD) is the primary resource for bond characteristics.\\index{Data!FISD} There is a [detailed manual](https://wrds-www.wharton.upenn.edu/documents/1364/FixedIncome_Securities_Master_Database_User_Guide_v4.pdf) on WRDS, so we only cover the necessary subjects here. FISD data comes in two main variants, namely, centered on issuers or issues. In either case, the most useful identifiers are [CUSIPs.](https://www.cusip.com/index.html) 9-digit CUSIPs identify securities issued by issuers. The issuers can be identified from the first six digits of a security CUSIP, which is also called 6-digit CUSIP. Both stocks and bonds have CUSIPs.\\index{CUSIP} This connection would, in principle, allow matching them easily, but due to changing issuer details, this approach only yields small coverage.\n\nWe use the issue-centered version of FISD to identify the subset of US corporate bonds that meet the standard criteria [@bessembinder2006]. The WRDS table `fisd_mergedissue` contains most of the information we need on a 9-digit CUSIP level. \nDue to the diversity of corporate bonds, details in the indenture vary significantly. We focus on common bonds that make up the majority of trading volume in this market without diverging too much in indentures. \n\nThe following chunk connects to the data and selects the bond sample to remove certain bond types that are less commonly used [see, e.g., @Dick2012; @Ohara2021, among many others].\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissue_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissue\"))\n\nfisd <- fisd_mergedissue_db |>\n filter(\n security_level == \"SEN\", # senior bonds\n slob == \"N\" | is.na(slob), # secured lease obligation\n is.na(security_pledge), # unsecured bonds\n asset_backed == \"N\" | is.na(asset_backed), # not asset backed\n defeased == \"N\" | is.na(defeased), # not defeased\n is.na(defeased_date),\n bond_type %in% c(\n \"CDEB\", # US Corporate Debentures\n \"CMTN\", # US Corporate MTN (Medium Term Note)\n \"CMTZ\", # US Corporate MTN Zero\n \"CZ\", # US Corporate Zero,\n \"USBN\" # US Corporate Bank Note\n ), \n pay_in_kind != \"Y\" | is.na(pay_in_kind), # not payable in kind\n is.na(pay_in_kind_exp_date),\n yankee == \"N\" | is.na(yankee), # no foreign issuer\n canadian == \"N\" | is.na(canadian), # not Canadian\n foreign_currency == \"N\", # USD\n coupon_type %in% c(\n \"F\", # fixed coupon\n \"Z\" # zero coupon\n ), \n is.na(fix_frequency),\n coupon_change_indicator == \"N\",\n interest_frequency %in% c(\n \"0\", # per year\n \"1\",\n \"2\",\n \"4\",\n \"12\"\n ),\n rule_144a == \"N\", # publicly traded\n private_placement == \"N\" | is.na(private_placement),\n defaulted == \"N\", # not defaulted\n is.na(filing_date),\n is.na(settlement),\n convertible == \"N\", # not convertible\n is.na(exchange),\n putable == \"N\" | is.na(putable), # not putable\n unit_deal == \"N\" | is.na(unit_deal), # not issued with another security\n exchangeable == \"N\" | is.na(exchangeable), # not exchangeable\n perpetual == \"N\", # not perpetual\n preferred_security == \"N\" | is.na(preferred_security) # not preferred\n ) |> \n select(\n complete_cusip, maturity,\n offering_amt, offering_date,\n dated_date, \n interest_frequency, coupon,\n last_interest_date, \n issue_id, issuer_id\n ) |>\n collect()\n```\n:::\n\n\nWe also pull issuer information from `fisd_mergedissuer` regarding the industry and country of the firm that issued a particular bond. Then, we filter to include only US-domiciled firms' bonds. We match the data by `issuer_id`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissuer_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissuer\")) \n\nfisd_issuer <- fisd_mergedissuer_db |>\n select(issuer_id, sic_code, country_domicile) |>\n collect()\n\nfisd <- fisd |>\n inner_join(fisd_issuer, by = \"issuer_id\") |>\n filter(country_domicile == \"USA\") |>\n select(-country_domicile)\n```\n:::\n\n\nFinally, we save the bond characteristics to our local database. This selection of bonds also constitutes the sample for which we will collect trade reports from TRACE below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n conn = tidy_finance,\n name = \"fisd\",\n value = fisd,\n overwrite = TRUE\n)\n```\n:::\n\n\nThe FISD database also contains other data. The issue-based file contains information on covenants, i.e., restrictions included in bond indentures to limit specific actions by firms [e.g., @handler2021]. Moreover, FISD also provides information on bond ratings. We do not need either here.\n\n## TRACE\n\nThe Financial Industry Regulatory Authority (FINRA) provides the Trade Reporting and Compliance Engine (TRACE).\\index{Data!TRACE} In TRACE, dealers that trade corporate bonds must report such trades individually. Hence, we observe trade messages in TRACE that contain information on the bond traded, the trade time, price, and volume. TRACE comes in two variants; standard and enhanced TRACE. We show how to download and clean enhanced TRACE as it contains uncapped volume, a crucial quantity missing in the standard distribution. Moreover, enhanced TRACE also provides information on the respective parties' roles and the direction of the trade report. These items become essential in cleaning the messages.\n\nWhy do we repeatedly talk about cleaning TRACE? Trade messages are submitted within a short time window after a trade is executed (less than 15 minutes). These messages can contain errors, and the reporters subsequently correct them or they cancel a trade altogether. The cleaning needs are described by @Dick2009 in detail, and @Dick2014 shows how to clean the enhanced TRACE data using SAS. We do not go into the cleaning steps here, since the code is lengthy and serves no educational purpose. However, downloading and cleaning enhanced TRACE data is straightforward with our setup.\n\nWe store code for cleaning enhanced TRACE with R on the following Github [gist.](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4) \\index{Github!Gist} as a function. The appendix also contains the code for reference. We only need to source the code from the gist, which we can do with `source_gist()`. Alternatively, you can also go to the gist, download it, and `source()` the respective R-file. The `clean_enhanced_trace()` function takes a vector of CUSIPs, a connection to WRDS explained in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd), and a start and end date, respectively.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")\n```\n:::\n\n\nThe TRACE database is considerably large. Therefore, we only download subsets of data at once. Specifying too many CUSIPs over a long time horizon will result in very long download times and a potential failure due to the size of the request to WRDS. The size limit depends on many parameters, and we cannot give you a guideline here. If we were working with the complete TRACE data for all CUSIPs above, splitting the data into 100 parts takes roughly two hours using our setup. For the applications in this book, we need data around the Paris Agreement in December 2015 and download the data in ten sets, which we define below.\\index{Paris (Climate) Agreement}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_cusips <- fisd |>\n pull(complete_cusip)\n\nfisd_parts <- split(\n fisd_cusips,\n rep(1:10, \n length.out = length(fisd_cusips))\n)\n```\n:::\n\n\nFinally, we run a loop in the same style as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd) where we download daily returns from CRSP. For each of the CUSIP sets defined above, we call the cleaning function and save the resulting output. We add new data to the existing table for batch two and all following batches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbatches <- length(fisd_parts)\n\nfor (j in 1:batches) {\n trace_enhanced <- clean_enhanced_trace(\n cusips = fisd_parts[[j]],\n connection = wrds,\n start_date = ymd(\"2014-01-01\"),\n end_date = ymd(\"2016-11-30\")\n )\n\n dbWriteTable(\n conn = tidy_finance,\n name = \"trace_enhanced\",\n value = trace_enhanced,\n overwrite = ifelse(j == 1, TRUE, FALSE),\n append = ifelse(j != 1, TRUE, FALSE)\n )\n \n cat(\"Batch\", j, \"out of\", batches, \"done (\", \n round(j / batches, 2) * 100, \"%)\\n\")\n}\n```\n:::\n\n\n## Insights into Corporate Bonds\n\nWhile many news outlets readily provide information on stocks and the underlying firms, corporate bonds are not covered frequently. Additionally, the TRACE database contains trade-level information, potentially new to students. Therefore, we provide you with some insights by showing some summary statistics.\\index{Summary statistics}\n\nWe start by looking into the number of bonds outstanding over time and compare it to the number of bonds traded in our sample. First, we compute the number of bonds outstanding for each quarter around the Paris Agreement from 2014 to 2016. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding <- expand_grid(\"date\" = seq(ymd(\"2014-01-01\"),\n ymd(\"2016-11-30\"), \n by = \"quarter\"), \n \"complete_cusip\" = fisd$complete_cusip) |> \n left_join(fisd |> select(complete_cusip, \n offering_date,\n maturity), \n by = \"complete_cusip\") |> \n mutate(offering_date = floor_date(offering_date),\n maturity = floor_date(maturity)) |> \n filter(date >= offering_date & date <= maturity) |> \n count(date) |> \n mutate(type = \"Outstanding\")\n```\n:::\n\n\nNext, we look at the bonds traded each quarter in the same period. Notice that we load the complete trace table from our database, as we only have a single part of it in the environment from the download loop from above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n collect()\n\nbonds_traded <- trace_enhanced |> \n mutate(date = floor_date(trd_exctn_dt, \"quarters\")) |> \n group_by(date) |> \n summarize(n = length(unique(cusip_id)),\n type = \"Traded\",\n .groups = \"drop\") \n```\n:::\n\n\nFinally, we plot the two time series in @fig-401.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding |> \n bind_rows(bonds_traded) |> \n ggplot(aes(\n x = date, \n y = n, \n color = type, \n linetype = type\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Number of bonds outstanding and traded each quarter\"\n )\n```\n\n::: {.cell-output-display}\n![The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.](trace-and-fisd_files/figure-html/fig-401-1.png){#fig-401 fig-alt='Title: Number of bonds outstanding and traded each quarter. The figure shows a time series of outstanding bonds and bonds traded. The amount outstanding increases monotonically between 2014 and 2016. The number of bonds traded represents only a fraction of roughly 60 percent, which peaks around the third quarter of 2016.' width=2100}\n:::\n:::\n\n\nWe see that the number of bonds outstanding increases steadily between 2014 and 2016. During our sample period of trade data, we see that the fraction of bonds trading each quarter is roughly 60%. The relatively small number of traded bonds means that many bonds do not trade through an entire quarter. This lack of trading activity illustrates the generally low level of liquidity in the corporate bond market, where it can be hard to trade specific bonds. \nDoes this lack of liquidity mean that corporate bond markets are irrelevant in terms of their size? With over 7,500 traded bonds each quarter, it is hard to say that the market is small. However, let us also investigate the characteristics of issued corporate bonds. In particular, we consider maturity (in years), coupon, and offering amount (in million USD).\\index{Liquidity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd |>\n mutate(maturity = as.numeric(maturity - offering_date) / 365,\n offering_amt = offering_amt / 10^3) |> \n pivot_longer(cols = c(maturity, coupon, offering_amt),\n names_to = \"measure\") |>\n drop_na() |> \n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 8\n measure mean sd min q05 q50 q95 max\n \n1 coupon 5.94 2.67 0 1.90 6.06 10 39 \n2 maturity 9.74 9.21 -6.24 1.03 7.18 30.0 101.\n3 offering_amt 379. 570. 0.001 0.669 200 1400 15000 \n```\n\n\n:::\n:::\n\n\nWe see that the average bond in our sample period has an offering amount of over 357 million USD with a median of 200 million USD, which both cannot be considered small. The average bond has a maturity of 10 years and pays around 6% in coupons.\n\nFinally, let us compute some summary statistics for the trades in this market. To this end, we show a summary based on aggregate information daily. In particular, we consider the trade size (in million USD) and the number of trades.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced |> \n group_by(trd_exctn_dt) |> \n summarize(trade_size = sum(entrd_vol_qt * rptd_pr / 100) / 10^6,\n trade_number = n(),\n .groups = \"drop\") |> \n pivot_longer(cols = c(trade_size, trade_number),\n names_to = \"measure\") |> \n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 8\n measure mean sd min q05 q50 q95 max\n \n1 trade_number 25914. 5444. 439 17844. 26051 34383. 40839 \n2 trade_size 12968. 3577. 17.2 6128. 13421. 17850. 21312.\n```\n\n\n:::\n:::\n\n\nOn average, nearly 26 billion USD of corporate bonds are traded daily in nearly 13,000 transactions. We can hence conclude that the corporate bond market is indeed significant in terms of trading volume and activity.\n\n## Exercises\n\n1. Compute the amount outstanding across all bonds over time. Make sure to subtract all matured bonds. How would you describe the resulting plot?\n1. Compute the number of days each bond is traded (accounting for the bonds' maturities and issuances). Start by looking at the number of bonds traded each day in a graph similar to the one above. How many bonds trade on more than 75% of trading days? \n1. WRDS provides more information from Mergent FISD such as ratings in the table `fisd_ratings`. Download the ratings table and plot the distribution of ratings for the different rating providers. How would you map the different providers to a common numeric rating scale? \\index{Rating}\n", + "markdown": "---\ntitle: TRACE and FISD\naliases:\n - ../trace-and-fisd.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: TRACE and FISD with R\n description-meta: Download and prepare corporate bond data such as TRACE and FISD from WRDS using the programming language R. \n---\n\n\nIn this chapter, we dive into the US corporate bond market. Bond markets are far more diverse than stock markets, as most issuers have multiple bonds outstanding simultaneously with potentially very different indentures. This market segment is exciting due to its size (roughly 10 trillion USD outstanding), heterogeneity of issuers (as opposed to government bonds), market structure (mostly over-the-counter trades), and data availability. We introduce how to use bond characteristics from FISD and trade reports from TRACE and provide code to download and clean TRACE in R. \n\nMany researchers study liquidity in the US corporate bond market [see, e.g., @bessembinder2006, @Edwards2007, and @Ohara2021, among many others]. We do not cover bond returns here, but you can compute them from TRACE data. Instead, we refer to studies on the topic such as @Bessembinder2008, @bai2019, and @kelly2020 and a survey by @Huang2021. Moreover, WRDS includes bond returns computed from TRACE data at a monthly frequency.\\index{Corporate bonds}\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(dbplyr)\nlibrary(RSQLite)\nlibrary(RPostgres)\nlibrary(devtools)\n```\n:::\n\n\nCompared to previous chapters, we load the `devtools` package [@devtools] to source code that we provided to the public via [gist.](https://docs.github.com/en/get-started/writing-on-github/editing-and-sharing-content-with-gists/creating-gists)\n\n## Bond Data from WRDS \n\nBoth bond databases we need are available on [WRDS](https://wrds-www.wharton.upenn.edu/) to which we establish the `RPostgres` connection described in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). Additionally, we connect to our local `SQLite`-database to store the data we download.\\index{WRDS}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n Postgres(),\n host = \"wrds-pgdata.wharton.upenn.edu\",\n dbname = \"wrds\",\n port = 9737,\n sslmode = \"require\",\n user = Sys.getenv(\"WRDS_USER\"),\n password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n```\n:::\n\n\n## Mergent FISD\n\nFor research on US corporate bonds, the Mergent Fixed Income Securities Database (FISD) is the primary resource for bond characteristics.\\index{Data!FISD} There is a [detailed manual](https://wrds-www.wharton.upenn.edu/documents/1364/FixedIncome_Securities_Master_Database_User_Guide_v4.pdf) on WRDS, so we only cover the necessary subjects here. FISD data comes in two main variants, namely, centered on issuers or issues. In either case, the most useful identifiers are [CUSIPs.](https://www.cusip.com/index.html) 9-digit CUSIPs identify securities issued by issuers. The issuers can be identified from the first six digits of a security CUSIP, which is also called 6-digit CUSIP. Both stocks and bonds have CUSIPs.\\index{CUSIP} This connection would, in principle, allow matching them easily, but due to changing issuer details, this approach only yields small coverage.\n\nWe use the issue-centered version of FISD to identify the subset of US corporate bonds that meet the standard criteria [@bessembinder2006]. The WRDS table `fisd_mergedissue` contains most of the information we need on a 9-digit CUSIP level. \nDue to the diversity of corporate bonds, details in the indenture vary significantly. We focus on common bonds that make up the majority of trading volume in this market without diverging too much in indentures. \n\nThe following chunk connects to the data and selects the bond sample to remove certain bond types that are less commonly used [see, e.g., @Dick2012; @Ohara2021, among many others].\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissue_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissue\"))\n\nfisd <- fisd_mergedissue_db |>\n filter(\n security_level == \"SEN\", # senior bonds\n slob == \"N\" | is.na(slob), # secured lease obligation\n is.na(security_pledge), # unsecured bonds\n asset_backed == \"N\" | is.na(asset_backed), # not asset backed\n defeased == \"N\" | is.na(defeased), # not defeased\n is.na(defeased_date),\n bond_type %in% c(\n \"CDEB\", # US Corporate Debentures\n \"CMTN\", # US Corporate MTN (Medium Term Note)\n \"CMTZ\", # US Corporate MTN Zero\n \"CZ\", # US Corporate Zero,\n \"USBN\" # US Corporate Bank Note\n ), \n pay_in_kind != \"Y\" | is.na(pay_in_kind), # not payable in kind\n is.na(pay_in_kind_exp_date),\n yankee == \"N\" | is.na(yankee), # no foreign issuer\n canadian == \"N\" | is.na(canadian), # not Canadian\n foreign_currency == \"N\", # USD\n coupon_type %in% c(\n \"F\", # fixed coupon\n \"Z\" # zero coupon\n ), \n is.na(fix_frequency),\n coupon_change_indicator == \"N\",\n interest_frequency %in% c(\n \"0\", # per year\n \"1\",\n \"2\",\n \"4\",\n \"12\"\n ),\n rule_144a == \"N\", # publicly traded\n private_placement == \"N\" | is.na(private_placement),\n defaulted == \"N\", # not defaulted\n is.na(filing_date),\n is.na(settlement),\n convertible == \"N\", # not convertible\n is.na(exchange),\n putable == \"N\" | is.na(putable), # not putable\n unit_deal == \"N\" | is.na(unit_deal), # not issued with another security\n exchangeable == \"N\" | is.na(exchangeable), # not exchangeable\n perpetual == \"N\", # not perpetual\n preferred_security == \"N\" | is.na(preferred_security) # not preferred\n ) |> \n select(\n complete_cusip, maturity,\n offering_amt, offering_date,\n dated_date, \n interest_frequency, coupon,\n last_interest_date, \n issue_id, issuer_id\n ) |>\n collect()\n```\n:::\n\n\nWe also pull issuer information from `fisd_mergedissuer` regarding the industry and country of the firm that issued a particular bond. Then, we filter to include only US-domiciled firms' bonds. We match the data by `issuer_id`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissuer_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissuer\")) \n\nfisd_issuer <- fisd_mergedissuer_db |>\n select(issuer_id, sic_code, country_domicile) |>\n collect()\n\nfisd <- fisd |>\n inner_join(fisd_issuer, join_by(issuer_id)) |>\n filter(country_domicile == \"USA\") |>\n select(-country_domicile)\n```\n:::\n\n\nFinally, we save the bond characteristics to our local database. This selection of bonds also constitutes the sample for which we will collect trade reports from TRACE below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n conn = tidy_finance,\n name = \"fisd\",\n value = fisd,\n overwrite = TRUE\n)\n```\n:::\n\n\nThe FISD database also contains other data. The issue-based file contains information on covenants, i.e., restrictions included in bond indentures to limit specific actions by firms [e.g., @handler2021]. Moreover, FISD also provides information on bond ratings. We do not need either here.\n\n## TRACE\n\nThe Financial Industry Regulatory Authority (FINRA) provides the Trade Reporting and Compliance Engine (TRACE).\\index{Data!TRACE} In TRACE, dealers that trade corporate bonds must report such trades individually. Hence, we observe trade messages in TRACE that contain information on the bond traded, the trade time, price, and volume. TRACE comes in two variants; standard and enhanced TRACE. We show how to download and clean enhanced TRACE as it contains uncapped volume, a crucial quantity missing in the standard distribution. Moreover, enhanced TRACE also provides information on the respective parties' roles and the direction of the trade report. These items become essential in cleaning the messages.\n\nWhy do we repeatedly talk about cleaning TRACE? Trade messages are submitted within a short time window after a trade is executed (less than 15 minutes). These messages can contain errors, and the reporters subsequently correct them or they cancel a trade altogether. The cleaning needs are described by @Dick2009 in detail, and @Dick2014 shows how to clean the enhanced TRACE data using SAS. We do not go into the cleaning steps here, since the code is lengthy and serves no educational purpose. However, downloading and cleaning enhanced TRACE data is straightforward with our setup.\n\nWe store code for cleaning enhanced TRACE with R on the following Github [gist.](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4) \\index{Github!Gist} as a function. The appendix also contains the code for reference. We only need to source the code from the gist, which we can do with `source_gist()`. Alternatively, you can also go to the gist, download it, and `source()` the respective R-file. The `clean_enhanced_trace()` function takes a vector of CUSIPs, a connection to WRDS explained in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd), and a start and end date, respectively.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")\n```\n:::\n\n\nThe TRACE database is considerably large. Therefore, we only download subsets of data at once. Specifying too many CUSIPs over a long time horizon will result in very long download times and a potential failure due to the size of the request to WRDS. The size limit depends on many parameters, and we cannot give you a guideline here. If we were working with the complete TRACE data for all CUSIPs above, splitting the data into 100 parts takes roughly two hours using our setup. For the applications in this book, we need data around the Paris Agreement in December 2015 and download the data in ten sets, which we define below.\\index{Paris (Climate) Agreement}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_cusips <- fisd |>\n pull(complete_cusip)\n\nfisd_parts <- split(\n fisd_cusips,\n rep(1:10, \n length.out = length(fisd_cusips))\n)\n```\n:::\n\n\nFinally, we run a loop in the same style as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd) where we download daily returns from CRSP. For each of the CUSIP sets defined above, we call the cleaning function and save the resulting output. We add new data to the existing table for batch two and all following batches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbatches <- length(fisd_parts)\n\nfor (j in 1:batches) {\n trace_enhanced <- clean_enhanced_trace(\n cusips = fisd_parts[[j]],\n connection = wrds,\n start_date = ymd(\"2014-01-01\"),\n end_date = ymd(\"2016-11-30\")\n )\n\n dbWriteTable(\n conn = tidy_finance,\n name = \"trace_enhanced\",\n value = trace_enhanced,\n overwrite = ifelse(j == 1, TRUE, FALSE),\n append = ifelse(j != 1, TRUE, FALSE)\n )\n \n cat(\"Batch\", j, \"out of\", batches, \"done (\", \n round(j / batches, 2) * 100, \"%)\\n\")\n}\n```\n:::\n\n\n## Insights into Corporate Bonds\n\nWhile many news outlets readily provide information on stocks and the underlying firms, corporate bonds are not covered frequently. Additionally, the TRACE database contains trade-level information, potentially new to students. Therefore, we provide you with some insights by showing some summary statistics.\\index{Summary statistics}\n\nWe start by looking into the number of bonds outstanding over time and compare it to the number of bonds traded in our sample. First, we compute the number of bonds outstanding for each quarter around the Paris Agreement from 2014 to 2016. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding <- expand_grid(\"date\" = seq(ymd(\"2014-01-01\"),\n ymd(\"2016-11-30\"), \n by = \"quarter\"), \n \"complete_cusip\" = fisd$complete_cusip) |> \n left_join(fisd |> select(complete_cusip, \n offering_date,\n maturity), \n join_by(complete_cusip)) |> \n mutate(offering_date = floor_date(offering_date),\n maturity = floor_date(maturity)) |> \n filter(date >= offering_date & date <= maturity) |> \n count(date) |> \n mutate(type = \"Outstanding\")\n```\n:::\n\n\nNext, we look at the bonds traded each quarter in the same period. Notice that we load the complete trace table from our database, as we only have a single part of it in the environment from the download loop from above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n collect()\n\nbonds_traded <- trace_enhanced |> \n mutate(date = floor_date(trd_exctn_dt, \"quarters\")) |> \n group_by(date) |> \n summarize(n = length(unique(cusip_id)),\n type = \"Traded\",\n .groups = \"drop\") \n```\n:::\n\n\nFinally, we plot the two time series in @fig-401.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding |> \n bind_rows(bonds_traded) |> \n ggplot(aes(\n x = date, \n y = n, \n color = type, \n linetype = type\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Number of bonds outstanding and traded each quarter\"\n )\n```\n\n::: {.cell-output-display}\n![The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.](trace-and-fisd_files/figure-html/fig-401-1.png){#fig-401 fig-alt='Title: Number of bonds outstanding and traded each quarter. The figure shows a time series of outstanding bonds and bonds traded. The amount outstanding increases monotonically between 2014 and 2016. The number of bonds traded represents only a fraction of roughly 60 percent, which peaks around the third quarter of 2016.' width=2100}\n:::\n:::\n\n\nWe see that the number of bonds outstanding increases steadily between 2014 and 2016. During our sample period of trade data, we see that the fraction of bonds trading each quarter is roughly 60%. The relatively small number of traded bonds means that many bonds do not trade through an entire quarter. This lack of trading activity illustrates the generally low level of liquidity in the corporate bond market, where it can be hard to trade specific bonds. \nDoes this lack of liquidity mean that corporate bond markets are irrelevant in terms of their size? With over 7,500 traded bonds each quarter, it is hard to say that the market is small. However, let us also investigate the characteristics of issued corporate bonds. In particular, we consider maturity (in years), coupon, and offering amount (in million USD).\\index{Liquidity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd |>\n mutate(maturity = as.numeric(maturity - offering_date) / 365,\n offering_amt = offering_amt / 10^3) |> \n pivot_longer(cols = c(maturity, coupon, offering_amt),\n names_to = \"measure\") |>\n drop_na() |> \n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 8\n measure mean sd min q05 q50 q95 max\n \n1 coupon 5.94 2.67 0 1.90 6.06 10 39 \n2 maturity 9.74 9.21 -6.24 1.03 7.18 30.0 101.\n3 offering_amt 379. 570. 0.001 0.669 200 1400 15000 \n```\n\n\n:::\n:::\n\n\nWe see that the average bond in our sample period has an offering amount of over 357 million USD with a median of 200 million USD, which both cannot be considered small. The average bond has a maturity of 10 years and pays around 6% in coupons.\n\nFinally, let us compute some summary statistics for the trades in this market. To this end, we show a summary based on aggregate information daily. In particular, we consider the trade size (in million USD) and the number of trades.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced |> \n group_by(trd_exctn_dt) |> \n summarize(trade_size = sum(entrd_vol_qt * rptd_pr / 100) / 10^6,\n trade_number = n(),\n .groups = \"drop\") |> \n pivot_longer(cols = c(trade_size, trade_number),\n names_to = \"measure\") |> \n group_by(measure) |>\n summarize(\n mean = mean(value),\n sd = sd(value),\n min = min(value),\n q05 = quantile(value, 0.05),\n q50 = quantile(value, 0.50),\n q95 = quantile(value, 0.95),\n max = max(value)\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 8\n measure mean sd min q05 q50 q95 max\n \n1 trade_number 25914. 5444. 439 17844. 26051 34383. 40839 \n2 trade_size 12968. 3577. 17.2 6128. 13421. 17850. 21312.\n```\n\n\n:::\n:::\n\n\nOn average, nearly 26 billion USD of corporate bonds are traded daily in nearly 13,000 transactions. We can hence conclude that the corporate bond market is indeed significant in terms of trading volume and activity.\n\n## Exercises\n\n1. Compute the amount outstanding across all bonds over time. Make sure to subtract all matured bonds. How would you describe the resulting plot?\n1. Compute the number of days each bond is traded (accounting for the bonds' maturities and issuances). Start by looking at the number of bonds traded each day in a graph similar to the one above. How many bonds trade on more than 75% of trading days? \n1. WRDS provides more information from Mergent FISD such as ratings in the table `fisd_ratings`. Download the ratings table and plot the distribution of ratings for the different rating providers. How would you map the different providers to a common numeric rating scale? \\index{Rating}\n", "supporting": [ "trace-and-fisd_files" ], diff --git a/_freeze/r/univariate-portfolio-sorts/execute-results/html.json b/_freeze/r/univariate-portfolio-sorts/execute-results/html.json index 8f48ff3e..f5ebffca 100644 --- a/_freeze/r/univariate-portfolio-sorts/execute-results/html.json +++ b/_freeze/r/univariate-portfolio-sorts/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "3127fe8ff1b60b520605091103823265", + "hash": "b8792df00aa6d328f4831f07414ec669", "result": { "engine": "knitr", - "markdown": "---\ntitle: Univariate Portfolio Sorts\naliases:\n - ../univariate-portfolio-sorts.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Univariate Portfolio Sorts with R\n description-meta: Perform univariate portfolio sorts to test for return predictability using the programming language R. \n---\n\n\nIn this chapter, we dive into portfolio sorts, one of the most widely used statistical methodologies in empirical asset pricing [e.g., @BaliEngleMurray2016]. The key application of portfolio sorts is to examine whether one or more variables can predict future excess returns. In general, the idea is to sort individual stocks into portfolios, where the stocks within each portfolio are similar with respect to a sorting variable, such as firm size. The different portfolios then represent well-diversified investments that differ in the level of the sorting variable. You can then attribute the differences in the return distribution to the impact of the sorting variable. \nWe start by introducing univariate portfolio sorts (which sort based on only one characteristic) and tackle bivariate sorting in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). \n\nA univariate portfolio sort considers only one sorting variable $x_{t-1,i}$.\\index{Portfolio sorts!Univariate}\nHere, $i$ denotes the stock and $t-1$ indicates that the characteristic is observable by investors at time $t$. \nThe objective is to assess the cross-sectional relation between $x_{t-1,i}$ and, typically, stock excess returns $r_{t,i}$ at time $t$ as the outcome variable. \nTo illustrate how portfolio sorts work, we use estimates for market betas from the previous chapter as our sorting variable.\n\nThe current chapter relies on the following set of R packages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(lmtest)\nlibrary(broom)\nlibrary(sandwich)\n```\n:::\n\n\nCompared to previous chapters, we introduce `lmtest` [@lmtest] for inference for estimated coefficients, `broom` package [@broom] to tidy the estimation output of many estimated linear models, and `sandwich` [@sandwich] for different covariance matrix estimators\n\n## Data Preparation\n\nWe start with loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, we use the monthly CRSP sample as our asset universe.\\index{Data!CRSP}\nOnce we form our portfolios, we use the Fama-French market factor returns to compute the risk-adjusted performance (i.e., alpha).\\index{Data!Fama-French factors}\n`beta` is the tibble with market betas computed in the previous chapter.\\index{Beta}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(permno, month, ret_excess, mktcap_lag) |>\n collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, mkt_excess) |>\n collect()\n\nbeta <- tbl(tidy_finance, \"beta\") |>\n select(permno, month, beta_monthly) |>\n collect()\n```\n:::\n\n\n## Sorting by Market Beta\n\nNext, we merge our sorting variable with the return data. We use the one-month *lagged* betas as a sorting variable to ensure that the sorts rely only on information available when we create the portfolios. \nTo lag stock beta by one month, we add one month to the current date and join the resulting information with our return data. \nThis procedure ensures that month $t$ information is available in month $t+1$. \nYou may be tempted to simply use a call such as `crsp_monthly |> group_by(permno) |> mutate(beta_lag = lag(beta)))` instead. \nThis procedure, however, does not work correctly if there are non-explicit missing values in the time series.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_lag <- beta |>\n mutate(month = month %m+% months(1)) |>\n select(permno, month, beta_lag = beta_monthly) |>\n drop_na()\n\ndata_for_sorts <- crsp_monthly |>\n inner_join(beta_lag, by = c(\"permno\", \"month\"))\n```\n:::\n\n\nThe first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios.\\index{Breakpoints} \nFor simplicity, we start with the median lagged market beta as the single breakpoint. \nWe then compute the value-weighted returns for each of the two resulting portfolios, which means that the lagged market capitalization determines the weight in `weighted.mean()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(\n breakpoint = median(beta_lag),\n portfolio = case_when(\n beta_lag <= breakpoint ~ \"low\",\n beta_lag > breakpoint ~ \"high\"\n )\n ) |>\n group_by(month, portfolio) |>\n summarize(ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\")\n```\n:::\n\n\n## Performance Evaluation\n\nWe can construct a long-short strategy based on the two portfolios: buy the high-beta portfolio and, at the same time, short the low-beta portfolio. Thereby, the overall position in the market is net-zero, i.e., you do not need to invest money to realize this strategy in the absence of frictions.\\index{Long-short}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n pivot_wider(id_cols = month, names_from = portfolio, values_from = ret) |>\n mutate(long_short = high - low)\n```\n:::\n\n\nWe compute the average return and the corresponding standard error to test whether the long-short portfolio yields on average positive or negative excess returns. In the asset pricing literature, one typically adjusts for autocorrelation by using @Newey1987 $t$-statistics to test the null hypothesis that average portfolio excess returns are equal to zero.\\index{Standard errors!Newey-West} One necessary input for Newey-West standard errors is a chosen bandwidth based on the number of lags employed for the estimation. While it seems that researchers often default on choosing a pre-specified lag length of 6 months, we instead recommend a data-driven approach. This automatic selection is advocated by @Newey1994 and available in the `sandwich` package. To implement this test, we compute the average return via `lm()` and then employ the `coeftest()` function. If you want to implement the typical 6-lag default setting, you can enforce it by passing the arguments `lag = 6, prewhite = FALSE` to the `coeftest()` function in the code below and it passes them on to `NeweyWest()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fit <- lm(long_short ~ 1, data = beta_longshort)\ncoeftest(model_fit, vcov = NeweyWest)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) 5.76e-05 1.30e-03 0.04 0.96\n```\n\n\n:::\n:::\n\n\nThe results indicate that we cannot reject the null hypothesis of average returns being equal to zero. Our portfolio strategy using the median as a breakpoint hence does not yield any abnormal returns. Is this finding surprising if you reconsider the CAPM? It certainly is. The CAPM yields that the high beta stocks should yield higher expected returns. Our portfolio sort implicitly mimics an investment strategy that finances high beta stocks by shorting low beta stocks. Therefore, one should expect that the average excess returns yield a return that is above the risk-free rate.\n\n## Functional Programming for Portfolio Sorts\n\nNow we take portfolio sorts to the next level. We want to be able to sort stocks into an arbitrary number of portfolios. For this case, functional programming is very handy: we employ the [curly-curly](https://www.tidyverse.org/blog/2019/06/rlang-0-4-0/#a-simpler-interpolation-pattern-with-)-operator to give us flexibility concerning which variable to use for the sorting, denoted by `sorting_variable`.\\index{Curly-curly} We use `quantile()` to compute breakpoints for `n_portfolios`. Then, we assign portfolios to stocks using the `findInterval()` function. The output of the following function is a new column that contains the number of the portfolio to which a stock belongs.\\index{Functional programming} \n\nIn some applications, the variable used for the sorting might be clustered (e.g., at a lower bound of 0). Then, multiple breakpoints may be identical, leading to empty portfolios. Similarly, some portfolios might have a very small number of stocks at the beginning of the sample. Cases, where the number of portfolio constituents differs substantially due to the distribution of the characteristics, require careful consideration and, depending on the application, might require customized sorting approaches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n sorting_variable, \n n_portfolios) {\n # Compute breakpoints\n breakpoints <- data |>\n pull({{ sorting_variable }}) |>\n quantile(\n probs = seq(0, 1, length.out = n_portfolios + 1),\n na.rm = TRUE,\n names = FALSE\n )\n\n # Assign portfolios\n assigned_portfolios <- data |>\n mutate(portfolio = findInterval(\n pick(everything()) |>\n pull({{ sorting_variable }}),\n breakpoints,\n all.inside = TRUE\n )) |>\n pull(portfolio)\n \n # Output\n return(assigned_portfolios)\n}\n```\n:::\n\n\nWe can use the above function to sort stocks into ten portfolios each month using lagged betas and compute value-weighted returns for each portfolio. Note that we transform the portfolio column to a factor variable because it provides more convenience for the figure construction below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(\n portfolio = assign_portfolio(\n data = pick(everything()),\n sorting_variable = beta_lag,\n n_portfolios = 10\n ),\n portfolio = as.factor(portfolio)\n ) |>\n group_by(portfolio, month) |>\n summarize(\n ret_excess = weighted.mean(ret_excess, mktcap_lag),\n .groups = \"drop\"\n )|>\n left_join(factors_ff3_monthly, by = \"month\")\n```\n:::\n\n\n## More Performance Evaluation\n\nIn the next step, we compute summary statistics for each beta portfolio. Namely, we compute CAPM-adjusted alphas, the beta of each beta portfolio, and average returns.\\index{Performance evaluation}\\index{Alpha}\\index{CAPM}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary <- beta_portfolios |>\n nest(data = c(month, ret_excess, mkt_excess)) |>\n mutate(estimates = map(\n data, ~ tidy(lm(ret_excess ~ 1 + mkt_excess, data = .x))\n )) |>\n unnest(estimates) |> \n select(portfolio, term, estimate) |> \n pivot_wider(names_from = term, values_from = estimate) |> \n rename(alpha = `(Intercept)`, beta = mkt_excess) |> \n left_join(\n beta_portfolios |> \n group_by(portfolio) |> \n summarize(ret_excess = mean(ret_excess),\n .groups = \"drop\"), by = \"portfolio\"\n )\n```\n:::\n\n\n@fig-701 illustrates the CAPM alphas of beta-sorted portfolios. It shows that low beta portfolios tend to exhibit positive alphas, while high beta portfolios exhibit negative alphas.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary |>\n ggplot(aes(x = portfolio, y = alpha, fill = portfolio)) +\n geom_bar(stat = \"identity\") +\n labs(\n title = \"CAPM alphas of beta-sorted portfolios\",\n x = \"Portfolio\",\n y = \"CAPM alpha\",\n fill = \"Portfolio\"\n ) +\n scale_y_continuous(labels = percent) +\n theme(legend.position = \"None\")\n```\n\n::: {.cell-output-display}\n![Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.](univariate-portfolio-sorts_files/figure-html/fig-701-1.png){#fig-701 fig-alt='Title: CAPM alphas of beta-sorted portfolios. The figure shows bar charts of alphas of beta-sorted portfolios with the decile portfolio on the horizontal axis and the corresponding CAPM alpha on the vertical axis. Alphas for low beta portfolios are positive, while high beta portfolios show negative alphas.' width=2100}\n:::\n:::\n\n\nThese results suggest a negative relation between beta and future stock returns, which contradicts the predictions of the CAPM. According to the CAPM, returns should increase with beta across the portfolios and risk-adjusted returns should be statistically indistinguishable from zero.\n\n## The Security Market Line and Beta Portfolios\n\nThe CAPM predicts that our portfolios should lie on the security market line (SML). The slope of the SML is equal to the market risk premium and reflects the risk-return trade-off at any given time.\\index{Security market line} @fig-702 illustrates the security market line: We see that (not surprisingly) the high beta portfolio returns have a high correlation with the market returns. However, it seems like the average excess returns for high beta stocks are lower than what the security market line implies would be an \"appropriate\" compensation for the high market risk. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsml_capm <- lm(ret_excess ~ 1 + beta, data = beta_portfolios_summary)$coefficients\n\nbeta_portfolios_summary |>\n ggplot(aes(\n x = beta, \n y = ret_excess, \n color = portfolio\n )) +\n geom_point() +\n geom_abline(\n intercept = 0,\n slope = mean(factors_ff3_monthly$mkt_excess),\n linetype = \"solid\"\n ) +\n geom_abline(\n intercept = sml_capm[1],\n slope = sml_capm[2],\n linetype = \"dashed\"\n ) +\n scale_y_continuous(\n labels = percent,\n limit = c(0, mean(factors_ff3_monthly$mkt_excess) * 2)\n ) +\n scale_x_continuous(limits = c(0, 2)) +\n labs(\n x = \"Beta\", y = \"Excess return\", color = \"Portfolio\",\n title = \"Average portfolio excess returns and average beta estimates\"\n )\n```\n\n::: {.cell-output-display}\n![Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.](univariate-portfolio-sorts_files/figure-html/fig-702-1.png){#fig-702 fig-alt='Title: Average portfolio excess returns and average beta estimates. The figure shows a scatter plot of the average excess returns per beta portfolio with average beta estimates per portfolio on the horizontal axis and average excess returns on the vertical axis. An increasing solid line indicates the security market line. A dashed increasing line with lower slope than the security market line indicates that the CAPM prediction is not valid for CRSP data.' width=2100}\n:::\n:::\n\n\nTo provide more evidence against the CAPM predictions, we again form a long-short strategy that buys the high-beta portfolio and shorts the low-beta portfolio.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n mutate(portfolio = case_when(\n portfolio == max(as.numeric(portfolio)) ~ \"high\",\n portfolio == min(as.numeric(portfolio)) ~ \"low\"\n )) |>\n filter(portfolio %in% c(\"low\", \"high\")) |>\n pivot_wider(id_cols = month, \n names_from = portfolio, \n values_from = ret_excess) |>\n mutate(long_short = high - low) |>\n left_join(factors_ff3_monthly, by = \"month\")\n```\n:::\n\n\nAgain, the resulting long-short strategy does not exhibit statistically significant returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1, data = beta_longshort),\n vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) 0.00147 0.00331 0.44 0.66\n```\n\n\n:::\n:::\n\n\nHowever, the long-short portfolio yields a statistically significant negative CAPM-adjusted alpha, although, controlling for the effect of beta, the average excess stock returns should be zero according to the CAPM. The results thus provide no evidence in support of the CAPM. The negative value has been documented as the so-called betting against beta factor [@Frazzini2014]. Betting against beta corresponds to a strategy that shorts high beta stocks and takes a (levered) long position in low beta stocks. If borrowing constraints prevent investors from taking positions on the SML they are instead incentivized to buy high beta stocks, which leads to a relatively higher price (and therefore lower expected returns than implied by the CAPM) for such high beta stocks. As a result, the betting-against-beta strategy earns from providing liquidity to capital constraint investors with lower risk aversion.\\index{Risk aversion}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1 + mkt_excess, data = beta_longshort),\n vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.00479 0.00249 -1.92 0.055 . \nmkt_excess 1.15351 0.08960 12.87 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n@fig-703 shows the annual returns of the extreme beta portfolios we are mainly interested in. The figure illustrates no consistent striking patterns over the last years - each portfolio exhibits periods with positive and negative annual returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort |>\n group_by(year = year(month)) |>\n summarize(\n low = prod(1 + low),\n high = prod(1 + high),\n long_short = prod(1 + long_short)\n ) |>\n pivot_longer(cols = -year) |>\n ggplot(aes(x = year, y = 1 - value, fill = name)) +\n geom_col(position = \"dodge\") +\n facet_wrap(~name, ncol = 1) +\n theme(legend.position = \"none\") +\n scale_y_continuous(labels = percent) +\n labs(\n title = \"Annual returns of beta portfolios\",\n x = NULL, y = NULL\n )\n```\n\n::: {.cell-output-display}\n![We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.](univariate-portfolio-sorts_files/figure-html/fig-703-1.png){#fig-703 fig-alt='Title: Annual returns of beta portfolios. The figure shows bar charts of annual returns of long, short, and long-short beta portfolios with years on the horizontal axis and returns on the vertical axis. Each portfolio is plotted in its own facet. The long-short portfolio strategy delivers very high losses during some periods.' width=2100}\n:::\n:::\n\n\nOverall, this chapter shows how functional programming can be leveraged to form an arbitrary number of portfolios using any sorting variable and how to evaluate the performance of the resulting portfolios. In the next chapter, we dive deeper into the many degrees of freedom that arise in the context of portfolio analysis. \n\n## Exercises\n\n1. Take the two long-short beta strategies based on different numbers of portfolios and compare the returns. Is there a significant difference in returns? How do the Sharpe ratios compare between the strategies? Find one additional portfolio evaluation statistic and compute it.\n1. We plotted the alphas of the ten beta portfolios above. Write a function that tests these estimates for significance. Which portfolios have significant alphas?\n1. The analysis here is based on betas from monthly returns. However, we also computed betas from daily returns. Re-run the analysis and point out differences in the results.\n1. Given the results in this chapter, can you define a long-short strategy that yields positive abnormal returns (i.e., alphas)? Plot the cumulative excess return of your strategy and the market excess return for comparison.", + "markdown": "---\ntitle: Univariate Portfolio Sorts\naliases:\n - ../univariate-portfolio-sorts.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Univariate Portfolio Sorts with R\n description-meta: Perform univariate portfolio sorts to test for return predictability using the programming language R. \n---\n\n\nIn this chapter, we dive into portfolio sorts, one of the most widely used statistical methodologies in empirical asset pricing [e.g., @BaliEngleMurray2016]. The key application of portfolio sorts is to examine whether one or more variables can predict future excess returns. In general, the idea is to sort individual stocks into portfolios, where the stocks within each portfolio are similar with respect to a sorting variable, such as firm size. The different portfolios then represent well-diversified investments that differ in the level of the sorting variable. You can then attribute the differences in the return distribution to the impact of the sorting variable. \nWe start by introducing univariate portfolio sorts (which sort based on only one characteristic) and tackle bivariate sorting in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). \n\nA univariate portfolio sort considers only one sorting variable $x_{t-1,i}$.\\index{Portfolio sorts!Univariate}\nHere, $i$ denotes the stock and $t-1$ indicates that the characteristic is observable by investors at time $t$. \nThe objective is to assess the cross-sectional relation between $x_{t-1,i}$ and, typically, stock excess returns $r_{t,i}$ at time $t$ as the outcome variable. \nTo illustrate how portfolio sorts work, we use estimates for market betas from the previous chapter as our sorting variable.\n\nThe current chapter relies on the following set of R packages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(lmtest)\nlibrary(broom)\nlibrary(sandwich)\n```\n:::\n\n\nCompared to previous chapters, we introduce `lmtest` [@lmtest] for inference for estimated coefficients, `broom` package [@broom] to tidy the estimation output of many estimated linear models, and `sandwich` [@sandwich] for different covariance matrix estimators\n\n## Data Preparation\n\nWe start with loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, we use the monthly CRSP sample as our asset universe.\\index{Data!CRSP}\nOnce we form our portfolios, we use the Fama-French market factor returns to compute the risk-adjusted performance (i.e., alpha).\\index{Data!Fama-French factors}\n`beta` is the tibble with market betas computed in the previous chapter.\\index{Beta}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(permno, month, ret_excess, mktcap_lag) |>\n collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, mkt_excess) |>\n collect()\n\nbeta <- tbl(tidy_finance, \"beta\") |>\n select(permno, month, beta_monthly) |>\n collect()\n```\n:::\n\n\n## Sorting by Market Beta\n\nNext, we merge our sorting variable with the return data. We use the one-month *lagged* betas as a sorting variable to ensure that the sorts rely only on information available when we create the portfolios. \nTo lag stock beta by one month, we add one month to the current date and join the resulting information with our return data. \nThis procedure ensures that month $t$ information is available in month $t+1$. \nYou may be tempted to simply use a call such as `crsp_monthly |> group_by(permno) |> mutate(beta_lag = lag(beta)))` instead. \nThis procedure, however, does not work correctly if there are non-explicit missing values in the time series.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_lag <- beta |>\n mutate(month = month %m+% months(1)) |>\n select(permno, month, beta_lag = beta_monthly) |>\n drop_na()\n\ndata_for_sorts <- crsp_monthly |>\n inner_join(beta_lag, join_by(permno, month))\n```\n:::\n\n\nThe first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios.\\index{Breakpoints} \nFor simplicity, we start with the median lagged market beta as the single breakpoint. \nWe then compute the value-weighted returns for each of the two resulting portfolios, which means that the lagged market capitalization determines the weight in `weighted.mean()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(\n breakpoint = median(beta_lag),\n portfolio = case_when(\n beta_lag <= breakpoint ~ \"low\",\n beta_lag > breakpoint ~ \"high\"\n )\n ) |>\n group_by(month, portfolio) |>\n summarize(ret = weighted.mean(ret_excess, mktcap_lag), \n .groups = \"drop\")\n```\n:::\n\n\n## Performance Evaluation\n\nWe can construct a long-short strategy based on the two portfolios: buy the high-beta portfolio and, at the same time, short the low-beta portfolio. Thereby, the overall position in the market is net-zero, i.e., you do not need to invest money to realize this strategy in the absence of frictions.\\index{Long-short}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n pivot_wider(id_cols = month, names_from = portfolio, values_from = ret) |>\n mutate(long_short = high - low)\n```\n:::\n\n\nWe compute the average return and the corresponding standard error to test whether the long-short portfolio yields on average positive or negative excess returns. In the asset pricing literature, one typically adjusts for autocorrelation by using @Newey1987 $t$-statistics to test the null hypothesis that average portfolio excess returns are equal to zero.\\index{Standard errors!Newey-West} One necessary input for Newey-West standard errors is a chosen bandwidth based on the number of lags employed for the estimation. While it seems that researchers often default on choosing a pre-specified lag length of 6 months, we instead recommend a data-driven approach. This automatic selection is advocated by @Newey1994 and available in the `sandwich` package. To implement this test, we compute the average return via `lm()` and then employ the `coeftest()` function. If you want to implement the typical 6-lag default setting, you can enforce it by passing the arguments `lag = 6, prewhite = FALSE` to the `coeftest()` function in the code below and it passes them on to `NeweyWest()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fit <- lm(long_short ~ 1, data = beta_longshort)\ncoeftest(model_fit, vcov = NeweyWest)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) 5.76e-05 1.30e-03 0.04 0.96\n```\n\n\n:::\n:::\n\n\nThe results indicate that we cannot reject the null hypothesis of average returns being equal to zero. Our portfolio strategy using the median as a breakpoint hence does not yield any abnormal returns. Is this finding surprising if you reconsider the CAPM? It certainly is. The CAPM yields that the high beta stocks should yield higher expected returns. Our portfolio sort implicitly mimics an investment strategy that finances high beta stocks by shorting low beta stocks. Therefore, one should expect that the average excess returns yield a return that is above the risk-free rate.\n\n## Functional Programming for Portfolio Sorts\n\nNow we take portfolio sorts to the next level. We want to be able to sort stocks into an arbitrary number of portfolios. For this case, functional programming is very handy: we employ the [curly-curly](https://www.tidyverse.org/blog/2019/06/rlang-0-4-0/#a-simpler-interpolation-pattern-with-)-operator to give us flexibility concerning which variable to use for the sorting, denoted by `sorting_variable`.\\index{Curly-curly} We use `quantile()` to compute breakpoints for `n_portfolios`. Then, we assign portfolios to stocks using the `findInterval()` function. The output of the following function is a new column that contains the number of the portfolio to which a stock belongs.\\index{Functional programming} \n\nIn some applications, the variable used for the sorting might be clustered (e.g., at a lower bound of 0). Then, multiple breakpoints may be identical, leading to empty portfolios. Similarly, some portfolios might have a very small number of stocks at the beginning of the sample. Cases, where the number of portfolio constituents differs substantially due to the distribution of the characteristics, require careful consideration and, depending on the application, might require customized sorting approaches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n sorting_variable, \n n_portfolios) {\n # Compute breakpoints\n breakpoints <- data |>\n pull({{ sorting_variable }}) |>\n quantile(\n probs = seq(0, 1, length.out = n_portfolios + 1),\n na.rm = TRUE,\n names = FALSE\n )\n\n # Assign portfolios\n assigned_portfolios <- data |>\n mutate(portfolio = findInterval(\n pick(everything()) |>\n pull({{ sorting_variable }}),\n breakpoints,\n all.inside = TRUE\n )) |>\n pull(portfolio)\n \n # Output\n return(assigned_portfolios)\n}\n```\n:::\n\n\nWe can use the above function to sort stocks into ten portfolios each month using lagged betas and compute value-weighted returns for each portfolio. Note that we transform the portfolio column to a factor variable because it provides more convenience for the figure construction below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(\n portfolio = assign_portfolio(\n data = pick(everything()),\n sorting_variable = beta_lag,\n n_portfolios = 10\n ),\n portfolio = as.factor(portfolio)\n ) |>\n group_by(portfolio, month) |>\n summarize(\n ret_excess = weighted.mean(ret_excess, mktcap_lag),\n .groups = \"drop\"\n )|>\n left_join(factors_ff3_monthly, join_by(month))\n```\n:::\n\n\n## More Performance Evaluation\n\nIn the next step, we compute summary statistics for each beta portfolio. Namely, we compute CAPM-adjusted alphas, the beta of each beta portfolio, and average returns.\\index{Performance evaluation}\\index{Alpha}\\index{CAPM}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary <- beta_portfolios |>\n nest(data = c(month, ret_excess, mkt_excess)) |>\n mutate(estimates = map(\n data, ~ tidy(lm(ret_excess ~ 1 + mkt_excess, data = .x))\n )) |>\n unnest(estimates) |> \n select(portfolio, term, estimate) |> \n pivot_wider(names_from = term, values_from = estimate) |> \n rename(alpha = `(Intercept)`, beta = mkt_excess) |> \n left_join(\n beta_portfolios |> \n group_by(portfolio) |> \n summarize(ret_excess = mean(ret_excess),\n .groups = \"drop\"), join_by(portfolio)\n )\n```\n:::\n\n\n@fig-701 illustrates the CAPM alphas of beta-sorted portfolios. It shows that low beta portfolios tend to exhibit positive alphas, while high beta portfolios exhibit negative alphas.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary |>\n ggplot(aes(x = portfolio, y = alpha, fill = portfolio)) +\n geom_bar(stat = \"identity\") +\n labs(\n title = \"CAPM alphas of beta-sorted portfolios\",\n x = \"Portfolio\",\n y = \"CAPM alpha\",\n fill = \"Portfolio\"\n ) +\n scale_y_continuous(labels = percent) +\n theme(legend.position = \"None\")\n```\n\n::: {.cell-output-display}\n![Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.](univariate-portfolio-sorts_files/figure-html/fig-701-1.png){#fig-701 fig-alt='Title: CAPM alphas of beta-sorted portfolios. The figure shows bar charts of alphas of beta-sorted portfolios with the decile portfolio on the horizontal axis and the corresponding CAPM alpha on the vertical axis. Alphas for low beta portfolios are positive, while high beta portfolios show negative alphas.' width=2100}\n:::\n:::\n\n\nThese results suggest a negative relation between beta and future stock returns, which contradicts the predictions of the CAPM. According to the CAPM, returns should increase with beta across the portfolios and risk-adjusted returns should be statistically indistinguishable from zero.\n\n## The Security Market Line and Beta Portfolios\n\nThe CAPM predicts that our portfolios should lie on the security market line (SML). The slope of the SML is equal to the market risk premium and reflects the risk-return trade-off at any given time.\\index{Security market line} @fig-702 illustrates the security market line: We see that (not surprisingly) the high beta portfolio returns have a high correlation with the market returns. However, it seems like the average excess returns for high beta stocks are lower than what the security market line implies would be an \"appropriate\" compensation for the high market risk. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsml_capm <- lm(ret_excess ~ 1 + beta, data = beta_portfolios_summary)$coefficients\n\nbeta_portfolios_summary |>\n ggplot(aes(\n x = beta, \n y = ret_excess, \n color = portfolio\n )) +\n geom_point() +\n geom_abline(\n intercept = 0,\n slope = mean(factors_ff3_monthly$mkt_excess),\n linetype = \"solid\"\n ) +\n geom_abline(\n intercept = sml_capm[1],\n slope = sml_capm[2],\n linetype = \"dashed\"\n ) +\n scale_y_continuous(\n labels = percent,\n limit = c(0, mean(factors_ff3_monthly$mkt_excess) * 2)\n ) +\n scale_x_continuous(limits = c(0, 2)) +\n labs(\n x = \"Beta\", y = \"Excess return\", color = \"Portfolio\",\n title = \"Average portfolio excess returns and average beta estimates\"\n )\n```\n\n::: {.cell-output-display}\n![Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.](univariate-portfolio-sorts_files/figure-html/fig-702-1.png){#fig-702 fig-alt='Title: Average portfolio excess returns and average beta estimates. The figure shows a scatter plot of the average excess returns per beta portfolio with average beta estimates per portfolio on the horizontal axis and average excess returns on the vertical axis. An increasing solid line indicates the security market line. A dashed increasing line with lower slope than the security market line indicates that the CAPM prediction is not valid for CRSP data.' width=2100}\n:::\n:::\n\n\nTo provide more evidence against the CAPM predictions, we again form a long-short strategy that buys the high-beta portfolio and shorts the low-beta portfolio.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n mutate(portfolio = case_when(\n portfolio == max(as.numeric(portfolio)) ~ \"high\",\n portfolio == min(as.numeric(portfolio)) ~ \"low\"\n )) |>\n filter(portfolio %in% c(\"low\", \"high\")) |>\n pivot_wider(id_cols = month, \n names_from = portfolio, \n values_from = ret_excess) |>\n mutate(long_short = high - low) |>\n left_join(factors_ff3_monthly, join_by(month))\n```\n:::\n\n\nAgain, the resulting long-short strategy does not exhibit statistically significant returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1, data = beta_longshort),\n vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n Estimate Std. Error t value Pr(>|t|)\n(Intercept) 0.00147 0.00331 0.44 0.66\n```\n\n\n:::\n:::\n\n\nHowever, the long-short portfolio yields a statistically significant negative CAPM-adjusted alpha, although, controlling for the effect of beta, the average excess stock returns should be zero according to the CAPM. The results thus provide no evidence in support of the CAPM. The negative value has been documented as the so-called betting against beta factor [@Frazzini2014]. Betting against beta corresponds to a strategy that shorts high beta stocks and takes a (levered) long position in low beta stocks. If borrowing constraints prevent investors from taking positions on the SML they are instead incentivized to buy high beta stocks, which leads to a relatively higher price (and therefore lower expected returns than implied by the CAPM) for such high beta stocks. As a result, the betting-against-beta strategy earns from providing liquidity to capital constraint investors with lower risk aversion.\\index{Risk aversion}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1 + mkt_excess, data = beta_longshort),\n vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -0.00479 0.00249 -1.92 0.055 . \nmkt_excess 1.15351 0.08960 12.87 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n@fig-703 shows the annual returns of the extreme beta portfolios we are mainly interested in. The figure illustrates no consistent striking patterns over the last years - each portfolio exhibits periods with positive and negative annual returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort |>\n group_by(year = year(month)) |>\n summarize(\n low = prod(1 + low),\n high = prod(1 + high),\n long_short = prod(1 + long_short)\n ) |>\n pivot_longer(cols = -year) |>\n ggplot(aes(x = year, y = 1 - value, fill = name)) +\n geom_col(position = \"dodge\") +\n facet_wrap(~name, ncol = 1) +\n theme(legend.position = \"none\") +\n scale_y_continuous(labels = percent) +\n labs(\n title = \"Annual returns of beta portfolios\",\n x = NULL, y = NULL\n )\n```\n\n::: {.cell-output-display}\n![We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.](univariate-portfolio-sorts_files/figure-html/fig-703-1.png){#fig-703 fig-alt='Title: Annual returns of beta portfolios. The figure shows bar charts of annual returns of long, short, and long-short beta portfolios with years on the horizontal axis and returns on the vertical axis. Each portfolio is plotted in its own facet. The long-short portfolio strategy delivers very high losses during some periods.' width=2100}\n:::\n:::\n\n\nOverall, this chapter shows how functional programming can be leveraged to form an arbitrary number of portfolios using any sorting variable and how to evaluate the performance of the resulting portfolios. In the next chapter, we dive deeper into the many degrees of freedom that arise in the context of portfolio analysis. \n\n## Exercises\n\n1. Take the two long-short beta strategies based on different numbers of portfolios and compare the returns. Is there a significant difference in returns? How do the Sharpe ratios compare between the strategies? Find one additional portfolio evaluation statistic and compute it.\n1. We plotted the alphas of the ten beta portfolios above. Write a function that tests these estimates for significance. Which portfolios have significant alphas?\n1. The analysis here is based on betas from monthly returns. However, we also computed betas from daily returns. Re-run the analysis and point out differences in the results.\n1. Given the results in this chapter, can you define a long-short strategy that yields positive abnormal returns (i.e., alphas)? Plot the cumulative excess return of your strategy and the market excess return for comparison.", "supporting": [ "univariate-portfolio-sorts_files" ], diff --git a/_freeze/r/value-and-bivariate-sorts/execute-results/html.json b/_freeze/r/value-and-bivariate-sorts/execute-results/html.json index ac8a9963..c8afb56b 100644 --- a/_freeze/r/value-and-bivariate-sorts/execute-results/html.json +++ b/_freeze/r/value-and-bivariate-sorts/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "3c5e4ac4d7a93ee4d408412a1586fe76", + "hash": "1cf73a739b64fd3f3fb69d768ecdb45c", "result": { "engine": "knitr", - "markdown": "---\ntitle: Value and Bivariate Sorts\naliases:\n - ../value-and-bivariate-sorts.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Value and Bivariate Sorts with R\n description-meta: Perform bivariate portfolio sorts to test for return predictability in asset pricing applications using the programming language R. \n---\n\n\nIn this chapter, we extend univariate portfolio analysis to bivariate sorts, which means we assign stocks to portfolios based on two characteristics. Bivariate sorts are regularly used in the academic asset pricing literature and are the basis for the Fama and French three factors. However, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.\n\nWe form portfolios on firm size and the book-to-market ratio. To calculate book-to-market ratios, accounting data is required, which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts.\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nFirst, we load the necessary data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). We conduct portfolio sorts based on the CRSP sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(\n permno, gvkey, month, ret_excess,\n mktcap, mktcap_lag, exchange\n ) |> \n collect() |>\n drop_na()\n```\n:::\n\n\nFurther, we utilize accounting data. The most common source of accounting data is Compustat. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable `datadate` to its monthly value, as we only consider monthly returns here and do not need to account for the exact date. To achieve this, we use the function `floor_date()`.\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbook_equity <- tbl(tidy_finance, \"compustat\") |>\n select(gvkey, datadate, be) |>\n collect() |>\n drop_na() |>\n mutate(month = floor_date(ymd(datadate), \"month\"))\n```\n:::\n\n\n## Book-to-Market Ratio\n\nA fundamental problem in handling accounting data is the *look-ahead bias* - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents' trades. Hence, we have to lag accounting information.\n\nWe continue to lag market capitalization and firm size by one month.\\index{Market capitalization}\\index{Firm size} Then, we compute the book-to-market ratio, which relates a firm's book equity to its market equity.\\index{Book equity}\\index{Book-to-market ratio} Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.^[The definition of a time lag is another choice a researcher has to make, similar to breakpoint choices as we describe in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).]\n\nHaving both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the `sorting_date`-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the `fill()`-function after sorting by date and firm (which we identify by `permno` and `gvkey`) and on a firm basis (which we do by `group_by()` as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nme <- crsp_monthly |>\n mutate(sorting_date = month %m+% months(1)) |>\n select(permno, sorting_date, me = mktcap)\n\nbm <- book_equity |>\n inner_join(crsp_monthly, by = c(\"gvkey\", \"month\")) |>\n mutate(\n bm = be / mktcap,\n sorting_date = month %m+% months(6),\n comp_date = sorting_date\n ) |>\n select(permno, gvkey, sorting_date, comp_date, bm)\n\ndata_for_sorts <- crsp_monthly |>\n left_join(\n bm, by = c(\"permno\", \"gvkey\", \"month\" = \"sorting_date\")\n ) |>\n left_join(\n me, by = c(\"permno\", \"month\" = \"sorting_date\")\n ) |>\n select(\n permno, gvkey, month, ret_excess,\n mktcap_lag, me, bm, exchange, comp_date\n )\n\ndata_for_sorts <- data_for_sorts |>\n arrange(permno, gvkey, month) |>\n group_by(permno, gvkey) |>\n fill(bm, comp_date) |>\n ungroup() |> \n filter(comp_date > month %m-% months(12)) |>\n select(-comp_date) |>\n drop_na()\n```\n:::\n\n\nThe last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument `sorting_variable` into the function for defining different sorting variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n sorting_variable, \n n_portfolios, \n exchanges) {\n breakpoints <- data |>\n filter(exchange %in% exchanges) |>\n pull({{ sorting_variable }}) |>\n quantile(\n probs = seq(0, 1, length.out = n_portfolios + 1),\n na.rm = TRUE,\n names = FALSE\n )\n\n assigned_portfolios <- data |>\n mutate(portfolio = findInterval(\n pick(everything()) |>\n pull({{ sorting_variable }}),\n breakpoints,\n all.inside = TRUE\n )) |>\n pull(portfolio)\n \n return(assigned_portfolios)\n}\n```\n:::\n\n\nAfter these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.\n\n## Independent Sorts\n\nBivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the *value premium* (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms.\\index{Value premium} The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits.\n\nTo implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables `portfolio_bm` and `portfolio_me`, respectively.\\index{Portfolio sorts!Independent bivariate} Then, these separate portfolios are combined to the final sort stored in `portfolio_combined`. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(\n portfolio_bm = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"bm\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n ),\n portfolio_me = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"me\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n )) |>\n group_by(month, portfolio_bm, portfolio_me) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag),\n .groups = \"drop\"\n )\n```\n:::\n\n\nEquipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher's choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_premium <- value_portfolios |>\n group_by(month, portfolio_bm) |>\n summarize(ret = mean(ret), .groups = \"drop_last\") |>\n summarize(\n value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n ret[portfolio_bm == min(portfolio_bm)]\n ) |> \n summarize(\n value_premium = mean(value_premium)\n )\n```\n:::\n\n\nThe resulting monthly value premium is 0.43 percent with an annualized return of 5.3 percent.\n\n## Dependent Sorts\n\nIn the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable.\\index{Portfolio sorts!Dependent bivariate} In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts ensure an equal amount of stocks within each portfolio.\n\nTo implement the dependent sorts, we first create the size portfolios by calling `assign_portfolio()` with `sorting_variable = \"me\"`. Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(portfolio_me = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"me\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n )) |>\n group_by(month, portfolio_me) |>\n mutate(\n portfolio_bm = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"bm\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n )) |>\n group_by(month, portfolio_me, portfolio_bm) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag),\n .groups = \"drop\"\n )\n\nvalue_premium <- value_portfolios |>\n group_by(month, portfolio_bm) |>\n summarize(ret = mean(ret), .groups = \"drop_last\") |>\n summarize(\n value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n ret[portfolio_bm == min(portfolio_bm)]\n ) |> \n summarize(\n value_premium = mean(value_premium)\n )\n```\n:::\n\n\nThe monthly value premium from dependent sorts is 0.38 percent, which translates to an annualized premium of 4.6 percent per year.\n\nOverall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the lines of [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices.\n\n## Exercises\n\n1. In [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we examine the distribution of market equity. Repeat this analysis for book equity and the book-to-market ratio (alongside a plot of the breakpoints, i.e., deciles).\n1. When we investigate the portfolios, we focus on the returns exclusively. However, it is also of interest to understand the characteristics of the portfolios. Write a function to compute the average characteristics for size and book-to-market across the 25 independently and dependently sorted portfolios.\n1. As for the size premium, also the value premium constructed here does not follow @Fama1993. Implement a p-hacking setup as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd) to find a premium that comes closest to their HML premium.", + "markdown": "---\ntitle: Value and Bivariate Sorts\naliases:\n - ../value-and-bivariate-sorts.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: Value and Bivariate Sorts with R\n description-meta: Perform bivariate portfolio sorts to test for return predictability in asset pricing applications using the programming language R. \n---\n\n\nIn this chapter, we extend univariate portfolio analysis to bivariate sorts, which means we assign stocks to portfolios based on two characteristics. Bivariate sorts are regularly used in the academic asset pricing literature and are the basis for the Fama and French three factors. However, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.\n\nWe form portfolios on firm size and the book-to-market ratio. To calculate book-to-market ratios, accounting data is required, which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts.\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nFirst, we load the necessary data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). We conduct portfolio sorts based on the CRSP sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n select(\n permno, gvkey, month, ret_excess,\n mktcap, mktcap_lag, exchange\n ) |> \n collect() |>\n drop_na()\n```\n:::\n\n\nFurther, we utilize accounting data. The most common source of accounting data is Compustat. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable `datadate` to its monthly value, as we only consider monthly returns here and do not need to account for the exact date. To achieve this, we use the function `floor_date()`.\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbook_equity <- tbl(tidy_finance, \"compustat\") |>\n select(gvkey, datadate, be) |>\n collect() |>\n drop_na() |>\n mutate(month = floor_date(ymd(datadate), \"month\"))\n```\n:::\n\n\n## Book-to-Market Ratio\n\nA fundamental problem in handling accounting data is the *look-ahead bias* - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents' trades. Hence, we have to lag accounting information.\n\nWe continue to lag market capitalization and firm size by one month.\\index{Market capitalization}\\index{Firm size} Then, we compute the book-to-market ratio, which relates a firm's book equity to its market equity.\\index{Book equity}\\index{Book-to-market ratio} Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.^[The definition of a time lag is another choice a researcher has to make, similar to breakpoint choices as we describe in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).]\n\nHaving both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the `sorting_date`-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the `fill()`-function after sorting by date and firm (which we identify by `permno` and `gvkey`) and on a firm basis (which we do by `group_by()` as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nme <- crsp_monthly |>\n mutate(sorting_date = month %m+% months(1)) |>\n select(permno, sorting_date, me = mktcap)\n\nbm <- book_equity |>\n inner_join(crsp_monthly, join_by(gvkey, month)) |>\n mutate(\n bm = be / mktcap,\n sorting_date = month %m+% months(6),\n comp_date = sorting_date\n ) |>\n select(permno, gvkey, sorting_date, comp_date, bm)\n\ndata_for_sorts <- crsp_monthly |>\n left_join(\n bm, join_by(permno, gvkey, month == sorting_date)\n ) |>\n left_join(\n me, join_by(permno, month == sorting_date)\n ) |>\n select(\n permno, gvkey, month, ret_excess,\n mktcap_lag, me, bm, exchange, comp_date\n )\n\ndata_for_sorts <- data_for_sorts |>\n arrange(permno, gvkey, month) |>\n group_by(permno, gvkey) |>\n fill(bm, comp_date) |>\n ungroup() |> \n filter(comp_date > month %m-% months(12)) |>\n select(-comp_date) |>\n drop_na()\n```\n:::\n\n\nThe last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument `sorting_variable` into the function for defining different sorting variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n sorting_variable, \n n_portfolios, \n exchanges) {\n breakpoints <- data |>\n filter(exchange %in% exchanges) |>\n pull({{ sorting_variable }}) |>\n quantile(\n probs = seq(0, 1, length.out = n_portfolios + 1),\n na.rm = TRUE,\n names = FALSE\n )\n\n assigned_portfolios <- data |>\n mutate(portfolio = findInterval(\n pick(everything()) |>\n pull({{ sorting_variable }}),\n breakpoints,\n all.inside = TRUE\n )) |>\n pull(portfolio)\n \n return(assigned_portfolios)\n}\n```\n:::\n\n\nAfter these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.\n\n## Independent Sorts\n\nBivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the *value premium* (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms.\\index{Value premium} The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits.\n\nTo implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables `portfolio_bm` and `portfolio_me`, respectively.\\index{Portfolio sorts!Independent bivariate} Then, these separate portfolios are combined to the final sort stored in `portfolio_combined`. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(\n portfolio_bm = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"bm\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n ),\n portfolio_me = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"me\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n )) |>\n group_by(month, portfolio_bm, portfolio_me) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag),\n .groups = \"drop\"\n )\n```\n:::\n\n\nEquipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher's choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_premium <- value_portfolios |>\n group_by(month, portfolio_bm) |>\n summarize(ret = mean(ret), .groups = \"drop_last\") |>\n summarize(\n value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n ret[portfolio_bm == min(portfolio_bm)]\n ) |> \n summarize(\n value_premium = mean(value_premium)\n )\n```\n:::\n\n\nThe resulting monthly value premium is 0.43 percent with an annualized return of 5.3 percent.\n\n## Dependent Sorts\n\nIn the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable.\\index{Portfolio sorts!Dependent bivariate} In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts ensure an equal amount of stocks within each portfolio.\n\nTo implement the dependent sorts, we first create the size portfolios by calling `assign_portfolio()` with `sorting_variable = \"me\"`. Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n group_by(month) |>\n mutate(portfolio_me = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"me\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n )) |>\n group_by(month, portfolio_me) |>\n mutate(\n portfolio_bm = assign_portfolio(\n data = pick(everything()),\n sorting_variable = \"bm\",\n n_portfolios = 5,\n exchanges = c(\"NYSE\")\n )) |>\n group_by(month, portfolio_me, portfolio_bm) |>\n summarize(\n ret = weighted.mean(ret_excess, mktcap_lag),\n .groups = \"drop\"\n )\n\nvalue_premium <- value_portfolios |>\n group_by(month, portfolio_bm) |>\n summarize(ret = mean(ret), .groups = \"drop_last\") |>\n summarize(\n value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n ret[portfolio_bm == min(portfolio_bm)]\n ) |> \n summarize(\n value_premium = mean(value_premium)\n )\n```\n:::\n\n\nThe monthly value premium from dependent sorts is 0.38 percent, which translates to an annualized premium of 4.6 percent per year.\n\nOverall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the lines of [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices.\n\n## Exercises\n\n1. In [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we examine the distribution of market equity. Repeat this analysis for book equity and the book-to-market ratio (alongside a plot of the breakpoints, i.e., deciles).\n1. When we investigate the portfolios, we focus on the returns exclusively. However, it is also of interest to understand the characteristics of the portfolios. Write a function to compute the average characteristics for size and book-to-market across the 25 independently and dependently sorted portfolios.\n1. As for the size premium, also the value premium constructed here does not follow @Fama1993. Implement a p-hacking setup as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd) to find a premium that comes closest to their HML premium.", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json b/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json index 4525d06a..d2a6969d 100644 --- a/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json +++ b/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "8473e5ae4cd380c44e664a41545391d8", + "hash": "18c85d354f045833d161cea91bd91e35", "result": { "engine": "knitr", - "markdown": "---\ntitle: WRDS, CRSP, and Compustat\naliases:\n - ../wrds-crsp-and-compustat.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: WRDS, CRSP, and Compustat with R\n description-meta: Use the programming language R to download stock and firm data such as CRSP and Compustat from WRDS. \n---\n\n\nThis chapter shows how to connect to [Wharton Research Data Services (WRDS)](https://wrds-www.wharton.upenn.edu/), a popular provider of financial and economic data for research applications. We use this connection to download the most commonly used data for stock and firm characteristics, CRSP and Compustat. Unfortunately, this data is not freely available, but most students and researchers typically have access to WRDS through their university libraries. Assuming that you have access to WRDS, we show you how to prepare and merge the databases and store them in the `SQLite`-database introduced in the previous chapter. We conclude this chapter by providing some tips for working with the WRDS database.\\index{WRDS}\n\nFirst, we load the R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nWe use the same date range as in the previous chapter to ensure consistency.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Accessing WRDS\n\nWRDS is the most widely used source for asset and firm-specific financial data used in academic settings. WRDS is a data platform that provides data validation, flexible delivery options, and access to many different data sources. The data at WRDS is also organized in an SQL database, although they use the [PostgreSQL](https://www.postgresql.org/) engine. This database engine is just as easy to handle with R as SQLite. We use the `RPostgres` package to establish a connection to the WRDS database [@RPostgres]. Note that you could also use the `odbc` package to connect to a PostgreSQL database, but then you need to install the appropriate drivers yourself. `RPostgres` already contains a suitable driver.\\index{Database!PostgreSQL}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RPostgres)\n```\n:::\n\n\nTo establish a connection, you use the function `dbConnect()` with the following arguments. Note that you need to replace the `WRDS_USER` and `WRDS_PASSWORD` arguments with your own credentials. We defined environment variables for the purpose of this book because we obviously do not want (and are not allowed) to share our credentials with the rest of the world (these environment variables are stored in an `.Renviron`-file in our project directory and loaded with the `Sys.getenv()` function).\n\nAdditionally, you have to use multi-factor (i.e., two-factor) authentication since May 2023 when establishing a `PostgreSQL` or other remote connections. You have two choices to provide the additional identification. First, if you have Duo Push enabled for your WRDS account, you will receive a push notification on your mobile phone when trying to establish a connection with the code below. Upon accepting the notification, you can continue your work. Second, you can log in to a WRDS website that requires multi-factor authentication with your username and the same IP address. Once you have successfully identified yourself on the website, your username-IP combination will be remembered for 30 days, and you can comfortably use the remote connection below.\\index{Two-factor authentication}\\index{WRDS:Two-factor authentication}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n Postgres(),\n host = \"wrds-pgdata.wharton.upenn.edu\",\n dbname = \"wrds\",\n port = 9737,\n sslmode = \"require\",\n user = Sys.getenv(\"WRDS_USER\"),\n password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n```\n:::\n\n\nThe remote connection to WRDS is very useful. Yet, the database itself contains many different tables. You can check the WRDS homepage to identify the table's name you are looking for (if you go beyond our exposition). Alternatively, you can also query the data structure with the function `dbSendQuery()`. If you are interested, there is an exercise below that is based on WRDS' tutorial on [\"Querying WRDS Data using R\".](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/) Furthermore, the penultimate section of this chapter shows how to investigate the structure of databases.\n\n## Downloading and Preparing CRSP\n\n\\index{Data!CRSP}[The Center for Research in Security Prices (CRSP)](https://crsp.org/) provides the most widely used data for US stocks. We use the `wrds` connection object that we just created to first access monthly CRSP return data. Actually, we need three tables to get the desired data: (i) the CRSP monthly security file,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsf_db <- tbl(wrds, in_schema(\"crsp\", \"msf\"))\n```\n:::\n\n\n(ii) the identifying information,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsenames_db <- tbl(wrds, in_schema(\"crsp\", \"msenames\"))\n```\n:::\n\n\nand (iii) the delisting information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsedelist_db <- tbl(wrds, in_schema(\"crsp\", \"msedelist\"))\n```\n:::\n\n\nWe use the three remote tables to fetch the data we want to put into our local database. Just as above, the idea is that we let the WRDS database do all the work and just download the data that we actually need. We apply common filters and data selection criteria to narrow down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep only US-listed stocks as identified via share codes `shrcd` 10 and 11, and (iii) we keep only months within permno-specific start dates `namedt` and end dates `nameendt`. In addition, we add delisting codes and returns. You can read up in the great textbook of @BaliEngleMurray2016 for an extensive discussion on the filters we apply in the code below.\\index{Permno}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- msf_db |>\n filter(date >= start_date & date <= end_date) |>\n inner_join(\n msenames_db |>\n filter(shrcd %in% c(10, 11)) |>\n select(permno, exchcd, siccd, namedt, nameendt),\n by = c(\"permno\")\n ) |>\n filter(date >= namedt & date <= nameendt) |>\n mutate(month = floor_date(date, \"month\")) |>\n left_join(\n msedelist_db |>\n select(permno, dlstdt, dlret, dlstcd) |>\n mutate(month = floor_date(dlstdt, \"month\")),\n by = c(\"permno\", \"month\")\n ) |>\n select(\n permno, # Security identifier\n date, # Date of the observation\n month, # Month of the observation\n ret, # Return\n shrout, # Shares outstanding (in thousands)\n altprc, # Last traded price in a month\n exchcd, # Exchange code\n siccd, # Industry code\n dlret, # Delisting return\n dlstcd # Delisting code\n ) |>\n collect() |>\n mutate(\n month = ymd(month),\n shrout = shrout * 1000\n )\n```\n:::\n\n\nNow, we have all the relevant monthly return data in memory and proceed with preparing the data for future analyses. We perform the preparation step at the current stage since we want to avoid executing the same mutations every time we use the data in subsequent chapters. \n\nThe first additional variable we create is market capitalization (`mktcap`), which is the product of the number of outstanding shares `shrout` and the last traded price in a month `altprc`.\\index{Market capitalization} Note that in contrast to returns `ret`, these two variables are not adjusted ex-post for any corporate actions like stock splits. Moreover, the `altprc` is negative whenever the last traded price does not exist, and CRSP decides to report the mid-quote of the last available order book instead. Hence, we take the absolute value of the market cap. We also keep the market cap in millions of USD just for convenience as we do not want to print huge numbers in our figures and tables. In addition, we set zero market cap to missing as it makes conceptually little sense (i.e., the firm would be bankrupt).\\index{Stock price}\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(\n mktcap = abs(shrout * altprc) / 10^6,\n mktcap = na_if(mktcap, 0)\n )\n```\n:::\n\n\nThe next variable we frequently use is the one-month *lagged* market capitalization. Lagged market capitalization is typically used to compute value-weighted portfolio returns, as we demonstrate in a later chapter. The most simple and consistent way to add a column with lagged market cap values is to add one month to each observation and then join the information to our monthly CRSP data.\\index{Weighting!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmktcap_lag <- crsp_monthly |>\n mutate(month = month %m+% months(1)) |>\n select(permno, month, mktcap_lag = mktcap)\n\ncrsp_monthly <- crsp_monthly |>\n left_join(mktcap_lag, by = c(\"permno\", \"month\"))\n```\n:::\n\n\nIf you wonder why we do not use the `lag()` function, e.g., via `crsp_monthly |> group_by(permno) |> mutate(mktcap_lag = lag(mktcap))`, take a look at the exercises.\n\nNext, we follow @BaliEngleMurray2016 in transforming listing exchange codes to explicit exchange names. \\index{Exchange!Exchange codes}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(exchange = case_when(\n exchcd %in% c(1, 31) ~ \"NYSE\",\n exchcd %in% c(2, 32) ~ \"AMEX\",\n exchcd %in% c(3, 33) ~ \"NASDAQ\",\n .default = \"Other\"\n ))\n```\n:::\n\n\nSimilarly, we transform industry codes to industry descriptions following @BaliEngleMurray2016.\\index{Industry codes} Notice that there are also other categorizations of industries [e.g., @FamaFrench1997] that are commonly used.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(industry = case_when(\n siccd >= 1 & siccd <= 999 ~ \"Agriculture\",\n siccd >= 1000 & siccd <= 1499 ~ \"Mining\",\n siccd >= 1500 & siccd <= 1799 ~ \"Construction\",\n siccd >= 2000 & siccd <= 3999 ~ \"Manufacturing\",\n siccd >= 4000 & siccd <= 4899 ~ \"Transportation\",\n siccd >= 4900 & siccd <= 4999 ~ \"Utilities\",\n siccd >= 5000 & siccd <= 5199 ~ \"Wholesale\",\n siccd >= 5200 & siccd <= 5999 ~ \"Retail\",\n siccd >= 6000 & siccd <= 6799 ~ \"Finance\",\n siccd >= 7000 & siccd <= 8999 ~ \"Services\",\n siccd >= 9000 & siccd <= 9999 ~ \"Public\",\n TRUE ~ \"Missing\"\n ))\n```\n:::\n\n\nWe also construct returns adjusted for delistings as described by @BaliEngleMurray2016. The delisting of a security usually results when a company ceases operations, declares bankruptcy, merges, does not meet listing requirements, or seeks to become private. The adjustment tries to reflect the returns of investors who bought the stock in the month before the delisting and held it until the delisting date. After this transformation, we can drop the delisting returns and codes.\\index{Returns!Delisting}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(ret_adj = case_when(\n is.na(dlstcd) ~ ret,\n !is.na(dlstcd) & !is.na(dlret) ~ dlret,\n dlstcd %in% c(500, 520, 580, 584) |\n (dlstcd >= 551 & dlstcd <= 574) ~ -0.30,\n dlstcd == 100 ~ ret,\n TRUE ~ -1\n )) |>\n select(-c(dlret, dlstcd))\n```\n:::\n\n\nNext, we compute excess returns by subtracting the monthly risk-free rate provided by our Fama-French data.\\index{Returns!Excess}\\index{Risk-free rate} As we base all our analyses on the excess returns, we can drop adjusted returns and the risk-free rate from our tibble. Note that we ensure excess returns are bounded by -1 from below as a return less than -100% makes no sense conceptually. Before we can adjust the returns, we have to connect to our database and load the table `factors_ff3_monthly`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, rf) |>\n collect()\n\ncrsp_monthly <- crsp_monthly |>\n left_join(factors_ff3_monthly,\n by = \"month\"\n ) |>\n mutate(\n ret_excess = ret_adj - rf,\n ret_excess = pmax(ret_excess, -1)\n ) |>\n select(-ret_adj, -rf)\n```\n:::\n\n\nSince excess returns and market capitalization are crucial for all our analyses, we can safely exclude all observations with missing returns or market capitalization. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n drop_na(ret_excess, mktcap, mktcap_lag)\n```\n:::\n\n\nFinally, we store the monthly CRSP file in our database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"crsp_monthly\",\n value = crsp_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\n## First Glimpse of the CRSP Sample\n\nBefore we move on to other data sources, let us look at some descriptive statistics of the CRSP sample, which is our main source for stock returns. \n\n@fig-211 shows the monthly number of securities by listing exchange over time. NYSE has the longest history in the data, but NASDAQ lists a considerably large number of stocks. The number of stocks listed on AMEX decreased steadily over the last couple of decades. By the end of 2022, there were 2,778 stocks with a primary listing on NASDAQ, 1,358 on NYSE, 162 on AMEX, and only one belonged to the other category. \\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n count(exchange, date) |>\n ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly number of securities by listing exchange\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample listed at each of the US exchanges.](wrds-crsp-and-compustat_files/figure-html/fig-211-1.png){#fig-211 fig-alt='Title: Monthly number of securities by listing exchange. The figure shows a line chart with the number of securities by listing exchange from 1960 to 2022. In the earlier period, NYSE dominated as a listing exchange. There is a strong upwards trend for NASDAQ. Other listing exchanges do only play a minor role.' width=2100}\n:::\n:::\n\n\nNext, we look at the aggregate market capitalization grouped by the respective listing exchanges in @fig-212. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in @fig-212 are at the end of 2022 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl(tidy_finance, \"crsp_monthly\") |>\n left_join(tbl(tidy_finance, \"cpi_monthly\"), by = \"month\") |>\n group_by(month, exchange) |>\n summarize(\n mktcap = sum(mktcap, na.rm = TRUE) / cpi,\n .groups = \"drop\"\n ) |>\n collect() |>\n mutate(month = ymd(month)) |>\n ggplot(aes(\n x = month, y = mktcap / 1000,\n color = exchange, linetype = exchange\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly market cap by listing exchange in billions of Dec 2022 USD\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-212-1.png){#fig-212 fig-alt='Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks.' width=2100}\n:::\n:::\n\n\nOf course, performing the computation in the database is not really meaningful because we can easily pull all the required data into our memory. The code chunk above is slower than performing the same steps on tables that are already in memory. However, we just want to illustrate that you can perform many things in the database before loading the data into your memory. Before we proceed, we load the monthly CPI data.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncpi_monthly <- tbl(tidy_finance, \"cpi_monthly\") |>\n collect()\n```\n:::\n\n\nNext, we look at the same descriptive statistics by industry. @fig-213 plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry <- crsp_monthly |>\n left_join(cpi_monthly, by = \"month\") |>\n group_by(month, industry) |>\n summarize(\n securities = n_distinct(permno),\n mktcap = sum(mktcap) / mean(cpi),\n .groups = \"drop\"\n )\n\ncrsp_monthly_industry |>\n ggplot(aes(\n x = month,\n y = securities,\n color = industry,\n linetype = industry\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly number of securities by industry\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample associated with different industries.](wrds-crsp-and-compustat_files/figure-html/fig-213-1.png){#fig-213 fig-alt='Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks.' width=2100}\n:::\n:::\n\n\nWe also compute the market cap of all stocks belonging to the respective industries and show the evolution over time in @fig-214. All values are again in terms of billions of end of 2022 USD. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Toward the end of the sample, however, financial firms and services begin to make up a substantial portion of the market cap.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry |>\n ggplot(aes(\n x = month,\n y = mktcap / 1000,\n color = industry,\n linetype = industry\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly total market cap by industry in billions as of Dec 2022 USD\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-214-1.png){#fig-214 fig-alt='Title: Monthly total market cap by industry in billions as of Dec 2022 USD. The figure shows a line chart of total market capitalization of all stocks in the CRSP sample aggregated by industry from 1960 to 2022 with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Stocks in the manufacturing sector have always had the highest market valuation. The figure shows a general upwards trend during the most recent past. ' width=2100}\n:::\n:::\n\n\n## Daily CRSP Data\n\nBefore we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session. \n\nThere is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for()`-loops,\\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 5 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `cat()`. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches. \n\nAlso notice that we adjust the returns for delistings in a different manner than for monthly returns: \n\n1. If a return is recorded in CRSP daily for a delisting date `dlstdt`, then we replace it with the delisting return `dlret`. \n2. If no return is recorded in CRSP daily for an existing delisting date, then we add a new row with the delisting return. \n3. We ensure that no returns after the delisting date `dlstdt` remain in the final table. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndsf_db <- tbl(wrds, in_schema(\"crsp\", \"dsf\"))\n\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n collect()\n\npermnos <- tbl(tidy_finance, \"crsp_monthly\") |>\n distinct(permno) |>\n pull()\n\nbatch_size <- 500\nbatches <- ceiling(length(permnos) / batch_size)\n\nfor (j in 1:batches) {\n \n permno_batch <- permnos[\n ((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))\n ]\n\n crsp_daily_sub <- dsf_db |>\n filter(permno %in% permno_batch &\n date >= start_date & date <= end_date) |>\n select(permno, date, ret) |>\n collect() |>\n drop_na()\n\n if (nrow(crsp_daily_sub) > 0) {\n \n msedelist_sub <- msedelist_db |> \n filter(permno %in% permno_batch) |> \n select(permno, dlstdt, dlret) |> \n collect() |>\n drop_na()\n \n crsp_daily_sub <- crsp_daily_sub |> \n left_join(msedelist_sub, by = c(\"permno\", \"date\"=\"dlstdt\")) |> \n bind_rows(msedelist_sub |> \n anti_join(crsp_daily_sub, \n by = c(\"permno\", \"dlstdt\" = \"date\"))) |> \n mutate(ret = if_else(!is.na(dlret), dlret, ret),\n date = if_else(!is.na(dlstdt), dlstdt, date)) |> \n select(-c(dlret, dlstdt)) |> \n left_join(msedelist_sub |> \n select(permno, dlstdt), by = \"permno\") |> \n mutate(dlstdt = replace_na(dlstdt, end_date)) |> \n filter(date <= dlstdt) |> \n select(-dlstdt)\n \n crsp_daily_sub <- crsp_daily_sub |>\n mutate(month = floor_date(date, \"month\")) |>\n left_join(factors_ff3_daily |>\n select(date, rf), by = \"date\") |>\n mutate(\n ret_excess = ret - rf,\n ret_excess = pmax(ret_excess, -1)\n ) |>\n select(permno, date, month, ret, ret_excess)\n\n dbWriteTable(tidy_finance,\n \"crsp_daily\",\n value = crsp_daily_sub,\n overwrite = ifelse(j == 1, TRUE, FALSE),\n append = ifelse(j != 1, TRUE, FALSE)\n )\n }\n\n cat(\"Batch\", j, \"out of\", batches, \"done (\", percent(j / batches), \")\\n\")\n}\n```\n:::\n\n\nEventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data that we actually use. \n\n## Preparing Compustat data\n\nFirm accounting data are an important source of information that we use in portfolio analyses in subsequent chapters. The commonly used source for firm financial information is Compustat provided by [S&P Global Market Intelligence,](https://www.spglobal.com/marketintelligence/en/) which is a global data vendor that provides financial, statistical, and market information on active and inactive companies throughout the world.\\index{Data!Compustat} For US and Canadian companies, annual history is available back to 1950 and quarterly as well as monthly histories date back to 1962.\n\nTo access Compustat data, we can again tap WRDS, which hosts the `funda` table that contains annual firm-level information on North American companies.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfunda_db <- tbl(wrds, in_schema(\"comp\", \"funda\"))\n```\n:::\n\n\nWe follow the typical filter conventions and pull only data that we actually need: (i) we get only records in industrial data format, (ii) in the standard format (i.e., consolidated information in standard presentation), and (iii) only data in the desired time window.\\index{Gvkey}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- funda_db |>\n filter(\n indfmt == \"INDL\" &\n datafmt == \"STD\" &\n consol == \"C\" &\n datadate >= start_date & datadate <= end_date\n ) |>\n select(\n gvkey, # Firm identifier\n datadate, # Date of the accounting data\n seq, # Stockholders' equity\n ceq, # Total common/ordinary equity\n at, # Total assets\n lt, # Total liabilities\n txditc, # Deferred taxes and investment tax credit\n txdb, # Deferred taxes\n itcb, # Investment tax credit\n pstkrv, # Preferred stock redemption value\n pstkl, # Preferred stock liquidating value\n pstk, # Preferred stock par value\n capx, # Capital investment\n oancf, # Operating cash flow\n sale, # Revenue\n cogs, # Costs of goods sold\n xint, # Interest expense\n xsga # Selling, general, and administrative expenses\n ) |>\n collect()\n```\n:::\n\n\nNext, we calculate the book value of preferred stock and equity `be` and the operating profitability `op` inspired by the [variable definitions in Ken French's data library.](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) Note that we set negative or zero equity to missing which is a common practice when working with book-to-market ratios [see @Fama1992 for details].\\index{Book equity}\\index{Preferred stock}\\index{Operating profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n mutate(\n be = coalesce(seq, ceq + pstk, at - lt) +\n coalesce(txditc, txdb + itcb, 0) -\n coalesce(pstkrv, pstkl, pstk, 0),\n be = if_else(be <= 0, as.numeric(NA), be),\n op = (sale - coalesce(cogs, 0) - \n coalesce(xsga, 0) - coalesce(xint, 0)) / be,\n )\n```\n:::\n\n\nWe keep only the last available information for each firm-year group. Note that `datadate` defines the time the corresponding financial data refers to (e.g., annual report as of December 31, 2022). Therefore, `datadate` is not the date when data was made available to the public. Check out the exercises for more insights into the peculiarities of `datadate`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n mutate(year = year(datadate)) |>\n group_by(gvkey, year) |>\n filter(datadate == max(datadate)) |>\n ungroup()\n```\n:::\n\n\nWe also compute the investment ratio `inv` according to Ken French's variable definitions as the change in total assets from one fiscal year to another. Note that we again use the approach using joins as introduced with the CRSP data above to construct lagged assets.\\index{Investment ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |> \n left_join(\n compustat |> \n select(gvkey, year, at_lag = at) |> \n mutate(year = year + 1), by = c(\"gvkey\", \"year\")\n ) |> \n mutate(\n inv = at / at_lag - 1,\n inv = if_else(at_lag <= 0, as.numeric(NA), inv)\n )\n```\n:::\n\n\nWith the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"compustat\",\n value = compustat,\n overwrite = TRUE\n)\n```\n:::\n\n\n## Merging CRSP with Compustat\n\nUnfortunately, CRSP and Compustat use different keys to identify stocks and firms. CRSP uses `permno` for stocks, while Compustat uses `gvkey` to identify firms. Fortunately, a curated matching table on WRDS allows us to merge CRSP and Compustat, so we create a connection to the *CRSP-Compustat Merged* table (provided by CRSP).\\index{Data!Crsp-Compustat Merged}\\index{Permno}\\index{Gvkey}\\index{Data!Linking table}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable_db <- tbl(\n wrds,\n in_schema(\"crsp\", \"ccmxpf_linktable\")\n)\n```\n:::\n\n\nThe linking table contains links between CRSP and Compustat identifiers from various approaches. However, we need to make sure that we keep only relevant and correct links, again following the description outlined in @BaliEngleMurray2016. Note also that currently active links have no end date, so we just enter the current date via `today()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable <- ccmxpf_linktable_db |>\n filter(linktype %in% c(\"LU\", \"LC\") &\n linkprim %in% c(\"P\", \"C\") &\n usedflag == 1) |>\n select(permno = lpermno, gvkey, linkdt, linkenddt) |>\n collect() |>\n mutate(linkenddt = replace_na(linkenddt, today()))\n```\n:::\n\n\nWe use these links to create a new table with a mapping between stock identifier, firm identifier, and month. We then add these links to the Compustat `gvkey` to our monthly stock data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nccm_links <- crsp_monthly |>\n inner_join(ccmxpf_linktable, \n by = \"permno\", relationship = \"many-to-many\") |>\n filter(!is.na(gvkey) & \n (date >= linkdt & date <= linkenddt)) |>\n select(permno, gvkey, date)\n\ncrsp_monthly <- crsp_monthly |>\n left_join(ccm_links, by = c(\"permno\", \"date\"))\n```\n:::\n\n\nAs the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"crsp_monthly\",\n value = crsp_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nBefore we close this chapter, let us look at an interesting descriptive statistic of our data. As the book value of equity plays a crucial role in many asset pricing applications, it is interesting to know for how many of our stocks this information is available. Hence, @fig-215 plots the share of securities with book equity values for each exchange. It turns out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 60s but hovers around 80% for all periods thereafter. We can ignore the erratic coverage of securities that belong to the other category since there is only a handful of them anyway in our sample.\\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n group_by(permno, year = year(month)) |>\n filter(date == max(date)) |>\n ungroup() |>\n left_join(compustat, by = c(\"gvkey\", \"year\")) |>\n group_by(exchange, year) |>\n summarize(\n share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),\n .groups = \"drop\"\n ) |>\n ggplot(aes(\n x = year, \n y = share, \n color = exchange,\n linetype = exchange\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Share of securities with book equity values by exchange\"\n ) +\n scale_y_continuous(labels = percent) +\n coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![End-of-year share of securities with book equity values by listing exchange.](wrds-crsp-and-compustat_files/figure-html/fig-215-1.png){#fig-215 fig-alt='Title: Share of securities with book equity values by exchange. The figure shows a line chart of end-of-year shares of securities with book equity values by exchange from 1960 to 2022 with years on the horizontal axis and the corresponding share on the vertical axis. After an initial period with lower coverage in the early 1960s, typically, more than 80 percent of the entries in the CRSP sample have information about book equity values from Compustat.' width=2100}\n:::\n:::\n\n\n## Some Tricks for PostgreSQL Databases\n\nAs we mentioned above, the WRDS database runs on PostgreSQL rather than SQLite. Finding the right tables for your data needs can be tricky in the WRDS PostgreSQL instance, as the tables are organized in schemas.\\index{Database!Schema} If you wonder what the purpose of schemas is, check out [this documetation.](https://www.postgresql.org/docs/9.1/ddl-schemas.html) For instance, if you want to find all tables that live in the `crsp` schema, you run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"crsp\"))\n```\n:::\n\n\nThis operation returns a list of all tables that belong to the `crsp` family on WRDS, e.g., ` schema = crsp, table = msenames`. Similarly, you can fetch a list of all tables that belong to the `comp` family via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"comp\"))\n```\n:::\n\n\nIf you want to get all schemas, then run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds)\n```\n:::\n\n\n## Exercises\n\n1. Check out the structure of the WRDS database by sending queries in the spirit of [\"Querying WRDS Data using R\"](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/) and verify the output with `dbListObjects()`. How many tables are associated with CRSP? Can you identify what is stored within `msp500`?\n1. Compute `mkt_cap_lag` using `lag(mktcap)` rather than using joins as above. Filter out all the rows where the lag-based market capitalization measure is different from the one we computed above. Why are the two measures they different?\n1. Plot the average market capitalization of firms for each exchange and industry, respectively, over time. What do you find?\n1. In the `compustat` table, `datadate` refers to the date to which the fiscal year of a corresponding firm refers to. Count the number of observations in Compustat by `month` of this date variable. What do you find? What does the finding suggest about pooling observations with the same fiscal year?\n1. Go back to the original Compustat data in `funda_db` and extract rows where the same firm has multiple rows for the same fiscal year. What is the reason for these observations?\n1. Keep the last observation of `crsp_monthly` by year and join it with the `compustat` table. Create the following plots: (i) aggregate book equity by exchange over time and (ii) aggregate annual book equity by industry over time. Do you notice any different patterns to the corresponding plots based on market capitalization?\n1. Repeat the analysis of market capitalization for book equity, which we computed from the Compustat data. Then, use the matched sample to plot book equity against market capitalization. How are these two variables related?", + "markdown": "---\ntitle: WRDS, CRSP, and Compustat\naliases:\n - ../wrds-crsp-and-compustat.html\npre-render:\n - pre_render_script.R\nmetadata:\n pagetitle: WRDS, CRSP, and Compustat with R\n description-meta: Use the programming language R to download stock and firm data such as CRSP and Compustat from WRDS. \n---\n\n\nThis chapter shows how to connect to [Wharton Research Data Services (WRDS)](https://wrds-www.wharton.upenn.edu/), a popular provider of financial and economic data for research applications. We use this connection to download the most commonly used data for stock and firm characteristics, CRSP and Compustat. Unfortunately, this data is not freely available, but most students and researchers typically have access to WRDS through their university libraries. Assuming that you have access to WRDS, we show you how to prepare and merge the databases and store them in the `SQLite`-database introduced in the previous chapter. We conclude this chapter by providing some tips for working with the WRDS database.\\index{WRDS}\n\nFirst, we load the R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nWe use the same date range as in the previous chapter to ensure consistency.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Accessing WRDS\n\nWRDS is the most widely used source for asset and firm-specific financial data used in academic settings. WRDS is a data platform that provides data validation, flexible delivery options, and access to many different data sources. The data at WRDS is also organized in an SQL database, although they use the [PostgreSQL](https://www.postgresql.org/) engine. This database engine is just as easy to handle with R as SQLite. We use the `RPostgres` package to establish a connection to the WRDS database [@RPostgres]. Note that you could also use the `odbc` package to connect to a PostgreSQL database, but then you need to install the appropriate drivers yourself. `RPostgres` already contains a suitable driver.\\index{Database!PostgreSQL}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RPostgres)\n```\n:::\n\n\nTo establish a connection, you use the function `dbConnect()` with the following arguments. Note that you need to replace the `WRDS_USER` and `WRDS_PASSWORD` arguments with your own credentials. We defined environment variables for the purpose of this book because we obviously do not want (and are not allowed) to share our credentials with the rest of the world (these environment variables are stored in an `.Renviron`-file in our project directory and loaded with the `Sys.getenv()` function).\n\nAdditionally, you have to use multi-factor (i.e., two-factor) authentication since May 2023 when establishing a `PostgreSQL` or other remote connections. You have two choices to provide the additional identification. First, if you have Duo Push enabled for your WRDS account, you will receive a push notification on your mobile phone when trying to establish a connection with the code below. Upon accepting the notification, you can continue your work. Second, you can log in to a WRDS website that requires multi-factor authentication with your username and the same IP address. Once you have successfully identified yourself on the website, your username-IP combination will be remembered for 30 days, and you can comfortably use the remote connection below.\\index{Two-factor authentication}\\index{WRDS:Two-factor authentication}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n Postgres(),\n host = \"wrds-pgdata.wharton.upenn.edu\",\n dbname = \"wrds\",\n port = 9737,\n sslmode = \"require\",\n user = Sys.getenv(\"WRDS_USER\"),\n password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n```\n:::\n\n\nThe remote connection to WRDS is very useful. Yet, the database itself contains many different tables. You can check the WRDS homepage to identify the table's name you are looking for (if you go beyond our exposition). Alternatively, you can also query the data structure with the function `dbSendQuery()`. If you are interested, there is an exercise below that is based on WRDS' tutorial on [\"Querying WRDS Data using R\".](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/) Furthermore, the penultimate section of this chapter shows how to investigate the structure of databases.\n\n## Downloading and Preparing CRSP\n\n\\index{Data!CRSP}[The Center for Research in Security Prices (CRSP)](https://crsp.org/) provides the most widely used data for US stocks. We use the `wrds` connection object that we just created to first access monthly CRSP return data. Actually, we need three tables to get the desired data: (i) the CRSP monthly security file,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsf_db <- tbl(wrds, in_schema(\"crsp\", \"msf\"))\n```\n:::\n\n\n(ii) the identifying information,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsenames_db <- tbl(wrds, in_schema(\"crsp\", \"msenames\"))\n```\n:::\n\n\nand (iii) the delisting information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsedelist_db <- tbl(wrds, in_schema(\"crsp\", \"msedelist\"))\n```\n:::\n\n\nWe use the three remote tables to fetch the data we want to put into our local database. Just as above, the idea is that we let the WRDS database do all the work and just download the data that we actually need. We apply common filters and data selection criteria to narrow down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep only US-listed stocks as identified via share codes `shrcd` 10 and 11, and (iii) we keep only months within permno-specific start dates `namedt` and end dates `nameendt`. In addition, we add delisting codes and returns. You can read up in the great textbook of @BaliEngleMurray2016 for an extensive discussion on the filters we apply in the code below.\\index{Permno}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- msf_db |>\n filter(date >= start_date & date <= end_date) |>\n inner_join(\n msenames_db |>\n filter(shrcd %in% c(10, 11)) |>\n select(permno, exchcd, siccd, namedt, nameendt),\n join_by(permno)\n ) |>\n filter(date >= namedt & date <= nameendt) |>\n mutate(month = floor_date(date, \"month\")) |>\n left_join(\n msedelist_db |>\n select(permno, dlstdt, dlret, dlstcd) |>\n mutate(month = floor_date(dlstdt, \"month\")),\n join_by(permno, month)\n ) |>\n select(\n permno, # Security identifier\n date, # Date of the observation\n month, # Month of the observation\n ret, # Return\n shrout, # Shares outstanding (in thousands)\n altprc, # Last traded price in a month\n exchcd, # Exchange code\n siccd, # Industry code\n dlret, # Delisting return\n dlstcd # Delisting code\n ) |>\n collect() |>\n mutate(\n month = ymd(month),\n shrout = shrout * 1000\n )\n```\n:::\n\n\nNow, we have all the relevant monthly return data in memory and proceed with preparing the data for future analyses. We perform the preparation step at the current stage since we want to avoid executing the same mutations every time we use the data in subsequent chapters. \n\nThe first additional variable we create is market capitalization (`mktcap`), which is the product of the number of outstanding shares `shrout` and the last traded price in a month `altprc`.\\index{Market capitalization} Note that in contrast to returns `ret`, these two variables are not adjusted ex-post for any corporate actions like stock splits. Moreover, the `altprc` is negative whenever the last traded price does not exist, and CRSP decides to report the mid-quote of the last available order book instead. Hence, we take the absolute value of the market cap. We also keep the market cap in millions of USD just for convenience as we do not want to print huge numbers in our figures and tables. In addition, we set zero market cap to missing as it makes conceptually little sense (i.e., the firm would be bankrupt).\\index{Stock price}\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(\n mktcap = abs(shrout * altprc) / 10^6,\n mktcap = na_if(mktcap, 0)\n )\n```\n:::\n\n\nThe next variable we frequently use is the one-month *lagged* market capitalization. Lagged market capitalization is typically used to compute value-weighted portfolio returns, as we demonstrate in a later chapter. The most simple and consistent way to add a column with lagged market cap values is to add one month to each observation and then join the information to our monthly CRSP data.\\index{Weighting!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmktcap_lag <- crsp_monthly |>\n mutate(month = month %m+% months(1)) |>\n select(permno, month, mktcap_lag = mktcap)\n\ncrsp_monthly <- crsp_monthly |>\n left_join(mktcap_lag, join_by(permno, month))\n```\n:::\n\n\nIf you wonder why we do not use the `lag()` function, e.g., via `crsp_monthly |> group_by(permno) |> mutate(mktcap_lag = lag(mktcap))`, take a look at the exercises.\n\nNext, we follow @BaliEngleMurray2016 in transforming listing exchange codes to explicit exchange names. \\index{Exchange!Exchange codes}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(exchange = case_when(\n exchcd %in% c(1, 31) ~ \"NYSE\",\n exchcd %in% c(2, 32) ~ \"AMEX\",\n exchcd %in% c(3, 33) ~ \"NASDAQ\",\n .default = \"Other\"\n ))\n```\n:::\n\n\nSimilarly, we transform industry codes to industry descriptions following @BaliEngleMurray2016.\\index{Industry codes} Notice that there are also other categorizations of industries [e.g., @FamaFrench1997] that are commonly used.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(industry = case_when(\n siccd >= 1 & siccd <= 999 ~ \"Agriculture\",\n siccd >= 1000 & siccd <= 1499 ~ \"Mining\",\n siccd >= 1500 & siccd <= 1799 ~ \"Construction\",\n siccd >= 2000 & siccd <= 3999 ~ \"Manufacturing\",\n siccd >= 4000 & siccd <= 4899 ~ \"Transportation\",\n siccd >= 4900 & siccd <= 4999 ~ \"Utilities\",\n siccd >= 5000 & siccd <= 5199 ~ \"Wholesale\",\n siccd >= 5200 & siccd <= 5999 ~ \"Retail\",\n siccd >= 6000 & siccd <= 6799 ~ \"Finance\",\n siccd >= 7000 & siccd <= 8999 ~ \"Services\",\n siccd >= 9000 & siccd <= 9999 ~ \"Public\",\n TRUE ~ \"Missing\"\n ))\n```\n:::\n\n\nWe also construct returns adjusted for delistings as described by @BaliEngleMurray2016. The delisting of a security usually results when a company ceases operations, declares bankruptcy, merges, does not meet listing requirements, or seeks to become private. The adjustment tries to reflect the returns of investors who bought the stock in the month before the delisting and held it until the delisting date. After this transformation, we can drop the delisting returns and codes.\\index{Returns!Delisting}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n mutate(ret_adj = case_when(\n is.na(dlstcd) ~ ret,\n !is.na(dlstcd) & !is.na(dlret) ~ dlret,\n dlstcd %in% c(500, 520, 580, 584) |\n (dlstcd >= 551 & dlstcd <= 574) ~ -0.30,\n dlstcd == 100 ~ ret,\n TRUE ~ -1\n )) |>\n select(-c(dlret, dlstcd))\n```\n:::\n\n\nNext, we compute excess returns by subtracting the monthly risk-free rate provided by our Fama-French data.\\index{Returns!Excess}\\index{Risk-free rate} As we base all our analyses on the excess returns, we can drop adjusted returns and the risk-free rate from our tibble. Note that we ensure excess returns are bounded by -1 from below as a return less than -100% makes no sense conceptually. Before we can adjust the returns, we have to connect to our database and load the table `factors_ff3_monthly`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n SQLite(),\n \"data/tidy_finance_r.sqlite\",\n extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n select(month, rf) |>\n collect()\n\ncrsp_monthly <- crsp_monthly |>\n left_join(factors_ff3_monthly,\n join_by(month)\n ) |>\n mutate(\n ret_excess = ret_adj - rf,\n ret_excess = pmax(ret_excess, -1)\n ) |>\n select(-ret_adj, -rf)\n```\n:::\n\n\nSince excess returns and market capitalization are crucial for all our analyses, we can safely exclude all observations with missing returns or market capitalization. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n drop_na(ret_excess, mktcap, mktcap_lag)\n```\n:::\n\n\nFinally, we store the monthly CRSP file in our database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"crsp_monthly\",\n value = crsp_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\n## First Glimpse of the CRSP Sample\n\nBefore we move on to other data sources, let us look at some descriptive statistics of the CRSP sample, which is our main source for stock returns. \n\n@fig-211 shows the monthly number of securities by listing exchange over time. NYSE has the longest history in the data, but NASDAQ lists a considerably large number of stocks. The number of stocks listed on AMEX decreased steadily over the last couple of decades. By the end of 2022, there were 2,778 stocks with a primary listing on NASDAQ, 1,358 on NYSE, 162 on AMEX, and only one belonged to the other category. \\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n count(exchange, date) |>\n ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly number of securities by listing exchange\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample listed at each of the US exchanges.](wrds-crsp-and-compustat_files/figure-html/fig-211-1.png){#fig-211 fig-alt='Title: Monthly number of securities by listing exchange. The figure shows a line chart with the number of securities by listing exchange from 1960 to 2022. In the earlier period, NYSE dominated as a listing exchange. There is a strong upwards trend for NASDAQ. Other listing exchanges do only play a minor role.' width=2100}\n:::\n:::\n\n\nNext, we look at the aggregate market capitalization grouped by the respective listing exchanges in @fig-212. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in @fig-212 are at the end of 2022 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl(tidy_finance, \"crsp_monthly\") |>\n left_join(tbl(tidy_finance, \"cpi_monthly\"), join_by(month)) |>\n group_by(month, exchange) |>\n summarize(\n mktcap = sum(mktcap, na.rm = TRUE) / cpi,\n .groups = \"drop\"\n ) |>\n collect() |>\n mutate(month = ymd(month)) |>\n ggplot(aes(\n x = month, y = mktcap / 1000,\n color = exchange, linetype = exchange\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly market cap by listing exchange in billions of Dec 2022 USD\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-212-1.png){#fig-212 fig-alt='Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks.' width=2100}\n:::\n:::\n\n\nOf course, performing the computation in the database is not really meaningful because we can easily pull all the required data into our memory. The code chunk above is slower than performing the same steps on tables that are already in memory. However, we just want to illustrate that you can perform many things in the database before loading the data into your memory. Before we proceed, we load the monthly CPI data.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncpi_monthly <- tbl(tidy_finance, \"cpi_monthly\") |>\n collect()\n```\n:::\n\n\nNext, we look at the same descriptive statistics by industry. @fig-213 plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry <- crsp_monthly |>\n left_join(cpi_monthly, join_by(month)) |>\n group_by(month, industry) |>\n summarize(\n securities = n_distinct(permno),\n mktcap = sum(mktcap) / mean(cpi),\n .groups = \"drop\"\n )\n\ncrsp_monthly_industry |>\n ggplot(aes(\n x = month,\n y = securities,\n color = industry,\n linetype = industry\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly number of securities by industry\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample associated with different industries.](wrds-crsp-and-compustat_files/figure-html/fig-213-1.png){#fig-213 fig-alt='Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks.' width=2100}\n:::\n:::\n\n\nWe also compute the market cap of all stocks belonging to the respective industries and show the evolution over time in @fig-214. All values are again in terms of billions of end of 2022 USD. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Toward the end of the sample, however, financial firms and services begin to make up a substantial portion of the market cap.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry |>\n ggplot(aes(\n x = month,\n y = mktcap / 1000,\n color = industry,\n linetype = industry\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Monthly total market cap by industry in billions as of Dec 2022 USD\"\n ) +\n scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-214-1.png){#fig-214 fig-alt='Title: Monthly total market cap by industry in billions as of Dec 2022 USD. The figure shows a line chart of total market capitalization of all stocks in the CRSP sample aggregated by industry from 1960 to 2022 with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Stocks in the manufacturing sector have always had the highest market valuation. The figure shows a general upwards trend during the most recent past. ' width=2100}\n:::\n:::\n\n\n## Daily CRSP Data\n\nBefore we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session. \n\nThere is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for()`-loops,\\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 5 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `cat()`. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches. \n\nAlso notice that we adjust the returns for delistings in a different manner than for monthly returns: \n\n1. If a return is recorded in CRSP daily for a delisting date `dlstdt`, then we replace it with the delisting return `dlret`. \n2. If no return is recorded in CRSP daily for an existing delisting date, then we add a new row with the delisting return. \n3. We ensure that no returns after the delisting date `dlstdt` remain in the final table. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndsf_db <- tbl(wrds, in_schema(\"crsp\", \"dsf\"))\n\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n collect()\n\npermnos <- tbl(tidy_finance, \"crsp_monthly\") |>\n distinct(permno) |>\n pull()\n\nbatch_size <- 500\nbatches <- ceiling(length(permnos) / batch_size)\n\nfor (j in 1:batches) {\n \n permno_batch <- permnos[\n ((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))\n ]\n\n crsp_daily_sub <- dsf_db |>\n filter(permno %in% permno_batch &\n date >= start_date & date <= end_date) |>\n select(permno, date, ret) |>\n collect() |>\n drop_na()\n\n if (nrow(crsp_daily_sub) > 0) {\n \n msedelist_sub <- msedelist_db |> \n filter(permno %in% permno_batch) |> \n select(permno, dlstdt, dlret) |> \n collect() |>\n drop_na()\n \n crsp_daily_sub <- crsp_daily_sub |> \n left_join(msedelist_sub, join_by(permno, date == dlstdt)) |> \n bind_rows(msedelist_sub |> \n anti_join(crsp_daily_sub, \n join_by(permno, dlstdt == date))) |> \n mutate(ret = if_else(!is.na(dlret), dlret, ret),\n date = if_else(!is.na(dlstdt), dlstdt, date)) |> \n select(-c(dlret, dlstdt)) |> \n left_join(msedelist_sub |> \n select(permno, dlstdt), join_by(permno)) |> \n mutate(dlstdt = replace_na(dlstdt, end_date)) |> \n filter(date <= dlstdt) |> \n select(-dlstdt)\n \n crsp_daily_sub <- crsp_daily_sub |>\n mutate(month = floor_date(date, \"month\")) |>\n left_join(factors_ff3_daily |>\n select(date, rf), join_by(date)) |>\n mutate(\n ret_excess = ret - rf,\n ret_excess = pmax(ret_excess, -1)\n ) |>\n select(permno, date, month, ret, ret_excess)\n\n dbWriteTable(tidy_finance,\n \"crsp_daily\",\n value = crsp_daily_sub,\n overwrite = ifelse(j == 1, TRUE, FALSE),\n append = ifelse(j != 1, TRUE, FALSE)\n )\n }\n\n cat(\"Batch\", j, \"out of\", batches, \"done (\", percent(j / batches), \")\\n\")\n}\n```\n:::\n\n\nEventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data that we actually use. \n\n## Preparing Compustat data\n\nFirm accounting data are an important source of information that we use in portfolio analyses in subsequent chapters. The commonly used source for firm financial information is Compustat provided by [S&P Global Market Intelligence,](https://www.spglobal.com/marketintelligence/en/) which is a global data vendor that provides financial, statistical, and market information on active and inactive companies throughout the world.\\index{Data!Compustat} For US and Canadian companies, annual history is available back to 1950 and quarterly as well as monthly histories date back to 1962.\n\nTo access Compustat data, we can again tap WRDS, which hosts the `funda` table that contains annual firm-level information on North American companies.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfunda_db <- tbl(wrds, in_schema(\"comp\", \"funda\"))\n```\n:::\n\n\nWe follow the typical filter conventions and pull only data that we actually need: (i) we get only records in industrial data format, (ii) in the standard format (i.e., consolidated information in standard presentation), and (iii) only data in the desired time window.\\index{Gvkey}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- funda_db |>\n filter(\n indfmt == \"INDL\" &\n datafmt == \"STD\" &\n consol == \"C\" &\n datadate >= start_date & datadate <= end_date\n ) |>\n select(\n gvkey, # Firm identifier\n datadate, # Date of the accounting data\n seq, # Stockholders' equity\n ceq, # Total common/ordinary equity\n at, # Total assets\n lt, # Total liabilities\n txditc, # Deferred taxes and investment tax credit\n txdb, # Deferred taxes\n itcb, # Investment tax credit\n pstkrv, # Preferred stock redemption value\n pstkl, # Preferred stock liquidating value\n pstk, # Preferred stock par value\n capx, # Capital investment\n oancf, # Operating cash flow\n sale, # Revenue\n cogs, # Costs of goods sold\n xint, # Interest expense\n xsga # Selling, general, and administrative expenses\n ) |>\n collect()\n```\n:::\n\n\nNext, we calculate the book value of preferred stock and equity `be` and the operating profitability `op` inspired by the [variable definitions in Ken French's data library.](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) Note that we set negative or zero equity to missing which is a common practice when working with book-to-market ratios [see @Fama1992 for details].\\index{Book equity}\\index{Preferred stock}\\index{Operating profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n mutate(\n be = coalesce(seq, ceq + pstk, at - lt) +\n coalesce(txditc, txdb + itcb, 0) -\n coalesce(pstkrv, pstkl, pstk, 0),\n be = if_else(be <= 0, NA, be),\n op = (sale - coalesce(cogs, 0) - \n coalesce(xsga, 0) - coalesce(xint, 0)) / be,\n )\n```\n:::\n\n\nWe keep only the last available information for each firm-year group. Note that `datadate` defines the time the corresponding financial data refers to (e.g., annual report as of December 31, 2022). Therefore, `datadate` is not the date when data was made available to the public. Check out the exercises for more insights into the peculiarities of `datadate`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n mutate(year = year(datadate)) |>\n group_by(gvkey, year) |>\n filter(datadate == max(datadate)) |>\n ungroup()\n```\n:::\n\n\nWe also compute the investment ratio `inv` according to Ken French's variable definitions as the change in total assets from one fiscal year to another. Note that we again use the approach using joins as introduced with the CRSP data above to construct lagged assets.\\index{Investment ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |> \n left_join(\n compustat |> \n select(gvkey, year, at_lag = at) |> \n mutate(year = year + 1), \n join_by(gvkey, year)\n ) |> \n mutate(\n inv = at / at_lag - 1,\n inv = if_else(at_lag <= 0, NA, inv)\n )\n```\n:::\n\n\nWith the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"compustat\",\n value = compustat,\n overwrite = TRUE\n)\n```\n:::\n\n\n## Merging CRSP with Compustat\n\nUnfortunately, CRSP and Compustat use different keys to identify stocks and firms. CRSP uses `permno` for stocks, while Compustat uses `gvkey` to identify firms. Fortunately, a curated matching table on WRDS allows us to merge CRSP and Compustat, so we create a connection to the *CRSP-Compustat Merged* table (provided by CRSP).\\index{Data!Crsp-Compustat Merged}\\index{Permno}\\index{Gvkey}\\index{Data!Linking table}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable_db <- tbl(\n wrds,\n in_schema(\"crsp\", \"ccmxpf_linktable\")\n)\n```\n:::\n\n\nThe linking table contains links between CRSP and Compustat identifiers from various approaches. However, we need to make sure that we keep only relevant and correct links, again following the description outlined in @BaliEngleMurray2016. Note also that currently active links have no end date, so we just enter the current date via `today()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable <- ccmxpf_linktable_db |>\n filter(linktype %in% c(\"LU\", \"LC\") &\n linkprim %in% c(\"P\", \"C\") &\n usedflag == 1) |>\n select(permno = lpermno, gvkey, linkdt, linkenddt) |>\n collect() |>\n mutate(linkenddt = replace_na(linkenddt, today()))\n```\n:::\n\n\nWe use these links to create a new table with a mapping between stock identifier, firm identifier, and month. We then add these links to the Compustat `gvkey` to our monthly stock data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nccm_links <- crsp_monthly |>\n inner_join(ccmxpf_linktable, \n join_by(permno), relationship = \"many-to-many\") |>\n filter(!is.na(gvkey) & \n (date >= linkdt & date <= linkenddt)) |>\n select(permno, gvkey, date)\n\ncrsp_monthly <- crsp_monthly |>\n left_join(ccm_links, join_by(permno, date))\n```\n:::\n\n\nAs the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n \"crsp_monthly\",\n value = crsp_monthly,\n overwrite = TRUE\n)\n```\n:::\n\n\nBefore we close this chapter, let us look at an interesting descriptive statistic of our data. As the book value of equity plays a crucial role in many asset pricing applications, it is interesting to know for how many of our stocks this information is available. Hence, @fig-215 plots the share of securities with book equity values for each exchange. It turns out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 60s but hovers around 80% for all periods thereafter. We can ignore the erratic coverage of securities that belong to the other category since there is only a handful of them anyway in our sample.\\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n group_by(permno, year = year(month)) |>\n filter(date == max(date)) |>\n ungroup() |>\n left_join(compustat, join_by(gvkey, year)) |>\n group_by(exchange, year) |>\n summarize(\n share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),\n .groups = \"drop\"\n ) |>\n ggplot(aes(\n x = year, \n y = share, \n color = exchange,\n linetype = exchange\n )) +\n geom_line() +\n labs(\n x = NULL, y = NULL, color = NULL, linetype = NULL,\n title = \"Share of securities with book equity values by exchange\"\n ) +\n scale_y_continuous(labels = percent) +\n coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![End-of-year share of securities with book equity values by listing exchange.](wrds-crsp-and-compustat_files/figure-html/fig-215-1.png){#fig-215 fig-alt='Title: Share of securities with book equity values by exchange. The figure shows a line chart of end-of-year shares of securities with book equity values by exchange from 1960 to 2022 with years on the horizontal axis and the corresponding share on the vertical axis. After an initial period with lower coverage in the early 1960s, typically, more than 80 percent of the entries in the CRSP sample have information about book equity values from Compustat.' width=2100}\n:::\n:::\n\n\n## Some Tricks for PostgreSQL Databases\n\nAs we mentioned above, the WRDS database runs on PostgreSQL rather than SQLite. Finding the right tables for your data needs can be tricky in the WRDS PostgreSQL instance, as the tables are organized in schemas.\\index{Database!Schema} If you wonder what the purpose of schemas is, check out [this documetation.](https://www.postgresql.org/docs/9.1/ddl-schemas.html) For instance, if you want to find all tables that live in the `crsp` schema, you run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"crsp\"))\n```\n:::\n\n\nThis operation returns a list of all tables that belong to the `crsp` family on WRDS, e.g., ` schema = crsp, table = msenames`. Similarly, you can fetch a list of all tables that belong to the `comp` family via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"comp\"))\n```\n:::\n\n\nIf you want to get all schemas, then run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds)\n```\n:::\n\n\n## Exercises\n\n1. Check out the structure of the WRDS database by sending queries in the spirit of [\"Querying WRDS Data using R\"](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/) and verify the output with `dbListObjects()`. How many tables are associated with CRSP? Can you identify what is stored within `msp500`?\n1. Compute `mkt_cap_lag` using `lag(mktcap)` rather than using joins as above. Filter out all the rows where the lag-based market capitalization measure is different from the one we computed above. Why are the two measures they different?\n1. Plot the average market capitalization of firms for each exchange and industry, respectively, over time. What do you find?\n1. In the `compustat` table, `datadate` refers to the date to which the fiscal year of a corresponding firm refers to. Count the number of observations in Compustat by `month` of this date variable. What do you find? What does the finding suggest about pooling observations with the same fiscal year?\n1. Go back to the original Compustat data in `funda_db` and extract rows where the same firm has multiple rows for the same fiscal year. What is the reason for these observations?\n1. Keep the last observation of `crsp_monthly` by year and join it with the `compustat` table. Create the following plots: (i) aggregate book equity by exchange over time and (ii) aggregate annual book equity by industry over time. Do you notice any different patterns to the corresponding plots based on market capitalization?\n1. Repeat the analysis of market capitalization for book equity, which we computed from the Compustat data. Then, use the matched sample to plot book equity against market capitalization. How are these two variables related?", "supporting": [ "wrds-crsp-and-compustat_files" ], diff --git a/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png b/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png index 793e23fd..4de5676f 100644 Binary files a/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png and b/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png differ diff --git a/docs/r/accessing-and-managing-financial-data.html b/docs/r/accessing-and-managing-financial-data.html index 492dee80..c43999eb 100644 --- a/docs/r/accessing-and-managing-financial-data.html +++ b/docs/r/accessing-and-managing-financial-data.html @@ -830,8 +830,8 @@

Managing SQLite [3] "cpi_monthly" "crsp_daily" [5] "crsp_monthly" "factors_ff3_daily" [7] "factors_ff3_monthly" "factors_ff5_monthly" - [9] "factors_q_monthly" "industries_ff_monthly" -[11] "macro_predictors" "mergent" + [9] "factors_q_monthly" "fisd" +[11] "industries_ff_monthly" "macro_predictors" [13] "trace_enhanced" diff --git a/docs/r/beta-estimation.html b/docs/r/beta-estimation.html index cdb0cca9..b3d19a29 100644 --- a/docs/r/beta-estimation.html +++ b/docs/r/beta-estimation.html @@ -569,7 +569,7 @@

Esti collect() crsp_monthly <- crsp_monthly |> - left_join(factors_ff3_monthly, by = "month") + left_join(factors_ff3_monthly, join_by(month))

To estimate the CAPM regression coefficients
\[ @@ -674,7 +674,7 @@

Rolling-Window E

It is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting Figure 1.

beta_examples <- crsp_monthly |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   group_by(permno) |>
   mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>
   ungroup() |>
@@ -693,7 +693,7 @@ 

Rolling-Window E title = "Monthly beta estimates for example stocks using 5 years of data" )

-
+
Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock.Figure 1: The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library. @@ -716,13 +716,13 @@

Par crsp_monthly_nested

# A tibble: 30,384 × 3
-  permno industry      data              
-   <dbl> <chr>         <list>            
-1  10000 Manufacturing <tibble [16 × 3]> 
-2  10001 Utilities     <tibble [378 × 3]>
-3  10057 Manufacturing <tibble [437 × 3]>
-4  10042 Mining        <tibble [264 × 3]>
-5  10043 Services      <tibble [159 × 3]>
+  permno industry      data             
+   <dbl> <chr>         <list>           
+1  10066 Services      <tibble [35 × 3]>
+2  10067 Manufacturing <tibble [12 × 3]>
+3  10068 Finance       <tibble [8 × 3]> 
+4  10069 Retail        <tibble [10 × 3]>
+5  10070 Manufacturing <tibble [56 × 3]>
 # ℹ 30,379 more rows
@@ -734,7 +734,7 @@

Par

Next, we want to apply the roll_capm_estimation() function to each stock. This situation is an ideal use case for map(), which takes a list or vector as input and returns an object of the same length as the input. In our case, map() returns a single data frame with a time series of beta estimates for each stock. Therefore, we use unnest() to transform the list of outputs to a tidy data frame.

crsp_monthly_nested |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   mutate(beta = map(
     data,
     ~ roll_capm_estimation(., months = 60, min_obs = 48)
@@ -788,7 +788,7 @@ 

Estima

We make sure to keep only relevant data to save memory space. However, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in WRDS, CRSP, and Compustat.

crsp_daily <- crsp_daily |>
-  inner_join(factors_ff3_daily, by = "date") |>
+  inner_join(factors_ff3_daily, join_by(date)) |>
   select(permno, month, ret_excess, mkt_excess)

Just like above, we nest the data by permno for parallelization.

@@ -799,7 +799,7 @@

Estima

This is what the estimation looks like for a couple of examples using map(). For the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. These restrictions help us to retrieve somewhat smooth coefficient estimates.

crsp_daily_nested |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   mutate(beta_daily = map(
     data,
     ~ roll_capm_estimation(., months = 3, min_obs = 50)
@@ -839,7 +839,7 @@ 

Comparing Beta Es

What is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated \(\hat\beta_i\) across different industries and across time below. Figure 2 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.

crsp_monthly |>
-  left_join(beta_monthly, by = c("permno", "month")) |>
+  left_join(beta_monthly, join_by(permno, month)) |>
   drop_na(beta_monthly) |>
   group_by(industry, permno) |>
   summarize(beta = mean(beta_monthly), 
@@ -852,7 +852,7 @@ 

Comparing Beta Es title = "Firm-specific beta distributions by industry" )

-
+
Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5.Figure 2: The box plots show the average firm-specific beta estimates by industry. @@ -885,7 +885,7 @@

Comparing Beta Es title = "Monthly deciles of estimated betas", )

-
+
Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly.Figure 3: Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta. @@ -900,11 +900,11 @@

Comparing Beta Es

To compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in Figure 4.

beta <- beta_monthly |>
-  full_join(beta_daily, by = c("permno", "month")) |>
+  full_join(beta_daily, join_by(permno, month)) |>
   arrange(permno, month)
 
 beta |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   pivot_longer(cols = c(beta_monthly, beta_daily)) |>
   drop_na() |>
   ggplot(aes(
@@ -920,7 +920,7 @@ 

Comparing Beta Es title = "Comparison of beta estimates using monthly and daily data" )

-
+
Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency.Figure 4: CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks. @@ -944,7 +944,7 @@

Comparing Beta Es

Whenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. This descriptive helps us discover potential errors in our data preparation or estimation procedure. For instance, suppose there was a gap in our output where we do not have any betas. In this case, we would have to go back and check all previous steps to find out what went wrong.

beta_long <- crsp_monthly |>
-  left_join(beta, by = c("permno", "month")) |>
+  left_join(beta, join_by(permno, month)) |>
   pivot_longer(cols = c(beta_monthly, beta_daily))
 
 beta_long |>
@@ -965,7 +965,7 @@ 

Comparing Beta Es ) + coord_cartesian(ylim = c(0, 1))

-
+
Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the  beta.Figure 5: The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data. @@ -1479,7 +1479,7 @@

Exercises

-