diff --git a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json
index 4fa6646a..f4920b89 100644
--- a/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json
+++ b/_freeze/r/accessing-and-managing-financial-data/execute-results/html.json
@@ -2,7 +2,7 @@
   "hash": "a7081974449810c632425178ddd3ae0a",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n  - ../accessing-and-managing-financial-data.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Accessing and Managing Financial Data with R\n  description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n\nIn this chapter, we suggest a way to organize your financial data. Everybody, who has experience with data, is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo!Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n  mutate(\n    month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n    across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n    .keep = \"none\"\n  ) |>\n  rename_with(str_to_lower) |>\n  rename(mkt_excess = `mkt-rf`) |> \n  filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the  monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n  mutate(\n    month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n    across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n    .keep = \"none\"\n  ) |>\n  rename_with(str_to_lower) |>\n  rename(mkt_excess = `mkt-rf`) |> \n  filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n  mutate(\n    date = ymd(date),\n    across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n    .keep = \"none\"\n  ) |>\n  rename_with(str_to_lower) |>\n  rename(mkt_excess = `mkt-rf`) |>\n  filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n  mutate(month = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n  mutate(across(where(is.numeric), ~ . / 100)) |>\n  select(month, everything(), -date) |>\n  filter(month >= start_date & month <= end_date) |> \n  rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. For an alternative to download Fama-French data, check out the `FFdownload` package by [Sebastian Stöckl](https://github.com/sstoeckl/ffdownload).\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n  \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2022.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n  mutate(month = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n  select(-R_F, -R_MKT, -year) |>\n  rename_with(~ str_remove(., \"R_\")) |>\n  rename_with(~ str_to_lower(.)) |>\n  mutate(across(-month, ~ . / 100)) |>\n  filter(month >= start_date & month <= end_date)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google drive location, we need additional packages to access the data directly from our R session. Therefore, we load `readxl` to read the XLSX-file [@readxl] and `googledrive` for the Google drive connection [@googledrive].\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readxl)\nlibrary(googledrive)\n```\n:::\n\n\nUsually, you need to authenticate if you interact with Google drive directly in R. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndrive_deauth()\n```\n:::\n\n\nThe `drive_download()` function from the `googledrive` package allows us to download the data and store it locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors_link <-\n  \"https://docs.google.com/spreadsheets/d/1g4LOaRj4TvwJr9RIaA_nwrXXWTOy46bP\"\n\ndrive_download(\n  macro_predictors_link,\n  path = \"macro_predictors.xlsx\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997] \n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- read_xlsx(\n  \"macro_predictors.xlsx\",\n  sheet = \"Monthly\"\n) |>\n  mutate(month = ym(yyyymm)) |>\n  mutate(across(where(is.character), as.numeric)) |>\n  mutate(\n    IndexDiv = Index + D12,\n    logret = log(IndexDiv) - log(lag(IndexDiv)),\n    Rfree = log(Rfree + 1),\n    rp_div = lead(logret - Rfree, 1), # Future excess market return\n    dp = log(D12) - log(Index), # Dividend Price ratio\n    dy = log(D12) - log(lag(Index)), # Dividend yield\n    ep = log(E12) - log(Index), # Earnings price ratio\n    de = log(D12) - log(E12), # Dividend payout ratio\n    tms = lty - tbl, # Term spread\n    dfy = BAA - AAA # Default yield spread\n  ) |>\n  select(month, rp_div, dp, dy, ep, de, svar,\n    bm = `b/m`, ntis, tbl, lty, ltr,\n    tms, dfy, infl\n  ) |>\n  filter(month >= start_date & month <= end_date) |>\n  drop_na()\n```\n:::\n\n\nFinally, after reading in the macro predictors to our memory, we remove the raw data file from our temporary storage. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(\"macro_predictors.xlsx\")\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar `tidyquant` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyquant)\n\ncpi_monthly <- tq_get(\"CPIAUCNS\",\n  get = \"economic.data\",\n  from = start_date,\n  to = end_date\n) |>\n  mutate(\n    month = floor_date(date, \"month\"),\n    cpi = price / price[month == max(month)],\n    .keep = \"none\"\n  )\n```\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. The `tidyquant` package provides access to around 10,000 time series of the FRED database. If your desired time series is not included, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details. \n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types=TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation}  We will use the resulting file `tidy_finance_r.sqlite` in the subfolder `data` for all subsequent chapters to retrieve our data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"factors_ff3_monthly\",\n  value = factors_ff3_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing.  \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n  select(month, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source:   SQL [?? x 2]\n# Database: sqlite 3.41.2 [data/tidy_finance_r.sqlite]\n  month          rf\n  <date>      <dbl>\n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n  select(month, rf) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 756 × 2\n  month          rf\n  <date>      <dbl>\n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 751 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"factors_ff5_monthly\",\n  value = factors_ff5_monthly,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"factors_ff3_daily\",\n  value = factors_ff3_daily,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"industries_ff_monthly\",\n  value = industries_ff_monthly,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"factors_q_monthly\",\n  value = factors_q_monthly,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"macro_predictors\",\n  value = macro_predictors,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"cpi_monthly\",\n  value = cpi_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n<SQLiteResult>\n  SQL  VACUUM\n  ROWS Fetched: 0 [complete]\n       Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read up in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"beta\"                  \"compustat\"            \n [3] \"cpi_monthly\"           \"crsp_daily\"           \n [5] \"crsp_monthly\"          \"factors_ff3_daily\"    \n [7] \"factors_ff3_monthly\"   \"factors_ff5_monthly\"  \n [9] \"factors_q_monthly\"     \"industries_ff_monthly\"\n[11] \"macro_predictors\"      \"mergent\"              \n[13] \"trace_enhanced\"       \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database. \n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package. \n1. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n",
+    "markdown": "---\ntitle: Accessing and Managing Financial Data\naliases: \n  - ../accessing-and-managing-financial-data.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Accessing and Managing Financial Data with R\n  description-meta: Download and organize open-source financial data using the programming language R. \n---\n\n\nIn this chapter, we suggest a way to organize your financial data. Everybody, who has experience with data, is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages. \n\nThis chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo!Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series.\\index{API} We store all the data in a *single* database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.\\index{Database}\n\nFirst, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\n```\n:::\n\n\nMoreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Fama-French Data\n\nWe start by downloading some famous Fama-French factors [e.g., @Fama1993] and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by [Nelson Areal](https://github.com/nareal/frenchdata/) that allows us to access the data easily: the `frenchdata` package provides functions to download and read data sets from [Prof. Kenneth French finance data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) [@frenchdata].\\index{Data!Fama-French factors} \\index{Kenneth French homepage}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(frenchdata)\n```\n:::\n\n\nWe can use the `download_french_data()` function of the package to download monthly Fama-French factors. The set *Fama/French 3 Factors* contains the return time series of the market `mkt_excess`, size `smb` and value `hml` alongside the risk-free rates `rf`. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French's finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to `frenchdata`.\\index{Factor!Market}\\index{Factor!Size}\\index{Factor!Value}\\index{Factor!Profitability}\\index{Factor!Investment}\\index{Risk-free rate}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_raw <- download_french_data(\"Fama/French 3 Factors\")\nfactors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>\n  mutate(\n    month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n    across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n    .keep = \"none\"\n  ) |>\n  rename_with(str_to_lower) |>\n  rename(mkt_excess = `mkt-rf`) |> \n  filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nWe also download the set *5 Factors (2x3)*, which additionally includes the return time series of the profitability `rmw` and investment `cma` factors. We demonstrate how the  monthly factors are constructed in the chapter [Replicating Fama and French Factors](replicating-fama-and-french-factors.qmd).\n \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff5_monthly_raw <- download_french_data(\"Fama/French 5 Factors (2x3)\")\n\nfactors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>\n  mutate(\n    month = floor_date(ymd(str_c(date, \"01\")), \"month\"),\n    across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),\n    .keep = \"none\"\n  ) |>\n  rename_with(str_to_lower) |>\n  rename(mkt_excess = `mkt-rf`) |> \n  filter(month >= start_date & month <= end_date)\n```\n:::\n\n\nIt is straightforward to download the corresponding *daily* Fama-French factors with the same function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily_raw <- download_french_data(\"Fama/French 3 Factors [Daily]\")\n\nfactors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>\n  mutate(\n    date = ymd(date),\n    across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),\n    .keep = \"none\"\n  ) |>\n  rename_with(str_to_lower) |>\n  rename(mkt_excess = `mkt-rf`) |>\n  filter(date >= start_date & date <= end_date)\n```\n:::\n\n\nIn a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindustries_ff_monthly_raw <- download_french_data(\"10 Industry Portfolios\")\n\nindustries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>\n  mutate(month = floor_date(ymd(str_c(date, \"01\")), \"month\")) |>\n  mutate(across(where(is.numeric), ~ . / 100)) |>\n  select(month, everything(), -date) |>\n  filter(month >= start_date & month <= end_date) |> \n  rename_with(str_to_lower)\n```\n:::\n\n\nIt is worth taking a look at all available portfolio return time series from Kenneth French's homepage. You should check out the other sets by calling `get_french_data_list()`. For an alternative to download Fama-French data, check out the `FFdownload` package by [Sebastian Stöckl](https://github.com/sstoeckl/ffdownload).\n\n## q-Factors\n\nIn recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the @Hou2015 *q*-factor model. We refer to the [extended background](http://global-q.org/background.html) information provided by the original authors for further information. The *q* factors can be downloaded directly from the authors' homepage from within `read_csv()`.\\index{Data!q-factors}\\index{Factor!q-factors}\n\nWe also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the \"R_\"-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on *try*. You can check out style guides available online, e.g., [Hadley Wickham's `tidyverse` style guide.](https://style.tidyverse.org/index.html)\\index{Style guide}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_q_monthly_link <-\n  \"https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2022.csv\"\n\nfactors_q_monthly <- read_csv(factors_q_monthly_link) |>\n  mutate(month = ymd(str_c(year, month, \"01\", sep = \"-\"))) |>\n  select(-R_F, -R_MKT, -year) |>\n  rename_with(~ str_remove(., \"R_\")) |>\n  rename_with(~ str_to_lower(.)) |>\n  mutate(across(-month, ~ . / 100)) |>\n  filter(month >= start_date & month <= end_date)\n```\n:::\n\n\n## Macroeconomic Predictors\n\nOur next data source is a set of macroeconomic variables often used as predictors for the equity premium. @Goyal2008 comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on [Amit Goyal's website.](https://sites.google.com/view/agoyal145) Since the data is an XLSX-file stored on a public Google drive location, we need additional packages to access the data directly from our R session. Therefore, we load `readxl` to read the XLSX-file [@readxl] and `googledrive` for the Google drive connection [@googledrive].\\index{Data!Macro predictors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readxl)\nlibrary(googledrive)\n```\n:::\n\n\nUsually, you need to authenticate if you interact with Google drive directly in R. Since the data is stored via a public link, we can proceed without any authentication.\\index{Google Drive}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndrive_deauth()\n```\n:::\n\n\nThe `drive_download()` function from the `googledrive` package allows us to download the data and store it locally.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors_link <-\n  \"https://docs.google.com/spreadsheets/d/1g4LOaRj4TvwJr9RIaA_nwrXXWTOy46bP\"\n\ndrive_download(\n  macro_predictors_link,\n  path = \"macro_predictors.xlsx\"\n)\n```\n:::\n\n\nNext, we read in the new data and transform the columns into the variables that we later use:\n\n1. The dividend price ratio (`dp`), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices [@Campbell1988; @Campbell2006]. \n1. Dividend yield (`dy`), the difference between the log of dividends and the log of lagged prices [@Ball1978]. \n1. Earnings price ratio (`ep`), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index [@Campbell1988]. \n1. Dividend payout ratio (`de`), the difference between the log of dividends and the log of earnings [@Lamont1998]. \n1. Stock variance (`svar`), the sum of squared daily returns on the S&P 500 index [@Guo2006].\n1. Book-to-market ratio (`bm`), the ratio of book value to market value for the Dow Jones Industrial Average [@Kothari1997] \n1. Net equity expansion (`ntis`), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks [@Campbell2008].\n1. Treasury bills (`tbl`), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis [@Campbell1987].\n1. Long-term yield (`lty`), the long-term government bond yield from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Long-term rate of returns (`ltr`), the long-term government bond returns from Ibbotson's Stocks, Bonds, Bills, and Inflation Yearbook [@Goyal2008].\n1. Term spread (`tms`), the difference between the long-term yield on government bonds and the Treasury bill [@Campbell1987].\n1. Default yield spread (`dfy`), the difference between BAA and AAA-rated corporate bond yields [@Fama1989]. \n1. Inflation (`infl`), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics [@Campbell2004].\n\t\t\t\nFor variable definitions and the required data transformations, you can consult the material on [Amit Goyal's website](https://sites.google.com/view/agoyal145).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmacro_predictors <- read_xlsx(\n  \"macro_predictors.xlsx\",\n  sheet = \"Monthly\"\n) |>\n  mutate(month = ym(yyyymm)) |>\n  mutate(across(where(is.character), as.numeric)) |>\n  mutate(\n    IndexDiv = Index + D12,\n    logret = log(IndexDiv) - log(lag(IndexDiv)),\n    Rfree = log(Rfree + 1),\n    rp_div = lead(logret - Rfree, 1), # Future excess market return\n    dp = log(D12) - log(Index), # Dividend Price ratio\n    dy = log(D12) - log(lag(Index)), # Dividend yield\n    ep = log(E12) - log(Index), # Earnings price ratio\n    de = log(D12) - log(E12), # Dividend payout ratio\n    tms = lty - tbl, # Term spread\n    dfy = BAA - AAA # Default yield spread\n  ) |>\n  select(month, rp_div, dp, dy, ep, de, svar,\n    bm = `b/m`, ntis, tbl, lty, ltr,\n    tms, dfy, infl\n  ) |>\n  filter(month >= start_date & month <= end_date) |>\n  drop_na()\n```\n:::\n\n\nFinally, after reading in the macro predictors to our memory, we remove the raw data file from our temporary storage. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile.remove(\"macro_predictors.xlsx\")\n```\n:::\n\n\n## Other Macroeconomic Data\n\nThe Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar `tidyquant` package to fetch consumer price index (CPI) data that can be found under the [CPIAUCNS](https://fred.stlouisfed.org/series/CPIAUCNS) key.\\index{Data!FRED}\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyquant)\n\ncpi_monthly <- tq_get(\"CPIAUCNS\",\n  get = \"economic.data\",\n  from = start_date,\n  to = end_date\n) |>\n  mutate(\n    month = floor_date(date, \"month\"),\n    cpi = price / price[month == max(month)],\n    .keep = \"none\"\n  )\n```\n:::\n\n\nTo download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the [PCU2122212122210](https://fred.stlouisfed.org/series/PCU2122212122210) key. The `tidyquant` package provides access to around 10,000 time series of the FRED database. If your desired time series is not included, we recommend working with the `fredr` package [@fredr]. Note that you need to get an API key to use its functionality. We refer to the package documentation for details. \n\n## Setting Up a Database\n\nNow that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code. \n\nThere are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an [SQLite](https://www.sqlite.org/index.html) database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that [SQL](https://en.wikipedia.org/wiki/SQL) (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the `dplyr` functions. We refer to [this tutorial](https://www.w3schools.com/sql/sql_intro.asp) for more information on SQL.\\index{Database!SQLite}\n\nThere are two packages that make working with SQLite in R very simple: `RSQLite` [@RSQLite] embeds the SQLite database engine in R, and `dbplyr` [@dbplyr] is the database back-end for `dplyr`. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting `dplyr` into SQL. Check out the [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html) and [`dbplyr`](https://db.rstudio.com/databases/sqlite/) vignettes for more information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nAn SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the `extended_types=TRUE` option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers.\\index{Database!Creation}  We will use the resulting file `tidy_finance_r.sqlite` in the subfolder `data` for all subsequent chapters to retrieve our data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n```\n:::\n\n\nNext, we create a remote table with the monthly Fama-French factor data. We do so with the function `dbWriteTable()`, which copies the data to our SQLite-database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"factors_ff3_monthly\",\n  value = factors_ff3_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\nWe can use the remote table as an in-memory data frame by building a connection via `tbl()`.\\index{Database!Remote connection}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db <- tbl(tidy_finance, \"factors_ff3_monthly\")\n```\n:::\n\n\nAll `dplyr` calls are evaluated lazily, i.e., the data is not in our R session's memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing.  \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n  select(month, rf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Source:   SQL [?? x 2]\n# Database: sqlite 3.41.2 [data/tidy_finance_r.sqlite]\n  month          rf\n  <date>      <dbl>\n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ more rows\n```\n\n\n:::\n:::\n\n\nIf we want to have the whole table in memory, we need to `collect()` it. You will see that we regularly load the data into the memory in the next chapters.\\index{Database!Fetch}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly_db |>\n  select(month, rf) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 756 × 2\n  month          rf\n  <date>      <dbl>\n1 1960-01-01 0.0033\n2 1960-02-01 0.0029\n3 1960-03-01 0.0035\n4 1960-04-01 0.0019\n5 1960-05-01 0.0027\n# ℹ 751 more rows\n```\n\n\n:::\n:::\n\n\nThe last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages. \n\nBefore we move on to the next data source, let us also store the other five tables in our new SQLite database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"factors_ff5_monthly\",\n  value = factors_ff5_monthly,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"factors_ff3_daily\",\n  value = factors_ff3_daily,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"industries_ff_monthly\",\n  value = industries_ff_monthly,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"factors_q_monthly\",\n  value = factors_q_monthly,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"macro_predictors\",\n  value = macro_predictors,\n  overwrite = TRUE\n)\n\ndbWriteTable(tidy_finance,\n  \"cpi_monthly\",\n  value = cpi_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\nFrom now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.\\index{Database!Connection}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\")\nfactors_q_monthly <- factors_q_monthly |> collect()\n```\n:::\n\n\n## Managing SQLite Databases\n\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\\index{Database!Management}\n\nTo optimize the database file, you can run the `VACUUM` command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the `dbSendQuery()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n<SQLiteResult>\n  SQL  VACUUM\n  ROWS Fetched: 0 [complete]\n       Changed: 0\n```\n\n\n:::\n:::\n\n\nThe `VACUUM` command actually performs a couple of additional cleaning steps, which you can read up in [this tutorial.](https://www.sqlitetutorial.net/sqlite-vacuum/) \\index{Database!Cleaning}\n\nWe store the result of the above query in `res` because the database keeps the result set open. To close open results and avoid warnings going forward, we can use `dbClearResult()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbClearResult(res)\n```\n:::\n\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the `dbListTables()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListTables(tidy_finance)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"beta\"                  \"compustat\"            \n [3] \"cpi_monthly\"           \"crsp_daily\"           \n [5] \"crsp_monthly\"          \"factors_ff3_daily\"    \n [7] \"factors_ff3_monthly\"   \"factors_ff5_monthly\"  \n [9] \"factors_q_monthly\"     \"fisd\"                 \n[11] \"industries_ff_monthly\" \"macro_predictors\"     \n[13] \"trace_enhanced\"       \n```\n\n\n:::\n:::\n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database. \n\n## Exercises\n\n1. Download the monthly Fama-French factors manually from [Ken French's data library](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and read them in via `read_csv()`. Validate that you get the same data as via the `frenchdata` package. \n1. Download the daily Fama-French 5 factors using the `frenchdata` package. Use `get_french_data_list()` to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the `rf`, `mkt_excess`, `smb`, and `hml` columns of `factors_ff3_daily` to `factors_ff5_daily`. Discuss any differences you might find. \n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/r/beta-estimation/execute-results/html.json b/_freeze/r/beta-estimation/execute-results/html.json
index 58a18cd4..46a37b9e 100644
--- a/_freeze/r/beta-estimation/execute-results/html.json
+++ b/_freeze/r/beta-estimation/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "b38cf1c3a936b0371fd75eef648b83e7",
+  "hash": "fdf890ff102507ae5bd5d91ee49171e5",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Beta Estimation\naliases:\n  - ../beta-estimation.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Beta Estimation with R\n  description-meta: Estimate CAPM betas using monthly or daily CRSP data and the programming language R. \n---\n\n\nIn this chapter, we introduce an important concept in financial economics: the exposure of an individual stock to changes in the market portfolio. According to the Capital Asset Pricing Model (CAPM) of @Sharpe1964, @Lintner1965, and @Mossin1966, cross-sectional variation in expected asset returns should be a function of the covariance between the excess return of the asset and the excess return on the market portfolio.\\index{CAPM} The regression coefficient of excess market returns on excess stock returns is usually called the market beta. We show an estimation procedure for the market betas.\\index{Beta} We do not go into details about the foundations of market beta but simply refer to any treatment of the [CAPM](https://en.wikipedia.org/wiki/Capital_asset_pricing_model) for further information. Instead, we provide details about all the functions that we use to compute the results. In particular, we leverage useful computational concepts: rolling-window estimation and parallelization.\n\nWe use the following R packages throughout this chapter:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(slider)\nlibrary(furrr)\n```\n:::\n\n\nCompared to previous chapters, we introduce `slider` [@slider] for sliding window functions, and `furrr` [@furrr] to apply mapping functions in parallel.\n\n## Estimating Beta using Monthly Returns\n\nThe estimation procedure is based on a rolling-window estimation, where we may use either monthly or daily returns and different window lengths. First, let us start with loading the monthly CRSP data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(permno, month, industry, ret_excess) |>\n  collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, mkt_excess) |>\n  collect()\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(factors_ff3_monthly, by = \"month\")\n```\n:::\n\n\nTo estimate the CAPM regression coefficients  \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\beta_i(r_{m, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwe regress stock excess returns `ret_excess` on excess returns of the market portfolio `mkt_excess`. \nR provides a simple solution to estimate (linear) models with the function `lm()`. `lm()` requires a formula as input that is specified in a compact symbolic form. An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modeled by a linear predictor specified symbolically by `model`. Such a model consists of a series of terms separated by `+` operators. In addition to standard linear models, `lm()` provides a lot of flexibility. You should check out the documentation for more information. To start, we restrict the data only to the time series of observations in CRSP that correspond to Apple’s stock (i.e., to `permno` 14593 for Apple) and compute $\\hat\\alpha_i$ as well as $\\hat\\beta_i$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit <- lm(ret_excess ~ mkt_excess,\n  data = crsp_monthly |>\n    filter(permno == \"14593\")\n)\n\nsummary(fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = ret_excess ~ mkt_excess, data = filter(crsp_monthly, \n    permno == \"14593\"))\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-0.5170 -0.0589  0.0001  0.0610  0.3947 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept)  0.01019    0.00508     2.0    0.046 *  \nmkt_excess   1.38889    0.11141    12.5   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.113 on 502 degrees of freedom\nMultiple R-squared:  0.236,\tAdjusted R-squared:  0.235 \nF-statistic:  155 on 1 and 502 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\n`lm()` returns an object of class `lm` which contains all information we usually care about with linear models. `summary()` returns an overview of the estimated parameters. `coefficients(fit)` would return only the estimated coefficients. The output above indicates that Apple moves excessively with the market as the estimated $\\hat\\beta_i$ is above one ($\\hat\\beta_i \\approx 1.4$). \n\n## Rolling-Window Estimation\n\nAfter we estimated the regression coefficients on an example, we scale the estimation of  $\\beta_i$ to a whole different level and perform rolling-window estimations for the entire CRSP sample.\\index{Rolling-window estimation} The following function implements the CAPM regression for a data frame (or a part thereof) containing at least `min_obs` observations to avoid huge fluctuations if the time series is too short. If the condition is violated, that is, the time series is too short, the function returns a missing value. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nestimate_capm <- function(data, min_obs = 1) {\n  if (nrow(data) < min_obs) {\n    beta <- as.numeric(NA)\n  } else {\n    fit <- lm(ret_excess ~ mkt_excess, data = data)\n    beta <- as.numeric(coefficients(fit)[2])\n  }\n  return(beta)\n}\n```\n:::\n\n\nNext, we define a function that does the rolling estimation. The `slide_period` function is able to handle months in its window input in a straightforward manner. We thus avoid using any time-series package (e.g., `zoo`) and converting the data to fit the package functions, but rather stay in the world of the `tidyverse`.\n\nThe following function takes input data and slides across the `month` vector, considering only a total of `months` months. The function essentially performs three steps: (i) arrange all rows, (ii) compute betas by sliding across months, and (iii) return a tibble with months and corresponding beta estimates (again particularly useful in the case of daily data).\nAs we demonstrate further below, we can also apply the same function to daily returns data. \n\n::: {.cell}\n\n```{.r .cell-code}\nroll_capm_estimation <- function(data, months, min_obs) {\n  data <- data |>\n    arrange(month)\n\n  betas <- slide_period_vec(\n    .x = data,\n    .i = data$month,\n    .period = \"month\",\n    .f = ~ estimate_capm(., min_obs),\n    .before = months - 1,\n    .complete = FALSE\n  )\n\n  return(tibble(\n    month = unique(data$month),\n    beta = betas\n  ))\n}\n```\n:::\n\n\nBefore we attack the whole CRSP sample, let us focus on a couple of examples for well-known firms.\n\n::: {.cell}\n\n```{.r .cell-code}\nexamples <- tribble(\n  ~permno, ~company,\n  14593, \"Apple\",\n  10107, \"Microsoft\",\n  93436, \"Tesla\",\n  17778, \"Berkshire Hathaway\"\n)\n```\n:::\n\nIf we want to estimate rolling betas for Apple, we can use `mutate()`. \nWe take a total of 5 years of data and require at least 48 months with return data to compute our betas. \nCheck out the exercises if you want to compute beta for different time periods. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_example <- crsp_monthly |>\n  filter(permno == examples$permno[1]) |>\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n  drop_na()\nbeta_example\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 457 × 6\n  permno month      industry      ret_excess mkt_excess  beta\n   <dbl> <date>     <chr>              <dbl>      <dbl> <dbl>\n1  14593 1984-12-01 Manufacturing     0.170      0.0184  2.05\n2  14593 1985-01-01 Manufacturing    -0.0108     0.0799  1.90\n3  14593 1985-02-01 Manufacturing    -0.152      0.0122  1.88\n4  14593 1985-03-01 Manufacturing    -0.112     -0.0084  1.89\n5  14593 1985-04-01 Manufacturing    -0.0467    -0.0096  1.90\n# ℹ 452 more rows\n```\n\n\n:::\n:::\n\nIt is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting @fig-601. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_examples <- crsp_monthly |>\n  inner_join(examples, by = \"permno\") |>\n  group_by(permno) |>\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n  ungroup() |>\n  select(permno, company, month, beta) |>\n  drop_na()\n\nbeta_examples |>\n  ggplot(aes(\n    x = month, \n    y = beta, \n    color = company,\n    linetype = company)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly beta estimates for example stocks using 5 years of data\"\n  )\n```\n\n::: {.cell-output-display}\n![The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.](beta-estimation_files/figure-html/fig-601-1.png){#fig-601 fig-alt='Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock.' width=2100}\n:::\n:::\n\n\n## Parallelized Rolling-Window Estimation\n\nEven though we could now just apply the function using `group_by()` on the whole CRSP sample, we advise against doing it as it is computationally quite expensive. \nRemember that we have to perform rolling-window estimations across all stocks and time periods. \nHowever, this estimation problem is an ideal scenario to employ the power of parallelization. \nParallelization means that we split the tasks which perform rolling-window estimations across different workers (or cores on your local machine). \n\nFirst, we `nest()` the data by `permno`. Nested data means we now have a list of `permno` with corresponding time series data and an `industry` label. We get one row of output for each unique combination of non-nested variables which are `permno` and `industry`.\\index{Data!Nested}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n  nest(data = c(month, ret_excess, mkt_excess))\ncrsp_monthly_nested\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30,384 × 3\n  permno industry      data              \n   <dbl> <chr>         <list>            \n1  10000 Manufacturing <tibble [16 × 3]> \n2  10001 Utilities     <tibble [378 × 3]>\n3  10057 Manufacturing <tibble [437 × 3]>\n4  10042 Mining        <tibble [264 × 3]>\n5  10043 Services      <tibble [159 × 3]>\n# ℹ 30,379 more rows\n```\n\n\n:::\n:::\n\n\nAlternatively, we could have created the same nested data by *excluding* the variables that we *do not* want to nest, as in the following code chunk. However, for many applications it is desirable to explicitly state the variables that are nested into the `data` list-column, so that the reader can track what ends up in there.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n  nest(data = -c(permno, industry))\n```\n:::\n\n\nNext, we want to apply the `roll_capm_estimation()` function to each stock. This situation is an ideal use case for `map()`, which takes a list or vector as input and returns an object of the same length as the input. In our case, `map()` returns a single data frame with a time series of beta estimates for each stock. Therefore, we use `unnest()` to transform the list of outputs to a tidy data frame. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested |>\n  inner_join(examples, by = \"permno\") |>\n  mutate(beta = map(\n    data,\n    ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |>\n  unnest(beta) |>\n  select(permno, month, beta_monthly = beta) |>\n  drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,458 × 3\n  permno month      beta_monthly\n   <dbl> <date>            <dbl>\n1  10107 1990-03-01         1.39\n2  10107 1990-04-01         1.38\n3  10107 1990-05-01         1.43\n4  10107 1990-06-01         1.43\n5  10107 1990-07-01         1.45\n# ℹ 1,453 more rows\n```\n\n\n:::\n:::\n\n\nHowever, instead, we want to perform the estimations of rolling betas for different stocks in parallel. If you have a Windows or Mac machine, it makes most sense to define `multisession`, which means that separate R processes are running in the background on the same machine to perform the individual jobs. If you check out the documentation of `plan()`, you can also see other ways to resolve the parallelization in different environments. Note that we use `availableCores()` to determine the number of cores available for parallelization, but keep one core free for other tasks. Some machines might freeze if all cores are busy with Python jobs. \\index{Parallelization}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_cores = availableCores() - 1\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nUsing eight cores, the estimation for our sample of around 25k stocks takes around 20 minutes. Of course, you can speed up things considerably by having more cores available to share the workload or by having more powerful cores. Notice the difference in the code below? All you need to do is to replace `map()` with `future_map()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly <- crsp_monthly_nested |>\n  mutate(beta = future_map(\n    data, ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |>\n  unnest(c(beta)) |>\n  select(permno, month, beta_monthly = beta) |>\n  drop_na()\n```\n:::\n\n\n## Estimating Beta using Daily Returns\n\nBefore we provide some descriptive statistics of our beta estimates, we implement the estimation for the daily CRSP sample as well. \nDepending on the application, you might either use longer horizon beta estimates based on monthly data or shorter horizon estimates based on daily returns. \n\nFirst, we load daily CRSP data. \nNote that the sample is large compared to the monthly data, so make sure to have enough memory available.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- tbl(tidy_finance, \"crsp_daily\") |>\n  select(permno, month, date, ret_excess) |>\n  collect()\n```\n:::\n\n\nWe also need the daily Fama-French market excess returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n  select(date, mkt_excess) |>\n  collect()\n```\n:::\n\n\nWe make sure to keep only relevant data to save memory space. \nHowever, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- crsp_daily |>\n  inner_join(factors_ff3_daily, by = \"date\") |>\n  select(permno, month, ret_excess, mkt_excess)\n```\n:::\n\n\nJust like above, we nest the data by `permno` for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested <- crsp_daily |>\n  nest(data = c(month, ret_excess, mkt_excess))\n```\n:::\n\n\nThis is what the estimation looks like for a couple of examples using `map()`. \nFor the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. \nThese restrictions help us to retrieve somewhat smooth coefficient estimates.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested |>\n  inner_join(examples, by = \"permno\") |>\n  mutate(beta_daily = map(\n    data,\n    ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |>\n  unnest(c(beta_daily)) |>\n  select(permno, month, beta_daily = beta) |>\n  drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,639 × 3\n  permno month      beta_daily\n   <dbl> <date>          <dbl>\n1  10107 1986-05-01      0.898\n2  10107 1986-06-01      0.906\n3  10107 1986-07-01      0.822\n4  10107 1986-08-01      0.900\n5  10107 1986-09-01      1.01 \n# ℹ 1,634 more rows\n```\n\n\n:::\n:::\n\n\nFor the sake of completeness, we tell our session again to use multiple workers for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nThe code chunk for beta estimation using daily returns now looks very similar to the one for monthly data. The whole estimation takes around 30 minutes using eight cores and 16gb memory. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_daily <- crsp_daily_nested |>\n  mutate(beta_daily = future_map(\n    data, ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |>\n  unnest(c(beta_daily)) |>\n  select(permno, month, beta_daily = beta) |>\n  drop_na()\n```\n:::\n\n\n## Comparing Beta Estimates\n\nWhat is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated $\\hat\\beta_i$ across different industries and across time below. @fig-602 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n  left_join(beta_monthly, by = c(\"permno\", \"month\")) |>\n  drop_na(beta_monthly) |>\n  group_by(industry, permno) |>\n  summarize(beta = mean(beta_monthly), \n            .groups = \"drop\") |>\n  ggplot(aes(x = reorder(industry, beta, FUN = median), y = beta)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Firm-specific beta distributions by industry\"\n  )\n```\n\n::: {.cell-output-display}\n![The box plots show the average firm-specific beta estimates by industry.](beta-estimation_files/figure-html/fig-602-1.png){#fig-602 fig-alt='Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5.' width=2100}\n:::\n:::\n\n\nNext, we illustrate the time-variation in the cross-section of estimated betas. @fig-603 shows the monthly deciles of estimated betas (based on monthly data) and indicates an interesting pattern: First, betas seem to vary over time in the sense that during some periods, there is a clear trend across all deciles. Second, the sample exhibits periods where the dispersion across stocks increases in the sense that the lower decile decreases and the upper decile increases, which indicates that for some stocks the correlation with the market increases while for others it decreases. Note also here: stocks with negative betas are a rare exception.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly |>\n  drop_na(beta_monthly) |>\n  group_by(month) |>\n  reframe(\n    x = quantile(beta_monthly, seq(0.1, 0.9, 0.1)),\n    quantile = 100 * seq(0.1, 0.9, 0.1)\n  ) |>\n  ggplot(aes(\n    x = month, \n    y = x, \n    color = as_factor(quantile),\n    linetype = as_factor(quantile)\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly deciles of estimated betas\",\n  )\n```\n\n::: {.cell-output-display}\n![Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.](beta-estimation_files/figure-html/fig-603-1.png){#fig-603 fig-alt='Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly.' width=2100}\n:::\n:::\n\n\nTo compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in @fig-604. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta <- beta_monthly |>\n  full_join(beta_daily, by = c(\"permno\", \"month\")) |>\n  arrange(permno, month)\n\nbeta |>\n  inner_join(examples, by = \"permno\") |>\n  pivot_longer(cols = c(beta_monthly, beta_daily)) |>\n  drop_na() |>\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = name, \n    linetype = name\n    )) +\n  geom_line() +\n  facet_wrap(~company, ncol = 1) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL, \n    title = \"Comparison of beta estimates using monthly and daily data\"\n  )\n```\n\n::: {.cell-output-display}\n![CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.](beta-estimation_files/figure-html/fig-604-1.png){#fig-604 fig-alt='Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency.' width=2100}\n:::\n:::\n\n\nThe estimates in @fig-604 look as expected. As you can see, it really depends on the estimation window and data frequency how your beta estimates turn out. \n\nFinally, we write the estimates to our database such that we can use them in later chapters. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"beta\",\n  value = beta,\n  overwrite = TRUE\n)\n```\n:::\n\n\nWhenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. \nThis descriptive helps us discover potential errors in our data preparation or estimation procedure. \nFor instance, suppose there was a gap in our output where we do not have any betas. \nIn this case, we would have to go back and check all previous steps to find out what went wrong. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long <- crsp_monthly |>\n  left_join(beta, by = c(\"permno\", \"month\")) |>\n  pivot_longer(cols = c(beta_monthly, beta_daily))\n\nbeta_long |>\n  group_by(month, name) |>\n  summarize(share = sum(!is.na(value)) / n(), \n            .groups = \"drop\") |>\n  ggplot(aes(\n    x = month, \n    y = share, \n    color = name,\n    linetype = name\n    )) +\n  geom_line() +\n  scale_y_continuous(labels = percent) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"End-of-month share of securities with beta estimates\"\n  ) +\n  coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.](beta-estimation_files/figure-html/fig-605-1.png){#fig-605 fig-alt='Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the  beta.' width=2100}\n:::\n:::\n\n\n@fig-605 does not indicate any troubles, so let us move on to the next check. \n\nWe also encourage everyone to always look at the distributional summary statistics of variables. You can easily spot outliers or weird distributions when looking at such tables.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long |>\n  select(name, value) |>\n  drop_na() |>\n  group_by(name) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n()\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 9\n  name          mean    sd   min    q05   q50   q95   max       n\n  <chr>        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <int>\n1 beta_daily   0.752 0.924 -43.7 -0.440 0.690  2.23  56.6 3286117\n2 beta_monthly 1.10  0.713 -13.0  0.127 1.04   2.32  11.8 2135342\n```\n\n\n:::\n:::\n\n\nThe summary statistics also look plausible for the two estimation procedures. \n\nFinally, since we have two different estimators for the same theoretical object, we expect the estimators should be at least positively correlated (although not perfectly as the estimators are based on different sample periods and frequencies).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta |>\n  select(beta_daily, beta_monthly) |>\n  cor(use = \"complete.obs\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n             beta_daily beta_monthly\nbeta_daily        1.000        0.324\nbeta_monthly      0.324        1.000\n```\n\n\n:::\n:::\n\n\nIndeed, we find a positive correlation between our beta estimates. In the subsequent chapters, we mainly use the estimates based on monthly data as most readers should be able to replicate them due to potential memory limitations that might arise with the daily data. \n\n## Exercises\n\n1. Compute beta estimates based on monthly data using 1, 3, and 5 years of data and impose a minimum number of observations of 10, 28, and 48 months with return data, respectively. How strongly correlated are the estimated betas?\n1. Compute beta estimates based on monthly data using 5 years of data and impose different numbers of minimum observations. How does the share of permno-month observations with successful beta estimates vary across the different requirements? Do you find a high correlation across the estimated betas? \n1. Instead of using `future_map()`, perform the beta estimation in a loop (using either monthly or daily data) for a subset of 100 permnos of your choice. Verify that you get the same results as with the parallelized code from above.\n1. Filter out the stocks with negative betas. Do these stocks frequently exhibit negative betas, or do they resemble estimation errors? \n1. Compute beta estimates for multi-factor models such as the Fama-French 3 factor model. For that purpose, you extend your regression to \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\sum\\limits_{j=1}^k\\beta_{i,k}(r_{j, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwhere $r_{j, t}$ are the $k$ factor returns. Thus, you estimate 4 parameters ($\\alpha_i$ and the slope coefficients). Provide some summary statistics of the cross-section of firms and their exposure to the different factors.",
+    "markdown": "---\ntitle: Beta Estimation\naliases:\n  - ../beta-estimation.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Beta Estimation with R\n  description-meta: Estimate CAPM betas using monthly or daily CRSP data and the programming language R. \n---\n\n\nIn this chapter, we introduce an important concept in financial economics: the exposure of an individual stock to changes in the market portfolio. According to the Capital Asset Pricing Model (CAPM) of @Sharpe1964, @Lintner1965, and @Mossin1966, cross-sectional variation in expected asset returns should be a function of the covariance between the excess return of the asset and the excess return on the market portfolio.\\index{CAPM} The regression coefficient of excess market returns on excess stock returns is usually called the market beta. We show an estimation procedure for the market betas.\\index{Beta} We do not go into details about the foundations of market beta but simply refer to any treatment of the [CAPM](https://en.wikipedia.org/wiki/Capital_asset_pricing_model) for further information. Instead, we provide details about all the functions that we use to compute the results. In particular, we leverage useful computational concepts: rolling-window estimation and parallelization.\n\nWe use the following R packages throughout this chapter:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(slider)\nlibrary(furrr)\n```\n:::\n\n\nCompared to previous chapters, we introduce `slider` [@slider] for sliding window functions, and `furrr` [@furrr] to apply mapping functions in parallel.\n\n## Estimating Beta using Monthly Returns\n\nThe estimation procedure is based on a rolling-window estimation, where we may use either monthly or daily returns and different window lengths. First, let us start with loading the monthly CRSP data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(permno, month, industry, ret_excess) |>\n  collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, mkt_excess) |>\n  collect()\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(factors_ff3_monthly, join_by(month))\n```\n:::\n\n\nTo estimate the CAPM regression coefficients  \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\beta_i(r_{m, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwe regress stock excess returns `ret_excess` on excess returns of the market portfolio `mkt_excess`. \nR provides a simple solution to estimate (linear) models with the function `lm()`. `lm()` requires a formula as input that is specified in a compact symbolic form. An expression of the form `y ~ model` is interpreted as a specification that the response `y` is modeled by a linear predictor specified symbolically by `model`. Such a model consists of a series of terms separated by `+` operators. In addition to standard linear models, `lm()` provides a lot of flexibility. You should check out the documentation for more information. To start, we restrict the data only to the time series of observations in CRSP that correspond to Apple’s stock (i.e., to `permno` 14593 for Apple) and compute $\\hat\\alpha_i$ as well as $\\hat\\beta_i$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit <- lm(ret_excess ~ mkt_excess,\n  data = crsp_monthly |>\n    filter(permno == \"14593\")\n)\n\nsummary(fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = ret_excess ~ mkt_excess, data = filter(crsp_monthly, \n    permno == \"14593\"))\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-0.5170 -0.0589  0.0001  0.0610  0.3947 \n\nCoefficients:\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept)  0.01019    0.00508     2.0    0.046 *  \nmkt_excess   1.38889    0.11141    12.5   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.113 on 502 degrees of freedom\nMultiple R-squared:  0.236,\tAdjusted R-squared:  0.235 \nF-statistic:  155 on 1 and 502 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\n`lm()` returns an object of class `lm` which contains all information we usually care about with linear models. `summary()` returns an overview of the estimated parameters. `coefficients(fit)` would return only the estimated coefficients. The output above indicates that Apple moves excessively with the market as the estimated $\\hat\\beta_i$ is above one ($\\hat\\beta_i \\approx 1.4$). \n\n## Rolling-Window Estimation\n\nAfter we estimated the regression coefficients on an example, we scale the estimation of  $\\beta_i$ to a whole different level and perform rolling-window estimations for the entire CRSP sample.\\index{Rolling-window estimation} The following function implements the CAPM regression for a data frame (or a part thereof) containing at least `min_obs` observations to avoid huge fluctuations if the time series is too short. If the condition is violated, that is, the time series is too short, the function returns a missing value. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nestimate_capm <- function(data, min_obs = 1) {\n  if (nrow(data) < min_obs) {\n    beta <- as.numeric(NA)\n  } else {\n    fit <- lm(ret_excess ~ mkt_excess, data = data)\n    beta <- as.numeric(coefficients(fit)[2])\n  }\n  return(beta)\n}\n```\n:::\n\n\nNext, we define a function that does the rolling estimation. The `slide_period` function is able to handle months in its window input in a straightforward manner. We thus avoid using any time-series package (e.g., `zoo`) and converting the data to fit the package functions, but rather stay in the world of the `tidyverse`.\n\nThe following function takes input data and slides across the `month` vector, considering only a total of `months` months. The function essentially performs three steps: (i) arrange all rows, (ii) compute betas by sliding across months, and (iii) return a tibble with months and corresponding beta estimates (again particularly useful in the case of daily data).\nAs we demonstrate further below, we can also apply the same function to daily returns data. \n\n::: {.cell}\n\n```{.r .cell-code}\nroll_capm_estimation <- function(data, months, min_obs) {\n  data <- data |>\n    arrange(month)\n\n  betas <- slide_period_vec(\n    .x = data,\n    .i = data$month,\n    .period = \"month\",\n    .f = ~ estimate_capm(., min_obs),\n    .before = months - 1,\n    .complete = FALSE\n  )\n\n  return(tibble(\n    month = unique(data$month),\n    beta = betas\n  ))\n}\n```\n:::\n\n\nBefore we attack the whole CRSP sample, let us focus on a couple of examples for well-known firms.\n\n::: {.cell}\n\n```{.r .cell-code}\nexamples <- tribble(\n  ~permno, ~company,\n  14593, \"Apple\",\n  10107, \"Microsoft\",\n  93436, \"Tesla\",\n  17778, \"Berkshire Hathaway\"\n)\n```\n:::\n\nIf we want to estimate rolling betas for Apple, we can use `mutate()`. \nWe take a total of 5 years of data and require at least 48 months with return data to compute our betas. \nCheck out the exercises if you want to compute beta for different time periods. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_example <- crsp_monthly |>\n  filter(permno == examples$permno[1]) |>\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n  drop_na()\nbeta_example\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 457 × 6\n  permno month      industry      ret_excess mkt_excess  beta\n   <dbl> <date>     <chr>              <dbl>      <dbl> <dbl>\n1  14593 1984-12-01 Manufacturing     0.170      0.0184  2.05\n2  14593 1985-01-01 Manufacturing    -0.0108     0.0799  1.90\n3  14593 1985-02-01 Manufacturing    -0.152      0.0122  1.88\n4  14593 1985-03-01 Manufacturing    -0.112     -0.0084  1.89\n5  14593 1985-04-01 Manufacturing    -0.0467    -0.0096  1.90\n# ℹ 452 more rows\n```\n\n\n:::\n:::\n\nIt is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting @fig-601. \n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_examples <- crsp_monthly |>\n  inner_join(examples, join_by(permno)) |>\n  group_by(permno) |>\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>\n  ungroup() |>\n  select(permno, company, month, beta) |>\n  drop_na()\n\nbeta_examples |>\n  ggplot(aes(\n    x = month, \n    y = beta, \n    color = company,\n    linetype = company)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly beta estimates for example stocks using 5 years of data\"\n  )\n```\n\n::: {.cell-output-display}\n![The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.](beta-estimation_files/figure-html/fig-601-1.png){#fig-601 fig-alt='Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock.' width=2100}\n:::\n:::\n\n\n## Parallelized Rolling-Window Estimation\n\nEven though we could now just apply the function using `group_by()` on the whole CRSP sample, we advise against doing it as it is computationally quite expensive. \nRemember that we have to perform rolling-window estimations across all stocks and time periods. \nHowever, this estimation problem is an ideal scenario to employ the power of parallelization. \nParallelization means that we split the tasks which perform rolling-window estimations across different workers (or cores on your local machine). \n\nFirst, we `nest()` the data by `permno`. Nested data means we now have a list of `permno` with corresponding time series data and an `industry` label. We get one row of output for each unique combination of non-nested variables which are `permno` and `industry`.\\index{Data!Nested}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n  nest(data = c(month, ret_excess, mkt_excess))\ncrsp_monthly_nested\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30,384 × 3\n  permno industry      data             \n   <dbl> <chr>         <list>           \n1  10066 Services      <tibble [35 × 3]>\n2  10067 Manufacturing <tibble [12 × 3]>\n3  10068 Finance       <tibble [8 × 3]> \n4  10069 Retail        <tibble [10 × 3]>\n5  10070 Manufacturing <tibble [56 × 3]>\n# ℹ 30,379 more rows\n```\n\n\n:::\n:::\n\n\nAlternatively, we could have created the same nested data by *excluding* the variables that we *do not* want to nest, as in the following code chunk. However, for many applications it is desirable to explicitly state the variables that are nested into the `data` list-column, so that the reader can track what ends up in there.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested <- crsp_monthly |>\n  nest(data = -c(permno, industry))\n```\n:::\n\n\nNext, we want to apply the `roll_capm_estimation()` function to each stock. This situation is an ideal use case for `map()`, which takes a list or vector as input and returns an object of the same length as the input. In our case, `map()` returns a single data frame with a time series of beta estimates for each stock. Therefore, we use `unnest()` to transform the list of outputs to a tidy data frame. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_nested |>\n  inner_join(examples, join_by(permno)) |>\n  mutate(beta = map(\n    data,\n    ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |>\n  unnest(beta) |>\n  select(permno, month, beta_monthly = beta) |>\n  drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,458 × 3\n  permno month      beta_monthly\n   <dbl> <date>            <dbl>\n1  10107 1990-03-01         1.39\n2  10107 1990-04-01         1.38\n3  10107 1990-05-01         1.43\n4  10107 1990-06-01         1.43\n5  10107 1990-07-01         1.45\n# ℹ 1,453 more rows\n```\n\n\n:::\n:::\n\n\nHowever, instead, we want to perform the estimations of rolling betas for different stocks in parallel. If you have a Windows or Mac machine, it makes most sense to define `multisession`, which means that separate R processes are running in the background on the same machine to perform the individual jobs. If you check out the documentation of `plan()`, you can also see other ways to resolve the parallelization in different environments. Note that we use `availableCores()` to determine the number of cores available for parallelization, but keep one core free for other tasks. Some machines might freeze if all cores are busy with Python jobs. \\index{Parallelization}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_cores = availableCores() - 1\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nUsing eight cores, the estimation for our sample of around 25k stocks takes around 20 minutes. Of course, you can speed up things considerably by having more cores available to share the workload or by having more powerful cores. Notice the difference in the code below? All you need to do is to replace `map()` with `future_map()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly <- crsp_monthly_nested |>\n  mutate(beta = future_map(\n    data, ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |>\n  unnest(c(beta)) |>\n  select(permno, month, beta_monthly = beta) |>\n  drop_na()\n```\n:::\n\n\n## Estimating Beta using Daily Returns\n\nBefore we provide some descriptive statistics of our beta estimates, we implement the estimation for the daily CRSP sample as well. \nDepending on the application, you might either use longer horizon beta estimates based on monthly data or shorter horizon estimates based on daily returns. \n\nFirst, we load daily CRSP data. \nNote that the sample is large compared to the monthly data, so make sure to have enough memory available.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- tbl(tidy_finance, \"crsp_daily\") |>\n  select(permno, month, date, ret_excess) |>\n  collect()\n```\n:::\n\n\nWe also need the daily Fama-French market excess returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n  select(date, mkt_excess) |>\n  collect()\n```\n:::\n\n\nWe make sure to keep only relevant data to save memory space. \nHowever, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily <- crsp_daily |>\n  inner_join(factors_ff3_daily, join_by(date)) |>\n  select(permno, month, ret_excess, mkt_excess)\n```\n:::\n\n\nJust like above, we nest the data by `permno` for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested <- crsp_daily |>\n  nest(data = c(month, ret_excess, mkt_excess))\n```\n:::\n\n\nThis is what the estimation looks like for a couple of examples using `map()`. \nFor the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. \nThese restrictions help us to retrieve somewhat smooth coefficient estimates.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_daily_nested |>\n  inner_join(examples, join_by(permno)) |>\n  mutate(beta_daily = map(\n    data,\n    ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |>\n  unnest(c(beta_daily)) |>\n  select(permno, month, beta_daily = beta) |>\n  drop_na()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1,639 × 3\n  permno month      beta_daily\n   <dbl> <date>          <dbl>\n1  10107 1986-05-01      0.898\n2  10107 1986-06-01      0.906\n3  10107 1986-07-01      0.822\n4  10107 1986-08-01      0.900\n5  10107 1986-09-01      1.01 \n# ℹ 1,634 more rows\n```\n\n\n:::\n:::\n\n\nFor the sake of completeness, we tell our session again to use multiple workers for parallelization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplan(multisession, workers = n_cores)\n```\n:::\n\n\nThe code chunk for beta estimation using daily returns now looks very similar to the one for monthly data. The whole estimation takes around 30 minutes using eight cores and 16gb memory. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_daily <- crsp_daily_nested |>\n  mutate(beta_daily = future_map(\n    data, ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |>\n  unnest(c(beta_daily)) |>\n  select(permno, month, beta_daily = beta) |>\n  drop_na()\n```\n:::\n\n\n## Comparing Beta Estimates\n\nWhat is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated $\\hat\\beta_i$ across different industries and across time below. @fig-602 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n  left_join(beta_monthly, join_by(permno, month)) |>\n  drop_na(beta_monthly) |>\n  group_by(industry, permno) |>\n  summarize(beta = mean(beta_monthly), \n            .groups = \"drop\") |>\n  ggplot(aes(x = reorder(industry, beta, FUN = median), y = beta)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Firm-specific beta distributions by industry\"\n  )\n```\n\n::: {.cell-output-display}\n![The box plots show the average firm-specific beta estimates by industry.](beta-estimation_files/figure-html/fig-602-1.png){#fig-602 fig-alt='Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5.' width=2100}\n:::\n:::\n\n\nNext, we illustrate the time-variation in the cross-section of estimated betas. @fig-603 shows the monthly deciles of estimated betas (based on monthly data) and indicates an interesting pattern: First, betas seem to vary over time in the sense that during some periods, there is a clear trend across all deciles. Second, the sample exhibits periods where the dispersion across stocks increases in the sense that the lower decile decreases and the upper decile increases, which indicates that for some stocks the correlation with the market increases while for others it decreases. Note also here: stocks with negative betas are a rare exception.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_monthly |>\n  drop_na(beta_monthly) |>\n  group_by(month) |>\n  reframe(\n    x = quantile(beta_monthly, seq(0.1, 0.9, 0.1)),\n    quantile = 100 * seq(0.1, 0.9, 0.1)\n  ) |>\n  ggplot(aes(\n    x = month, \n    y = x, \n    color = as_factor(quantile),\n    linetype = as_factor(quantile)\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly deciles of estimated betas\",\n  )\n```\n\n::: {.cell-output-display}\n![Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.](beta-estimation_files/figure-html/fig-603-1.png){#fig-603 fig-alt='Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly.' width=2100}\n:::\n:::\n\n\nTo compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in @fig-604. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta <- beta_monthly |>\n  full_join(beta_daily, join_by(permno, month)) |>\n  arrange(permno, month)\n\nbeta |>\n  inner_join(examples, join_by(permno)) |>\n  pivot_longer(cols = c(beta_monthly, beta_daily)) |>\n  drop_na() |>\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = name, \n    linetype = name\n    )) +\n  geom_line() +\n  facet_wrap(~company, ncol = 1) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL, \n    title = \"Comparison of beta estimates using monthly and daily data\"\n  )\n```\n\n::: {.cell-output-display}\n![CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.](beta-estimation_files/figure-html/fig-604-1.png){#fig-604 fig-alt='Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency.' width=2100}\n:::\n:::\n\n\nThe estimates in @fig-604 look as expected. As you can see, it really depends on the estimation window and data frequency how your beta estimates turn out. \n\nFinally, we write the estimates to our database such that we can use them in later chapters. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"beta\",\n  value = beta,\n  overwrite = TRUE\n)\n```\n:::\n\n\nWhenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. \nThis descriptive helps us discover potential errors in our data preparation or estimation procedure. \nFor instance, suppose there was a gap in our output where we do not have any betas. \nIn this case, we would have to go back and check all previous steps to find out what went wrong. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long <- crsp_monthly |>\n  left_join(beta, join_by(permno, month)) |>\n  pivot_longer(cols = c(beta_monthly, beta_daily))\n\nbeta_long |>\n  group_by(month, name) |>\n  summarize(share = sum(!is.na(value)) / n(), \n            .groups = \"drop\") |>\n  ggplot(aes(\n    x = month, \n    y = share, \n    color = name,\n    linetype = name\n    )) +\n  geom_line() +\n  scale_y_continuous(labels = percent) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"End-of-month share of securities with beta estimates\"\n  ) +\n  coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.](beta-estimation_files/figure-html/fig-605-1.png){#fig-605 fig-alt='Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the  beta.' width=2100}\n:::\n:::\n\n\n@fig-605 does not indicate any troubles, so let us move on to the next check. \n\nWe also encourage everyone to always look at the distributional summary statistics of variables. You can easily spot outliers or weird distributions when looking at such tables.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_long |>\n  select(name, value) |>\n  drop_na() |>\n  group_by(name) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n()\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 9\n  name          mean    sd   min    q05   q50   q95   max       n\n  <chr>        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <int>\n1 beta_daily   0.752 0.924 -43.7 -0.440 0.690  2.23  56.6 3286117\n2 beta_monthly 1.10  0.713 -13.0  0.127 1.04   2.32  11.8 2135342\n```\n\n\n:::\n:::\n\n\nThe summary statistics also look plausible for the two estimation procedures. \n\nFinally, since we have two different estimators for the same theoretical object, we expect the estimators should be at least positively correlated (although not perfectly as the estimators are based on different sample periods and frequencies).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta |>\n  select(beta_daily, beta_monthly) |>\n  cor(use = \"complete.obs\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n             beta_daily beta_monthly\nbeta_daily        1.000        0.324\nbeta_monthly      0.324        1.000\n```\n\n\n:::\n:::\n\n\nIndeed, we find a positive correlation between our beta estimates. In the subsequent chapters, we mainly use the estimates based on monthly data as most readers should be able to replicate them due to potential memory limitations that might arise with the daily data. \n\n## Exercises\n\n1. Compute beta estimates based on monthly data using 1, 3, and 5 years of data and impose a minimum number of observations of 10, 28, and 48 months with return data, respectively. How strongly correlated are the estimated betas?\n1. Compute beta estimates based on monthly data using 5 years of data and impose different numbers of minimum observations. How does the share of permno-month observations with successful beta estimates vary across the different requirements? Do you find a high correlation across the estimated betas? \n1. Instead of using `future_map()`, perform the beta estimation in a loop (using either monthly or daily data) for a subset of 100 permnos of your choice. Verify that you get the same results as with the parallelized code from above.\n1. Filter out the stocks with negative betas. Do these stocks frequently exhibit negative betas, or do they resemble estimation errors? \n1. Compute beta estimates for multi-factor models such as the Fama-French 3 factor model. For that purpose, you extend your regression to \n$$\nr_{i, t} - r_{f, t} = \\alpha_i + \\sum\\limits_{j=1}^k\\beta_{i,k}(r_{j, t}-r_{f,t})+\\varepsilon_{i, t}\n$$\nwhere $r_{j, t}$ are the $k$ factor returns. Thus, you estimate 4 parameters ($\\alpha_i$ and the slope coefficients). Provide some summary statistics of the cross-section of firms and their exposure to the different factors.",
     "supporting": [
       "beta-estimation_files"
     ],
diff --git a/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json b/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json
index d72e3554..dcdc734e 100644
--- a/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json
+++ b/_freeze/r/clean-enhanced-trace-with-r/execute-results/html.json
@@ -1,11 +1,9 @@
 {
-  "hash": "1e2484c3cced12486f09d0650b05db55",
+  "hash": "06984c67c24538430ebea23c84e143f4",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Clean Enhanced TRACE with R\naliases:\n  - ../clean-enhanced-trace-with-r.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Clean Enhanced TRACE with R\n  description-meta: Copy the code to clean enhanced TRACE bond transaction data using the programming language R. \n---\n\n\nThis appendix contains code to clean enhanced TRACE with R. It is also available via the following Github [gist](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4).\\index{Github!Gist} Hence, you could also source the function with `devtools::source_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")`. We need this function in Chapter [TRACE and FISD](trace-and-fisd.qmd) to download and clean enhanced TRACE trade messages following @Dick2009 and @Dick2014 for enhanced TRACE specifically. Related, WRDS provides SAS code and there is Python code available by the project [Open Source Bond Asset Pricing](https://openbondassetpricing.com/).\n\nThe function takes a vector of CUSIPs (in `cusips`), a connection to WRDS (`connection`) explained in Chapter 3, and a start and end date (`start_date` and `end_date`, respectively). Specifying too many CUSIPs will result in very slow downloads and a potential failure due to the size of the request to WRDS. The dates should be within the coverage of TRACE itself, i.e., starting after 2002, and the dates should be supplied using the class date. The output of the function contains all valid trade messages for the selected CUSIPs over the specified period.\\index{CUSIP}\\index{Dick-Nielsen cleaning} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nclean_enhanced_trace <- function(cusips,\n                                 connection,\n                                 start_date = as.Date(\"2002-01-01\"),\n                                 end_date = today()) {\n\n  # Packages (required)\n  library(tidyverse)\n  library(dbplyr)\n  library(RPostgres)\n\n  # Function checks ---------------------------------------------------------\n  # Input parameters\n  ## Cusips\n  if (length(cusips) == 0 | any(is.na(cusips))) stop(\"Check cusips.\")\n\n  ## Dates\n  if (!is.Date(start_date) | !is.Date(end_date)) stop(\"Dates needed\")\n  if (start_date < as.Date(\"2002-01-01\")) stop(\"TRACE starts later.\")\n  if (end_date > today()) stop(\"TRACE does not predict the future.\")\n  if (start_date >= end_date) stop(\"Date conflict.\")\n\n  ## Connection\n  if (!dbIsValid(connection)) stop(\"Connection issue.\")\n\n  # Enhanced Trace ----------------------------------------------------------\n  # Main file\n  trace_all <- tbl(\n    connection,\n    in_schema(\"trace\", \"trace_enhanced\")\n  ) |>\n    filter(cusip_id %in% cusips) |>\n    filter(trd_exctn_dt >= start_date & trd_exctn_dt <= end_date) |>\n    select(\n      cusip_id, msg_seq_nb, orig_msg_seq_nb,\n      entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,\n      trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm,\n      pr_trd_dt, trc_st, asof_cd, wis_fl,\n      days_to_sttl_ct, stlmnt_dt, spcl_trd_fl\n    ) |>\n    collect()\n\n  # Enhanced Trace: Post 06-02-2012 -----------------------------------------\n  # Trades (trc_st = T) and correction (trc_st = R)\n  trace_post_TR <- trace_all |>\n    filter(\n      (trc_st == \"T\" | trc_st == \"R\"),\n      trd_rpt_dt >= as.Date(\"2012-02-06\")\n    )\n\n  # Cancellations (trc_st = X) and correction cancellations (trc_st = C)\n  trace_post_XC <- trace_all |>\n    filter(\n      (trc_st == \"X\" | trc_st == \"C\"),\n      trd_rpt_dt >= as.Date(\"2012-02-06\")\n    )\n\n  # Cleaning corrected and cancelled trades\n  trace_post_TR <- trace_post_TR |>\n    anti_join(trace_post_XC,\n      by = c(\n        \"cusip_id\", \"msg_seq_nb\", \"entrd_vol_qt\",\n        \"rptd_pr\", \"rpt_side_cd\", \"cntra_mp_id\",\n        \"trd_exctn_dt\", \"trd_exctn_tm\"\n      )\n    )\n\n  # Reversals (trc_st = Y)\n  trace_post_Y <- trace_all |>\n    filter(\n      trc_st == \"Y\",\n      trd_rpt_dt >= as.Date(\"2012-02-06\")\n    )\n\n  # Clean reversals\n  ## match the orig_msg_seq_nb of the Y-message to\n  ## the msg_seq_nb of the main message\n  trace_post <- trace_post_TR |>\n    anti_join(trace_post_Y,\n      by = c(\"cusip_id\",\n        \"msg_seq_nb\" = \"orig_msg_seq_nb\",\n        \"entrd_vol_qt\", \"rptd_pr\", \"rpt_side_cd\",\n        \"cntra_mp_id\", \"trd_exctn_dt\", \"trd_exctn_tm\"\n      )\n    )\n\n\n  # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------\n  # Cancelations (trc_st = C)\n  trace_pre_C <- trace_all |>\n    filter(\n      trc_st == \"C\",\n      trd_rpt_dt < as.Date(\"2012-02-06\")\n    )\n\n  # Trades w/o cancellations\n  ## match the orig_msg_seq_nb of the C-message\n  ## to the msg_seq_nb of the main message\n  trace_pre_T <- trace_all |>\n    filter(\n      trc_st == \"T\",\n      trd_rpt_dt < as.Date(\"2012-02-06\")\n    ) |>\n    anti_join(trace_pre_C,\n      by = c(\"cusip_id\",\n        \"msg_seq_nb\" = \"orig_msg_seq_nb\",\n        \"entrd_vol_qt\", \"rptd_pr\", \"rpt_side_cd\",\n        \"cntra_mp_id\", \"trd_exctn_dt\", \"trd_exctn_tm\"\n      )\n    )\n\n  # Corrections (trc_st = W) - W can also correct a previous W\n  trace_pre_W <- trace_all |>\n    filter(\n      trc_st == \"W\",\n      trd_rpt_dt < as.Date(\"2012-02-06\")\n    )\n\n  # Implement corrections in a loop\n  ## Correction control\n  correction_control <- nrow(trace_pre_W)\n  correction_control_last <- nrow(trace_pre_W)\n\n  ## Correction loop\n  while (correction_control > 0) {\n    # Corrections that correct some msg\n    trace_pre_W_correcting <- trace_pre_W |>\n      semi_join(trace_pre_T,\n        by = c(\"cusip_id\", \"trd_exctn_dt\",\n          \"orig_msg_seq_nb\" = \"msg_seq_nb\"\n        )\n      )\n\n    # Corrections that do not correct some msg\n    trace_pre_W <- trace_pre_W |>\n      anti_join(trace_pre_T,\n        by = c(\"cusip_id\", \"trd_exctn_dt\",\n          \"orig_msg_seq_nb\" = \"msg_seq_nb\"\n        )\n      )\n\n    # Delete msgs that are corrected and add correction msgs\n    trace_pre_T <- trace_pre_T |>\n      anti_join(trace_pre_W_correcting,\n        by = c(\"cusip_id\", \"trd_exctn_dt\",\n          \"msg_seq_nb\" = \"orig_msg_seq_nb\"\n        )\n      ) |>\n      union_all(trace_pre_W_correcting)\n\n    # Escape if no corrections remain or they cannot be matched\n    correction_control <- nrow(trace_pre_W)\n    if (correction_control == correction_control_last) {\n      correction_control <- 0\n    }\n    correction_control_last <- nrow(trace_pre_W)\n  }\n\n\n  # Clean reversals\n  ## Record reversals\n  trace_pre_R <- trace_pre_T |>\n    filter(asof_cd == \"R\") |>\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |>\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n    mutate(seq = row_number()) |>\n    ungroup()\n\n  ## Remove reversals and the reversed trade\n  trace_pre <- trace_pre_T |>\n    filter(is.na(asof_cd) | !(asof_cd %in% c(\"R\", \"X\", \"D\"))) |>\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |>\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n    mutate(seq = row_number()) |>\n    ungroup() |>\n    anti_join(trace_pre_R,\n      by = c(\n        \"cusip_id\", \"trd_exctn_dt\", \"entrd_vol_qt\",\n        \"rptd_pr\", \"rpt_side_cd\", \"cntra_mp_id\", \"seq\"\n      )\n    ) |>\n    select(-seq)\n\n\n  # Agency trades -----------------------------------------------------------\n  # Combine pre and post trades\n  trace_clean <- trace_post |>\n    union_all(trace_pre)\n\n  # Keep angency sells and unmatched agency buys\n  ## Agency sells\n  trace_agency_sells <- trace_clean |>\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"S\"\n    )\n\n  # Agency buys that are unmatched\n  trace_agency_buys_filtered <- trace_clean |>\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"B\"\n    ) |>\n    anti_join(trace_agency_sells,\n      by = c(\n        \"cusip_id\", \"trd_exctn_dt\",\n        \"entrd_vol_qt\", \"rptd_pr\"\n      )\n    )\n\n  # Agency clean\n  trace_clean <- trace_clean |>\n    filter(cntra_mp_id == \"C\") |>\n    union_all(trace_agency_sells) |>\n    union_all(trace_agency_buys_filtered)\n\n\n  # Additional Filters ------------------------------------------------------\n  trace_add_filters <- trace_clean |>\n    mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |>\n    filter(\n      is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) <= 7,\n      is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) <= 7,\n      wis_fl == \"N\",\n      is.na(spcl_trd_fl) | spcl_trd_fl == \"\",\n      is.na(asof_cd) | asof_cd == \"\"\n    )\n\n\n  # Output ------------------------------------------------------------------\n  # Only keep necessary columns\n  trace_final <- trace_add_filters |>\n    arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) |>\n    select(\n      cusip_id, trd_exctn_dt, trd_exctn_tm,\n      rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id\n    ) |>\n    mutate(trd_exctn_tm = format(as_datetime(trd_exctn_tm), \"%H:%M:%S\"))\n\n  # Return\n  return(trace_final)\n}\n```\n:::",
-    "supporting": [
-      "clean-enhanced-trace-with-r_files"
-    ],
+    "markdown": "---\ntitle: Clean Enhanced TRACE with R\naliases:\n  - ../clean-enhanced-trace-with-r.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Clean Enhanced TRACE with R\n  description-meta: Copy the code to clean enhanced TRACE bond transaction data using the programming language R. \n---\n\n\nThis appendix contains code to clean enhanced TRACE with R. It is also available via the following Github [gist](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4).\\index{Github!Gist} Hence, you could also source the function with `devtools::source_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")`. We need this function in Chapter [TRACE and FISD](trace-and-fisd.qmd) to download and clean enhanced TRACE trade messages following @Dick2009 and @Dick2014 for enhanced TRACE specifically. Related, WRDS provides SAS code and there is Python code available by the project [Open Source Bond Asset Pricing](https://openbondassetpricing.com/).\n\nThe function takes a vector of CUSIPs (in `cusips`), a connection to WRDS (`connection`) explained in Chapter 3, and a start and end date (`start_date` and `end_date`, respectively). Specifying too many CUSIPs will result in very slow downloads and a potential failure due to the size of the request to WRDS. The dates should be within the coverage of TRACE itself, i.e., starting after 2002, and the dates should be supplied using the class date. The output of the function contains all valid trade messages for the selected CUSIPs over the specified period.\\index{CUSIP}\\index{Dick-Nielsen cleaning} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nclean_enhanced_trace <- function(cusips,\n                                 connection,\n                                 start_date = as.Date(\"2002-01-01\"),\n                                 end_date = today()) {\n\n  # Packages (required)\n  library(tidyverse)\n  library(dbplyr)\n  library(RPostgres)\n\n  # Function checks ---------------------------------------------------------\n  # Input parameters\n  ## Cusips\n  if (length(cusips) == 0 | any(is.na(cusips))) stop(\"Check cusips.\")\n\n  ## Dates\n  if (!is.Date(start_date) | !is.Date(end_date)) stop(\"Dates needed\")\n  if (start_date < as.Date(\"2002-01-01\")) stop(\"TRACE starts later.\")\n  if (end_date > today()) stop(\"TRACE does not predict the future.\")\n  if (start_date >= end_date) stop(\"Date conflict.\")\n\n  ## Connection\n  if (!dbIsValid(connection)) stop(\"Connection issue.\")\n\n  # Enhanced Trace ----------------------------------------------------------\n  # Main file\n  trace_all <- tbl(\n    connection,\n    in_schema(\"trace\", \"trace_enhanced\")\n  ) |>\n    filter(cusip_id %in% cusips) |>\n    filter(trd_exctn_dt >= start_date & trd_exctn_dt <= end_date) |>\n    select(\n      cusip_id, msg_seq_nb, orig_msg_seq_nb,\n      entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,\n      trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm,\n      pr_trd_dt, trc_st, asof_cd, wis_fl,\n      days_to_sttl_ct, stlmnt_dt, spcl_trd_fl\n    ) |>\n    collect()\n\n  # Enhanced Trace: Post 06-02-2012 -----------------------------------------\n  # Trades (trc_st = T) and correction (trc_st = R)\n  trace_post_TR <- trace_all |>\n    filter(\n      (trc_st == \"T\" | trc_st == \"R\"),\n      trd_rpt_dt >= as.Date(\"2012-02-06\")\n    )\n\n  # Cancellations (trc_st = X) and correction cancellations (trc_st = C)\n  trace_post_XC <- trace_all |>\n    filter(\n      (trc_st == \"X\" | trc_st == \"C\"),\n      trd_rpt_dt >= as.Date(\"2012-02-06\")\n    )\n\n  # Cleaning corrected and cancelled trades\n  trace_post_TR <- trace_post_TR |>\n    anti_join(trace_post_XC,\n      join_by(\n        cusip_id, msg_seq_nb, entrd_vol_qt,\n        rptd_pr, rpt_side_cd, cntra_mp_id,\n        trd_exctn_dt, trd_exctn_tm\n      )\n    )\n\n  # Reversals (trc_st = Y)\n  trace_post_Y <- trace_all |>\n    filter(\n      trc_st == \"Y\",\n      trd_rpt_dt >= as.Date(\"2012-02-06\")\n    )\n\n  # Clean reversals\n  ## match the orig_msg_seq_nb of the Y-message to\n  ## the msg_seq_nb of the main message\n  trace_post <- trace_post_TR |>\n    anti_join(trace_post_Y,\n      join_by(\n        cusip_id,\n        msg_seq_nb == orig_msg_seq_nb,\n        entrd_vol_qt, rptd_pr, rpt_side_cd,\n        cntra_mp_id, trd_exctn_dt, trd_exctn_tm\n      )\n    )\n\n\n  # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------\n  # Cancelations (trc_st = C)\n  trace_pre_C <- trace_all |>\n    filter(\n      trc_st == \"C\",\n      trd_rpt_dt < as.Date(\"2012-02-06\")\n    )\n\n  # Trades w/o cancellations\n  ## match the orig_msg_seq_nb of the C-message\n  ## to the msg_seq_nb of the main message\n  trace_pre_T <- trace_all |>\n    filter(\n      trc_st == \"T\",\n      trd_rpt_dt < as.Date(\"2012-02-06\")\n    ) |>\n    anti_join(trace_pre_C,\n      join_by(\n        cusip_id,\n        msg_seq_nb == orig_msg_seq_nb,\n        entrd_vol_qt, rptd_pr, rpt_side_cd,\n        cntra_mp_id, trd_exctn_dt, trd_exctn_tm\n      )\n    )\n\n  # Corrections (trc_st = W) - W can also correct a previous W\n  trace_pre_W <- trace_all |>\n    filter(\n      trc_st == \"W\",\n      trd_rpt_dt < as.Date(\"2012-02-06\")\n    )\n\n  # Implement corrections in a loop\n  ## Correction control\n  correction_control <- nrow(trace_pre_W)\n  correction_control_last <- nrow(trace_pre_W)\n\n  ## Correction loop\n  while (correction_control > 0) {\n    # Corrections that correct some msg\n    trace_pre_W_correcting <- trace_pre_W |>\n      semi_join(trace_pre_T,\n        join_by(\n          cusip_id, trd_exctn_dt,\n          orig_msg_seq_nb == msg_seq_nb\n        )\n      )\n\n    # Corrections that do not correct some msg\n    trace_pre_W <- trace_pre_W |>\n      anti_join(trace_pre_T,\n        join_by(cusip_id, trd_exctn_dt,\n          orig_msg_seq_nb == msg_seq_nb\n        )\n      )\n\n    # Delete msgs that are corrected and add correction msgs\n    trace_pre_T <- trace_pre_T |>\n      anti_join(trace_pre_W_correcting,\n        join_by(\n          cusip_id, trd_exctn_dt,\n          msg_seq_nb == orig_msg_seq_nb\n        )\n      ) |>\n      union_all(trace_pre_W_correcting)\n\n    # Escape if no corrections remain or they cannot be matched\n    correction_control <- nrow(trace_pre_W)\n    if (correction_control == correction_control_last) {\n      correction_control <- 0\n    }\n    correction_control_last <- nrow(trace_pre_W)\n  }\n\n\n  # Clean reversals\n  ## Record reversals\n  trace_pre_R <- trace_pre_T |>\n    filter(asof_cd == \"R\") |>\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |>\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n    mutate(seq = row_number()) |>\n    ungroup()\n\n  ## Remove reversals and the reversed trade\n  trace_pre <- trace_pre_T |>\n    filter(is.na(asof_cd) | !(asof_cd %in% c(\"R\", \"X\", \"D\"))) |>\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |>\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |>\n    mutate(seq = row_number()) |>\n    ungroup() |>\n    anti_join(trace_pre_R,\n      join_by(\n        cusip_id, trd_exctn_dt, entrd_vol_qt,\n        rptd_pr, rpt_side_cd, cntra_mp_id, seq\n      )\n    ) |>\n    select(-seq)\n\n\n  # Agency trades -----------------------------------------------------------\n  # Combine pre and post trades\n  trace_clean <- trace_post |>\n    union_all(trace_pre)\n\n  # Keep angency sells and unmatched agency buys\n  ## Agency sells\n  trace_agency_sells <- trace_clean |>\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"S\"\n    )\n\n  # Agency buys that are unmatched\n  trace_agency_buys_filtered <- trace_clean |>\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"B\"\n    ) |>\n    anti_join(trace_agency_sells,\n      join_by(\n        cusip_id, trd_exctn_dt,\n        entrd_vol_qt, rptd_pr\n      )\n    )\n\n  # Agency clean\n  trace_clean <- trace_clean |>\n    filter(cntra_mp_id == \"C\") |>\n    union_all(trace_agency_sells) |>\n    union_all(trace_agency_buys_filtered)\n\n\n  # Additional Filters ------------------------------------------------------\n  trace_add_filters <- trace_clean |>\n    mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |>\n    filter(\n      is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) <= 7,\n      is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) <= 7,\n      wis_fl == \"N\",\n      is.na(spcl_trd_fl) | spcl_trd_fl == \"\",\n      is.na(asof_cd) | asof_cd == \"\"\n    )\n\n\n  # Output ------------------------------------------------------------------\n  # Only keep necessary columns\n  trace_final <- trace_add_filters |>\n    arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) |>\n    select(\n      cusip_id, trd_exctn_dt, trd_exctn_tm,\n      rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id\n    ) |>\n    mutate(trd_exctn_tm = format(as_datetime(trd_exctn_tm), \"%H:%M:%S\"))\n\n  # Return\n  return(trace_final)\n}\n```\n:::",
+    "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],
diff --git a/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json b/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json
index e66542d3..60c83592 100644
--- a/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json
+++ b/_freeze/r/constrained-optimization-and-backtesting/execute-results/html.json
@@ -2,7 +2,7 @@
   "hash": "1c2684b4fabe1e11d16fb9d3e16a8a9f",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Constrained Optimization and Backtesting\naliases:\n  - ../constrained-optimization-and-backtesting.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Constrained Optimization and Backtesting with R\n  description-meta: Conduct portfolio backtesting with transaction costs and no-shortselling constraints using the programming language R. \n---\n\n\n\\index{Backtesting} In this chapter, we conduct portfolio backtesting in a realistic setting by including transaction costs and investment constraints such as no-short-selling rules. \\index{Short-selling}\nWe start with standard mean-variance efficient portfolios and introduce constraints in a step-by-step manner. To do so, we rely on numerical optimization procedures in R.\\index{Efficient portfolio} We conclude the chapter by providing an out-of-sample backtesting procedure for the different strategies that we introduce in this chapter. \n\nThroughout this chapter, we use the following R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(nloptr)\n```\n:::\n\n\nCompared to previous chapters, we introduce the `nloptr` package [@nloptr] to perform numerical constrained optimization for portfolio choice problems.\\index{Optimization}\\index{Portfolio choice}\n\n## Data Preparation\n\nWe start by loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). For simplicity, we restrict our investment universe to the monthly Fama-French industry portfolio returns in the following application. \\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nindustry_returns <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n  select(-month) |>\n  collect() |> \n  drop_na()\n```\n:::\n\n\n## Recap of Portfolio Choice \n\nA common objective for portfolio optimization is to find mean-variance efficient portfolio weights, i.e., the allocation which delivers the lowest possible return variance for a given minimum level of expected returns. \nIn the most extreme case, where the investor is only concerned about portfolio variance, she may choose to implement the minimum variance portfolio (MVP) weights which are given by the solution to \n$$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1$$\nwhere $\\Sigma$ is the $(N \\times N)$ covariance matrix of the returns. The optimal weights $\\omega_\\text{mvp}$ can be found analytically and are $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$. In terms of code, the math is equivalent to the following chunk. \\index{Minimum variance portfolio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_industries <- ncol(industry_returns)\n\nSigma <- cov(industry_returns)\nw_mvp <- solve(Sigma) %*% rep(1, n_industries)\nw_mvp <- as.vector(w_mvp / sum(w_mvp))\n\nmu <- colMeans(industry_returns)\n```\n:::\n\n\nNext, consider an investor who aims to achieve minimum variance *given a required expected portfolio return* $\\bar{\\mu}$ such that she chooses\n$$\\omega_\\text{eff}({\\bar{\\mu}}) =\\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\nWe leave it as an exercise below to show that the portfolio choice problem can equivalently be formulated for an investor with mean-variance preferences and risk aversion factor $\\gamma$. That means the investor aims to choose portfolio weights as the solution to\n$$ \\omega^*_\\gamma = \\arg\\max \\omega' \\mu - \\frac{\\gamma}{2}\\omega'\\Sigma \\omega\\quad \\text{ s.t. } \\omega'\\iota = 1.$$\nThe solution to the optimal portfolio choice problem is:\n$$\\omega^*_{\\gamma}  = \\frac{1}{\\gamma}\\left(\\Sigma^{-1} - \\frac{1}{\\iota' \\Sigma^{-1}\\iota }\\Sigma^{-1}\\iota\\iota' \\Sigma^{-1} \\right) \\mu  + \\frac{1}{\\iota' \\Sigma^{-1} \\iota }\\Sigma^{-1} \\iota.$$\nEmpirically, this classical solution imposes many problems. \nIn particular, the estimates of $\\mu$ are noisy over short horizons, the ($N \\times N$) matrix $\\Sigma$ contains $N(N-1)/2$ distinct elements and thus, estimation error is huge. \nSeminal papers on the effect of ignoring estimation uncertainty, among others, are @Brown1976, @Jobson1980, @Jorion1986, and @Chopra1993.\n\nEven worse, if the asset universe contains more assets than available time periods $(N > T)$, the sample covariance matrix is no longer positive definite such that the inverse $\\Sigma^{-1}$ does not exist anymore. \nTo address estimation issues for vast-dimensional covariance matrices, regularization techniques are a popular tool [see, e.g., @Ledoit2003; @Ledoit2004; @Ledoit2012; @Fan2008].\\index{Covariance}\n\nWhile the uncertainty associated with estimated parameters is challenging, the data-generating process is also unknown to the investor. In other words, model uncertainty reflects that it is ex-ante not even clear which parameters require estimation (for instance, if returns are driven by a factor model, selecting the universe of relevant factors imposes model uncertainty). @Wang2005 and @Garlappi2007 provide theoretical analysis on optimal portfolio choice under model *and* estimation uncertainty. In the most extreme case, @Pflug2012 shows that the naive portfolio which allocates equal wealth to all assets is the optimal choice for an investor averse to model uncertainty.\\index{Model uncertainty}\n\nOn top of the estimation uncertainty, *transaction costs* are a major concern.\\index{Transaction cost} \nRebalancing portfolios is costly, and, therefore, the optimal choice should depend on the investor's current holdings. In the presence of transaction costs, the benefits of reallocating wealth may be smaller than the costs associated with turnover. This aspect has been investigated theoretically, among others, for one risky asset by @Magill1976 and @Davis1990. Subsequent extensions to the case with multiple assets have been proposed by @Balduzzi1999 and @Balduzzi2000. More recent papers on empirical approaches which explicitly account for transaction costs include @Garleanu2013, and @DeMiguel2014, and @DeMiguel2015. \n\n## Estimation Uncertainty and Transaction Costs\n\nThe empirical evidence regarding the performance of a mean-variance optimization procedure in which you simply plug in some sample estimates $\\hat \\mu$ and $\\hat \\Sigma$ can be summarized rather briefly: mean-variance optimization performs poorly! The literature discusses many proposals to overcome these empirical issues. For instance, one may impose some form of regularization of $\\Sigma$, rely on Bayesian priors inspired by theoretical asset pricing models [@Kan2007] or use high-frequency data to improve forecasting [@Hautsch2015].\nOne unifying framework that works easily, effectively (even for large dimensions), and is purely inspired by economic arguments is an ex-ante adjustment for transaction costs [@Hautsch2019]. \n\nAssume that returns are from a multivariate normal distribution with mean $\\mu$ and variance-covariance matrix $\\Sigma$, $N(\\mu,\\Sigma)$. Additionally, we assume quadratic transaction costs which penalize rebalancing such that  $$\n\\begin{aligned}\n\\nu\\left(\\omega_{t+1},\\omega_{t^+}, \\beta\\right) = \\frac{\\beta}{2} \\left(\\omega_{t+1} - \\omega_{t^+}\\right)'\\left(\\omega_{t+1}- \\omega_{t^+}\\right),\\end{aligned}$$\nwith cost parameter $\\beta>0$ and $\\omega_{t^+} = {\\omega_t \\circ  (1 +r_{t})}/{\\iota' (\\omega_t \\circ (1 + r_{t}))}$. $\\omega_{t^+}$ denotes the portfolio weights just before rebalancing. Note that $\\omega_{t^+}$ differs mechanically from $\\omega_t$ due to the returns in the past period.\nIntuitively, transaction costs penalize portfolio performance when the portfolio is shifted from the current holdings $\\omega_{t^+}$ to a new allocation $\\omega_{t+1}$. \nIn this setup, transaction costs do not increase linearly. Instead, larger rebalancing is penalized more heavily than small adjustments. \nThen, the optimal portfolio choice for an investor with mean variance preferences is\n$$\\begin{aligned}\\omega_{t+1} ^* &=  \\arg\\max \\omega'\\mu - \\nu_t (\\omega,\\omega_{t^+}, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega \\text{ s.t. } \\iota'\\omega = 1\\\\\n&=\\arg\\max\n\\omega'\\mu^* - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t.} \\iota'\\omega=1,\\end{aligned}$$\nwhere\n$$\\mu^*=\\mu+\\beta \\omega_{t^+} \\quad  \\text{and} \\quad \\Sigma^*=\\Sigma + \\frac{\\beta}{\\gamma} I_N.$$\nAs a result, adjusting for transaction costs implies a standard mean-variance optimal portfolio choice with adjusted return parameters $\\Sigma^*$ and $\\mu^*$: $$\\omega^*_{t+1} = \\frac{1}{\\gamma}\\left(\\Sigma^{*-1} - \\frac{1}{\\iota' \\Sigma^{*-1}\\iota }\\Sigma^{*-1}\\iota\\iota' \\Sigma^{*-1} \\right) \\mu^*  + \\frac{1}{\\iota' \\Sigma^{*-1} \\iota }\\Sigma^{*-1} \\iota.$$\n\nAn alternative formulation of the optimal portfolio can be derived as follows: \n$$\\omega_{t+1} ^*=\\arg\\max\n\\omega'\\left(\\mu+\\beta\\left(\\omega_{t^+} - \\frac{1}{N}\\iota\\right)\\right) - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t. } \\iota'\\omega=1.$$\nThe optimal weights correspond to a mean-variance portfolio, where the vector of expected returns is such that assets that currently exhibit a higher weight are considered as delivering a higher expected return. \n\n## Optimal Portfolio Choice\n\nThe function below implements the efficient portfolio weight in its general form, allowing for transaction costs (conditional on the holdings *before* reallocation). \nFor $\\beta=0$, the computation resembles the standard mean-variance efficient framework. `gamma` denotes the coefficient of risk aversion $\\gamma$, `beta` is the transaction cost parameter $\\beta$ and `w_prev` are the weights before rebalancing $\\omega_{t^+}$. \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight <- function(Sigma,\n                                     mu,\n                                     gamma = 2,\n                                     beta = 0, # transaction costs\n                                     w_prev = rep(\n                                       1 / ncol(Sigma),\n                                       ncol(Sigma)\n                                     )) {\n  iota <- rep(1, ncol(Sigma))\n  Sigma_processed <- Sigma + beta / gamma * diag(ncol(Sigma))\n  mu_processed <- mu + beta * w_prev\n\n  Sigma_inverse <- solve(Sigma_processed)\n\n  w_mvp <- Sigma_inverse %*% iota\n  w_mvp <- as.vector(w_mvp / sum(w_mvp))\n  w_opt <- w_mvp + 1 / gamma *\n    (Sigma_inverse - w_mvp %*% t(iota) %*% Sigma_inverse) %*%\n      mu_processed\n  \n  return(as.vector(w_opt))\n}\n\nw_efficient <- compute_efficient_weight(Sigma, mu)\nround(w_efficient, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1]  1.630  0.090 -1.355  0.687  0.333 -0.412  0.504  0.402 -0.219\n[10] -0.661\n```\n\n\n:::\n:::\n\n\nThe portfolio weights above indicate the efficient portfolio for an investor with risk aversion coefficient $\\gamma=2$ in absence of transaction costs. Some of the positions are negative which implies short-selling, most of the positions are rather extreme. For instance, a position of $-1$ implies that the investor takes a short position worth her entire wealth to lever long positions in other assets. \\index{Short-selling}\nWhat is the effect of transaction costs or different levels of risk aversion on the optimal portfolio choice? The following few lines of code analyze the distance between the minimum variance portfolio and the portfolio implemented by the investor for different values of the transaction cost parameter $\\beta$ and risk aversion $\\gamma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngammas <- c(2, 4, 8, 20)\nbetas <- 20 * qexp((1:99) / 100)\n                   \ntransaction_costs <- expand_grid(\n  gamma = gammas,\n  beta = betas\n) |>\n  mutate(\n    weights = map2(\n      .x = gamma,\n      .y = beta,\n      ~ compute_efficient_weight(Sigma,\n        mu,\n        gamma = .x,\n        beta = .y / 10000,\n        w_prev = w_mvp\n      )\n    ),\n    concentration = map_dbl(weights, ~ sum(abs(. - w_mvp)))\n  )\n```\n:::\n\n\nThe code chunk above computes the optimal weight in presence of transaction cost for different values of $\\beta$ and $\\gamma$ but with the same initial allocation, the theoretical optimal minimum variance portfolio. \nStarting from the initial allocation, the investor chooses her optimal allocation along the efficient frontier to reflect her own risk preferences. \nIf transaction costs would be absent, the investor would simply implement the mean-variance efficient allocation. If transaction costs make it costly to rebalance, her optimal portfolio choice reflects a shift toward the efficient portfolio, whereas her current portfolio anchors her investment.\\index{Graph!Comparative statics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransaction_costs |>\n  mutate(risk_aversion = as_factor(gamma)) |>\n  ggplot(aes(\n    x = beta,\n    y = concentration,\n    color = risk_aversion,\n    linetype = risk_aversion\n  )) +\n  geom_line() +\n  guides(linetype = \"none\") + \n  labs(\n    x = \"Transaction cost parameter\",\n    y = \"Distance from MVP\",\n    color = \"Risk aversion\",\n    title = \"Portfolio weights for different risk aversion and transaction cost\"\n  )\n```\n\n::: {.cell-output-display}\n![The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.](constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png){#fig-1701 fig-alt='Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion.' width=2100}\n:::\n:::\n\n\n@fig-1701 shows rebalancing from the initial portfolio (which we always set to the minimum variance portfolio weights in this example). The higher the transaction costs parameter $\\beta$, the smaller is the rebalancing from the initial portfolio. In addition, if risk aversion $\\gamma$ increases, the efficient portfolio is closer to the minimum variance portfolio weights such that the investor desires less rebalancing from the initial holdings. \n\n## Constrained Optimization\n\nNext, we introduce constraints to the above optimization procedure. Very often, typical constraints such as short-selling restrictions prevent analytical solutions for optimal portfolio weights (short-selling restrictions simply imply that negative weights are not allowed such that we require that $w_i \\geq 0\\quad \\forall i$). However, numerical optimization allows computing the solutions to such constrained problems. \n\nWe rely on the powerful `nloptr` package, which provides a common interface to a number of different optimization routines. In particular, we employ the Sequential Least-Squares Quadratic Programming (SLSQP) algorithm of @Kraft1994 because it is able to hand multiple equality and inequality constraints at the same time and typically used for problems where the objective function and the constraints are twice continuously differentiable. We hence have to provide the algorithm with the objective function and its gradient, as well as the constraints and their Jacobian. \n\nWe illustrate the use of the `nloptr()` function by replicating the analytical solutions for the minimum variance and efficient portfolio weights from above. Note that the equality constraint for both solutions is given by the requirement that the weights must sum up to one. In addition, we supply a vector of equal weights as an initial value for the algorithm in all applications. We verify that the output is equal to the above solution. Note that `near()` is a safe way to compare two vectors for pairwise equality. The alternative `==` is sensitive to small differences that may occur due to the representation of floating points on a computer, while `near()` has a built-in tolerance. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_initial <- rep(1 / n_industries, n_industries)\n\nobjective_mvp <- function(w) {\n  0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_mvp <- function(w) {\n  Sigma %*% w\n}\n\nequality_constraint <- function(w) {\n  sum(w) - 1\n}\n\njacobian_equality <- function(w) {\n  rep(1, n_industries)\n}\n\noptions <- list(\n  \"xtol_rel\"=1e-20, \n  \"algorithm\" = \"NLOPT_LD_SLSQP\", \n  \"maxeval\" = 10000\n)\n\nw_mvp_numerical <- nloptr(\n  x0 = w_initial, \n  eval_f = objective_mvp, \n  eval_grad_f = gradient_mvp,\n  eval_g_eq = equality_constraint,\n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_mvp, w_mvp_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nobjective_efficient <- function(w) {\n  2 * 0.5 * t(w) %*% Sigma %*% w - sum((1 + mu) * w)\n}\n\ngradient_efficient <- function(w) {\n  2 * Sigma %*% w - (1 + mu)\n}\n\nw_efficient_numerical <- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_efficient, w_efficient_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nThe result above shows that indeed the numerical procedure recovered the optimal weights for a scenario, where we already know the analytic solution. For more complex optimization routines, [R's optimization task view](https://cran.r-project.org/web/views/Optimization.html) provides an overview of the vast optimization landscape. \\index{Optimization}\n\nNext, we approach problems where no analytical solutions exist. First, we additionally impose short-sale constraints, which implies $N$ inequality constraints of the form $\\omega_i >=0$. We can implement the short sale constraints by imposing a vector of lower bounds `lb = rep(0, n_industries)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_no_short_sale <- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  lb = rep(0, n_industries),\n  opts = options\n)\n\nround(w_no_short_sale$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 0.610 0.000 0.000 0.211 0.000 0.000 0.000 0.179 0.000 0.000\n```\n\n\n:::\n:::\n\n\nAs expected, the resulting portfolio weights are all positive (up to numerical precision). Typically, the holdings in the presence of short-sale constraints are concentrated among way fewer assets than for the unrestricted case. \nYou can verify that `sum(w_no_short_sale$solution)` returns 1. In other words: `nloptr()` provides the numerical solution to a portfolio choice problem for a mean-variance investor with risk aversion `gamma = 2`, where negative holdings are forbidden. \n\n`nloptr()` can also handle more complex problems. As an example, we show how to compute optimal weights, subject to the so-called [Regulation T-constraint,](https://en.wikipedia.org/wiki/Regulation_T) which requires that the sum of all absolute portfolio weights is smaller than 1.5, that is $\\sum_{i=1}^N |\\omega_i| \\leq 1.5$. \nThe constraint enforces that a maximum of 50 percent of the allocated wealth can be allocated to short positions, thus implying an initial margin requirement of 50 percent. Imposing such a margin requirement reduces portfolio risks because extreme portfolio weights are not attainable anymore. The implementation of Regulation-T rules is numerically interesting because the margin constraints imply a non-linear constraint on the portfolio weights. \\index{Regulation T}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_t <- 1.5\n\ninequality_constraint <- function(w) {\n  sum(abs(w)) - reg_t\n}\n\njacobian_inequality <- function(w) {\n  sign(w)\n}\n\nobjective_reg_t <- function(w) {\n  - t(w) %*% (1 + mu) +\n    2 * 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_reg_t <- function(w) {\n  - (1 + mu) + 2 * Sigma %*% w\n}\n\nw_reg_t <- nloptr(\n  x0 = w_initial,\n  eval_f = objective_reg_t, \n  eval_grad_f = gradient_reg_t,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality, \n  eval_g_ineq = inequality_constraint, \n  eval_jac_g_ineq = jacobian_inequality,\n  opts = options\n)\n\nround(w_reg_t$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1]  0.736  0.000 -0.134  0.264  0.000 -0.019  0.028  0.223  0.000\n[10] -0.097\n```\n\n\n:::\n:::\n\n\n@fig-1702 shows the optimal allocation weights across all 10 industries for the four different strategies considered so far: minimum variance, efficient portfolio with $\\gamma$ = 2, efficient portfolio with short-sale constraints, and the Regulation-T constrained portfolio.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  `No short-sale` = w_no_short_sale$solution,\n  `Minimum Variance` = w_mvp,\n  `Efficient portfolio` = compute_efficient_weight(Sigma, mu),\n  `Regulation-T` = w_reg_t$solution,\n  Industry = colnames(industry_returns)\n) |>\n  pivot_longer(-Industry,\n    names_to = \"Strategy\",\n    values_to = \"weights\"\n  ) |>\n  ggplot(aes(\n    fill = Strategy,\n    y = weights,\n    x = Industry\n  )) +\n  geom_bar(position = \"dodge\", stat = \"identity\") +\n  coord_flip() +\n  labs(\n    y = \"Allocation weight\", fill = NULL,\n    title = \"Optimal allocations for different strategies\"\n  ) +\n  scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.](constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png){#fig-1702 fig-alt='Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio.' width=2100}\n:::\n:::\n\n\nThe results clearly indicate the effect of imposing additional constraints: the extreme holdings the investor implements if she follows the (theoretically optimal) efficient portfolio vanish under, e.g., the Regulation-T constraint.\nYou may wonder why an investor would deviate from what is theoretically the optimal portfolio by imposing potentially arbitrary constraints. \nThe short answer is: the *efficient portfolio* is only efficient if the true parameters of the data generating process correspond to the estimated parameters $\\hat\\Sigma$ and $\\hat\\mu$. \nEstimation uncertainty may thus lead to inefficient allocations. By imposing restrictions, we implicitly shrink the set of possible weights and prevent extreme allocations, which could result from *error-maximization* due to estimation uncertainty [@Jagannathan2003].\n\nBefore we move on, we want to propose a final allocation strategy, which reflects a somewhat more realistic structure of transaction costs instead of the quadratic specification used above. The function below computes efficient portfolio weights while adjusting for transaction costs of the form $\\beta\\sum_{i=1}^N |(\\omega_{i, t+1} - \\omega_{i, t^+})|$. No closed-form solution exists, and we rely on non-linear optimization procedures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight_L1_TC <- function(mu,\n                                           Sigma,\n                                           gamma,\n                                           beta,\n                                           initial_weights) {\n  objective <- function(w) {\n    -t(w) %*% mu +\n      gamma / 2 * t(w) %*% Sigma %*% w +\n      (beta / 10000) / 2 * sum(abs(w - initial_weights))\n  }\n  \n  gradient <- function(w) {\n    -mu + gamma * Sigma %*% w + \n      (beta / 10000) * 0.5 * sign(w - initial_weights)\n  }\n\n  w_optimal <- nloptr(\n    x0 = initial_weights,\n    eval_f = objective, \n    eval_grad_f = gradient,\n    eval_g_eq = equality_constraint, \n    eval_jac_g_eq = jacobian_equality, \n    opts = options\n  )\n\n  return(w_optimal$solution)\n}\n```\n:::\n\n\n## Out-of-Sample Backtesting\n\nFor the sake of simplicity, we committed one fundamental error in computing portfolio weights above: we used the full sample of the data to determine the optimal allocation [@Harvey2019]. To implement this strategy at the beginning of the 2000s, you will need to know how the returns will evolve until 2021. \\index{Backtesting} \\index{Performance evaluation}\\index{Out-of-sample}\nWhile interesting from a methodological point of view, we cannot evaluate the performance of the portfolios in a reasonable out-of-sample fashion. We do so next in a backtesting application for three strategies. For the backtest, we recompute optimal weights just based on past available data. \n\nThe few lines below define the general setup. We consider 120 periods from the past to update the parameter estimates before recomputing portfolio weights. Then, we update portfolio weights which is costly and affects the performance. The portfolio weights determine the portfolio return. A period later, the current portfolio weights have changed and form the foundation for transaction costs incurred in the next period. We consider three different competing strategies: the mean-variance efficient portfolio, the mean-variance efficient portfolio with ex-ante adjustment for transaction costs, and the naive portfolio, which allocates wealth equally across the different assets.\\index{Transaction cost}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwindow_length <- 120\nperiods <- nrow(industry_returns) - window_length\n\nbeta <- 50\ngamma <- 2\n\nperformance_values <- matrix(NA,\n  nrow = periods,\n  ncol = 3\n)\ncolnames(performance_values) <- c(\"raw_return\", \"turnover\", \"net_return\")\n\nperformance_values <- list(\n  \"MV (TC)\" = performance_values,\n  \"Naive\" = performance_values,\n  \"MV\" = performance_values\n)\n\nw_prev_1 <- w_prev_2 <- w_prev_3 <- rep(\n  1 / n_industries,\n  n_industries\n)\n```\n:::\n\n\nWe also define two helper functions: one to adjust the weights due to returns and one for performance evaluation, where we compute realized returns net of transaction costs. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjust_weights <- function(w, next_return) {\n  w_prev <- 1 + w * next_return\n  as.numeric(w_prev / sum(as.vector(w_prev)))\n}\n\nevaluate_performance <- function(w, w_previous, next_return, beta = 50) {\n  raw_return <- as.matrix(next_return) %*% w\n  turnover <- sum(abs(w - w_previous))\n  net_return <- raw_return - beta / 10000 * turnover\n  c(raw_return, turnover, net_return)\n}\n```\n:::\n\n\nThe following code chunk performs a rolling-window estimation, which we implement in a loop. In each period, the estimation window contains the returns available up to the current period. \nNote that we use the sample variance-covariance matrix and ignore the estimation of $\\hat\\mu$ entirely, but you might use more advanced estimators in practice. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (p in 1:periods) {\n  returns_window <- industry_returns[p:(p + window_length - 1), ]\n  next_return <- industry_returns[p + window_length, ] |> as.matrix()\n\n  Sigma <- cov(returns_window)\n  mu <- 0 * colMeans(returns_window)\n\n  # Transaction-cost adjusted portfolio\n  w_1 <- compute_efficient_weight_L1_TC(\n    mu = mu,\n    Sigma = Sigma,\n    beta = beta,\n    gamma = gamma,\n    initial_weights = w_prev_1\n  )\n\n  performance_values[[1]][p, ] <- evaluate_performance(w_1,\n    w_prev_1,\n    next_return,\n    beta = beta\n  )\n\n  w_prev_1 <- adjust_weights(w_1, next_return)\n\n  # Naive portfolio\n  w_2 <- rep(1 / n_industries, n_industries)\n\n  performance_values[[2]][p, ] <- evaluate_performance(\n    w_2,\n    w_prev_2,\n    next_return\n  )\n\n  w_prev_2 <- adjust_weights(w_2, next_return)\n\n  # Mean-variance efficient portfolio (w/o transaction costs)\n  w_3 <- compute_efficient_weight(\n    Sigma = Sigma,\n    mu = mu,\n    gamma = gamma\n  )\n\n  performance_values[[3]][p, ] <- evaluate_performance(\n    w_3,\n    w_prev_3,\n    next_return\n  )\n\n  w_prev_3 <- adjust_weights(w_3, next_return)\n}\n```\n:::\n\n\nFinally, we get to the evaluation of the portfolio strategies *net-of-transaction costs*. Note that we compute annualized returns and standard deviations. \\index{Sharpe Ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance <- lapply(\n  performance_values,\n  as_tibble\n) |>\n  bind_rows(.id = \"strategy\")\n\nlength_year <- 12\n\nperformance_table <- performance |>\n  group_by(Strategy = strategy) |>\n  summarize(\n    Mean = length_year * mean(100 * net_return),\n    SD = sqrt(length_year) * sd(100 * net_return),\n    `Sharpe ratio` = if_else(Mean > 0,\n      Mean / SD,\n      NA_real_\n    ),\n    Turnover = 100 * mean(turnover)\n  )\n\nperformance_table |> \n  mutate(across(-Strategy, ~round(., 4)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n  Strategy   Mean    SD `Sharpe ratio` Turnover\n  <chr>     <dbl> <dbl>          <dbl>    <dbl>\n1 MV       -0.897  12.6         NA      211.   \n2 MV (TC)  11.9    15.2          0.780    0    \n3 Naive    11.8    15.2          0.779    0.234\n```\n\n\n:::\n:::\n\n\nThe results clearly speak against mean-variance optimization. Turnover is huge when the investor only considers her portfolio's expected return and variance. Effectively, the mean-variance portfolio generates a *negative* annualized return after adjusting for transaction costs. At the same time, the naive portfolio turns out to perform very well. In fact, the performance gains of the transaction-cost adjusted mean-variance portfolio are small. The out-of-sample Sharpe ratio is slightly higher than for the naive portfolio. Note the extreme effect of turnover penalization on turnover: *MV (TC)* effectively resembles a buy-and-hold strategy which only updates the portfolio once the estimated parameters $\\hat\\mu_t$ and $\\hat\\Sigma_t$indicate that the current allocation is too far away from the optimal theoretical portfolio. \n\n## Exercises\n\n1. Consider the portfolio choice problem for transaction-cost adjusted certainty equivalent maximization with risk aversion parameter $\\gamma$ \n$$\\omega_{t+1} ^* =  \\arg\\max_{\\omega \\in \\mathbb{R}^N,  \\iota'\\omega = 1} \\omega'\\mu - \\nu_t (\\omega, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega$$\nwhere $\\Sigma$ and $\\mu$ are (estimators of) the variance-covariance matrix of the returns and the vector of expected returns. Assume for now that transaction costs are quadratic in rebalancing *and* proportional to stock illiquidity such that \n$$\\nu_t\\left(\\omega, B\\right) = \\frac{\\beta}{2} \\left(\\omega - \\omega_{t^+}\\right)'B\\left(\\omega - \\omega_{t^+}\\right)$$ where $B = \\text{diag}(ill_1, \\ldots, ill_N)$ is a diagonal matrix, where $ill_1, \\ldots, ill_N$. Derive a closed-form solution for the mean-variance efficient portfolio $\\omega_{t+1} ^*$ based on the transaction cost specification above. Discuss the effect of illiquidity $ill_i$ on the individual portfolio weights relative to an investor that myopically ignores transaction costs in her decision. \n1. Use the solution from the previous exercise to update the function `compute_efficient_weight()` such that you can compute optimal weights conditional on a matrix $B$ with illiquidity measures. \n1. Illustrate the evolution of the *optimal* weights from the naive portfolio to the efficient portfolio in the mean-standard deviation diagram.\n1. Is it always optimal to choose the same $\\beta$ in the optimization problem than the value used in evaluating the portfolio performance? In other words: can it be optimal to choose theoretically sub-optimal portfolios based on transaction cost considerations that do not reflect the actual incurred costs? Evaluate the out-of-sample Sharpe ratio after transaction costs for a range of different values of imposed $\\beta$ values.",
+    "markdown": "---\ntitle: Constrained Optimization and Backtesting\naliases:\n  - ../constrained-optimization-and-backtesting.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Constrained Optimization and Backtesting with R\n  description-meta: Conduct portfolio backtesting with transaction costs and no-shortselling constraints using the programming language R. \n---\n\n\n\\index{Backtesting} In this chapter, we conduct portfolio backtesting in a realistic setting by including transaction costs and investment constraints such as no-short-selling rules. \\index{Short-selling}\nWe start with standard mean-variance efficient portfolios and introduce constraints in a step-by-step manner. To do so, we rely on numerical optimization procedures in R.\\index{Efficient portfolio} We conclude the chapter by providing an out-of-sample backtesting procedure for the different strategies that we introduce in this chapter. \n\nThroughout this chapter, we use the following R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(nloptr)\n```\n:::\n\n\nCompared to previous chapters, we introduce the `nloptr` package [@nloptr] to perform numerical constrained optimization for portfolio choice problems.\\index{Optimization}\\index{Portfolio choice}\n\n## Data Preparation\n\nWe start by loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). For simplicity, we restrict our investment universe to the monthly Fama-French industry portfolio returns in the following application. \\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nindustry_returns <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n  select(-month) |>\n  collect() |> \n  drop_na()\n```\n:::\n\n\n## Recap of Portfolio Choice \n\nA common objective for portfolio optimization is to find mean-variance efficient portfolio weights, i.e., the allocation which delivers the lowest possible return variance for a given minimum level of expected returns. \nIn the most extreme case, where the investor is only concerned about portfolio variance, she may choose to implement the minimum variance portfolio (MVP) weights which are given by the solution to \n$$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1$$\nwhere $\\Sigma$ is the $(N \\times N)$ covariance matrix of the returns. The optimal weights $\\omega_\\text{mvp}$ can be found analytically and are $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$. In terms of code, the math is equivalent to the following chunk. \\index{Minimum variance portfolio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_industries <- ncol(industry_returns)\n\nSigma <- cov(industry_returns)\nw_mvp <- solve(Sigma) %*% rep(1, n_industries)\nw_mvp <- as.vector(w_mvp / sum(w_mvp))\n\nmu <- colMeans(industry_returns)\n```\n:::\n\n\nNext, consider an investor who aims to achieve minimum variance *given a required expected portfolio return* $\\bar{\\mu}$ such that she chooses\n$$\\omega_\\text{eff}({\\bar{\\mu}}) =\\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\nWe leave it as an exercise below to show that the portfolio choice problem can equivalently be formulated for an investor with mean-variance preferences and risk aversion factor $\\gamma$. That means the investor aims to choose portfolio weights as the solution to\n$$ \\omega^*_\\gamma = \\arg\\max \\omega' \\mu - \\frac{\\gamma}{2}\\omega'\\Sigma \\omega\\quad \\text{ s.t. } \\omega'\\iota = 1.$$\nThe solution to the optimal portfolio choice problem is:\n$$\\omega^*_{\\gamma}  = \\frac{1}{\\gamma}\\left(\\Sigma^{-1} - \\frac{1}{\\iota' \\Sigma^{-1}\\iota }\\Sigma^{-1}\\iota\\iota' \\Sigma^{-1} \\right) \\mu  + \\frac{1}{\\iota' \\Sigma^{-1} \\iota }\\Sigma^{-1} \\iota.$$\nEmpirically, this classical solution imposes many problems. \nIn particular, the estimates of $\\mu$ are noisy over short horizons, the ($N \\times N$) matrix $\\Sigma$ contains $N(N-1)/2$ distinct elements and thus, estimation error is huge. \nSeminal papers on the effect of ignoring estimation uncertainty, among others, are @Brown1976, @Jobson1980, @Jorion1986, and @Chopra1993.\n\nEven worse, if the asset universe contains more assets than available time periods $(N > T)$, the sample covariance matrix is no longer positive definite such that the inverse $\\Sigma^{-1}$ does not exist anymore. \nTo address estimation issues for vast-dimensional covariance matrices, regularization techniques are a popular tool [see, e.g., @Ledoit2003; @Ledoit2004; @Ledoit2012; @Fan2008].\\index{Covariance}\n\nWhile the uncertainty associated with estimated parameters is challenging, the data-generating process is also unknown to the investor. In other words, model uncertainty reflects that it is ex-ante not even clear which parameters require estimation (for instance, if returns are driven by a factor model, selecting the universe of relevant factors imposes model uncertainty). @Wang2005 and @Garlappi2007 provide theoretical analysis on optimal portfolio choice under model *and* estimation uncertainty. In the most extreme case, @Pflug2012 shows that the naive portfolio which allocates equal wealth to all assets is the optimal choice for an investor averse to model uncertainty.\\index{Model uncertainty}\n\nOn top of the estimation uncertainty, *transaction costs* are a major concern.\\index{Transaction cost} \nRebalancing portfolios is costly, and, therefore, the optimal choice should depend on the investor's current holdings. In the presence of transaction costs, the benefits of reallocating wealth may be smaller than the costs associated with turnover. This aspect has been investigated theoretically, among others, for one risky asset by @Magill1976 and @Davis1990. Subsequent extensions to the case with multiple assets have been proposed by @Balduzzi1999 and @Balduzzi2000. More recent papers on empirical approaches which explicitly account for transaction costs include @Garleanu2013, and @DeMiguel2014, and @DeMiguel2015. \n\n## Estimation Uncertainty and Transaction Costs\n\nThe empirical evidence regarding the performance of a mean-variance optimization procedure in which you simply plug in some sample estimates $\\hat \\mu$ and $\\hat \\Sigma$ can be summarized rather briefly: mean-variance optimization performs poorly! The literature discusses many proposals to overcome these empirical issues. For instance, one may impose some form of regularization of $\\Sigma$, rely on Bayesian priors inspired by theoretical asset pricing models [@Kan2007] or use high-frequency data to improve forecasting [@Hautsch2015].\nOne unifying framework that works easily, effectively (even for large dimensions), and is purely inspired by economic arguments is an ex-ante adjustment for transaction costs [@Hautsch2019]. \n\nAssume that returns are from a multivariate normal distribution with mean $\\mu$ and variance-covariance matrix $\\Sigma$, $N(\\mu,\\Sigma)$. Additionally, we assume quadratic transaction costs which penalize rebalancing such that  $$\n\\begin{aligned}\n\\nu\\left(\\omega_{t+1},\\omega_{t^+}, \\beta\\right) = \\frac{\\beta}{2} \\left(\\omega_{t+1} - \\omega_{t^+}\\right)'\\left(\\omega_{t+1}- \\omega_{t^+}\\right),\\end{aligned}$$\nwith cost parameter $\\beta>0$ and $\\omega_{t^+} = {\\omega_t \\circ  (1 +r_{t})}/{\\iota' (\\omega_t \\circ (1 + r_{t}))}$. $\\omega_{t^+}$ denotes the portfolio weights just before rebalancing. Note that $\\omega_{t^+}$ differs mechanically from $\\omega_t$ due to the returns in the past period.\nIntuitively, transaction costs penalize portfolio performance when the portfolio is shifted from the current holdings $\\omega_{t^+}$ to a new allocation $\\omega_{t+1}$. \nIn this setup, transaction costs do not increase linearly. Instead, larger rebalancing is penalized more heavily than small adjustments. \nThen, the optimal portfolio choice for an investor with mean variance preferences is\n$$\\begin{aligned}\\omega_{t+1} ^* &=  \\arg\\max \\omega'\\mu - \\nu_t (\\omega,\\omega_{t^+}, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega \\text{ s.t. } \\iota'\\omega = 1\\\\\n&=\\arg\\max\n\\omega'\\mu^* - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t.} \\iota'\\omega=1,\\end{aligned}$$\nwhere\n$$\\mu^*=\\mu+\\beta \\omega_{t^+} \\quad  \\text{and} \\quad \\Sigma^*=\\Sigma + \\frac{\\beta}{\\gamma} I_N.$$\nAs a result, adjusting for transaction costs implies a standard mean-variance optimal portfolio choice with adjusted return parameters $\\Sigma^*$ and $\\mu^*$: $$\\omega^*_{t+1} = \\frac{1}{\\gamma}\\left(\\Sigma^{*-1} - \\frac{1}{\\iota' \\Sigma^{*-1}\\iota }\\Sigma^{*-1}\\iota\\iota' \\Sigma^{*-1} \\right) \\mu^*  + \\frac{1}{\\iota' \\Sigma^{*-1} \\iota }\\Sigma^{*-1} \\iota.$$\n\nAn alternative formulation of the optimal portfolio can be derived as follows: \n$$\\omega_{t+1} ^*=\\arg\\max\n\\omega'\\left(\\mu+\\beta\\left(\\omega_{t^+} - \\frac{1}{N}\\iota\\right)\\right) - \\frac{\\gamma}{2}\\omega'\\Sigma^* \\omega \\text{ s.t. } \\iota'\\omega=1.$$\nThe optimal weights correspond to a mean-variance portfolio, where the vector of expected returns is such that assets that currently exhibit a higher weight are considered as delivering a higher expected return. \n\n## Optimal Portfolio Choice\n\nThe function below implements the efficient portfolio weight in its general form, allowing for transaction costs (conditional on the holdings *before* reallocation). \nFor $\\beta=0$, the computation resembles the standard mean-variance efficient framework. `gamma` denotes the coefficient of risk aversion $\\gamma$, `beta` is the transaction cost parameter $\\beta$ and `w_prev` are the weights before rebalancing $\\omega_{t^+}$. \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight <- function(Sigma,\n                                     mu,\n                                     gamma = 2,\n                                     beta = 0, # transaction costs\n                                     w_prev = rep(\n                                       1 / ncol(Sigma),\n                                       ncol(Sigma)\n                                     )) {\n  iota <- rep(1, ncol(Sigma))\n  Sigma_processed <- Sigma + beta / gamma * diag(ncol(Sigma))\n  mu_processed <- mu + beta * w_prev\n\n  Sigma_inverse <- solve(Sigma_processed)\n\n  w_mvp <- Sigma_inverse %*% iota\n  w_mvp <- as.vector(w_mvp / sum(w_mvp))\n  w_opt <- w_mvp + 1 / gamma *\n    (Sigma_inverse - w_mvp %*% t(iota) %*% Sigma_inverse) %*%\n      mu_processed\n  \n  return(as.vector(w_opt))\n}\n\nw_efficient <- compute_efficient_weight(Sigma, mu)\nround(w_efficient, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1]  1.630  0.090 -1.356  0.687  0.333 -0.412  0.504  0.402 -0.219\n[10] -0.659\n```\n\n\n:::\n:::\n\n\nThe portfolio weights above indicate the efficient portfolio for an investor with risk aversion coefficient $\\gamma=2$ in absence of transaction costs. Some of the positions are negative which implies short-selling, most of the positions are rather extreme. For instance, a position of $-1$ implies that the investor takes a short position worth her entire wealth to lever long positions in other assets. \\index{Short-selling}\nWhat is the effect of transaction costs or different levels of risk aversion on the optimal portfolio choice? The following few lines of code analyze the distance between the minimum variance portfolio and the portfolio implemented by the investor for different values of the transaction cost parameter $\\beta$ and risk aversion $\\gamma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngammas <- c(2, 4, 8, 20)\nbetas <- 20 * qexp((1:99) / 100)\n                   \ntransaction_costs <- expand_grid(\n  gamma = gammas,\n  beta = betas\n) |>\n  mutate(\n    weights = map2(\n      .x = gamma,\n      .y = beta,\n      ~ compute_efficient_weight(Sigma,\n        mu,\n        gamma = .x,\n        beta = .y / 10000,\n        w_prev = w_mvp\n      )\n    ),\n    concentration = map_dbl(weights, ~ sum(abs(. - w_mvp)))\n  )\n```\n:::\n\n\nThe code chunk above computes the optimal weight in presence of transaction cost for different values of $\\beta$ and $\\gamma$ but with the same initial allocation, the theoretical optimal minimum variance portfolio. \nStarting from the initial allocation, the investor chooses her optimal allocation along the efficient frontier to reflect her own risk preferences. \nIf transaction costs would be absent, the investor would simply implement the mean-variance efficient allocation. If transaction costs make it costly to rebalance, her optimal portfolio choice reflects a shift toward the efficient portfolio, whereas her current portfolio anchors her investment.\\index{Graph!Comparative statics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransaction_costs |>\n  mutate(risk_aversion = as_factor(gamma)) |>\n  ggplot(aes(\n    x = beta,\n    y = concentration,\n    color = risk_aversion,\n    linetype = risk_aversion\n  )) +\n  geom_line() +\n  guides(linetype = \"none\") + \n  labs(\n    x = \"Transaction cost parameter\",\n    y = \"Distance from MVP\",\n    color = \"Risk aversion\",\n    title = \"Portfolio weights for different risk aversion and transaction cost\"\n  )\n```\n\n::: {.cell-output-display}\n![The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.](constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png){#fig-1701 fig-alt='Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion.' width=2100}\n:::\n:::\n\n\n@fig-1701 shows rebalancing from the initial portfolio (which we always set to the minimum variance portfolio weights in this example). The higher the transaction costs parameter $\\beta$, the smaller is the rebalancing from the initial portfolio. In addition, if risk aversion $\\gamma$ increases, the efficient portfolio is closer to the minimum variance portfolio weights such that the investor desires less rebalancing from the initial holdings. \n\n## Constrained Optimization\n\nNext, we introduce constraints to the above optimization procedure. Very often, typical constraints such as short-selling restrictions prevent analytical solutions for optimal portfolio weights (short-selling restrictions simply imply that negative weights are not allowed such that we require that $w_i \\geq 0\\quad \\forall i$). However, numerical optimization allows computing the solutions to such constrained problems. \n\nWe rely on the powerful `nloptr` package, which provides a common interface to a number of different optimization routines. In particular, we employ the Sequential Least-Squares Quadratic Programming (SLSQP) algorithm of @Kraft1994 because it is able to hand multiple equality and inequality constraints at the same time and typically used for problems where the objective function and the constraints are twice continuously differentiable. We hence have to provide the algorithm with the objective function and its gradient, as well as the constraints and their Jacobian. \n\nWe illustrate the use of the `nloptr()` function by replicating the analytical solutions for the minimum variance and efficient portfolio weights from above. Note that the equality constraint for both solutions is given by the requirement that the weights must sum up to one. In addition, we supply a vector of equal weights as an initial value for the algorithm in all applications. We verify that the output is equal to the above solution. Note that `near()` is a safe way to compare two vectors for pairwise equality. The alternative `==` is sensitive to small differences that may occur due to the representation of floating points on a computer, while `near()` has a built-in tolerance. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_initial <- rep(1 / n_industries, n_industries)\n\nobjective_mvp <- function(w) {\n  0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_mvp <- function(w) {\n  Sigma %*% w\n}\n\nequality_constraint <- function(w) {\n  sum(w) - 1\n}\n\njacobian_equality <- function(w) {\n  rep(1, n_industries)\n}\n\noptions <- list(\n  \"xtol_rel\"=1e-20, \n  \"algorithm\" = \"NLOPT_LD_SLSQP\", \n  \"maxeval\" = 10000\n)\n\nw_mvp_numerical <- nloptr(\n  x0 = w_initial, \n  eval_f = objective_mvp, \n  eval_grad_f = gradient_mvp,\n  eval_g_eq = equality_constraint,\n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_mvp, w_mvp_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nobjective_efficient <- function(w) {\n  2 * 0.5 * t(w) %*% Sigma %*% w - sum((1 + mu) * w)\n}\n\ngradient_efficient <- function(w) {\n  2 * Sigma %*% w - (1 + mu)\n}\n\nw_efficient_numerical <- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_efficient, w_efficient_numerical$solution))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nThe result above shows that indeed the numerical procedure recovered the optimal weights for a scenario, where we already know the analytic solution. For more complex optimization routines, [R's optimization task view](https://cran.r-project.org/web/views/Optimization.html) provides an overview of the vast optimization landscape. \\index{Optimization}\n\nNext, we approach problems where no analytical solutions exist. First, we additionally impose short-sale constraints, which implies $N$ inequality constraints of the form $\\omega_i >=0$. We can implement the short sale constraints by imposing a vector of lower bounds `lb = rep(0, n_industries)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nw_no_short_sale <- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  lb = rep(0, n_industries),\n  opts = options\n)\n\nround(w_no_short_sale$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 0.610 0.000 0.000 0.211 0.000 0.000 0.000 0.179 0.000 0.000\n```\n\n\n:::\n:::\n\n\nAs expected, the resulting portfolio weights are all positive (up to numerical precision). Typically, the holdings in the presence of short-sale constraints are concentrated among way fewer assets than for the unrestricted case. \nYou can verify that `sum(w_no_short_sale$solution)` returns 1. In other words: `nloptr()` provides the numerical solution to a portfolio choice problem for a mean-variance investor with risk aversion `gamma = 2`, where negative holdings are forbidden. \n\n`nloptr()` can also handle more complex problems. As an example, we show how to compute optimal weights, subject to the so-called [Regulation T-constraint,](https://en.wikipedia.org/wiki/Regulation_T) which requires that the sum of all absolute portfolio weights is smaller than 1.5, that is $\\sum_{i=1}^N |\\omega_i| \\leq 1.5$. \nThe constraint enforces that a maximum of 50 percent of the allocated wealth can be allocated to short positions, thus implying an initial margin requirement of 50 percent. Imposing such a margin requirement reduces portfolio risks because extreme portfolio weights are not attainable anymore. The implementation of Regulation-T rules is numerically interesting because the margin constraints imply a non-linear constraint on the portfolio weights. \\index{Regulation T}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_t <- 1.5\n\ninequality_constraint <- function(w) {\n  sum(abs(w)) - reg_t\n}\n\njacobian_inequality <- function(w) {\n  sign(w)\n}\n\nobjective_reg_t <- function(w) {\n  - t(w) %*% (1 + mu) +\n    2 * 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_reg_t <- function(w) {\n  - (1 + mu) + 2 * Sigma %*% w\n}\n\nw_reg_t <- nloptr(\n  x0 = w_initial,\n  eval_f = objective_reg_t, \n  eval_grad_f = gradient_reg_t,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality, \n  eval_g_ineq = inequality_constraint, \n  eval_jac_g_ineq = jacobian_inequality,\n  opts = options\n)\n\nround(w_reg_t$solution, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1]  0.736  0.000 -0.135  0.264  0.000 -0.019  0.028  0.223  0.000\n[10] -0.096\n```\n\n\n:::\n:::\n\n\n@fig-1702 shows the optimal allocation weights across all 10 industries for the four different strategies considered so far: minimum variance, efficient portfolio with $\\gamma$ = 2, efficient portfolio with short-sale constraints, and the Regulation-T constrained portfolio.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n  `No short-sale` = w_no_short_sale$solution,\n  `Minimum Variance` = w_mvp,\n  `Efficient portfolio` = compute_efficient_weight(Sigma, mu),\n  `Regulation-T` = w_reg_t$solution,\n  Industry = colnames(industry_returns)\n) |>\n  pivot_longer(-Industry,\n    names_to = \"Strategy\",\n    values_to = \"weights\"\n  ) |>\n  ggplot(aes(\n    fill = Strategy,\n    y = weights,\n    x = Industry\n  )) +\n  geom_bar(position = \"dodge\", stat = \"identity\") +\n  coord_flip() +\n  labs(\n    y = \"Allocation weight\", fill = NULL,\n    title = \"Optimal allocations for different strategies\"\n  ) +\n  scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.](constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png){#fig-1702 fig-alt='Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio.' width=2100}\n:::\n:::\n\n\nThe results clearly indicate the effect of imposing additional constraints: the extreme holdings the investor implements if she follows the (theoretically optimal) efficient portfolio vanish under, e.g., the Regulation-T constraint.\nYou may wonder why an investor would deviate from what is theoretically the optimal portfolio by imposing potentially arbitrary constraints. \nThe short answer is: the *efficient portfolio* is only efficient if the true parameters of the data generating process correspond to the estimated parameters $\\hat\\Sigma$ and $\\hat\\mu$. \nEstimation uncertainty may thus lead to inefficient allocations. By imposing restrictions, we implicitly shrink the set of possible weights and prevent extreme allocations, which could result from *error-maximization* due to estimation uncertainty [@Jagannathan2003].\n\nBefore we move on, we want to propose a final allocation strategy, which reflects a somewhat more realistic structure of transaction costs instead of the quadratic specification used above. The function below computes efficient portfolio weights while adjusting for transaction costs of the form $\\beta\\sum_{i=1}^N |(\\omega_{i, t+1} - \\omega_{i, t^+})|$. No closed-form solution exists, and we rely on non-linear optimization procedures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_efficient_weight_L1_TC <- function(mu,\n                                           Sigma,\n                                           gamma,\n                                           beta,\n                                           initial_weights) {\n  objective <- function(w) {\n    -t(w) %*% mu +\n      gamma / 2 * t(w) %*% Sigma %*% w +\n      (beta / 10000) / 2 * sum(abs(w - initial_weights))\n  }\n  \n  gradient <- function(w) {\n    -mu + gamma * Sigma %*% w + \n      (beta / 10000) * 0.5 * sign(w - initial_weights)\n  }\n\n  w_optimal <- nloptr(\n    x0 = initial_weights,\n    eval_f = objective, \n    eval_grad_f = gradient,\n    eval_g_eq = equality_constraint, \n    eval_jac_g_eq = jacobian_equality, \n    opts = options\n  )\n\n  return(w_optimal$solution)\n}\n```\n:::\n\n\n## Out-of-Sample Backtesting\n\nFor the sake of simplicity, we committed one fundamental error in computing portfolio weights above: we used the full sample of the data to determine the optimal allocation [@Harvey2019]. To implement this strategy at the beginning of the 2000s, you will need to know how the returns will evolve until 2021. \\index{Backtesting} \\index{Performance evaluation}\\index{Out-of-sample}\nWhile interesting from a methodological point of view, we cannot evaluate the performance of the portfolios in a reasonable out-of-sample fashion. We do so next in a backtesting application for three strategies. For the backtest, we recompute optimal weights just based on past available data. \n\nThe few lines below define the general setup. We consider 120 periods from the past to update the parameter estimates before recomputing portfolio weights. Then, we update portfolio weights which is costly and affects the performance. The portfolio weights determine the portfolio return. A period later, the current portfolio weights have changed and form the foundation for transaction costs incurred in the next period. We consider three different competing strategies: the mean-variance efficient portfolio, the mean-variance efficient portfolio with ex-ante adjustment for transaction costs, and the naive portfolio, which allocates wealth equally across the different assets.\\index{Transaction cost}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwindow_length <- 120\nperiods <- nrow(industry_returns) - window_length\n\nbeta <- 50\ngamma <- 2\n\nperformance_values <- matrix(NA,\n  nrow = periods,\n  ncol = 3\n)\ncolnames(performance_values) <- c(\"raw_return\", \"turnover\", \"net_return\")\n\nperformance_values <- list(\n  \"MV (TC)\" = performance_values,\n  \"Naive\" = performance_values,\n  \"MV\" = performance_values\n)\n\nw_prev_1 <- w_prev_2 <- w_prev_3 <- rep(\n  1 / n_industries,\n  n_industries\n)\n```\n:::\n\n\nWe also define two helper functions: one to adjust the weights due to returns and one for performance evaluation, where we compute realized returns net of transaction costs. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nadjust_weights <- function(w, next_return) {\n  w_prev <- 1 + w * next_return\n  as.numeric(w_prev / sum(as.vector(w_prev)))\n}\n\nevaluate_performance <- function(w, w_previous, next_return, beta = 50) {\n  raw_return <- as.matrix(next_return) %*% w\n  turnover <- sum(abs(w - w_previous))\n  net_return <- raw_return - beta / 10000 * turnover\n  c(raw_return, turnover, net_return)\n}\n```\n:::\n\n\nThe following code chunk performs a rolling-window estimation, which we implement in a loop. In each period, the estimation window contains the returns available up to the current period. \nNote that we use the sample variance-covariance matrix and ignore the estimation of $\\hat\\mu$ entirely, but you might use more advanced estimators in practice. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (p in 1:periods) {\n  returns_window <- industry_returns[p:(p + window_length - 1), ]\n  next_return <- industry_returns[p + window_length, ] |> as.matrix()\n\n  Sigma <- cov(returns_window)\n  mu <- 0 * colMeans(returns_window)\n\n  # Transaction-cost adjusted portfolio\n  w_1 <- compute_efficient_weight_L1_TC(\n    mu = mu,\n    Sigma = Sigma,\n    beta = beta,\n    gamma = gamma,\n    initial_weights = w_prev_1\n  )\n\n  performance_values[[1]][p, ] <- evaluate_performance(w_1,\n    w_prev_1,\n    next_return,\n    beta = beta\n  )\n\n  w_prev_1 <- adjust_weights(w_1, next_return)\n\n  # Naive portfolio\n  w_2 <- rep(1 / n_industries, n_industries)\n\n  performance_values[[2]][p, ] <- evaluate_performance(\n    w_2,\n    w_prev_2,\n    next_return\n  )\n\n  w_prev_2 <- adjust_weights(w_2, next_return)\n\n  # Mean-variance efficient portfolio (w/o transaction costs)\n  w_3 <- compute_efficient_weight(\n    Sigma = Sigma,\n    mu = mu,\n    gamma = gamma\n  )\n\n  performance_values[[3]][p, ] <- evaluate_performance(\n    w_3,\n    w_prev_3,\n    next_return\n  )\n\n  w_prev_3 <- adjust_weights(w_3, next_return)\n}\n```\n:::\n\n\nFinally, we get to the evaluation of the portfolio strategies *net-of-transaction costs*. Note that we compute annualized returns and standard deviations. \\index{Sharpe Ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance <- lapply(\n  performance_values,\n  as_tibble\n) |>\n  bind_rows(.id = \"strategy\")\n\nlength_year <- 12\n\nperformance_table <- performance |>\n  group_by(Strategy = strategy) |>\n  summarize(\n    Mean = length_year * mean(100 * net_return),\n    SD = sqrt(length_year) * sd(100 * net_return),\n    `Sharpe ratio` = if_else(Mean > 0,\n      Mean / SD,\n      NA_real_\n    ),\n    Turnover = 100 * mean(turnover)\n  )\n\nperformance_table |> \n  mutate(across(-Strategy, ~round(., 4)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 5\n  Strategy   Mean    SD `Sharpe ratio` Turnover\n  <chr>     <dbl> <dbl>          <dbl>    <dbl>\n1 MV       -0.899  12.6         NA      211.   \n2 MV (TC)  11.9    15.2          0.780    0    \n3 Naive    11.8    15.2          0.779    0.234\n```\n\n\n:::\n:::\n\n\nThe results clearly speak against mean-variance optimization. Turnover is huge when the investor only considers her portfolio's expected return and variance. Effectively, the mean-variance portfolio generates a *negative* annualized return after adjusting for transaction costs. At the same time, the naive portfolio turns out to perform very well. In fact, the performance gains of the transaction-cost adjusted mean-variance portfolio are small. The out-of-sample Sharpe ratio is slightly higher than for the naive portfolio. Note the extreme effect of turnover penalization on turnover: *MV (TC)* effectively resembles a buy-and-hold strategy which only updates the portfolio once the estimated parameters $\\hat\\mu_t$ and $\\hat\\Sigma_t$indicate that the current allocation is too far away from the optimal theoretical portfolio. \n\n## Exercises\n\n1. Consider the portfolio choice problem for transaction-cost adjusted certainty equivalent maximization with risk aversion parameter $\\gamma$ \n$$\\omega_{t+1} ^* =  \\arg\\max_{\\omega \\in \\mathbb{R}^N,  \\iota'\\omega = 1} \\omega'\\mu - \\nu_t (\\omega, \\beta) - \\frac{\\gamma}{2}\\omega'\\Sigma\\omega$$\nwhere $\\Sigma$ and $\\mu$ are (estimators of) the variance-covariance matrix of the returns and the vector of expected returns. Assume for now that transaction costs are quadratic in rebalancing *and* proportional to stock illiquidity such that \n$$\\nu_t\\left(\\omega, B\\right) = \\frac{\\beta}{2} \\left(\\omega - \\omega_{t^+}\\right)'B\\left(\\omega - \\omega_{t^+}\\right)$$ where $B = \\text{diag}(ill_1, \\ldots, ill_N)$ is a diagonal matrix, where $ill_1, \\ldots, ill_N$. Derive a closed-form solution for the mean-variance efficient portfolio $\\omega_{t+1} ^*$ based on the transaction cost specification above. Discuss the effect of illiquidity $ill_i$ on the individual portfolio weights relative to an investor that myopically ignores transaction costs in her decision. \n1. Use the solution from the previous exercise to update the function `compute_efficient_weight()` such that you can compute optimal weights conditional on a matrix $B$ with illiquidity measures. \n1. Illustrate the evolution of the *optimal* weights from the naive portfolio to the efficient portfolio in the mean-standard deviation diagram.\n1. Is it always optimal to choose the same $\\beta$ in the optimization problem than the value used in evaluating the portfolio performance? In other words: can it be optimal to choose theoretically sub-optimal portfolios based on transaction cost considerations that do not reflect the actual incurred costs? Evaluate the out-of-sample Sharpe ratio after transaction costs for a range of different values of imposed $\\beta$ values.",
     "supporting": [
       "constrained-optimization-and-backtesting_files"
     ],
diff --git a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png
index c55fe9bf..abfb4985 100644
Binary files a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png and b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1701-1.png differ
diff --git a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png
index 29394693..dfc14b7b 100644
Binary files a/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png and b/_freeze/r/constrained-optimization-and-backtesting/figure-html/fig-1702-1.png differ
diff --git a/_freeze/r/cover-and-logo-design/execute-results/html.json b/_freeze/r/cover-and-logo-design/execute-results/html.json
index 44be99cf..b691ff40 100644
--- a/_freeze/r/cover-and-logo-design/execute-results/html.json
+++ b/_freeze/r/cover-and-logo-design/execute-results/html.json
@@ -3,9 +3,7 @@
   "result": {
     "engine": "knitr",
     "markdown": "---\ntitle: Cover and Logo Design\naliases:\n  - ../cover-and-logo-design.html\npre-render:\n  - pre_render_script.R\n---\n\n\nThe cover of the book is inspired by the fast growing generative art community in R.\\index{Generative art}\nGenerative art refers to art that in whole or in part has been created with the use of an autonomous system. \nInstead of creating random dynamics we rely on what is core to the book: The evolution of financial markets. \nEach circle in the cover figure corresponds to daily market return within one year of our sample. Deviations from the circle line indicate positive or negative returns. \nThe colors are determined by the standard deviation of market returns during the particular year. \nThe few lines of code below replicate the entire figure. \nWe use the Wes Andersen color palette (also throughout the entire book), provided by the package `wesanderson` [@wesanderson]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(wesanderson)\n\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_daily <- tbl(\n  tidy_finance,\n  \"factors_ff3_daily\"\n) |>\n  collect()\n\ndata_plot <- factors_ff3_daily |>\n  select(date, mkt_excess) |>\n  group_by(year = floor_date(date, \"year\")) |>\n  mutate(group_id = cur_group_id())\n\ndata_plot <- data_plot |>\n  group_by(group_id) |>\n  mutate(\n    day = 2 * pi * (1:n()) / 252,\n    ymin = pmin(1 + mkt_excess, 1),\n    ymax = pmax(1 + mkt_excess, 1),\n    vola = sd(mkt_excess)\n  ) |>\n  filter(year >= \"1962-01-01\" & year <= \"2021-12-31\")\n\nlevels <- data_plot |>\n  distinct(group_id, vola) |>\n  arrange(vola) |>\n  pull(vola)\n\ncp <- coord_polar(\n  direction = -1,\n  clip = \"on\"\n)\n\ncp$is_free <- function() TRUE\ncolors <- wes_palette(\"Zissou1\",\n  n_groups(data_plot),\n  type = \"continuous\"\n)\n\ncover <- data_plot |>\n  mutate(vola = factor(vola, levels = levels)) |>\n  ggplot(aes(\n    x = day,\n    y = mkt_excess,\n    group = group_id,\n    fill = vola\n  )) +\n  cp +\n  geom_ribbon(aes(\n    ymin = ymin,\n    ymax = ymax,\n    fill = vola\n  ), alpha = 0.90) +\n  theme_void() +\n  facet_wrap(~group_id,\n    ncol = 10,\n    scales = \"free\"\n  ) +\n  theme(\n    strip.text.x = element_blank(),\n    legend.position = \"None\",\n    panel.spacing = unit(-5, \"lines\")\n  ) +\n  scale_fill_manual(values = colors)\n\nggsave(\n plot = cover,\n width = 10,\n height = 6,\n filename = \"images/cover.png\",\n bg = \"white\"\n)\n```\n:::\n\n\nTo generate our logo, we focus on year 2021 - the end of the sample period at the time we published tidy-finance.org for the first time. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogo <- data_plot |>\n  ungroup() |> \n  filter(year == \"2021-01-01\") |> \n  mutate(vola = factor(vola, levels = levels)) |>\n  ggplot(aes(\n    x = day,\n    y = mkt_excess,\n    fill = vola\n  )) +\n  cp +\n  geom_ribbon(aes(\n    ymin = ymin,\n    ymax = ymax,\n    fill = vola\n  ), alpha = 0.90) +\n  theme_void() +\n  theme(\n    strip.text.x = element_blank(),\n    legend.position = \"None\",\n    plot.margin = unit(c(-0.15,-0.15,-0.15,-0.15), \"null\")\n  ) +\n  scale_fill_manual(values =  \"white\") \n\nggsave(\n plot = logo,\n width = 840,\n height = 840,\n units = \"px\",\n filename = \"images/logo-website-white.png\",\n)\n\nggsave(\n plot = logo +\n    scale_fill_manual(values =  wes_palette(\"Zissou1\")[1]), \n width = 840,\n height = 840,\n units = \"px\",\n filename = \"images/logo-website.png\",\n)\n```\n:::\n\n\nHere is the code to generate the vector graphics for our buttons.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbutton_r <- data_plot |>\n  ungroup() |> \n  filter(year == \"2000-01-01\") |> \n  mutate(vola = factor(vola, levels = levels)) |>\n  ggplot(aes(\n    x = day,\n    y = mkt_excess,\n    fill = vola\n  )) +\n  cp +\n  geom_ribbon(aes(\n    ymin = ymin,\n    ymax = ymax,\n    fill = vola\n  ), alpha = 0.90) +\n  theme_void() +\n  theme(\n    strip.text.x = element_blank(),\n    legend.position = \"None\",\n    plot.margin = unit(c(-0.15,-0.15,-0.15,-0.15), \"null\")\n  ) \n\nggsave(\n plot = button_r +\n    scale_fill_manual(values =  wes_palette(\"Zissou1\")[1]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-r-blue.svg\",\n)\n\nggsave(\n plot = button_r +\n    scale_fill_manual(values =  wes_palette(\"Zissou1\")[4]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-r-orange.svg\",\n)\n\nbutton_python <- data_plot |>\n  ungroup() |> \n  filter(year == \"1991-01-01\") |> \n  mutate(vola = factor(vola, levels = levels)) |>\n  ggplot(aes(\n    x = day,\n    y = mkt_excess,\n    fill = vola\n  )) +\n  cp +\n  geom_ribbon(aes(\n    ymin = ymin,\n    ymax = ymax,\n    fill = vola\n  ), alpha = 0.90) +\n  theme_void() +\n  theme(\n    strip.text.x = element_blank(),\n    legend.position = \"None\",\n    plot.margin = unit(c(-0.15,-0.15,-0.15,-0.15), \"null\")\n  ) \n\nggsave(\n plot = button_python +\n    scale_fill_manual(values =  wes_palette(\"Zissou1\")[1]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-python-blue.svg\",\n)\n\nggsave(\n plot = button_python +\n    scale_fill_manual(values =  wes_palette(\"Zissou1\")[4]), \n width = 100,\n height = 100,\n units = \"px\",\n filename = \"images/button-python-orange.svg\",\n)\n```\n:::\n",
-    "supporting": [
-      "cover-and-logo-design_files"
-    ],
+    "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],
diff --git a/_freeze/r/difference-in-differences/execute-results/html.json b/_freeze/r/difference-in-differences/execute-results/html.json
index 5508df1f..789d17a6 100644
--- a/_freeze/r/difference-in-differences/execute-results/html.json
+++ b/_freeze/r/difference-in-differences/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "ba151815ca57c6906c62e796919e5f8e",
+  "hash": "b0d9fb5aefc9d29ad7b7a9f2db4f05d1",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Difference in Differences\naliases:\n  - ../difference-in-differences.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Difference in Differences with R\n  description-meta: Perform difference-in-difference estimation and analyze parallel trends using the programming language R. \n---\n\n\nIn this chapter, we illustrate the concept of *difference in differences* (DD) estimators by evaluating the effects of climate change regulation on the pricing of bonds across firms. DD estimators are typically used to recover the treatment effects of natural or quasi-natural experiments that trigger sharp changes in the environment of a specific group. Instead of looking at differences in just one group (e.g., the effect in the treated group), DD investigates the treatment effects by looking at the difference between differences in two groups. Such experiments are usually exploited to address endogeneity concerns [e.g., @RobertsWhited2013]. The identifying assumption is that the outcome variable would change equally in both groups without the treatment. This assumption is also often referred to as the assumption of parallel trends. Moreover, we would ideally also want a random assignment to the treatment and control groups. Due to lobbying or other activities, this randomness is often violated in (financial) economics.\\index{Difference in differences}\n\nIn the context of our setting, we investigate the impact of the Paris Agreement (PA), signed on December 12, 2015, on the bond yields of polluting firms. We first estimate the treatment effect of the agreement using panel regression techniques that we discuss in [Fixed Effects and Clustered Standard Errors](fixed-effects-and-clustered-standard-errors.qmd). We then present two methods to illustrate the treatment effect over time graphically. Although we demonstrate that the treatment effect of the agreement is anticipated by bond market participants well in advance, the techniques we present below can also be applied to many other settings.\\index{Paris (Climate) Agreement}\n\nThe approach we use here replicates the results of @Seltzer2022 partly. Specifically, we borrow their industry definitions for grouping firms into green and brown types. Overall, the literature on ESG effects in corporate bond markets is already large but continues to grow (for recent examples, see, e.g., @Halling2021, @Handler2022, @Huynh2021, among many others).\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\nlibrary(broom)\n```\n:::\n\n\n## Data Preparation\n\nWe use TRACE and Mergent FISD as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [TRACE and FISD](trace-and-fisd.qmd). \\index{Data!TRACE}\\index{Data!FISD}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfisd <- tbl(tidy_finance, \"fisd\") |>\n  select(complete_cusip, maturity, offering_amt, sic_code) |>\n  collect() |> \n  drop_na()\n\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n  select(cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt)|>\n  collect() |> \n  drop_na()\n```\n:::\n\n\nWe start our analysis by preparing the sample of bonds. We only consider bonds with a time to maturity of more than one year to the signing of the PA, so that we have sufficient data to analyze the yield behavior after the treatment date. This restriction also excludes all bonds issued after the agreement. We also consider only the first two digits of the SIC industry code to identify the polluting industries [in line with @Seltzer2022].\\index{Time to maturity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntreatment_date <- ymd(\"2015-12-12\")\n\npolluting_industries <- c(\n  49, 13, 45, 29, 28, 33, 40, 20,\n  26, 42, 10, 53, 32, 99, 37\n)\n\nbonds <- fisd |>\n  filter(offering_amt > 0) |> \n  mutate(\n    time_to_maturity = as.numeric(maturity - treatment_date) / 365,\n    sic_code = as.integer(substr(sic_code, 1, 2)),\n    log_offering_amt = log(offering_amt)\n  ) |>\n  filter(time_to_maturity >= 1) |>\n  select(\n    cusip_id = complete_cusip,\n    time_to_maturity, log_offering_amt, sic_code\n  ) |>\n  mutate(polluter = sic_code %in% polluting_industries)\n```\n:::\n\n\nNext, we aggregate the individual transactions as reported in TRACE to a monthly panel of bond yields. We consider bond yields for a bond's last trading day in a month. Therefore, we first aggregate bond data to daily frequency and apply common restrictions from the literature [see, e.g., @Bessembinder2008]. We weigh each transaction by volume to reflect a trade's relative importance and avoid emphasizing small trades. Moreover, we only consider transactions with reported prices `rptd_pr` larger than 25 (to exclude bonds that are close to default) and only bond-day observations with more than five trades on a corresponding day (to exclude prices based on too few, potentially non-representative transactions).\\index{Yield aggregation} \\index{Returns!Bonds}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_aggregated <- trace_enhanced |>\n  filter(rptd_pr > 25) |>\n  group_by(cusip_id, trd_exctn_dt) |>\n  summarize(\n    avg_yield = weighted.mean(yld_pt, entrd_vol_qt * rptd_pr),\n    trades = n(),\n    .groups = \"drop\"\n  ) |>\n  drop_na(avg_yield) |>\n  filter(trades >= 5) |>\n  mutate(month = floor_date(trd_exctn_dt, \"months\")) |>\n  group_by(cusip_id, month) |>\n  slice_max(trd_exctn_dt) |>\n  ungroup() |>\n  select(cusip_id, month, avg_yield)\n```\n:::\n\n\nBy combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds |>\n  inner_join(trace_aggregated, by = \"cusip_id\", multiple = \"all\") |>\n  drop_na()\n```\n:::\n\n\nBefore we can run the first regression, we need to define the `treated` indicator, which is the product of the `post_period` (i.e., all months after the signing of the PA) and the `polluter` indicator defined above.\\index{Regression!Fixed effects} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds_panel |>\n  mutate(post_period = month >= floor_date(treatment_date, \"months\")) |>\n  mutate(treated = polluter & post_period)\n```\n:::\n\n\nAs usual, we tabulate summary statistics of the variables that enter the regression to check the validity of our variable definitions.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel |>\n  pivot_longer(\n    cols = c(avg_yield, time_to_maturity, log_offering_amt),\n    names_to = \"measure\"\n  ) |>\n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n  measure           mean    sd    min   q05   q50   q95   max      n\n  <chr>            <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <int>\n1 avg_yield         4.08 4.21  0.0595  1.27  3.38  8.10 128.  127530\n2 log_offering_amt 13.3  0.823 4.64   12.2  13.2  14.5   16.5 127530\n3 time_to_maturity  8.55 8.41  1.01    1.50  5.81 27.4  101.  127530\n```\n\n\n:::\n:::\n\n\n## Panel Regressions\n\nThe PA is a legally binding international treaty on climate change. It was adopted by 196 Parties at COP 21 in Paris on 12 December 2015 and entered into force on 4 November 2016. The PA obliges developed countries to support efforts to build clean, climate-resilient futures. One may thus hypothesize that adopting climate-related policies may affect financial markets. To measure the magnitude of this effect, we first run an OLS regression without fixed effects where we include the `treated`, `post_period`, and `polluter` dummies, as well as the bond-specific characteristics `log_offering_amt` and `time_to_maturity`. This simple model assumes that there are essentially two periods (before and after the PA) and two groups (polluters and non-polluters). Nonetheless, it should indicate whether polluters have higher yields following the PA compared to non-polluters.\\index{Regression!Yields}\\index{Regression!Panel}\n\nThe second model follows the typical DD regression approach by including individual (`cusip_id`) and time (`month`) fixed effects. In this model, we do not include any other variables from the simple model because the fixed effects subsume them, and we observe the coefficient of our main variable of interest: `treated`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe <- feols(\n  fml = avg_yield ~ treated + post_period + polluter +\n    log_offering_amt + time_to_maturity,\n  vcov = \"iid\",\n  data = bonds_panel\n)\n\nmodel_with_fe <- feols(\n  fml = avg_yield ~ treated | cusip_id + month,\n  vcov = \"iid\",\n  data = bonds_panel\n)\n\netable(\n  model_without_fe, model_with_fe, \n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  model_without_fe   model_with_fe\nDependent Var.:          avg_yield       avg_yield\n                                                  \nConstant            10.7*** (57.0)                \ntreatedTRUE        0.462*** (9.31) 0.983*** (29.5)\npost_periodTRUE  -0.174*** (-5.92)                \npolluterTRUE       0.481*** (15.3)                \nlog_offering_amt -0.551*** (-39.0)                \ntime_to_maturity   0.058*** (41.6)                \nFixed-Effects:   ----------------- ---------------\ncusip_id                        No             Yes\nmonth                           No             Yes\n________________ _________________ _______________\nVCOV type                      IID             IID\nObservations               127,530         127,530\nR2                           0.032           0.647\nWithin R2                       --           0.007\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nBoth models indicate that polluters have significantly higher yields after the PA than non-polluting firms. Note that the magnitude of the `treated` coefficient varies considerably across models. \n\n## Visualizing Parallel Trends\n\nEven though the regressions above indicate that there is an impact of the PA on bond yields of polluters, the tables do not tell us anything about the dynamics of the treatment effect. In particular, the models provide no indication about whether the crucial *parallel trends* assumption is valid. This assumption requires that in the absence of treatment, the difference between the two groups is constant over time. Although there is no well-defined statistical test for this assumption, visual inspection typically provides a good indication.\\index{Parallel trends assumption}\n\nTo provide such visual evidence, we revisit the simple OLS model and replace the `treated` and `post_period` indicators with month dummies for each group. This approach estimates the average yield change of both groups for each period and provides corresponding confidence intervals. Plotting the coefficient estimates for both groups around the treatment date shows us the dynamics of our panel data.\\index{Graph!Diff-in-diff graph} \n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe_time <- feols(\n  fml = avg_yield ~ polluter + month:polluter +\n    time_to_maturity + log_offering_amt,\n  vcov = \"iid\",\n  data = bonds_panel |>\n    mutate(month = factor(month))\n)\n\nmodel_without_fe_coefs <- tidy(model_without_fe_time) |>\n  filter(str_detect(term, \"month\")) |>\n  mutate(\n    month = ymd(substr(term, nchar(term) - 9, nchar(term))),\n    treatment = str_detect(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  )\n\nmodel_without_fe_coefs |>\n  ggplot(aes(\n    month, \n    color = treatment,\n    linetype = treatment,\n    shape = treatment\n    )) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  guides(linetype = \"none\") + \n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    shape = \"Polluter?\",\n    color = \"Polluter?\",\n    title = \"Polluters respond stronger to Paris Agreement than green firms\"\n  )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1301-1.png){#fig-1301 fig-alt='Title: Polluters respond stronger to Paris Agreement than green firms. The figure shows a sequence of monthly dots for two groups. Before the agreement, the dots mainly overlap. Ahead of the agreement, yields start to increase. Then, after the agreement, there is a strong divergence in yields. Polluters have significantly higher yields than non-polluters in the months before and after the signing of the Paris Agreement. However, this yield difference vanishes again towards the end of 2016.' width=2100}\n:::\n:::\n\n@fig-1301 shows that throughout most of 2014, the yields of the two groups changed in unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest difference around the signing of the PA. Afterward, the yields for both groups fall again, and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on the other hand, even experience significantly lower yields than polluters after the signing of the agreement. \n\nInstead of plotting both groups using the simple model approach, we can also use the fixed-effects model and focus on the polluter's yield response to the signing relative to the non-polluters. To perform this estimation, we need to replace the `treated` indicator with separate time dummies for the polluters, each marking a one-month period relative to the treatment date. We then regress the monthly yields on the set of time dummies and `cusip_id` and `month` fixed effects.\\index{Graph!Diff-in-diff graph}\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel_alt <- bonds_panel |>\n  mutate(\n    diff_to_treatment = interval(\n      floor_date(treatment_date, \"month\"), month\n    ) %/% months(1)\n  )\n\nvariables <- bonds_panel_alt |>\n  distinct(diff_to_treatment, month) |>\n  arrange(month) |>\n  mutate(variable_name = as.character(NA))\n\nformula <- \"avg_yield ~ \"\n\nfor (j in 1:nrow(variables)) {\n  if (variables$diff_to_treatment[j] != 0) {\n    old_names <- names(bonds_panel_alt)\n    bonds_panel_alt <- bonds_panel_alt |>\n      mutate(new_var = diff_to_treatment == variables$diff_to_treatment[j] & \n               polluter)\n    new_var_name <- ifelse(variables$diff_to_treatment[j] < 0,\n      str_c(\"lag\", abs(variables$diff_to_treatment[j])),\n      str_c(\"lead\", variables$diff_to_treatment[j])\n    )\n    variables$variable_name[j] <- new_var_name\n    names(bonds_panel_alt) <- c(old_names, new_var_name)\n    formula <- str_c(\n      formula,\n      ifelse(j == 1,\n        new_var_name,\n        str_c(\"+\", new_var_name)\n      )\n    )\n  }\n}\nformula <- str_c(formula, \"| cusip_id + month\")\n\nmodel_with_fe_time <- feols(\n  fml = as.formula(formula),\n  vcov = \"iid\",\n  data = bonds_panel_alt\n)\n\nmodel_with_fe_time_coefs <- tidy(model_with_fe_time) |>\n  mutate(\n    term = str_remove(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  ) |>\n  left_join(\n    variables,\n    by = c(\"term\" = \"variable_name\")\n  ) |>\n  bind_rows(tibble(\n    term = \"lag0\",\n    estimate = 0,\n    ci_up = 0,\n    ci_low = 0,\n    month = floor_date(treatment_date, \"month\")\n  ))\n\nmodel_with_fe_time_coefs |>\n  ggplot(aes(x = month, y = estimate)) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    title = \"Polluters' yield patterns around Paris Agreement signing\"\n  )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Polluters\\' yield patterns around Paris Agreement signing. The figure shows a sequence of monthly dots for the treated group. Ahead of the agreement, yields of polluters start to increase. Then, after the agreement, there is a small reversal and yields drop again.' width=2100}\n:::\n:::\n\n\n\\index{Robustness tests}\nThe resulting graph shown in @fig-1402 confirms the main conclusion of the previous image: polluters' yield patterns show a considerable anticipation effect starting toward the end of 2014. Yields only marginally increase after the signing of the agreement. However, as opposed to the simple model, we do not see a complete reversal back to the pre-agreement level. Yields of polluters stay at a significantly higher level even one year after the signing.\n\nNotice that during the year after the PA was signed, the 45th President of the United States was elected on November 8, 2016. During his campaign there were some indications of intentions to withdraw the US from the PA, which ultimately happened on November 4, 2020. Hence, reversal effects are potentially driven by these actions.\n\n## Exercises\n\n1. The 46th President of the US rejoined the Paris Agreement in February 2021. Repeat the difference in differences analysis for the day of his election victory. Note that you will also have to download new TRACE data. How did polluters' yields react to this action?\n2. Based on the exercise on ratings in [TRACE and FISD](trace-and-fisd.qmd), include ratings as a control variable in the analysis above. Do the results change?",
+    "markdown": "---\ntitle: Difference in Differences\naliases:\n  - ../difference-in-differences.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Difference in Differences with R\n  description-meta: Perform difference-in-difference estimation and analyze parallel trends using the programming language R. \n---\n\n\nIn this chapter, we illustrate the concept of *difference in differences* (DD) estimators by evaluating the effects of climate change regulation on the pricing of bonds across firms. DD estimators are typically used to recover the treatment effects of natural or quasi-natural experiments that trigger sharp changes in the environment of a specific group. Instead of looking at differences in just one group (e.g., the effect in the treated group), DD investigates the treatment effects by looking at the difference between differences in two groups. Such experiments are usually exploited to address endogeneity concerns [e.g., @RobertsWhited2013]. The identifying assumption is that the outcome variable would change equally in both groups without the treatment. This assumption is also often referred to as the assumption of parallel trends. Moreover, we would ideally also want a random assignment to the treatment and control groups. Due to lobbying or other activities, this randomness is often violated in (financial) economics.\\index{Difference in differences}\n\nIn the context of our setting, we investigate the impact of the Paris Agreement (PA), signed on December 12, 2015, on the bond yields of polluting firms. We first estimate the treatment effect of the agreement using panel regression techniques that we discuss in [Fixed Effects and Clustered Standard Errors](fixed-effects-and-clustered-standard-errors.qmd). We then present two methods to illustrate the treatment effect over time graphically. Although we demonstrate that the treatment effect of the agreement is anticipated by bond market participants well in advance, the techniques we present below can also be applied to many other settings.\\index{Paris (Climate) Agreement}\n\nThe approach we use here replicates the results of @Seltzer2022 partly. Specifically, we borrow their industry definitions for grouping firms into green and brown types. Overall, the literature on ESG effects in corporate bond markets is already large but continues to grow (for recent examples, see, e.g., @Halling2021, @Handler2022, @Huynh2021, among many others).\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\nlibrary(broom)\n```\n:::\n\n\n## Data Preparation\n\nWe use TRACE and Mergent FISD as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [TRACE and FISD](trace-and-fisd.qmd). \\index{Data!TRACE}\\index{Data!FISD}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfisd <- tbl(tidy_finance, \"fisd\") |>\n  select(complete_cusip, maturity, offering_amt, sic_code) |>\n  collect() |> \n  drop_na()\n\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n  select(cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt)|>\n  collect() |> \n  drop_na()\n```\n:::\n\n\nWe start our analysis by preparing the sample of bonds. We only consider bonds with a time to maturity of more than one year to the signing of the PA, so that we have sufficient data to analyze the yield behavior after the treatment date. This restriction also excludes all bonds issued after the agreement. We also consider only the first two digits of the SIC industry code to identify the polluting industries [in line with @Seltzer2022].\\index{Time to maturity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntreatment_date <- ymd(\"2015-12-12\")\n\npolluting_industries <- c(\n  49, 13, 45, 29, 28, 33, 40, 20,\n  26, 42, 10, 53, 32, 99, 37\n)\n\nbonds <- fisd |>\n  filter(offering_amt > 0) |> \n  mutate(\n    time_to_maturity = as.numeric(maturity - treatment_date) / 365,\n    sic_code = as.integer(substr(sic_code, 1, 2)),\n    log_offering_amt = log(offering_amt)\n  ) |>\n  filter(time_to_maturity >= 1) |>\n  select(\n    cusip_id = complete_cusip,\n    time_to_maturity, log_offering_amt, sic_code\n  ) |>\n  mutate(polluter = sic_code %in% polluting_industries)\n```\n:::\n\n\nNext, we aggregate the individual transactions as reported in TRACE to a monthly panel of bond yields. We consider bond yields for a bond's last trading day in a month. Therefore, we first aggregate bond data to daily frequency and apply common restrictions from the literature [see, e.g., @Bessembinder2008]. We weigh each transaction by volume to reflect a trade's relative importance and avoid emphasizing small trades. Moreover, we only consider transactions with reported prices `rptd_pr` larger than 25 (to exclude bonds that are close to default) and only bond-day observations with more than five trades on a corresponding day (to exclude prices based on too few, potentially non-representative transactions).\\index{Yield aggregation} \\index{Returns!Bonds}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_aggregated <- trace_enhanced |>\n  filter(rptd_pr > 25) |>\n  group_by(cusip_id, trd_exctn_dt) |>\n  summarize(\n    avg_yield = weighted.mean(yld_pt, entrd_vol_qt * rptd_pr),\n    trades = n(),\n    .groups = \"drop\"\n  ) |>\n  drop_na(avg_yield) |>\n  filter(trades >= 5) |>\n  mutate(month = floor_date(trd_exctn_dt, \"months\")) |>\n  group_by(cusip_id, month) |>\n  slice_max(trd_exctn_dt) |>\n  ungroup() |>\n  select(cusip_id, month, avg_yield)\n```\n:::\n\n\nBy combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds |>\n  inner_join(trace_aggregated, join_by(cusip_id), multiple = \"all\") |>\n  drop_na()\n```\n:::\n\n\nBefore we can run the first regression, we need to define the `treated` indicator, which is the product of the `post_period` (i.e., all months after the signing of the PA) and the `polluter` indicator defined above.\\index{Regression!Fixed effects} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel <- bonds_panel |>\n  mutate(post_period = month >= floor_date(treatment_date, \"months\")) |>\n  mutate(treated = polluter & post_period)\n```\n:::\n\n\nAs usual, we tabulate summary statistics of the variables that enter the regression to check the validity of our variable definitions.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel |>\n  pivot_longer(\n    cols = c(avg_yield, time_to_maturity, log_offering_amt),\n    names_to = \"measure\"\n  ) |>\n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n  measure           mean    sd    min   q05   q50   q95   max      n\n  <chr>            <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <int>\n1 avg_yield         4.08 4.21  0.0595  1.27  3.38  8.10 128.  127530\n2 log_offering_amt 13.3  0.823 4.64   12.2  13.2  14.5   16.5 127530\n3 time_to_maturity  8.55 8.41  1.01    1.50  5.81 27.4  101.  127530\n```\n\n\n:::\n:::\n\n\n## Panel Regressions\n\nThe PA is a legally binding international treaty on climate change. It was adopted by 196 Parties at COP 21 in Paris on 12 December 2015 and entered into force on 4 November 2016. The PA obliges developed countries to support efforts to build clean, climate-resilient futures. One may thus hypothesize that adopting climate-related policies may affect financial markets. To measure the magnitude of this effect, we first run an OLS regression without fixed effects where we include the `treated`, `post_period`, and `polluter` dummies, as well as the bond-specific characteristics `log_offering_amt` and `time_to_maturity`. This simple model assumes that there are essentially two periods (before and after the PA) and two groups (polluters and non-polluters). Nonetheless, it should indicate whether polluters have higher yields following the PA compared to non-polluters.\\index{Regression!Yields}\\index{Regression!Panel}\n\nThe second model follows the typical DD regression approach by including individual (`cusip_id`) and time (`month`) fixed effects. In this model, we do not include any other variables from the simple model because the fixed effects subsume them, and we observe the coefficient of our main variable of interest: `treated`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe <- feols(\n  fml = avg_yield ~ treated + post_period + polluter +\n    log_offering_amt + time_to_maturity,\n  vcov = \"iid\",\n  data = bonds_panel\n)\n\nmodel_with_fe <- feols(\n  fml = avg_yield ~ treated | cusip_id + month,\n  vcov = \"iid\",\n  data = bonds_panel\n)\n\netable(\n  model_without_fe, model_with_fe, \n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                  model_without_fe   model_with_fe\nDependent Var.:          avg_yield       avg_yield\n                                                  \nConstant            10.7*** (57.0)                \ntreatedTRUE        0.462*** (9.31) 0.983*** (29.5)\npost_periodTRUE  -0.174*** (-5.92)                \npolluterTRUE       0.481*** (15.3)                \nlog_offering_amt -0.551*** (-39.0)                \ntime_to_maturity   0.058*** (41.6)                \nFixed-Effects:   ----------------- ---------------\ncusip_id                        No             Yes\nmonth                           No             Yes\n________________ _________________ _______________\nVCOV type                      IID             IID\nObservations               127,530         127,530\nR2                           0.032           0.647\nWithin R2                       --           0.007\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nBoth models indicate that polluters have significantly higher yields after the PA than non-polluting firms. Note that the magnitude of the `treated` coefficient varies considerably across models. \n\n## Visualizing Parallel Trends\n\nEven though the regressions above indicate that there is an impact of the PA on bond yields of polluters, the tables do not tell us anything about the dynamics of the treatment effect. In particular, the models provide no indication about whether the crucial *parallel trends* assumption is valid. This assumption requires that in the absence of treatment, the difference between the two groups is constant over time. Although there is no well-defined statistical test for this assumption, visual inspection typically provides a good indication.\\index{Parallel trends assumption}\n\nTo provide such visual evidence, we revisit the simple OLS model and replace the `treated` and `post_period` indicators with month dummies for each group. This approach estimates the average yield change of both groups for each period and provides corresponding confidence intervals. Plotting the coefficient estimates for both groups around the treatment date shows us the dynamics of our panel data.\\index{Graph!Diff-in-diff graph} \n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_without_fe_time <- feols(\n  fml = avg_yield ~ polluter + month:polluter +\n    time_to_maturity + log_offering_amt,\n  vcov = \"iid\",\n  data = bonds_panel |>\n    mutate(month = factor(month))\n)\n\nmodel_without_fe_coefs <- tidy(model_without_fe_time) |>\n  filter(str_detect(term, \"month\")) |>\n  mutate(\n    month = ymd(substr(term, nchar(term) - 9, nchar(term))),\n    treatment = str_detect(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  )\n\nmodel_without_fe_coefs |>\n  ggplot(aes(\n    month, \n    color = treatment,\n    linetype = treatment,\n    shape = treatment\n    )) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  guides(linetype = \"none\") + \n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    shape = \"Polluter?\",\n    color = \"Polluter?\",\n    title = \"Polluters respond stronger to Paris Agreement than green firms\"\n  )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1301-1.png){#fig-1301 fig-alt='Title: Polluters respond stronger to Paris Agreement than green firms. The figure shows a sequence of monthly dots for two groups. Before the agreement, the dots mainly overlap. Ahead of the agreement, yields start to increase. Then, after the agreement, there is a strong divergence in yields. Polluters have significantly higher yields than non-polluters in the months before and after the signing of the Paris Agreement. However, this yield difference vanishes again towards the end of 2016.' width=2100}\n:::\n:::\n\n@fig-1301 shows that throughout most of 2014, the yields of the two groups changed in unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest difference around the signing of the PA. Afterward, the yields for both groups fall again, and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on the other hand, even experience significantly lower yields than polluters after the signing of the agreement. \n\nInstead of plotting both groups using the simple model approach, we can also use the fixed-effects model and focus on the polluter's yield response to the signing relative to the non-polluters. To perform this estimation, we need to replace the `treated` indicator with separate time dummies for the polluters, each marking a one-month period relative to the treatment date. We then regress the monthly yields on the set of time dummies and `cusip_id` and `month` fixed effects.\\index{Graph!Diff-in-diff graph}\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_panel_alt <- bonds_panel |>\n  mutate(\n    diff_to_treatment = interval(\n      floor_date(treatment_date, \"month\"), month\n    ) %/% months(1)\n  )\n\nvariables <- bonds_panel_alt |>\n  distinct(diff_to_treatment, month) |>\n  arrange(month) |>\n  mutate(variable_name = as.character(NA))\n\nformula <- \"avg_yield ~ \"\n\nfor (j in 1:nrow(variables)) {\n  if (variables$diff_to_treatment[j] != 0) {\n    old_names <- names(bonds_panel_alt)\n    bonds_panel_alt <- bonds_panel_alt |>\n      mutate(new_var = diff_to_treatment == variables$diff_to_treatment[j] & \n               polluter)\n    new_var_name <- ifelse(variables$diff_to_treatment[j] < 0,\n      str_c(\"lag\", abs(variables$diff_to_treatment[j])),\n      str_c(\"lead\", variables$diff_to_treatment[j])\n    )\n    variables$variable_name[j] <- new_var_name\n    names(bonds_panel_alt) <- c(old_names, new_var_name)\n    formula <- str_c(\n      formula,\n      ifelse(j == 1,\n        new_var_name,\n        str_c(\"+\", new_var_name)\n      )\n    )\n  }\n}\nformula <- str_c(formula, \"| cusip_id + month\")\n\nmodel_with_fe_time <- feols(\n  fml = as.formula(formula),\n  vcov = \"iid\",\n  data = bonds_panel_alt\n)\n\nmodel_with_fe_time_coefs <- tidy(model_with_fe_time) |>\n  mutate(\n    term = str_remove(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  ) |>\n  left_join(\n    variables,\n    join_by(term == variable_name)\n  ) |>\n  bind_rows(tibble(\n    term = \"lag0\",\n    estimate = 0,\n    ci_up = 0,\n    ci_low = 0,\n    month = floor_date(treatment_date, \"month\")\n  ))\n\nmodel_with_fe_time_coefs |>\n  ggplot(aes(x = month, y = estimate)) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    title = \"Polluters' yield patterns around Paris Agreement signing\"\n  )\n```\n\n::: {.cell-output-display}\n![The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).](difference-in-differences_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Polluters\\' yield patterns around Paris Agreement signing. The figure shows a sequence of monthly dots for the treated group. Ahead of the agreement, yields of polluters start to increase. Then, after the agreement, there is a small reversal and yields drop again.' width=2100}\n:::\n:::\n\n\n\\index{Robustness tests}\nThe resulting graph shown in @fig-1402 confirms the main conclusion of the previous image: polluters' yield patterns show a considerable anticipation effect starting toward the end of 2014. Yields only marginally increase after the signing of the agreement. However, as opposed to the simple model, we do not see a complete reversal back to the pre-agreement level. Yields of polluters stay at a significantly higher level even one year after the signing.\n\nNotice that during the year after the PA was signed, the 45th President of the United States was elected on November 8, 2016. During his campaign there were some indications of intentions to withdraw the US from the PA, which ultimately happened on November 4, 2020. Hence, reversal effects are potentially driven by these actions.\n\n## Exercises\n\n1. The 46th President of the US rejoined the Paris Agreement in February 2021. Repeat the difference in differences analysis for the day of his election victory. Note that you will also have to download new TRACE data. How did polluters' yields react to this action?\n2. Based on the exercise on ratings in [TRACE and FISD](trace-and-fisd.qmd), include ratings as a control variable in the analysis above. Do the results change?",
     "supporting": [
       "difference-in-differences_files"
     ],
diff --git a/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json b/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json
index bf2d639c..c3de96de 100644
--- a/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json
+++ b/_freeze/r/factor-selection-via-machine-learning/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "737369369e1c6b54de47c8f395406a42",
+  "hash": "cfae0fe7dccd12dadb62ebd01257070e",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Factor Selection via Machine Learning\naliases:\n  - ../factor-selection-via-machine-learning.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Factor Selection via Machine Learning with R\n  description-meta: Use machine learning tools such as Lasso and Ridge regressions to identify asset pricing factors using the programming language R.\n---\n\n\nThe aim of this chapter is twofold. From a data science perspective, we introduce `tidymodels`, a collection of packages for modeling and machine learning (ML) using `tidyverse` principles. `tidymodels` comes with a handy workflow for all sorts of typical prediction tasks. From a finance perspective, we address the notion of *factor zoo* [@Cochrane2011] using ML methods. We introduce Lasso and Ridge regression as a special case of penalized regression models. Then, we explain the concept of cross-validation for model *tuning* with Elastic Net regularization as a popular example. We implement and showcase the entire cycle from model specification, training, and forecast evaluation within the `tidymodels` universe. While the tools can generally be applied to an abundance of interesting asset pricing problems, we apply penalized regressions for identifying macroeconomic variables and asset pricing factors that help explain a cross-section of industry portfolios.\n\nIn previous chapters, we illustrate that stock characteristics such as size provide valuable pricing information in addition to the market beta.\\index{Factor zoo}\\index{Factor model}\\index{CAPM} \nSuch findings question the usefulness of the Capital Asset Pricing Model. \nIn fact, during the last decades, financial economists discovered a plethora of additional factors which may be correlated with the marginal utility of consumption (and would thus deserve a prominent role in pricing applications). The search for factors that explain the cross-section of expected stock returns has produced hundreds of potential candidates, as noted more recently by @Harvey2016, @Mclean2016, and @Hou2020.\nTherefore, given the multitude of proposed risk factors, the challenge these days rather is: *do we believe in the relevance of 300+ risk factors?* During recent years, promising methods from the field of ML got applied to common finance applications. We refer to @Mullainathan2017 for a treatment of ML from the perspective of an econometrician, @Nagel2021 for an excellent review of ML practices in asset pricing, @Easley2021 for ML applications in (high-frequency) market microstructure, and @Dixon2020 for a detailed treatment of all methodological aspects. \n\n## Brief Theoretical Background\n\nThis is a book about *doing* empirical work in a tidy manner, and we refer to any of the many excellent textbook treatments of ML methods and especially penalized regressions for some deeper discussion. Excellent material is provided, for instance, by @Hastie2009, @Hastie2013, and @DePrado2018. Instead, we briefly summarize the idea of Lasso and Ridge regressions as well as the more general Elastic Net. Then, we turn to the fascinating question on *how* to implement, tune, and use such models with the `tidymodels` workflow.\n\nTo set the stage, we start with the definition of a linear model: suppose we have data $(y_t, x_t), t = 1,\\ldots, T$, where $x_t$ is a $(K \\times 1)$ vector of regressors and $y_t$ is the response for observation $t$. \nThe linear model takes the form $y_t = \\beta' x_t + \\varepsilon_t$ with some error term $\\varepsilon_t$ and has been studied in abundance. The well-known ordinary-least square (OLS) estimator for the $(K \\times 1)$ vector $\\beta$ minimizes the sum of squared residuals and is then $$\\hat{\\beta}^\\text{ols} = \\left(\\sum\\limits_{t=1}^T x_t'x_t\\right)^{-1} \\sum\\limits_{t=1}^T x_t'y_t.$$ \n\\index{Regression!OLS}\n\nWhile we are often interested in the estimated coefficient vector $\\hat\\beta^\\text{ols}$, ML is about the predictive performance most of the time. For a new observation $\\tilde{x}_t$, the linear model generates predictions such that $$\\hat y_t = E\\left(y|x_t = \\tilde x_t\\right) = \\hat\\beta^\\text{ols}{}' \\tilde x_t.$$ \nIs this the best we can do? \nNot really: instead of minimizing the sum of squared residuals, penalized linear models can improve predictive performance by choosing other estimators $\\hat{\\beta}$ with lower variance than the estimator $\\hat\\beta^\\text{ols}$. \nAt the same time, it seems appealing to restrict the set of regressors to a few meaningful ones if possible. In other words, if $K$ is large (such as for the number of proposed factors in the asset pricing literature), it may be a desirable feature to *select* reasonable factors and set $\\hat\\beta^{\\text{ols}}_k = 0$ for some redundant factors. \n\nIt should be clear that the promised benefits of penalized regressions, i.e., reducing the mean squared error (MSE), come at a cost. In most cases, reducing the variance of the estimator introduces a bias such that $E\\left(\\hat\\beta\\right) \\neq \\beta$. What is the effect of such a bias-variance trade-off? To understand the implications, assume the following data-generating process for $y$: $$y = f(x) + \\varepsilon, \\quad \\varepsilon \\sim (0, \\sigma_\\varepsilon^2)$$ We want to recover $f(x)$, which denotes some unknown functional which maps the relationship between $x$ and $y$. While the properties of $\\hat\\beta^\\text{ols}$ as an unbiased estimator may be desirable under some circumstances, they are certainly not if we consider predictive accuracy. Alternative predictors $\\hat{f}(x)$ could be more desirable: For instance, the MSE depends on our model choice as follows: \\index{MSE} $$\\begin{aligned}\nMSE &=E((y-\\hat{f}(x))^2)=E((f(x)+\\epsilon-\\hat{f}(x))^2)\\\\\n&= \\underbrace{E((f(x)-\\hat{f}(x))^2)}_{\\text{total quadratic error}}+\\underbrace{E(\\epsilon^2)}_{\\text{irreducible error}} \\\\\n&= E\\left(\\hat{f}(x)^2\\right)+E\\left(f(x)^2\\right)-2E\\left(f(x)\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=E\\left(\\hat{f}(x)^2\\right)+f(x)^2-2f(x)E\\left(\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=\\underbrace{\\text{Var}\\left(\\hat{f}(x)\\right)}_{\\text{variance of model}}+ \\underbrace{E\\left((f(x)-\\hat{f}(x))\\right)^2}_{\\text{squared bias}} +\\sigma_\\varepsilon^2. \n\\end{aligned}$$ While no model can reduce $\\sigma_\\varepsilon^2$, a biased estimator with small variance may have a lower MSE than an unbiased estimator.\n\n### Ridge regression\n\n\\index{Regression!Ridge}\n\nOne biased estimator is known as Ridge regression. @Hoerl1970 propose to minimize the sum of squared errors *while simultaneously imposing a penalty on the $L_2$ norm of the parameters* $\\hat\\beta$. Formally, this means that for a penalty factor $\\lambda\\geq 0$ the minimization problem takes the form $\\min_\\beta \\left(y - X\\beta\\right)'\\left(y - X\\beta\\right)\\text{ s.t. } \\beta'\\beta \\leq c$. Here $c\\geq 0$ is a constant that depends on the choice of $\\lambda$. The larger $\\lambda$, the smaller $c$ (technically speaking, there is a one-to-one relationship between $\\lambda$, which corresponds to the Lagrangian of the minimization problem above and $c$).  Here, $X = \\left(x_1  \\ldots  x_T\\right)'$ and $y = \\left(y_1, \\ldots, y_T\\right)'$. A closed-form solution for the resulting regression coefficient vector $\\beta^\\text{ridge}$ exists: $$\\hat{\\beta}^\\text{ridge} = \\left(X'X + \\lambda I\\right)^{-1}X'y.$$ A couple of observations are worth noting: $\\hat\\beta^\\text{ridge} = \\hat\\beta^\\text{ols}$ for $\\lambda = 0$ and $\\hat\\beta^\\text{ridge} \\rightarrow 0$ for $\\lambda\\rightarrow \\infty$. Also for $\\lambda > 0$, $\\left(X'X + \\lambda I\\right)$ is non-singular even if $X'X$ is which means that $\\hat\\beta^\\text{ridge}$ exists even if $\\hat\\beta$ is not defined. However, note also that the Ridge estimator requires careful choice of the hyperparameter $\\lambda$ which controls the *amount of regularization*: a larger value of $\\lambda$ implies *shrinkage* of the regression coefficient toward 0, a smaller value of $\\lambda$ reduces the bias of the resulting estimator.     \n\n::: {.calloutnote}\nNote, that $X$ usually contains an intercept column with ones. As a general rule, the associated intercept coefficient is not penalized. In practice, this often implies that $y$ is simply demeaned before computing $\\hat\\beta^\\text{ridge}$.\n::: \n\nWhat about the statistical properties of the Ridge estimator? First, the bad news is that $\\hat\\beta^\\text{ridge}$ is a biased estimator of $\\beta$. However, the good news is that (under homoscedastic error terms) the variance of the Ridge estimator is guaranteed to be *smaller* than the variance of the ordinary least square estimator. We encourage you to verify these two statements in the exercises. As a result, we face a trade-off: The Ridge regression sacrifices some unbiasedness to achieve a smaller variance than the OLS estimator.\n\n### Lasso\n\n\\index{Regression!Lasso}\n\nAn alternative to Ridge regression is the Lasso (*l*east *a*bsolute *s*hrinkage and *s*election *o*perator). Similar to Ridge regression, the Lasso [@Tibshirani1996] is a penalized and biased estimator. \nThe main difference to Ridge regression is that Lasso does not only *shrink* coefficients but effectively selects variables by setting coefficients for *irrelevant* variables to zero. Lasso implements a $L_1$ penalization on the parameters such that: $$\\hat\\beta^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right)\\text{ s.t. } \\sum\\limits_{k=1}^K|\\beta_k| < c(\\lambda).$$ There is no closed form solution for $\\hat\\beta^\\text{Lasso}$ in the above maximization problem but efficient algorithms exist (e.g., the R package `glmnet`). Like for Ridge regression, the hyperparameter $\\lambda$ has to be specified beforehand.\n\n### Elastic Net\n\nThe Elastic Net [@Zou2005] combines $L_1$ with $L_2$ penalization and encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. This more general framework considers the following optimization problem: $$\\hat\\beta^\\text{EN} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda(1-\\rho)\\sum\\limits_{k=1}^K|\\beta_k| +\\frac{1}{2}\\lambda\\rho\\sum\\limits_{k=1}^K\\beta_k^2$$ Now, we have to chose two hyperparameters: the *shrinkage* factor $\\lambda$ and the *weighting parameter* $\\rho$. The Elastic Net resembles Lasso for $\\rho = 0$ and Ridge regression for $\\rho = 1$. While the R package `glmnet` provides efficient algorithms to compute the coefficients of penalized regressions, it is a good exercise to implement Ridge and Lasso estimation on your own before you use the `glmnet` package or the `tidymodels` back-end.\n\n## Data Preparation\n\nTo get started, we load the required R packages and data. The main focus is on the workflow behind the `tidymodels` package collection [@tidymodels]. \n@Kuhn2022 provide a thorough introduction into all `tidymodels` components. `glmnet` [@glmnet] was developed and released in sync with @Tibshirani1996 and provides an R implementation of Elastic Net estimation. The package `timetk` [@timetk] provides useful tools for time series data wrangling. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(scales)\nlibrary(furrr)\nlibrary(glmnet)\nlibrary(timetk)\n```\n:::\n\n\nIn this analysis, we use four different data sources that we load from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). We start with two different sets of factor portfolio returns which have been suggested as representing practical risk factor exposure and thus should be relevant when it comes to asset pricing applications. \n\n- The standard workhorse: monthly Fama-French 3 factor returns (market, small-minus-big, and high-minus-low book-to-market valuation sorts) defined in @Fama1992 and @Fama1993\n- Monthly q-factor returns from @Hou2015. The factors contain the size factor, the investment factor, the return-on-equity factor, and the expected growth factor\n\nNext, we include macroeconomic predictors which may predict the general stock market economy. Macroeconomic variables effectively serve as conditioning information such that their inclusion hints at the relevance of conditional models instead of unconditional asset pricing. We refer the interested reader to @Cochrane2009 on the role of conditioning information.\n\n- Our set of macroeconomic predictors comes from @Goyal2008. The data has been updated by the authors until 2021 and contains monthly variables that have been suggested as good predictors for the equity premium. Some of the variables are the dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury bill rate, and inflation\n\nFinally, we need a set of *test assets*. The aim is to understand which of the plenty factors and macroeconomic variable combinations prove helpful in explaining our test assets' cross-section of returns. \nIn line with many existing papers, we use monthly portfolio returns from 10 different industries according to the definition from [Kenneth French's homepage](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_10_ind_port.html) as test assets.\\index{Data!Fama-French factors}\\index{Data!q-factors}\\index{Data!Macro predictors}\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  collect() |>\n  rename_with(~ str_c(\"factor_ff_\", .), -month)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\") |>\n  collect() |>\n  rename_with(~ str_c(\"factor_q_\", .), -month)\n\nmacro_predictors <- tbl(tidy_finance, \"macro_predictors\") |>\n  collect() |>\n  rename_with(~ str_c(\"macro_\", .), -month) |>\n  select(-macro_rp_div)\n\nindustries_ff_monthly <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n  collect() |>\n  pivot_longer(-month,\n    names_to = \"industry\", values_to = \"ret\"\n  ) |>\n  arrange(desc(industry)) |> \n  mutate(industry = as_factor(industry))\n```\n:::\n\n\nWe combine all the monthly observations into one data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- industries_ff_monthly |>\n  left_join(factors_ff3_monthly, by = \"month\") |>\n  left_join(factors_q_monthly, by = \"month\") |>\n  left_join(macro_predictors, by = \"month\") |>\n  mutate(\n    ret = ret - factor_ff_rf\n  ) |>\n  select(month, industry, ret_excess = ret, everything()) |>\n  drop_na()\n```\n:::\n\n\nOur data contains 22 columns of regressors with the 13 macro variables and 8 factor returns for each month. \n@fig-1401 provides summary statistics for the 10 monthly industry excess returns in percent.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata |>\n  group_by(industry) |>\n  ggplot(aes(x = industry, y = ret_excess)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Excess return distributions by industry in percent\"\n  ) +\n  scale_y_continuous(\n    labels = percent\n  )\n```\n\n::: {.cell-output-display}\n![The box plots show the monthly dispersion of returns for 10 different industries.](factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png){#fig-1401 fig-alt='Title: Excess return distributions by industry in percent. The figure shows boxplots that visualize the industry\\'s excess return distribution. All industry returns are centered around zero and exhibit substantial outliers in the magnitude of 20 percent on a monthly basis.' width=2100}\n:::\n:::\n\n\n## The tidymodels Workflow\n\nTo illustrate penalized linear regressions, we employ the `tidymodels` collection of packages for modeling and ML using `tidyverse` principles. You can simply use `install.packages(\"tidymodels\")` to get access to all the related packages. We recommend checking out the work of @Kuhn2022: They  continuously write on their great book ['Tidy Modeling with R'](https://www.tmwr.org/) using tidy principles.\n\nThe `tidymodels` workflow encompasses the main stages of the modeling process: pre-processing of data, model fitting, and post-processing of results. As we demonstrate below, `tidymodels` provides efficient workflows that you can update with low effort.\n\nUsing the ideas of Ridge and Lasso regressions, the following example guides you through (i) pre-processing the data (data split and variable mutation), (ii) building models, (iii) fitting models, and (iv) tuning models to create the \"best\" possible predictions.\n\nTo start, we restrict our analysis to just one industry: Manufacturing. We first split the sample into a *training* and a *test* set. \nFor that purpose, `tidymodels` provides the function `initial_time_split()` from the `rsample` package [@rsample]. \nThe split takes the last 20% of the data as a test set, which is not used for any model tuning. \nWe use this test set to evaluate the predictive accuracy in an out-of-sample scenario.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit <- initial_time_split(\n  data |>\n    filter(industry == \"manuf\") |>\n    select(-industry),\n  prop = 4 / 5\n)\nsplit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n<Training/Testing/Total>\n<536/135/671>\n```\n\n\n:::\n:::\n\n\nThe object `split` simply keeps track of the observations of the training and the test set. \nWe can call the training set with `training(split)`, while we can extract the test set with `testing(split)`.\n\n### Pre-process data\n\nRecipes help you pre-process your data before training your model. Recipes are a series of pre-processing steps such as variable selection, transformation, or conversion of qualitative predictors to indicator variables. Each recipe starts with a `formula` that defines the general structure of the dataset and the role of each variable (regressor or dependent variable). For our dataset, our recipe contains the following steps before we fit any model:\n\n-   Our formula defines that we want to explain excess returns with all available predictors. The regression equation thus takes the form \n$$r_{t} = \\alpha_0 + \\left(\\tilde f_t \\otimes \\tilde z_t\\right)B + \\varepsilon_t $$ where $r_t$ is the vector of industry excess returns at time $t$ and $\\tilde f_t$ and $\\tilde z_t$ are the (standardized) vectors of factor portfolio returns and macroeconomic variables\n- We exclude the column *month* from the analysis\n- We include all interaction terms between factors and macroeconomic predictors\n- We demean and scale each regressor such that the standard deviation is one\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrec <- recipe(ret_excess ~ ., data = training(split)) |>\n  step_rm(month) |>\n  step_interact(terms = ~ contains(\"factor\"):contains(\"macro\")) |>\n  step_normalize(all_predictors())\n```\n:::\n\n\nA table of all available recipe steps can be found [in the `tidymodels` documentation.](https://www.tidymodels.org/find/recipes/) As of 2023, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a *description* of the tasks to be applied. As a result, it is very easy to *reuse* recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. \nIn the example above, it does not make a difference whether you use the input `data = training(split)` or `data = testing(split)`. \nAll that matters at this early stage are the column names and types.\n\nWe can apply the recipe to any data with a suitable structure. The code below combines two different functions: `prep()` estimates the required parameters from a training set that can be applied to other data sets later. `bake()` applies the processed computations to new data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_prep <- prep(rec, training(split))\n```\n:::\n\nThe object `data_prep` contains information related to the different preprocessing steps applied to the training data: E.g., it is necessary to compute sample means and standard deviations to center and scale the variables. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_bake <- bake(data_prep,\n  new_data = testing(split)\n)\ndata_bake\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 135 × 126\n  factor_ff_mkt_excess factor_ff_smb factor_ff_hml factor_ff_rf\n                 <dbl>         <dbl>         <dbl>        <dbl>\n1              -1.73          -1.10        -0.711         -1.78\n2               2.35           0.949       -0.0937        -1.78\n3              -0.156         -0.122       -0.284         -1.78\n4               0.0646        -0.256        0.425         -1.78\n5               0.995          0.563       -0.456         -1.78\n# ℹ 130 more rows\n# ℹ 122 more variables: factor_q_me <dbl>, factor_q_ia <dbl>,\n#   factor_q_roe <dbl>, factor_q_eg <dbl>, macro_dp <dbl>,\n#   macro_dy <dbl>, macro_ep <dbl>, macro_de <dbl>,\n#   macro_svar <dbl>, macro_bm <dbl>, macro_ntis <dbl>,\n#   macro_tbl <dbl>, macro_lty <dbl>, macro_ltr <dbl>,\n#   macro_tms <dbl>, macro_dfy <dbl>, macro_infl <dbl>, …\n```\n\n\n:::\n:::\n\n\nNote that the resulting data contains the 132 observations from the test set and 126 columns. Why so many? Recall that the recipe states to compute every possible interaction term between the factors and predictors, which increases the dimension of the data matrix substantially. \n\nYou may ask at this stage: why should I use a recipe instead of simply using the data wrangling commands such as `mutate()` or `select()`? `tidymodels` beauty is that a lot is happening under the hood. Recall, that for the simple scaling step, you actually have to compute the standard deviation of each column, then *store* this value, and apply the identical transformation to a different dataset, e.g., `testing(split)`. A prepped `recipe` stores these values and hands them on once you `bake()` a novel dataset. Easy as pie with `tidymodels`, isn't it?\n\n### Build a model\n\n\\index{Regression}\nNext, we can build an actual model based on our pre-processed data. In line with the definition above, we estimate regression coefficients of a Lasso regression such that we get \n$$\\begin{aligned}\\hat\\beta_\\lambda^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda\\sum\\limits_{k=1}^K|\\beta_k|.\\end{aligned}$$ We want to emphasize that the `tidymodels` workflow for *any* model is very similar, irrespective of the specific model. As you will see further below, it is straightforward to fit Ridge regression coefficients and - later - Neural networks or Random forests with basically the same code. The structure is always as follows: create a so-called `workflow()` and use the `fit()` function. A table with all available model APIs is available [here.](https://www.tidymodels.org/find/parsnip/)\nFor now, we start with the linear regression model with a given value for the penalty factor $\\lambda$. In the setup below, `mixture` denotes the value of $\\rho$, hence setting `mixture = 1` implies the Lasso.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n  penalty = 0.0001,\n  mixture = 1\n) |>\n  set_engine(\"glmnet\", intercept = FALSE)\n```\n:::\n\n\nThat's it - we are done! The object `lm_model` contains the definition of our model with all required information. Note that `set_engine(\"glmnet\")` indicates the API character of the `tidymodels` workflow: Under the hood, the package `glmnet` is doing the heavy lifting, while `linear_reg()` provides a unified framework to collect the inputs. The `workflow`  ends with combining everything necessary for the serious data science workflow, namely, a recipe and a model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_fit <- workflow() |>\n  add_recipe(rec) |>\n  add_model(lm_model)\nlm_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n══ Workflow ═════════════════════════════════════════════════════════\nPreprocessor: Recipe\nModel: linear_reg()\n\n── Preprocessor ─────────────────────────────────────────────────────\n3 Recipe Steps\n\n• step_rm()\n• step_interact()\n• step_normalize()\n\n── Model ────────────────────────────────────────────────────────────\nLinear Regression Model Specification (regression)\n\nMain Arguments:\n  penalty = 1e-04\n  mixture = 1\n\nEngine-Specific Arguments:\n  intercept = FALSE\n\nComputational engine: glmnet \n```\n\n\n:::\n:::\n\n\n### Fit a model\n\nWith the `workflow` from above, we are ready to use `fit()`. Typically, we use training data to fit the model. \nThe training data is pre-processed according to our recipe steps, and the Lasso regression coefficients are computed. \nFirst, we focus on the predicted values $\\hat{y}_t = x_t\\hat\\beta^\\text{Lasso}.$ @fig-1402 illustrates the projections for the *entire* time series of the manufacturing industry portfolio returns. The grey area indicates the out-of-sample period, which we did not use to fit the model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values <- lm_fit |>\n  fit(data = training(split)) |>\n  predict(data |> filter(industry == \"manuf\")) |>\n  bind_cols(data |> filter(industry == \"manuf\")) |>\n  select(month,\n    \"Fitted value\" = .pred,\n    \"Realization\" = ret_excess\n  ) |>\n  pivot_longer(-month, names_to = \"Variable\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values |>\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = Variable,\n    linetype = Variable\n    )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    linetype = NULL,\n    title = \"Monthly realized and fitted manufacturing industry risk premia\"\n  ) +\n  scale_x_date(\n    breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"5 years\"\n      )\n    },\n    minor_breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"1 years\"\n      )\n    },\n    expand = c(0, 0),\n    labels = date_format(\"%Y\")\n  ) +\n  scale_y_continuous(\n    labels = percent\n  ) +\n  annotate(\"rect\",\n    xmin = testing(split) |> pull(month) |> min(),\n    xmax = testing(split) |> pull(month) |> max(),\n    ymin = -Inf, ymax = Inf,\n    alpha = 0.5, fill = \"grey70\"\n  )\n```\n\n::: {.cell-output-display}\n![The grey area corresponds to the out of sample period.](factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Monthly realized and fitted manufacturing industry risk premium. The figure shows the time series of realized and predicted manufacturing industry risk premiums. The figure seems to indicate that the predictions capture most of the return dynamics.' width=2100}\n:::\n:::\n\n\nWhat do the estimated coefficients look like? To analyze these values and to illustrate the difference between the `tidymodels` workflow and the underlying `glmnet` package, it is worth computing the coefficients $\\hat\\beta^\\text{Lasso}$ directly. The code below estimates the coefficients for the Lasso and Ridge regression for the processed training data sample. Note that `glmnet` actually takes a vector `y` and the matrix of regressors $X$ as input. Moreover, `glmnet` requires choosing the penalty parameter $\\alpha$, which corresponds to $\\rho$ in the notation above. When using the `tidymodels` model API, such details do not need consideration.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data_bake |>\n  select(-ret_excess) |>\n  as.matrix()\ny <- data_bake |> pull(ret_excess)\n\nfit_lasso <- glmnet(\n  x = x,\n  y = y,\n  alpha = 1,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n\nfit_ridge <- glmnet(\n  x = x,\n  y = y,\n  alpha = 0,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n```\n:::\n\n\nThe objects `fit_lasso` and `fit_ridge` contain an entire sequence of estimated coefficients for multiple values of the penalty factor $\\lambda$. @fig-1403 illustrates the trajectories of the regression coefficients as a function of the penalty factor. Both Lasso and Ridge coefficients converge to zero as the penalty factor increases.\\index{Graph!ML prediction path}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbind_rows(\n  tidy(fit_lasso) |> mutate(Model = \"Lasso\"),\n  tidy(fit_ridge) |> mutate(Model = \"Ridge\")\n) |>\n  rename(\"Variable\" = term) |>\n  ggplot(aes(x = lambda, y = estimate, color = Variable)) +\n  geom_line() +\n  scale_x_log10() +\n  facet_wrap(~Model, scales = \"free_x\") +\n  labs(\n    x = \"Penalty factor (lambda)\", y = NULL,\n    title = \"Estimated coefficient paths for different penalty factors\"\n  ) +\n  theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.](factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png){#fig-1403 fig-alt='Title: Estimated coefficient paths for different penalty factors. The figure shows how estimated lasso and ridge coefficients tend to zero for a higher penalty parameter. Ridge trace is smooth, and Lasso exhibits non-linear behavior.' width=2100}\n:::\n:::\n\n\n::: {.rmdnote}\nOne word of caution: The package `glmnet` computes estimates of the coefficients $\\hat\\beta$ based on numerical optimization procedures. \nAs a result, the estimated coefficients for the [special case](https://parsnip.tidymodels.org/reference/glmnet-details.html) with no regularization ($\\lambda = 0$) can deviate from the standard OLS estimates.\n:::\n\n### Tune a model\n\nTo compute $\\hat\\beta_\\lambda^\\text{Lasso}$ , we simply imposed a value for the penalty hyperparameter $\\lambda$. Model tuning is the process of optimally selecting such hyperparameters. `tidymodels` provides extensive tuning options based on so-called *cross-validation*. Again, we refer to any treatment of cross-validation to get a more detailed  discussion of the statistical underpinnings. Here we focus on the general idea and the implementation with `tidymodels`. \n\nThe goal for choosing $\\lambda$ (or any other hyperparameter, e.g., $\\rho$ for the Elastic Net) is to find a way to produce predictors $\\hat{Y}$ for an outcome $Y$ that minimizes the mean squared prediction error $\\text{MSPE} = E\\left( \\frac{1}{T}\\sum_{t=1}^T (\\hat{y}_t - y_t)^2 \\right)$. Unfortunately, the MSPE is not directly observable. We can only compute an estimate because our data is random and because we do not observe the entire population.\n\nObviously, if we train an algorithm on the same data that we use to compute the error, our estimate $\\hat{\\text{MSPE}}$ would indicate way better predictive accuracy than what we can expect in real out-of-sample data. The result is called overfitting.\n\nCross-validation is a technique that allows us to alleviate this problem. We approximate the true MSPE as the average of many MSPE obtained by creating predictions for $K$ new random samples of the data, none of them used to train the algorithm $\\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k - y_t^k\\right)^2$. In practice, this is done by carving out a piece of our data and pretending it is an independent sample. We again divide the data into a training set and a test set. The MSPE on the test set is our measure for actual predictive ability, while we use the training set to fit models with the aim to find the *optimal* hyperparameter values. To do so, we further divide our training sample into (several) subsets, fit our model for a grid of potential hyperparameter values (e.g., $\\lambda$), and evaluate the predictive accuracy on an *independent* sample. This works as follows:\n\n1.  Specify a grid of hyperparameters\n2.  Obtain predictors $\\hat{y}_i(\\lambda)$ to denote the predictors for the used parameters $\\lambda$\n3.  Compute $$\n    \\text{MSPE}(\\lambda) = \\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k(\\lambda) - y_t^k\\right)^2 \n    $$ With K-fold cross-validation, we do this computation $K$ times. Simply pick a validation set with $M=T/K$ observations at random and think of these as random samples $y_1^k, \\dots, y_{\\tilde{T}}^k$, with $k=1$\n\nHow should you pick $K$? Large values of $K$ are preferable because the training data better imitates the original data. However, larger values of $K$ will have much higher computation time.\n`tidymodels` provides all required tools to conduct $K$-fold cross-validation. We just have to update our model specification and let `tidymodels` know which parameters to tune. In our case, we specify the penalty factor $\\lambda$ as well as the mixing factor $\\rho$ as *free* parameters. Note that it is simple to change an existing `workflow` with `update_model()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n  penalty = tune(),\n  mixture = tune()\n) |>\n  set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n  update_model(lm_model)\n```\n:::\n\n\nFor our sample, we consider a time-series cross-validation sample. This means that we tune our models with 20 random samples of length five years with a validation period of four years. For a grid of possible hyperparameters, we then fit the model for each fold and evaluate $\\hat{\\text{MSPE}}$ in the corresponding validation set. Finally, we select the model specification with the lowest MSPE in the validation set. First, we define the cross-validation folds based on our training data only.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_folds <- time_series_cv(\n  data        = training(split),\n  date_var    = month,\n  initial     = \"5 years\",\n  assess      = \"48 months\",\n  cumulative  = FALSE,\n  slice_limit = 20\n)\ndata_folds\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Time Series Cross Validation Plan \n# A tibble: 20 × 2\n  splits          id     \n  <list>          <chr>  \n1 <split [60/48]> Slice01\n2 <split [60/48]> Slice02\n3 <split [60/48]> Slice03\n4 <split [60/48]> Slice04\n5 <split [60/48]> Slice05\n# ℹ 15 more rows\n```\n\n\n:::\n:::\n\n\nThen, we evaluate the performance for a grid of different penalty values. `tidymodels` provides functionalities to construct a suitable grid of hyperparameters with `grid_regular`. The code chunk below creates a $10 \\times 3$ hyperparameters grid. Then, the function `tune_grid()` evaluates all the models for each fold.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_tune <- lm_fit |>\n  tune_grid(\n    resample = data_folds,\n    grid = grid_regular(penalty(), mixture(), levels = c(20, 3)),\n    metrics = metric_set(rmse)\n  )\n```\n:::\n\n\nAfter the tuning process, we collect the evaluation metrics (the root mean-squared error in our example) to identify the *optimal* model. @fig-1404 illustrates the average validation set's root mean-squared error for each value of $\\lambda$ and $\\rho$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(lm_tune) + \n  aes(linetype = `Proportion of Lasso Penalty`) + \n  guides(linetype = \"none\") +\n  labs(\n    x = \"Penalty factor (lambda)\",\n    y = \"Root MSPE\",\n    title = \"Root MSPE for different penalty factors\"\n  ) + \n  scale_x_log10()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nScale for x is already present.\nAdding another scale for x, which will replace the existing scale.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.](factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png){#fig-1404 fig-alt='Title: Root MSPE for different penalty factors. The figure shows that more regularization does not affect the selected models in a meaningful fashion. At some point, the Elastic Net prediction error drops, which indicates the selected model. MSPE increases again for high penalization values.' width=2100}\n:::\n:::\n\n\n@fig-1404 shows that the cross-validated MSPE drops for Lasso and Elastic Net and spikes afterward. For Ridge regression, the MSPE increases above a certain threshold. Recall that the larger the regularization, the more restricted the model becomes. Thus, we would choose the model with the lowest MSPE.\n\n### Parallelized workflow\n\nOur starting point was the question: Which factors determine industry returns? While @Avramov2022b provide a Bayesian analysis related to the research question above, we choose a simplified approach: To illustrate the entire workflow, we now run the penalized regressions for all ten industries. \nWe want to identify relevant variables by fitting Lasso models for each industry returns time series. More specifically, we perform cross-validation for each industry to identify the optimal penalty factor $\\lambda$. \nThen, we use the set of `finalize_*()`-functions that take a list or tibble of tuning parameter values and update objects with those values. After determining the best model, we compute the final fit on the entire training set and analyze the estimated coefficients. \n\nFirst, we define the Lasso model with one tuning parameter.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlasso_model <- linear_reg(\n  penalty = tune(),\n  mixture = 1\n) |>\n  set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n  update_model(lasso_model)\n```\n:::\n\n\nThe following task can be easily parallelized to reduce computing time substantially. We use the parallelization capabilities of `furrr`. Note that we can also just recycle all the steps from above and collect them in a function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect_variables <- function(input) {\n  # Split into training and testing data\n  split <- initial_time_split(input, prop = 4 / 5)\n\n  # Data folds for cross-validation\n  data_folds <- time_series_cv(\n    data = training(split),\n    date_var = month,\n    initial = \"5 years\",\n    assess = \"48 months\",\n    cumulative = FALSE,\n    slice_limit = 20\n  )\n\n  # Model tuning with the Lasso model\n  lm_tune <- lm_fit |>\n    tune_grid(\n      resample = data_folds,\n      grid = grid_regular(penalty(), levels = c(10)),\n      metrics = metric_set(rmse)\n    )\n\n  # Identify the best model and fit with the training data\n  lasso_lowest_rmse <- lm_tune |> select_by_one_std_err(\"rmse\")\n  lasso_final <- finalize_workflow(lm_fit, lasso_lowest_rmse)\n  lasso_final_fit <- last_fit(lasso_final, split, metrics = metric_set(rmse))\n\n  # Extract the estimated coefficients\n  estimated_coefficients <- lasso_final_fit |>\n    extract_fit_parsnip() |>\n    tidy() |>\n    mutate(\n      term = str_remove_all(term, \"factor_|macro_|industry_\")\n    )\n\n  return(estimated_coefficients)\n}\n\n# Parallelization\nplan(multisession, workers = availableCores())\n\n# Computation by industry\nselected_factors <- data |>\n  nest(data = -industry) |>\n  mutate(selected_variables = future_map(\n    data, select_variables,\n    .options = furrr_options(seed = TRUE)\n  ))\n```\n:::\n\n\nWhat has just happened? In principle, exactly the same as before but instead of computing the Lasso coefficients for one industry, we did it for ten in parallel. The final option `seed = TRUE` is required to make the cross-validation process reproducible. \nNow, we just have to do some housekeeping and keep only variables that Lasso does *not* set to zero. We illustrate the results in a heat map in @fig-1405.\\index{Graph!Heat map}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselected_factors |>\n  unnest(selected_variables) |>\n  filter(\n    term != \"(Intercept)\",\n    estimate != 0\n  ) |>\n  add_count(term) |>\n  mutate(\n    term = str_remove_all(term, \"NA|ff_|q_\"),\n    term = str_replace_all(term, \"_x_\", \" \"),\n    term = fct_reorder(as_factor(term), n),\n    term = fct_lump_min(term, min = 2),\n    selected = 1\n  ) |>\n  filter(term != \"Other\") |>\n  mutate(term = fct_drop(term)) |>\n  complete(industry, term, fill = list(selected = 0)) |>\n  ggplot(aes(industry,\n    term,\n    fill = as_factor(selected)\n  )) +\n  geom_tile() +\n  scale_x_discrete(guide = guide_axis(angle = 70)) +\n  scale_fill_manual(values = c(\"white\", \"grey30\")) +\n  theme(legend.position = \"None\") +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Selected variables for different industries\"\n  )\n```\n\n::: {.cell-output-display}\n![Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.](factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png){#fig-1405 fig-alt='Title: Selected variables for different industries. The figure shows which factors and macroeconomic predictors the Lasso model selected for the different industries. In general, there are not many selected variables. The market excess return is selected across all industries except for utilities.' width=2100}\n:::\n:::\n\n\nThe heat map in @fig-1405 conveys two main insights. \nFirst, we see a lot of white, which means that many factors, macroeconomic variables, and interaction terms are not relevant for explaining the cross-section of returns across the industry portfolios. In fact, only the market factor and the return-on-equity factor play a role for several industries. Second, there seems to be quite some heterogeneity across different industries. While barely any variable is selected by Lasso for Utilities, many factors are selected for, e.g., High-Tech and Durable, but they do not coincide at all. \nIn other words, there seems to be a clear picture that we do not need many factors, but Lasso does not provide a factor that consistently provides pricing abilities across industries.\n\n## Exercises \n\n1. Write a function that requires three inputs, namely, `y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and `lambda` and then returns the Ridge estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_2$ penalty.\n1. Compute the $L_2$ norm ($\\beta'\\beta$) for the regression coefficients based on the predictive regression from the previous exercise for a range of $\\lambda$'s and illustrate the effect of penalization in a suitable figure.\n1. Now, write a function that requires three inputs, namely,`y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and 'lambda` and then returns the Lasso estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_1$ penalty.\n1. After you understand what Ridge and Lasso regressions are doing, familiarize yourself with the `glmnet()` package's documentation. It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients for Ridge and Lasso and for combinations, commonly called *Elastic Nets*.  ",
+    "markdown": "---\ntitle: Factor Selection via Machine Learning\naliases:\n  - ../factor-selection-via-machine-learning.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Factor Selection via Machine Learning with R\n  description-meta: Use machine learning tools such as Lasso and Ridge regressions to identify asset pricing factors using the programming language R.\n---\n\n\nThe aim of this chapter is twofold. From a data science perspective, we introduce `tidymodels`, a collection of packages for modeling and machine learning (ML) using `tidyverse` principles. `tidymodels` comes with a handy workflow for all sorts of typical prediction tasks. From a finance perspective, we address the notion of *factor zoo* [@Cochrane2011] using ML methods. We introduce Lasso and Ridge regression as a special case of penalized regression models. Then, we explain the concept of cross-validation for model *tuning* with Elastic Net regularization as a popular example. We implement and showcase the entire cycle from model specification, training, and forecast evaluation within the `tidymodels` universe. While the tools can generally be applied to an abundance of interesting asset pricing problems, we apply penalized regressions for identifying macroeconomic variables and asset pricing factors that help explain a cross-section of industry portfolios.\n\nIn previous chapters, we illustrate that stock characteristics such as size provide valuable pricing information in addition to the market beta.\\index{Factor zoo}\\index{Factor model}\\index{CAPM} \nSuch findings question the usefulness of the Capital Asset Pricing Model. \nIn fact, during the last decades, financial economists discovered a plethora of additional factors which may be correlated with the marginal utility of consumption (and would thus deserve a prominent role in pricing applications). The search for factors that explain the cross-section of expected stock returns has produced hundreds of potential candidates, as noted more recently by @Harvey2016, @Mclean2016, and @Hou2020.\nTherefore, given the multitude of proposed risk factors, the challenge these days rather is: *do we believe in the relevance of 300+ risk factors?* During recent years, promising methods from the field of ML got applied to common finance applications. We refer to @Mullainathan2017 for a treatment of ML from the perspective of an econometrician, @Nagel2021 for an excellent review of ML practices in asset pricing, @Easley2021 for ML applications in (high-frequency) market microstructure, and @Dixon2020 for a detailed treatment of all methodological aspects. \n\n## Brief Theoretical Background\n\nThis is a book about *doing* empirical work in a tidy manner, and we refer to any of the many excellent textbook treatments of ML methods and especially penalized regressions for some deeper discussion. Excellent material is provided, for instance, by @Hastie2009, @Hastie2013, and @DePrado2018. Instead, we briefly summarize the idea of Lasso and Ridge regressions as well as the more general Elastic Net. Then, we turn to the fascinating question on *how* to implement, tune, and use such models with the `tidymodels` workflow.\n\nTo set the stage, we start with the definition of a linear model: suppose we have data $(y_t, x_t), t = 1,\\ldots, T$, where $x_t$ is a $(K \\times 1)$ vector of regressors and $y_t$ is the response for observation $t$. \nThe linear model takes the form $y_t = \\beta' x_t + \\varepsilon_t$ with some error term $\\varepsilon_t$ and has been studied in abundance. The well-known ordinary-least square (OLS) estimator for the $(K \\times 1)$ vector $\\beta$ minimizes the sum of squared residuals and is then $$\\hat{\\beta}^\\text{ols} = \\left(\\sum\\limits_{t=1}^T x_t'x_t\\right)^{-1} \\sum\\limits_{t=1}^T x_t'y_t.$$ \n\\index{Regression!OLS}\n\nWhile we are often interested in the estimated coefficient vector $\\hat\\beta^\\text{ols}$, ML is about the predictive performance most of the time. For a new observation $\\tilde{x}_t$, the linear model generates predictions such that $$\\hat y_t = E\\left(y|x_t = \\tilde x_t\\right) = \\hat\\beta^\\text{ols}{}' \\tilde x_t.$$ \nIs this the best we can do? \nNot really: instead of minimizing the sum of squared residuals, penalized linear models can improve predictive performance by choosing other estimators $\\hat{\\beta}$ with lower variance than the estimator $\\hat\\beta^\\text{ols}$. \nAt the same time, it seems appealing to restrict the set of regressors to a few meaningful ones if possible. In other words, if $K$ is large (such as for the number of proposed factors in the asset pricing literature), it may be a desirable feature to *select* reasonable factors and set $\\hat\\beta^{\\text{ols}}_k = 0$ for some redundant factors. \n\nIt should be clear that the promised benefits of penalized regressions, i.e., reducing the mean squared error (MSE), come at a cost. In most cases, reducing the variance of the estimator introduces a bias such that $E\\left(\\hat\\beta\\right) \\neq \\beta$. What is the effect of such a bias-variance trade-off? To understand the implications, assume the following data-generating process for $y$: $$y = f(x) + \\varepsilon, \\quad \\varepsilon \\sim (0, \\sigma_\\varepsilon^2)$$ We want to recover $f(x)$, which denotes some unknown functional which maps the relationship between $x$ and $y$. While the properties of $\\hat\\beta^\\text{ols}$ as an unbiased estimator may be desirable under some circumstances, they are certainly not if we consider predictive accuracy. Alternative predictors $\\hat{f}(x)$ could be more desirable: For instance, the MSE depends on our model choice as follows: \\index{MSE} $$\\begin{aligned}\nMSE &=E((y-\\hat{f}(x))^2)=E((f(x)+\\epsilon-\\hat{f}(x))^2)\\\\\n&= \\underbrace{E((f(x)-\\hat{f}(x))^2)}_{\\text{total quadratic error}}+\\underbrace{E(\\epsilon^2)}_{\\text{irreducible error}} \\\\\n&= E\\left(\\hat{f}(x)^2\\right)+E\\left(f(x)^2\\right)-2E\\left(f(x)\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=E\\left(\\hat{f}(x)^2\\right)+f(x)^2-2f(x)E\\left(\\hat{f}(x)\\right)+\\sigma_\\varepsilon^2\\\\\n&=\\underbrace{\\text{Var}\\left(\\hat{f}(x)\\right)}_{\\text{variance of model}}+ \\underbrace{E\\left((f(x)-\\hat{f}(x))\\right)^2}_{\\text{squared bias}} +\\sigma_\\varepsilon^2. \n\\end{aligned}$$ While no model can reduce $\\sigma_\\varepsilon^2$, a biased estimator with small variance may have a lower MSE than an unbiased estimator.\n\n### Ridge regression\n\n\\index{Regression!Ridge}\n\nOne biased estimator is known as Ridge regression. @Hoerl1970 propose to minimize the sum of squared errors *while simultaneously imposing a penalty on the $L_2$ norm of the parameters* $\\hat\\beta$. Formally, this means that for a penalty factor $\\lambda\\geq 0$ the minimization problem takes the form $\\min_\\beta \\left(y - X\\beta\\right)'\\left(y - X\\beta\\right)\\text{ s.t. } \\beta'\\beta \\leq c$. Here $c\\geq 0$ is a constant that depends on the choice of $\\lambda$. The larger $\\lambda$, the smaller $c$ (technically speaking, there is a one-to-one relationship between $\\lambda$, which corresponds to the Lagrangian of the minimization problem above and $c$).  Here, $X = \\left(x_1  \\ldots  x_T\\right)'$ and $y = \\left(y_1, \\ldots, y_T\\right)'$. A closed-form solution for the resulting regression coefficient vector $\\beta^\\text{ridge}$ exists: $$\\hat{\\beta}^\\text{ridge} = \\left(X'X + \\lambda I\\right)^{-1}X'y.$$ A couple of observations are worth noting: $\\hat\\beta^\\text{ridge} = \\hat\\beta^\\text{ols}$ for $\\lambda = 0$ and $\\hat\\beta^\\text{ridge} \\rightarrow 0$ for $\\lambda\\rightarrow \\infty$. Also for $\\lambda > 0$, $\\left(X'X + \\lambda I\\right)$ is non-singular even if $X'X$ is which means that $\\hat\\beta^\\text{ridge}$ exists even if $\\hat\\beta$ is not defined. However, note also that the Ridge estimator requires careful choice of the hyperparameter $\\lambda$ which controls the *amount of regularization*: a larger value of $\\lambda$ implies *shrinkage* of the regression coefficient toward 0, a smaller value of $\\lambda$ reduces the bias of the resulting estimator.     \n\n::: {.calloutnote}\nNote, that $X$ usually contains an intercept column with ones. As a general rule, the associated intercept coefficient is not penalized. In practice, this often implies that $y$ is simply demeaned before computing $\\hat\\beta^\\text{ridge}$.\n::: \n\nWhat about the statistical properties of the Ridge estimator? First, the bad news is that $\\hat\\beta^\\text{ridge}$ is a biased estimator of $\\beta$. However, the good news is that (under homoscedastic error terms) the variance of the Ridge estimator is guaranteed to be *smaller* than the variance of the ordinary least square estimator. We encourage you to verify these two statements in the exercises. As a result, we face a trade-off: The Ridge regression sacrifices some unbiasedness to achieve a smaller variance than the OLS estimator.\n\n### Lasso\n\n\\index{Regression!Lasso}\n\nAn alternative to Ridge regression is the Lasso (*l*east *a*bsolute *s*hrinkage and *s*election *o*perator). Similar to Ridge regression, the Lasso [@Tibshirani1996] is a penalized and biased estimator. \nThe main difference to Ridge regression is that Lasso does not only *shrink* coefficients but effectively selects variables by setting coefficients for *irrelevant* variables to zero. Lasso implements a $L_1$ penalization on the parameters such that: $$\\hat\\beta^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right)\\text{ s.t. } \\sum\\limits_{k=1}^K|\\beta_k| < c(\\lambda).$$ There is no closed form solution for $\\hat\\beta^\\text{Lasso}$ in the above maximization problem but efficient algorithms exist (e.g., the R package `glmnet`). Like for Ridge regression, the hyperparameter $\\lambda$ has to be specified beforehand.\n\n### Elastic Net\n\nThe Elastic Net [@Zou2005] combines $L_1$ with $L_2$ penalization and encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. This more general framework considers the following optimization problem: $$\\hat\\beta^\\text{EN} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda(1-\\rho)\\sum\\limits_{k=1}^K|\\beta_k| +\\frac{1}{2}\\lambda\\rho\\sum\\limits_{k=1}^K\\beta_k^2$$ Now, we have to chose two hyperparameters: the *shrinkage* factor $\\lambda$ and the *weighting parameter* $\\rho$. The Elastic Net resembles Lasso for $\\rho = 0$ and Ridge regression for $\\rho = 1$. While the R package `glmnet` provides efficient algorithms to compute the coefficients of penalized regressions, it is a good exercise to implement Ridge and Lasso estimation on your own before you use the `glmnet` package or the `tidymodels` back-end.\n\n## Data Preparation\n\nTo get started, we load the required R packages and data. The main focus is on the workflow behind the `tidymodels` package collection [@tidymodels]. \n@Kuhn2022 provide a thorough introduction into all `tidymodels` components. `glmnet` [@glmnet] was developed and released in sync with @Tibshirani1996 and provides an R implementation of Elastic Net estimation. The package `timetk` [@timetk] provides useful tools for time series data wrangling. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RSQLite)\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(scales)\nlibrary(furrr)\nlibrary(glmnet)\nlibrary(timetk)\n```\n:::\n\n\nIn this analysis, we use four different data sources that we load from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). We start with two different sets of factor portfolio returns which have been suggested as representing practical risk factor exposure and thus should be relevant when it comes to asset pricing applications. \n\n- The standard workhorse: monthly Fama-French 3 factor returns (market, small-minus-big, and high-minus-low book-to-market valuation sorts) defined in @Fama1992 and @Fama1993\n- Monthly q-factor returns from @Hou2015. The factors contain the size factor, the investment factor, the return-on-equity factor, and the expected growth factor\n\nNext, we include macroeconomic predictors which may predict the general stock market economy. Macroeconomic variables effectively serve as conditioning information such that their inclusion hints at the relevance of conditional models instead of unconditional asset pricing. We refer the interested reader to @Cochrane2009 on the role of conditioning information.\n\n- Our set of macroeconomic predictors comes from @Goyal2008. The data has been updated by the authors until 2021 and contains monthly variables that have been suggested as good predictors for the equity premium. Some of the variables are the dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury bill rate, and inflation\n\nFinally, we need a set of *test assets*. The aim is to understand which of the plenty factors and macroeconomic variable combinations prove helpful in explaining our test assets' cross-section of returns. \nIn line with many existing papers, we use monthly portfolio returns from 10 different industries according to the definition from [Kenneth French's homepage](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_10_ind_port.html) as test assets.\\index{Data!Fama-French factors}\\index{Data!q-factors}\\index{Data!Macro predictors}\\index{Data!Industry portfolios}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  collect() |>\n  rename_with(~ str_c(\"factor_ff_\", .), -month)\n\nfactors_q_monthly <- tbl(tidy_finance, \"factors_q_monthly\") |>\n  collect() |>\n  rename_with(~ str_c(\"factor_q_\", .), -month)\n\nmacro_predictors <- tbl(tidy_finance, \"macro_predictors\") |>\n  collect() |>\n  rename_with(~ str_c(\"macro_\", .), -month) |>\n  select(-macro_rp_div)\n\nindustries_ff_monthly <- tbl(tidy_finance, \"industries_ff_monthly\") |>\n  collect() |>\n  pivot_longer(-month,\n    names_to = \"industry\", values_to = \"ret\"\n  ) |>\n  arrange(desc(industry)) |> \n  mutate(industry = as_factor(industry))\n```\n:::\n\n\nWe combine all the monthly observations into one data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- industries_ff_monthly |>\n  left_join(factors_ff3_monthly, join_by(month)) |>\n  left_join(factors_q_monthly, join_by(month)) |>\n  left_join(macro_predictors, join_by(month)) |>\n  mutate(\n    ret = ret - factor_ff_rf\n  ) |>\n  select(month, industry, ret_excess = ret, everything()) |>\n  drop_na()\n```\n:::\n\n\nOur data contains 22 columns of regressors with the 13 macro variables and 8 factor returns for each month. \n@fig-1401 provides summary statistics for the 10 monthly industry excess returns in percent.\\index{Graph!Box plot}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata |>\n  group_by(industry) |>\n  ggplot(aes(x = industry, y = ret_excess)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Excess return distributions by industry in percent\"\n  ) +\n  scale_y_continuous(\n    labels = percent\n  )\n```\n\n::: {.cell-output-display}\n![The box plots show the monthly dispersion of returns for 10 different industries.](factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png){#fig-1401 fig-alt='Title: Excess return distributions by industry in percent. The figure shows boxplots that visualize the industry\\'s excess return distribution. All industry returns are centered around zero and exhibit substantial outliers in the magnitude of 20 percent on a monthly basis.' width=2100}\n:::\n:::\n\n\n## The tidymodels Workflow\n\nTo illustrate penalized linear regressions, we employ the `tidymodels` collection of packages for modeling and ML using `tidyverse` principles. You can simply use `install.packages(\"tidymodels\")` to get access to all the related packages. We recommend checking out the work of @Kuhn2022: They  continuously write on their great book ['Tidy Modeling with R'](https://www.tmwr.org/) using tidy principles.\n\nThe `tidymodels` workflow encompasses the main stages of the modeling process: pre-processing of data, model fitting, and post-processing of results. As we demonstrate below, `tidymodels` provides efficient workflows that you can update with low effort.\n\nUsing the ideas of Ridge and Lasso regressions, the following example guides you through (i) pre-processing the data (data split and variable mutation), (ii) building models, (iii) fitting models, and (iv) tuning models to create the \"best\" possible predictions.\n\nTo start, we restrict our analysis to just one industry: Manufacturing. We first split the sample into a *training* and a *test* set. \nFor that purpose, `tidymodels` provides the function `initial_time_split()` from the `rsample` package [@rsample]. \nThe split takes the last 20% of the data as a test set, which is not used for any model tuning. \nWe use this test set to evaluate the predictive accuracy in an out-of-sample scenario.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsplit <- initial_time_split(\n  data |>\n    filter(industry == \"manuf\") |>\n    select(-industry),\n  prop = 4 / 5\n)\nsplit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n<Training/Testing/Total>\n<536/135/671>\n```\n\n\n:::\n:::\n\n\nThe object `split` simply keeps track of the observations of the training and the test set. \nWe can call the training set with `training(split)`, while we can extract the test set with `testing(split)`.\n\n### Pre-process data\n\nRecipes help you pre-process your data before training your model. Recipes are a series of pre-processing steps such as variable selection, transformation, or conversion of qualitative predictors to indicator variables. Each recipe starts with a `formula` that defines the general structure of the dataset and the role of each variable (regressor or dependent variable). For our dataset, our recipe contains the following steps before we fit any model:\n\n-   Our formula defines that we want to explain excess returns with all available predictors. The regression equation thus takes the form \n$$r_{t} = \\alpha_0 + \\left(\\tilde f_t \\otimes \\tilde z_t\\right)B + \\varepsilon_t $$ where $r_t$ is the vector of industry excess returns at time $t$ and $\\tilde f_t$ and $\\tilde z_t$ are the (standardized) vectors of factor portfolio returns and macroeconomic variables\n- We exclude the column *month* from the analysis\n- We include all interaction terms between factors and macroeconomic predictors\n- We demean and scale each regressor such that the standard deviation is one\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrec <- recipe(ret_excess ~ ., data = training(split)) |>\n  step_rm(month) |>\n  step_interact(terms = ~ contains(\"factor\"):contains(\"macro\")) |>\n  step_normalize(all_predictors())\n```\n:::\n\n\nA table of all available recipe steps can be found [in the `tidymodels` documentation.](https://www.tidymodels.org/find/recipes/) As of 2024, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a *description* of the tasks to be applied. As a result, it is very easy to *reuse* recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. \nIn the example above, it does not make a difference whether you use the input `data = training(split)` or `data = testing(split)`. \nAll that matters at this early stage are the column names and types.\n\nWe can apply the recipe to any data with a suitable structure. The code below combines two different functions: `prep()` estimates the required parameters from a training set that can be applied to other data sets later. `bake()` applies the processed computations to new data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_prep <- prep(rec, training(split))\n```\n:::\n\nThe object `data_prep` contains information related to the different preprocessing steps applied to the training data: E.g., it is necessary to compute sample means and standard deviations to center and scale the variables. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_bake <- bake(data_prep,\n  new_data = testing(split)\n)\ndata_bake\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 135 × 126\n  factor_ff_mkt_excess factor_ff_smb factor_ff_hml factor_ff_rf\n                 <dbl>         <dbl>         <dbl>        <dbl>\n1              -1.73          -1.10        -0.711         -1.78\n2               2.35           0.949       -0.0937        -1.78\n3              -0.156         -0.122       -0.284         -1.78\n4               0.0646        -0.256        0.425         -1.78\n5               0.995          0.563       -0.456         -1.78\n# ℹ 130 more rows\n# ℹ 122 more variables: factor_q_me <dbl>, factor_q_ia <dbl>,\n#   factor_q_roe <dbl>, factor_q_eg <dbl>, macro_dp <dbl>,\n#   macro_dy <dbl>, macro_ep <dbl>, macro_de <dbl>,\n#   macro_svar <dbl>, macro_bm <dbl>, macro_ntis <dbl>,\n#   macro_tbl <dbl>, macro_lty <dbl>, macro_ltr <dbl>,\n#   macro_tms <dbl>, macro_dfy <dbl>, macro_infl <dbl>, …\n```\n\n\n:::\n:::\n\n\nNote that the resulting data contains the 132 observations from the test set and 126 columns. Why so many? Recall that the recipe states to compute every possible interaction term between the factors and predictors, which increases the dimension of the data matrix substantially. \n\nYou may ask at this stage: why should I use a recipe instead of simply using the data wrangling commands such as `mutate()` or `select()`? `tidymodels` beauty is that a lot is happening under the hood. Recall, that for the simple scaling step, you actually have to compute the standard deviation of each column, then *store* this value, and apply the identical transformation to a different dataset, e.g., `testing(split)`. A prepped `recipe` stores these values and hands them on once you `bake()` a novel dataset. Easy as pie with `tidymodels`, isn't it?\n\n### Build a model\n\n\\index{Regression}\nNext, we can build an actual model based on our pre-processed data. In line with the definition above, we estimate regression coefficients of a Lasso regression such that we get \n$$\\begin{aligned}\\hat\\beta_\\lambda^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda\\sum\\limits_{k=1}^K|\\beta_k|.\\end{aligned}$$ We want to emphasize that the `tidymodels` workflow for *any* model is very similar, irrespective of the specific model. As you will see further below, it is straightforward to fit Ridge regression coefficients and - later - Neural networks or Random forests with basically the same code. The structure is always as follows: create a so-called `workflow()` and use the `fit()` function. A table with all available model APIs is available [here.](https://www.tidymodels.org/find/parsnip/)\nFor now, we start with the linear regression model with a given value for the penalty factor $\\lambda$. In the setup below, `mixture` denotes the value of $\\rho$, hence setting `mixture = 1` implies the Lasso.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n  penalty = 0.0001,\n  mixture = 1\n) |>\n  set_engine(\"glmnet\", intercept = FALSE)\n```\n:::\n\n\nThat's it - we are done! The object `lm_model` contains the definition of our model with all required information. Note that `set_engine(\"glmnet\")` indicates the API character of the `tidymodels` workflow: Under the hood, the package `glmnet` is doing the heavy lifting, while `linear_reg()` provides a unified framework to collect the inputs. The `workflow`  ends with combining everything necessary for the serious data science workflow, namely, a recipe and a model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_fit <- workflow() |>\n  add_recipe(rec) |>\n  add_model(lm_model)\nlm_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n══ Workflow ═════════════════════════════════════════════════════════\nPreprocessor: Recipe\nModel: linear_reg()\n\n── Preprocessor ─────────────────────────────────────────────────────\n3 Recipe Steps\n\n• step_rm()\n• step_interact()\n• step_normalize()\n\n── Model ────────────────────────────────────────────────────────────\nLinear Regression Model Specification (regression)\n\nMain Arguments:\n  penalty = 1e-04\n  mixture = 1\n\nEngine-Specific Arguments:\n  intercept = FALSE\n\nComputational engine: glmnet \n```\n\n\n:::\n:::\n\n\n### Fit a model\n\nWith the `workflow` from above, we are ready to use `fit()`. Typically, we use training data to fit the model. \nThe training data is pre-processed according to our recipe steps, and the Lasso regression coefficients are computed. \nFirst, we focus on the predicted values $\\hat{y}_t = x_t\\hat\\beta^\\text{Lasso}.$ @fig-1402 illustrates the projections for the *entire* time series of the manufacturing industry portfolio returns. The grey area indicates the out-of-sample period, which we did not use to fit the model.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values <- lm_fit |>\n  fit(data = training(split)) |>\n  predict(data |> filter(industry == \"manuf\")) |>\n  bind_cols(data |> filter(industry == \"manuf\")) |>\n  select(month,\n    \"Fitted value\" = .pred,\n    \"Realization\" = ret_excess\n  ) |>\n  pivot_longer(-month, names_to = \"Variable\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npredicted_values |>\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = Variable,\n    linetype = Variable\n    )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    linetype = NULL,\n    title = \"Monthly realized and fitted manufacturing industry risk premia\"\n  ) +\n  scale_x_date(\n    breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"5 years\"\n      )\n    },\n    minor_breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"1 years\"\n      )\n    },\n    expand = c(0, 0),\n    labels = date_format(\"%Y\")\n  ) +\n  scale_y_continuous(\n    labels = percent\n  ) +\n  annotate(\"rect\",\n    xmin = testing(split) |> pull(month) |> min(),\n    xmax = testing(split) |> pull(month) |> max(),\n    ymin = -Inf, ymax = Inf,\n    alpha = 0.5, fill = \"grey70\"\n  )\n```\n\n::: {.cell-output-display}\n![The grey area corresponds to the out of sample period.](factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png){#fig-1402 fig-alt='Title: Monthly realized and fitted manufacturing industry risk premium. The figure shows the time series of realized and predicted manufacturing industry risk premiums. The figure seems to indicate that the predictions capture most of the return dynamics.' width=2100}\n:::\n:::\n\n\nWhat do the estimated coefficients look like? To analyze these values and to illustrate the difference between the `tidymodels` workflow and the underlying `glmnet` package, it is worth computing the coefficients $\\hat\\beta^\\text{Lasso}$ directly. The code below estimates the coefficients for the Lasso and Ridge regression for the processed training data sample. Note that `glmnet` actually takes a vector `y` and the matrix of regressors $X$ as input. Moreover, `glmnet` requires choosing the penalty parameter $\\alpha$, which corresponds to $\\rho$ in the notation above. When using the `tidymodels` model API, such details do not need consideration.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data_bake |>\n  select(-ret_excess) |>\n  as.matrix()\ny <- data_bake |> pull(ret_excess)\n\nfit_lasso <- glmnet(\n  x = x,\n  y = y,\n  alpha = 1,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n\nfit_ridge <- glmnet(\n  x = x,\n  y = y,\n  alpha = 0,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n```\n:::\n\n\nThe objects `fit_lasso` and `fit_ridge` contain an entire sequence of estimated coefficients for multiple values of the penalty factor $\\lambda$. @fig-1403 illustrates the trajectories of the regression coefficients as a function of the penalty factor. Both Lasso and Ridge coefficients converge to zero as the penalty factor increases.\\index{Graph!ML prediction path}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbind_rows(\n  tidy(fit_lasso) |> mutate(Model = \"Lasso\"),\n  tidy(fit_ridge) |> mutate(Model = \"Ridge\")\n) |>\n  rename(\"Variable\" = term) |>\n  ggplot(aes(x = lambda, y = estimate, color = Variable)) +\n  geom_line() +\n  scale_x_log10() +\n  facet_wrap(~Model, scales = \"free_x\") +\n  labs(\n    x = \"Penalty factor (lambda)\", y = NULL,\n    title = \"Estimated coefficient paths for different penalty factors\"\n  ) +\n  theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.](factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png){#fig-1403 fig-alt='Title: Estimated coefficient paths for different penalty factors. The figure shows how estimated lasso and ridge coefficients tend to zero for a higher penalty parameter. Ridge trace is smooth, and Lasso exhibits non-linear behavior.' width=2100}\n:::\n:::\n\n\n::: {.rmdnote}\nOne word of caution: The package `glmnet` computes estimates of the coefficients $\\hat\\beta$ based on numerical optimization procedures. \nAs a result, the estimated coefficients for the [special case](https://parsnip.tidymodels.org/reference/glmnet-details.html) with no regularization ($\\lambda = 0$) can deviate from the standard OLS estimates.\n:::\n\n### Tune a model\n\nTo compute $\\hat\\beta_\\lambda^\\text{Lasso}$ , we simply imposed a value for the penalty hyperparameter $\\lambda$. Model tuning is the process of optimally selecting such hyperparameters. `tidymodels` provides extensive tuning options based on so-called *cross-validation*. Again, we refer to any treatment of cross-validation to get a more detailed  discussion of the statistical underpinnings. Here we focus on the general idea and the implementation with `tidymodels`. \n\nThe goal for choosing $\\lambda$ (or any other hyperparameter, e.g., $\\rho$ for the Elastic Net) is to find a way to produce predictors $\\hat{Y}$ for an outcome $Y$ that minimizes the mean squared prediction error $\\text{MSPE} = E\\left( \\frac{1}{T}\\sum_{t=1}^T (\\hat{y}_t - y_t)^2 \\right)$. Unfortunately, the MSPE is not directly observable. We can only compute an estimate because our data is random and because we do not observe the entire population.\n\nObviously, if we train an algorithm on the same data that we use to compute the error, our estimate $\\hat{\\text{MSPE}}$ would indicate way better predictive accuracy than what we can expect in real out-of-sample data. The result is called overfitting.\n\nCross-validation is a technique that allows us to alleviate this problem. We approximate the true MSPE as the average of many MSPE obtained by creating predictions for $K$ new random samples of the data, none of them used to train the algorithm $\\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k - y_t^k\\right)^2$. In practice, this is done by carving out a piece of our data and pretending it is an independent sample. We again divide the data into a training set and a test set. The MSPE on the test set is our measure for actual predictive ability, while we use the training set to fit models with the aim to find the *optimal* hyperparameter values. To do so, we further divide our training sample into (several) subsets, fit our model for a grid of potential hyperparameter values (e.g., $\\lambda$), and evaluate the predictive accuracy on an *independent* sample. This works as follows:\n\n1.  Specify a grid of hyperparameters\n2.  Obtain predictors $\\hat{y}_i(\\lambda)$ to denote the predictors for the used parameters $\\lambda$\n3.  Compute $$\n    \\text{MSPE}(\\lambda) = \\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k(\\lambda) - y_t^k\\right)^2 \n    $$ With K-fold cross-validation, we do this computation $K$ times. Simply pick a validation set with $M=T/K$ observations at random and think of these as random samples $y_1^k, \\dots, y_{\\tilde{T}}^k$, with $k=1$\n\nHow should you pick $K$? Large values of $K$ are preferable because the training data better imitates the original data. However, larger values of $K$ will have much higher computation time.\n`tidymodels` provides all required tools to conduct $K$-fold cross-validation. We just have to update our model specification and let `tidymodels` know which parameters to tune. In our case, we specify the penalty factor $\\lambda$ as well as the mixing factor $\\rho$ as *free* parameters. Note that it is simple to change an existing `workflow` with `update_model()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_model <- linear_reg(\n  penalty = tune(),\n  mixture = tune()\n) |>\n  set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n  update_model(lm_model)\n```\n:::\n\n\nFor our sample, we consider a time-series cross-validation sample. This means that we tune our models with 20 random samples of length five years with a validation period of four years. For a grid of possible hyperparameters, we then fit the model for each fold and evaluate $\\hat{\\text{MSPE}}$ in the corresponding validation set. Finally, we select the model specification with the lowest MSPE in the validation set. First, we define the cross-validation folds based on our training data only.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_folds <- time_series_cv(\n  data        = training(split),\n  date_var    = month,\n  initial     = \"5 years\",\n  assess      = \"48 months\",\n  cumulative  = FALSE,\n  slice_limit = 20\n)\ndata_folds\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# Time Series Cross Validation Plan \n# A tibble: 20 × 2\n  splits          id     \n  <list>          <chr>  \n1 <split [60/48]> Slice01\n2 <split [60/48]> Slice02\n3 <split [60/48]> Slice03\n4 <split [60/48]> Slice04\n5 <split [60/48]> Slice05\n# ℹ 15 more rows\n```\n\n\n:::\n:::\n\n\nThen, we evaluate the performance for a grid of different penalty values. `tidymodels` provides functionalities to construct a suitable grid of hyperparameters with `grid_regular`. The code chunk below creates a $10 \\times 3$ hyperparameters grid. Then, the function `tune_grid()` evaluates all the models for each fold.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlm_tune <- lm_fit |>\n  tune_grid(\n    resample = data_folds,\n    grid = grid_regular(penalty(), mixture(), levels = c(20, 3)),\n    metrics = metric_set(rmse)\n  )\n```\n:::\n\n\nAfter the tuning process, we collect the evaluation metrics (the root mean-squared error in our example) to identify the *optimal* model. @fig-1404 illustrates the average validation set's root mean-squared error for each value of $\\lambda$ and $\\rho$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(lm_tune) + \n  aes(linetype = `Proportion of Lasso Penalty`) + \n  guides(linetype = \"none\") +\n  labs(\n    x = \"Penalty factor (lambda)\",\n    y = \"Root MSPE\",\n    title = \"Root MSPE for different penalty factors\"\n  ) + \n  scale_x_log10()\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nScale for x is already present.\nAdding another scale for x, which will replace the existing scale.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.](factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png){#fig-1404 fig-alt='Title: Root MSPE for different penalty factors. The figure shows that more regularization does not affect the selected models in a meaningful fashion. At some point, the Elastic Net prediction error drops, which indicates the selected model. MSPE increases again for high penalization values.' width=2100}\n:::\n:::\n\n\n@fig-1404 shows that the cross-validated MSPE drops for Lasso and Elastic Net and spikes afterward. For Ridge regression, the MSPE increases above a certain threshold. Recall that the larger the regularization, the more restricted the model becomes. Thus, we would choose the model with the lowest MSPE.\n\n### Parallelized workflow\n\nOur starting point was the question: Which factors determine industry returns? While @Avramov2022b provide a Bayesian analysis related to the research question above, we choose a simplified approach: To illustrate the entire workflow, we now run the penalized regressions for all ten industries. \nWe want to identify relevant variables by fitting Lasso models for each industry returns time series. More specifically, we perform cross-validation for each industry to identify the optimal penalty factor $\\lambda$. \nThen, we use the set of `finalize_*()`-functions that take a list or tibble of tuning parameter values and update objects with those values. After determining the best model, we compute the final fit on the entire training set and analyze the estimated coefficients. \n\nFirst, we define the Lasso model with one tuning parameter.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlasso_model <- linear_reg(\n  penalty = tune(),\n  mixture = 1\n) |>\n  set_engine(\"glmnet\")\n\nlm_fit <- lm_fit |>\n  update_model(lasso_model)\n```\n:::\n\n\nThe following task can be easily parallelized to reduce computing time substantially. We use the parallelization capabilities of `furrr`. Note that we can also just recycle all the steps from above and collect them in a function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselect_variables <- function(input) {\n  # Split into training and testing data\n  split <- initial_time_split(input, prop = 4 / 5)\n\n  # Data folds for cross-validation\n  data_folds <- time_series_cv(\n    data = training(split),\n    date_var = month,\n    initial = \"5 years\",\n    assess = \"48 months\",\n    cumulative = FALSE,\n    slice_limit = 20\n  )\n\n  # Model tuning with the Lasso model\n  lm_tune <- lm_fit |>\n    tune_grid(\n      resample = data_folds,\n      grid = grid_regular(penalty(), levels = c(10)),\n      metrics = metric_set(rmse)\n    )\n\n  # Identify the best model and fit with the training data\n  lasso_lowest_rmse <- lm_tune |> select_by_one_std_err(\"rmse\")\n  lasso_final <- finalize_workflow(lm_fit, lasso_lowest_rmse)\n  lasso_final_fit <- last_fit(lasso_final, split, metrics = metric_set(rmse))\n\n  # Extract the estimated coefficients\n  estimated_coefficients <- lasso_final_fit |>\n    extract_fit_parsnip() |>\n    tidy() |>\n    mutate(\n      term = str_remove_all(term, \"factor_|macro_|industry_\")\n    )\n\n  return(estimated_coefficients)\n}\n\n# Parallelization\nplan(multisession, workers = availableCores())\n\n# Computation by industry\nselected_factors <- data |>\n  nest(data = -industry) |>\n  mutate(selected_variables = future_map(\n    data, select_variables,\n    .options = furrr_options(seed = TRUE)\n  ))\n```\n:::\n\n\nWhat has just happened? In principle, exactly the same as before but instead of computing the Lasso coefficients for one industry, we did it for ten in parallel. The final option `seed = TRUE` is required to make the cross-validation process reproducible. \nNow, we just have to do some housekeeping and keep only variables that Lasso does *not* set to zero. We illustrate the results in a heat map in @fig-1405.\\index{Graph!Heat map}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nselected_factors |>\n  unnest(selected_variables) |>\n  filter(\n    term != \"(Intercept)\",\n    estimate != 0\n  ) |>\n  add_count(term) |>\n  mutate(\n    term = str_remove_all(term, \"NA|ff_|q_\"),\n    term = str_replace_all(term, \"_x_\", \" \"),\n    term = fct_reorder(as_factor(term), n),\n    term = fct_lump_min(term, min = 2),\n    selected = 1\n  ) |>\n  filter(term != \"Other\") |>\n  mutate(term = fct_drop(term)) |>\n  complete(industry, term, fill = list(selected = 0)) |>\n  ggplot(aes(industry,\n    term,\n    fill = as_factor(selected)\n  )) +\n  geom_tile() +\n  scale_x_discrete(guide = guide_axis(angle = 70)) +\n  scale_fill_manual(values = c(\"white\", \"grey30\")) +\n  theme(legend.position = \"None\") +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Selected variables for different industries\"\n  )\n```\n\n::: {.cell-output-display}\n![Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.](factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png){#fig-1405 fig-alt='Title: Selected variables for different industries. The figure shows which factors and macroeconomic predictors the Lasso model selected for the different industries. In general, there are not many selected variables. The market excess return is selected across all industries except for utilities.' width=2100}\n:::\n:::\n\n\nThe heat map in @fig-1405 conveys two main insights. \nFirst, we see a lot of white, which means that many factors, macroeconomic variables, and interaction terms are not relevant for explaining the cross-section of returns across the industry portfolios. In fact, only the market factor and the return-on-equity factor play a role for several industries. Second, there seems to be quite some heterogeneity across different industries. While barely any variable is selected by Lasso for Utilities, many factors are selected for, e.g., High-Tech and Durable, but they do not coincide at all. \nIn other words, there seems to be a clear picture that we do not need many factors, but Lasso does not provide a factor that consistently provides pricing abilities across industries.\n\n## Exercises \n\n1. Write a function that requires three inputs, namely, `y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and `lambda` and then returns the Ridge estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_2$ penalty.\n1. Compute the $L_2$ norm ($\\beta'\\beta$) for the regression coefficients based on the predictive regression from the previous exercise for a range of $\\lambda$'s and illustrate the effect of penalization in a suitable figure.\n1. Now, write a function that requires three inputs, namely,`y` (a $T$ vector), `X` (a $(T \\times K)$ matrix), and 'lambda` and then returns the Lasso estimator (a $K$ vector) for a given penalization parameter $\\lambda$. Recall that the intercept should not be penalized. Therefore, your function should indicate whether $X$ contains a vector of ones as the first column, which should be exempt from the $L_1$ penalty.\n1. After you understand what Ridge and Lasso regressions are doing, familiarize yourself with the `glmnet()` package's documentation. It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients for Ridge and Lasso and for combinations, commonly called *Elastic Nets*.  ",
     "supporting": [
       "factor-selection-via-machine-learning_files"
     ],
diff --git a/_freeze/r/fama-macbeth-regressions/execute-results/html.json b/_freeze/r/fama-macbeth-regressions/execute-results/html.json
index 0f1e7e23..f1c279be 100644
--- a/_freeze/r/fama-macbeth-regressions/execute-results/html.json
+++ b/_freeze/r/fama-macbeth-regressions/execute-results/html.json
@@ -3,9 +3,7 @@
   "result": {
     "engine": "knitr",
     "markdown": "---\ntitle: Fama-MacBeth Regressions\naliases:\n  - ../fama-macbeth-regressions.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Fama-MacBeth Regressions with R\n  description-meta: Estimate risk premiums via Fama-MacBeth regressions using the programming language R.\n---\n\n\nIn this chapter, we present a simple implementation of @Fama1973, a regression approach commonly called Fama-MacBeth regressions. Fama-MacBeth regressions are widely used in empirical asset pricing studies. We use individual stocks as test assets to estimate the risk premium associated with the three factors included in @Fama1993.\n\nResearchers use the two-stage regression approach to estimate risk premiums in various markets, but predominately in the stock market. \nEssentially, the two-step Fama-MacBeth regressions exploit a linear relationship between expected returns and exposure to (priced) risk factors. \nThe basic idea of the regression approach is to project asset returns on factor exposures or characteristics that resemble exposure to a risk factor in the cross-section in each time period. \nThen, in the second step, the estimates are aggregated across time to test if a risk factor is priced. \nIn principle, Fama-MacBeth regressions can be used in the same way as portfolio sorts introduced in previous chapters.\n\n\\index{Regression!Fama-MacBeth}\\index{Fama-MacBeth} The Fama-MacBeth procedure is a simple two-step approach: \nThe first step uses the exposures (characteristics) as explanatory variables in $T$ cross-sectional regressions. For example, if $r_{i,t+1}$ denote the excess returns of asset $i$ in month $t+1$, then the famous Fama-French three factor model implies the following return generating process [see also @Campbell1998]:\n$$\\begin{aligned}r_{i,t+1} = \\alpha_i + \\lambda^{M}_t \\beta^M_{i,t}  + \\lambda^{SMB}_t \\beta^{SMB}_{i,t} + \\lambda^{HML}_t \\beta^{HML}_{i,t} + \\epsilon_{i,t}.\\end{aligned}$$ \nHere, we are interested in the compensation $\\lambda^{f}_t$ for the exposure to each risk factor $\\beta^{f}_{i,t}$ at each time point, i.e., the risk premium. Note the terminology: $\\beta^{f}_{i,t}$ is a asset-specific characteristic, e.g., a factor exposure or an accounting variable. *If* there is a linear relationship between expected returns and the characteristic in a given month, we expect the regression coefficient to reflect the relationship, i.e., $\\lambda_t^{f}\\neq0$. \n\nIn the second step, the time-series average $\\frac{1}{T}\\sum_{t=1}^T \\hat\\lambda^{f}_t$ of the estimates $\\hat\\lambda^{f}_t$ can then be interpreted as the risk premium for the specific risk factor $f$. We follow @Zaffaroni2022 and consider the standard cross-sectional regression to predict future returns. If the characteristics are replaced with time $t+1$ variables, then the regression approach  captures risk attributes rather than risk premiums. \n\nBefore we move to the implementation, we want to highlight that the characteristics, e.g., $\\hat\\beta^{f}_{i}$, are often estimated in a separate step before applying the actual Fama-MacBeth methodology. You can think of this as a *step 0*. You might thus worry that the errors of $\\hat\\beta^{f}_{i}$ impact the risk premiums' standard errors. Measurement error in $\\hat\\beta^{f}_{i}$ indeed affects the risk premium estimates, i.e., they lead to biased estimates. The literature provides adjustments for this bias [see, e.g., @Shanken1992; @Kim1995; @Chen2015, among others] but also shows that the bias goes to zero as $T \\to \\infty$. We refer to @Gagliardini2016 for an in-depth discussion also covering the case of time-varying betas. Moreover, if you plan to use Fama-MacBeth regressions with individual stocks: @Hou2020 advocates using weighed-least squares to estimate the coefficients such that they are not biased toward small firms. Without this adjustment, the high number of small firms would drive the coefficient estimates.\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(sandwich)\nlibrary(broom)\n```\n:::\n\n\n## Data Preparation\n\nWe illustrate @Fama1973 with the monthly CRSP sample and use three characteristics to explain the cross-section of returns: market capitalization, the book-to-market ratio, and the CAPM beta (i.e., the covariance of the excess stock returns with the market excess returns). We collect the data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Compustat}\\index{Beta}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(permno, gvkey, month, ret_excess, mktcap) |>\n  collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n  select(datadate, gvkey, be) |>\n  collect()\n\nbeta <- tbl(tidy_finance, \"beta\") |>\n  select(month, permno, beta_monthly) |>\n  collect()\n```\n:::\n\n\nWe use the Compustat and CRSP data to compute the book-to-market ratio and the (log) market capitalization.\\index{Book-to-market ratio}\\index{Market capitalization} \nFurthermore, we also use the CAPM betas based on monthly returns we computed in the previous chapters.\\index{Beta}\\index{CAPM}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncharacteristics <- compustat |>\n  mutate(month = floor_date(ymd(datadate), \"month\")) |>\n  left_join(crsp_monthly, by = c(\"gvkey\", \"month\")) |>\n  left_join(beta, by = c(\"permno\", \"month\")) |>\n  transmute(gvkey,\n    bm = be / mktcap,\n    log_mktcap = log(mktcap),\n    beta = beta_monthly,\n    sorting_date = month %m+% months(6)\n  )\n\ndata_fama_macbeth <- crsp_monthly |>\n  left_join(characteristics, by = c(\"gvkey\", \"month\" = \"sorting_date\")) |>\n  group_by(permno) |>\n  arrange(month) |>\n  fill(c(beta, bm, log_mktcap), .direction = \"down\") |>\n  ungroup() |>\n  left_join(crsp_monthly |>\n    select(permno, month, ret_excess_lead = ret_excess) |>\n    mutate(month = month %m-% months(1)),\n  by = c(\"permno\", \"month\")\n  ) |>\n  select(permno, month, ret_excess_lead, beta, log_mktcap, bm) |>\n  drop_na()\n```\n:::\n\n\n## Cross-sectional Regression\n\nNext, we run the cross-sectional regressions with the characteristics as explanatory variables for each month. We regress the returns of the test assets at a particular time point on the characteristics of each asset. By doing so, we get an estimate of the risk premiums $\\hat\\lambda^{f}_t$ for each point in time. \\index{Regression!Cross-section}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrisk_premiums <- data_fama_macbeth |>\n  nest(data = c(ret_excess_lead, beta, log_mktcap, bm, permno)) |>\n  mutate(estimates = map(\n    data,\n    ~ tidy(lm(ret_excess_lead ~ beta + log_mktcap + bm, data = .x))\n  )) |>\n  unnest(estimates)\n```\n:::\n\n\n## Time-Series Aggregation\n\nNow that we have the risk premiums' estimates for each period, we can average across the time-series dimension to get the expected risk premium for each characteristic. Similarly, we manually create the $t$-test statistics for each regressor, which we can then compare to usual critical values of 1.96 or 2.576 for two-tailed significance tests. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nprice_of_risk <- risk_premiums |>\n  group_by(factor = term) |>\n  summarize(\n    risk_premium = mean(estimate) * 100,\n    t_statistic = mean(estimate) / sd(estimate) * sqrt(n())\n  )\n```\n:::\n\n\nIt is common to adjust for autocorrelation when reporting standard errors of risk premiums. As in [Univariate Portfolio Sorts](univariate-portfolio-sorts.qmd), the typical procedure for this is computing @Newey1987 standard errors. We again recommend the data-driven approach of @Newey1994 using the `NeweyWest()` function, but note that you can enforce the typical 6 lag settings via `NeweyWest(., lag = 6, prewhite = FALSE)`.\\index{Standard errors!Newey-West}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nregressions_for_newey_west <- risk_premiums |>\n  select(month, factor = term, estimate) |>\n  nest(data = c(month, estimate)) |>\n  mutate(\n    model = map(data, ~ lm(estimate ~ 1, .)),\n    mean = map(model, tidy)\n  )\n\nprice_of_risk_newey_west <- regressions_for_newey_west |>\n  mutate(newey_west_se = map_dbl(model, ~ sqrt(NeweyWest(.)))) |>\n  unnest(mean) |>\n  mutate(t_statistic_newey_west = estimate / newey_west_se) |>\n  select(factor,\n    risk_premium = estimate,\n    t_statistic_newey_west\n  )\n\nleft_join(price_of_risk,\n  price_of_risk_newey_west |>\n    select(factor, t_statistic_newey_west),\n  by = \"factor\"\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 4 × 4\n  factor      risk_premium t_statistic t_statistic_newey_west\n  <chr>              <dbl>       <dbl>                  <dbl>\n1 (Intercept)      1.22         4.77                   3.98  \n2 beta             0.00515      0.0499                 0.0446\n3 bm               0.151        3.22                   2.75  \n4 log_mktcap      -0.104       -2.94                  -2.60  \n```\n\n\n:::\n:::\n\n\nFinally, let us interpret the results. Stocks with higher book-to-market ratios earn higher expected future returns, which is in line with the value premium. The negative value for log market capitalization reflects the size premium for smaller stocks. Consistent with results from earlier chapters, we detect no relation between beta and future stock returns.\n\n## Exercises\n\n1. Download a sample of test assets from Kenneth French's homepage and reevaluate the risk premiums for industry portfolios instead of individual stocks.\n1. Use individual stocks with weighted-least squares based on a firm's size as suggested by @Hou2020. Then, repeat the Fama-MacBeth regressions without the weighting scheme adjustment but drop the smallest 20 percent of firms each month. Compare the results of the three approaches. ",
-    "supporting": [
-      "fama-macbeth-regressions_files"
-    ],
+    "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],
diff --git a/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json b/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json
index 1f39dcd2..cc3f76ec 100644
--- a/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json
+++ b/_freeze/r/fixed-effects-and-clustered-standard-errors/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "1f5671dc77c104a1b302a07db6d86106",
+  "hash": "4a0ef56e54495717886fe255ef8103ad",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Fixed Effects and Clustered Standard Errors\naliases:\n  - ../fixed-effects-and-clustered-standard-errors\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Fixed Effects and Clustered Standard Errors with R\n  description-meta: Dive into the implementation of fixed effects regressions and clustered standard errors in finance using the programming language R. \n---\n\n\nIn this chapter, we provide an intuitive introduction to the two popular concepts of *fixed effects regressions* and *clustered standard errors*. When working with regressions in empirical finance, you will sooner or later be confronted with discussions around how you deal with omitted variables bias and dependence in your residuals. The concepts we introduce in this chapter are designed to address such concerns.\n\nWe focus on a classical panel regression common to the corporate finance literature [e.g., @Fazzari1988;@Erickson2012;@Gulen2015]: firm investment modeled as a function that increases in firm cash flow and firm investment opportunities.  \n\nTypically, this investment regression uses quarterly balance sheet data provided via Compustat because it allows for richer dynamics in the regressors and more opportunities to construct variables. As we focus on the implementation of fixed effects and clustered standard errors, we use the annual Compustat data from our previous chapters and leave the estimation using quarterly data as an exercise. We demonstrate below that the regression based on annual data yields qualitatively similar results to estimations based on quarterly data from the literature, namely confirming the positive relationships between investment and the two regressors.  \n\nThe current chapter relies on the following set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\n```\n:::\n\n\nCompared to previous chapters, we introduce `fixest` [@fixest] for the fixed effects regressions, the implementation of standard error clusters, and tidy estimation output.\n\n## Data Preparation\n\nWe use CRSP and annual Compustat as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, Compustat provides balance sheet and income statement data on a firm level, while CRSP provides market valuations. \\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(gvkey, month, mktcap) |>\n  collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n  select(datadate, gvkey, year, at, be, capx, oancf, txdb) |>\n  collect()\n```\n:::\n\n\nThe classical investment regressions model the capital investment of a firm as a function of operating cash flows and Tobin's q, a measure of a firm's investment opportunities.\\index{Cash flows}\\index{Regression!Investment} We start by constructing investment and cash flows which are usually normalized by lagged total assets of a firm. In the following code chunk, we construct a *panel* of firm-year observations, so we have both cross-sectional information on firms as well as time-series information for each firm.\\index{Regression!Panel}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- compustat |>\n  mutate(month = floor_date(datadate, \"month\")) |>\n  left_join(compustat |>\n    select(gvkey, year, at_lag = at) |>\n    mutate(year = year + 1),\n  by = c(\"gvkey\", \"year\")\n  ) |>\n  filter(at > 0, at_lag > 0) |>\n  mutate(\n    investment = capx / at_lag,\n    cash_flows = oancf / at_lag\n  )\n\ndata_investment <- data_investment |>\n  left_join(data_investment |>\n    select(gvkey, year, investment_lead = investment) |>\n    mutate(year = year - 1),\n  by = c(\"gvkey\", \"year\")\n  )\n```\n:::\n\n\nTobin's q is the ratio of the market value of capital to its replacement costs.\\index{Tobin's q} It is one of the most common regressors in corporate finance applications [e.g., @Fazzari1988; @Erickson2012]. We follow the implementation of @Gulen2015 and compute Tobin's q as the market value of equity (`mktcap`) plus the book value of assets (`at`) minus book value of equity (`be`) plus deferred taxes (`txdb`), all divided by book value of assets (`at`). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- data_investment |>\n  left_join(crsp_monthly,\n  by = c(\"gvkey\", \"month\")\n  ) |>\n  mutate(tobins_q = (mktcap + at - be + txdb) / at) |>\n  select(gvkey, year, investment_lead, cash_flows, tobins_q) |>\n  drop_na()\n```\n:::\n\n\nAs the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions.\\index{Winsorization} In the specific example, we winsorize the main variables (`investment`, `cash_flows`, and `tobins_q`) at the 1 percent level. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nwinsorize <- function(x, cut) {\n  x <- replace(\n    x,\n    x > quantile(x, 1 - cut, na.rm = T),\n    quantile(x, 1 - cut, na.rm = T)\n  )\n  x <- replace(\n    x,\n    x < quantile(x, cut, na.rm = T),\n    quantile(x, cut, na.rm = T)\n  )\n  return(x)\n}\n\ndata_investment <- data_investment |>\n  mutate(across(\n    c(investment_lead, cash_flows, tobins_q),\n    ~ winsorize(., 0.01)\n  ))\n```\n:::\n\n\nBefore proceeding to any estimations, we highly recommend tabulating summary statistics of the variables that enter the regression. These simple tables allow you to check the plausibility of your numerical variables, as well as spot any obvious errors or outliers. Additionally, for panel data, plotting the time series of the variable's mean and the number of observations is a useful exercise to spot potential problems.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment |>\n  pivot_longer(\n    cols = c(investment_lead, cash_flows, tobins_q),\n    names_to = \"measure\"\n  ) |>\n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n  measure      mean     sd    min      q05    q50   q95    max      n\n  <chr>       <dbl>  <dbl>  <dbl>    <dbl>  <dbl> <dbl>  <dbl>  <int>\n1 cash_flows 0.0111 0.274  -1.56  -4.72e-1 0.0641 0.272  0.479 127446\n2 investmen… 0.0577 0.0772  0      6.75e-4 0.0327 0.206  0.464 127446\n3 tobins_q   2.00   1.70    0.573  7.94e-1 1.39   5.37  10.9   127446\n```\n\n\n:::\n:::\n\n\n## Fixed Effects \n\nTo illustrate fixed effects regressions, we use the `fixest` package, which is both computationally powerful and flexible with respect to model specifications.\\index{Regression!Fixed effects} We start out with the basic investment regression using the simple model\n$$ \\text{Investment}_{i,t+1} = \\alpha + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\varepsilon_t$ is i.i.d. normally distributed across time and firms. We use the `feols()`-function to estimate the simple model so that the output has the same structure as the other regressions below, but you could also use `lm()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_ols <- feols(\n  fml = investment_lead ~ cash_flows + tobins_q,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_ols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nStandard-errors: IID \n            Estimate Std. Error t value  Pr(>|t|)    \n(Intercept)  0.04254   0.000334   127.5 < 2.2e-16 ***\ncash_flows   0.04902   0.000793    61.8 < 2.2e-16 ***\ntobins_q     0.00731   0.000128    57.0 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.075503   Adj. R2: 0.043012\n```\n\n\n:::\n:::\n\n\nAs expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data. \n\nOne way to tackle the issue of omitted variable bias is to get rid of as much unexplained variation as possible by including *fixed effects* - i.e., model parameters that are fixed for specific groups [e.g., @Wooldridge2010]. In essence, each group has its own mean in fixed effects regressions. The simplest group that we can form in the investment regression is the firm level. The firm fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_i$ is the firm fixed effect and captures the firm-specific mean investment across all years. In fact, you could also compute firms' investments as deviations from the firms' average investments and estimate the model without the fixed effects. The idea of the firm fixed effect is to remove the firm's average investment, which might be affected by firm-specific variables that you do not observe. For example, firms in a specific industry might invest more on average. Or you observe a young firm with large investments but only small concurrent cash flows, which will only happen in a few years. This sort of variation is unwanted because it is related to unobserved variables that can bias your estimates in any direction.\n\nTo include the firm fixed effect, we use `gvkey` (Compustat's firm identifier) as follows:\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firm <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firm\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nFixed-effects: gvkey: 14,348\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(>|t|)    \ncash_flows   0.0141   0.000923    15.3 < 2.2e-16 ***\ntobins_q     0.0109   0.000133    82.0 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.049705     Adj. R2: 0.532649\n                 Within R2: 0.057237\n```\n\n\n:::\n:::\n\n\nThe regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm's cash flow and Tobin's q *on top* of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.\n\nThere is another source of variation that we can get rid of in our setting: average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. By including year fixed effects, we can take out the effect of unobservables that vary over time. The two-way fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\alpha_t + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_t$ is the time fixed effect. Here you can think of higher investments during an economic expansion with simultaneously high cash flows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firmyear <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firmyear\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nFixed-effects: gvkey: 14,348,  year: 35\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(>|t|)    \ncash_flows  0.01750   0.000902    19.4 < 2.2e-16 ***\ntobins_q    0.00987   0.000131    75.3 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.048415     Adj. R2: 0.556452\n                 Within R2: 0.050071\n```\n\n\n:::\n:::\n\n\nThe inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time. \n\nHow can we further improve the robustness of our regression results? Ideally, we want to get rid of unexplained variation at the firm-year level, which means we need to include more variables that vary across firm *and* time and are likely correlated with investment. Note that we cannot include firm-year fixed effects in our setting because then cash flows and Tobin's q are colinear with the fixed effects, and the estimation becomes void. \n\nBefore we discuss the properties of our estimation errors, we want to point out that regression tables are at the heart of every empirical analysis, where you compare multiple models. Fortunately, the `etable()` function provides a convenient way to tabulate the regression output (with many parameters to customize and even print the output in LaTeX). We recommend printing $t$-statistics rather than standard errors in regression tables because the latter are typically very hard to interpret across coefficients that vary in size. We also do not print p-values because they are sometimes misinterpreted to signal the importance of observed effects [@Wasserstein2016]. The $t$-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications.  \n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n  model_ols, model_fe_firm, model_fe_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                       model_ols   model_fe_firm model_fe_firm..\nDependent Var.:  investment_lead investment_lead investment_lead\n                                                                \nConstant        0.042*** (127.5)                                \ncash_flows       0.049*** (61.8) 0.014*** (15.3) 0.018*** (19.4)\ntobins_q         0.007*** (57.0) 0.011*** (82.0) 0.010*** (75.3)\nFixed-Effects:  ---------------- --------------- ---------------\ngvkey                         No             Yes             Yes\nyear                          No              No             Yes\n_______________ ________________ _______________ _______________\nVCOV type                    IID             IID             IID\nObservations             127,446         127,446         127,446\nR2                         0.043           0.585           0.607\nWithin R2                     --           0.057           0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n## Clustering Standard Errors\n\nApart from biased estimators, we usually have to deal with potentially complex dependencies of our residuals with each other. Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of our estimated coefficients. \n\nIn our setting, the residuals may be correlated across years for a given firm (time-series dependence), or, alternatively, the residuals may be correlated across different firms (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of *clustered standard errors* [@Petersen2008].\\index{Standard errors!Clustered} The idea behind clustering is that the correlation of residuals *within* a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent [@Lang2007;@Wooldridge2010]. A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient [@Cameron2011].\n\nInstead of relying on the iid assumption, we can use the cluster option in the `feols`-function as above. The code chunk below applies both one-way clustering by firm as well as two-way clustering by firm and year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cluster_firm <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = \"gvkey\",\n  data = data_investment\n)\n\nmodel_cluster_firmyear <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = c(\"gvkey\", \"year\"),\n  data = data_investment\n)\n```\n:::\n\n\n\\index{Robustness tests}\nThe table below shows the comparison of the different assumptions behind the standard errors. In the first column, we can see highly significant coefficients on both cash flows and Tobin's q. By clustering the standard errors on the firm level, the $t$-statistics of both coefficients drop in half, indicating a high correlation of residuals within firms. If we additionally cluster by year, we see a drop, particularly for Tobin's q, again. Even after relaxing the assumptions behind our standard errors, both coefficients are still comfortably significant as the $t$ statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed significance tests.\n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n  model_fe_firmyear, model_cluster_firm, model_cluster_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                model_fe_firm.. model_cluster.. model_cluster...1\nDependent Var.: investment_lead investment_lead   investment_lead\n                                                                 \ncash_flows      0.018*** (19.4) 0.018*** (11.3)   0.018*** (9.57)\ntobins_q        0.010*** (75.3) 0.010*** (35.4)   0.010*** (15.4)\nFixed-Effects:  --------------- ---------------   ---------------\ngvkey                       Yes             Yes               Yes\nyear                        Yes             Yes               Yes\n_______________ _______________ _______________   _______________\nVCOV type                   IID       by: gvkey  by: gvkey & year\nObservations            127,446         127,446           127,446\nR2                        0.607           0.607             0.607\nWithin R2                 0.050           0.050             0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nInspired by @AbadieEtAl2017, we want to close this chapter by highlighting that choosing the right dimensions for clustering is a design problem. Even if the data is informative about whether clustering matters for standard errors, they do not tell you whether you should adjust the standard errors for clustering. Clustering at too aggregate levels can hence lead to unnecessarily inflated standard errors. \n\n## Exercises\n\n1. Estimate the two-way fixed effects model with two-way clustered standard errors using quarterly Compustat data from WRDS. Note that you can access quarterly data via `tbl(wrds, in_schema(\"comp\", \"fundq\"))`.\n1. Following @Peters2017, compute Tobin's q as the market value of outstanding equity `mktcap` plus the book value of debt (`dltt` + `dlc`) minus the current assets `atc` and everything divided by the book value of property, plant and equipment `ppegt`. What is the correlation between the measures of Tobin's q? What is the impact on the two-way fixed effects regressions?\n",
+    "markdown": "---\ntitle: Fixed Effects and Clustered Standard Errors\naliases:\n  - ../fixed-effects-and-clustered-standard-errors\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Fixed Effects and Clustered Standard Errors with R\n  description-meta: Dive into the implementation of fixed effects regressions and clustered standard errors in finance using the programming language R. \n---\n\n\nIn this chapter, we provide an intuitive introduction to the two popular concepts of *fixed effects regressions* and *clustered standard errors*. When working with regressions in empirical finance, you will sooner or later be confronted with discussions around how you deal with omitted variables bias and dependence in your residuals. The concepts we introduce in this chapter are designed to address such concerns.\n\nWe focus on a classical panel regression common to the corporate finance literature [e.g., @Fazzari1988;@Erickson2012;@Gulen2015]: firm investment modeled as a function that increases in firm cash flow and firm investment opportunities.  \n\nTypically, this investment regression uses quarterly balance sheet data provided via Compustat because it allows for richer dynamics in the regressors and more opportunities to construct variables. As we focus on the implementation of fixed effects and clustered standard errors, we use the annual Compustat data from our previous chapters and leave the estimation using quarterly data as an exercise. We demonstrate below that the regression based on annual data yields qualitatively similar results to estimations based on quarterly data from the literature, namely confirming the positive relationships between investment and the two regressors.  \n\nThe current chapter relies on the following set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(fixest)\n```\n:::\n\n\nCompared to previous chapters, we introduce `fixest` [@fixest] for the fixed effects regressions, the implementation of standard error clusters, and tidy estimation output.\n\n## Data Preparation\n\nWe use CRSP and annual Compustat as data sources from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, Compustat provides balance sheet and income statement data on a firm level, while CRSP provides market valuations. \\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(gvkey, month, mktcap) |>\n  collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n  select(datadate, gvkey, year, at, be, capx, oancf, txdb) |>\n  collect()\n```\n:::\n\n\nThe classical investment regressions model the capital investment of a firm as a function of operating cash flows and Tobin's q, a measure of a firm's investment opportunities.\\index{Cash flows}\\index{Regression!Investment} We start by constructing investment and cash flows which are usually normalized by lagged total assets of a firm. In the following code chunk, we construct a *panel* of firm-year observations, so we have both cross-sectional information on firms as well as time-series information for each firm.\\index{Regression!Panel}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- compustat |>\n  mutate(month = floor_date(datadate, \"month\")) |>\n  left_join(compustat |>\n    select(gvkey, year, at_lag = at) |>\n    mutate(year = year + 1),\n  join_by(gvkey, year)\n  ) |>\n  filter(at > 0, at_lag > 0) |>\n  mutate(\n    investment = capx / at_lag,\n    cash_flows = oancf / at_lag\n  )\n\ndata_investment <- data_investment |>\n  left_join(data_investment |>\n    select(gvkey, year, investment_lead = investment) |>\n    mutate(year = year - 1),\n    join_by(gvkey, year)\n  )\n```\n:::\n\n\nTobin's q is the ratio of the market value of capital to its replacement costs.\\index{Tobin's q} It is one of the most common regressors in corporate finance applications [e.g., @Fazzari1988; @Erickson2012]. We follow the implementation of @Gulen2015 and compute Tobin's q as the market value of equity (`mktcap`) plus the book value of assets (`at`) minus book value of equity (`be`) plus deferred taxes (`txdb`), all divided by book value of assets (`at`). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment <- data_investment |>\n  left_join(crsp_monthly,\n            join_by(gvkey, month)) |>\n  mutate(tobins_q = (mktcap + at - be + txdb) / at) |>\n  select(gvkey, year, investment_lead, cash_flows, tobins_q) |>\n  drop_na()\n```\n:::\n\n\nAs the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions.\\index{Winsorization} In the specific example, we winsorize the main variables (`investment`, `cash_flows`, and `tobins_q`) at the 1 percent level. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nwinsorize <- function(x, cut) {\n  x <- replace(\n    x,\n    x > quantile(x, 1 - cut, na.rm = T),\n    quantile(x, 1 - cut, na.rm = T)\n  )\n  x <- replace(\n    x,\n    x < quantile(x, cut, na.rm = T),\n    quantile(x, cut, na.rm = T)\n  )\n  return(x)\n}\n\ndata_investment <- data_investment |>\n  mutate(across(\n    c(investment_lead, cash_flows, tobins_q),\n    ~ winsorize(., 0.01)\n  ))\n```\n:::\n\n\nBefore proceeding to any estimations, we highly recommend tabulating summary statistics of the variables that enter the regression. These simple tables allow you to check the plausibility of your numerical variables, as well as spot any obvious errors or outliers. Additionally, for panel data, plotting the time series of the variable's mean and the number of observations is a useful exercise to spot potential problems.\\index{Summary statistics}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_investment |>\n  pivot_longer(\n    cols = c(investment_lead, cash_flows, tobins_q),\n    names_to = \"measure\"\n  ) |>\n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 9\n  measure      mean     sd    min      q05    q50   q95    max      n\n  <chr>       <dbl>  <dbl>  <dbl>    <dbl>  <dbl> <dbl>  <dbl>  <int>\n1 cash_flows 0.0111 0.274  -1.56  -4.72e-1 0.0641 0.272  0.479 127450\n2 investmen… 0.0577 0.0772  0      6.75e-4 0.0327 0.206  0.464 127450\n3 tobins_q   2.00   1.70    0.573  7.94e-1 1.39   5.37  10.9   127450\n```\n\n\n:::\n:::\n\n\n## Fixed Effects \n\nTo illustrate fixed effects regressions, we use the `fixest` package, which is both computationally powerful and flexible with respect to model specifications.\\index{Regression!Fixed effects} We start out with the basic investment regression using the simple model\n$$ \\text{Investment}_{i,t+1} = \\alpha + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\varepsilon_t$ is i.i.d. normally distributed across time and firms. We use the `feols()`-function to estimate the simple model so that the output has the same structure as the other regressions below, but you could also use `lm()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_ols <- feols(\n  fml = investment_lead ~ cash_flows + tobins_q,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_ols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nStandard-errors: IID \n            Estimate Std. Error t value  Pr(>|t|)    \n(Intercept)  0.04254   0.000334   127.5 < 2.2e-16 ***\ncash_flows   0.04901   0.000793    61.8 < 2.2e-16 ***\ntobins_q     0.00731   0.000128    57.0 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.075502   Adj. R2: 0.043014\n```\n\n\n:::\n:::\n\n\nAs expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data. \n\nOne way to tackle the issue of omitted variable bias is to get rid of as much unexplained variation as possible by including *fixed effects* - i.e., model parameters that are fixed for specific groups [e.g., @Wooldridge2010]. In essence, each group has its own mean in fixed effects regressions. The simplest group that we can form in the investment regression is the firm level. The firm fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_i$ is the firm fixed effect and captures the firm-specific mean investment across all years. In fact, you could also compute firms' investments as deviations from the firms' average investments and estimate the model without the fixed effects. The idea of the firm fixed effect is to remove the firm's average investment, which might be affected by firm-specific variables that you do not observe. For example, firms in a specific industry might invest more on average. Or you observe a young firm with large investments but only small concurrent cash flows, which will only happen in a few years. This sort of variation is unwanted because it is related to unobserved variables that can bias your estimates in any direction.\n\nTo include the firm fixed effect, we use `gvkey` (Compustat's firm identifier) as follows:\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firm <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firm\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nFixed-effects: gvkey: 14,352\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(>|t|)    \ncash_flows   0.0141   0.000923    15.3 < 2.2e-16 ***\ntobins_q     0.0109   0.000133    82.0 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.049704     Adj. R2: 0.532639\n                 Within R2: 0.057238\n```\n\n\n:::\n:::\n\n\nThe regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm's cash flow and Tobin's q *on top* of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.\n\nThere is another source of variation that we can get rid of in our setting: average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. By including year fixed effects, we can take out the effect of unobservables that vary over time. The two-way fixed effects regression is then\n$$ \\text{Investment}_{i,t+1} = \\alpha_i + \\alpha_t + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},$$\nwhere $\\alpha_t$ is the time fixed effect. Here you can think of higher investments during an economic expansion with simultaneously high cash flows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fe_firmyear <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firmyear\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nFixed-effects: gvkey: 14,352,  year: 35\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(>|t|)    \ncash_flows  0.01750   0.000902    19.4 < 2.2e-16 ***\ntobins_q    0.00987   0.000131    75.3 < 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.048414     Adj. R2: 0.556443\n                 Within R2: 0.050072\n```\n\n\n:::\n:::\n\n\nThe inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time. \n\nHow can we further improve the robustness of our regression results? Ideally, we want to get rid of unexplained variation at the firm-year level, which means we need to include more variables that vary across firm *and* time and are likely correlated with investment. Note that we cannot include firm-year fixed effects in our setting because then cash flows and Tobin's q are colinear with the fixed effects, and the estimation becomes void. \n\nBefore we discuss the properties of our estimation errors, we want to point out that regression tables are at the heart of every empirical analysis, where you compare multiple models. Fortunately, the `etable()` function provides a convenient way to tabulate the regression output (with many parameters to customize and even print the output in LaTeX). We recommend printing $t$-statistics rather than standard errors in regression tables because the latter are typically very hard to interpret across coefficients that vary in size. We also do not print p-values because they are sometimes misinterpreted to signal the importance of observed effects [@Wasserstein2016]. The $t$-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications.  \n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n  model_ols, model_fe_firm, model_fe_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                       model_ols   model_fe_firm model_fe_firm..\nDependent Var.:  investment_lead investment_lead investment_lead\n                                                                \nConstant        0.043*** (127.5)                                \ncash_flows       0.049*** (61.8) 0.014*** (15.3) 0.018*** (19.4)\ntobins_q         0.007*** (57.0) 0.011*** (82.0) 0.010*** (75.3)\nFixed-Effects:  ---------------- --------------- ---------------\ngvkey                         No             Yes             Yes\nyear                          No              No             Yes\n_______________ ________________ _______________ _______________\nVCOV type                    IID             IID             IID\nObservations             127,450         127,450         127,450\nR2                         0.043           0.585           0.607\nWithin R2                     --           0.057           0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n## Clustering Standard Errors\n\nApart from biased estimators, we usually have to deal with potentially complex dependencies of our residuals with each other. Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of our estimated coefficients. \n\nIn our setting, the residuals may be correlated across years for a given firm (time-series dependence), or, alternatively, the residuals may be correlated across different firms (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of *clustered standard errors* [@Petersen2008].\\index{Standard errors!Clustered} The idea behind clustering is that the correlation of residuals *within* a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent [@Lang2007;@Wooldridge2010]. A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient [@Cameron2011].\n\nInstead of relying on the iid assumption, we can use the cluster option in the `feols`-function as above. The code chunk below applies both one-way clustering by firm as well as two-way clustering by firm and year.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cluster_firm <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = \"gvkey\",\n  data = data_investment\n)\n\nmodel_cluster_firmyear <- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = c(\"gvkey\", \"year\"),\n  data = data_investment\n)\n```\n:::\n\n\n\\index{Robustness tests}\nThe table below shows the comparison of the different assumptions behind the standard errors. In the first column, we can see highly significant coefficients on both cash flows and Tobin's q. By clustering the standard errors on the firm level, the $t$-statistics of both coefficients drop in half, indicating a high correlation of residuals within firms. If we additionally cluster by year, we see a drop, particularly for Tobin's q, again. Even after relaxing the assumptions behind our standard errors, both coefficients are still comfortably significant as the $t$ statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed significance tests.\n\n\n::: {.cell}\n\n```{.r .cell-code}\netable(\n  model_fe_firmyear, model_cluster_firm, model_cluster_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n                model_fe_firm.. model_cluster.. model_cluster...1\nDependent Var.: investment_lead investment_lead   investment_lead\n                                                                 \ncash_flows      0.018*** (19.4) 0.018*** (11.3)   0.018*** (9.57)\ntobins_q        0.010*** (75.3) 0.010*** (35.4)   0.010*** (15.4)\nFixed-Effects:  --------------- ---------------   ---------------\ngvkey                       Yes             Yes               Yes\nyear                        Yes             Yes               Yes\n_______________ _______________ _______________   _______________\nVCOV type                   IID       by: gvkey  by: gvkey & year\nObservations            127,450         127,450           127,450\nR2                        0.607           0.607             0.607\nWithin R2                 0.050           0.050             0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\nInspired by @AbadieEtAl2017, we want to close this chapter by highlighting that choosing the right dimensions for clustering is a design problem. Even if the data is informative about whether clustering matters for standard errors, they do not tell you whether you should adjust the standard errors for clustering. Clustering at too aggregate levels can hence lead to unnecessarily inflated standard errors. \n\n## Exercises\n\n1. Estimate the two-way fixed effects model with two-way clustered standard errors using quarterly Compustat data from WRDS. Note that you can access quarterly data via `tbl(wrds, in_schema(\"comp\", \"fundq\"))`.\n1. Following @Peters2017, compute Tobin's q as the market value of outstanding equity `mktcap` plus the book value of debt (`dltt` + `dlc`) minus the current assets `atc` and everything divided by the book value of property, plant and equipment `ppegt`. What is the correlation between the measures of Tobin's q? What is the impact on the two-way fixed effects regressions?\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/r/introduction-to-tidy-finance/execute-results/html.json b/_freeze/r/introduction-to-tidy-finance/execute-results/html.json
index 16d9b1f7..049e909b 100644
--- a/_freeze/r/introduction-to-tidy-finance/execute-results/html.json
+++ b/_freeze/r/introduction-to-tidy-finance/execute-results/html.json
@@ -2,7 +2,7 @@
   "hash": "739476056eb443fe25639888af8b6c10",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Introduction to Tidy Finance\naliases:\n  - ../introduction-to-tidy-finance.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Introduction to Tidy Finance with R\n  description-meta: Learn how to use the programming language R for downloading and analyzing stock market data.\n---\n\n\nThe main aim of this chapter is to familiarize yourself with the tidyverse. We start by downloading and visualizing stock data from Yahoo!Finance. Then we move to a simple portfolio choice problem and construct the efficient frontier. These examples introduce you to our approach of *Tidy Finance*.\n\n## Working with Stock Market Data\n\nAt the start of each session, we load the required R packages. Throughout the entire book, we always use the `tidyverse` [@Wickham2019]. In this chapter, we also load the convenient `tidyquant` package [@tidyquant] to download price data. This package provides a convenient wrapper for various quantitative functions compatible with the `tidyverse`.\\index{tidyverse}\nFinally, the package `scales` [@scales] provides useful scale functions for visualizations.\n\nYou typically have to install a package once before you can load it. In case you have not done this yet, call `install.packages(\"tidyquant\")`. \\index{tidyquant} If you have trouble using `tidyquant`, check out the corresponding [documentation.](https://cran.r-project.org/web/packages/tidyquant/vignettes/TQ00-introduction-to-tidyquant.html)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyquant)\nlibrary(scales)\n```\n:::\n\n\nWe first download daily prices for one stock symbol, e.g., the Apple stock, *AAPL*, directly from the data provider Yahoo!Finance. To download the data, you can use the command `tq_get`. If you do not know how to use it, make sure you read the help file by calling `?tq_get`. We especially recommend taking a look at the examples section of the documentation. We request daily data for a period of more than 20 years.\\index{Stock prices}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices <- tq_get(\"AAPL\",\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\nprices\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 8\n  symbol date        open  high   low close    volume adjusted\n  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>\n1 AAPL   2000-01-03 0.936 1.00  0.908 0.999 535796800    0.848\n2 AAPL   2000-01-04 0.967 0.988 0.903 0.915 512377600    0.777\n3 AAPL   2000-01-05 0.926 0.987 0.920 0.929 778321600    0.788\n4 AAPL   2000-01-06 0.948 0.955 0.848 0.848 767972800    0.720\n5 AAPL   2000-01-07 0.862 0.902 0.853 0.888 460734400    0.754\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\n\\index{Data!YahooFinance} `tq_get` downloads stock market data from Yahoo!Finance if you do not specify another data source. The function returns a tibble with eight quite self-explanatory columns: `symbol`, `date`, the market prices at the `open`, `high`, `low`, and `close`, the daily `volume` (in the number of traded shares), and the `adjusted` price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.\\index{Stock price adjustments}\n\nNext, we use the `ggplot2` package [@ggplot2] to visualize the time series of adjusted prices in @fig-100 . This package takes care of visualization tasks based on the principles of the grammar of graphics [@Wilkinson2012].\\index{Graph!Time series}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices |>\n  ggplot(aes(x = date, y = adjusted)) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Apple stock prices between beginning of 2000 and end of 2022\"\n  )\n```\n\n::: {.cell-output-display}\n![Prices are in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-100-1.png){#fig-100 fig-alt='Title: Apple stock prices between the beginning of 2000 and the end of 2022. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD.' width=2100}\n:::\n:::\n\n\n\\index{Returns} Instead of analyzing prices, we compute daily net returns defined as $r_t = p_t / p_{t-1} - 1$, where $p_t$ is the adjusted day $t$ price. In that context, the function `lag()` is helpful, which returns the previous value in a vector.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- prices |>\n  arrange(date) |>\n  mutate(ret = adjusted / lag(adjusted) - 1) |>\n  select(symbol, date, ret)\nreturns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 3\n  symbol date           ret\n  <chr>  <date>       <dbl>\n1 AAPL   2000-01-03 NA     \n2 AAPL   2000-01-04 -0.0843\n3 AAPL   2000-01-05  0.0146\n4 AAPL   2000-01-06 -0.0865\n5 AAPL   2000-01-07  0.0474\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\nThe resulting tibble contains three columns, where the last contains the daily returns (`ret`). Note that the first entry naturally contains a missing value (`NA`) because there is no previous price.\\index{Missing value} Obviously, the use of `lag()` would be meaningless if the time series is not ordered by ascending dates.\\index{Lag observations} The command `arrange()` provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use `arrange(desc(date))`.\n\nFor the upcoming examples, we remove missing values as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why `NA` values occur and carefully examine if you can simply get rid of these observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- returns |>\n  drop_na(ret)\n```\n:::\n\n\nNext, we visualize the distribution of daily returns in a histogram in @fig-101. \\index{Graph!Histogram} Additionally, we add a dashed line that indicates the 5 percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most 5 percent. The 5 percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. \\index{Value-at-risk} We refer to @Tsay2010 for a more thorough introduction to stylized facts of returns.\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquantile_05 <- quantile(returns |> pull(ret), probs = 0.05)\nreturns |>\n  ggplot(aes(x = ret)) +\n  geom_histogram(bins = 100) +\n  geom_vline(aes(xintercept = quantile_05),\n    linetype = \"dashed\"\n  ) +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Distribution of daily Apple stock returns\"\n  ) +\n  scale_x_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The dotted vertical line indicates the historical 5 percent quantile.](introduction-to-tidy-finance_files/figure-html/fig-101-1.png){#fig-101 fig-alt='Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical 5 percent quantile of daily returns was around negative 3 percent.' width=2100}\n:::\n:::\n\n\nHere, `bins = 100` determines the number of bins used in the illustration and hence implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom `geom_vline()` to add a dashed line that indicates the 5 percent quantile of the daily returns. A typical task before proceeding with *any* data is to compute summary statistics for the main variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    )\n  ))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 4\n  ret_daily_mean ret_daily_sd ret_daily_min ret_daily_max\n           <dbl>        <dbl>         <dbl>         <dbl>\n1        0.00120       0.0251        -0.519         0.139\n```\n\n\n:::\n:::\n\n\nWe see that the maximum *daily* return was 13.905 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.\\\nYou can also compute these summary statistics for each year individually by imposing `group_by(year = year(date))`, where the call `year(date)` returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n  group_by(year = year(date)) |>\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |>\n  print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n    year daily_mean daily_sd daily_min daily_max\n   <dbl>      <dbl>    <dbl>     <dbl>     <dbl>\n 1  2000 -0.00346     0.0549   -0.519     0.137 \n 2  2001  0.00233     0.0393   -0.172     0.129 \n 3  2002 -0.00121     0.0305   -0.150     0.0846\n 4  2003  0.00186     0.0234   -0.0814    0.113 \n 5  2004  0.00470     0.0255   -0.0558    0.132 \n 6  2005  0.00349     0.0245   -0.0921    0.0912\n 7  2006  0.000949    0.0243   -0.0633    0.118 \n 8  2007  0.00366     0.0238   -0.0702    0.105 \n 9  2008 -0.00265     0.0367   -0.179     0.139 \n10  2009  0.00382     0.0214   -0.0502    0.0676\n11  2010  0.00183     0.0169   -0.0496    0.0769\n12  2011  0.00104     0.0165   -0.0559    0.0589\n13  2012  0.00130     0.0186   -0.0644    0.0887\n14  2013  0.000472    0.0180   -0.124     0.0514\n15  2014  0.00145     0.0136   -0.0799    0.0820\n16  2015  0.0000199   0.0168   -0.0612    0.0574\n17  2016  0.000575    0.0147   -0.0657    0.0650\n18  2017  0.00164     0.0111   -0.0388    0.0610\n19  2018 -0.0000573   0.0181   -0.0663    0.0704\n20  2019  0.00266     0.0165   -0.0996    0.0683\n21  2020  0.00281     0.0294   -0.129     0.120 \n22  2021  0.00131     0.0158   -0.0417    0.0539\n23  2022 -0.000970    0.0225   -0.0587    0.0890\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nIn case you wonder: the additional argument `.names = \"{.fn}\"` in `across()` determines how to name the output columns. The specification is rather flexible and allows almost arbitrary column names, which can be useful for reporting. The `print()` function simply controls the output options for the R console.\n\n## Scaling Up the Analysis\n\nAs a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.\n\nThis is where the `tidyverse` magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., `symbol <- c(\"AAPL\", \"MMM\", \"BA\")`, and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the [Dow Jones Industrial Average index.](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average) \\index{Data!Dow Jones Index}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsymbols <- tq_index(\"DOW\") |> \n  filter(company != \"US DOLLAR\")\nsymbols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 8\n  symbol company           identifier sedol weight sector shares_held\n  <chr>  <chr>             <chr>      <chr>  <dbl> <chr>        <dbl>\n1 UNH    UNITEDHEALTH GRO… 91324P102  2917… 0.106  -          5487702\n2 MSFT   MICROSOFT CORP    594918104  2588… 0.0641 -          5487702\n3 GS     GOLDMAN SACHS GR… 38141G104  2407… 0.0605 -          5487702\n4 HD     HOME DEPOT INC    437076102  2434… 0.0571 -          5487702\n5 AMGN   AMGEN INC         031162100  2023… 0.0556 -          5487702\n# ℹ 25 more rows\n# ℹ 1 more variable: local_currency <chr>\n```\n\n\n:::\n:::\n\n\nConveniently, `tidyquant` provides a function to get all stocks in a stock index with a single call (similarly, `tq_exchange(\"NASDAQ\")` delivers all stocks currently listed on the NASDAQ exchange). \\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- tq_get(symbols,\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\n```\n:::\n\n\nThe resulting tibble contains 165593 daily observations for 30 different corporations. @fig-103 illustrates the time series of downloaded *adjusted* prices for each of the constituents of the Dow Jones index. Make sure you understand every single line of code! What are the arguments of `aes()`? Which alternative `geoms` could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices |>\n  ggplot(aes(\n    x = date,\n    y = adjusted,\n    color = symbol\n  )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    title = \"Stock prices of DOW index constituents\"\n  ) +\n  theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![Prices in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-103-1.png){#fig-103 fig-alt='Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index.' width=2100}\n:::\n:::\n\n\nDo you notice the small differences relative to the code we used before? `tq_get(symbols)` returns a tibble for several symbols as well. All we need to do to illustrate all stock symbols simultaneously is to include `color = symbol` in the `ggplot` aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.\n\nThe same holds for stock returns. Before computing the returns, we use `group_by(symbol)` such that the `mutate()` command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: `group_by(symbol)` is the key to aggregating the time series into symbol-specific variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nall_returns <- index_prices |>\n  group_by(symbol) |>\n  mutate(ret = adjusted / lag(adjusted) - 1) |>\n  select(symbol, date, ret) |>\n  drop_na(ret)\n\nall_returns |>\n  group_by(symbol) |>\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |>\n  print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 5\n   symbol daily_mean daily_sd daily_min daily_max\n   <chr>       <dbl>    <dbl>     <dbl>     <dbl>\n 1 AAPL     0.00120    0.0251    -0.519     0.139\n 2 AMGN     0.000489   0.0197    -0.134     0.151\n 3 AXP      0.000518   0.0229    -0.176     0.219\n 4 BA       0.000595   0.0224    -0.238     0.243\n 5 CAT      0.000709   0.0204    -0.145     0.147\n 6 CRM      0.00110    0.0270    -0.271     0.260\n 7 CSCO     0.000317   0.0237    -0.162     0.244\n 8 CVX      0.000553   0.0176    -0.221     0.227\n 9 DIS      0.000418   0.0195    -0.184     0.160\n10 DOW      0.000562   0.0260    -0.217     0.209\n11 GS       0.000550   0.0231    -0.190     0.265\n12 HD       0.000543   0.0194    -0.287     0.141\n13 HON      0.000515   0.0194    -0.174     0.282\n14 IBM      0.000273   0.0165    -0.155     0.120\n15 INTC     0.000285   0.0236    -0.220     0.201\n16 JNJ      0.000408   0.0122    -0.158     0.122\n17 JPM      0.000582   0.0242    -0.207     0.251\n18 KO       0.000337   0.0132    -0.101     0.139\n19 MCD      0.000533   0.0147    -0.159     0.181\n20 MMM      0.000378   0.0150    -0.129     0.126\n21 MRK      0.000383   0.0168    -0.268     0.130\n22 MSFT     0.000513   0.0194    -0.156     0.196\n23 NKE      0.000743   0.0194    -0.198     0.155\n24 PG       0.000377   0.0134    -0.302     0.120\n25 TRV      0.000569   0.0183    -0.208     0.256\n26 UNH      0.000984   0.0198    -0.186     0.348\n27 V        0.000929   0.0190    -0.136     0.150\n28 VZ       0.000239   0.0151    -0.118     0.146\n29 WBA      0.000284   0.0182    -0.150     0.166\n30 WMT      0.000314   0.0150    -0.114     0.117\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nNote that you are now also equipped with all tools to download price data for *each* symbol listed in the S&P 500 index with the same number of lines of code. Just use `symbol <- tq_index(\"SP500\")`, which provides you with a tibble that contains each symbol that is (currently) part of the S&P 500.\\index{Data!SP 500} However, don't try this if you are not prepared to wait for a couple of minutes because this is quite some data to download!\n\n## Other Forms of Data Aggregation\n\nOf course, aggregation across variables other than `symbol` can also make sense. For instance, suppose you are interested in answering the question: Are days with high aggregate trading volume likely followed by days with high aggregate trading volume? To provide some initial analysis on this question, we take the downloaded data and compute aggregate daily trading volume for all Dow Jones constituents in USD. Recall that the column `volume` is denoted in the number of traded shares.\\index{Trading volume} Thus, we multiply the trading volume with the daily closing price to get a proxy for the aggregate trading volume in USD. Scaling by `1e9` (R can handle scientific notation) denotes daily trading volume in billion USD.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume <- index_prices |>\n  group_by(date) |>\n  summarize(trading_volume = sum(volume * adjusted))\n\ntrading_volume |>\n  ggplot(aes(x = date, y = trading_volume)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Aggregate daily trading volume of DOW index constitutens\"\n  ) +\n    scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-104-1.png){#fig-104 fig-alt='Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2022, with a maximum of more than 100.' width=2100}\n:::\n:::\n\n\n@fig-104 indicates a clear upward trend in aggregated daily trading volume. In particular, since the outbreak of the COVID-19 pandemic, markets have processed substantial trading volumes, as analyzed, for instance, by @Goldstein2021.\\index{Covid 19} One way to illustrate the persistence of trading volume would be to plot volume on day $t$ against volume on day $t-1$ as in the example below. In @fig-105, we add a dotted 45°-line to indicate a hypothetical one-to-one relation by `geom_abline()`, addressing potential differences in the axes' scales.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume |>\n  ggplot(aes(x = lag(trading_volume), y = trading_volume)) +\n  geom_point() +\n  geom_abline(aes(intercept = 0, slope = 1),\n    linetype = \"dashed\"\n  ) +\n  labs(\n    x = \"Previous day aggregate trading volume\",\n    y = \"Aggregate trading volume\",\n    title = \"Persistence in daily trading volume of DOW index constituents\"\n  ) + \n  scale_x_continuous(labels = unit_format(unit = \"B\", scale = 1e-9)) +\n  scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 1 rows containing missing values (`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-105-1.png){#fig-105 fig-alt='Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line.' width=2100}\n:::\n:::\n\n\nDo you understand where the warning `## Warning: Removed 1 rows containing missing values (geom_point).` comes from and what it means? Purely eye-balling reveals that days with high trading volume are often followed by similarly high trading volume days.\\index{Error message}\n\n## Portfolio Choice Problems\n\nIn the previous part, we show how to download stock market data and inspect it with graphs and summary statistics. Now, we move to a typical question in Finance: how to allocate wealth across different assets optimally.\\index{Portfolio choice} The standard framework for optimal portfolio selection considers investors that prefer higher future returns but dislike future return volatility (defined as the square root of the return variance): the *mean-variance investor* [@Markowitz1952].\\index{Markowitz optimization}\n\n\\index{Efficient frontier} An essential tool to evaluate portfolios in the mean-variance context is the *efficient frontier*, the set of portfolios which satisfies the condition that no other portfolio exists with a higher expected return but with the same volatility (the square root of the variance, i.e., the risk), see, e.g., @Merton1972.\\index{Return volatility} We compute and visualize the efficient frontier for several stocks. First, we extract each asset's *monthly* returns. In order to keep things simple, we work with a balanced panel and exclude DOW constituents for which we do not observe a price on every single trading day since the year 2000.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- index_prices |>\n  group_by(symbol) |>\n  mutate(n = n()) |>\n  ungroup() |>\n  filter(n == max(n)) |>\n  select(-n)\nreturns <- index_prices |>\n  mutate(month = floor_date(date, \"month\")) |>\n  group_by(symbol, month) |>\n  summarize(price = last(adjusted), .groups = \"drop_last\") |>\n  mutate(ret = price / lag(price) - 1) |>\n  drop_na(ret) |>\n  select(-price)\n```\n:::\n\n\nHere, `floor_date()` is a function from the `lubridate` package [@lubridate], which provides useful functions to work with dates and times.\n\nNext, we transform the returns from a tidy tibble into a $(T \\times N)$ matrix with one column for each of the $N$ symbols and one row for each of the $T$ trading days to compute the sample average return vector $$\\hat\\mu = \\frac{1}{T}\\sum\\limits_{t=1}^T r_t$$ where $r_t$ is the $N$ vector of returns on date $t$ and the sample covariance matrix $$\\hat\\Sigma = \\frac{1}{T-1}\\sum\\limits_{t=1}^T (r_t - \\hat\\mu)(r_t - \\hat\\mu)'.$$ We achieve this by using `pivot_wider()` with the new column names from the column `symbol` and setting the values to `ret`. We compute the vector of sample average returns and the sample variance-covariance matrix, which we consider as proxies for the parameters of the distribution of future stock returns. Thus, for simplicity, we refer to $\\Sigma$ and $\\mu$ instead of explicitly highlighting that the sample moments are estimates. \\index{Covariance} In later chapters, we discuss the issues that arise once we take estimation uncertainty into account.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns_matrix <- returns |>\n  pivot_wider(\n    names_from = symbol,\n    values_from = ret\n  ) |>\n  select(-month)\nsigma <- cov(returns_matrix)\nmu <- colMeans(returns_matrix)\n```\n:::\n\n\nThen, we compute the minimum variance portfolio weights $\\omega_\\text{mvp}$ as well as the expected portfolio return $\\omega_\\text{mvp}'\\mu$ and volatility $\\sqrt{\\omega_\\text{mvp}'\\Sigma\\omega_\\text{mvp}}$ of this portfolio. \\index{Minimum variance portfolio} Recall that the minimum variance portfolio is the vector of portfolio weights that are the solution to $$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\sum\\limits_{i=1}^N\\omega_i = 1.$$ The constraint that weights sum up to one simply implies that all funds are distributed across the available asset universe, i.e., there is no possibility to retain cash. It is easy to show analytically that $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$, where $\\iota$ is a vector of ones and $\\Sigma^{-1}$ is the inverse of $\\Sigma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nN <- ncol(returns_matrix)\niota <- rep(1, N)\nsigma_inv <- solve(sigma)\nmvp_weights <- sigma_inv %*% iota\nmvp_weights <- mvp_weights / sum(mvp_weights)\ntibble(\n  average_ret = as.numeric(t(mvp_weights) %*% mu),\n  volatility = as.numeric(sqrt(t(mvp_weights) %*% sigma %*% mvp_weights))\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 2\n  average_ret volatility\n        <dbl>      <dbl>\n1     0.00792     0.0321\n```\n\n\n:::\n:::\n\n\nThe command `solve(A, b)` returns the solution of a system of equations $Ax = b$. If `b` is not provided, as in the example above, it defaults to the identity matrix such that `solve(sigma)` delivers $\\Sigma^{-1}$ (if a unique solution exists).\\\nNote that the *monthly* volatility of the minimum variance portfolio is of the same order of magnitude as the *daily* standard deviation of the individual components. Thus, the diversification benefits in terms of risk reduction are tremendous!\\index{Diversification}\n\nNext, we set out to find the weights for a portfolio that achieves, as an example, three times the expected return of the minimum variance portfolio. However, mean-variance investors are not interested in any portfolio that achieves the required return but rather in the efficient portfolio, i.e., the portfolio with the lowest standard deviation. If you wonder where the solution $\\omega_\\text{eff}$ comes from: \\index{Efficient portfolio} The efficient portfolio is chosen by an investor who aims to achieve minimum variance *given a minimum acceptable expected return* $\\bar{\\mu}$. Hence, their objective function is to choose $\\omega_\\text{eff}$ as the solution to $$\\omega_\\text{eff}(\\bar{\\mu}) = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\n\nThe code below implements the analytic solution to this optimization problem for a benchmark return $\\bar\\mu$, which we set to 3 times the expected return of the minimum variance portfolio. We encourage you to verify that it is correct.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmark_multiple <- 3\nmu_bar <- benchmark_multiple * t(mvp_weights) %*% mu\nC <- as.numeric(t(iota) %*% sigma_inv %*% iota)\nD <- as.numeric(t(iota) %*% sigma_inv %*% mu)\nE <- as.numeric(t(mu) %*% sigma_inv %*% mu)\nlambda_tilde <- as.numeric(2 * (mu_bar - D / C) / (E - D^2 / C))\nefp_weights <- mvp_weights +\n  lambda_tilde / 2 * (sigma_inv %*% mu - D * mvp_weights)\n```\n:::\n\n\n## The Efficient Frontier\n\n\\index{Separation theorem} The mutual fund separation theorem states that as soon as we have two efficient portfolios (such as the minimum variance portfolio $\\omega_\\text{mvp}$ and the efficient portfolio for a higher required level of expected returns $\\omega_\\text{eff}(\\bar{\\mu})$, we can characterize the entire efficient frontier by combining these two portfolios. That is, any linear combination of the two portfolio weights will again represent an efficient portfolio. \\index{Efficient frontier} The code below implements the construction of the *efficient frontier*, which characterizes the highest expected return achievable at each level of risk. To understand the code better, make sure to familiarize yourself with the inner workings of the `for` loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlength_year <- 12\na <- seq(from = -0.4, to = 1.9, by = 0.01)\nres <- tibble(\n  a = a,\n  mu = NA,\n  sd = NA\n)\nfor (i in seq_along(a)) {\n  w <- (1 - a[i]) * mvp_weights + (a[i]) * efp_weights\n  res$mu[i] <- length_year * t(w) %*% mu   \n  res$sd[i] <- sqrt(length_year) * sqrt(t(w) %*% sigma %*% w)\n}\n```\n:::\n\n\nThe code above proceeds in two steps: First, we compute a vector of combination weights $a$ and then we evaluate the resulting linear combination with $a\\in\\mathbb{R}$:\\\n$$\\omega^* = a\\omega_\\text{eff}(\\bar\\mu) + (1-a)\\omega_\\text{mvp} = \\omega_\\text{mvp} + \\frac{\\lambda^*}{2}\\left(\\Sigma^{-1}\\mu -\\frac{D}{C}\\Sigma^{-1}\\iota \\right)$$ with $\\lambda^* = 2\\frac{a\\bar\\mu + (1-a)\\tilde\\mu - D/C}{E-D^2/C}$ where $C = \\iota'\\Sigma^{-1}\\iota$, $D=\\iota'\\Sigma^{-1}\\mu$, and $E=\\mu'\\Sigma^{-1}\\mu$. Finally, it is simple to visualize the efficient frontier alongside the two efficient portfolios within one powerful figure using `ggplot` (see @fig-106). We also add the individual stocks in the same call. We compute annualized returns based on the simple assumption that monthly returns are independent and identically distributed. Thus, the average annualized return is just 12 times the expected monthly return.\\index{Graph!Efficient frontier}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres |>\n  ggplot(aes(x = sd, y = mu)) +\n  geom_point() +\n  geom_point(\n    data = res |> filter(a %in% c(0, 1)),\n    size = 4\n  ) +\n  geom_point(\n    data = tibble(\n      mu = length_year * mu,       \n      sd = sqrt(length_year) * sqrt(diag(sigma))\n    ),\n    aes(y = mu, x = sd), size = 1\n  ) +\n  labs(\n    x = \"Annualized standard deviation\",\n    y = \"Annualized expected return\",\n    title = \"Efficient frontier for DOW index constituents\"\n  ) +\n  scale_x_continuous(labels = percent) +\n  scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The big dots indicate the location of the minimum variance and the efficient portfolio that delivers 3 times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.](introduction-to-tidy-finance_files/figure-html/fig-106-1.png){#fig-106 fig-alt='Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility.' width=2100}\n:::\n:::\n\n\nThe line in @fig-106 indicates the efficient frontier: the set of portfolios a mean-variance efficient investor would choose from. Compare the performance relative to the individual assets (the dots) - it should become clear that diversifying yields massive performance gains (at least as long as we take the parameters $\\Sigma$ and $\\mu$ as given).\n\n## Exercises\n\n1. Download daily prices for another stock market symbol of your choice from Yahoo!Finance with `tq_get()` from the `tidyquant` package. Plot two time series of the ticker’s un-adjusted and adjusted closing prices. Explain the differences.\n1. Compute daily net returns for an asset of your choice and visualize the distribution of daily returns in a histogram using 100 bins. Also, use `geom_vline()` to add a dashed red vertical line that indicates the 5 percent quantile of the daily returns. Compute summary statistics (mean, standard deviation, minimum and maximum) for the daily returns.\n1. Take your code from before and generalize it such that you can perform all the computations for an arbitrary vector of tickers (e.g., `ticker <- c(\"AAPL\", \"MMM\", \"BA\")`). Automate the download, the plot of the price time series, and create a table of return summary statistics for this arbitrary number of assets.\n1. Are days with high aggregate trading volume often also days with large absolute returns? Find an appropriate visualization to analyze the question using the ticker `AAPL`.\n1.Compute monthly returns from the downloaded stock market prices. Compute the vector of historical average returns and the sample variance-covariance matrix. Compute the minimum variance portfolio weights and the portfolio volatility and average returns. Visualize the mean-variance efficient frontier. Choose one of your assets and identify the portfolio which yields the same historical volatility but achieves the highest possible average return.\n1. In the portfolio choice analysis, we restricted our sample to all assets trading every day since 2000. How is such a decision a problem when you want to infer future expected portfolio performance from the results?\n1. The efficient frontier characterizes the portfolios with the highest expected return for different levels of risk. Identify the portfolio with the highest expected return per standard deviation. Which famous performance measure is close to the ratio of average returns to the standard deviation of returns?\n",
+    "markdown": "---\ntitle: Introduction to Tidy Finance\naliases:\n  - ../introduction-to-tidy-finance.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Introduction to Tidy Finance with R\n  description-meta: Learn how to use the programming language R for downloading and analyzing stock market data.\n---\n\n\nThe main aim of this chapter is to familiarize yourself with the tidyverse. We start by downloading and visualizing stock data from Yahoo!Finance. Then we move to a simple portfolio choice problem and construct the efficient frontier. These examples introduce you to our approach of *Tidy Finance*.\n\n## Working with Stock Market Data\n\nAt the start of each session, we load the required R packages. Throughout the entire book, we always use the `tidyverse` [@Wickham2019]. In this chapter, we also load the convenient `tidyquant` package [@tidyquant] to download price data. This package provides a convenient wrapper for various quantitative functions compatible with the `tidyverse`.\\index{tidyverse}\nFinally, the package `scales` [@scales] provides useful scale functions for visualizations.\n\nYou typically have to install a package once before you can load it. In case you have not done this yet, call `install.packages(\"tidyquant\")`. \\index{tidyquant} If you have trouble using `tidyquant`, check out the corresponding [documentation.](https://cran.r-project.org/web/packages/tidyquant/vignettes/TQ00-introduction-to-tidyquant.html)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(tidyquant)\nlibrary(scales)\n```\n:::\n\n\nWe first download daily prices for one stock symbol, e.g., the Apple stock, *AAPL*, directly from the data provider Yahoo!Finance. To download the data, you can use the command `tq_get`. If you do not know how to use it, make sure you read the help file by calling `?tq_get`. We especially recommend taking a look at the examples section of the documentation. We request daily data for a period of more than 20 years.\\index{Stock prices}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices <- tq_get(\"AAPL\",\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\nprices\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 8\n  symbol date        open  high   low close    volume adjusted\n  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>\n1 AAPL   2000-01-03 0.936 1.00  0.908 0.999 535796800    0.847\n2 AAPL   2000-01-04 0.967 0.988 0.903 0.915 512377600    0.776\n3 AAPL   2000-01-05 0.926 0.987 0.920 0.929 778321600    0.787\n4 AAPL   2000-01-06 0.948 0.955 0.848 0.848 767972800    0.719\n5 AAPL   2000-01-07 0.862 0.902 0.853 0.888 460734400    0.753\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\n\\index{Data!YahooFinance} `tq_get` downloads stock market data from Yahoo!Finance if you do not specify another data source. The function returns a tibble with eight quite self-explanatory columns: `symbol`, `date`, the market prices at the `open`, `high`, `low`, and `close`, the daily `volume` (in the number of traded shares), and the `adjusted` price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.\\index{Stock price adjustments}\n\nNext, we use the `ggplot2` package [@ggplot2] to visualize the time series of adjusted prices in @fig-100 . This package takes care of visualization tasks based on the principles of the grammar of graphics [@Wilkinson2012].\\index{Graph!Time series}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprices |>\n  ggplot(aes(x = date, y = adjusted)) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Apple stock prices between beginning of 2000 and end of 2022\"\n  )\n```\n\n::: {.cell-output-display}\n![Prices are in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-100-1.png){#fig-100 fig-alt='Title: Apple stock prices between the beginning of 2000 and the end of 2022. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD.' width=2100}\n:::\n:::\n\n\n\\index{Returns} Instead of analyzing prices, we compute daily net returns defined as $r_t = p_t / p_{t-1} - 1$, where $p_t$ is the adjusted day $t$ price. In that context, the function `lag()` is helpful, which returns the previous value in a vector.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- prices |>\n  arrange(date) |>\n  mutate(ret = adjusted / lag(adjusted) - 1) |>\n  select(symbol, date, ret)\nreturns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5,787 × 3\n  symbol date           ret\n  <chr>  <date>       <dbl>\n1 AAPL   2000-01-03 NA     \n2 AAPL   2000-01-04 -0.0843\n3 AAPL   2000-01-05  0.0146\n4 AAPL   2000-01-06 -0.0865\n5 AAPL   2000-01-07  0.0474\n# ℹ 5,782 more rows\n```\n\n\n:::\n:::\n\n\nThe resulting tibble contains three columns, where the last contains the daily returns (`ret`). Note that the first entry naturally contains a missing value (`NA`) because there is no previous price.\\index{Missing value} Obviously, the use of `lag()` would be meaningless if the time series is not ordered by ascending dates.\\index{Lag observations} The command `arrange()` provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use `arrange(desc(date))`.\n\nFor the upcoming examples, we remove missing values as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why `NA` values occur and carefully examine if you can simply get rid of these observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns <- returns |>\n  drop_na(ret)\n```\n:::\n\n\nNext, we visualize the distribution of daily returns in a histogram in @fig-101. \\index{Graph!Histogram} Additionally, we add a dashed line that indicates the 5 percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most 5 percent. The 5 percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. \\index{Value-at-risk} We refer to @Tsay2010 for a more thorough introduction to stylized facts of returns.\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquantile_05 <- quantile(returns |> pull(ret), probs = 0.05)\nreturns |>\n  ggplot(aes(x = ret)) +\n  geom_histogram(bins = 100) +\n  geom_vline(aes(xintercept = quantile_05),\n    linetype = \"dashed\"\n  ) +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Distribution of daily Apple stock returns\"\n  ) +\n  scale_x_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The dotted vertical line indicates the historical 5 percent quantile.](introduction-to-tidy-finance_files/figure-html/fig-101-1.png){#fig-101 fig-alt='Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical 5 percent quantile of daily returns was around negative 3 percent.' width=2100}\n:::\n:::\n\n\nHere, `bins = 100` determines the number of bins used in the illustration and hence implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom `geom_vline()` to add a dashed line that indicates the 5 percent quantile of the daily returns. A typical task before proceeding with *any* data is to compute summary statistics for the main variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    )\n  ))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 4\n  ret_daily_mean ret_daily_sd ret_daily_min ret_daily_max\n           <dbl>        <dbl>         <dbl>         <dbl>\n1        0.00120       0.0251        -0.519         0.139\n```\n\n\n:::\n:::\n\n\nWe see that the maximum *daily* return was 13.905 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.\\\nYou can also compute these summary statistics for each year individually by imposing `group_by(year = year(date))`, where the call `year(date)` returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns |>\n  group_by(year = year(date)) |>\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |>\n  print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 23 × 5\n    year daily_mean daily_sd daily_min daily_max\n   <dbl>      <dbl>    <dbl>     <dbl>     <dbl>\n 1  2000 -0.00346     0.0549   -0.519     0.137 \n 2  2001  0.00233     0.0393   -0.172     0.129 \n 3  2002 -0.00121     0.0305   -0.150     0.0846\n 4  2003  0.00186     0.0234   -0.0814    0.113 \n 5  2004  0.00470     0.0255   -0.0558    0.132 \n 6  2005  0.00349     0.0245   -0.0921    0.0912\n 7  2006  0.000949    0.0243   -0.0633    0.118 \n 8  2007  0.00366     0.0238   -0.0702    0.105 \n 9  2008 -0.00265     0.0367   -0.179     0.139 \n10  2009  0.00382     0.0214   -0.0502    0.0676\n11  2010  0.00183     0.0169   -0.0496    0.0769\n12  2011  0.00104     0.0165   -0.0559    0.0589\n13  2012  0.00130     0.0186   -0.0644    0.0887\n14  2013  0.000472    0.0180   -0.124     0.0514\n15  2014  0.00145     0.0136   -0.0799    0.0820\n16  2015  0.0000199   0.0168   -0.0612    0.0574\n17  2016  0.000575    0.0147   -0.0657    0.0650\n18  2017  0.00164     0.0111   -0.0388    0.0610\n19  2018 -0.0000573   0.0181   -0.0663    0.0704\n20  2019  0.00266     0.0165   -0.0996    0.0683\n21  2020  0.00281     0.0294   -0.129     0.120 \n22  2021  0.00131     0.0158   -0.0417    0.0539\n23  2022 -0.000970    0.0225   -0.0587    0.0890\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nIn case you wonder: the additional argument `.names = \"{.fn}\"` in `across()` determines how to name the output columns. The specification is rather flexible and allows almost arbitrary column names, which can be useful for reporting. The `print()` function simply controls the output options for the R console.\n\n## Scaling Up the Analysis\n\nAs a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.\n\nThis is where the `tidyverse` magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., `symbol <- c(\"AAPL\", \"MMM\", \"BA\")`, and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the [Dow Jones Industrial Average index.](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average) \\index{Data!Dow Jones Index}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsymbols <- tq_index(\"DOW\") |> \n  filter(company != \"US DOLLAR\")\nsymbols\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 8\n  symbol company           identifier sedol weight sector shares_held\n  <chr>  <chr>             <chr>      <chr>  <dbl> <chr>        <dbl>\n1 UNH    UNITEDHEALTH GRO… 91324P102  2917… 0.0958 -          5747928\n2 GS     GOLDMAN SACHS GR… 38141G104  2407… 0.0672 -          5747928\n3 MSFT   MICROSOFT CORP    594918104  2588… 0.0646 -          5747928\n4 HD     HOME DEPOT INC    437076102  2434… 0.0594 -          5747928\n5 AMGN   AMGEN INC         031162100  2023… 0.0532 -          5747928\n# ℹ 25 more rows\n# ℹ 1 more variable: local_currency <chr>\n```\n\n\n:::\n:::\n\n\nConveniently, `tidyquant` provides a function to get all stocks in a stock index with a single call (similarly, `tq_exchange(\"NASDAQ\")` delivers all stocks currently listed on the NASDAQ exchange). \\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- tq_get(symbols,\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\n```\n:::\n\n\nThe resulting tibble contains 165593 daily observations for 30 different corporations. @fig-103 illustrates the time series of downloaded *adjusted* prices for each of the constituents of the Dow Jones index. Make sure you understand every single line of code! What are the arguments of `aes()`? Which alternative `geoms` could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices |>\n  ggplot(aes(\n    x = date,\n    y = adjusted,\n    color = symbol\n  )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    title = \"Stock prices of DOW index constituents\"\n  ) +\n  theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![Prices in USD, adjusted for dividend payments and stock splits.](introduction-to-tidy-finance_files/figure-html/fig-103-1.png){#fig-103 fig-alt='Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index.' width=2100}\n:::\n:::\n\n\nDo you notice the small differences relative to the code we used before? `tq_get(symbols)` returns a tibble for several symbols as well. All we need to do to illustrate all stock symbols simultaneously is to include `color = symbol` in the `ggplot` aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.\n\nThe same holds for stock returns. Before computing the returns, we use `group_by(symbol)` such that the `mutate()` command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: `group_by(symbol)` is the key to aggregating the time series into symbol-specific variables of interest.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nall_returns <- index_prices |>\n  group_by(symbol) |>\n  mutate(ret = adjusted / lag(adjusted) - 1) |>\n  select(symbol, date, ret) |>\n  drop_na(ret)\n\nall_returns |>\n  group_by(symbol) |>\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |>\n  print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 30 × 5\n   symbol daily_mean daily_sd daily_min daily_max\n   <chr>       <dbl>    <dbl>     <dbl>     <dbl>\n 1 AAPL     0.00120    0.0251    -0.519     0.139\n 2 AMGN     0.000489   0.0197    -0.134     0.151\n 3 AXP      0.000518   0.0229    -0.176     0.219\n 4 BA       0.000595   0.0224    -0.238     0.243\n 5 CAT      0.000709   0.0204    -0.145     0.147\n 6 CRM      0.00110    0.0270    -0.271     0.260\n 7 CSCO     0.000317   0.0237    -0.162     0.244\n 8 CVX      0.000553   0.0176    -0.221     0.227\n 9 DIS      0.000418   0.0195    -0.184     0.160\n10 DOW      0.000562   0.0260    -0.217     0.209\n11 GS       0.000550   0.0231    -0.190     0.265\n12 HD       0.000543   0.0194    -0.287     0.141\n13 HON      0.000515   0.0194    -0.174     0.282\n14 IBM      0.000273   0.0165    -0.155     0.120\n15 INTC     0.000285   0.0236    -0.220     0.201\n16 JNJ      0.000408   0.0122    -0.158     0.122\n17 JPM      0.000582   0.0242    -0.207     0.251\n18 KO       0.000337   0.0132    -0.101     0.139\n19 MCD      0.000533   0.0147    -0.159     0.181\n20 MMM      0.000378   0.0150    -0.129     0.126\n21 MRK      0.000383   0.0168    -0.268     0.130\n22 MSFT     0.000513   0.0194    -0.156     0.196\n23 NKE      0.000743   0.0194    -0.198     0.155\n24 PG       0.000377   0.0134    -0.302     0.120\n25 TRV      0.000569   0.0183    -0.208     0.256\n26 UNH      0.000984   0.0198    -0.186     0.348\n27 V        0.000929   0.0190    -0.136     0.150\n28 VZ       0.000239   0.0151    -0.118     0.146\n29 WBA      0.000284   0.0182    -0.150     0.166\n30 WMT      0.000314   0.0150    -0.114     0.117\n```\n\n\n:::\n:::\n\n\n\\index{Summary statistics}\n\nNote that you are now also equipped with all tools to download price data for *each* symbol listed in the S&P 500 index with the same number of lines of code. Just use `symbol <- tq_index(\"SP500\")`, which provides you with a tibble that contains each symbol that is (currently) part of the S&P 500.\\index{Data!SP 500} However, don't try this if you are not prepared to wait for a couple of minutes because this is quite some data to download!\n\n## Other Forms of Data Aggregation\n\nOf course, aggregation across variables other than `symbol` can also make sense. For instance, suppose you are interested in answering the question: Are days with high aggregate trading volume likely followed by days with high aggregate trading volume? To provide some initial analysis on this question, we take the downloaded data and compute aggregate daily trading volume for all Dow Jones constituents in USD. Recall that the column `volume` is denoted in the number of traded shares.\\index{Trading volume} Thus, we multiply the trading volume with the daily closing price to get a proxy for the aggregate trading volume in USD. Scaling by `1e9` (R can handle scientific notation) denotes daily trading volume in billion USD.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume <- index_prices |>\n  group_by(date) |>\n  summarize(trading_volume = sum(volume * adjusted))\n\ntrading_volume |>\n  ggplot(aes(x = date, y = trading_volume)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Aggregate daily trading volume of DOW index constitutens\"\n  ) +\n    scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-104-1.png){#fig-104 fig-alt='Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2022, with a maximum of more than 100.' width=2100}\n:::\n:::\n\n\n@fig-104 indicates a clear upward trend in aggregated daily trading volume. In particular, since the outbreak of the COVID-19 pandemic, markets have processed substantial trading volumes, as analyzed, for instance, by @Goldstein2021.\\index{Covid 19} One way to illustrate the persistence of trading volume would be to plot volume on day $t$ against volume on day $t-1$ as in the example below. In @fig-105, we add a dotted 45°-line to indicate a hypothetical one-to-one relation by `geom_abline()`, addressing potential differences in the axes' scales.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrading_volume |>\n  ggplot(aes(x = lag(trading_volume), y = trading_volume)) +\n  geom_point() +\n  geom_abline(aes(intercept = 0, slope = 1),\n    linetype = \"dashed\"\n  ) +\n  labs(\n    x = \"Previous day aggregate trading volume\",\n    y = \"Aggregate trading volume\",\n    title = \"Persistence in daily trading volume of DOW index constituents\"\n  ) + \n  scale_x_continuous(labels = unit_format(unit = \"B\", scale = 1e-9)) +\n  scale_y_continuous(labels = unit_format(unit = \"B\", scale = 1e-9))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 1 rows containing missing values (`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![Total daily trading volume in billion USD.](introduction-to-tidy-finance_files/figure-html/fig-105-1.png){#fig-105 fig-alt='Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line.' width=2100}\n:::\n:::\n\n\nDo you understand where the warning `## Warning: Removed 1 rows containing missing values (geom_point).` comes from and what it means? Purely eye-balling reveals that days with high trading volume are often followed by similarly high trading volume days.\\index{Error message}\n\n## Portfolio Choice Problems\n\nIn the previous part, we show how to download stock market data and inspect it with graphs and summary statistics. Now, we move to a typical question in Finance: how to allocate wealth across different assets optimally.\\index{Portfolio choice} The standard framework for optimal portfolio selection considers investors that prefer higher future returns but dislike future return volatility (defined as the square root of the return variance): the *mean-variance investor* [@Markowitz1952].\\index{Markowitz optimization}\n\n\\index{Efficient frontier} An essential tool to evaluate portfolios in the mean-variance context is the *efficient frontier*, the set of portfolios which satisfies the condition that no other portfolio exists with a higher expected return but with the same volatility (the square root of the variance, i.e., the risk), see, e.g., @Merton1972.\\index{Return volatility} We compute and visualize the efficient frontier for several stocks. First, we extract each asset's *monthly* returns. In order to keep things simple, we work with a balanced panel and exclude DOW constituents for which we do not observe a price on every single trading day since the year 2000.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex_prices <- index_prices |>\n  group_by(symbol) |>\n  mutate(n = n()) |>\n  ungroup() |>\n  filter(n == max(n)) |>\n  select(-n)\nreturns <- index_prices |>\n  mutate(month = floor_date(date, \"month\")) |>\n  group_by(symbol, month) |>\n  summarize(price = last(adjusted), .groups = \"drop_last\") |>\n  mutate(ret = price / lag(price) - 1) |>\n  drop_na(ret) |>\n  select(-price)\n```\n:::\n\n\nHere, `floor_date()` is a function from the `lubridate` package [@lubridate], which provides useful functions to work with dates and times.\n\nNext, we transform the returns from a tidy tibble into a $(T \\times N)$ matrix with one column for each of the $N$ symbols and one row for each of the $T$ trading days to compute the sample average return vector $$\\hat\\mu = \\frac{1}{T}\\sum\\limits_{t=1}^T r_t$$ where $r_t$ is the $N$ vector of returns on date $t$ and the sample covariance matrix $$\\hat\\Sigma = \\frac{1}{T-1}\\sum\\limits_{t=1}^T (r_t - \\hat\\mu)(r_t - \\hat\\mu)'.$$ We achieve this by using `pivot_wider()` with the new column names from the column `symbol` and setting the values to `ret`. We compute the vector of sample average returns and the sample variance-covariance matrix, which we consider as proxies for the parameters of the distribution of future stock returns. Thus, for simplicity, we refer to $\\Sigma$ and $\\mu$ instead of explicitly highlighting that the sample moments are estimates. \\index{Covariance} In later chapters, we discuss the issues that arise once we take estimation uncertainty into account.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreturns_matrix <- returns |>\n  pivot_wider(\n    names_from = symbol,\n    values_from = ret\n  ) |>\n  select(-month)\nsigma <- cov(returns_matrix)\nmu <- colMeans(returns_matrix)\n```\n:::\n\n\nThen, we compute the minimum variance portfolio weights $\\omega_\\text{mvp}$ as well as the expected portfolio return $\\omega_\\text{mvp}'\\mu$ and volatility $\\sqrt{\\omega_\\text{mvp}'\\Sigma\\omega_\\text{mvp}}$ of this portfolio. \\index{Minimum variance portfolio} Recall that the minimum variance portfolio is the vector of portfolio weights that are the solution to $$\\omega_\\text{mvp} = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\sum\\limits_{i=1}^N\\omega_i = 1.$$ The constraint that weights sum up to one simply implies that all funds are distributed across the available asset universe, i.e., there is no possibility to retain cash. It is easy to show analytically that $\\omega_\\text{mvp} = \\frac{\\Sigma^{-1}\\iota}{\\iota'\\Sigma^{-1}\\iota}$, where $\\iota$ is a vector of ones and $\\Sigma^{-1}$ is the inverse of $\\Sigma$.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nN <- ncol(returns_matrix)\niota <- rep(1, N)\nsigma_inv <- solve(sigma)\nmvp_weights <- sigma_inv %*% iota\nmvp_weights <- mvp_weights / sum(mvp_weights)\ntibble(\n  average_ret = as.numeric(t(mvp_weights) %*% mu),\n  volatility = as.numeric(sqrt(t(mvp_weights) %*% sigma %*% mvp_weights))\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 1 × 2\n  average_ret volatility\n        <dbl>      <dbl>\n1     0.00792     0.0321\n```\n\n\n:::\n:::\n\n\nThe command `solve(A, b)` returns the solution of a system of equations $Ax = b$. If `b` is not provided, as in the example above, it defaults to the identity matrix such that `solve(sigma)` delivers $\\Sigma^{-1}$ (if a unique solution exists).\\\nNote that the *monthly* volatility of the minimum variance portfolio is of the same order of magnitude as the *daily* standard deviation of the individual components. Thus, the diversification benefits in terms of risk reduction are tremendous!\\index{Diversification}\n\nNext, we set out to find the weights for a portfolio that achieves, as an example, three times the expected return of the minimum variance portfolio. However, mean-variance investors are not interested in any portfolio that achieves the required return but rather in the efficient portfolio, i.e., the portfolio with the lowest standard deviation. If you wonder where the solution $\\omega_\\text{eff}$ comes from: \\index{Efficient portfolio} The efficient portfolio is chosen by an investor who aims to achieve minimum variance *given a minimum acceptable expected return* $\\bar{\\mu}$. Hence, their objective function is to choose $\\omega_\\text{eff}$ as the solution to $$\\omega_\\text{eff}(\\bar{\\mu}) = \\arg\\min \\omega'\\Sigma \\omega \\text{ s.t. } \\omega'\\iota = 1 \\text{ and } \\omega'\\mu \\geq \\bar{\\mu}.$$\n\nThe code below implements the analytic solution to this optimization problem for a benchmark return $\\bar\\mu$, which we set to 3 times the expected return of the minimum variance portfolio. We encourage you to verify that it is correct.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmark_multiple <- 3\nmu_bar <- benchmark_multiple * t(mvp_weights) %*% mu\nC <- as.numeric(t(iota) %*% sigma_inv %*% iota)\nD <- as.numeric(t(iota) %*% sigma_inv %*% mu)\nE <- as.numeric(t(mu) %*% sigma_inv %*% mu)\nlambda_tilde <- as.numeric(2 * (mu_bar - D / C) / (E - D^2 / C))\nefp_weights <- mvp_weights +\n  lambda_tilde / 2 * (sigma_inv %*% mu - D * mvp_weights)\n```\n:::\n\n\n## The Efficient Frontier\n\n\\index{Separation theorem} The mutual fund separation theorem states that as soon as we have two efficient portfolios (such as the minimum variance portfolio $\\omega_\\text{mvp}$ and the efficient portfolio for a higher required level of expected returns $\\omega_\\text{eff}(\\bar{\\mu})$, we can characterize the entire efficient frontier by combining these two portfolios. That is, any linear combination of the two portfolio weights will again represent an efficient portfolio. \\index{Efficient frontier} The code below implements the construction of the *efficient frontier*, which characterizes the highest expected return achievable at each level of risk. To understand the code better, make sure to familiarize yourself with the inner workings of the `for` loop.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlength_year <- 12\na <- seq(from = -0.4, to = 1.9, by = 0.01)\nres <- tibble(\n  a = a,\n  mu = NA,\n  sd = NA\n)\nfor (i in seq_along(a)) {\n  w <- (1 - a[i]) * mvp_weights + (a[i]) * efp_weights\n  res$mu[i] <- length_year * t(w) %*% mu   \n  res$sd[i] <- sqrt(length_year) * sqrt(t(w) %*% sigma %*% w)\n}\n```\n:::\n\n\nThe code above proceeds in two steps: First, we compute a vector of combination weights $a$ and then we evaluate the resulting linear combination with $a\\in\\mathbb{R}$:\\\n$$\\omega^* = a\\omega_\\text{eff}(\\bar\\mu) + (1-a)\\omega_\\text{mvp} = \\omega_\\text{mvp} + \\frac{\\lambda^*}{2}\\left(\\Sigma^{-1}\\mu -\\frac{D}{C}\\Sigma^{-1}\\iota \\right)$$ with $\\lambda^* = 2\\frac{a\\bar\\mu + (1-a)\\tilde\\mu - D/C}{E-D^2/C}$ where $C = \\iota'\\Sigma^{-1}\\iota$, $D=\\iota'\\Sigma^{-1}\\mu$, and $E=\\mu'\\Sigma^{-1}\\mu$. Finally, it is simple to visualize the efficient frontier alongside the two efficient portfolios within one powerful figure using `ggplot` (see @fig-106). We also add the individual stocks in the same call. We compute annualized returns based on the simple assumption that monthly returns are independent and identically distributed. Thus, the average annualized return is just 12 times the expected monthly return.\\index{Graph!Efficient frontier}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres |>\n  ggplot(aes(x = sd, y = mu)) +\n  geom_point() +\n  geom_point(\n    data = res |> filter(a %in% c(0, 1)),\n    size = 4\n  ) +\n  geom_point(\n    data = tibble(\n      mu = length_year * mu,       \n      sd = sqrt(length_year) * sqrt(diag(sigma))\n    ),\n    aes(y = mu, x = sd), size = 1\n  ) +\n  labs(\n    x = \"Annualized standard deviation\",\n    y = \"Annualized expected return\",\n    title = \"Efficient frontier for DOW index constituents\"\n  ) +\n  scale_x_continuous(labels = percent) +\n  scale_y_continuous(labels = percent)\n```\n\n::: {.cell-output-display}\n![The big dots indicate the location of the minimum variance and the efficient portfolio that delivers 3 times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.](introduction-to-tidy-finance_files/figure-html/fig-106-1.png){#fig-106 fig-alt='Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility.' width=2100}\n:::\n:::\n\n\nThe line in @fig-106 indicates the efficient frontier: the set of portfolios a mean-variance efficient investor would choose from. Compare the performance relative to the individual assets (the dots) - it should become clear that diversifying yields massive performance gains (at least as long as we take the parameters $\\Sigma$ and $\\mu$ as given).\n\n## Exercises\n\n1. Download daily prices for another stock market symbol of your choice from Yahoo!Finance with `tq_get()` from the `tidyquant` package. Plot two time series of the ticker’s un-adjusted and adjusted closing prices. Explain the differences.\n1. Compute daily net returns for an asset of your choice and visualize the distribution of daily returns in a histogram using 100 bins. Also, use `geom_vline()` to add a dashed red vertical line that indicates the 5 percent quantile of the daily returns. Compute summary statistics (mean, standard deviation, minimum and maximum) for the daily returns.\n1. Take your code from before and generalize it such that you can perform all the computations for an arbitrary vector of tickers (e.g., `ticker <- c(\"AAPL\", \"MMM\", \"BA\")`). Automate the download, the plot of the price time series, and create a table of return summary statistics for this arbitrary number of assets.\n1. Are days with high aggregate trading volume often also days with large absolute returns? Find an appropriate visualization to analyze the question using the ticker `AAPL`.\n1.Compute monthly returns from the downloaded stock market prices. Compute the vector of historical average returns and the sample variance-covariance matrix. Compute the minimum variance portfolio weights and the portfolio volatility and average returns. Visualize the mean-variance efficient frontier. Choose one of your assets and identify the portfolio which yields the same historical volatility but achieves the highest possible average return.\n1. In the portfolio choice analysis, we restricted our sample to all assets trading every day since 2000. How is such a decision a problem when you want to infer future expected portfolio performance from the results?\n1. The efficient frontier characterizes the portfolios with the highest expected return for different levels of risk. Identify the portfolio with the highest expected return per standard deviation. Which famous performance measure is close to the ratio of average returns to the standard deviation of returns?\n",
     "supporting": [
       "introduction-to-tidy-finance_files"
     ],
diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png
index 0ac1fcb7..47838342 100644
Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-100-1.png differ
diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png
index f51c1495..6b77e32b 100644
Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-101-1.png differ
diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png
index 554fa966..88f0b0c9 100644
Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-103-1.png differ
diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png
index e35424b3..5931c6ff 100644
Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-104-1.png differ
diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png
index 93bfd0c2..27a55bcc 100644
Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-105-1.png differ
diff --git a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png
index c5fe952d..bbff2aaf 100644
Binary files a/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png and b/_freeze/r/introduction-to-tidy-finance/figure-html/fig-106-1.png differ
diff --git a/_freeze/r/parametric-portfolio-policies/execute-results/html.json b/_freeze/r/parametric-portfolio-policies/execute-results/html.json
index 24315a3a..d0872af3 100644
--- a/_freeze/r/parametric-portfolio-policies/execute-results/html.json
+++ b/_freeze/r/parametric-portfolio-policies/execute-results/html.json
@@ -1,11 +1,9 @@
 {
-  "hash": "70da5db0be11b30e9837c466c16ecf0b",
+  "hash": "3c0c52e64199495858258579baf547c7",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Parametric Portfolio Policies\naliases:\n  - ../parametric-portfolio-policies.html\nmetadata:\n  pagetitle: Parametric Portfolio Policies with R\n  description-meta: Evaluate portfolio allocation strategies based on Brandt, Santa-Clara, and Valkanov (2009) using the programming language R.\n---\n\n\nIn this chapter, we apply different portfolio performance measures to evaluate and compare portfolio allocation strategies. \nFor this purpose, we introduce a direct way to estimate optimal portfolio weights for large-scale cross-sectional applications. More precisely, the approach of @Brandt2009 proposes to parametrize the optimal portfolio weights as a function of stock characteristics instead of estimating the stock's expected return, variance, and covariances with other stocks in a prior step. \nWe choose weights as a function of the characteristics, which maximize the expected utility of the investor. This approach is feasible for large portfolio dimensions (such as the entire CRSP universe) and has been proposed by @Brandt2009. See the review paper @Brandt2010 for an excellent treatment of related portfolio choice methods.  \n\nThe current chapter relies on the following set of R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nTo get started, we load the monthly CRSP file, which forms our investment universe. We load the data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(), \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(permno, month, ret_excess, mktcap, mktcap_lag) |>\n  collect()\n```\n:::\n\n\nTo evaluate the performance of portfolios, we further use monthly market returns as a benchmark to compute CAPM alphas.\\index{Data!Fama-French factors} \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, mkt_excess) |>\n  collect()\n```\n:::\n\n\nNext, we retrieve some stock characteristics that have been shown to have an effect on the expected returns or expected variances (or even higher moments) of the return distribution. \\index{Momentum} In particular, we record the lagged one-year return momentum (`momentum_lag`), defined as the compounded return between months $t-13$ and $t-2$ for each firm. In finance, momentum is the empirically observed tendency for rising asset prices to rise further, and falling prices to keep falling [@Jegadeesh1993]. \\index{Size!Size effect} The second characteristic is the firm's market equity (`size_lag`), defined as the log of the price per share times the number of shares outstanding [@Banz1981]. \nTo construct the correct lagged values, we use the approach introduced in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_lags <- crsp_monthly |>\n  transmute(permno,\n    month_13 = month %m+% months(13),\n    mktcap\n  )\n\ncrsp_monthly <- crsp_monthly |>\n  inner_join(crsp_monthly_lags,\n    by = c(\"permno\", \"month\" = \"month_13\"),\n    suffix = c(\"\", \"_13\")\n  )\n\ndata_portfolios <- crsp_monthly |>\n  mutate(\n    momentum_lag = mktcap_lag / mktcap_13,\n    size_lag = log(mktcap_lag)\n  ) |>\n  drop_na(contains(\"lag\"))\n```\n:::\n\n\n## Parametric Portfolio Policies\n\nThe basic idea of parametric portfolio weights is as follows. Suppose that at each date $t$ we have $N_t$ stocks in the investment universe, where each stock $i$ has a return of $r_{i, t+1}$ and is associated with a vector of firm characteristics $x_{i, t}$ such as time-series momentum or the market capitalization. The investor's problem is to choose portfolio weights $w_{i,t}$ to maximize the expected utility of the portfolio return:\n$$\\begin{aligned}\n\\max_{\\omega} E_t\\left(u(r_{p, t+1})\\right) = E_t\\left[u\\left(\\sum\\limits_{i=1}^{N_t}\\omega_{i,t}r_{i,t+1}\\right)\\right]\n\\end{aligned}$$\nwhere $u(\\cdot)$ denotes the utility function.\n\nWhere do the stock characteristics show up? We parameterize the optimal portfolio weights as a function of the stock characteristic $x_{i,t}$ with the following linear specification for the portfolio weights: \n$$\\omega_{i,t} = \\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t},$$\nwhere $\\bar{\\omega}_{i,t}$ is a stock's weight in a benchmark portfolio (we use the value-weighted or naive portfolio in the application below), $\\theta$ is a vector of coefficients which we are going to estimate, and $\\hat{x}_{i,t}$ are the characteristics of stock $i$, cross-sectionally standardized to have zero mean and unit standard deviation. \n\nIntuitively, the portfolio strategy is a form of active portfolio management relative to a performance benchmark. Deviations from the benchmark portfolio are derived from the individual stock characteristics. Note that by construction the weights sum up to one as $\\sum_{i=1}^{N_t}\\hat{x}_{i,t} = 0$ due to the standardization. Moreover, the coefficients are constant across assets and over time. The implicit assumption is that the characteristics fully capture all aspects of the joint distribution of returns that are relevant for forming optimal portfolios.       \n\nWe first implement cross-sectional standardization for the entire CRSP universe. We also keep track of (lagged) relative market capitalization `relative_mktcap`, which will represent the value-weighted benchmark portfolio, while `n` denotes the number of traded assets $N_t$, which we use to construct the naive portfolio benchmark.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_portfolios <- data_portfolios |>\n  group_by(month) |>\n  mutate(\n    n = n(),\n    relative_mktcap = mktcap_lag / sum(mktcap_lag),\n    across(contains(\"lag\"), ~ (. - mean(.)) / sd(.)),\n  ) |>\n  ungroup() |>\n  select(-mktcap_lag)\n```\n:::\n\n\n## Computing Portfolio Weights\n\nNext, we move on to identify optimal choices of $\\theta$. We rewrite the optimization problem together with the weight parametrization and can then estimate $\\theta$ to maximize the objective function based on our sample \n$$\\begin{aligned}\nE_t\\left(u(r_{p, t+1})\\right) = \\frac{1}{T}\\sum\\limits_{t=0}^{T-1}u\\left(\\sum\\limits_{i=1}^{N_t}\\left(\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}\\right)r_{i,t+1}\\right).\n\\end{aligned}$$\nThe allocation strategy is straightforward because the number of parameters to estimate is small. Instead of a tedious specification of the $N_t$ dimensional vector of expected returns and the $N_t(N_t+1)/2$ free elements of the covariance matrix, all we need to focus on in our application is the vector $\\theta$. $\\theta$ contains only two elements in our application - the relative deviation from the benchmark due to *size* and *momentum*. \n\nTo get a feeling for the performance of such an allocation strategy, we start with an arbitrary initial vector $\\theta_0$. The next step is to choose $\\theta$ optimally to maximize the objective function. We automatically detect the number of parameters by counting the number of columns with lagged values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_parameters <- sum(str_detect(\n  colnames(data_portfolios), \"lag\"\n))\n\ntheta <- rep(1.5, n_parameters)\n\nnames(theta) <- colnames(data_portfolios)[str_detect(\n  colnames(data_portfolios), \"lag\"\n)]\n```\n:::\n\n\nThe function `compute_portfolio_weights()` below computes the portfolio weights $\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}$ according to our parametrization for a given value $\\theta_0$. Everything happens within a single pipeline. Hence, we provide a short walk-through.\n\nWe first compute `characteristic_tilt`, the tilting values $\\frac{1}{N_t}\\theta'\\hat{x}_{i, t}$ which resemble the deviation from the benchmark portfolio. Next, we compute the benchmark portfolio `weight_benchmark`, which can be any reasonable set of portfolio weights. In our case, we choose either the value or equal-weighted allocation. \n`weight_tilt` completes the picture and contains the final portfolio weights `weight_tilt = weight_benchmark + characteristic_tilt` which deviate from the benchmark portfolio depending on the stock characteristics.\n\nThe final few lines go a bit further and implement a simple version of a no-short sale constraint. While it is generally not straightforward to ensure portfolio weight constraints via parameterization, we simply normalize the portfolio weights such that they are enforced to be positive. Finally, we make sure that the normalized weights sum up to one again:\n$$\\omega_{i,t}^+ = \\frac{\\max(0, \\omega_{i,t})}{\\sum_{j=1}^{N_t}\\max(0, \\omega_{i,t})}.$$\n\n\nThe following function computes the optimal portfolio weights in the way just described.  \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_portfolio_weights <- function(theta,\n                                      data,\n                                      value_weighting = TRUE,\n                                      allow_short_selling = TRUE) {\n  data |>\n    group_by(month) |>\n    bind_cols(\n      characteristic_tilt = data |>\n        transmute(across(contains(\"lag\"), ~ . / n)) |>\n        as.matrix() %*% theta |> as.numeric()\n    ) |>\n    mutate(\n      # Definition of benchmark weight\n      weight_benchmark = case_when(\n        value_weighting == TRUE ~ relative_mktcap,\n        value_weighting == FALSE ~ 1 / n\n      ),\n      # Parametric portfolio weights\n      weight_tilt = weight_benchmark + characteristic_tilt,\n      # Short-sell constraint\n      weight_tilt = case_when(\n        allow_short_selling == TRUE ~ weight_tilt,\n        allow_short_selling == FALSE ~ pmax(0, weight_tilt)\n      ),\n      # Weights sum up to 1\n      weight_tilt = weight_tilt / sum(weight_tilt)\n    ) |>\n    ungroup()\n}\n```\n:::\n\n\nIn the next step, we compute the portfolio weights for the arbitrary vector $\\theta_0$. In the example below, we use the value-weighted portfolio as a benchmark and allow negative portfolio weights.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweights_crsp <- compute_portfolio_weights(\n  theta,\n  data_portfolios,\n  value_weighting = TRUE,\n  allow_short_selling = TRUE\n)\n```\n:::\n\n\n## Portfolio Performance\n\n\\index{Performance evaluation}\nAre the computed weights optimal in any way? Most likely not, as we picked $\\theta_0$ arbitrarily. To evaluate the performance of an allocation strategy, one can think of many different approaches. In their original paper, @Brandt2009 focus on a simple evaluation of the hypothetical utility of an agent equipped with a power utility function $u_\\gamma(r) = \\frac{(1 + r)^{(1-\\gamma)}}{1-\\gamma}$, where $\\gamma$ is the risk aversion factor.\\index{Power utility}\n\n\n::: {.cell}\n\n```{.r .cell-code}\npower_utility <- function(r, gamma = 5) {\n  (1 + r)^(1 - gamma) / (1 - gamma)\n}\n```\n:::\n\n\nWe want to note that @Gehrig2020 warn that, in the leading case of constant relative risk aversion (CRRA), strong assumptions on the properties of the returns, the variables used to implement the parametric portfolio policy, and the parameter space are necessary to obtain a well-defined optimization problem.\n\nNo doubt, there are many other ways to evaluate a portfolio. The function below provides a summary of all kinds of interesting measures that can be considered relevant. Do we need all these evaluation measures? It depends: the original paper @Brandt2009 only cares about the expected utility to choose $\\theta$. However, if you want to choose optimal values that achieve the highest performance while putting some constraints on your portfolio weights, it is helpful to have everything in one function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio <- function(weights_crsp,\n                               capm_evaluation = TRUE,\n                               full_evaluation = TRUE,\n                               length_year = 12) {\n  \n  evaluation <- weights_crsp |>\n    group_by(month) |>\n    summarize(\n      tilt = weighted.mean(ret_excess, weight_tilt),\n      benchmark = weighted.mean(ret_excess, weight_benchmark)\n    ) |>\n    pivot_longer(-month,\n      values_to = \"portfolio_return\",\n      names_to = \"model\"\n    ) \n  \n  evaluation_stats <- evaluation |>\n    group_by(model) |>\n    left_join(factors_ff3_monthly, by = \"month\") |>\n    summarize(tibble(\n      \"Expected utility\" = mean(power_utility(portfolio_return)),\n      \"Average return\" = 100 * mean(length_year * portfolio_return),\n      \"SD return\" = 100 * sqrt(length_year) * sd(portfolio_return),\n      \"Sharpe ratio\" = sqrt(length_year) * mean(portfolio_return) / sd(portfolio_return),\n\n    )) |>\n    mutate(model = str_remove(model, \"return_\")) \n  \n  if (capm_evaluation) {\n    evaluation_capm <- evaluation |> \n      left_join(factors_ff3_monthly, by = \"month\") |>\n      group_by(model) |>\n      summarize(\n      \"CAPM alpha\" = coefficients(lm(portfolio_return ~ mkt_excess))[1],\n      \"Market beta\" = coefficients(lm(portfolio_return ~ mkt_excess))[2]\n      )\n    \n    evaluation_stats <- evaluation_stats |> \n      left_join(evaluation_capm, by = \"model\")\n  }\n\n  if (full_evaluation) {\n    evaluation_weights <- weights_crsp |>\n      select(month, contains(\"weight\")) |>\n      pivot_longer(-month, values_to = \"weight\", names_to = \"model\") |>\n      group_by(model, month) |>\n      mutate(\n        \"Absolute weight\" = abs(weight),\n        \"Max. weight\" = max(weight),\n        \"Min. weight\" = min(weight),\n        \"Avg. sum of negative weights\" = -sum(weight[weight < 0]),\n        \"Avg. fraction of negative weights\" = sum(weight < 0) / n(),\n        .keep = \"none\"\n      ) |>\n      group_by(model) |>\n      summarize(across(-month, ~ 100 * mean(.))) |>\n      mutate(model = str_remove(model, \"weight_\")) \n    \n    evaluation_stats <- evaluation_stats |> \n      left_join(evaluation_weights, by = \"model\")\n  }\n  \n  evaluation_output <- evaluation_stats |> \n    pivot_longer(cols = -model, names_to = \"measure\") |> \n    pivot_wider(names_from = model)\n  \n  return(evaluation_output)\n}\n```\n:::\n\n\n\\index{Sharpe Ratio}\nLet us take a look at the different portfolio strategies and evaluation measures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio(weights_crsp) |>\n  print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 3\n   measure                            benchmark     tilt\n   <chr>                                  <dbl>    <dbl>\n 1 Expected utility                  -0.250     -0.261  \n 2 Average return                     6.64       0.259  \n 3 SD return                         15.5       21.1    \n 4 Sharpe ratio                       0.430      0.0122 \n 5 CAPM alpha                         0.000138  -0.00496\n 6 Market beta                        0.992      0.952  \n 7 Absolute weight                    0.0249     0.0638 \n 8 Max. weight                        3.59       3.72   \n 9 Min. weight                        0.0000274 -0.144  \n10 Avg. sum of negative weights       0         78.0    \n11 Avg. fraction of negative weights  0         49.5    \n```\n\n\n:::\n:::\n\n\nThe value-weighted portfolio delivers an annualized return of more than 6 percent and clearly outperforms the tilted portfolio, irrespective of whether we evaluate expected utility, the Sharpe ratio, or the CAPM alpha. We can conclude the market beta is close to one for both strategies (naturally almost identically 1 for the value-weighted benchmark portfolio). When it comes to the distribution of the portfolio weights, we see that the benchmark portfolio weight takes less extreme positions (lower average absolute weights and lower maximum weight). By definition, the value-weighted benchmark does not take any negative positions, while the tilted portfolio also takes short positions.\n\n## Optimal Parameter Choice\n\nNext, we move to a choice of $\\theta$ that actually aims to improve some (or all) of the performance measures. We first define a helper function `compute_objective_function()`, which we then pass to an optimizer.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_objective_function <- function(theta,\n                                       data,\n                                       objective_measure = \"Expected utility\",\n                                       value_weighting = TRUE,\n                                       allow_short_selling = TRUE) {\n  processed_data <- compute_portfolio_weights(\n    theta,\n    data,\n    value_weighting,\n    allow_short_selling\n  )\n\n  objective_function <- evaluate_portfolio(\n    processed_data,\n    capm_evaluation = FALSE,\n    full_evaluation = FALSE\n  ) |>\n    filter(measure == objective_measure) |>\n    pull(tilt)\n\n  return(-objective_function)\n}\n```\n:::\n\n\nYou may wonder why we return the negative value of the objective function. This is simply due to the common convention for optimization procedures to search for minima as a default. By minimizing the negative value of the objective function, we get the maximum value as a result.\nIn its most basic form, R optimization relies on the function `optim()`. As main inputs, the function requires an initial guess of the parameters and the objective function to minimize. Now, we are fully equipped to compute the optimal values of $\\hat\\theta$, which maximize the hypothetical expected utility of the investor. \n\n\n::: {.cell}\n\n```{.r .cell-code}\noptimal_theta <- optim(\n  par = theta,\n  fn = compute_objective_function,\n  objective_measure = \"Expected utility\",\n  data = data_portfolios,\n  value_weighting = TRUE,\n  allow_short_selling = TRUE,\n  method = \"Nelder-Mead\"\n)\n\noptimal_theta$par\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nmomentum_lag     size_lag \n       0.361       -1.820 \n```\n\n\n:::\n:::\n\n\nThe resulting values of $\\hat\\theta$ are easy to interpret: intuitively, expected utility increases by tilting weights from the value-weighted portfolio toward smaller stocks (negative coefficient for size) and toward past winners (positive value for momentum). Both findings are in line with the well-documented size effect [@Banz1981] and the momentum anomaly [@Jegadeesh1993].\n\n## More Model Specifications\n\nHow does the portfolio perform for different model specifications? For this purpose, we compute the performance of a number of different modeling choices based on the entire CRSP sample. The next code chunk performs all the heavy lifting.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_optimal_performance <- function(data, \n                                         objective_measure,\n                                         value_weighting, \n                                         allow_short_selling) {\n  optimal_theta <- optim(\n    par = theta,\n    fn = compute_objective_function,\n    data = data,\n    objective_measure = \"Expected utility\",\n    value_weighting = TRUE,\n    allow_short_selling = TRUE,\n    method = \"Nelder-Mead\"\n  )\n\n  processed_data = compute_portfolio_weights(\n    optimal_theta$par, \n    data,\n    value_weighting,\n    allow_short_selling\n  )\n  \n  portfolio_evaluation = evaluate_portfolio(\n    processed_data,\n    capm_evaluation = TRUE,\n    full_evaluation = TRUE\n  )\n \n  return(portfolio_evaluation) \n}\n\nspecifications <- expand_grid(\n  data = list(data_portfolios),\n  objective_measure = \"Expected utility\",\n  value_weighting = c(TRUE, FALSE),\n  allow_short_selling = c(TRUE, FALSE)\n) |> \n  mutate(\n    portfolio_evaluation = pmap(\n      .l = list(data, objective_measure, value_weighting, allow_short_selling),\n      .f = evaluate_optimal_performance\n    )\n)\n```\n:::\n\n\nFinally, we can compare the results. The table below shows summary statistics for all possible combinations: equal- or value-weighted benchmark portfolio, with or without short-selling constraints, and tilted toward maximizing expected utility. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance_table <- specifications |>\n  select(\n    value_weighting,\n    allow_short_selling,\n    portfolio_evaluation\n  ) |>\n  unnest(portfolio_evaluation)\n\nperformance_table |>\n  rename(\n    \" \" = benchmark,\n    Optimal = tilt\n  ) |>\n  mutate(\n    value_weighting = case_when(\n      value_weighting == TRUE ~ \"VW\",\n      value_weighting == FALSE ~ \"EW\"\n    ),\n    allow_short_selling = case_when(\n      allow_short_selling == TRUE ~ \"\",\n      allow_short_selling == FALSE ~ \"(no s.)\"\n    )\n  ) |>\n  pivot_wider(\n    names_from = value_weighting:allow_short_selling,\n    values_from = \" \":Optimal,\n    names_glue = \"{value_weighting} {allow_short_selling} {.value} \"\n  ) |>\n  select(\n    measure,\n    `EW    `,\n    `VW    `,\n    sort(contains(\"Optimal\"))\n  ) |>\n  print(n = 11)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 7\n   measure     `EW    ` `VW    ` `VW  Optimal ` `VW (no s.) Optimal `\n   <chr>          <dbl>    <dbl>          <dbl>                 <dbl>\n 1 Expected u… -0.251   -2.50e-1       -0.247                -0.248  \n 2 Average re…  9.99     6.64e+0       13.3                  12.3    \n 3 SD return   20.4      1.55e+1       19.9                  19.1    \n 4 Sharpe rat…  0.491    4.30e-1        0.670                 0.641  \n 5 CAPM alpha   0.00222  1.38e-4        0.00564               0.00462\n 6 Market beta  1.12     9.92e-1        1.00                  1.03   \n 7 Absolute w…  0.0249   2.49e-2        0.0360                0.0249 \n 8 Max. weight  0.0249   3.59e+0        3.42                  2.80   \n 9 Min. weight  0.0249   2.74e-5       -0.0324                0      \n10 Avg. sum o…  0        0             23.1                   0      \n11 Avg. fract…  0        0             37.5                   0      \n# ℹ 2 more variables: `EW  Optimal ` <dbl>,\n#   `EW (no s.) Optimal ` <dbl>\n```\n\n\n:::\n:::\n\n\nThe results indicate that the average annualized Sharpe ratio of the equal-weighted portfolio exceeds the Sharpe ratio of the value-weighted benchmark portfolio. Nevertheless, starting with the weighted value portfolio as a benchmark and tilting optimally with respect to momentum and small stocks yields the highest Sharpe ratio across all specifications. Finally, imposing no short-sale constraints does not improve the performance of the portfolios in our application.\n\n## Exercises\n\n1. How do the estimated parameters $\\hat\\theta$ and the portfolio performance change if your objective is to maximize the Sharpe ratio instead of the hypothetical expected utility?\n1. The code above is very flexible in the sense that you can easily add new firm characteristics. Construct a new characteristic of your choice and evaluate the corresponding coefficient $\\hat\\theta_i$. \n1. Tweak the function `optimal_theta()` such that you can impose additional performance constraints in order to determine $\\hat\\theta$, which maximizes expected utility under the constraint that the market beta is below 1.\n1. Does the portfolio performance resemble a realistic out-of-sample backtesting procedure? Verify the robustness of the results by first estimating $\\hat\\theta$ based on *past data* only. Then, use more recent periods to evaluate the actual portfolio performance. \n1. By formulating the portfolio problem as a statistical estimation problem, you can easily obtain standard errors for the coefficients of the weight function. @Brandt2009 provide the relevant derivations in their paper in Equation (10). Implement a small function that computes standard errors for $\\hat\\theta$.",
-    "supporting": [
-      "parametric-portfolio-policies_files"
-    ],
+    "markdown": "---\ntitle: Parametric Portfolio Policies\naliases:\n  - ../parametric-portfolio-policies.html\nmetadata:\n  pagetitle: Parametric Portfolio Policies with R\n  description-meta: Evaluate portfolio allocation strategies based on Brandt, Santa-Clara, and Valkanov (2009) using the programming language R.\n---\n\n\nIn this chapter, we apply different portfolio performance measures to evaluate and compare portfolio allocation strategies. \nFor this purpose, we introduce a direct way to estimate optimal portfolio weights for large-scale cross-sectional applications. More precisely, the approach of @Brandt2009 proposes to parametrize the optimal portfolio weights as a function of stock characteristics instead of estimating the stock's expected return, variance, and covariances with other stocks in a prior step. \nWe choose weights as a function of the characteristics, which maximize the expected utility of the investor. This approach is feasible for large portfolio dimensions (such as the entire CRSP universe) and has been proposed by @Brandt2009. See the review paper @Brandt2010 for an excellent treatment of related portfolio choice methods.  \n\nThe current chapter relies on the following set of R packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nTo get started, we load the monthly CRSP file, which forms our investment universe. We load the data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(), \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(permno, month, ret_excess, mktcap, mktcap_lag) |>\n  collect()\n```\n:::\n\n\nTo evaluate the performance of portfolios, we further use monthly market returns as a benchmark to compute CAPM alphas.\\index{Data!Fama-French factors} \n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, mkt_excess) |>\n  collect()\n```\n:::\n\n\nNext, we retrieve some stock characteristics that have been shown to have an effect on the expected returns or expected variances (or even higher moments) of the return distribution. \\index{Momentum} In particular, we record the lagged one-year return momentum (`momentum_lag`), defined as the compounded return between months $t-13$ and $t-2$ for each firm. In finance, momentum is the empirically observed tendency for rising asset prices to rise further, and falling prices to keep falling [@Jegadeesh1993]. \\index{Size!Size effect} The second characteristic is the firm's market equity (`size_lag`), defined as the log of the price per share times the number of shares outstanding [@Banz1981]. \nTo construct the correct lagged values, we use the approach introduced in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_lags <- crsp_monthly |>\n  transmute(permno,\n    month_13 = month %m+% months(13),\n    mktcap\n  )\n\ncrsp_monthly <- crsp_monthly |>\n  inner_join(crsp_monthly_lags,\n    join_by(permno, month == month_13),\n    suffix = c(\"\", \"_13\")\n  )\n\ndata_portfolios <- crsp_monthly |>\n  mutate(\n    momentum_lag = mktcap_lag / mktcap_13,\n    size_lag = log(mktcap_lag)\n  ) |>\n  drop_na(contains(\"lag\"))\n```\n:::\n\n\n## Parametric Portfolio Policies\n\nThe basic idea of parametric portfolio weights is as follows. Suppose that at each date $t$ we have $N_t$ stocks in the investment universe, where each stock $i$ has a return of $r_{i, t+1}$ and is associated with a vector of firm characteristics $x_{i, t}$ such as time-series momentum or the market capitalization. The investor's problem is to choose portfolio weights $w_{i,t}$ to maximize the expected utility of the portfolio return:\n$$\\begin{aligned}\n\\max_{\\omega} E_t\\left(u(r_{p, t+1})\\right) = E_t\\left[u\\left(\\sum\\limits_{i=1}^{N_t}\\omega_{i,t}r_{i,t+1}\\right)\\right]\n\\end{aligned}$$\nwhere $u(\\cdot)$ denotes the utility function.\n\nWhere do the stock characteristics show up? We parameterize the optimal portfolio weights as a function of the stock characteristic $x_{i,t}$ with the following linear specification for the portfolio weights: \n$$\\omega_{i,t} = \\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t},$$\nwhere $\\bar{\\omega}_{i,t}$ is a stock's weight in a benchmark portfolio (we use the value-weighted or naive portfolio in the application below), $\\theta$ is a vector of coefficients which we are going to estimate, and $\\hat{x}_{i,t}$ are the characteristics of stock $i$, cross-sectionally standardized to have zero mean and unit standard deviation. \n\nIntuitively, the portfolio strategy is a form of active portfolio management relative to a performance benchmark. Deviations from the benchmark portfolio are derived from the individual stock characteristics. Note that by construction the weights sum up to one as $\\sum_{i=1}^{N_t}\\hat{x}_{i,t} = 0$ due to the standardization. Moreover, the coefficients are constant across assets and over time. The implicit assumption is that the characteristics fully capture all aspects of the joint distribution of returns that are relevant for forming optimal portfolios.       \n\nWe first implement cross-sectional standardization for the entire CRSP universe. We also keep track of (lagged) relative market capitalization `relative_mktcap`, which will represent the value-weighted benchmark portfolio, while `n` denotes the number of traded assets $N_t$, which we use to construct the naive portfolio benchmark.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata_portfolios <- data_portfolios |>\n  group_by(month) |>\n  mutate(\n    n = n(),\n    relative_mktcap = mktcap_lag / sum(mktcap_lag),\n    across(contains(\"lag\"), ~ (. - mean(.)) / sd(.)),\n  ) |>\n  ungroup() |>\n  select(-mktcap_lag)\n```\n:::\n\n\n## Computing Portfolio Weights\n\nNext, we move on to identify optimal choices of $\\theta$. We rewrite the optimization problem together with the weight parametrization and can then estimate $\\theta$ to maximize the objective function based on our sample \n$$\\begin{aligned}\nE_t\\left(u(r_{p, t+1})\\right) = \\frac{1}{T}\\sum\\limits_{t=0}^{T-1}u\\left(\\sum\\limits_{i=1}^{N_t}\\left(\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}\\right)r_{i,t+1}\\right).\n\\end{aligned}$$\nThe allocation strategy is straightforward because the number of parameters to estimate is small. Instead of a tedious specification of the $N_t$ dimensional vector of expected returns and the $N_t(N_t+1)/2$ free elements of the covariance matrix, all we need to focus on in our application is the vector $\\theta$. $\\theta$ contains only two elements in our application - the relative deviation from the benchmark due to *size* and *momentum*. \n\nTo get a feeling for the performance of such an allocation strategy, we start with an arbitrary initial vector $\\theta_0$. The next step is to choose $\\theta$ optimally to maximize the objective function. We automatically detect the number of parameters by counting the number of columns with lagged values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn_parameters <- sum(str_detect(\n  colnames(data_portfolios), \"lag\"\n))\n\ntheta <- rep(1.5, n_parameters)\n\nnames(theta) <- colnames(data_portfolios)[str_detect(\n  colnames(data_portfolios), \"lag\"\n)]\n```\n:::\n\n\nThe function `compute_portfolio_weights()` below computes the portfolio weights $\\bar{\\omega}_{i,t} + \\frac{1}{N_t}\\theta'\\hat{x}_{i,t}$ according to our parametrization for a given value $\\theta_0$. Everything happens within a single pipeline. Hence, we provide a short walk-through.\n\nWe first compute `characteristic_tilt`, the tilting values $\\frac{1}{N_t}\\theta'\\hat{x}_{i, t}$ which resemble the deviation from the benchmark portfolio. Next, we compute the benchmark portfolio `weight_benchmark`, which can be any reasonable set of portfolio weights. In our case, we choose either the value or equal-weighted allocation. \n`weight_tilt` completes the picture and contains the final portfolio weights `weight_tilt = weight_benchmark + characteristic_tilt` which deviate from the benchmark portfolio depending on the stock characteristics.\n\nThe final few lines go a bit further and implement a simple version of a no-short sale constraint. While it is generally not straightforward to ensure portfolio weight constraints via parameterization, we simply normalize the portfolio weights such that they are enforced to be positive. Finally, we make sure that the normalized weights sum up to one again:\n$$\\omega_{i,t}^+ = \\frac{\\max(0, \\omega_{i,t})}{\\sum_{j=1}^{N_t}\\max(0, \\omega_{i,t})}.$$\n\n\nThe following function computes the optimal portfolio weights in the way just described.  \n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_portfolio_weights <- function(theta,\n                                      data,\n                                      value_weighting = TRUE,\n                                      allow_short_selling = TRUE) {\n  data |>\n    group_by(month) |>\n    bind_cols(\n      characteristic_tilt = data |>\n        transmute(across(contains(\"lag\"), ~ . / n)) |>\n        as.matrix() %*% theta |> as.numeric()\n    ) |>\n    mutate(\n      # Definition of benchmark weight\n      weight_benchmark = case_when(\n        value_weighting == TRUE ~ relative_mktcap,\n        value_weighting == FALSE ~ 1 / n\n      ),\n      # Parametric portfolio weights\n      weight_tilt = weight_benchmark + characteristic_tilt,\n      # Short-sell constraint\n      weight_tilt = case_when(\n        allow_short_selling == TRUE ~ weight_tilt,\n        allow_short_selling == FALSE ~ pmax(0, weight_tilt)\n      ),\n      # Weights sum up to 1\n      weight_tilt = weight_tilt / sum(weight_tilt)\n    ) |>\n    ungroup()\n}\n```\n:::\n\n\nIn the next step, we compute the portfolio weights for the arbitrary vector $\\theta_0$. In the example below, we use the value-weighted portfolio as a benchmark and allow negative portfolio weights.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nweights_crsp <- compute_portfolio_weights(\n  theta,\n  data_portfolios,\n  value_weighting = TRUE,\n  allow_short_selling = TRUE\n)\n```\n:::\n\n\n## Portfolio Performance\n\n\\index{Performance evaluation}\nAre the computed weights optimal in any way? Most likely not, as we picked $\\theta_0$ arbitrarily. To evaluate the performance of an allocation strategy, one can think of many different approaches. In their original paper, @Brandt2009 focus on a simple evaluation of the hypothetical utility of an agent equipped with a power utility function $u_\\gamma(r) = \\frac{(1 + r)^{(1-\\gamma)}}{1-\\gamma}$, where $\\gamma$ is the risk aversion factor.\\index{Power utility}\n\n\n::: {.cell}\n\n```{.r .cell-code}\npower_utility <- function(r, gamma = 5) {\n  (1 + r)^(1 - gamma) / (1 - gamma)\n}\n```\n:::\n\n\nWe want to note that @Gehrig2020 warn that, in the leading case of constant relative risk aversion (CRRA), strong assumptions on the properties of the returns, the variables used to implement the parametric portfolio policy, and the parameter space are necessary to obtain a well-defined optimization problem.\n\nNo doubt, there are many other ways to evaluate a portfolio. The function below provides a summary of all kinds of interesting measures that can be considered relevant. Do we need all these evaluation measures? It depends: the original paper @Brandt2009 only cares about the expected utility to choose $\\theta$. However, if you want to choose optimal values that achieve the highest performance while putting some constraints on your portfolio weights, it is helpful to have everything in one function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio <- function(weights_crsp,\n                               capm_evaluation = TRUE,\n                               full_evaluation = TRUE,\n                               length_year = 12) {\n  \n  evaluation <- weights_crsp |>\n    group_by(month) |>\n    summarize(\n      tilt = weighted.mean(ret_excess, weight_tilt),\n      benchmark = weighted.mean(ret_excess, weight_benchmark)\n    ) |>\n    pivot_longer(-month,\n      values_to = \"portfolio_return\",\n      names_to = \"model\"\n    ) \n  \n  evaluation_stats <- evaluation |>\n    group_by(model) |>\n    left_join(factors_ff3_monthly, \n              join_by(month)) |>\n    summarize(tibble(\n      \"Expected utility\" = mean(power_utility(portfolio_return)),\n      \"Average return\" = 100 * mean(length_year * portfolio_return),\n      \"SD return\" = 100 * sqrt(length_year) * sd(portfolio_return),\n      \"Sharpe ratio\" = sqrt(length_year) * mean(portfolio_return) / sd(portfolio_return),\n\n    )) |>\n    mutate(model = str_remove(model, \"return_\")) \n  \n  if (capm_evaluation) {\n    evaluation_capm <- evaluation |> \n      left_join(factors_ff3_monthly, \n                join_by(month)) |>\n      group_by(model) |>\n      summarize(\n      \"CAPM alpha\" = coefficients(lm(portfolio_return ~ mkt_excess))[1],\n      \"Market beta\" = coefficients(lm(portfolio_return ~ mkt_excess))[2]\n      )\n    \n    evaluation_stats <- evaluation_stats |> \n      left_join(evaluation_capm, join_by(model))\n  }\n\n  if (full_evaluation) {\n    evaluation_weights <- weights_crsp |>\n      select(month, contains(\"weight\")) |>\n      pivot_longer(-month, values_to = \"weight\", names_to = \"model\") |>\n      group_by(model, month) |>\n      mutate(\n        \"Absolute weight\" = abs(weight),\n        \"Max. weight\" = max(weight),\n        \"Min. weight\" = min(weight),\n        \"Avg. sum of negative weights\" = -sum(weight[weight < 0]),\n        \"Avg. fraction of negative weights\" = sum(weight < 0) / n(),\n        .keep = \"none\"\n      ) |>\n      group_by(model) |>\n      summarize(across(-month, ~ 100 * mean(.))) |>\n      mutate(model = str_remove(model, \"weight_\")) \n    \n    evaluation_stats <- evaluation_stats |> \n      left_join(evaluation_weights, join_by(model))\n  }\n  \n  evaluation_output <- evaluation_stats |> \n    pivot_longer(cols = -model, names_to = \"measure\") |> \n    pivot_wider(names_from = model)\n  \n  return(evaluation_output)\n}\n```\n:::\n\n\n\\index{Sharpe Ratio}\nLet us take a look at the different portfolio strategies and evaluation measures.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_portfolio(weights_crsp) |>\n  print(n = Inf)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 3\n   measure                            benchmark     tilt\n   <chr>                                  <dbl>    <dbl>\n 1 Expected utility                  -0.250     -0.261  \n 2 Average return                     6.64       0.259  \n 3 SD return                         15.5       21.1    \n 4 Sharpe ratio                       0.430      0.0122 \n 5 CAPM alpha                         0.000138  -0.00496\n 6 Market beta                        0.992      0.952  \n 7 Absolute weight                    0.0249     0.0638 \n 8 Max. weight                        3.59       3.72   \n 9 Min. weight                        0.0000274 -0.144  \n10 Avg. sum of negative weights       0         78.0    \n11 Avg. fraction of negative weights  0         49.5    \n```\n\n\n:::\n:::\n\n\nThe value-weighted portfolio delivers an annualized return of more than 6 percent and clearly outperforms the tilted portfolio, irrespective of whether we evaluate expected utility, the Sharpe ratio, or the CAPM alpha. We can conclude the market beta is close to one for both strategies (naturally almost identically 1 for the value-weighted benchmark portfolio). When it comes to the distribution of the portfolio weights, we see that the benchmark portfolio weight takes less extreme positions (lower average absolute weights and lower maximum weight). By definition, the value-weighted benchmark does not take any negative positions, while the tilted portfolio also takes short positions.\n\n## Optimal Parameter Choice\n\nNext, we move to a choice of $\\theta$ that actually aims to improve some (or all) of the performance measures. We first define a helper function `compute_objective_function()`, which we then pass to an optimizer.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute_objective_function <- function(theta,\n                                       data,\n                                       objective_measure = \"Expected utility\",\n                                       value_weighting = TRUE,\n                                       allow_short_selling = TRUE) {\n  processed_data <- compute_portfolio_weights(\n    theta,\n    data,\n    value_weighting,\n    allow_short_selling\n  )\n\n  objective_function <- evaluate_portfolio(\n    processed_data,\n    capm_evaluation = FALSE,\n    full_evaluation = FALSE\n  ) |>\n    filter(measure == objective_measure) |>\n    pull(tilt)\n\n  return(-objective_function)\n}\n```\n:::\n\n\nYou may wonder why we return the negative value of the objective function. This is simply due to the common convention for optimization procedures to search for minima as a default. By minimizing the negative value of the objective function, we get the maximum value as a result.\nIn its most basic form, R optimization relies on the function `optim()`. As main inputs, the function requires an initial guess of the parameters and the objective function to minimize. Now, we are fully equipped to compute the optimal values of $\\hat\\theta$, which maximize the hypothetical expected utility of the investor. \n\n\n::: {.cell}\n\n```{.r .cell-code}\noptimal_theta <- optim(\n  par = theta,\n  fn = compute_objective_function,\n  objective_measure = \"Expected utility\",\n  data = data_portfolios,\n  value_weighting = TRUE,\n  allow_short_selling = TRUE,\n  method = \"Nelder-Mead\"\n)\n\noptimal_theta$par\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nmomentum_lag     size_lag \n       0.361       -1.820 \n```\n\n\n:::\n:::\n\n\nThe resulting values of $\\hat\\theta$ are easy to interpret: intuitively, expected utility increases by tilting weights from the value-weighted portfolio toward smaller stocks (negative coefficient for size) and toward past winners (positive value for momentum). Both findings are in line with the well-documented size effect [@Banz1981] and the momentum anomaly [@Jegadeesh1993].\n\n## More Model Specifications\n\nHow does the portfolio perform for different model specifications? For this purpose, we compute the performance of a number of different modeling choices based on the entire CRSP sample. The next code chunk performs all the heavy lifting.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nevaluate_optimal_performance <- function(data, \n                                         objective_measure,\n                                         value_weighting, \n                                         allow_short_selling) {\n  optimal_theta <- optim(\n    par = theta,\n    fn = compute_objective_function,\n    data = data,\n    objective_measure = \"Expected utility\",\n    value_weighting = TRUE,\n    allow_short_selling = TRUE,\n    method = \"Nelder-Mead\"\n  )\n\n  processed_data = compute_portfolio_weights(\n    optimal_theta$par, \n    data,\n    value_weighting,\n    allow_short_selling\n  )\n  \n  portfolio_evaluation = evaluate_portfolio(\n    processed_data,\n    capm_evaluation = TRUE,\n    full_evaluation = TRUE\n  )\n \n  return(portfolio_evaluation) \n}\n\nspecifications <- expand_grid(\n  data = list(data_portfolios),\n  objective_measure = \"Expected utility\",\n  value_weighting = c(TRUE, FALSE),\n  allow_short_selling = c(TRUE, FALSE)\n) |> \n  mutate(\n    portfolio_evaluation = pmap(\n      .l = list(data, objective_measure, value_weighting, allow_short_selling),\n      .f = evaluate_optimal_performance\n    )\n)\n```\n:::\n\n\nFinally, we can compare the results. The table below shows summary statistics for all possible combinations: equal- or value-weighted benchmark portfolio, with or without short-selling constraints, and tilted toward maximizing expected utility. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nperformance_table <- specifications |>\n  select(\n    value_weighting,\n    allow_short_selling,\n    portfolio_evaluation\n  ) |>\n  unnest(portfolio_evaluation)\n\nperformance_table |>\n  rename(\n    \" \" = benchmark,\n    Optimal = tilt\n  ) |>\n  mutate(\n    value_weighting = case_when(\n      value_weighting == TRUE ~ \"VW\",\n      value_weighting == FALSE ~ \"EW\"\n    ),\n    allow_short_selling = case_when(\n      allow_short_selling == TRUE ~ \"\",\n      allow_short_selling == FALSE ~ \"(no s.)\"\n    )\n  ) |>\n  pivot_wider(\n    names_from = value_weighting:allow_short_selling,\n    values_from = \" \":Optimal,\n    names_glue = \"{value_weighting} {allow_short_selling} {.value} \"\n  ) |>\n  select(\n    measure,\n    `EW    `,\n    `VW    `,\n    sort(contains(\"Optimal\"))\n  ) |>\n  print(n = 11)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 11 × 7\n   measure     `EW    ` `VW    ` `VW  Optimal ` `VW (no s.) Optimal `\n   <chr>          <dbl>    <dbl>          <dbl>                 <dbl>\n 1 Expected u… -0.251   -2.50e-1       -0.247                -0.248  \n 2 Average re…  9.99     6.64e+0       13.3                  12.3    \n 3 SD return   20.4      1.55e+1       19.9                  19.1    \n 4 Sharpe rat…  0.491    4.30e-1        0.670                 0.641  \n 5 CAPM alpha   0.00222  1.38e-4        0.00564               0.00462\n 6 Market beta  1.12     9.92e-1        1.00                  1.03   \n 7 Absolute w…  0.0249   2.49e-2        0.0360                0.0249 \n 8 Max. weight  0.0249   3.59e+0        3.42                  2.80   \n 9 Min. weight  0.0249   2.74e-5       -0.0324                0      \n10 Avg. sum o…  0        0             23.1                   0      \n11 Avg. fract…  0        0             37.5                   0      \n# ℹ 2 more variables: `EW  Optimal ` <dbl>,\n#   `EW (no s.) Optimal ` <dbl>\n```\n\n\n:::\n:::\n\n\nThe results indicate that the average annualized Sharpe ratio of the equal-weighted portfolio exceeds the Sharpe ratio of the value-weighted benchmark portfolio. Nevertheless, starting with the weighted value portfolio as a benchmark and tilting optimally with respect to momentum and small stocks yields the highest Sharpe ratio across all specifications. Finally, imposing no short-sale constraints does not improve the performance of the portfolios in our application.\n\n## Exercises\n\n1. How do the estimated parameters $\\hat\\theta$ and the portfolio performance change if your objective is to maximize the Sharpe ratio instead of the hypothetical expected utility?\n1. The code above is very flexible in the sense that you can easily add new firm characteristics. Construct a new characteristic of your choice and evaluate the corresponding coefficient $\\hat\\theta_i$. \n1. Tweak the function `optimal_theta()` such that you can impose additional performance constraints in order to determine $\\hat\\theta$, which maximizes expected utility under the constraint that the market beta is below 1.\n1. Does the portfolio performance resemble a realistic out-of-sample backtesting procedure? Verify the robustness of the results by first estimating $\\hat\\theta$ based on *past data* only. Then, use more recent periods to evaluate the actual portfolio performance. \n1. By formulating the portfolio problem as a statistical estimation problem, you can easily obtain standard errors for the coefficients of the weight function. @Brandt2009 provide the relevant derivations in their paper in Equation (10). Implement a small function that computes standard errors for $\\hat\\theta$.",
+    "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],
diff --git a/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json b/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json
index 85b02eff..921fcba4 100644
--- a/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json
+++ b/_freeze/r/replicating-fama-and-french-factors/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "0dc3115bcf5adcc95bfb3e71a6e4e5e6",
+  "hash": "a902f8812d8309d205a7198613b759c9",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Replicating Fama and French Factors\naliases:\n  - ../replicating-fama-and-french-factors.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Replicating Fama and French Factors with R\n  description-meta: Use the programming language R to replicate the famous Fama-French three and five factor asset pricing models.\n---\n\n\nIn this chapter, we provide a replication of the famous Fama and French factor portfolios. The Fama and French factor models are a cornerstone of empirical asset pricing [see @Fama1992 and @FamaFrench2015]. On top of the market factor represented by the traditional CAPM beta, the three factor model includes the size and value factors to explain the cross section of returns. Its successor, the five factor model, additionally includes profitability and investment as explanatory factors. \n\nWe start with the three factor model. We already introduced the size and value factors in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd), and their definition remains the same: size is the SMB factor (small-minus-big) that is long small firms and short large firms. The value factor is HML (high-minus-low) and is long in high book-to-market firms and short in low book-to-market counterparts. \n\nAfter the replication of the three factor model, we move to the five factors by constructing the profitability factor RMW (robust-minus-weak) as the difference between the returns of firms with high and low operating profitability and the investment factor CMA (conservative-minus-aggressive) as the difference between firms with high versus low investment rates. \n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nWe use CRSP and Compustat as data sources, as we need the same variables to compute the factors in the way Fama and French do it. Hence, there is nothing new below and we only load data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap, mktcap_lag, exchange\n  ) |>\n  collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n    select(gvkey, datadate, be, op, inv) |>\n    collect() \n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, smb, hml) |>\n  collect()\n\nfactors_ff5_monthly <- tbl(tidy_finance, \"factors_ff5_monthly\") |>\n  select(month, smb, hml, rmw, cma) |>\n  collect()\n```\n:::\n\n\nYet when we start merging our data set for computing the premiums, there are a few differences to [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). First, Fama and French form their portfolios in June of year $t$, whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year $t+1$.\n\nSecond, Fama and French also have a different protocol for computing the book-to-market ratio.\\index{Book-to-market ratio} They use market equity as of the end of year $t - 1$ and the book equity reported in year $t-1$, i.e., the `datadate` is within the last year.\\index{Book equity} Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity. The other sorting variables are analogously to book equity taken from year  $t-1$.\n\nTo implement all these time lags, we again employ the temporary `sorting_date`-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of `distinct()` at the end of the chunk below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsize <- crsp_monthly |>\n  filter(month(month) == 6) |>\n  mutate(sorting_date = month %m+% months(1)) |>\n  select(permno, exchange, sorting_date, size = mktcap)\n\nmarket_equity <- crsp_monthly |>\n  filter(month(month) == 12) |>\n  mutate(sorting_date = ymd(str_c(year(month) + 1, \"0701)\"))) |>\n  select(permno, gvkey, sorting_date, me = mktcap)\n\nbook_to_market <- compustat |>\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n  select(gvkey, sorting_date, be) |>\n  inner_join(market_equity, by = c(\"gvkey\", \"sorting_date\")) |>\n  mutate(bm = be / me) |>\n  select(permno, sorting_date, me, bm)\n\nsorting_variables <- size |>\n  inner_join(\n    book_to_market, by = c(\"permno\", \"sorting_date\")\n    ) |>\n  drop_na() |>\n  distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\n## Portfolio Sorts\n\nNext, we construct our portfolios with an adjusted `assign_portfolio()` function.\\index{Portfolio sorts} Fama and French rely on NYSE-specific breakpoints, they form two portfolios in the size dimension at the median and three portfolios in the dimension of each other sorting variable at the 30%- and 70%-percentiles, and they use dependent sorts. The sorts for book-to-market require an adjustment to the function in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd) because the `seq()` we would produce does not produce the right breakpoints. Instead of `n_portfolios`, we now specify `percentiles`, which take the breakpoint-sequence as an object specified in the function's call. Specifically, we give `percentiles = c(0, 0.3, 0.7, 1)` to the function. Additionally, we perform an `inner_join()` with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.\\index{Breakpoints}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n                             sorting_variable, \n                             percentiles) {\n  breakpoints <- data |>\n    filter(exchange == \"NYSE\") |>\n    pull({{ sorting_variable }}) |>\n    quantile(\n      probs = percentiles,\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios <- data |>\n    mutate(portfolio = findInterval(\n      pick(everything()) |>\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |>\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n\nportfolios <- sorting_variables |>\n  group_by(sorting_date) |>\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    ),\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = bm,\n      percentiles = c(0, 0.3, 0.7, 1)\n    )\n  ) |>\n  ungroup() |> \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm)\n```\n:::\n\n\nNext, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column `sorting_date` in our return data by setting the date to sort on to July of $t-1$ if the month is June (of year $t$) or earlier or to July of year $t$ if the month is July or later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- crsp_monthly |>\n  mutate(sorting_date = case_when(\n    month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |>\n  inner_join(portfolios, by = c(\"permno\", \"sorting_date\"))\n```\n:::\n\n\n## Fama and French Three Factor Model\n\nEquipped with the return data and the assigned portfolios, we can now compute the value-weighted average return for each of the six portfolios. Then, we form the Fama and French factors. For the size factor (i.e., SMB), we go long in the three small portfolios and short the three large portfolios by taking an average across either group. For the value factor (i.e., HML), we go long in the two high book-to-market portfolios and short the two low book-to-market portfolios, again weighting them equally.\\index{Factor!Size}\\index{Factor!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- portfolios |>\n  group_by(portfolio_size, portfolio_bm, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), .groups = \"drop\"\n  ) |>\n  group_by(month) |>\n  summarize(\n    smb_replicated = mean(ret[portfolio_size == 1]) -\n      mean(ret[portfolio_size == 2]),\n    hml_replicated = mean(ret[portfolio_bm == 3]) -\n      mean(ret[portfolio_bm == 1])\n  )\n```\n:::\n\n\n## Replication Evaluation\n\nIn the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French.\\index{Size!Size premium}\\index{Value premium} The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using `lm()`. If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- factors_ff3_monthly |>\n  inner_join(factors_replicated, by = \"month\") |>\n  mutate(\n    across(c(smb_replicated, hml_replicated), ~round(., 4))\n  )\n```\n:::\n\n\nTo test the success of the SMB factor, we hence run the following regression:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.020294 -0.001541 -0.000055  0.001489  0.015482 \n\nCoefficients:\n                Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    -0.000130   0.000131      -1     0.32    \nsmb_replicated  0.993003   0.004335     229   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00355 on 736 degrees of freedom\nMultiple R-squared:  0.986,\tAdjusted R-squared:  0.986 \nF-statistic: 5.25e+04 on 1 and 736 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are really convincing as all three criteria outlined above are met and the coefficient is 0.99 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.02324 -0.00287 -0.00010  0.00230  0.03407 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    0.000291   0.000219    1.33     0.18    \nhml_replicated 0.963431   0.007280  132.35   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00591 on 736 degrees of freedom\nMultiple R-squared:  0.96,\tAdjusted R-squared:  0.96 \nF-statistic: 1.75e+04 on 1 and 736 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly lower coefficient of 0.96 and an R-squared around 96%. \n\nThe evidence hence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.\n\n## Fama and French Five Factor Model\n\nNow, let us move to the replication of the five factor model. We extend the `other_sorting_variables` table from above with the additional characteristics operating profitability `op` and investment `inv`. Note that the `drop_na()` statement yields different sample sizes as some firms with `be` values might not have `op` or `inv` values. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nother_sorting_variables <- compustat |>\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n  select(gvkey, sorting_date, be, op, inv) |>\n  inner_join(market_equity, by = c(\"gvkey\", \"sorting_date\")) |>\n  mutate(bm = be / me) |>\n  select(permno, sorting_date, me, be, bm, op, inv)\n\nsorting_variables <- size |>\n  inner_join(\n    other_sorting_variables, by = c(\"permno\", \"sorting_date\")\n    ) |>\n  drop_na() |>\n  distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\nIn each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- sorting_variables |>\n  group_by(sorting_date) |>\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    )) |> \n  group_by(sorting_date, portfolio_size) |> \n  mutate(\n    across(c(bm, op, inv), ~assign_portfolio(\n      data = pick(everything()), \n      sorting_variable = ., \n      percentiles = c(0, 0.3, 0.7, 1)),\n      .names = \"portfolio_{.col}\"\n    )\n  ) |>\n  ungroup() |> \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm,\n         portfolio_op, portfolio_inv)\n\nportfolios <- crsp_monthly |>\n  mutate(sorting_date = case_when(\n    month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |>\n  inner_join(portfolios, by = c(\"permno\", \"sorting_date\"))\n```\n:::\n\n\nNow, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_value <- portfolios |>\n  group_by(portfolio_size, portfolio_bm, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_value <- portfolios_value |>\n  group_by(month) |>\n  summarize(\n    hml_replicated = mean(ret[portfolio_bm == 3]) -\n      mean(ret[portfolio_bm == 1])\n  )\n```\n:::\n\n\nFor the profitability factor, RMW, we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.\\index{Factor!Profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_profitability <- portfolios |>\n  group_by(portfolio_size, portfolio_op, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  ) \n\nfactors_profitability <- portfolios_profitability |>\n  group_by(month) |>\n  summarize(\n    rmw_replicated = mean(ret[portfolio_op == 3]) -\n      mean(ret[portfolio_op == 1])\n  )\n```\n:::\n\n\nFor the investment factor, CMA, we go long the two low investment portfolios and short the two high investment portfolios.\\index{Factor!Investment}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_investment <- portfolios |>\n  group_by(portfolio_size, portfolio_inv, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_investment <- portfolios_investment |>\n  group_by(month) |>\n  summarize(\n    cma_replicated = mean(ret[portfolio_inv == 1]) -\n      mean(ret[portfolio_inv == 3])\n  )\n```\n:::\n\n\nFinally, the size factor, SMB, is constructed by going long the six small portfolios and short the six large portfolios. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_size <- bind_rows(\n  portfolios_value,\n  portfolios_profitability,\n  portfolios_investment\n) |> \n  group_by(month) |>\n  summarize(\n    smb_replicated = mean(ret[portfolio_size == 1]) -\n      mean(ret[portfolio_size == 2])\n  )\n```\n:::\n\n\nWe then join all factors together into one data frame and construct again a suitable table to run tests for evaluating our replication.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- factors_size |>\n  full_join(\n    factors_value, by = \"month\"\n  ) |>\n  full_join(\n    factors_profitability, by = \"month\"\n  ) |>\n  full_join(\n    factors_investment, by = \"month\"\n  )\n\ntest <- factors_ff5_monthly |>\n  inner_join(factors_replicated, by = \"month\") |>\n  mutate(\n    across(c(smb_replicated, hml_replicated, \n             rmw_replicated, cma_replicated), ~round(., 4))\n  )\n```\n:::\n\n\nLet us start the replication evaluation again with the size factor:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.018146 -0.001865  0.000181  0.001980  0.014417 \n\nCoefficients:\n                Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    -0.000203   0.000136    -1.5     0.14    \nsmb_replicated  0.969574   0.004375   221.6   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00361 on 712 degrees of freedom\nMultiple R-squared:  0.986,\tAdjusted R-squared:  0.986 \nF-statistic: 4.91e+04 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is 0.97 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.04464 -0.00414 -0.00035  0.00410  0.03667 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    0.000473   0.000298    1.59     0.11    \nhml_replicated 0.991570   0.010271   96.54   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00792 on 712 degrees of freedom\nMultiple R-squared:  0.929,\tAdjusted R-squared:  0.929 \nF-statistic: 9.32e+03 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly higher coefficient of 0.99 and an R-squared around 93%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_rmw <- lm(rmw ~ rmw_replicated, data = test)\nsummary(model_rmw)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = rmw ~ rmw_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.019810 -0.003051  0.000054  0.003280  0.018801 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    5.59e-05   2.02e-04    0.28     0.78    \nrmw_replicated 9.55e-01   8.88e-03  107.52   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00536 on 712 degrees of freedom\nMultiple R-squared:  0.942,\tAdjusted R-squared:  0.942 \nF-statistic: 1.16e+04 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nWe are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an R-squared around 94%.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cma <- lm(cma ~ cma_replicated, data = test)\nsummary(model_cma)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = cma ~ cma_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.015071 -0.002720 -0.000225  0.002331  0.021404 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    0.000673   0.000171    3.93  9.4e-05 ***\ncma_replicated 0.965219   0.008195  117.78  < 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00455 on 712 degrees of freedom\nMultiple R-squared:  0.951,\tAdjusted R-squared:  0.951 \nF-statistic: 1.39e+04 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nFinally, the CMA factor also replicates well with a coefficient of 0.97 and an R-squared around 95%.\n\nOverall, our approach seems to replicate the Fama-French three and five factor models just as well as the three factors. \n\n## Exercises\n\n1. @Fama1993 claim that their sample excludes firms until they have appeared in Compustat for two years. Implement this additional filter and compare the improvements of your replication effort. \n2. On his homepage, [Kenneth French](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) provides instructions on how to construct the most common variables used for portfolio sorts. Try to replicate the univariate portfolio sort return time series for `E/P` (earnings / price) provided on his homepage and evaluate your replication effort using regressions.\n",
+    "markdown": "---\ntitle: Replicating Fama and French Factors\naliases:\n  - ../replicating-fama-and-french-factors.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Replicating Fama and French Factors with R\n  description-meta: Use the programming language R to replicate the famous Fama-French three and five factor asset pricing models.\n---\n\n\nIn this chapter, we provide a replication of the famous Fama and French factor portfolios. The Fama and French factor models are a cornerstone of empirical asset pricing [see @Fama1992 and @FamaFrench2015]. On top of the market factor represented by the traditional CAPM beta, the three factor model includes the size and value factors to explain the cross section of returns. Its successor, the five factor model, additionally includes profitability and investment as explanatory factors. \n\nWe start with the three factor model. We already introduced the size and value factors in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd), and their definition remains the same: size is the SMB factor (small-minus-big) that is long small firms and short large firms. The value factor is HML (high-minus-low) and is long in high book-to-market firms and short in low book-to-market counterparts. \n\nAfter the replication of the three factor model, we move to the five factors by constructing the profitability factor RMW (robust-minus-weak) as the difference between the returns of firms with high and low operating profitability and the investment factor CMA (conservative-minus-aggressive) as the difference between firms with high versus low investment rates. \n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nWe use CRSP and Compustat as data sources, as we need the same variables to compute the factors in the way Fama and French do it. Hence, there is nothing new below and we only load data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd).\\index{Data!CRSP}\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap, mktcap_lag, exchange\n  ) |>\n  collect()\n\ncompustat <- tbl(tidy_finance, \"compustat\") |>\n    select(gvkey, datadate, be, op, inv) |>\n    collect() \n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, smb, hml) |>\n  collect()\n\nfactors_ff5_monthly <- tbl(tidy_finance, \"factors_ff5_monthly\") |>\n  select(month, smb, hml, rmw, cma) |>\n  collect()\n```\n:::\n\n\nYet when we start merging our data set for computing the premiums, there are a few differences to [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). First, Fama and French form their portfolios in June of year $t$, whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year $t+1$.\n\nSecond, Fama and French also have a different protocol for computing the book-to-market ratio.\\index{Book-to-market ratio} They use market equity as of the end of year $t - 1$ and the book equity reported in year $t-1$, i.e., the `datadate` is within the last year.\\index{Book equity} Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity. The other sorting variables are analogously to book equity taken from year  $t-1$.\n\nTo implement all these time lags, we again employ the temporary `sorting_date`-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of `distinct()` at the end of the chunk below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsize <- crsp_monthly |>\n  filter(month(month) == 6) |>\n  mutate(sorting_date = month %m+% months(1)) |>\n  select(permno, exchange, sorting_date, size = mktcap)\n\nmarket_equity <- crsp_monthly |>\n  filter(month(month) == 12) |>\n  mutate(sorting_date = ymd(str_c(year(month) + 1, \"0701)\"))) |>\n  select(permno, gvkey, sorting_date, me = mktcap)\n\nbook_to_market <- compustat |>\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n  select(gvkey, sorting_date, be) |>\n  inner_join(market_equity, join_by(gvkey, sorting_date)) |>\n  mutate(bm = be / me) |>\n  select(permno, sorting_date, me, bm)\n\nsorting_variables <- size |>\n  inner_join(\n    book_to_market, join_by(permno, sorting_date)\n    ) |>\n  drop_na() |>\n  distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\n## Portfolio Sorts\n\nNext, we construct our portfolios with an adjusted `assign_portfolio()` function.\\index{Portfolio sorts} Fama and French rely on NYSE-specific breakpoints, they form two portfolios in the size dimension at the median and three portfolios in the dimension of each other sorting variable at the 30%- and 70%-percentiles, and they use dependent sorts. The sorts for book-to-market require an adjustment to the function in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd) because the `seq()` we would produce does not produce the right breakpoints. Instead of `n_portfolios`, we now specify `percentiles`, which take the breakpoint-sequence as an object specified in the function's call. Specifically, we give `percentiles = c(0, 0.3, 0.7, 1)` to the function. Additionally, we perform an `inner_join()` with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.\\index{Breakpoints}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n                             sorting_variable, \n                             percentiles) {\n  breakpoints <- data |>\n    filter(exchange == \"NYSE\") |>\n    pull({{ sorting_variable }}) |>\n    quantile(\n      probs = percentiles,\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios <- data |>\n    mutate(portfolio = findInterval(\n      pick(everything()) |>\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |>\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n\nportfolios <- sorting_variables |>\n  group_by(sorting_date) |>\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    ),\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = bm,\n      percentiles = c(0, 0.3, 0.7, 1)\n    )\n  ) |>\n  ungroup() |> \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm)\n```\n:::\n\n\nNext, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column `sorting_date` in our return data by setting the date to sort on to July of $t-1$ if the month is June (of year $t$) or earlier or to July of year $t$ if the month is July or later.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- crsp_monthly |>\n  mutate(sorting_date = case_when(\n    month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |>\n  inner_join(portfolios, join_by(permno, sorting_date))\n```\n:::\n\n\n## Fama and French Three Factor Model\n\nEquipped with the return data and the assigned portfolios, we can now compute the value-weighted average return for each of the six portfolios. Then, we form the Fama and French factors. For the size factor (i.e., SMB), we go long in the three small portfolios and short the three large portfolios by taking an average across either group. For the value factor (i.e., HML), we go long in the two high book-to-market portfolios and short the two low book-to-market portfolios, again weighting them equally.\\index{Factor!Size}\\index{Factor!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- portfolios |>\n  group_by(portfolio_size, portfolio_bm, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), .groups = \"drop\"\n  ) |>\n  group_by(month) |>\n  summarize(\n    smb_replicated = mean(ret[portfolio_size == 1]) -\n      mean(ret[portfolio_size == 2]),\n    hml_replicated = mean(ret[portfolio_bm == 3]) -\n      mean(ret[portfolio_bm == 1])\n  )\n```\n:::\n\n\n## Replication Evaluation\n\nIn the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French.\\index{Size!Size premium}\\index{Value premium} The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using `lm()`. If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- factors_ff3_monthly |>\n  inner_join(factors_replicated, join_by(month)) |>\n  mutate(\n    across(c(smb_replicated, hml_replicated), ~round(., 4))\n  )\n```\n:::\n\n\nTo test the success of the SMB factor, we hence run the following regression:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.020294 -0.001541 -0.000055  0.001489  0.015482 \n\nCoefficients:\n                Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    -0.000130   0.000131      -1     0.32    \nsmb_replicated  0.993003   0.004335     229   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00355 on 736 degrees of freedom\nMultiple R-squared:  0.986,\tAdjusted R-squared:  0.986 \nF-statistic: 5.25e+04 on 1 and 736 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are really convincing as all three criteria outlined above are met and the coefficient is 0.99 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.02324 -0.00287 -0.00010  0.00230  0.03407 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    0.000291   0.000219    1.33     0.18    \nhml_replicated 0.963431   0.007280  132.35   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00591 on 736 degrees of freedom\nMultiple R-squared:  0.96,\tAdjusted R-squared:  0.96 \nF-statistic: 1.75e+04 on 1 and 736 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly lower coefficient of 0.96 and an R-squared around 96%. \n\nThe evidence hence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.\n\n## Fama and French Five Factor Model\n\nNow, let us move to the replication of the five factor model. We extend the `other_sorting_variables` table from above with the additional characteristics operating profitability `op` and investment `inv`. Note that the `drop_na()` statement yields different sample sizes as some firms with `be` values might not have `op` or `inv` values. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nother_sorting_variables <- compustat |>\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |>\n  select(gvkey, sorting_date, be, op, inv) |>\n  inner_join(market_equity, \n             join_by(gvkey, sorting_date)) |>\n  mutate(bm = be / me) |>\n  select(permno, sorting_date, me, be, bm, op, inv)\n\nsorting_variables <- size |>\n  inner_join(\n    other_sorting_variables, \n    join_by(permno, sorting_date)\n    ) |>\n  drop_na() |>\n  distinct(permno, sorting_date, .keep_all = TRUE)\n```\n:::\n\n\nIn each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios <- sorting_variables |>\n  group_by(sorting_date) |>\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    )) |> \n  group_by(sorting_date, portfolio_size) |> \n  mutate(\n    across(c(bm, op, inv), ~assign_portfolio(\n      data = pick(everything()), \n      sorting_variable = ., \n      percentiles = c(0, 0.3, 0.7, 1)),\n      .names = \"portfolio_{.col}\"\n    )\n  ) |>\n  ungroup() |> \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm,\n         portfolio_op, portfolio_inv)\n\nportfolios <- crsp_monthly |>\n  mutate(sorting_date = case_when(\n    month(month) <= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) >= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |>\n  inner_join(portfolios, join_by(permno, sorting_date))\n```\n:::\n\n\nNow, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_value <- portfolios |>\n  group_by(portfolio_size, portfolio_bm, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_value <- portfolios_value |>\n  group_by(month) |>\n  summarize(\n    hml_replicated = mean(ret[portfolio_bm == 3]) -\n      mean(ret[portfolio_bm == 1])\n  )\n```\n:::\n\n\nFor the profitability factor, RMW, we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.\\index{Factor!Profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_profitability <- portfolios |>\n  group_by(portfolio_size, portfolio_op, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  ) \n\nfactors_profitability <- portfolios_profitability |>\n  group_by(month) |>\n  summarize(\n    rmw_replicated = mean(ret[portfolio_op == 3]) -\n      mean(ret[portfolio_op == 1])\n  )\n```\n:::\n\n\nFor the investment factor, CMA, we go long the two low investment portfolios and short the two high investment portfolios.\\index{Factor!Investment}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nportfolios_investment <- portfolios |>\n  group_by(portfolio_size, portfolio_inv, month) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_investment <- portfolios_investment |>\n  group_by(month) |>\n  summarize(\n    cma_replicated = mean(ret[portfolio_inv == 1]) -\n      mean(ret[portfolio_inv == 3])\n  )\n```\n:::\n\n\nFinally, the size factor, SMB, is constructed by going long the six small portfolios and short the six large portfolios. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_size <- bind_rows(\n  portfolios_value,\n  portfolios_profitability,\n  portfolios_investment\n) |> \n  group_by(month) |>\n  summarize(\n    smb_replicated = mean(ret[portfolio_size == 1]) -\n      mean(ret[portfolio_size == 2])\n  )\n```\n:::\n\n\nWe then join all factors together into one data frame and construct again a suitable table to run tests for evaluating our replication.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfactors_replicated <- factors_size |>\n  full_join(\n    factors_value, join_by(month)\n  ) |>\n  full_join(\n    factors_profitability, join_by(month)\n  ) |>\n  full_join(\n    factors_investment, join_by(month)\n  )\n\ntest <- factors_ff5_monthly |>\n  inner_join(factors_replicated, join_by(month)) |>\n  mutate(\n    across(c(smb_replicated, hml_replicated, \n             rmw_replicated, cma_replicated), ~round(., 4))\n  )\n```\n:::\n\n\nLet us start the replication evaluation again with the size factor:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_smb <- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.018146 -0.001865  0.000181  0.001980  0.014417 \n\nCoefficients:\n                Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    -0.000203   0.000136    -1.5     0.14    \nsmb_replicated  0.969574   0.004375   221.6   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00361 on 712 degrees of freedom\nMultiple R-squared:  0.986,\tAdjusted R-squared:  0.986 \nF-statistic: 4.91e+04 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is 0.97 and the R-squared is at 99%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_hml <- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.04464 -0.00414 -0.00035  0.00410  0.03667 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    0.000473   0.000298    1.59     0.11    \nhml_replicated 0.991570   0.010271   96.54   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00792 on 712 degrees of freedom\nMultiple R-squared:  0.929,\tAdjusted R-squared:  0.929 \nF-statistic: 9.32e+03 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nThe replication of the HML factor is also a success, although at a slightly higher coefficient of 0.99 and an R-squared around 93%. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_rmw <- lm(rmw ~ rmw_replicated, data = test)\nsummary(model_rmw)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = rmw ~ rmw_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.019810 -0.003051  0.000054  0.003280  0.018801 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    5.59e-05   2.02e-04    0.28     0.78    \nrmw_replicated 9.55e-01   8.88e-03  107.52   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00536 on 712 degrees of freedom\nMultiple R-squared:  0.942,\tAdjusted R-squared:  0.942 \nF-statistic: 1.16e+04 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nWe are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an R-squared around 94%.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_cma <- lm(cma ~ cma_replicated, data = test)\nsummary(model_cma)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nlm(formula = cma ~ cma_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.015071 -0.002720 -0.000225  0.002331  0.021404 \n\nCoefficients:\n               Estimate Std. Error t value Pr(>|t|)    \n(Intercept)    0.000673   0.000171    3.93  9.4e-05 ***\ncma_replicated 0.965219   0.008195  117.78  < 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00455 on 712 degrees of freedom\nMultiple R-squared:  0.951,\tAdjusted R-squared:  0.951 \nF-statistic: 1.39e+04 on 1 and 712 DF,  p-value: <2e-16\n```\n\n\n:::\n:::\n\n\nFinally, the CMA factor also replicates well with a coefficient of 0.97 and an R-squared around 95%.\n\nOverall, our approach seems to replicate the Fama-French three and five factor models just as well as the three factors. \n\n## Exercises\n\n1. @Fama1993 claim that their sample excludes firms until they have appeared in Compustat for two years. Implement this additional filter and compare the improvements of your replication effort. \n2. On his homepage, [Kenneth French](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) provides instructions on how to construct the most common variables used for portfolio sorts. Try to replicate the univariate portfolio sort return time series for `E/P` (earnings / price) provided on his homepage and evaluate your replication effort using regressions.\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/r/setting-up-your-environment/execute-results/html.json b/_freeze/r/setting-up-your-environment/execute-results/html.json
index a2b2bc74..68facf3a 100644
--- a/_freeze/r/setting-up-your-environment/execute-results/html.json
+++ b/_freeze/r/setting-up-your-environment/execute-results/html.json
@@ -1,11 +1,9 @@
 {
-  "hash": "98e050968d54d756e6a65fdb0f8da089",
+  "hash": "81bf1ebef16bf2a9d7f3a09520917963",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Setting Up Your Environment\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Setting Up Your R Environment\n  description-meta: How to set up your development environment to program with R.  \n---\n\n\nWe aim to lower the bar for starting empirical research in financial economics. We want that using R is easy for you. However, given that Tidy Finance is a platform that supports multiple programming languages, we also consider the possibility that you are not familiar with R at all. Hence, we provide you with a simple guide to get started with R and RStudio. If you were not using R before, you will be able to use it after reading this chapter.\n\n## The R language\n\nSome good news first: The software you need is free and easy to download. We will start with downloading and installing R and follow up with doing the same for RStudio.\n\nR is provided via [The Comprehensive R Archive Network](https://cran.r-project.org/) (or short CRAN). CRAN does not only provide the main software but also nearly all extensions that you need. We will cover these extensions or packages later, as we usually visit the CRAN website only to download the base version. Now, go ahead and visit CRAN. On the landing page, you can choose your operating systems (i.e., Linux, macOS, and Windows). Click the respective link that fits your system:\\index{CRAN}\n\n- R comes as a part of many Linux distributions. If it does not, CRAN provides installation guides for individual Linux distributions.\n- For macOS, the choice currently depends on some hardware specifications, but the right version for your system is clearly indicated.\n- For Windows, you want to use the base version provided.\n\nAfter downloading and installing the software to your system, you are nearly ready to go. In fact, you could just use R now. Unfortunately for many users, R is not a program but a programming language and comes with an interpreter that you would use like a command line. While using R like this might make you feel like a hacker (not that we do not endorse any criminal activity), it is in your best interest to combine R with RStudio.\n\nR is constantly being updated, with new versions being released multiple times a year. This means that you might want to return to CRAN in the future to fetch yourself an update. You know it is time for an update if packages remind you that you are using an outdated version of R. \n\n## RStudio\n\nAssuming you are looking for a more comfortable way of using R, you will get RStudio next. You can download it for free from [Posit](https://posit.co/download/rstudio-desktop/) (i.e., the company that created RStudio, which was previously called RStudio itself). When you follow the instructions, you will see that Posit asks you to install R. However, you should have done that already and can move straight to downloading and installing RStudio.\\index{RStudio}\n\nRStudio is a program similar to other programs you most likely use, like a browser, text editor, or anything else. It comes with many advantages, including a project manager, Github integration, and much more. Unfortunately, Tidy Finance is not the right scope to elaborate more on these possibilities or introduce the basics of programming, but we point you to some excellent resources below. For the purposes of this book, you have completed your excursions to websites that provide you with the necessary software installers.\n\n## R Packages and Environments\n\nFollowing your read of the preface to this book, you might now wonder why we did not download the `tidyverse` yet. Therefore, you must understand one more concept, namely packages in R.\\index{Packages} You can think of them as extensions that you use for specific purposes, whereas R itself is the core pillar upon which everything rests. Comfortably, you can install packages within R with the following code. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nSimply specify the package you want where we placed `tidyverse`. You typically only need to install packages once - except for updates or project-specific R environments. Once installed, you can then load a package with a call to `library(\"tidyverse\")` to use it.\n\nTo keep track of the packages' versions and make our results replicatable, we rely on the package `renv`. It creates a project-specific installation of R packages and you can find the full list of packages used here in the colophon below. The recorded package versions can also be shared with collaborators to ensure consistency. Our use of `renv` also makes it easier for you to install the exact package versions we were using (if you want that) by initializing `renv` with our [renv.lock](https://github.com/tidy-finance/website/blob/main/renv.lock)-file from Github.\\index{Packages:renv} \\index{Github}\n\nOne more piece of advice is the use of RStudio projects. They are a powerful tool to save you some time and make working with R more fun. Without going into more detail here, we refer you to @Wickham2023's chapter on [Workflow: scripts and projects](https://r4ds.hadley.nz/workflow-scripts.html).\\index{RStudio!Projects}\n\n## Your First Steps with R\n\nWhile we believe that downloading and installing R and RStudio is sufficiently easy, you might find help from @Grolemund2014 on [R and RStudio](https://rstudio-education.github.io/hopr/starting.html), [packages](https://rstudio-education.github.io/hopr/packages2.html), as well as [updating](https://rstudio-education.github.io/hopr/updating.html) the software.\n\nThis book's scope cannot be to give you an introduction to R itself. It is not our comparative advantage. However, we can point you to a possible path that you could follow to familiarize yourself with R. Therefore, we make the following suggestion:\n\n-   If you are new to R itself, a very gentle and good introduction to the workings of R can be found in @Grolemund2014. He provides a wonderful example in the form of the [weighted dice project](https://rstudio-education.github.io/hopr/project-1-weighted-dice.html). Once you are done setting up R on your machine, try to follow the instructions in this project.\n-   The main book on the `tidyverse`, @Wickham2023, is available online and for free: [R for Data Science](https://r4ds.had.co.nz/introduction.html) explains the majority of the tools we use in our book. Working through this text is an eye-opening experience and really useful.\\index{tidyverse}\n\nAdditional resources we can encourage you to use are the following:\n\n-   If you are an instructor searching to effectively teach R and data science methods, we recommend taking a look at the excellent [data science toolbox](https://datasciencebox.org/) by [Mine Cetinkaya-Rundel.](https://mine-cr.com/about/)\n-   RStudio provides a range of excellent [cheat sheets](https://www.rstudio.com/resources/cheatsheets/) with extensive information on how to use the `tidyverse` packages.\\index{tidyverse} \\index{RStudio:Cheat sheets}\n\n## Colophon\n\nThis book was written in RStudio using `bookdown` [@Xie2016]. The website was rendered using `quarto` [@quarto] and it is hosted via GitHub Pages.\nThe complete source is [available from GitHub](https://github.com/tidy-finance/website).\\index{Bookdown} \\index{Github} \\index{Quarto} We generated all plots in this book using `ggplot2` and its classic dark-on-light theme (`theme_bw()`).\\index{ggplot2 theme}\n\nThis version of the book was built with R [@R-base] version 4.3.1 (2023-06-16, Beagle Scouts) and the following packages: \\index{Colophon}\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n<table class=\"table\" style=\"margin-left: auto; margin-right: auto;\">\n <thead>\n  <tr>\n   <th style=\"text-align:left;\"> Package </th>\n   <th style=\"text-align:left;\"> Version </th>\n  </tr>\n </thead>\n<tbody>\n  <tr>\n   <td style=\"text-align:left;\"> RPostgres </td>\n   <td style=\"text-align:left;\"> 1.4.5 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> RSQLite </td>\n   <td style=\"text-align:left;\"> 2.3.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> broom </td>\n   <td style=\"text-align:left;\"> 1.0.5 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> brulee </td>\n   <td style=\"text-align:left;\"> 0.2.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> dbplyr </td>\n   <td style=\"text-align:left;\"> 2.3.2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> devtools </td>\n   <td style=\"text-align:left;\"> 2.4.5 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> dplyr </td>\n   <td style=\"text-align:left;\"> 1.1.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> fixest </td>\n   <td style=\"text-align:left;\"> 0.11.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> forcats </td>\n   <td style=\"text-align:left;\"> 1.0.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> frenchdata </td>\n   <td style=\"text-align:left;\"> 0.2.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> furrr </td>\n   <td style=\"text-align:left;\"> 0.3.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> ggplot2 </td>\n   <td style=\"text-align:left;\"> 3.4.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> glmnet </td>\n   <td style=\"text-align:left;\"> 4.1-8 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> googledrive </td>\n   <td style=\"text-align:left;\"> 2.1.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> hardhat </td>\n   <td style=\"text-align:left;\"> 1.3.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> hexSticker </td>\n   <td style=\"text-align:left;\"> 0.4.9 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> jsonlite </td>\n   <td style=\"text-align:left;\"> 1.8.4 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> kableExtra </td>\n   <td style=\"text-align:left;\"> 1.3.4 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> lmtest </td>\n   <td style=\"text-align:left;\"> 0.9-40 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> lubridate </td>\n   <td style=\"text-align:left;\"> 1.9.2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> nloptr </td>\n   <td style=\"text-align:left;\"> 2.0.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> purrr </td>\n   <td style=\"text-align:left;\"> 1.0.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> ranger </td>\n   <td style=\"text-align:left;\"> 0.15.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> readr </td>\n   <td style=\"text-align:left;\"> 2.1.4 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> readxl </td>\n   <td style=\"text-align:left;\"> 1.4.2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> renv </td>\n   <td style=\"text-align:left;\"> 1.0.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> rlang </td>\n   <td style=\"text-align:left;\"> 1.1.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> rmarkdown </td>\n   <td style=\"text-align:left;\"> 2.21 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> sandwich </td>\n   <td style=\"text-align:left;\"> 3.0-2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> scales </td>\n   <td style=\"text-align:left;\"> 1.2.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> slider </td>\n   <td style=\"text-align:left;\"> 0.3.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> stringr </td>\n   <td style=\"text-align:left;\"> 1.5.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tibble </td>\n   <td style=\"text-align:left;\"> 3.2.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidymodels </td>\n   <td style=\"text-align:left;\"> 1.1.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidyquant </td>\n   <td style=\"text-align:left;\"> 1.0.7 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidyr </td>\n   <td style=\"text-align:left;\"> 1.3.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidyverse </td>\n   <td style=\"text-align:left;\"> 2.0.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> timetk </td>\n   <td style=\"text-align:left;\"> 2.8.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> torch </td>\n   <td style=\"text-align:left;\"> 0.11.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> wesanderson </td>\n   <td style=\"text-align:left;\"> 0.3.6 </td>\n  </tr>\n</tbody>\n</table>\n\n`````\n:::\n:::\n",
-    "supporting": [
-      "setting-up-your-environment_files"
-    ],
+    "markdown": "---\ntitle: Setting Up Your Environment\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Setting Up Your R Environment\n  description-meta: How to set up your development environment to program with R.  \n---\n\n\nWe aim to lower the bar for starting empirical research in financial economics. We want that using R is easy for you. However, given that Tidy Finance is a platform that supports multiple programming languages, we also consider the possibility that you are not familiar with R at all. Hence, we provide you with a simple guide to get started with R and RStudio. If you were not using R before, you will be able to use it after reading this chapter.\n\n## The R language\n\nSome good news first: The software you need is free and easy to download. We will start with downloading and installing R and follow up with doing the same for RStudio.\n\nR is provided via [The Comprehensive R Archive Network](https://cran.r-project.org/) (or short CRAN). CRAN does not only provide the main software but also nearly all extensions that you need. We will cover these extensions or packages later, as we usually visit the CRAN website only to download the base version. Now, go ahead and visit CRAN. On the landing page, you can choose your operating systems (i.e., Linux, macOS, and Windows). Click the respective link that fits your system:\\index{CRAN}\n\n- R comes as a part of many Linux distributions. If it does not, CRAN provides installation guides for individual Linux distributions.\n- For macOS, the choice currently depends on some hardware specifications, but the right version for your system is clearly indicated.\n- For Windows, you want to use the base version provided.\n\nAfter downloading and installing the software to your system, you are nearly ready to go. In fact, you could just use R now. Unfortunately for many users, R is not a program but a programming language and comes with an interpreter that you would use like a command line. While using R like this might make you feel like a hacker (not that we do not endorse any criminal activity), it is in your best interest to combine R with RStudio.\n\nR is constantly being updated, with new versions being released multiple times a year. This means that you might want to return to CRAN in the future to fetch yourself an update. You know it is time for an update if packages remind you that you are using an outdated version of R. \n\n## RStudio\n\nAssuming you are looking for a more comfortable way of using R, you will get RStudio next. You can download it for free from [Posit](https://posit.co/download/rstudio-desktop/) (i.e., the company that created RStudio, which was previously called RStudio itself). When you follow the instructions, you will see that Posit asks you to install R. However, you should have done that already and can move straight to downloading and installing RStudio.\\index{RStudio}\n\nRStudio is a program similar to other programs you most likely use, like a browser, text editor, or anything else. It comes with many advantages, including a project manager, Github integration, and much more. Unfortunately, Tidy Finance is not the right scope to elaborate more on these possibilities or introduce the basics of programming, but we point you to some excellent resources below. For the purposes of this book, you have completed your excursions to websites that provide you with the necessary software installers.\n\n## R Packages and Environments\n\nFollowing your read of the preface to this book, you might now wonder why we did not download the `tidyverse` yet. Therefore, you must understand one more concept, namely packages in R.\\index{Packages} You can think of them as extensions that you use for specific purposes, whereas R itself is the core pillar upon which everything rests. Comfortably, you can install packages within R with the following code. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nSimply specify the package you want where we placed `tidyverse`. You typically only need to install packages once - except for updates or project-specific R environments. Once installed, you can then load a package with a call to `library(\"tidyverse\")` to use it.\n\nTo keep track of the packages' versions and make our results replicatable, we rely on the package `renv`. It creates a project-specific installation of R packages and you can find the full list of packages used here in the colophon below. The recorded package versions can also be shared with collaborators to ensure consistency. Our use of `renv` also makes it easier for you to install the exact package versions we were using (if you want that) by initializing `renv` with our [renv.lock](https://github.com/tidy-finance/website/blob/main/renv.lock)-file from Github.\\index{Packages:renv} \\index{Github}\n\nOne more piece of advice is the use of RStudio projects. They are a powerful tool to save you some time and make working with R more fun. Without going into more detail here, we refer you to @Wickham2023's chapter on [Workflow: scripts and projects](https://r4ds.hadley.nz/workflow-scripts.html).\\index{RStudio!Projects}\n\n## Your First Steps with R\n\nWhile we believe that downloading and installing R and RStudio is sufficiently easy, you might find help from @Grolemund2014 on [R and RStudio](https://rstudio-education.github.io/hopr/starting.html), [packages](https://rstudio-education.github.io/hopr/packages2.html), as well as [updating](https://rstudio-education.github.io/hopr/updating.html) the software.\n\nThis book's scope cannot be to give you an introduction to R itself. It is not our comparative advantage. However, we can point you to a possible path that you could follow to familiarize yourself with R. Therefore, we make the following suggestion:\n\n-   If you are new to R itself, a very gentle and good introduction to the workings of R can be found in @Grolemund2014. He provides a wonderful example in the form of the [weighted dice project](https://rstudio-education.github.io/hopr/project-1-weighted-dice.html). Once you are done setting up R on your machine, try to follow the instructions in this project.\n-   The main book on the `tidyverse`, @Wickham2023, is available online and for free: [R for Data Science](https://r4ds.had.co.nz/introduction.html) explains the majority of the tools we use in our book. Working through this text is an eye-opening experience and really useful.\\index{tidyverse}\n\nAdditional resources we can encourage you to use are the following:\n\n-   If you are an instructor searching to effectively teach R and data science methods, we recommend taking a look at the excellent [data science toolbox](https://datasciencebox.org/) by [Mine Cetinkaya-Rundel.](https://mine-cr.com/about/)\n-   RStudio provides a range of excellent [cheat sheets](https://www.rstudio.com/resources/cheatsheets/) with extensive information on how to use the `tidyverse` packages.\\index{tidyverse} \\index{RStudio:Cheat sheets}\n\n## Colophon\n\nThis book was written in RStudio using `bookdown` [@Xie2016]. The website was rendered using `quarto` [@quarto] and it is hosted via GitHub Pages.\nThe complete source is [available from GitHub](https://github.com/tidy-finance/website).\\index{Bookdown} \\index{Github} \\index{Quarto} We generated all plots in this book using `ggplot2` and its classic dark-on-light theme (`theme_bw()`).\\index{ggplot2 theme}\n\nThis version of the book was built with R [@R-base] version 4.3.2 (2023-10-31, Eye Holes) and the following packages: \\index{Colophon}\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n<table class=\"table\" style=\"margin-left: auto; margin-right: auto;\">\n <thead>\n  <tr>\n   <th style=\"text-align:left;\"> Package </th>\n   <th style=\"text-align:left;\"> Version </th>\n  </tr>\n </thead>\n<tbody>\n  <tr>\n   <td style=\"text-align:left;\"> RPostgres </td>\n   <td style=\"text-align:left;\"> 1.4.5 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> RSQLite </td>\n   <td style=\"text-align:left;\"> 2.3.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> broom </td>\n   <td style=\"text-align:left;\"> 1.0.5 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> brulee </td>\n   <td style=\"text-align:left;\"> 0.2.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> dbplyr </td>\n   <td style=\"text-align:left;\"> 2.3.2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> devtools </td>\n   <td style=\"text-align:left;\"> 2.4.5 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> dplyr </td>\n   <td style=\"text-align:left;\"> 1.1.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> fixest </td>\n   <td style=\"text-align:left;\"> 0.11.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> forcats </td>\n   <td style=\"text-align:left;\"> 1.0.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> frenchdata </td>\n   <td style=\"text-align:left;\"> 0.2.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> furrr </td>\n   <td style=\"text-align:left;\"> 0.3.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> ggplot2 </td>\n   <td style=\"text-align:left;\"> 3.4.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> glmnet </td>\n   <td style=\"text-align:left;\"> 4.1-8 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> googledrive </td>\n   <td style=\"text-align:left;\"> 2.1.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> hardhat </td>\n   <td style=\"text-align:left;\"> 1.3.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> hexSticker </td>\n   <td style=\"text-align:left;\"> 0.4.9 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> jsonlite </td>\n   <td style=\"text-align:left;\"> 1.8.4 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> kableExtra </td>\n   <td style=\"text-align:left;\"> 1.3.4 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> lmtest </td>\n   <td style=\"text-align:left;\"> 0.9-40 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> lubridate </td>\n   <td style=\"text-align:left;\"> 1.9.2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> nloptr </td>\n   <td style=\"text-align:left;\"> 2.0.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> purrr </td>\n   <td style=\"text-align:left;\"> 1.0.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> ranger </td>\n   <td style=\"text-align:left;\"> 0.15.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> readr </td>\n   <td style=\"text-align:left;\"> 2.1.4 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> readxl </td>\n   <td style=\"text-align:left;\"> 1.4.2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> renv </td>\n   <td style=\"text-align:left;\"> 1.0.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> rlang </td>\n   <td style=\"text-align:left;\"> 1.1.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> rmarkdown </td>\n   <td style=\"text-align:left;\"> 2.21 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> sandwich </td>\n   <td style=\"text-align:left;\"> 3.0-2 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> scales </td>\n   <td style=\"text-align:left;\"> 1.2.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> slider </td>\n   <td style=\"text-align:left;\"> 0.3.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> stringr </td>\n   <td style=\"text-align:left;\"> 1.5.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tibble </td>\n   <td style=\"text-align:left;\"> 3.2.1 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidymodels </td>\n   <td style=\"text-align:left;\"> 1.1.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidyquant </td>\n   <td style=\"text-align:left;\"> 1.0.7 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidyr </td>\n   <td style=\"text-align:left;\"> 1.3.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> tidyverse </td>\n   <td style=\"text-align:left;\"> 2.0.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> timetk </td>\n   <td style=\"text-align:left;\"> 2.8.3 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> torch </td>\n   <td style=\"text-align:left;\"> 0.11.0 </td>\n  </tr>\n  <tr>\n   <td style=\"text-align:left;\"> wesanderson </td>\n   <td style=\"text-align:left;\"> 0.3.6 </td>\n  </tr>\n</tbody>\n</table>\n\n`````\n:::\n:::\n",
+    "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],
diff --git a/_freeze/r/trace-and-fisd/execute-results/html.json b/_freeze/r/trace-and-fisd/execute-results/html.json
index 25cef0d5..35532f0f 100644
--- a/_freeze/r/trace-and-fisd/execute-results/html.json
+++ b/_freeze/r/trace-and-fisd/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "716a9d11a8ce7edf5b5f4ee59c7883d1",
+  "hash": "f094010288c587f828fd06baab89551d",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: TRACE and FISD\naliases:\n  - ../trace-and-fisd.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: TRACE and FISD with R\n  description-meta: Download and prepare corporate bond data such as TRACE and FISD from WRDS using the programming language R. \n---\n\n\nIn this chapter, we dive into the US corporate bond market. Bond markets are far more diverse than stock markets, as most issuers have multiple bonds outstanding simultaneously with potentially very different indentures. This market segment is exciting due to its size (roughly 10 trillion USD outstanding), heterogeneity of issuers (as opposed to government bonds), market structure (mostly over-the-counter trades), and data availability. We introduce how to use bond characteristics from FISD and trade reports from TRACE and provide code to download and clean TRACE in R. \n\nMany researchers study liquidity in the US corporate bond market [see, e.g., @bessembinder2006, @Edwards2007, and @Ohara2021, among many others]. We do not cover bond returns here, but you can compute them from TRACE data. Instead, we refer to studies on the topic such as @Bessembinder2008, @bai2019, and @kelly2020 and a survey by @Huang2021. Moreover, WRDS includes bond returns computed from TRACE data at a monthly frequency.\\index{Corporate bonds}\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(dbplyr)\nlibrary(RSQLite)\nlibrary(RPostgres)\nlibrary(devtools)\n```\n:::\n\n\nCompared to previous chapters, we load the `devtools` package [@devtools] to source code that we provided to the public via [gist.](https://docs.github.com/en/get-started/writing-on-github/editing-and-sharing-content-with-gists/creating-gists)\n\n## Bond Data from WRDS \n\nBoth bond databases we need are available on [WRDS](https://wrds-www.wharton.upenn.edu/) to which we establish the `RPostgres` connection described in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). Additionally, we connect to our local `SQLite`-database to store the data we download.\\index{WRDS}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n  Postgres(),\n  host = \"wrds-pgdata.wharton.upenn.edu\",\n  dbname = \"wrds\",\n  port = 9737,\n  sslmode = \"require\",\n  user = Sys.getenv(\"WRDS_USER\"),\n  password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n```\n:::\n\n\n## Mergent FISD\n\nFor research on US corporate bonds, the Mergent Fixed Income Securities Database (FISD) is the primary resource for bond characteristics.\\index{Data!FISD} There is a [detailed manual](https://wrds-www.wharton.upenn.edu/documents/1364/FixedIncome_Securities_Master_Database_User_Guide_v4.pdf) on WRDS, so we only cover the necessary subjects here. FISD data comes in two main variants, namely, centered on issuers or issues. In either case, the most useful identifiers are [CUSIPs.](https://www.cusip.com/index.html) 9-digit CUSIPs identify securities issued by issuers. The issuers can be identified from the first six digits of a security CUSIP, which is also called 6-digit CUSIP. Both stocks and bonds have CUSIPs.\\index{CUSIP} This connection would, in principle, allow matching them easily, but due to changing issuer details, this approach only yields small coverage.\n\nWe use the issue-centered version of FISD to identify the subset of US corporate bonds that meet the standard criteria [@bessembinder2006]. The WRDS table `fisd_mergedissue` contains most of the information we need on a 9-digit CUSIP level. \nDue to the diversity of corporate bonds, details in the indenture vary significantly. We focus on common bonds that make up the majority of trading volume in this market without diverging too much in indentures. \n\nThe following chunk connects to the data and selects the bond sample to remove certain bond types that are less commonly used [see, e.g., @Dick2012; @Ohara2021, among many others].\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissue_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissue\"))\n\nfisd <- fisd_mergedissue_db |>\n  filter(\n    security_level == \"SEN\", # senior bonds\n    slob == \"N\" | is.na(slob), # secured lease obligation\n    is.na(security_pledge), # unsecured bonds\n    asset_backed == \"N\" | is.na(asset_backed), # not asset backed\n    defeased == \"N\" | is.na(defeased), # not defeased\n    is.na(defeased_date),\n    bond_type %in% c(\n      \"CDEB\", # US Corporate Debentures\n      \"CMTN\", # US Corporate MTN (Medium Term Note)\n      \"CMTZ\", # US Corporate MTN Zero\n      \"CZ\", # US Corporate Zero,\n      \"USBN\" # US Corporate Bank Note\n    ), \n    pay_in_kind != \"Y\" | is.na(pay_in_kind), # not payable in kind\n    is.na(pay_in_kind_exp_date),\n    yankee == \"N\" | is.na(yankee), # no foreign issuer\n    canadian == \"N\" | is.na(canadian), # not Canadian\n    foreign_currency == \"N\", # USD\n    coupon_type %in% c(\n      \"F\", # fixed coupon\n      \"Z\" # zero coupon\n    ), \n    is.na(fix_frequency),\n    coupon_change_indicator == \"N\",\n    interest_frequency %in% c(\n      \"0\", # per year\n      \"1\",\n      \"2\",\n      \"4\",\n      \"12\"\n    ),\n    rule_144a == \"N\", # publicly traded\n    private_placement == \"N\" | is.na(private_placement),\n    defaulted == \"N\", # not defaulted\n    is.na(filing_date),\n    is.na(settlement),\n    convertible == \"N\", # not convertible\n    is.na(exchange),\n    putable == \"N\" | is.na(putable), # not putable\n    unit_deal == \"N\" | is.na(unit_deal), # not issued with another security\n    exchangeable == \"N\" | is.na(exchangeable), # not exchangeable\n    perpetual == \"N\", # not perpetual\n    preferred_security == \"N\" | is.na(preferred_security) # not preferred\n  ) |> \n  select(\n    complete_cusip, maturity,\n    offering_amt, offering_date,\n    dated_date, \n    interest_frequency, coupon,\n    last_interest_date, \n    issue_id, issuer_id\n  ) |>\n  collect()\n```\n:::\n\n\nWe also pull issuer information from `fisd_mergedissuer` regarding the industry and country of the firm that issued a particular bond. Then, we filter to include only US-domiciled firms' bonds. We match the data by `issuer_id`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissuer_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissuer\")) \n\nfisd_issuer <- fisd_mergedissuer_db |>\n  select(issuer_id, sic_code, country_domicile) |>\n  collect()\n\nfisd <- fisd |>\n  inner_join(fisd_issuer, by = \"issuer_id\") |>\n  filter(country_domicile == \"USA\") |>\n  select(-country_domicile)\n```\n:::\n\n\nFinally, we save the bond characteristics to our local database. This selection of bonds also constitutes the sample for which we will collect trade reports from TRACE below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n  conn = tidy_finance,\n  name = \"fisd\",\n  value = fisd,\n  overwrite = TRUE\n)\n```\n:::\n\n\nThe FISD database also contains other data. The issue-based file contains information on covenants, i.e., restrictions included in bond indentures to limit specific actions by firms [e.g., @handler2021]. Moreover, FISD also provides information on bond ratings. We do not need either here.\n\n## TRACE\n\nThe Financial Industry Regulatory Authority (FINRA) provides the Trade Reporting and Compliance Engine (TRACE).\\index{Data!TRACE} In TRACE, dealers that trade corporate bonds must report such trades individually. Hence, we observe trade messages in TRACE that contain information on the bond traded, the trade time, price, and volume. TRACE comes in two variants; standard and enhanced TRACE. We show how to download and clean enhanced TRACE as it contains uncapped volume, a crucial quantity missing in the standard distribution. Moreover, enhanced TRACE also provides information on the respective parties' roles and the direction of the trade report. These items become essential in cleaning the messages.\n\nWhy do we repeatedly talk about cleaning TRACE? Trade messages are submitted within a short time window after a trade is executed (less than 15 minutes). These messages can contain errors, and the reporters subsequently correct them or they cancel a trade altogether. The cleaning needs are described by @Dick2009 in detail, and @Dick2014 shows how to clean the enhanced TRACE data using SAS. We do not go into the cleaning steps here, since the code is lengthy and serves no educational purpose. However, downloading and cleaning enhanced TRACE data is straightforward with our setup.\n\nWe store code for cleaning enhanced TRACE with R on the following Github [gist.](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4) \\index{Github!Gist} as a function. The appendix also contains the code for reference. We only need to source the code from the gist, which we can do with `source_gist()`. Alternatively, you can also go to the gist, download it, and `source()` the respective R-file. The `clean_enhanced_trace()` function takes a vector of CUSIPs, a connection to WRDS explained in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd), and a start and end date, respectively.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")\n```\n:::\n\n\nThe TRACE database is considerably large. Therefore, we only download subsets of data at once. Specifying too many CUSIPs over a long time horizon will result in very long download times and a potential failure due to the size of the request to WRDS. The size limit depends on many parameters, and we cannot give you a guideline here. If we were working with the complete TRACE data for all CUSIPs above, splitting the data into 100 parts takes roughly two hours using our setup. For the applications in this book, we need data around the Paris Agreement in December 2015 and download the data in ten sets, which we define below.\\index{Paris (Climate) Agreement}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_cusips <- fisd |>\n  pull(complete_cusip)\n\nfisd_parts <- split(\n  fisd_cusips,\n  rep(1:10, \n      length.out = length(fisd_cusips))\n)\n```\n:::\n\n\nFinally, we run a loop in the same style as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd) where we download daily returns from CRSP. For each of the CUSIP sets defined above, we call the cleaning function and save the resulting output. We add new data to the existing table for batch two and all following batches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbatches <- length(fisd_parts)\n\nfor (j in 1:batches) {\n  trace_enhanced <- clean_enhanced_trace(\n    cusips = fisd_parts[[j]],\n    connection = wrds,\n    start_date = ymd(\"2014-01-01\"),\n    end_date = ymd(\"2016-11-30\")\n  )\n\n  dbWriteTable(\n    conn = tidy_finance,\n    name = \"trace_enhanced\",\n    value = trace_enhanced,\n    overwrite = ifelse(j == 1, TRUE, FALSE),\n    append = ifelse(j != 1, TRUE, FALSE)\n  )\n  \n  cat(\"Batch\", j, \"out of\", batches, \"done (\", \n      round(j / batches, 2) * 100, \"%)\\n\")\n}\n```\n:::\n\n\n## Insights into Corporate Bonds\n\nWhile many news outlets readily provide information on stocks and the underlying firms, corporate bonds are not covered frequently. Additionally, the TRACE database contains trade-level information, potentially new to students. Therefore, we provide you with some insights by showing some summary statistics.\\index{Summary statistics}\n\nWe start by looking into the number of bonds outstanding over time and compare it to the number of bonds traded in our sample. First, we compute the number of bonds outstanding for each quarter around the Paris Agreement from 2014 to 2016. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding <- expand_grid(\"date\" = seq(ymd(\"2014-01-01\"),\n                                              ymd(\"2016-11-30\"), \n                                              by = \"quarter\"), \n                                 \"complete_cusip\" = fisd$complete_cusip) |> \n  left_join(fisd |> select(complete_cusip, \n                              offering_date,\n                              maturity), \n            by = \"complete_cusip\") |> \n  mutate(offering_date = floor_date(offering_date),\n         maturity = floor_date(maturity)) |> \n  filter(date >= offering_date & date <= maturity) |> \n  count(date) |> \n  mutate(type = \"Outstanding\")\n```\n:::\n\n\nNext, we look at the bonds traded each quarter in the same period. Notice that we load the complete trace table from our database, as we only have a single part of it in the environment from the download loop from above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n  collect()\n\nbonds_traded <- trace_enhanced |> \n  mutate(date = floor_date(trd_exctn_dt, \"quarters\")) |> \n  group_by(date) |> \n  summarize(n = length(unique(cusip_id)),\n            type = \"Traded\",\n            .groups = \"drop\") \n```\n:::\n\n\nFinally, we plot the two time series in @fig-401.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding |> \n  bind_rows(bonds_traded) |> \n  ggplot(aes(\n    x = date, \n    y = n, \n    color = type, \n    linetype = type\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Number of bonds outstanding and traded each quarter\"\n  )\n```\n\n::: {.cell-output-display}\n![The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.](trace-and-fisd_files/figure-html/fig-401-1.png){#fig-401 fig-alt='Title: Number of bonds outstanding and traded each quarter. The figure shows a time series of outstanding bonds and bonds traded. The amount outstanding increases monotonically between 2014 and 2016. The number of bonds traded represents only a fraction of roughly 60 percent, which peaks around the third quarter of 2016.' width=2100}\n:::\n:::\n\n\nWe see that the number of bonds outstanding increases steadily between 2014 and 2016. During our sample period of trade data, we see that the fraction of bonds trading each quarter is roughly 60%. The relatively small number of traded bonds means that many bonds do not trade through an entire quarter. This lack of trading activity illustrates the generally low level of liquidity in the corporate bond market, where it can be hard to trade specific bonds. \nDoes this lack of liquidity mean that corporate bond markets are irrelevant in terms of their size? With over 7,500 traded bonds each quarter, it is hard to say that the market is small. However, let us also investigate the characteristics of issued corporate bonds. In particular, we consider maturity (in years), coupon, and offering amount (in million USD).\\index{Liquidity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd |>\n  mutate(maturity = as.numeric(maturity - offering_date) / 365,\n         offering_amt = offering_amt / 10^3) |> \n  pivot_longer(cols = c(maturity, coupon, offering_amt),\n               names_to = \"measure\") |>\n  drop_na() |> \n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 8\n  measure        mean     sd    min   q05    q50    q95    max\n  <chr>         <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>\n1 coupon         5.94   2.67  0     1.90    6.06   10      39 \n2 maturity       9.74   9.21 -6.24  1.03    7.18   30.0   101.\n3 offering_amt 379.   570.    0.001 0.669 200    1400   15000 \n```\n\n\n:::\n:::\n\n\nWe see that the average bond in our sample period has an offering amount of over 357 million USD with a median of 200 million USD, which both cannot be considered small. The average bond has a maturity of 10 years and pays around 6% in coupons.\n\nFinally, let us compute some summary statistics for the trades in this market. To this end, we show a summary based on aggregate information daily. In particular, we consider the trade size (in million USD) and the number of trades.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced |> \n  group_by(trd_exctn_dt) |> \n  summarize(trade_size = sum(entrd_vol_qt * rptd_pr / 100) / 10^6,\n            trade_number = n(),\n            .groups = \"drop\") |> \n  pivot_longer(cols = c(trade_size, trade_number),\n               names_to = \"measure\") |> \n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 8\n  measure        mean    sd   min    q05    q50    q95    max\n  <chr>         <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>\n1 trade_number 25914. 5444. 439   17844. 26051  34383. 40839 \n2 trade_size   12968. 3577.  17.2  6128. 13421. 17850. 21312.\n```\n\n\n:::\n:::\n\n\nOn average, nearly 26 billion USD of corporate bonds are traded daily in nearly 13,000 transactions. We can hence conclude that the corporate bond market is indeed significant in terms of trading volume and activity.\n\n## Exercises\n\n1. Compute the amount outstanding across all bonds over time. Make sure to subtract all matured bonds. How would you describe the resulting plot?\n1. Compute the number of days each bond is traded (accounting for the bonds' maturities and issuances). Start by looking at the number of bonds traded each day in a graph similar to the one above. How many bonds trade on more than 75% of trading days? \n1. WRDS provides more information from Mergent FISD such as ratings in the table `fisd_ratings`. Download the ratings table and plot the distribution of ratings for the different rating providers. How would you map the different providers to a common numeric rating scale? \\index{Rating}\n",
+    "markdown": "---\ntitle: TRACE and FISD\naliases:\n  - ../trace-and-fisd.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: TRACE and FISD with R\n  description-meta: Download and prepare corporate bond data such as TRACE and FISD from WRDS using the programming language R. \n---\n\n\nIn this chapter, we dive into the US corporate bond market. Bond markets are far more diverse than stock markets, as most issuers have multiple bonds outstanding simultaneously with potentially very different indentures. This market segment is exciting due to its size (roughly 10 trillion USD outstanding), heterogeneity of issuers (as opposed to government bonds), market structure (mostly over-the-counter trades), and data availability. We introduce how to use bond characteristics from FISD and trade reports from TRACE and provide code to download and clean TRACE in R. \n\nMany researchers study liquidity in the US corporate bond market [see, e.g., @bessembinder2006, @Edwards2007, and @Ohara2021, among many others]. We do not cover bond returns here, but you can compute them from TRACE data. Instead, we refer to studies on the topic such as @Bessembinder2008, @bai2019, and @kelly2020 and a survey by @Huang2021. Moreover, WRDS includes bond returns computed from TRACE data at a monthly frequency.\\index{Corporate bonds}\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(dbplyr)\nlibrary(RSQLite)\nlibrary(RPostgres)\nlibrary(devtools)\n```\n:::\n\n\nCompared to previous chapters, we load the `devtools` package [@devtools] to source code that we provided to the public via [gist.](https://docs.github.com/en/get-started/writing-on-github/editing-and-sharing-content-with-gists/creating-gists)\n\n## Bond Data from WRDS \n\nBoth bond databases we need are available on [WRDS](https://wrds-www.wharton.upenn.edu/) to which we establish the `RPostgres` connection described in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). Additionally, we connect to our local `SQLite`-database to store the data we download.\\index{WRDS}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n  Postgres(),\n  host = \"wrds-pgdata.wharton.upenn.edu\",\n  dbname = \"wrds\",\n  port = 9737,\n  sslmode = \"require\",\n  user = Sys.getenv(\"WRDS_USER\"),\n  password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n```\n:::\n\n\n## Mergent FISD\n\nFor research on US corporate bonds, the Mergent Fixed Income Securities Database (FISD) is the primary resource for bond characteristics.\\index{Data!FISD} There is a [detailed manual](https://wrds-www.wharton.upenn.edu/documents/1364/FixedIncome_Securities_Master_Database_User_Guide_v4.pdf) on WRDS, so we only cover the necessary subjects here. FISD data comes in two main variants, namely, centered on issuers or issues. In either case, the most useful identifiers are [CUSIPs.](https://www.cusip.com/index.html) 9-digit CUSIPs identify securities issued by issuers. The issuers can be identified from the first six digits of a security CUSIP, which is also called 6-digit CUSIP. Both stocks and bonds have CUSIPs.\\index{CUSIP} This connection would, in principle, allow matching them easily, but due to changing issuer details, this approach only yields small coverage.\n\nWe use the issue-centered version of FISD to identify the subset of US corporate bonds that meet the standard criteria [@bessembinder2006]. The WRDS table `fisd_mergedissue` contains most of the information we need on a 9-digit CUSIP level. \nDue to the diversity of corporate bonds, details in the indenture vary significantly. We focus on common bonds that make up the majority of trading volume in this market without diverging too much in indentures. \n\nThe following chunk connects to the data and selects the bond sample to remove certain bond types that are less commonly used [see, e.g., @Dick2012; @Ohara2021, among many others].\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissue_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissue\"))\n\nfisd <- fisd_mergedissue_db |>\n  filter(\n    security_level == \"SEN\", # senior bonds\n    slob == \"N\" | is.na(slob), # secured lease obligation\n    is.na(security_pledge), # unsecured bonds\n    asset_backed == \"N\" | is.na(asset_backed), # not asset backed\n    defeased == \"N\" | is.na(defeased), # not defeased\n    is.na(defeased_date),\n    bond_type %in% c(\n      \"CDEB\", # US Corporate Debentures\n      \"CMTN\", # US Corporate MTN (Medium Term Note)\n      \"CMTZ\", # US Corporate MTN Zero\n      \"CZ\", # US Corporate Zero,\n      \"USBN\" # US Corporate Bank Note\n    ), \n    pay_in_kind != \"Y\" | is.na(pay_in_kind), # not payable in kind\n    is.na(pay_in_kind_exp_date),\n    yankee == \"N\" | is.na(yankee), # no foreign issuer\n    canadian == \"N\" | is.na(canadian), # not Canadian\n    foreign_currency == \"N\", # USD\n    coupon_type %in% c(\n      \"F\", # fixed coupon\n      \"Z\" # zero coupon\n    ), \n    is.na(fix_frequency),\n    coupon_change_indicator == \"N\",\n    interest_frequency %in% c(\n      \"0\", # per year\n      \"1\",\n      \"2\",\n      \"4\",\n      \"12\"\n    ),\n    rule_144a == \"N\", # publicly traded\n    private_placement == \"N\" | is.na(private_placement),\n    defaulted == \"N\", # not defaulted\n    is.na(filing_date),\n    is.na(settlement),\n    convertible == \"N\", # not convertible\n    is.na(exchange),\n    putable == \"N\" | is.na(putable), # not putable\n    unit_deal == \"N\" | is.na(unit_deal), # not issued with another security\n    exchangeable == \"N\" | is.na(exchangeable), # not exchangeable\n    perpetual == \"N\", # not perpetual\n    preferred_security == \"N\" | is.na(preferred_security) # not preferred\n  ) |> \n  select(\n    complete_cusip, maturity,\n    offering_amt, offering_date,\n    dated_date, \n    interest_frequency, coupon,\n    last_interest_date, \n    issue_id, issuer_id\n  ) |>\n  collect()\n```\n:::\n\n\nWe also pull issuer information from `fisd_mergedissuer` regarding the industry and country of the firm that issued a particular bond. Then, we filter to include only US-domiciled firms' bonds. We match the data by `issuer_id`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_mergedissuer_db <- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissuer\")) \n\nfisd_issuer <- fisd_mergedissuer_db |>\n  select(issuer_id, sic_code, country_domicile) |>\n  collect()\n\nfisd <- fisd |>\n  inner_join(fisd_issuer, join_by(issuer_id)) |>\n  filter(country_domicile == \"USA\") |>\n  select(-country_domicile)\n```\n:::\n\n\nFinally, we save the bond characteristics to our local database. This selection of bonds also constitutes the sample for which we will collect trade reports from TRACE below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(\n  conn = tidy_finance,\n  name = \"fisd\",\n  value = fisd,\n  overwrite = TRUE\n)\n```\n:::\n\n\nThe FISD database also contains other data. The issue-based file contains information on covenants, i.e., restrictions included in bond indentures to limit specific actions by firms [e.g., @handler2021]. Moreover, FISD also provides information on bond ratings. We do not need either here.\n\n## TRACE\n\nThe Financial Industry Regulatory Authority (FINRA) provides the Trade Reporting and Compliance Engine (TRACE).\\index{Data!TRACE} In TRACE, dealers that trade corporate bonds must report such trades individually. Hence, we observe trade messages in TRACE that contain information on the bond traded, the trade time, price, and volume. TRACE comes in two variants; standard and enhanced TRACE. We show how to download and clean enhanced TRACE as it contains uncapped volume, a crucial quantity missing in the standard distribution. Moreover, enhanced TRACE also provides information on the respective parties' roles and the direction of the trade report. These items become essential in cleaning the messages.\n\nWhy do we repeatedly talk about cleaning TRACE? Trade messages are submitted within a short time window after a trade is executed (less than 15 minutes). These messages can contain errors, and the reporters subsequently correct them or they cancel a trade altogether. The cleaning needs are described by @Dick2009 in detail, and @Dick2014 shows how to clean the enhanced TRACE data using SAS. We do not go into the cleaning steps here, since the code is lengthy and serves no educational purpose. However, downloading and cleaning enhanced TRACE data is straightforward with our setup.\n\nWe store code for cleaning enhanced TRACE with R on the following Github [gist.](https://gist.github.com/patrick-weiss/3a05b3ab281563b2e94858451c2eb3a4) \\index{Github!Gist} as a function. The appendix also contains the code for reference. We only need to source the code from the gist, which we can do with `source_gist()`. Alternatively, you can also go to the gist, download it, and `source()` the respective R-file. The `clean_enhanced_trace()` function takes a vector of CUSIPs, a connection to WRDS explained in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd), and a start and end date, respectively.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource_gist(\"3a05b3ab281563b2e94858451c2eb3a4\")\n```\n:::\n\n\nThe TRACE database is considerably large. Therefore, we only download subsets of data at once. Specifying too many CUSIPs over a long time horizon will result in very long download times and a potential failure due to the size of the request to WRDS. The size limit depends on many parameters, and we cannot give you a guideline here. If we were working with the complete TRACE data for all CUSIPs above, splitting the data into 100 parts takes roughly two hours using our setup. For the applications in this book, we need data around the Paris Agreement in December 2015 and download the data in ten sets, which we define below.\\index{Paris (Climate) Agreement}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd_cusips <- fisd |>\n  pull(complete_cusip)\n\nfisd_parts <- split(\n  fisd_cusips,\n  rep(1:10, \n      length.out = length(fisd_cusips))\n)\n```\n:::\n\n\nFinally, we run a loop in the same style as in [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd) where we download daily returns from CRSP. For each of the CUSIP sets defined above, we call the cleaning function and save the resulting output. We add new data to the existing table for batch two and all following batches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbatches <- length(fisd_parts)\n\nfor (j in 1:batches) {\n  trace_enhanced <- clean_enhanced_trace(\n    cusips = fisd_parts[[j]],\n    connection = wrds,\n    start_date = ymd(\"2014-01-01\"),\n    end_date = ymd(\"2016-11-30\")\n  )\n\n  dbWriteTable(\n    conn = tidy_finance,\n    name = \"trace_enhanced\",\n    value = trace_enhanced,\n    overwrite = ifelse(j == 1, TRUE, FALSE),\n    append = ifelse(j != 1, TRUE, FALSE)\n  )\n  \n  cat(\"Batch\", j, \"out of\", batches, \"done (\", \n      round(j / batches, 2) * 100, \"%)\\n\")\n}\n```\n:::\n\n\n## Insights into Corporate Bonds\n\nWhile many news outlets readily provide information on stocks and the underlying firms, corporate bonds are not covered frequently. Additionally, the TRACE database contains trade-level information, potentially new to students. Therefore, we provide you with some insights by showing some summary statistics.\\index{Summary statistics}\n\nWe start by looking into the number of bonds outstanding over time and compare it to the number of bonds traded in our sample. First, we compute the number of bonds outstanding for each quarter around the Paris Agreement from 2014 to 2016. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding <- expand_grid(\"date\" = seq(ymd(\"2014-01-01\"),\n                                              ymd(\"2016-11-30\"), \n                                              by = \"quarter\"), \n                                 \"complete_cusip\" = fisd$complete_cusip) |> \n  left_join(fisd |> select(complete_cusip, \n                              offering_date,\n                              maturity), \n            join_by(complete_cusip)) |> \n  mutate(offering_date = floor_date(offering_date),\n         maturity = floor_date(maturity)) |> \n  filter(date >= offering_date & date <= maturity) |> \n  count(date) |> \n  mutate(type = \"Outstanding\")\n```\n:::\n\n\nNext, we look at the bonds traded each quarter in the same period. Notice that we load the complete trace table from our database, as we only have a single part of it in the environment from the download loop from above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced <- tbl(tidy_finance, \"trace_enhanced\") |>\n  collect()\n\nbonds_traded <- trace_enhanced |> \n  mutate(date = floor_date(trd_exctn_dt, \"quarters\")) |> \n  group_by(date) |> \n  summarize(n = length(unique(cusip_id)),\n            type = \"Traded\",\n            .groups = \"drop\") \n```\n:::\n\n\nFinally, we plot the two time series in @fig-401.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbonds_outstanding |> \n  bind_rows(bonds_traded) |> \n  ggplot(aes(\n    x = date, \n    y = n, \n    color = type, \n    linetype = type\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Number of bonds outstanding and traded each quarter\"\n  )\n```\n\n::: {.cell-output-display}\n![The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.](trace-and-fisd_files/figure-html/fig-401-1.png){#fig-401 fig-alt='Title: Number of bonds outstanding and traded each quarter. The figure shows a time series of outstanding bonds and bonds traded. The amount outstanding increases monotonically between 2014 and 2016. The number of bonds traded represents only a fraction of roughly 60 percent, which peaks around the third quarter of 2016.' width=2100}\n:::\n:::\n\n\nWe see that the number of bonds outstanding increases steadily between 2014 and 2016. During our sample period of trade data, we see that the fraction of bonds trading each quarter is roughly 60%. The relatively small number of traded bonds means that many bonds do not trade through an entire quarter. This lack of trading activity illustrates the generally low level of liquidity in the corporate bond market, where it can be hard to trade specific bonds. \nDoes this lack of liquidity mean that corporate bond markets are irrelevant in terms of their size? With over 7,500 traded bonds each quarter, it is hard to say that the market is small. However, let us also investigate the characteristics of issued corporate bonds. In particular, we consider maturity (in years), coupon, and offering amount (in million USD).\\index{Liquidity}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfisd |>\n  mutate(maturity = as.numeric(maturity - offering_date) / 365,\n         offering_amt = offering_amt / 10^3) |> \n  pivot_longer(cols = c(maturity, coupon, offering_amt),\n               names_to = \"measure\") |>\n  drop_na() |> \n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 8\n  measure        mean     sd    min   q05    q50    q95    max\n  <chr>         <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>\n1 coupon         5.94   2.67  0     1.90    6.06   10      39 \n2 maturity       9.74   9.21 -6.24  1.03    7.18   30.0   101.\n3 offering_amt 379.   570.    0.001 0.669 200    1400   15000 \n```\n\n\n:::\n:::\n\n\nWe see that the average bond in our sample period has an offering amount of over 357 million USD with a median of 200 million USD, which both cannot be considered small. The average bond has a maturity of 10 years and pays around 6% in coupons.\n\nFinally, let us compute some summary statistics for the trades in this market. To this end, we show a summary based on aggregate information daily. In particular, we consider the trade size (in million USD) and the number of trades.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntrace_enhanced |> \n  group_by(trd_exctn_dt) |> \n  summarize(trade_size = sum(entrd_vol_qt * rptd_pr / 100) / 10^6,\n            trade_number = n(),\n            .groups = \"drop\") |> \n  pivot_longer(cols = c(trade_size, trade_number),\n               names_to = \"measure\") |> \n  group_by(measure) |>\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 2 × 8\n  measure        mean    sd   min    q05    q50    q95    max\n  <chr>         <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>\n1 trade_number 25914. 5444. 439   17844. 26051  34383. 40839 \n2 trade_size   12968. 3577.  17.2  6128. 13421. 17850. 21312.\n```\n\n\n:::\n:::\n\n\nOn average, nearly 26 billion USD of corporate bonds are traded daily in nearly 13,000 transactions. We can hence conclude that the corporate bond market is indeed significant in terms of trading volume and activity.\n\n## Exercises\n\n1. Compute the amount outstanding across all bonds over time. Make sure to subtract all matured bonds. How would you describe the resulting plot?\n1. Compute the number of days each bond is traded (accounting for the bonds' maturities and issuances). Start by looking at the number of bonds traded each day in a graph similar to the one above. How many bonds trade on more than 75% of trading days? \n1. WRDS provides more information from Mergent FISD such as ratings in the table `fisd_ratings`. Download the ratings table and plot the distribution of ratings for the different rating providers. How would you map the different providers to a common numeric rating scale? \\index{Rating}\n",
     "supporting": [
       "trace-and-fisd_files"
     ],
diff --git a/_freeze/r/univariate-portfolio-sorts/execute-results/html.json b/_freeze/r/univariate-portfolio-sorts/execute-results/html.json
index 8f48ff3e..f5ebffca 100644
--- a/_freeze/r/univariate-portfolio-sorts/execute-results/html.json
+++ b/_freeze/r/univariate-portfolio-sorts/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "3127fe8ff1b60b520605091103823265",
+  "hash": "b8792df00aa6d328f4831f07414ec669",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Univariate Portfolio Sorts\naliases:\n  - ../univariate-portfolio-sorts.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Univariate Portfolio Sorts with R\n  description-meta: Perform univariate portfolio sorts to test for return predictability using the programming language R. \n---\n\n\nIn this chapter, we dive into portfolio sorts, one of the most widely used statistical methodologies in empirical asset pricing [e.g., @BaliEngleMurray2016]. The key application of portfolio sorts is to examine whether one or more variables can predict future excess returns. In general, the idea is to sort individual stocks into portfolios, where the stocks within each portfolio are similar with respect to a sorting variable, such as firm size. The different portfolios then represent well-diversified investments that differ in the level of the sorting variable. You can then attribute the differences in the return distribution to the impact of the sorting variable. \nWe start by introducing univariate portfolio sorts (which sort based on only one characteristic) and tackle bivariate sorting in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). \n\nA univariate portfolio sort considers only one sorting variable $x_{t-1,i}$.\\index{Portfolio sorts!Univariate}\nHere, $i$ denotes the stock and $t-1$ indicates that the characteristic is observable by investors at time $t$.  \nThe objective is to assess the cross-sectional relation between $x_{t-1,i}$ and, typically, stock excess returns $r_{t,i}$ at time $t$ as the outcome variable. \nTo illustrate how portfolio sorts work, we use estimates for market betas from the previous chapter as our sorting variable.\n\nThe current chapter relies on the following set of R packages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(lmtest)\nlibrary(broom)\nlibrary(sandwich)\n```\n:::\n\n\nCompared to previous chapters, we introduce `lmtest` [@lmtest] for inference for estimated coefficients, `broom` package [@broom] to tidy the estimation output of many estimated linear models, and `sandwich` [@sandwich] for different covariance matrix estimators\n\n## Data Preparation\n\nWe start with loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, we use the monthly CRSP sample as our asset universe.\\index{Data!CRSP}\nOnce we form our portfolios, we use the Fama-French market factor returns to compute the risk-adjusted performance (i.e., alpha).\\index{Data!Fama-French factors}\n`beta` is the tibble with market betas computed in the previous chapter.\\index{Beta}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(permno, month, ret_excess, mktcap_lag) |>\n  collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, mkt_excess) |>\n  collect()\n\nbeta <- tbl(tidy_finance, \"beta\") |>\n  select(permno, month, beta_monthly) |>\n  collect()\n```\n:::\n\n\n## Sorting by Market Beta\n\nNext, we merge our sorting variable with the return data. We use the one-month *lagged* betas as a sorting variable to ensure that the sorts rely only on information available when we create the portfolios. \nTo lag stock beta by one month, we add one month to the current date and join the resulting information with our return data. \nThis procedure ensures that month $t$ information is available in month $t+1$. \nYou may be tempted to simply use a call such as `crsp_monthly |> group_by(permno) |> mutate(beta_lag = lag(beta)))` instead. \nThis procedure, however, does not work correctly if there are non-explicit missing values in the time series.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_lag <- beta |>\n  mutate(month = month %m+% months(1)) |>\n  select(permno, month, beta_lag = beta_monthly) |>\n  drop_na()\n\ndata_for_sorts <- crsp_monthly |>\n  inner_join(beta_lag, by = c(\"permno\", \"month\"))\n```\n:::\n\n\nThe first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios.\\index{Breakpoints} \nFor simplicity, we start with the median lagged market beta as the single breakpoint. \nWe then compute the value-weighted returns for each of the two resulting portfolios, which means that the lagged market capitalization determines the weight in `weighted.mean()`.  \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(\n    breakpoint = median(beta_lag),\n    portfolio = case_when(\n      beta_lag <= breakpoint ~ \"low\",\n      beta_lag > breakpoint ~ \"high\"\n    )\n  ) |>\n  group_by(month, portfolio) |>\n  summarize(ret = weighted.mean(ret_excess, mktcap_lag), \n            .groups = \"drop\")\n```\n:::\n\n\n## Performance Evaluation\n\nWe can construct a long-short strategy based on the two portfolios: buy the high-beta portfolio and, at the same time, short the low-beta portfolio. Thereby, the overall position in the market is net-zero, i.e., you do not need to invest money to realize this strategy in the absence of frictions.\\index{Long-short}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n  pivot_wider(id_cols = month, names_from = portfolio, values_from = ret) |>\n  mutate(long_short = high - low)\n```\n:::\n\n\nWe compute the average return and the corresponding standard error to test whether the long-short portfolio yields on average positive or negative excess returns. In the asset pricing literature, one typically adjusts for autocorrelation by using @Newey1987 $t$-statistics to test the null hypothesis that average portfolio excess returns are equal to zero.\\index{Standard errors!Newey-West} One necessary input for Newey-West standard errors is a chosen bandwidth based on the number of lags employed for the estimation. While it seems that researchers often default on choosing a pre-specified lag length of 6 months, we instead recommend a data-driven approach. This automatic selection is advocated by @Newey1994 and available in the `sandwich` package. To implement this test, we compute the average return via `lm()` and then employ the `coeftest()` function. If you want to implement the typical 6-lag default setting, you can enforce it by passing the arguments `lag = 6, prewhite = FALSE` to the `coeftest()` function in the code below and it passes them on to `NeweyWest()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fit <- lm(long_short ~ 1, data = beta_longshort)\ncoeftest(model_fit, vcov = NeweyWest)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(>|t|)\n(Intercept) 5.76e-05   1.30e-03    0.04     0.96\n```\n\n\n:::\n:::\n\n\nThe results indicate that we cannot reject the null hypothesis of average returns being equal to zero. Our portfolio strategy using the median as a breakpoint hence does not yield any abnormal returns. Is this finding surprising if you reconsider the CAPM? It certainly is. The CAPM yields that the high beta stocks should yield higher expected returns. Our portfolio sort implicitly mimics an investment strategy that finances high beta stocks by shorting low beta stocks. Therefore, one should expect that the average excess returns yield a return that is above the risk-free rate.\n\n## Functional Programming for Portfolio Sorts\n\nNow we take portfolio sorts to the next level. We want to be able to sort stocks into an arbitrary number of portfolios. For this case, functional programming is very handy: we employ the [curly-curly](https://www.tidyverse.org/blog/2019/06/rlang-0-4-0/#a-simpler-interpolation-pattern-with-)-operator to give us flexibility concerning which variable to use for the sorting, denoted by `sorting_variable`.\\index{Curly-curly} We use `quantile()` to compute breakpoints for `n_portfolios`. Then, we assign portfolios to stocks using the `findInterval()` function. The output of the following function is a new column that contains the number of the portfolio to which a stock belongs.\\index{Functional programming} \n\nIn some applications, the variable used for the sorting might be clustered (e.g., at a lower bound of 0). Then, multiple breakpoints may be identical, leading to empty portfolios. Similarly, some portfolios might have a very small number of stocks at the beginning of the sample. Cases, where the number of portfolio constituents differs substantially due to the distribution of the characteristics, require careful consideration and, depending on the application, might require customized sorting approaches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n                             sorting_variable, \n                             n_portfolios) {\n  # Compute breakpoints\n  breakpoints <- data |>\n    pull({{ sorting_variable }}) |>\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  # Assign portfolios\n  assigned_portfolios <- data |>\n    mutate(portfolio = findInterval(\n      pick(everything()) |>\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |>\n    pull(portfolio)\n  \n  # Output\n  return(assigned_portfolios)\n}\n```\n:::\n\n\nWe can use the above function to sort stocks into ten portfolios each month using lagged betas and compute value-weighted returns for each portfolio. Note that we transform the portfolio column to a factor variable because it provides more convenience for the figure construction below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(\n    portfolio = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = beta_lag,\n      n_portfolios = 10\n    ),\n    portfolio = as.factor(portfolio)\n  ) |>\n  group_by(portfolio, month) |>\n  summarize(\n    ret_excess = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )|>\n  left_join(factors_ff3_monthly, by = \"month\")\n```\n:::\n\n\n## More Performance Evaluation\n\nIn the next step, we compute summary statistics for each beta portfolio. Namely, we compute CAPM-adjusted alphas, the beta of each beta portfolio, and average returns.\\index{Performance evaluation}\\index{Alpha}\\index{CAPM}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary <- beta_portfolios |>\n  nest(data = c(month, ret_excess, mkt_excess)) |>\n  mutate(estimates = map(\n    data, ~ tidy(lm(ret_excess ~ 1 + mkt_excess, data = .x))\n  )) |>\n  unnest(estimates) |> \n  select(portfolio, term, estimate) |> \n  pivot_wider(names_from = term, values_from = estimate) |> \n  rename(alpha = `(Intercept)`, beta = mkt_excess) |> \n  left_join(\n    beta_portfolios |> \n      group_by(portfolio) |> \n      summarize(ret_excess = mean(ret_excess),\n                .groups = \"drop\"), by = \"portfolio\"\n  )\n```\n:::\n\n\n@fig-701 illustrates the CAPM alphas of beta-sorted portfolios. It shows that low beta portfolios tend to exhibit positive alphas, while high beta portfolios exhibit negative alphas.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary |>\n  ggplot(aes(x = portfolio, y = alpha, fill = portfolio)) +\n  geom_bar(stat = \"identity\") +\n  labs(\n    title = \"CAPM alphas of beta-sorted portfolios\",\n    x = \"Portfolio\",\n    y = \"CAPM alpha\",\n    fill = \"Portfolio\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  theme(legend.position = \"None\")\n```\n\n::: {.cell-output-display}\n![Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.](univariate-portfolio-sorts_files/figure-html/fig-701-1.png){#fig-701 fig-alt='Title: CAPM alphas of beta-sorted portfolios. The figure shows bar charts of alphas of beta-sorted portfolios with the decile portfolio on the horizontal axis and the corresponding CAPM alpha on the vertical axis. Alphas for low beta portfolios are positive, while high beta portfolios show negative alphas.' width=2100}\n:::\n:::\n\n\nThese results suggest a negative relation between beta and future stock returns, which contradicts the predictions of the CAPM. According to the CAPM, returns should increase with beta across the portfolios and risk-adjusted returns should be statistically indistinguishable from zero.\n\n## The Security Market Line and Beta Portfolios\n\nThe CAPM predicts that our portfolios should lie on the security market line (SML). The slope of the SML is equal to the market risk premium and reflects the risk-return trade-off at any given time.\\index{Security market line} @fig-702 illustrates the security market line: We see that (not surprisingly) the high beta portfolio returns have a high correlation with the market returns. However, it seems like the average excess returns for high beta stocks are lower than what the security market line implies would be an \"appropriate\" compensation for the high market risk. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsml_capm <- lm(ret_excess ~ 1 + beta, data = beta_portfolios_summary)$coefficients\n\nbeta_portfolios_summary |>\n  ggplot(aes(\n    x = beta, \n    y = ret_excess, \n    color = portfolio\n  )) +\n  geom_point() +\n  geom_abline(\n    intercept = 0,\n    slope = mean(factors_ff3_monthly$mkt_excess),\n    linetype = \"solid\"\n  ) +\n  geom_abline(\n    intercept = sml_capm[1],\n    slope = sml_capm[2],\n    linetype = \"dashed\"\n  ) +\n  scale_y_continuous(\n    labels = percent,\n    limit = c(0, mean(factors_ff3_monthly$mkt_excess) * 2)\n  ) +\n  scale_x_continuous(limits = c(0, 2)) +\n  labs(\n    x = \"Beta\", y = \"Excess return\", color = \"Portfolio\",\n    title = \"Average portfolio excess returns and average beta estimates\"\n  )\n```\n\n::: {.cell-output-display}\n![Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.](univariate-portfolio-sorts_files/figure-html/fig-702-1.png){#fig-702 fig-alt='Title: Average portfolio excess returns and average beta estimates. The figure shows a scatter plot of the average excess returns per beta portfolio with average beta estimates per portfolio on the horizontal axis and average excess returns on the vertical axis. An increasing solid line indicates the security market line. A dashed increasing line with lower slope than the security market line indicates that the CAPM prediction is not valid for CRSP data.' width=2100}\n:::\n:::\n\n\nTo provide more evidence against the CAPM predictions, we again form a long-short strategy that buys the high-beta portfolio and shorts the low-beta portfolio.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n  mutate(portfolio = case_when(\n    portfolio == max(as.numeric(portfolio)) ~ \"high\",\n    portfolio == min(as.numeric(portfolio)) ~ \"low\"\n  )) |>\n  filter(portfolio %in% c(\"low\", \"high\")) |>\n  pivot_wider(id_cols = month, \n              names_from = portfolio, \n              values_from = ret_excess) |>\n  mutate(long_short = high - low) |>\n  left_join(factors_ff3_monthly, by = \"month\")\n```\n:::\n\n\nAgain, the resulting long-short strategy does not exhibit statistically significant returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1, data = beta_longshort),\n  vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(>|t|)\n(Intercept)  0.00147    0.00331    0.44     0.66\n```\n\n\n:::\n:::\n\n\nHowever, the long-short portfolio yields a statistically significant negative CAPM-adjusted alpha, although, controlling for the effect of beta, the average excess stock returns should be zero according to the CAPM. The results thus provide no evidence in support of the CAPM. The negative value has been documented as the so-called betting against beta factor [@Frazzini2014]. Betting against beta corresponds to a strategy that shorts high beta stocks and takes a (levered) long position in low beta stocks. If borrowing constraints prevent investors from taking positions on the SML they are instead incentivized to buy high beta stocks, which leads to a relatively higher price (and therefore lower expected returns than implied by the CAPM) for such high beta stocks. As a result, the betting-against-beta strategy earns from providing liquidity to capital constraint investors with lower risk aversion.\\index{Risk aversion}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1 + mkt_excess, data = beta_longshort),\n  vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept) -0.00479    0.00249   -1.92    0.055 .  \nmkt_excess   1.15351    0.08960   12.87   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n@fig-703 shows the annual returns of the extreme beta portfolios we are mainly interested in. The figure illustrates no consistent striking patterns over the last years - each portfolio exhibits periods with positive and negative annual returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort |>\n  group_by(year = year(month)) |>\n  summarize(\n    low = prod(1 + low),\n    high = prod(1 + high),\n    long_short = prod(1 + long_short)\n  ) |>\n  pivot_longer(cols = -year) |>\n  ggplot(aes(x = year, y = 1 - value, fill = name)) +\n  geom_col(position = \"dodge\") +\n  facet_wrap(~name, ncol = 1) +\n  theme(legend.position = \"none\") +\n  scale_y_continuous(labels = percent) +\n  labs(\n    title = \"Annual returns of beta portfolios\",\n    x = NULL, y = NULL\n  )\n```\n\n::: {.cell-output-display}\n![We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.](univariate-portfolio-sorts_files/figure-html/fig-703-1.png){#fig-703 fig-alt='Title: Annual returns of beta portfolios. The figure shows bar charts of annual returns of long, short, and long-short beta portfolios with years on the horizontal axis and returns on the vertical axis. Each portfolio is plotted in its own facet. The long-short portfolio strategy delivers very high losses during some periods.' width=2100}\n:::\n:::\n\n\nOverall, this chapter shows how functional programming can be leveraged to form an arbitrary number of portfolios using any sorting variable and how to evaluate the performance of the resulting portfolios. In the next chapter, we dive deeper into the many degrees of freedom that arise in the context of portfolio analysis. \n\n## Exercises\n\n1. Take the two long-short beta strategies based on different numbers of portfolios and compare the returns. Is there a significant difference in returns? How do the Sharpe ratios compare between the strategies? Find one additional portfolio evaluation statistic and compute it.\n1. We plotted the alphas of the ten beta portfolios above. Write a function that tests these estimates for significance. Which portfolios have significant alphas?\n1. The analysis here is based on betas from monthly returns. However, we also computed betas from daily returns. Re-run the analysis and point out differences in the results.\n1. Given the results in this chapter, can you define a long-short strategy that yields positive abnormal returns (i.e., alphas)? Plot the cumulative excess return of your strategy and the market excess return for comparison.",
+    "markdown": "---\ntitle: Univariate Portfolio Sorts\naliases:\n  - ../univariate-portfolio-sorts.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Univariate Portfolio Sorts with R\n  description-meta: Perform univariate portfolio sorts to test for return predictability using the programming language R. \n---\n\n\nIn this chapter, we dive into portfolio sorts, one of the most widely used statistical methodologies in empirical asset pricing [e.g., @BaliEngleMurray2016]. The key application of portfolio sorts is to examine whether one or more variables can predict future excess returns. In general, the idea is to sort individual stocks into portfolios, where the stocks within each portfolio are similar with respect to a sorting variable, such as firm size. The different portfolios then represent well-diversified investments that differ in the level of the sorting variable. You can then attribute the differences in the return distribution to the impact of the sorting variable. \nWe start by introducing univariate portfolio sorts (which sort based on only one characteristic) and tackle bivariate sorting in [Value and Bivariate Sorts](value-and-bivariate-sorts.qmd). \n\nA univariate portfolio sort considers only one sorting variable $x_{t-1,i}$.\\index{Portfolio sorts!Univariate}\nHere, $i$ denotes the stock and $t-1$ indicates that the characteristic is observable by investors at time $t$.  \nThe objective is to assess the cross-sectional relation between $x_{t-1,i}$ and, typically, stock excess returns $r_{t,i}$ at time $t$ as the outcome variable. \nTo illustrate how portfolio sorts work, we use estimates for market betas from the previous chapter as our sorting variable.\n\nThe current chapter relies on the following set of R packages.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\nlibrary(scales)\nlibrary(lmtest)\nlibrary(broom)\nlibrary(sandwich)\n```\n:::\n\n\nCompared to previous chapters, we introduce `lmtest` [@lmtest] for inference for estimated coefficients, `broom` package [@broom] to tidy the estimation output of many estimated linear models, and `sandwich` [@sandwich] for different covariance matrix estimators\n\n## Data Preparation\n\nWe start with loading the required data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd) and [WRDS, CRSP, and Compustat](wrds-crsp-and-compustat.qmd). In particular, we use the monthly CRSP sample as our asset universe.\\index{Data!CRSP}\nOnce we form our portfolios, we use the Fama-French market factor returns to compute the risk-adjusted performance (i.e., alpha).\\index{Data!Fama-French factors}\n`beta` is the tibble with market betas computed in the previous chapter.\\index{Beta}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n  select(permno, month, ret_excess, mktcap_lag) |>\n  collect()\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, mkt_excess) |>\n  collect()\n\nbeta <- tbl(tidy_finance, \"beta\") |>\n  select(permno, month, beta_monthly) |>\n  collect()\n```\n:::\n\n\n## Sorting by Market Beta\n\nNext, we merge our sorting variable with the return data. We use the one-month *lagged* betas as a sorting variable to ensure that the sorts rely only on information available when we create the portfolios. \nTo lag stock beta by one month, we add one month to the current date and join the resulting information with our return data. \nThis procedure ensures that month $t$ information is available in month $t+1$. \nYou may be tempted to simply use a call such as `crsp_monthly |> group_by(permno) |> mutate(beta_lag = lag(beta)))` instead. \nThis procedure, however, does not work correctly if there are non-explicit missing values in the time series.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_lag <- beta |>\n  mutate(month = month %m+% months(1)) |>\n  select(permno, month, beta_lag = beta_monthly) |>\n  drop_na()\n\ndata_for_sorts <- crsp_monthly |>\n  inner_join(beta_lag, join_by(permno, month))\n```\n:::\n\n\nThe first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios.\\index{Breakpoints} \nFor simplicity, we start with the median lagged market beta as the single breakpoint. \nWe then compute the value-weighted returns for each of the two resulting portfolios, which means that the lagged market capitalization determines the weight in `weighted.mean()`.  \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(\n    breakpoint = median(beta_lag),\n    portfolio = case_when(\n      beta_lag <= breakpoint ~ \"low\",\n      beta_lag > breakpoint ~ \"high\"\n    )\n  ) |>\n  group_by(month, portfolio) |>\n  summarize(ret = weighted.mean(ret_excess, mktcap_lag), \n            .groups = \"drop\")\n```\n:::\n\n\n## Performance Evaluation\n\nWe can construct a long-short strategy based on the two portfolios: buy the high-beta portfolio and, at the same time, short the low-beta portfolio. Thereby, the overall position in the market is net-zero, i.e., you do not need to invest money to realize this strategy in the absence of frictions.\\index{Long-short}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n  pivot_wider(id_cols = month, names_from = portfolio, values_from = ret) |>\n  mutate(long_short = high - low)\n```\n:::\n\n\nWe compute the average return and the corresponding standard error to test whether the long-short portfolio yields on average positive or negative excess returns. In the asset pricing literature, one typically adjusts for autocorrelation by using @Newey1987 $t$-statistics to test the null hypothesis that average portfolio excess returns are equal to zero.\\index{Standard errors!Newey-West} One necessary input for Newey-West standard errors is a chosen bandwidth based on the number of lags employed for the estimation. While it seems that researchers often default on choosing a pre-specified lag length of 6 months, we instead recommend a data-driven approach. This automatic selection is advocated by @Newey1994 and available in the `sandwich` package. To implement this test, we compute the average return via `lm()` and then employ the `coeftest()` function. If you want to implement the typical 6-lag default setting, you can enforce it by passing the arguments `lag = 6, prewhite = FALSE` to the `coeftest()` function in the code below and it passes them on to `NeweyWest()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmodel_fit <- lm(long_short ~ 1, data = beta_longshort)\ncoeftest(model_fit, vcov = NeweyWest)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(>|t|)\n(Intercept) 5.76e-05   1.30e-03    0.04     0.96\n```\n\n\n:::\n:::\n\n\nThe results indicate that we cannot reject the null hypothesis of average returns being equal to zero. Our portfolio strategy using the median as a breakpoint hence does not yield any abnormal returns. Is this finding surprising if you reconsider the CAPM? It certainly is. The CAPM yields that the high beta stocks should yield higher expected returns. Our portfolio sort implicitly mimics an investment strategy that finances high beta stocks by shorting low beta stocks. Therefore, one should expect that the average excess returns yield a return that is above the risk-free rate.\n\n## Functional Programming for Portfolio Sorts\n\nNow we take portfolio sorts to the next level. We want to be able to sort stocks into an arbitrary number of portfolios. For this case, functional programming is very handy: we employ the [curly-curly](https://www.tidyverse.org/blog/2019/06/rlang-0-4-0/#a-simpler-interpolation-pattern-with-)-operator to give us flexibility concerning which variable to use for the sorting, denoted by `sorting_variable`.\\index{Curly-curly} We use `quantile()` to compute breakpoints for `n_portfolios`. Then, we assign portfolios to stocks using the `findInterval()` function. The output of the following function is a new column that contains the number of the portfolio to which a stock belongs.\\index{Functional programming} \n\nIn some applications, the variable used for the sorting might be clustered (e.g., at a lower bound of 0). Then, multiple breakpoints may be identical, leading to empty portfolios. Similarly, some portfolios might have a very small number of stocks at the beginning of the sample. Cases, where the number of portfolio constituents differs substantially due to the distribution of the characteristics, require careful consideration and, depending on the application, might require customized sorting approaches.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n                             sorting_variable, \n                             n_portfolios) {\n  # Compute breakpoints\n  breakpoints <- data |>\n    pull({{ sorting_variable }}) |>\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  # Assign portfolios\n  assigned_portfolios <- data |>\n    mutate(portfolio = findInterval(\n      pick(everything()) |>\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |>\n    pull(portfolio)\n  \n  # Output\n  return(assigned_portfolios)\n}\n```\n:::\n\n\nWe can use the above function to sort stocks into ten portfolios each month using lagged betas and compute value-weighted returns for each portfolio. Note that we transform the portfolio column to a factor variable because it provides more convenience for the figure construction below.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(\n    portfolio = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = beta_lag,\n      n_portfolios = 10\n    ),\n    portfolio = as.factor(portfolio)\n  ) |>\n  group_by(portfolio, month) |>\n  summarize(\n    ret_excess = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )|>\n  left_join(factors_ff3_monthly, join_by(month))\n```\n:::\n\n\n## More Performance Evaluation\n\nIn the next step, we compute summary statistics for each beta portfolio. Namely, we compute CAPM-adjusted alphas, the beta of each beta portfolio, and average returns.\\index{Performance evaluation}\\index{Alpha}\\index{CAPM}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary <- beta_portfolios |>\n  nest(data = c(month, ret_excess, mkt_excess)) |>\n  mutate(estimates = map(\n    data, ~ tidy(lm(ret_excess ~ 1 + mkt_excess, data = .x))\n  )) |>\n  unnest(estimates) |> \n  select(portfolio, term, estimate) |> \n  pivot_wider(names_from = term, values_from = estimate) |> \n  rename(alpha = `(Intercept)`, beta = mkt_excess) |> \n  left_join(\n    beta_portfolios |> \n      group_by(portfolio) |> \n      summarize(ret_excess = mean(ret_excess),\n                .groups = \"drop\"), join_by(portfolio)\n  )\n```\n:::\n\n\n@fig-701 illustrates the CAPM alphas of beta-sorted portfolios. It shows that low beta portfolios tend to exhibit positive alphas, while high beta portfolios exhibit negative alphas.\\index{Graph!Bar chart}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_portfolios_summary |>\n  ggplot(aes(x = portfolio, y = alpha, fill = portfolio)) +\n  geom_bar(stat = \"identity\") +\n  labs(\n    title = \"CAPM alphas of beta-sorted portfolios\",\n    x = \"Portfolio\",\n    y = \"CAPM alpha\",\n    fill = \"Portfolio\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  theme(legend.position = \"None\")\n```\n\n::: {.cell-output-display}\n![Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.](univariate-portfolio-sorts_files/figure-html/fig-701-1.png){#fig-701 fig-alt='Title: CAPM alphas of beta-sorted portfolios. The figure shows bar charts of alphas of beta-sorted portfolios with the decile portfolio on the horizontal axis and the corresponding CAPM alpha on the vertical axis. Alphas for low beta portfolios are positive, while high beta portfolios show negative alphas.' width=2100}\n:::\n:::\n\n\nThese results suggest a negative relation between beta and future stock returns, which contradicts the predictions of the CAPM. According to the CAPM, returns should increase with beta across the portfolios and risk-adjusted returns should be statistically indistinguishable from zero.\n\n## The Security Market Line and Beta Portfolios\n\nThe CAPM predicts that our portfolios should lie on the security market line (SML). The slope of the SML is equal to the market risk premium and reflects the risk-return trade-off at any given time.\\index{Security market line} @fig-702 illustrates the security market line: We see that (not surprisingly) the high beta portfolio returns have a high correlation with the market returns. However, it seems like the average excess returns for high beta stocks are lower than what the security market line implies would be an \"appropriate\" compensation for the high market risk. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsml_capm <- lm(ret_excess ~ 1 + beta, data = beta_portfolios_summary)$coefficients\n\nbeta_portfolios_summary |>\n  ggplot(aes(\n    x = beta, \n    y = ret_excess, \n    color = portfolio\n  )) +\n  geom_point() +\n  geom_abline(\n    intercept = 0,\n    slope = mean(factors_ff3_monthly$mkt_excess),\n    linetype = \"solid\"\n  ) +\n  geom_abline(\n    intercept = sml_capm[1],\n    slope = sml_capm[2],\n    linetype = \"dashed\"\n  ) +\n  scale_y_continuous(\n    labels = percent,\n    limit = c(0, mean(factors_ff3_monthly$mkt_excess) * 2)\n  ) +\n  scale_x_continuous(limits = c(0, 2)) +\n  labs(\n    x = \"Beta\", y = \"Excess return\", color = \"Portfolio\",\n    title = \"Average portfolio excess returns and average beta estimates\"\n  )\n```\n\n::: {.cell-output-display}\n![Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.](univariate-portfolio-sorts_files/figure-html/fig-702-1.png){#fig-702 fig-alt='Title: Average portfolio excess returns and average beta estimates. The figure shows a scatter plot of the average excess returns per beta portfolio with average beta estimates per portfolio on the horizontal axis and average excess returns on the vertical axis. An increasing solid line indicates the security market line. A dashed increasing line with lower slope than the security market line indicates that the CAPM prediction is not valid for CRSP data.' width=2100}\n:::\n:::\n\n\nTo provide more evidence against the CAPM predictions, we again form a long-short strategy that buys the high-beta portfolio and shorts the low-beta portfolio.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort <- beta_portfolios |>\n  mutate(portfolio = case_when(\n    portfolio == max(as.numeric(portfolio)) ~ \"high\",\n    portfolio == min(as.numeric(portfolio)) ~ \"low\"\n  )) |>\n  filter(portfolio %in% c(\"low\", \"high\")) |>\n  pivot_wider(id_cols = month, \n              names_from = portfolio, \n              values_from = ret_excess) |>\n  mutate(long_short = high - low) |>\n  left_join(factors_ff3_monthly, join_by(month))\n```\n:::\n\n\nAgain, the resulting long-short strategy does not exhibit statistically significant returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1, data = beta_longshort),\n  vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(>|t|)\n(Intercept)  0.00147    0.00331    0.44     0.66\n```\n\n\n:::\n:::\n\n\nHowever, the long-short portfolio yields a statistically significant negative CAPM-adjusted alpha, although, controlling for the effect of beta, the average excess stock returns should be zero according to the CAPM. The results thus provide no evidence in support of the CAPM. The negative value has been documented as the so-called betting against beta factor [@Frazzini2014]. Betting against beta corresponds to a strategy that shorts high beta stocks and takes a (levered) long position in low beta stocks. If borrowing constraints prevent investors from taking positions on the SML they are instead incentivized to buy high beta stocks, which leads to a relatively higher price (and therefore lower expected returns than implied by the CAPM) for such high beta stocks. As a result, the betting-against-beta strategy earns from providing liquidity to capital constraint investors with lower risk aversion.\\index{Risk aversion}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncoeftest(lm(long_short ~ 1 + mkt_excess, data = beta_longshort),\n  vcov = NeweyWest\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(>|t|)    \n(Intercept) -0.00479    0.00249   -1.92    0.055 .  \nmkt_excess   1.15351    0.08960   12.87   <2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n```\n\n\n:::\n:::\n\n\n@fig-703 shows the annual returns of the extreme beta portfolios we are mainly interested in. The figure illustrates no consistent striking patterns over the last years - each portfolio exhibits periods with positive and negative annual returns. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbeta_longshort |>\n  group_by(year = year(month)) |>\n  summarize(\n    low = prod(1 + low),\n    high = prod(1 + high),\n    long_short = prod(1 + long_short)\n  ) |>\n  pivot_longer(cols = -year) |>\n  ggplot(aes(x = year, y = 1 - value, fill = name)) +\n  geom_col(position = \"dodge\") +\n  facet_wrap(~name, ncol = 1) +\n  theme(legend.position = \"none\") +\n  scale_y_continuous(labels = percent) +\n  labs(\n    title = \"Annual returns of beta portfolios\",\n    x = NULL, y = NULL\n  )\n```\n\n::: {.cell-output-display}\n![We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.](univariate-portfolio-sorts_files/figure-html/fig-703-1.png){#fig-703 fig-alt='Title: Annual returns of beta portfolios. The figure shows bar charts of annual returns of long, short, and long-short beta portfolios with years on the horizontal axis and returns on the vertical axis. Each portfolio is plotted in its own facet. The long-short portfolio strategy delivers very high losses during some periods.' width=2100}\n:::\n:::\n\n\nOverall, this chapter shows how functional programming can be leveraged to form an arbitrary number of portfolios using any sorting variable and how to evaluate the performance of the resulting portfolios. In the next chapter, we dive deeper into the many degrees of freedom that arise in the context of portfolio analysis. \n\n## Exercises\n\n1. Take the two long-short beta strategies based on different numbers of portfolios and compare the returns. Is there a significant difference in returns? How do the Sharpe ratios compare between the strategies? Find one additional portfolio evaluation statistic and compute it.\n1. We plotted the alphas of the ten beta portfolios above. Write a function that tests these estimates for significance. Which portfolios have significant alphas?\n1. The analysis here is based on betas from monthly returns. However, we also computed betas from daily returns. Re-run the analysis and point out differences in the results.\n1. Given the results in this chapter, can you define a long-short strategy that yields positive abnormal returns (i.e., alphas)? Plot the cumulative excess return of your strategy and the market excess return for comparison.",
     "supporting": [
       "univariate-portfolio-sorts_files"
     ],
diff --git a/_freeze/r/value-and-bivariate-sorts/execute-results/html.json b/_freeze/r/value-and-bivariate-sorts/execute-results/html.json
index ac8a9963..c8afb56b 100644
--- a/_freeze/r/value-and-bivariate-sorts/execute-results/html.json
+++ b/_freeze/r/value-and-bivariate-sorts/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "3c5e4ac4d7a93ee4d408412a1586fe76",
+  "hash": "1cf73a739b64fd3f3fb69d768ecdb45c",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: Value and Bivariate Sorts\naliases:\n  - ../value-and-bivariate-sorts.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Value and Bivariate Sorts with R\n  description-meta: Perform bivariate portfolio sorts to test for return predictability in asset pricing applications using the programming language R. \n---\n\n\nIn this chapter, we extend univariate portfolio analysis to bivariate sorts, which means we assign stocks to portfolios based on two characteristics. Bivariate sorts are regularly used in the academic asset pricing literature and are the basis for the Fama and French three factors. However, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.\n\nWe form portfolios on firm size and the book-to-market ratio. To calculate book-to-market ratios, accounting data is required, which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts.\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nFirst, we load the necessary data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). We conduct portfolio sorts based on the CRSP sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n    select(\n    permno, gvkey, month, ret_excess,\n    mktcap, mktcap_lag, exchange\n  ) |> \n  collect() |>\n  drop_na()\n```\n:::\n\n\nFurther, we utilize accounting data. The most common source of accounting data is Compustat. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable `datadate` to its monthly value, as we only consider monthly returns here and do not need to account for the exact date. To achieve this, we use the function `floor_date()`.\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbook_equity <- tbl(tidy_finance, \"compustat\") |>\n  select(gvkey, datadate, be) |>\n  collect() |>\n  drop_na() |>\n  mutate(month = floor_date(ymd(datadate), \"month\"))\n```\n:::\n\n\n## Book-to-Market Ratio\n\nA fundamental problem in handling accounting data is the *look-ahead bias* - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents' trades. Hence, we have to lag accounting information.\n\nWe continue to lag market capitalization and firm size by one month.\\index{Market capitalization}\\index{Firm size} Then, we compute the book-to-market ratio, which relates a firm's book equity to its market equity.\\index{Book equity}\\index{Book-to-market ratio} Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.^[The definition of a time lag is another choice a researcher has to make, similar to breakpoint choices as we describe in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).]\n\nHaving both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the `sorting_date`-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the `fill()`-function after sorting by date and firm (which we identify by `permno` and `gvkey`) and on a firm basis (which we do by `group_by()` as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nme <- crsp_monthly |>\n  mutate(sorting_date = month %m+% months(1)) |>\n  select(permno, sorting_date, me = mktcap)\n\nbm <- book_equity |>\n  inner_join(crsp_monthly, by = c(\"gvkey\", \"month\")) |>\n  mutate(\n    bm = be / mktcap,\n    sorting_date = month %m+% months(6),\n    comp_date = sorting_date\n  ) |>\n  select(permno, gvkey, sorting_date, comp_date, bm)\n\ndata_for_sorts <- crsp_monthly |>\n  left_join(\n    bm, by = c(\"permno\", \"gvkey\", \"month\" = \"sorting_date\")\n  ) |>\n  left_join(\n    me, by = c(\"permno\", \"month\" = \"sorting_date\")\n  ) |>\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap_lag, me, bm, exchange, comp_date\n  )\n\ndata_for_sorts <- data_for_sorts |>\n  arrange(permno, gvkey, month) |>\n  group_by(permno, gvkey) |>\n  fill(bm, comp_date) |>\n  ungroup() |> \n  filter(comp_date > month %m-% months(12)) |>\n  select(-comp_date) |>\n  drop_na()\n```\n:::\n\n\nThe last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument `sorting_variable` into the function for defining different sorting variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n                             sorting_variable, \n                             n_portfolios, \n                             exchanges) {\n  breakpoints <- data |>\n    filter(exchange %in% exchanges) |>\n    pull({{ sorting_variable }}) |>\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios <- data |>\n    mutate(portfolio = findInterval(\n      pick(everything()) |>\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |>\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n```\n:::\n\n\nAfter these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.\n\n## Independent Sorts\n\nBivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the *value premium* (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms.\\index{Value premium} The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits.\n\nTo implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables `portfolio_bm` and `portfolio_me`, respectively.\\index{Portfolio sorts!Independent bivariate} Then, these separate portfolios are combined to the final sort stored in `portfolio_combined`. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = \"bm\",\n      n_portfolios = 5,\n      exchanges = c(\"NYSE\")\n    ),\n    portfolio_me = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = \"me\",\n      n_portfolios = 5,\n      exchanges = c(\"NYSE\")\n    )) |>\n  group_by(month, portfolio_bm, portfolio_me) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )\n```\n:::\n\n\nEquipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher's choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_premium <- value_portfolios |>\n  group_by(month, portfolio_bm) |>\n  summarize(ret = mean(ret), .groups = \"drop_last\") |>\n  summarize(\n    value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n      ret[portfolio_bm == min(portfolio_bm)]\n  ) |> \n  summarize(\n    value_premium = mean(value_premium)\n  )\n```\n:::\n\n\nThe resulting monthly value premium is 0.43 percent with an annualized return of 5.3 percent.\n\n## Dependent Sorts\n\nIn the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable.\\index{Portfolio sorts!Dependent bivariate} In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts ensure an equal amount of stocks within each portfolio.\n\nTo implement the dependent sorts, we first create the size portfolios by calling `assign_portfolio()` with `sorting_variable = \"me\"`. Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(portfolio_me = assign_portfolio(\n    data = pick(everything()),\n    sorting_variable = \"me\",\n    n_portfolios = 5,\n    exchanges = c(\"NYSE\")\n  )) |>\n  group_by(month, portfolio_me) |>\n  mutate(\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = \"bm\",\n      n_portfolios = 5,\n      exchanges = c(\"NYSE\")\n    )) |>\n  group_by(month, portfolio_me, portfolio_bm) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )\n\nvalue_premium <- value_portfolios |>\n  group_by(month, portfolio_bm) |>\n  summarize(ret = mean(ret), .groups = \"drop_last\") |>\n  summarize(\n    value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n      ret[portfolio_bm == min(portfolio_bm)]\n  ) |> \n  summarize(\n    value_premium = mean(value_premium)\n  )\n```\n:::\n\n\nThe monthly value premium from dependent sorts is 0.38 percent, which translates to an annualized premium of 4.6 percent per year.\n\nOverall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the lines of [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices.\n\n## Exercises\n\n1. In [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we examine the distribution of market equity. Repeat this analysis for book equity and the book-to-market ratio (alongside a plot of the breakpoints, i.e., deciles).\n1. When we investigate the portfolios, we focus on the returns exclusively. However, it is also of interest to understand the characteristics of the portfolios. Write a function to compute the average characteristics for size and book-to-market across the 25 independently and dependently sorted portfolios.\n1. As for the size premium, also the value premium constructed here does not follow @Fama1993. Implement a p-hacking setup as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd) to find a premium that comes closest to their HML premium.",
+    "markdown": "---\ntitle: Value and Bivariate Sorts\naliases:\n  - ../value-and-bivariate-sorts.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: Value and Bivariate Sorts with R\n  description-meta: Perform bivariate portfolio sorts to test for return predictability in asset pricing applications using the programming language R. \n---\n\n\nIn this chapter, we extend univariate portfolio analysis to bivariate sorts, which means we assign stocks to portfolios based on two characteristics. Bivariate sorts are regularly used in the academic asset pricing literature and are the basis for the Fama and French three factors. However, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.\n\nWe form portfolios on firm size and the book-to-market ratio. To calculate book-to-market ratios, accounting data is required, which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts.\n\nThe current chapter relies on this set of R packages. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(RSQLite)\n```\n:::\n\n\n## Data Preparation\n\nFirst, we load the necessary data from our `SQLite`-database introduced in [Accessing and Managing Financial Data](accessing-and-managing-financial-data.qmd). We conduct portfolio sorts based on the CRSP sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).\\index{Data!CRSP}\\index{Data!Fama-French factors}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly <- tbl(tidy_finance, \"crsp_monthly\") |>\n    select(\n    permno, gvkey, month, ret_excess,\n    mktcap, mktcap_lag, exchange\n  ) |> \n  collect() |>\n  drop_na()\n```\n:::\n\n\nFurther, we utilize accounting data. The most common source of accounting data is Compustat. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable `datadate` to its monthly value, as we only consider monthly returns here and do not need to account for the exact date. To achieve this, we use the function `floor_date()`.\\index{Data!Compustat}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbook_equity <- tbl(tidy_finance, \"compustat\") |>\n  select(gvkey, datadate, be) |>\n  collect() |>\n  drop_na() |>\n  mutate(month = floor_date(ymd(datadate), \"month\"))\n```\n:::\n\n\n## Book-to-Market Ratio\n\nA fundamental problem in handling accounting data is the *look-ahead bias* - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents' trades. Hence, we have to lag accounting information.\n\nWe continue to lag market capitalization and firm size by one month.\\index{Market capitalization}\\index{Firm size} Then, we compute the book-to-market ratio, which relates a firm's book equity to its market equity.\\index{Book equity}\\index{Book-to-market ratio} Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.^[The definition of a time lag is another choice a researcher has to make, similar to breakpoint choices as we describe in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd).]\n\nHaving both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the `sorting_date`-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the `fill()`-function after sorting by date and firm (which we identify by `permno` and `gvkey`) and on a firm basis (which we do by `group_by()` as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nme <- crsp_monthly |>\n  mutate(sorting_date = month %m+% months(1)) |>\n  select(permno, sorting_date, me = mktcap)\n\nbm <- book_equity |>\n  inner_join(crsp_monthly, join_by(gvkey, month)) |>\n  mutate(\n    bm = be / mktcap,\n    sorting_date = month %m+% months(6),\n    comp_date = sorting_date\n  ) |>\n  select(permno, gvkey, sorting_date, comp_date, bm)\n\ndata_for_sorts <- crsp_monthly |>\n  left_join(\n    bm, join_by(permno, gvkey, month == sorting_date)\n  ) |>\n  left_join(\n    me, join_by(permno, month == sorting_date)\n  ) |>\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap_lag, me, bm, exchange, comp_date\n  )\n\ndata_for_sorts <- data_for_sorts |>\n  arrange(permno, gvkey, month) |>\n  group_by(permno, gvkey) |>\n  fill(bm, comp_date) |>\n  ungroup() |> \n  filter(comp_date > month %m-% months(12)) |>\n  select(-comp_date) |>\n  drop_na()\n```\n:::\n\n\nThe last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument `sorting_variable` into the function for defining different sorting variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassign_portfolio <- function(data, \n                             sorting_variable, \n                             n_portfolios, \n                             exchanges) {\n  breakpoints <- data |>\n    filter(exchange %in% exchanges) |>\n    pull({{ sorting_variable }}) |>\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios <- data |>\n    mutate(portfolio = findInterval(\n      pick(everything()) |>\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |>\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n```\n:::\n\n\nAfter these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.\n\n## Independent Sorts\n\nBivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the *value premium* (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms.\\index{Value premium} The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits.\n\nTo implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables `portfolio_bm` and `portfolio_me`, respectively.\\index{Portfolio sorts!Independent bivariate} Then, these separate portfolios are combined to the final sort stored in `portfolio_combined`. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = \"bm\",\n      n_portfolios = 5,\n      exchanges = c(\"NYSE\")\n    ),\n    portfolio_me = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = \"me\",\n      n_portfolios = 5,\n      exchanges = c(\"NYSE\")\n    )) |>\n  group_by(month, portfolio_bm, portfolio_me) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )\n```\n:::\n\n\nEquipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher's choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_premium <- value_portfolios |>\n  group_by(month, portfolio_bm) |>\n  summarize(ret = mean(ret), .groups = \"drop_last\") |>\n  summarize(\n    value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n      ret[portfolio_bm == min(portfolio_bm)]\n  ) |> \n  summarize(\n    value_premium = mean(value_premium)\n  )\n```\n:::\n\n\nThe resulting monthly value premium is 0.43 percent with an annualized return of 5.3 percent.\n\n## Dependent Sorts\n\nIn the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable.\\index{Portfolio sorts!Dependent bivariate} In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts ensure an equal amount of stocks within each portfolio.\n\nTo implement the dependent sorts, we first create the size portfolios by calling `assign_portfolio()` with `sorting_variable = \"me\"`. Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalue_portfolios <- data_for_sorts |>\n  group_by(month) |>\n  mutate(portfolio_me = assign_portfolio(\n    data = pick(everything()),\n    sorting_variable = \"me\",\n    n_portfolios = 5,\n    exchanges = c(\"NYSE\")\n  )) |>\n  group_by(month, portfolio_me) |>\n  mutate(\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = \"bm\",\n      n_portfolios = 5,\n      exchanges = c(\"NYSE\")\n    )) |>\n  group_by(month, portfolio_me, portfolio_bm) |>\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )\n\nvalue_premium <- value_portfolios |>\n  group_by(month, portfolio_bm) |>\n  summarize(ret = mean(ret), .groups = \"drop_last\") |>\n  summarize(\n    value_premium = ret[portfolio_bm == max(portfolio_bm)] -\n      ret[portfolio_bm == min(portfolio_bm)]\n  ) |> \n  summarize(\n    value_premium = mean(value_premium)\n  )\n```\n:::\n\n\nThe monthly value premium from dependent sorts is 0.38 percent, which translates to an annualized premium of 4.6 percent per year.\n\nOverall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the lines of [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices.\n\n## Exercises\n\n1. In [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd), we examine the distribution of market equity. Repeat this analysis for book equity and the book-to-market ratio (alongside a plot of the breakpoints, i.e., deciles).\n1. When we investigate the portfolios, we focus on the returns exclusively. However, it is also of interest to understand the characteristics of the portfolios. Write a function to compute the average characteristics for size and book-to-market across the 25 independently and dependently sorted portfolios.\n1. As for the size premium, also the value premium constructed here does not follow @Fama1993. Implement a p-hacking setup as in [Size Sorts and P-Hacking](size-sorts-and-p-hacking.qmd) to find a premium that comes closest to their HML premium.",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"
diff --git a/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json b/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json
index 4525d06a..d2a6969d 100644
--- a/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json
+++ b/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "8473e5ae4cd380c44e664a41545391d8",
+  "hash": "18c85d354f045833d161cea91bd91e35",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: WRDS, CRSP, and Compustat\naliases:\n  - ../wrds-crsp-and-compustat.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: WRDS, CRSP, and Compustat with R\n  description-meta: Use the programming language R to download stock and firm data such as CRSP and Compustat from WRDS. \n---\n\n\nThis chapter shows how to connect to [Wharton Research Data Services (WRDS)](https://wrds-www.wharton.upenn.edu/), a popular provider of financial and economic data for research applications. We use this connection to download the most commonly used data for stock and firm characteristics, CRSP and Compustat. Unfortunately, this data is not freely available, but most students and researchers typically have access to WRDS through their university libraries. Assuming that you have access to WRDS, we show you how to prepare and merge the databases and store them in the `SQLite`-database introduced in the previous chapter. We conclude this chapter by providing some tips for working with the WRDS database.\\index{WRDS}\n\nFirst, we load the R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nWe use the same date range as in the previous chapter to ensure consistency.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Accessing WRDS\n\nWRDS is the most widely used source for asset and firm-specific financial data used in academic settings. WRDS is a data platform that provides data validation, flexible delivery options, and access to many different data sources. The data at WRDS is also organized in an SQL database, although they use the [PostgreSQL](https://www.postgresql.org/) engine. This database engine is just as easy to handle with R as SQLite. We use the `RPostgres` package to establish a connection to the WRDS database [@RPostgres]. Note that you could also use the `odbc` package to connect to a PostgreSQL database, but then you need to install the appropriate drivers yourself. `RPostgres` already contains a suitable driver.\\index{Database!PostgreSQL}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RPostgres)\n```\n:::\n\n\nTo establish a connection, you use the function `dbConnect()` with the following arguments. Note that you need to replace the `WRDS_USER` and `WRDS_PASSWORD` arguments with your own credentials. We defined environment variables for the purpose of this book because we obviously do not want (and are not allowed) to share our credentials with the rest of the world (these environment variables are stored in an `.Renviron`-file in our project directory and loaded with the `Sys.getenv()` function).\n\nAdditionally, you have to use multi-factor (i.e., two-factor) authentication since May 2023 when establishing a `PostgreSQL` or other remote connections. You have two choices to provide the additional identification. First, if you have Duo Push enabled for your WRDS account, you will receive a push notification on your mobile phone when trying to establish a connection with the code below. Upon accepting the notification, you can continue your work. Second, you can log in to a WRDS website that requires multi-factor authentication with your username and the same IP address. Once you have successfully identified yourself on the website, your username-IP combination will be remembered for 30 days, and you can comfortably use the remote connection below.\\index{Two-factor authentication}\\index{WRDS:Two-factor authentication}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n  Postgres(),\n  host = \"wrds-pgdata.wharton.upenn.edu\",\n  dbname = \"wrds\",\n  port = 9737,\n  sslmode = \"require\",\n  user = Sys.getenv(\"WRDS_USER\"),\n  password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n```\n:::\n\n\nThe remote connection to WRDS is very useful. Yet, the database itself contains many different tables. You can check the WRDS homepage to identify the table's name you are looking for (if you go beyond our exposition). Alternatively, you can also query the data structure with the function `dbSendQuery()`. If you are interested, there is an exercise below that is based on WRDS' tutorial on [\"Querying WRDS Data using R\".](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/)  Furthermore, the penultimate section of this chapter shows how to investigate the structure of databases.\n\n## Downloading and Preparing CRSP\n\n\\index{Data!CRSP}[The Center for Research in Security Prices (CRSP)](https://crsp.org/) provides the most widely used data for US stocks. We use the `wrds` connection object that we just created to first access monthly CRSP return data. Actually, we need three tables to get the desired data: (i) the CRSP monthly security file,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsf_db <- tbl(wrds, in_schema(\"crsp\", \"msf\"))\n```\n:::\n\n\n(ii) the identifying information,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsenames_db <- tbl(wrds, in_schema(\"crsp\", \"msenames\"))\n```\n:::\n\n\nand (iii) the delisting information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsedelist_db <- tbl(wrds, in_schema(\"crsp\", \"msedelist\"))\n```\n:::\n\n\nWe use the three remote tables to fetch the data we want to put into our local database. Just as above, the idea is that we let the WRDS database do all the work and just download the data that we actually need. We apply common filters and data selection criteria to narrow down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep only US-listed stocks as identified via share codes `shrcd` 10 and 11, and (iii) we keep only months within permno-specific start dates `namedt` and end dates `nameendt`. In addition, we add delisting codes  and returns. You can read up in the great textbook of @BaliEngleMurray2016 for an extensive discussion on the filters we apply in the code below.\\index{Permno}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- msf_db |>\n  filter(date >= start_date & date <= end_date) |>\n  inner_join(\n    msenames_db |>\n      filter(shrcd %in% c(10, 11)) |>\n      select(permno, exchcd, siccd, namedt, nameendt),\n    by = c(\"permno\")\n  ) |>\n  filter(date >= namedt & date <= nameendt) |>\n  mutate(month = floor_date(date, \"month\")) |>\n  left_join(\n    msedelist_db |>\n      select(permno, dlstdt, dlret, dlstcd) |>\n      mutate(month = floor_date(dlstdt, \"month\")),\n    by = c(\"permno\", \"month\")\n  ) |>\n  select(\n    permno, # Security identifier\n    date, # Date of the observation\n    month, # Month of the observation\n    ret, # Return\n    shrout, # Shares outstanding (in thousands)\n    altprc, # Last traded price in a month\n    exchcd, # Exchange code\n    siccd, # Industry code\n    dlret, # Delisting return\n    dlstcd # Delisting code\n  ) |>\n  collect() |>\n  mutate(\n    month = ymd(month),\n    shrout = shrout * 1000\n  )\n```\n:::\n\n\nNow, we have all the relevant monthly return data in memory and proceed with preparing the data for future analyses. We perform the preparation step at the current stage since we want to avoid executing the same mutations every time we use the data in subsequent chapters. \n\nThe first additional variable we create is market capitalization (`mktcap`), which is the product of the number of outstanding shares `shrout` and the last traded price in a month `altprc`.\\index{Market capitalization} Note that in contrast to returns `ret`, these two variables are not adjusted ex-post for any corporate actions like stock splits. Moreover, the `altprc` is negative whenever the last traded price does not exist, and CRSP decides to report the mid-quote of the last available order book instead. Hence, we take the absolute value of the market cap. We also keep the market cap in millions of USD just for convenience as we do not want to print huge numbers in our figures and tables. In addition, we set zero market cap to missing as it makes conceptually little sense (i.e., the firm would be bankrupt).\\index{Stock price}\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(\n    mktcap = abs(shrout * altprc) / 10^6,\n    mktcap = na_if(mktcap, 0)\n  )\n```\n:::\n\n\nThe next variable we frequently use is the one-month *lagged* market capitalization. Lagged market capitalization is typically used to compute value-weighted portfolio returns, as we demonstrate in a later chapter. The most simple and consistent way to add a column with lagged market cap values is to add one month to each observation and then join the information to our monthly CRSP data.\\index{Weighting!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmktcap_lag <- crsp_monthly |>\n  mutate(month = month %m+% months(1)) |>\n  select(permno, month, mktcap_lag = mktcap)\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(mktcap_lag, by = c(\"permno\", \"month\"))\n```\n:::\n\n\nIf you wonder why we do not use the `lag()` function, e.g., via `crsp_monthly |> group_by(permno) |> mutate(mktcap_lag = lag(mktcap))`, take a look at the exercises.\n\nNext, we follow @BaliEngleMurray2016 in transforming listing exchange codes to explicit exchange names. \\index{Exchange!Exchange codes}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(exchange = case_when(\n    exchcd %in% c(1, 31) ~ \"NYSE\",\n    exchcd %in% c(2, 32) ~ \"AMEX\",\n    exchcd %in% c(3, 33) ~ \"NASDAQ\",\n    .default = \"Other\"\n  ))\n```\n:::\n\n\nSimilarly, we transform industry codes to industry descriptions following @BaliEngleMurray2016.\\index{Industry codes} Notice that there are also other categorizations of industries [e.g., @FamaFrench1997] that are commonly used.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(industry = case_when(\n    siccd >= 1 & siccd <= 999 ~ \"Agriculture\",\n    siccd >= 1000 & siccd <= 1499 ~ \"Mining\",\n    siccd >= 1500 & siccd <= 1799 ~ \"Construction\",\n    siccd >= 2000 & siccd <= 3999 ~ \"Manufacturing\",\n    siccd >= 4000 & siccd <= 4899 ~ \"Transportation\",\n    siccd >= 4900 & siccd <= 4999 ~ \"Utilities\",\n    siccd >= 5000 & siccd <= 5199 ~ \"Wholesale\",\n    siccd >= 5200 & siccd <= 5999 ~ \"Retail\",\n    siccd >= 6000 & siccd <= 6799 ~ \"Finance\",\n    siccd >= 7000 & siccd <= 8999 ~ \"Services\",\n    siccd >= 9000 & siccd <= 9999 ~ \"Public\",\n    TRUE ~ \"Missing\"\n  ))\n```\n:::\n\n\nWe also construct returns adjusted for delistings as described by @BaliEngleMurray2016. The delisting of a security usually results when a company ceases operations, declares bankruptcy, merges, does not meet listing requirements, or seeks to become private. The adjustment tries to reflect the returns of investors who bought the stock in the month before the delisting and held it until the delisting date. After this transformation, we can drop the delisting returns and codes.\\index{Returns!Delisting}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(ret_adj = case_when(\n    is.na(dlstcd) ~ ret,\n    !is.na(dlstcd) & !is.na(dlret) ~ dlret,\n    dlstcd %in% c(500, 520, 580, 584) |\n      (dlstcd >= 551 & dlstcd <= 574) ~ -0.30,\n    dlstcd == 100 ~ ret,\n    TRUE ~ -1\n  )) |>\n  select(-c(dlret, dlstcd))\n```\n:::\n\n\nNext, we compute excess returns by subtracting the monthly risk-free rate provided by our Fama-French data.\\index{Returns!Excess}\\index{Risk-free rate} As we base all our analyses on the excess returns, we can drop adjusted returns and the risk-free rate from our tibble. Note that we ensure excess returns are bounded by -1 from below as a return less than -100% makes no sense conceptually. Before we can adjust the returns, we have to connect to our database and load the table `factors_ff3_monthly`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, rf) |>\n  collect()\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(factors_ff3_monthly,\n    by = \"month\"\n  ) |>\n  mutate(\n    ret_excess = ret_adj - rf,\n    ret_excess = pmax(ret_excess, -1)\n  ) |>\n  select(-ret_adj, -rf)\n```\n:::\n\n\nSince excess returns and market capitalization are crucial for all our analyses, we can safely exclude all observations with missing returns or market capitalization. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  drop_na(ret_excess, mktcap, mktcap_lag)\n```\n:::\n\n\nFinally, we store the monthly CRSP file in our database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\n## First Glimpse of the CRSP Sample\n\nBefore we move on to other data sources, let us look at some descriptive statistics of the CRSP sample, which is our main source for stock returns. \n\n@fig-211 shows the monthly number of securities by listing exchange over time. NYSE has the longest history in the data, but NASDAQ lists a considerably large number of stocks. The number of stocks listed on AMEX decreased steadily over the last couple of decades. By the end of 2022, there were 2,778 stocks with a primary listing on NASDAQ, 1,358 on NYSE, 162 on AMEX, and only one belonged to the other category. \\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n  count(exchange, date) |>\n  ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by listing exchange\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample listed at each of the US exchanges.](wrds-crsp-and-compustat_files/figure-html/fig-211-1.png){#fig-211 fig-alt='Title: Monthly number of securities by listing exchange. The figure shows a line chart with the number of securities by listing exchange from 1960 to 2022. In the earlier period, NYSE dominated as a listing exchange. There is a strong upwards trend for NASDAQ. Other listing exchanges do only play a minor role.' width=2100}\n:::\n:::\n\n\nNext, we look at the aggregate market capitalization grouped by the respective listing exchanges in @fig-212. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in @fig-212 are at the end of 2022 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl(tidy_finance, \"crsp_monthly\") |>\n  left_join(tbl(tidy_finance, \"cpi_monthly\"), by = \"month\") |>\n  group_by(month, exchange) |>\n  summarize(\n    mktcap = sum(mktcap, na.rm = TRUE) / cpi,\n    .groups = \"drop\"\n  ) |>\n  collect() |>\n  mutate(month = ymd(month)) |>\n  ggplot(aes(\n    x = month, y = mktcap / 1000,\n    color = exchange, linetype = exchange\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly market cap by listing exchange in billions of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-212-1.png){#fig-212 fig-alt='Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks.' width=2100}\n:::\n:::\n\n\nOf course, performing the computation in the database is not really meaningful because we can easily pull all the required data into our memory. The code chunk above is slower than performing the same steps on tables that are already in memory. However, we just want to illustrate that you can perform many things in the database before loading the data into your memory. Before we proceed, we load the monthly CPI data.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncpi_monthly <- tbl(tidy_finance, \"cpi_monthly\") |>\n  collect()\n```\n:::\n\n\nNext, we look at the same descriptive statistics by industry. @fig-213 plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry <- crsp_monthly |>\n  left_join(cpi_monthly, by = \"month\") |>\n  group_by(month, industry) |>\n  summarize(\n    securities = n_distinct(permno),\n    mktcap = sum(mktcap) / mean(cpi),\n    .groups = \"drop\"\n  )\n\ncrsp_monthly_industry |>\n  ggplot(aes(\n    x = month,\n    y = securities,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by industry\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample associated with different industries.](wrds-crsp-and-compustat_files/figure-html/fig-213-1.png){#fig-213 fig-alt='Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks.' width=2100}\n:::\n:::\n\n\nWe also compute the market cap of all stocks belonging to the respective industries and show the evolution over time in @fig-214. All values are again in terms of billions of end of 2022 USD. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Toward the end of the sample, however, financial firms and services begin to make up a substantial portion of the market cap.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry |>\n  ggplot(aes(\n    x = month,\n    y = mktcap / 1000,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly total market cap by industry in billions as of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-214-1.png){#fig-214 fig-alt='Title: Monthly total market cap by industry in billions as of Dec 2022 USD. The figure shows a line chart of total market capitalization of all stocks in the CRSP sample aggregated by industry from 1960 to 2022 with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Stocks in the manufacturing sector have always had the highest market valuation. The figure shows a general upwards trend during the most recent past. ' width=2100}\n:::\n:::\n\n\n## Daily CRSP Data\n\nBefore we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session. \n\nThere is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for()`-loops,\\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 5 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `cat()`. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches. \n\nAlso notice that we adjust the returns for delistings in a different manner than for monthly returns: \n\n1. If a return is recorded in CRSP daily for a delisting date `dlstdt`, then we replace it with the delisting return `dlret`. \n2. If no return is recorded in CRSP daily for an existing delisting date, then we add a new row with the delisting return. \n3. We ensure that no returns after the delisting date `dlstdt` remain in the final table. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndsf_db <- tbl(wrds, in_schema(\"crsp\", \"dsf\"))\n\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n  collect()\n\npermnos <- tbl(tidy_finance, \"crsp_monthly\") |>\n  distinct(permno) |>\n  pull()\n\nbatch_size <- 500\nbatches <- ceiling(length(permnos) / batch_size)\n\nfor (j in 1:batches) {\n  \n  permno_batch <- permnos[\n    ((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))\n  ]\n\n  crsp_daily_sub <- dsf_db |>\n    filter(permno %in% permno_batch &\n             date >= start_date & date <= end_date) |>\n    select(permno, date, ret) |>\n    collect() |>\n    drop_na()\n\n  if (nrow(crsp_daily_sub) > 0) {\n    \n    msedelist_sub <- msedelist_db |> \n    filter(permno %in% permno_batch) |> \n    select(permno, dlstdt, dlret) |> \n    collect() |>\n    drop_na()\n    \n    crsp_daily_sub <- crsp_daily_sub |> \n      left_join(msedelist_sub, by = c(\"permno\", \"date\"=\"dlstdt\")) |> \n      bind_rows(msedelist_sub |> \n                  anti_join(crsp_daily_sub, \n                            by = c(\"permno\", \"dlstdt\" = \"date\"))) |> \n      mutate(ret = if_else(!is.na(dlret), dlret, ret),\n             date = if_else(!is.na(dlstdt), dlstdt, date)) |> \n      select(-c(dlret, dlstdt)) |> \n      left_join(msedelist_sub |> \n                  select(permno, dlstdt), by = \"permno\") |> \n      mutate(dlstdt = replace_na(dlstdt, end_date)) |> \n      filter(date <= dlstdt) |> \n      select(-dlstdt)\n    \n    crsp_daily_sub <- crsp_daily_sub |>\n      mutate(month = floor_date(date, \"month\")) |>\n      left_join(factors_ff3_daily |>\n        select(date, rf), by = \"date\") |>\n      mutate(\n        ret_excess = ret - rf,\n        ret_excess = pmax(ret_excess, -1)\n      ) |>\n      select(permno, date, month, ret, ret_excess)\n\n    dbWriteTable(tidy_finance,\n      \"crsp_daily\",\n      value = crsp_daily_sub,\n      overwrite = ifelse(j == 1, TRUE, FALSE),\n      append = ifelse(j != 1, TRUE, FALSE)\n    )\n  }\n\n  cat(\"Batch\", j, \"out of\", batches, \"done (\", percent(j / batches), \")\\n\")\n}\n```\n:::\n\n\nEventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data that we actually use. \n\n## Preparing Compustat data\n\nFirm accounting data are an important source of information that we use in portfolio analyses in subsequent chapters. The commonly used source for firm financial information is Compustat provided by [S&P Global Market Intelligence,](https://www.spglobal.com/marketintelligence/en/) which is a global data vendor that provides financial, statistical, and market information on active and inactive companies throughout the world.\\index{Data!Compustat} For US and Canadian companies, annual history is available back to 1950 and quarterly as well as monthly histories date back to 1962.\n\nTo access Compustat data, we can again tap WRDS, which hosts the `funda` table that contains annual firm-level information on North American companies.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfunda_db <- tbl(wrds, in_schema(\"comp\", \"funda\"))\n```\n:::\n\n\nWe follow the typical filter conventions and pull only data that we actually need: (i) we get only records in industrial data format, (ii) in the standard format (i.e., consolidated information in standard presentation), and (iii) only data in the desired time window.\\index{Gvkey}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- funda_db |>\n  filter(\n    indfmt == \"INDL\" &\n      datafmt == \"STD\" &\n      consol == \"C\" &\n      datadate >= start_date & datadate <= end_date\n  ) |>\n  select(\n    gvkey, # Firm identifier\n    datadate, # Date of the accounting data\n    seq, # Stockholders' equity\n    ceq, # Total common/ordinary equity\n    at, # Total assets\n    lt, # Total liabilities\n    txditc, # Deferred taxes and investment tax credit\n    txdb, # Deferred taxes\n    itcb, # Investment tax credit\n    pstkrv, # Preferred stock redemption value\n    pstkl, # Preferred stock liquidating value\n    pstk, # Preferred stock par value\n    capx, # Capital investment\n    oancf, # Operating cash flow\n    sale,  # Revenue\n    cogs, # Costs of goods sold\n    xint, # Interest expense\n    xsga # Selling, general, and administrative expenses\n  ) |>\n  collect()\n```\n:::\n\n\nNext, we calculate the book value of preferred stock and equity `be` and the operating profitability `op` inspired by the [variable definitions in Ken French's data library.](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) Note that we set negative or zero equity to missing which is a common practice when working with book-to-market ratios [see @Fama1992 for details].\\index{Book equity}\\index{Preferred stock}\\index{Operating profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n  mutate(\n    be = coalesce(seq, ceq + pstk, at - lt) +\n      coalesce(txditc, txdb + itcb, 0) -\n      coalesce(pstkrv, pstkl, pstk, 0),\n    be = if_else(be <= 0, as.numeric(NA), be),\n    op = (sale - coalesce(cogs, 0) - \n            coalesce(xsga, 0) - coalesce(xint, 0)) / be,\n  )\n```\n:::\n\n\nWe keep only the last available information for each firm-year group. Note that `datadate` defines the time the corresponding financial data refers to (e.g., annual report as of December 31, 2022). Therefore, `datadate` is not the date when data was made available to the public. Check out the exercises for more insights into the peculiarities of `datadate`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n  mutate(year = year(datadate)) |>\n  group_by(gvkey, year) |>\n  filter(datadate == max(datadate)) |>\n  ungroup()\n```\n:::\n\n\nWe also compute the investment ratio `inv` according to Ken French's variable definitions as the change in total assets from one fiscal year to another. Note that we again use the approach using joins as introduced with the CRSP data above to construct lagged assets.\\index{Investment ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |> \n  left_join(\n    compustat |> \n      select(gvkey, year, at_lag = at) |> \n      mutate(year = year + 1), by = c(\"gvkey\", \"year\")\n  ) |> \n  mutate(\n    inv = at / at_lag - 1,\n    inv = if_else(at_lag <= 0, as.numeric(NA), inv)\n  )\n```\n:::\n\n\nWith the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"compustat\",\n  value = compustat,\n  overwrite = TRUE\n)\n```\n:::\n\n\n## Merging CRSP with Compustat\n\nUnfortunately, CRSP and Compustat use different keys to identify stocks and firms. CRSP uses `permno` for stocks, while Compustat uses `gvkey` to identify firms. Fortunately, a curated matching table on WRDS allows us to merge CRSP and Compustat, so we create a connection to the *CRSP-Compustat Merged* table (provided by CRSP).\\index{Data!Crsp-Compustat Merged}\\index{Permno}\\index{Gvkey}\\index{Data!Linking table}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable_db <- tbl(\n  wrds,\n  in_schema(\"crsp\", \"ccmxpf_linktable\")\n)\n```\n:::\n\n\nThe linking table contains links between CRSP and Compustat identifiers from various approaches. However, we need to make sure that we keep only relevant and correct links, again following the description outlined in @BaliEngleMurray2016. Note also that currently active links have no end date, so we just enter the current date via `today()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable <- ccmxpf_linktable_db |>\n  filter(linktype %in% c(\"LU\", \"LC\") &\n    linkprim %in% c(\"P\", \"C\") &\n    usedflag == 1) |>\n  select(permno = lpermno, gvkey, linkdt, linkenddt) |>\n  collect() |>\n  mutate(linkenddt = replace_na(linkenddt, today()))\n```\n:::\n\n\nWe use these links to create a new table with a mapping between stock identifier, firm identifier, and month. We then add these links to the Compustat `gvkey` to our monthly stock data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nccm_links <- crsp_monthly |>\n  inner_join(ccmxpf_linktable, \n             by = \"permno\", relationship = \"many-to-many\") |>\n  filter(!is.na(gvkey) & \n           (date >= linkdt & date <= linkenddt)) |>\n  select(permno, gvkey, date)\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(ccm_links, by = c(\"permno\", \"date\"))\n```\n:::\n\n\nAs the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\nBefore we close this chapter, let us look at an interesting descriptive statistic of our data. As the book value of equity plays a crucial role in many asset pricing applications, it is interesting to know for how many of our stocks this information is available. Hence, @fig-215 plots the share of securities with book equity values for each exchange. It turns out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 60s but hovers around 80% for all periods thereafter. We can ignore the erratic coverage of securities that belong to the other category since there is only a handful of them anyway in our sample.\\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n  group_by(permno, year = year(month)) |>\n  filter(date == max(date)) |>\n  ungroup() |>\n  left_join(compustat, by = c(\"gvkey\", \"year\")) |>\n  group_by(exchange, year) |>\n  summarize(\n    share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),\n    .groups = \"drop\"\n  ) |>\n  ggplot(aes(\n    x = year, \n    y = share, \n    color = exchange,\n    linetype = exchange\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Share of securities with book equity values by exchange\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![End-of-year share of securities with book equity values by listing exchange.](wrds-crsp-and-compustat_files/figure-html/fig-215-1.png){#fig-215 fig-alt='Title: Share of securities with book equity values by exchange. The figure shows a line chart of end-of-year shares of securities with book equity values by exchange from 1960 to 2022 with years on the horizontal axis and the corresponding share on the vertical axis. After an initial period with lower coverage in the early 1960s, typically, more than 80 percent of the entries in the CRSP sample have information about book equity values from Compustat.' width=2100}\n:::\n:::\n\n\n## Some Tricks for PostgreSQL Databases\n\nAs we mentioned above, the WRDS database runs on PostgreSQL rather than SQLite. Finding the right tables for your data needs can be tricky in the WRDS PostgreSQL instance, as the tables are organized in schemas.\\index{Database!Schema} If you wonder what the purpose of schemas is, check out [this documetation.](https://www.postgresql.org/docs/9.1/ddl-schemas.html) For instance, if you want to find all tables that live in the `crsp` schema, you run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"crsp\"))\n```\n:::\n\n\nThis operation returns a list of all tables that belong to the `crsp` family on WRDS, e.g., `<Id> schema = crsp, table = msenames`. Similarly, you can fetch a list of all tables that belong to the `comp` family via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"comp\"))\n```\n:::\n\n\nIf you want to get all schemas, then run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds)\n```\n:::\n\n\n## Exercises\n\n1. Check out the structure of the WRDS database by sending queries in the spirit of [\"Querying WRDS Data using R\"](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/) and verify the output with `dbListObjects()`. How many tables are associated with CRSP? Can you identify what is stored within `msp500`?\n1. Compute `mkt_cap_lag` using `lag(mktcap)` rather than using joins as above. Filter out all the rows where the lag-based market capitalization measure is different from the one we computed above. Why are the two measures they different?\n1. Plot the average market capitalization of firms for each exchange and industry, respectively, over time. What do you find?\n1. In the `compustat` table, `datadate` refers to the date to which the fiscal year of a corresponding firm refers to. Count the number of observations in Compustat by `month` of this date variable. What do you find? What does the finding suggest about pooling observations with the same fiscal year?\n1.  Go back to the original Compustat data in `funda_db` and extract rows where the same firm has multiple rows for the same fiscal year. What is the reason for these observations?\n1. Keep the last observation of `crsp_monthly` by year and join it with the `compustat` table. Create the following plots: (i) aggregate  book equity by exchange over time and (ii) aggregate annual book equity by industry over time. Do you notice any different patterns to the corresponding plots based on market capitalization?\n1. Repeat the analysis of market capitalization for book equity, which we computed from the Compustat data. Then, use the matched sample to plot book equity against market capitalization. How are these two variables related?",
+    "markdown": "---\ntitle: WRDS, CRSP, and Compustat\naliases:\n  - ../wrds-crsp-and-compustat.html\npre-render:\n  - pre_render_script.R\nmetadata:\n  pagetitle: WRDS, CRSP, and Compustat with R\n  description-meta: Use the programming language R to download stock and firm data such as CRSP and Compustat from WRDS. \n---\n\n\nThis chapter shows how to connect to [Wharton Research Data Services (WRDS)](https://wrds-www.wharton.upenn.edu/), a popular provider of financial and economic data for research applications. We use this connection to download the most commonly used data for stock and firm characteristics, CRSP and Compustat. Unfortunately, this data is not freely available, but most students and researchers typically have access to WRDS through their university libraries. Assuming that you have access to WRDS, we show you how to prepare and merge the databases and store them in the `SQLite`-database introduced in the previous chapter. We conclude this chapter by providing some tips for working with the WRDS database.\\index{WRDS}\n\nFirst, we load the R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(scales)\nlibrary(RSQLite)\nlibrary(dbplyr)\n```\n:::\n\n\nWe use the same date range as in the previous chapter to ensure consistency.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstart_date <- ymd(\"1960-01-01\")\nend_date <- ymd(\"2022-12-31\")\n```\n:::\n\n\n## Accessing WRDS\n\nWRDS is the most widely used source for asset and firm-specific financial data used in academic settings. WRDS is a data platform that provides data validation, flexible delivery options, and access to many different data sources. The data at WRDS is also organized in an SQL database, although they use the [PostgreSQL](https://www.postgresql.org/) engine. This database engine is just as easy to handle with R as SQLite. We use the `RPostgres` package to establish a connection to the WRDS database [@RPostgres]. Note that you could also use the `odbc` package to connect to a PostgreSQL database, but then you need to install the appropriate drivers yourself. `RPostgres` already contains a suitable driver.\\index{Database!PostgreSQL}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(RPostgres)\n```\n:::\n\n\nTo establish a connection, you use the function `dbConnect()` with the following arguments. Note that you need to replace the `WRDS_USER` and `WRDS_PASSWORD` arguments with your own credentials. We defined environment variables for the purpose of this book because we obviously do not want (and are not allowed) to share our credentials with the rest of the world (these environment variables are stored in an `.Renviron`-file in our project directory and loaded with the `Sys.getenv()` function).\n\nAdditionally, you have to use multi-factor (i.e., two-factor) authentication since May 2023 when establishing a `PostgreSQL` or other remote connections. You have two choices to provide the additional identification. First, if you have Duo Push enabled for your WRDS account, you will receive a push notification on your mobile phone when trying to establish a connection with the code below. Upon accepting the notification, you can continue your work. Second, you can log in to a WRDS website that requires multi-factor authentication with your username and the same IP address. Once you have successfully identified yourself on the website, your username-IP combination will be remembered for 30 days, and you can comfortably use the remote connection below.\\index{Two-factor authentication}\\index{WRDS:Two-factor authentication}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrds <- dbConnect(\n  Postgres(),\n  host = \"wrds-pgdata.wharton.upenn.edu\",\n  dbname = \"wrds\",\n  port = 9737,\n  sslmode = \"require\",\n  user = Sys.getenv(\"WRDS_USER\"),\n  password = Sys.getenv(\"WRDS_PASSWORD\")\n)\n```\n:::\n\n\nThe remote connection to WRDS is very useful. Yet, the database itself contains many different tables. You can check the WRDS homepage to identify the table's name you are looking for (if you go beyond our exposition). Alternatively, you can also query the data structure with the function `dbSendQuery()`. If you are interested, there is an exercise below that is based on WRDS' tutorial on [\"Querying WRDS Data using R\".](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/)  Furthermore, the penultimate section of this chapter shows how to investigate the structure of databases.\n\n## Downloading and Preparing CRSP\n\n\\index{Data!CRSP}[The Center for Research in Security Prices (CRSP)](https://crsp.org/) provides the most widely used data for US stocks. We use the `wrds` connection object that we just created to first access monthly CRSP return data. Actually, we need three tables to get the desired data: (i) the CRSP monthly security file,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsf_db <- tbl(wrds, in_schema(\"crsp\", \"msf\"))\n```\n:::\n\n\n(ii) the identifying information,\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsenames_db <- tbl(wrds, in_schema(\"crsp\", \"msenames\"))\n```\n:::\n\n\nand (iii) the delisting information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmsedelist_db <- tbl(wrds, in_schema(\"crsp\", \"msedelist\"))\n```\n:::\n\n\nWe use the three remote tables to fetch the data we want to put into our local database. Just as above, the idea is that we let the WRDS database do all the work and just download the data that we actually need. We apply common filters and data selection criteria to narrow down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep only US-listed stocks as identified via share codes `shrcd` 10 and 11, and (iii) we keep only months within permno-specific start dates `namedt` and end dates `nameendt`. In addition, we add delisting codes  and returns. You can read up in the great textbook of @BaliEngleMurray2016 for an extensive discussion on the filters we apply in the code below.\\index{Permno}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- msf_db |>\n  filter(date >= start_date & date <= end_date) |>\n  inner_join(\n    msenames_db |>\n      filter(shrcd %in% c(10, 11)) |>\n      select(permno, exchcd, siccd, namedt, nameendt),\n    join_by(permno)\n  ) |>\n  filter(date >= namedt & date <= nameendt) |>\n  mutate(month = floor_date(date, \"month\")) |>\n  left_join(\n    msedelist_db |>\n      select(permno, dlstdt, dlret, dlstcd) |>\n      mutate(month = floor_date(dlstdt, \"month\")),\n    join_by(permno, month)\n  ) |>\n  select(\n    permno, # Security identifier\n    date, # Date of the observation\n    month, # Month of the observation\n    ret, # Return\n    shrout, # Shares outstanding (in thousands)\n    altprc, # Last traded price in a month\n    exchcd, # Exchange code\n    siccd, # Industry code\n    dlret, # Delisting return\n    dlstcd # Delisting code\n  ) |>\n  collect() |>\n  mutate(\n    month = ymd(month),\n    shrout = shrout * 1000\n  )\n```\n:::\n\n\nNow, we have all the relevant monthly return data in memory and proceed with preparing the data for future analyses. We perform the preparation step at the current stage since we want to avoid executing the same mutations every time we use the data in subsequent chapters. \n\nThe first additional variable we create is market capitalization (`mktcap`), which is the product of the number of outstanding shares `shrout` and the last traded price in a month `altprc`.\\index{Market capitalization} Note that in contrast to returns `ret`, these two variables are not adjusted ex-post for any corporate actions like stock splits. Moreover, the `altprc` is negative whenever the last traded price does not exist, and CRSP decides to report the mid-quote of the last available order book instead. Hence, we take the absolute value of the market cap. We also keep the market cap in millions of USD just for convenience as we do not want to print huge numbers in our figures and tables. In addition, we set zero market cap to missing as it makes conceptually little sense (i.e., the firm would be bankrupt).\\index{Stock price}\\index{Returns}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(\n    mktcap = abs(shrout * altprc) / 10^6,\n    mktcap = na_if(mktcap, 0)\n  )\n```\n:::\n\n\nThe next variable we frequently use is the one-month *lagged* market capitalization. Lagged market capitalization is typically used to compute value-weighted portfolio returns, as we demonstrate in a later chapter. The most simple and consistent way to add a column with lagged market cap values is to add one month to each observation and then join the information to our monthly CRSP data.\\index{Weighting!Value}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmktcap_lag <- crsp_monthly |>\n  mutate(month = month %m+% months(1)) |>\n  select(permno, month, mktcap_lag = mktcap)\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(mktcap_lag, join_by(permno, month))\n```\n:::\n\n\nIf you wonder why we do not use the `lag()` function, e.g., via `crsp_monthly |> group_by(permno) |> mutate(mktcap_lag = lag(mktcap))`, take a look at the exercises.\n\nNext, we follow @BaliEngleMurray2016 in transforming listing exchange codes to explicit exchange names. \\index{Exchange!Exchange codes}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(exchange = case_when(\n    exchcd %in% c(1, 31) ~ \"NYSE\",\n    exchcd %in% c(2, 32) ~ \"AMEX\",\n    exchcd %in% c(3, 33) ~ \"NASDAQ\",\n    .default = \"Other\"\n  ))\n```\n:::\n\n\nSimilarly, we transform industry codes to industry descriptions following @BaliEngleMurray2016.\\index{Industry codes} Notice that there are also other categorizations of industries [e.g., @FamaFrench1997] that are commonly used.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(industry = case_when(\n    siccd >= 1 & siccd <= 999 ~ \"Agriculture\",\n    siccd >= 1000 & siccd <= 1499 ~ \"Mining\",\n    siccd >= 1500 & siccd <= 1799 ~ \"Construction\",\n    siccd >= 2000 & siccd <= 3999 ~ \"Manufacturing\",\n    siccd >= 4000 & siccd <= 4899 ~ \"Transportation\",\n    siccd >= 4900 & siccd <= 4999 ~ \"Utilities\",\n    siccd >= 5000 & siccd <= 5199 ~ \"Wholesale\",\n    siccd >= 5200 & siccd <= 5999 ~ \"Retail\",\n    siccd >= 6000 & siccd <= 6799 ~ \"Finance\",\n    siccd >= 7000 & siccd <= 8999 ~ \"Services\",\n    siccd >= 9000 & siccd <= 9999 ~ \"Public\",\n    TRUE ~ \"Missing\"\n  ))\n```\n:::\n\n\nWe also construct returns adjusted for delistings as described by @BaliEngleMurray2016. The delisting of a security usually results when a company ceases operations, declares bankruptcy, merges, does not meet listing requirements, or seeks to become private. The adjustment tries to reflect the returns of investors who bought the stock in the month before the delisting and held it until the delisting date. After this transformation, we can drop the delisting returns and codes.\\index{Returns!Delisting}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  mutate(ret_adj = case_when(\n    is.na(dlstcd) ~ ret,\n    !is.na(dlstcd) & !is.na(dlret) ~ dlret,\n    dlstcd %in% c(500, 520, 580, 584) |\n      (dlstcd >= 551 & dlstcd <= 574) ~ -0.30,\n    dlstcd == 100 ~ ret,\n    TRUE ~ -1\n  )) |>\n  select(-c(dlret, dlstcd))\n```\n:::\n\n\nNext, we compute excess returns by subtracting the monthly risk-free rate provided by our Fama-French data.\\index{Returns!Excess}\\index{Risk-free rate} As we base all our analyses on the excess returns, we can drop adjusted returns and the risk-free rate from our tibble. Note that we ensure excess returns are bounded by -1 from below as a return less than -100% makes no sense conceptually. Before we can adjust the returns, we have to connect to our database and load the table `factors_ff3_monthly`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntidy_finance <- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly <- tbl(tidy_finance, \"factors_ff3_monthly\") |>\n  select(month, rf) |>\n  collect()\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(factors_ff3_monthly,\n    join_by(month)\n  ) |>\n  mutate(\n    ret_excess = ret_adj - rf,\n    ret_excess = pmax(ret_excess, -1)\n  ) |>\n  select(-ret_adj, -rf)\n```\n:::\n\n\nSince excess returns and market capitalization are crucial for all our analyses, we can safely exclude all observations with missing returns or market capitalization. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly <- crsp_monthly |>\n  drop_na(ret_excess, mktcap, mktcap_lag)\n```\n:::\n\n\nFinally, we store the monthly CRSP file in our database. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\n## First Glimpse of the CRSP Sample\n\nBefore we move on to other data sources, let us look at some descriptive statistics of the CRSP sample, which is our main source for stock returns. \n\n@fig-211 shows the monthly number of securities by listing exchange over time. NYSE has the longest history in the data, but NASDAQ lists a considerably large number of stocks. The number of stocks listed on AMEX decreased steadily over the last couple of decades. By the end of 2022, there were 2,778 stocks with a primary listing on NASDAQ, 1,358 on NYSE, 162 on AMEX, and only one belonged to the other category. \\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n  count(exchange, date) |>\n  ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by listing exchange\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample listed at each of the US exchanges.](wrds-crsp-and-compustat_files/figure-html/fig-211-1.png){#fig-211 fig-alt='Title: Monthly number of securities by listing exchange. The figure shows a line chart with the number of securities by listing exchange from 1960 to 2022. In the earlier period, NYSE dominated as a listing exchange. There is a strong upwards trend for NASDAQ. Other listing exchanges do only play a minor role.' width=2100}\n:::\n:::\n\n\nNext, we look at the aggregate market capitalization grouped by the respective listing exchanges in @fig-212. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in @fig-212 are at the end of 2022 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl(tidy_finance, \"crsp_monthly\") |>\n  left_join(tbl(tidy_finance, \"cpi_monthly\"), join_by(month)) |>\n  group_by(month, exchange) |>\n  summarize(\n    mktcap = sum(mktcap, na.rm = TRUE) / cpi,\n    .groups = \"drop\"\n  ) |>\n  collect() |>\n  mutate(month = ymd(month)) |>\n  ggplot(aes(\n    x = month, y = mktcap / 1000,\n    color = exchange, linetype = exchange\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly market cap by listing exchange in billions of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-212-1.png){#fig-212 fig-alt='Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks.' width=2100}\n:::\n:::\n\n\nOf course, performing the computation in the database is not really meaningful because we can easily pull all the required data into our memory. The code chunk above is slower than performing the same steps on tables that are already in memory. However, we just want to illustrate that you can perform many things in the database before loading the data into your memory. Before we proceed, we load the monthly CPI data.\\index{Data!CPI}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncpi_monthly <- tbl(tidy_finance, \"cpi_monthly\") |>\n  collect()\n```\n:::\n\n\nNext, we look at the same descriptive statistics by industry. @fig-213 plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry <- crsp_monthly |>\n  left_join(cpi_monthly, join_by(month)) |>\n  group_by(month, industry) |>\n  summarize(\n    securities = n_distinct(permno),\n    mktcap = sum(mktcap) / mean(cpi),\n    .groups = \"drop\"\n  )\n\ncrsp_monthly_industry |>\n  ggplot(aes(\n    x = month,\n    y = securities,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by industry\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Number of stocks in the CRSP sample associated with different industries.](wrds-crsp-and-compustat_files/figure-html/fig-213-1.png){#fig-213 fig-alt='Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks.' width=2100}\n:::\n:::\n\n\nWe also compute the market cap of all stocks belonging to the respective industries and show the evolution over time in @fig-214. All values are again in terms of billions of end of 2022 USD. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Toward the end of the sample, however, financial firms and services begin to make up a substantial portion of the market cap.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly_industry |>\n  ggplot(aes(\n    x = month,\n    y = mktcap / 1000,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly total market cap by industry in billions as of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n```\n\n::: {.cell-output-display}\n![Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.](wrds-crsp-and-compustat_files/figure-html/fig-214-1.png){#fig-214 fig-alt='Title: Monthly total market cap by industry in billions as of Dec 2022 USD. The figure shows a line chart of total market capitalization of all stocks in the CRSP sample aggregated by industry from 1960 to 2022 with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Stocks in the manufacturing sector have always had the highest market valuation. The figure shows a general upwards trend during the most recent past. ' width=2100}\n:::\n:::\n\n\n## Daily CRSP Data\n\nBefore we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session. \n\nThere is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for()`-loops,\\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 5 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `cat()`. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches. \n\nAlso notice that we adjust the returns for delistings in a different manner than for monthly returns: \n\n1. If a return is recorded in CRSP daily for a delisting date `dlstdt`, then we replace it with the delisting return `dlret`. \n2. If no return is recorded in CRSP daily for an existing delisting date, then we add a new row with the delisting return. \n3. We ensure that no returns after the delisting date `dlstdt` remain in the final table. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndsf_db <- tbl(wrds, in_schema(\"crsp\", \"dsf\"))\n\nfactors_ff3_daily <- tbl(tidy_finance, \"factors_ff3_daily\") |>\n  collect()\n\npermnos <- tbl(tidy_finance, \"crsp_monthly\") |>\n  distinct(permno) |>\n  pull()\n\nbatch_size <- 500\nbatches <- ceiling(length(permnos) / batch_size)\n\nfor (j in 1:batches) {\n  \n  permno_batch <- permnos[\n    ((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))\n  ]\n\n  crsp_daily_sub <- dsf_db |>\n    filter(permno %in% permno_batch &\n             date >= start_date & date <= end_date) |>\n    select(permno, date, ret) |>\n    collect() |>\n    drop_na()\n\n  if (nrow(crsp_daily_sub) > 0) {\n    \n    msedelist_sub <- msedelist_db |> \n    filter(permno %in% permno_batch) |> \n    select(permno, dlstdt, dlret) |> \n    collect() |>\n    drop_na()\n    \n    crsp_daily_sub <- crsp_daily_sub |> \n      left_join(msedelist_sub, join_by(permno, date == dlstdt)) |> \n      bind_rows(msedelist_sub |> \n                  anti_join(crsp_daily_sub, \n                            join_by(permno, dlstdt == date))) |> \n      mutate(ret = if_else(!is.na(dlret), dlret, ret),\n             date = if_else(!is.na(dlstdt), dlstdt, date)) |> \n      select(-c(dlret, dlstdt)) |> \n      left_join(msedelist_sub |> \n                  select(permno, dlstdt), join_by(permno)) |> \n      mutate(dlstdt = replace_na(dlstdt, end_date)) |> \n      filter(date <= dlstdt) |> \n      select(-dlstdt)\n    \n    crsp_daily_sub <- crsp_daily_sub |>\n      mutate(month = floor_date(date, \"month\")) |>\n      left_join(factors_ff3_daily |>\n        select(date, rf), join_by(date)) |>\n      mutate(\n        ret_excess = ret - rf,\n        ret_excess = pmax(ret_excess, -1)\n      ) |>\n      select(permno, date, month, ret, ret_excess)\n\n    dbWriteTable(tidy_finance,\n      \"crsp_daily\",\n      value = crsp_daily_sub,\n      overwrite = ifelse(j == 1, TRUE, FALSE),\n      append = ifelse(j != 1, TRUE, FALSE)\n    )\n  }\n\n  cat(\"Batch\", j, \"out of\", batches, \"done (\", percent(j / batches), \")\\n\")\n}\n```\n:::\n\n\nEventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data that we actually use. \n\n## Preparing Compustat data\n\nFirm accounting data are an important source of information that we use in portfolio analyses in subsequent chapters. The commonly used source for firm financial information is Compustat provided by [S&P Global Market Intelligence,](https://www.spglobal.com/marketintelligence/en/) which is a global data vendor that provides financial, statistical, and market information on active and inactive companies throughout the world.\\index{Data!Compustat} For US and Canadian companies, annual history is available back to 1950 and quarterly as well as monthly histories date back to 1962.\n\nTo access Compustat data, we can again tap WRDS, which hosts the `funda` table that contains annual firm-level information on North American companies.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfunda_db <- tbl(wrds, in_schema(\"comp\", \"funda\"))\n```\n:::\n\n\nWe follow the typical filter conventions and pull only data that we actually need: (i) we get only records in industrial data format, (ii) in the standard format (i.e., consolidated information in standard presentation), and (iii) only data in the desired time window.\\index{Gvkey}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- funda_db |>\n  filter(\n    indfmt == \"INDL\" &\n      datafmt == \"STD\" &\n      consol == \"C\" &\n      datadate >= start_date & datadate <= end_date\n  ) |>\n  select(\n    gvkey, # Firm identifier\n    datadate, # Date of the accounting data\n    seq, # Stockholders' equity\n    ceq, # Total common/ordinary equity\n    at, # Total assets\n    lt, # Total liabilities\n    txditc, # Deferred taxes and investment tax credit\n    txdb, # Deferred taxes\n    itcb, # Investment tax credit\n    pstkrv, # Preferred stock redemption value\n    pstkl, # Preferred stock liquidating value\n    pstk, # Preferred stock par value\n    capx, # Capital investment\n    oancf, # Operating cash flow\n    sale,  # Revenue\n    cogs, # Costs of goods sold\n    xint, # Interest expense\n    xsga # Selling, general, and administrative expenses\n  ) |>\n  collect()\n```\n:::\n\n\nNext, we calculate the book value of preferred stock and equity `be` and the operating profitability `op` inspired by the [variable definitions in Ken French's data library.](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html) Note that we set negative or zero equity to missing which is a common practice when working with book-to-market ratios [see @Fama1992 for details].\\index{Book equity}\\index{Preferred stock}\\index{Operating profitability}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n  mutate(\n    be = coalesce(seq, ceq + pstk, at - lt) +\n      coalesce(txditc, txdb + itcb, 0) -\n      coalesce(pstkrv, pstkl, pstk, 0),\n    be = if_else(be <= 0, NA, be),\n    op = (sale - coalesce(cogs, 0) - \n            coalesce(xsga, 0) - coalesce(xint, 0)) / be,\n  )\n```\n:::\n\n\nWe keep only the last available information for each firm-year group. Note that `datadate` defines the time the corresponding financial data refers to (e.g., annual report as of December 31, 2022). Therefore, `datadate` is not the date when data was made available to the public. Check out the exercises for more insights into the peculiarities of `datadate`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |>\n  mutate(year = year(datadate)) |>\n  group_by(gvkey, year) |>\n  filter(datadate == max(datadate)) |>\n  ungroup()\n```\n:::\n\n\nWe also compute the investment ratio `inv` according to Ken French's variable definitions as the change in total assets from one fiscal year to another. Note that we again use the approach using joins as introduced with the CRSP data above to construct lagged assets.\\index{Investment ratio}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompustat <- compustat |> \n  left_join(\n    compustat |> \n      select(gvkey, year, at_lag = at) |> \n      mutate(year = year + 1), \n    join_by(gvkey, year)\n  ) |> \n  mutate(\n    inv = at / at_lag - 1,\n    inv = if_else(at_lag <= 0, NA, inv)\n  )\n```\n:::\n\n\nWith the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"compustat\",\n  value = compustat,\n  overwrite = TRUE\n)\n```\n:::\n\n\n## Merging CRSP with Compustat\n\nUnfortunately, CRSP and Compustat use different keys to identify stocks and firms. CRSP uses `permno` for stocks, while Compustat uses `gvkey` to identify firms. Fortunately, a curated matching table on WRDS allows us to merge CRSP and Compustat, so we create a connection to the *CRSP-Compustat Merged* table (provided by CRSP).\\index{Data!Crsp-Compustat Merged}\\index{Permno}\\index{Gvkey}\\index{Data!Linking table}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable_db <- tbl(\n  wrds,\n  in_schema(\"crsp\", \"ccmxpf_linktable\")\n)\n```\n:::\n\n\nThe linking table contains links between CRSP and Compustat identifiers from various approaches. However, we need to make sure that we keep only relevant and correct links, again following the description outlined in @BaliEngleMurray2016. Note also that currently active links have no end date, so we just enter the current date via `today()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nccmxpf_linktable <- ccmxpf_linktable_db |>\n  filter(linktype %in% c(\"LU\", \"LC\") &\n    linkprim %in% c(\"P\", \"C\") &\n    usedflag == 1) |>\n  select(permno = lpermno, gvkey, linkdt, linkenddt) |>\n  collect() |>\n  mutate(linkenddt = replace_na(linkenddt, today()))\n```\n:::\n\n\nWe use these links to create a new table with a mapping between stock identifier, firm identifier, and month. We then add these links to the Compustat `gvkey` to our monthly stock data. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nccm_links <- crsp_monthly |>\n  inner_join(ccmxpf_linktable, \n             join_by(permno), relationship = \"many-to-many\") |>\n  filter(!is.na(gvkey) & \n           (date >= linkdt & date <= linkenddt)) |>\n  select(permno, gvkey, date)\n\ncrsp_monthly <- crsp_monthly |>\n  left_join(ccm_links, join_by(permno, date))\n```\n:::\n\n\nAs the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)\n```\n:::\n\n\nBefore we close this chapter, let us look at an interesting descriptive statistic of our data. As the book value of equity plays a crucial role in many asset pricing applications, it is interesting to know for how many of our stocks this information is available. Hence, @fig-215 plots the share of securities with book equity values for each exchange. It turns out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 60s but hovers around 80% for all periods thereafter. We can ignore the erratic coverage of securities that belong to the other category since there is only a handful of them anyway in our sample.\\index{Exchange!NYSE}\\index{Exchange!AMEX}\\index{Exchange!NASDAQ}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncrsp_monthly |>\n  group_by(permno, year = year(month)) |>\n  filter(date == max(date)) |>\n  ungroup() |>\n  left_join(compustat, join_by(gvkey, year)) |>\n  group_by(exchange, year) |>\n  summarize(\n    share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),\n    .groups = \"drop\"\n  ) |>\n  ggplot(aes(\n    x = year, \n    y = share, \n    color = exchange,\n    linetype = exchange\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Share of securities with book equity values by exchange\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  coord_cartesian(ylim = c(0, 1))\n```\n\n::: {.cell-output-display}\n![End-of-year share of securities with book equity values by listing exchange.](wrds-crsp-and-compustat_files/figure-html/fig-215-1.png){#fig-215 fig-alt='Title: Share of securities with book equity values by exchange. The figure shows a line chart of end-of-year shares of securities with book equity values by exchange from 1960 to 2022 with years on the horizontal axis and the corresponding share on the vertical axis. After an initial period with lower coverage in the early 1960s, typically, more than 80 percent of the entries in the CRSP sample have information about book equity values from Compustat.' width=2100}\n:::\n:::\n\n\n## Some Tricks for PostgreSQL Databases\n\nAs we mentioned above, the WRDS database runs on PostgreSQL rather than SQLite. Finding the right tables for your data needs can be tricky in the WRDS PostgreSQL instance, as the tables are organized in schemas.\\index{Database!Schema} If you wonder what the purpose of schemas is, check out [this documetation.](https://www.postgresql.org/docs/9.1/ddl-schemas.html) For instance, if you want to find all tables that live in the `crsp` schema, you run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"crsp\"))\n```\n:::\n\n\nThis operation returns a list of all tables that belong to the `crsp` family on WRDS, e.g., `<Id> schema = crsp, table = msenames`. Similarly, you can fetch a list of all tables that belong to the `comp` family via\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds, Id(schema = \"comp\"))\n```\n:::\n\n\nIf you want to get all schemas, then run\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndbListObjects(wrds)\n```\n:::\n\n\n## Exercises\n\n1. Check out the structure of the WRDS database by sending queries in the spirit of [\"Querying WRDS Data using R\"](https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-r/querying-wrds-data-r/) and verify the output with `dbListObjects()`. How many tables are associated with CRSP? Can you identify what is stored within `msp500`?\n1. Compute `mkt_cap_lag` using `lag(mktcap)` rather than using joins as above. Filter out all the rows where the lag-based market capitalization measure is different from the one we computed above. Why are the two measures they different?\n1. Plot the average market capitalization of firms for each exchange and industry, respectively, over time. What do you find?\n1. In the `compustat` table, `datadate` refers to the date to which the fiscal year of a corresponding firm refers to. Count the number of observations in Compustat by `month` of this date variable. What do you find? What does the finding suggest about pooling observations with the same fiscal year?\n1.  Go back to the original Compustat data in `funda_db` and extract rows where the same firm has multiple rows for the same fiscal year. What is the reason for these observations?\n1. Keep the last observation of `crsp_monthly` by year and join it with the `compustat` table. Create the following plots: (i) aggregate  book equity by exchange over time and (ii) aggregate annual book equity by industry over time. Do you notice any different patterns to the corresponding plots based on market capitalization?\n1. Repeat the analysis of market capitalization for book equity, which we computed from the Compustat data. Then, use the matched sample to plot book equity against market capitalization. How are these two variables related?",
     "supporting": [
       "wrds-crsp-and-compustat_files"
     ],
diff --git a/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png b/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png
index 793e23fd..4de5676f 100644
Binary files a/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png and b/_freeze/r/wrds-crsp-and-compustat/figure-html/fig-215-1.png differ
diff --git a/docs/r/accessing-and-managing-financial-data.html b/docs/r/accessing-and-managing-financial-data.html
index 492dee80..c43999eb 100644
--- a/docs/r/accessing-and-managing-financial-data.html
+++ b/docs/r/accessing-and-managing-financial-data.html
@@ -830,8 +830,8 @@ <h2 class="anchored" data-anchor-id="managing-sqlite-databases">Managing SQLite
  [3] "cpi_monthly"           "crsp_daily"           
  [5] "crsp_monthly"          "factors_ff3_daily"    
  [7] "factors_ff3_monthly"   "factors_ff5_monthly"  
- [9] "factors_q_monthly"     "industries_ff_monthly"
-[11] "macro_predictors"      "mergent"              
+ [9] "factors_q_monthly"     "fisd"                 
+[11] "industries_ff_monthly" "macro_predictors"     
 [13] "trace_enhanced"       </code></pre>
 </div>
 </div>
diff --git a/docs/r/beta-estimation.html b/docs/r/beta-estimation.html
index cdb0cca9..b3d19a29 100644
--- a/docs/r/beta-estimation.html
+++ b/docs/r/beta-estimation.html
@@ -569,7 +569,7 @@ <h2 class="anchored" data-anchor-id="estimating-beta-using-monthly-returns">Esti
 <span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a>  <span class="fu">collect</span>()</span>
 <span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a>crsp_monthly <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="at">by =</span> <span class="st">"month"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="fu">join_by</span>(month))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>To estimate the CAPM regression coefficients<br>
 <span class="math display">\[
@@ -674,7 +674,7 @@ <h2 class="anchored" data-anchor-id="rolling-window-estimation">Rolling-Window E
 <p>It is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting <a href="#fig-601" class="quarto-xref">Figure&nbsp;1</a>.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>beta_examples <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="at">by =</span> <span class="st">"permno"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="fu">join_by</span>(permno)) <span class="sc">|&gt;</span></span>
 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">group_by</span>(permno) <span class="sc">|&gt;</span></span>
 <span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="fu">roll_capm_estimation</span>(<span class="fu">pick</span>(<span class="fu">everything</span>()), <span class="at">months =</span> <span class="dv">60</span>, <span class="at">min_obs =</span> <span class="dv">48</span>)) <span class="sc">|&gt;</span></span>
 <span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">ungroup</span>() <span class="sc">|&gt;</span></span>
@@ -693,7 +693,7 @@ <h2 class="anchored" data-anchor-id="rolling-window-estimation">Rolling-Window E
 <span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Monthly beta estimates for example stocks using 5 years of data"</span></span>
 <span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-601" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock.">
+<div id="fig-601" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-601-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="beta-estimation_files/figure-html/fig-601-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="beta-estimation_files/figure-html/fig-601-1.png" class="img-fluid figure-img" alt="Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.</span>
@@ -716,13 +716,13 @@ <h2 class="anchored" data-anchor-id="parallelized-rolling-window-estimation">Par
 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>crsp_monthly_nested</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 30,384 × 3
-  permno industry      data              
-   &lt;dbl&gt; &lt;chr&gt;         &lt;list&gt;            
-1  10000 Manufacturing &lt;tibble [16 × 3]&gt; 
-2  10001 Utilities     &lt;tibble [378 × 3]&gt;
-3  10057 Manufacturing &lt;tibble [437 × 3]&gt;
-4  10042 Mining        &lt;tibble [264 × 3]&gt;
-5  10043 Services      &lt;tibble [159 × 3]&gt;
+  permno industry      data             
+   &lt;dbl&gt; &lt;chr&gt;         &lt;list&gt;           
+1  10066 Services      &lt;tibble [35 × 3]&gt;
+2  10067 Manufacturing &lt;tibble [12 × 3]&gt;
+3  10068 Finance       &lt;tibble [8 × 3]&gt; 
+4  10069 Retail        &lt;tibble [10 × 3]&gt;
+5  10070 Manufacturing &lt;tibble [56 × 3]&gt;
 # ℹ 30,379 more rows</code></pre>
 </div>
 </div>
@@ -734,7 +734,7 @@ <h2 class="anchored" data-anchor-id="parallelized-rolling-window-estimation">Par
 <p>Next, we want to apply the <code>roll_capm_estimation()</code> function to each stock. This situation is an ideal use case for <code>map()</code>, which takes a list or vector as input and returns an object of the same length as the input. In our case, <code>map()</code> returns a single data frame with a time series of beta estimates for each stock. Therefore, we use <code>unnest()</code> to transform the list of outputs to a tidy data frame.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>crsp_monthly_nested <span class="sc">|&gt;</span></span>
-<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="at">by =</span> <span class="st">"permno"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="fu">join_by</span>(permno)) <span class="sc">|&gt;</span></span>
 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">beta =</span> <span class="fu">map</span>(</span>
 <span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a>    data,</span>
 <span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a>    <span class="sc">~</span> <span class="fu">roll_capm_estimation</span>(., <span class="at">months =</span> <span class="dv">60</span>, <span class="at">min_obs =</span> <span class="dv">48</span>)</span>
@@ -788,7 +788,7 @@ <h2 class="anchored" data-anchor-id="estimating-beta-using-daily-returns">Estima
 <p>We make sure to keep only relevant data to save memory space. However, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in <a href="../r/wrds-crsp-and-compustat.html">WRDS, CRSP, and Compustat</a>.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a>crsp_daily <span class="ot">&lt;-</span> crsp_daily <span class="sc">|&gt;</span></span>
-<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(factors_ff3_daily, <span class="at">by =</span> <span class="st">"date"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(factors_ff3_daily, <span class="fu">join_by</span>(date)) <span class="sc">|&gt;</span></span>
 <span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(permno, month, ret_excess, mkt_excess)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Just like above, we nest the data by <code>permno</code> for parallelization.</p>
@@ -799,7 +799,7 @@ <h2 class="anchored" data-anchor-id="estimating-beta-using-daily-returns">Estima
 <p>This is what the estimation looks like for a couple of examples using <code>map()</code>. For the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. These restrictions help us to retrieve somewhat smooth coefficient estimates.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a>crsp_daily_nested <span class="sc">|&gt;</span></span>
-<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="at">by =</span> <span class="st">"permno"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="fu">join_by</span>(permno)) <span class="sc">|&gt;</span></span>
 <span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">beta_daily =</span> <span class="fu">map</span>(</span>
 <span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a>    data,</span>
 <span id="cb22-5"><a href="#cb22-5" aria-hidden="true" tabindex="-1"></a>    <span class="sc">~</span> <span class="fu">roll_capm_estimation</span>(., <span class="at">months =</span> <span class="dv">3</span>, <span class="at">min_obs =</span> <span class="dv">50</span>)</span>
@@ -839,7 +839,7 @@ <h2 class="anchored" data-anchor-id="comparing-beta-estimates">Comparing Beta Es
 <p>What is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated <span class="math inline">\(\hat\beta_i\)</span> across different industries and across time below. <a href="#fig-602" class="quarto-xref">Figure&nbsp;2</a> shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb26"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a>crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(beta_monthly, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span>)) <span class="sc">|&gt;</span></span>
+<span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(beta_monthly, <span class="fu">join_by</span>(permno, month)) <span class="sc">|&gt;</span></span>
 <span id="cb26-3"><a href="#cb26-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>(beta_monthly) <span class="sc">|&gt;</span></span>
 <span id="cb26-4"><a href="#cb26-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">group_by</span>(industry, permno) <span class="sc">|&gt;</span></span>
 <span id="cb26-5"><a href="#cb26-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">summarize</span>(<span class="at">beta =</span> <span class="fu">mean</span>(beta_monthly), </span>
@@ -852,7 +852,7 @@ <h2 class="anchored" data-anchor-id="comparing-beta-estimates">Comparing Beta Es
 <span id="cb26-12"><a href="#cb26-12" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Firm-specific beta distributions by industry"</span></span>
 <span id="cb26-13"><a href="#cb26-13" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-602" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5.">
+<div id="fig-602" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-602-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="beta-estimation_files/figure-html/fig-602-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" data-glightbox="description: .lightbox-desc-2"><img src="beta-estimation_files/figure-html/fig-602-1.png" class="img-fluid figure-img" alt="Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: The box plots show the average firm-specific beta estimates by industry.</span>
@@ -885,7 +885,7 @@ <h2 class="anchored" data-anchor-id="comparing-beta-estimates">Comparing Beta Es
 <span id="cb27-17"><a href="#cb27-17" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Monthly deciles of estimated betas"</span>,</span>
 <span id="cb27-18"><a href="#cb27-18" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-603" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly.">
+<div id="fig-603" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-603-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="beta-estimation_files/figure-html/fig-603-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" data-glightbox="description: .lightbox-desc-3"><img src="beta-estimation_files/figure-html/fig-603-1.png" class="img-fluid figure-img" alt="Title: Monthly deciles of estimated betas. The figure shows time series of deciles of estimated betas to illustrate the distribution of betas over time. The top 10 percent quantile on average is around 2 but varies substantially over time. The lowest 10 percent quantile is around 0.4 on average but is highly correlated with the top quantile such that in general CAPM market betas seem to go up and down jointly." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.</span>
@@ -900,11 +900,11 @@ <h2 class="anchored" data-anchor-id="comparing-beta-estimates">Comparing Beta Es
 <p>To compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in <a href="#fig-604" class="quarto-xref">Figure&nbsp;4</a>.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb28"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a>beta <span class="ot">&lt;-</span> beta_monthly <span class="sc">|&gt;</span></span>
-<span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">full_join</span>(beta_daily, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span>)) <span class="sc">|&gt;</span></span>
+<span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">full_join</span>(beta_daily, <span class="fu">join_by</span>(permno, month)) <span class="sc">|&gt;</span></span>
 <span id="cb28-3"><a href="#cb28-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">arrange</span>(permno, month)</span>
 <span id="cb28-4"><a href="#cb28-4" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb28-5"><a href="#cb28-5" aria-hidden="true" tabindex="-1"></a>beta <span class="sc">|&gt;</span></span>
-<span id="cb28-6"><a href="#cb28-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="at">by =</span> <span class="st">"permno"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb28-6"><a href="#cb28-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(examples, <span class="fu">join_by</span>(permno)) <span class="sc">|&gt;</span></span>
 <span id="cb28-7"><a href="#cb28-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">pivot_longer</span>(<span class="at">cols =</span> <span class="fu">c</span>(beta_monthly, beta_daily)) <span class="sc">|&gt;</span></span>
 <span id="cb28-8"><a href="#cb28-8" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>() <span class="sc">|&gt;</span></span>
 <span id="cb28-9"><a href="#cb28-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">ggplot</span>(<span class="fu">aes</span>(</span>
@@ -920,7 +920,7 @@ <h2 class="anchored" data-anchor-id="comparing-beta-estimates">Comparing Beta Es
 <span id="cb28-19"><a href="#cb28-19" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Comparison of beta estimates using monthly and daily data"</span></span>
 <span id="cb28-20"><a href="#cb28-20" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-604" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency.">
+<div id="fig-604" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-604-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="beta-estimation_files/figure-html/fig-604-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-4" data-glightbox="description: .lightbox-desc-4"><img src="beta-estimation_files/figure-html/fig-604-1.png" class="img-fluid figure-img" alt="Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency." width="2100"></a><span class="glightbox-desc lightbox-desc-4">Figure&nbsp;4: CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.</span>
@@ -944,7 +944,7 @@ <h2 class="anchored" data-anchor-id="comparing-beta-estimates">Comparing Beta Es
 <p>Whenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. This descriptive helps us discover potential errors in our data preparation or estimation procedure. For instance, suppose there was a gap in our output where we do not have any betas. In this case, we would have to go back and check all previous steps to find out what went wrong.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb30"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true" tabindex="-1"></a>beta_long <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb30-2"><a href="#cb30-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(beta, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span>)) <span class="sc">|&gt;</span></span>
+<span id="cb30-2"><a href="#cb30-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(beta, <span class="fu">join_by</span>(permno, month)) <span class="sc">|&gt;</span></span>
 <span id="cb30-3"><a href="#cb30-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">pivot_longer</span>(<span class="at">cols =</span> <span class="fu">c</span>(beta_monthly, beta_daily))</span>
 <span id="cb30-4"><a href="#cb30-4" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb30-5"><a href="#cb30-5" aria-hidden="true" tabindex="-1"></a>beta_long <span class="sc">|&gt;</span></span>
@@ -965,7 +965,7 @@ <h2 class="anchored" data-anchor-id="comparing-beta-estimates">Comparing Beta Es
 <span id="cb30-20"><a href="#cb30-20" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">+</span></span>
 <span id="cb30-21"><a href="#cb30-21" aria-hidden="true" tabindex="-1"></a>  <span class="fu">coord_cartesian</span>(<span class="at">ylim =</span> <span class="fu">c</span>(<span class="dv">0</span>, <span class="dv">1</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-605" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the  beta.">
+<div id="fig-605" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the  beta." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-605-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="beta-estimation_files/figure-html/fig-605-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-5" data-glightbox="description: .lightbox-desc-5"><img src="beta-estimation_files/figure-html/fig-605-1.png" class="img-fluid figure-img" alt="Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the  beta." width="2100"></a><span class="glightbox-desc lightbox-desc-5">Figure&nbsp;5: The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.</span>
@@ -1479,7 +1479,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","selector":".lightbox","closeEffect":"zoom","openEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"selector":".lightbox","closeEffect":"zoom","loop":false,"openEffect":"zoom","descPosition":"bottom"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/clean-enhanced-trace-with-r.html b/docs/r/clean-enhanced-trace-with-r.html
index 5c1778d5..d3b9cca5 100644
--- a/docs/r/clean-enhanced-trace-with-r.html
+++ b/docs/r/clean-enhanced-trace-with-r.html
@@ -559,10 +559,10 @@ <h1 class="title d-none d-lg-block">Clean Enhanced TRACE with R</h1>
 <span id="cb1-57"><a href="#cb1-57" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Cleaning corrected and cancelled trades</span></span>
 <span id="cb1-58"><a href="#cb1-58" aria-hidden="true" tabindex="-1"></a>  trace_post_TR <span class="ot">&lt;-</span> trace_post_TR <span class="sc">|&gt;</span></span>
 <span id="cb1-59"><a href="#cb1-59" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_post_XC,</span>
-<span id="cb1-60"><a href="#cb1-60" aria-hidden="true" tabindex="-1"></a>      <span class="at">by =</span> <span class="fu">c</span>(</span>
-<span id="cb1-61"><a href="#cb1-61" aria-hidden="true" tabindex="-1"></a>        <span class="st">"cusip_id"</span>, <span class="st">"msg_seq_nb"</span>, <span class="st">"entrd_vol_qt"</span>,</span>
-<span id="cb1-62"><a href="#cb1-62" aria-hidden="true" tabindex="-1"></a>        <span class="st">"rptd_pr"</span>, <span class="st">"rpt_side_cd"</span>, <span class="st">"cntra_mp_id"</span>,</span>
-<span id="cb1-63"><a href="#cb1-63" aria-hidden="true" tabindex="-1"></a>        <span class="st">"trd_exctn_dt"</span>, <span class="st">"trd_exctn_tm"</span></span>
+<span id="cb1-60"><a href="#cb1-60" aria-hidden="true" tabindex="-1"></a>      <span class="fu">join_by</span>(</span>
+<span id="cb1-61"><a href="#cb1-61" aria-hidden="true" tabindex="-1"></a>        cusip_id, msg_seq_nb, entrd_vol_qt,</span>
+<span id="cb1-62"><a href="#cb1-62" aria-hidden="true" tabindex="-1"></a>        rptd_pr, rpt_side_cd, cntra_mp_id,</span>
+<span id="cb1-63"><a href="#cb1-63" aria-hidden="true" tabindex="-1"></a>        trd_exctn_dt, trd_exctn_tm</span>
 <span id="cb1-64"><a href="#cb1-64" aria-hidden="true" tabindex="-1"></a>      )</span>
 <span id="cb1-65"><a href="#cb1-65" aria-hidden="true" tabindex="-1"></a>    )</span>
 <span id="cb1-66"><a href="#cb1-66" aria-hidden="true" tabindex="-1"></a></span>
@@ -578,175 +578,179 @@ <h1 class="title d-none d-lg-block">Clean Enhanced TRACE with R</h1>
 <span id="cb1-76"><a href="#cb1-76" aria-hidden="true" tabindex="-1"></a>  <span class="do">## the msg_seq_nb of the main message</span></span>
 <span id="cb1-77"><a href="#cb1-77" aria-hidden="true" tabindex="-1"></a>  trace_post <span class="ot">&lt;-</span> trace_post_TR <span class="sc">|&gt;</span></span>
 <span id="cb1-78"><a href="#cb1-78" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_post_Y,</span>
-<span id="cb1-79"><a href="#cb1-79" aria-hidden="true" tabindex="-1"></a>      <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"cusip_id"</span>,</span>
-<span id="cb1-80"><a href="#cb1-80" aria-hidden="true" tabindex="-1"></a>        <span class="st">"msg_seq_nb"</span> <span class="ot">=</span> <span class="st">"orig_msg_seq_nb"</span>,</span>
-<span id="cb1-81"><a href="#cb1-81" aria-hidden="true" tabindex="-1"></a>        <span class="st">"entrd_vol_qt"</span>, <span class="st">"rptd_pr"</span>, <span class="st">"rpt_side_cd"</span>,</span>
-<span id="cb1-82"><a href="#cb1-82" aria-hidden="true" tabindex="-1"></a>        <span class="st">"cntra_mp_id"</span>, <span class="st">"trd_exctn_dt"</span>, <span class="st">"trd_exctn_tm"</span></span>
-<span id="cb1-83"><a href="#cb1-83" aria-hidden="true" tabindex="-1"></a>      )</span>
-<span id="cb1-84"><a href="#cb1-84" aria-hidden="true" tabindex="-1"></a>    )</span>
-<span id="cb1-85"><a href="#cb1-85" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-79"><a href="#cb1-79" aria-hidden="true" tabindex="-1"></a>      <span class="fu">join_by</span>(</span>
+<span id="cb1-80"><a href="#cb1-80" aria-hidden="true" tabindex="-1"></a>        cusip_id,</span>
+<span id="cb1-81"><a href="#cb1-81" aria-hidden="true" tabindex="-1"></a>        msg_seq_nb <span class="sc">==</span> orig_msg_seq_nb,</span>
+<span id="cb1-82"><a href="#cb1-82" aria-hidden="true" tabindex="-1"></a>        entrd_vol_qt, rptd_pr, rpt_side_cd,</span>
+<span id="cb1-83"><a href="#cb1-83" aria-hidden="true" tabindex="-1"></a>        cntra_mp_id, trd_exctn_dt, trd_exctn_tm</span>
+<span id="cb1-84"><a href="#cb1-84" aria-hidden="true" tabindex="-1"></a>      )</span>
+<span id="cb1-85"><a href="#cb1-85" aria-hidden="true" tabindex="-1"></a>    )</span>
 <span id="cb1-86"><a href="#cb1-86" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-87"><a href="#cb1-87" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Enhanced TRACE: Pre 06-02-2012 ------------------------------------------</span></span>
-<span id="cb1-88"><a href="#cb1-88" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Cancelations (trc_st = C)</span></span>
-<span id="cb1-89"><a href="#cb1-89" aria-hidden="true" tabindex="-1"></a>  trace_pre_C <span class="ot">&lt;-</span> trace_all <span class="sc">|&gt;</span></span>
-<span id="cb1-90"><a href="#cb1-90" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
-<span id="cb1-91"><a href="#cb1-91" aria-hidden="true" tabindex="-1"></a>      trc_st <span class="sc">==</span> <span class="st">"C"</span>,</span>
-<span id="cb1-92"><a href="#cb1-92" aria-hidden="true" tabindex="-1"></a>      trd_rpt_dt <span class="sc">&lt;</span> <span class="fu">as.Date</span>(<span class="st">"2012-02-06"</span>)</span>
-<span id="cb1-93"><a href="#cb1-93" aria-hidden="true" tabindex="-1"></a>    )</span>
-<span id="cb1-94"><a href="#cb1-94" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-95"><a href="#cb1-95" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Trades w/o cancellations</span></span>
-<span id="cb1-96"><a href="#cb1-96" aria-hidden="true" tabindex="-1"></a>  <span class="do">## match the orig_msg_seq_nb of the C-message</span></span>
-<span id="cb1-97"><a href="#cb1-97" aria-hidden="true" tabindex="-1"></a>  <span class="do">## to the msg_seq_nb of the main message</span></span>
-<span id="cb1-98"><a href="#cb1-98" aria-hidden="true" tabindex="-1"></a>  trace_pre_T <span class="ot">&lt;-</span> trace_all <span class="sc">|&gt;</span></span>
-<span id="cb1-99"><a href="#cb1-99" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
-<span id="cb1-100"><a href="#cb1-100" aria-hidden="true" tabindex="-1"></a>      trc_st <span class="sc">==</span> <span class="st">"T"</span>,</span>
-<span id="cb1-101"><a href="#cb1-101" aria-hidden="true" tabindex="-1"></a>      trd_rpt_dt <span class="sc">&lt;</span> <span class="fu">as.Date</span>(<span class="st">"2012-02-06"</span>)</span>
-<span id="cb1-102"><a href="#cb1-102" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
-<span id="cb1-103"><a href="#cb1-103" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_pre_C,</span>
-<span id="cb1-104"><a href="#cb1-104" aria-hidden="true" tabindex="-1"></a>      <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"cusip_id"</span>,</span>
-<span id="cb1-105"><a href="#cb1-105" aria-hidden="true" tabindex="-1"></a>        <span class="st">"msg_seq_nb"</span> <span class="ot">=</span> <span class="st">"orig_msg_seq_nb"</span>,</span>
-<span id="cb1-106"><a href="#cb1-106" aria-hidden="true" tabindex="-1"></a>        <span class="st">"entrd_vol_qt"</span>, <span class="st">"rptd_pr"</span>, <span class="st">"rpt_side_cd"</span>,</span>
-<span id="cb1-107"><a href="#cb1-107" aria-hidden="true" tabindex="-1"></a>        <span class="st">"cntra_mp_id"</span>, <span class="st">"trd_exctn_dt"</span>, <span class="st">"trd_exctn_tm"</span></span>
-<span id="cb1-108"><a href="#cb1-108" aria-hidden="true" tabindex="-1"></a>      )</span>
-<span id="cb1-109"><a href="#cb1-109" aria-hidden="true" tabindex="-1"></a>    )</span>
-<span id="cb1-110"><a href="#cb1-110" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-111"><a href="#cb1-111" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Corrections (trc_st = W) - W can also correct a previous W</span></span>
-<span id="cb1-112"><a href="#cb1-112" aria-hidden="true" tabindex="-1"></a>  trace_pre_W <span class="ot">&lt;-</span> trace_all <span class="sc">|&gt;</span></span>
-<span id="cb1-113"><a href="#cb1-113" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
-<span id="cb1-114"><a href="#cb1-114" aria-hidden="true" tabindex="-1"></a>      trc_st <span class="sc">==</span> <span class="st">"W"</span>,</span>
-<span id="cb1-115"><a href="#cb1-115" aria-hidden="true" tabindex="-1"></a>      trd_rpt_dt <span class="sc">&lt;</span> <span class="fu">as.Date</span>(<span class="st">"2012-02-06"</span>)</span>
-<span id="cb1-116"><a href="#cb1-116" aria-hidden="true" tabindex="-1"></a>    )</span>
-<span id="cb1-117"><a href="#cb1-117" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-118"><a href="#cb1-118" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Implement corrections in a loop</span></span>
-<span id="cb1-119"><a href="#cb1-119" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Correction control</span></span>
-<span id="cb1-120"><a href="#cb1-120" aria-hidden="true" tabindex="-1"></a>  correction_control <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
-<span id="cb1-121"><a href="#cb1-121" aria-hidden="true" tabindex="-1"></a>  correction_control_last <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
-<span id="cb1-122"><a href="#cb1-122" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-123"><a href="#cb1-123" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Correction loop</span></span>
-<span id="cb1-124"><a href="#cb1-124" aria-hidden="true" tabindex="-1"></a>  <span class="cf">while</span> (correction_control <span class="sc">&gt;</span> <span class="dv">0</span>) {</span>
-<span id="cb1-125"><a href="#cb1-125" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Corrections that correct some msg</span></span>
-<span id="cb1-126"><a href="#cb1-126" aria-hidden="true" tabindex="-1"></a>    trace_pre_W_correcting <span class="ot">&lt;-</span> trace_pre_W <span class="sc">|&gt;</span></span>
-<span id="cb1-127"><a href="#cb1-127" aria-hidden="true" tabindex="-1"></a>      <span class="fu">semi_join</span>(trace_pre_T,</span>
-<span id="cb1-128"><a href="#cb1-128" aria-hidden="true" tabindex="-1"></a>        <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"cusip_id"</span>, <span class="st">"trd_exctn_dt"</span>,</span>
-<span id="cb1-129"><a href="#cb1-129" aria-hidden="true" tabindex="-1"></a>          <span class="st">"orig_msg_seq_nb"</span> <span class="ot">=</span> <span class="st">"msg_seq_nb"</span></span>
-<span id="cb1-130"><a href="#cb1-130" aria-hidden="true" tabindex="-1"></a>        )</span>
-<span id="cb1-131"><a href="#cb1-131" aria-hidden="true" tabindex="-1"></a>      )</span>
-<span id="cb1-132"><a href="#cb1-132" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-133"><a href="#cb1-133" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Corrections that do not correct some msg</span></span>
-<span id="cb1-134"><a href="#cb1-134" aria-hidden="true" tabindex="-1"></a>    trace_pre_W <span class="ot">&lt;-</span> trace_pre_W <span class="sc">|&gt;</span></span>
-<span id="cb1-135"><a href="#cb1-135" aria-hidden="true" tabindex="-1"></a>      <span class="fu">anti_join</span>(trace_pre_T,</span>
-<span id="cb1-136"><a href="#cb1-136" aria-hidden="true" tabindex="-1"></a>        <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"cusip_id"</span>, <span class="st">"trd_exctn_dt"</span>,</span>
-<span id="cb1-137"><a href="#cb1-137" aria-hidden="true" tabindex="-1"></a>          <span class="st">"orig_msg_seq_nb"</span> <span class="ot">=</span> <span class="st">"msg_seq_nb"</span></span>
-<span id="cb1-138"><a href="#cb1-138" aria-hidden="true" tabindex="-1"></a>        )</span>
-<span id="cb1-139"><a href="#cb1-139" aria-hidden="true" tabindex="-1"></a>      )</span>
-<span id="cb1-140"><a href="#cb1-140" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-141"><a href="#cb1-141" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Delete msgs that are corrected and add correction msgs</span></span>
-<span id="cb1-142"><a href="#cb1-142" aria-hidden="true" tabindex="-1"></a>    trace_pre_T <span class="ot">&lt;-</span> trace_pre_T <span class="sc">|&gt;</span></span>
-<span id="cb1-143"><a href="#cb1-143" aria-hidden="true" tabindex="-1"></a>      <span class="fu">anti_join</span>(trace_pre_W_correcting,</span>
-<span id="cb1-144"><a href="#cb1-144" aria-hidden="true" tabindex="-1"></a>        <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"cusip_id"</span>, <span class="st">"trd_exctn_dt"</span>,</span>
-<span id="cb1-145"><a href="#cb1-145" aria-hidden="true" tabindex="-1"></a>          <span class="st">"msg_seq_nb"</span> <span class="ot">=</span> <span class="st">"orig_msg_seq_nb"</span></span>
-<span id="cb1-146"><a href="#cb1-146" aria-hidden="true" tabindex="-1"></a>        )</span>
-<span id="cb1-147"><a href="#cb1-147" aria-hidden="true" tabindex="-1"></a>      ) <span class="sc">|&gt;</span></span>
-<span id="cb1-148"><a href="#cb1-148" aria-hidden="true" tabindex="-1"></a>      <span class="fu">union_all</span>(trace_pre_W_correcting)</span>
-<span id="cb1-149"><a href="#cb1-149" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-150"><a href="#cb1-150" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Escape if no corrections remain or they cannot be matched</span></span>
-<span id="cb1-151"><a href="#cb1-151" aria-hidden="true" tabindex="-1"></a>    correction_control <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
-<span id="cb1-152"><a href="#cb1-152" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> (correction_control <span class="sc">==</span> correction_control_last) {</span>
-<span id="cb1-153"><a href="#cb1-153" aria-hidden="true" tabindex="-1"></a>      correction_control <span class="ot">&lt;-</span> <span class="dv">0</span></span>
-<span id="cb1-154"><a href="#cb1-154" aria-hidden="true" tabindex="-1"></a>    }</span>
-<span id="cb1-155"><a href="#cb1-155" aria-hidden="true" tabindex="-1"></a>    correction_control_last <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
-<span id="cb1-156"><a href="#cb1-156" aria-hidden="true" tabindex="-1"></a>  }</span>
-<span id="cb1-157"><a href="#cb1-157" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-158"><a href="#cb1-158" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-159"><a href="#cb1-159" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Clean reversals</span></span>
-<span id="cb1-160"><a href="#cb1-160" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Record reversals</span></span>
-<span id="cb1-161"><a href="#cb1-161" aria-hidden="true" tabindex="-1"></a>  trace_pre_R <span class="ot">&lt;-</span> trace_pre_T <span class="sc">|&gt;</span></span>
-<span id="cb1-162"><a href="#cb1-162" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(asof_cd <span class="sc">==</span> <span class="st">"R"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb1-163"><a href="#cb1-163" aria-hidden="true" tabindex="-1"></a>    <span class="fu">group_by</span>(</span>
-<span id="cb1-164"><a href="#cb1-164" aria-hidden="true" tabindex="-1"></a>      cusip_id, trd_exctn_dt, entrd_vol_qt,</span>
-<span id="cb1-165"><a href="#cb1-165" aria-hidden="true" tabindex="-1"></a>      rptd_pr, rpt_side_cd, cntra_mp_id</span>
-<span id="cb1-166"><a href="#cb1-166" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
-<span id="cb1-167"><a href="#cb1-167" aria-hidden="true" tabindex="-1"></a>    <span class="fu">arrange</span>(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) <span class="sc">|&gt;</span></span>
-<span id="cb1-168"><a href="#cb1-168" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">seq =</span> <span class="fu">row_number</span>()) <span class="sc">|&gt;</span></span>
-<span id="cb1-169"><a href="#cb1-169" aria-hidden="true" tabindex="-1"></a>    <span class="fu">ungroup</span>()</span>
-<span id="cb1-170"><a href="#cb1-170" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-171"><a href="#cb1-171" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Remove reversals and the reversed trade</span></span>
-<span id="cb1-172"><a href="#cb1-172" aria-hidden="true" tabindex="-1"></a>  trace_pre <span class="ot">&lt;-</span> trace_pre_T <span class="sc">|&gt;</span></span>
-<span id="cb1-173"><a href="#cb1-173" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(<span class="fu">is.na</span>(asof_cd) <span class="sc">|</span> <span class="sc">!</span>(asof_cd <span class="sc">%in%</span> <span class="fu">c</span>(<span class="st">"R"</span>, <span class="st">"X"</span>, <span class="st">"D"</span>))) <span class="sc">|&gt;</span></span>
-<span id="cb1-174"><a href="#cb1-174" aria-hidden="true" tabindex="-1"></a>    <span class="fu">group_by</span>(</span>
-<span id="cb1-175"><a href="#cb1-175" aria-hidden="true" tabindex="-1"></a>      cusip_id, trd_exctn_dt, entrd_vol_qt,</span>
-<span id="cb1-176"><a href="#cb1-176" aria-hidden="true" tabindex="-1"></a>      rptd_pr, rpt_side_cd, cntra_mp_id</span>
-<span id="cb1-177"><a href="#cb1-177" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
-<span id="cb1-178"><a href="#cb1-178" aria-hidden="true" tabindex="-1"></a>    <span class="fu">arrange</span>(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) <span class="sc">|&gt;</span></span>
-<span id="cb1-179"><a href="#cb1-179" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">seq =</span> <span class="fu">row_number</span>()) <span class="sc">|&gt;</span></span>
-<span id="cb1-180"><a href="#cb1-180" aria-hidden="true" tabindex="-1"></a>    <span class="fu">ungroup</span>() <span class="sc">|&gt;</span></span>
-<span id="cb1-181"><a href="#cb1-181" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_pre_R,</span>
-<span id="cb1-182"><a href="#cb1-182" aria-hidden="true" tabindex="-1"></a>      <span class="at">by =</span> <span class="fu">c</span>(</span>
-<span id="cb1-183"><a href="#cb1-183" aria-hidden="true" tabindex="-1"></a>        <span class="st">"cusip_id"</span>, <span class="st">"trd_exctn_dt"</span>, <span class="st">"entrd_vol_qt"</span>,</span>
-<span id="cb1-184"><a href="#cb1-184" aria-hidden="true" tabindex="-1"></a>        <span class="st">"rptd_pr"</span>, <span class="st">"rpt_side_cd"</span>, <span class="st">"cntra_mp_id"</span>, <span class="st">"seq"</span></span>
-<span id="cb1-185"><a href="#cb1-185" aria-hidden="true" tabindex="-1"></a>      )</span>
-<span id="cb1-186"><a href="#cb1-186" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
-<span id="cb1-187"><a href="#cb1-187" aria-hidden="true" tabindex="-1"></a>    <span class="fu">select</span>(<span class="sc">-</span>seq)</span>
-<span id="cb1-188"><a href="#cb1-188" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-189"><a href="#cb1-189" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-190"><a href="#cb1-190" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Agency trades -----------------------------------------------------------</span></span>
-<span id="cb1-191"><a href="#cb1-191" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Combine pre and post trades</span></span>
-<span id="cb1-192"><a href="#cb1-192" aria-hidden="true" tabindex="-1"></a>  trace_clean <span class="ot">&lt;-</span> trace_post <span class="sc">|&gt;</span></span>
-<span id="cb1-193"><a href="#cb1-193" aria-hidden="true" tabindex="-1"></a>    <span class="fu">union_all</span>(trace_pre)</span>
-<span id="cb1-194"><a href="#cb1-194" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-195"><a href="#cb1-195" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Keep angency sells and unmatched agency buys</span></span>
-<span id="cb1-196"><a href="#cb1-196" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Agency sells</span></span>
-<span id="cb1-197"><a href="#cb1-197" aria-hidden="true" tabindex="-1"></a>  trace_agency_sells <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
-<span id="cb1-198"><a href="#cb1-198" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
-<span id="cb1-199"><a href="#cb1-199" aria-hidden="true" tabindex="-1"></a>      cntra_mp_id <span class="sc">==</span> <span class="st">"D"</span>,</span>
-<span id="cb1-200"><a href="#cb1-200" aria-hidden="true" tabindex="-1"></a>      rpt_side_cd <span class="sc">==</span> <span class="st">"S"</span></span>
-<span id="cb1-201"><a href="#cb1-201" aria-hidden="true" tabindex="-1"></a>    )</span>
-<span id="cb1-202"><a href="#cb1-202" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-203"><a href="#cb1-203" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Agency buys that are unmatched</span></span>
-<span id="cb1-204"><a href="#cb1-204" aria-hidden="true" tabindex="-1"></a>  trace_agency_buys_filtered <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
-<span id="cb1-205"><a href="#cb1-205" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
-<span id="cb1-206"><a href="#cb1-206" aria-hidden="true" tabindex="-1"></a>      cntra_mp_id <span class="sc">==</span> <span class="st">"D"</span>,</span>
-<span id="cb1-207"><a href="#cb1-207" aria-hidden="true" tabindex="-1"></a>      rpt_side_cd <span class="sc">==</span> <span class="st">"B"</span></span>
-<span id="cb1-208"><a href="#cb1-208" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
-<span id="cb1-209"><a href="#cb1-209" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_agency_sells,</span>
-<span id="cb1-210"><a href="#cb1-210" aria-hidden="true" tabindex="-1"></a>      <span class="at">by =</span> <span class="fu">c</span>(</span>
-<span id="cb1-211"><a href="#cb1-211" aria-hidden="true" tabindex="-1"></a>        <span class="st">"cusip_id"</span>, <span class="st">"trd_exctn_dt"</span>,</span>
-<span id="cb1-212"><a href="#cb1-212" aria-hidden="true" tabindex="-1"></a>        <span class="st">"entrd_vol_qt"</span>, <span class="st">"rptd_pr"</span></span>
-<span id="cb1-213"><a href="#cb1-213" aria-hidden="true" tabindex="-1"></a>      )</span>
-<span id="cb1-214"><a href="#cb1-214" aria-hidden="true" tabindex="-1"></a>    )</span>
-<span id="cb1-215"><a href="#cb1-215" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-216"><a href="#cb1-216" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Agency clean</span></span>
-<span id="cb1-217"><a href="#cb1-217" aria-hidden="true" tabindex="-1"></a>  trace_clean <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
-<span id="cb1-218"><a href="#cb1-218" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(cntra_mp_id <span class="sc">==</span> <span class="st">"C"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb1-219"><a href="#cb1-219" aria-hidden="true" tabindex="-1"></a>    <span class="fu">union_all</span>(trace_agency_sells) <span class="sc">|&gt;</span></span>
-<span id="cb1-220"><a href="#cb1-220" aria-hidden="true" tabindex="-1"></a>    <span class="fu">union_all</span>(trace_agency_buys_filtered)</span>
-<span id="cb1-221"><a href="#cb1-221" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-222"><a href="#cb1-222" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-223"><a href="#cb1-223" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Additional Filters ------------------------------------------------------</span></span>
-<span id="cb1-224"><a href="#cb1-224" aria-hidden="true" tabindex="-1"></a>  trace_add_filters <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
-<span id="cb1-225"><a href="#cb1-225" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">days_to_sttl_ct2 =</span> stlmnt_dt <span class="sc">-</span> trd_exctn_dt) <span class="sc">|&gt;</span></span>
-<span id="cb1-226"><a href="#cb1-226" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
-<span id="cb1-227"><a href="#cb1-227" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(days_to_sttl_ct) <span class="sc">|</span> <span class="fu">as.numeric</span>(days_to_sttl_ct) <span class="sc">&lt;=</span> <span class="dv">7</span>,</span>
-<span id="cb1-228"><a href="#cb1-228" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(days_to_sttl_ct2) <span class="sc">|</span> <span class="fu">as.numeric</span>(days_to_sttl_ct2) <span class="sc">&lt;=</span> <span class="dv">7</span>,</span>
-<span id="cb1-229"><a href="#cb1-229" aria-hidden="true" tabindex="-1"></a>      wis_fl <span class="sc">==</span> <span class="st">"N"</span>,</span>
-<span id="cb1-230"><a href="#cb1-230" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(spcl_trd_fl) <span class="sc">|</span> spcl_trd_fl <span class="sc">==</span> <span class="st">""</span>,</span>
-<span id="cb1-231"><a href="#cb1-231" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(asof_cd) <span class="sc">|</span> asof_cd <span class="sc">==</span> <span class="st">""</span></span>
-<span id="cb1-232"><a href="#cb1-232" aria-hidden="true" tabindex="-1"></a>    )</span>
-<span id="cb1-233"><a href="#cb1-233" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-234"><a href="#cb1-234" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-235"><a href="#cb1-235" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Output ------------------------------------------------------------------</span></span>
-<span id="cb1-236"><a href="#cb1-236" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Only keep necessary columns</span></span>
-<span id="cb1-237"><a href="#cb1-237" aria-hidden="true" tabindex="-1"></a>  trace_final <span class="ot">&lt;-</span> trace_add_filters <span class="sc">|&gt;</span></span>
-<span id="cb1-238"><a href="#cb1-238" aria-hidden="true" tabindex="-1"></a>    <span class="fu">arrange</span>(cusip_id, trd_exctn_dt, trd_exctn_tm) <span class="sc">|&gt;</span></span>
-<span id="cb1-239"><a href="#cb1-239" aria-hidden="true" tabindex="-1"></a>    <span class="fu">select</span>(</span>
-<span id="cb1-240"><a href="#cb1-240" aria-hidden="true" tabindex="-1"></a>      cusip_id, trd_exctn_dt, trd_exctn_tm,</span>
-<span id="cb1-241"><a href="#cb1-241" aria-hidden="true" tabindex="-1"></a>      rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id</span>
-<span id="cb1-242"><a href="#cb1-242" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
-<span id="cb1-243"><a href="#cb1-243" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">trd_exctn_tm =</span> <span class="fu">format</span>(<span class="fu">as_datetime</span>(trd_exctn_tm), <span class="st">"%H:%M:%S"</span>))</span>
-<span id="cb1-244"><a href="#cb1-244" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb1-245"><a href="#cb1-245" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Return</span></span>
-<span id="cb1-246"><a href="#cb1-246" aria-hidden="true" tabindex="-1"></a>  <span class="fu">return</span>(trace_final)</span>
-<span id="cb1-247"><a href="#cb1-247" aria-hidden="true" tabindex="-1"></a>}</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb1-87"><a href="#cb1-87" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-88"><a href="#cb1-88" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Enhanced TRACE: Pre 06-02-2012 ------------------------------------------</span></span>
+<span id="cb1-89"><a href="#cb1-89" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Cancelations (trc_st = C)</span></span>
+<span id="cb1-90"><a href="#cb1-90" aria-hidden="true" tabindex="-1"></a>  trace_pre_C <span class="ot">&lt;-</span> trace_all <span class="sc">|&gt;</span></span>
+<span id="cb1-91"><a href="#cb1-91" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
+<span id="cb1-92"><a href="#cb1-92" aria-hidden="true" tabindex="-1"></a>      trc_st <span class="sc">==</span> <span class="st">"C"</span>,</span>
+<span id="cb1-93"><a href="#cb1-93" aria-hidden="true" tabindex="-1"></a>      trd_rpt_dt <span class="sc">&lt;</span> <span class="fu">as.Date</span>(<span class="st">"2012-02-06"</span>)</span>
+<span id="cb1-94"><a href="#cb1-94" aria-hidden="true" tabindex="-1"></a>    )</span>
+<span id="cb1-95"><a href="#cb1-95" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-96"><a href="#cb1-96" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Trades w/o cancellations</span></span>
+<span id="cb1-97"><a href="#cb1-97" aria-hidden="true" tabindex="-1"></a>  <span class="do">## match the orig_msg_seq_nb of the C-message</span></span>
+<span id="cb1-98"><a href="#cb1-98" aria-hidden="true" tabindex="-1"></a>  <span class="do">## to the msg_seq_nb of the main message</span></span>
+<span id="cb1-99"><a href="#cb1-99" aria-hidden="true" tabindex="-1"></a>  trace_pre_T <span class="ot">&lt;-</span> trace_all <span class="sc">|&gt;</span></span>
+<span id="cb1-100"><a href="#cb1-100" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
+<span id="cb1-101"><a href="#cb1-101" aria-hidden="true" tabindex="-1"></a>      trc_st <span class="sc">==</span> <span class="st">"T"</span>,</span>
+<span id="cb1-102"><a href="#cb1-102" aria-hidden="true" tabindex="-1"></a>      trd_rpt_dt <span class="sc">&lt;</span> <span class="fu">as.Date</span>(<span class="st">"2012-02-06"</span>)</span>
+<span id="cb1-103"><a href="#cb1-103" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
+<span id="cb1-104"><a href="#cb1-104" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_pre_C,</span>
+<span id="cb1-105"><a href="#cb1-105" aria-hidden="true" tabindex="-1"></a>      <span class="fu">join_by</span>(</span>
+<span id="cb1-106"><a href="#cb1-106" aria-hidden="true" tabindex="-1"></a>        cusip_id,</span>
+<span id="cb1-107"><a href="#cb1-107" aria-hidden="true" tabindex="-1"></a>        msg_seq_nb <span class="sc">==</span> orig_msg_seq_nb,</span>
+<span id="cb1-108"><a href="#cb1-108" aria-hidden="true" tabindex="-1"></a>        entrd_vol_qt, rptd_pr, rpt_side_cd,</span>
+<span id="cb1-109"><a href="#cb1-109" aria-hidden="true" tabindex="-1"></a>        cntra_mp_id, trd_exctn_dt, trd_exctn_tm</span>
+<span id="cb1-110"><a href="#cb1-110" aria-hidden="true" tabindex="-1"></a>      )</span>
+<span id="cb1-111"><a href="#cb1-111" aria-hidden="true" tabindex="-1"></a>    )</span>
+<span id="cb1-112"><a href="#cb1-112" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-113"><a href="#cb1-113" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Corrections (trc_st = W) - W can also correct a previous W</span></span>
+<span id="cb1-114"><a href="#cb1-114" aria-hidden="true" tabindex="-1"></a>  trace_pre_W <span class="ot">&lt;-</span> trace_all <span class="sc">|&gt;</span></span>
+<span id="cb1-115"><a href="#cb1-115" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
+<span id="cb1-116"><a href="#cb1-116" aria-hidden="true" tabindex="-1"></a>      trc_st <span class="sc">==</span> <span class="st">"W"</span>,</span>
+<span id="cb1-117"><a href="#cb1-117" aria-hidden="true" tabindex="-1"></a>      trd_rpt_dt <span class="sc">&lt;</span> <span class="fu">as.Date</span>(<span class="st">"2012-02-06"</span>)</span>
+<span id="cb1-118"><a href="#cb1-118" aria-hidden="true" tabindex="-1"></a>    )</span>
+<span id="cb1-119"><a href="#cb1-119" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-120"><a href="#cb1-120" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Implement corrections in a loop</span></span>
+<span id="cb1-121"><a href="#cb1-121" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Correction control</span></span>
+<span id="cb1-122"><a href="#cb1-122" aria-hidden="true" tabindex="-1"></a>  correction_control <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
+<span id="cb1-123"><a href="#cb1-123" aria-hidden="true" tabindex="-1"></a>  correction_control_last <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
+<span id="cb1-124"><a href="#cb1-124" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-125"><a href="#cb1-125" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Correction loop</span></span>
+<span id="cb1-126"><a href="#cb1-126" aria-hidden="true" tabindex="-1"></a>  <span class="cf">while</span> (correction_control <span class="sc">&gt;</span> <span class="dv">0</span>) {</span>
+<span id="cb1-127"><a href="#cb1-127" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Corrections that correct some msg</span></span>
+<span id="cb1-128"><a href="#cb1-128" aria-hidden="true" tabindex="-1"></a>    trace_pre_W_correcting <span class="ot">&lt;-</span> trace_pre_W <span class="sc">|&gt;</span></span>
+<span id="cb1-129"><a href="#cb1-129" aria-hidden="true" tabindex="-1"></a>      <span class="fu">semi_join</span>(trace_pre_T,</span>
+<span id="cb1-130"><a href="#cb1-130" aria-hidden="true" tabindex="-1"></a>        <span class="fu">join_by</span>(</span>
+<span id="cb1-131"><a href="#cb1-131" aria-hidden="true" tabindex="-1"></a>          cusip_id, trd_exctn_dt,</span>
+<span id="cb1-132"><a href="#cb1-132" aria-hidden="true" tabindex="-1"></a>          orig_msg_seq_nb <span class="sc">==</span> msg_seq_nb</span>
+<span id="cb1-133"><a href="#cb1-133" aria-hidden="true" tabindex="-1"></a>        )</span>
+<span id="cb1-134"><a href="#cb1-134" aria-hidden="true" tabindex="-1"></a>      )</span>
+<span id="cb1-135"><a href="#cb1-135" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-136"><a href="#cb1-136" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Corrections that do not correct some msg</span></span>
+<span id="cb1-137"><a href="#cb1-137" aria-hidden="true" tabindex="-1"></a>    trace_pre_W <span class="ot">&lt;-</span> trace_pre_W <span class="sc">|&gt;</span></span>
+<span id="cb1-138"><a href="#cb1-138" aria-hidden="true" tabindex="-1"></a>      <span class="fu">anti_join</span>(trace_pre_T,</span>
+<span id="cb1-139"><a href="#cb1-139" aria-hidden="true" tabindex="-1"></a>        <span class="fu">join_by</span>(cusip_id, trd_exctn_dt,</span>
+<span id="cb1-140"><a href="#cb1-140" aria-hidden="true" tabindex="-1"></a>          orig_msg_seq_nb <span class="sc">==</span> msg_seq_nb</span>
+<span id="cb1-141"><a href="#cb1-141" aria-hidden="true" tabindex="-1"></a>        )</span>
+<span id="cb1-142"><a href="#cb1-142" aria-hidden="true" tabindex="-1"></a>      )</span>
+<span id="cb1-143"><a href="#cb1-143" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-144"><a href="#cb1-144" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Delete msgs that are corrected and add correction msgs</span></span>
+<span id="cb1-145"><a href="#cb1-145" aria-hidden="true" tabindex="-1"></a>    trace_pre_T <span class="ot">&lt;-</span> trace_pre_T <span class="sc">|&gt;</span></span>
+<span id="cb1-146"><a href="#cb1-146" aria-hidden="true" tabindex="-1"></a>      <span class="fu">anti_join</span>(trace_pre_W_correcting,</span>
+<span id="cb1-147"><a href="#cb1-147" aria-hidden="true" tabindex="-1"></a>        <span class="fu">join_by</span>(</span>
+<span id="cb1-148"><a href="#cb1-148" aria-hidden="true" tabindex="-1"></a>          cusip_id, trd_exctn_dt,</span>
+<span id="cb1-149"><a href="#cb1-149" aria-hidden="true" tabindex="-1"></a>          msg_seq_nb <span class="sc">==</span> orig_msg_seq_nb</span>
+<span id="cb1-150"><a href="#cb1-150" aria-hidden="true" tabindex="-1"></a>        )</span>
+<span id="cb1-151"><a href="#cb1-151" aria-hidden="true" tabindex="-1"></a>      ) <span class="sc">|&gt;</span></span>
+<span id="cb1-152"><a href="#cb1-152" aria-hidden="true" tabindex="-1"></a>      <span class="fu">union_all</span>(trace_pre_W_correcting)</span>
+<span id="cb1-153"><a href="#cb1-153" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-154"><a href="#cb1-154" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Escape if no corrections remain or they cannot be matched</span></span>
+<span id="cb1-155"><a href="#cb1-155" aria-hidden="true" tabindex="-1"></a>    correction_control <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
+<span id="cb1-156"><a href="#cb1-156" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> (correction_control <span class="sc">==</span> correction_control_last) {</span>
+<span id="cb1-157"><a href="#cb1-157" aria-hidden="true" tabindex="-1"></a>      correction_control <span class="ot">&lt;-</span> <span class="dv">0</span></span>
+<span id="cb1-158"><a href="#cb1-158" aria-hidden="true" tabindex="-1"></a>    }</span>
+<span id="cb1-159"><a href="#cb1-159" aria-hidden="true" tabindex="-1"></a>    correction_control_last <span class="ot">&lt;-</span> <span class="fu">nrow</span>(trace_pre_W)</span>
+<span id="cb1-160"><a href="#cb1-160" aria-hidden="true" tabindex="-1"></a>  }</span>
+<span id="cb1-161"><a href="#cb1-161" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-162"><a href="#cb1-162" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-163"><a href="#cb1-163" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Clean reversals</span></span>
+<span id="cb1-164"><a href="#cb1-164" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Record reversals</span></span>
+<span id="cb1-165"><a href="#cb1-165" aria-hidden="true" tabindex="-1"></a>  trace_pre_R <span class="ot">&lt;-</span> trace_pre_T <span class="sc">|&gt;</span></span>
+<span id="cb1-166"><a href="#cb1-166" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(asof_cd <span class="sc">==</span> <span class="st">"R"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb1-167"><a href="#cb1-167" aria-hidden="true" tabindex="-1"></a>    <span class="fu">group_by</span>(</span>
+<span id="cb1-168"><a href="#cb1-168" aria-hidden="true" tabindex="-1"></a>      cusip_id, trd_exctn_dt, entrd_vol_qt,</span>
+<span id="cb1-169"><a href="#cb1-169" aria-hidden="true" tabindex="-1"></a>      rptd_pr, rpt_side_cd, cntra_mp_id</span>
+<span id="cb1-170"><a href="#cb1-170" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
+<span id="cb1-171"><a href="#cb1-171" aria-hidden="true" tabindex="-1"></a>    <span class="fu">arrange</span>(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) <span class="sc">|&gt;</span></span>
+<span id="cb1-172"><a href="#cb1-172" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">seq =</span> <span class="fu">row_number</span>()) <span class="sc">|&gt;</span></span>
+<span id="cb1-173"><a href="#cb1-173" aria-hidden="true" tabindex="-1"></a>    <span class="fu">ungroup</span>()</span>
+<span id="cb1-174"><a href="#cb1-174" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-175"><a href="#cb1-175" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Remove reversals and the reversed trade</span></span>
+<span id="cb1-176"><a href="#cb1-176" aria-hidden="true" tabindex="-1"></a>  trace_pre <span class="ot">&lt;-</span> trace_pre_T <span class="sc">|&gt;</span></span>
+<span id="cb1-177"><a href="#cb1-177" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(<span class="fu">is.na</span>(asof_cd) <span class="sc">|</span> <span class="sc">!</span>(asof_cd <span class="sc">%in%</span> <span class="fu">c</span>(<span class="st">"R"</span>, <span class="st">"X"</span>, <span class="st">"D"</span>))) <span class="sc">|&gt;</span></span>
+<span id="cb1-178"><a href="#cb1-178" aria-hidden="true" tabindex="-1"></a>    <span class="fu">group_by</span>(</span>
+<span id="cb1-179"><a href="#cb1-179" aria-hidden="true" tabindex="-1"></a>      cusip_id, trd_exctn_dt, entrd_vol_qt,</span>
+<span id="cb1-180"><a href="#cb1-180" aria-hidden="true" tabindex="-1"></a>      rptd_pr, rpt_side_cd, cntra_mp_id</span>
+<span id="cb1-181"><a href="#cb1-181" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
+<span id="cb1-182"><a href="#cb1-182" aria-hidden="true" tabindex="-1"></a>    <span class="fu">arrange</span>(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) <span class="sc">|&gt;</span></span>
+<span id="cb1-183"><a href="#cb1-183" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">seq =</span> <span class="fu">row_number</span>()) <span class="sc">|&gt;</span></span>
+<span id="cb1-184"><a href="#cb1-184" aria-hidden="true" tabindex="-1"></a>    <span class="fu">ungroup</span>() <span class="sc">|&gt;</span></span>
+<span id="cb1-185"><a href="#cb1-185" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_pre_R,</span>
+<span id="cb1-186"><a href="#cb1-186" aria-hidden="true" tabindex="-1"></a>      <span class="fu">join_by</span>(</span>
+<span id="cb1-187"><a href="#cb1-187" aria-hidden="true" tabindex="-1"></a>        cusip_id, trd_exctn_dt, entrd_vol_qt,</span>
+<span id="cb1-188"><a href="#cb1-188" aria-hidden="true" tabindex="-1"></a>        rptd_pr, rpt_side_cd, cntra_mp_id, seq</span>
+<span id="cb1-189"><a href="#cb1-189" aria-hidden="true" tabindex="-1"></a>      )</span>
+<span id="cb1-190"><a href="#cb1-190" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
+<span id="cb1-191"><a href="#cb1-191" aria-hidden="true" tabindex="-1"></a>    <span class="fu">select</span>(<span class="sc">-</span>seq)</span>
+<span id="cb1-192"><a href="#cb1-192" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-193"><a href="#cb1-193" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-194"><a href="#cb1-194" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Agency trades -----------------------------------------------------------</span></span>
+<span id="cb1-195"><a href="#cb1-195" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Combine pre and post trades</span></span>
+<span id="cb1-196"><a href="#cb1-196" aria-hidden="true" tabindex="-1"></a>  trace_clean <span class="ot">&lt;-</span> trace_post <span class="sc">|&gt;</span></span>
+<span id="cb1-197"><a href="#cb1-197" aria-hidden="true" tabindex="-1"></a>    <span class="fu">union_all</span>(trace_pre)</span>
+<span id="cb1-198"><a href="#cb1-198" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-199"><a href="#cb1-199" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Keep angency sells and unmatched agency buys</span></span>
+<span id="cb1-200"><a href="#cb1-200" aria-hidden="true" tabindex="-1"></a>  <span class="do">## Agency sells</span></span>
+<span id="cb1-201"><a href="#cb1-201" aria-hidden="true" tabindex="-1"></a>  trace_agency_sells <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
+<span id="cb1-202"><a href="#cb1-202" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
+<span id="cb1-203"><a href="#cb1-203" aria-hidden="true" tabindex="-1"></a>      cntra_mp_id <span class="sc">==</span> <span class="st">"D"</span>,</span>
+<span id="cb1-204"><a href="#cb1-204" aria-hidden="true" tabindex="-1"></a>      rpt_side_cd <span class="sc">==</span> <span class="st">"S"</span></span>
+<span id="cb1-205"><a href="#cb1-205" aria-hidden="true" tabindex="-1"></a>    )</span>
+<span id="cb1-206"><a href="#cb1-206" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-207"><a href="#cb1-207" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Agency buys that are unmatched</span></span>
+<span id="cb1-208"><a href="#cb1-208" aria-hidden="true" tabindex="-1"></a>  trace_agency_buys_filtered <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
+<span id="cb1-209"><a href="#cb1-209" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
+<span id="cb1-210"><a href="#cb1-210" aria-hidden="true" tabindex="-1"></a>      cntra_mp_id <span class="sc">==</span> <span class="st">"D"</span>,</span>
+<span id="cb1-211"><a href="#cb1-211" aria-hidden="true" tabindex="-1"></a>      rpt_side_cd <span class="sc">==</span> <span class="st">"B"</span></span>
+<span id="cb1-212"><a href="#cb1-212" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
+<span id="cb1-213"><a href="#cb1-213" aria-hidden="true" tabindex="-1"></a>    <span class="fu">anti_join</span>(trace_agency_sells,</span>
+<span id="cb1-214"><a href="#cb1-214" aria-hidden="true" tabindex="-1"></a>      <span class="fu">join_by</span>(</span>
+<span id="cb1-215"><a href="#cb1-215" aria-hidden="true" tabindex="-1"></a>        cusip_id, trd_exctn_dt,</span>
+<span id="cb1-216"><a href="#cb1-216" aria-hidden="true" tabindex="-1"></a>        entrd_vol_qt, rptd_pr</span>
+<span id="cb1-217"><a href="#cb1-217" aria-hidden="true" tabindex="-1"></a>      )</span>
+<span id="cb1-218"><a href="#cb1-218" aria-hidden="true" tabindex="-1"></a>    )</span>
+<span id="cb1-219"><a href="#cb1-219" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-220"><a href="#cb1-220" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Agency clean</span></span>
+<span id="cb1-221"><a href="#cb1-221" aria-hidden="true" tabindex="-1"></a>  trace_clean <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
+<span id="cb1-222"><a href="#cb1-222" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(cntra_mp_id <span class="sc">==</span> <span class="st">"C"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb1-223"><a href="#cb1-223" aria-hidden="true" tabindex="-1"></a>    <span class="fu">union_all</span>(trace_agency_sells) <span class="sc">|&gt;</span></span>
+<span id="cb1-224"><a href="#cb1-224" aria-hidden="true" tabindex="-1"></a>    <span class="fu">union_all</span>(trace_agency_buys_filtered)</span>
+<span id="cb1-225"><a href="#cb1-225" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-226"><a href="#cb1-226" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-227"><a href="#cb1-227" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Additional Filters ------------------------------------------------------</span></span>
+<span id="cb1-228"><a href="#cb1-228" aria-hidden="true" tabindex="-1"></a>  trace_add_filters <span class="ot">&lt;-</span> trace_clean <span class="sc">|&gt;</span></span>
+<span id="cb1-229"><a href="#cb1-229" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">days_to_sttl_ct2 =</span> stlmnt_dt <span class="sc">-</span> trd_exctn_dt) <span class="sc">|&gt;</span></span>
+<span id="cb1-230"><a href="#cb1-230" aria-hidden="true" tabindex="-1"></a>    <span class="fu">filter</span>(</span>
+<span id="cb1-231"><a href="#cb1-231" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(days_to_sttl_ct) <span class="sc">|</span> <span class="fu">as.numeric</span>(days_to_sttl_ct) <span class="sc">&lt;=</span> <span class="dv">7</span>,</span>
+<span id="cb1-232"><a href="#cb1-232" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(days_to_sttl_ct2) <span class="sc">|</span> <span class="fu">as.numeric</span>(days_to_sttl_ct2) <span class="sc">&lt;=</span> <span class="dv">7</span>,</span>
+<span id="cb1-233"><a href="#cb1-233" aria-hidden="true" tabindex="-1"></a>      wis_fl <span class="sc">==</span> <span class="st">"N"</span>,</span>
+<span id="cb1-234"><a href="#cb1-234" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(spcl_trd_fl) <span class="sc">|</span> spcl_trd_fl <span class="sc">==</span> <span class="st">""</span>,</span>
+<span id="cb1-235"><a href="#cb1-235" aria-hidden="true" tabindex="-1"></a>      <span class="fu">is.na</span>(asof_cd) <span class="sc">|</span> asof_cd <span class="sc">==</span> <span class="st">""</span></span>
+<span id="cb1-236"><a href="#cb1-236" aria-hidden="true" tabindex="-1"></a>    )</span>
+<span id="cb1-237"><a href="#cb1-237" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-238"><a href="#cb1-238" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-239"><a href="#cb1-239" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Output ------------------------------------------------------------------</span></span>
+<span id="cb1-240"><a href="#cb1-240" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Only keep necessary columns</span></span>
+<span id="cb1-241"><a href="#cb1-241" aria-hidden="true" tabindex="-1"></a>  trace_final <span class="ot">&lt;-</span> trace_add_filters <span class="sc">|&gt;</span></span>
+<span id="cb1-242"><a href="#cb1-242" aria-hidden="true" tabindex="-1"></a>    <span class="fu">arrange</span>(cusip_id, trd_exctn_dt, trd_exctn_tm) <span class="sc">|&gt;</span></span>
+<span id="cb1-243"><a href="#cb1-243" aria-hidden="true" tabindex="-1"></a>    <span class="fu">select</span>(</span>
+<span id="cb1-244"><a href="#cb1-244" aria-hidden="true" tabindex="-1"></a>      cusip_id, trd_exctn_dt, trd_exctn_tm,</span>
+<span id="cb1-245"><a href="#cb1-245" aria-hidden="true" tabindex="-1"></a>      rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id</span>
+<span id="cb1-246"><a href="#cb1-246" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
+<span id="cb1-247"><a href="#cb1-247" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">trd_exctn_tm =</span> <span class="fu">format</span>(<span class="fu">as_datetime</span>(trd_exctn_tm), <span class="st">"%H:%M:%S"</span>))</span>
+<span id="cb1-248"><a href="#cb1-248" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb1-249"><a href="#cb1-249" aria-hidden="true" tabindex="-1"></a>  <span class="co"># Return</span></span>
+<span id="cb1-250"><a href="#cb1-250" aria-hidden="true" tabindex="-1"></a>  <span class="fu">return</span>(trace_final)</span>
+<span id="cb1-251"><a href="#cb1-251" aria-hidden="true" tabindex="-1"></a>}</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 
 
diff --git a/docs/r/constrained-optimization-and-backtesting.html b/docs/r/constrained-optimization-and-backtesting.html
index 27c4d71d..17edd109 100644
--- a/docs/r/constrained-optimization-and-backtesting.html
+++ b/docs/r/constrained-optimization-and-backtesting.html
@@ -624,8 +624,8 @@ <h2 class="anchored" data-anchor-id="optimal-portfolio-choice">Optimal Portfolio
 <span id="cb4-24"><a href="#cb4-24" aria-hidden="true" tabindex="-1"></a>w_efficient <span class="ot">&lt;-</span> <span class="fu">compute_efficient_weight</span>(Sigma, mu)</span>
 <span id="cb4-25"><a href="#cb4-25" aria-hidden="true" tabindex="-1"></a><span class="fu">round</span>(w_efficient, <span class="dv">3</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code> [1]  1.630  0.090 -1.355  0.687  0.333 -0.412  0.504  0.402 -0.219
-[10] -0.661</code></pre>
+<pre><code> [1]  1.630  0.090 -1.356  0.687  0.333 -0.412  0.504  0.402 -0.219
+[10] -0.659</code></pre>
 </div>
 </div>
 <p>The portfolio weights above indicate the efficient portfolio for an investor with risk aversion coefficient <span class="math inline">\(\gamma=2\)</span> in absence of transaction costs. Some of the positions are negative which implies short-selling, most of the positions are rather extreme. For instance, a position of <span class="math inline">\(-1\)</span> implies that the investor takes a short position worth her entire wealth to lever long positions in other assets. What is the effect of transaction costs or different levels of risk aversion on the optimal portfolio choice? The following few lines of code analyze the distance between the minimum variance portfolio and the portfolio implemented by the investor for different values of the transaction cost parameter <span class="math inline">\(\beta\)</span> and risk aversion <span class="math inline">\(\gamma\)</span>.</p>
@@ -670,10 +670,10 @@ <h2 class="anchored" data-anchor-id="optimal-portfolio-choice">Optimal Portfolio
 <span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Portfolio weights for different risk aversion and transaction cost"</span></span>
 <span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-1701" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion." width="2100">
+<div id="fig-1701" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1701-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png" class="img-fluid figure-img" alt="Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.</span>
+<a href="constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-1" data-gallery="quarto-lightbox-gallery-1"><img src="constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png" class="img-fluid figure-img" alt="Title: Portfolio weights for different risk aversion and transaction cost. The figure shows four lines that indicate that rebalancing from the initial portfolio decreases for higher transaction costs and for higher risk aversion." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1701-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;1: The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.
@@ -802,8 +802,8 @@ <h2 class="anchored" data-anchor-id="constrained-optimization">Constrained Optim
 <span id="cb14-30"><a href="#cb14-30" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb14-31"><a href="#cb14-31" aria-hidden="true" tabindex="-1"></a><span class="fu">round</span>(w_reg_t<span class="sc">$</span>solution, <span class="dv">3</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code> [1]  0.736  0.000 -0.134  0.264  0.000 -0.019  0.028  0.223  0.000
-[10] -0.097</code></pre>
+<pre><code> [1]  0.736  0.000 -0.135  0.264  0.000 -0.019  0.028  0.223  0.000
+[10] -0.096</code></pre>
 </div>
 </div>
 <p><a href="#fig-1702" class="quarto-xref">Figure&nbsp;2</a> shows the optimal allocation weights across all 10 industries for the four different strategies considered so far: minimum variance, efficient portfolio with <span class="math inline">\(\gamma\)</span> = 2, efficient portfolio with short-sale constraints, and the Regulation-T constrained portfolio.</p>
@@ -832,10 +832,10 @@ <h2 class="anchored" data-anchor-id="constrained-optimization">Constrained Optim
 <span id="cb16-22"><a href="#cb16-22" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">+</span></span>
 <span id="cb16-23"><a href="#cb16-23" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_y_continuous</span>(<span class="at">labels =</span> percent)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-1702" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio." width="2100">
+<div id="fig-1702" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1702-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" data-glightbox="description: .lightbox-desc-2"><img src="constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png" class="img-fluid figure-img" alt="Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.</span>
+<a href="constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-2" data-gallery="quarto-lightbox-gallery-2"><img src="constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png" class="img-fluid figure-img" alt="Title: Optimal allocations for different strategies. The figure shows the portfolio weights for the four different strategies across the 10 different industries. The figures indicate extreme long and short positions for the efficient portfolio." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1702-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;2: Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.
@@ -999,7 +999,7 @@ <h2 class="anchored" data-anchor-id="out-of-sample-backtesting">Out-of-Sample Ba
 <pre><code># A tibble: 3 × 5
   Strategy   Mean    SD `Sharpe ratio` Turnover
   &lt;chr&gt;     &lt;dbl&gt; &lt;dbl&gt;          &lt;dbl&gt;    &lt;dbl&gt;
-1 MV       -0.897  12.6         NA      211.   
+1 MV       -0.899  12.6         NA      211.   
 2 MV (TC)  11.9    15.2          0.780    0    
 3 Naive    11.8    15.2          0.779    0.234</code></pre>
 </div>
@@ -1526,7 +1526,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"loop":false,"selector":".lightbox","closeEffect":"zoom","openEffect":"zoom","descPosition":"bottom"});
+<script>var lightboxQuarto = GLightbox({"closeEffect":"zoom","selector":".lightbox","loop":false,"descPosition":"bottom","openEffect":"zoom"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png b/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png
index c55fe9bf..abfb4985 100644
Binary files a/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png and b/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1701-1.png differ
diff --git a/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png b/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png
index 29394693..dfc14b7b 100644
Binary files a/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png and b/docs/r/constrained-optimization-and-backtesting_files/figure-html/fig-1702-1.png differ
diff --git a/docs/r/difference-in-differences.html b/docs/r/difference-in-differences.html
index 75c0d56a..2c835965 100644
--- a/docs/r/difference-in-differences.html
+++ b/docs/r/difference-in-differences.html
@@ -583,7 +583,7 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <p>By combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>bonds_panel <span class="ot">&lt;-</span> bonds <span class="sc">|&gt;</span></span>
-<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(trace_aggregated, <span class="at">by =</span> <span class="st">"cusip_id"</span>, <span class="at">multiple =</span> <span class="st">"all"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(trace_aggregated, <span class="fu">join_by</span>(cusip_id), <span class="at">multiple =</span> <span class="st">"all"</span>) <span class="sc">|&gt;</span></span>
 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Before we can run the first regression, we need to define the <code>treated</code> indicator, which is the product of the <code>post_period</code> (i.e., all months after the signing of the PA) and the <code>polluter</code> indicator defined above.</p>
@@ -718,7 +718,7 @@ <h2 class="anchored" data-anchor-id="visualizing-parallel-trends">Visualizing Pa
 <div id="fig-1301" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Polluters respond stronger to Paris Agreement than green firms. The figure shows a sequence of monthly dots for two groups. Before the agreement, the dots mainly overlap. Ahead of the agreement, yields start to increase. Then, after the agreement, there is a strong divergence in yields. Polluters have significantly higher yields than non-polluters in the months before and after the signing of the Paris Agreement. However, this yield difference vanishes again towards the end of 2016.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1301-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="difference-in-differences_files/figure-html/fig-1301-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="difference-in-differences_files/figure-html/fig-1301-1.png" class="img-fluid figure-img" alt="Title: Polluters respond stronger to Paris Agreement than green firms. The figure shows a sequence of monthly dots for two groups. Before the agreement, the dots mainly overlap. Ahead of the agreement, yields start to increase. Then, after the agreement, there is a strong divergence in yields. Polluters have significantly higher yields than non-polluters in the months before and after the signing of the Paris Agreement. However, this yield difference vanishes again towards the end of 2016." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).</span>
+<a href="difference-in-differences_files/figure-html/fig-1301-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-1" data-gallery="quarto-lightbox-gallery-1"><img src="difference-in-differences_files/figure-html/fig-1301-1.png" class="img-fluid figure-img" alt="Title: Polluters respond stronger to Paris Agreement than green firms. The figure shows a sequence of monthly dots for two groups. Before the agreement, the dots mainly overlap. Ahead of the agreement, yields start to increase. Then, after the agreement, there is a strong divergence in yields. Polluters have significantly higher yields than non-polluters in the months before and after the signing of the Paris Agreement. However, this yield difference vanishes again towards the end of 2016." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1301-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;1: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).
@@ -781,7 +781,7 @@ <h2 class="anchored" data-anchor-id="visualizing-parallel-trends">Visualizing Pa
 <span id="cb12-49"><a href="#cb12-49" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb12-50"><a href="#cb12-50" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(</span>
 <span id="cb12-51"><a href="#cb12-51" aria-hidden="true" tabindex="-1"></a>    variables,</span>
-<span id="cb12-52"><a href="#cb12-52" aria-hidden="true" tabindex="-1"></a>    <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"term"</span> <span class="ot">=</span> <span class="st">"variable_name"</span>)</span>
+<span id="cb12-52"><a href="#cb12-52" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(term <span class="sc">==</span> variable_name)</span>
 <span id="cb12-53"><a href="#cb12-53" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb12-54"><a href="#cb12-54" aria-hidden="true" tabindex="-1"></a>  <span class="fu">bind_rows</span>(<span class="fu">tibble</span>(</span>
 <span id="cb12-55"><a href="#cb12-55" aria-hidden="true" tabindex="-1"></a>    <span class="at">term =</span> <span class="st">"lag0"</span>,</span>
@@ -812,7 +812,7 @@ <h2 class="anchored" data-anchor-id="visualizing-parallel-trends">Visualizing Pa
 <div id="fig-1402" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Polluters' yield patterns around Paris Agreement signing. The figure shows a sequence of monthly dots for the treated group. Ahead of the agreement, yields of polluters start to increase. Then, after the agreement, there is a small reversal and yields drop again.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1402-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="difference-in-differences_files/figure-html/fig-1402-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" data-glightbox="description: .lightbox-desc-2"><img src="difference-in-differences_files/figure-html/fig-1402-1.png" class="img-fluid figure-img" alt="Title: Polluters' yield patterns around Paris Agreement signing. The figure shows a sequence of monthly dots for the treated group. Ahead of the agreement, yields of polluters start to increase. Then, after the agreement, there is a small reversal and yields drop again." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).</span>
+<a href="difference-in-differences_files/figure-html/fig-1402-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-2" data-gallery="quarto-lightbox-gallery-2"><img src="difference-in-differences_files/figure-html/fig-1402-1.png" class="img-fluid figure-img" alt="Title: Polluters' yield patterns around Paris Agreement signing. The figure shows a sequence of monthly dots for the treated group. Ahead of the agreement, yields of polluters start to increase. Then, after the agreement, there is a small reversal and yields drop again." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1402-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;2: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).
@@ -1285,7 +1285,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","openEffect":"zoom","loop":false,"selector":".lightbox","closeEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","closeEffect":"zoom","descPosition":"bottom","loop":false,"selector":".lightbox"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/factor-selection-via-machine-learning.html b/docs/r/factor-selection-via-machine-learning.html
index 8b0a9fff..1a2e5a1b 100644
--- a/docs/r/factor-selection-via-machine-learning.html
+++ b/docs/r/factor-selection-via-machine-learning.html
@@ -636,9 +636,9 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <p>We combine all the monthly observations into one data frame.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>data <span class="ot">&lt;-</span> industries_ff_monthly <span class="sc">|&gt;</span></span>
-<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_q_monthly, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(macro_predictors, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
+<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_q_monthly, <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
+<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(macro_predictors, <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
 <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
 <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>    <span class="at">ret =</span> ret <span class="sc">-</span> factor_ff_rf</span>
 <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
@@ -663,7 +663,7 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <div id="fig-1401" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Excess return distributions by industry in percent. The figure shows boxplots that visualize the industry's excess return distribution. All industry returns are centered around zero and exhibit substantial outliers in the magnitude of 20 percent on a monthly basis." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1401-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png" class="img-fluid figure-img" alt="Title: Excess return distributions by industry in percent. The figure shows boxplots that visualize the industry's excess return distribution. All industry returns are centered around zero and exhibit substantial outliers in the magnitude of 20 percent on a monthly basis." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The box plots show the monthly dispersion of returns for 10 different industries.</span>
+<a href="factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-1" data-gallery="quarto-lightbox-gallery-1"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1401-1.png" class="img-fluid figure-img" alt="Title: Excess return distributions by industry in percent. The figure shows boxplots that visualize the industry's excess return distribution. All industry returns are centered around zero and exhibit substantial outliers in the magnitude of 20 percent on a monthly basis." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The box plots show the monthly dispersion of returns for 10 different industries.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1401-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;1: The box plots show the monthly dispersion of returns for 10 different industries.
@@ -708,7 +708,7 @@ <h3 class="anchored" data-anchor-id="pre-process-data">Pre-process data</h3>
 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">step_interact</span>(<span class="at">terms =</span> <span class="sc">~</span> <span class="fu">contains</span>(<span class="st">"factor"</span>)<span class="sc">:</span><span class="fu">contains</span>(<span class="st">"macro"</span>)) <span class="sc">|&gt;</span></span>
 <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">step_normalize</span>(<span class="fu">all_predictors</span>())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
-<p>A table of all available recipe steps can be found <a href="https://www.tidymodels.org/find/recipes/">in the <code>tidymodels</code> documentation.</a> As of 2023, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a <em>description</em> of the tasks to be applied. As a result, it is very easy to <em>reuse</em> recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. In the example above, it does not make a difference whether you use the input <code>data = training(split)</code> or <code>data = testing(split)</code>. All that matters at this early stage are the column names and types.</p>
+<p>A table of all available recipe steps can be found <a href="https://www.tidymodels.org/find/recipes/">in the <code>tidymodels</code> documentation.</a> As of 2024, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a <em>description</em> of the tasks to be applied. As a result, it is very easy to <em>reuse</em> recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. In the example above, it does not make a difference whether you use the input <code>data = training(split)</code> or <code>data = testing(split)</code>. All that matters at this early stage are the column names and types.</p>
 <p>We can apply the recipe to any data with a suitable structure. The code below combines two different functions: <code>prep()</code> estimates the required parameters from a training set that can be applied to other data sets later. <code>bake()</code> applies the processed computations to new data.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>data_prep <span class="ot">&lt;-</span> <span class="fu">prep</span>(rec, <span class="fu">training</span>(split))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -843,7 +843,7 @@ <h3 class="anchored" data-anchor-id="fit-a-model">Fit a model</h3>
 <div id="fig-1402" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Monthly realized and fitted manufacturing industry risk premium. The figure shows the time series of realized and predicted manufacturing industry risk premiums. The figure seems to indicate that the predictions capture most of the return dynamics." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1402-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" data-glightbox="description: .lightbox-desc-2"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png" class="img-fluid figure-img" alt="Title: Monthly realized and fitted manufacturing industry risk premium. The figure shows the time series of realized and predicted manufacturing industry risk premiums. The figure seems to indicate that the predictions capture most of the return dynamics." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: The grey area corresponds to the out of sample period.</span>
+<a href="factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-2" data-gallery="quarto-lightbox-gallery-2"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1402-1.png" class="img-fluid figure-img" alt="Title: Monthly realized and fitted manufacturing industry risk premium. The figure shows the time series of realized and predicted manufacturing industry risk premiums. The figure seems to indicate that the predictions capture most of the return dynamics." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: The grey area corresponds to the out of sample period.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1402-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;2: The grey area corresponds to the out of sample period.
@@ -897,7 +897,7 @@ <h3 class="anchored" data-anchor-id="fit-a-model">Fit a model</h3>
 <div id="fig-1403" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Estimated coefficient paths for different penalty factors. The figure shows how estimated lasso and ridge coefficients tend to zero for a higher penalty parameter. Ridge trace is smooth, and Lasso exhibits non-linear behavior." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1403-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" data-glightbox="description: .lightbox-desc-3"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png" class="img-fluid figure-img" alt="Title: Estimated coefficient paths for different penalty factors. The figure shows how estimated lasso and ridge coefficients tend to zero for a higher penalty parameter. Ridge trace is smooth, and Lasso exhibits non-linear behavior." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.</span>
+<a href="factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-3" data-gallery="quarto-lightbox-gallery-3"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1403-1.png" class="img-fluid figure-img" alt="Title: Estimated coefficient paths for different penalty factors. The figure shows how estimated lasso and ridge coefficients tend to zero for a higher penalty parameter. Ridge trace is smooth, and Lasso exhibits non-linear behavior." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1403-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;3: The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.
@@ -986,7 +986,7 @@ <h3 class="anchored" data-anchor-id="tune-a-model">Tune a model</h3>
 <div id="fig-1404" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Root MSPE for different penalty factors. The figure shows that more regularization does not affect the selected models in a meaningful fashion. At some point, the Elastic Net prediction error drops, which indicates the selected model. MSPE increases again for high penalization values." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1404-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-4" data-glightbox="description: .lightbox-desc-4"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png" class="img-fluid figure-img" alt="Title: Root MSPE for different penalty factors. The figure shows that more regularization does not affect the selected models in a meaningful fashion. At some point, the Elastic Net prediction error drops, which indicates the selected model. MSPE increases again for high penalization values." width="2100"></a><span class="glightbox-desc lightbox-desc-4">Figure&nbsp;4: Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.</span>
+<a href="factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-4" data-gallery="quarto-lightbox-gallery-4"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1404-1.png" class="img-fluid figure-img" alt="Title: Root MSPE for different penalty factors. The figure shows that more regularization does not affect the selected models in a meaningful fashion. At some point, the Elastic Net prediction error drops, which indicates the selected model. MSPE increases again for high penalization values." width="2100"></a><span class="glightbox-desc lightbox-desc-4">Figure&nbsp;4: Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1404-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;4: Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.
@@ -1097,7 +1097,7 @@ <h3 class="anchored" data-anchor-id="parallelized-workflow">Parallelized workflo
 <div id="fig-1405" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Selected variables for different industries. The figure shows which factors and macroeconomic predictors the Lasso model selected for the different industries. In general, there are not many selected variables. The market excess return is selected across all industries except for utilities." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-1405-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-5" data-glightbox="description: .lightbox-desc-5"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png" class="img-fluid figure-img" alt="Title: Selected variables for different industries. The figure shows which factors and macroeconomic predictors the Lasso model selected for the different industries. In general, there are not many selected variables. The market excess return is selected across all industries except for utilities." width="2100"></a><span class="glightbox-desc lightbox-desc-5">Figure&nbsp;5: Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.</span>
+<a href="factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-5" data-gallery="quarto-lightbox-gallery-5"><img src="factor-selection-via-machine-learning_files/figure-html/fig-1405-1.png" class="img-fluid figure-img" alt="Title: Selected variables for different industries. The figure shows which factors and macroeconomic predictors the Lasso model selected for the different industries. In general, there are not many selected variables. The market excess return is selected across all industries except for utilities." width="2100"></a><span class="glightbox-desc lightbox-desc-5">Figure&nbsp;5: Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-1405-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;5: Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.
@@ -1629,7 +1629,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"closeEffect":"zoom","loop":false,"openEffect":"zoom","selector":".lightbox","descPosition":"bottom"});
+<script>var lightboxQuarto = GLightbox({"loop":false,"selector":".lightbox","openEffect":"zoom","closeEffect":"zoom","descPosition":"bottom"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/fixed-effects-and-clustered-standard-errors.html b/docs/r/fixed-effects-and-clustered-standard-errors.html
index 7a6593fb..7e178429 100644
--- a/docs/r/fixed-effects-and-clustered-standard-errors.html
+++ b/docs/r/fixed-effects-and-clustered-standard-errors.html
@@ -570,7 +570,7 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(compustat <span class="sc">|&gt;</span></span>
 <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>    <span class="fu">select</span>(gvkey, year, <span class="at">at_lag =</span> at) <span class="sc">|&gt;</span></span>
 <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">year =</span> year <span class="sc">+</span> <span class="dv">1</span>),</span>
-<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>  <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"year"</span>)</span>
+<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">join_by</span>(gvkey, year)</span>
 <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a>  <span class="fu">filter</span>(at <span class="sc">&gt;</span> <span class="dv">0</span>, at_lag <span class="sc">&gt;</span> <span class="dv">0</span>) <span class="sc">|&gt;</span></span>
 <span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
@@ -582,18 +582,17 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(data_investment <span class="sc">|&gt;</span></span>
 <span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a>    <span class="fu">select</span>(gvkey, year, <span class="at">investment_lead =</span> investment) <span class="sc">|&gt;</span></span>
 <span id="cb3-17"><a href="#cb3-17" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">year =</span> year <span class="sc">-</span> <span class="dv">1</span>),</span>
-<span id="cb3-18"><a href="#cb3-18" aria-hidden="true" tabindex="-1"></a>  <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"year"</span>)</span>
+<span id="cb3-18"><a href="#cb3-18" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(gvkey, year)</span>
 <span id="cb3-19"><a href="#cb3-19" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Tobin’s q is the ratio of the market value of capital to its replacement costs. It is one of the most common regressors in corporate finance applications <span class="citation" data-cites="Fazzari1988 Erickson2012">(e.g., <a href="#ref-Fazzari1988" role="doc-biblioref">Fazzari et al. 1988</a>; <a href="#ref-Erickson2012" role="doc-biblioref">Erickson and Whited 2012</a>)</span>. We follow the implementation of <span class="citation" data-cites="Gulen2015">Gulen and Ion (<a href="#ref-Gulen2015" role="doc-biblioref">2015</a>)</span> and compute Tobin’s q as the market value of equity (<code>mktcap</code>) plus the book value of assets (<code>at</code>) minus book value of equity (<code>be</code>) plus deferred taxes (<code>txdb</code>), all divided by book value of assets (<code>at</code>). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>data_investment <span class="ot">&lt;-</span> data_investment <span class="sc">|&gt;</span></span>
 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(crsp_monthly,</span>
-<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>  <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"month"</span>)</span>
-<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
-<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">tobins_q =</span> (mktcap <span class="sc">+</span> at <span class="sc">-</span> be <span class="sc">+</span> txdb) <span class="sc">/</span> at) <span class="sc">|&gt;</span></span>
-<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(gvkey, year, investment_lead, cash_flows, tobins_q) <span class="sc">|&gt;</span></span>
-<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>            <span class="fu">join_by</span>(gvkey, month)) <span class="sc">|&gt;</span></span>
+<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">tobins_q =</span> (mktcap <span class="sc">+</span> at <span class="sc">-</span> be <span class="sc">+</span> txdb) <span class="sc">/</span> at) <span class="sc">|&gt;</span></span>
+<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(gvkey, year, investment_lead, cash_flows, tobins_q) <span class="sc">|&gt;</span></span>
+<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>As the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions. In the specific example, we winsorize the main variables (<code>investment</code>, <code>cash_flows</code>, and <code>tobins_q</code>) at the 1 percent level.</p>
 <div class="cell">
@@ -640,9 +639,9 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <pre><code># A tibble: 3 × 9
   measure      mean     sd    min      q05    q50   q95    max      n
   &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;int&gt;
-1 cash_flows 0.0111 0.274  -1.56  -4.72e-1 0.0641 0.272  0.479 127446
-2 investmen… 0.0577 0.0772  0      6.75e-4 0.0327 0.206  0.464 127446
-3 tobins_q   2.00   1.70    0.573  7.94e-1 1.39   5.37  10.9   127446</code></pre>
+1 cash_flows 0.0111 0.274  -1.56  -4.72e-1 0.0641 0.272  0.479 127450
+2 investmen… 0.0577 0.0772  0      6.75e-4 0.0327 0.206  0.464 127450
+3 tobins_q   2.00   1.70    0.573  7.94e-1 1.39   5.37  10.9   127450</code></pre>
 </div>
 </div>
 </section>
@@ -658,15 +657,15 @@ <h2 class="anchored" data-anchor-id="fixed-effects">Fixed Effects</h2>
 <span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>model_ols</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>OLS estimation, Dep. Var.: investment_lead
-Observations: 127,446 
+Observations: 127,450 
 Standard-errors: IID 
             Estimate Std. Error t value  Pr(&gt;|t|)    
 (Intercept)  0.04254   0.000334   127.5 &lt; 2.2e-16 ***
-cash_flows   0.04902   0.000793    61.8 &lt; 2.2e-16 ***
+cash_flows   0.04901   0.000793    61.8 &lt; 2.2e-16 ***
 tobins_q     0.00731   0.000128    57.0 &lt; 2.2e-16 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-RMSE: 0.075503   Adj. R2: 0.043012</code></pre>
+RMSE: 0.075502   Adj. R2: 0.043014</code></pre>
 </div>
 </div>
 <p>As expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data.</p>
@@ -681,16 +680,16 @@ <h2 class="anchored" data-anchor-id="fixed-effects">Fixed Effects</h2>
 <span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a>model_fe_firm</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>OLS estimation, Dep. Var.: investment_lead
-Observations: 127,446 
-Fixed-effects: gvkey: 14,348
+Observations: 127,450 
+Fixed-effects: gvkey: 14,352
 Standard-errors: IID 
            Estimate Std. Error t value  Pr(&gt;|t|)    
 cash_flows   0.0141   0.000923    15.3 &lt; 2.2e-16 ***
 tobins_q     0.0109   0.000133    82.0 &lt; 2.2e-16 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-RMSE: 0.049705     Adj. R2: 0.532649
-                 Within R2: 0.057237</code></pre>
+RMSE: 0.049704     Adj. R2: 0.532639
+                 Within R2: 0.057238</code></pre>
 </div>
 </div>
 <p>The regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm’s cash flow and Tobin’s q <em>on top</em> of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.</p>
@@ -704,16 +703,16 @@ <h2 class="anchored" data-anchor-id="fixed-effects">Fixed Effects</h2>
 <span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>model_fe_firmyear</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>OLS estimation, Dep. Var.: investment_lead
-Observations: 127,446 
-Fixed-effects: gvkey: 14,348,  year: 35
+Observations: 127,450 
+Fixed-effects: gvkey: 14,352,  year: 35
 Standard-errors: IID 
            Estimate Std. Error t value  Pr(&gt;|t|)    
 cash_flows  0.01750   0.000902    19.4 &lt; 2.2e-16 ***
 tobins_q    0.00987   0.000131    75.3 &lt; 2.2e-16 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-RMSE: 0.048415     Adj. R2: 0.556452
-                 Within R2: 0.050071</code></pre>
+RMSE: 0.048414     Adj. R2: 0.556443
+                 Within R2: 0.050072</code></pre>
 </div>
 </div>
 <p>The inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time.</p>
@@ -728,7 +727,7 @@ <h2 class="anchored" data-anchor-id="fixed-effects">Fixed Effects</h2>
 <pre><code>                       model_ols   model_fe_firm model_fe_firm..
 Dependent Var.:  investment_lead investment_lead investment_lead
                                                                 
-Constant        0.042*** (127.5)                                
+Constant        0.043*** (127.5)                                
 cash_flows       0.049*** (61.8) 0.014*** (15.3) 0.018*** (19.4)
 tobins_q         0.007*** (57.0) 0.011*** (82.0) 0.010*** (75.3)
 Fixed-Effects:  ---------------- --------------- ---------------
@@ -736,7 +735,7 @@ <h2 class="anchored" data-anchor-id="fixed-effects">Fixed Effects</h2>
 year                          No              No             Yes
 _______________ ________________ _______________ _______________
 VCOV type                    IID             IID             IID
-Observations             127,446         127,446         127,446
+Observations             127,450         127,450         127,450
 R2                         0.043           0.585           0.607
 Within R2                     --           0.057           0.050
 ---
@@ -779,7 +778,7 @@ <h2 class="anchored" data-anchor-id="clustering-standard-errors">Clustering Stan
 year                        Yes             Yes               Yes
 _______________ _______________ _______________   _______________
 VCOV type                   IID       by: gvkey  by: gvkey &amp; year
-Observations            127,446         127,446           127,446
+Observations            127,450         127,450           127,450
 R2                        0.607           0.607             0.607
 Within R2                 0.050           0.050             0.050
 ---
diff --git a/docs/r/hex-sticker.html b/docs/r/hex-sticker.html
index 4e2e7041..63965692 100644
--- a/docs/r/hex-sticker.html
+++ b/docs/r/hex-sticker.html
@@ -930,7 +930,7 @@ <h1 class="title d-none d-lg-block">Hex Sticker</h1>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","openEffect":"zoom","selector":".lightbox","closeEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","selector":".lightbox","loop":false,"openEffect":"zoom","closeEffect":"zoom"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/introduction-to-tidy-finance.html b/docs/r/introduction-to-tidy-finance.html
index ce0d690a..21b23a81 100644
--- a/docs/r/introduction-to-tidy-finance.html
+++ b/docs/r/introduction-to-tidy-finance.html
@@ -562,11 +562,11 @@ <h2 class="anchored" data-anchor-id="working-with-stock-market-data">Working wit
 <pre><code># A tibble: 5,787 × 8
   symbol date        open  high   low close    volume adjusted
   &lt;chr&gt;  &lt;date&gt;     &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;
-1 AAPL   2000-01-03 0.936 1.00  0.908 0.999 535796800    0.848
-2 AAPL   2000-01-04 0.967 0.988 0.903 0.915 512377600    0.777
-3 AAPL   2000-01-05 0.926 0.987 0.920 0.929 778321600    0.788
-4 AAPL   2000-01-06 0.948 0.955 0.848 0.848 767972800    0.720
-5 AAPL   2000-01-07 0.862 0.902 0.853 0.888 460734400    0.754
+1 AAPL   2000-01-03 0.936 1.00  0.908 0.999 535796800    0.847
+2 AAPL   2000-01-04 0.967 0.988 0.903 0.915 512377600    0.776
+3 AAPL   2000-01-05 0.926 0.987 0.920 0.929 778321600    0.787
+4 AAPL   2000-01-06 0.948 0.955 0.848 0.848 767972800    0.719
+5 AAPL   2000-01-07 0.862 0.902 0.853 0.888 460734400    0.753
 # ℹ 5,782 more rows</code></pre>
 </div>
 </div>
@@ -585,7 +585,7 @@ <h2 class="anchored" data-anchor-id="working-with-stock-market-data">Working wit
 <div id="fig-100" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Apple stock prices between the beginning of 2000 and the end of 2022. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-100-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="introduction-to-tidy-finance_files/figure-html/fig-100-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="introduction-to-tidy-finance_files/figure-html/fig-100-1.png" class="img-fluid figure-img" alt="Title: Apple stock prices between the beginning of 2000 and the end of 2022. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: Prices are in USD, adjusted for dividend payments and stock splits.</span>
+<a href="introduction-to-tidy-finance_files/figure-html/fig-100-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-1" data-gallery="quarto-lightbox-gallery-1"><img src="introduction-to-tidy-finance_files/figure-html/fig-100-1.png" class="img-fluid figure-img" alt="Title: Apple stock prices between the beginning of 2000 and the end of 2022. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: Prices are in USD, adjusted for dividend payments and stock splits.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-100-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;1: Prices are in USD, adjusted for dividend payments and stock splits.
@@ -638,7 +638,7 @@ <h2 class="anchored" data-anchor-id="working-with-stock-market-data">Working wit
 <div id="fig-101" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical 5 percent quantile of daily returns was around negative 3 percent.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-101-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="introduction-to-tidy-finance_files/figure-html/fig-101-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" data-glightbox="description: .lightbox-desc-2"><img src="introduction-to-tidy-finance_files/figure-html/fig-101-1.png" class="img-fluid figure-img" alt="Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical 5 percent quantile of daily returns was around negative 3 percent." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: The dotted vertical line indicates the historical 5 percent quantile.</span>
+<a href="introduction-to-tidy-finance_files/figure-html/fig-101-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-2" data-gallery="quarto-lightbox-gallery-2"><img src="introduction-to-tidy-finance_files/figure-html/fig-101-1.png" class="img-fluid figure-img" alt="Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical 5 percent quantile of daily returns was around negative 3 percent." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: The dotted vertical line indicates the historical 5 percent quantile.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-101-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;2: The dotted vertical line indicates the historical 5 percent quantile.
@@ -726,11 +726,11 @@ <h2 class="anchored" data-anchor-id="scaling-up-the-analysis">Scaling Up the Ana
 <pre><code># A tibble: 30 × 8
   symbol company           identifier sedol weight sector shares_held
   &lt;chr&gt;  &lt;chr&gt;             &lt;chr&gt;      &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;        &lt;dbl&gt;
-1 UNH    UNITEDHEALTH GRO… 91324P102  2917… 0.106  -          5487702
-2 MSFT   MICROSOFT CORP    594918104  2588… 0.0641 -          5487702
-3 GS     GOLDMAN SACHS GR… 38141G104  2407… 0.0605 -          5487702
-4 HD     HOME DEPOT INC    437076102  2434… 0.0571 -          5487702
-5 AMGN   AMGEN INC         031162100  2023… 0.0556 -          5487702
+1 UNH    UNITEDHEALTH GRO… 91324P102  2917… 0.0958 -          5747928
+2 GS     GOLDMAN SACHS GR… 38141G104  2407… 0.0672 -          5747928
+3 MSFT   MICROSOFT CORP    594918104  2588… 0.0646 -          5747928
+4 HD     HOME DEPOT INC    437076102  2434… 0.0594 -          5747928
+5 AMGN   AMGEN INC         031162100  2023… 0.0532 -          5747928
 # ℹ 25 more rows
 # ℹ 1 more variable: local_currency &lt;chr&gt;</code></pre>
 </div>
@@ -763,7 +763,7 @@ <h2 class="anchored" data-anchor-id="scaling-up-the-analysis">Scaling Up the Ana
 <div id="fig-103" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-103-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="introduction-to-tidy-finance_files/figure-html/fig-103-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" data-glightbox="description: .lightbox-desc-3"><img src="introduction-to-tidy-finance_files/figure-html/fig-103-1.png" class="img-fluid figure-img" alt="Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: Prices in USD, adjusted for dividend payments and stock splits.</span>
+<a href="introduction-to-tidy-finance_files/figure-html/fig-103-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-3" data-gallery="quarto-lightbox-gallery-3"><img src="introduction-to-tidy-finance_files/figure-html/fig-103-1.png" class="img-fluid figure-img" alt="Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: Prices in USD, adjusted for dividend payments and stock splits.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-103-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;3: Prices in USD, adjusted for dividend payments and stock splits.
@@ -853,7 +853,7 @@ <h2 class="anchored" data-anchor-id="other-forms-of-data-aggregation">Other Form
 <div id="fig-104" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2022, with a maximum of more than 100.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-104-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="introduction-to-tidy-finance_files/figure-html/fig-104-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-4" data-glightbox="description: .lightbox-desc-4"><img src="introduction-to-tidy-finance_files/figure-html/fig-104-1.png" class="img-fluid figure-img" alt="Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2022, with a maximum of more than 100." width="2100"></a><span class="glightbox-desc lightbox-desc-4">Figure&nbsp;4: Total daily trading volume in billion USD.</span>
+<a href="introduction-to-tidy-finance_files/figure-html/fig-104-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-4" data-gallery="quarto-lightbox-gallery-4"><img src="introduction-to-tidy-finance_files/figure-html/fig-104-1.png" class="img-fluid figure-img" alt="Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2022, with a maximum of more than 100." width="2100"></a><span class="glightbox-desc lightbox-desc-4">Figure&nbsp;4: Total daily trading volume in billion USD.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-104-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;4: Total daily trading volume in billion USD.
@@ -884,7 +884,7 @@ <h2 class="anchored" data-anchor-id="other-forms-of-data-aggregation">Other Form
 <div id="fig-105" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-105-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="introduction-to-tidy-finance_files/figure-html/fig-105-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-5" data-glightbox="description: .lightbox-desc-5"><img src="introduction-to-tidy-finance_files/figure-html/fig-105-1.png" class="img-fluid figure-img" alt="Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line." width="2100"></a><span class="glightbox-desc lightbox-desc-5">Figure&nbsp;5: Total daily trading volume in billion USD.</span>
+<a href="introduction-to-tidy-finance_files/figure-html/fig-105-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-5" data-gallery="quarto-lightbox-gallery-5"><img src="introduction-to-tidy-finance_files/figure-html/fig-105-1.png" class="img-fluid figure-img" alt="Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line." width="2100"></a><span class="glightbox-desc lightbox-desc-5">Figure&nbsp;5: Total daily trading volume in billion USD.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-105-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;5: Total daily trading volume in billion USD.
@@ -1004,7 +1004,7 @@ <h2 class="anchored" data-anchor-id="the-efficient-frontier">The Efficient Front
 <div id="fig-106" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-106-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="introduction-to-tidy-finance_files/figure-html/fig-106-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-6" data-glightbox="description: .lightbox-desc-6"><img src="introduction-to-tidy-finance_files/figure-html/fig-106-1.png" class="img-fluid figure-img" alt="Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility." width="2100"></a><span class="glightbox-desc lightbox-desc-6">Figure&nbsp;6: The big dots indicate the location of the minimum variance and the efficient portfolio that delivers 3 times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.</span>
+<a href="introduction-to-tidy-finance_files/figure-html/fig-106-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-6" data-gallery="quarto-lightbox-gallery-6"><img src="introduction-to-tidy-finance_files/figure-html/fig-106-1.png" class="img-fluid figure-img" alt="Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility." width="2100"></a><span class="glightbox-desc lightbox-desc-6">Figure&nbsp;6: The big dots indicate the location of the minimum variance and the efficient portfolio that delivers 3 times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-106-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;6: The big dots indicate the location of the minimum variance and the efficient portfolio that delivers 3 times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.
@@ -1492,7 +1492,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"selector":".lightbox","descPosition":"bottom","closeEffect":"zoom","loop":false,"openEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","openEffect":"zoom","loop":false,"closeEffect":"zoom","selector":".lightbox"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-100-1.png b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-100-1.png
index 0ac1fcb7..47838342 100644
Binary files a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-100-1.png and b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-100-1.png differ
diff --git a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-101-1.png b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-101-1.png
index f51c1495..6b77e32b 100644
Binary files a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-101-1.png and b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-101-1.png differ
diff --git a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-103-1.png b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-103-1.png
index 554fa966..88f0b0c9 100644
Binary files a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-103-1.png and b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-103-1.png differ
diff --git a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-104-1.png b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-104-1.png
index e35424b3..5931c6ff 100644
Binary files a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-104-1.png and b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-104-1.png differ
diff --git a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-105-1.png b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-105-1.png
index 93bfd0c2..27a55bcc 100644
Binary files a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-105-1.png and b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-105-1.png differ
diff --git a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-106-1.png b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-106-1.png
index c5fe952d..bbff2aaf 100644
Binary files a/docs/r/introduction-to-tidy-finance_files/figure-html/fig-106-1.png and b/docs/r/introduction-to-tidy-finance_files/figure-html/fig-106-1.png differ
diff --git a/docs/r/option-pricing-via-machine-learning.html b/docs/r/option-pricing-via-machine-learning.html
index 65bc1eab..58264762 100644
--- a/docs/r/option-pricing-via-machine-learning.html
+++ b/docs/r/option-pricing-via-machine-learning.html
@@ -1258,7 +1258,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","selector":".lightbox","openEffect":"zoom","closeEffect":"zoom","loop":false});
+<script>var lightboxQuarto = GLightbox({"selector":".lightbox","closeEffect":"zoom","loop":false,"descPosition":"bottom","openEffect":"zoom"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/parametric-portfolio-policies.html b/docs/r/parametric-portfolio-policies.html
index ea5e777f..015079ca 100644
--- a/docs/r/parametric-portfolio-policies.html
+++ b/docs/r/parametric-portfolio-policies.html
@@ -573,7 +573,7 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>crsp_monthly <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
 <span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(crsp_monthly_lags,</span>
-<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>    <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span> <span class="ot">=</span> <span class="st">"month_13"</span>),</span>
+<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(permno, month <span class="sc">==</span> month_13),</span>
 <span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a>    <span class="at">suffix =</span> <span class="fu">c</span>(<span class="st">""</span>, <span class="st">"_13"</span>)</span>
 <span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a>  )</span>
 <span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a></span>
@@ -696,56 +696,58 @@ <h2 class="anchored" data-anchor-id="portfolio-performance">Portfolio Performanc
 <span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a>  </span>
 <span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a>  evaluation_stats <span class="ot">&lt;-</span> evaluation <span class="sc">|&gt;</span></span>
 <span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a>    <span class="fu">group_by</span>(model) <span class="sc">|&gt;</span></span>
-<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>    <span class="fu">left_join</span>(factors_ff3_monthly, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a>    <span class="fu">summarize</span>(<span class="fu">tibble</span>(</span>
-<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Expected utility"</span> <span class="ot">=</span> <span class="fu">mean</span>(<span class="fu">power_utility</span>(portfolio_return)),</span>
-<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Average return"</span> <span class="ot">=</span> <span class="dv">100</span> <span class="sc">*</span> <span class="fu">mean</span>(length_year <span class="sc">*</span> portfolio_return),</span>
-<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a>      <span class="st">"SD return"</span> <span class="ot">=</span> <span class="dv">100</span> <span class="sc">*</span> <span class="fu">sqrt</span>(length_year) <span class="sc">*</span> <span class="fu">sd</span>(portfolio_return),</span>
-<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Sharpe ratio"</span> <span class="ot">=</span> <span class="fu">sqrt</span>(length_year) <span class="sc">*</span> <span class="fu">mean</span>(portfolio_return) <span class="sc">/</span> <span class="fu">sd</span>(portfolio_return),</span>
-<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a>    )) <span class="sc">|&gt;</span></span>
-<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">model =</span> <span class="fu">str_remove</span>(model, <span class="st">"return_"</span>)) </span>
-<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a>  </span>
-<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a>  <span class="cf">if</span> (capm_evaluation) {</span>
-<span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a>    evaluation_capm <span class="ot">&lt;-</span> evaluation <span class="sc">|&gt;</span> </span>
-<span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(factors_ff3_monthly, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb10-32"><a href="#cb10-32" aria-hidden="true" tabindex="-1"></a>      <span class="fu">group_by</span>(model) <span class="sc">|&gt;</span></span>
-<span id="cb10-33"><a href="#cb10-33" aria-hidden="true" tabindex="-1"></a>      <span class="fu">summarize</span>(</span>
-<span id="cb10-34"><a href="#cb10-34" aria-hidden="true" tabindex="-1"></a>      <span class="st">"CAPM alpha"</span> <span class="ot">=</span> <span class="fu">coefficients</span>(<span class="fu">lm</span>(portfolio_return <span class="sc">~</span> mkt_excess))[<span class="dv">1</span>],</span>
-<span id="cb10-35"><a href="#cb10-35" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Market beta"</span> <span class="ot">=</span> <span class="fu">coefficients</span>(<span class="fu">lm</span>(portfolio_return <span class="sc">~</span> mkt_excess))[<span class="dv">2</span>]</span>
-<span id="cb10-36"><a href="#cb10-36" aria-hidden="true" tabindex="-1"></a>      )</span>
-<span id="cb10-37"><a href="#cb10-37" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb10-38"><a href="#cb10-38" aria-hidden="true" tabindex="-1"></a>    evaluation_stats <span class="ot">&lt;-</span> evaluation_stats <span class="sc">|&gt;</span> </span>
-<span id="cb10-39"><a href="#cb10-39" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(evaluation_capm, <span class="at">by =</span> <span class="st">"model"</span>)</span>
-<span id="cb10-40"><a href="#cb10-40" aria-hidden="true" tabindex="-1"></a>  }</span>
-<span id="cb10-41"><a href="#cb10-41" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb10-42"><a href="#cb10-42" aria-hidden="true" tabindex="-1"></a>  <span class="cf">if</span> (full_evaluation) {</span>
-<span id="cb10-43"><a href="#cb10-43" aria-hidden="true" tabindex="-1"></a>    evaluation_weights <span class="ot">&lt;-</span> weights_crsp <span class="sc">|&gt;</span></span>
-<span id="cb10-44"><a href="#cb10-44" aria-hidden="true" tabindex="-1"></a>      <span class="fu">select</span>(month, <span class="fu">contains</span>(<span class="st">"weight"</span>)) <span class="sc">|&gt;</span></span>
-<span id="cb10-45"><a href="#cb10-45" aria-hidden="true" tabindex="-1"></a>      <span class="fu">pivot_longer</span>(<span class="sc">-</span>month, <span class="at">values_to =</span> <span class="st">"weight"</span>, <span class="at">names_to =</span> <span class="st">"model"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb10-46"><a href="#cb10-46" aria-hidden="true" tabindex="-1"></a>      <span class="fu">group_by</span>(model, month) <span class="sc">|&gt;</span></span>
-<span id="cb10-47"><a href="#cb10-47" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(</span>
-<span id="cb10-48"><a href="#cb10-48" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Absolute weight"</span> <span class="ot">=</span> <span class="fu">abs</span>(weight),</span>
-<span id="cb10-49"><a href="#cb10-49" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Max. weight"</span> <span class="ot">=</span> <span class="fu">max</span>(weight),</span>
-<span id="cb10-50"><a href="#cb10-50" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Min. weight"</span> <span class="ot">=</span> <span class="fu">min</span>(weight),</span>
-<span id="cb10-51"><a href="#cb10-51" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Avg. sum of negative weights"</span> <span class="ot">=</span> <span class="sc">-</span><span class="fu">sum</span>(weight[weight <span class="sc">&lt;</span> <span class="dv">0</span>]),</span>
-<span id="cb10-52"><a href="#cb10-52" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Avg. fraction of negative weights"</span> <span class="ot">=</span> <span class="fu">sum</span>(weight <span class="sc">&lt;</span> <span class="dv">0</span>) <span class="sc">/</span> <span class="fu">n</span>(),</span>
-<span id="cb10-53"><a href="#cb10-53" aria-hidden="true" tabindex="-1"></a>        <span class="at">.keep =</span> <span class="st">"none"</span></span>
-<span id="cb10-54"><a href="#cb10-54" aria-hidden="true" tabindex="-1"></a>      ) <span class="sc">|&gt;</span></span>
-<span id="cb10-55"><a href="#cb10-55" aria-hidden="true" tabindex="-1"></a>      <span class="fu">group_by</span>(model) <span class="sc">|&gt;</span></span>
-<span id="cb10-56"><a href="#cb10-56" aria-hidden="true" tabindex="-1"></a>      <span class="fu">summarize</span>(<span class="fu">across</span>(<span class="sc">-</span>month, <span class="sc">~</span> <span class="dv">100</span> <span class="sc">*</span> <span class="fu">mean</span>(.))) <span class="sc">|&gt;</span></span>
-<span id="cb10-57"><a href="#cb10-57" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">model =</span> <span class="fu">str_remove</span>(model, <span class="st">"weight_"</span>)) </span>
-<span id="cb10-58"><a href="#cb10-58" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb10-59"><a href="#cb10-59" aria-hidden="true" tabindex="-1"></a>    evaluation_stats <span class="ot">&lt;-</span> evaluation_stats <span class="sc">|&gt;</span> </span>
-<span id="cb10-60"><a href="#cb10-60" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(evaluation_weights, <span class="at">by =</span> <span class="st">"model"</span>)</span>
-<span id="cb10-61"><a href="#cb10-61" aria-hidden="true" tabindex="-1"></a>  }</span>
-<span id="cb10-62"><a href="#cb10-62" aria-hidden="true" tabindex="-1"></a>  </span>
-<span id="cb10-63"><a href="#cb10-63" aria-hidden="true" tabindex="-1"></a>  evaluation_output <span class="ot">&lt;-</span> evaluation_stats <span class="sc">|&gt;</span> </span>
-<span id="cb10-64"><a href="#cb10-64" aria-hidden="true" tabindex="-1"></a>    <span class="fu">pivot_longer</span>(<span class="at">cols =</span> <span class="sc">-</span>model, <span class="at">names_to =</span> <span class="st">"measure"</span>) <span class="sc">|&gt;</span> </span>
-<span id="cb10-65"><a href="#cb10-65" aria-hidden="true" tabindex="-1"></a>    <span class="fu">pivot_wider</span>(<span class="at">names_from =</span> model)</span>
-<span id="cb10-66"><a href="#cb10-66" aria-hidden="true" tabindex="-1"></a>  </span>
-<span id="cb10-67"><a href="#cb10-67" aria-hidden="true" tabindex="-1"></a>  <span class="fu">return</span>(evaluation_output)</span>
-<span id="cb10-68"><a href="#cb10-68" aria-hidden="true" tabindex="-1"></a>}</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>    <span class="fu">left_join</span>(factors_ff3_monthly, </span>
+<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a>              <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
+<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a>    <span class="fu">summarize</span>(<span class="fu">tibble</span>(</span>
+<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Expected utility"</span> <span class="ot">=</span> <span class="fu">mean</span>(<span class="fu">power_utility</span>(portfolio_return)),</span>
+<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Average return"</span> <span class="ot">=</span> <span class="dv">100</span> <span class="sc">*</span> <span class="fu">mean</span>(length_year <span class="sc">*</span> portfolio_return),</span>
+<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a>      <span class="st">"SD return"</span> <span class="ot">=</span> <span class="dv">100</span> <span class="sc">*</span> <span class="fu">sqrt</span>(length_year) <span class="sc">*</span> <span class="fu">sd</span>(portfolio_return),</span>
+<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Sharpe ratio"</span> <span class="ot">=</span> <span class="fu">sqrt</span>(length_year) <span class="sc">*</span> <span class="fu">mean</span>(portfolio_return) <span class="sc">/</span> <span class="fu">sd</span>(portfolio_return),</span>
+<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a>    )) <span class="sc">|&gt;</span></span>
+<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a>    <span class="fu">mutate</span>(<span class="at">model =</span> <span class="fu">str_remove</span>(model, <span class="st">"return_"</span>)) </span>
+<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a>  </span>
+<span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a>  <span class="cf">if</span> (capm_evaluation) {</span>
+<span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a>    evaluation_capm <span class="ot">&lt;-</span> evaluation <span class="sc">|&gt;</span> </span>
+<span id="cb10-32"><a href="#cb10-32" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(factors_ff3_monthly, </span>
+<span id="cb10-33"><a href="#cb10-33" aria-hidden="true" tabindex="-1"></a>                <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
+<span id="cb10-34"><a href="#cb10-34" aria-hidden="true" tabindex="-1"></a>      <span class="fu">group_by</span>(model) <span class="sc">|&gt;</span></span>
+<span id="cb10-35"><a href="#cb10-35" aria-hidden="true" tabindex="-1"></a>      <span class="fu">summarize</span>(</span>
+<span id="cb10-36"><a href="#cb10-36" aria-hidden="true" tabindex="-1"></a>      <span class="st">"CAPM alpha"</span> <span class="ot">=</span> <span class="fu">coefficients</span>(<span class="fu">lm</span>(portfolio_return <span class="sc">~</span> mkt_excess))[<span class="dv">1</span>],</span>
+<span id="cb10-37"><a href="#cb10-37" aria-hidden="true" tabindex="-1"></a>      <span class="st">"Market beta"</span> <span class="ot">=</span> <span class="fu">coefficients</span>(<span class="fu">lm</span>(portfolio_return <span class="sc">~</span> mkt_excess))[<span class="dv">2</span>]</span>
+<span id="cb10-38"><a href="#cb10-38" aria-hidden="true" tabindex="-1"></a>      )</span>
+<span id="cb10-39"><a href="#cb10-39" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb10-40"><a href="#cb10-40" aria-hidden="true" tabindex="-1"></a>    evaluation_stats <span class="ot">&lt;-</span> evaluation_stats <span class="sc">|&gt;</span> </span>
+<span id="cb10-41"><a href="#cb10-41" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(evaluation_capm, <span class="fu">join_by</span>(model))</span>
+<span id="cb10-42"><a href="#cb10-42" aria-hidden="true" tabindex="-1"></a>  }</span>
+<span id="cb10-43"><a href="#cb10-43" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-44"><a href="#cb10-44" aria-hidden="true" tabindex="-1"></a>  <span class="cf">if</span> (full_evaluation) {</span>
+<span id="cb10-45"><a href="#cb10-45" aria-hidden="true" tabindex="-1"></a>    evaluation_weights <span class="ot">&lt;-</span> weights_crsp <span class="sc">|&gt;</span></span>
+<span id="cb10-46"><a href="#cb10-46" aria-hidden="true" tabindex="-1"></a>      <span class="fu">select</span>(month, <span class="fu">contains</span>(<span class="st">"weight"</span>)) <span class="sc">|&gt;</span></span>
+<span id="cb10-47"><a href="#cb10-47" aria-hidden="true" tabindex="-1"></a>      <span class="fu">pivot_longer</span>(<span class="sc">-</span>month, <span class="at">values_to =</span> <span class="st">"weight"</span>, <span class="at">names_to =</span> <span class="st">"model"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb10-48"><a href="#cb10-48" aria-hidden="true" tabindex="-1"></a>      <span class="fu">group_by</span>(model, month) <span class="sc">|&gt;</span></span>
+<span id="cb10-49"><a href="#cb10-49" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(</span>
+<span id="cb10-50"><a href="#cb10-50" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Absolute weight"</span> <span class="ot">=</span> <span class="fu">abs</span>(weight),</span>
+<span id="cb10-51"><a href="#cb10-51" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Max. weight"</span> <span class="ot">=</span> <span class="fu">max</span>(weight),</span>
+<span id="cb10-52"><a href="#cb10-52" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Min. weight"</span> <span class="ot">=</span> <span class="fu">min</span>(weight),</span>
+<span id="cb10-53"><a href="#cb10-53" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Avg. sum of negative weights"</span> <span class="ot">=</span> <span class="sc">-</span><span class="fu">sum</span>(weight[weight <span class="sc">&lt;</span> <span class="dv">0</span>]),</span>
+<span id="cb10-54"><a href="#cb10-54" aria-hidden="true" tabindex="-1"></a>        <span class="st">"Avg. fraction of negative weights"</span> <span class="ot">=</span> <span class="fu">sum</span>(weight <span class="sc">&lt;</span> <span class="dv">0</span>) <span class="sc">/</span> <span class="fu">n</span>(),</span>
+<span id="cb10-55"><a href="#cb10-55" aria-hidden="true" tabindex="-1"></a>        <span class="at">.keep =</span> <span class="st">"none"</span></span>
+<span id="cb10-56"><a href="#cb10-56" aria-hidden="true" tabindex="-1"></a>      ) <span class="sc">|&gt;</span></span>
+<span id="cb10-57"><a href="#cb10-57" aria-hidden="true" tabindex="-1"></a>      <span class="fu">group_by</span>(model) <span class="sc">|&gt;</span></span>
+<span id="cb10-58"><a href="#cb10-58" aria-hidden="true" tabindex="-1"></a>      <span class="fu">summarize</span>(<span class="fu">across</span>(<span class="sc">-</span>month, <span class="sc">~</span> <span class="dv">100</span> <span class="sc">*</span> <span class="fu">mean</span>(.))) <span class="sc">|&gt;</span></span>
+<span id="cb10-59"><a href="#cb10-59" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">model =</span> <span class="fu">str_remove</span>(model, <span class="st">"weight_"</span>)) </span>
+<span id="cb10-60"><a href="#cb10-60" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb10-61"><a href="#cb10-61" aria-hidden="true" tabindex="-1"></a>    evaluation_stats <span class="ot">&lt;-</span> evaluation_stats <span class="sc">|&gt;</span> </span>
+<span id="cb10-62"><a href="#cb10-62" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(evaluation_weights, <span class="fu">join_by</span>(model))</span>
+<span id="cb10-63"><a href="#cb10-63" aria-hidden="true" tabindex="-1"></a>  }</span>
+<span id="cb10-64"><a href="#cb10-64" aria-hidden="true" tabindex="-1"></a>  </span>
+<span id="cb10-65"><a href="#cb10-65" aria-hidden="true" tabindex="-1"></a>  evaluation_output <span class="ot">&lt;-</span> evaluation_stats <span class="sc">|&gt;</span> </span>
+<span id="cb10-66"><a href="#cb10-66" aria-hidden="true" tabindex="-1"></a>    <span class="fu">pivot_longer</span>(<span class="at">cols =</span> <span class="sc">-</span>model, <span class="at">names_to =</span> <span class="st">"measure"</span>) <span class="sc">|&gt;</span> </span>
+<span id="cb10-67"><a href="#cb10-67" aria-hidden="true" tabindex="-1"></a>    <span class="fu">pivot_wider</span>(<span class="at">names_from =</span> model)</span>
+<span id="cb10-68"><a href="#cb10-68" aria-hidden="true" tabindex="-1"></a>  </span>
+<span id="cb10-69"><a href="#cb10-69" aria-hidden="true" tabindex="-1"></a>  <span class="fu">return</span>(evaluation_output)</span>
+<span id="cb10-70"><a href="#cb10-70" aria-hidden="true" tabindex="-1"></a>}</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p> Let us take a look at the different portfolio strategies and evaluation measures.</p>
 <div class="cell">
diff --git a/docs/r/replicating-fama-and-french-factors.html b/docs/r/replicating-fama-and-french-factors.html
index 9c213b06..5354727d 100644
--- a/docs/r/replicating-fama-and-french-factors.html
+++ b/docs/r/replicating-fama-and-french-factors.html
@@ -591,13 +591,13 @@ <h2 class="anchored" data-anchor-id="data-preparation">Data Preparation</h2>
 <span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a>book_to_market <span class="ot">&lt;-</span> compustat <span class="sc">|&gt;</span></span>
 <span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">sorting_date =</span> <span class="fu">ymd</span>(<span class="fu">str_c</span>(<span class="fu">year</span>(datadate) <span class="sc">+</span> <span class="dv">1</span>, <span class="st">"0701"</span>))) <span class="sc">|&gt;</span></span>
 <span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(gvkey, sorting_date, be) <span class="sc">|&gt;</span></span>
-<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(market_equity, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"sorting_date"</span>)) <span class="sc">|&gt;</span></span>
+<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(market_equity, <span class="fu">join_by</span>(gvkey, sorting_date)) <span class="sc">|&gt;</span></span>
 <span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">bm =</span> be <span class="sc">/</span> me) <span class="sc">|&gt;</span></span>
 <span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(permno, sorting_date, me, bm)</span>
 <span id="cb3-17"><a href="#cb3-17" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb3-18"><a href="#cb3-18" aria-hidden="true" tabindex="-1"></a>sorting_variables <span class="ot">&lt;-</span> size <span class="sc">|&gt;</span></span>
 <span id="cb3-19"><a href="#cb3-19" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(</span>
-<span id="cb3-20"><a href="#cb3-20" aria-hidden="true" tabindex="-1"></a>    book_to_market, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"sorting_date"</span>)</span>
+<span id="cb3-20"><a href="#cb3-20" aria-hidden="true" tabindex="-1"></a>    book_to_market, <span class="fu">join_by</span>(permno, sorting_date)</span>
 <span id="cb3-21"><a href="#cb3-21" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
 <span id="cb3-22"><a href="#cb3-22" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>() <span class="sc">|&gt;</span></span>
 <span id="cb3-23"><a href="#cb3-23" aria-hidden="true" tabindex="-1"></a>  <span class="fu">distinct</span>(permno, sorting_date, <span class="at">.keep_all =</span> <span class="cn">TRUE</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -656,7 +656,7 @@ <h2 class="anchored" data-anchor-id="portfolio-sorts">Portfolio Sorts</h2>
 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>    <span class="fu">month</span>(month) <span class="sc">&lt;=</span> <span class="dv">6</span> <span class="sc">~</span> <span class="fu">ymd</span>(<span class="fu">str_c</span>(<span class="fu">year</span>(month) <span class="sc">-</span> <span class="dv">1</span>, <span class="st">"0701"</span>)),</span>
 <span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>    <span class="fu">month</span>(month) <span class="sc">&gt;=</span> <span class="dv">7</span> <span class="sc">~</span> <span class="fu">ymd</span>(<span class="fu">str_c</span>(<span class="fu">year</span>(month), <span class="st">"0701"</span>))</span>
 <span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>  )) <span class="sc">|&gt;</span></span>
-<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(portfolios, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"sorting_date"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(portfolios, <span class="fu">join_by</span>(permno, sorting_date))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </section>
 <section id="fama-and-french-three-factor-model" class="level2">
@@ -682,7 +682,7 @@ <h2 class="anchored" data-anchor-id="replication-evaluation">Replication Evaluat
 <p>In the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French. The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using <code>lm()</code>. If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>test <span class="ot">&lt;-</span> factors_ff3_monthly <span class="sc">|&gt;</span></span>
-<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(factors_replicated, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(factors_replicated, <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
 <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>    <span class="fu">across</span>(<span class="fu">c</span>(smb_replicated, hml_replicated), <span class="sc">~</span><span class="fu">round</span>(., <span class="dv">4</span>))</span>
 <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -747,16 +747,18 @@ <h2 class="anchored" data-anchor-id="fama-and-french-five-factor-model">Fama and
 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>other_sorting_variables <span class="ot">&lt;-</span> compustat <span class="sc">|&gt;</span></span>
 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">sorting_date =</span> <span class="fu">ymd</span>(<span class="fu">str_c</span>(<span class="fu">year</span>(datadate) <span class="sc">+</span> <span class="dv">1</span>, <span class="st">"0701"</span>))) <span class="sc">|&gt;</span></span>
 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(gvkey, sorting_date, be, op, inv) <span class="sc">|&gt;</span></span>
-<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(market_equity, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"sorting_date"</span>)) <span class="sc">|&gt;</span></span>
-<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">bm =</span> be <span class="sc">/</span> me) <span class="sc">|&gt;</span></span>
-<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(permno, sorting_date, me, be, bm, op, inv)</span>
-<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a>sorting_variables <span class="ot">&lt;-</span> size <span class="sc">|&gt;</span></span>
-<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(</span>
-<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a>    other_sorting_variables, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"sorting_date"</span>)</span>
-<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
-<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>() <span class="sc">|&gt;</span></span>
-<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a>  <span class="fu">distinct</span>(permno, sorting_date, <span class="at">.keep_all =</span> <span class="cn">TRUE</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(market_equity, </span>
+<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a>             <span class="fu">join_by</span>(gvkey, sorting_date)) <span class="sc">|&gt;</span></span>
+<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">bm =</span> be <span class="sc">/</span> me) <span class="sc">|&gt;</span></span>
+<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(permno, sorting_date, me, be, bm, op, inv)</span>
+<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a>sorting_variables <span class="ot">&lt;-</span> size <span class="sc">|&gt;</span></span>
+<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(</span>
+<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a>    other_sorting_variables, </span>
+<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(permno, sorting_date)</span>
+<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a>    ) <span class="sc">|&gt;</span></span>
+<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>() <span class="sc">|&gt;</span></span>
+<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a>  <span class="fu">distinct</span>(permno, sorting_date, <span class="at">.keep_all =</span> <span class="cn">TRUE</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>In each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above.</p>
 <div class="cell">
@@ -787,7 +789,7 @@ <h2 class="anchored" data-anchor-id="fama-and-french-five-factor-model">Fama and
 <span id="cb13-25"><a href="#cb13-25" aria-hidden="true" tabindex="-1"></a>    <span class="fu">month</span>(month) <span class="sc">&lt;=</span> <span class="dv">6</span> <span class="sc">~</span> <span class="fu">ymd</span>(<span class="fu">str_c</span>(<span class="fu">year</span>(month) <span class="sc">-</span> <span class="dv">1</span>, <span class="st">"0701"</span>)),</span>
 <span id="cb13-26"><a href="#cb13-26" aria-hidden="true" tabindex="-1"></a>    <span class="fu">month</span>(month) <span class="sc">&gt;=</span> <span class="dv">7</span> <span class="sc">~</span> <span class="fu">ymd</span>(<span class="fu">str_c</span>(<span class="fu">year</span>(month), <span class="st">"0701"</span>))</span>
 <span id="cb13-27"><a href="#cb13-27" aria-hidden="true" tabindex="-1"></a>  )) <span class="sc">|&gt;</span></span>
-<span id="cb13-28"><a href="#cb13-28" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(portfolios, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"sorting_date"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb13-28"><a href="#cb13-28" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(portfolios, <span class="fu">join_by</span>(permno, sorting_date))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Now, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.</p>
 <div class="cell">
@@ -854,17 +856,17 @@ <h2 class="anchored" data-anchor-id="fama-and-french-five-factor-model">Fama and
 <div class="cell">
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>factors_replicated <span class="ot">&lt;-</span> factors_size <span class="sc">|&gt;</span></span>
 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">full_join</span>(</span>
-<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>    factors_value, <span class="at">by =</span> <span class="st">"month"</span></span>
+<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>    factors_value, <span class="fu">join_by</span>(month)</span>
 <span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">full_join</span>(</span>
-<span id="cb18-6"><a href="#cb18-6" aria-hidden="true" tabindex="-1"></a>    factors_profitability, <span class="at">by =</span> <span class="st">"month"</span></span>
+<span id="cb18-6"><a href="#cb18-6" aria-hidden="true" tabindex="-1"></a>    factors_profitability, <span class="fu">join_by</span>(month)</span>
 <span id="cb18-7"><a href="#cb18-7" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb18-8"><a href="#cb18-8" aria-hidden="true" tabindex="-1"></a>  <span class="fu">full_join</span>(</span>
-<span id="cb18-9"><a href="#cb18-9" aria-hidden="true" tabindex="-1"></a>    factors_investment, <span class="at">by =</span> <span class="st">"month"</span></span>
+<span id="cb18-9"><a href="#cb18-9" aria-hidden="true" tabindex="-1"></a>    factors_investment, <span class="fu">join_by</span>(month)</span>
 <span id="cb18-10"><a href="#cb18-10" aria-hidden="true" tabindex="-1"></a>  )</span>
 <span id="cb18-11"><a href="#cb18-11" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb18-12"><a href="#cb18-12" aria-hidden="true" tabindex="-1"></a>test <span class="ot">&lt;-</span> factors_ff5_monthly <span class="sc">|&gt;</span></span>
-<span id="cb18-13"><a href="#cb18-13" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(factors_replicated, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb18-13"><a href="#cb18-13" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(factors_replicated, <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
 <span id="cb18-14"><a href="#cb18-14" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
 <span id="cb18-15"><a href="#cb18-15" aria-hidden="true" tabindex="-1"></a>    <span class="fu">across</span>(<span class="fu">c</span>(smb_replicated, hml_replicated, </span>
 <span id="cb18-16"><a href="#cb18-16" aria-hidden="true" tabindex="-1"></a>             rmw_replicated, cma_replicated), <span class="sc">~</span><span class="fu">round</span>(., <span class="dv">4</span>))</span>
diff --git a/docs/r/setting-up-your-environment.html b/docs/r/setting-up-your-environment.html
index 1d8c5463..854cbc7b 100644
--- a/docs/r/setting-up-your-environment.html
+++ b/docs/r/setting-up-your-environment.html
@@ -554,7 +554,7 @@ <h2 class="anchored" data-anchor-id="your-first-steps-with-r">Your First Steps w
 <section id="colophon" class="level2">
 <h2 class="anchored" data-anchor-id="colophon">Colophon</h2>
 <p>This book was written in RStudio using <code>bookdown</code> <span class="citation" data-cites="Xie2016">(<a href="#ref-Xie2016" role="doc-biblioref">Xie 2016</a>)</span>. The website was rendered using <code>quarto</code> <span class="citation" data-cites="quarto">(<a href="#ref-quarto" role="doc-biblioref">Allaire et al. 2022</a>)</span> and it is hosted via GitHub Pages. The complete source is <a href="https://github.com/tidy-finance/website">available from GitHub</a>. We generated all plots in this book using <code>ggplot2</code> and its classic dark-on-light theme (<code>theme_bw()</code>).</p>
-<p>This version of the book was built with R <span class="citation" data-cites="R-base">(<a href="#ref-R-base" role="doc-biblioref">R Core Team 2022</a>)</span> version 4.3.1 (2023-06-16, Beagle Scouts) and the following packages: </p>
+<p>This version of the book was built with R <span class="citation" data-cites="R-base">(<a href="#ref-R-base" role="doc-biblioref">R Core Team 2022</a>)</span> version 4.3.2 (2023-10-31, Eye Holes) and the following packages: </p>
 <div class="cell">
 <div class="cell-output-display">
 <table class="table table-sm table-striped small" data-quarto-postprocess="true">
diff --git a/docs/r/size-sorts-and-p-hacking.html b/docs/r/size-sorts-and-p-hacking.html
index bad4f14a..6476ebd7 100644
--- a/docs/r/size-sorts-and-p-hacking.html
+++ b/docs/r/size-sorts-and-p-hacking.html
@@ -611,10 +611,10 @@ <h2 class="anchored" data-anchor-id="size-distribution">Size Distribution</h2>
 <span id="cb3-32"><a href="#cb3-32" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Percentage of total market capitalization in largest stocks"</span></span>
 <span id="cb3-33"><a href="#cb3-33" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-801" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Percentage of total market capitalization in largest stocks. The figure shows a line chart with four different lines that are relatively stable during the entire CRSP sample period. The largest 1 percent of all stocks on average comprise around 40 percent of the entire market capitalization. For the largest 25 percent, the share is around 90 percent.">
+<div id="fig-801" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Percentage of total market capitalization in largest stocks. The figure shows a line chart with four different lines that are relatively stable during the entire CRSP sample period. The largest 1 percent of all stocks on average comprise around 40 percent of the entire market capitalization. For the largest 25 percent, the share is around 90 percent." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-801-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="size-sorts-and-p-hacking_files/figure-html/fig-801-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-1" data-gallery="quarto-lightbox-gallery-1"><img src="size-sorts-and-p-hacking_files/figure-html/fig-801-1.png" class="img-fluid figure-img" alt="Title: Percentage of total market capitalization in largest stocks. The figure shows a line chart with four different lines that are relatively stable during the entire CRSP sample period. The largest 1 percent of all stocks on average comprise around 40 percent of the entire market capitalization. For the largest 25 percent, the share is around 90 percent." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: We report the aggregate market capitalization of all stocks that belong to the 1, 5, 10, and 25 percent quantile of the largest firms in the monthly cross-section relative to the market capitalization of all stocks during the month.</span>
+<a href="size-sorts-and-p-hacking_files/figure-html/fig-801-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="size-sorts-and-p-hacking_files/figure-html/fig-801-1.png" class="img-fluid figure-img" alt="Title: Percentage of total market capitalization in largest stocks. The figure shows a line chart with four different lines that are relatively stable during the entire CRSP sample period. The largest 1 percent of all stocks on average comprise around 40 percent of the entire market capitalization. For the largest 25 percent, the share is around 90 percent." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: We report the aggregate market capitalization of all stocks that belong to the 1, 5, 10, and 25 percent quantile of the largest firms in the monthly cross-section relative to the market capitalization of all stocks during the month.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-801-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;1: We report the aggregate market capitalization of all stocks that belong to the 1, 5, 10, and 25 percent quantile of the largest firms in the monthly cross-section relative to the market capitalization of all stocks during the month.
@@ -647,10 +647,10 @@ <h2 class="anchored" data-anchor-id="size-distribution">Size Distribution</h2>
 <span id="cb4-20"><a href="#cb4-20" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Share of total market capitalization per listing exchange"</span></span>
 <span id="cb4-21"><a href="#cb4-21" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-802" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Share of total market capitalization per listing exchange. The figure shows stacked area plots with a steady decline in the market capitalization of NYSE listed stocks since 1970. As of 2021, NYSE-listed stocks comprise around 50 percent of the entire CRSP market capitalization. The remainder is essentially listed on NASDAQ. Other exchanges are negligible.">
+<div id="fig-802" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Share of total market capitalization per listing exchange. The figure shows stacked area plots with a steady decline in the market capitalization of NYSE listed stocks since 1970. As of 2021, NYSE-listed stocks comprise around 50 percent of the entire CRSP market capitalization. The remainder is essentially listed on NASDAQ. Other exchanges are negligible." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-802-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="size-sorts-and-p-hacking_files/figure-html/fig-802-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-2" data-gallery="quarto-lightbox-gallery-2"><img src="size-sorts-and-p-hacking_files/figure-html/fig-802-1.png" class="img-fluid figure-img" alt="Title: Share of total market capitalization per listing exchange. The figure shows stacked area plots with a steady decline in the market capitalization of NYSE listed stocks since 1970. As of 2021, NYSE-listed stocks comprise around 50 percent of the entire CRSP market capitalization. The remainder is essentially listed on NASDAQ. Other exchanges are negligible." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: Years are on the horizontal axis and the corresponding share of total market capitalization per listing exchange on the vertical axis.</span>
+<a href="size-sorts-and-p-hacking_files/figure-html/fig-802-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" data-glightbox="description: .lightbox-desc-2"><img src="size-sorts-and-p-hacking_files/figure-html/fig-802-1.png" class="img-fluid figure-img" alt="Title: Share of total market capitalization per listing exchange. The figure shows stacked area plots with a steady decline in the market capitalization of NYSE listed stocks since 1970. As of 2021, NYSE-listed stocks comprise around 50 percent of the entire CRSP market capitalization. The remainder is essentially listed on NASDAQ. Other exchanges are negligible." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: Years are on the horizontal axis and the corresponding share of total market capitalization per listing exchange on the vertical axis.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-802-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;2: Years are on the horizontal axis and the corresponding share of total market capitalization per listing exchange on the vertical axis.
@@ -862,10 +862,10 @@ <h2 class="anchored" data-anchor-id="the-size-premium-variation">The Size-Premiu
 <span id="cb14-10"><a href="#cb14-10" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">+</span></span>
 <span id="cb14-11"><a href="#cb14-11" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_x_continuous</span>(<span class="at">labels =</span> percent)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-803" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Distribution of size premiums for different sorting choices. The figure shows a histogram of size premiums based on different sorting choices. The variation is huge, but the estimated coefficients are positive for all choices.">
+<div id="fig-803" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Distribution of size premiums for different sorting choices. The figure shows a histogram of size premiums based on different sorting choices. The variation is huge, but the estimated coefficients are positive for all choices." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-803-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="size-sorts-and-p-hacking_files/figure-html/fig-803-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-3" data-gallery="quarto-lightbox-gallery-3"><img src="size-sorts-and-p-hacking_files/figure-html/fig-803-1.png" class="img-fluid figure-img" alt="Title: Distribution of size premiums for different sorting choices. The figure shows a histogram of size premiums based on different sorting choices. The variation is huge, but the estimated coefficients are positive for all choices." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: The dashed vertical line indicates the average Fama-French SMB premium.</span>
+<a href="size-sorts-and-p-hacking_files/figure-html/fig-803-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" data-glightbox="description: .lightbox-desc-3"><img src="size-sorts-and-p-hacking_files/figure-html/fig-803-1.png" class="img-fluid figure-img" alt="Title: Distribution of size premiums for different sorting choices. The figure shows a histogram of size premiums based on different sorting choices. The variation is huge, but the estimated coefficients are positive for all choices." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: The dashed vertical line indicates the average Fama-French SMB premium.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-803-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;3: The dashed vertical line indicates the average Fama-French SMB premium.
@@ -1342,7 +1342,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","loop":false,"closeEffect":"zoom","selector":".lightbox","openEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"closeEffect":"zoom","loop":false,"selector":".lightbox","descPosition":"bottom","openEffect":"zoom"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/trace-and-fisd.html b/docs/r/trace-and-fisd.html
index 9c4d69d2..1375c33b 100644
--- a/docs/r/trace-and-fisd.html
+++ b/docs/r/trace-and-fisd.html
@@ -615,7 +615,7 @@ <h2 class="anchored" data-anchor-id="mergent-fisd">Mergent FISD</h2>
 <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">collect</span>()</span>
 <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>fisd <span class="ot">&lt;-</span> fisd <span class="sc">|&gt;</span></span>
-<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(fisd_issuer, <span class="at">by =</span> <span class="st">"issuer_id"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(fisd_issuer, <span class="fu">join_by</span>(issuer_id)) <span class="sc">|&gt;</span></span>
 <span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">filter</span>(country_domicile <span class="sc">==</span> <span class="st">"USA"</span>) <span class="sc">|&gt;</span></span>
 <span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(<span class="sc">-</span>country_domicile)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
@@ -686,7 +686,7 @@ <h2 class="anchored" data-anchor-id="insights-into-corporate-bonds">Insights int
 <span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(fisd <span class="sc">|&gt;</span> <span class="fu">select</span>(complete_cusip, </span>
 <span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>                              offering_date,</span>
 <span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>                              maturity), </span>
-<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a>            <span class="at">by =</span> <span class="st">"complete_cusip"</span>) <span class="sc">|&gt;</span> </span>
+<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a>            <span class="fu">join_by</span>(complete_cusip)) <span class="sc">|&gt;</span> </span>
 <span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">offering_date =</span> <span class="fu">floor_date</span>(offering_date),</span>
 <span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>         <span class="at">maturity =</span> <span class="fu">floor_date</span>(maturity)) <span class="sc">|&gt;</span> </span>
 <span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a>  <span class="fu">filter</span>(date <span class="sc">&gt;=</span> offering_date <span class="sc">&amp;</span> date <span class="sc">&lt;=</span> maturity) <span class="sc">|&gt;</span> </span>
@@ -721,7 +721,7 @@ <h2 class="anchored" data-anchor-id="insights-into-corporate-bonds">Insights int
 <span id="cb11-12"><a href="#cb11-12" aria-hidden="true" tabindex="-1"></a>    <span class="at">title =</span> <span class="st">"Number of bonds outstanding and traded each quarter"</span></span>
 <span id="cb11-13"><a href="#cb11-13" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-401" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Number of bonds outstanding and traded each quarter. The figure shows a time series of outstanding bonds and bonds traded. The amount outstanding increases monotonically between 2014 and 2016. The number of bonds traded represents only a fraction of roughly 60 percent, which peaks around the third quarter of 2016." width="2100">
+<div id="fig-401" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Number of bonds outstanding and traded each quarter. The figure shows a time series of outstanding bonds and bonds traded. The amount outstanding increases monotonically between 2014 and 2016. The number of bonds traded represents only a fraction of roughly 60 percent, which peaks around the third quarter of 2016.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-401-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="trace-and-fisd_files/figure-html/fig-401-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="trace-and-fisd_files/figure-html/fig-401-1.png" class="img-fluid figure-img" alt="Title: Number of bonds outstanding and traded each quarter. The figure shows a time series of outstanding bonds and bonds traded. The amount outstanding increases monotonically between 2014 and 2016. The number of bonds traded represents only a fraction of roughly 60 percent, which peaks around the third quarter of 2016." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.</span>
@@ -1270,7 +1270,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","loop":false,"selector":".lightbox","closeEffect":"zoom","openEffect":"zoom"});
+<script>var lightboxQuarto = GLightbox({"selector":".lightbox","openEffect":"zoom","descPosition":"bottom","loop":false,"closeEffect":"zoom"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/univariate-portfolio-sorts.html b/docs/r/univariate-portfolio-sorts.html
index bc93d22b..9bc6719b 100644
--- a/docs/r/univariate-portfolio-sorts.html
+++ b/docs/r/univariate-portfolio-sorts.html
@@ -587,7 +587,7 @@ <h2 class="anchored" data-anchor-id="sorting-by-market-beta">Sorting by Market B
 <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">drop_na</span>()</span>
 <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>data_for_sorts <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(beta_lag, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(beta_lag, <span class="fu">join_by</span>(permno, month))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>The first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios. For simplicity, we start with the median lagged market beta as the single breakpoint. We then compute the value-weighted returns for each of the two resulting portfolios, which means that the lagged market capitalization determines the weight in <code>weighted.mean()</code>.</p>
 <div class="cell">
@@ -675,7 +675,7 @@ <h2 class="anchored" data-anchor-id="functional-programming-for-portfolio-sorts"
 <span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a>    <span class="at">ret_excess =</span> <span class="fu">weighted.mean</span>(ret_excess, mktcap_lag),</span>
 <span id="cb9-14"><a href="#cb9-14" aria-hidden="true" tabindex="-1"></a>    <span class="at">.groups =</span> <span class="st">"drop"</span></span>
 <span id="cb9-15"><a href="#cb9-15" aria-hidden="true" tabindex="-1"></a>  )<span class="sc">|&gt;</span></span>
-<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="at">by =</span> <span class="st">"month"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="fu">join_by</span>(month))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </section>
 <section id="more-performance-evaluation" class="level2">
@@ -695,7 +695,7 @@ <h2 class="anchored" data-anchor-id="more-performance-evaluation">More Performan
 <span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a>    beta_portfolios <span class="sc">|&gt;</span> </span>
 <span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a>      <span class="fu">group_by</span>(portfolio) <span class="sc">|&gt;</span> </span>
 <span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a>      <span class="fu">summarize</span>(<span class="at">ret_excess =</span> <span class="fu">mean</span>(ret_excess),</span>
-<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a>                <span class="at">.groups =</span> <span class="st">"drop"</span>), <span class="at">by =</span> <span class="st">"portfolio"</span></span>
+<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a>                <span class="at">.groups =</span> <span class="st">"drop"</span>), <span class="fu">join_by</span>(portfolio)</span>
 <span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p><a href="#fig-701" class="quarto-xref">Figure&nbsp;1</a> illustrates the CAPM alphas of beta-sorted portfolios. It shows that low beta portfolios tend to exhibit positive alphas, while high beta portfolios exhibit negative alphas.</p>
@@ -715,7 +715,7 @@ <h2 class="anchored" data-anchor-id="more-performance-evaluation">More Performan
 <div id="fig-701" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: CAPM alphas of beta-sorted portfolios. The figure shows bar charts of alphas of beta-sorted portfolios with the decile portfolio on the horizontal axis and the corresponding CAPM alpha on the vertical axis. Alphas for low beta portfolios are positive, while high beta portfolios show negative alphas." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-701-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="univariate-portfolio-sorts_files/figure-html/fig-701-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-1" data-gallery="quarto-lightbox-gallery-1"><img src="univariate-portfolio-sorts_files/figure-html/fig-701-1.png" class="img-fluid figure-img" alt="Title: CAPM alphas of beta-sorted portfolios. The figure shows bar charts of alphas of beta-sorted portfolios with the decile portfolio on the horizontal axis and the corresponding CAPM alpha on the vertical axis. Alphas for low beta portfolios are positive, while high beta portfolios show negative alphas." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.</span>
+<a href="univariate-portfolio-sorts_files/figure-html/fig-701-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1" data-glightbox="description: .lightbox-desc-1"><img src="univariate-portfolio-sorts_files/figure-html/fig-701-1.png" class="img-fluid figure-img" alt="Title: CAPM alphas of beta-sorted portfolios. The figure shows bar charts of alphas of beta-sorted portfolios with the decile portfolio on the horizontal axis and the corresponding CAPM alpha on the vertical axis. Alphas for low beta portfolios are positive, while high beta portfolios show negative alphas." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-701-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;1: Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.
@@ -762,7 +762,7 @@ <h2 class="anchored" data-anchor-id="the-security-market-line-and-beta-portfolio
 <div id="fig-702" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Average portfolio excess returns and average beta estimates. The figure shows a scatter plot of the average excess returns per beta portfolio with average beta estimates per portfolio on the horizontal axis and average excess returns on the vertical axis. An increasing solid line indicates the security market line. A dashed increasing line with lower slope than the security market line indicates that the CAPM prediction is not valid for CRSP data." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-702-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="univariate-portfolio-sorts_files/figure-html/fig-702-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-2" data-gallery="quarto-lightbox-gallery-2"><img src="univariate-portfolio-sorts_files/figure-html/fig-702-1.png" class="img-fluid figure-img" alt="Title: Average portfolio excess returns and average beta estimates. The figure shows a scatter plot of the average excess returns per beta portfolio with average beta estimates per portfolio on the horizontal axis and average excess returns on the vertical axis. An increasing solid line indicates the security market line. A dashed increasing line with lower slope than the security market line indicates that the CAPM prediction is not valid for CRSP data." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.</span>
+<a href="univariate-portfolio-sorts_files/figure-html/fig-702-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" data-glightbox="description: .lightbox-desc-2"><img src="univariate-portfolio-sorts_files/figure-html/fig-702-1.png" class="img-fluid figure-img" alt="Title: Average portfolio excess returns and average beta estimates. The figure shows a scatter plot of the average excess returns per beta portfolio with average beta estimates per portfolio on the horizontal axis and average excess returns on the vertical axis. An increasing solid line indicates the security market line. A dashed increasing line with lower slope than the security market line indicates that the CAPM prediction is not valid for CRSP data." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-702-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;2: Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.
@@ -783,7 +783,7 @@ <h2 class="anchored" data-anchor-id="the-security-market-line-and-beta-portfolio
 <span id="cb13-8"><a href="#cb13-8" aria-hidden="true" tabindex="-1"></a>              <span class="at">names_from =</span> portfolio, </span>
 <span id="cb13-9"><a href="#cb13-9" aria-hidden="true" tabindex="-1"></a>              <span class="at">values_from =</span> ret_excess) <span class="sc">|&gt;</span></span>
 <span id="cb13-10"><a href="#cb13-10" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">long_short =</span> high <span class="sc">-</span> low) <span class="sc">|&gt;</span></span>
-<span id="cb13-11"><a href="#cb13-11" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="at">by =</span> <span class="st">"month"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb13-11"><a href="#cb13-11" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly, <span class="fu">join_by</span>(month))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Again, the resulting long-short strategy does not exhibit statistically significant returns.</p>
 <div class="cell">
@@ -837,7 +837,7 @@ <h2 class="anchored" data-anchor-id="the-security-market-line-and-beta-portfolio
 <div id="fig-703" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Annual returns of beta portfolios. The figure shows bar charts of annual returns of long, short, and long-short beta portfolios with years on the horizontal axis and returns on the vertical axis. Each portfolio is plotted in its own facet. The long-short portfolio strategy delivers very high losses during some periods." width="2100">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-703-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
-<a href="univariate-portfolio-sorts_files/figure-html/fig-703-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-3" data-gallery="quarto-lightbox-gallery-3"><img src="univariate-portfolio-sorts_files/figure-html/fig-703-1.png" class="img-fluid figure-img" alt="Title: Annual returns of beta portfolios. The figure shows bar charts of annual returns of long, short, and long-short beta portfolios with years on the horizontal axis and returns on the vertical axis. Each portfolio is plotted in its own facet. The long-short portfolio strategy delivers very high losses during some periods." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.</span>
+<a href="univariate-portfolio-sorts_files/figure-html/fig-703-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" data-glightbox="description: .lightbox-desc-3"><img src="univariate-portfolio-sorts_files/figure-html/fig-703-1.png" class="img-fluid figure-img" alt="Title: Annual returns of beta portfolios. The figure shows bar charts of annual returns of long, short, and long-short beta portfolios with years on the horizontal axis and returns on the vertical axis. Each portfolio is plotted in its own facet. The long-short portfolio strategy delivers very high losses during some periods." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.</span>
 </div>
 <figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-703-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 Figure&nbsp;3: We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.
@@ -1314,7 +1314,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","selector":".lightbox","loop":false,"closeEffect":"zoom","descPosition":"bottom"});
+<script>var lightboxQuarto = GLightbox({"loop":false,"descPosition":"bottom","closeEffect":"zoom","openEffect":"zoom","selector":".lightbox"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/value-and-bivariate-sorts.html b/docs/r/value-and-bivariate-sorts.html
index bcbbc174..fe6e3efb 100644
--- a/docs/r/value-and-bivariate-sorts.html
+++ b/docs/r/value-and-bivariate-sorts.html
@@ -552,7 +552,7 @@ <h2 class="anchored" data-anchor-id="book-to-market-ratio">Book-to-Market Ratio<
 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(permno, sorting_date, <span class="at">me =</span> mktcap)</span>
 <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>bm <span class="ot">&lt;-</span> book_equity <span class="sc">|&gt;</span></span>
-<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(crsp_monthly, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"month"</span>)) <span class="sc">|&gt;</span></span>
+<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(crsp_monthly, <span class="fu">join_by</span>(gvkey, month)) <span class="sc">|&gt;</span></span>
 <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
 <span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a>    <span class="at">bm =</span> be <span class="sc">/</span> mktcap,</span>
 <span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>    <span class="at">sorting_date =</span> month <span class="sc">%m+%</span> <span class="fu">months</span>(<span class="dv">6</span>),</span>
@@ -562,10 +562,10 @@ <h2 class="anchored" data-anchor-id="book-to-market-ratio">Book-to-Market Ratio<
 <span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a>data_for_sorts <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
 <span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(</span>
-<span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a>    bm, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"gvkey"</span>, <span class="st">"month"</span> <span class="ot">=</span> <span class="st">"sorting_date"</span>)</span>
+<span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a>    bm, <span class="fu">join_by</span>(permno, gvkey, month <span class="sc">==</span> sorting_date)</span>
 <span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb4-18"><a href="#cb4-18" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(</span>
-<span id="cb4-19"><a href="#cb4-19" aria-hidden="true" tabindex="-1"></a>    me, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span> <span class="ot">=</span> <span class="st">"sorting_date"</span>)</span>
+<span id="cb4-19"><a href="#cb4-19" aria-hidden="true" tabindex="-1"></a>    me, <span class="fu">join_by</span>(permno, month <span class="sc">==</span> sorting_date)</span>
 <span id="cb4-20"><a href="#cb4-20" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb4-21"><a href="#cb4-21" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(</span>
 <span id="cb4-22"><a href="#cb4-22" aria-hidden="true" tabindex="-1"></a>    permno, gvkey, month, ret_excess,</span>
diff --git a/docs/r/wrds-crsp-and-compustat.html b/docs/r/wrds-crsp-and-compustat.html
index fa9a9804..6ed3cef0 100644
--- a/docs/r/wrds-crsp-and-compustat.html
+++ b/docs/r/wrds-crsp-and-compustat.html
@@ -571,7 +571,7 @@ <h2 class="anchored" data-anchor-id="downloading-and-preparing-crsp">Downloading
 <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>    msenames_db <span class="sc">|&gt;</span></span>
 <span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>      <span class="fu">filter</span>(shrcd <span class="sc">%in%</span> <span class="fu">c</span>(<span class="dv">10</span>, <span class="dv">11</span>)) <span class="sc">|&gt;</span></span>
 <span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>      <span class="fu">select</span>(permno, exchcd, siccd, namedt, nameendt),</span>
-<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>    <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>)</span>
+<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(permno)</span>
 <span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">filter</span>(date <span class="sc">&gt;=</span> namedt <span class="sc">&amp;</span> date <span class="sc">&lt;=</span> nameendt) <span class="sc">|&gt;</span></span>
 <span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(<span class="at">month =</span> <span class="fu">floor_date</span>(date, <span class="st">"month"</span>)) <span class="sc">|&gt;</span></span>
@@ -579,7 +579,7 @@ <h2 class="anchored" data-anchor-id="downloading-and-preparing-crsp">Downloading
 <span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a>    msedelist_db <span class="sc">|&gt;</span></span>
 <span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a>      <span class="fu">select</span>(permno, dlstdt, dlret, dlstcd) <span class="sc">|&gt;</span></span>
 <span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">month =</span> <span class="fu">floor_date</span>(dlstdt, <span class="st">"month"</span>)),</span>
-<span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a>    <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span>)</span>
+<span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(permno, month)</span>
 <span id="cb8-16"><a href="#cb8-16" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb8-17"><a href="#cb8-17" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(</span>
 <span id="cb8-18"><a href="#cb8-18" aria-hidden="true" tabindex="-1"></a>    permno, <span class="co"># Security identifier</span></span>
@@ -615,7 +615,7 @@ <h2 class="anchored" data-anchor-id="downloading-and-preparing-crsp">Downloading
 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(permno, month, <span class="at">mktcap_lag =</span> mktcap)</span>
 <span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>crsp_monthly <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(mktcap_lag, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"month"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(mktcap_lag, <span class="fu">join_by</span>(permno, month))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>If you wonder why we do not use the <code>lag()</code> function, e.g., via <code>crsp_monthly |&gt; group_by(permno) |&gt; mutate(mktcap_lag = lag(mktcap))</code>, take a look at the exercises.</p>
 <p>Next, we follow <span class="citation" data-cites="BaliEngleMurray2016">Bali, Engle, and Murray (<a href="#ref-BaliEngleMurray2016" role="doc-biblioref">2016</a>)</span> in transforming listing exchange codes to explicit exchange names. </p>
@@ -673,7 +673,7 @@ <h2 class="anchored" data-anchor-id="downloading-and-preparing-crsp">Downloading
 <span id="cb14-10"><a href="#cb14-10" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb14-11"><a href="#cb14-11" aria-hidden="true" tabindex="-1"></a>crsp_monthly <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
 <span id="cb14-12"><a href="#cb14-12" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(factors_ff3_monthly,</span>
-<span id="cb14-13"><a href="#cb14-13" aria-hidden="true" tabindex="-1"></a>    <span class="at">by =</span> <span class="st">"month"</span></span>
+<span id="cb14-13"><a href="#cb14-13" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(month)</span>
 <span id="cb14-14"><a href="#cb14-14" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span></span>
 <span id="cb14-15"><a href="#cb14-15" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
 <span id="cb14-16"><a href="#cb14-16" aria-hidden="true" tabindex="-1"></a>    <span class="at">ret_excess =</span> ret_adj <span class="sc">-</span> rf,</span>
@@ -711,7 +711,7 @@ <h2 class="anchored" data-anchor-id="first-glimpse-of-the-crsp-sample">First Gli
 <span id="cb17-9"><a href="#cb17-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_x_date</span>(<span class="at">date_breaks =</span> <span class="st">"10 years"</span>, <span class="at">date_labels =</span> <span class="st">"%Y"</span>) <span class="sc">+</span></span>
 <span id="cb17-10"><a href="#cb17-10" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_y_continuous</span>(<span class="at">labels =</span> comma)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-211" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Monthly number of securities by listing exchange. The figure shows a line chart with the number of securities by listing exchange from 1960 to 2022. In the earlier period, NYSE dominated as a listing exchange. There is a strong upwards trend for NASDAQ. Other listing exchanges do only play a minor role." width="2100">
+<div id="fig-211" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Monthly number of securities by listing exchange. The figure shows a line chart with the number of securities by listing exchange from 1960 to 2022. In the earlier period, NYSE dominated as a listing exchange. There is a strong upwards trend for NASDAQ. Other listing exchanges do only play a minor role.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-211-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="wrds-crsp-and-compustat_files/figure-html/fig-211-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-1" data-gallery="quarto-lightbox-gallery-1"><img src="wrds-crsp-and-compustat_files/figure-html/fig-211-1.png" class="img-fluid figure-img" alt="Title: Monthly number of securities by listing exchange. The figure shows a line chart with the number of securities by listing exchange from 1960 to 2022. In the earlier period, NYSE dominated as a listing exchange. There is a strong upwards trend for NASDAQ. Other listing exchanges do only play a minor role." width="2100"></a><span class="glightbox-desc lightbox-desc-1">Figure&nbsp;1: Number of stocks in the CRSP sample listed at each of the US exchanges.</span>
@@ -726,7 +726,7 @@ <h2 class="anchored" data-anchor-id="first-glimpse-of-the-crsp-sample">First Gli
 <p>Next, we look at the aggregate market capitalization grouped by the respective listing exchanges in <a href="#fig-212" class="quarto-xref">Figure&nbsp;2</a>. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in <a href="#fig-212" class="quarto-xref">Figure&nbsp;2</a> are at the end of 2022 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="fu">tbl</span>(tidy_finance, <span class="st">"crsp_monthly"</span>) <span class="sc">|&gt;</span></span>
-<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(<span class="fu">tbl</span>(tidy_finance, <span class="st">"cpi_monthly"</span>), <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(<span class="fu">tbl</span>(tidy_finance, <span class="st">"cpi_monthly"</span>), <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
 <span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">group_by</span>(month, exchange) <span class="sc">|&gt;</span></span>
 <span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">summarize</span>(</span>
 <span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a>    <span class="at">mktcap =</span> <span class="fu">sum</span>(mktcap, <span class="at">na.rm =</span> <span class="cn">TRUE</span>) <span class="sc">/</span> cpi,</span>
@@ -746,7 +746,7 @@ <h2 class="anchored" data-anchor-id="first-glimpse-of-the-crsp-sample">First Gli
 <span id="cb18-19"><a href="#cb18-19" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_x_date</span>(<span class="at">date_breaks =</span> <span class="st">"10 years"</span>, <span class="at">date_labels =</span> <span class="st">"%Y"</span>) <span class="sc">+</span></span>
 <span id="cb18-20"><a href="#cb18-20" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_y_continuous</span>(<span class="at">labels =</span> comma)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-212" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks." width="2100">
+<div id="fig-212" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-212-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="wrds-crsp-and-compustat_files/figure-html/fig-212-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-2" data-gallery="quarto-lightbox-gallery-2"><img src="wrds-crsp-and-compustat_files/figure-html/fig-212-1.png" class="img-fluid figure-img" alt="Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks." width="2100"></a><span class="glightbox-desc lightbox-desc-2">Figure&nbsp;2: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.</span>
@@ -766,7 +766,7 @@ <h2 class="anchored" data-anchor-id="first-glimpse-of-the-crsp-sample">First Gli
 <p>Next, we look at the same descriptive statistics by industry. <a href="#fig-213" class="quarto-xref">Figure&nbsp;3</a> plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.</p>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a>crsp_monthly_industry <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(cpi_monthly, <span class="at">by =</span> <span class="st">"month"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(cpi_monthly, <span class="fu">join_by</span>(month)) <span class="sc">|&gt;</span></span>
 <span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">group_by</span>(month, industry) <span class="sc">|&gt;</span></span>
 <span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">summarize</span>(</span>
 <span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a>    <span class="at">securities =</span> <span class="fu">n_distinct</span>(permno),</span>
@@ -789,7 +789,7 @@ <h2 class="anchored" data-anchor-id="first-glimpse-of-the-crsp-sample">First Gli
 <span id="cb20-22"><a href="#cb20-22" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_x_date</span>(<span class="at">date_breaks =</span> <span class="st">"10 years"</span>, <span class="at">date_labels =</span> <span class="st">"%Y"</span>) <span class="sc">+</span></span>
 <span id="cb20-23"><a href="#cb20-23" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_y_continuous</span>(<span class="at">labels =</span> comma)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-213" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks." width="2100">
+<div id="fig-213" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-213-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="wrds-crsp-and-compustat_files/figure-html/fig-213-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-3" data-gallery="quarto-lightbox-gallery-3"><img src="wrds-crsp-and-compustat_files/figure-html/fig-213-1.png" class="img-fluid figure-img" alt="Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks." width="2100"></a><span class="glightbox-desc lightbox-desc-3">Figure&nbsp;3: Number of stocks in the CRSP sample associated with different industries.</span>
@@ -818,7 +818,7 @@ <h2 class="anchored" data-anchor-id="first-glimpse-of-the-crsp-sample">First Gli
 <span id="cb21-13"><a href="#cb21-13" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_x_date</span>(<span class="at">date_breaks =</span> <span class="st">"10 years"</span>, <span class="at">date_labels =</span> <span class="st">"%Y"</span>) <span class="sc">+</span></span>
 <span id="cb21-14"><a href="#cb21-14" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_y_continuous</span>(<span class="at">labels =</span> comma)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-214" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Monthly total market cap by industry in billions as of Dec 2022 USD. The figure shows a line chart of total market capitalization of all stocks in the CRSP sample aggregated by industry from 1960 to 2022 with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Stocks in the manufacturing sector have always had the highest market valuation. The figure shows a general upwards trend during the most recent past. " width="2100">
+<div id="fig-214" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Monthly total market cap by industry in billions as of Dec 2022 USD. The figure shows a line chart of total market capitalization of all stocks in the CRSP sample aggregated by industry from 1960 to 2022 with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Stocks in the manufacturing sector have always had the highest market valuation. The figure shows a general upwards trend during the most recent past. ">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-214-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="wrds-crsp-and-compustat_files/figure-html/fig-214-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-4" data-gallery="quarto-lightbox-gallery-4"><img src="wrds-crsp-and-compustat_files/figure-html/fig-214-1.png" class="img-fluid figure-img" alt="Title: Monthly total market cap by industry in billions as of Dec 2022 USD. The figure shows a line chart of total market capitalization of all stocks in the CRSP sample aggregated by industry from 1960 to 2022 with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Stocks in the manufacturing sector have always had the highest market valuation. The figure shows a general upwards trend during the most recent past. " width="2100"></a><span class="glightbox-desc lightbox-desc-4">Figure&nbsp;4: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.</span>
@@ -876,15 +876,15 @@ <h2 class="anchored" data-anchor-id="daily-crsp-data">Daily CRSP Data</h2>
 <span id="cb22-32"><a href="#cb22-32" aria-hidden="true" tabindex="-1"></a>    <span class="fu">drop_na</span>()</span>
 <span id="cb22-33"><a href="#cb22-33" aria-hidden="true" tabindex="-1"></a>    </span>
 <span id="cb22-34"><a href="#cb22-34" aria-hidden="true" tabindex="-1"></a>    crsp_daily_sub <span class="ot">&lt;-</span> crsp_daily_sub <span class="sc">|&gt;</span> </span>
-<span id="cb22-35"><a href="#cb22-35" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(msedelist_sub, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"date"</span><span class="ot">=</span><span class="st">"dlstdt"</span>)) <span class="sc">|&gt;</span> </span>
+<span id="cb22-35"><a href="#cb22-35" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(msedelist_sub, <span class="fu">join_by</span>(permno, date <span class="sc">==</span> dlstdt)) <span class="sc">|&gt;</span> </span>
 <span id="cb22-36"><a href="#cb22-36" aria-hidden="true" tabindex="-1"></a>      <span class="fu">bind_rows</span>(msedelist_sub <span class="sc">|&gt;</span> </span>
 <span id="cb22-37"><a href="#cb22-37" aria-hidden="true" tabindex="-1"></a>                  <span class="fu">anti_join</span>(crsp_daily_sub, </span>
-<span id="cb22-38"><a href="#cb22-38" aria-hidden="true" tabindex="-1"></a>                            <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"dlstdt"</span> <span class="ot">=</span> <span class="st">"date"</span>))) <span class="sc">|&gt;</span> </span>
+<span id="cb22-38"><a href="#cb22-38" aria-hidden="true" tabindex="-1"></a>                            <span class="fu">join_by</span>(permno, dlstdt <span class="sc">==</span> date))) <span class="sc">|&gt;</span> </span>
 <span id="cb22-39"><a href="#cb22-39" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">ret =</span> <span class="fu">if_else</span>(<span class="sc">!</span><span class="fu">is.na</span>(dlret), dlret, ret),</span>
 <span id="cb22-40"><a href="#cb22-40" aria-hidden="true" tabindex="-1"></a>             <span class="at">date =</span> <span class="fu">if_else</span>(<span class="sc">!</span><span class="fu">is.na</span>(dlstdt), dlstdt, date)) <span class="sc">|&gt;</span> </span>
 <span id="cb22-41"><a href="#cb22-41" aria-hidden="true" tabindex="-1"></a>      <span class="fu">select</span>(<span class="sc">-</span><span class="fu">c</span>(dlret, dlstdt)) <span class="sc">|&gt;</span> </span>
 <span id="cb22-42"><a href="#cb22-42" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(msedelist_sub <span class="sc">|&gt;</span> </span>
-<span id="cb22-43"><a href="#cb22-43" aria-hidden="true" tabindex="-1"></a>                  <span class="fu">select</span>(permno, dlstdt), <span class="at">by =</span> <span class="st">"permno"</span>) <span class="sc">|&gt;</span> </span>
+<span id="cb22-43"><a href="#cb22-43" aria-hidden="true" tabindex="-1"></a>                  <span class="fu">select</span>(permno, dlstdt), <span class="fu">join_by</span>(permno)) <span class="sc">|&gt;</span> </span>
 <span id="cb22-44"><a href="#cb22-44" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">dlstdt =</span> <span class="fu">replace_na</span>(dlstdt, end_date)) <span class="sc">|&gt;</span> </span>
 <span id="cb22-45"><a href="#cb22-45" aria-hidden="true" tabindex="-1"></a>      <span class="fu">filter</span>(date <span class="sc">&lt;=</span> dlstdt) <span class="sc">|&gt;</span> </span>
 <span id="cb22-46"><a href="#cb22-46" aria-hidden="true" tabindex="-1"></a>      <span class="fu">select</span>(<span class="sc">-</span>dlstdt)</span>
@@ -892,7 +892,7 @@ <h2 class="anchored" data-anchor-id="daily-crsp-data">Daily CRSP Data</h2>
 <span id="cb22-48"><a href="#cb22-48" aria-hidden="true" tabindex="-1"></a>    crsp_daily_sub <span class="ot">&lt;-</span> crsp_daily_sub <span class="sc">|&gt;</span></span>
 <span id="cb22-49"><a href="#cb22-49" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">month =</span> <span class="fu">floor_date</span>(date, <span class="st">"month"</span>)) <span class="sc">|&gt;</span></span>
 <span id="cb22-50"><a href="#cb22-50" aria-hidden="true" tabindex="-1"></a>      <span class="fu">left_join</span>(factors_ff3_daily <span class="sc">|&gt;</span></span>
-<span id="cb22-51"><a href="#cb22-51" aria-hidden="true" tabindex="-1"></a>        <span class="fu">select</span>(date, rf), <span class="at">by =</span> <span class="st">"date"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb22-51"><a href="#cb22-51" aria-hidden="true" tabindex="-1"></a>        <span class="fu">select</span>(date, rf), <span class="fu">join_by</span>(date)) <span class="sc">|&gt;</span></span>
 <span id="cb22-52"><a href="#cb22-52" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(</span>
 <span id="cb22-53"><a href="#cb22-53" aria-hidden="true" tabindex="-1"></a>        <span class="at">ret_excess =</span> ret <span class="sc">-</span> rf,</span>
 <span id="cb22-54"><a href="#cb22-54" aria-hidden="true" tabindex="-1"></a>        <span class="at">ret_excess =</span> <span class="fu">pmax</span>(ret_excess, <span class="sc">-</span><span class="dv">1</span>)</span>
@@ -957,7 +957,7 @@ <h2 class="anchored" data-anchor-id="preparing-compustat-data">Preparing Compust
 <span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a>    <span class="at">be =</span> <span class="fu">coalesce</span>(seq, ceq <span class="sc">+</span> pstk, at <span class="sc">-</span> lt) <span class="sc">+</span></span>
 <span id="cb25-4"><a href="#cb25-4" aria-hidden="true" tabindex="-1"></a>      <span class="fu">coalesce</span>(txditc, txdb <span class="sc">+</span> itcb, <span class="dv">0</span>) <span class="sc">-</span></span>
 <span id="cb25-5"><a href="#cb25-5" aria-hidden="true" tabindex="-1"></a>      <span class="fu">coalesce</span>(pstkrv, pstkl, pstk, <span class="dv">0</span>),</span>
-<span id="cb25-6"><a href="#cb25-6" aria-hidden="true" tabindex="-1"></a>    <span class="at">be =</span> <span class="fu">if_else</span>(be <span class="sc">&lt;=</span> <span class="dv">0</span>, <span class="fu">as.numeric</span>(<span class="cn">NA</span>), be),</span>
+<span id="cb25-6"><a href="#cb25-6" aria-hidden="true" tabindex="-1"></a>    <span class="at">be =</span> <span class="fu">if_else</span>(be <span class="sc">&lt;=</span> <span class="dv">0</span>, <span class="cn">NA</span>, be),</span>
 <span id="cb25-7"><a href="#cb25-7" aria-hidden="true" tabindex="-1"></a>    <span class="at">op =</span> (sale <span class="sc">-</span> <span class="fu">coalesce</span>(cogs, <span class="dv">0</span>) <span class="sc">-</span> </span>
 <span id="cb25-8"><a href="#cb25-8" aria-hidden="true" tabindex="-1"></a>            <span class="fu">coalesce</span>(xsga, <span class="dv">0</span>) <span class="sc">-</span> <span class="fu">coalesce</span>(xint, <span class="dv">0</span>)) <span class="sc">/</span> be,</span>
 <span id="cb25-9"><a href="#cb25-9" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -976,12 +976,13 @@ <h2 class="anchored" data-anchor-id="preparing-compustat-data">Preparing Compust
 <span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(</span>
 <span id="cb27-3"><a href="#cb27-3" aria-hidden="true" tabindex="-1"></a>    compustat <span class="sc">|&gt;</span> </span>
 <span id="cb27-4"><a href="#cb27-4" aria-hidden="true" tabindex="-1"></a>      <span class="fu">select</span>(gvkey, year, <span class="at">at_lag =</span> at) <span class="sc">|&gt;</span> </span>
-<span id="cb27-5"><a href="#cb27-5" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">year =</span> year <span class="sc">+</span> <span class="dv">1</span>), <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"year"</span>)</span>
-<span id="cb27-6"><a href="#cb27-6" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span> </span>
-<span id="cb27-7"><a href="#cb27-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
-<span id="cb27-8"><a href="#cb27-8" aria-hidden="true" tabindex="-1"></a>    <span class="at">inv =</span> at <span class="sc">/</span> at_lag <span class="sc">-</span> <span class="dv">1</span>,</span>
-<span id="cb27-9"><a href="#cb27-9" aria-hidden="true" tabindex="-1"></a>    <span class="at">inv =</span> <span class="fu">if_else</span>(at_lag <span class="sc">&lt;=</span> <span class="dv">0</span>, <span class="fu">as.numeric</span>(<span class="cn">NA</span>), inv)</span>
-<span id="cb27-10"><a href="#cb27-10" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb27-5"><a href="#cb27-5" aria-hidden="true" tabindex="-1"></a>      <span class="fu">mutate</span>(<span class="at">year =</span> year <span class="sc">+</span> <span class="dv">1</span>), </span>
+<span id="cb27-6"><a href="#cb27-6" aria-hidden="true" tabindex="-1"></a>    <span class="fu">join_by</span>(gvkey, year)</span>
+<span id="cb27-7"><a href="#cb27-7" aria-hidden="true" tabindex="-1"></a>  ) <span class="sc">|&gt;</span> </span>
+<span id="cb27-8"><a href="#cb27-8" aria-hidden="true" tabindex="-1"></a>  <span class="fu">mutate</span>(</span>
+<span id="cb27-9"><a href="#cb27-9" aria-hidden="true" tabindex="-1"></a>    <span class="at">inv =</span> at <span class="sc">/</span> at_lag <span class="sc">-</span> <span class="dv">1</span>,</span>
+<span id="cb27-10"><a href="#cb27-10" aria-hidden="true" tabindex="-1"></a>    <span class="at">inv =</span> <span class="fu">if_else</span>(at_lag <span class="sc">&lt;=</span> <span class="dv">0</span>, <span class="cn">NA</span>, inv)</span>
+<span id="cb27-11"><a href="#cb27-11" aria-hidden="true" tabindex="-1"></a>  )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>With the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.</p>
 <div class="cell">
@@ -1015,13 +1016,13 @@ <h2 class="anchored" data-anchor-id="merging-crsp-with-compustat">Merging CRSP w
 <div class="cell">
 <div class="sourceCode cell-code" id="cb31"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a>ccm_links <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
 <span id="cb31-2"><a href="#cb31-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">inner_join</span>(ccmxpf_linktable, </span>
-<span id="cb31-3"><a href="#cb31-3" aria-hidden="true" tabindex="-1"></a>             <span class="at">by =</span> <span class="st">"permno"</span>, <span class="at">relationship =</span> <span class="st">"many-to-many"</span>) <span class="sc">|&gt;</span></span>
+<span id="cb31-3"><a href="#cb31-3" aria-hidden="true" tabindex="-1"></a>             <span class="fu">join_by</span>(permno), <span class="at">relationship =</span> <span class="st">"many-to-many"</span>) <span class="sc">|&gt;</span></span>
 <span id="cb31-4"><a href="#cb31-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">filter</span>(<span class="sc">!</span><span class="fu">is.na</span>(gvkey) <span class="sc">&amp;</span> </span>
 <span id="cb31-5"><a href="#cb31-5" aria-hidden="true" tabindex="-1"></a>           (date <span class="sc">&gt;=</span> linkdt <span class="sc">&amp;</span> date <span class="sc">&lt;=</span> linkenddt)) <span class="sc">|&gt;</span></span>
 <span id="cb31-6"><a href="#cb31-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">select</span>(permno, gvkey, date)</span>
 <span id="cb31-7"><a href="#cb31-7" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb31-8"><a href="#cb31-8" aria-hidden="true" tabindex="-1"></a>crsp_monthly <span class="ot">&lt;-</span> crsp_monthly <span class="sc">|&gt;</span></span>
-<span id="cb31-9"><a href="#cb31-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(ccm_links, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"permno"</span>, <span class="st">"date"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<span id="cb31-9"><a href="#cb31-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(ccm_links, <span class="fu">join_by</span>(permno, date))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>As the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.</p>
 <div class="cell">
@@ -1037,7 +1038,7 @@ <h2 class="anchored" data-anchor-id="merging-crsp-with-compustat">Merging CRSP w
 <span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">group_by</span>(permno, <span class="at">year =</span> <span class="fu">year</span>(month)) <span class="sc">|&gt;</span></span>
 <span id="cb33-3"><a href="#cb33-3" aria-hidden="true" tabindex="-1"></a>  <span class="fu">filter</span>(date <span class="sc">==</span> <span class="fu">max</span>(date)) <span class="sc">|&gt;</span></span>
 <span id="cb33-4"><a href="#cb33-4" aria-hidden="true" tabindex="-1"></a>  <span class="fu">ungroup</span>() <span class="sc">|&gt;</span></span>
-<span id="cb33-5"><a href="#cb33-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(compustat, <span class="at">by =</span> <span class="fu">c</span>(<span class="st">"gvkey"</span>, <span class="st">"year"</span>)) <span class="sc">|&gt;</span></span>
+<span id="cb33-5"><a href="#cb33-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">left_join</span>(compustat, <span class="fu">join_by</span>(gvkey, year)) <span class="sc">|&gt;</span></span>
 <span id="cb33-6"><a href="#cb33-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">group_by</span>(exchange, year) <span class="sc">|&gt;</span></span>
 <span id="cb33-7"><a href="#cb33-7" aria-hidden="true" tabindex="-1"></a>  <span class="fu">summarize</span>(</span>
 <span id="cb33-8"><a href="#cb33-8" aria-hidden="true" tabindex="-1"></a>    <span class="at">share =</span> <span class="fu">n_distinct</span>(permno[<span class="sc">!</span><span class="fu">is.na</span>(be)]) <span class="sc">/</span> <span class="fu">n_distinct</span>(permno),</span>
@@ -1057,7 +1058,7 @@ <h2 class="anchored" data-anchor-id="merging-crsp-with-compustat">Merging CRSP w
 <span id="cb33-22"><a href="#cb33-22" aria-hidden="true" tabindex="-1"></a>  <span class="fu">scale_y_continuous</span>(<span class="at">labels =</span> percent) <span class="sc">+</span></span>
 <span id="cb33-23"><a href="#cb33-23" aria-hidden="true" tabindex="-1"></a>  <span class="fu">coord_cartesian</span>(<span class="at">ylim =</span> <span class="fu">c</span>(<span class="dv">0</span>, <span class="dv">1</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output-display">
-<div id="fig-215" class="quarto-figure quarto-figure-center quarto-float anchored" alt="Title: Share of securities with book equity values by exchange. The figure shows a line chart of end-of-year shares of securities with book equity values by exchange from 1960 to 2022 with years on the horizontal axis and the corresponding share on the vertical axis. After an initial period with lower coverage in the early 1960s, typically, more than 80 percent of the entries in the CRSP sample have information about book equity values from Compustat." width="2100">
+<div id="fig-215" class="quarto-figure quarto-figure-center quarto-float anchored" width="2100" alt="Title: Share of securities with book equity values by exchange. The figure shows a line chart of end-of-year shares of securities with book equity values by exchange from 1960 to 2022 with years on the horizontal axis and the corresponding share on the vertical axis. After an initial period with lower coverage in the early 1960s, typically, more than 80 percent of the entries in the CRSP sample have information about book equity values from Compustat.">
 <figure class="quarto-float quarto-float-fig figure">
 <div aria-describedby="fig-215-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
 <a href="wrds-crsp-and-compustat_files/figure-html/fig-215-1.png" class="lightbox" data-glightbox="description: .lightbox-desc-5" data-gallery="quarto-lightbox-gallery-5"><img src="wrds-crsp-and-compustat_files/figure-html/fig-215-1.png" class="img-fluid figure-img" alt="Title: Share of securities with book equity values by exchange. The figure shows a line chart of end-of-year shares of securities with book equity values by exchange from 1960 to 2022 with years on the horizontal axis and the corresponding share on the vertical axis. After an initial period with lower coverage in the early 1960s, typically, more than 80 percent of the entries in the CRSP sample have information about book equity values from Compustat." width="2100"></a><span class="glightbox-desc lightbox-desc-5">Figure&nbsp;5: End-of-year share of securities with book equity values by listing exchange.</span>
@@ -1545,7 +1546,7 @@ <h2 class="anchored" data-anchor-id="exercises">Exercises</h2>
     </div>
   </div>
 </footer>
-<script>var lightboxQuarto = GLightbox({"closeEffect":"zoom","openEffect":"zoom","loop":false,"selector":".lightbox","descPosition":"bottom"});
+<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","loop":false,"closeEffect":"zoom","descPosition":"bottom","selector":".lightbox"});
 window.onload = () => {
   lightboxQuarto.on('slide_before_load', (data) => {
     const { slideIndex, slideNode, slideConfig, player, trigger } = data;
diff --git a/docs/r/wrds-crsp-and-compustat_files/figure-html/fig-215-1.png b/docs/r/wrds-crsp-and-compustat_files/figure-html/fig-215-1.png
index 793e23fd..4de5676f 100644
Binary files a/docs/r/wrds-crsp-and-compustat_files/figure-html/fig-215-1.png and b/docs/r/wrds-crsp-and-compustat_files/figure-html/fig-215-1.png differ
diff --git a/docs/search.json b/docs/search.json
index 5f94a5ef..4945948c 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -273,7 +273,7 @@
     "href": "r/univariate-portfolio-sorts.html#sorting-by-market-beta",
     "title": "Univariate Portfolio Sorts",
     "section": "Sorting by Market Beta",
-    "text": "Sorting by Market Beta\nNext, we merge our sorting variable with the return data. We use the one-month lagged betas as a sorting variable to ensure that the sorts rely only on information available when we create the portfolios. To lag stock beta by one month, we add one month to the current date and join the resulting information with our return data. This procedure ensures that month \\(t\\) information is available in month \\(t+1\\). You may be tempted to simply use a call such as crsp_monthly |&gt; group_by(permno) |&gt; mutate(beta_lag = lag(beta))) instead. This procedure, however, does not work correctly if there are non-explicit missing values in the time series.\n\nbeta_lag &lt;- beta |&gt;\n  mutate(month = month %m+% months(1)) |&gt;\n  select(permno, month, beta_lag = beta_monthly) |&gt;\n  drop_na()\n\ndata_for_sorts &lt;- crsp_monthly |&gt;\n  inner_join(beta_lag, by = c(\"permno\", \"month\"))\n\nThe first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios. For simplicity, we start with the median lagged market beta as the single breakpoint. We then compute the value-weighted returns for each of the two resulting portfolios, which means that the lagged market capitalization determines the weight in weighted.mean().\n\nbeta_portfolios &lt;- data_for_sorts |&gt;\n  group_by(month) |&gt;\n  mutate(\n    breakpoint = median(beta_lag),\n    portfolio = case_when(\n      beta_lag &lt;= breakpoint ~ \"low\",\n      beta_lag &gt; breakpoint ~ \"high\"\n    )\n  ) |&gt;\n  group_by(month, portfolio) |&gt;\n  summarize(ret = weighted.mean(ret_excess, mktcap_lag), \n            .groups = \"drop\")",
+    "text": "Sorting by Market Beta\nNext, we merge our sorting variable with the return data. We use the one-month lagged betas as a sorting variable to ensure that the sorts rely only on information available when we create the portfolios. To lag stock beta by one month, we add one month to the current date and join the resulting information with our return data. This procedure ensures that month \\(t\\) information is available in month \\(t+1\\). You may be tempted to simply use a call such as crsp_monthly |&gt; group_by(permno) |&gt; mutate(beta_lag = lag(beta))) instead. This procedure, however, does not work correctly if there are non-explicit missing values in the time series.\n\nbeta_lag &lt;- beta |&gt;\n  mutate(month = month %m+% months(1)) |&gt;\n  select(permno, month, beta_lag = beta_monthly) |&gt;\n  drop_na()\n\ndata_for_sorts &lt;- crsp_monthly |&gt;\n  inner_join(beta_lag, join_by(permno, month))\n\nThe first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios. For simplicity, we start with the median lagged market beta as the single breakpoint. We then compute the value-weighted returns for each of the two resulting portfolios, which means that the lagged market capitalization determines the weight in weighted.mean().\n\nbeta_portfolios &lt;- data_for_sorts |&gt;\n  group_by(month) |&gt;\n  mutate(\n    breakpoint = median(beta_lag),\n    portfolio = case_when(\n      beta_lag &lt;= breakpoint ~ \"low\",\n      beta_lag &gt; breakpoint ~ \"high\"\n    )\n  ) |&gt;\n  group_by(month, portfolio) |&gt;\n  summarize(ret = weighted.mean(ret_excess, mktcap_lag), \n            .groups = \"drop\")",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -297,7 +297,7 @@
     "href": "r/univariate-portfolio-sorts.html#functional-programming-for-portfolio-sorts",
     "title": "Univariate Portfolio Sorts",
     "section": "Functional Programming for Portfolio Sorts",
-    "text": "Functional Programming for Portfolio Sorts\nNow we take portfolio sorts to the next level. We want to be able to sort stocks into an arbitrary number of portfolios. For this case, functional programming is very handy: we employ the curly-curly-operator to give us flexibility concerning which variable to use for the sorting, denoted by sorting_variable. We use quantile() to compute breakpoints for n_portfolios. Then, we assign portfolios to stocks using the findInterval() function. The output of the following function is a new column that contains the number of the portfolio to which a stock belongs.\nIn some applications, the variable used for the sorting might be clustered (e.g., at a lower bound of 0). Then, multiple breakpoints may be identical, leading to empty portfolios. Similarly, some portfolios might have a very small number of stocks at the beginning of the sample. Cases, where the number of portfolio constituents differs substantially due to the distribution of the characteristics, require careful consideration and, depending on the application, might require customized sorting approaches.\n\nassign_portfolio &lt;- function(data, \n                             sorting_variable, \n                             n_portfolios) {\n  # Compute breakpoints\n  breakpoints &lt;- data |&gt;\n    pull({{ sorting_variable }}) |&gt;\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  # Assign portfolios\n  assigned_portfolios &lt;- data |&gt;\n    mutate(portfolio = findInterval(\n      pick(everything()) |&gt;\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |&gt;\n    pull(portfolio)\n  \n  # Output\n  return(assigned_portfolios)\n}\n\nWe can use the above function to sort stocks into ten portfolios each month using lagged betas and compute value-weighted returns for each portfolio. Note that we transform the portfolio column to a factor variable because it provides more convenience for the figure construction below.\n\nbeta_portfolios &lt;- data_for_sorts |&gt;\n  group_by(month) |&gt;\n  mutate(\n    portfolio = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = beta_lag,\n      n_portfolios = 10\n    ),\n    portfolio = as.factor(portfolio)\n  ) |&gt;\n  group_by(portfolio, month) |&gt;\n  summarize(\n    ret_excess = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )|&gt;\n  left_join(factors_ff3_monthly, by = \"month\")",
+    "text": "Functional Programming for Portfolio Sorts\nNow we take portfolio sorts to the next level. We want to be able to sort stocks into an arbitrary number of portfolios. For this case, functional programming is very handy: we employ the curly-curly-operator to give us flexibility concerning which variable to use for the sorting, denoted by sorting_variable. We use quantile() to compute breakpoints for n_portfolios. Then, we assign portfolios to stocks using the findInterval() function. The output of the following function is a new column that contains the number of the portfolio to which a stock belongs.\nIn some applications, the variable used for the sorting might be clustered (e.g., at a lower bound of 0). Then, multiple breakpoints may be identical, leading to empty portfolios. Similarly, some portfolios might have a very small number of stocks at the beginning of the sample. Cases, where the number of portfolio constituents differs substantially due to the distribution of the characteristics, require careful consideration and, depending on the application, might require customized sorting approaches.\n\nassign_portfolio &lt;- function(data, \n                             sorting_variable, \n                             n_portfolios) {\n  # Compute breakpoints\n  breakpoints &lt;- data |&gt;\n    pull({{ sorting_variable }}) |&gt;\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  # Assign portfolios\n  assigned_portfolios &lt;- data |&gt;\n    mutate(portfolio = findInterval(\n      pick(everything()) |&gt;\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |&gt;\n    pull(portfolio)\n  \n  # Output\n  return(assigned_portfolios)\n}\n\nWe can use the above function to sort stocks into ten portfolios each month using lagged betas and compute value-weighted returns for each portfolio. Note that we transform the portfolio column to a factor variable because it provides more convenience for the figure construction below.\n\nbeta_portfolios &lt;- data_for_sorts |&gt;\n  group_by(month) |&gt;\n  mutate(\n    portfolio = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = beta_lag,\n      n_portfolios = 10\n    ),\n    portfolio = as.factor(portfolio)\n  ) |&gt;\n  group_by(portfolio, month) |&gt;\n  summarize(\n    ret_excess = weighted.mean(ret_excess, mktcap_lag),\n    .groups = \"drop\"\n  )|&gt;\n  left_join(factors_ff3_monthly, join_by(month))",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -309,7 +309,7 @@
     "href": "r/univariate-portfolio-sorts.html#more-performance-evaluation",
     "title": "Univariate Portfolio Sorts",
     "section": "More Performance Evaluation",
-    "text": "More Performance Evaluation\nIn the next step, we compute summary statistics for each beta portfolio. Namely, we compute CAPM-adjusted alphas, the beta of each beta portfolio, and average returns.\n\nbeta_portfolios_summary &lt;- beta_portfolios |&gt;\n  nest(data = c(month, ret_excess, mkt_excess)) |&gt;\n  mutate(estimates = map(\n    data, ~ tidy(lm(ret_excess ~ 1 + mkt_excess, data = .x))\n  )) |&gt;\n  unnest(estimates) |&gt; \n  select(portfolio, term, estimate) |&gt; \n  pivot_wider(names_from = term, values_from = estimate) |&gt; \n  rename(alpha = `(Intercept)`, beta = mkt_excess) |&gt; \n  left_join(\n    beta_portfolios |&gt; \n      group_by(portfolio) |&gt; \n      summarize(ret_excess = mean(ret_excess),\n                .groups = \"drop\"), by = \"portfolio\"\n  )\n\nFigure 1 illustrates the CAPM alphas of beta-sorted portfolios. It shows that low beta portfolios tend to exhibit positive alphas, while high beta portfolios exhibit negative alphas.\n\nbeta_portfolios_summary |&gt;\n  ggplot(aes(x = portfolio, y = alpha, fill = portfolio)) +\n  geom_bar(stat = \"identity\") +\n  labs(\n    title = \"CAPM alphas of beta-sorted portfolios\",\n    x = \"Portfolio\",\n    y = \"CAPM alpha\",\n    fill = \"Portfolio\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  theme(legend.position = \"None\")\n\n\n\n\nFigure 1: Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.\n\n\nFigure 1: Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.\n\n\n\n\n\nThese results suggest a negative relation between beta and future stock returns, which contradicts the predictions of the CAPM. According to the CAPM, returns should increase with beta across the portfolios and risk-adjusted returns should be statistically indistinguishable from zero.",
+    "text": "More Performance Evaluation\nIn the next step, we compute summary statistics for each beta portfolio. Namely, we compute CAPM-adjusted alphas, the beta of each beta portfolio, and average returns.\n\nbeta_portfolios_summary &lt;- beta_portfolios |&gt;\n  nest(data = c(month, ret_excess, mkt_excess)) |&gt;\n  mutate(estimates = map(\n    data, ~ tidy(lm(ret_excess ~ 1 + mkt_excess, data = .x))\n  )) |&gt;\n  unnest(estimates) |&gt; \n  select(portfolio, term, estimate) |&gt; \n  pivot_wider(names_from = term, values_from = estimate) |&gt; \n  rename(alpha = `(Intercept)`, beta = mkt_excess) |&gt; \n  left_join(\n    beta_portfolios |&gt; \n      group_by(portfolio) |&gt; \n      summarize(ret_excess = mean(ret_excess),\n                .groups = \"drop\"), join_by(portfolio)\n  )\n\nFigure 1 illustrates the CAPM alphas of beta-sorted portfolios. It shows that low beta portfolios tend to exhibit positive alphas, while high beta portfolios exhibit negative alphas.\n\nbeta_portfolios_summary |&gt;\n  ggplot(aes(x = portfolio, y = alpha, fill = portfolio)) +\n  geom_bar(stat = \"identity\") +\n  labs(\n    title = \"CAPM alphas of beta-sorted portfolios\",\n    x = \"Portfolio\",\n    y = \"CAPM alpha\",\n    fill = \"Portfolio\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  theme(legend.position = \"None\")\n\n\n\n\nFigure 1: Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.\n\n\nFigure 1: Portfolios are sorted into deciles each month based on their estimated CAPM beta. The bar charts indicate the CAPM alpha of the resulting portfolio returns during the entire CRSP period.\n\n\n\n\n\nThese results suggest a negative relation between beta and future stock returns, which contradicts the predictions of the CAPM. According to the CAPM, returns should increase with beta across the portfolios and risk-adjusted returns should be statistically indistinguishable from zero.",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -321,7 +321,7 @@
     "href": "r/univariate-portfolio-sorts.html#the-security-market-line-and-beta-portfolios",
     "title": "Univariate Portfolio Sorts",
     "section": "The Security Market Line and Beta Portfolios",
-    "text": "The Security Market Line and Beta Portfolios\nThe CAPM predicts that our portfolios should lie on the security market line (SML). The slope of the SML is equal to the market risk premium and reflects the risk-return trade-off at any given time. Figure 2 illustrates the security market line: We see that (not surprisingly) the high beta portfolio returns have a high correlation with the market returns. However, it seems like the average excess returns for high beta stocks are lower than what the security market line implies would be an “appropriate” compensation for the high market risk.\n\nsml_capm &lt;- lm(ret_excess ~ 1 + beta, data = beta_portfolios_summary)$coefficients\n\nbeta_portfolios_summary |&gt;\n  ggplot(aes(\n    x = beta, \n    y = ret_excess, \n    color = portfolio\n  )) +\n  geom_point() +\n  geom_abline(\n    intercept = 0,\n    slope = mean(factors_ff3_monthly$mkt_excess),\n    linetype = \"solid\"\n  ) +\n  geom_abline(\n    intercept = sml_capm[1],\n    slope = sml_capm[2],\n    linetype = \"dashed\"\n  ) +\n  scale_y_continuous(\n    labels = percent,\n    limit = c(0, mean(factors_ff3_monthly$mkt_excess) * 2)\n  ) +\n  scale_x_continuous(limits = c(0, 2)) +\n  labs(\n    x = \"Beta\", y = \"Excess return\", color = \"Portfolio\",\n    title = \"Average portfolio excess returns and average beta estimates\"\n  )\n\n\n\n\nFigure 2: Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.\n\n\nFigure 2: Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.\n\n\n\n\n\nTo provide more evidence against the CAPM predictions, we again form a long-short strategy that buys the high-beta portfolio and shorts the low-beta portfolio.\n\nbeta_longshort &lt;- beta_portfolios |&gt;\n  mutate(portfolio = case_when(\n    portfolio == max(as.numeric(portfolio)) ~ \"high\",\n    portfolio == min(as.numeric(portfolio)) ~ \"low\"\n  )) |&gt;\n  filter(portfolio %in% c(\"low\", \"high\")) |&gt;\n  pivot_wider(id_cols = month, \n              names_from = portfolio, \n              values_from = ret_excess) |&gt;\n  mutate(long_short = high - low) |&gt;\n  left_join(factors_ff3_monthly, by = \"month\")\n\nAgain, the resulting long-short strategy does not exhibit statistically significant returns.\n\ncoeftest(lm(long_short ~ 1, data = beta_longshort),\n  vcov = NeweyWest\n)\n\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(&gt;|t|)\n(Intercept)  0.00147    0.00331    0.44     0.66\n\n\nHowever, the long-short portfolio yields a statistically significant negative CAPM-adjusted alpha, although, controlling for the effect of beta, the average excess stock returns should be zero according to the CAPM. The results thus provide no evidence in support of the CAPM. The negative value has been documented as the so-called betting against beta factor (Frazzini and Pedersen 2014). Betting against beta corresponds to a strategy that shorts high beta stocks and takes a (levered) long position in low beta stocks. If borrowing constraints prevent investors from taking positions on the SML they are instead incentivized to buy high beta stocks, which leads to a relatively higher price (and therefore lower expected returns than implied by the CAPM) for such high beta stocks. As a result, the betting-against-beta strategy earns from providing liquidity to capital constraint investors with lower risk aversion.\n\ncoeftest(lm(long_short ~ 1 + mkt_excess, data = beta_longshort),\n  vcov = NeweyWest\n)\n\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept) -0.00479    0.00249   -1.92    0.055 .  \nmkt_excess   1.15351    0.08960   12.87   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nFigure 3 shows the annual returns of the extreme beta portfolios we are mainly interested in. The figure illustrates no consistent striking patterns over the last years - each portfolio exhibits periods with positive and negative annual returns.\n\nbeta_longshort |&gt;\n  group_by(year = year(month)) |&gt;\n  summarize(\n    low = prod(1 + low),\n    high = prod(1 + high),\n    long_short = prod(1 + long_short)\n  ) |&gt;\n  pivot_longer(cols = -year) |&gt;\n  ggplot(aes(x = year, y = 1 - value, fill = name)) +\n  geom_col(position = \"dodge\") +\n  facet_wrap(~name, ncol = 1) +\n  theme(legend.position = \"none\") +\n  scale_y_continuous(labels = percent) +\n  labs(\n    title = \"Annual returns of beta portfolios\",\n    x = NULL, y = NULL\n  )\n\n\n\n\nFigure 3: We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.\n\n\nFigure 3: We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.\n\n\n\n\n\nOverall, this chapter shows how functional programming can be leveraged to form an arbitrary number of portfolios using any sorting variable and how to evaluate the performance of the resulting portfolios. In the next chapter, we dive deeper into the many degrees of freedom that arise in the context of portfolio analysis.",
+    "text": "The Security Market Line and Beta Portfolios\nThe CAPM predicts that our portfolios should lie on the security market line (SML). The slope of the SML is equal to the market risk premium and reflects the risk-return trade-off at any given time. Figure 2 illustrates the security market line: We see that (not surprisingly) the high beta portfolio returns have a high correlation with the market returns. However, it seems like the average excess returns for high beta stocks are lower than what the security market line implies would be an “appropriate” compensation for the high market risk.\n\nsml_capm &lt;- lm(ret_excess ~ 1 + beta, data = beta_portfolios_summary)$coefficients\n\nbeta_portfolios_summary |&gt;\n  ggplot(aes(\n    x = beta, \n    y = ret_excess, \n    color = portfolio\n  )) +\n  geom_point() +\n  geom_abline(\n    intercept = 0,\n    slope = mean(factors_ff3_monthly$mkt_excess),\n    linetype = \"solid\"\n  ) +\n  geom_abline(\n    intercept = sml_capm[1],\n    slope = sml_capm[2],\n    linetype = \"dashed\"\n  ) +\n  scale_y_continuous(\n    labels = percent,\n    limit = c(0, mean(factors_ff3_monthly$mkt_excess) * 2)\n  ) +\n  scale_x_continuous(limits = c(0, 2)) +\n  labs(\n    x = \"Beta\", y = \"Excess return\", color = \"Portfolio\",\n    title = \"Average portfolio excess returns and average beta estimates\"\n  )\n\n\n\n\nFigure 2: Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.\n\n\nFigure 2: Excess returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio betas.\n\n\n\n\n\nTo provide more evidence against the CAPM predictions, we again form a long-short strategy that buys the high-beta portfolio and shorts the low-beta portfolio.\n\nbeta_longshort &lt;- beta_portfolios |&gt;\n  mutate(portfolio = case_when(\n    portfolio == max(as.numeric(portfolio)) ~ \"high\",\n    portfolio == min(as.numeric(portfolio)) ~ \"low\"\n  )) |&gt;\n  filter(portfolio %in% c(\"low\", \"high\")) |&gt;\n  pivot_wider(id_cols = month, \n              names_from = portfolio, \n              values_from = ret_excess) |&gt;\n  mutate(long_short = high - low) |&gt;\n  left_join(factors_ff3_monthly, join_by(month))\n\nAgain, the resulting long-short strategy does not exhibit statistically significant returns.\n\ncoeftest(lm(long_short ~ 1, data = beta_longshort),\n  vcov = NeweyWest\n)\n\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(&gt;|t|)\n(Intercept)  0.00147    0.00331    0.44     0.66\n\n\nHowever, the long-short portfolio yields a statistically significant negative CAPM-adjusted alpha, although, controlling for the effect of beta, the average excess stock returns should be zero according to the CAPM. The results thus provide no evidence in support of the CAPM. The negative value has been documented as the so-called betting against beta factor (Frazzini and Pedersen 2014). Betting against beta corresponds to a strategy that shorts high beta stocks and takes a (levered) long position in low beta stocks. If borrowing constraints prevent investors from taking positions on the SML they are instead incentivized to buy high beta stocks, which leads to a relatively higher price (and therefore lower expected returns than implied by the CAPM) for such high beta stocks. As a result, the betting-against-beta strategy earns from providing liquidity to capital constraint investors with lower risk aversion.\n\ncoeftest(lm(long_short ~ 1 + mkt_excess, data = beta_longshort),\n  vcov = NeweyWest\n)\n\n\nt test of coefficients:\n\n            Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept) -0.00479    0.00249   -1.92    0.055 .  \nmkt_excess   1.15351    0.08960   12.87   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nFigure 3 shows the annual returns of the extreme beta portfolios we are mainly interested in. The figure illustrates no consistent striking patterns over the last years - each portfolio exhibits periods with positive and negative annual returns.\n\nbeta_longshort |&gt;\n  group_by(year = year(month)) |&gt;\n  summarize(\n    low = prod(1 + low),\n    high = prod(1 + high),\n    long_short = prod(1 + long_short)\n  ) |&gt;\n  pivot_longer(cols = -year) |&gt;\n  ggplot(aes(x = year, y = 1 - value, fill = name)) +\n  geom_col(position = \"dodge\") +\n  facet_wrap(~name, ncol = 1) +\n  theme(legend.position = \"none\") +\n  scale_y_continuous(labels = percent) +\n  labs(\n    title = \"Annual returns of beta portfolios\",\n    x = NULL, y = NULL\n  )\n\n\n\n\nFigure 3: We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.\n\n\nFigure 3: We construct portfolios by sorting stocks into high and low based on their estimated CAPM beta. Long short indicates a strategy that goes long into high beta stocks and short low beta stocks.\n\n\n\n\n\nOverall, this chapter shows how functional programming can be leveraged to form an arbitrary number of portfolios using any sorting variable and how to evaluate the performance of the resulting portfolios. In the next chapter, we dive deeper into the many degrees of freedom that arise in the context of portfolio analysis.",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -393,7 +393,7 @@
     "href": "r/constrained-optimization-and-backtesting.html#optimal-portfolio-choice",
     "title": "Constrained Optimization and Backtesting",
     "section": "Optimal Portfolio Choice",
-    "text": "Optimal Portfolio Choice\nThe function below implements the efficient portfolio weight in its general form, allowing for transaction costs (conditional on the holdings before reallocation). For \\(\\beta=0\\), the computation resembles the standard mean-variance efficient framework. gamma denotes the coefficient of risk aversion \\(\\gamma\\), beta is the transaction cost parameter \\(\\beta\\) and w_prev are the weights before rebalancing \\(\\omega_{t^+}\\).\n\ncompute_efficient_weight &lt;- function(Sigma,\n                                     mu,\n                                     gamma = 2,\n                                     beta = 0, # transaction costs\n                                     w_prev = rep(\n                                       1 / ncol(Sigma),\n                                       ncol(Sigma)\n                                     )) {\n  iota &lt;- rep(1, ncol(Sigma))\n  Sigma_processed &lt;- Sigma + beta / gamma * diag(ncol(Sigma))\n  mu_processed &lt;- mu + beta * w_prev\n\n  Sigma_inverse &lt;- solve(Sigma_processed)\n\n  w_mvp &lt;- Sigma_inverse %*% iota\n  w_mvp &lt;- as.vector(w_mvp / sum(w_mvp))\n  w_opt &lt;- w_mvp + 1 / gamma *\n    (Sigma_inverse - w_mvp %*% t(iota) %*% Sigma_inverse) %*%\n      mu_processed\n  \n  return(as.vector(w_opt))\n}\n\nw_efficient &lt;- compute_efficient_weight(Sigma, mu)\nround(w_efficient, 3)\n\n [1]  1.630  0.090 -1.355  0.687  0.333 -0.412  0.504  0.402 -0.219\n[10] -0.661\n\n\nThe portfolio weights above indicate the efficient portfolio for an investor with risk aversion coefficient \\(\\gamma=2\\) in absence of transaction costs. Some of the positions are negative which implies short-selling, most of the positions are rather extreme. For instance, a position of \\(-1\\) implies that the investor takes a short position worth her entire wealth to lever long positions in other assets. What is the effect of transaction costs or different levels of risk aversion on the optimal portfolio choice? The following few lines of code analyze the distance between the minimum variance portfolio and the portfolio implemented by the investor for different values of the transaction cost parameter \\(\\beta\\) and risk aversion \\(\\gamma\\).\n\ngammas &lt;- c(2, 4, 8, 20)\nbetas &lt;- 20 * qexp((1:99) / 100)\n                   \ntransaction_costs &lt;- expand_grid(\n  gamma = gammas,\n  beta = betas\n) |&gt;\n  mutate(\n    weights = map2(\n      .x = gamma,\n      .y = beta,\n      ~ compute_efficient_weight(Sigma,\n        mu,\n        gamma = .x,\n        beta = .y / 10000,\n        w_prev = w_mvp\n      )\n    ),\n    concentration = map_dbl(weights, ~ sum(abs(. - w_mvp)))\n  )\n\nThe code chunk above computes the optimal weight in presence of transaction cost for different values of \\(\\beta\\) and \\(\\gamma\\) but with the same initial allocation, the theoretical optimal minimum variance portfolio. Starting from the initial allocation, the investor chooses her optimal allocation along the efficient frontier to reflect her own risk preferences. If transaction costs would be absent, the investor would simply implement the mean-variance efficient allocation. If transaction costs make it costly to rebalance, her optimal portfolio choice reflects a shift toward the efficient portfolio, whereas her current portfolio anchors her investment.\n\ntransaction_costs |&gt;\n  mutate(risk_aversion = as_factor(gamma)) |&gt;\n  ggplot(aes(\n    x = beta,\n    y = concentration,\n    color = risk_aversion,\n    linetype = risk_aversion\n  )) +\n  geom_line() +\n  guides(linetype = \"none\") + \n  labs(\n    x = \"Transaction cost parameter\",\n    y = \"Distance from MVP\",\n    color = \"Risk aversion\",\n    title = \"Portfolio weights for different risk aversion and transaction cost\"\n  )\n\n\n\n\nFigure 1: The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.\n\n\nFigure 1: The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.\n\n\n\n\n\nFigure 1 shows rebalancing from the initial portfolio (which we always set to the minimum variance portfolio weights in this example). The higher the transaction costs parameter \\(\\beta\\), the smaller is the rebalancing from the initial portfolio. In addition, if risk aversion \\(\\gamma\\) increases, the efficient portfolio is closer to the minimum variance portfolio weights such that the investor desires less rebalancing from the initial holdings.",
+    "text": "Optimal Portfolio Choice\nThe function below implements the efficient portfolio weight in its general form, allowing for transaction costs (conditional on the holdings before reallocation). For \\(\\beta=0\\), the computation resembles the standard mean-variance efficient framework. gamma denotes the coefficient of risk aversion \\(\\gamma\\), beta is the transaction cost parameter \\(\\beta\\) and w_prev are the weights before rebalancing \\(\\omega_{t^+}\\).\n\ncompute_efficient_weight &lt;- function(Sigma,\n                                     mu,\n                                     gamma = 2,\n                                     beta = 0, # transaction costs\n                                     w_prev = rep(\n                                       1 / ncol(Sigma),\n                                       ncol(Sigma)\n                                     )) {\n  iota &lt;- rep(1, ncol(Sigma))\n  Sigma_processed &lt;- Sigma + beta / gamma * diag(ncol(Sigma))\n  mu_processed &lt;- mu + beta * w_prev\n\n  Sigma_inverse &lt;- solve(Sigma_processed)\n\n  w_mvp &lt;- Sigma_inverse %*% iota\n  w_mvp &lt;- as.vector(w_mvp / sum(w_mvp))\n  w_opt &lt;- w_mvp + 1 / gamma *\n    (Sigma_inverse - w_mvp %*% t(iota) %*% Sigma_inverse) %*%\n      mu_processed\n  \n  return(as.vector(w_opt))\n}\n\nw_efficient &lt;- compute_efficient_weight(Sigma, mu)\nround(w_efficient, 3)\n\n [1]  1.630  0.090 -1.356  0.687  0.333 -0.412  0.504  0.402 -0.219\n[10] -0.659\n\n\nThe portfolio weights above indicate the efficient portfolio for an investor with risk aversion coefficient \\(\\gamma=2\\) in absence of transaction costs. Some of the positions are negative which implies short-selling, most of the positions are rather extreme. For instance, a position of \\(-1\\) implies that the investor takes a short position worth her entire wealth to lever long positions in other assets. What is the effect of transaction costs or different levels of risk aversion on the optimal portfolio choice? The following few lines of code analyze the distance between the minimum variance portfolio and the portfolio implemented by the investor for different values of the transaction cost parameter \\(\\beta\\) and risk aversion \\(\\gamma\\).\n\ngammas &lt;- c(2, 4, 8, 20)\nbetas &lt;- 20 * qexp((1:99) / 100)\n                   \ntransaction_costs &lt;- expand_grid(\n  gamma = gammas,\n  beta = betas\n) |&gt;\n  mutate(\n    weights = map2(\n      .x = gamma,\n      .y = beta,\n      ~ compute_efficient_weight(Sigma,\n        mu,\n        gamma = .x,\n        beta = .y / 10000,\n        w_prev = w_mvp\n      )\n    ),\n    concentration = map_dbl(weights, ~ sum(abs(. - w_mvp)))\n  )\n\nThe code chunk above computes the optimal weight in presence of transaction cost for different values of \\(\\beta\\) and \\(\\gamma\\) but with the same initial allocation, the theoretical optimal minimum variance portfolio. Starting from the initial allocation, the investor chooses her optimal allocation along the efficient frontier to reflect her own risk preferences. If transaction costs would be absent, the investor would simply implement the mean-variance efficient allocation. If transaction costs make it costly to rebalance, her optimal portfolio choice reflects a shift toward the efficient portfolio, whereas her current portfolio anchors her investment.\n\ntransaction_costs |&gt;\n  mutate(risk_aversion = as_factor(gamma)) |&gt;\n  ggplot(aes(\n    x = beta,\n    y = concentration,\n    color = risk_aversion,\n    linetype = risk_aversion\n  )) +\n  geom_line() +\n  guides(linetype = \"none\") + \n  labs(\n    x = \"Transaction cost parameter\",\n    y = \"Distance from MVP\",\n    color = \"Risk aversion\",\n    title = \"Portfolio weights for different risk aversion and transaction cost\"\n  )\n\n\n\n\nFigure 1: The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.\n\n\nFigure 1: The horizontal axis indicates the distance from the empirical minimum variance portfolio weight, measured by the sum of the absolute deviations of the chosen portfolio from the benchmark.\n\n\n\n\n\nFigure 1 shows rebalancing from the initial portfolio (which we always set to the minimum variance portfolio weights in this example). The higher the transaction costs parameter \\(\\beta\\), the smaller is the rebalancing from the initial portfolio. In addition, if risk aversion \\(\\gamma\\) increases, the efficient portfolio is closer to the minimum variance portfolio weights such that the investor desires less rebalancing from the initial holdings.",
     "crumbs": [
       "R",
       "Portfolio Optimization",
@@ -405,7 +405,7 @@
     "href": "r/constrained-optimization-and-backtesting.html#constrained-optimization",
     "title": "Constrained Optimization and Backtesting",
     "section": "Constrained Optimization",
-    "text": "Constrained Optimization\nNext, we introduce constraints to the above optimization procedure. Very often, typical constraints such as short-selling restrictions prevent analytical solutions for optimal portfolio weights (short-selling restrictions simply imply that negative weights are not allowed such that we require that \\(w_i \\geq 0\\quad \\forall i\\)). However, numerical optimization allows computing the solutions to such constrained problems.\nWe rely on the powerful nloptr package, which provides a common interface to a number of different optimization routines. In particular, we employ the Sequential Least-Squares Quadratic Programming (SLSQP) algorithm of Kraft (1994) because it is able to hand multiple equality and inequality constraints at the same time and typically used for problems where the objective function and the constraints are twice continuously differentiable. We hence have to provide the algorithm with the objective function and its gradient, as well as the constraints and their Jacobian.\nWe illustrate the use of the nloptr() function by replicating the analytical solutions for the minimum variance and efficient portfolio weights from above. Note that the equality constraint for both solutions is given by the requirement that the weights must sum up to one. In addition, we supply a vector of equal weights as an initial value for the algorithm in all applications. We verify that the output is equal to the above solution. Note that near() is a safe way to compare two vectors for pairwise equality. The alternative == is sensitive to small differences that may occur due to the representation of floating points on a computer, while near() has a built-in tolerance.\n\nw_initial &lt;- rep(1 / n_industries, n_industries)\n\nobjective_mvp &lt;- function(w) {\n  0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_mvp &lt;- function(w) {\n  Sigma %*% w\n}\n\nequality_constraint &lt;- function(w) {\n  sum(w) - 1\n}\n\njacobian_equality &lt;- function(w) {\n  rep(1, n_industries)\n}\n\noptions &lt;- list(\n  \"xtol_rel\"=1e-20, \n  \"algorithm\" = \"NLOPT_LD_SLSQP\", \n  \"maxeval\" = 10000\n)\n\nw_mvp_numerical &lt;- nloptr(\n  x0 = w_initial, \n  eval_f = objective_mvp, \n  eval_grad_f = gradient_mvp,\n  eval_g_eq = equality_constraint,\n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_mvp, w_mvp_numerical$solution))\n\n[1] TRUE\n\nobjective_efficient &lt;- function(w) {\n  2 * 0.5 * t(w) %*% Sigma %*% w - sum((1 + mu) * w)\n}\n\ngradient_efficient &lt;- function(w) {\n  2 * Sigma %*% w - (1 + mu)\n}\n\nw_efficient_numerical &lt;- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_efficient, w_efficient_numerical$solution))\n\n[1] TRUE\n\n\nThe result above shows that indeed the numerical procedure recovered the optimal weights for a scenario, where we already know the analytic solution. For more complex optimization routines, R’s optimization task view provides an overview of the vast optimization landscape. \nNext, we approach problems where no analytical solutions exist. First, we additionally impose short-sale constraints, which implies \\(N\\) inequality constraints of the form \\(\\omega_i &gt;=0\\). We can implement the short sale constraints by imposing a vector of lower bounds lb = rep(0, n_industries).\n\nw_no_short_sale &lt;- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  lb = rep(0, n_industries),\n  opts = options\n)\n\nround(w_no_short_sale$solution, 3)\n\n [1] 0.610 0.000 0.000 0.211 0.000 0.000 0.000 0.179 0.000 0.000\n\n\nAs expected, the resulting portfolio weights are all positive (up to numerical precision). Typically, the holdings in the presence of short-sale constraints are concentrated among way fewer assets than for the unrestricted case. You can verify that sum(w_no_short_sale$solution) returns 1. In other words: nloptr() provides the numerical solution to a portfolio choice problem for a mean-variance investor with risk aversion gamma = 2, where negative holdings are forbidden.\nnloptr() can also handle more complex problems. As an example, we show how to compute optimal weights, subject to the so-called Regulation T-constraint, which requires that the sum of all absolute portfolio weights is smaller than 1.5, that is \\(\\sum_{i=1}^N |\\omega_i| \\leq 1.5\\). The constraint enforces that a maximum of 50 percent of the allocated wealth can be allocated to short positions, thus implying an initial margin requirement of 50 percent. Imposing such a margin requirement reduces portfolio risks because extreme portfolio weights are not attainable anymore. The implementation of Regulation-T rules is numerically interesting because the margin constraints imply a non-linear constraint on the portfolio weights. \n\nreg_t &lt;- 1.5\n\ninequality_constraint &lt;- function(w) {\n  sum(abs(w)) - reg_t\n}\n\njacobian_inequality &lt;- function(w) {\n  sign(w)\n}\n\nobjective_reg_t &lt;- function(w) {\n  - t(w) %*% (1 + mu) +\n    2 * 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_reg_t &lt;- function(w) {\n  - (1 + mu) + 2 * Sigma %*% w\n}\n\nw_reg_t &lt;- nloptr(\n  x0 = w_initial,\n  eval_f = objective_reg_t, \n  eval_grad_f = gradient_reg_t,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality, \n  eval_g_ineq = inequality_constraint, \n  eval_jac_g_ineq = jacobian_inequality,\n  opts = options\n)\n\nround(w_reg_t$solution, 3)\n\n [1]  0.736  0.000 -0.134  0.264  0.000 -0.019  0.028  0.223  0.000\n[10] -0.097\n\n\nFigure 2 shows the optimal allocation weights across all 10 industries for the four different strategies considered so far: minimum variance, efficient portfolio with \\(\\gamma\\) = 2, efficient portfolio with short-sale constraints, and the Regulation-T constrained portfolio.\n\ntibble(\n  `No short-sale` = w_no_short_sale$solution,\n  `Minimum Variance` = w_mvp,\n  `Efficient portfolio` = compute_efficient_weight(Sigma, mu),\n  `Regulation-T` = w_reg_t$solution,\n  Industry = colnames(industry_returns)\n) |&gt;\n  pivot_longer(-Industry,\n    names_to = \"Strategy\",\n    values_to = \"weights\"\n  ) |&gt;\n  ggplot(aes(\n    fill = Strategy,\n    y = weights,\n    x = Industry\n  )) +\n  geom_bar(position = \"dodge\", stat = \"identity\") +\n  coord_flip() +\n  labs(\n    y = \"Allocation weight\", fill = NULL,\n    title = \"Optimal allocations for different strategies\"\n  ) +\n  scale_y_continuous(labels = percent)\n\n\n\n\nFigure 2: Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.\n\n\nFigure 2: Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.\n\n\n\n\n\nThe results clearly indicate the effect of imposing additional constraints: the extreme holdings the investor implements if she follows the (theoretically optimal) efficient portfolio vanish under, e.g., the Regulation-T constraint. You may wonder why an investor would deviate from what is theoretically the optimal portfolio by imposing potentially arbitrary constraints. The short answer is: the efficient portfolio is only efficient if the true parameters of the data generating process correspond to the estimated parameters \\(\\hat\\Sigma\\) and \\(\\hat\\mu\\). Estimation uncertainty may thus lead to inefficient allocations. By imposing restrictions, we implicitly shrink the set of possible weights and prevent extreme allocations, which could result from error-maximization due to estimation uncertainty (Jagannathan and Ma 2003).\nBefore we move on, we want to propose a final allocation strategy, which reflects a somewhat more realistic structure of transaction costs instead of the quadratic specification used above. The function below computes efficient portfolio weights while adjusting for transaction costs of the form \\(\\beta\\sum_{i=1}^N |(\\omega_{i, t+1} - \\omega_{i, t^+})|\\). No closed-form solution exists, and we rely on non-linear optimization procedures.\n\ncompute_efficient_weight_L1_TC &lt;- function(mu,\n                                           Sigma,\n                                           gamma,\n                                           beta,\n                                           initial_weights) {\n  objective &lt;- function(w) {\n    -t(w) %*% mu +\n      gamma / 2 * t(w) %*% Sigma %*% w +\n      (beta / 10000) / 2 * sum(abs(w - initial_weights))\n  }\n  \n  gradient &lt;- function(w) {\n    -mu + gamma * Sigma %*% w + \n      (beta / 10000) * 0.5 * sign(w - initial_weights)\n  }\n\n  w_optimal &lt;- nloptr(\n    x0 = initial_weights,\n    eval_f = objective, \n    eval_grad_f = gradient,\n    eval_g_eq = equality_constraint, \n    eval_jac_g_eq = jacobian_equality, \n    opts = options\n  )\n\n  return(w_optimal$solution)\n}",
+    "text": "Constrained Optimization\nNext, we introduce constraints to the above optimization procedure. Very often, typical constraints such as short-selling restrictions prevent analytical solutions for optimal portfolio weights (short-selling restrictions simply imply that negative weights are not allowed such that we require that \\(w_i \\geq 0\\quad \\forall i\\)). However, numerical optimization allows computing the solutions to such constrained problems.\nWe rely on the powerful nloptr package, which provides a common interface to a number of different optimization routines. In particular, we employ the Sequential Least-Squares Quadratic Programming (SLSQP) algorithm of Kraft (1994) because it is able to hand multiple equality and inequality constraints at the same time and typically used for problems where the objective function and the constraints are twice continuously differentiable. We hence have to provide the algorithm with the objective function and its gradient, as well as the constraints and their Jacobian.\nWe illustrate the use of the nloptr() function by replicating the analytical solutions for the minimum variance and efficient portfolio weights from above. Note that the equality constraint for both solutions is given by the requirement that the weights must sum up to one. In addition, we supply a vector of equal weights as an initial value for the algorithm in all applications. We verify that the output is equal to the above solution. Note that near() is a safe way to compare two vectors for pairwise equality. The alternative == is sensitive to small differences that may occur due to the representation of floating points on a computer, while near() has a built-in tolerance.\n\nw_initial &lt;- rep(1 / n_industries, n_industries)\n\nobjective_mvp &lt;- function(w) {\n  0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_mvp &lt;- function(w) {\n  Sigma %*% w\n}\n\nequality_constraint &lt;- function(w) {\n  sum(w) - 1\n}\n\njacobian_equality &lt;- function(w) {\n  rep(1, n_industries)\n}\n\noptions &lt;- list(\n  \"xtol_rel\"=1e-20, \n  \"algorithm\" = \"NLOPT_LD_SLSQP\", \n  \"maxeval\" = 10000\n)\n\nw_mvp_numerical &lt;- nloptr(\n  x0 = w_initial, \n  eval_f = objective_mvp, \n  eval_grad_f = gradient_mvp,\n  eval_g_eq = equality_constraint,\n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_mvp, w_mvp_numerical$solution))\n\n[1] TRUE\n\nobjective_efficient &lt;- function(w) {\n  2 * 0.5 * t(w) %*% Sigma %*% w - sum((1 + mu) * w)\n}\n\ngradient_efficient &lt;- function(w) {\n  2 * Sigma %*% w - (1 + mu)\n}\n\nw_efficient_numerical &lt;- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  opts = options\n)\n\nall(near(w_efficient, w_efficient_numerical$solution))\n\n[1] TRUE\n\n\nThe result above shows that indeed the numerical procedure recovered the optimal weights for a scenario, where we already know the analytic solution. For more complex optimization routines, R’s optimization task view provides an overview of the vast optimization landscape. \nNext, we approach problems where no analytical solutions exist. First, we additionally impose short-sale constraints, which implies \\(N\\) inequality constraints of the form \\(\\omega_i &gt;=0\\). We can implement the short sale constraints by imposing a vector of lower bounds lb = rep(0, n_industries).\n\nw_no_short_sale &lt;- nloptr(\n  x0 = w_initial, \n  eval_f = objective_efficient, \n  eval_grad_f = gradient_efficient,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality,\n  lb = rep(0, n_industries),\n  opts = options\n)\n\nround(w_no_short_sale$solution, 3)\n\n [1] 0.610 0.000 0.000 0.211 0.000 0.000 0.000 0.179 0.000 0.000\n\n\nAs expected, the resulting portfolio weights are all positive (up to numerical precision). Typically, the holdings in the presence of short-sale constraints are concentrated among way fewer assets than for the unrestricted case. You can verify that sum(w_no_short_sale$solution) returns 1. In other words: nloptr() provides the numerical solution to a portfolio choice problem for a mean-variance investor with risk aversion gamma = 2, where negative holdings are forbidden.\nnloptr() can also handle more complex problems. As an example, we show how to compute optimal weights, subject to the so-called Regulation T-constraint, which requires that the sum of all absolute portfolio weights is smaller than 1.5, that is \\(\\sum_{i=1}^N |\\omega_i| \\leq 1.5\\). The constraint enforces that a maximum of 50 percent of the allocated wealth can be allocated to short positions, thus implying an initial margin requirement of 50 percent. Imposing such a margin requirement reduces portfolio risks because extreme portfolio weights are not attainable anymore. The implementation of Regulation-T rules is numerically interesting because the margin constraints imply a non-linear constraint on the portfolio weights. \n\nreg_t &lt;- 1.5\n\ninequality_constraint &lt;- function(w) {\n  sum(abs(w)) - reg_t\n}\n\njacobian_inequality &lt;- function(w) {\n  sign(w)\n}\n\nobjective_reg_t &lt;- function(w) {\n  - t(w) %*% (1 + mu) +\n    2 * 0.5 * t(w) %*% Sigma %*% w\n}\n\ngradient_reg_t &lt;- function(w) {\n  - (1 + mu) + 2 * Sigma %*% w\n}\n\nw_reg_t &lt;- nloptr(\n  x0 = w_initial,\n  eval_f = objective_reg_t, \n  eval_grad_f = gradient_reg_t,\n  eval_g_eq = equality_constraint, \n  eval_jac_g_eq = jacobian_equality, \n  eval_g_ineq = inequality_constraint, \n  eval_jac_g_ineq = jacobian_inequality,\n  opts = options\n)\n\nround(w_reg_t$solution, 3)\n\n [1]  0.736  0.000 -0.135  0.264  0.000 -0.019  0.028  0.223  0.000\n[10] -0.096\n\n\nFigure 2 shows the optimal allocation weights across all 10 industries for the four different strategies considered so far: minimum variance, efficient portfolio with \\(\\gamma\\) = 2, efficient portfolio with short-sale constraints, and the Regulation-T constrained portfolio.\n\ntibble(\n  `No short-sale` = w_no_short_sale$solution,\n  `Minimum Variance` = w_mvp,\n  `Efficient portfolio` = compute_efficient_weight(Sigma, mu),\n  `Regulation-T` = w_reg_t$solution,\n  Industry = colnames(industry_returns)\n) |&gt;\n  pivot_longer(-Industry,\n    names_to = \"Strategy\",\n    values_to = \"weights\"\n  ) |&gt;\n  ggplot(aes(\n    fill = Strategy,\n    y = weights,\n    x = Industry\n  )) +\n  geom_bar(position = \"dodge\", stat = \"identity\") +\n  coord_flip() +\n  labs(\n    y = \"Allocation weight\", fill = NULL,\n    title = \"Optimal allocations for different strategies\"\n  ) +\n  scale_y_continuous(labels = percent)\n\n\n\n\nFigure 2: Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.\n\n\nFigure 2: Optimal allocation weights for the 10 industry portfolios and the 4 different allocation strategies.\n\n\n\n\n\nThe results clearly indicate the effect of imposing additional constraints: the extreme holdings the investor implements if she follows the (theoretically optimal) efficient portfolio vanish under, e.g., the Regulation-T constraint. You may wonder why an investor would deviate from what is theoretically the optimal portfolio by imposing potentially arbitrary constraints. The short answer is: the efficient portfolio is only efficient if the true parameters of the data generating process correspond to the estimated parameters \\(\\hat\\Sigma\\) and \\(\\hat\\mu\\). Estimation uncertainty may thus lead to inefficient allocations. By imposing restrictions, we implicitly shrink the set of possible weights and prevent extreme allocations, which could result from error-maximization due to estimation uncertainty (Jagannathan and Ma 2003).\nBefore we move on, we want to propose a final allocation strategy, which reflects a somewhat more realistic structure of transaction costs instead of the quadratic specification used above. The function below computes efficient portfolio weights while adjusting for transaction costs of the form \\(\\beta\\sum_{i=1}^N |(\\omega_{i, t+1} - \\omega_{i, t^+})|\\). No closed-form solution exists, and we rely on non-linear optimization procedures.\n\ncompute_efficient_weight_L1_TC &lt;- function(mu,\n                                           Sigma,\n                                           gamma,\n                                           beta,\n                                           initial_weights) {\n  objective &lt;- function(w) {\n    -t(w) %*% mu +\n      gamma / 2 * t(w) %*% Sigma %*% w +\n      (beta / 10000) / 2 * sum(abs(w - initial_weights))\n  }\n  \n  gradient &lt;- function(w) {\n    -mu + gamma * Sigma %*% w + \n      (beta / 10000) * 0.5 * sign(w - initial_weights)\n  }\n\n  w_optimal &lt;- nloptr(\n    x0 = initial_weights,\n    eval_f = objective, \n    eval_grad_f = gradient,\n    eval_g_eq = equality_constraint, \n    eval_jac_g_eq = jacobian_equality, \n    opts = options\n  )\n\n  return(w_optimal$solution)\n}",
     "crumbs": [
       "R",
       "Portfolio Optimization",
@@ -417,7 +417,7 @@
     "href": "r/constrained-optimization-and-backtesting.html#out-of-sample-backtesting",
     "title": "Constrained Optimization and Backtesting",
     "section": "Out-of-Sample Backtesting",
-    "text": "Out-of-Sample Backtesting\nFor the sake of simplicity, we committed one fundamental error in computing portfolio weights above: we used the full sample of the data to determine the optimal allocation (Arnott, Harvey, and Markowitz 2019). To implement this strategy at the beginning of the 2000s, you will need to know how the returns will evolve until 2021. While interesting from a methodological point of view, we cannot evaluate the performance of the portfolios in a reasonable out-of-sample fashion. We do so next in a backtesting application for three strategies. For the backtest, we recompute optimal weights just based on past available data.\nThe few lines below define the general setup. We consider 120 periods from the past to update the parameter estimates before recomputing portfolio weights. Then, we update portfolio weights which is costly and affects the performance. The portfolio weights determine the portfolio return. A period later, the current portfolio weights have changed and form the foundation for transaction costs incurred in the next period. We consider three different competing strategies: the mean-variance efficient portfolio, the mean-variance efficient portfolio with ex-ante adjustment for transaction costs, and the naive portfolio, which allocates wealth equally across the different assets.\n\nwindow_length &lt;- 120\nperiods &lt;- nrow(industry_returns) - window_length\n\nbeta &lt;- 50\ngamma &lt;- 2\n\nperformance_values &lt;- matrix(NA,\n  nrow = periods,\n  ncol = 3\n)\ncolnames(performance_values) &lt;- c(\"raw_return\", \"turnover\", \"net_return\")\n\nperformance_values &lt;- list(\n  \"MV (TC)\" = performance_values,\n  \"Naive\" = performance_values,\n  \"MV\" = performance_values\n)\n\nw_prev_1 &lt;- w_prev_2 &lt;- w_prev_3 &lt;- rep(\n  1 / n_industries,\n  n_industries\n)\n\nWe also define two helper functions: one to adjust the weights due to returns and one for performance evaluation, where we compute realized returns net of transaction costs.\n\nadjust_weights &lt;- function(w, next_return) {\n  w_prev &lt;- 1 + w * next_return\n  as.numeric(w_prev / sum(as.vector(w_prev)))\n}\n\nevaluate_performance &lt;- function(w, w_previous, next_return, beta = 50) {\n  raw_return &lt;- as.matrix(next_return) %*% w\n  turnover &lt;- sum(abs(w - w_previous))\n  net_return &lt;- raw_return - beta / 10000 * turnover\n  c(raw_return, turnover, net_return)\n}\n\nThe following code chunk performs a rolling-window estimation, which we implement in a loop. In each period, the estimation window contains the returns available up to the current period. Note that we use the sample variance-covariance matrix and ignore the estimation of \\(\\hat\\mu\\) entirely, but you might use more advanced estimators in practice.\n\nfor (p in 1:periods) {\n  returns_window &lt;- industry_returns[p:(p + window_length - 1), ]\n  next_return &lt;- industry_returns[p + window_length, ] |&gt; as.matrix()\n\n  Sigma &lt;- cov(returns_window)\n  mu &lt;- 0 * colMeans(returns_window)\n\n  # Transaction-cost adjusted portfolio\n  w_1 &lt;- compute_efficient_weight_L1_TC(\n    mu = mu,\n    Sigma = Sigma,\n    beta = beta,\n    gamma = gamma,\n    initial_weights = w_prev_1\n  )\n\n  performance_values[[1]][p, ] &lt;- evaluate_performance(w_1,\n    w_prev_1,\n    next_return,\n    beta = beta\n  )\n\n  w_prev_1 &lt;- adjust_weights(w_1, next_return)\n\n  # Naive portfolio\n  w_2 &lt;- rep(1 / n_industries, n_industries)\n\n  performance_values[[2]][p, ] &lt;- evaluate_performance(\n    w_2,\n    w_prev_2,\n    next_return\n  )\n\n  w_prev_2 &lt;- adjust_weights(w_2, next_return)\n\n  # Mean-variance efficient portfolio (w/o transaction costs)\n  w_3 &lt;- compute_efficient_weight(\n    Sigma = Sigma,\n    mu = mu,\n    gamma = gamma\n  )\n\n  performance_values[[3]][p, ] &lt;- evaluate_performance(\n    w_3,\n    w_prev_3,\n    next_return\n  )\n\n  w_prev_3 &lt;- adjust_weights(w_3, next_return)\n}\n\nFinally, we get to the evaluation of the portfolio strategies net-of-transaction costs. Note that we compute annualized returns and standard deviations. \n\nperformance &lt;- lapply(\n  performance_values,\n  as_tibble\n) |&gt;\n  bind_rows(.id = \"strategy\")\n\nlength_year &lt;- 12\n\nperformance_table &lt;- performance |&gt;\n  group_by(Strategy = strategy) |&gt;\n  summarize(\n    Mean = length_year * mean(100 * net_return),\n    SD = sqrt(length_year) * sd(100 * net_return),\n    `Sharpe ratio` = if_else(Mean &gt; 0,\n      Mean / SD,\n      NA_real_\n    ),\n    Turnover = 100 * mean(turnover)\n  )\n\nperformance_table |&gt; \n  mutate(across(-Strategy, ~round(., 4)))\n\n# A tibble: 3 × 5\n  Strategy   Mean    SD `Sharpe ratio` Turnover\n  &lt;chr&gt;     &lt;dbl&gt; &lt;dbl&gt;          &lt;dbl&gt;    &lt;dbl&gt;\n1 MV       -0.897  12.6         NA      211.   \n2 MV (TC)  11.9    15.2          0.780    0    \n3 Naive    11.8    15.2          0.779    0.234\n\n\nThe results clearly speak against mean-variance optimization. Turnover is huge when the investor only considers her portfolio’s expected return and variance. Effectively, the mean-variance portfolio generates a negative annualized return after adjusting for transaction costs. At the same time, the naive portfolio turns out to perform very well. In fact, the performance gains of the transaction-cost adjusted mean-variance portfolio are small. The out-of-sample Sharpe ratio is slightly higher than for the naive portfolio. Note the extreme effect of turnover penalization on turnover: MV (TC) effectively resembles a buy-and-hold strategy which only updates the portfolio once the estimated parameters \\(\\hat\\mu_t\\) and \\(\\hat\\Sigma_t\\)indicate that the current allocation is too far away from the optimal theoretical portfolio.",
+    "text": "Out-of-Sample Backtesting\nFor the sake of simplicity, we committed one fundamental error in computing portfolio weights above: we used the full sample of the data to determine the optimal allocation (Arnott, Harvey, and Markowitz 2019). To implement this strategy at the beginning of the 2000s, you will need to know how the returns will evolve until 2021. While interesting from a methodological point of view, we cannot evaluate the performance of the portfolios in a reasonable out-of-sample fashion. We do so next in a backtesting application for three strategies. For the backtest, we recompute optimal weights just based on past available data.\nThe few lines below define the general setup. We consider 120 periods from the past to update the parameter estimates before recomputing portfolio weights. Then, we update portfolio weights which is costly and affects the performance. The portfolio weights determine the portfolio return. A period later, the current portfolio weights have changed and form the foundation for transaction costs incurred in the next period. We consider three different competing strategies: the mean-variance efficient portfolio, the mean-variance efficient portfolio with ex-ante adjustment for transaction costs, and the naive portfolio, which allocates wealth equally across the different assets.\n\nwindow_length &lt;- 120\nperiods &lt;- nrow(industry_returns) - window_length\n\nbeta &lt;- 50\ngamma &lt;- 2\n\nperformance_values &lt;- matrix(NA,\n  nrow = periods,\n  ncol = 3\n)\ncolnames(performance_values) &lt;- c(\"raw_return\", \"turnover\", \"net_return\")\n\nperformance_values &lt;- list(\n  \"MV (TC)\" = performance_values,\n  \"Naive\" = performance_values,\n  \"MV\" = performance_values\n)\n\nw_prev_1 &lt;- w_prev_2 &lt;- w_prev_3 &lt;- rep(\n  1 / n_industries,\n  n_industries\n)\n\nWe also define two helper functions: one to adjust the weights due to returns and one for performance evaluation, where we compute realized returns net of transaction costs.\n\nadjust_weights &lt;- function(w, next_return) {\n  w_prev &lt;- 1 + w * next_return\n  as.numeric(w_prev / sum(as.vector(w_prev)))\n}\n\nevaluate_performance &lt;- function(w, w_previous, next_return, beta = 50) {\n  raw_return &lt;- as.matrix(next_return) %*% w\n  turnover &lt;- sum(abs(w - w_previous))\n  net_return &lt;- raw_return - beta / 10000 * turnover\n  c(raw_return, turnover, net_return)\n}\n\nThe following code chunk performs a rolling-window estimation, which we implement in a loop. In each period, the estimation window contains the returns available up to the current period. Note that we use the sample variance-covariance matrix and ignore the estimation of \\(\\hat\\mu\\) entirely, but you might use more advanced estimators in practice.\n\nfor (p in 1:periods) {\n  returns_window &lt;- industry_returns[p:(p + window_length - 1), ]\n  next_return &lt;- industry_returns[p + window_length, ] |&gt; as.matrix()\n\n  Sigma &lt;- cov(returns_window)\n  mu &lt;- 0 * colMeans(returns_window)\n\n  # Transaction-cost adjusted portfolio\n  w_1 &lt;- compute_efficient_weight_L1_TC(\n    mu = mu,\n    Sigma = Sigma,\n    beta = beta,\n    gamma = gamma,\n    initial_weights = w_prev_1\n  )\n\n  performance_values[[1]][p, ] &lt;- evaluate_performance(w_1,\n    w_prev_1,\n    next_return,\n    beta = beta\n  )\n\n  w_prev_1 &lt;- adjust_weights(w_1, next_return)\n\n  # Naive portfolio\n  w_2 &lt;- rep(1 / n_industries, n_industries)\n\n  performance_values[[2]][p, ] &lt;- evaluate_performance(\n    w_2,\n    w_prev_2,\n    next_return\n  )\n\n  w_prev_2 &lt;- adjust_weights(w_2, next_return)\n\n  # Mean-variance efficient portfolio (w/o transaction costs)\n  w_3 &lt;- compute_efficient_weight(\n    Sigma = Sigma,\n    mu = mu,\n    gamma = gamma\n  )\n\n  performance_values[[3]][p, ] &lt;- evaluate_performance(\n    w_3,\n    w_prev_3,\n    next_return\n  )\n\n  w_prev_3 &lt;- adjust_weights(w_3, next_return)\n}\n\nFinally, we get to the evaluation of the portfolio strategies net-of-transaction costs. Note that we compute annualized returns and standard deviations. \n\nperformance &lt;- lapply(\n  performance_values,\n  as_tibble\n) |&gt;\n  bind_rows(.id = \"strategy\")\n\nlength_year &lt;- 12\n\nperformance_table &lt;- performance |&gt;\n  group_by(Strategy = strategy) |&gt;\n  summarize(\n    Mean = length_year * mean(100 * net_return),\n    SD = sqrt(length_year) * sd(100 * net_return),\n    `Sharpe ratio` = if_else(Mean &gt; 0,\n      Mean / SD,\n      NA_real_\n    ),\n    Turnover = 100 * mean(turnover)\n  )\n\nperformance_table |&gt; \n  mutate(across(-Strategy, ~round(., 4)))\n\n# A tibble: 3 × 5\n  Strategy   Mean    SD `Sharpe ratio` Turnover\n  &lt;chr&gt;     &lt;dbl&gt; &lt;dbl&gt;          &lt;dbl&gt;    &lt;dbl&gt;\n1 MV       -0.899  12.6         NA      211.   \n2 MV (TC)  11.9    15.2          0.780    0    \n3 Naive    11.8    15.2          0.779    0.234\n\n\nThe results clearly speak against mean-variance optimization. Turnover is huge when the investor only considers her portfolio’s expected return and variance. Effectively, the mean-variance portfolio generates a negative annualized return after adjusting for transaction costs. At the same time, the naive portfolio turns out to perform very well. In fact, the performance gains of the transaction-cost adjusted mean-variance portfolio are small. The out-of-sample Sharpe ratio is slightly higher than for the naive portfolio. Note the extreme effect of turnover penalization on turnover: MV (TC) effectively resembles a buy-and-hold strategy which only updates the portfolio once the estimated parameters \\(\\hat\\mu_t\\) and \\(\\hat\\Sigma_t\\)indicate that the current allocation is too far away from the optimal theoretical portfolio.",
     "crumbs": [
       "R",
       "Portfolio Optimization",
@@ -501,7 +501,7 @@
     "href": "r/parametric-portfolio-policies.html#data-preparation",
     "title": "Parametric Portfolio Policies",
     "section": "Data Preparation",
-    "text": "Data Preparation\nTo get started, we load the monthly CRSP file, which forms our investment universe. We load the data from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(), \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(permno, month, ret_excess, mktcap, mktcap_lag) |&gt;\n  collect()\n\nTo evaluate the performance of portfolios, we further use monthly market returns as a benchmark to compute CAPM alphas.\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, mkt_excess) |&gt;\n  collect()\n\nNext, we retrieve some stock characteristics that have been shown to have an effect on the expected returns or expected variances (or even higher moments) of the return distribution. In particular, we record the lagged one-year return momentum (momentum_lag), defined as the compounded return between months \\(t-13\\) and \\(t-2\\) for each firm. In finance, momentum is the empirically observed tendency for rising asset prices to rise further, and falling prices to keep falling (Jegadeesh and Titman 1993). The second characteristic is the firm’s market equity (size_lag), defined as the log of the price per share times the number of shares outstanding (Banz 1981). To construct the correct lagged values, we use the approach introduced in WRDS, CRSP, and Compustat.\n\ncrsp_monthly_lags &lt;- crsp_monthly |&gt;\n  transmute(permno,\n    month_13 = month %m+% months(13),\n    mktcap\n  )\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  inner_join(crsp_monthly_lags,\n    by = c(\"permno\", \"month\" = \"month_13\"),\n    suffix = c(\"\", \"_13\")\n  )\n\ndata_portfolios &lt;- crsp_monthly |&gt;\n  mutate(\n    momentum_lag = mktcap_lag / mktcap_13,\n    size_lag = log(mktcap_lag)\n  ) |&gt;\n  drop_na(contains(\"lag\"))",
+    "text": "Data Preparation\nTo get started, we load the monthly CRSP file, which forms our investment universe. We load the data from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(), \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(permno, month, ret_excess, mktcap, mktcap_lag) |&gt;\n  collect()\n\nTo evaluate the performance of portfolios, we further use monthly market returns as a benchmark to compute CAPM alphas.\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, mkt_excess) |&gt;\n  collect()\n\nNext, we retrieve some stock characteristics that have been shown to have an effect on the expected returns or expected variances (or even higher moments) of the return distribution. In particular, we record the lagged one-year return momentum (momentum_lag), defined as the compounded return between months \\(t-13\\) and \\(t-2\\) for each firm. In finance, momentum is the empirically observed tendency for rising asset prices to rise further, and falling prices to keep falling (Jegadeesh and Titman 1993). The second characteristic is the firm’s market equity (size_lag), defined as the log of the price per share times the number of shares outstanding (Banz 1981). To construct the correct lagged values, we use the approach introduced in WRDS, CRSP, and Compustat.\n\ncrsp_monthly_lags &lt;- crsp_monthly |&gt;\n  transmute(permno,\n    month_13 = month %m+% months(13),\n    mktcap\n  )\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  inner_join(crsp_monthly_lags,\n    join_by(permno, month == month_13),\n    suffix = c(\"\", \"_13\")\n  )\n\ndata_portfolios &lt;- crsp_monthly |&gt;\n  mutate(\n    momentum_lag = mktcap_lag / mktcap_13,\n    size_lag = log(mktcap_lag)\n  ) |&gt;\n  drop_na(contains(\"lag\"))",
     "crumbs": [
       "R",
       "Portfolio Optimization",
@@ -537,7 +537,7 @@
     "href": "r/parametric-portfolio-policies.html#portfolio-performance",
     "title": "Parametric Portfolio Policies",
     "section": "Portfolio Performance",
-    "text": "Portfolio Performance\n Are the computed weights optimal in any way? Most likely not, as we picked \\(\\theta_0\\) arbitrarily. To evaluate the performance of an allocation strategy, one can think of many different approaches. In their original paper, Brandt, Santa-Clara, and Valkanov (2009) focus on a simple evaluation of the hypothetical utility of an agent equipped with a power utility function \\(u_\\gamma(r) = \\frac{(1 + r)^{(1-\\gamma)}}{1-\\gamma}\\), where \\(\\gamma\\) is the risk aversion factor.\n\npower_utility &lt;- function(r, gamma = 5) {\n  (1 + r)^(1 - gamma) / (1 - gamma)\n}\n\nWe want to note that Gehrig, Sögner, and Westerkamp (2020) warn that, in the leading case of constant relative risk aversion (CRRA), strong assumptions on the properties of the returns, the variables used to implement the parametric portfolio policy, and the parameter space are necessary to obtain a well-defined optimization problem.\nNo doubt, there are many other ways to evaluate a portfolio. The function below provides a summary of all kinds of interesting measures that can be considered relevant. Do we need all these evaluation measures? It depends: the original paper Brandt, Santa-Clara, and Valkanov (2009) only cares about the expected utility to choose \\(\\theta\\). However, if you want to choose optimal values that achieve the highest performance while putting some constraints on your portfolio weights, it is helpful to have everything in one function.\n\nevaluate_portfolio &lt;- function(weights_crsp,\n                               capm_evaluation = TRUE,\n                               full_evaluation = TRUE,\n                               length_year = 12) {\n  \n  evaluation &lt;- weights_crsp |&gt;\n    group_by(month) |&gt;\n    summarize(\n      tilt = weighted.mean(ret_excess, weight_tilt),\n      benchmark = weighted.mean(ret_excess, weight_benchmark)\n    ) |&gt;\n    pivot_longer(-month,\n      values_to = \"portfolio_return\",\n      names_to = \"model\"\n    ) \n  \n  evaluation_stats &lt;- evaluation |&gt;\n    group_by(model) |&gt;\n    left_join(factors_ff3_monthly, by = \"month\") |&gt;\n    summarize(tibble(\n      \"Expected utility\" = mean(power_utility(portfolio_return)),\n      \"Average return\" = 100 * mean(length_year * portfolio_return),\n      \"SD return\" = 100 * sqrt(length_year) * sd(portfolio_return),\n      \"Sharpe ratio\" = sqrt(length_year) * mean(portfolio_return) / sd(portfolio_return),\n\n    )) |&gt;\n    mutate(model = str_remove(model, \"return_\")) \n  \n  if (capm_evaluation) {\n    evaluation_capm &lt;- evaluation |&gt; \n      left_join(factors_ff3_monthly, by = \"month\") |&gt;\n      group_by(model) |&gt;\n      summarize(\n      \"CAPM alpha\" = coefficients(lm(portfolio_return ~ mkt_excess))[1],\n      \"Market beta\" = coefficients(lm(portfolio_return ~ mkt_excess))[2]\n      )\n    \n    evaluation_stats &lt;- evaluation_stats |&gt; \n      left_join(evaluation_capm, by = \"model\")\n  }\n\n  if (full_evaluation) {\n    evaluation_weights &lt;- weights_crsp |&gt;\n      select(month, contains(\"weight\")) |&gt;\n      pivot_longer(-month, values_to = \"weight\", names_to = \"model\") |&gt;\n      group_by(model, month) |&gt;\n      mutate(\n        \"Absolute weight\" = abs(weight),\n        \"Max. weight\" = max(weight),\n        \"Min. weight\" = min(weight),\n        \"Avg. sum of negative weights\" = -sum(weight[weight &lt; 0]),\n        \"Avg. fraction of negative weights\" = sum(weight &lt; 0) / n(),\n        .keep = \"none\"\n      ) |&gt;\n      group_by(model) |&gt;\n      summarize(across(-month, ~ 100 * mean(.))) |&gt;\n      mutate(model = str_remove(model, \"weight_\")) \n    \n    evaluation_stats &lt;- evaluation_stats |&gt; \n      left_join(evaluation_weights, by = \"model\")\n  }\n  \n  evaluation_output &lt;- evaluation_stats |&gt; \n    pivot_longer(cols = -model, names_to = \"measure\") |&gt; \n    pivot_wider(names_from = model)\n  \n  return(evaluation_output)\n}\n\n Let us take a look at the different portfolio strategies and evaluation measures.\n\nevaluate_portfolio(weights_crsp) |&gt;\n  print(n = Inf)\n\n# A tibble: 11 × 3\n   measure                            benchmark     tilt\n   &lt;chr&gt;                                  &lt;dbl&gt;    &lt;dbl&gt;\n 1 Expected utility                  -0.250     -0.261  \n 2 Average return                     6.64       0.259  \n 3 SD return                         15.5       21.1    \n 4 Sharpe ratio                       0.430      0.0122 \n 5 CAPM alpha                         0.000138  -0.00496\n 6 Market beta                        0.992      0.952  \n 7 Absolute weight                    0.0249     0.0638 \n 8 Max. weight                        3.59       3.72   \n 9 Min. weight                        0.0000274 -0.144  \n10 Avg. sum of negative weights       0         78.0    \n11 Avg. fraction of negative weights  0         49.5    \n\n\nThe value-weighted portfolio delivers an annualized return of more than 6 percent and clearly outperforms the tilted portfolio, irrespective of whether we evaluate expected utility, the Sharpe ratio, or the CAPM alpha. We can conclude the market beta is close to one for both strategies (naturally almost identically 1 for the value-weighted benchmark portfolio). When it comes to the distribution of the portfolio weights, we see that the benchmark portfolio weight takes less extreme positions (lower average absolute weights and lower maximum weight). By definition, the value-weighted benchmark does not take any negative positions, while the tilted portfolio also takes short positions.",
+    "text": "Portfolio Performance\n Are the computed weights optimal in any way? Most likely not, as we picked \\(\\theta_0\\) arbitrarily. To evaluate the performance of an allocation strategy, one can think of many different approaches. In their original paper, Brandt, Santa-Clara, and Valkanov (2009) focus on a simple evaluation of the hypothetical utility of an agent equipped with a power utility function \\(u_\\gamma(r) = \\frac{(1 + r)^{(1-\\gamma)}}{1-\\gamma}\\), where \\(\\gamma\\) is the risk aversion factor.\n\npower_utility &lt;- function(r, gamma = 5) {\n  (1 + r)^(1 - gamma) / (1 - gamma)\n}\n\nWe want to note that Gehrig, Sögner, and Westerkamp (2020) warn that, in the leading case of constant relative risk aversion (CRRA), strong assumptions on the properties of the returns, the variables used to implement the parametric portfolio policy, and the parameter space are necessary to obtain a well-defined optimization problem.\nNo doubt, there are many other ways to evaluate a portfolio. The function below provides a summary of all kinds of interesting measures that can be considered relevant. Do we need all these evaluation measures? It depends: the original paper Brandt, Santa-Clara, and Valkanov (2009) only cares about the expected utility to choose \\(\\theta\\). However, if you want to choose optimal values that achieve the highest performance while putting some constraints on your portfolio weights, it is helpful to have everything in one function.\n\nevaluate_portfolio &lt;- function(weights_crsp,\n                               capm_evaluation = TRUE,\n                               full_evaluation = TRUE,\n                               length_year = 12) {\n  \n  evaluation &lt;- weights_crsp |&gt;\n    group_by(month) |&gt;\n    summarize(\n      tilt = weighted.mean(ret_excess, weight_tilt),\n      benchmark = weighted.mean(ret_excess, weight_benchmark)\n    ) |&gt;\n    pivot_longer(-month,\n      values_to = \"portfolio_return\",\n      names_to = \"model\"\n    ) \n  \n  evaluation_stats &lt;- evaluation |&gt;\n    group_by(model) |&gt;\n    left_join(factors_ff3_monthly, \n              join_by(month)) |&gt;\n    summarize(tibble(\n      \"Expected utility\" = mean(power_utility(portfolio_return)),\n      \"Average return\" = 100 * mean(length_year * portfolio_return),\n      \"SD return\" = 100 * sqrt(length_year) * sd(portfolio_return),\n      \"Sharpe ratio\" = sqrt(length_year) * mean(portfolio_return) / sd(portfolio_return),\n\n    )) |&gt;\n    mutate(model = str_remove(model, \"return_\")) \n  \n  if (capm_evaluation) {\n    evaluation_capm &lt;- evaluation |&gt; \n      left_join(factors_ff3_monthly, \n                join_by(month)) |&gt;\n      group_by(model) |&gt;\n      summarize(\n      \"CAPM alpha\" = coefficients(lm(portfolio_return ~ mkt_excess))[1],\n      \"Market beta\" = coefficients(lm(portfolio_return ~ mkt_excess))[2]\n      )\n    \n    evaluation_stats &lt;- evaluation_stats |&gt; \n      left_join(evaluation_capm, join_by(model))\n  }\n\n  if (full_evaluation) {\n    evaluation_weights &lt;- weights_crsp |&gt;\n      select(month, contains(\"weight\")) |&gt;\n      pivot_longer(-month, values_to = \"weight\", names_to = \"model\") |&gt;\n      group_by(model, month) |&gt;\n      mutate(\n        \"Absolute weight\" = abs(weight),\n        \"Max. weight\" = max(weight),\n        \"Min. weight\" = min(weight),\n        \"Avg. sum of negative weights\" = -sum(weight[weight &lt; 0]),\n        \"Avg. fraction of negative weights\" = sum(weight &lt; 0) / n(),\n        .keep = \"none\"\n      ) |&gt;\n      group_by(model) |&gt;\n      summarize(across(-month, ~ 100 * mean(.))) |&gt;\n      mutate(model = str_remove(model, \"weight_\")) \n    \n    evaluation_stats &lt;- evaluation_stats |&gt; \n      left_join(evaluation_weights, join_by(model))\n  }\n  \n  evaluation_output &lt;- evaluation_stats |&gt; \n    pivot_longer(cols = -model, names_to = \"measure\") |&gt; \n    pivot_wider(names_from = model)\n  \n  return(evaluation_output)\n}\n\n Let us take a look at the different portfolio strategies and evaluation measures.\n\nevaluate_portfolio(weights_crsp) |&gt;\n  print(n = Inf)\n\n# A tibble: 11 × 3\n   measure                            benchmark     tilt\n   &lt;chr&gt;                                  &lt;dbl&gt;    &lt;dbl&gt;\n 1 Expected utility                  -0.250     -0.261  \n 2 Average return                     6.64       0.259  \n 3 SD return                         15.5       21.1    \n 4 Sharpe ratio                       0.430      0.0122 \n 5 CAPM alpha                         0.000138  -0.00496\n 6 Market beta                        0.992      0.952  \n 7 Absolute weight                    0.0249     0.0638 \n 8 Max. weight                        3.59       3.72   \n 9 Min. weight                        0.0000274 -0.144  \n10 Avg. sum of negative weights       0         78.0    \n11 Avg. fraction of negative weights  0         49.5    \n\n\nThe value-weighted portfolio delivers an annualized return of more than 6 percent and clearly outperforms the tilted portfolio, irrespective of whether we evaluate expected utility, the Sharpe ratio, or the CAPM alpha. We can conclude the market beta is close to one for both strategies (naturally almost identically 1 for the value-weighted benchmark portfolio). When it comes to the distribution of the portfolio weights, we see that the benchmark portfolio weight takes less extreme positions (lower average absolute weights and lower maximum weight). By definition, the value-weighted benchmark does not take any negative positions, while the tilted portfolio also takes short positions.",
     "crumbs": [
       "R",
       "Portfolio Optimization",
@@ -609,7 +609,7 @@
     "href": "r/wrds-crsp-and-compustat.html#downloading-and-preparing-crsp",
     "title": "WRDS, CRSP, and Compustat",
     "section": "Downloading and Preparing CRSP",
-    "text": "Downloading and Preparing CRSP\nThe Center for Research in Security Prices (CRSP) provides the most widely used data for US stocks. We use the wrds connection object that we just created to first access monthly CRSP return data. Actually, we need three tables to get the desired data: (i) the CRSP monthly security file,\n\nmsf_db &lt;- tbl(wrds, in_schema(\"crsp\", \"msf\"))\n\n\nthe identifying information,\n\n\nmsenames_db &lt;- tbl(wrds, in_schema(\"crsp\", \"msenames\"))\n\nand (iii) the delisting information.\n\nmsedelist_db &lt;- tbl(wrds, in_schema(\"crsp\", \"msedelist\"))\n\nWe use the three remote tables to fetch the data we want to put into our local database. Just as above, the idea is that we let the WRDS database do all the work and just download the data that we actually need. We apply common filters and data selection criteria to narrow down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep only US-listed stocks as identified via share codes shrcd 10 and 11, and (iii) we keep only months within permno-specific start dates namedt and end dates nameendt. In addition, we add delisting codes and returns. You can read up in the great textbook of Bali, Engle, and Murray (2016) for an extensive discussion on the filters we apply in the code below.\n\ncrsp_monthly &lt;- msf_db |&gt;\n  filter(date &gt;= start_date & date &lt;= end_date) |&gt;\n  inner_join(\n    msenames_db |&gt;\n      filter(shrcd %in% c(10, 11)) |&gt;\n      select(permno, exchcd, siccd, namedt, nameendt),\n    by = c(\"permno\")\n  ) |&gt;\n  filter(date &gt;= namedt & date &lt;= nameendt) |&gt;\n  mutate(month = floor_date(date, \"month\")) |&gt;\n  left_join(\n    msedelist_db |&gt;\n      select(permno, dlstdt, dlret, dlstcd) |&gt;\n      mutate(month = floor_date(dlstdt, \"month\")),\n    by = c(\"permno\", \"month\")\n  ) |&gt;\n  select(\n    permno, # Security identifier\n    date, # Date of the observation\n    month, # Month of the observation\n    ret, # Return\n    shrout, # Shares outstanding (in thousands)\n    altprc, # Last traded price in a month\n    exchcd, # Exchange code\n    siccd, # Industry code\n    dlret, # Delisting return\n    dlstcd # Delisting code\n  ) |&gt;\n  collect() |&gt;\n  mutate(\n    month = ymd(month),\n    shrout = shrout * 1000\n  )\n\nNow, we have all the relevant monthly return data in memory and proceed with preparing the data for future analyses. We perform the preparation step at the current stage since we want to avoid executing the same mutations every time we use the data in subsequent chapters.\nThe first additional variable we create is market capitalization (mktcap), which is the product of the number of outstanding shares shrout and the last traded price in a month altprc. Note that in contrast to returns ret, these two variables are not adjusted ex-post for any corporate actions like stock splits. Moreover, the altprc is negative whenever the last traded price does not exist, and CRSP decides to report the mid-quote of the last available order book instead. Hence, we take the absolute value of the market cap. We also keep the market cap in millions of USD just for convenience as we do not want to print huge numbers in our figures and tables. In addition, we set zero market cap to missing as it makes conceptually little sense (i.e., the firm would be bankrupt).\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(\n    mktcap = abs(shrout * altprc) / 10^6,\n    mktcap = na_if(mktcap, 0)\n  )\n\nThe next variable we frequently use is the one-month lagged market capitalization. Lagged market capitalization is typically used to compute value-weighted portfolio returns, as we demonstrate in a later chapter. The most simple and consistent way to add a column with lagged market cap values is to add one month to each observation and then join the information to our monthly CRSP data.\n\nmktcap_lag &lt;- crsp_monthly |&gt;\n  mutate(month = month %m+% months(1)) |&gt;\n  select(permno, month, mktcap_lag = mktcap)\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(mktcap_lag, by = c(\"permno\", \"month\"))\n\nIf you wonder why we do not use the lag() function, e.g., via crsp_monthly |&gt; group_by(permno) |&gt; mutate(mktcap_lag = lag(mktcap)), take a look at the exercises.\nNext, we follow Bali, Engle, and Murray (2016) in transforming listing exchange codes to explicit exchange names. \n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(exchange = case_when(\n    exchcd %in% c(1, 31) ~ \"NYSE\",\n    exchcd %in% c(2, 32) ~ \"AMEX\",\n    exchcd %in% c(3, 33) ~ \"NASDAQ\",\n    .default = \"Other\"\n  ))\n\nSimilarly, we transform industry codes to industry descriptions following Bali, Engle, and Murray (2016). Notice that there are also other categorizations of industries (e.g., Fama and French 1997) that are commonly used.\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(industry = case_when(\n    siccd &gt;= 1 & siccd &lt;= 999 ~ \"Agriculture\",\n    siccd &gt;= 1000 & siccd &lt;= 1499 ~ \"Mining\",\n    siccd &gt;= 1500 & siccd &lt;= 1799 ~ \"Construction\",\n    siccd &gt;= 2000 & siccd &lt;= 3999 ~ \"Manufacturing\",\n    siccd &gt;= 4000 & siccd &lt;= 4899 ~ \"Transportation\",\n    siccd &gt;= 4900 & siccd &lt;= 4999 ~ \"Utilities\",\n    siccd &gt;= 5000 & siccd &lt;= 5199 ~ \"Wholesale\",\n    siccd &gt;= 5200 & siccd &lt;= 5999 ~ \"Retail\",\n    siccd &gt;= 6000 & siccd &lt;= 6799 ~ \"Finance\",\n    siccd &gt;= 7000 & siccd &lt;= 8999 ~ \"Services\",\n    siccd &gt;= 9000 & siccd &lt;= 9999 ~ \"Public\",\n    TRUE ~ \"Missing\"\n  ))\n\nWe also construct returns adjusted for delistings as described by Bali, Engle, and Murray (2016). The delisting of a security usually results when a company ceases operations, declares bankruptcy, merges, does not meet listing requirements, or seeks to become private. The adjustment tries to reflect the returns of investors who bought the stock in the month before the delisting and held it until the delisting date. After this transformation, we can drop the delisting returns and codes.\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(ret_adj = case_when(\n    is.na(dlstcd) ~ ret,\n    !is.na(dlstcd) & !is.na(dlret) ~ dlret,\n    dlstcd %in% c(500, 520, 580, 584) |\n      (dlstcd &gt;= 551 & dlstcd &lt;= 574) ~ -0.30,\n    dlstcd == 100 ~ ret,\n    TRUE ~ -1\n  )) |&gt;\n  select(-c(dlret, dlstcd))\n\nNext, we compute excess returns by subtracting the monthly risk-free rate provided by our Fama-French data. As we base all our analyses on the excess returns, we can drop adjusted returns and the risk-free rate from our tibble. Note that we ensure excess returns are bounded by -1 from below as a return less than -100% makes no sense conceptually. Before we can adjust the returns, we have to connect to our database and load the table factors_ff3_monthly.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, rf) |&gt;\n  collect()\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(factors_ff3_monthly,\n    by = \"month\"\n  ) |&gt;\n  mutate(\n    ret_excess = ret_adj - rf,\n    ret_excess = pmax(ret_excess, -1)\n  ) |&gt;\n  select(-ret_adj, -rf)\n\nSince excess returns and market capitalization are crucial for all our analyses, we can safely exclude all observations with missing returns or market capitalization.\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  drop_na(ret_excess, mktcap, mktcap_lag)\n\nFinally, we store the monthly CRSP file in our database.\n\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)",
+    "text": "Downloading and Preparing CRSP\nThe Center for Research in Security Prices (CRSP) provides the most widely used data for US stocks. We use the wrds connection object that we just created to first access monthly CRSP return data. Actually, we need three tables to get the desired data: (i) the CRSP monthly security file,\n\nmsf_db &lt;- tbl(wrds, in_schema(\"crsp\", \"msf\"))\n\n\nthe identifying information,\n\n\nmsenames_db &lt;- tbl(wrds, in_schema(\"crsp\", \"msenames\"))\n\nand (iii) the delisting information.\n\nmsedelist_db &lt;- tbl(wrds, in_schema(\"crsp\", \"msedelist\"))\n\nWe use the three remote tables to fetch the data we want to put into our local database. Just as above, the idea is that we let the WRDS database do all the work and just download the data that we actually need. We apply common filters and data selection criteria to narrow down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep only US-listed stocks as identified via share codes shrcd 10 and 11, and (iii) we keep only months within permno-specific start dates namedt and end dates nameendt. In addition, we add delisting codes and returns. You can read up in the great textbook of Bali, Engle, and Murray (2016) for an extensive discussion on the filters we apply in the code below.\n\ncrsp_monthly &lt;- msf_db |&gt;\n  filter(date &gt;= start_date & date &lt;= end_date) |&gt;\n  inner_join(\n    msenames_db |&gt;\n      filter(shrcd %in% c(10, 11)) |&gt;\n      select(permno, exchcd, siccd, namedt, nameendt),\n    join_by(permno)\n  ) |&gt;\n  filter(date &gt;= namedt & date &lt;= nameendt) |&gt;\n  mutate(month = floor_date(date, \"month\")) |&gt;\n  left_join(\n    msedelist_db |&gt;\n      select(permno, dlstdt, dlret, dlstcd) |&gt;\n      mutate(month = floor_date(dlstdt, \"month\")),\n    join_by(permno, month)\n  ) |&gt;\n  select(\n    permno, # Security identifier\n    date, # Date of the observation\n    month, # Month of the observation\n    ret, # Return\n    shrout, # Shares outstanding (in thousands)\n    altprc, # Last traded price in a month\n    exchcd, # Exchange code\n    siccd, # Industry code\n    dlret, # Delisting return\n    dlstcd # Delisting code\n  ) |&gt;\n  collect() |&gt;\n  mutate(\n    month = ymd(month),\n    shrout = shrout * 1000\n  )\n\nNow, we have all the relevant monthly return data in memory and proceed with preparing the data for future analyses. We perform the preparation step at the current stage since we want to avoid executing the same mutations every time we use the data in subsequent chapters.\nThe first additional variable we create is market capitalization (mktcap), which is the product of the number of outstanding shares shrout and the last traded price in a month altprc. Note that in contrast to returns ret, these two variables are not adjusted ex-post for any corporate actions like stock splits. Moreover, the altprc is negative whenever the last traded price does not exist, and CRSP decides to report the mid-quote of the last available order book instead. Hence, we take the absolute value of the market cap. We also keep the market cap in millions of USD just for convenience as we do not want to print huge numbers in our figures and tables. In addition, we set zero market cap to missing as it makes conceptually little sense (i.e., the firm would be bankrupt).\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(\n    mktcap = abs(shrout * altprc) / 10^6,\n    mktcap = na_if(mktcap, 0)\n  )\n\nThe next variable we frequently use is the one-month lagged market capitalization. Lagged market capitalization is typically used to compute value-weighted portfolio returns, as we demonstrate in a later chapter. The most simple and consistent way to add a column with lagged market cap values is to add one month to each observation and then join the information to our monthly CRSP data.\n\nmktcap_lag &lt;- crsp_monthly |&gt;\n  mutate(month = month %m+% months(1)) |&gt;\n  select(permno, month, mktcap_lag = mktcap)\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(mktcap_lag, join_by(permno, month))\n\nIf you wonder why we do not use the lag() function, e.g., via crsp_monthly |&gt; group_by(permno) |&gt; mutate(mktcap_lag = lag(mktcap)), take a look at the exercises.\nNext, we follow Bali, Engle, and Murray (2016) in transforming listing exchange codes to explicit exchange names. \n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(exchange = case_when(\n    exchcd %in% c(1, 31) ~ \"NYSE\",\n    exchcd %in% c(2, 32) ~ \"AMEX\",\n    exchcd %in% c(3, 33) ~ \"NASDAQ\",\n    .default = \"Other\"\n  ))\n\nSimilarly, we transform industry codes to industry descriptions following Bali, Engle, and Murray (2016). Notice that there are also other categorizations of industries (e.g., Fama and French 1997) that are commonly used.\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(industry = case_when(\n    siccd &gt;= 1 & siccd &lt;= 999 ~ \"Agriculture\",\n    siccd &gt;= 1000 & siccd &lt;= 1499 ~ \"Mining\",\n    siccd &gt;= 1500 & siccd &lt;= 1799 ~ \"Construction\",\n    siccd &gt;= 2000 & siccd &lt;= 3999 ~ \"Manufacturing\",\n    siccd &gt;= 4000 & siccd &lt;= 4899 ~ \"Transportation\",\n    siccd &gt;= 4900 & siccd &lt;= 4999 ~ \"Utilities\",\n    siccd &gt;= 5000 & siccd &lt;= 5199 ~ \"Wholesale\",\n    siccd &gt;= 5200 & siccd &lt;= 5999 ~ \"Retail\",\n    siccd &gt;= 6000 & siccd &lt;= 6799 ~ \"Finance\",\n    siccd &gt;= 7000 & siccd &lt;= 8999 ~ \"Services\",\n    siccd &gt;= 9000 & siccd &lt;= 9999 ~ \"Public\",\n    TRUE ~ \"Missing\"\n  ))\n\nWe also construct returns adjusted for delistings as described by Bali, Engle, and Murray (2016). The delisting of a security usually results when a company ceases operations, declares bankruptcy, merges, does not meet listing requirements, or seeks to become private. The adjustment tries to reflect the returns of investors who bought the stock in the month before the delisting and held it until the delisting date. After this transformation, we can drop the delisting returns and codes.\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  mutate(ret_adj = case_when(\n    is.na(dlstcd) ~ ret,\n    !is.na(dlstcd) & !is.na(dlret) ~ dlret,\n    dlstcd %in% c(500, 520, 580, 584) |\n      (dlstcd &gt;= 551 & dlstcd &lt;= 574) ~ -0.30,\n    dlstcd == 100 ~ ret,\n    TRUE ~ -1\n  )) |&gt;\n  select(-c(dlret, dlstcd))\n\nNext, we compute excess returns by subtracting the monthly risk-free rate provided by our Fama-French data. As we base all our analyses on the excess returns, we can drop adjusted returns and the risk-free rate from our tibble. Note that we ensure excess returns are bounded by -1 from below as a return less than -100% makes no sense conceptually. Before we can adjust the returns, we have to connect to our database and load the table factors_ff3_monthly.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, rf) |&gt;\n  collect()\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(factors_ff3_monthly,\n    join_by(month)\n  ) |&gt;\n  mutate(\n    ret_excess = ret_adj - rf,\n    ret_excess = pmax(ret_excess, -1)\n  ) |&gt;\n  select(-ret_adj, -rf)\n\nSince excess returns and market capitalization are crucial for all our analyses, we can safely exclude all observations with missing returns or market capitalization.\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  drop_na(ret_excess, mktcap, mktcap_lag)\n\nFinally, we store the monthly CRSP file in our database.\n\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)",
     "crumbs": [
       "R",
       "Financial Data",
@@ -621,7 +621,7 @@
     "href": "r/wrds-crsp-and-compustat.html#first-glimpse-of-the-crsp-sample",
     "title": "WRDS, CRSP, and Compustat",
     "section": "First Glimpse of the CRSP Sample",
-    "text": "First Glimpse of the CRSP Sample\nBefore we move on to other data sources, let us look at some descriptive statistics of the CRSP sample, which is our main source for stock returns.\nFigure 1 shows the monthly number of securities by listing exchange over time. NYSE has the longest history in the data, but NASDAQ lists a considerably large number of stocks. The number of stocks listed on AMEX decreased steadily over the last couple of decades. By the end of 2022, there were 2,778 stocks with a primary listing on NASDAQ, 1,358 on NYSE, 162 on AMEX, and only one belonged to the other category. \n\ncrsp_monthly |&gt;\n  count(exchange, date) |&gt;\n  ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by listing exchange\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 1: Number of stocks in the CRSP sample listed at each of the US exchanges.\n\n\nFigure 1: Number of stocks in the CRSP sample listed at each of the US exchanges.\n\n\n\n\n\nNext, we look at the aggregate market capitalization grouped by the respective listing exchanges in Figure 2. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in Figure 2 are at the end of 2022 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.\n\ntbl(tidy_finance, \"crsp_monthly\") |&gt;\n  left_join(tbl(tidy_finance, \"cpi_monthly\"), by = \"month\") |&gt;\n  group_by(month, exchange) |&gt;\n  summarize(\n    mktcap = sum(mktcap, na.rm = TRUE) / cpi,\n    .groups = \"drop\"\n  ) |&gt;\n  collect() |&gt;\n  mutate(month = ymd(month)) |&gt;\n  ggplot(aes(\n    x = month, y = mktcap / 1000,\n    color = exchange, linetype = exchange\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly market cap by listing exchange in billions of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 2: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.\n\n\nFigure 2: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.\n\n\n\n\n\nOf course, performing the computation in the database is not really meaningful because we can easily pull all the required data into our memory. The code chunk above is slower than performing the same steps on tables that are already in memory. However, we just want to illustrate that you can perform many things in the database before loading the data into your memory. Before we proceed, we load the monthly CPI data.\n\ncpi_monthly &lt;- tbl(tidy_finance, \"cpi_monthly\") |&gt;\n  collect()\n\nNext, we look at the same descriptive statistics by industry. Figure 3 plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.\n\ncrsp_monthly_industry &lt;- crsp_monthly |&gt;\n  left_join(cpi_monthly, by = \"month\") |&gt;\n  group_by(month, industry) |&gt;\n  summarize(\n    securities = n_distinct(permno),\n    mktcap = sum(mktcap) / mean(cpi),\n    .groups = \"drop\"\n  )\n\ncrsp_monthly_industry |&gt;\n  ggplot(aes(\n    x = month,\n    y = securities,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by industry\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 3: Number of stocks in the CRSP sample associated with different industries.\n\n\nFigure 3: Number of stocks in the CRSP sample associated with different industries.\n\n\n\n\n\nWe also compute the market cap of all stocks belonging to the respective industries and show the evolution over time in Figure 4. All values are again in terms of billions of end of 2022 USD. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Toward the end of the sample, however, financial firms and services begin to make up a substantial portion of the market cap.\n\ncrsp_monthly_industry |&gt;\n  ggplot(aes(\n    x = month,\n    y = mktcap / 1000,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly total market cap by industry in billions as of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 4: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.\n\n\nFigure 4: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.",
+    "text": "First Glimpse of the CRSP Sample\nBefore we move on to other data sources, let us look at some descriptive statistics of the CRSP sample, which is our main source for stock returns.\nFigure 1 shows the monthly number of securities by listing exchange over time. NYSE has the longest history in the data, but NASDAQ lists a considerably large number of stocks. The number of stocks listed on AMEX decreased steadily over the last couple of decades. By the end of 2022, there were 2,778 stocks with a primary listing on NASDAQ, 1,358 on NYSE, 162 on AMEX, and only one belonged to the other category. \n\ncrsp_monthly |&gt;\n  count(exchange, date) |&gt;\n  ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by listing exchange\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 1: Number of stocks in the CRSP sample listed at each of the US exchanges.\n\n\nFigure 1: Number of stocks in the CRSP sample listed at each of the US exchanges.\n\n\n\n\n\nNext, we look at the aggregate market capitalization grouped by the respective listing exchanges in Figure 2. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in Figure 2 are at the end of 2022 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.\n\ntbl(tidy_finance, \"crsp_monthly\") |&gt;\n  left_join(tbl(tidy_finance, \"cpi_monthly\"), join_by(month)) |&gt;\n  group_by(month, exchange) |&gt;\n  summarize(\n    mktcap = sum(mktcap, na.rm = TRUE) / cpi,\n    .groups = \"drop\"\n  ) |&gt;\n  collect() |&gt;\n  mutate(month = ymd(month)) |&gt;\n  ggplot(aes(\n    x = month, y = mktcap / 1000,\n    color = exchange, linetype = exchange\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly market cap by listing exchange in billions of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 2: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.\n\n\nFigure 2: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022.\n\n\n\n\n\nOf course, performing the computation in the database is not really meaningful because we can easily pull all the required data into our memory. The code chunk above is slower than performing the same steps on tables that are already in memory. However, we just want to illustrate that you can perform many things in the database before loading the data into your memory. Before we proceed, we load the monthly CPI data.\n\ncpi_monthly &lt;- tbl(tidy_finance, \"cpi_monthly\") |&gt;\n  collect()\n\nNext, we look at the same descriptive statistics by industry. Figure 3 plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.\n\ncrsp_monthly_industry &lt;- crsp_monthly |&gt;\n  left_join(cpi_monthly, join_by(month)) |&gt;\n  group_by(month, industry) |&gt;\n  summarize(\n    securities = n_distinct(permno),\n    mktcap = sum(mktcap) / mean(cpi),\n    .groups = \"drop\"\n  )\n\ncrsp_monthly_industry |&gt;\n  ggplot(aes(\n    x = month,\n    y = securities,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly number of securities by industry\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 3: Number of stocks in the CRSP sample associated with different industries.\n\n\nFigure 3: Number of stocks in the CRSP sample associated with different industries.\n\n\n\n\n\nWe also compute the market cap of all stocks belonging to the respective industries and show the evolution over time in Figure 4. All values are again in terms of billions of end of 2022 USD. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Toward the end of the sample, however, financial firms and services begin to make up a substantial portion of the market cap.\n\ncrsp_monthly_industry |&gt;\n  ggplot(aes(\n    x = month,\n    y = mktcap / 1000,\n    color = industry,\n    linetype = industry\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly total market cap by industry in billions as of Dec 2022 USD\"\n  ) +\n  scale_x_date(date_breaks = \"10 years\", date_labels = \"%Y\") +\n  scale_y_continuous(labels = comma)\n\n\n\n\nFigure 4: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.\n\n\nFigure 4: Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the y-axis reflect the buying power of billion USD in December 2022.",
     "crumbs": [
       "R",
       "Financial Data",
@@ -633,7 +633,7 @@
     "href": "r/wrds-crsp-and-compustat.html#daily-crsp-data",
     "title": "WRDS, CRSP, and Compustat",
     "section": "Daily CRSP Data",
-    "text": "Daily CRSP Data\nBefore we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session.\nThere is a solution to this challenge. As with many big data problems, you can split up the big task into several smaller tasks that are easier to handle. That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in for()-loops, where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 5 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using cat(). Notice that we also use the function dbWriteTable() here with the option to append the new data to an existing table, when we process the second and all following batches.\nAlso notice that we adjust the returns for delistings in a different manner than for monthly returns:\n\nIf a return is recorded in CRSP daily for a delisting date dlstdt, then we replace it with the delisting return dlret.\nIf no return is recorded in CRSP daily for an existing delisting date, then we add a new row with the delisting return.\nWe ensure that no returns after the delisting date dlstdt remain in the final table.\n\n\ndsf_db &lt;- tbl(wrds, in_schema(\"crsp\", \"dsf\"))\n\nfactors_ff3_daily &lt;- tbl(tidy_finance, \"factors_ff3_daily\") |&gt;\n  collect()\n\npermnos &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  distinct(permno) |&gt;\n  pull()\n\nbatch_size &lt;- 500\nbatches &lt;- ceiling(length(permnos) / batch_size)\n\nfor (j in 1:batches) {\n  \n  permno_batch &lt;- permnos[\n    ((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))\n  ]\n\n  crsp_daily_sub &lt;- dsf_db |&gt;\n    filter(permno %in% permno_batch &\n             date &gt;= start_date & date &lt;= end_date) |&gt;\n    select(permno, date, ret) |&gt;\n    collect() |&gt;\n    drop_na()\n\n  if (nrow(crsp_daily_sub) &gt; 0) {\n    \n    msedelist_sub &lt;- msedelist_db |&gt; \n    filter(permno %in% permno_batch) |&gt; \n    select(permno, dlstdt, dlret) |&gt; \n    collect() |&gt;\n    drop_na()\n    \n    crsp_daily_sub &lt;- crsp_daily_sub |&gt; \n      left_join(msedelist_sub, by = c(\"permno\", \"date\"=\"dlstdt\")) |&gt; \n      bind_rows(msedelist_sub |&gt; \n                  anti_join(crsp_daily_sub, \n                            by = c(\"permno\", \"dlstdt\" = \"date\"))) |&gt; \n      mutate(ret = if_else(!is.na(dlret), dlret, ret),\n             date = if_else(!is.na(dlstdt), dlstdt, date)) |&gt; \n      select(-c(dlret, dlstdt)) |&gt; \n      left_join(msedelist_sub |&gt; \n                  select(permno, dlstdt), by = \"permno\") |&gt; \n      mutate(dlstdt = replace_na(dlstdt, end_date)) |&gt; \n      filter(date &lt;= dlstdt) |&gt; \n      select(-dlstdt)\n    \n    crsp_daily_sub &lt;- crsp_daily_sub |&gt;\n      mutate(month = floor_date(date, \"month\")) |&gt;\n      left_join(factors_ff3_daily |&gt;\n        select(date, rf), by = \"date\") |&gt;\n      mutate(\n        ret_excess = ret - rf,\n        ret_excess = pmax(ret_excess, -1)\n      ) |&gt;\n      select(permno, date, month, ret, ret_excess)\n\n    dbWriteTable(tidy_finance,\n      \"crsp_daily\",\n      value = crsp_daily_sub,\n      overwrite = ifelse(j == 1, TRUE, FALSE),\n      append = ifelse(j != 1, TRUE, FALSE)\n    )\n  }\n\n  cat(\"Batch\", j, \"out of\", batches, \"done (\", percent(j / batches), \")\\n\")\n}\n\nEventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely permno, date, and month alongside the excess returns. We thus ensure that our local database contains only the data that we actually use.",
+    "text": "Daily CRSP Data\nBefore we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session.\nThere is a solution to this challenge. As with many big data problems, you can split up the big task into several smaller tasks that are easier to handle. That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in for()-loops, where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 5 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using cat(). Notice that we also use the function dbWriteTable() here with the option to append the new data to an existing table, when we process the second and all following batches.\nAlso notice that we adjust the returns for delistings in a different manner than for monthly returns:\n\nIf a return is recorded in CRSP daily for a delisting date dlstdt, then we replace it with the delisting return dlret.\nIf no return is recorded in CRSP daily for an existing delisting date, then we add a new row with the delisting return.\nWe ensure that no returns after the delisting date dlstdt remain in the final table.\n\n\ndsf_db &lt;- tbl(wrds, in_schema(\"crsp\", \"dsf\"))\n\nfactors_ff3_daily &lt;- tbl(tidy_finance, \"factors_ff3_daily\") |&gt;\n  collect()\n\npermnos &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  distinct(permno) |&gt;\n  pull()\n\nbatch_size &lt;- 500\nbatches &lt;- ceiling(length(permnos) / batch_size)\n\nfor (j in 1:batches) {\n  \n  permno_batch &lt;- permnos[\n    ((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))\n  ]\n\n  crsp_daily_sub &lt;- dsf_db |&gt;\n    filter(permno %in% permno_batch &\n             date &gt;= start_date & date &lt;= end_date) |&gt;\n    select(permno, date, ret) |&gt;\n    collect() |&gt;\n    drop_na()\n\n  if (nrow(crsp_daily_sub) &gt; 0) {\n    \n    msedelist_sub &lt;- msedelist_db |&gt; \n    filter(permno %in% permno_batch) |&gt; \n    select(permno, dlstdt, dlret) |&gt; \n    collect() |&gt;\n    drop_na()\n    \n    crsp_daily_sub &lt;- crsp_daily_sub |&gt; \n      left_join(msedelist_sub, join_by(permno, date == dlstdt)) |&gt; \n      bind_rows(msedelist_sub |&gt; \n                  anti_join(crsp_daily_sub, \n                            join_by(permno, dlstdt == date))) |&gt; \n      mutate(ret = if_else(!is.na(dlret), dlret, ret),\n             date = if_else(!is.na(dlstdt), dlstdt, date)) |&gt; \n      select(-c(dlret, dlstdt)) |&gt; \n      left_join(msedelist_sub |&gt; \n                  select(permno, dlstdt), join_by(permno)) |&gt; \n      mutate(dlstdt = replace_na(dlstdt, end_date)) |&gt; \n      filter(date &lt;= dlstdt) |&gt; \n      select(-dlstdt)\n    \n    crsp_daily_sub &lt;- crsp_daily_sub |&gt;\n      mutate(month = floor_date(date, \"month\")) |&gt;\n      left_join(factors_ff3_daily |&gt;\n        select(date, rf), join_by(date)) |&gt;\n      mutate(\n        ret_excess = ret - rf,\n        ret_excess = pmax(ret_excess, -1)\n      ) |&gt;\n      select(permno, date, month, ret, ret_excess)\n\n    dbWriteTable(tidy_finance,\n      \"crsp_daily\",\n      value = crsp_daily_sub,\n      overwrite = ifelse(j == 1, TRUE, FALSE),\n      append = ifelse(j != 1, TRUE, FALSE)\n    )\n  }\n\n  cat(\"Batch\", j, \"out of\", batches, \"done (\", percent(j / batches), \")\\n\")\n}\n\nEventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely permno, date, and month alongside the excess returns. We thus ensure that our local database contains only the data that we actually use.",
     "crumbs": [
       "R",
       "Financial Data",
@@ -645,7 +645,7 @@
     "href": "r/wrds-crsp-and-compustat.html#preparing-compustat-data",
     "title": "WRDS, CRSP, and Compustat",
     "section": "Preparing Compustat data",
-    "text": "Preparing Compustat data\nFirm accounting data are an important source of information that we use in portfolio analyses in subsequent chapters. The commonly used source for firm financial information is Compustat provided by S&P Global Market Intelligence, which is a global data vendor that provides financial, statistical, and market information on active and inactive companies throughout the world. For US and Canadian companies, annual history is available back to 1950 and quarterly as well as monthly histories date back to 1962.\nTo access Compustat data, we can again tap WRDS, which hosts the funda table that contains annual firm-level information on North American companies.\n\nfunda_db &lt;- tbl(wrds, in_schema(\"comp\", \"funda\"))\n\nWe follow the typical filter conventions and pull only data that we actually need: (i) we get only records in industrial data format, (ii) in the standard format (i.e., consolidated information in standard presentation), and (iii) only data in the desired time window.\n\ncompustat &lt;- funda_db |&gt;\n  filter(\n    indfmt == \"INDL\" &\n      datafmt == \"STD\" &\n      consol == \"C\" &\n      datadate &gt;= start_date & datadate &lt;= end_date\n  ) |&gt;\n  select(\n    gvkey, # Firm identifier\n    datadate, # Date of the accounting data\n    seq, # Stockholders' equity\n    ceq, # Total common/ordinary equity\n    at, # Total assets\n    lt, # Total liabilities\n    txditc, # Deferred taxes and investment tax credit\n    txdb, # Deferred taxes\n    itcb, # Investment tax credit\n    pstkrv, # Preferred stock redemption value\n    pstkl, # Preferred stock liquidating value\n    pstk, # Preferred stock par value\n    capx, # Capital investment\n    oancf, # Operating cash flow\n    sale,  # Revenue\n    cogs, # Costs of goods sold\n    xint, # Interest expense\n    xsga # Selling, general, and administrative expenses\n  ) |&gt;\n  collect()\n\nNext, we calculate the book value of preferred stock and equity be and the operating profitability op inspired by the variable definitions in Ken French’s data library. Note that we set negative or zero equity to missing which is a common practice when working with book-to-market ratios (see Fama and French 1992 for details).\n\ncompustat &lt;- compustat |&gt;\n  mutate(\n    be = coalesce(seq, ceq + pstk, at - lt) +\n      coalesce(txditc, txdb + itcb, 0) -\n      coalesce(pstkrv, pstkl, pstk, 0),\n    be = if_else(be &lt;= 0, as.numeric(NA), be),\n    op = (sale - coalesce(cogs, 0) - \n            coalesce(xsga, 0) - coalesce(xint, 0)) / be,\n  )\n\nWe keep only the last available information for each firm-year group. Note that datadate defines the time the corresponding financial data refers to (e.g., annual report as of December 31, 2022). Therefore, datadate is not the date when data was made available to the public. Check out the exercises for more insights into the peculiarities of datadate.\n\ncompustat &lt;- compustat |&gt;\n  mutate(year = year(datadate)) |&gt;\n  group_by(gvkey, year) |&gt;\n  filter(datadate == max(datadate)) |&gt;\n  ungroup()\n\nWe also compute the investment ratio inv according to Ken French’s variable definitions as the change in total assets from one fiscal year to another. Note that we again use the approach using joins as introduced with the CRSP data above to construct lagged assets.\n\ncompustat &lt;- compustat |&gt; \n  left_join(\n    compustat |&gt; \n      select(gvkey, year, at_lag = at) |&gt; \n      mutate(year = year + 1), by = c(\"gvkey\", \"year\")\n  ) |&gt; \n  mutate(\n    inv = at / at_lag - 1,\n    inv = if_else(at_lag &lt;= 0, as.numeric(NA), inv)\n  )\n\nWith the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.\n\ndbWriteTable(tidy_finance,\n  \"compustat\",\n  value = compustat,\n  overwrite = TRUE\n)",
+    "text": "Preparing Compustat data\nFirm accounting data are an important source of information that we use in portfolio analyses in subsequent chapters. The commonly used source for firm financial information is Compustat provided by S&P Global Market Intelligence, which is a global data vendor that provides financial, statistical, and market information on active and inactive companies throughout the world. For US and Canadian companies, annual history is available back to 1950 and quarterly as well as monthly histories date back to 1962.\nTo access Compustat data, we can again tap WRDS, which hosts the funda table that contains annual firm-level information on North American companies.\n\nfunda_db &lt;- tbl(wrds, in_schema(\"comp\", \"funda\"))\n\nWe follow the typical filter conventions and pull only data that we actually need: (i) we get only records in industrial data format, (ii) in the standard format (i.e., consolidated information in standard presentation), and (iii) only data in the desired time window.\n\ncompustat &lt;- funda_db |&gt;\n  filter(\n    indfmt == \"INDL\" &\n      datafmt == \"STD\" &\n      consol == \"C\" &\n      datadate &gt;= start_date & datadate &lt;= end_date\n  ) |&gt;\n  select(\n    gvkey, # Firm identifier\n    datadate, # Date of the accounting data\n    seq, # Stockholders' equity\n    ceq, # Total common/ordinary equity\n    at, # Total assets\n    lt, # Total liabilities\n    txditc, # Deferred taxes and investment tax credit\n    txdb, # Deferred taxes\n    itcb, # Investment tax credit\n    pstkrv, # Preferred stock redemption value\n    pstkl, # Preferred stock liquidating value\n    pstk, # Preferred stock par value\n    capx, # Capital investment\n    oancf, # Operating cash flow\n    sale,  # Revenue\n    cogs, # Costs of goods sold\n    xint, # Interest expense\n    xsga # Selling, general, and administrative expenses\n  ) |&gt;\n  collect()\n\nNext, we calculate the book value of preferred stock and equity be and the operating profitability op inspired by the variable definitions in Ken French’s data library. Note that we set negative or zero equity to missing which is a common practice when working with book-to-market ratios (see Fama and French 1992 for details).\n\ncompustat &lt;- compustat |&gt;\n  mutate(\n    be = coalesce(seq, ceq + pstk, at - lt) +\n      coalesce(txditc, txdb + itcb, 0) -\n      coalesce(pstkrv, pstkl, pstk, 0),\n    be = if_else(be &lt;= 0, NA, be),\n    op = (sale - coalesce(cogs, 0) - \n            coalesce(xsga, 0) - coalesce(xint, 0)) / be,\n  )\n\nWe keep only the last available information for each firm-year group. Note that datadate defines the time the corresponding financial data refers to (e.g., annual report as of December 31, 2022). Therefore, datadate is not the date when data was made available to the public. Check out the exercises for more insights into the peculiarities of datadate.\n\ncompustat &lt;- compustat |&gt;\n  mutate(year = year(datadate)) |&gt;\n  group_by(gvkey, year) |&gt;\n  filter(datadate == max(datadate)) |&gt;\n  ungroup()\n\nWe also compute the investment ratio inv according to Ken French’s variable definitions as the change in total assets from one fiscal year to another. Note that we again use the approach using joins as introduced with the CRSP data above to construct lagged assets.\n\ncompustat &lt;- compustat |&gt; \n  left_join(\n    compustat |&gt; \n      select(gvkey, year, at_lag = at) |&gt; \n      mutate(year = year + 1), \n    join_by(gvkey, year)\n  ) |&gt; \n  mutate(\n    inv = at / at_lag - 1,\n    inv = if_else(at_lag &lt;= 0, NA, inv)\n  )\n\nWith the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.\n\ndbWriteTable(tidy_finance,\n  \"compustat\",\n  value = compustat,\n  overwrite = TRUE\n)",
     "crumbs": [
       "R",
       "Financial Data",
@@ -657,7 +657,7 @@
     "href": "r/wrds-crsp-and-compustat.html#merging-crsp-with-compustat",
     "title": "WRDS, CRSP, and Compustat",
     "section": "Merging CRSP with Compustat",
-    "text": "Merging CRSP with Compustat\nUnfortunately, CRSP and Compustat use different keys to identify stocks and firms. CRSP uses permno for stocks, while Compustat uses gvkey to identify firms. Fortunately, a curated matching table on WRDS allows us to merge CRSP and Compustat, so we create a connection to the CRSP-Compustat Merged table (provided by CRSP).\n\nccmxpf_linktable_db &lt;- tbl(\n  wrds,\n  in_schema(\"crsp\", \"ccmxpf_linktable\")\n)\n\nThe linking table contains links between CRSP and Compustat identifiers from various approaches. However, we need to make sure that we keep only relevant and correct links, again following the description outlined in Bali, Engle, and Murray (2016). Note also that currently active links have no end date, so we just enter the current date via today().\n\nccmxpf_linktable &lt;- ccmxpf_linktable_db |&gt;\n  filter(linktype %in% c(\"LU\", \"LC\") &\n    linkprim %in% c(\"P\", \"C\") &\n    usedflag == 1) |&gt;\n  select(permno = lpermno, gvkey, linkdt, linkenddt) |&gt;\n  collect() |&gt;\n  mutate(linkenddt = replace_na(linkenddt, today()))\n\nWe use these links to create a new table with a mapping between stock identifier, firm identifier, and month. We then add these links to the Compustat gvkey to our monthly stock data.\n\nccm_links &lt;- crsp_monthly |&gt;\n  inner_join(ccmxpf_linktable, \n             by = \"permno\", relationship = \"many-to-many\") |&gt;\n  filter(!is.na(gvkey) & \n           (date &gt;= linkdt & date &lt;= linkenddt)) |&gt;\n  select(permno, gvkey, date)\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(ccm_links, by = c(\"permno\", \"date\"))\n\nAs the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.\n\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)\n\nBefore we close this chapter, let us look at an interesting descriptive statistic of our data. As the book value of equity plays a crucial role in many asset pricing applications, it is interesting to know for how many of our stocks this information is available. Hence, Figure 5 plots the share of securities with book equity values for each exchange. It turns out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 60s but hovers around 80% for all periods thereafter. We can ignore the erratic coverage of securities that belong to the other category since there is only a handful of them anyway in our sample.\n\ncrsp_monthly |&gt;\n  group_by(permno, year = year(month)) |&gt;\n  filter(date == max(date)) |&gt;\n  ungroup() |&gt;\n  left_join(compustat, by = c(\"gvkey\", \"year\")) |&gt;\n  group_by(exchange, year) |&gt;\n  summarize(\n    share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),\n    .groups = \"drop\"\n  ) |&gt;\n  ggplot(aes(\n    x = year, \n    y = share, \n    color = exchange,\n    linetype = exchange\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Share of securities with book equity values by exchange\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  coord_cartesian(ylim = c(0, 1))\n\n\n\n\nFigure 5: End-of-year share of securities with book equity values by listing exchange.\n\n\nFigure 5: End-of-year share of securities with book equity values by listing exchange.",
+    "text": "Merging CRSP with Compustat\nUnfortunately, CRSP and Compustat use different keys to identify stocks and firms. CRSP uses permno for stocks, while Compustat uses gvkey to identify firms. Fortunately, a curated matching table on WRDS allows us to merge CRSP and Compustat, so we create a connection to the CRSP-Compustat Merged table (provided by CRSP).\n\nccmxpf_linktable_db &lt;- tbl(\n  wrds,\n  in_schema(\"crsp\", \"ccmxpf_linktable\")\n)\n\nThe linking table contains links between CRSP and Compustat identifiers from various approaches. However, we need to make sure that we keep only relevant and correct links, again following the description outlined in Bali, Engle, and Murray (2016). Note also that currently active links have no end date, so we just enter the current date via today().\n\nccmxpf_linktable &lt;- ccmxpf_linktable_db |&gt;\n  filter(linktype %in% c(\"LU\", \"LC\") &\n    linkprim %in% c(\"P\", \"C\") &\n    usedflag == 1) |&gt;\n  select(permno = lpermno, gvkey, linkdt, linkenddt) |&gt;\n  collect() |&gt;\n  mutate(linkenddt = replace_na(linkenddt, today()))\n\nWe use these links to create a new table with a mapping between stock identifier, firm identifier, and month. We then add these links to the Compustat gvkey to our monthly stock data.\n\nccm_links &lt;- crsp_monthly |&gt;\n  inner_join(ccmxpf_linktable, \n             join_by(permno), relationship = \"many-to-many\") |&gt;\n  filter(!is.na(gvkey) & \n           (date &gt;= linkdt & date &lt;= linkenddt)) |&gt;\n  select(permno, gvkey, date)\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(ccm_links, join_by(permno, date))\n\nAs the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.\n\ndbWriteTable(tidy_finance,\n  \"crsp_monthly\",\n  value = crsp_monthly,\n  overwrite = TRUE\n)\n\nBefore we close this chapter, let us look at an interesting descriptive statistic of our data. As the book value of equity plays a crucial role in many asset pricing applications, it is interesting to know for how many of our stocks this information is available. Hence, Figure 5 plots the share of securities with book equity values for each exchange. It turns out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 60s but hovers around 80% for all periods thereafter. We can ignore the erratic coverage of securities that belong to the other category since there is only a handful of them anyway in our sample.\n\ncrsp_monthly |&gt;\n  group_by(permno, year = year(month)) |&gt;\n  filter(date == max(date)) |&gt;\n  ungroup() |&gt;\n  left_join(compustat, join_by(gvkey, year)) |&gt;\n  group_by(exchange, year) |&gt;\n  summarize(\n    share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),\n    .groups = \"drop\"\n  ) |&gt;\n  ggplot(aes(\n    x = year, \n    y = share, \n    color = exchange,\n    linetype = exchange\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Share of securities with book equity values by exchange\"\n  ) +\n  scale_y_continuous(labels = percent) +\n  coord_cartesian(ylim = c(0, 1))\n\n\n\n\nFigure 5: End-of-year share of securities with book equity values by listing exchange.\n\n\nFigure 5: End-of-year share of securities with book equity values by listing exchange.",
     "crumbs": [
       "R",
       "Financial Data",
@@ -2714,7 +2714,7 @@
     "href": "r/replicating-fama-and-french-factors.html#data-preparation",
     "title": "Replicating Fama and French Factors",
     "section": "Data Preparation",
-    "text": "Data Preparation\nWe use CRSP and Compustat as data sources, as we need the same variables to compute the factors in the way Fama and French do it. Hence, there is nothing new below and we only load data from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap, mktcap_lag, exchange\n  ) |&gt;\n  collect()\n\ncompustat &lt;- tbl(tidy_finance, \"compustat\") |&gt;\n    select(gvkey, datadate, be, op, inv) |&gt;\n    collect() \n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, smb, hml) |&gt;\n  collect()\n\nfactors_ff5_monthly &lt;- tbl(tidy_finance, \"factors_ff5_monthly\") |&gt;\n  select(month, smb, hml, rmw, cma) |&gt;\n  collect()\n\nYet when we start merging our data set for computing the premiums, there are a few differences to Value and Bivariate Sorts. First, Fama and French form their portfolios in June of year \\(t\\), whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year \\(t+1\\).\nSecond, Fama and French also have a different protocol for computing the book-to-market ratio. They use market equity as of the end of year \\(t - 1\\) and the book equity reported in year \\(t-1\\), i.e., the datadate is within the last year. Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity. The other sorting variables are analogously to book equity taken from year \\(t-1\\).\nTo implement all these time lags, we again employ the temporary sorting_date-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of distinct() at the end of the chunk below.\n\nsize &lt;- crsp_monthly |&gt;\n  filter(month(month) == 6) |&gt;\n  mutate(sorting_date = month %m+% months(1)) |&gt;\n  select(permno, exchange, sorting_date, size = mktcap)\n\nmarket_equity &lt;- crsp_monthly |&gt;\n  filter(month(month) == 12) |&gt;\n  mutate(sorting_date = ymd(str_c(year(month) + 1, \"0701)\"))) |&gt;\n  select(permno, gvkey, sorting_date, me = mktcap)\n\nbook_to_market &lt;- compustat |&gt;\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |&gt;\n  select(gvkey, sorting_date, be) |&gt;\n  inner_join(market_equity, by = c(\"gvkey\", \"sorting_date\")) |&gt;\n  mutate(bm = be / me) |&gt;\n  select(permno, sorting_date, me, bm)\n\nsorting_variables &lt;- size |&gt;\n  inner_join(\n    book_to_market, by = c(\"permno\", \"sorting_date\")\n    ) |&gt;\n  drop_na() |&gt;\n  distinct(permno, sorting_date, .keep_all = TRUE)",
+    "text": "Data Preparation\nWe use CRSP and Compustat as data sources, as we need the same variables to compute the factors in the way Fama and French do it. Hence, there is nothing new below and we only load data from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap, mktcap_lag, exchange\n  ) |&gt;\n  collect()\n\ncompustat &lt;- tbl(tidy_finance, \"compustat\") |&gt;\n    select(gvkey, datadate, be, op, inv) |&gt;\n    collect() \n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, smb, hml) |&gt;\n  collect()\n\nfactors_ff5_monthly &lt;- tbl(tidy_finance, \"factors_ff5_monthly\") |&gt;\n  select(month, smb, hml, rmw, cma) |&gt;\n  collect()\n\nYet when we start merging our data set for computing the premiums, there are a few differences to Value and Bivariate Sorts. First, Fama and French form their portfolios in June of year \\(t\\), whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year \\(t+1\\).\nSecond, Fama and French also have a different protocol for computing the book-to-market ratio. They use market equity as of the end of year \\(t - 1\\) and the book equity reported in year \\(t-1\\), i.e., the datadate is within the last year. Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity. The other sorting variables are analogously to book equity taken from year \\(t-1\\).\nTo implement all these time lags, we again employ the temporary sorting_date-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of distinct() at the end of the chunk below.\n\nsize &lt;- crsp_monthly |&gt;\n  filter(month(month) == 6) |&gt;\n  mutate(sorting_date = month %m+% months(1)) |&gt;\n  select(permno, exchange, sorting_date, size = mktcap)\n\nmarket_equity &lt;- crsp_monthly |&gt;\n  filter(month(month) == 12) |&gt;\n  mutate(sorting_date = ymd(str_c(year(month) + 1, \"0701)\"))) |&gt;\n  select(permno, gvkey, sorting_date, me = mktcap)\n\nbook_to_market &lt;- compustat |&gt;\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |&gt;\n  select(gvkey, sorting_date, be) |&gt;\n  inner_join(market_equity, join_by(gvkey, sorting_date)) |&gt;\n  mutate(bm = be / me) |&gt;\n  select(permno, sorting_date, me, bm)\n\nsorting_variables &lt;- size |&gt;\n  inner_join(\n    book_to_market, join_by(permno, sorting_date)\n    ) |&gt;\n  drop_na() |&gt;\n  distinct(permno, sorting_date, .keep_all = TRUE)",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -2726,7 +2726,7 @@
     "href": "r/replicating-fama-and-french-factors.html#portfolio-sorts",
     "title": "Replicating Fama and French Factors",
     "section": "Portfolio Sorts",
-    "text": "Portfolio Sorts\nNext, we construct our portfolios with an adjusted assign_portfolio() function. Fama and French rely on NYSE-specific breakpoints, they form two portfolios in the size dimension at the median and three portfolios in the dimension of each other sorting variable at the 30%- and 70%-percentiles, and they use dependent sorts. The sorts for book-to-market require an adjustment to the function in Value and Bivariate Sorts because the seq() we would produce does not produce the right breakpoints. Instead of n_portfolios, we now specify percentiles, which take the breakpoint-sequence as an object specified in the function’s call. Specifically, we give percentiles = c(0, 0.3, 0.7, 1) to the function. Additionally, we perform an inner_join() with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.\n\nassign_portfolio &lt;- function(data, \n                             sorting_variable, \n                             percentiles) {\n  breakpoints &lt;- data |&gt;\n    filter(exchange == \"NYSE\") |&gt;\n    pull({{ sorting_variable }}) |&gt;\n    quantile(\n      probs = percentiles,\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios &lt;- data |&gt;\n    mutate(portfolio = findInterval(\n      pick(everything()) |&gt;\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |&gt;\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n\nportfolios &lt;- sorting_variables |&gt;\n  group_by(sorting_date) |&gt;\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    ),\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = bm,\n      percentiles = c(0, 0.3, 0.7, 1)\n    )\n  ) |&gt;\n  ungroup() |&gt; \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm)\n\nNext, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column sorting_date in our return data by setting the date to sort on to July of \\(t-1\\) if the month is June (of year \\(t\\)) or earlier or to July of year \\(t\\) if the month is July or later.\n\nportfolios &lt;- crsp_monthly |&gt;\n  mutate(sorting_date = case_when(\n    month(month) &lt;= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) &gt;= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |&gt;\n  inner_join(portfolios, by = c(\"permno\", \"sorting_date\"))",
+    "text": "Portfolio Sorts\nNext, we construct our portfolios with an adjusted assign_portfolio() function. Fama and French rely on NYSE-specific breakpoints, they form two portfolios in the size dimension at the median and three portfolios in the dimension of each other sorting variable at the 30%- and 70%-percentiles, and they use dependent sorts. The sorts for book-to-market require an adjustment to the function in Value and Bivariate Sorts because the seq() we would produce does not produce the right breakpoints. Instead of n_portfolios, we now specify percentiles, which take the breakpoint-sequence as an object specified in the function’s call. Specifically, we give percentiles = c(0, 0.3, 0.7, 1) to the function. Additionally, we perform an inner_join() with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.\n\nassign_portfolio &lt;- function(data, \n                             sorting_variable, \n                             percentiles) {\n  breakpoints &lt;- data |&gt;\n    filter(exchange == \"NYSE\") |&gt;\n    pull({{ sorting_variable }}) |&gt;\n    quantile(\n      probs = percentiles,\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios &lt;- data |&gt;\n    mutate(portfolio = findInterval(\n      pick(everything()) |&gt;\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |&gt;\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n\nportfolios &lt;- sorting_variables |&gt;\n  group_by(sorting_date) |&gt;\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    ),\n    portfolio_bm = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = bm,\n      percentiles = c(0, 0.3, 0.7, 1)\n    )\n  ) |&gt;\n  ungroup() |&gt; \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm)\n\nNext, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column sorting_date in our return data by setting the date to sort on to July of \\(t-1\\) if the month is June (of year \\(t\\)) or earlier or to July of year \\(t\\) if the month is July or later.\n\nportfolios &lt;- crsp_monthly |&gt;\n  mutate(sorting_date = case_when(\n    month(month) &lt;= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) &gt;= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |&gt;\n  inner_join(portfolios, join_by(permno, sorting_date))",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -2750,7 +2750,7 @@
     "href": "r/replicating-fama-and-french-factors.html#replication-evaluation",
     "title": "Replicating Fama and French Factors",
     "section": "Replication Evaluation",
-    "text": "Replication Evaluation\nIn the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French. The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using lm(). If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).\n\ntest &lt;- factors_ff3_monthly |&gt;\n  inner_join(factors_replicated, by = \"month\") |&gt;\n  mutate(\n    across(c(smb_replicated, hml_replicated), ~round(., 4))\n  )\n\nTo test the success of the SMB factor, we hence run the following regression:\n\nmodel_smb &lt;- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.020294 -0.001541 -0.000055  0.001489  0.015482 \n\nCoefficients:\n                Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    -0.000130   0.000131      -1     0.32    \nsmb_replicated  0.993003   0.004335     229   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00355 on 736 degrees of freedom\nMultiple R-squared:  0.986, Adjusted R-squared:  0.986 \nF-statistic: 5.25e+04 on 1 and 736 DF,  p-value: &lt;2e-16\n\n\nThe results for the SMB factor are really convincing as all three criteria outlined above are met and the coefficient is 0.99 and the R-squared is at 99%.\n\nmodel_hml &lt;- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.02324 -0.00287 -0.00010  0.00230  0.03407 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    0.000291   0.000219    1.33     0.18    \nhml_replicated 0.963431   0.007280  132.35   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00591 on 736 degrees of freedom\nMultiple R-squared:  0.96,  Adjusted R-squared:  0.96 \nF-statistic: 1.75e+04 on 1 and 736 DF,  p-value: &lt;2e-16\n\n\nThe replication of the HML factor is also a success, although at a slightly lower coefficient of 0.96 and an R-squared around 96%.\nThe evidence hence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.",
+    "text": "Replication Evaluation\nIn the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French. The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using lm(). If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to 1 (indicating a high correlation), and an adjusted R-squared close to 1 (indicating a high proportion of explained variance).\n\ntest &lt;- factors_ff3_monthly |&gt;\n  inner_join(factors_replicated, join_by(month)) |&gt;\n  mutate(\n    across(c(smb_replicated, hml_replicated), ~round(., 4))\n  )\n\nTo test the success of the SMB factor, we hence run the following regression:\n\nmodel_smb &lt;- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.020294 -0.001541 -0.000055  0.001489  0.015482 \n\nCoefficients:\n                Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    -0.000130   0.000131      -1     0.32    \nsmb_replicated  0.993003   0.004335     229   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00355 on 736 degrees of freedom\nMultiple R-squared:  0.986, Adjusted R-squared:  0.986 \nF-statistic: 5.25e+04 on 1 and 736 DF,  p-value: &lt;2e-16\n\n\nThe results for the SMB factor are really convincing as all three criteria outlined above are met and the coefficient is 0.99 and the R-squared is at 99%.\n\nmodel_hml &lt;- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.02324 -0.00287 -0.00010  0.00230  0.03407 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    0.000291   0.000219    1.33     0.18    \nhml_replicated 0.963431   0.007280  132.35   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00591 on 736 degrees of freedom\nMultiple R-squared:  0.96,  Adjusted R-squared:  0.96 \nF-statistic: 1.75e+04 on 1 and 736 DF,  p-value: &lt;2e-16\n\n\nThe replication of the HML factor is also a success, although at a slightly lower coefficient of 0.96 and an R-squared around 96%.\nThe evidence hence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -2762,7 +2762,7 @@
     "href": "r/replicating-fama-and-french-factors.html#fama-and-french-five-factor-model",
     "title": "Replicating Fama and French Factors",
     "section": "Fama and French Five Factor Model",
-    "text": "Fama and French Five Factor Model\nNow, let us move to the replication of the five factor model. We extend the other_sorting_variables table from above with the additional characteristics operating profitability op and investment inv. Note that the drop_na() statement yields different sample sizes as some firms with be values might not have op or inv values.\n\nother_sorting_variables &lt;- compustat |&gt;\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |&gt;\n  select(gvkey, sorting_date, be, op, inv) |&gt;\n  inner_join(market_equity, by = c(\"gvkey\", \"sorting_date\")) |&gt;\n  mutate(bm = be / me) |&gt;\n  select(permno, sorting_date, me, be, bm, op, inv)\n\nsorting_variables &lt;- size |&gt;\n  inner_join(\n    other_sorting_variables, by = c(\"permno\", \"sorting_date\")\n    ) |&gt;\n  drop_na() |&gt;\n  distinct(permno, sorting_date, .keep_all = TRUE)\n\nIn each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above.\n\nportfolios &lt;- sorting_variables |&gt;\n  group_by(sorting_date) |&gt;\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    )) |&gt; \n  group_by(sorting_date, portfolio_size) |&gt; \n  mutate(\n    across(c(bm, op, inv), ~assign_portfolio(\n      data = pick(everything()), \n      sorting_variable = ., \n      percentiles = c(0, 0.3, 0.7, 1)),\n      .names = \"portfolio_{.col}\"\n    )\n  ) |&gt;\n  ungroup() |&gt; \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm,\n         portfolio_op, portfolio_inv)\n\nportfolios &lt;- crsp_monthly |&gt;\n  mutate(sorting_date = case_when(\n    month(month) &lt;= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) &gt;= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |&gt;\n  inner_join(portfolios, by = c(\"permno\", \"sorting_date\"))\n\nNow, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.\n\nportfolios_value &lt;- portfolios |&gt;\n  group_by(portfolio_size, portfolio_bm, month) |&gt;\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_value &lt;- portfolios_value |&gt;\n  group_by(month) |&gt;\n  summarize(\n    hml_replicated = mean(ret[portfolio_bm == 3]) -\n      mean(ret[portfolio_bm == 1])\n  )\n\nFor the profitability factor, RMW, we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.\n\nportfolios_profitability &lt;- portfolios |&gt;\n  group_by(portfolio_size, portfolio_op, month) |&gt;\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  ) \n\nfactors_profitability &lt;- portfolios_profitability |&gt;\n  group_by(month) |&gt;\n  summarize(\n    rmw_replicated = mean(ret[portfolio_op == 3]) -\n      mean(ret[portfolio_op == 1])\n  )\n\nFor the investment factor, CMA, we go long the two low investment portfolios and short the two high investment portfolios.\n\nportfolios_investment &lt;- portfolios |&gt;\n  group_by(portfolio_size, portfolio_inv, month) |&gt;\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_investment &lt;- portfolios_investment |&gt;\n  group_by(month) |&gt;\n  summarize(\n    cma_replicated = mean(ret[portfolio_inv == 1]) -\n      mean(ret[portfolio_inv == 3])\n  )\n\nFinally, the size factor, SMB, is constructed by going long the six small portfolios and short the six large portfolios.\n\nfactors_size &lt;- bind_rows(\n  portfolios_value,\n  portfolios_profitability,\n  portfolios_investment\n) |&gt; \n  group_by(month) |&gt;\n  summarize(\n    smb_replicated = mean(ret[portfolio_size == 1]) -\n      mean(ret[portfolio_size == 2])\n  )\n\nWe then join all factors together into one data frame and construct again a suitable table to run tests for evaluating our replication.\n\nfactors_replicated &lt;- factors_size |&gt;\n  full_join(\n    factors_value, by = \"month\"\n  ) |&gt;\n  full_join(\n    factors_profitability, by = \"month\"\n  ) |&gt;\n  full_join(\n    factors_investment, by = \"month\"\n  )\n\ntest &lt;- factors_ff5_monthly |&gt;\n  inner_join(factors_replicated, by = \"month\") |&gt;\n  mutate(\n    across(c(smb_replicated, hml_replicated, \n             rmw_replicated, cma_replicated), ~round(., 4))\n  )\n\nLet us start the replication evaluation again with the size factor:\n\nmodel_smb &lt;- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.018146 -0.001865  0.000181  0.001980  0.014417 \n\nCoefficients:\n                Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    -0.000203   0.000136    -1.5     0.14    \nsmb_replicated  0.969574   0.004375   221.6   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00361 on 712 degrees of freedom\nMultiple R-squared:  0.986, Adjusted R-squared:  0.986 \nF-statistic: 4.91e+04 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nThe results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is 0.97 and the R-squared is at 99%.\n\nmodel_hml &lt;- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.04464 -0.00414 -0.00035  0.00410  0.03667 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    0.000473   0.000298    1.59     0.11    \nhml_replicated 0.991570   0.010271   96.54   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00792 on 712 degrees of freedom\nMultiple R-squared:  0.929, Adjusted R-squared:  0.929 \nF-statistic: 9.32e+03 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nThe replication of the HML factor is also a success, although at a slightly higher coefficient of 0.99 and an R-squared around 93%.\n\nmodel_rmw &lt;- lm(rmw ~ rmw_replicated, data = test)\nsummary(model_rmw)\n\n\nCall:\nlm(formula = rmw ~ rmw_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.019810 -0.003051  0.000054  0.003280  0.018801 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    5.59e-05   2.02e-04    0.28     0.78    \nrmw_replicated 9.55e-01   8.88e-03  107.52   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00536 on 712 degrees of freedom\nMultiple R-squared:  0.942, Adjusted R-squared:  0.942 \nF-statistic: 1.16e+04 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nWe are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an R-squared around 94%.\n\nmodel_cma &lt;- lm(cma ~ cma_replicated, data = test)\nsummary(model_cma)\n\n\nCall:\nlm(formula = cma ~ cma_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.015071 -0.002720 -0.000225  0.002331  0.021404 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    0.000673   0.000171    3.93  9.4e-05 ***\ncma_replicated 0.965219   0.008195  117.78  &lt; 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00455 on 712 degrees of freedom\nMultiple R-squared:  0.951, Adjusted R-squared:  0.951 \nF-statistic: 1.39e+04 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nFinally, the CMA factor also replicates well with a coefficient of 0.97 and an R-squared around 95%.\nOverall, our approach seems to replicate the Fama-French three and five factor models just as well as the three factors.",
+    "text": "Fama and French Five Factor Model\nNow, let us move to the replication of the five factor model. We extend the other_sorting_variables table from above with the additional characteristics operating profitability op and investment inv. Note that the drop_na() statement yields different sample sizes as some firms with be values might not have op or inv values.\n\nother_sorting_variables &lt;- compustat |&gt;\n  mutate(sorting_date = ymd(str_c(year(datadate) + 1, \"0701\"))) |&gt;\n  select(gvkey, sorting_date, be, op, inv) |&gt;\n  inner_join(market_equity, \n             join_by(gvkey, sorting_date)) |&gt;\n  mutate(bm = be / me) |&gt;\n  select(permno, sorting_date, me, be, bm, op, inv)\n\nsorting_variables &lt;- size |&gt;\n  inner_join(\n    other_sorting_variables, \n    join_by(permno, sorting_date)\n    ) |&gt;\n  drop_na() |&gt;\n  distinct(permno, sorting_date, .keep_all = TRUE)\n\nIn each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above.\n\nportfolios &lt;- sorting_variables |&gt;\n  group_by(sorting_date) |&gt;\n  mutate(\n    portfolio_size = assign_portfolio(\n      data = pick(everything()),\n      sorting_variable = size,\n      percentiles = c(0, 0.5, 1)\n    )) |&gt; \n  group_by(sorting_date, portfolio_size) |&gt; \n  mutate(\n    across(c(bm, op, inv), ~assign_portfolio(\n      data = pick(everything()), \n      sorting_variable = ., \n      percentiles = c(0, 0.3, 0.7, 1)),\n      .names = \"portfolio_{.col}\"\n    )\n  ) |&gt;\n  ungroup() |&gt; \n  select(permno, sorting_date, \n         portfolio_size, portfolio_bm,\n         portfolio_op, portfolio_inv)\n\nportfolios &lt;- crsp_monthly |&gt;\n  mutate(sorting_date = case_when(\n    month(month) &lt;= 6 ~ ymd(str_c(year(month) - 1, \"0701\")),\n    month(month) &gt;= 7 ~ ymd(str_c(year(month), \"0701\"))\n  )) |&gt;\n  inner_join(portfolios, join_by(permno, sorting_date))\n\nNow, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.\n\nportfolios_value &lt;- portfolios |&gt;\n  group_by(portfolio_size, portfolio_bm, month) |&gt;\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_value &lt;- portfolios_value |&gt;\n  group_by(month) |&gt;\n  summarize(\n    hml_replicated = mean(ret[portfolio_bm == 3]) -\n      mean(ret[portfolio_bm == 1])\n  )\n\nFor the profitability factor, RMW, we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.\n\nportfolios_profitability &lt;- portfolios |&gt;\n  group_by(portfolio_size, portfolio_op, month) |&gt;\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  ) \n\nfactors_profitability &lt;- portfolios_profitability |&gt;\n  group_by(month) |&gt;\n  summarize(\n    rmw_replicated = mean(ret[portfolio_op == 3]) -\n      mean(ret[portfolio_op == 1])\n  )\n\nFor the investment factor, CMA, we go long the two low investment portfolios and short the two high investment portfolios.\n\nportfolios_investment &lt;- portfolios |&gt;\n  group_by(portfolio_size, portfolio_inv, month) |&gt;\n  summarize(\n    ret = weighted.mean(ret_excess, mktcap_lag), \n    .groups = \"drop\"\n  )\n\nfactors_investment &lt;- portfolios_investment |&gt;\n  group_by(month) |&gt;\n  summarize(\n    cma_replicated = mean(ret[portfolio_inv == 1]) -\n      mean(ret[portfolio_inv == 3])\n  )\n\nFinally, the size factor, SMB, is constructed by going long the six small portfolios and short the six large portfolios.\n\nfactors_size &lt;- bind_rows(\n  portfolios_value,\n  portfolios_profitability,\n  portfolios_investment\n) |&gt; \n  group_by(month) |&gt;\n  summarize(\n    smb_replicated = mean(ret[portfolio_size == 1]) -\n      mean(ret[portfolio_size == 2])\n  )\n\nWe then join all factors together into one data frame and construct again a suitable table to run tests for evaluating our replication.\n\nfactors_replicated &lt;- factors_size |&gt;\n  full_join(\n    factors_value, join_by(month)\n  ) |&gt;\n  full_join(\n    factors_profitability, join_by(month)\n  ) |&gt;\n  full_join(\n    factors_investment, join_by(month)\n  )\n\ntest &lt;- factors_ff5_monthly |&gt;\n  inner_join(factors_replicated, join_by(month)) |&gt;\n  mutate(\n    across(c(smb_replicated, hml_replicated, \n             rmw_replicated, cma_replicated), ~round(., 4))\n  )\n\nLet us start the replication evaluation again with the size factor:\n\nmodel_smb &lt;- lm(smb ~ smb_replicated, data = test)\nsummary(model_smb)\n\n\nCall:\nlm(formula = smb ~ smb_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.018146 -0.001865  0.000181  0.001980  0.014417 \n\nCoefficients:\n                Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    -0.000203   0.000136    -1.5     0.14    \nsmb_replicated  0.969574   0.004375   221.6   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00361 on 712 degrees of freedom\nMultiple R-squared:  0.986, Adjusted R-squared:  0.986 \nF-statistic: 4.91e+04 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nThe results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is 0.97 and the R-squared is at 99%.\n\nmodel_hml &lt;- lm(hml ~ hml_replicated, data = test)\nsummary(model_hml)\n\n\nCall:\nlm(formula = hml ~ hml_replicated, data = test)\n\nResiduals:\n     Min       1Q   Median       3Q      Max \n-0.04464 -0.00414 -0.00035  0.00410  0.03667 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    0.000473   0.000298    1.59     0.11    \nhml_replicated 0.991570   0.010271   96.54   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00792 on 712 degrees of freedom\nMultiple R-squared:  0.929, Adjusted R-squared:  0.929 \nF-statistic: 9.32e+03 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nThe replication of the HML factor is also a success, although at a slightly higher coefficient of 0.99 and an R-squared around 93%.\n\nmodel_rmw &lt;- lm(rmw ~ rmw_replicated, data = test)\nsummary(model_rmw)\n\n\nCall:\nlm(formula = rmw ~ rmw_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.019810 -0.003051  0.000054  0.003280  0.018801 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    5.59e-05   2.02e-04    0.28     0.78    \nrmw_replicated 9.55e-01   8.88e-03  107.52   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00536 on 712 degrees of freedom\nMultiple R-squared:  0.942, Adjusted R-squared:  0.942 \nF-statistic: 1.16e+04 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nWe are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an R-squared around 94%.\n\nmodel_cma &lt;- lm(cma ~ cma_replicated, data = test)\nsummary(model_cma)\n\n\nCall:\nlm(formula = cma ~ cma_replicated, data = test)\n\nResiduals:\n      Min        1Q    Median        3Q       Max \n-0.015071 -0.002720 -0.000225  0.002331  0.021404 \n\nCoefficients:\n               Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)    0.000673   0.000171    3.93  9.4e-05 ***\ncma_replicated 0.965219   0.008195  117.78  &lt; 2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.00455 on 712 degrees of freedom\nMultiple R-squared:  0.951, Adjusted R-squared:  0.951 \nF-statistic: 1.39e+04 on 1 and 712 DF,  p-value: &lt;2e-16\n\n\nFinally, the CMA factor also replicates well with a coefficient of 0.97 and an R-squared around 95%.\nOverall, our approach seems to replicate the Fama-French three and five factor models just as well as the three factors.",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -2810,7 +2810,7 @@
     "href": "r/trace-and-fisd.html#mergent-fisd",
     "title": "TRACE and FISD",
     "section": "Mergent FISD",
-    "text": "Mergent FISD\nFor research on US corporate bonds, the Mergent Fixed Income Securities Database (FISD) is the primary resource for bond characteristics. There is a detailed manual on WRDS, so we only cover the necessary subjects here. FISD data comes in two main variants, namely, centered on issuers or issues. In either case, the most useful identifiers are CUSIPs. 9-digit CUSIPs identify securities issued by issuers. The issuers can be identified from the first six digits of a security CUSIP, which is also called 6-digit CUSIP. Both stocks and bonds have CUSIPs. This connection would, in principle, allow matching them easily, but due to changing issuer details, this approach only yields small coverage.\nWe use the issue-centered version of FISD to identify the subset of US corporate bonds that meet the standard criteria (Bessembinder, Maxwell, and Venkataraman 2006). The WRDS table fisd_mergedissue contains most of the information we need on a 9-digit CUSIP level. Due to the diversity of corporate bonds, details in the indenture vary significantly. We focus on common bonds that make up the majority of trading volume in this market without diverging too much in indentures.\nThe following chunk connects to the data and selects the bond sample to remove certain bond types that are less commonly used (see, e.g., Dick-Nielsen, Feldhütter, and Lando 2012; O’Hara and Zhou 2021, among many others).\n\nfisd_mergedissue_db &lt;- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissue\"))\n\nfisd &lt;- fisd_mergedissue_db |&gt;\n  filter(\n    security_level == \"SEN\", # senior bonds\n    slob == \"N\" | is.na(slob), # secured lease obligation\n    is.na(security_pledge), # unsecured bonds\n    asset_backed == \"N\" | is.na(asset_backed), # not asset backed\n    defeased == \"N\" | is.na(defeased), # not defeased\n    is.na(defeased_date),\n    bond_type %in% c(\n      \"CDEB\", # US Corporate Debentures\n      \"CMTN\", # US Corporate MTN (Medium Term Note)\n      \"CMTZ\", # US Corporate MTN Zero\n      \"CZ\", # US Corporate Zero,\n      \"USBN\" # US Corporate Bank Note\n    ), \n    pay_in_kind != \"Y\" | is.na(pay_in_kind), # not payable in kind\n    is.na(pay_in_kind_exp_date),\n    yankee == \"N\" | is.na(yankee), # no foreign issuer\n    canadian == \"N\" | is.na(canadian), # not Canadian\n    foreign_currency == \"N\", # USD\n    coupon_type %in% c(\n      \"F\", # fixed coupon\n      \"Z\" # zero coupon\n    ), \n    is.na(fix_frequency),\n    coupon_change_indicator == \"N\",\n    interest_frequency %in% c(\n      \"0\", # per year\n      \"1\",\n      \"2\",\n      \"4\",\n      \"12\"\n    ),\n    rule_144a == \"N\", # publicly traded\n    private_placement == \"N\" | is.na(private_placement),\n    defaulted == \"N\", # not defaulted\n    is.na(filing_date),\n    is.na(settlement),\n    convertible == \"N\", # not convertible\n    is.na(exchange),\n    putable == \"N\" | is.na(putable), # not putable\n    unit_deal == \"N\" | is.na(unit_deal), # not issued with another security\n    exchangeable == \"N\" | is.na(exchangeable), # not exchangeable\n    perpetual == \"N\", # not perpetual\n    preferred_security == \"N\" | is.na(preferred_security) # not preferred\n  ) |&gt; \n  select(\n    complete_cusip, maturity,\n    offering_amt, offering_date,\n    dated_date, \n    interest_frequency, coupon,\n    last_interest_date, \n    issue_id, issuer_id\n  ) |&gt;\n  collect()\n\nWe also pull issuer information from fisd_mergedissuer regarding the industry and country of the firm that issued a particular bond. Then, we filter to include only US-domiciled firms’ bonds. We match the data by issuer_id.\n\nfisd_mergedissuer_db &lt;- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissuer\")) \n\nfisd_issuer &lt;- fisd_mergedissuer_db |&gt;\n  select(issuer_id, sic_code, country_domicile) |&gt;\n  collect()\n\nfisd &lt;- fisd |&gt;\n  inner_join(fisd_issuer, by = \"issuer_id\") |&gt;\n  filter(country_domicile == \"USA\") |&gt;\n  select(-country_domicile)\n\nFinally, we save the bond characteristics to our local database. This selection of bonds also constitutes the sample for which we will collect trade reports from TRACE below.\n\ndbWriteTable(\n  conn = tidy_finance,\n  name = \"fisd\",\n  value = fisd,\n  overwrite = TRUE\n)\n\nThe FISD database also contains other data. The issue-based file contains information on covenants, i.e., restrictions included in bond indentures to limit specific actions by firms (e.g., Handler, Jankowitsch, and Weiss 2021). Moreover, FISD also provides information on bond ratings. We do not need either here.",
+    "text": "Mergent FISD\nFor research on US corporate bonds, the Mergent Fixed Income Securities Database (FISD) is the primary resource for bond characteristics. There is a detailed manual on WRDS, so we only cover the necessary subjects here. FISD data comes in two main variants, namely, centered on issuers or issues. In either case, the most useful identifiers are CUSIPs. 9-digit CUSIPs identify securities issued by issuers. The issuers can be identified from the first six digits of a security CUSIP, which is also called 6-digit CUSIP. Both stocks and bonds have CUSIPs. This connection would, in principle, allow matching them easily, but due to changing issuer details, this approach only yields small coverage.\nWe use the issue-centered version of FISD to identify the subset of US corporate bonds that meet the standard criteria (Bessembinder, Maxwell, and Venkataraman 2006). The WRDS table fisd_mergedissue contains most of the information we need on a 9-digit CUSIP level. Due to the diversity of corporate bonds, details in the indenture vary significantly. We focus on common bonds that make up the majority of trading volume in this market without diverging too much in indentures.\nThe following chunk connects to the data and selects the bond sample to remove certain bond types that are less commonly used (see, e.g., Dick-Nielsen, Feldhütter, and Lando 2012; O’Hara and Zhou 2021, among many others).\n\nfisd_mergedissue_db &lt;- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissue\"))\n\nfisd &lt;- fisd_mergedissue_db |&gt;\n  filter(\n    security_level == \"SEN\", # senior bonds\n    slob == \"N\" | is.na(slob), # secured lease obligation\n    is.na(security_pledge), # unsecured bonds\n    asset_backed == \"N\" | is.na(asset_backed), # not asset backed\n    defeased == \"N\" | is.na(defeased), # not defeased\n    is.na(defeased_date),\n    bond_type %in% c(\n      \"CDEB\", # US Corporate Debentures\n      \"CMTN\", # US Corporate MTN (Medium Term Note)\n      \"CMTZ\", # US Corporate MTN Zero\n      \"CZ\", # US Corporate Zero,\n      \"USBN\" # US Corporate Bank Note\n    ), \n    pay_in_kind != \"Y\" | is.na(pay_in_kind), # not payable in kind\n    is.na(pay_in_kind_exp_date),\n    yankee == \"N\" | is.na(yankee), # no foreign issuer\n    canadian == \"N\" | is.na(canadian), # not Canadian\n    foreign_currency == \"N\", # USD\n    coupon_type %in% c(\n      \"F\", # fixed coupon\n      \"Z\" # zero coupon\n    ), \n    is.na(fix_frequency),\n    coupon_change_indicator == \"N\",\n    interest_frequency %in% c(\n      \"0\", # per year\n      \"1\",\n      \"2\",\n      \"4\",\n      \"12\"\n    ),\n    rule_144a == \"N\", # publicly traded\n    private_placement == \"N\" | is.na(private_placement),\n    defaulted == \"N\", # not defaulted\n    is.na(filing_date),\n    is.na(settlement),\n    convertible == \"N\", # not convertible\n    is.na(exchange),\n    putable == \"N\" | is.na(putable), # not putable\n    unit_deal == \"N\" | is.na(unit_deal), # not issued with another security\n    exchangeable == \"N\" | is.na(exchangeable), # not exchangeable\n    perpetual == \"N\", # not perpetual\n    preferred_security == \"N\" | is.na(preferred_security) # not preferred\n  ) |&gt; \n  select(\n    complete_cusip, maturity,\n    offering_amt, offering_date,\n    dated_date, \n    interest_frequency, coupon,\n    last_interest_date, \n    issue_id, issuer_id\n  ) |&gt;\n  collect()\n\nWe also pull issuer information from fisd_mergedissuer regarding the industry and country of the firm that issued a particular bond. Then, we filter to include only US-domiciled firms’ bonds. We match the data by issuer_id.\n\nfisd_mergedissuer_db &lt;- tbl(wrds, in_schema(\"fisd\", \"fisd_mergedissuer\")) \n\nfisd_issuer &lt;- fisd_mergedissuer_db |&gt;\n  select(issuer_id, sic_code, country_domicile) |&gt;\n  collect()\n\nfisd &lt;- fisd |&gt;\n  inner_join(fisd_issuer, join_by(issuer_id)) |&gt;\n  filter(country_domicile == \"USA\") |&gt;\n  select(-country_domicile)\n\nFinally, we save the bond characteristics to our local database. This selection of bonds also constitutes the sample for which we will collect trade reports from TRACE below.\n\ndbWriteTable(\n  conn = tidy_finance,\n  name = \"fisd\",\n  value = fisd,\n  overwrite = TRUE\n)\n\nThe FISD database also contains other data. The issue-based file contains information on covenants, i.e., restrictions included in bond indentures to limit specific actions by firms (e.g., Handler, Jankowitsch, and Weiss 2021). Moreover, FISD also provides information on bond ratings. We do not need either here.",
     "crumbs": [
       "R",
       "Financial Data",
@@ -2834,7 +2834,7 @@
     "href": "r/trace-and-fisd.html#insights-into-corporate-bonds",
     "title": "TRACE and FISD",
     "section": "Insights into Corporate Bonds",
-    "text": "Insights into Corporate Bonds\nWhile many news outlets readily provide information on stocks and the underlying firms, corporate bonds are not covered frequently. Additionally, the TRACE database contains trade-level information, potentially new to students. Therefore, we provide you with some insights by showing some summary statistics.\nWe start by looking into the number of bonds outstanding over time and compare it to the number of bonds traded in our sample. First, we compute the number of bonds outstanding for each quarter around the Paris Agreement from 2014 to 2016.\n\nbonds_outstanding &lt;- expand_grid(\"date\" = seq(ymd(\"2014-01-01\"),\n                                              ymd(\"2016-11-30\"), \n                                              by = \"quarter\"), \n                                 \"complete_cusip\" = fisd$complete_cusip) |&gt; \n  left_join(fisd |&gt; select(complete_cusip, \n                              offering_date,\n                              maturity), \n            by = \"complete_cusip\") |&gt; \n  mutate(offering_date = floor_date(offering_date),\n         maturity = floor_date(maturity)) |&gt; \n  filter(date &gt;= offering_date & date &lt;= maturity) |&gt; \n  count(date) |&gt; \n  mutate(type = \"Outstanding\")\n\nNext, we look at the bonds traded each quarter in the same period. Notice that we load the complete trace table from our database, as we only have a single part of it in the environment from the download loop from above.\n\ntrace_enhanced &lt;- tbl(tidy_finance, \"trace_enhanced\") |&gt;\n  collect()\n\nbonds_traded &lt;- trace_enhanced |&gt; \n  mutate(date = floor_date(trd_exctn_dt, \"quarters\")) |&gt; \n  group_by(date) |&gt; \n  summarize(n = length(unique(cusip_id)),\n            type = \"Traded\",\n            .groups = \"drop\") \n\nFinally, we plot the two time series in Figure 1.\n\nbonds_outstanding |&gt; \n  bind_rows(bonds_traded) |&gt; \n  ggplot(aes(\n    x = date, \n    y = n, \n    color = type, \n    linetype = type\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Number of bonds outstanding and traded each quarter\"\n  )\n\n\n\n\nFigure 1: The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.\n\n\nFigure 1: The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.\n\n\n\n\n\nWe see that the number of bonds outstanding increases steadily between 2014 and 2016. During our sample period of trade data, we see that the fraction of bonds trading each quarter is roughly 60%. The relatively small number of traded bonds means that many bonds do not trade through an entire quarter. This lack of trading activity illustrates the generally low level of liquidity in the corporate bond market, where it can be hard to trade specific bonds. Does this lack of liquidity mean that corporate bond markets are irrelevant in terms of their size? With over 7,500 traded bonds each quarter, it is hard to say that the market is small. However, let us also investigate the characteristics of issued corporate bonds. In particular, we consider maturity (in years), coupon, and offering amount (in million USD).\n\nfisd |&gt;\n  mutate(maturity = as.numeric(maturity - offering_date) / 365,\n         offering_amt = offering_amt / 10^3) |&gt; \n  pivot_longer(cols = c(maturity, coupon, offering_amt),\n               names_to = \"measure\") |&gt;\n  drop_na() |&gt; \n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n\n# A tibble: 3 × 8\n  measure        mean     sd    min   q05    q50    q95    max\n  &lt;chr&gt;         &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;\n1 coupon         5.94   2.67  0     1.90    6.06   10      39 \n2 maturity       9.74   9.21 -6.24  1.03    7.18   30.0   101.\n3 offering_amt 379.   570.    0.001 0.669 200    1400   15000 \n\n\nWe see that the average bond in our sample period has an offering amount of over 357 million USD with a median of 200 million USD, which both cannot be considered small. The average bond has a maturity of 10 years and pays around 6% in coupons.\nFinally, let us compute some summary statistics for the trades in this market. To this end, we show a summary based on aggregate information daily. In particular, we consider the trade size (in million USD) and the number of trades.\n\ntrace_enhanced |&gt; \n  group_by(trd_exctn_dt) |&gt; \n  summarize(trade_size = sum(entrd_vol_qt * rptd_pr / 100) / 10^6,\n            trade_number = n(),\n            .groups = \"drop\") |&gt; \n  pivot_longer(cols = c(trade_size, trade_number),\n               names_to = \"measure\") |&gt; \n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n\n# A tibble: 2 × 8\n  measure        mean    sd   min    q05    q50    q95    max\n  &lt;chr&gt;         &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;\n1 trade_number 25914. 5444. 439   17844. 26051  34383. 40839 \n2 trade_size   12968. 3577.  17.2  6128. 13421. 17850. 21312.\n\n\nOn average, nearly 26 billion USD of corporate bonds are traded daily in nearly 13,000 transactions. We can hence conclude that the corporate bond market is indeed significant in terms of trading volume and activity.",
+    "text": "Insights into Corporate Bonds\nWhile many news outlets readily provide information on stocks and the underlying firms, corporate bonds are not covered frequently. Additionally, the TRACE database contains trade-level information, potentially new to students. Therefore, we provide you with some insights by showing some summary statistics.\nWe start by looking into the number of bonds outstanding over time and compare it to the number of bonds traded in our sample. First, we compute the number of bonds outstanding for each quarter around the Paris Agreement from 2014 to 2016.\n\nbonds_outstanding &lt;- expand_grid(\"date\" = seq(ymd(\"2014-01-01\"),\n                                              ymd(\"2016-11-30\"), \n                                              by = \"quarter\"), \n                                 \"complete_cusip\" = fisd$complete_cusip) |&gt; \n  left_join(fisd |&gt; select(complete_cusip, \n                              offering_date,\n                              maturity), \n            join_by(complete_cusip)) |&gt; \n  mutate(offering_date = floor_date(offering_date),\n         maturity = floor_date(maturity)) |&gt; \n  filter(date &gt;= offering_date & date &lt;= maturity) |&gt; \n  count(date) |&gt; \n  mutate(type = \"Outstanding\")\n\nNext, we look at the bonds traded each quarter in the same period. Notice that we load the complete trace table from our database, as we only have a single part of it in the environment from the download loop from above.\n\ntrace_enhanced &lt;- tbl(tidy_finance, \"trace_enhanced\") |&gt;\n  collect()\n\nbonds_traded &lt;- trace_enhanced |&gt; \n  mutate(date = floor_date(trd_exctn_dt, \"quarters\")) |&gt; \n  group_by(date) |&gt; \n  summarize(n = length(unique(cusip_id)),\n            type = \"Traded\",\n            .groups = \"drop\") \n\nFinally, we plot the two time series in Figure 1.\n\nbonds_outstanding |&gt; \n  bind_rows(bonds_traded) |&gt; \n  ggplot(aes(\n    x = date, \n    y = n, \n    color = type, \n    linetype = type\n  )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Number of bonds outstanding and traded each quarter\"\n  )\n\n\n\n\nFigure 1: The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.\n\n\nFigure 1: The number of corporate bonds outstanding each quarter as reported by Mergent FISD and the number of traded bonds from enhanced TRACE between 2014 and end of 2016.\n\n\n\n\n\nWe see that the number of bonds outstanding increases steadily between 2014 and 2016. During our sample period of trade data, we see that the fraction of bonds trading each quarter is roughly 60%. The relatively small number of traded bonds means that many bonds do not trade through an entire quarter. This lack of trading activity illustrates the generally low level of liquidity in the corporate bond market, where it can be hard to trade specific bonds. Does this lack of liquidity mean that corporate bond markets are irrelevant in terms of their size? With over 7,500 traded bonds each quarter, it is hard to say that the market is small. However, let us also investigate the characteristics of issued corporate bonds. In particular, we consider maturity (in years), coupon, and offering amount (in million USD).\n\nfisd |&gt;\n  mutate(maturity = as.numeric(maturity - offering_date) / 365,\n         offering_amt = offering_amt / 10^3) |&gt; \n  pivot_longer(cols = c(maturity, coupon, offering_amt),\n               names_to = \"measure\") |&gt;\n  drop_na() |&gt; \n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n\n# A tibble: 3 × 8\n  measure        mean     sd    min   q05    q50    q95    max\n  &lt;chr&gt;         &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;\n1 coupon         5.94   2.67  0     1.90    6.06   10      39 \n2 maturity       9.74   9.21 -6.24  1.03    7.18   30.0   101.\n3 offering_amt 379.   570.    0.001 0.669 200    1400   15000 \n\n\nWe see that the average bond in our sample period has an offering amount of over 357 million USD with a median of 200 million USD, which both cannot be considered small. The average bond has a maturity of 10 years and pays around 6% in coupons.\nFinally, let us compute some summary statistics for the trades in this market. To this end, we show a summary based on aggregate information daily. In particular, we consider the trade size (in million USD) and the number of trades.\n\ntrace_enhanced |&gt; \n  group_by(trd_exctn_dt) |&gt; \n  summarize(trade_size = sum(entrd_vol_qt * rptd_pr / 100) / 10^6,\n            trade_number = n(),\n            .groups = \"drop\") |&gt; \n  pivot_longer(cols = c(trade_size, trade_number),\n               names_to = \"measure\") |&gt; \n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value)\n  )\n\n# A tibble: 2 × 8\n  measure        mean    sd   min    q05    q50    q95    max\n  &lt;chr&gt;         &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;\n1 trade_number 25914. 5444. 439   17844. 26051  34383. 40839 \n2 trade_size   12968. 3577.  17.2  6128. 13421. 17850. 21312.\n\n\nOn average, nearly 26 billion USD of corporate bonds are traded daily in nearly 13,000 transactions. We can hence conclude that the corporate bond market is indeed significant in terms of trading volume and activity.",
     "crumbs": [
       "R",
       "Financial Data",
@@ -2882,7 +2882,7 @@
     "href": "r/factor-selection-via-machine-learning.html#data-preparation",
     "title": "Factor Selection via Machine Learning",
     "section": "Data Preparation",
-    "text": "Data Preparation\nTo get started, we load the required R packages and data. The main focus is on the workflow behind the tidymodels package collection (Kuhn and Wickham 2020). Kuhn and Silge (2018) provide a thorough introduction into all tidymodels components. glmnet (Simon et al. 2011) was developed and released in sync with Tibshirani (1996) and provides an R implementation of Elastic Net estimation. The package timetk (Dancho and Vaughan 2022) provides useful tools for time series data wrangling.\n\nlibrary(RSQLite)\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(scales)\nlibrary(furrr)\nlibrary(glmnet)\nlibrary(timetk)\n\nIn this analysis, we use four different data sources that we load from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat. We start with two different sets of factor portfolio returns which have been suggested as representing practical risk factor exposure and thus should be relevant when it comes to asset pricing applications.\n\nThe standard workhorse: monthly Fama-French 3 factor returns (market, small-minus-big, and high-minus-low book-to-market valuation sorts) defined in Fama and French (1992) and Fama and French (1993)\nMonthly q-factor returns from Hou, Xue, and Zhang (2014). The factors contain the size factor, the investment factor, the return-on-equity factor, and the expected growth factor\n\nNext, we include macroeconomic predictors which may predict the general stock market economy. Macroeconomic variables effectively serve as conditioning information such that their inclusion hints at the relevance of conditional models instead of unconditional asset pricing. We refer the interested reader to Cochrane (2009) on the role of conditioning information.\n\nOur set of macroeconomic predictors comes from Welch and Goyal (2008). The data has been updated by the authors until 2021 and contains monthly variables that have been suggested as good predictors for the equity premium. Some of the variables are the dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury bill rate, and inflation\n\nFinally, we need a set of test assets. The aim is to understand which of the plenty factors and macroeconomic variable combinations prove helpful in explaining our test assets’ cross-section of returns. In line with many existing papers, we use monthly portfolio returns from 10 different industries according to the definition from Kenneth French’s homepage as test assets.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  collect() |&gt;\n  rename_with(~ str_c(\"factor_ff_\", .), -month)\n\nfactors_q_monthly &lt;- tbl(tidy_finance, \"factors_q_monthly\") |&gt;\n  collect() |&gt;\n  rename_with(~ str_c(\"factor_q_\", .), -month)\n\nmacro_predictors &lt;- tbl(tidy_finance, \"macro_predictors\") |&gt;\n  collect() |&gt;\n  rename_with(~ str_c(\"macro_\", .), -month) |&gt;\n  select(-macro_rp_div)\n\nindustries_ff_monthly &lt;- tbl(tidy_finance, \"industries_ff_monthly\") |&gt;\n  collect() |&gt;\n  pivot_longer(-month,\n    names_to = \"industry\", values_to = \"ret\"\n  ) |&gt;\n  arrange(desc(industry)) |&gt; \n  mutate(industry = as_factor(industry))\n\nWe combine all the monthly observations into one data frame.\n\ndata &lt;- industries_ff_monthly |&gt;\n  left_join(factors_ff3_monthly, by = \"month\") |&gt;\n  left_join(factors_q_monthly, by = \"month\") |&gt;\n  left_join(macro_predictors, by = \"month\") |&gt;\n  mutate(\n    ret = ret - factor_ff_rf\n  ) |&gt;\n  select(month, industry, ret_excess = ret, everything()) |&gt;\n  drop_na()\n\nOur data contains 22 columns of regressors with the 13 macro variables and 8 factor returns for each month. Figure 1 provides summary statistics for the 10 monthly industry excess returns in percent.\n\ndata |&gt;\n  group_by(industry) |&gt;\n  ggplot(aes(x = industry, y = ret_excess)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Excess return distributions by industry in percent\"\n  ) +\n  scale_y_continuous(\n    labels = percent\n  )\n\n\n\n\nFigure 1: The box plots show the monthly dispersion of returns for 10 different industries.\n\n\nFigure 1: The box plots show the monthly dispersion of returns for 10 different industries.",
+    "text": "Data Preparation\nTo get started, we load the required R packages and data. The main focus is on the workflow behind the tidymodels package collection (Kuhn and Wickham 2020). Kuhn and Silge (2018) provide a thorough introduction into all tidymodels components. glmnet (Simon et al. 2011) was developed and released in sync with Tibshirani (1996) and provides an R implementation of Elastic Net estimation. The package timetk (Dancho and Vaughan 2022) provides useful tools for time series data wrangling.\n\nlibrary(RSQLite)\nlibrary(tidyverse)\nlibrary(tidymodels)\nlibrary(scales)\nlibrary(furrr)\nlibrary(glmnet)\nlibrary(timetk)\n\nIn this analysis, we use four different data sources that we load from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat. We start with two different sets of factor portfolio returns which have been suggested as representing practical risk factor exposure and thus should be relevant when it comes to asset pricing applications.\n\nThe standard workhorse: monthly Fama-French 3 factor returns (market, small-minus-big, and high-minus-low book-to-market valuation sorts) defined in Fama and French (1992) and Fama and French (1993)\nMonthly q-factor returns from Hou, Xue, and Zhang (2014). The factors contain the size factor, the investment factor, the return-on-equity factor, and the expected growth factor\n\nNext, we include macroeconomic predictors which may predict the general stock market economy. Macroeconomic variables effectively serve as conditioning information such that their inclusion hints at the relevance of conditional models instead of unconditional asset pricing. We refer the interested reader to Cochrane (2009) on the role of conditioning information.\n\nOur set of macroeconomic predictors comes from Welch and Goyal (2008). The data has been updated by the authors until 2021 and contains monthly variables that have been suggested as good predictors for the equity premium. Some of the variables are the dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury bill rate, and inflation\n\nFinally, we need a set of test assets. The aim is to understand which of the plenty factors and macroeconomic variable combinations prove helpful in explaining our test assets’ cross-section of returns. In line with many existing papers, we use monthly portfolio returns from 10 different industries according to the definition from Kenneth French’s homepage as test assets.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  collect() |&gt;\n  rename_with(~ str_c(\"factor_ff_\", .), -month)\n\nfactors_q_monthly &lt;- tbl(tidy_finance, \"factors_q_monthly\") |&gt;\n  collect() |&gt;\n  rename_with(~ str_c(\"factor_q_\", .), -month)\n\nmacro_predictors &lt;- tbl(tidy_finance, \"macro_predictors\") |&gt;\n  collect() |&gt;\n  rename_with(~ str_c(\"macro_\", .), -month) |&gt;\n  select(-macro_rp_div)\n\nindustries_ff_monthly &lt;- tbl(tidy_finance, \"industries_ff_monthly\") |&gt;\n  collect() |&gt;\n  pivot_longer(-month,\n    names_to = \"industry\", values_to = \"ret\"\n  ) |&gt;\n  arrange(desc(industry)) |&gt; \n  mutate(industry = as_factor(industry))\n\nWe combine all the monthly observations into one data frame.\n\ndata &lt;- industries_ff_monthly |&gt;\n  left_join(factors_ff3_monthly, join_by(month)) |&gt;\n  left_join(factors_q_monthly, join_by(month)) |&gt;\n  left_join(macro_predictors, join_by(month)) |&gt;\n  mutate(\n    ret = ret - factor_ff_rf\n  ) |&gt;\n  select(month, industry, ret_excess = ret, everything()) |&gt;\n  drop_na()\n\nOur data contains 22 columns of regressors with the 13 macro variables and 8 factor returns for each month. Figure 1 provides summary statistics for the 10 monthly industry excess returns in percent.\n\ndata |&gt;\n  group_by(industry) |&gt;\n  ggplot(aes(x = industry, y = ret_excess)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Excess return distributions by industry in percent\"\n  ) +\n  scale_y_continuous(\n    labels = percent\n  )\n\n\n\n\nFigure 1: The box plots show the monthly dispersion of returns for 10 different industries.\n\n\nFigure 1: The box plots show the monthly dispersion of returns for 10 different industries.",
     "crumbs": [
       "R",
       "Modeling and Machine Learning",
@@ -2894,7 +2894,7 @@
     "href": "r/factor-selection-via-machine-learning.html#the-tidymodels-workflow",
     "title": "Factor Selection via Machine Learning",
     "section": "The tidymodels Workflow",
-    "text": "The tidymodels Workflow\nTo illustrate penalized linear regressions, we employ the tidymodels collection of packages for modeling and ML using tidyverse principles. You can simply use install.packages(\"tidymodels\") to get access to all the related packages. We recommend checking out the work of Kuhn and Silge (2018): They continuously write on their great book ‘Tidy Modeling with R’ using tidy principles.\nThe tidymodels workflow encompasses the main stages of the modeling process: pre-processing of data, model fitting, and post-processing of results. As we demonstrate below, tidymodels provides efficient workflows that you can update with low effort.\nUsing the ideas of Ridge and Lasso regressions, the following example guides you through (i) pre-processing the data (data split and variable mutation), (ii) building models, (iii) fitting models, and (iv) tuning models to create the “best” possible predictions.\nTo start, we restrict our analysis to just one industry: Manufacturing. We first split the sample into a training and a test set. For that purpose, tidymodels provides the function initial_time_split() from the rsample package (Silge et al. 2022). The split takes the last 20% of the data as a test set, which is not used for any model tuning. We use this test set to evaluate the predictive accuracy in an out-of-sample scenario.\n\nsplit &lt;- initial_time_split(\n  data |&gt;\n    filter(industry == \"manuf\") |&gt;\n    select(-industry),\n  prop = 4 / 5\n)\nsplit\n\n&lt;Training/Testing/Total&gt;\n&lt;536/135/671&gt;\n\n\nThe object split simply keeps track of the observations of the training and the test set. We can call the training set with training(split), while we can extract the test set with testing(split).\n\nPre-process data\nRecipes help you pre-process your data before training your model. Recipes are a series of pre-processing steps such as variable selection, transformation, or conversion of qualitative predictors to indicator variables. Each recipe starts with a formula that defines the general structure of the dataset and the role of each variable (regressor or dependent variable). For our dataset, our recipe contains the following steps before we fit any model:\n\nOur formula defines that we want to explain excess returns with all available predictors. The regression equation thus takes the form \\[r_{t} = \\alpha_0 + \\left(\\tilde f_t \\otimes \\tilde z_t\\right)B + \\varepsilon_t \\] where \\(r_t\\) is the vector of industry excess returns at time \\(t\\) and \\(\\tilde f_t\\) and \\(\\tilde z_t\\) are the (standardized) vectors of factor portfolio returns and macroeconomic variables\nWe exclude the column month from the analysis\nWe include all interaction terms between factors and macroeconomic predictors\nWe demean and scale each regressor such that the standard deviation is one\n\n\nrec &lt;- recipe(ret_excess ~ ., data = training(split)) |&gt;\n  step_rm(month) |&gt;\n  step_interact(terms = ~ contains(\"factor\"):contains(\"macro\")) |&gt;\n  step_normalize(all_predictors())\n\nA table of all available recipe steps can be found in the tidymodels documentation. As of 2023, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a description of the tasks to be applied. As a result, it is very easy to reuse recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. In the example above, it does not make a difference whether you use the input data = training(split) or data = testing(split). All that matters at this early stage are the column names and types.\nWe can apply the recipe to any data with a suitable structure. The code below combines two different functions: prep() estimates the required parameters from a training set that can be applied to other data sets later. bake() applies the processed computations to new data.\n\ndata_prep &lt;- prep(rec, training(split))\n\nThe object data_prep contains information related to the different preprocessing steps applied to the training data: E.g., it is necessary to compute sample means and standard deviations to center and scale the variables.\n\ndata_bake &lt;- bake(data_prep,\n  new_data = testing(split)\n)\ndata_bake\n\n# A tibble: 135 × 126\n  factor_ff_mkt_excess factor_ff_smb factor_ff_hml factor_ff_rf\n                 &lt;dbl&gt;         &lt;dbl&gt;         &lt;dbl&gt;        &lt;dbl&gt;\n1              -1.73          -1.10        -0.711         -1.78\n2               2.35           0.949       -0.0937        -1.78\n3              -0.156         -0.122       -0.284         -1.78\n4               0.0646        -0.256        0.425         -1.78\n5               0.995          0.563       -0.456         -1.78\n# ℹ 130 more rows\n# ℹ 122 more variables: factor_q_me &lt;dbl&gt;, factor_q_ia &lt;dbl&gt;,\n#   factor_q_roe &lt;dbl&gt;, factor_q_eg &lt;dbl&gt;, macro_dp &lt;dbl&gt;,\n#   macro_dy &lt;dbl&gt;, macro_ep &lt;dbl&gt;, macro_de &lt;dbl&gt;,\n#   macro_svar &lt;dbl&gt;, macro_bm &lt;dbl&gt;, macro_ntis &lt;dbl&gt;,\n#   macro_tbl &lt;dbl&gt;, macro_lty &lt;dbl&gt;, macro_ltr &lt;dbl&gt;,\n#   macro_tms &lt;dbl&gt;, macro_dfy &lt;dbl&gt;, macro_infl &lt;dbl&gt;, …\n\n\nNote that the resulting data contains the 132 observations from the test set and 126 columns. Why so many? Recall that the recipe states to compute every possible interaction term between the factors and predictors, which increases the dimension of the data matrix substantially.\nYou may ask at this stage: why should I use a recipe instead of simply using the data wrangling commands such as mutate() or select()? tidymodels beauty is that a lot is happening under the hood. Recall, that for the simple scaling step, you actually have to compute the standard deviation of each column, then store this value, and apply the identical transformation to a different dataset, e.g., testing(split). A prepped recipe stores these values and hands them on once you bake() a novel dataset. Easy as pie with tidymodels, isn’t it?\n\n\nBuild a model\n Next, we can build an actual model based on our pre-processed data. In line with the definition above, we estimate regression coefficients of a Lasso regression such that we get \\[\\begin{aligned}\\hat\\beta_\\lambda^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda\\sum\\limits_{k=1}^K|\\beta_k|.\\end{aligned}\\] We want to emphasize that the tidymodels workflow for any model is very similar, irrespective of the specific model. As you will see further below, it is straightforward to fit Ridge regression coefficients and - later - Neural networks or Random forests with basically the same code. The structure is always as follows: create a so-called workflow() and use the fit() function. A table with all available model APIs is available here. For now, we start with the linear regression model with a given value for the penalty factor \\(\\lambda\\). In the setup below, mixture denotes the value of \\(\\rho\\), hence setting mixture = 1 implies the Lasso.\n\nlm_model &lt;- linear_reg(\n  penalty = 0.0001,\n  mixture = 1\n) |&gt;\n  set_engine(\"glmnet\", intercept = FALSE)\n\nThat’s it - we are done! The object lm_model contains the definition of our model with all required information. Note that set_engine(\"glmnet\") indicates the API character of the tidymodels workflow: Under the hood, the package glmnet is doing the heavy lifting, while linear_reg() provides a unified framework to collect the inputs. The workflow ends with combining everything necessary for the serious data science workflow, namely, a recipe and a model.\n\nlm_fit &lt;- workflow() |&gt;\n  add_recipe(rec) |&gt;\n  add_model(lm_model)\nlm_fit\n\n══ Workflow ═════════════════════════════════════════════════════════\nPreprocessor: Recipe\nModel: linear_reg()\n\n── Preprocessor ─────────────────────────────────────────────────────\n3 Recipe Steps\n\n• step_rm()\n• step_interact()\n• step_normalize()\n\n── Model ────────────────────────────────────────────────────────────\nLinear Regression Model Specification (regression)\n\nMain Arguments:\n  penalty = 1e-04\n  mixture = 1\n\nEngine-Specific Arguments:\n  intercept = FALSE\n\nComputational engine: glmnet \n\n\n\n\nFit a model\nWith the workflow from above, we are ready to use fit(). Typically, we use training data to fit the model. The training data is pre-processed according to our recipe steps, and the Lasso regression coefficients are computed. First, we focus on the predicted values \\(\\hat{y}_t = x_t\\hat\\beta^\\text{Lasso}.\\) Figure 2 illustrates the projections for the entire time series of the manufacturing industry portfolio returns. The grey area indicates the out-of-sample period, which we did not use to fit the model.\n\npredicted_values &lt;- lm_fit |&gt;\n  fit(data = training(split)) |&gt;\n  predict(data |&gt; filter(industry == \"manuf\")) |&gt;\n  bind_cols(data |&gt; filter(industry == \"manuf\")) |&gt;\n  select(month,\n    \"Fitted value\" = .pred,\n    \"Realization\" = ret_excess\n  ) |&gt;\n  pivot_longer(-month, names_to = \"Variable\")\n\n\npredicted_values |&gt;\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = Variable,\n    linetype = Variable\n    )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    linetype = NULL,\n    title = \"Monthly realized and fitted manufacturing industry risk premia\"\n  ) +\n  scale_x_date(\n    breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"5 years\"\n      )\n    },\n    minor_breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"1 years\"\n      )\n    },\n    expand = c(0, 0),\n    labels = date_format(\"%Y\")\n  ) +\n  scale_y_continuous(\n    labels = percent\n  ) +\n  annotate(\"rect\",\n    xmin = testing(split) |&gt; pull(month) |&gt; min(),\n    xmax = testing(split) |&gt; pull(month) |&gt; max(),\n    ymin = -Inf, ymax = Inf,\n    alpha = 0.5, fill = \"grey70\"\n  )\n\n\n\n\nFigure 2: The grey area corresponds to the out of sample period.\n\n\nFigure 2: The grey area corresponds to the out of sample period.\n\n\n\n\n\nWhat do the estimated coefficients look like? To analyze these values and to illustrate the difference between the tidymodels workflow and the underlying glmnet package, it is worth computing the coefficients \\(\\hat\\beta^\\text{Lasso}\\) directly. The code below estimates the coefficients for the Lasso and Ridge regression for the processed training data sample. Note that glmnet actually takes a vector y and the matrix of regressors \\(X\\) as input. Moreover, glmnet requires choosing the penalty parameter \\(\\alpha\\), which corresponds to \\(\\rho\\) in the notation above. When using the tidymodels model API, such details do not need consideration.\n\nx &lt;- data_bake |&gt;\n  select(-ret_excess) |&gt;\n  as.matrix()\ny &lt;- data_bake |&gt; pull(ret_excess)\n\nfit_lasso &lt;- glmnet(\n  x = x,\n  y = y,\n  alpha = 1,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n\nfit_ridge &lt;- glmnet(\n  x = x,\n  y = y,\n  alpha = 0,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n\nThe objects fit_lasso and fit_ridge contain an entire sequence of estimated coefficients for multiple values of the penalty factor \\(\\lambda\\). Figure 3 illustrates the trajectories of the regression coefficients as a function of the penalty factor. Both Lasso and Ridge coefficients converge to zero as the penalty factor increases.\n\nbind_rows(\n  tidy(fit_lasso) |&gt; mutate(Model = \"Lasso\"),\n  tidy(fit_ridge) |&gt; mutate(Model = \"Ridge\")\n) |&gt;\n  rename(\"Variable\" = term) |&gt;\n  ggplot(aes(x = lambda, y = estimate, color = Variable)) +\n  geom_line() +\n  scale_x_log10() +\n  facet_wrap(~Model, scales = \"free_x\") +\n  labs(\n    x = \"Penalty factor (lambda)\", y = NULL,\n    title = \"Estimated coefficient paths for different penalty factors\"\n  ) +\n  theme(legend.position = \"none\")\n\n\n\n\nFigure 3: The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.\n\n\nFigure 3: The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.\n\n\n\n\n\n\nOne word of caution: The package glmnet computes estimates of the coefficients \\(\\hat\\beta\\) based on numerical optimization procedures. As a result, the estimated coefficients for the special case with no regularization (\\(\\lambda = 0\\)) can deviate from the standard OLS estimates.\n\n\n\nTune a model\nTo compute \\(\\hat\\beta_\\lambda^\\text{Lasso}\\) , we simply imposed a value for the penalty hyperparameter \\(\\lambda\\). Model tuning is the process of optimally selecting such hyperparameters. tidymodels provides extensive tuning options based on so-called cross-validation. Again, we refer to any treatment of cross-validation to get a more detailed discussion of the statistical underpinnings. Here we focus on the general idea and the implementation with tidymodels.\nThe goal for choosing \\(\\lambda\\) (or any other hyperparameter, e.g., \\(\\rho\\) for the Elastic Net) is to find a way to produce predictors \\(\\hat{Y}\\) for an outcome \\(Y\\) that minimizes the mean squared prediction error \\(\\text{MSPE} = E\\left( \\frac{1}{T}\\sum_{t=1}^T (\\hat{y}_t - y_t)^2 \\right)\\). Unfortunately, the MSPE is not directly observable. We can only compute an estimate because our data is random and because we do not observe the entire population.\nObviously, if we train an algorithm on the same data that we use to compute the error, our estimate \\(\\hat{\\text{MSPE}}\\) would indicate way better predictive accuracy than what we can expect in real out-of-sample data. The result is called overfitting.\nCross-validation is a technique that allows us to alleviate this problem. We approximate the true MSPE as the average of many MSPE obtained by creating predictions for \\(K\\) new random samples of the data, none of them used to train the algorithm \\(\\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k - y_t^k\\right)^2\\). In practice, this is done by carving out a piece of our data and pretending it is an independent sample. We again divide the data into a training set and a test set. The MSPE on the test set is our measure for actual predictive ability, while we use the training set to fit models with the aim to find the optimal hyperparameter values. To do so, we further divide our training sample into (several) subsets, fit our model for a grid of potential hyperparameter values (e.g., \\(\\lambda\\)), and evaluate the predictive accuracy on an independent sample. This works as follows:\n\nSpecify a grid of hyperparameters\nObtain predictors \\(\\hat{y}_i(\\lambda)\\) to denote the predictors for the used parameters \\(\\lambda\\)\nCompute \\[\n\\text{MSPE}(\\lambda) = \\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k(\\lambda) - y_t^k\\right)^2\n\\] With K-fold cross-validation, we do this computation \\(K\\) times. Simply pick a validation set with \\(M=T/K\\) observations at random and think of these as random samples \\(y_1^k, \\dots, y_{\\tilde{T}}^k\\), with \\(k=1\\)\n\nHow should you pick \\(K\\)? Large values of \\(K\\) are preferable because the training data better imitates the original data. However, larger values of \\(K\\) will have much higher computation time. tidymodels provides all required tools to conduct \\(K\\)-fold cross-validation. We just have to update our model specification and let tidymodels know which parameters to tune. In our case, we specify the penalty factor \\(\\lambda\\) as well as the mixing factor \\(\\rho\\) as free parameters. Note that it is simple to change an existing workflow with update_model().\n\nlm_model &lt;- linear_reg(\n  penalty = tune(),\n  mixture = tune()\n) |&gt;\n  set_engine(\"glmnet\")\n\nlm_fit &lt;- lm_fit |&gt;\n  update_model(lm_model)\n\nFor our sample, we consider a time-series cross-validation sample. This means that we tune our models with 20 random samples of length five years with a validation period of four years. For a grid of possible hyperparameters, we then fit the model for each fold and evaluate \\(\\hat{\\text{MSPE}}\\) in the corresponding validation set. Finally, we select the model specification with the lowest MSPE in the validation set. First, we define the cross-validation folds based on our training data only.\n\ndata_folds &lt;- time_series_cv(\n  data        = training(split),\n  date_var    = month,\n  initial     = \"5 years\",\n  assess      = \"48 months\",\n  cumulative  = FALSE,\n  slice_limit = 20\n)\ndata_folds\n\n# Time Series Cross Validation Plan \n# A tibble: 20 × 2\n  splits          id     \n  &lt;list&gt;          &lt;chr&gt;  \n1 &lt;split [60/48]&gt; Slice01\n2 &lt;split [60/48]&gt; Slice02\n3 &lt;split [60/48]&gt; Slice03\n4 &lt;split [60/48]&gt; Slice04\n5 &lt;split [60/48]&gt; Slice05\n# ℹ 15 more rows\n\n\nThen, we evaluate the performance for a grid of different penalty values. tidymodels provides functionalities to construct a suitable grid of hyperparameters with grid_regular. The code chunk below creates a \\(10 \\times 3\\) hyperparameters grid. Then, the function tune_grid() evaluates all the models for each fold.\n\nlm_tune &lt;- lm_fit |&gt;\n  tune_grid(\n    resample = data_folds,\n    grid = grid_regular(penalty(), mixture(), levels = c(20, 3)),\n    metrics = metric_set(rmse)\n  )\n\nAfter the tuning process, we collect the evaluation metrics (the root mean-squared error in our example) to identify the optimal model. Figure 4 illustrates the average validation set’s root mean-squared error for each value of \\(\\lambda\\) and \\(\\rho\\).\n\nautoplot(lm_tune) + \n  aes(linetype = `Proportion of Lasso Penalty`) + \n  guides(linetype = \"none\") +\n  labs(\n    x = \"Penalty factor (lambda)\",\n    y = \"Root MSPE\",\n    title = \"Root MSPE for different penalty factors\"\n  ) + \n  scale_x_log10()\n\nScale for x is already present.\nAdding another scale for x, which will replace the existing scale.\n\n\n\n\n\nFigure 4: Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.\n\n\nFigure 4: Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.\n\n\n\n\n\nFigure 4 shows that the cross-validated MSPE drops for Lasso and Elastic Net and spikes afterward. For Ridge regression, the MSPE increases above a certain threshold. Recall that the larger the regularization, the more restricted the model becomes. Thus, we would choose the model with the lowest MSPE.\n\n\nParallelized workflow\nOur starting point was the question: Which factors determine industry returns? While Avramov et al. (2022) provide a Bayesian analysis related to the research question above, we choose a simplified approach: To illustrate the entire workflow, we now run the penalized regressions for all ten industries. We want to identify relevant variables by fitting Lasso models for each industry returns time series. More specifically, we perform cross-validation for each industry to identify the optimal penalty factor \\(\\lambda\\). Then, we use the set of finalize_*()-functions that take a list or tibble of tuning parameter values and update objects with those values. After determining the best model, we compute the final fit on the entire training set and analyze the estimated coefficients.\nFirst, we define the Lasso model with one tuning parameter.\n\nlasso_model &lt;- linear_reg(\n  penalty = tune(),\n  mixture = 1\n) |&gt;\n  set_engine(\"glmnet\")\n\nlm_fit &lt;- lm_fit |&gt;\n  update_model(lasso_model)\n\nThe following task can be easily parallelized to reduce computing time substantially. We use the parallelization capabilities of furrr. Note that we can also just recycle all the steps from above and collect them in a function.\n\nselect_variables &lt;- function(input) {\n  # Split into training and testing data\n  split &lt;- initial_time_split(input, prop = 4 / 5)\n\n  # Data folds for cross-validation\n  data_folds &lt;- time_series_cv(\n    data = training(split),\n    date_var = month,\n    initial = \"5 years\",\n    assess = \"48 months\",\n    cumulative = FALSE,\n    slice_limit = 20\n  )\n\n  # Model tuning with the Lasso model\n  lm_tune &lt;- lm_fit |&gt;\n    tune_grid(\n      resample = data_folds,\n      grid = grid_regular(penalty(), levels = c(10)),\n      metrics = metric_set(rmse)\n    )\n\n  # Identify the best model and fit with the training data\n  lasso_lowest_rmse &lt;- lm_tune |&gt; select_by_one_std_err(\"rmse\")\n  lasso_final &lt;- finalize_workflow(lm_fit, lasso_lowest_rmse)\n  lasso_final_fit &lt;- last_fit(lasso_final, split, metrics = metric_set(rmse))\n\n  # Extract the estimated coefficients\n  estimated_coefficients &lt;- lasso_final_fit |&gt;\n    extract_fit_parsnip() |&gt;\n    tidy() |&gt;\n    mutate(\n      term = str_remove_all(term, \"factor_|macro_|industry_\")\n    )\n\n  return(estimated_coefficients)\n}\n\n# Parallelization\nplan(multisession, workers = availableCores())\n\n# Computation by industry\nselected_factors &lt;- data |&gt;\n  nest(data = -industry) |&gt;\n  mutate(selected_variables = future_map(\n    data, select_variables,\n    .options = furrr_options(seed = TRUE)\n  ))\n\nWhat has just happened? In principle, exactly the same as before but instead of computing the Lasso coefficients for one industry, we did it for ten in parallel. The final option seed = TRUE is required to make the cross-validation process reproducible. Now, we just have to do some housekeeping and keep only variables that Lasso does not set to zero. We illustrate the results in a heat map in Figure 5.\n\nselected_factors |&gt;\n  unnest(selected_variables) |&gt;\n  filter(\n    term != \"(Intercept)\",\n    estimate != 0\n  ) |&gt;\n  add_count(term) |&gt;\n  mutate(\n    term = str_remove_all(term, \"NA|ff_|q_\"),\n    term = str_replace_all(term, \"_x_\", \" \"),\n    term = fct_reorder(as_factor(term), n),\n    term = fct_lump_min(term, min = 2),\n    selected = 1\n  ) |&gt;\n  filter(term != \"Other\") |&gt;\n  mutate(term = fct_drop(term)) |&gt;\n  complete(industry, term, fill = list(selected = 0)) |&gt;\n  ggplot(aes(industry,\n    term,\n    fill = as_factor(selected)\n  )) +\n  geom_tile() +\n  scale_x_discrete(guide = guide_axis(angle = 70)) +\n  scale_fill_manual(values = c(\"white\", \"grey30\")) +\n  theme(legend.position = \"None\") +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Selected variables for different industries\"\n  )\n\n\n\n\nFigure 5: Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.\n\n\nFigure 5: Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.\n\n\n\n\n\nThe heat map in Figure 5 conveys two main insights. First, we see a lot of white, which means that many factors, macroeconomic variables, and interaction terms are not relevant for explaining the cross-section of returns across the industry portfolios. In fact, only the market factor and the return-on-equity factor play a role for several industries. Second, there seems to be quite some heterogeneity across different industries. While barely any variable is selected by Lasso for Utilities, many factors are selected for, e.g., High-Tech and Durable, but they do not coincide at all. In other words, there seems to be a clear picture that we do not need many factors, but Lasso does not provide a factor that consistently provides pricing abilities across industries.",
+    "text": "The tidymodels Workflow\nTo illustrate penalized linear regressions, we employ the tidymodels collection of packages for modeling and ML using tidyverse principles. You can simply use install.packages(\"tidymodels\") to get access to all the related packages. We recommend checking out the work of Kuhn and Silge (2018): They continuously write on their great book ‘Tidy Modeling with R’ using tidy principles.\nThe tidymodels workflow encompasses the main stages of the modeling process: pre-processing of data, model fitting, and post-processing of results. As we demonstrate below, tidymodels provides efficient workflows that you can update with low effort.\nUsing the ideas of Ridge and Lasso regressions, the following example guides you through (i) pre-processing the data (data split and variable mutation), (ii) building models, (iii) fitting models, and (iv) tuning models to create the “best” possible predictions.\nTo start, we restrict our analysis to just one industry: Manufacturing. We first split the sample into a training and a test set. For that purpose, tidymodels provides the function initial_time_split() from the rsample package (Silge et al. 2022). The split takes the last 20% of the data as a test set, which is not used for any model tuning. We use this test set to evaluate the predictive accuracy in an out-of-sample scenario.\n\nsplit &lt;- initial_time_split(\n  data |&gt;\n    filter(industry == \"manuf\") |&gt;\n    select(-industry),\n  prop = 4 / 5\n)\nsplit\n\n&lt;Training/Testing/Total&gt;\n&lt;536/135/671&gt;\n\n\nThe object split simply keeps track of the observations of the training and the test set. We can call the training set with training(split), while we can extract the test set with testing(split).\n\nPre-process data\nRecipes help you pre-process your data before training your model. Recipes are a series of pre-processing steps such as variable selection, transformation, or conversion of qualitative predictors to indicator variables. Each recipe starts with a formula that defines the general structure of the dataset and the role of each variable (regressor or dependent variable). For our dataset, our recipe contains the following steps before we fit any model:\n\nOur formula defines that we want to explain excess returns with all available predictors. The regression equation thus takes the form \\[r_{t} = \\alpha_0 + \\left(\\tilde f_t \\otimes \\tilde z_t\\right)B + \\varepsilon_t \\] where \\(r_t\\) is the vector of industry excess returns at time \\(t\\) and \\(\\tilde f_t\\) and \\(\\tilde z_t\\) are the (standardized) vectors of factor portfolio returns and macroeconomic variables\nWe exclude the column month from the analysis\nWe include all interaction terms between factors and macroeconomic predictors\nWe demean and scale each regressor such that the standard deviation is one\n\n\nrec &lt;- recipe(ret_excess ~ ., data = training(split)) |&gt;\n  step_rm(month) |&gt;\n  step_interact(terms = ~ contains(\"factor\"):contains(\"macro\")) |&gt;\n  step_normalize(all_predictors())\n\nA table of all available recipe steps can be found in the tidymodels documentation. As of 2024, more than 150 different processing steps are available! One important point: The definition of a recipe does not trigger any calculations yet but rather provides a description of the tasks to be applied. As a result, it is very easy to reuse recipes for different models and thus make sure that the outcomes are comparable as they are based on the same input. In the example above, it does not make a difference whether you use the input data = training(split) or data = testing(split). All that matters at this early stage are the column names and types.\nWe can apply the recipe to any data with a suitable structure. The code below combines two different functions: prep() estimates the required parameters from a training set that can be applied to other data sets later. bake() applies the processed computations to new data.\n\ndata_prep &lt;- prep(rec, training(split))\n\nThe object data_prep contains information related to the different preprocessing steps applied to the training data: E.g., it is necessary to compute sample means and standard deviations to center and scale the variables.\n\ndata_bake &lt;- bake(data_prep,\n  new_data = testing(split)\n)\ndata_bake\n\n# A tibble: 135 × 126\n  factor_ff_mkt_excess factor_ff_smb factor_ff_hml factor_ff_rf\n                 &lt;dbl&gt;         &lt;dbl&gt;         &lt;dbl&gt;        &lt;dbl&gt;\n1              -1.73          -1.10        -0.711         -1.78\n2               2.35           0.949       -0.0937        -1.78\n3              -0.156         -0.122       -0.284         -1.78\n4               0.0646        -0.256        0.425         -1.78\n5               0.995          0.563       -0.456         -1.78\n# ℹ 130 more rows\n# ℹ 122 more variables: factor_q_me &lt;dbl&gt;, factor_q_ia &lt;dbl&gt;,\n#   factor_q_roe &lt;dbl&gt;, factor_q_eg &lt;dbl&gt;, macro_dp &lt;dbl&gt;,\n#   macro_dy &lt;dbl&gt;, macro_ep &lt;dbl&gt;, macro_de &lt;dbl&gt;,\n#   macro_svar &lt;dbl&gt;, macro_bm &lt;dbl&gt;, macro_ntis &lt;dbl&gt;,\n#   macro_tbl &lt;dbl&gt;, macro_lty &lt;dbl&gt;, macro_ltr &lt;dbl&gt;,\n#   macro_tms &lt;dbl&gt;, macro_dfy &lt;dbl&gt;, macro_infl &lt;dbl&gt;, …\n\n\nNote that the resulting data contains the 132 observations from the test set and 126 columns. Why so many? Recall that the recipe states to compute every possible interaction term between the factors and predictors, which increases the dimension of the data matrix substantially.\nYou may ask at this stage: why should I use a recipe instead of simply using the data wrangling commands such as mutate() or select()? tidymodels beauty is that a lot is happening under the hood. Recall, that for the simple scaling step, you actually have to compute the standard deviation of each column, then store this value, and apply the identical transformation to a different dataset, e.g., testing(split). A prepped recipe stores these values and hands them on once you bake() a novel dataset. Easy as pie with tidymodels, isn’t it?\n\n\nBuild a model\n Next, we can build an actual model based on our pre-processed data. In line with the definition above, we estimate regression coefficients of a Lasso regression such that we get \\[\\begin{aligned}\\hat\\beta_\\lambda^\\text{Lasso} = \\arg\\min_\\beta \\left(Y - X\\beta\\right)'\\left(Y - X\\beta\\right) + \\lambda\\sum\\limits_{k=1}^K|\\beta_k|.\\end{aligned}\\] We want to emphasize that the tidymodels workflow for any model is very similar, irrespective of the specific model. As you will see further below, it is straightforward to fit Ridge regression coefficients and - later - Neural networks or Random forests with basically the same code. The structure is always as follows: create a so-called workflow() and use the fit() function. A table with all available model APIs is available here. For now, we start with the linear regression model with a given value for the penalty factor \\(\\lambda\\). In the setup below, mixture denotes the value of \\(\\rho\\), hence setting mixture = 1 implies the Lasso.\n\nlm_model &lt;- linear_reg(\n  penalty = 0.0001,\n  mixture = 1\n) |&gt;\n  set_engine(\"glmnet\", intercept = FALSE)\n\nThat’s it - we are done! The object lm_model contains the definition of our model with all required information. Note that set_engine(\"glmnet\") indicates the API character of the tidymodels workflow: Under the hood, the package glmnet is doing the heavy lifting, while linear_reg() provides a unified framework to collect the inputs. The workflow ends with combining everything necessary for the serious data science workflow, namely, a recipe and a model.\n\nlm_fit &lt;- workflow() |&gt;\n  add_recipe(rec) |&gt;\n  add_model(lm_model)\nlm_fit\n\n══ Workflow ═════════════════════════════════════════════════════════\nPreprocessor: Recipe\nModel: linear_reg()\n\n── Preprocessor ─────────────────────────────────────────────────────\n3 Recipe Steps\n\n• step_rm()\n• step_interact()\n• step_normalize()\n\n── Model ────────────────────────────────────────────────────────────\nLinear Regression Model Specification (regression)\n\nMain Arguments:\n  penalty = 1e-04\n  mixture = 1\n\nEngine-Specific Arguments:\n  intercept = FALSE\n\nComputational engine: glmnet \n\n\n\n\nFit a model\nWith the workflow from above, we are ready to use fit(). Typically, we use training data to fit the model. The training data is pre-processed according to our recipe steps, and the Lasso regression coefficients are computed. First, we focus on the predicted values \\(\\hat{y}_t = x_t\\hat\\beta^\\text{Lasso}.\\) Figure 2 illustrates the projections for the entire time series of the manufacturing industry portfolio returns. The grey area indicates the out-of-sample period, which we did not use to fit the model.\n\npredicted_values &lt;- lm_fit |&gt;\n  fit(data = training(split)) |&gt;\n  predict(data |&gt; filter(industry == \"manuf\")) |&gt;\n  bind_cols(data |&gt; filter(industry == \"manuf\")) |&gt;\n  select(month,\n    \"Fitted value\" = .pred,\n    \"Realization\" = ret_excess\n  ) |&gt;\n  pivot_longer(-month, names_to = \"Variable\")\n\n\npredicted_values |&gt;\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = Variable,\n    linetype = Variable\n    )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    linetype = NULL,\n    title = \"Monthly realized and fitted manufacturing industry risk premia\"\n  ) +\n  scale_x_date(\n    breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"5 years\"\n      )\n    },\n    minor_breaks = function(x) {\n      seq.Date(\n        from = min(x),\n        to = max(x),\n        by = \"1 years\"\n      )\n    },\n    expand = c(0, 0),\n    labels = date_format(\"%Y\")\n  ) +\n  scale_y_continuous(\n    labels = percent\n  ) +\n  annotate(\"rect\",\n    xmin = testing(split) |&gt; pull(month) |&gt; min(),\n    xmax = testing(split) |&gt; pull(month) |&gt; max(),\n    ymin = -Inf, ymax = Inf,\n    alpha = 0.5, fill = \"grey70\"\n  )\n\n\n\n\nFigure 2: The grey area corresponds to the out of sample period.\n\n\nFigure 2: The grey area corresponds to the out of sample period.\n\n\n\n\n\nWhat do the estimated coefficients look like? To analyze these values and to illustrate the difference between the tidymodels workflow and the underlying glmnet package, it is worth computing the coefficients \\(\\hat\\beta^\\text{Lasso}\\) directly. The code below estimates the coefficients for the Lasso and Ridge regression for the processed training data sample. Note that glmnet actually takes a vector y and the matrix of regressors \\(X\\) as input. Moreover, glmnet requires choosing the penalty parameter \\(\\alpha\\), which corresponds to \\(\\rho\\) in the notation above. When using the tidymodels model API, such details do not need consideration.\n\nx &lt;- data_bake |&gt;\n  select(-ret_excess) |&gt;\n  as.matrix()\ny &lt;- data_bake |&gt; pull(ret_excess)\n\nfit_lasso &lt;- glmnet(\n  x = x,\n  y = y,\n  alpha = 1,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n\nfit_ridge &lt;- glmnet(\n  x = x,\n  y = y,\n  alpha = 0,\n  intercept = FALSE,\n  standardize = FALSE,\n  lambda.min.ratio = 0\n)\n\nThe objects fit_lasso and fit_ridge contain an entire sequence of estimated coefficients for multiple values of the penalty factor \\(\\lambda\\). Figure 3 illustrates the trajectories of the regression coefficients as a function of the penalty factor. Both Lasso and Ridge coefficients converge to zero as the penalty factor increases.\n\nbind_rows(\n  tidy(fit_lasso) |&gt; mutate(Model = \"Lasso\"),\n  tidy(fit_ridge) |&gt; mutate(Model = \"Ridge\")\n) |&gt;\n  rename(\"Variable\" = term) |&gt;\n  ggplot(aes(x = lambda, y = estimate, color = Variable)) +\n  geom_line() +\n  scale_x_log10() +\n  facet_wrap(~Model, scales = \"free_x\") +\n  labs(\n    x = \"Penalty factor (lambda)\", y = NULL,\n    title = \"Estimated coefficient paths for different penalty factors\"\n  ) +\n  theme(legend.position = \"none\")\n\n\n\n\nFigure 3: The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.\n\n\nFigure 3: The penalty parameters are chosen iteratively to resemble the path from no penalization to a model that excludes all variables.\n\n\n\n\n\n\nOne word of caution: The package glmnet computes estimates of the coefficients \\(\\hat\\beta\\) based on numerical optimization procedures. As a result, the estimated coefficients for the special case with no regularization (\\(\\lambda = 0\\)) can deviate from the standard OLS estimates.\n\n\n\nTune a model\nTo compute \\(\\hat\\beta_\\lambda^\\text{Lasso}\\) , we simply imposed a value for the penalty hyperparameter \\(\\lambda\\). Model tuning is the process of optimally selecting such hyperparameters. tidymodels provides extensive tuning options based on so-called cross-validation. Again, we refer to any treatment of cross-validation to get a more detailed discussion of the statistical underpinnings. Here we focus on the general idea and the implementation with tidymodels.\nThe goal for choosing \\(\\lambda\\) (or any other hyperparameter, e.g., \\(\\rho\\) for the Elastic Net) is to find a way to produce predictors \\(\\hat{Y}\\) for an outcome \\(Y\\) that minimizes the mean squared prediction error \\(\\text{MSPE} = E\\left( \\frac{1}{T}\\sum_{t=1}^T (\\hat{y}_t - y_t)^2 \\right)\\). Unfortunately, the MSPE is not directly observable. We can only compute an estimate because our data is random and because we do not observe the entire population.\nObviously, if we train an algorithm on the same data that we use to compute the error, our estimate \\(\\hat{\\text{MSPE}}\\) would indicate way better predictive accuracy than what we can expect in real out-of-sample data. The result is called overfitting.\nCross-validation is a technique that allows us to alleviate this problem. We approximate the true MSPE as the average of many MSPE obtained by creating predictions for \\(K\\) new random samples of the data, none of them used to train the algorithm \\(\\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k - y_t^k\\right)^2\\). In practice, this is done by carving out a piece of our data and pretending it is an independent sample. We again divide the data into a training set and a test set. The MSPE on the test set is our measure for actual predictive ability, while we use the training set to fit models with the aim to find the optimal hyperparameter values. To do so, we further divide our training sample into (several) subsets, fit our model for a grid of potential hyperparameter values (e.g., \\(\\lambda\\)), and evaluate the predictive accuracy on an independent sample. This works as follows:\n\nSpecify a grid of hyperparameters\nObtain predictors \\(\\hat{y}_i(\\lambda)\\) to denote the predictors for the used parameters \\(\\lambda\\)\nCompute \\[\n\\text{MSPE}(\\lambda) = \\frac{1}{K} \\sum_{k=1}^K \\frac{1}{T}\\sum_{t=1}^T \\left(\\hat{y}_t^k(\\lambda) - y_t^k\\right)^2\n\\] With K-fold cross-validation, we do this computation \\(K\\) times. Simply pick a validation set with \\(M=T/K\\) observations at random and think of these as random samples \\(y_1^k, \\dots, y_{\\tilde{T}}^k\\), with \\(k=1\\)\n\nHow should you pick \\(K\\)? Large values of \\(K\\) are preferable because the training data better imitates the original data. However, larger values of \\(K\\) will have much higher computation time. tidymodels provides all required tools to conduct \\(K\\)-fold cross-validation. We just have to update our model specification and let tidymodels know which parameters to tune. In our case, we specify the penalty factor \\(\\lambda\\) as well as the mixing factor \\(\\rho\\) as free parameters. Note that it is simple to change an existing workflow with update_model().\n\nlm_model &lt;- linear_reg(\n  penalty = tune(),\n  mixture = tune()\n) |&gt;\n  set_engine(\"glmnet\")\n\nlm_fit &lt;- lm_fit |&gt;\n  update_model(lm_model)\n\nFor our sample, we consider a time-series cross-validation sample. This means that we tune our models with 20 random samples of length five years with a validation period of four years. For a grid of possible hyperparameters, we then fit the model for each fold and evaluate \\(\\hat{\\text{MSPE}}\\) in the corresponding validation set. Finally, we select the model specification with the lowest MSPE in the validation set. First, we define the cross-validation folds based on our training data only.\n\ndata_folds &lt;- time_series_cv(\n  data        = training(split),\n  date_var    = month,\n  initial     = \"5 years\",\n  assess      = \"48 months\",\n  cumulative  = FALSE,\n  slice_limit = 20\n)\ndata_folds\n\n# Time Series Cross Validation Plan \n# A tibble: 20 × 2\n  splits          id     \n  &lt;list&gt;          &lt;chr&gt;  \n1 &lt;split [60/48]&gt; Slice01\n2 &lt;split [60/48]&gt; Slice02\n3 &lt;split [60/48]&gt; Slice03\n4 &lt;split [60/48]&gt; Slice04\n5 &lt;split [60/48]&gt; Slice05\n# ℹ 15 more rows\n\n\nThen, we evaluate the performance for a grid of different penalty values. tidymodels provides functionalities to construct a suitable grid of hyperparameters with grid_regular. The code chunk below creates a \\(10 \\times 3\\) hyperparameters grid. Then, the function tune_grid() evaluates all the models for each fold.\n\nlm_tune &lt;- lm_fit |&gt;\n  tune_grid(\n    resample = data_folds,\n    grid = grid_regular(penalty(), mixture(), levels = c(20, 3)),\n    metrics = metric_set(rmse)\n  )\n\nAfter the tuning process, we collect the evaluation metrics (the root mean-squared error in our example) to identify the optimal model. Figure 4 illustrates the average validation set’s root mean-squared error for each value of \\(\\lambda\\) and \\(\\rho\\).\n\nautoplot(lm_tune) + \n  aes(linetype = `Proportion of Lasso Penalty`) + \n  guides(linetype = \"none\") +\n  labs(\n    x = \"Penalty factor (lambda)\",\n    y = \"Root MSPE\",\n    title = \"Root MSPE for different penalty factors\"\n  ) + \n  scale_x_log10()\n\nScale for x is already present.\nAdding another scale for x, which will replace the existing scale.\n\n\n\n\n\nFigure 4: Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.\n\n\nFigure 4: Evaluation of manufacturing excess returns for different penalty factors (lambda) and proportions of Lasso penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.\n\n\n\n\n\nFigure 4 shows that the cross-validated MSPE drops for Lasso and Elastic Net and spikes afterward. For Ridge regression, the MSPE increases above a certain threshold. Recall that the larger the regularization, the more restricted the model becomes. Thus, we would choose the model with the lowest MSPE.\n\n\nParallelized workflow\nOur starting point was the question: Which factors determine industry returns? While Avramov et al. (2022) provide a Bayesian analysis related to the research question above, we choose a simplified approach: To illustrate the entire workflow, we now run the penalized regressions for all ten industries. We want to identify relevant variables by fitting Lasso models for each industry returns time series. More specifically, we perform cross-validation for each industry to identify the optimal penalty factor \\(\\lambda\\). Then, we use the set of finalize_*()-functions that take a list or tibble of tuning parameter values and update objects with those values. After determining the best model, we compute the final fit on the entire training set and analyze the estimated coefficients.\nFirst, we define the Lasso model with one tuning parameter.\n\nlasso_model &lt;- linear_reg(\n  penalty = tune(),\n  mixture = 1\n) |&gt;\n  set_engine(\"glmnet\")\n\nlm_fit &lt;- lm_fit |&gt;\n  update_model(lasso_model)\n\nThe following task can be easily parallelized to reduce computing time substantially. We use the parallelization capabilities of furrr. Note that we can also just recycle all the steps from above and collect them in a function.\n\nselect_variables &lt;- function(input) {\n  # Split into training and testing data\n  split &lt;- initial_time_split(input, prop = 4 / 5)\n\n  # Data folds for cross-validation\n  data_folds &lt;- time_series_cv(\n    data = training(split),\n    date_var = month,\n    initial = \"5 years\",\n    assess = \"48 months\",\n    cumulative = FALSE,\n    slice_limit = 20\n  )\n\n  # Model tuning with the Lasso model\n  lm_tune &lt;- lm_fit |&gt;\n    tune_grid(\n      resample = data_folds,\n      grid = grid_regular(penalty(), levels = c(10)),\n      metrics = metric_set(rmse)\n    )\n\n  # Identify the best model and fit with the training data\n  lasso_lowest_rmse &lt;- lm_tune |&gt; select_by_one_std_err(\"rmse\")\n  lasso_final &lt;- finalize_workflow(lm_fit, lasso_lowest_rmse)\n  lasso_final_fit &lt;- last_fit(lasso_final, split, metrics = metric_set(rmse))\n\n  # Extract the estimated coefficients\n  estimated_coefficients &lt;- lasso_final_fit |&gt;\n    extract_fit_parsnip() |&gt;\n    tidy() |&gt;\n    mutate(\n      term = str_remove_all(term, \"factor_|macro_|industry_\")\n    )\n\n  return(estimated_coefficients)\n}\n\n# Parallelization\nplan(multisession, workers = availableCores())\n\n# Computation by industry\nselected_factors &lt;- data |&gt;\n  nest(data = -industry) |&gt;\n  mutate(selected_variables = future_map(\n    data, select_variables,\n    .options = furrr_options(seed = TRUE)\n  ))\n\nWhat has just happened? In principle, exactly the same as before but instead of computing the Lasso coefficients for one industry, we did it for ten in parallel. The final option seed = TRUE is required to make the cross-validation process reproducible. Now, we just have to do some housekeeping and keep only variables that Lasso does not set to zero. We illustrate the results in a heat map in Figure 5.\n\nselected_factors |&gt;\n  unnest(selected_variables) |&gt;\n  filter(\n    term != \"(Intercept)\",\n    estimate != 0\n  ) |&gt;\n  add_count(term) |&gt;\n  mutate(\n    term = str_remove_all(term, \"NA|ff_|q_\"),\n    term = str_replace_all(term, \"_x_\", \" \"),\n    term = fct_reorder(as_factor(term), n),\n    term = fct_lump_min(term, min = 2),\n    selected = 1\n  ) |&gt;\n  filter(term != \"Other\") |&gt;\n  mutate(term = fct_drop(term)) |&gt;\n  complete(industry, term, fill = list(selected = 0)) |&gt;\n  ggplot(aes(industry,\n    term,\n    fill = as_factor(selected)\n  )) +\n  geom_tile() +\n  scale_x_discrete(guide = guide_axis(angle = 70)) +\n  scale_fill_manual(values = c(\"white\", \"grey30\")) +\n  theme(legend.position = \"None\") +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Selected variables for different industries\"\n  )\n\n\n\n\nFigure 5: Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.\n\n\nFigure 5: Grey areas indicate that the estimated Lasso regression coefficient is not set to zero. White fields show which variables get assigned a value of exactly zero.\n\n\n\n\n\nThe heat map in Figure 5 conveys two main insights. First, we see a lot of white, which means that many factors, macroeconomic variables, and interaction terms are not relevant for explaining the cross-section of returns across the industry portfolios. In fact, only the market factor and the return-on-equity factor play a role for several industries. Second, there seems to be quite some heterogeneity across different industries. While barely any variable is selected by Lasso for Utilities, many factors are selected for, e.g., High-Tech and Durable, but they do not coincide at all. In other words, there seems to be a clear picture that we do not need many factors, but Lasso does not provide a factor that consistently provides pricing abilities across industries.",
     "crumbs": [
       "R",
       "Modeling and Machine Learning",
@@ -2942,7 +2942,7 @@
     "href": "r/value-and-bivariate-sorts.html#book-to-market-ratio",
     "title": "Value and Bivariate Sorts",
     "section": "Book-to-Market Ratio",
-    "text": "Book-to-Market Ratio\nA fundamental problem in handling accounting data is the look-ahead bias - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents’ trades. Hence, we have to lag accounting information.\nWe continue to lag market capitalization and firm size by one month. Then, we compute the book-to-market ratio, which relates a firm’s book equity to its market equity. Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.1\nHaving both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the sorting_date-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the fill()-function after sorting by date and firm (which we identify by permno and gvkey) and on a firm basis (which we do by group_by() as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.\n\nme &lt;- crsp_monthly |&gt;\n  mutate(sorting_date = month %m+% months(1)) |&gt;\n  select(permno, sorting_date, me = mktcap)\n\nbm &lt;- book_equity |&gt;\n  inner_join(crsp_monthly, by = c(\"gvkey\", \"month\")) |&gt;\n  mutate(\n    bm = be / mktcap,\n    sorting_date = month %m+% months(6),\n    comp_date = sorting_date\n  ) |&gt;\n  select(permno, gvkey, sorting_date, comp_date, bm)\n\ndata_for_sorts &lt;- crsp_monthly |&gt;\n  left_join(\n    bm, by = c(\"permno\", \"gvkey\", \"month\" = \"sorting_date\")\n  ) |&gt;\n  left_join(\n    me, by = c(\"permno\", \"month\" = \"sorting_date\")\n  ) |&gt;\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap_lag, me, bm, exchange, comp_date\n  )\n\ndata_for_sorts &lt;- data_for_sorts |&gt;\n  arrange(permno, gvkey, month) |&gt;\n  group_by(permno, gvkey) |&gt;\n  fill(bm, comp_date) |&gt;\n  ungroup() |&gt; \n  filter(comp_date &gt; month %m-% months(12)) |&gt;\n  select(-comp_date) |&gt;\n  drop_na()\n\nThe last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument sorting_variable into the function for defining different sorting variables.\n\nassign_portfolio &lt;- function(data, \n                             sorting_variable, \n                             n_portfolios, \n                             exchanges) {\n  breakpoints &lt;- data |&gt;\n    filter(exchange %in% exchanges) |&gt;\n    pull({{ sorting_variable }}) |&gt;\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios &lt;- data |&gt;\n    mutate(portfolio = findInterval(\n      pick(everything()) |&gt;\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |&gt;\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n\nAfter these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.",
+    "text": "Book-to-Market Ratio\nA fundamental problem in handling accounting data is the look-ahead bias - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents’ trades. Hence, we have to lag accounting information.\nWe continue to lag market capitalization and firm size by one month. Then, we compute the book-to-market ratio, which relates a firm’s book equity to its market equity. Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.1\nHaving both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the sorting_date-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the fill()-function after sorting by date and firm (which we identify by permno and gvkey) and on a firm basis (which we do by group_by() as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.\n\nme &lt;- crsp_monthly |&gt;\n  mutate(sorting_date = month %m+% months(1)) |&gt;\n  select(permno, sorting_date, me = mktcap)\n\nbm &lt;- book_equity |&gt;\n  inner_join(crsp_monthly, join_by(gvkey, month)) |&gt;\n  mutate(\n    bm = be / mktcap,\n    sorting_date = month %m+% months(6),\n    comp_date = sorting_date\n  ) |&gt;\n  select(permno, gvkey, sorting_date, comp_date, bm)\n\ndata_for_sorts &lt;- crsp_monthly |&gt;\n  left_join(\n    bm, join_by(permno, gvkey, month == sorting_date)\n  ) |&gt;\n  left_join(\n    me, join_by(permno, month == sorting_date)\n  ) |&gt;\n  select(\n    permno, gvkey, month, ret_excess,\n    mktcap_lag, me, bm, exchange, comp_date\n  )\n\ndata_for_sorts &lt;- data_for_sorts |&gt;\n  arrange(permno, gvkey, month) |&gt;\n  group_by(permno, gvkey) |&gt;\n  fill(bm, comp_date) |&gt;\n  ungroup() |&gt; \n  filter(comp_date &gt; month %m-% months(12)) |&gt;\n  select(-comp_date) |&gt;\n  drop_na()\n\nThe last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument sorting_variable into the function for defining different sorting variables.\n\nassign_portfolio &lt;- function(data, \n                             sorting_variable, \n                             n_portfolios, \n                             exchanges) {\n  breakpoints &lt;- data |&gt;\n    filter(exchange %in% exchanges) |&gt;\n    pull({{ sorting_variable }}) |&gt;\n    quantile(\n      probs = seq(0, 1, length.out = n_portfolios + 1),\n      na.rm = TRUE,\n      names = FALSE\n    )\n\n  assigned_portfolios &lt;- data |&gt;\n    mutate(portfolio = findInterval(\n      pick(everything()) |&gt;\n        pull({{ sorting_variable }}),\n      breakpoints,\n      all.inside = TRUE\n    )) |&gt;\n    pull(portfolio)\n  \n  return(assigned_portfolios)\n}\n\nAfter these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -3014,7 +3014,7 @@
     "href": "r/difference-in-differences.html#data-preparation",
     "title": "Difference in Differences",
     "section": "Data Preparation",
-    "text": "Data Preparation\nWe use TRACE and Mergent FISD as data sources from our SQLite-database introduced in Accessing and Managing Financial Data and TRACE and FISD. \n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfisd &lt;- tbl(tidy_finance, \"fisd\") |&gt;\n  select(complete_cusip, maturity, offering_amt, sic_code) |&gt;\n  collect() |&gt; \n  drop_na()\n\ntrace_enhanced &lt;- tbl(tidy_finance, \"trace_enhanced\") |&gt;\n  select(cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt)|&gt;\n  collect() |&gt; \n  drop_na()\n\nWe start our analysis by preparing the sample of bonds. We only consider bonds with a time to maturity of more than one year to the signing of the PA, so that we have sufficient data to analyze the yield behavior after the treatment date. This restriction also excludes all bonds issued after the agreement. We also consider only the first two digits of the SIC industry code to identify the polluting industries (in line with Seltzer, Starks, and Zhu 2022).\n\ntreatment_date &lt;- ymd(\"2015-12-12\")\n\npolluting_industries &lt;- c(\n  49, 13, 45, 29, 28, 33, 40, 20,\n  26, 42, 10, 53, 32, 99, 37\n)\n\nbonds &lt;- fisd |&gt;\n  filter(offering_amt &gt; 0) |&gt; \n  mutate(\n    time_to_maturity = as.numeric(maturity - treatment_date) / 365,\n    sic_code = as.integer(substr(sic_code, 1, 2)),\n    log_offering_amt = log(offering_amt)\n  ) |&gt;\n  filter(time_to_maturity &gt;= 1) |&gt;\n  select(\n    cusip_id = complete_cusip,\n    time_to_maturity, log_offering_amt, sic_code\n  ) |&gt;\n  mutate(polluter = sic_code %in% polluting_industries)\n\nNext, we aggregate the individual transactions as reported in TRACE to a monthly panel of bond yields. We consider bond yields for a bond’s last trading day in a month. Therefore, we first aggregate bond data to daily frequency and apply common restrictions from the literature (see, e.g., Bessembinder et al. 2008). We weigh each transaction by volume to reflect a trade’s relative importance and avoid emphasizing small trades. Moreover, we only consider transactions with reported prices rptd_pr larger than 25 (to exclude bonds that are close to default) and only bond-day observations with more than five trades on a corresponding day (to exclude prices based on too few, potentially non-representative transactions). \n\ntrace_aggregated &lt;- trace_enhanced |&gt;\n  filter(rptd_pr &gt; 25) |&gt;\n  group_by(cusip_id, trd_exctn_dt) |&gt;\n  summarize(\n    avg_yield = weighted.mean(yld_pt, entrd_vol_qt * rptd_pr),\n    trades = n(),\n    .groups = \"drop\"\n  ) |&gt;\n  drop_na(avg_yield) |&gt;\n  filter(trades &gt;= 5) |&gt;\n  mutate(month = floor_date(trd_exctn_dt, \"months\")) |&gt;\n  group_by(cusip_id, month) |&gt;\n  slice_max(trd_exctn_dt) |&gt;\n  ungroup() |&gt;\n  select(cusip_id, month, avg_yield)\n\nBy combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.\n\nbonds_panel &lt;- bonds |&gt;\n  inner_join(trace_aggregated, by = \"cusip_id\", multiple = \"all\") |&gt;\n  drop_na()\n\nBefore we can run the first regression, we need to define the treated indicator, which is the product of the post_period (i.e., all months after the signing of the PA) and the polluter indicator defined above.\n\nbonds_panel &lt;- bonds_panel |&gt;\n  mutate(post_period = month &gt;= floor_date(treatment_date, \"months\")) |&gt;\n  mutate(treated = polluter & post_period)\n\nAs usual, we tabulate summary statistics of the variables that enter the regression to check the validity of our variable definitions.\n\nbonds_panel |&gt;\n  pivot_longer(\n    cols = c(avg_yield, time_to_maturity, log_offering_amt),\n    names_to = \"measure\"\n  ) |&gt;\n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n\n# A tibble: 3 × 9\n  measure           mean    sd    min   q05   q50   q95   max      n\n  &lt;chr&gt;            &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;int&gt;\n1 avg_yield         4.08 4.21  0.0595  1.27  3.38  8.10 128.  127530\n2 log_offering_amt 13.3  0.823 4.64   12.2  13.2  14.5   16.5 127530\n3 time_to_maturity  8.55 8.41  1.01    1.50  5.81 27.4  101.  127530",
+    "text": "Data Preparation\nWe use TRACE and Mergent FISD as data sources from our SQLite-database introduced in Accessing and Managing Financial Data and TRACE and FISD. \n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\nfisd &lt;- tbl(tidy_finance, \"fisd\") |&gt;\n  select(complete_cusip, maturity, offering_amt, sic_code) |&gt;\n  collect() |&gt; \n  drop_na()\n\ntrace_enhanced &lt;- tbl(tidy_finance, \"trace_enhanced\") |&gt;\n  select(cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt)|&gt;\n  collect() |&gt; \n  drop_na()\n\nWe start our analysis by preparing the sample of bonds. We only consider bonds with a time to maturity of more than one year to the signing of the PA, so that we have sufficient data to analyze the yield behavior after the treatment date. This restriction also excludes all bonds issued after the agreement. We also consider only the first two digits of the SIC industry code to identify the polluting industries (in line with Seltzer, Starks, and Zhu 2022).\n\ntreatment_date &lt;- ymd(\"2015-12-12\")\n\npolluting_industries &lt;- c(\n  49, 13, 45, 29, 28, 33, 40, 20,\n  26, 42, 10, 53, 32, 99, 37\n)\n\nbonds &lt;- fisd |&gt;\n  filter(offering_amt &gt; 0) |&gt; \n  mutate(\n    time_to_maturity = as.numeric(maturity - treatment_date) / 365,\n    sic_code = as.integer(substr(sic_code, 1, 2)),\n    log_offering_amt = log(offering_amt)\n  ) |&gt;\n  filter(time_to_maturity &gt;= 1) |&gt;\n  select(\n    cusip_id = complete_cusip,\n    time_to_maturity, log_offering_amt, sic_code\n  ) |&gt;\n  mutate(polluter = sic_code %in% polluting_industries)\n\nNext, we aggregate the individual transactions as reported in TRACE to a monthly panel of bond yields. We consider bond yields for a bond’s last trading day in a month. Therefore, we first aggregate bond data to daily frequency and apply common restrictions from the literature (see, e.g., Bessembinder et al. 2008). We weigh each transaction by volume to reflect a trade’s relative importance and avoid emphasizing small trades. Moreover, we only consider transactions with reported prices rptd_pr larger than 25 (to exclude bonds that are close to default) and only bond-day observations with more than five trades on a corresponding day (to exclude prices based on too few, potentially non-representative transactions). \n\ntrace_aggregated &lt;- trace_enhanced |&gt;\n  filter(rptd_pr &gt; 25) |&gt;\n  group_by(cusip_id, trd_exctn_dt) |&gt;\n  summarize(\n    avg_yield = weighted.mean(yld_pt, entrd_vol_qt * rptd_pr),\n    trades = n(),\n    .groups = \"drop\"\n  ) |&gt;\n  drop_na(avg_yield) |&gt;\n  filter(trades &gt;= 5) |&gt;\n  mutate(month = floor_date(trd_exctn_dt, \"months\")) |&gt;\n  group_by(cusip_id, month) |&gt;\n  slice_max(trd_exctn_dt) |&gt;\n  ungroup() |&gt;\n  select(cusip_id, month, avg_yield)\n\nBy combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.\n\nbonds_panel &lt;- bonds |&gt;\n  inner_join(trace_aggregated, join_by(cusip_id), multiple = \"all\") |&gt;\n  drop_na()\n\nBefore we can run the first regression, we need to define the treated indicator, which is the product of the post_period (i.e., all months after the signing of the PA) and the polluter indicator defined above.\n\nbonds_panel &lt;- bonds_panel |&gt;\n  mutate(post_period = month &gt;= floor_date(treatment_date, \"months\")) |&gt;\n  mutate(treated = polluter & post_period)\n\nAs usual, we tabulate summary statistics of the variables that enter the regression to check the validity of our variable definitions.\n\nbonds_panel |&gt;\n  pivot_longer(\n    cols = c(avg_yield, time_to_maturity, log_offering_amt),\n    names_to = \"measure\"\n  ) |&gt;\n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n\n# A tibble: 3 × 9\n  measure           mean    sd    min   q05   q50   q95   max      n\n  &lt;chr&gt;            &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;int&gt;\n1 avg_yield         4.08 4.21  0.0595  1.27  3.38  8.10 128.  127530\n2 log_offering_amt 13.3  0.823 4.64   12.2  13.2  14.5   16.5 127530\n3 time_to_maturity  8.55 8.41  1.01    1.50  5.81 27.4  101.  127530",
     "crumbs": [
       "R",
       "Modeling and Machine Learning",
@@ -3038,7 +3038,7 @@
     "href": "r/difference-in-differences.html#visualizing-parallel-trends",
     "title": "Difference in Differences",
     "section": "Visualizing Parallel Trends",
-    "text": "Visualizing Parallel Trends\nEven though the regressions above indicate that there is an impact of the PA on bond yields of polluters, the tables do not tell us anything about the dynamics of the treatment effect. In particular, the models provide no indication about whether the crucial parallel trends assumption is valid. This assumption requires that in the absence of treatment, the difference between the two groups is constant over time. Although there is no well-defined statistical test for this assumption, visual inspection typically provides a good indication.\nTo provide such visual evidence, we revisit the simple OLS model and replace the treated and post_period indicators with month dummies for each group. This approach estimates the average yield change of both groups for each period and provides corresponding confidence intervals. Plotting the coefficient estimates for both groups around the treatment date shows us the dynamics of our panel data.\n\nmodel_without_fe_time &lt;- feols(\n  fml = avg_yield ~ polluter + month:polluter +\n    time_to_maturity + log_offering_amt,\n  vcov = \"iid\",\n  data = bonds_panel |&gt;\n    mutate(month = factor(month))\n)\n\nmodel_without_fe_coefs &lt;- tidy(model_without_fe_time) |&gt;\n  filter(str_detect(term, \"month\")) |&gt;\n  mutate(\n    month = ymd(substr(term, nchar(term) - 9, nchar(term))),\n    treatment = str_detect(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  )\n\nmodel_without_fe_coefs |&gt;\n  ggplot(aes(\n    month, \n    color = treatment,\n    linetype = treatment,\n    shape = treatment\n    )) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  guides(linetype = \"none\") + \n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    shape = \"Polluter?\",\n    color = \"Polluter?\",\n    title = \"Polluters respond stronger to Paris Agreement than green firms\"\n  )\n\n\n\n\nFigure 1: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\nFigure 1: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\n\n\n\nFigure 1 shows that throughout most of 2014, the yields of the two groups changed in unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest difference around the signing of the PA. Afterward, the yields for both groups fall again, and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on the other hand, even experience significantly lower yields than polluters after the signing of the agreement.\nInstead of plotting both groups using the simple model approach, we can also use the fixed-effects model and focus on the polluter’s yield response to the signing relative to the non-polluters. To perform this estimation, we need to replace the treated indicator with separate time dummies for the polluters, each marking a one-month period relative to the treatment date. We then regress the monthly yields on the set of time dummies and cusip_id and month fixed effects.\n\nbonds_panel_alt &lt;- bonds_panel |&gt;\n  mutate(\n    diff_to_treatment = interval(\n      floor_date(treatment_date, \"month\"), month\n    ) %/% months(1)\n  )\n\nvariables &lt;- bonds_panel_alt |&gt;\n  distinct(diff_to_treatment, month) |&gt;\n  arrange(month) |&gt;\n  mutate(variable_name = as.character(NA))\n\nformula &lt;- \"avg_yield ~ \"\n\nfor (j in 1:nrow(variables)) {\n  if (variables$diff_to_treatment[j] != 0) {\n    old_names &lt;- names(bonds_panel_alt)\n    bonds_panel_alt &lt;- bonds_panel_alt |&gt;\n      mutate(new_var = diff_to_treatment == variables$diff_to_treatment[j] & \n               polluter)\n    new_var_name &lt;- ifelse(variables$diff_to_treatment[j] &lt; 0,\n      str_c(\"lag\", abs(variables$diff_to_treatment[j])),\n      str_c(\"lead\", variables$diff_to_treatment[j])\n    )\n    variables$variable_name[j] &lt;- new_var_name\n    names(bonds_panel_alt) &lt;- c(old_names, new_var_name)\n    formula &lt;- str_c(\n      formula,\n      ifelse(j == 1,\n        new_var_name,\n        str_c(\"+\", new_var_name)\n      )\n    )\n  }\n}\nformula &lt;- str_c(formula, \"| cusip_id + month\")\n\nmodel_with_fe_time &lt;- feols(\n  fml = as.formula(formula),\n  vcov = \"iid\",\n  data = bonds_panel_alt\n)\n\nmodel_with_fe_time_coefs &lt;- tidy(model_with_fe_time) |&gt;\n  mutate(\n    term = str_remove(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  ) |&gt;\n  left_join(\n    variables,\n    by = c(\"term\" = \"variable_name\")\n  ) |&gt;\n  bind_rows(tibble(\n    term = \"lag0\",\n    estimate = 0,\n    ci_up = 0,\n    ci_low = 0,\n    month = floor_date(treatment_date, \"month\")\n  ))\n\nmodel_with_fe_time_coefs |&gt;\n  ggplot(aes(x = month, y = estimate)) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    title = \"Polluters' yield patterns around Paris Agreement signing\"\n  )\n\n\n\n\nFigure 2: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\nFigure 2: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\n\n\n\n The resulting graph shown in Figure 2 confirms the main conclusion of the previous image: polluters’ yield patterns show a considerable anticipation effect starting toward the end of 2014. Yields only marginally increase after the signing of the agreement. However, as opposed to the simple model, we do not see a complete reversal back to the pre-agreement level. Yields of polluters stay at a significantly higher level even one year after the signing.\nNotice that during the year after the PA was signed, the 45th President of the United States was elected on November 8, 2016. During his campaign there were some indications of intentions to withdraw the US from the PA, which ultimately happened on November 4, 2020. Hence, reversal effects are potentially driven by these actions.",
+    "text": "Visualizing Parallel Trends\nEven though the regressions above indicate that there is an impact of the PA on bond yields of polluters, the tables do not tell us anything about the dynamics of the treatment effect. In particular, the models provide no indication about whether the crucial parallel trends assumption is valid. This assumption requires that in the absence of treatment, the difference between the two groups is constant over time. Although there is no well-defined statistical test for this assumption, visual inspection typically provides a good indication.\nTo provide such visual evidence, we revisit the simple OLS model and replace the treated and post_period indicators with month dummies for each group. This approach estimates the average yield change of both groups for each period and provides corresponding confidence intervals. Plotting the coefficient estimates for both groups around the treatment date shows us the dynamics of our panel data.\n\nmodel_without_fe_time &lt;- feols(\n  fml = avg_yield ~ polluter + month:polluter +\n    time_to_maturity + log_offering_amt,\n  vcov = \"iid\",\n  data = bonds_panel |&gt;\n    mutate(month = factor(month))\n)\n\nmodel_without_fe_coefs &lt;- tidy(model_without_fe_time) |&gt;\n  filter(str_detect(term, \"month\")) |&gt;\n  mutate(\n    month = ymd(substr(term, nchar(term) - 9, nchar(term))),\n    treatment = str_detect(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  )\n\nmodel_without_fe_coefs |&gt;\n  ggplot(aes(\n    month, \n    color = treatment,\n    linetype = treatment,\n    shape = treatment\n    )) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  guides(linetype = \"none\") + \n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    shape = \"Polluter?\",\n    color = \"Polluter?\",\n    title = \"Polluters respond stronger to Paris Agreement than green firms\"\n  )\n\n\n\n\nFigure 1: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\nFigure 1: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters and non-polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\n\n\n\nFigure 1 shows that throughout most of 2014, the yields of the two groups changed in unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest difference around the signing of the PA. Afterward, the yields for both groups fall again, and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on the other hand, even experience significantly lower yields than polluters after the signing of the agreement.\nInstead of plotting both groups using the simple model approach, we can also use the fixed-effects model and focus on the polluter’s yield response to the signing relative to the non-polluters. To perform this estimation, we need to replace the treated indicator with separate time dummies for the polluters, each marking a one-month period relative to the treatment date. We then regress the monthly yields on the set of time dummies and cusip_id and month fixed effects.\n\nbonds_panel_alt &lt;- bonds_panel |&gt;\n  mutate(\n    diff_to_treatment = interval(\n      floor_date(treatment_date, \"month\"), month\n    ) %/% months(1)\n  )\n\nvariables &lt;- bonds_panel_alt |&gt;\n  distinct(diff_to_treatment, month) |&gt;\n  arrange(month) |&gt;\n  mutate(variable_name = as.character(NA))\n\nformula &lt;- \"avg_yield ~ \"\n\nfor (j in 1:nrow(variables)) {\n  if (variables$diff_to_treatment[j] != 0) {\n    old_names &lt;- names(bonds_panel_alt)\n    bonds_panel_alt &lt;- bonds_panel_alt |&gt;\n      mutate(new_var = diff_to_treatment == variables$diff_to_treatment[j] & \n               polluter)\n    new_var_name &lt;- ifelse(variables$diff_to_treatment[j] &lt; 0,\n      str_c(\"lag\", abs(variables$diff_to_treatment[j])),\n      str_c(\"lead\", variables$diff_to_treatment[j])\n    )\n    variables$variable_name[j] &lt;- new_var_name\n    names(bonds_panel_alt) &lt;- c(old_names, new_var_name)\n    formula &lt;- str_c(\n      formula,\n      ifelse(j == 1,\n        new_var_name,\n        str_c(\"+\", new_var_name)\n      )\n    )\n  }\n}\nformula &lt;- str_c(formula, \"| cusip_id + month\")\n\nmodel_with_fe_time &lt;- feols(\n  fml = as.formula(formula),\n  vcov = \"iid\",\n  data = bonds_panel_alt\n)\n\nmodel_with_fe_time_coefs &lt;- tidy(model_with_fe_time) |&gt;\n  mutate(\n    term = str_remove(term, \"TRUE\"),\n    ci_up = estimate + qnorm(0.975) * std.error,\n    ci_low = estimate + qnorm(0.025) * std.error\n  ) |&gt;\n  left_join(\n    variables,\n    join_by(term == variable_name)\n  ) |&gt;\n  bind_rows(tibble(\n    term = \"lag0\",\n    estimate = 0,\n    ci_up = 0,\n    ci_low = 0,\n    month = floor_date(treatment_date, \"month\")\n  ))\n\nmodel_with_fe_time_coefs |&gt;\n  ggplot(aes(x = month, y = estimate)) +\n  geom_vline(aes(xintercept = floor_date(treatment_date, \"month\")),\n    linetype = \"dashed\"\n  ) +\n  geom_hline(aes(yintercept = 0),\n    linetype = \"dashed\"\n  ) +\n  geom_errorbar(aes(ymin = ci_low, ymax = ci_up),\n    alpha = 0.5\n  ) +\n  geom_point(aes(y = estimate)) +\n  labs(\n    x = NULL,\n    y = \"Yield\",\n    title = \"Polluters' yield patterns around Paris Agreement signing\"\n  )\n\n\n\n\nFigure 2: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\nFigure 2: The figure shows the coefficient estimates and 95 percent confidence intervals for OLS regressions estimating the treatment effect of the Paris Climate Agreement on bond yields (in percent) for polluters. The horizontal line represents the benchmark yield of polluters before the Paris Agreement. The vertical line indicates the date of the agreement (December 12, 2015).\n\n\n\n\n\n The resulting graph shown in Figure 2 confirms the main conclusion of the previous image: polluters’ yield patterns show a considerable anticipation effect starting toward the end of 2014. Yields only marginally increase after the signing of the agreement. However, as opposed to the simple model, we do not see a complete reversal back to the pre-agreement level. Yields of polluters stay at a significantly higher level even one year after the signing.\nNotice that during the year after the PA was signed, the 45th President of the United States was elected on November 8, 2016. During his campaign there were some indications of intentions to withdraw the US from the PA, which ultimately happened on November 4, 2020. Hence, reversal effects are potentially driven by these actions.",
     "crumbs": [
       "R",
       "Modeling and Machine Learning",
@@ -3074,7 +3074,7 @@
     "href": "r/fixed-effects-and-clustered-standard-errors.html#data-preparation",
     "title": "Fixed Effects and Clustered Standard Errors",
     "section": "Data Preparation",
-    "text": "Data Preparation\nWe use CRSP and annual Compustat as data sources from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat. In particular, Compustat provides balance sheet and income statement data on a firm level, while CRSP provides market valuations. \n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(gvkey, month, mktcap) |&gt;\n  collect()\n\ncompustat &lt;- tbl(tidy_finance, \"compustat\") |&gt;\n  select(datadate, gvkey, year, at, be, capx, oancf, txdb) |&gt;\n  collect()\n\nThe classical investment regressions model the capital investment of a firm as a function of operating cash flows and Tobin’s q, a measure of a firm’s investment opportunities. We start by constructing investment and cash flows which are usually normalized by lagged total assets of a firm. In the following code chunk, we construct a panel of firm-year observations, so we have both cross-sectional information on firms as well as time-series information for each firm.\n\ndata_investment &lt;- compustat |&gt;\n  mutate(month = floor_date(datadate, \"month\")) |&gt;\n  left_join(compustat |&gt;\n    select(gvkey, year, at_lag = at) |&gt;\n    mutate(year = year + 1),\n  by = c(\"gvkey\", \"year\")\n  ) |&gt;\n  filter(at &gt; 0, at_lag &gt; 0) |&gt;\n  mutate(\n    investment = capx / at_lag,\n    cash_flows = oancf / at_lag\n  )\n\ndata_investment &lt;- data_investment |&gt;\n  left_join(data_investment |&gt;\n    select(gvkey, year, investment_lead = investment) |&gt;\n    mutate(year = year - 1),\n  by = c(\"gvkey\", \"year\")\n  )\n\nTobin’s q is the ratio of the market value of capital to its replacement costs. It is one of the most common regressors in corporate finance applications (e.g., Fazzari et al. 1988; Erickson and Whited 2012). We follow the implementation of Gulen and Ion (2015) and compute Tobin’s q as the market value of equity (mktcap) plus the book value of assets (at) minus book value of equity (be) plus deferred taxes (txdb), all divided by book value of assets (at). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.\n\ndata_investment &lt;- data_investment |&gt;\n  left_join(crsp_monthly,\n  by = c(\"gvkey\", \"month\")\n  ) |&gt;\n  mutate(tobins_q = (mktcap + at - be + txdb) / at) |&gt;\n  select(gvkey, year, investment_lead, cash_flows, tobins_q) |&gt;\n  drop_na()\n\nAs the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions. In the specific example, we winsorize the main variables (investment, cash_flows, and tobins_q) at the 1 percent level.\n\nwinsorize &lt;- function(x, cut) {\n  x &lt;- replace(\n    x,\n    x &gt; quantile(x, 1 - cut, na.rm = T),\n    quantile(x, 1 - cut, na.rm = T)\n  )\n  x &lt;- replace(\n    x,\n    x &lt; quantile(x, cut, na.rm = T),\n    quantile(x, cut, na.rm = T)\n  )\n  return(x)\n}\n\ndata_investment &lt;- data_investment |&gt;\n  mutate(across(\n    c(investment_lead, cash_flows, tobins_q),\n    ~ winsorize(., 0.01)\n  ))\n\nBefore proceeding to any estimations, we highly recommend tabulating summary statistics of the variables that enter the regression. These simple tables allow you to check the plausibility of your numerical variables, as well as spot any obvious errors or outliers. Additionally, for panel data, plotting the time series of the variable’s mean and the number of observations is a useful exercise to spot potential problems.\n\ndata_investment |&gt;\n  pivot_longer(\n    cols = c(investment_lead, cash_flows, tobins_q),\n    names_to = \"measure\"\n  ) |&gt;\n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n\n# A tibble: 3 × 9\n  measure      mean     sd    min      q05    q50   q95    max      n\n  &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;int&gt;\n1 cash_flows 0.0111 0.274  -1.56  -4.72e-1 0.0641 0.272  0.479 127446\n2 investmen… 0.0577 0.0772  0      6.75e-4 0.0327 0.206  0.464 127446\n3 tobins_q   2.00   1.70    0.573  7.94e-1 1.39   5.37  10.9   127446",
+    "text": "Data Preparation\nWe use CRSP and annual Compustat as data sources from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat. In particular, Compustat provides balance sheet and income statement data on a firm level, while CRSP provides market valuations. \n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(gvkey, month, mktcap) |&gt;\n  collect()\n\ncompustat &lt;- tbl(tidy_finance, \"compustat\") |&gt;\n  select(datadate, gvkey, year, at, be, capx, oancf, txdb) |&gt;\n  collect()\n\nThe classical investment regressions model the capital investment of a firm as a function of operating cash flows and Tobin’s q, a measure of a firm’s investment opportunities. We start by constructing investment and cash flows which are usually normalized by lagged total assets of a firm. In the following code chunk, we construct a panel of firm-year observations, so we have both cross-sectional information on firms as well as time-series information for each firm.\n\ndata_investment &lt;- compustat |&gt;\n  mutate(month = floor_date(datadate, \"month\")) |&gt;\n  left_join(compustat |&gt;\n    select(gvkey, year, at_lag = at) |&gt;\n    mutate(year = year + 1),\n  join_by(gvkey, year)\n  ) |&gt;\n  filter(at &gt; 0, at_lag &gt; 0) |&gt;\n  mutate(\n    investment = capx / at_lag,\n    cash_flows = oancf / at_lag\n  )\n\ndata_investment &lt;- data_investment |&gt;\n  left_join(data_investment |&gt;\n    select(gvkey, year, investment_lead = investment) |&gt;\n    mutate(year = year - 1),\n    join_by(gvkey, year)\n  )\n\nTobin’s q is the ratio of the market value of capital to its replacement costs. It is one of the most common regressors in corporate finance applications (e.g., Fazzari et al. 1988; Erickson and Whited 2012). We follow the implementation of Gulen and Ion (2015) and compute Tobin’s q as the market value of equity (mktcap) plus the book value of assets (at) minus book value of equity (be) plus deferred taxes (txdb), all divided by book value of assets (at). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.\n\ndata_investment &lt;- data_investment |&gt;\n  left_join(crsp_monthly,\n            join_by(gvkey, month)) |&gt;\n  mutate(tobins_q = (mktcap + at - be + txdb) / at) |&gt;\n  select(gvkey, year, investment_lead, cash_flows, tobins_q) |&gt;\n  drop_na()\n\nAs the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions. In the specific example, we winsorize the main variables (investment, cash_flows, and tobins_q) at the 1 percent level.\n\nwinsorize &lt;- function(x, cut) {\n  x &lt;- replace(\n    x,\n    x &gt; quantile(x, 1 - cut, na.rm = T),\n    quantile(x, 1 - cut, na.rm = T)\n  )\n  x &lt;- replace(\n    x,\n    x &lt; quantile(x, cut, na.rm = T),\n    quantile(x, cut, na.rm = T)\n  )\n  return(x)\n}\n\ndata_investment &lt;- data_investment |&gt;\n  mutate(across(\n    c(investment_lead, cash_flows, tobins_q),\n    ~ winsorize(., 0.01)\n  ))\n\nBefore proceeding to any estimations, we highly recommend tabulating summary statistics of the variables that enter the regression. These simple tables allow you to check the plausibility of your numerical variables, as well as spot any obvious errors or outliers. Additionally, for panel data, plotting the time series of the variable’s mean and the number of observations is a useful exercise to spot potential problems.\n\ndata_investment |&gt;\n  pivot_longer(\n    cols = c(investment_lead, cash_flows, tobins_q),\n    names_to = \"measure\"\n  ) |&gt;\n  group_by(measure) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n(),\n    .groups = \"drop\"\n  )\n\n# A tibble: 3 × 9\n  measure      mean     sd    min      q05    q50   q95    max      n\n  &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;  &lt;int&gt;\n1 cash_flows 0.0111 0.274  -1.56  -4.72e-1 0.0641 0.272  0.479 127450\n2 investmen… 0.0577 0.0772  0      6.75e-4 0.0327 0.206  0.464 127450\n3 tobins_q   2.00   1.70    0.573  7.94e-1 1.39   5.37  10.9   127450",
     "crumbs": [
       "R",
       "Modeling and Machine Learning",
@@ -3086,7 +3086,7 @@
     "href": "r/fixed-effects-and-clustered-standard-errors.html#fixed-effects",
     "title": "Fixed Effects and Clustered Standard Errors",
     "section": "Fixed Effects",
-    "text": "Fixed Effects\nTo illustrate fixed effects regressions, we use the fixest package, which is both computationally powerful and flexible with respect to model specifications. We start out with the basic investment regression using the simple model \\[ \\text{Investment}_{i,t+1} = \\alpha + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},\\] where \\(\\varepsilon_t\\) is i.i.d. normally distributed across time and firms. We use the feols()-function to estimate the simple model so that the output has the same structure as the other regressions below, but you could also use lm().\n\nmodel_ols &lt;- feols(\n  fml = investment_lead ~ cash_flows + tobins_q,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_ols\n\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nStandard-errors: IID \n            Estimate Std. Error t value  Pr(&gt;|t|)    \n(Intercept)  0.04254   0.000334   127.5 &lt; 2.2e-16 ***\ncash_flows   0.04902   0.000793    61.8 &lt; 2.2e-16 ***\ntobins_q     0.00731   0.000128    57.0 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.075503   Adj. R2: 0.043012\n\n\nAs expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data.\nOne way to tackle the issue of omitted variable bias is to get rid of as much unexplained variation as possible by including fixed effects - i.e., model parameters that are fixed for specific groups (e.g., Wooldridge 2010). In essence, each group has its own mean in fixed effects regressions. The simplest group that we can form in the investment regression is the firm level. The firm fixed effects regression is then \\[ \\text{Investment}_{i,t+1} = \\alpha_i + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},\\] where \\(\\alpha_i\\) is the firm fixed effect and captures the firm-specific mean investment across all years. In fact, you could also compute firms’ investments as deviations from the firms’ average investments and estimate the model without the fixed effects. The idea of the firm fixed effect is to remove the firm’s average investment, which might be affected by firm-specific variables that you do not observe. For example, firms in a specific industry might invest more on average. Or you observe a young firm with large investments but only small concurrent cash flows, which will only happen in a few years. This sort of variation is unwanted because it is related to unobserved variables that can bias your estimates in any direction.\nTo include the firm fixed effect, we use gvkey (Compustat’s firm identifier) as follows:\n\nmodel_fe_firm &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firm\n\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nFixed-effects: gvkey: 14,348\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(&gt;|t|)    \ncash_flows   0.0141   0.000923    15.3 &lt; 2.2e-16 ***\ntobins_q     0.0109   0.000133    82.0 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.049705     Adj. R2: 0.532649\n                 Within R2: 0.057237\n\n\nThe regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm’s cash flow and Tobin’s q on top of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.\nThere is another source of variation that we can get rid of in our setting: average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. By including year fixed effects, we can take out the effect of unobservables that vary over time. The two-way fixed effects regression is then \\[ \\text{Investment}_{i,t+1} = \\alpha_i + \\alpha_t + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},\\] where \\(\\alpha_t\\) is the time fixed effect. Here you can think of higher investments during an economic expansion with simultaneously high cash flows.\n\nmodel_fe_firmyear &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firmyear\n\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,446 \nFixed-effects: gvkey: 14,348,  year: 35\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(&gt;|t|)    \ncash_flows  0.01750   0.000902    19.4 &lt; 2.2e-16 ***\ntobins_q    0.00987   0.000131    75.3 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.048415     Adj. R2: 0.556452\n                 Within R2: 0.050071\n\n\nThe inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time.\nHow can we further improve the robustness of our regression results? Ideally, we want to get rid of unexplained variation at the firm-year level, which means we need to include more variables that vary across firm and time and are likely correlated with investment. Note that we cannot include firm-year fixed effects in our setting because then cash flows and Tobin’s q are colinear with the fixed effects, and the estimation becomes void.\nBefore we discuss the properties of our estimation errors, we want to point out that regression tables are at the heart of every empirical analysis, where you compare multiple models. Fortunately, the etable() function provides a convenient way to tabulate the regression output (with many parameters to customize and even print the output in LaTeX). We recommend printing \\(t\\)-statistics rather than standard errors in regression tables because the latter are typically very hard to interpret across coefficients that vary in size. We also do not print p-values because they are sometimes misinterpreted to signal the importance of observed effects (Wasserstein and Lazar 2016). The \\(t\\)-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications.\n\netable(\n  model_ols, model_fe_firm, model_fe_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n\n                       model_ols   model_fe_firm model_fe_firm..\nDependent Var.:  investment_lead investment_lead investment_lead\n                                                                \nConstant        0.042*** (127.5)                                \ncash_flows       0.049*** (61.8) 0.014*** (15.3) 0.018*** (19.4)\ntobins_q         0.007*** (57.0) 0.011*** (82.0) 0.010*** (75.3)\nFixed-Effects:  ---------------- --------------- ---------------\ngvkey                         No             Yes             Yes\nyear                          No              No             Yes\n_______________ ________________ _______________ _______________\nVCOV type                    IID             IID             IID\nObservations             127,446         127,446         127,446\nR2                         0.043           0.585           0.607\nWithin R2                     --           0.057           0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1",
+    "text": "Fixed Effects\nTo illustrate fixed effects regressions, we use the fixest package, which is both computationally powerful and flexible with respect to model specifications. We start out with the basic investment regression using the simple model \\[ \\text{Investment}_{i,t+1} = \\alpha + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},\\] where \\(\\varepsilon_t\\) is i.i.d. normally distributed across time and firms. We use the feols()-function to estimate the simple model so that the output has the same structure as the other regressions below, but you could also use lm().\n\nmodel_ols &lt;- feols(\n  fml = investment_lead ~ cash_flows + tobins_q,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_ols\n\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nStandard-errors: IID \n            Estimate Std. Error t value  Pr(&gt;|t|)    \n(Intercept)  0.04254   0.000334   127.5 &lt; 2.2e-16 ***\ncash_flows   0.04901   0.000793    61.8 &lt; 2.2e-16 ***\ntobins_q     0.00731   0.000128    57.0 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.075502   Adj. R2: 0.043014\n\n\nAs expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data.\nOne way to tackle the issue of omitted variable bias is to get rid of as much unexplained variation as possible by including fixed effects - i.e., model parameters that are fixed for specific groups (e.g., Wooldridge 2010). In essence, each group has its own mean in fixed effects regressions. The simplest group that we can form in the investment regression is the firm level. The firm fixed effects regression is then \\[ \\text{Investment}_{i,t+1} = \\alpha_i + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},\\] where \\(\\alpha_i\\) is the firm fixed effect and captures the firm-specific mean investment across all years. In fact, you could also compute firms’ investments as deviations from the firms’ average investments and estimate the model without the fixed effects. The idea of the firm fixed effect is to remove the firm’s average investment, which might be affected by firm-specific variables that you do not observe. For example, firms in a specific industry might invest more on average. Or you observe a young firm with large investments but only small concurrent cash flows, which will only happen in a few years. This sort of variation is unwanted because it is related to unobserved variables that can bias your estimates in any direction.\nTo include the firm fixed effect, we use gvkey (Compustat’s firm identifier) as follows:\n\nmodel_fe_firm &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firm\n\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nFixed-effects: gvkey: 14,352\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(&gt;|t|)    \ncash_flows   0.0141   0.000923    15.3 &lt; 2.2e-16 ***\ntobins_q     0.0109   0.000133    82.0 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.049704     Adj. R2: 0.532639\n                 Within R2: 0.057238\n\n\nThe regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm’s cash flow and Tobin’s q on top of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.\nThere is another source of variation that we can get rid of in our setting: average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. By including year fixed effects, we can take out the effect of unobservables that vary over time. The two-way fixed effects regression is then \\[ \\text{Investment}_{i,t+1} = \\alpha_i + \\alpha_t + \\beta_1\\text{Cash Flows}_{i,t}+\\beta_2\\text{Tobin's q}_{i,t}+\\varepsilon_{i,t},\\] where \\(\\alpha_t\\) is the time fixed effect. Here you can think of higher investments during an economic expansion with simultaneously high cash flows.\n\nmodel_fe_firmyear &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  se = \"iid\",\n  data = data_investment\n)\nmodel_fe_firmyear\n\nOLS estimation, Dep. Var.: investment_lead\nObservations: 127,450 \nFixed-effects: gvkey: 14,352,  year: 35\nStandard-errors: IID \n           Estimate Std. Error t value  Pr(&gt;|t|)    \ncash_flows  0.01750   0.000902    19.4 &lt; 2.2e-16 ***\ntobins_q    0.00987   0.000131    75.3 &lt; 2.2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\nRMSE: 0.048414     Adj. R2: 0.556443\n                 Within R2: 0.050072\n\n\nThe inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time.\nHow can we further improve the robustness of our regression results? Ideally, we want to get rid of unexplained variation at the firm-year level, which means we need to include more variables that vary across firm and time and are likely correlated with investment. Note that we cannot include firm-year fixed effects in our setting because then cash flows and Tobin’s q are colinear with the fixed effects, and the estimation becomes void.\nBefore we discuss the properties of our estimation errors, we want to point out that regression tables are at the heart of every empirical analysis, where you compare multiple models. Fortunately, the etable() function provides a convenient way to tabulate the regression output (with many parameters to customize and even print the output in LaTeX). We recommend printing \\(t\\)-statistics rather than standard errors in regression tables because the latter are typically very hard to interpret across coefficients that vary in size. We also do not print p-values because they are sometimes misinterpreted to signal the importance of observed effects (Wasserstein and Lazar 2016). The \\(t\\)-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications.\n\netable(\n  model_ols, model_fe_firm, model_fe_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n\n                       model_ols   model_fe_firm model_fe_firm..\nDependent Var.:  investment_lead investment_lead investment_lead\n                                                                \nConstant        0.043*** (127.5)                                \ncash_flows       0.049*** (61.8) 0.014*** (15.3) 0.018*** (19.4)\ntobins_q         0.007*** (57.0) 0.011*** (82.0) 0.010*** (75.3)\nFixed-Effects:  ---------------- --------------- ---------------\ngvkey                         No             Yes             Yes\nyear                          No              No             Yes\n_______________ ________________ _______________ _______________\nVCOV type                    IID             IID             IID\nObservations             127,450         127,450         127,450\nR2                         0.043           0.585           0.607\nWithin R2                     --           0.057           0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1",
     "crumbs": [
       "R",
       "Modeling and Machine Learning",
@@ -3098,7 +3098,7 @@
     "href": "r/fixed-effects-and-clustered-standard-errors.html#clustering-standard-errors",
     "title": "Fixed Effects and Clustered Standard Errors",
     "section": "Clustering Standard Errors",
-    "text": "Clustering Standard Errors\nApart from biased estimators, we usually have to deal with potentially complex dependencies of our residuals with each other. Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of our estimated coefficients.\nIn our setting, the residuals may be correlated across years for a given firm (time-series dependence), or, alternatively, the residuals may be correlated across different firms (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of clustered standard errors (Petersen 2008). The idea behind clustering is that the correlation of residuals within a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent (Donald and Lang 2007; Wooldridge 2010). A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient (Cameron, Gelbach, and Miller 2011).\nInstead of relying on the iid assumption, we can use the cluster option in the feols-function as above. The code chunk below applies both one-way clustering by firm as well as two-way clustering by firm and year.\n\nmodel_cluster_firm &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = \"gvkey\",\n  data = data_investment\n)\n\nmodel_cluster_firmyear &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = c(\"gvkey\", \"year\"),\n  data = data_investment\n)\n\n The table below shows the comparison of the different assumptions behind the standard errors. In the first column, we can see highly significant coefficients on both cash flows and Tobin’s q. By clustering the standard errors on the firm level, the \\(t\\)-statistics of both coefficients drop in half, indicating a high correlation of residuals within firms. If we additionally cluster by year, we see a drop, particularly for Tobin’s q, again. Even after relaxing the assumptions behind our standard errors, both coefficients are still comfortably significant as the \\(t\\) statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed significance tests.\n\netable(\n  model_fe_firmyear, model_cluster_firm, model_cluster_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n\n                model_fe_firm.. model_cluster.. model_cluster...1\nDependent Var.: investment_lead investment_lead   investment_lead\n                                                                 \ncash_flows      0.018*** (19.4) 0.018*** (11.3)   0.018*** (9.57)\ntobins_q        0.010*** (75.3) 0.010*** (35.4)   0.010*** (15.4)\nFixed-Effects:  --------------- ---------------   ---------------\ngvkey                       Yes             Yes               Yes\nyear                        Yes             Yes               Yes\n_______________ _______________ _______________   _______________\nVCOV type                   IID       by: gvkey  by: gvkey & year\nObservations            127,446         127,446           127,446\nR2                        0.607           0.607             0.607\nWithin R2                 0.050           0.050             0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nInspired by Abadie et al. (2017), we want to close this chapter by highlighting that choosing the right dimensions for clustering is a design problem. Even if the data is informative about whether clustering matters for standard errors, they do not tell you whether you should adjust the standard errors for clustering. Clustering at too aggregate levels can hence lead to unnecessarily inflated standard errors.",
+    "text": "Clustering Standard Errors\nApart from biased estimators, we usually have to deal with potentially complex dependencies of our residuals with each other. Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of our estimated coefficients.\nIn our setting, the residuals may be correlated across years for a given firm (time-series dependence), or, alternatively, the residuals may be correlated across different firms (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of clustered standard errors (Petersen 2008). The idea behind clustering is that the correlation of residuals within a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent (Donald and Lang 2007; Wooldridge 2010). A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient (Cameron, Gelbach, and Miller 2011).\nInstead of relying on the iid assumption, we can use the cluster option in the feols-function as above. The code chunk below applies both one-way clustering by firm as well as two-way clustering by firm and year.\n\nmodel_cluster_firm &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = \"gvkey\",\n  data = data_investment\n)\n\nmodel_cluster_firmyear &lt;- feols(\n  investment_lead ~ cash_flows + tobins_q | gvkey + year,\n  cluster = c(\"gvkey\", \"year\"),\n  data = data_investment\n)\n\n The table below shows the comparison of the different assumptions behind the standard errors. In the first column, we can see highly significant coefficients on both cash flows and Tobin’s q. By clustering the standard errors on the firm level, the \\(t\\)-statistics of both coefficients drop in half, indicating a high correlation of residuals within firms. If we additionally cluster by year, we see a drop, particularly for Tobin’s q, again. Even after relaxing the assumptions behind our standard errors, both coefficients are still comfortably significant as the \\(t\\) statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed significance tests.\n\netable(\n  model_fe_firmyear, model_cluster_firm, model_cluster_firmyear,\n  coefstat = \"tstat\", digits = 3, digits.stats = 3\n)\n\n                model_fe_firm.. model_cluster.. model_cluster...1\nDependent Var.: investment_lead investment_lead   investment_lead\n                                                                 \ncash_flows      0.018*** (19.4) 0.018*** (11.3)   0.018*** (9.57)\ntobins_q        0.010*** (75.3) 0.010*** (35.4)   0.010*** (15.4)\nFixed-Effects:  --------------- ---------------   ---------------\ngvkey                       Yes             Yes               Yes\nyear                        Yes             Yes               Yes\n_______________ _______________ _______________   _______________\nVCOV type                   IID       by: gvkey  by: gvkey & year\nObservations            127,450         127,450           127,450\nR2                        0.607           0.607             0.607\nWithin R2                 0.050           0.050             0.050\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n\nInspired by Abadie et al. (2017), we want to close this chapter by highlighting that choosing the right dimensions for clustering is a design problem. Even if the data is informative about whether clustering matters for standard errors, they do not tell you whether you should adjust the standard errors for clustering. Clustering at too aggregate levels can hence lead to unnecessarily inflated standard errors.",
     "crumbs": [
       "R",
       "Modeling and Machine Learning",
@@ -3194,7 +3194,7 @@
     "href": "r/accessing-and-managing-financial-data.html#managing-sqlite-databases",
     "title": "Accessing and Managing Financial Data",
     "section": "Managing SQLite Databases",
-    "text": "Managing SQLite Databases\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\nTo optimize the database file, you can run the VACUUM command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the dbSendQuery() function.\n\nres &lt;- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n\n&lt;SQLiteResult&gt;\n  SQL  VACUUM\n  ROWS Fetched: 0 [complete]\n       Changed: 0\n\n\nThe VACUUM command actually performs a couple of additional cleaning steps, which you can read up in this tutorial. \nWe store the result of the above query in res because the database keeps the result set open. To close open results and avoid warnings going forward, we can use dbClearResult().\n\ndbClearResult(res)\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the dbListTables() function.\n\ndbListTables(tidy_finance)\n\n [1] \"beta\"                  \"compustat\"            \n [3] \"cpi_monthly\"           \"crsp_daily\"           \n [5] \"crsp_monthly\"          \"factors_ff3_daily\"    \n [7] \"factors_ff3_monthly\"   \"factors_ff5_monthly\"  \n [9] \"factors_q_monthly\"     \"industries_ff_monthly\"\n[11] \"macro_predictors\"      \"mergent\"              \n[13] \"trace_enhanced\"       \n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database.",
+    "text": "Managing SQLite Databases\nFinally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.\nTo optimize the database file, you can run the VACUUM command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the dbSendQuery() function.\n\nres &lt;- dbSendQuery(tidy_finance, \"VACUUM\")\nres\n\n&lt;SQLiteResult&gt;\n  SQL  VACUUM\n  ROWS Fetched: 0 [complete]\n       Changed: 0\n\n\nThe VACUUM command actually performs a couple of additional cleaning steps, which you can read up in this tutorial. \nWe store the result of the above query in res because the database keeps the result set open. To close open results and avoid warnings going forward, we can use dbClearResult().\n\ndbClearResult(res)\n\nApart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the dbListTables() function.\n\ndbListTables(tidy_finance)\n\n [1] \"beta\"                  \"compustat\"            \n [3] \"cpi_monthly\"           \"crsp_daily\"           \n [5] \"crsp_monthly\"          \"factors_ff3_daily\"    \n [7] \"factors_ff3_monthly\"   \"factors_ff5_monthly\"  \n [9] \"factors_q_monthly\"     \"fisd\"                 \n[11] \"industries_ff_monthly\" \"macro_predictors\"     \n[13] \"trace_enhanced\"       \n\n\nThis function comes in handy if you are unsure about the correct naming of the tables in your database.",
     "crumbs": [
       "R",
       "Financial Data",
@@ -3218,7 +3218,7 @@
     "href": "r/clean-enhanced-trace-with-r.html",
     "title": "Clean Enhanced TRACE with R",
     "section": "",
-    "text": "This appendix contains code to clean enhanced TRACE with R. It is also available via the following Github gist. Hence, you could also source the function with devtools::source_gist(\"3a05b3ab281563b2e94858451c2eb3a4\"). We need this function in Chapter TRACE and FISD to download and clean enhanced TRACE trade messages following Dick-Nielsen (2009) and Dick-Nielsen (2014) for enhanced TRACE specifically. Related, WRDS provides SAS code and there is Python code available by the project Open Source Bond Asset Pricing.\nThe function takes a vector of CUSIPs (in cusips), a connection to WRDS (connection) explained in Chapter 3, and a start and end date (start_date and end_date, respectively). Specifying too many CUSIPs will result in very slow downloads and a potential failure due to the size of the request to WRDS. The dates should be within the coverage of TRACE itself, i.e., starting after 2002, and the dates should be supplied using the class date. The output of the function contains all valid trade messages for the selected CUSIPs over the specified period.\n\nclean_enhanced_trace &lt;- function(cusips,\n                                 connection,\n                                 start_date = as.Date(\"2002-01-01\"),\n                                 end_date = today()) {\n\n  # Packages (required)\n  library(tidyverse)\n  library(dbplyr)\n  library(RPostgres)\n\n  # Function checks ---------------------------------------------------------\n  # Input parameters\n  ## Cusips\n  if (length(cusips) == 0 | any(is.na(cusips))) stop(\"Check cusips.\")\n\n  ## Dates\n  if (!is.Date(start_date) | !is.Date(end_date)) stop(\"Dates needed\")\n  if (start_date &lt; as.Date(\"2002-01-01\")) stop(\"TRACE starts later.\")\n  if (end_date &gt; today()) stop(\"TRACE does not predict the future.\")\n  if (start_date &gt;= end_date) stop(\"Date conflict.\")\n\n  ## Connection\n  if (!dbIsValid(connection)) stop(\"Connection issue.\")\n\n  # Enhanced Trace ----------------------------------------------------------\n  # Main file\n  trace_all &lt;- tbl(\n    connection,\n    in_schema(\"trace\", \"trace_enhanced\")\n  ) |&gt;\n    filter(cusip_id %in% cusips) |&gt;\n    filter(trd_exctn_dt &gt;= start_date & trd_exctn_dt &lt;= end_date) |&gt;\n    select(\n      cusip_id, msg_seq_nb, orig_msg_seq_nb,\n      entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,\n      trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm,\n      pr_trd_dt, trc_st, asof_cd, wis_fl,\n      days_to_sttl_ct, stlmnt_dt, spcl_trd_fl\n    ) |&gt;\n    collect()\n\n  # Enhanced Trace: Post 06-02-2012 -----------------------------------------\n  # Trades (trc_st = T) and correction (trc_st = R)\n  trace_post_TR &lt;- trace_all |&gt;\n    filter(\n      (trc_st == \"T\" | trc_st == \"R\"),\n      trd_rpt_dt &gt;= as.Date(\"2012-02-06\")\n    )\n\n  # Cancellations (trc_st = X) and correction cancellations (trc_st = C)\n  trace_post_XC &lt;- trace_all |&gt;\n    filter(\n      (trc_st == \"X\" | trc_st == \"C\"),\n      trd_rpt_dt &gt;= as.Date(\"2012-02-06\")\n    )\n\n  # Cleaning corrected and cancelled trades\n  trace_post_TR &lt;- trace_post_TR |&gt;\n    anti_join(trace_post_XC,\n      by = c(\n        \"cusip_id\", \"msg_seq_nb\", \"entrd_vol_qt\",\n        \"rptd_pr\", \"rpt_side_cd\", \"cntra_mp_id\",\n        \"trd_exctn_dt\", \"trd_exctn_tm\"\n      )\n    )\n\n  # Reversals (trc_st = Y)\n  trace_post_Y &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"Y\",\n      trd_rpt_dt &gt;= as.Date(\"2012-02-06\")\n    )\n\n  # Clean reversals\n  ## match the orig_msg_seq_nb of the Y-message to\n  ## the msg_seq_nb of the main message\n  trace_post &lt;- trace_post_TR |&gt;\n    anti_join(trace_post_Y,\n      by = c(\"cusip_id\",\n        \"msg_seq_nb\" = \"orig_msg_seq_nb\",\n        \"entrd_vol_qt\", \"rptd_pr\", \"rpt_side_cd\",\n        \"cntra_mp_id\", \"trd_exctn_dt\", \"trd_exctn_tm\"\n      )\n    )\n\n\n  # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------\n  # Cancelations (trc_st = C)\n  trace_pre_C &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"C\",\n      trd_rpt_dt &lt; as.Date(\"2012-02-06\")\n    )\n\n  # Trades w/o cancellations\n  ## match the orig_msg_seq_nb of the C-message\n  ## to the msg_seq_nb of the main message\n  trace_pre_T &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"T\",\n      trd_rpt_dt &lt; as.Date(\"2012-02-06\")\n    ) |&gt;\n    anti_join(trace_pre_C,\n      by = c(\"cusip_id\",\n        \"msg_seq_nb\" = \"orig_msg_seq_nb\",\n        \"entrd_vol_qt\", \"rptd_pr\", \"rpt_side_cd\",\n        \"cntra_mp_id\", \"trd_exctn_dt\", \"trd_exctn_tm\"\n      )\n    )\n\n  # Corrections (trc_st = W) - W can also correct a previous W\n  trace_pre_W &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"W\",\n      trd_rpt_dt &lt; as.Date(\"2012-02-06\")\n    )\n\n  # Implement corrections in a loop\n  ## Correction control\n  correction_control &lt;- nrow(trace_pre_W)\n  correction_control_last &lt;- nrow(trace_pre_W)\n\n  ## Correction loop\n  while (correction_control &gt; 0) {\n    # Corrections that correct some msg\n    trace_pre_W_correcting &lt;- trace_pre_W |&gt;\n      semi_join(trace_pre_T,\n        by = c(\"cusip_id\", \"trd_exctn_dt\",\n          \"orig_msg_seq_nb\" = \"msg_seq_nb\"\n        )\n      )\n\n    # Corrections that do not correct some msg\n    trace_pre_W &lt;- trace_pre_W |&gt;\n      anti_join(trace_pre_T,\n        by = c(\"cusip_id\", \"trd_exctn_dt\",\n          \"orig_msg_seq_nb\" = \"msg_seq_nb\"\n        )\n      )\n\n    # Delete msgs that are corrected and add correction msgs\n    trace_pre_T &lt;- trace_pre_T |&gt;\n      anti_join(trace_pre_W_correcting,\n        by = c(\"cusip_id\", \"trd_exctn_dt\",\n          \"msg_seq_nb\" = \"orig_msg_seq_nb\"\n        )\n      ) |&gt;\n      union_all(trace_pre_W_correcting)\n\n    # Escape if no corrections remain or they cannot be matched\n    correction_control &lt;- nrow(trace_pre_W)\n    if (correction_control == correction_control_last) {\n      correction_control &lt;- 0\n    }\n    correction_control_last &lt;- nrow(trace_pre_W)\n  }\n\n\n  # Clean reversals\n  ## Record reversals\n  trace_pre_R &lt;- trace_pre_T |&gt;\n    filter(asof_cd == \"R\") |&gt;\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |&gt;\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |&gt;\n    mutate(seq = row_number()) |&gt;\n    ungroup()\n\n  ## Remove reversals and the reversed trade\n  trace_pre &lt;- trace_pre_T |&gt;\n    filter(is.na(asof_cd) | !(asof_cd %in% c(\"R\", \"X\", \"D\"))) |&gt;\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |&gt;\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |&gt;\n    mutate(seq = row_number()) |&gt;\n    ungroup() |&gt;\n    anti_join(trace_pre_R,\n      by = c(\n        \"cusip_id\", \"trd_exctn_dt\", \"entrd_vol_qt\",\n        \"rptd_pr\", \"rpt_side_cd\", \"cntra_mp_id\", \"seq\"\n      )\n    ) |&gt;\n    select(-seq)\n\n\n  # Agency trades -----------------------------------------------------------\n  # Combine pre and post trades\n  trace_clean &lt;- trace_post |&gt;\n    union_all(trace_pre)\n\n  # Keep angency sells and unmatched agency buys\n  ## Agency sells\n  trace_agency_sells &lt;- trace_clean |&gt;\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"S\"\n    )\n\n  # Agency buys that are unmatched\n  trace_agency_buys_filtered &lt;- trace_clean |&gt;\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"B\"\n    ) |&gt;\n    anti_join(trace_agency_sells,\n      by = c(\n        \"cusip_id\", \"trd_exctn_dt\",\n        \"entrd_vol_qt\", \"rptd_pr\"\n      )\n    )\n\n  # Agency clean\n  trace_clean &lt;- trace_clean |&gt;\n    filter(cntra_mp_id == \"C\") |&gt;\n    union_all(trace_agency_sells) |&gt;\n    union_all(trace_agency_buys_filtered)\n\n\n  # Additional Filters ------------------------------------------------------\n  trace_add_filters &lt;- trace_clean |&gt;\n    mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |&gt;\n    filter(\n      is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) &lt;= 7,\n      is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) &lt;= 7,\n      wis_fl == \"N\",\n      is.na(spcl_trd_fl) | spcl_trd_fl == \"\",\n      is.na(asof_cd) | asof_cd == \"\"\n    )\n\n\n  # Output ------------------------------------------------------------------\n  # Only keep necessary columns\n  trace_final &lt;- trace_add_filters |&gt;\n    arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) |&gt;\n    select(\n      cusip_id, trd_exctn_dt, trd_exctn_tm,\n      rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id\n    ) |&gt;\n    mutate(trd_exctn_tm = format(as_datetime(trd_exctn_tm), \"%H:%M:%S\"))\n\n  # Return\n  return(trace_final)\n}\n\n\n\n\n\nReferences\n\nDick-Nielsen, Jens. 2009. “Liquidity biases in TRACE.” The Journal of Fixed Income 19 (2): 43–55. https://doi.org/10.3905/jfi.2009.19.2.043.\n\n\n———. 2014. “How to clean enhanced TRACE data.” Working Paper. https://ssrn.com/abstract=2337908.",
+    "text": "This appendix contains code to clean enhanced TRACE with R. It is also available via the following Github gist. Hence, you could also source the function with devtools::source_gist(\"3a05b3ab281563b2e94858451c2eb3a4\"). We need this function in Chapter TRACE and FISD to download and clean enhanced TRACE trade messages following Dick-Nielsen (2009) and Dick-Nielsen (2014) for enhanced TRACE specifically. Related, WRDS provides SAS code and there is Python code available by the project Open Source Bond Asset Pricing.\nThe function takes a vector of CUSIPs (in cusips), a connection to WRDS (connection) explained in Chapter 3, and a start and end date (start_date and end_date, respectively). Specifying too many CUSIPs will result in very slow downloads and a potential failure due to the size of the request to WRDS. The dates should be within the coverage of TRACE itself, i.e., starting after 2002, and the dates should be supplied using the class date. The output of the function contains all valid trade messages for the selected CUSIPs over the specified period.\n\nclean_enhanced_trace &lt;- function(cusips,\n                                 connection,\n                                 start_date = as.Date(\"2002-01-01\"),\n                                 end_date = today()) {\n\n  # Packages (required)\n  library(tidyverse)\n  library(dbplyr)\n  library(RPostgres)\n\n  # Function checks ---------------------------------------------------------\n  # Input parameters\n  ## Cusips\n  if (length(cusips) == 0 | any(is.na(cusips))) stop(\"Check cusips.\")\n\n  ## Dates\n  if (!is.Date(start_date) | !is.Date(end_date)) stop(\"Dates needed\")\n  if (start_date &lt; as.Date(\"2002-01-01\")) stop(\"TRACE starts later.\")\n  if (end_date &gt; today()) stop(\"TRACE does not predict the future.\")\n  if (start_date &gt;= end_date) stop(\"Date conflict.\")\n\n  ## Connection\n  if (!dbIsValid(connection)) stop(\"Connection issue.\")\n\n  # Enhanced Trace ----------------------------------------------------------\n  # Main file\n  trace_all &lt;- tbl(\n    connection,\n    in_schema(\"trace\", \"trace_enhanced\")\n  ) |&gt;\n    filter(cusip_id %in% cusips) |&gt;\n    filter(trd_exctn_dt &gt;= start_date & trd_exctn_dt &lt;= end_date) |&gt;\n    select(\n      cusip_id, msg_seq_nb, orig_msg_seq_nb,\n      entrd_vol_qt, rptd_pr, yld_pt, rpt_side_cd, cntra_mp_id,\n      trd_exctn_dt, trd_exctn_tm, trd_rpt_dt, trd_rpt_tm,\n      pr_trd_dt, trc_st, asof_cd, wis_fl,\n      days_to_sttl_ct, stlmnt_dt, spcl_trd_fl\n    ) |&gt;\n    collect()\n\n  # Enhanced Trace: Post 06-02-2012 -----------------------------------------\n  # Trades (trc_st = T) and correction (trc_st = R)\n  trace_post_TR &lt;- trace_all |&gt;\n    filter(\n      (trc_st == \"T\" | trc_st == \"R\"),\n      trd_rpt_dt &gt;= as.Date(\"2012-02-06\")\n    )\n\n  # Cancellations (trc_st = X) and correction cancellations (trc_st = C)\n  trace_post_XC &lt;- trace_all |&gt;\n    filter(\n      (trc_st == \"X\" | trc_st == \"C\"),\n      trd_rpt_dt &gt;= as.Date(\"2012-02-06\")\n    )\n\n  # Cleaning corrected and cancelled trades\n  trace_post_TR &lt;- trace_post_TR |&gt;\n    anti_join(trace_post_XC,\n      join_by(\n        cusip_id, msg_seq_nb, entrd_vol_qt,\n        rptd_pr, rpt_side_cd, cntra_mp_id,\n        trd_exctn_dt, trd_exctn_tm\n      )\n    )\n\n  # Reversals (trc_st = Y)\n  trace_post_Y &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"Y\",\n      trd_rpt_dt &gt;= as.Date(\"2012-02-06\")\n    )\n\n  # Clean reversals\n  ## match the orig_msg_seq_nb of the Y-message to\n  ## the msg_seq_nb of the main message\n  trace_post &lt;- trace_post_TR |&gt;\n    anti_join(trace_post_Y,\n      join_by(\n        cusip_id,\n        msg_seq_nb == orig_msg_seq_nb,\n        entrd_vol_qt, rptd_pr, rpt_side_cd,\n        cntra_mp_id, trd_exctn_dt, trd_exctn_tm\n      )\n    )\n\n\n  # Enhanced TRACE: Pre 06-02-2012 ------------------------------------------\n  # Cancelations (trc_st = C)\n  trace_pre_C &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"C\",\n      trd_rpt_dt &lt; as.Date(\"2012-02-06\")\n    )\n\n  # Trades w/o cancellations\n  ## match the orig_msg_seq_nb of the C-message\n  ## to the msg_seq_nb of the main message\n  trace_pre_T &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"T\",\n      trd_rpt_dt &lt; as.Date(\"2012-02-06\")\n    ) |&gt;\n    anti_join(trace_pre_C,\n      join_by(\n        cusip_id,\n        msg_seq_nb == orig_msg_seq_nb,\n        entrd_vol_qt, rptd_pr, rpt_side_cd,\n        cntra_mp_id, trd_exctn_dt, trd_exctn_tm\n      )\n    )\n\n  # Corrections (trc_st = W) - W can also correct a previous W\n  trace_pre_W &lt;- trace_all |&gt;\n    filter(\n      trc_st == \"W\",\n      trd_rpt_dt &lt; as.Date(\"2012-02-06\")\n    )\n\n  # Implement corrections in a loop\n  ## Correction control\n  correction_control &lt;- nrow(trace_pre_W)\n  correction_control_last &lt;- nrow(trace_pre_W)\n\n  ## Correction loop\n  while (correction_control &gt; 0) {\n    # Corrections that correct some msg\n    trace_pre_W_correcting &lt;- trace_pre_W |&gt;\n      semi_join(trace_pre_T,\n        join_by(\n          cusip_id, trd_exctn_dt,\n          orig_msg_seq_nb == msg_seq_nb\n        )\n      )\n\n    # Corrections that do not correct some msg\n    trace_pre_W &lt;- trace_pre_W |&gt;\n      anti_join(trace_pre_T,\n        join_by(cusip_id, trd_exctn_dt,\n          orig_msg_seq_nb == msg_seq_nb\n        )\n      )\n\n    # Delete msgs that are corrected and add correction msgs\n    trace_pre_T &lt;- trace_pre_T |&gt;\n      anti_join(trace_pre_W_correcting,\n        join_by(\n          cusip_id, trd_exctn_dt,\n          msg_seq_nb == orig_msg_seq_nb\n        )\n      ) |&gt;\n      union_all(trace_pre_W_correcting)\n\n    # Escape if no corrections remain or they cannot be matched\n    correction_control &lt;- nrow(trace_pre_W)\n    if (correction_control == correction_control_last) {\n      correction_control &lt;- 0\n    }\n    correction_control_last &lt;- nrow(trace_pre_W)\n  }\n\n\n  # Clean reversals\n  ## Record reversals\n  trace_pre_R &lt;- trace_pre_T |&gt;\n    filter(asof_cd == \"R\") |&gt;\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |&gt;\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |&gt;\n    mutate(seq = row_number()) |&gt;\n    ungroup()\n\n  ## Remove reversals and the reversed trade\n  trace_pre &lt;- trace_pre_T |&gt;\n    filter(is.na(asof_cd) | !(asof_cd %in% c(\"R\", \"X\", \"D\"))) |&gt;\n    group_by(\n      cusip_id, trd_exctn_dt, entrd_vol_qt,\n      rptd_pr, rpt_side_cd, cntra_mp_id\n    ) |&gt;\n    arrange(trd_exctn_tm, trd_rpt_dt, trd_rpt_tm) |&gt;\n    mutate(seq = row_number()) |&gt;\n    ungroup() |&gt;\n    anti_join(trace_pre_R,\n      join_by(\n        cusip_id, trd_exctn_dt, entrd_vol_qt,\n        rptd_pr, rpt_side_cd, cntra_mp_id, seq\n      )\n    ) |&gt;\n    select(-seq)\n\n\n  # Agency trades -----------------------------------------------------------\n  # Combine pre and post trades\n  trace_clean &lt;- trace_post |&gt;\n    union_all(trace_pre)\n\n  # Keep angency sells and unmatched agency buys\n  ## Agency sells\n  trace_agency_sells &lt;- trace_clean |&gt;\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"S\"\n    )\n\n  # Agency buys that are unmatched\n  trace_agency_buys_filtered &lt;- trace_clean |&gt;\n    filter(\n      cntra_mp_id == \"D\",\n      rpt_side_cd == \"B\"\n    ) |&gt;\n    anti_join(trace_agency_sells,\n      join_by(\n        cusip_id, trd_exctn_dt,\n        entrd_vol_qt, rptd_pr\n      )\n    )\n\n  # Agency clean\n  trace_clean &lt;- trace_clean |&gt;\n    filter(cntra_mp_id == \"C\") |&gt;\n    union_all(trace_agency_sells) |&gt;\n    union_all(trace_agency_buys_filtered)\n\n\n  # Additional Filters ------------------------------------------------------\n  trace_add_filters &lt;- trace_clean |&gt;\n    mutate(days_to_sttl_ct2 = stlmnt_dt - trd_exctn_dt) |&gt;\n    filter(\n      is.na(days_to_sttl_ct) | as.numeric(days_to_sttl_ct) &lt;= 7,\n      is.na(days_to_sttl_ct2) | as.numeric(days_to_sttl_ct2) &lt;= 7,\n      wis_fl == \"N\",\n      is.na(spcl_trd_fl) | spcl_trd_fl == \"\",\n      is.na(asof_cd) | asof_cd == \"\"\n    )\n\n\n  # Output ------------------------------------------------------------------\n  # Only keep necessary columns\n  trace_final &lt;- trace_add_filters |&gt;\n    arrange(cusip_id, trd_exctn_dt, trd_exctn_tm) |&gt;\n    select(\n      cusip_id, trd_exctn_dt, trd_exctn_tm,\n      rptd_pr, entrd_vol_qt, yld_pt, rpt_side_cd, cntra_mp_id\n    ) |&gt;\n    mutate(trd_exctn_tm = format(as_datetime(trd_exctn_tm), \"%H:%M:%S\"))\n\n  # Return\n  return(trace_final)\n}\n\n\n\n\n\nReferences\n\nDick-Nielsen, Jens. 2009. “Liquidity biases in TRACE.” The Journal of Fixed Income 19 (2): 43–55. https://doi.org/10.3905/jfi.2009.19.2.043.\n\n\n———. 2014. “How to clean enhanced TRACE data.” Working Paper. https://ssrn.com/abstract=2337908.",
     "crumbs": [
       "R",
       "Appendix",
@@ -3242,7 +3242,7 @@
     "href": "r/introduction-to-tidy-finance.html#working-with-stock-market-data",
     "title": "Introduction to Tidy Finance",
     "section": "Working with Stock Market Data",
-    "text": "Working with Stock Market Data\nAt the start of each session, we load the required R packages. Throughout the entire book, we always use the tidyverse (Wickham et al. 2019). In this chapter, we also load the convenient tidyquant package (Dancho and Vaughan 2022) to download price data. This package provides a convenient wrapper for various quantitative functions compatible with the tidyverse. Finally, the package scales (Wickham and Seidel 2022) provides useful scale functions for visualizations.\nYou typically have to install a package once before you can load it. In case you have not done this yet, call install.packages(\"tidyquant\"). If you have trouble using tidyquant, check out the corresponding documentation.\n\nlibrary(tidyverse)\nlibrary(tidyquant)\nlibrary(scales)\n\nWe first download daily prices for one stock symbol, e.g., the Apple stock, AAPL, directly from the data provider Yahoo!Finance. To download the data, you can use the command tq_get. If you do not know how to use it, make sure you read the help file by calling ?tq_get. We especially recommend taking a look at the examples section of the documentation. We request daily data for a period of more than 20 years.\n\nprices &lt;- tq_get(\"AAPL\",\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\nprices\n\n# A tibble: 5,787 × 8\n  symbol date        open  high   low close    volume adjusted\n  &lt;chr&gt;  &lt;date&gt;     &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;\n1 AAPL   2000-01-03 0.936 1.00  0.908 0.999 535796800    0.848\n2 AAPL   2000-01-04 0.967 0.988 0.903 0.915 512377600    0.777\n3 AAPL   2000-01-05 0.926 0.987 0.920 0.929 778321600    0.788\n4 AAPL   2000-01-06 0.948 0.955 0.848 0.848 767972800    0.720\n5 AAPL   2000-01-07 0.862 0.902 0.853 0.888 460734400    0.754\n# ℹ 5,782 more rows\n\n\n tq_get downloads stock market data from Yahoo!Finance if you do not specify another data source. The function returns a tibble with eight quite self-explanatory columns: symbol, date, the market prices at the open, high, low, and close, the daily volume (in the number of traded shares), and the adjusted price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.\nNext, we use the ggplot2 package (Wickham 2016) to visualize the time series of adjusted prices in Figure 1 . This package takes care of visualization tasks based on the principles of the grammar of graphics (Wilkinson 2012).\n\nprices |&gt;\n  ggplot(aes(x = date, y = adjusted)) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Apple stock prices between beginning of 2000 and end of 2022\"\n  )\n\n\n\n\nFigure 1: Prices are in USD, adjusted for dividend payments and stock splits.\n\n\nFigure 1: Prices are in USD, adjusted for dividend payments and stock splits.\n\n\n\n\n\n Instead of analyzing prices, we compute daily net returns defined as \\(r_t = p_t / p_{t-1} - 1\\), where \\(p_t\\) is the adjusted day \\(t\\) price. In that context, the function lag() is helpful, which returns the previous value in a vector.\n\nreturns &lt;- prices |&gt;\n  arrange(date) |&gt;\n  mutate(ret = adjusted / lag(adjusted) - 1) |&gt;\n  select(symbol, date, ret)\nreturns\n\n# A tibble: 5,787 × 3\n  symbol date           ret\n  &lt;chr&gt;  &lt;date&gt;       &lt;dbl&gt;\n1 AAPL   2000-01-03 NA     \n2 AAPL   2000-01-04 -0.0843\n3 AAPL   2000-01-05  0.0146\n4 AAPL   2000-01-06 -0.0865\n5 AAPL   2000-01-07  0.0474\n# ℹ 5,782 more rows\n\n\nThe resulting tibble contains three columns, where the last contains the daily returns (ret). Note that the first entry naturally contains a missing value (NA) because there is no previous price. Obviously, the use of lag() would be meaningless if the time series is not ordered by ascending dates. The command arrange() provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use arrange(desc(date)).\nFor the upcoming examples, we remove missing values as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why NA values occur and carefully examine if you can simply get rid of these observations.\n\nreturns &lt;- returns |&gt;\n  drop_na(ret)\n\nNext, we visualize the distribution of daily returns in a histogram in Figure 2. Additionally, we add a dashed line that indicates the 5 percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most 5 percent. The 5 percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. We refer to Tsay (2010) for a more thorough introduction to stylized facts of returns.\n\nquantile_05 &lt;- quantile(returns |&gt; pull(ret), probs = 0.05)\nreturns |&gt;\n  ggplot(aes(x = ret)) +\n  geom_histogram(bins = 100) +\n  geom_vline(aes(xintercept = quantile_05),\n    linetype = \"dashed\"\n  ) +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Distribution of daily Apple stock returns\"\n  ) +\n  scale_x_continuous(labels = percent)\n\n\n\n\nFigure 2: The dotted vertical line indicates the historical 5 percent quantile.\n\n\nFigure 2: The dotted vertical line indicates the historical 5 percent quantile.\n\n\n\n\n\nHere, bins = 100 determines the number of bins used in the illustration and hence implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom geom_vline() to add a dashed line that indicates the 5 percent quantile of the daily returns. A typical task before proceeding with any data is to compute summary statistics for the main variables of interest.\n\nreturns |&gt;\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    )\n  ))\n\n# A tibble: 1 × 4\n  ret_daily_mean ret_daily_sd ret_daily_min ret_daily_max\n           &lt;dbl&gt;        &lt;dbl&gt;         &lt;dbl&gt;         &lt;dbl&gt;\n1        0.00120       0.0251        -0.519         0.139\n\n\nWe see that the maximum daily return was 13.905 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.\nYou can also compute these summary statistics for each year individually by imposing group_by(year = year(date)), where the call year(date) returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.\n\nreturns |&gt;\n  group_by(year = year(date)) |&gt;\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |&gt;\n  print(n = Inf)\n\n# A tibble: 23 × 5\n    year daily_mean daily_sd daily_min daily_max\n   &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;\n 1  2000 -0.00346     0.0549   -0.519     0.137 \n 2  2001  0.00233     0.0393   -0.172     0.129 \n 3  2002 -0.00121     0.0305   -0.150     0.0846\n 4  2003  0.00186     0.0234   -0.0814    0.113 \n 5  2004  0.00470     0.0255   -0.0558    0.132 \n 6  2005  0.00349     0.0245   -0.0921    0.0912\n 7  2006  0.000949    0.0243   -0.0633    0.118 \n 8  2007  0.00366     0.0238   -0.0702    0.105 \n 9  2008 -0.00265     0.0367   -0.179     0.139 \n10  2009  0.00382     0.0214   -0.0502    0.0676\n11  2010  0.00183     0.0169   -0.0496    0.0769\n12  2011  0.00104     0.0165   -0.0559    0.0589\n13  2012  0.00130     0.0186   -0.0644    0.0887\n14  2013  0.000472    0.0180   -0.124     0.0514\n15  2014  0.00145     0.0136   -0.0799    0.0820\n16  2015  0.0000199   0.0168   -0.0612    0.0574\n17  2016  0.000575    0.0147   -0.0657    0.0650\n18  2017  0.00164     0.0111   -0.0388    0.0610\n19  2018 -0.0000573   0.0181   -0.0663    0.0704\n20  2019  0.00266     0.0165   -0.0996    0.0683\n21  2020  0.00281     0.0294   -0.129     0.120 \n22  2021  0.00131     0.0158   -0.0417    0.0539\n23  2022 -0.000970    0.0225   -0.0587    0.0890\n\n\n\nIn case you wonder: the additional argument .names = \"{.fn}\" in across() determines how to name the output columns. The specification is rather flexible and allows almost arbitrary column names, which can be useful for reporting. The print() function simply controls the output options for the R console.",
+    "text": "Working with Stock Market Data\nAt the start of each session, we load the required R packages. Throughout the entire book, we always use the tidyverse (Wickham et al. 2019). In this chapter, we also load the convenient tidyquant package (Dancho and Vaughan 2022) to download price data. This package provides a convenient wrapper for various quantitative functions compatible with the tidyverse. Finally, the package scales (Wickham and Seidel 2022) provides useful scale functions for visualizations.\nYou typically have to install a package once before you can load it. In case you have not done this yet, call install.packages(\"tidyquant\"). If you have trouble using tidyquant, check out the corresponding documentation.\n\nlibrary(tidyverse)\nlibrary(tidyquant)\nlibrary(scales)\n\nWe first download daily prices for one stock symbol, e.g., the Apple stock, AAPL, directly from the data provider Yahoo!Finance. To download the data, you can use the command tq_get. If you do not know how to use it, make sure you read the help file by calling ?tq_get. We especially recommend taking a look at the examples section of the documentation. We request daily data for a period of more than 20 years.\n\nprices &lt;- tq_get(\"AAPL\",\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\nprices\n\n# A tibble: 5,787 × 8\n  symbol date        open  high   low close    volume adjusted\n  &lt;chr&gt;  &lt;date&gt;     &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;     &lt;dbl&gt;    &lt;dbl&gt;\n1 AAPL   2000-01-03 0.936 1.00  0.908 0.999 535796800    0.847\n2 AAPL   2000-01-04 0.967 0.988 0.903 0.915 512377600    0.776\n3 AAPL   2000-01-05 0.926 0.987 0.920 0.929 778321600    0.787\n4 AAPL   2000-01-06 0.948 0.955 0.848 0.848 767972800    0.719\n5 AAPL   2000-01-07 0.862 0.902 0.853 0.888 460734400    0.753\n# ℹ 5,782 more rows\n\n\n tq_get downloads stock market data from Yahoo!Finance if you do not specify another data source. The function returns a tibble with eight quite self-explanatory columns: symbol, date, the market prices at the open, high, low, and close, the daily volume (in the number of traded shares), and the adjusted price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.\nNext, we use the ggplot2 package (Wickham 2016) to visualize the time series of adjusted prices in Figure 1 . This package takes care of visualization tasks based on the principles of the grammar of graphics (Wilkinson 2012).\n\nprices |&gt;\n  ggplot(aes(x = date, y = adjusted)) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Apple stock prices between beginning of 2000 and end of 2022\"\n  )\n\n\n\n\nFigure 1: Prices are in USD, adjusted for dividend payments and stock splits.\n\n\nFigure 1: Prices are in USD, adjusted for dividend payments and stock splits.\n\n\n\n\n\n Instead of analyzing prices, we compute daily net returns defined as \\(r_t = p_t / p_{t-1} - 1\\), where \\(p_t\\) is the adjusted day \\(t\\) price. In that context, the function lag() is helpful, which returns the previous value in a vector.\n\nreturns &lt;- prices |&gt;\n  arrange(date) |&gt;\n  mutate(ret = adjusted / lag(adjusted) - 1) |&gt;\n  select(symbol, date, ret)\nreturns\n\n# A tibble: 5,787 × 3\n  symbol date           ret\n  &lt;chr&gt;  &lt;date&gt;       &lt;dbl&gt;\n1 AAPL   2000-01-03 NA     \n2 AAPL   2000-01-04 -0.0843\n3 AAPL   2000-01-05  0.0146\n4 AAPL   2000-01-06 -0.0865\n5 AAPL   2000-01-07  0.0474\n# ℹ 5,782 more rows\n\n\nThe resulting tibble contains three columns, where the last contains the daily returns (ret). Note that the first entry naturally contains a missing value (NA) because there is no previous price. Obviously, the use of lag() would be meaningless if the time series is not ordered by ascending dates. The command arrange() provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use arrange(desc(date)).\nFor the upcoming examples, we remove missing values as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why NA values occur and carefully examine if you can simply get rid of these observations.\n\nreturns &lt;- returns |&gt;\n  drop_na(ret)\n\nNext, we visualize the distribution of daily returns in a histogram in Figure 2. Additionally, we add a dashed line that indicates the 5 percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most 5 percent. The 5 percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. We refer to Tsay (2010) for a more thorough introduction to stylized facts of returns.\n\nquantile_05 &lt;- quantile(returns |&gt; pull(ret), probs = 0.05)\nreturns |&gt;\n  ggplot(aes(x = ret)) +\n  geom_histogram(bins = 100) +\n  geom_vline(aes(xintercept = quantile_05),\n    linetype = \"dashed\"\n  ) +\n  labs(\n    x = NULL,\n    y = NULL,\n    title = \"Distribution of daily Apple stock returns\"\n  ) +\n  scale_x_continuous(labels = percent)\n\n\n\n\nFigure 2: The dotted vertical line indicates the historical 5 percent quantile.\n\n\nFigure 2: The dotted vertical line indicates the historical 5 percent quantile.\n\n\n\n\n\nHere, bins = 100 determines the number of bins used in the illustration and hence implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom geom_vline() to add a dashed line that indicates the 5 percent quantile of the daily returns. A typical task before proceeding with any data is to compute summary statistics for the main variables of interest.\n\nreturns |&gt;\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    )\n  ))\n\n# A tibble: 1 × 4\n  ret_daily_mean ret_daily_sd ret_daily_min ret_daily_max\n           &lt;dbl&gt;        &lt;dbl&gt;         &lt;dbl&gt;         &lt;dbl&gt;\n1        0.00120       0.0251        -0.519         0.139\n\n\nWe see that the maximum daily return was 13.905 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.\nYou can also compute these summary statistics for each year individually by imposing group_by(year = year(date)), where the call year(date) returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.\n\nreturns |&gt;\n  group_by(year = year(date)) |&gt;\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |&gt;\n  print(n = Inf)\n\n# A tibble: 23 × 5\n    year daily_mean daily_sd daily_min daily_max\n   &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;\n 1  2000 -0.00346     0.0549   -0.519     0.137 \n 2  2001  0.00233     0.0393   -0.172     0.129 \n 3  2002 -0.00121     0.0305   -0.150     0.0846\n 4  2003  0.00186     0.0234   -0.0814    0.113 \n 5  2004  0.00470     0.0255   -0.0558    0.132 \n 6  2005  0.00349     0.0245   -0.0921    0.0912\n 7  2006  0.000949    0.0243   -0.0633    0.118 \n 8  2007  0.00366     0.0238   -0.0702    0.105 \n 9  2008 -0.00265     0.0367   -0.179     0.139 \n10  2009  0.00382     0.0214   -0.0502    0.0676\n11  2010  0.00183     0.0169   -0.0496    0.0769\n12  2011  0.00104     0.0165   -0.0559    0.0589\n13  2012  0.00130     0.0186   -0.0644    0.0887\n14  2013  0.000472    0.0180   -0.124     0.0514\n15  2014  0.00145     0.0136   -0.0799    0.0820\n16  2015  0.0000199   0.0168   -0.0612    0.0574\n17  2016  0.000575    0.0147   -0.0657    0.0650\n18  2017  0.00164     0.0111   -0.0388    0.0610\n19  2018 -0.0000573   0.0181   -0.0663    0.0704\n20  2019  0.00266     0.0165   -0.0996    0.0683\n21  2020  0.00281     0.0294   -0.129     0.120 \n22  2021  0.00131     0.0158   -0.0417    0.0539\n23  2022 -0.000970    0.0225   -0.0587    0.0890\n\n\n\nIn case you wonder: the additional argument .names = \"{.fn}\" in across() determines how to name the output columns. The specification is rather flexible and allows almost arbitrary column names, which can be useful for reporting. The print() function simply controls the output options for the R console.",
     "crumbs": [
       "R",
       "Getting Started",
@@ -3254,7 +3254,7 @@
     "href": "r/introduction-to-tidy-finance.html#scaling-up-the-analysis",
     "title": "Introduction to Tidy Finance",
     "section": "Scaling Up the Analysis",
-    "text": "Scaling Up the Analysis\nAs a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.\nThis is where the tidyverse magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., symbol &lt;- c(\"AAPL\", \"MMM\", \"BA\"), and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the Dow Jones Industrial Average index. \n\nsymbols &lt;- tq_index(\"DOW\") |&gt; \n  filter(company != \"US DOLLAR\")\nsymbols\n\n# A tibble: 30 × 8\n  symbol company           identifier sedol weight sector shares_held\n  &lt;chr&gt;  &lt;chr&gt;             &lt;chr&gt;      &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;        &lt;dbl&gt;\n1 UNH    UNITEDHEALTH GRO… 91324P102  2917… 0.106  -          5487702\n2 MSFT   MICROSOFT CORP    594918104  2588… 0.0641 -          5487702\n3 GS     GOLDMAN SACHS GR… 38141G104  2407… 0.0605 -          5487702\n4 HD     HOME DEPOT INC    437076102  2434… 0.0571 -          5487702\n5 AMGN   AMGEN INC         031162100  2023… 0.0556 -          5487702\n# ℹ 25 more rows\n# ℹ 1 more variable: local_currency &lt;chr&gt;\n\n\nConveniently, tidyquant provides a function to get all stocks in a stock index with a single call (similarly, tq_exchange(\"NASDAQ\") delivers all stocks currently listed on the NASDAQ exchange). \n\nindex_prices &lt;- tq_get(symbols,\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\n\nThe resulting tibble contains 165593 daily observations for 30 different corporations. Figure 3 illustrates the time series of downloaded adjusted prices for each of the constituents of the Dow Jones index. Make sure you understand every single line of code! What are the arguments of aes()? Which alternative geoms could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.\n\nindex_prices |&gt;\n  ggplot(aes(\n    x = date,\n    y = adjusted,\n    color = symbol\n  )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    title = \"Stock prices of DOW index constituents\"\n  ) +\n  theme(legend.position = \"none\")\n\n\n\n\nFigure 3: Prices in USD, adjusted for dividend payments and stock splits.\n\n\nFigure 3: Prices in USD, adjusted for dividend payments and stock splits.\n\n\n\n\n\nDo you notice the small differences relative to the code we used before? tq_get(symbols) returns a tibble for several symbols as well. All we need to do to illustrate all stock symbols simultaneously is to include color = symbol in the ggplot aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.\nThe same holds for stock returns. Before computing the returns, we use group_by(symbol) such that the mutate() command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: group_by(symbol) is the key to aggregating the time series into symbol-specific variables of interest.\n\nall_returns &lt;- index_prices |&gt;\n  group_by(symbol) |&gt;\n  mutate(ret = adjusted / lag(adjusted) - 1) |&gt;\n  select(symbol, date, ret) |&gt;\n  drop_na(ret)\n\nall_returns |&gt;\n  group_by(symbol) |&gt;\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |&gt;\n  print(n = Inf)\n\n# A tibble: 30 × 5\n   symbol daily_mean daily_sd daily_min daily_max\n   &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;\n 1 AAPL     0.00120    0.0251    -0.519     0.139\n 2 AMGN     0.000489   0.0197    -0.134     0.151\n 3 AXP      0.000518   0.0229    -0.176     0.219\n 4 BA       0.000595   0.0224    -0.238     0.243\n 5 CAT      0.000709   0.0204    -0.145     0.147\n 6 CRM      0.00110    0.0270    -0.271     0.260\n 7 CSCO     0.000317   0.0237    -0.162     0.244\n 8 CVX      0.000553   0.0176    -0.221     0.227\n 9 DIS      0.000418   0.0195    -0.184     0.160\n10 DOW      0.000562   0.0260    -0.217     0.209\n11 GS       0.000550   0.0231    -0.190     0.265\n12 HD       0.000543   0.0194    -0.287     0.141\n13 HON      0.000515   0.0194    -0.174     0.282\n14 IBM      0.000273   0.0165    -0.155     0.120\n15 INTC     0.000285   0.0236    -0.220     0.201\n16 JNJ      0.000408   0.0122    -0.158     0.122\n17 JPM      0.000582   0.0242    -0.207     0.251\n18 KO       0.000337   0.0132    -0.101     0.139\n19 MCD      0.000533   0.0147    -0.159     0.181\n20 MMM      0.000378   0.0150    -0.129     0.126\n21 MRK      0.000383   0.0168    -0.268     0.130\n22 MSFT     0.000513   0.0194    -0.156     0.196\n23 NKE      0.000743   0.0194    -0.198     0.155\n24 PG       0.000377   0.0134    -0.302     0.120\n25 TRV      0.000569   0.0183    -0.208     0.256\n26 UNH      0.000984   0.0198    -0.186     0.348\n27 V        0.000929   0.0190    -0.136     0.150\n28 VZ       0.000239   0.0151    -0.118     0.146\n29 WBA      0.000284   0.0182    -0.150     0.166\n30 WMT      0.000314   0.0150    -0.114     0.117\n\n\n\nNote that you are now also equipped with all tools to download price data for each symbol listed in the S&P 500 index with the same number of lines of code. Just use symbol &lt;- tq_index(\"SP500\"), which provides you with a tibble that contains each symbol that is (currently) part of the S&P 500. However, don’t try this if you are not prepared to wait for a couple of minutes because this is quite some data to download!",
+    "text": "Scaling Up the Analysis\nAs a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.\nThis is where the tidyverse magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., symbol &lt;- c(\"AAPL\", \"MMM\", \"BA\"), and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the Dow Jones Industrial Average index. \n\nsymbols &lt;- tq_index(\"DOW\") |&gt; \n  filter(company != \"US DOLLAR\")\nsymbols\n\n# A tibble: 30 × 8\n  symbol company           identifier sedol weight sector shares_held\n  &lt;chr&gt;  &lt;chr&gt;             &lt;chr&gt;      &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;        &lt;dbl&gt;\n1 UNH    UNITEDHEALTH GRO… 91324P102  2917… 0.0958 -          5747928\n2 GS     GOLDMAN SACHS GR… 38141G104  2407… 0.0672 -          5747928\n3 MSFT   MICROSOFT CORP    594918104  2588… 0.0646 -          5747928\n4 HD     HOME DEPOT INC    437076102  2434… 0.0594 -          5747928\n5 AMGN   AMGEN INC         031162100  2023… 0.0532 -          5747928\n# ℹ 25 more rows\n# ℹ 1 more variable: local_currency &lt;chr&gt;\n\n\nConveniently, tidyquant provides a function to get all stocks in a stock index with a single call (similarly, tq_exchange(\"NASDAQ\") delivers all stocks currently listed on the NASDAQ exchange). \n\nindex_prices &lt;- tq_get(symbols,\n  get = \"stock.prices\",\n  from = \"2000-01-01\",\n  to = \"2022-12-31\"\n)\n\nThe resulting tibble contains 165593 daily observations for 30 different corporations. Figure 3 illustrates the time series of downloaded adjusted prices for each of the constituents of the Dow Jones index. Make sure you understand every single line of code! What are the arguments of aes()? Which alternative geoms could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.\n\nindex_prices |&gt;\n  ggplot(aes(\n    x = date,\n    y = adjusted,\n    color = symbol\n  )) +\n  geom_line() +\n  labs(\n    x = NULL,\n    y = NULL,\n    color = NULL,\n    title = \"Stock prices of DOW index constituents\"\n  ) +\n  theme(legend.position = \"none\")\n\n\n\n\nFigure 3: Prices in USD, adjusted for dividend payments and stock splits.\n\n\nFigure 3: Prices in USD, adjusted for dividend payments and stock splits.\n\n\n\n\n\nDo you notice the small differences relative to the code we used before? tq_get(symbols) returns a tibble for several symbols as well. All we need to do to illustrate all stock symbols simultaneously is to include color = symbol in the ggplot aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.\nThe same holds for stock returns. Before computing the returns, we use group_by(symbol) such that the mutate() command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: group_by(symbol) is the key to aggregating the time series into symbol-specific variables of interest.\n\nall_returns &lt;- index_prices |&gt;\n  group_by(symbol) |&gt;\n  mutate(ret = adjusted / lag(adjusted) - 1) |&gt;\n  select(symbol, date, ret) |&gt;\n  drop_na(ret)\n\nall_returns |&gt;\n  group_by(symbol) |&gt;\n  summarize(across(\n    ret,\n    list(\n      daily_mean = mean,\n      daily_sd = sd,\n      daily_min = min,\n      daily_max = max\n    ),\n    .names = \"{.fn}\"\n  )) |&gt;\n  print(n = Inf)\n\n# A tibble: 30 × 5\n   symbol daily_mean daily_sd daily_min daily_max\n   &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;\n 1 AAPL     0.00120    0.0251    -0.519     0.139\n 2 AMGN     0.000489   0.0197    -0.134     0.151\n 3 AXP      0.000518   0.0229    -0.176     0.219\n 4 BA       0.000595   0.0224    -0.238     0.243\n 5 CAT      0.000709   0.0204    -0.145     0.147\n 6 CRM      0.00110    0.0270    -0.271     0.260\n 7 CSCO     0.000317   0.0237    -0.162     0.244\n 8 CVX      0.000553   0.0176    -0.221     0.227\n 9 DIS      0.000418   0.0195    -0.184     0.160\n10 DOW      0.000562   0.0260    -0.217     0.209\n11 GS       0.000550   0.0231    -0.190     0.265\n12 HD       0.000543   0.0194    -0.287     0.141\n13 HON      0.000515   0.0194    -0.174     0.282\n14 IBM      0.000273   0.0165    -0.155     0.120\n15 INTC     0.000285   0.0236    -0.220     0.201\n16 JNJ      0.000408   0.0122    -0.158     0.122\n17 JPM      0.000582   0.0242    -0.207     0.251\n18 KO       0.000337   0.0132    -0.101     0.139\n19 MCD      0.000533   0.0147    -0.159     0.181\n20 MMM      0.000378   0.0150    -0.129     0.126\n21 MRK      0.000383   0.0168    -0.268     0.130\n22 MSFT     0.000513   0.0194    -0.156     0.196\n23 NKE      0.000743   0.0194    -0.198     0.155\n24 PG       0.000377   0.0134    -0.302     0.120\n25 TRV      0.000569   0.0183    -0.208     0.256\n26 UNH      0.000984   0.0198    -0.186     0.348\n27 V        0.000929   0.0190    -0.136     0.150\n28 VZ       0.000239   0.0151    -0.118     0.146\n29 WBA      0.000284   0.0182    -0.150     0.166\n30 WMT      0.000314   0.0150    -0.114     0.117\n\n\n\nNote that you are now also equipped with all tools to download price data for each symbol listed in the S&P 500 index with the same number of lines of code. Just use symbol &lt;- tq_index(\"SP500\"), which provides you with a tibble that contains each symbol that is (currently) part of the S&P 500. However, don’t try this if you are not prepared to wait for a couple of minutes because this is quite some data to download!",
     "crumbs": [
       "R",
       "Getting Started",
@@ -3326,7 +3326,7 @@
     "href": "r/beta-estimation.html#estimating-beta-using-monthly-returns",
     "title": "Beta Estimation",
     "section": "Estimating Beta using Monthly Returns",
-    "text": "Estimating Beta using Monthly Returns\nThe estimation procedure is based on a rolling-window estimation, where we may use either monthly or daily returns and different window lengths. First, let us start with loading the monthly CRSP data from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(permno, month, industry, ret_excess) |&gt;\n  collect()\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, mkt_excess) |&gt;\n  collect()\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(factors_ff3_monthly, by = \"month\")\n\nTo estimate the CAPM regression coefficients\n\\[\nr_{i, t} - r_{f, t} = \\alpha_i + \\beta_i(r_{m, t}-r_{f,t})+\\varepsilon_{i, t}\n\\] we regress stock excess returns ret_excess on excess returns of the market portfolio mkt_excess. R provides a simple solution to estimate (linear) models with the function lm(). lm() requires a formula as input that is specified in a compact symbolic form. An expression of the form y ~ model is interpreted as a specification that the response y is modeled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. In addition to standard linear models, lm() provides a lot of flexibility. You should check out the documentation for more information. To start, we restrict the data only to the time series of observations in CRSP that correspond to Apple’s stock (i.e., to permno 14593 for Apple) and compute \\(\\hat\\alpha_i\\) as well as \\(\\hat\\beta_i\\).\n\nfit &lt;- lm(ret_excess ~ mkt_excess,\n  data = crsp_monthly |&gt;\n    filter(permno == \"14593\")\n)\n\nsummary(fit)\n\n\nCall:\nlm(formula = ret_excess ~ mkt_excess, data = filter(crsp_monthly, \n    permno == \"14593\"))\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-0.5170 -0.0589  0.0001  0.0610  0.3947 \n\nCoefficients:\n            Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)  0.01019    0.00508     2.0    0.046 *  \nmkt_excess   1.38889    0.11141    12.5   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.113 on 502 degrees of freedom\nMultiple R-squared:  0.236, Adjusted R-squared:  0.235 \nF-statistic:  155 on 1 and 502 DF,  p-value: &lt;2e-16\n\n\nlm() returns an object of class lm which contains all information we usually care about with linear models. summary() returns an overview of the estimated parameters. coefficients(fit) would return only the estimated coefficients. The output above indicates that Apple moves excessively with the market as the estimated \\(\\hat\\beta_i\\) is above one (\\(\\hat\\beta_i \\approx 1.4\\)).",
+    "text": "Estimating Beta using Monthly Returns\nThe estimation procedure is based on a rolling-window estimation, where we may use either monthly or daily returns and different window lengths. First, let us start with loading the monthly CRSP data from our SQLite-database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.\n\ntidy_finance &lt;- dbConnect(\n  SQLite(),\n  \"data/tidy_finance_r.sqlite\",\n  extended_types = TRUE\n)\n\ncrsp_monthly &lt;- tbl(tidy_finance, \"crsp_monthly\") |&gt;\n  select(permno, month, industry, ret_excess) |&gt;\n  collect()\n\nfactors_ff3_monthly &lt;- tbl(tidy_finance, \"factors_ff3_monthly\") |&gt;\n  select(month, mkt_excess) |&gt;\n  collect()\n\ncrsp_monthly &lt;- crsp_monthly |&gt;\n  left_join(factors_ff3_monthly, join_by(month))\n\nTo estimate the CAPM regression coefficients\n\\[\nr_{i, t} - r_{f, t} = \\alpha_i + \\beta_i(r_{m, t}-r_{f,t})+\\varepsilon_{i, t}\n\\] we regress stock excess returns ret_excess on excess returns of the market portfolio mkt_excess. R provides a simple solution to estimate (linear) models with the function lm(). lm() requires a formula as input that is specified in a compact symbolic form. An expression of the form y ~ model is interpreted as a specification that the response y is modeled by a linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. In addition to standard linear models, lm() provides a lot of flexibility. You should check out the documentation for more information. To start, we restrict the data only to the time series of observations in CRSP that correspond to Apple’s stock (i.e., to permno 14593 for Apple) and compute \\(\\hat\\alpha_i\\) as well as \\(\\hat\\beta_i\\).\n\nfit &lt;- lm(ret_excess ~ mkt_excess,\n  data = crsp_monthly |&gt;\n    filter(permno == \"14593\")\n)\n\nsummary(fit)\n\n\nCall:\nlm(formula = ret_excess ~ mkt_excess, data = filter(crsp_monthly, \n    permno == \"14593\"))\n\nResiduals:\n    Min      1Q  Median      3Q     Max \n-0.5170 -0.0589  0.0001  0.0610  0.3947 \n\nCoefficients:\n            Estimate Std. Error t value Pr(&gt;|t|)    \n(Intercept)  0.01019    0.00508     2.0    0.046 *  \nmkt_excess   1.38889    0.11141    12.5   &lt;2e-16 ***\n---\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.113 on 502 degrees of freedom\nMultiple R-squared:  0.236, Adjusted R-squared:  0.235 \nF-statistic:  155 on 1 and 502 DF,  p-value: &lt;2e-16\n\n\nlm() returns an object of class lm which contains all information we usually care about with linear models. summary() returns an overview of the estimated parameters. coefficients(fit) would return only the estimated coefficients. The output above indicates that Apple moves excessively with the market as the estimated \\(\\hat\\beta_i\\) is above one (\\(\\hat\\beta_i \\approx 1.4\\)).",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -3338,7 +3338,7 @@
     "href": "r/beta-estimation.html#rolling-window-estimation",
     "title": "Beta Estimation",
     "section": "Rolling-Window Estimation",
-    "text": "Rolling-Window Estimation\nAfter we estimated the regression coefficients on an example, we scale the estimation of \\(\\beta_i\\) to a whole different level and perform rolling-window estimations for the entire CRSP sample. The following function implements the CAPM regression for a data frame (or a part thereof) containing at least min_obs observations to avoid huge fluctuations if the time series is too short. If the condition is violated, that is, the time series is too short, the function returns a missing value.\n\nestimate_capm &lt;- function(data, min_obs = 1) {\n  if (nrow(data) &lt; min_obs) {\n    beta &lt;- as.numeric(NA)\n  } else {\n    fit &lt;- lm(ret_excess ~ mkt_excess, data = data)\n    beta &lt;- as.numeric(coefficients(fit)[2])\n  }\n  return(beta)\n}\n\nNext, we define a function that does the rolling estimation. The slide_period function is able to handle months in its window input in a straightforward manner. We thus avoid using any time-series package (e.g., zoo) and converting the data to fit the package functions, but rather stay in the world of the tidyverse.\nThe following function takes input data and slides across the month vector, considering only a total of months months. The function essentially performs three steps: (i) arrange all rows, (ii) compute betas by sliding across months, and (iii) return a tibble with months and corresponding beta estimates (again particularly useful in the case of daily data). As we demonstrate further below, we can also apply the same function to daily returns data.\n\nroll_capm_estimation &lt;- function(data, months, min_obs) {\n  data &lt;- data |&gt;\n    arrange(month)\n\n  betas &lt;- slide_period_vec(\n    .x = data,\n    .i = data$month,\n    .period = \"month\",\n    .f = ~ estimate_capm(., min_obs),\n    .before = months - 1,\n    .complete = FALSE\n  )\n\n  return(tibble(\n    month = unique(data$month),\n    beta = betas\n  ))\n}\n\nBefore we attack the whole CRSP sample, let us focus on a couple of examples for well-known firms.\n\nexamples &lt;- tribble(\n  ~permno, ~company,\n  14593, \"Apple\",\n  10107, \"Microsoft\",\n  93436, \"Tesla\",\n  17778, \"Berkshire Hathaway\"\n)\n\nIf we want to estimate rolling betas for Apple, we can use mutate(). We take a total of 5 years of data and require at least 48 months with return data to compute our betas. Check out the exercises if you want to compute beta for different time periods.\n\nbeta_example &lt;- crsp_monthly |&gt;\n  filter(permno == examples$permno[1]) |&gt;\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |&gt;\n  drop_na()\nbeta_example\n\n# A tibble: 457 × 6\n  permno month      industry      ret_excess mkt_excess  beta\n   &lt;dbl&gt; &lt;date&gt;     &lt;chr&gt;              &lt;dbl&gt;      &lt;dbl&gt; &lt;dbl&gt;\n1  14593 1984-12-01 Manufacturing     0.170      0.0184  2.05\n2  14593 1985-01-01 Manufacturing    -0.0108     0.0799  1.90\n3  14593 1985-02-01 Manufacturing    -0.152      0.0122  1.88\n4  14593 1985-03-01 Manufacturing    -0.112     -0.0084  1.89\n5  14593 1985-04-01 Manufacturing    -0.0467    -0.0096  1.90\n# ℹ 452 more rows\n\n\nIt is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting Figure 1.\n\nbeta_examples &lt;- crsp_monthly |&gt;\n  inner_join(examples, by = \"permno\") |&gt;\n  group_by(permno) |&gt;\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |&gt;\n  ungroup() |&gt;\n  select(permno, company, month, beta) |&gt;\n  drop_na()\n\nbeta_examples |&gt;\n  ggplot(aes(\n    x = month, \n    y = beta, \n    color = company,\n    linetype = company)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly beta estimates for example stocks using 5 years of data\"\n  )\n\n\n\n\nFigure 1: The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.\n\n\nFigure 1: The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.",
+    "text": "Rolling-Window Estimation\nAfter we estimated the regression coefficients on an example, we scale the estimation of \\(\\beta_i\\) to a whole different level and perform rolling-window estimations for the entire CRSP sample. The following function implements the CAPM regression for a data frame (or a part thereof) containing at least min_obs observations to avoid huge fluctuations if the time series is too short. If the condition is violated, that is, the time series is too short, the function returns a missing value.\n\nestimate_capm &lt;- function(data, min_obs = 1) {\n  if (nrow(data) &lt; min_obs) {\n    beta &lt;- as.numeric(NA)\n  } else {\n    fit &lt;- lm(ret_excess ~ mkt_excess, data = data)\n    beta &lt;- as.numeric(coefficients(fit)[2])\n  }\n  return(beta)\n}\n\nNext, we define a function that does the rolling estimation. The slide_period function is able to handle months in its window input in a straightforward manner. We thus avoid using any time-series package (e.g., zoo) and converting the data to fit the package functions, but rather stay in the world of the tidyverse.\nThe following function takes input data and slides across the month vector, considering only a total of months months. The function essentially performs three steps: (i) arrange all rows, (ii) compute betas by sliding across months, and (iii) return a tibble with months and corresponding beta estimates (again particularly useful in the case of daily data). As we demonstrate further below, we can also apply the same function to daily returns data.\n\nroll_capm_estimation &lt;- function(data, months, min_obs) {\n  data &lt;- data |&gt;\n    arrange(month)\n\n  betas &lt;- slide_period_vec(\n    .x = data,\n    .i = data$month,\n    .period = \"month\",\n    .f = ~ estimate_capm(., min_obs),\n    .before = months - 1,\n    .complete = FALSE\n  )\n\n  return(tibble(\n    month = unique(data$month),\n    beta = betas\n  ))\n}\n\nBefore we attack the whole CRSP sample, let us focus on a couple of examples for well-known firms.\n\nexamples &lt;- tribble(\n  ~permno, ~company,\n  14593, \"Apple\",\n  10107, \"Microsoft\",\n  93436, \"Tesla\",\n  17778, \"Berkshire Hathaway\"\n)\n\nIf we want to estimate rolling betas for Apple, we can use mutate(). We take a total of 5 years of data and require at least 48 months with return data to compute our betas. Check out the exercises if you want to compute beta for different time periods.\n\nbeta_example &lt;- crsp_monthly |&gt;\n  filter(permno == examples$permno[1]) |&gt;\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |&gt;\n  drop_na()\nbeta_example\n\n# A tibble: 457 × 6\n  permno month      industry      ret_excess mkt_excess  beta\n   &lt;dbl&gt; &lt;date&gt;     &lt;chr&gt;              &lt;dbl&gt;      &lt;dbl&gt; &lt;dbl&gt;\n1  14593 1984-12-01 Manufacturing     0.170      0.0184  2.05\n2  14593 1985-01-01 Manufacturing    -0.0108     0.0799  1.90\n3  14593 1985-02-01 Manufacturing    -0.152      0.0122  1.88\n4  14593 1985-03-01 Manufacturing    -0.112     -0.0084  1.89\n5  14593 1985-04-01 Manufacturing    -0.0467    -0.0096  1.90\n# ℹ 452 more rows\n\n\nIt is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting Figure 1.\n\nbeta_examples &lt;- crsp_monthly |&gt;\n  inner_join(examples, join_by(permno)) |&gt;\n  group_by(permno) |&gt;\n  mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |&gt;\n  ungroup() |&gt;\n  select(permno, company, month, beta) |&gt;\n  drop_na()\n\nbeta_examples |&gt;\n  ggplot(aes(\n    x = month, \n    y = beta, \n    color = company,\n    linetype = company)) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly beta estimates for example stocks using 5 years of data\"\n  )\n\n\n\n\nFigure 1: The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.\n\n\nFigure 1: The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library.",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -3350,7 +3350,7 @@
     "href": "r/beta-estimation.html#parallelized-rolling-window-estimation",
     "title": "Beta Estimation",
     "section": "Parallelized Rolling-Window Estimation",
-    "text": "Parallelized Rolling-Window Estimation\nEven though we could now just apply the function using group_by() on the whole CRSP sample, we advise against doing it as it is computationally quite expensive. Remember that we have to perform rolling-window estimations across all stocks and time periods. However, this estimation problem is an ideal scenario to employ the power of parallelization. Parallelization means that we split the tasks which perform rolling-window estimations across different workers (or cores on your local machine).\nFirst, we nest() the data by permno. Nested data means we now have a list of permno with corresponding time series data and an industry label. We get one row of output for each unique combination of non-nested variables which are permno and industry.\n\ncrsp_monthly_nested &lt;- crsp_monthly |&gt;\n  nest(data = c(month, ret_excess, mkt_excess))\ncrsp_monthly_nested\n\n# A tibble: 30,384 × 3\n  permno industry      data              \n   &lt;dbl&gt; &lt;chr&gt;         &lt;list&gt;            \n1  10000 Manufacturing &lt;tibble [16 × 3]&gt; \n2  10001 Utilities     &lt;tibble [378 × 3]&gt;\n3  10057 Manufacturing &lt;tibble [437 × 3]&gt;\n4  10042 Mining        &lt;tibble [264 × 3]&gt;\n5  10043 Services      &lt;tibble [159 × 3]&gt;\n# ℹ 30,379 more rows\n\n\nAlternatively, we could have created the same nested data by excluding the variables that we do not want to nest, as in the following code chunk. However, for many applications it is desirable to explicitly state the variables that are nested into the data list-column, so that the reader can track what ends up in there.\n\ncrsp_monthly_nested &lt;- crsp_monthly |&gt;\n  nest(data = -c(permno, industry))\n\nNext, we want to apply the roll_capm_estimation() function to each stock. This situation is an ideal use case for map(), which takes a list or vector as input and returns an object of the same length as the input. In our case, map() returns a single data frame with a time series of beta estimates for each stock. Therefore, we use unnest() to transform the list of outputs to a tidy data frame.\n\ncrsp_monthly_nested |&gt;\n  inner_join(examples, by = \"permno\") |&gt;\n  mutate(beta = map(\n    data,\n    ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |&gt;\n  unnest(beta) |&gt;\n  select(permno, month, beta_monthly = beta) |&gt;\n  drop_na()\n\n# A tibble: 1,458 × 3\n  permno month      beta_monthly\n   &lt;dbl&gt; &lt;date&gt;            &lt;dbl&gt;\n1  10107 1990-03-01         1.39\n2  10107 1990-04-01         1.38\n3  10107 1990-05-01         1.43\n4  10107 1990-06-01         1.43\n5  10107 1990-07-01         1.45\n# ℹ 1,453 more rows\n\n\nHowever, instead, we want to perform the estimations of rolling betas for different stocks in parallel. If you have a Windows or Mac machine, it makes most sense to define multisession, which means that separate R processes are running in the background on the same machine to perform the individual jobs. If you check out the documentation of plan(), you can also see other ways to resolve the parallelization in different environments. Note that we use availableCores() to determine the number of cores available for parallelization, but keep one core free for other tasks. Some machines might freeze if all cores are busy with Python jobs. \n\nn_cores = availableCores() - 1\nplan(multisession, workers = n_cores)\n\nUsing eight cores, the estimation for our sample of around 25k stocks takes around 20 minutes. Of course, you can speed up things considerably by having more cores available to share the workload or by having more powerful cores. Notice the difference in the code below? All you need to do is to replace map() with future_map().\n\nbeta_monthly &lt;- crsp_monthly_nested |&gt;\n  mutate(beta = future_map(\n    data, ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |&gt;\n  unnest(c(beta)) |&gt;\n  select(permno, month, beta_monthly = beta) |&gt;\n  drop_na()",
+    "text": "Parallelized Rolling-Window Estimation\nEven though we could now just apply the function using group_by() on the whole CRSP sample, we advise against doing it as it is computationally quite expensive. Remember that we have to perform rolling-window estimations across all stocks and time periods. However, this estimation problem is an ideal scenario to employ the power of parallelization. Parallelization means that we split the tasks which perform rolling-window estimations across different workers (or cores on your local machine).\nFirst, we nest() the data by permno. Nested data means we now have a list of permno with corresponding time series data and an industry label. We get one row of output for each unique combination of non-nested variables which are permno and industry.\n\ncrsp_monthly_nested &lt;- crsp_monthly |&gt;\n  nest(data = c(month, ret_excess, mkt_excess))\ncrsp_monthly_nested\n\n# A tibble: 30,384 × 3\n  permno industry      data             \n   &lt;dbl&gt; &lt;chr&gt;         &lt;list&gt;           \n1  10066 Services      &lt;tibble [35 × 3]&gt;\n2  10067 Manufacturing &lt;tibble [12 × 3]&gt;\n3  10068 Finance       &lt;tibble [8 × 3]&gt; \n4  10069 Retail        &lt;tibble [10 × 3]&gt;\n5  10070 Manufacturing &lt;tibble [56 × 3]&gt;\n# ℹ 30,379 more rows\n\n\nAlternatively, we could have created the same nested data by excluding the variables that we do not want to nest, as in the following code chunk. However, for many applications it is desirable to explicitly state the variables that are nested into the data list-column, so that the reader can track what ends up in there.\n\ncrsp_monthly_nested &lt;- crsp_monthly |&gt;\n  nest(data = -c(permno, industry))\n\nNext, we want to apply the roll_capm_estimation() function to each stock. This situation is an ideal use case for map(), which takes a list or vector as input and returns an object of the same length as the input. In our case, map() returns a single data frame with a time series of beta estimates for each stock. Therefore, we use unnest() to transform the list of outputs to a tidy data frame.\n\ncrsp_monthly_nested |&gt;\n  inner_join(examples, join_by(permno)) |&gt;\n  mutate(beta = map(\n    data,\n    ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |&gt;\n  unnest(beta) |&gt;\n  select(permno, month, beta_monthly = beta) |&gt;\n  drop_na()\n\n# A tibble: 1,458 × 3\n  permno month      beta_monthly\n   &lt;dbl&gt; &lt;date&gt;            &lt;dbl&gt;\n1  10107 1990-03-01         1.39\n2  10107 1990-04-01         1.38\n3  10107 1990-05-01         1.43\n4  10107 1990-06-01         1.43\n5  10107 1990-07-01         1.45\n# ℹ 1,453 more rows\n\n\nHowever, instead, we want to perform the estimations of rolling betas for different stocks in parallel. If you have a Windows or Mac machine, it makes most sense to define multisession, which means that separate R processes are running in the background on the same machine to perform the individual jobs. If you check out the documentation of plan(), you can also see other ways to resolve the parallelization in different environments. Note that we use availableCores() to determine the number of cores available for parallelization, but keep one core free for other tasks. Some machines might freeze if all cores are busy with Python jobs. \n\nn_cores = availableCores() - 1\nplan(multisession, workers = n_cores)\n\nUsing eight cores, the estimation for our sample of around 25k stocks takes around 20 minutes. Of course, you can speed up things considerably by having more cores available to share the workload or by having more powerful cores. Notice the difference in the code below? All you need to do is to replace map() with future_map().\n\nbeta_monthly &lt;- crsp_monthly_nested |&gt;\n  mutate(beta = future_map(\n    data, ~ roll_capm_estimation(., months = 60, min_obs = 48)\n  )) |&gt;\n  unnest(c(beta)) |&gt;\n  select(permno, month, beta_monthly = beta) |&gt;\n  drop_na()",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -3362,7 +3362,7 @@
     "href": "r/beta-estimation.html#estimating-beta-using-daily-returns",
     "title": "Beta Estimation",
     "section": "Estimating Beta using Daily Returns",
-    "text": "Estimating Beta using Daily Returns\nBefore we provide some descriptive statistics of our beta estimates, we implement the estimation for the daily CRSP sample as well. Depending on the application, you might either use longer horizon beta estimates based on monthly data or shorter horizon estimates based on daily returns.\nFirst, we load daily CRSP data. Note that the sample is large compared to the monthly data, so make sure to have enough memory available.\n\ncrsp_daily &lt;- tbl(tidy_finance, \"crsp_daily\") |&gt;\n  select(permno, month, date, ret_excess) |&gt;\n  collect()\n\nWe also need the daily Fama-French market excess returns.\n\nfactors_ff3_daily &lt;- tbl(tidy_finance, \"factors_ff3_daily\") |&gt;\n  select(date, mkt_excess) |&gt;\n  collect()\n\nWe make sure to keep only relevant data to save memory space. However, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in WRDS, CRSP, and Compustat.\n\ncrsp_daily &lt;- crsp_daily |&gt;\n  inner_join(factors_ff3_daily, by = \"date\") |&gt;\n  select(permno, month, ret_excess, mkt_excess)\n\nJust like above, we nest the data by permno for parallelization.\n\ncrsp_daily_nested &lt;- crsp_daily |&gt;\n  nest(data = c(month, ret_excess, mkt_excess))\n\nThis is what the estimation looks like for a couple of examples using map(). For the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. These restrictions help us to retrieve somewhat smooth coefficient estimates.\n\ncrsp_daily_nested |&gt;\n  inner_join(examples, by = \"permno\") |&gt;\n  mutate(beta_daily = map(\n    data,\n    ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |&gt;\n  unnest(c(beta_daily)) |&gt;\n  select(permno, month, beta_daily = beta) |&gt;\n  drop_na()\n\n# A tibble: 1,639 × 3\n  permno month      beta_daily\n   &lt;dbl&gt; &lt;date&gt;          &lt;dbl&gt;\n1  10107 1986-05-01      0.898\n2  10107 1986-06-01      0.906\n3  10107 1986-07-01      0.822\n4  10107 1986-08-01      0.900\n5  10107 1986-09-01      1.01 \n# ℹ 1,634 more rows\n\n\nFor the sake of completeness, we tell our session again to use multiple workers for parallelization.\n\nplan(multisession, workers = n_cores)\n\nThe code chunk for beta estimation using daily returns now looks very similar to the one for monthly data. The whole estimation takes around 30 minutes using eight cores and 16gb memory.\n\nbeta_daily &lt;- crsp_daily_nested |&gt;\n  mutate(beta_daily = future_map(\n    data, ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |&gt;\n  unnest(c(beta_daily)) |&gt;\n  select(permno, month, beta_daily = beta) |&gt;\n  drop_na()",
+    "text": "Estimating Beta using Daily Returns\nBefore we provide some descriptive statistics of our beta estimates, we implement the estimation for the daily CRSP sample as well. Depending on the application, you might either use longer horizon beta estimates based on monthly data or shorter horizon estimates based on daily returns.\nFirst, we load daily CRSP data. Note that the sample is large compared to the monthly data, so make sure to have enough memory available.\n\ncrsp_daily &lt;- tbl(tidy_finance, \"crsp_daily\") |&gt;\n  select(permno, month, date, ret_excess) |&gt;\n  collect()\n\nWe also need the daily Fama-French market excess returns.\n\nfactors_ff3_daily &lt;- tbl(tidy_finance, \"factors_ff3_daily\") |&gt;\n  select(date, mkt_excess) |&gt;\n  collect()\n\nWe make sure to keep only relevant data to save memory space. However, note that your machine might not have enough memory to read the whole daily CRSP sample. In this case, we refer you to the exercises and try working with loops as in WRDS, CRSP, and Compustat.\n\ncrsp_daily &lt;- crsp_daily |&gt;\n  inner_join(factors_ff3_daily, join_by(date)) |&gt;\n  select(permno, month, ret_excess, mkt_excess)\n\nJust like above, we nest the data by permno for parallelization.\n\ncrsp_daily_nested &lt;- crsp_daily |&gt;\n  nest(data = c(month, ret_excess, mkt_excess))\n\nThis is what the estimation looks like for a couple of examples using map(). For the daily data, we use the same function as above but only take 3 months of data and require at least 50 daily return observations in these months. These restrictions help us to retrieve somewhat smooth coefficient estimates.\n\ncrsp_daily_nested |&gt;\n  inner_join(examples, join_by(permno)) |&gt;\n  mutate(beta_daily = map(\n    data,\n    ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |&gt;\n  unnest(c(beta_daily)) |&gt;\n  select(permno, month, beta_daily = beta) |&gt;\n  drop_na()\n\n# A tibble: 1,639 × 3\n  permno month      beta_daily\n   &lt;dbl&gt; &lt;date&gt;          &lt;dbl&gt;\n1  10107 1986-05-01      0.898\n2  10107 1986-06-01      0.906\n3  10107 1986-07-01      0.822\n4  10107 1986-08-01      0.900\n5  10107 1986-09-01      1.01 \n# ℹ 1,634 more rows\n\n\nFor the sake of completeness, we tell our session again to use multiple workers for parallelization.\n\nplan(multisession, workers = n_cores)\n\nThe code chunk for beta estimation using daily returns now looks very similar to the one for monthly data. The whole estimation takes around 30 minutes using eight cores and 16gb memory.\n\nbeta_daily &lt;- crsp_daily_nested |&gt;\n  mutate(beta_daily = future_map(\n    data, ~ roll_capm_estimation(., months = 3, min_obs = 50)\n  )) |&gt;\n  unnest(c(beta_daily)) |&gt;\n  select(permno, month, beta_daily = beta) |&gt;\n  drop_na()",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -3374,7 +3374,7 @@
     "href": "r/beta-estimation.html#comparing-beta-estimates",
     "title": "Beta Estimation",
     "section": "Comparing Beta Estimates",
-    "text": "Comparing Beta Estimates\nWhat is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated \\(\\hat\\beta_i\\) across different industries and across time below. Figure 2 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.\n\ncrsp_monthly |&gt;\n  left_join(beta_monthly, by = c(\"permno\", \"month\")) |&gt;\n  drop_na(beta_monthly) |&gt;\n  group_by(industry, permno) |&gt;\n  summarize(beta = mean(beta_monthly), \n            .groups = \"drop\") |&gt;\n  ggplot(aes(x = reorder(industry, beta, FUN = median), y = beta)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Firm-specific beta distributions by industry\"\n  )\n\n\n\n\nFigure 2: The box plots show the average firm-specific beta estimates by industry.\n\n\nFigure 2: The box plots show the average firm-specific beta estimates by industry.\n\n\n\n\n\nNext, we illustrate the time-variation in the cross-section of estimated betas. Figure 3 shows the monthly deciles of estimated betas (based on monthly data) and indicates an interesting pattern: First, betas seem to vary over time in the sense that during some periods, there is a clear trend across all deciles. Second, the sample exhibits periods where the dispersion across stocks increases in the sense that the lower decile decreases and the upper decile increases, which indicates that for some stocks the correlation with the market increases while for others it decreases. Note also here: stocks with negative betas are a rare exception.\n\nbeta_monthly |&gt;\n  drop_na(beta_monthly) |&gt;\n  group_by(month) |&gt;\n  reframe(\n    x = quantile(beta_monthly, seq(0.1, 0.9, 0.1)),\n    quantile = 100 * seq(0.1, 0.9, 0.1)\n  ) |&gt;\n  ggplot(aes(\n    x = month, \n    y = x, \n    color = as_factor(quantile),\n    linetype = as_factor(quantile)\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly deciles of estimated betas\",\n  )\n\n\n\n\nFigure 3: Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.\n\n\nFigure 3: Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.\n\n\n\n\n\nTo compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in Figure 4.\n\nbeta &lt;- beta_monthly |&gt;\n  full_join(beta_daily, by = c(\"permno\", \"month\")) |&gt;\n  arrange(permno, month)\n\nbeta |&gt;\n  inner_join(examples, by = \"permno\") |&gt;\n  pivot_longer(cols = c(beta_monthly, beta_daily)) |&gt;\n  drop_na() |&gt;\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = name, \n    linetype = name\n    )) +\n  geom_line() +\n  facet_wrap(~company, ncol = 1) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL, \n    title = \"Comparison of beta estimates using monthly and daily data\"\n  )\n\n\n\n\nFigure 4: CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.\n\n\nFigure 4: CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.\n\n\n\n\n\nThe estimates in Figure 4 look as expected. As you can see, it really depends on the estimation window and data frequency how your beta estimates turn out.\nFinally, we write the estimates to our database such that we can use them in later chapters.\n\ndbWriteTable(tidy_finance,\n  \"beta\",\n  value = beta,\n  overwrite = TRUE\n)\n\nWhenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. This descriptive helps us discover potential errors in our data preparation or estimation procedure. For instance, suppose there was a gap in our output where we do not have any betas. In this case, we would have to go back and check all previous steps to find out what went wrong.\n\nbeta_long &lt;- crsp_monthly |&gt;\n  left_join(beta, by = c(\"permno\", \"month\")) |&gt;\n  pivot_longer(cols = c(beta_monthly, beta_daily))\n\nbeta_long |&gt;\n  group_by(month, name) |&gt;\n  summarize(share = sum(!is.na(value)) / n(), \n            .groups = \"drop\") |&gt;\n  ggplot(aes(\n    x = month, \n    y = share, \n    color = name,\n    linetype = name\n    )) +\n  geom_line() +\n  scale_y_continuous(labels = percent) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"End-of-month share of securities with beta estimates\"\n  ) +\n  coord_cartesian(ylim = c(0, 1))\n\n\n\n\nFigure 5: The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.\n\n\nFigure 5: The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.\n\n\n\n\n\nFigure 5 does not indicate any troubles, so let us move on to the next check.\nWe also encourage everyone to always look at the distributional summary statistics of variables. You can easily spot outliers or weird distributions when looking at such tables.\n\nbeta_long |&gt;\n  select(name, value) |&gt;\n  drop_na() |&gt;\n  group_by(name) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n()\n  )\n\n# A tibble: 2 × 9\n  name          mean    sd   min    q05   q50   q95   max       n\n  &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;   &lt;int&gt;\n1 beta_daily   0.752 0.924 -43.7 -0.440 0.690  2.23  56.6 3286117\n2 beta_monthly 1.10  0.713 -13.0  0.127 1.04   2.32  11.8 2135342\n\n\nThe summary statistics also look plausible for the two estimation procedures.\nFinally, since we have two different estimators for the same theoretical object, we expect the estimators should be at least positively correlated (although not perfectly as the estimators are based on different sample periods and frequencies).\n\nbeta |&gt;\n  select(beta_daily, beta_monthly) |&gt;\n  cor(use = \"complete.obs\")\n\n             beta_daily beta_monthly\nbeta_daily        1.000        0.324\nbeta_monthly      0.324        1.000\n\n\nIndeed, we find a positive correlation between our beta estimates. In the subsequent chapters, we mainly use the estimates based on monthly data as most readers should be able to replicate them due to potential memory limitations that might arise with the daily data.",
+    "text": "Comparing Beta Estimates\nWhat is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated \\(\\hat\\beta_i\\) across different industries and across time below. Figure 2 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.\n\ncrsp_monthly |&gt;\n  left_join(beta_monthly, join_by(permno, month)) |&gt;\n  drop_na(beta_monthly) |&gt;\n  group_by(industry, permno) |&gt;\n  summarize(beta = mean(beta_monthly), \n            .groups = \"drop\") |&gt;\n  ggplot(aes(x = reorder(industry, beta, FUN = median), y = beta)) +\n  geom_boxplot() +\n  coord_flip() +\n  labs(\n    x = NULL, y = NULL,\n    title = \"Firm-specific beta distributions by industry\"\n  )\n\n\n\n\nFigure 2: The box plots show the average firm-specific beta estimates by industry.\n\n\nFigure 2: The box plots show the average firm-specific beta estimates by industry.\n\n\n\n\n\nNext, we illustrate the time-variation in the cross-section of estimated betas. Figure 3 shows the monthly deciles of estimated betas (based on monthly data) and indicates an interesting pattern: First, betas seem to vary over time in the sense that during some periods, there is a clear trend across all deciles. Second, the sample exhibits periods where the dispersion across stocks increases in the sense that the lower decile decreases and the upper decile increases, which indicates that for some stocks the correlation with the market increases while for others it decreases. Note also here: stocks with negative betas are a rare exception.\n\nbeta_monthly |&gt;\n  drop_na(beta_monthly) |&gt;\n  group_by(month) |&gt;\n  reframe(\n    x = quantile(beta_monthly, seq(0.1, 0.9, 0.1)),\n    quantile = 100 * seq(0.1, 0.9, 0.1)\n  ) |&gt;\n  ggplot(aes(\n    x = month, \n    y = x, \n    color = as_factor(quantile),\n    linetype = as_factor(quantile)\n    )) +\n  geom_line() +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"Monthly deciles of estimated betas\",\n  )\n\n\n\n\nFigure 3: Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.\n\n\nFigure 3: Each line corresponds to the monthly cross-sectional quantile of the estimated CAPM beta.\n\n\n\n\n\nTo compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in Figure 4.\n\nbeta &lt;- beta_monthly |&gt;\n  full_join(beta_daily, join_by(permno, month)) |&gt;\n  arrange(permno, month)\n\nbeta |&gt;\n  inner_join(examples, join_by(permno)) |&gt;\n  pivot_longer(cols = c(beta_monthly, beta_daily)) |&gt;\n  drop_na() |&gt;\n  ggplot(aes(\n    x = month, \n    y = value, \n    color = name, \n    linetype = name\n    )) +\n  geom_line() +\n  facet_wrap(~company, ncol = 1) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL, \n    title = \"Comparison of beta estimates using monthly and daily data\"\n  )\n\n\n\n\nFigure 4: CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.\n\n\nFigure 4: CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks.\n\n\n\n\n\nThe estimates in Figure 4 look as expected. As you can see, it really depends on the estimation window and data frequency how your beta estimates turn out.\nFinally, we write the estimates to our database such that we can use them in later chapters.\n\ndbWriteTable(tidy_finance,\n  \"beta\",\n  value = beta,\n  overwrite = TRUE\n)\n\nWhenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. This descriptive helps us discover potential errors in our data preparation or estimation procedure. For instance, suppose there was a gap in our output where we do not have any betas. In this case, we would have to go back and check all previous steps to find out what went wrong.\n\nbeta_long &lt;- crsp_monthly |&gt;\n  left_join(beta, join_by(permno, month)) |&gt;\n  pivot_longer(cols = c(beta_monthly, beta_daily))\n\nbeta_long |&gt;\n  group_by(month, name) |&gt;\n  summarize(share = sum(!is.na(value)) / n(), \n            .groups = \"drop\") |&gt;\n  ggplot(aes(\n    x = month, \n    y = share, \n    color = name,\n    linetype = name\n    )) +\n  geom_line() +\n  scale_y_continuous(labels = percent) +\n  labs(\n    x = NULL, y = NULL, color = NULL, linetype = NULL,\n    title = \"End-of-month share of securities with beta estimates\"\n  ) +\n  coord_cartesian(ylim = c(0, 1))\n\n\n\n\nFigure 5: The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.\n\n\nFigure 5: The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data.\n\n\n\n\n\nFigure 5 does not indicate any troubles, so let us move on to the next check.\nWe also encourage everyone to always look at the distributional summary statistics of variables. You can easily spot outliers or weird distributions when looking at such tables.\n\nbeta_long |&gt;\n  select(name, value) |&gt;\n  drop_na() |&gt;\n  group_by(name) |&gt;\n  summarize(\n    mean = mean(value),\n    sd = sd(value),\n    min = min(value),\n    q05 = quantile(value, 0.05),\n    q50 = quantile(value, 0.50),\n    q95 = quantile(value, 0.95),\n    max = max(value),\n    n = n()\n  )\n\n# A tibble: 2 × 9\n  name          mean    sd   min    q05   q50   q95   max       n\n  &lt;chr&gt;        &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;   &lt;int&gt;\n1 beta_daily   0.752 0.924 -43.7 -0.440 0.690  2.23  56.6 3286117\n2 beta_monthly 1.10  0.713 -13.0  0.127 1.04   2.32  11.8 2135342\n\n\nThe summary statistics also look plausible for the two estimation procedures.\nFinally, since we have two different estimators for the same theoretical object, we expect the estimators should be at least positively correlated (although not perfectly as the estimators are based on different sample periods and frequencies).\n\nbeta |&gt;\n  select(beta_daily, beta_monthly) |&gt;\n  cor(use = \"complete.obs\")\n\n             beta_daily beta_monthly\nbeta_daily        1.000        0.324\nbeta_monthly      0.324        1.000\n\n\nIndeed, we find a positive correlation between our beta estimates. In the subsequent chapters, we mainly use the estimates based on monthly data as most readers should be able to replicate them due to potential memory limitations that might arise with the daily data.",
     "crumbs": [
       "R",
       "Asset Pricing",
@@ -3458,7 +3458,7 @@
     "href": "r/setting-up-your-environment.html#colophon",
     "title": "Setting Up Your Environment",
     "section": "Colophon",
-    "text": "Colophon\nThis book was written in RStudio using bookdown (Xie 2016). The website was rendered using quarto (Allaire et al. 2022) and it is hosted via GitHub Pages. The complete source is available from GitHub. We generated all plots in this book using ggplot2 and its classic dark-on-light theme (theme_bw()).\nThis version of the book was built with R (R Core Team 2022) version 4.3.1 (2023-06-16, Beagle Scouts) and the following packages: \n\n\n\n\n\nPackage\nVersion\n\n\n\n\nRPostgres\n1.4.5\n\n\nRSQLite\n2.3.1\n\n\nbroom\n1.0.5\n\n\nbrulee\n0.2.0\n\n\ndbplyr\n2.3.2\n\n\ndevtools\n2.4.5\n\n\ndplyr\n1.1.3\n\n\nfixest\n0.11.1\n\n\nforcats\n1.0.0\n\n\nfrenchdata\n0.2.0\n\n\nfurrr\n0.3.1\n\n\nggplot2\n3.4.3\n\n\nglmnet\n4.1-8\n\n\ngoogledrive\n2.1.0\n\n\nhardhat\n1.3.0\n\n\nhexSticker\n0.4.9\n\n\njsonlite\n1.8.4\n\n\nkableExtra\n1.3.4\n\n\nlmtest\n0.9-40\n\n\nlubridate\n1.9.2\n\n\nnloptr\n2.0.3\n\n\npurrr\n1.0.1\n\n\nranger\n0.15.1\n\n\nreadr\n2.1.4\n\n\nreadxl\n1.4.2\n\n\nrenv\n1.0.3\n\n\nrlang\n1.1.1\n\n\nrmarkdown\n2.21\n\n\nsandwich\n3.0-2\n\n\nscales\n1.2.1\n\n\nslider\n0.3.0\n\n\nstringr\n1.5.0\n\n\ntibble\n3.2.1\n\n\ntidymodels\n1.1.0\n\n\ntidyquant\n1.0.7\n\n\ntidyr\n1.3.0\n\n\ntidyverse\n2.0.0\n\n\ntimetk\n2.8.3\n\n\ntorch\n0.11.0\n\n\nwesanderson\n0.3.6",
+    "text": "Colophon\nThis book was written in RStudio using bookdown (Xie 2016). The website was rendered using quarto (Allaire et al. 2022) and it is hosted via GitHub Pages. The complete source is available from GitHub. We generated all plots in this book using ggplot2 and its classic dark-on-light theme (theme_bw()).\nThis version of the book was built with R (R Core Team 2022) version 4.3.2 (2023-10-31, Eye Holes) and the following packages: \n\n\n\n\n\nPackage\nVersion\n\n\n\n\nRPostgres\n1.4.5\n\n\nRSQLite\n2.3.1\n\n\nbroom\n1.0.5\n\n\nbrulee\n0.2.0\n\n\ndbplyr\n2.3.2\n\n\ndevtools\n2.4.5\n\n\ndplyr\n1.1.3\n\n\nfixest\n0.11.1\n\n\nforcats\n1.0.0\n\n\nfrenchdata\n0.2.0\n\n\nfurrr\n0.3.1\n\n\nggplot2\n3.4.3\n\n\nglmnet\n4.1-8\n\n\ngoogledrive\n2.1.0\n\n\nhardhat\n1.3.0\n\n\nhexSticker\n0.4.9\n\n\njsonlite\n1.8.4\n\n\nkableExtra\n1.3.4\n\n\nlmtest\n0.9-40\n\n\nlubridate\n1.9.2\n\n\nnloptr\n2.0.3\n\n\npurrr\n1.0.1\n\n\nranger\n0.15.1\n\n\nreadr\n2.1.4\n\n\nreadxl\n1.4.2\n\n\nrenv\n1.0.3\n\n\nrlang\n1.1.1\n\n\nrmarkdown\n2.21\n\n\nsandwich\n3.0-2\n\n\nscales\n1.2.1\n\n\nslider\n0.3.0\n\n\nstringr\n1.5.0\n\n\ntibble\n3.2.1\n\n\ntidymodels\n1.1.0\n\n\ntidyquant\n1.0.7\n\n\ntidyr\n1.3.0\n\n\ntidyverse\n2.0.0\n\n\ntimetk\n2.8.3\n\n\ntorch\n0.11.0\n\n\nwesanderson\n0.3.6",
     "crumbs": [
       "R",
       "Getting Started",
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 3b7d2af9..5781b6ee 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -30,7 +30,7 @@
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/univariate-portfolio-sorts.html</loc>
-    <lastmod>2024-01-01T19:35:05.247Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.105Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/constrained-optimization-and-backtesting.html</loc>
@@ -50,11 +50,11 @@
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/parametric-portfolio-policies.html</loc>
-    <lastmod>2024-01-01T19:35:05.247Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.104Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/wrds-crsp-and-compustat.html</loc>
-    <lastmod>2024-01-01T19:35:05.248Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.105Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/hex-sticker.html</loc>
@@ -174,27 +174,27 @@
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/replicating-fama-and-french-factors.html</loc>
-    <lastmod>2024-01-01T19:35:05.247Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.104Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/trace-and-fisd.html</loc>
-    <lastmod>2024-01-01T19:35:05.247Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.105Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/factor-selection-via-machine-learning.html</loc>
-    <lastmod>2024-01-01T19:35:05.239Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.103Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/value-and-bivariate-sorts.html</loc>
-    <lastmod>2024-01-01T19:35:05.248Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.105Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/difference-in-differences.html</loc>
-    <lastmod>2024-01-01T19:35:05.239Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.103Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/fixed-effects-and-clustered-standard-errors.html</loc>
-    <lastmod>2024-01-01T19:35:05.239Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.103Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/accessing-and-managing-financial-data.html</loc>
@@ -202,7 +202,7 @@
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/clean-enhanced-trace-with-r.html</loc>
-    <lastmod>2024-01-01T19:35:05.238Z</lastmod>
+    <lastmod>2024-01-06T16:02:05.204Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/introduction-to-tidy-finance.html</loc>
@@ -210,11 +210,11 @@
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/beta-estimation.html</loc>
-    <lastmod>2024-01-01T19:35:05.238Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.102Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/setting-up-your-environment.html</loc>
-    <lastmod>2024-01-01T19:35:05.247Z</lastmod>
+    <lastmod>2024-01-06T16:00:27.104Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/index.html</loc>
diff --git a/r/beta-estimation.qmd b/r/beta-estimation.qmd
index d206b48f..344a27e1 100644
--- a/r/beta-estimation.qmd
+++ b/r/beta-estimation.qmd
@@ -44,7 +44,7 @@ factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
   collect()
 
 crsp_monthly <- crsp_monthly |>
-  left_join(factors_ff3_monthly, by = "month")
+  left_join(factors_ff3_monthly, join_by(month))
 ```
 
 To estimate the CAPM regression coefficients  
@@ -132,7 +132,7 @@ It is actually quite simple to perform the rolling-window estimation for an arbi
 #| fig-cap: "The CAPM betas are estimated with monthly data and a rolling window of length 5 years based on adjusted excess returns from CRSP. We use market excess returns from Kenneth French data library."
 #| fig-alt: "Title: Monthly beta estimates for example stocks using 5 years of data. The figure shows a time series of beta estimates based on 5 years of monthly data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimated betas vary over time and across varies but are always positive for each stock."
 beta_examples <- crsp_monthly |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   group_by(permno) |>
   mutate(roll_capm_estimation(pick(everything()), months = 60, min_obs = 48)) |>
   ungroup() |>
@@ -178,7 +178,7 @@ Next, we want to apply the `roll_capm_estimation()` function to each stock. This
 
 ```{r}
 crsp_monthly_nested |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   mutate(beta = map(
     data,
     ~ roll_capm_estimation(., months = 60, min_obs = 48)
@@ -234,7 +234,7 @@ However, note that your machine might not have enough memory to read the whole d
 
 ```{r}
 crsp_daily <- crsp_daily |>
-  inner_join(factors_ff3_daily, by = "date") |>
+  inner_join(factors_ff3_daily, join_by(date)) |>
   select(permno, month, ret_excess, mkt_excess)
 ```
 
@@ -251,7 +251,7 @@ These restrictions help us to retrieve somewhat smooth coefficient estimates.
 
 ```{r}
 crsp_daily_nested |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   mutate(beta_daily = map(
     data,
     ~ roll_capm_estimation(., months = 3, min_obs = 50)
@@ -288,7 +288,7 @@ What is a typical value for stock betas? To get some feeling, we illustrate the
 #| fig-cap: "The box plots show the average firm-specific beta estimates by industry."
 #| fig.alt: "Title: Firm-specific beta distributions by industry. The figure shows box plots for each industry. Firms with the highest average CAPM beta belong to the public administration industry. Firms from the utility sector have the lowest average CAPM beta. The figure indicates very few outliers with negative CAPM betas. The large majority of all stocks has CAPM betas between 0.5 and 1.5."
 crsp_monthly |>
-  left_join(beta_monthly, by = c("permno", "month")) |>
+  left_join(beta_monthly, join_by(permno, month)) |>
   drop_na(beta_monthly) |>
   group_by(industry, permno) |>
   summarize(beta = mean(beta_monthly), 
@@ -335,11 +335,11 @@ To compare the difference between daily and monthly data, we combine beta estima
 #| fig-cap: "CAPM betas are computed using 5 years of monthly or 3 months of daily data. The two lines show the monthly estimates based on a rolling window for few exemplary stocks."
 #| fig-alt: "Title: Comparison of beta estimates using monthly and daily data. The figure shows a time series of beta estimates using 5 years of monthly versus 3 years of daily data for Apple, Berkshire Hathaway, Microsoft, and Tesla. The estimates based on longer periods of monthly data are smooth relative to the estimates based on daily data. However, the general trend and level is similar, irrespective of the choice of frequency."
 beta <- beta_monthly |>
-  full_join(beta_daily, by = c("permno", "month")) |>
+  full_join(beta_daily, join_by(permno, month)) |>
   arrange(permno, month)
 
 beta |>
-  inner_join(examples, by = "permno") |>
+  inner_join(examples, join_by(permno)) |>
   pivot_longer(cols = c(beta_monthly, beta_daily)) |>
   drop_na() |>
   ggplot(aes(
@@ -378,7 +378,7 @@ In this case, we would have to go back and check all previous steps to find out
 #| fig-cap: "The two lines show the share of securities with beta estimates using 5 years of monthly or 3 months of daily data."
 #| fig-alt: "Title: End-of-month share of securities with beta estimates. The figure shows two time series with end-of-year shares of securities with beta estimates using 5 years of monthly or 3 months of daily data. There is almost no missing data for the estimates based on daily data. For the beta estimates based on monthly data, around 75 percent of all stock-month combinations provide sufficient long historical periods to estimate the  beta."
 beta_long <- crsp_monthly |>
-  left_join(beta, by = c("permno", "month")) |>
+  left_join(beta, join_by(permno, month)) |>
   pivot_longer(cols = c(beta_monthly, beta_daily))
 
 beta_long |>
diff --git a/r/clean-enhanced-trace-with-r.qmd b/r/clean-enhanced-trace-with-r.qmd
index 2f48b023..3ff8ef0e 100644
--- a/r/clean-enhanced-trace-with-r.qmd
+++ b/r/clean-enhanced-trace-with-r.qmd
@@ -74,10 +74,10 @@ clean_enhanced_trace <- function(cusips,
   # Cleaning corrected and cancelled trades
   trace_post_TR <- trace_post_TR |>
     anti_join(trace_post_XC,
-      by = c(
-        "cusip_id", "msg_seq_nb", "entrd_vol_qt",
-        "rptd_pr", "rpt_side_cd", "cntra_mp_id",
-        "trd_exctn_dt", "trd_exctn_tm"
+      join_by(
+        cusip_id, msg_seq_nb, entrd_vol_qt,
+        rptd_pr, rpt_side_cd, cntra_mp_id,
+        trd_exctn_dt, trd_exctn_tm
       )
     )
 
@@ -93,10 +93,11 @@ clean_enhanced_trace <- function(cusips,
   ## the msg_seq_nb of the main message
   trace_post <- trace_post_TR |>
     anti_join(trace_post_Y,
-      by = c("cusip_id",
-        "msg_seq_nb" = "orig_msg_seq_nb",
-        "entrd_vol_qt", "rptd_pr", "rpt_side_cd",
-        "cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"
+      join_by(
+        cusip_id,
+        msg_seq_nb == orig_msg_seq_nb,
+        entrd_vol_qt, rptd_pr, rpt_side_cd,
+        cntra_mp_id, trd_exctn_dt, trd_exctn_tm
       )
     )
 
@@ -118,10 +119,11 @@ clean_enhanced_trace <- function(cusips,
       trd_rpt_dt < as.Date("2012-02-06")
     ) |>
     anti_join(trace_pre_C,
-      by = c("cusip_id",
-        "msg_seq_nb" = "orig_msg_seq_nb",
-        "entrd_vol_qt", "rptd_pr", "rpt_side_cd",
-        "cntra_mp_id", "trd_exctn_dt", "trd_exctn_tm"
+      join_by(
+        cusip_id,
+        msg_seq_nb == orig_msg_seq_nb,
+        entrd_vol_qt, rptd_pr, rpt_side_cd,
+        cntra_mp_id, trd_exctn_dt, trd_exctn_tm
       )
     )
 
@@ -142,24 +144,26 @@ clean_enhanced_trace <- function(cusips,
     # Corrections that correct some msg
     trace_pre_W_correcting <- trace_pre_W |>
       semi_join(trace_pre_T,
-        by = c("cusip_id", "trd_exctn_dt",
-          "orig_msg_seq_nb" = "msg_seq_nb"
+        join_by(
+          cusip_id, trd_exctn_dt,
+          orig_msg_seq_nb == msg_seq_nb
         )
       )
 
     # Corrections that do not correct some msg
     trace_pre_W <- trace_pre_W |>
       anti_join(trace_pre_T,
-        by = c("cusip_id", "trd_exctn_dt",
-          "orig_msg_seq_nb" = "msg_seq_nb"
+        join_by(cusip_id, trd_exctn_dt,
+          orig_msg_seq_nb == msg_seq_nb
         )
       )
 
     # Delete msgs that are corrected and add correction msgs
     trace_pre_T <- trace_pre_T |>
       anti_join(trace_pre_W_correcting,
-        by = c("cusip_id", "trd_exctn_dt",
-          "msg_seq_nb" = "orig_msg_seq_nb"
+        join_by(
+          cusip_id, trd_exctn_dt,
+          msg_seq_nb == orig_msg_seq_nb
         )
       ) |>
       union_all(trace_pre_W_correcting)
@@ -196,9 +200,9 @@ clean_enhanced_trace <- function(cusips,
     mutate(seq = row_number()) |>
     ungroup() |>
     anti_join(trace_pre_R,
-      by = c(
-        "cusip_id", "trd_exctn_dt", "entrd_vol_qt",
-        "rptd_pr", "rpt_side_cd", "cntra_mp_id", "seq"
+      join_by(
+        cusip_id, trd_exctn_dt, entrd_vol_qt,
+        rptd_pr, rpt_side_cd, cntra_mp_id, seq
       )
     ) |>
     select(-seq)
@@ -224,9 +228,9 @@ clean_enhanced_trace <- function(cusips,
       rpt_side_cd == "B"
     ) |>
     anti_join(trace_agency_sells,
-      by = c(
-        "cusip_id", "trd_exctn_dt",
-        "entrd_vol_qt", "rptd_pr"
+      join_by(
+        cusip_id, trd_exctn_dt,
+        entrd_vol_qt, rptd_pr
       )
     )
 
diff --git a/r/difference-in-differences.qmd b/r/difference-in-differences.qmd
index 28ac4906..0ab27527 100644
--- a/r/difference-in-differences.qmd
+++ b/r/difference-in-differences.qmd
@@ -96,7 +96,7 @@ By combining the bond-specific information from Mergent FISD for our bond sample
 
 ```{r}
 bonds_panel <- bonds |>
-  inner_join(trace_aggregated, by = "cusip_id", multiple = "all") |>
+  inner_join(trace_aggregated, join_by(cusip_id), multiple = "all") |>
   drop_na()
 ```
 
@@ -268,7 +268,7 @@ model_with_fe_time_coefs <- tidy(model_with_fe_time) |>
   ) |>
   left_join(
     variables,
-    by = c("term" = "variable_name")
+    join_by(term == variable_name)
   ) |>
   bind_rows(tibble(
     term = "lag0",
diff --git a/r/factor-selection-via-machine-learning.qmd b/r/factor-selection-via-machine-learning.qmd
index 22c7528a..c04b85c1 100644
--- a/r/factor-selection-via-machine-learning.qmd
+++ b/r/factor-selection-via-machine-learning.qmd
@@ -121,9 +121,9 @@ We combine all the monthly observations into one data frame.
 
 ```{r, eval = TRUE}
 data <- industries_ff_monthly |>
-  left_join(factors_ff3_monthly, by = "month") |>
-  left_join(factors_q_monthly, by = "month") |>
-  left_join(macro_predictors, by = "month") |>
+  left_join(factors_ff3_monthly, join_by(month)) |>
+  left_join(factors_q_monthly, join_by(month)) |>
+  left_join(macro_predictors, join_by(month)) |>
   mutate(
     ret = ret - factor_ff_rf
   ) |>
diff --git a/r/fixed-effects-and-clustered-standard-errors.qmd b/r/fixed-effects-and-clustered-standard-errors.qmd
index 8a9d9740..5b35756f 100644
--- a/r/fixed-effects-and-clustered-standard-errors.qmd
+++ b/r/fixed-effects-and-clustered-standard-errors.qmd
@@ -54,7 +54,7 @@ data_investment <- compustat |>
   left_join(compustat |>
     select(gvkey, year, at_lag = at) |>
     mutate(year = year + 1),
-  by = c("gvkey", "year")
+  join_by(gvkey, year)
   ) |>
   filter(at > 0, at_lag > 0) |>
   mutate(
@@ -66,7 +66,7 @@ data_investment <- data_investment |>
   left_join(data_investment |>
     select(gvkey, year, investment_lead = investment) |>
     mutate(year = year - 1),
-  by = c("gvkey", "year")
+    join_by(gvkey, year)
   )
 ```
 
@@ -75,8 +75,7 @@ Tobin's q is the ratio of the market value of capital to its replacement costs.\
 ```{r}
 data_investment <- data_investment |>
   left_join(crsp_monthly,
-  by = c("gvkey", "month")
-  ) |>
+            join_by(gvkey, month)) |>
   mutate(tobins_q = (mktcap + at - be + txdb) / at) |>
   select(gvkey, year, investment_lead, cash_flows, tobins_q) |>
   drop_na()
diff --git a/r/parametric-portfolio-policies.qmd b/r/parametric-portfolio-policies.qmd
index 888f255f..468c72e1 100644
--- a/r/parametric-portfolio-policies.qmd
+++ b/r/parametric-portfolio-policies.qmd
@@ -53,7 +53,7 @@ crsp_monthly_lags <- crsp_monthly |>
 
 crsp_monthly <- crsp_monthly |>
   inner_join(crsp_monthly_lags,
-    by = c("permno", "month" = "month_13"),
+    join_by(permno, month == month_13),
     suffix = c("", "_13")
   )
 
@@ -202,7 +202,8 @@ evaluate_portfolio <- function(weights_crsp,
   
   evaluation_stats <- evaluation |>
     group_by(model) |>
-    left_join(factors_ff3_monthly, by = "month") |>
+    left_join(factors_ff3_monthly, 
+              join_by(month)) |>
     summarize(tibble(
       "Expected utility" = mean(power_utility(portfolio_return)),
       "Average return" = 100 * mean(length_year * portfolio_return),
@@ -214,7 +215,8 @@ evaluate_portfolio <- function(weights_crsp,
   
   if (capm_evaluation) {
     evaluation_capm <- evaluation |> 
-      left_join(factors_ff3_monthly, by = "month") |>
+      left_join(factors_ff3_monthly, 
+                join_by(month)) |>
       group_by(model) |>
       summarize(
       "CAPM alpha" = coefficients(lm(portfolio_return ~ mkt_excess))[1],
@@ -222,7 +224,7 @@ evaluate_portfolio <- function(weights_crsp,
       )
     
     evaluation_stats <- evaluation_stats |> 
-      left_join(evaluation_capm, by = "model")
+      left_join(evaluation_capm, join_by(model))
   }
 
   if (full_evaluation) {
@@ -243,7 +245,7 @@ evaluate_portfolio <- function(weights_crsp,
       mutate(model = str_remove(model, "weight_")) 
     
     evaluation_stats <- evaluation_stats |> 
-      left_join(evaluation_weights, by = "model")
+      left_join(evaluation_weights, join_by(model))
   }
   
   evaluation_output <- evaluation_stats |> 
diff --git a/r/replicating-fama-and-french-factors.qmd b/r/replicating-fama-and-french-factors.qmd
index cb0ba6ce..64c9bb06 100644
--- a/r/replicating-fama-and-french-factors.qmd
+++ b/r/replicating-fama-and-french-factors.qmd
@@ -74,13 +74,13 @@ market_equity <- crsp_monthly |>
 book_to_market <- compustat |>
   mutate(sorting_date = ymd(str_c(year(datadate) + 1, "0701"))) |>
   select(gvkey, sorting_date, be) |>
-  inner_join(market_equity, by = c("gvkey", "sorting_date")) |>
+  inner_join(market_equity, join_by(gvkey, sorting_date)) |>
   mutate(bm = be / me) |>
   select(permno, sorting_date, me, bm)
 
 sorting_variables <- size |>
   inner_join(
-    book_to_market, by = c("permno", "sorting_date")
+    book_to_market, join_by(permno, sorting_date)
     ) |>
   drop_na() |>
   distinct(permno, sorting_date, .keep_all = TRUE)
@@ -142,7 +142,7 @@ portfolios <- crsp_monthly |>
     month(month) <= 6 ~ ymd(str_c(year(month) - 1, "0701")),
     month(month) >= 7 ~ ymd(str_c(year(month), "0701"))
   )) |>
-  inner_join(portfolios, by = c("permno", "sorting_date"))
+  inner_join(portfolios, join_by(permno, sorting_date))
 ```
 
 ## Fama and French Three Factor Model
@@ -170,7 +170,7 @@ In the previous section, we replicated the size and value premiums following the
 
 ```{r}
 test <- factors_ff3_monthly |>
-  inner_join(factors_replicated, by = "month") |>
+  inner_join(factors_replicated, join_by(month)) |>
   mutate(
     across(c(smb_replicated, hml_replicated), ~round(., 4))
   )
@@ -202,13 +202,15 @@ Now, let us move to the replication of the five factor model. We extend the `oth
 other_sorting_variables <- compustat |>
   mutate(sorting_date = ymd(str_c(year(datadate) + 1, "0701"))) |>
   select(gvkey, sorting_date, be, op, inv) |>
-  inner_join(market_equity, by = c("gvkey", "sorting_date")) |>
+  inner_join(market_equity, 
+             join_by(gvkey, sorting_date)) |>
   mutate(bm = be / me) |>
   select(permno, sorting_date, me, be, bm, op, inv)
 
 sorting_variables <- size |>
   inner_join(
-    other_sorting_variables, by = c("permno", "sorting_date")
+    other_sorting_variables, 
+    join_by(permno, sorting_date)
     ) |>
   drop_na() |>
   distinct(permno, sorting_date, .keep_all = TRUE)
@@ -244,7 +246,7 @@ portfolios <- crsp_monthly |>
     month(month) <= 6 ~ ymd(str_c(year(month) - 1, "0701")),
     month(month) >= 7 ~ ymd(str_c(year(month), "0701"))
   )) |>
-  inner_join(portfolios, by = c("permno", "sorting_date"))
+  inner_join(portfolios, join_by(permno, sorting_date))
 ```
 
 Now, we want to construct each of the factors, but this time the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.
@@ -321,17 +323,17 @@ We then join all factors together into one data frame and construct again a suit
 ```{r}
 factors_replicated <- factors_size |>
   full_join(
-    factors_value, by = "month"
+    factors_value, join_by(month)
   ) |>
   full_join(
-    factors_profitability, by = "month"
+    factors_profitability, join_by(month)
   ) |>
   full_join(
-    factors_investment, by = "month"
+    factors_investment, join_by(month)
   )
 
 test <- factors_ff5_monthly |>
-  inner_join(factors_replicated, by = "month") |>
+  inner_join(factors_replicated, join_by(month)) |>
   mutate(
     across(c(smb_replicated, hml_replicated, 
              rmw_replicated, cma_replicated), ~round(., 4))
diff --git a/r/setting-up-your-environment.qmd b/r/setting-up-your-environment.qmd
index 1a1ab501..9bab2897 100644
--- a/r/setting-up-your-environment.qmd
+++ b/r/setting-up-your-environment.qmd
@@ -79,7 +79,7 @@ do.call("rbind", packages$Packages) |>
   as_tibble() |>
   select(Package, Version) |>
   unnest(cols = c("Package", "Version")) |>
-  inner_join(dependencies, by = "Package") |>
+  inner_join(dependencies, join_by("Package")) |>
   arrange(Package) |>
   select(-n) |>
   kableExtra::kbl(booktabs = TRUE) |> 
diff --git a/r/trace-and-fisd.qmd b/r/trace-and-fisd.qmd
index ef43ae41..a0e83916 100644
--- a/r/trace-and-fisd.qmd
+++ b/r/trace-and-fisd.qmd
@@ -127,7 +127,7 @@ fisd_issuer <- fisd_mergedissuer_db |>
   collect()
 
 fisd <- fisd |>
-  inner_join(fisd_issuer, by = "issuer_id") |>
+  inner_join(fisd_issuer, join_by(issuer_id)) |>
   filter(country_domicile == "USA") |>
   select(-country_domicile)
 ```
@@ -211,7 +211,7 @@ bonds_outstanding <- expand_grid("date" = seq(ymd("2014-01-01"),
   left_join(fisd |> select(complete_cusip, 
                               offering_date,
                               maturity), 
-            by = "complete_cusip") |> 
+            join_by(complete_cusip)) |> 
   mutate(offering_date = floor_date(offering_date),
          maturity = floor_date(maturity)) |> 
   filter(date >= offering_date & date <= maturity) |> 
diff --git a/r/univariate-portfolio-sorts.qmd b/r/univariate-portfolio-sorts.qmd
index f00e3885..15c9d7b9 100644
--- a/r/univariate-portfolio-sorts.qmd
+++ b/r/univariate-portfolio-sorts.qmd
@@ -72,7 +72,7 @@ beta_lag <- beta |>
   drop_na()
 
 data_for_sorts <- crsp_monthly |>
-  inner_join(beta_lag, by = c("permno", "month"))
+  inner_join(beta_lag, join_by(permno, month))
 ```
 
 The first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use to group the stocks into portfolios.\index{Breakpoints} 
@@ -165,7 +165,7 @@ beta_portfolios <- data_for_sorts |>
     ret_excess = weighted.mean(ret_excess, mktcap_lag),
     .groups = "drop"
   )|>
-  left_join(factors_ff3_monthly, by = "month")
+  left_join(factors_ff3_monthly, join_by(month))
 ```
 
 ## More Performance Evaluation
@@ -186,7 +186,7 @@ beta_portfolios_summary <- beta_portfolios |>
     beta_portfolios |> 
       group_by(portfolio) |> 
       summarize(ret_excess = mean(ret_excess),
-                .groups = "drop"), by = "portfolio"
+                .groups = "drop"), join_by(portfolio)
   )
 ```
 
@@ -262,7 +262,7 @@ beta_longshort <- beta_portfolios |>
               names_from = portfolio, 
               values_from = ret_excess) |>
   mutate(long_short = high - low) |>
-  left_join(factors_ff3_monthly, by = "month")
+  left_join(factors_ff3_monthly, join_by(month))
 ```
 
 Again, the resulting long-short strategy does not exhibit statistically significant returns. 
diff --git a/r/value-and-bivariate-sorts.qmd b/r/value-and-bivariate-sorts.qmd
index b507b1d7..b6ce5231 100644
--- a/r/value-and-bivariate-sorts.qmd
+++ b/r/value-and-bivariate-sorts.qmd
@@ -65,7 +65,7 @@ me <- crsp_monthly |>
   select(permno, sorting_date, me = mktcap)
 
 bm <- book_equity |>
-  inner_join(crsp_monthly, by = c("gvkey", "month")) |>
+  inner_join(crsp_monthly, join_by(gvkey, month)) |>
   mutate(
     bm = be / mktcap,
     sorting_date = month %m+% months(6),
@@ -75,10 +75,10 @@ bm <- book_equity |>
 
 data_for_sorts <- crsp_monthly |>
   left_join(
-    bm, by = c("permno", "gvkey", "month" = "sorting_date")
+    bm, join_by(permno, gvkey, month == sorting_date)
   ) |>
   left_join(
-    me, by = c("permno", "month" = "sorting_date")
+    me, join_by(permno, month == sorting_date)
   ) |>
   select(
     permno, gvkey, month, ret_excess,
diff --git a/r/wrds-crsp-and-compustat.qmd b/r/wrds-crsp-and-compustat.qmd
index 751f3c8b..e15aafea 100644
--- a/r/wrds-crsp-and-compustat.qmd
+++ b/r/wrds-crsp-and-compustat.qmd
@@ -84,7 +84,7 @@ crsp_monthly <- msf_db |>
     msenames_db |>
       filter(shrcd %in% c(10, 11)) |>
       select(permno, exchcd, siccd, namedt, nameendt),
-    by = c("permno")
+    join_by(permno)
   ) |>
   filter(date >= namedt & date <= nameendt) |>
   mutate(month = floor_date(date, "month")) |>
@@ -92,7 +92,7 @@ crsp_monthly <- msf_db |>
     msedelist_db |>
       select(permno, dlstdt, dlret, dlstcd) |>
       mutate(month = floor_date(dlstdt, "month")),
-    by = c("permno", "month")
+    join_by(permno, month)
   ) |>
   select(
     permno, # Security identifier
@@ -133,7 +133,7 @@ mktcap_lag <- crsp_monthly |>
   select(permno, month, mktcap_lag = mktcap)
 
 crsp_monthly <- crsp_monthly |>
-  left_join(mktcap_lag, by = c("permno", "month"))
+  left_join(mktcap_lag, join_by(permno, month))
 ```
 
 If you wonder why we do not use the `lag()` function, e.g., via `crsp_monthly |> group_by(permno) |> mutate(mktcap_lag = lag(mktcap))`, take a look at the exercises.
@@ -200,7 +200,7 @@ factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
 
 crsp_monthly <- crsp_monthly |>
   left_join(factors_ff3_monthly,
-    by = "month"
+    join_by(month)
   ) |>
   mutate(
     ret_excess = ret_adj - rf,
@@ -255,7 +255,7 @@ Next, we look at the aggregate market capitalization grouped by the respective l
 #| fig-cap: "Market capitalization is measured in billion USD, adjusted for consumer price index changes such that the values on the horizontal axis reflect the buying power of billion USD in December 2022."
 #| fig-alt: "Title: Monthly market cap by listing exchange in billion USD as of Dec 2022. The figure shows a line chart of the total market capitalization of all stocks aggregated by the listing exchange from 1960 to 2022, with years on the horizontal axis and the corresponding market capitalization on the vertical axis. Historically, NYSE listed stocks had the highest market capitalization. In the more recent past, the valuation of NASDAQ listed stocks exceeded that of NYSE listed stocks."
 tbl(tidy_finance, "crsp_monthly") |>
-  left_join(tbl(tidy_finance, "cpi_monthly"), by = "month") |>
+  left_join(tbl(tidy_finance, "cpi_monthly"), join_by(month)) |>
   group_by(month, exchange) |>
   summarize(
     mktcap = sum(mktcap, na.rm = TRUE) / cpi,
@@ -290,7 +290,7 @@ Next, we look at the same descriptive statistics by industry. @fig-213 plots the
 #| fig-cap: "Number of stocks in the CRSP sample associated with different industries."
 #| fig-alt: "Title: Monthly number of securities by industry. The figure shows a line chart of the number of securities by industry from 1960 to 2022 with years on the horizontal axis and the corresponding number on the vertical axis. Except for stocks that are assigned to the industry public administration, the number of listed stocks decreased steadily at least since 1996. As of 2022, the segment of firms within public administration is the largest in terms of the number of listed stocks."
 crsp_monthly_industry <- crsp_monthly |>
-  left_join(cpi_monthly, by = "month") |>
+  left_join(cpi_monthly, join_by(month)) |>
   group_by(month, industry) |>
   summarize(
     securities = n_distinct(permno),
@@ -384,15 +384,15 @@ for (j in 1:batches) {
     drop_na()
     
     crsp_daily_sub <- crsp_daily_sub |> 
-      left_join(msedelist_sub, by = c("permno", "date"="dlstdt")) |> 
+      left_join(msedelist_sub, join_by(permno, date == dlstdt)) |> 
       bind_rows(msedelist_sub |> 
                   anti_join(crsp_daily_sub, 
-                            by = c("permno", "dlstdt" = "date"))) |> 
+                            join_by(permno, dlstdt == date))) |> 
       mutate(ret = if_else(!is.na(dlret), dlret, ret),
              date = if_else(!is.na(dlstdt), dlstdt, date)) |> 
       select(-c(dlret, dlstdt)) |> 
       left_join(msedelist_sub |> 
-                  select(permno, dlstdt), by = "permno") |> 
+                  select(permno, dlstdt), join_by(permno)) |> 
       mutate(dlstdt = replace_na(dlstdt, end_date)) |> 
       filter(date <= dlstdt) |> 
       select(-dlstdt)
@@ -400,7 +400,7 @@ for (j in 1:batches) {
     crsp_daily_sub <- crsp_daily_sub |>
       mutate(month = floor_date(date, "month")) |>
       left_join(factors_ff3_daily |>
-        select(date, rf), by = "date") |>
+        select(date, rf), join_by(date)) |>
       mutate(
         ret_excess = ret - rf,
         ret_excess = pmax(ret_excess, -1)
@@ -472,7 +472,7 @@ compustat <- compustat |>
     be = coalesce(seq, ceq + pstk, at - lt) +
       coalesce(txditc, txdb + itcb, 0) -
       coalesce(pstkrv, pstkl, pstk, 0),
-    be = if_else(be <= 0, as.numeric(NA), be),
+    be = if_else(be <= 0, NA, be),
     op = (sale - coalesce(cogs, 0) - 
             coalesce(xsga, 0) - coalesce(xint, 0)) / be,
   )
@@ -495,11 +495,12 @@ compustat <- compustat |>
   left_join(
     compustat |> 
       select(gvkey, year, at_lag = at) |> 
-      mutate(year = year + 1), by = c("gvkey", "year")
+      mutate(year = year + 1), 
+    join_by(gvkey, year)
   ) |> 
   mutate(
     inv = at / at_lag - 1,
-    inv = if_else(at_lag <= 0, as.numeric(NA), inv)
+    inv = if_else(at_lag <= 0, NA, inv)
   )
 ```
 
@@ -541,13 +542,13 @@ We use these links to create a new table with a mapping between stock identifier
 ```{r}
 ccm_links <- crsp_monthly |>
   inner_join(ccmxpf_linktable, 
-             by = "permno", relationship = "many-to-many") |>
+             join_by(permno), relationship = "many-to-many") |>
   filter(!is.na(gvkey) & 
            (date >= linkdt & date <= linkenddt)) |>
   select(permno, gvkey, date)
 
 crsp_monthly <- crsp_monthly |>
-  left_join(ccm_links, by = c("permno", "date"))
+  left_join(ccm_links, join_by(permno, date))
 ```
 
 As the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.
@@ -570,7 +571,7 @@ crsp_monthly |>
   group_by(permno, year = year(month)) |>
   filter(date == max(date)) |>
   ungroup() |>
-  left_join(compustat, by = c("gvkey", "year")) |>
+  left_join(compustat, join_by(gvkey, year)) |>
   group_by(exchange, year) |>
   summarize(
     share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),