Skip to content

Commit

Permalink
Merge pull request #74 from tidy-finance/feature/speed-up-preparing-d…
Browse files Browse the repository at this point in the history
…aily-crsp-data

feature/speed-up-preparing-daily-crsp-data
  • Loading branch information
christophscheuch committed Oct 14, 2023
2 parents 0cb9042 + 1def50a commit b5a7495
Show file tree
Hide file tree
Showing 8 changed files with 202 additions and 142 deletions.

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions _freeze/r/wrds-crsp-and-compustat/execute-results/html.json

Large diffs are not rendered by default.

181 changes: 101 additions & 80 deletions docs/python/wrds-crsp-and-compustat.html

Large diffs are not rendered by default.

79 changes: 44 additions & 35 deletions docs/r/wrds-crsp-and-compustat.html

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/search.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/sitemap.xml
Expand Up @@ -46,7 +46,7 @@
</url>
<url>
<loc>https://www.tidy-finance.org/r/wrds-crsp-and-compustat.html</loc>
<lastmod>2023-10-13T16:22:00.853Z</lastmod>
<lastmod>2023-10-14T07:33:33.510Z</lastmod>
</url>
<url>
<loc>https://www.tidy-finance.org/r/hex-sticker.html</loc>
Expand Down Expand Up @@ -126,7 +126,7 @@
</url>
<url>
<loc>https://www.tidy-finance.org/python/wrds-crsp-and-compustat.html</loc>
<lastmod>2023-10-13T16:22:00.850Z</lastmod>
<lastmod>2023-10-14T17:25:10.833Z</lastmod>
</url>
<url>
<loc>https://www.tidy-finance.org/python/parametric-portfolio-policies.html</loc>
Expand Down
38 changes: 29 additions & 9 deletions python/wrds-crsp-and-compustat.qmd
@@ -1,7 +1,5 @@
---
title: WRDS, CRSP, and Compustat
execute:
cache: true
---

```{python}
Expand Down Expand Up @@ -397,9 +395,9 @@ market_cap_per_industry_figure.draw()

## Daily CRSP Data

Before we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session.
Before we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your Python session.

There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easy to handle.\index{Big data} That is, instead of downloading data about many stocks all at once, download the data in small batches for each stock consecutively. Such operations can be implemented in `for()`-loops,\index{For-loops} where we download, prepare, and store the data for a single stock in each iteration. This operation might nonetheless take a couple of hours, so you have to be patient either way (we often run such code overnight). Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later. Notice that we also use the function `to_sql()` here with the option to append the new data to an existing table, when we process the second and all following batches.
There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for`-loops,\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 20 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `print()`. Notice that we also use the function `to_sql()` here with the option to append the new data to an existing table, when we process the second and all following batches.

```{python}
#| eval: false
Expand All @@ -413,15 +411,25 @@ permnos = pd.read_sql(
"SELECT DISTINCT permno FROM crsp_monthly",
tidy_finance
)
batch_size = 100
batches = np.ceil(len(permnos) / batch_size).astype(int)
for j in range(1, batches + 1):
permno_chunk = permnos[
((j - 1) * batch_size):(min(j * batch_size, len(permnos)))
]
permno_str = "('" + "', '".join(permno_chunk["permno"].astype(str)) + "')"
for j in range(0, len(permnos)):
permno_sub = str(int(permnos.iloc[j]))
crsp_daily_sub_query = (
"SELECT permno, date, ret " +
"FROM crsp.dsf " +
"WHERE permno = " + permno_sub + " " +
"WHERE permno IN " + permno_str + " " +
"AND date BETWEEN '01/01/1960' AND '12/31/2022'"
)
crsp_daily_sub = (pd.read_sql_query(
sql=crsp_daily_sub_query,
con=wrds,
Expand All @@ -434,7 +442,7 @@ for j in range(0, len(permnos)):
if not crsp_daily_sub.empty:
crsp_daily_sub = (crsp_daily_sub
.assign(
month=lambda x: x["date"].dt.to_period("M")
month = lambda x: x["date"].dt.to_period("M").dt.to_timestamp()
)
.merge(factors_ff3_daily[["date", "rf"]],
on="date", how="left")
Expand All @@ -443,9 +451,17 @@ for j in range(0, len(permnos)):
((x["ret"] - x["rf"]).clip(lower=-1))
)
.get(["permno", "date", "month", "ret_excess"])
.assign(
date = lambda x:
((x["date"]- pd.Timestamp("1970-01-01"))
// pd.Timedelta("1d")),
month = lambda x:
((x["month"]- pd.Timestamp("1970-01-01"))
// pd.Timedelta("1d"))
)
)
if j == 0:
if j == 1:
crsp_daily_sub.to_sql(
name="crsp_daily",
con=tidy_finance,
Expand All @@ -459,8 +475,12 @@ for j in range(0, len(permnos)):
if_exists="append",
index=False
)
print(f"Chunk {j} out of {batches} done ({(j / batches) * 100:.2f}%)\n")
```

Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later.

To the best of our knowledge, the daily CRSP data does not require any adjustments like the monthly data. The adjustment of the monthly data comes from the fact that CRSP aggregates daily data into monthly observations and has to decide which prices and returns to record if a stock gets delisted. In the daily data, there is simply no price or return after delisting, so there is also no aggregation problem.

## Preparing Compustat data
Expand Down
26 changes: 18 additions & 8 deletions r/wrds-crsp-and-compustat.qmd
Expand Up @@ -337,7 +337,7 @@ crsp_monthly_industry |>

Before we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session.

There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easy to handle.\index{Big data} That is, instead of downloading data about many stocks all at once, download the data in small batches for each stock consecutively. Such operations can be implemented in `for()`-loops,\index{For-loops} where we download, prepare, and store the data for a single stock in each iteration. This operation might nonetheless take a couple of hours, so you have to be patient either way (we often run such code overnight). To keep track of the progress, we create ad-hoc progress updates using `cat()`. Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches.
There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for()`-loops,\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 20 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `cat()`. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches.

```{r}
#| eval: false
Expand All @@ -350,10 +350,17 @@ permnos <- tbl(tidy_finance, "crsp_monthly") |>
distinct(permno) |>
pull()
for (j in 1:length(permnos)) {
permno_sub <- permnos[j]
batch_size <- 100
batches <- ceiling(length(permnos) / batch_size)
for (j in 1:batches) {
permno_chunk <- permnos[
((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))
]
crsp_daily_sub <- dsf_db |>
filter(permno == permno_sub &
filter(permno %in% permno_chunk &
date >= start_date & date <= end_date) |>
select(permno, date, ret) |>
collect() |>
Expand All @@ -363,12 +370,12 @@ for (j in 1:length(permnos)) {
crsp_daily_sub <- crsp_daily_sub |>
mutate(month = floor_date(date, "month")) |>
left_join(factors_ff3_daily |>
select(date, rf), by = "date") |>
select(date, rf), by = "date") |>
mutate(
ret_excess = ret - rf,
ret_excess = pmax(ret_excess, -1)
) |>
select(permno, date, month, ret_excess)
select(permno, date, month, ret, ret_excess)
dbWriteTable(tidy_finance,
"crsp_daily",
Expand All @@ -377,11 +384,14 @@ for (j in 1:length(permnos)) {
append = ifelse(j != 1, TRUE, FALSE)
)
}
cat("Index", j, "out of", length(permnos), "done (",
percent(j / length(permnos)), ")\n")
cat("Chunk", j, "out of", batches, "done (",
percent(j / batches), ")\n")
}
```

Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later.

To the best of our knowledge, the daily CRSP data does not require any adjustments like the monthly data. The adjustment of the monthly data comes from the fact that CRSP aggregates daily data into monthly observations and has to decide which prices and returns to record if a stock gets delisted. In the daily data, there is simply no price or return after delisting, so there is also no aggregation problem.

## Preparing Compustat data
Expand Down

0 comments on commit b5a7495

Please sign in to comment.