Merge pull request #74 from tidy-finance/feature/speed-up-preparing-d…

…aily-crsp-data feature/speed-up-preparing-daily-crsp-data
tidy-finance · Oct 14, 2023 · b5a7495 · b5a7495
2 parents 0cb9042 + 1def50a
commit b5a7495
Show file tree

Hide file tree

Showing 8 changed files with 202 additions and 142 deletions.
diff --git a/_freeze/python/wrds-crsp-and-compustat/execute-results/html.json b/_freeze/python/wrds-crsp-and-compustat/execute-results/html.json
diff --git a/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json b/_freeze/r/wrds-crsp-and-compustat/execute-results/html.json
diff --git a/docs/python/wrds-crsp-and-compustat.html b/docs/python/wrds-crsp-and-compustat.html
diff --git a/docs/r/wrds-crsp-and-compustat.html b/docs/r/wrds-crsp-and-compustat.html
diff --git a/docs/search.json b/docs/search.json
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
@@ -46,7 +46,7 @@
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/wrds-crsp-and-compustat.html</loc>
-    <lastmod>2023-10-13T16:22:00.853Z</lastmod>
+    <lastmod>2023-10-14T07:33:33.510Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/r/hex-sticker.html</loc>
@@ -126,7 +126,7 @@
   </url>
   <url>
     <loc>https://www.tidy-finance.org/python/wrds-crsp-and-compustat.html</loc>
-    <lastmod>2023-10-13T16:22:00.850Z</lastmod>
+    <lastmod>2023-10-14T17:25:10.833Z</lastmod>
   </url>
   <url>
     <loc>https://www.tidy-finance.org/python/parametric-portfolio-policies.html</loc>

diff --git a/python/wrds-crsp-and-compustat.qmd b/python/wrds-crsp-and-compustat.qmd
@@ -1,7 +1,5 @@
 ---
 title: WRDS, CRSP, and Compustat
-execute:
-  cache: true
 ---
 
 ```{python}
@@ -397,9 +395,9 @@ market_cap_per_industry_figure.draw()
 
 ## Daily CRSP Data
 
-Before we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session. 
+Before we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your Python session. 
 
-There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easy to handle.\index{Big data} That is, instead of downloading data about many stocks all at once, download the data in small batches for each stock consecutively. Such operations can be implemented in `for()`-loops,\index{For-loops} where we download, prepare, and store the data for a single stock in each iteration. This operation might nonetheless take a couple of hours, so you have to be patient either way (we often run such code overnight). Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later. Notice that we also use the function `to_sql()` here with the option to append the new data to an existing table, when we process the second and all following batches. 
+There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for`-loops,\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 20 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `print()`. Notice that we also use the function `to_sql()` here with the option to append the new data to an existing table, when we process the second and all following batches. 
 
 ```{python}
 #| eval: false
@@ -413,15 +411,25 @@ permnos = pd.read_sql(
   "SELECT DISTINCT permno FROM crsp_monthly", 
   tidy_finance
 )
+
+batch_size = 100
+batches = np.ceil(len(permnos) / batch_size).astype(int)
+  
+for j in range(1, batches + 1):  
+    
+    permno_chunk = permnos[
+      ((j - 1) * batch_size):(min(j * batch_size, len(permnos)))
+    ]
+    
+    permno_str = "('" + "', '".join(permno_chunk["permno"].astype(str)) + "')"
   
-for j in range(0, len(permnos)):  
-    permno_sub = str(int(permnos.iloc[j]))
     crsp_daily_sub_query = (
       "SELECT permno, date, ret " + 
         "FROM crsp.dsf " +
-        "WHERE permno = " + permno_sub + " " + 
+        "WHERE permno IN " + permno_str + " " + 
         "AND date BETWEEN '01/01/1960' AND '12/31/2022'" 
     )
+    
     crsp_daily_sub = (pd.read_sql_query(
         sql=crsp_daily_sub_query,
         con=wrds,
@@ -434,7 +442,7 @@ for j in range(0, len(permnos)):
     if not crsp_daily_sub.empty:
         crsp_daily_sub = (crsp_daily_sub
           .assign(
-            month=lambda x: x["date"].dt.to_period("M")
+            month = lambda x: x["date"].dt.to_period("M").dt.to_timestamp()
           )
           .merge(factors_ff3_daily[["date", "rf"]], 
                  on="date", how="left")
@@ -443,9 +451,17 @@ for j in range(0, len(permnos)):
               ((x["ret"] - x["rf"]).clip(lower=-1))
           )
           .get(["permno", "date", "month", "ret_excess"])
+          .assign(
+            date = lambda x:
+              ((x["date"]- pd.Timestamp("1970-01-01"))
+                // pd.Timedelta("1d")),
+            month = lambda x:
+              ((x["month"]- pd.Timestamp("1970-01-01"))
+                // pd.Timedelta("1d"))
+          )
         )
           
-        if j == 0:
+        if j == 1:
             crsp_daily_sub.to_sql(
               name="crsp_daily", 
               con=tidy_finance, 
@@ -459,8 +475,12 @@ for j in range(0, len(permnos)):
               if_exists="append", 
               index=False
             )
+            
+        print(f"Chunk {j} out of {batches} done ({(j / batches) * 100:.2f}%)\n")
 ```
 
+Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later. 
+
 To the best of our knowledge, the daily CRSP data does not require any adjustments like the monthly data. The adjustment of the monthly data comes from the fact that CRSP aggregates daily data into monthly observations and has to decide which prices and returns to record if a stock gets delisted. In the daily data, there is simply no price or return after delisting, so there is also no aggregation problem.
 
 ## Preparing Compustat data

diff --git a/r/wrds-crsp-and-compustat.qmd b/r/wrds-crsp-and-compustat.qmd
@@ -337,7 +337,7 @@ crsp_monthly_industry |>
 
 Before we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session. 
 
-There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easy to handle.\index{Big data} That is, instead of downloading data about many stocks all at once, download the data in small batches for each stock consecutively. Such operations can be implemented in `for()`-loops,\index{For-loops} where we download, prepare, and store the data for a single stock in each iteration. This operation might nonetheless take a couple of hours, so you have to be patient either way (we often run such code overnight). To keep track of the progress, we create ad-hoc progress updates using `cat()`. Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches. 
+There is a solution to this challenge. As with many *big data* problems, you can split up the big task into several smaller tasks that are easier to handle.\index{Big data} That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Such operations can be implemented in `for()`-loops,\index{For-loops} where we download, prepare, and store the data for a small number of stocks in each iteration. This operation might nonetheless take around 20 minutes, depending on your internet connection. To keep track of the progress, we create ad-hoc progress updates using `cat()`. Notice that we also use the function `dbWriteTable()` here with the option to append the new data to an existing table, when we process the second and all following batches. 
 
 ```{r}
 #| eval: false
@@ -350,10 +350,17 @@ permnos <- tbl(tidy_finance, "crsp_monthly") |>
   distinct(permno) |>
   pull()
 
-for (j in 1:length(permnos)) {
-  permno_sub <- permnos[j]
+batch_size <- 100
+batches <- ceiling(length(permnos) / batch_size)
+
+for (j in 1:batches) {
+  
+  permno_chunk <- permnos[
+    ((j - 1) * batch_size + 1):min(j * batch_size, length(permnos))
+  ]
+
   crsp_daily_sub <- dsf_db |>
-    filter(permno == permno_sub &
+    filter(permno %in% permno_chunk &
       date >= start_date & date <= end_date) |>
     select(permno, date, ret) |>
     collect() |>
@@ -363,12 +370,12 @@ for (j in 1:length(permnos)) {
     crsp_daily_sub <- crsp_daily_sub |>
       mutate(month = floor_date(date, "month")) |>
       left_join(factors_ff3_daily |>
-                  select(date, rf), by = "date") |>
+        select(date, rf), by = "date") |>
       mutate(
         ret_excess = ret - rf,
         ret_excess = pmax(ret_excess, -1)
       ) |>
-      select(permno, date, month, ret_excess)
+      select(permno, date, month, ret, ret_excess)
 
     dbWriteTable(tidy_finance,
       "crsp_daily",
@@ -377,11 +384,14 @@ for (j in 1:length(permnos)) {
       append = ifelse(j != 1, TRUE, FALSE)
     )
   }
-  cat("Index", j, "out of", length(permnos), "done (", 
-      percent(j / length(permnos)), ")\n")
+
+  cat("Chunk", j, "out of", batches, "done (", 
+      percent(j / batches), ")\n")
 }
 ```
 
+Eventually, we end up with more than 71 million rows of daily return data. Note that we only store the identifying information that we actually need, namely `permno`, `date`, and `month` alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later. 
+
 To the best of our knowledge, the daily CRSP data does not require any adjustments like the monthly data. The adjustment of the monthly data comes from the fact that CRSP aggregates daily data into monthly observations and has to decide which prices and returns to record if a stock gets delisted. In the daily data, there is simply no price or return after delisting, so there is also no aggregation problem.
 
 ## Preparing Compustat data