# Truncated Records 2025

The content limit for page captures will eventually be raised from 1 to 5 MiB.

This notebook is to estimate the impact on the WARC size and other parameters before a change to the content limit.

If a change to the content limit is made, also the real changes are evaluated here.

Counts of truncated records are aggregated per MIME type using [AWS Athena](https://aws.amazon.com/athena/) and the following SQL query (cf. [average-warc-record-length-by-mime-type.sql](https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql)):

```sql
SELECT COUNT(*) as n_pages,
       COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as perc_pages,
       AVG(warc_record_length) as avg_warc_record_length,
       SUM(warc_record_length) as sum_warc_record_length,
       MAX(warc_record_length) as max_warc_record_length,
       approx_percentile(warc_record_length, .95) as perc95_warc_record_length,
       SUM(warc_record_length) * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_warc_storage,
       SUM(case when content_truncated is null then 0 else 1 end) * 100.0 / COUNT(*) as perc_truncated,
       SUM(case when content_truncated is not null then warc_record_length else 0 end) as sum_warc_record_length_truncated,
       SUM(case when content_truncated is not null then warc_record_length else 0 end)
          * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_warc_storage_truncated,
       SUM(case when content_truncated = 'length' then warc_record_length else 0 end) as sum_warc_record_length_truncated_length,
       SUM(case when content_truncated = 'length' then warc_record_length else 0 end)
          * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_warc_storage_truncated_length,
       content_mime_detected,
       histogram(content_truncated) as reason_truncated,
       slice(
         array_sort(
           map_entries(map_filter(
             histogram(regexp_extract(url_path, '\.[a-zA-Z0-9_-]{1,7}$')),
             (k, v) -> v > 4)),
           (a, b) -> IF(a[2] < b[2], 1, IF(a[2] = b[2], 0, -1))),
         1, 25) as common_url_path_suffixes,
       COUNT(DISTINCT url_host_tld) as uniq_tlds,
       approx_distinct(url_host_registered_domain) as uniq_domains,
       approx_distinct(url_host_name) as uniq_hosts,
       slice(
         array_sort(
           map_entries(map_filter(histogram(url_host_tld), (k, v) -> v > 4)),
           (a, b) -> IF(a[2] < b[2], 1, IF(a[2] = b[2], 0, -1))),
         1, 25) as top_tlds,
       approx_most_frequent(25, url_host_registered_domain, 1000) as top_domains
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2025-05'
  AND subset = 'warc'
GROUP BY content_mime_detected
HAVING (COUNT(*) >= 10) -- ignore MIME types seen less than 10 times
ORDER BY n_pages DESC;
```

In [1]:
import json
import pandas as pd

data = pd.read_csv('data/warc-record-size-truncation-by-mime-type-CC-MAIN-2025-05.csv')

data[['content_mime_detected', 'n_pages', 'perc_warc_storage',
      'perc_truncated', 'perc_warc_storage_truncated', 'reason_truncated']].head(20)

Unnamed: 0,content_mime_detected,n_pages,perc_warc_storage,perc_truncated,perc_warc_storage_truncated,reason_truncated
0,text/html,2731118776,85.520576,2.39808,10.172436,"{length=64028128, disconnect=1466160, time=129}"
1,application/xhtml+xml,263868889,5.153182,0.631314,0.217108,"{length=1515360, disconnect=150472, time=10}"
2,application/pdf,19926901,9.04505,26.312124,4.968546,"{length=5232274, disconnect=10799, time=118}"
3,text/plain,3184175,0.036948,2.959793,0.020463,"{length=90376, disconnect=3853, time=16}"
4,application/atom+xml,3167720,0.012043,0.12359,0.000916,"{length=3899, disconnect=16}"
5,application/rss+xml,2445452,0.017838,0.504856,0.002287,"{length=11915, disconnect=429, time=2}"
6,application/xml,1943043,0.01816,1.408667,0.003037,"{length=26964, disconnect=404, time=3}"
7,text/calendar,1040301,0.002292,0.127944,0.000236,"{length=1208, disconnect=123}"
8,application/json,958061,0.004572,0.905162,0.001052,"{length=8596, disconnect=76}"
9,application/octet-stream,428542,0.033369,10.163765,0.026076,"{length=43363, disconnect=178, time=15}"


The aggregations show which MIME types are mostly affected by truncations.

Now let's look into the reasons of the truncation and load the histograms with reason counts into columns:

In [2]:
# expand embedded Presto/Trino/Athena histogram as columns into data frame
# - transform to valid JSON
data['reasons_truncation'] = data['reason_truncated'].str.replace('(\\w+)=', '"\\1":', regex=True)
# - load columns in data frame
truncation_reason = data['reasons_truncation'].apply(
    lambda x: json.loads(x) if type(x) == str else {}
).apply(pd.Series).apply(lambda s: s.fillna(0).astype(int)).add_prefix('trunc_reason_')

# - join with original data
data = data.join(truncation_reason)

data['n_pages_truncated'] \
    = data['trunc_reason_length'] + data['trunc_reason_time'] + data['trunc_reason_disconnect']
data['trunc_reason_length_perc'] = 100.0 * data['trunc_reason_length'] / data['n_pages']
data['trunc_length_gib'] = data['sum_warc_record_length_truncated_length'] / 2**30

data[['content_mime_detected', 'n_pages', 'perc_truncated', 'n_pages_truncated',
      'trunc_reason_length', 'trunc_reason_length_perc', 'trunc_length_gib']
    ].sort_values(by=['trunc_reason_length'], ascending=False).head(20)

Unnamed: 0,content_mime_detected,n_pages,perc_truncated,n_pages_truncated,trunc_reason_length,trunc_reason_length_perc,trunc_length_gib
0,text/html,2731118776,2.39808,65494417,64028128,2.344392,9414.133382
2,application/pdf,19926901,26.312124,5243191,5232274,26.257339,4615.277669
1,application/xhtml+xml,263868889,0.631314,1665842,1515360,0.574285,199.879604
3,text/plain,3184175,2.959793,94245,90376,2.838286,18.930999
9,application/octet-stream,428542,10.163765,43556,43363,10.118728,24.214183
6,application/xml,1943043,1.408667,27371,26964,1.38772,2.810701
5,application/rss+xml,2445452,0.504856,12346,11915,0.487231,2.118575
28,application/epub+zip,43534,27.105251,11800,11794,27.091469,11.324456
8,application/json,958061,0.905162,8672,8596,0.897229,0.974798
50,application/x-matlab-data,12258,59.781367,7328,7322,59.73242,6.644439


### Estimate of storage increase

If the content limit is increased from 1 MiB to 5 MiB, what would be the estimate for storage occupied by the records which are now not truncated anymore or hit the new limit?

- upper bound estimate of the storage **increase**, as product of
  - 4 (assumed that all captures move from 1 to 5 MiB = 4 MiB more content) 
  - WARC storage occupied by truncated records (because of the content length limit)
- lower bound estimate
  - 1 (assuming that on average the content size doubles)
  - 95th percentile of WARC record length
  - number of truncated records (because of the content length limit)
- more realistic estimate
  - 1.5 (increase from 1 to 2.5 MiB on average)
  - WARC storage occupied by truncated records (because of the content length limit)

In [3]:
data['WARC GiB'] = data['sum_warc_record_length'] / 2**30
data['+ GiB upper'] = 5 * data['sum_warc_record_length_truncated_length'] / 2**30
data['+ GiB lower'] = 1 * data['trunc_reason_length'] * 1 * data['perc95_warc_record_length'] / 2**30
data['+ GiB mid'] = 1.5 * data['sum_warc_record_length_truncated_length'] / 2**30
data['avg WARC MiB'] = data['avg_warc_record_length'] / 2**20
data['perc95 WARC MiB'] = data['perc95_warc_record_length'] / 2**20
data['max WARC MiB'] = data['max_warc_record_length'] / 2**20

data[['content_mime_detected', 'WARC GiB', '+ GiB upper', '+ GiB lower', '+ GiB mid', 'trunc_reason_length',
      'avg WARC MiB', 'perc95 WARC MiB', 'max WARC MiB']
    ].sort_values(by=['trunc_reason_length'], ascending=False).head(20)

Unnamed: 0,content_mime_detected,WARC GiB,+ GiB upper,+ GiB lower,+ GiB mid,trunc_reason_length,avg WARC MiB,perc95 WARC MiB,max WARC MiB
0,text/html,79481.638957,47070.666908,5942.272938,14121.200072,64028128,0.029801,0.095035,1.001537
2,application/pdf,8406.344064,23076.388346,5048.453135,6922.916504,5232274,0.431984,0.988025,1.004257
1,application/xhtml+xml,4789.296311,999.39802,89.151125,299.819406,1515360,0.018586,0.060244,0.99848
3,text/plain,34.339026,94.654997,3.381077,28.396499,90376,0.011043,0.038309,1.001639
9,application/octet-stream,31.012809,121.070914,26.368903,36.321274,43363,0.074105,0.622691,1.002588
6,application/xml,16.877369,14.053504,1.070306,4.216051,26964,0.008895,0.040647,1.001264
5,application/rss+xml,16.578616,10.592875,0.303761,3.177863,11915,0.006942,0.026106,0.758889
28,application/epub+zip,19.962399,56.622282,11.529111,16.986685,11794,0.469552,1.001001,1.003022
8,application/json,4.249273,4.87399,0.102488,1.462197,8596,0.004542,0.012209,1.001091
50,application/x-matlab-data,6.804147,33.222194,7.138655,9.966658,7322,0.5684,0.998359,1.001732


In [4]:
print("Upper bound TiB: {:.2f} (+{:.2f}%)".format(
    data['+ GiB upper'].sum() / 2**10,
    100.0 * data['+ GiB upper'].sum() / data['WARC GiB'].sum()))
print("Lower bound TiB: {:.2f} (+{:.2f}%)".format(
    data['+ GiB lower'].sum() / 2**10,
    100.0 * data['+ GiB lower'].sum() / data['WARC GiB'].sum()))
print("Mid estimate TiB: {:.2f} (+{:.2f}%)".format(
    data['+ GiB mid'].sum() / 2**10,
    100.0 * data['+ GiB mid'].sum() / data['WARC GiB'].sum()))

Upper bound TiB: 70.13 (+77.27%)
Lower bound TiB: 10.94 (+12.05%)
Mid estimate TiB: 21.04 (+23.18%)
