# Searching Common Crawl Index

This explores different ways of using the common crawl index

* [Comcrawl library](#Using-commcrawl)
* [CDX Toolkit](#Using-cdx-toolkit)
* [Querying HTTP Endpoint directly](#Requesting-CDX-endpoint-Directly)

See [the related article](https://skeptric.com/searching-100b-pages-cdx/) and [Jupyter notebook](https://skeptric.com/notebooks/Searching%20Common%20Crawl%20Index.ipynb).

In [1]:
import requests
import warcio
from contextlib import closing
from bs4 import BeautifulSoup
import json

import logging
from IPython.display import HTML
import pandas as pd

# Using [comcrawl](https://github.com/michaelharms/comcrawl)

https://index.commoncrawl.org/CC-MAIN-2020-16

https://index.commoncrawl.org/CC-MAIN-2020-16-index?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Fdataisbeautiful%2F*&output=json

https://index.commoncrawl.org/CC-MAIN-2021-04

https://index.commoncrawl.org/CC-MAIN-2021-04-index?url=www.workana.com&output=json

In [2]:
! python -m pip install comcrawl



In [3]:
from comcrawl import IndexClient

In [4]:
client = IndexClient(['2020-10', '2020-16'])

In [5]:
client.search('https://www.reddit.com/r/dataisbeautiful/*')

In [6]:
pd.DataFrame(client.results).head()

Unnamed: 0,urlkey,timestamp,url,mime,mime-detected,status,digest,length,offset,filename,redirect,charset,languages
0,"com,reddit)/r/dataisbeautiful/comments/2wlsvz/...",20200217065457,http://www.reddit.com/r/dataisbeautiful/commen...,unk,application/octet-stream,301,3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ,679,13689701,crawl-data/CC-MAIN-2020-10/segments/1581875141...,https://www.reddit.com/r/dataisbeautiful/comme...,,
1,"com,reddit)/r/dataisbeautiful/comments/2wlsvz/...",20200217065459,https://www.reddit.com/r/dataisbeautiful/comme...,text/html,text/html,200,L4C22PRVUOGG22PXMKSB7KYVCWQUKEQ7,74716,915522267,crawl-data/CC-MAIN-2020-10/segments/1581875141...,,UTF-8,eng
2,"com,reddit)/r/dataisbeautiful/comments/7f2sfy/...",20200223060640,https://www.reddit.com/r/dataisbeautiful/comme...,text/html,text/html,200,GEWEQE4I2JOSKTL3QXPEI7FXVI3BP52O,29470,884674375,crawl-data/CC-MAIN-2020-10/segments/1581875145...,,UTF-8,eng
3,"com,reddit)/r/dataisbeautiful/comments/7jbefu/...",20200217195615,https://www.reddit.com/r/dataisbeautiful/comme...,text/html,text/html,200,42HZLBLZI5DQYGQAZNUAQ5NRCMEEVERW,21516,890110347,crawl-data/CC-MAIN-2020-10/segments/1581875143...,,UTF-8,eng
4,"com,reddit)/r/dataisbeautiful/comments/8f1rk7/...",20200222202649,https://www.reddit.com/r/dataisbeautiful/comme...,text/html,text/html,200,IDKDLHSVB7YH3L2AUIMKPJFER3VLBZRU,95956,859518253,crawl-data/CC-MAIN-2020-10/segments/1581875145...,,UTF-8,eng


Only download the first couple of 'ok' results

In [7]:
client.results = [res for res in client.results if res['status'] == '200'][:2]

In [8]:
client.download()

In [9]:
client.results[0]['url']

'https://www.reddit.com/r/dataisbeautiful/comments/2wlsvz/why_the_mlb_rule_changes_since_2004_game_time_is/'

In [10]:
html = client.results[0]['html']

In [11]:
soup = BeautifulSoup(html, 'html5lib')

In [12]:
soup.head.title.text

'Why the MLB rule changes: Since 2004, game time is up 10%, while runs are down 13% [OC] : dataisbeautiful'

In [13]:
soup.find('div', {'class': 'usertext-body'}).p.text

'A place for visual representations of data: Graphs, charts, maps, etc.'

# Using [cdx-toolkit](https://github.com/cocrawler/cdx_toolkit)

In [14]:
!python -m pip install cdx_toolkit



In [15]:
import cdx_toolkit

In [16]:
#url = 'https://www.reddit.com/r/dataisbeautiful/*'
url = 'https://www.workana.com/*'

In [17]:
cdx = cdx_toolkit.CDXFetcher(source='cc')

Note: from_ts rather than from in CLI

In [18]:
#objs = list(cdx.iter(url, from_ts='202101', to='202102', limit=5, filter='=status:200'))
objs = list(cdx.iter(url, from_ts='202101', to='202102', filter='=status:200'))

In [19]:
df = pd.DataFrame([o.data for o in objs])

In [20]:
df.shape

(23705, 12)

In [21]:
df.head()

Unnamed: 0,urlkey,timestamp,url,mime,mime-detected,status,digest,length,offset,filename,charset,languages
0,"com,workana)/",20210115134610,https://www.workana.com/,text/html,text/html,200,RUACBLESF5RHTGR42PP6U5NYKVNSC7G7,13252,3462879,crawl-data/CC-MAIN-2021-04/segments/1610703495...,,
1,"com,workana)/",20210115164606,https://www.workana.com/,text/html,text/html,200,ORS7FELEAMSXRBLRKUGDNNFLPLWZA6MB,13252,3402545,crawl-data/CC-MAIN-2021-04/segments/1610703495...,,
2,"com,workana)/",20210115195200,https://www.workana.com/,text/html,text/html,200,GAHKARH5YMM7L6MQQRAIKINAPN3JKWBN,13249,3758775,crawl-data/CC-MAIN-2021-04/segments/1610703496...,,
3,"com,workana)/",20210115225232,https://www.workana.com/,text/html,text/html,200,U5ZHB5ICS6TMZ36HKI7ZYA2LXI3XWGMQ,13254,2986780,crawl-data/CC-MAIN-2021-04/segments/1610703497...,,
4,"com,workana)/",20210116015831,https://www.workana.com/,text/html,text/html,200,S4XEJ3TZEPMXA6QP3HB7IV4M3ZRRN2TK,13250,3059704,crawl-data/CC-MAIN-2021-04/segments/1610703499...,,


In [22]:
print(df.to_markdown())

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [23]:
html = objs[0].content

In [24]:
soup = BeautifulSoup(html, 'html5lib')

In [25]:
soup.head.title.text

'\n        Workana - Find Freelancers & Freelance Jobs Online     '

In [26]:
o = objs[0]

In [27]:
o.warc_record.rec_headers.get_header('WARC-Target-URI')

'https://www.workana.com/'

# Requesting CDX endpoint Directly

We can request the [Index directly](https://index.commoncrawl.org/) using [pywb's CDX API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference).

But first we need to know what indexes are available.

In [28]:
cdx_indexes = requests.get('https://index.commoncrawl.org/collinfo.json').json()

In [29]:
pd.options.display.max_colwidth=150
pd.options.display.max_rows=6

In [30]:
pd.DataFrame(cdx_indexes)

Unnamed: 0,id,name,timegate,cdx-api
0,CC-MAIN-2021-04,January 2021 Index,https://index.commoncrawl.org/CC-MAIN-2021-04/,https://index.commoncrawl.org/CC-MAIN-2021-04-index
1,CC-MAIN-2020-50,November 2020 Index,https://index.commoncrawl.org/CC-MAIN-2020-50/,https://index.commoncrawl.org/CC-MAIN-2020-50-index
2,CC-MAIN-2020-45,October 2020 Index,https://index.commoncrawl.org/CC-MAIN-2020-45/,https://index.commoncrawl.org/CC-MAIN-2020-45-index
...,...,...,...,...
75,CC-MAIN-2012,Index of 2012 ARC files,https://index.commoncrawl.org/CC-MAIN-2012/,https://index.commoncrawl.org/CC-MAIN-2012-index
76,CC-MAIN-2009-2010,Index of 2009 - 2010 ARC files,https://index.commoncrawl.org/CC-MAIN-2009-2010/,https://index.commoncrawl.org/CC-MAIN-2009-2010-index
77,CC-MAIN-2008-2009,Index of 2008 - 2009 ARC files,https://index.commoncrawl.org/CC-MAIN-2008-2009/,https://index.commoncrawl.org/CC-MAIN-2008-2009-index


In [31]:
print(pd.DataFrame(cdx_indexes).head(10).to_markdown())

|    | id              | name                 | timegate                                       | cdx-api                                             |
|---:|:----------------|:---------------------|:-----------------------------------------------|:----------------------------------------------------|
|  0 | CC-MAIN-2021-04 | January 2021 Index   | https://index.commoncrawl.org/CC-MAIN-2021-04/ | https://index.commoncrawl.org/CC-MAIN-2021-04-index |
|  1 | CC-MAIN-2020-50 | November 2020 Index  | https://index.commoncrawl.org/CC-MAIN-2020-50/ | https://index.commoncrawl.org/CC-MAIN-2020-50-index |
|  2 | CC-MAIN-2020-45 | October 2020 Index   | https://index.commoncrawl.org/CC-MAIN-2020-45/ | https://index.commoncrawl.org/CC-MAIN-2020-45-index |
|  3 | CC-MAIN-2020-40 | September 2020 Index | https://index.commoncrawl.org/CC-MAIN-2020-40/ | https://index.commoncrawl.org/CC-MAIN-2020-40-index |
|  4 | CC-MAIN-2020-34 | August 2020 Index    | https://index.commoncrawl.org/CC-MAIN-2020-34/

In [32]:
print(pd.DataFrame(cdx_indexes).tail(10).to_markdown())

|    | id                | name                           | timegate                                         | cdx-api                                               |
|---:|:------------------|:-------------------------------|:-------------------------------------------------|:------------------------------------------------------|
| 68 | CC-MAIN-2014-41   | September 2014 Index           | https://index.commoncrawl.org/CC-MAIN-2014-41/   | https://index.commoncrawl.org/CC-MAIN-2014-41-index   |
| 69 | CC-MAIN-2014-35   | August 2014 Index              | https://index.commoncrawl.org/CC-MAIN-2014-35/   | https://index.commoncrawl.org/CC-MAIN-2014-35-index   |
| 70 | CC-MAIN-2014-23   | July 2014 Index                | https://index.commoncrawl.org/CC-MAIN-2014-23/   | https://index.commoncrawl.org/CC-MAIN-2014-23-index   |
| 71 | CC-MAIN-2014-15   | April 2014 Index               | https://index.commoncrawl.org/CC-MAIN-2014-15/   | https://index.commoncrawl.org/CC-MAIN-2014-15-index   

In [33]:
api_url = cdx_indexes[0]['cdx-api']
api_url

'https://index.commoncrawl.org/CC-MAIN-2021-04-index'

## Basic usage

In [34]:
#r = requests.get(api_url,
#                 params = {
#                     'url': 'workana.com',
#                     'limit': 100,
#                     'output': 'json'
#                 })

In [35]:
r = requests.get(api_url,
                 params = {
                     'url': 'https://workana.com',
                     'output': 'json',
                     'filter': ['=status:200', '=mime-detected:text/html']
                 })

In [36]:
records = [json.loads(line) for line in r.text.split('\n') if line]

In [37]:
df = pd.DataFrame(records)

In [38]:
print(df.head().to_markdown())

|    | urlkey        |      timestamp | url                      | mime      | mime-detected   |   status | digest                           |   length |   offset | filename                                                                                                          |   charset |   languages |
|---:|:--------------|---------------:|:-------------------------|:----------|:----------------|---------:|:---------------------------------|---------:|---------:|:------------------------------------------------------------------------------------------------------------------|----------:|------------:|
|  0 | com,workana)/ | 20210115134610 | https://www.workana.com/ | text/html | text/html       |      200 | RUACBLESF5RHTGR42PP6U5NYKVNSC7G7 |    13252 |  3462879 | crawl-data/CC-MAIN-2021-04/segments/1610703495901.0/robotstxt/CC-MAIN-20210115134101-20210115164101-00528.warc.gz |       nan |         nan |
|  1 | com,workana)/ | 20210115164606 | https://www.workana.com/ | text/html | 

In [39]:
df.shape

(55, 12)

In [40]:
df.head()

Unnamed: 0,urlkey,timestamp,url,mime,mime-detected,status,digest,length,offset,filename,charset,languages
0,"com,workana)/",20210115134610,https://www.workana.com/,text/html,text/html,200,RUACBLESF5RHTGR42PP6U5NYKVNSC7G7,13252,3462879,crawl-data/CC-MAIN-2021-04/segments/1610703495901.0/robotstxt/CC-MAIN-20210115134101-20210115164101-00528.warc.gz,,
1,"com,workana)/",20210115164606,https://www.workana.com/,text/html,text/html,200,ORS7FELEAMSXRBLRKUGDNNFLPLWZA6MB,13252,3402545,crawl-data/CC-MAIN-2021-04/segments/1610703495936.3/robotstxt/CC-MAIN-20210115164417-20210115194417-00528.warc.gz,,
2,"com,workana)/",20210115195200,https://www.workana.com/,text/html,text/html,200,GAHKARH5YMM7L6MQQRAIKINAPN3JKWBN,13249,3758775,crawl-data/CC-MAIN-2021-04/segments/1610703496947.2/robotstxt/CC-MAIN-20210115194851-20210115224851-00608.warc.gz,,
3,"com,workana)/",20210115225232,https://www.workana.com/,text/html,text/html,200,U5ZHB5ICS6TMZ36HKI7ZYA2LXI3XWGMQ,13254,2986780,crawl-data/CC-MAIN-2021-04/segments/1610703497681.4/robotstxt/CC-MAIN-20210115224908-20210116014908-00608.warc.gz,,
4,"com,workana)/",20210116015831,https://www.workana.com/,text/html,text/html,200,S4XEJ3TZEPMXA6QP3HB7IV4M3ZRRN2TK,13250,3059704,crawl-data/CC-MAIN-2021-04/segments/1610703499999.6/robotstxt/CC-MAIN-20210116014637-20210116044637-00608.warc.gz,,


In [41]:
df.columns

Index(['urlkey', 'timestamp', 'url', 'mime', 'mime-detected', 'status',
       'digest', 'length', 'offset', 'filename', 'charset', 'languages'],
      dtype='object')

## Filters and fields

Let's use a few of the bells and whistles form the API.

Particularly interesting are the [filters](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#filter) which let us to only get rows that we need.

In [42]:
#r = requests.get(api_url,
#                 params = {
#                     'url': 'https://www.reddit.com/r/',
#                     'matchType': 'prefix',
#                     'limit': 10,
#                     'output': 'json',
#                     'fl': 'url,filename,offset,length',
#                     'filter': ['=status:200', '=mime-detected:text/html', '~url:.*/comments/']
#                 })

In [43]:
r = requests.get(api_url,
                 params = {
                     'url': 'https://workana.com',
                     'matchType': 'prefix',
                     'output': 'json',
                     'fl': 'url,filename,offset,length',
                     'filter': ['=status:200', '=mime-detected:text/html']
                 })

In [44]:
r.raise_for_status()

In [45]:
df = pd.DataFrame([json.loads(line) for line in r.text.split('\n') if line])

In [46]:
df.shape

(6822, 4)

In [47]:
df.head()

Unnamed: 0,url,filename,offset,length
0,https://www.workana.com/,crawl-data/CC-MAIN-2021-04/segments/1610703495901.0/robotstxt/CC-MAIN-20210115134101-20210115164101-00528.warc.gz,3462879,13252
1,https://www.workana.com/,crawl-data/CC-MAIN-2021-04/segments/1610703495936.3/robotstxt/CC-MAIN-20210115164417-20210115194417-00528.warc.gz,3402545,13252
2,https://www.workana.com/,crawl-data/CC-MAIN-2021-04/segments/1610703496947.2/robotstxt/CC-MAIN-20210115194851-20210115224851-00608.warc.gz,3758775,13249
3,https://www.workana.com/,crawl-data/CC-MAIN-2021-04/segments/1610703497681.4/robotstxt/CC-MAIN-20210115224908-20210116014908-00608.warc.gz,2986780,13254
4,https://www.workana.com/,crawl-data/CC-MAIN-2021-04/segments/1610703499999.6/robotstxt/CC-MAIN-20210116014637-20210116044637-00608.warc.gz,3059704,13250


## Pagination

The [introductory blog post to CDX on Common Crawl](https://commoncrawl.org/2015/04/announcing-the-common-crawl-index/) mentions it's paginated to 15,000 results by default.

Let's test that

In [48]:
r = requests.get(api_url,
                 params = {
                     'url': '*.workana.com',
                     'output': 'json',
                     'showNumPages': True,
                 })

* pageSize is number of results in (compressed) blocks
* blocks is total number of compressed blocks
* pages = (blocks // page_size)


In [49]:
num_pages = r.json()
num_pages

{'pages': 5, 'pageSize': 5, 'blocks': 24}

In [50]:
import math

In [51]:
math.ceil(num_pages['blocks'] / num_pages['pageSize']) == num_pages['pages']

True

In [52]:
r = requests.get(api_url,
                 params = {
                     'url': '*.workana.com',
                     'output': 'json',
                 })

In [53]:
results = [json.loads(line) for line in r.text.split('\n') if line]

In [54]:
len(results)

12099

In [55]:
results[-1]

{'urlkey': 'com,workana)/es/job/validador-de-imei-para-text-input',
 'timestamp': '20210118043611',
 'url': 'https://www.workana.com/es/job/validador-de-imei-para-text-input',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': 'L6CEA7OGHNNL37BJSHG5OKQTB5UBE42N',
 'length': '12847',
 'offset': '1064473553',
 'filename': 'crawl-data/CC-MAIN-2021-04/segments/1610703514121.8/warc/CC-MAIN-20210118030549-20210118060549-00672.warc.gz',
 'charset': 'UTF-8',
 'languages': 'spa'}

We can adjust the pageSize (in blocks) as well

In [56]:
r = requests.get(api_url,
                 params = {
                     'url': '*.workana.com',
                     'output': 'json',
                     'page': 3,
                     'pageSize': 1,
                 })

In [57]:
results2 = [json.loads(line) for line in r.text.split('\n') if line]

About 3,000 results per page

In [58]:
len(results2)

3000

In [59]:
results[0]

{'urlkey': 'com,workana)/',
 'timestamp': '20210115134610',
 'url': 'https://www.workana.com/',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': 'RUACBLESF5RHTGR42PP6U5NYKVNSC7G7',
 'length': '13252',
 'offset': '3462879',
 'filename': 'crawl-data/CC-MAIN-2021-04/segments/1610703495901.0/robotstxt/CC-MAIN-20210115134101-20210115164101-00528.warc.gz'}

This should correspond to the 3rd fifth of results

In [60]:
[r for r in results2 if r not in results]

[]

Going past the last page

In [61]:
r = requests.get(api_url,
                 params = {
                     'url': '*.workana.com',
                     'output': 'json',
                     'page': 409,
                 })

In [62]:
r.status_code

400

In [63]:
print(r.text)

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="/static/__shared/shared.css"/>
</head>
<body>
<h2>Common Crawl Index Server Error</h2>
<b>Page 409 invalid: First Page is 0, Last Page is 4</b>

</body>
</html>


# Retrieving content

In [64]:
record = records[0]

In [65]:
record

{'urlkey': 'com,workana)/',
 'timestamp': '20210115134610',
 'url': 'https://www.workana.com/',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': 'RUACBLESF5RHTGR42PP6U5NYKVNSC7G7',
 'length': '13252',
 'offset': '3462879',
 'filename': 'crawl-data/CC-MAIN-2021-04/segments/1610703495901.0/robotstxt/CC-MAIN-20210115134101-20210115164101-00528.warc.gz'}

In [66]:
data_url = 'https://commoncrawl.s3.amazonaws.com/' + record['filename']
data_url

'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-04/segments/1610703495901.0/robotstxt/CC-MAIN-20210115134101-20210115164101-00528.warc.gz'

Use a [Range header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) to get just the data we need.

In [67]:
headers = {'Range': f'bytes={int(record["offset"])}-{int(record["offset"]) + int(record["length"])}'}
headers

{'Range': 'bytes=3462879-3476131'}

In [68]:
r = requests.get(data_url, headers=headers)

In [69]:
import zlib

In [71]:
#data = zlib.decompress(r.content)

We have to use zlib instead of gzip because we're not reading from the start of the file, and so gzip headers aren't there.

For gzip compatible we need to [set the wbits](https://stackoverflow.com/a/22310760).

In [72]:
data = zlib.decompress(r.content, wbits = zlib.MAX_WBITS | 16)

In [73]:
print(data.decode('utf-8'))

WARC/1.0
WARC-Type: response
WARC-Date: 2021-01-15T13:46:10Z
WARC-Record-ID: <urn:uuid:77120c70-1982-49c3-9c83-5477dc4b666a>
Content-Length: 60269
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:543fe5c9-060f-48c3-bd88-c6474ad8410f>
WARC-Concurrent-To: <urn:uuid:343c496c-1230-408d-89ed-94b3e112f5b6>
WARC-IP-Address: 40.70.170.72
WARC-Target-URI: https://www.workana.com/
WARC-Payload-Digest: sha1:RUACBLESF5RHTGR42PP6U5NYKVNSC7G7
WARC-Block-Digest: sha1:L4DM2QGXZTGP7CWK53MGLPOILN25KK2Z
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Date: Fri, 15 Jan 2021 13:46:10 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: appcookie[user_locale]=en_US; expires=Sat, 15-Jan-2022 13:46:10 GMT; Max-Age=31536000; path=/; domain=www.workana.com; secure; HttpOnly
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Workana-Company-Hash: null
Str