## Scrape reddit data from Common Crawl

Credit: This is the code from Common Crawl [here](https://commoncrawl.org/get-started)

#### 1. Reference:
1.1 Searching 100 Billion Webpages Pages With Capture Index https://skeptric.com/searching-100b-pages-cdx/

1.2 Searching Common Crawl Index https://skeptric.com/notebooks/Searching%20Common%20Crawl%20Index.html

#### 2. Interesting site to play on:

2.1 Get the index list of Common Crawl here (e.g., CC-MAIN-2020-16): https://index.commoncrawl.org/collinfo.json

2.2 cc-index-serve: https://index.commoncrawl.org/CC-MAIN-2020-16/ (note: replace CC-Main-2020-16 with latest index shown in 2.1, such as CC-MAIN-2023-50)


In [43]:
# The URL you want to look up in the Common Crawl index
target_url = 'reddit.com/r/wallstreetbets/' #'commoncrawl.org/faq'  # Replace with your target URL

# The Common Crawl index you want to query
INDEX_NAME = 'CC-MAIN-2023-50'      # Replace with the latest index name

In [44]:
import requests
import gzip
import json
from urllib.parse import quote_plus

# Please note: f-strings require Python 3.6+

# The URL of the Common Crawl Index server
CC_INDEX_SERVER = 'http://index.commoncrawl.org/'


# Function to search the Common Crawl Index
def search_cc_index(url):
    encoded_url = quote_plus(url)
    index_url = f'{CC_INDEX_SERVER}{INDEX_NAME}-index?url={encoded_url}&output=json'
    response = requests.get(index_url)
    print("Response from CCI:", response.text)  # Output the response from the server
    if response.status_code == 200:
        records = response.text.strip().split('\n')
        return [json.loads(record) for record in records]
    else:
        return None

# Function to fetch the content from Common Crawl
def fetch_page_from_cc(records):
    for record in records:
        offset, length = int(record['offset']), int(record['length'])
        prefix = record['filename'].split('/')[0]
        s3_url = f'https://data.commoncrawl.org/{record["filename"]}'
        response = requests.get(s3_url, headers={'Range': f'bytes={offset}-{offset+length-1}'})
        if response.status_code == 206:
            # Process the response content if necessary
            # For example, you can use warcio to parse the WARC record
            return response.content
        else:
            print(f"Failed to fetch data: {response.status_code}")
            return None

# Search the index for the target URL
records = search_cc_index(target_url)
if records:
    print(f"Found {len(records)} records for {target_url}")

    # Fetch the page content from the first record
    compressed_data = fetch_page_from_cc(records)
    if compressed_data:
        print(f"Successfully fetched content for {target_url}")
        # decompressed_data = gzip.decompress(compressed_data)
        #decoded_data = decompressed_data.decode('utf-8')  # Assuming it's a text-based content
        #print(decoded_data)
        print("-------------------------------------")
        # You can now process the 'content' variable as needed
else:
    print(f"No records found for {target_url}")

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): index.commoncrawl.org:80


DEBUG:urllib3.connectionpool:http://index.commoncrawl.org:80 "GET /CC-MAIN-2023-50-index?url=reddit.com%2Fr%2Fwallstreetbets%2F&output=json HTTP/1.1" 404 None


Response from CCI: {"message": "No Captures found for: reddit.com/r/wallstreetbets/"}
No records found for reddit.com/r/wallstreetbets/


## Computing environment

In [45]:
%load_ext watermark

%watermark

# print out pypi packages used
%watermark --iversions

# date
%watermark -u -n -t -z

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Last updated: 2024-03-04T22:18:45.332361+08:00

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 8.22.1

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
CPU cores   : 8
Architecture: 64bit

json    : 2.0.9
requests: 2.31.0

Last updated: Mon Mar 04 2024 22:18:45Malay Peninsula Standard Time

