## Trying to query S3 data directly (COMMON CRAWL)

Currently this code is only querying an index of the recorded webpages, not the scraped text itself. 

Then we'll need to figure out a way to use the *(warc_filename, warc_record_offset, warc_record_offset)* to extract the text from the relevant WARC file without downloading the whole WARC file. 

Plan B: download entire WARC file since this still cuts down on local processing (having offet the processing into the S3 bucket/S3 Sele).

In [23]:
import boto3
import pandas as pd
import io
import re
import time
import subprocess
import json

In [2]:
sess = boto3.session.Session(profile_name="xmiles")
# commoncrawl is hosted in us-east-1
s3 = sess.client('s3')

In [3]:
# subprocess.check_output(["aws", "s3", "ls", "--no-sign-request",
#                          "s3://commoncrawl/cc-index/table/cc-main/warc/CC-MAIN-2021"])

In [25]:
%%time
# ALL FILES: s3://commoncrawl/crawl-data/cc-index/table/cc-main/warc/
# s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-10/subset=warc/part-00299-dbb5a216-bcb2-4bff-b117-e812a7981d21.c000.gz.parquet

sql_str = """
    SELECT * FROM S3Object s
    limit 10
"""
# WHERE s.url_host_tld='nz'

resp = s3.select_object_content(
    Bucket='commoncrawl',
    Key='cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-10/subset=warc/part-00299-dbb5a216-bcb2-4bff-b117-e812a7981d21.c000.gz.parquet',
    #Key='cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-10/subset=warc/',
    Expression=sql_str,
    ExpressionType='SQL',
    InputSerialization={'Parquet': {}},
    OutputSerialization={'JSON': {}}
)
print("Status code:", resp['ResponseMetadata']['HTTPStatusCode'])
print("DOWNLOADED")

i = 0
end_event_received = False
infos = []

for event in resp['Payload']:
    if 'Records' in event:
        info = event['Records']['Payload'].decode()
        
        infos.append(info)
        #i += 1
        #if i == 10: break
    elif 'Progress' in event:
        print(event['Progress']['Details'])
    elif 'End' in event:
        print('Result is complete')
        end_event_received = True
            
if not end_event_received:
    raise Exception("End event not received, request incomplete.")

Status code: 200
DOWNLOADED
Result is complete
CPU times: user 32.2 ms, sys: 1.53 ms, total: 33.8 ms
Wall time: 1.43 s


In [27]:
print(infos[0])

{"url_surtkey":"ua,com,inotec)/kabeli-sinhronizacii-gg1038097","url":"https://inotec.com.ua/kabeli-sinhronizacii-gg1038097","url_host_name":"inotec.com.ua","url_host_tld":"ua","url_host_2nd_last_part":"com","url_host_3rd_last_part":"inotec","url_host_registry_suffix":"com.ua","url_host_registered_domain":"inotec.com.ua","url_host_private_suffix":"com.ua","url_host_private_domain":"inotec.com.ua","url_protocol":"https","url_path":"/kabeli-sinhronizacii-gg1038097","fetch_time":"2021-02-25T11:00:33.000Z","fetch_status":200,"content_digest":"W4UWVL5ZMJCDLSXE442CRMPKDR5FP36V","content_mime_type":"text/html","content_mime_detected":"text/html","content_charset":"UTF-8","content_languages":"rus","warc_filename":"crawl-data/CC-MAIN-2021-10/segments/1614178350942.3/warc/CC-MAIN-20210225095141-20210225125141-00410.warc.gz","warc_record_offset":363115834,"warc_record_length":12324,"warc_segment":"1614178350942.3"}
{"url_surtkey":"ua,com,inotec)/kabeli-sinhronizacii-gg1038097","url":"https://inote

In [None]:
# import awswrangler as wr

In [None]:
# df = wr.pandas.read_parquet('commoncrawl',
#                             Key='crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/warc/CC-MAIN-20210224165708-20210224195708-00000.warc.gz')