## Trying to query S3 data directly (COMMON CRAWL)

Currently this code is only querying an index of the recorded webpages, not the scraped records themselves. 

Then it uses the *(warc_filename, warc_record_offset, warc_record_offset)* to only download the relevant part of the relevant WARC file, from which the webpage's text is extracted.

In [174]:
import boto3
import subprocess
import json
import gzip
import warcio
import newspaper

In [2]:
sess = boto3.session.Session(profile_name="xmiles")
# commoncrawl is hosted in us-east-1
s3 = sess.client('s3')

In [192]:
%%time
# ALL FILES: s3://commoncrawl/crawl-data/cc-index/table/cc-main/warc/
# s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-10/subset=warc/part-00299-dbb5a216-bcb2-4bff-b117-e812a7981d21.c000.gz.parquet

sql_str = """
    SELECT * FROM S3Object s
    limit 20
"""
# WHERE s.url_host_tld='nz'

resp = s3.select_object_content(
    Bucket='commoncrawl',
    Key='cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-10/subset=warc/part-00290-dbb5a216-bcb2-4bff-b117-e812a7981d21.c000.gz.parquet',
    #Key='cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-10/subset=warc/part-00299-dbb5a216-bcb2-4bff-b117-e812a7981d21.c000.gz.parquet',
    #Key='cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-10/subset=warc/',
    Expression=sql_str,
    ExpressionType='SQL',
    InputSerialization={'Parquet': {}},
    OutputSerialization={'JSON': {}}
)
print("Downloaded")

end_event_received = False
infos = []

for event in resp['Payload']:
    if 'Records' in event:
        payload = event['Records']['Payload'].decode()
        info = payload.split('\n')[:-1]
        
        infos.append(info)
    elif 'End' in event:
        print('Result is complete')
        end_event_received = True
            
if not end_event_received:
    raise Exception("End event not received, request incomplete.")

Downloaded
Result is complete
CPU times: user 25.7 ms, sys: 6.22 ms, total: 31.9 ms
Wall time: 1.87 s


In [152]:
test = infos[0]
test[:3]

['{"url_surtkey":"com,extraspace)/storage/reserveorhold.aspx?uid=2510_7406","url":"https://www.extraspace.com/storage/reserveorhold.aspx?uid=2510_7406","url_host_name":"www.extraspace.com","url_host_tld":"com","url_host_2nd_last_part":"extraspace","url_host_3rd_last_part":"www","url_host_registry_suffix":"com","url_host_registered_domain":"extraspace.com","url_host_private_suffix":"com","url_host_private_domain":"extraspace.com","url_protocol":"https","url_path":"/storage/reserveorhold.aspx","url_query":"uid=2510_7406","fetch_time":"2021-02-24T23:29:01.000Z","fetch_status":200,"content_digest":"L5VNSRFOT6XKDLIC75WHNP7ON7S5EZ7G","content_mime_type":"text/html","content_mime_detected":"application/xhtml+xml","content_charset":"UTF-8","content_languages":"eng","warc_filename":"crawl-data/CC-MAIN-2021-10/segments/1614178349708.2/warc/CC-MAIN-20210224223004-20210225013004-00497.warc.gz","warc_record_offset":725568062,"warc_record_length":20193,"warc_segment":"1614178349708.2"}',
 '{"url_sur

In [139]:
test_json = json.loads(test[9])
test_json

{'url_surtkey': 'com,eventfuladvantage)/2020/02/11/know-your-event-purpose-goals',
 'url': 'https://eventfuladvantage.com/2020/02/11/know-your-event-purpose-goals/',
 'url_host_name': 'eventfuladvantage.com',
 'url_host_tld': 'com',
 'url_host_2nd_last_part': 'eventfuladvantage',
 'url_host_registry_suffix': 'com',
 'url_host_registered_domain': 'eventfuladvantage.com',
 'url_host_private_suffix': 'com',
 'url_host_private_domain': 'eventfuladvantage.com',
 'url_protocol': 'https',
 'url_path': '/2020/02/11/know-your-event-purpose-goals/',
 'fetch_time': '2021-03-06T01:04:13.000Z',
 'fetch_status': 200,
 'content_digest': '34NDPLJE2WBY7UGJLB7QKYN7M7MCIBUD',
 'content_mime_type': 'text/html',
 'content_mime_detected': 'text/html',
 'content_charset': 'UTF-8',
 'content_languages': 'eng',
 'warc_filename': 'crawl-data/CC-MAIN-2021-10/segments/1614178374217.78/warc/CC-MAIN-20210306004859-20210306034859-00378.warc.gz',
 'warc_record_offset': 323255944,
 'warc_record_length': 18788,
 'warc_

In [140]:
test_json['warc_record_offset'], test_json['warc_record_offset'] + test_json['warc_record_length']

(323255944, 323274732)

In [141]:
range_str = "bytes={}-{}".format(test_json['warc_record_offset'], 
                                 test_json['warc_record_offset'] + test_json['warc_record_length'])
range_str

'bytes=323255944-323274732'

In [150]:
test_resp = s3.get_object(
    Bucket="commoncrawl",
    Key=test_json['warc_filename'],
    Range=range_str
)

# the last byte is omitted since it causes an error
warc_contents = gzip.decompress(test_resp['Body'].read()[:-1]).decode()

# how to parse WARC contents?
# warc_contents['WARC-Target-URI']

TypeError: string indices must be integers

In [171]:
html_start = warc_contents.find('<!DOCTYPE html>')
html_end = warc_contents.find('</html>') + len('</html>')
html_contents = warc_contents[html_start:html_end]

In [173]:
if not html_contents.endswith('</html>'): 
    print("NOT CONFIGURED PROPERLY")

In [180]:
article = newspaper.Article(url='')
article.set_html(html_contents)
article.parse()
text = article.text
text

'Know Your Event Purpose and Goals\n\nLike other parts of your marketing and sales strategy, you need to know your event purpose and goals to make sure and structure your event correctly from the very start of the planning stage to ensure that you get the results that you want and the return on your investment that you need.\n\nWhile some businesses are busy planning no less than seven different ways to market their business, there is a real power of events that gives you and your clients a more personal way to grow your business:\n\nit keeps your current clients engaged with your business;\n\nit can help you gain additional clients;\n\nit can help you increase your sales; and\n\nit can increase your brand visibility and awareness.\n\nOnce the purpose and goals of your event are defined, you can then build the rest of the event to support and achieve these. What your purpose and goals are can influence so many of the different elements of your event.\n\nTiming\n\nWhat is the best Date 

In [185]:
phrases = list(filter(None, text.split('\n')))
phrases

['Know Your Event Purpose and Goals',
 'Like other parts of your marketing and sales strategy, you need to know your event purpose and goals to make sure and structure your event correctly from the very start of the planning stage to ensure that you get the results that you want and the return on your investment that you need.',
 'While some businesses are busy planning no less than seven different ways to market their business, there is a real power of events that gives you and your clients a more personal way to grow your business:',
 'it keeps your current clients engaged with your business;',
 'it can help you gain additional clients;',
 'it can help you increase your sales; and',
 'it can increase your brand visibility and awareness.',
 'Once the purpose and goals of your event are defined, you can then build the rest of the event to support and achieve these. What your purpose and goals are can influence so many of the different elements of your event.',
 'Timing',
 'What is the 