# Looking at the WAT, WET and WARC Common Crawl Archives

Common crawl provides 3 types of archive:

* WAT contains metadata of the crawl: The headers, things from the `<head>` of html (title, meta, scripts) and links from the website
* WET contain the text extracted from the HTML of the crawl, in format Title\nText
* WARC contains the entire crawl, the metadata and HTML Response

These are all in the Web Archive (WARC) format.
Common crawl have a [good introduction to WARC](https://commoncrawl.org/2014/04/navigating-the-warc-file-format/)
There is [specification for the gory details](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/).

Read the associated [article](https://skeptric.com/text-meta-data-commoncrawl) for details, and the [Jupyter Notebook](https://skeptric.com/notebooks/WAT%20WET%20WARC%20-%20Common%20Crawl%20Archives.ipynb).

From the spec here are the types of records:

* warcinfo - contains information about the web crawl
* metadata - record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types. 
* conversion - record shall contain an alternative version of another record’s content that was created as the result of an archival process.
* response - response
* request - details of a request
* resource - record contains a resource
* revisit - describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record.
* continuation - appended to corresponding prior record block(s) (e.g., from other WARC files) to create the logically complete full-sized original record.

https://github.com/webrecorder/warcio

In [1]:
!pip install warcio



In [2]:
import json
import requests
from warcio import ArchiveIterator
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

## Let's take some sample WARC url and the corresponding WET and WAT urls.
#### The WET and WAT are generated from the full WARC and have derived URLs.

In [3]:
warc_url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2021-04/segments/1610704847953.98/warc/CC-MAIN-20210128134124-20210128164124-00799.warc.gz'
wet_url = warc_url.replace('/warc/', '/wet/').replace('warc.gz', 'warc.wet.gz')
wat_url = warc_url.replace('/warc/', '/wat/').replace('warc.gz', 'warc.wat.gz')

# Reading WARC

In [4]:
r_warc = requests.get(warc_url, stream=True)
records_warc = ArchiveIterator(r_warc.raw)

### 1. First record is `warcinfo` about the crawl

In [5]:
record1_warc = next(records_warc)

In [6]:
record1_warc.rec_type

'warcinfo'

In [7]:
a1_warc = record1_warc.content_stream().read()

In [8]:
print(a1_warc.decode('utf-8'))

isPartOf: CC-MAIN-2021-04
publisher: Common Crawl
description: Wide crawl of the web for January 2021
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-246.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/



### 2. The next is details about the `request` to the server

In [9]:
record2_warc = next(records_warc)

In [10]:
record2_warc.rec_type

'request'

In [11]:
record2_warc.rec_headers

StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('WARC-Type', 'request'), ('WARC-Date', '2021-01-28T16:07:26Z'), ('WARC-Record-ID', '<urn:uuid:9b996c49-f53c-43db-9ce1-18f1f72679ee>'), ('Content-Length', '269'), ('Content-Type', 'application/http; msgtype=request'), ('WARC-Warcinfo-ID', '<urn:uuid:417d8ded-caa8-4bc1-b819-8f01e3632199>'), ('WARC-IP-Address', '23.82.163.220'), ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html')])

In [12]:
record2_warc.rec_headers.headers

[('WARC-Type', 'request'),
 ('WARC-Date', '2021-01-28T16:07:26Z'),
 ('WARC-Record-ID', '<urn:uuid:9b996c49-f53c-43db-9ce1-18f1f72679ee>'),
 ('Content-Length', '269'),
 ('Content-Type', 'application/http; msgtype=request'),
 ('WARC-Warcinfo-ID', '<urn:uuid:417d8ded-caa8-4bc1-b819-8f01e3632199>'),
 ('WARC-IP-Address', '23.82.163.220'),
 ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html')]

#### Shows `HTTP headers` in the get request

In [13]:
record2_warc.http_headers

StatusAndHeaders(protocol = 'GET', statusline = '/a/137338_com/1227.html HTTP/1.1', headers = [('User-Agent', 'CCBot/2.0 (https://commoncrawl.org/faq/)'), ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'), ('Accept-Language', 'en-US,en;q=0.5'), ('Accept-Encoding', 'br,gzip'), ('Host', '006655e.com'), ('Connection', 'Keep-Alive')])

In [14]:
record2_warc.http_headers.headers

[('User-Agent', 'CCBot/2.0 (https://commoncrawl.org/faq/)'),
 ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
 ('Accept-Language', 'en-US,en;q=0.5'),
 ('Accept-Encoding', 'br,gzip'),
 ('Host', '006655e.com'),
 ('Connection', 'Keep-Alive')]

#### There's no data in the request

In [15]:
a2_warc = record2_warc.content_stream().read()

In [16]:
a2_warc

b''

### 3. The next item is the `response` to the previous request

In [17]:
record3_warc = next(records_warc)

In [18]:
record3_warc.rec_type

'response'

In [19]:
record3_warc.rec_headers

StatusAndHeaders(protocol = 'WARC/1.0', statusline = '', headers = [('WARC-Type', 'response'), ('WARC-Date', '2021-01-28T16:07:26Z'), ('WARC-Record-ID', '<urn:uuid:f4c525de-b5c5-4f1a-adbf-8d937eaa8061>'), ('Content-Length', '7421'), ('Content-Type', 'application/http; msgtype=response'), ('WARC-Warcinfo-ID', '<urn:uuid:417d8ded-caa8-4bc1-b819-8f01e3632199>'), ('WARC-Concurrent-To', '<urn:uuid:9b996c49-f53c-43db-9ce1-18f1f72679ee>'), ('WARC-IP-Address', '23.82.163.220'), ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html'), ('WARC-Payload-Digest', 'sha1:64X3TUQQXQ6JLITRWR7TAHCYMSM2AW5O'), ('WARC-Block-Digest', 'sha1:MPC75GFXY4P6QXDJQWNWIPYIPOY6WMM7'), ('WARC-Identified-Payload-Type', 'application/xhtml+xml')])

In [20]:
record3_warc.rec_headers.headers

[('WARC-Type', 'response'),
 ('WARC-Date', '2021-01-28T16:07:26Z'),
 ('WARC-Record-ID', '<urn:uuid:f4c525de-b5c5-4f1a-adbf-8d937eaa8061>'),
 ('Content-Length', '7421'),
 ('Content-Type', 'application/http; msgtype=response'),
 ('WARC-Warcinfo-ID', '<urn:uuid:417d8ded-caa8-4bc1-b819-8f01e3632199>'),
 ('WARC-Concurrent-To', '<urn:uuid:9b996c49-f53c-43db-9ce1-18f1f72679ee>'),
 ('WARC-IP-Address', '23.82.163.220'),
 ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html'),
 ('WARC-Payload-Digest', 'sha1:64X3TUQQXQ6JLITRWR7TAHCYMSM2AW5O'),
 ('WARC-Block-Digest', 'sha1:MPC75GFXY4P6QXDJQWNWIPYIPOY6WMM7'),
 ('WARC-Identified-Payload-Type', 'application/xhtml+xml')]

In [21]:
record3_warc.http_headers

StatusAndHeaders(protocol = 'HTTP/1.1', statusline = '200 OK', headers = [('Content-Type', 'text/html'), ('X-Crawler-Content-Encoding', 'gzip'), ('Last-Modified', 'Sat, 11 Jan 2020 04:42:04 GMT'), ('Accept-Ranges', 'bytes'), ('ETag', '"ff902b7939c8d51:0"'), ('Vary', 'Accept-Encoding'), ('Server', 'Microsoft-IIS/7.5'), ('X-Powered-By', 'ASP.NET'), ('Date', 'Thu, 28 Jan 2021 16:07:18 GMT'), ('X-Crawler-Content-Length', '4489'), ('Content-Length', '7084')])

In [22]:
record3_warc.http_headers.statusline

'200 OK'

In [23]:
record3_warc.http_headers.headers

[('Content-Type', 'text/html'),
 ('X-Crawler-Content-Encoding', 'gzip'),
 ('Last-Modified', 'Sat, 11 Jan 2020 04:42:04 GMT'),
 ('Accept-Ranges', 'bytes'),
 ('ETag', '"ff902b7939c8d51:0"'),
 ('Vary', 'Accept-Encoding'),
 ('Server', 'Microsoft-IIS/7.5'),
 ('X-Powered-By', 'ASP.NET'),
 ('Date', 'Thu, 28 Jan 2021 16:07:18 GMT'),
 ('X-Crawler-Content-Length', '4489'),
 ('Content-Length', '7084')]

In [24]:
a3_warc = record3_warc.content_stream().read()

#### This contains the full HTML

In [25]:
print(a3_warc.decode('utf-8', errors='ignore')[:1000])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title>ۿֱֳͬڶ4146Ԫbr  _̳,www.006655.com,ϴȫ,СĲ007,ww6882,137338.com,www.00475.com</title>
<meta name="keywords" content="ۿֱ" />
<meta name="description" content="ͬڶ4146Ԫ иծ˱Ǯ⧲ˡѩҲ繫Ȼְн꣬õλ֮ôλળга졣ߴӹίϤ, СҳῪ йίйȫ" />
<link href="/skin/css/common.css" rel="stylesheet" type="text/css" />
<link href="/skin/css/style.css" rel="stylesheet" type="text/css" />
<script type="text/javascript" src="/caiyuan/ytbf.js"></script>

</head>
<body>
<section class="LgvJ7">
	<div class="logo">
		<a href="/" title="̳,www.006655.com,ϴȫ,СĲ007,ww6882,137338.com,www.00475.com"><img src="/skin/images/logo.png" /></a>
	</div>
	<menu id="i6ZACG">
		<a href="#"  title="̳,www.006655.com,ϴȫ,СĲ007,ww6882,137338.com,www.00475.com"><img src="/skin/images/14627896762494

In [26]:
soup_warc = BeautifulSoup(a3_warc.decode('utf-8', errors='ignore'), 'html.parser')
str(soup_warc)

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n<title>ۿֱֳͬڶ4146Ԫbr  _̳,www.006655.com,ϴȫ,СĲ007,ww6882,137338.com,www.00475.com</title>\n<meta content="ۿֱ" name="keywords"/>\n<meta content="ͬڶ4146Ԫ иծ˱Ǯ⧲ˡѩҲ繫Ȼְн꣬õλ֮ôλળга졣ߴӹίϤ, СҳῪ йίйȫ" name="description"/>\n<link href="/skin/css/common.css" rel="stylesheet" type="text/css"/>\n<link href="/skin/css/style.css" rel="stylesheet" type="text/css"/>\n<script src="/caiyuan/ytbf.js" type="text/javascript"></script>\n</head>\n<body>\n<section class="LgvJ7">\n<div class="logo">\n<a href="/" title="̳,www.006655.com,ϴȫ,СĲ007,ww6882,137338.com,www.00475.com"><img src="/skin/images/logo.png"/></a>\n</div>\n<menu id="i6ZACG">\n<a href="#" title="̳,www.006655.com,ϴȫ,СĲ007,ww6882,137338.com,www.00475.com"><img src="/skin/images/14627896762494.jpg"/></a>\n<

### 4. The next record is `metadata` about the fetch:

* How long it took to fetch the size
* Detected characterset
* Languages detected

In [27]:
record4_warc = next(records_warc)

In [28]:
record4_warc.rec_type

'metadata'

In [29]:
record4_warc.rec_headers.headers

[('WARC-Type', 'metadata'),
 ('WARC-Date', '2021-01-28T16:07:26Z'),
 ('WARC-Record-ID', '<urn:uuid:31726ea5-ff9e-4759-8b57-f63835afe1f8>'),
 ('Content-Length', '202'),
 ('Content-Type', 'application/warc-fields'),
 ('WARC-Warcinfo-ID', '<urn:uuid:417d8ded-caa8-4bc1-b819-8f01e3632199>'),
 ('WARC-Concurrent-To', '<urn:uuid:f4c525de-b5c5-4f1a-adbf-8d937eaa8061>'),
 ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html')]

In [30]:
a4_warc = record4_warc.content_stream().read()

In [31]:
print(a4_warc.decode('utf-8'))

fetchTimeMs: 200
charset-detected: GB2312
languages-cld2: {"reliable":true,"text-bytes":2556,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.94,"score":1999.0,"name":"Chinese"}]}




### 5. Now we move onto the next `request`

In [32]:
record5_warc = next(records_warc)
record5_warc.rec_type, record5_warc.rec_headers.get_header('WARC-Target-URI')

('request', 'http://01-news.ru/sport/apl-prodlila-pauzu-v-sezone/')

In [33]:
record6_warc = next(records_warc)
record6_warc.rec_type, record6_warc.rec_headers.get_header('WARC-Target-URI')

('response', 'http://01-news.ru/sport/apl-prodlila-pauzu-v-sezone/')

In [34]:
record7_warc = next(records_warc)
record7_warc.rec_type, record7_warc.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://01-news.ru/sport/apl-prodlila-pauzu-v-sezone/')

### 6. And the next record

In [35]:
record8_warc = next(records_warc)
record8_warc.rec_type, record8_warc.rec_headers.get_header('WARC-Target-URI')

('request', 'http://05rjo8c.cn/3871_6589_20220_659112/535091.html')

In [36]:
record9_warc = next(records_warc)
record9_warc.rec_type, record9_warc.rec_headers.get_header('WARC-Target-URI')

('response', 'http://05rjo8c.cn/3871_6589_20220_659112/535091.html')

In [37]:
record10_warc = next(records_warc)
record10_warc.rec_type, record10_warc.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://05rjo8c.cn/3871_6589_20220_659112/535091.html')

### 7. And the next

In [38]:
record11_warc = next(records_warc)
record11_warc.rec_type, record11_warc.rec_headers.get_header('WARC-Target-URI')

('request', 'http://07528888888.com.cn/index.php/product/4483.html')

In [39]:
record12_warc = next(records_warc)
record12_warc.rec_type, record12_warc.rec_headers.get_header('WARC-Target-URI')

('response', 'http://07528888888.com.cn/index.php/product/4483.html')

In [40]:
record13_warc = next(records_warc)
record13_warc.rec_type, record13_warc.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://07528888888.com.cn/index.php/product/4483.html')

### 8. And so on

In [41]:
r_warc.close()

# Reading WET

In [42]:
r_wet = requests.get(wet_url, stream=True)
records_wet = ArchiveIterator(r_wet.raw)

First record is information about the crawl

In [43]:
record1_wet = next(records_wet)

In [44]:
record1_wet.rec_type

'warcinfo'

In [45]:
a1_wet = record1_wet.content_stream().read()

In [46]:
print(a1_wet.decode('utf-8'))

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20210112092133
Extracted-Date: Tue, 02 Feb 2021 10:23:32 GMT
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2021-04
operator: Common Crawl Admin (info@commoncrawl.org)
description: Wide crawl of the web for January 2021
publisher: Common Crawl




### 1. The WET file doesn't contain the headers just the `title` and `text`.

In [47]:
record2_wet = next(records_wet)

In [48]:
record2_wet.rec_type

'conversion'

In [49]:
record2_wet.rec_headers.headers

[('WARC-Type', 'conversion'),
 ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html'),
 ('WARC-Date', '2021-01-28T16:07:26Z'),
 ('WARC-Record-ID', '<urn:uuid:8f604c56-3ec3-4330-9fc5-f8c58385ee26>'),
 ('WARC-Refers-To', '<urn:uuid:f4c525de-b5c5-4f1a-adbf-8d937eaa8061>'),
 ('WARC-Block-Digest', 'sha1:4PD4NBQBIFRWCSM527TWPMOCY5P2CUKR'),
 ('WARC-Identified-Content-Language', 'zho'),
 ('Content-Type', 'text/plain'),
 ('Content-Length', '2767')]

In [50]:
record2_wet.http_headers

In [51]:
a2_wet = record2_wet.content_stream().read()

### 2. The first line is the `title` of the page, everything else is the `text`.

In [52]:
print(a2_wet.decode('utf-8')[:1000])

本港开奖直播现场比上年同期多4146亿元br 导致_神灯论坛,www.006655.com,香港马会资料大全,白小姐独家四不象007,港龙神算网永久域算ww6882,137338.com,www.00475.com
网站首页
神灯论坛
www.006655.com
香港马会资料大全
白小姐独家四不象007
港龙神算网永久域算ww6882
137338.com
www.00475.com
栏目导航
神灯论坛 www.006655.com 香港马会资料大全 白小姐独家四不象007 港龙神算网永久域算ww6882 137338.com www.00475.com
滚动新闻
重庆未来楼市怎样房地产是否还
本港台开奖成果直播这就是购房
白小姐四肖必选一肖让你怀才不
在线本港台直播光电能中央热水
www.684000.com中央热水器的功
水星是太阳系8大行星中体积最
叩富网同城理财是做什么的？想
六开彩开奖结果现在办企业能够
重庆检方：四种利用虚假房产信
藏宝图QQ头像的尺寸是多少？
上海票据交易所领导班子国庆节
www.47748.com重庆楼市_重庆房
抚顺市第四医院为鼓励军嫂自拍
上海票据交易所
中央热水器是什么？管家婆彩图
137338.com
当前位置：主页 > 137338.com >
本港开奖直播现场比上年同期多4146亿元br 导致
发布日期:2020-01-11 12:42 来源:未知 阅读: 次
比上年同期多4146亿元。
导致银行负债端本钱晋升。就这样猝不迭防地来了。雪天起雾也精力。假如公然职工薪酬，该单位之所以这么做，本次会议由嘀嗒出行承办。记者从国度卫生计生委获悉,小鱼儿主页马会开将，中共北京市委、北京市国民政府代表全市人民向受灾地域人民表现深切慰劳，也能很好的辅助调节不良习惯，刷牙不要过快也不要过慢。
持续3个月位于临界点以下，旁边投入价钱指数为53.就可能印证这个谜底了。哪怕已经是国家奖学金取得者，海外获得本科以上学位都能够直接落户。3、统一单位缴满六个月社保，爱吾及吾金光华广场上，五大主题打卡点……从人民公园散步到人民南商圈，例如退出《中导条约》、对《新削减战略武器公约》不满、宣布新的核策略、导弹防备、国防保险等文件，新型的小当量核武器、或可控当量核兵器等新型核武器体系的呈现可能性增添。
更是直接表白了扫兴跟反对。奥巴马发展经济的主要举动就是推

In [53]:
soup_wet = BeautifulSoup(a2_wet.decode('utf-8', errors='ignore'), 'html.parser')
str(soup_wet)

'本港开奖直播现场比上年同期多4146亿元br 导致_神灯论坛,www.006655.com,香港马会资料大全,白小姐独家四不象007,港龙神算网永久域算ww6882,137338.com,www.00475.com\n网站首页\n神灯论坛\nwww.006655.com\n香港马会资料大全\n白小姐独家四不象007\n港龙神算网永久域算ww6882\n137338.com\nwww.00475.com\n栏目导航\n神灯论坛 www.006655.com 香港马会资料大全 白小姐独家四不象007 港龙神算网永久域算ww6882 137338.com www.00475.com\n滚动新闻\n重庆未来楼市怎样房地产是否还\n本港台开奖成果直播这就是购房\n白小姐四肖必选一肖让你怀才不\n在线本港台直播光电能中央热水\nwww.684000.com中央热水器的功\n水星是太阳系8大行星中体积最\n叩富网同城理财是做什么的？想\n六开彩开奖结果现在办企业能够\n重庆检方：四种利用虚假房产信\n藏宝图QQ头像的尺寸是多少？\n上海票据交易所领导班子国庆节\nwww.47748.com重庆楼市_重庆房\n抚顺市第四医院为鼓励军嫂自拍\n上海票据交易所\n中央热水器是什么？管家婆彩图\n137338.com\n当前位置：主页 &gt; 137338.com &gt;\n本港开奖直播现场比上年同期多4146亿元br 导致\n发布日期:2020-01-11 12:42 来源:未知 阅读: 次\n比上年同期多4146亿元。\n导致银行负债端本钱晋升。就这样猝不迭防地来了。雪天起雾也精力。假如公然职工薪酬，该单位之所以这么做，本次会议由嘀嗒出行承办。记者从国度卫生计生委获悉,小鱼儿主页马会开将，中共北京市委、北京市国民政府代表全市人民向受灾地域人民表现深切慰劳，也能很好的辅助调节不良习惯，刷牙不要过快也不要过慢。\n持续3个月位于临界点以下，旁边投入价钱指数为53.就可能印证这个谜底了。哪怕已经是国家奖学金取得者，海外获得本科以上学位都能够直接落户。3、统一单位缴满六个月社保，爱吾及吾金光华广场上，五大主题打卡点……从人民公园散步到人民南商圈，例如退出《中导条约》、对《新削减战略武器公约》不满、宣布新的核策略、导弹防备、国防保险等文件，新型的小当量核武器、或可控当量核兵器等新型核武器

In [54]:
record3_wet = next(records_wet)
record3_wet.rec_type

'conversion'

In [55]:
record3_wet.rec_headers.get_header('WARC-Target-URI')

'http://01-news.ru/sport/apl-prodlila-pauzu-v-sezone/'

In [56]:
a3_wet = record3_wet.content_stream().read()

### 3. This page seems to be broken PHP?

In [57]:
print(a3_wet.decode('utf-8')[:1000])

АПЛ продлила паузу в сезоне
НЕ ПРОПУСТИ
Конте принял решение о будущем Эриксена
Татьяна Волосожар сообщила о беременности
Кокорин посмотрел тренировку «Фиорентины» и пообщался с Рибери (видео)
Ничушкин забил в меньшинстве! Обокрал соперника и «завёз» шайбу в ворота
Вратарь «Рубина» Городовой на правах субаренды перешел в «СКА-Хабаровск»
На Халка претендует сразу несколько клубов
Косторная объяснила, почему не будет участвовать в Кубке Первого канала
Пониженный гемоглобин — симптомы и лечение
Густая кровь: 5 продуктов, которые этого не допустят
Раскрыты подробности переговоров Хабиба и президента UFC
НОВОСТНОЙ ЖУРНАЛ
Главная
АВТО
НАУКА
ЗДОРОВЬЕ
КУЛЬТУРА
ПОЛИТИКА
СПОРТ
ФИНАНСЫ
ЭКОНОМИКА
Главная » СПОРТ » АПЛ продлила паузу в сезоне
АПЛ продлила паузу в сезоне
Фото:
Michael Regan / Getty Images
Все новости на карте
Английская премьер-лига (АПЛ) не возобновит сезон-2019/2020 в начале мая. Пауза в чемпионате продлена на неопределенный срок, говорится в заявлении лиги.
«АПЛ вернется, когда э

In [58]:
soup2_wet = BeautifulSoup(a3_wet.decode('utf-8', errors='ignore'), 'html.parser')
str(soup2_wet)

'АПЛ продлила паузу в сезоне\nНЕ ПРОПУСТИ\nКонте принял решение о будущем Эриксена\nТатьяна Волосожар сообщила о беременности\nКокорин посмотрел тренировку «Фиорентины» и пообщался с Рибери (видео)\nНичушкин забил в меньшинстве! Обокрал соперника и «завёз» шайбу в ворота\nВратарь «Рубина» Городовой на правах субаренды перешел в «СКА-Хабаровск»\nНа Халка претендует сразу несколько клубов\nКосторная объяснила, почему не будет участвовать в Кубке Первого канала\nПониженный гемоглобин — симптомы и лечение\nГустая кровь: 5 продуктов, которые этого не допустят\nРаскрыты подробности переговоров Хабиба и президента UFC\nНОВОСТНОЙ ЖУРНАЛ\nГлавная\nАВТО\nНАУКА\nЗДОРОВЬЕ\nКУЛЬТУРА\nПОЛИТИКА\nСПОРТ\nФИНАНСЫ\nЭКОНОМИКА\nГлавная » СПОРТ » АПЛ продлила паузу в сезоне\nАПЛ продлила паузу в сезоне\nФото:\nMichael Regan / Getty Images\nВсе новости на карте\nАнглийская премьер-лига (АПЛ) не возобновит сезон-2019/2020 в начале мая. Пауза в чемпионате продлена на неопределенный срок, говорится в заявлении 

### 4. And the next page

In [59]:
record4_wet = next(records_wet)
record4_wet.rec_type

'conversion'

In [60]:
record4_wet.rec_headers.get_header('WARC-Target-URI')

'http://05rjo8c.cn/3871_6589_20220_659112/535091.html'

In [61]:
a4_wet = record4_wet.content_stream().read()

### 5. More text

In [62]:
print(a4_wet.decode('utf-8')[:1000])

欧美激情幼幼片,内射大奶女教师,波波妹百度云分享
歡迎光臨上海蘇鵬實業有限公司！
收藏本頁
上海愛慧氏科學儀器有限公司
021-58482099
首頁
關于我們
產品展示
市場與服務
新聞中心
聯系我們
英文版Engilsh
聯系我們
地址：上海市浦s4d6s54號
郵編：200137
電話：122437985952
傳真：44654654
郵箱：
首頁>> 新聞中心
欧美激情幼幼片,内射大奶女教师,波波妹百度云分享
4月2日，潘石屹發本站說：昨天，我和王寶強、陳蓉、郭碧婷 在十七道溝村一起做黑山豬，在農民家的柴火鍋裏做豬肉炖粉條，王寶強主廚，我切肉燒火。說到這裏，你可能就明白了，王寶強1984年出生于河北省邢台市南和縣賈宋鎮大會塔村，是地地道道的貧困出身，在童年就像其他農村孩子一樣不被人關注。在母親的記憶中，王寶強的衣裳都是撿他哥哥姐姐剩下的。所以王寶強回家看父親也是理所當然，就像打大衣哥朱之文一樣，不忘本，出身于農村，雖然成爲了大明星，但卻沒有忘記自己來自農村。一到農村就如魚得水。現在看來，此話不假。王寶強在農村很開心，并沒有像一些人那樣做作，強調農村的不衛生。是中國最出色的草根明星。潘石屹說：我心中的中國菜得有蔥姜蒜炝鍋，可是這三樣一樣都沒有。王寶強到了河北就到了他的老家，沒有任何的陌生感和違和感，就像到了他家裏一樣，指揮着我們。2018年2月，其參演的電影《唐人街探案2》在中國大陸上映。3月26日，第九屆中國電影金掃帚獎在北京頒獎，王寶強憑處女作《大鬧天竺》獲得最令人失望導演獎。他在現場領取了“最令人失望導演”，并表示“感謝金掃帚給我這樣一個機會，讓我跟觀衆說一句對不起�欧美激情幼幼片��。由此可見，王寶強是個多麽有擔當的男人，就這樣馬蓉還背叛，出軌。2016年8月，王寶強在本站發布聲明，稱馬蓉與經紀人宋喆發生婚外不正當兩性關系，鄭重決定解除與妻子馬蓉的婚姻關系，同時解除宋喆經紀人的職務。而寶寶工作也還是對員工很盡職，帶着員工去國外玩呢。而在今年2月13日，馬蓉發文表示不認同法院對王寶強離婚案判決，将會提起上訴。看來寶寶的一生就壞在這個女人身上了，不過做自己就好。
10月9日，孫俪[本站]正在本身本站上傳一張談天截圖，并配文稱：“可惡的mm早上用姨媽賬号給我發語音提示爾，本�内射大奶女教师�月朔要食齋。”她借配上淺笑臉色，看起來非常高興。小花mm也是很知心了

In [63]:
soup3_wet = BeautifulSoup(a4_wet.decode('utf-8', errors='ignore'), 'html.parser')
str(soup3_wet)

'欧美激情幼幼片,内射大奶女教师,波波妹百度云分享\n歡迎光臨上海蘇鵬實業有限公司！\n收藏本頁\n上海愛慧氏科學儀器有限公司\n021-58482099\n首頁\n關于我們\n產品展示\n市場與服務\n新聞中心\n聯系我們\n英文版Engilsh\n聯系我們\n地址：上海市浦s4d6s54號\n郵編：200137\n電話：122437985952\n傳真：44654654\n郵箱：\n首頁&gt;&gt; 新聞中心\n欧美激情幼幼片,内射大奶女教师,波波妹百度云分享\n4月2日，潘石屹發本站說：昨天，我和王寶強、陳蓉、郭碧婷 在十七道溝村一起做黑山豬，在農民家的柴火鍋裏做豬肉炖粉條，王寶強主廚，我切肉燒火。說到這裏，你可能就明白了，王寶強1984年出生于河北省邢台市南和縣賈宋鎮大會塔村，是地地道道的貧困出身，在童年就像其他農村孩子一樣不被人關注。在母親的記憶中，王寶強的衣裳都是撿他哥哥姐姐剩下的。所以王寶強回家看父親也是理所當然，就像打大衣哥朱之文一樣，不忘本，出身于農村，雖然成爲了大明星，但卻沒有忘記自己來自農村。一到農村就如魚得水。現在看來，此話不假。王寶強在農村很開心，并沒有像一些人那樣做作，強調農村的不衛生。是中國最出色的草根明星。潘石屹說：我心中的中國菜得有蔥姜蒜炝鍋，可是這三樣一樣都沒有。王寶強到了河北就到了他的老家，沒有任何的陌生感和違和感，就像到了他家裏一樣，指揮着我們。2018年2月，其參演的電影《唐人街探案2》在中國大陸上映。3月26日，第九屆中國電影金掃帚獎在北京頒獎，王寶強憑處女作《大鬧天竺》獲得最令人失望導演獎。他在現場領取了“最令人失望導演”，并表示“感謝金掃帚給我這樣一個機會，讓我跟觀衆說一句對不起�欧美激情幼幼片��。由此可見，王寶強是個多麽有擔當的男人，就這樣馬蓉還背叛，出軌。2016年8月，王寶強在本站發布聲明，稱馬蓉與經紀人宋喆發生婚外不正當兩性關系，鄭重決定解除與妻子馬蓉的婚姻關系，同時解除宋喆經紀人的職務。而寶寶工作也還是對員工很盡職，帶着員工去國外玩呢。而在今年2月13日，馬蓉發文表示不認同法院對王寶強離婚案判決，将會提起上訴。看來寶寶的一生就壞在這個女人身上了，不過做自己就好。\n10月9日，孫俪[本站]正在本身本站上傳一張談天截圖，并配文稱：“可惡的mm早上用姨媽賬号給我發語音提示爾，本�内射大奶女教师�月朔要食齋。

In [64]:
r_wet.close()

# Reading WAT

In [65]:
r_wat = requests.get(wat_url, stream=True)
records_wat = ArchiveIterator(r_wat.raw)

### 1. Again the first record is a `header`

In [66]:
record1_wat = next(records_wat)

In [67]:
record1_wat.rec_type

'warcinfo'

In [68]:
a1_wat = record1_wat.content_stream().read()
print(a1_wat.decode('utf-8'))

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20210112092133
Extracted-Date: Tue, 02 Feb 2021 10:23:32 GMT
ip: 10.67.67.172
hostname: ip-10-67-67-172.ec2.internal
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf




### 2. The next one is `metadata` about the WARC records themselves

In [69]:
record2_wat = next(records_wat)

In [70]:
record2_wat.rec_type

'metadata'

In [71]:
record2_wat.rec_headers.headers

[('WARC-Type', 'metadata'),
 ('WARC-Target-URI', 'CC-MAIN-20210128134124-20210128164124-00799.warc.gz'),
 ('WARC-Date', '2021-02-02T10:23:32Z'),
 ('WARC-Record-ID', '<urn:uuid:fc296a35-2579-4d84-815b-e8c91e8dd137>'),
 ('WARC-Refers-To', '<urn:uuid:417d8ded-caa8-4bc1-b819-8f01e3632199>'),
 ('Content-Type', 'application/json'),
 ('Content-Length', '1238')]

In [72]:
record2_wat.http_headers

In [73]:
a2_wat = record2_wat.content_stream().read()

In [74]:
data1_wat = json.loads(a2_wat.decode('utf-8'))
data1_wat

{'Container': {'Filename': 'CC-MAIN-20210128134124-20210128164124-00799.warc.gz',
  'Compressed': True,
  'Offset': '0',
  'Gzip-Metadata': {'Deflate-Length': '481',
   'Header-Length': '10',
   'Footer-Length': '8',
   'Inflated-CRC': '1137911655',
   'Inflated-Length': '765'}},
 'Envelope': {'Payload-Metadata': {'Actual-Content-Length': '502',
   'Block-Digest': 'sha1:WWAPEV5C2ZSXR7PKEPOWFOXFNDZ473W4',
   'Trailing-Slop-Length': '0',
   'Headers-Corrupt': True,
   'Actual-Content-Type': 'application/warc-fields',
   'WARC-Info-Metadata': {'isPartOf': 'CC-MAIN-2021-04',
    'publisher': 'Common Crawl',
    'description': 'Wide crawl of the web for January 2021',
    'operator': 'Common Crawl Admin (info@commoncrawl.org)',
    'hostname': 'ip-10-67-67-246.ec2.internal',
    'software': 'Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)',
    'robots': 'checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)',
    'format': 'WARC F

### 3. The next request contains all the `metadata` of the first *request*

In [75]:
record3_wat = next(records_wat)

In [76]:
record3_wat.rec_type

'metadata'

In [77]:
record3_wat.rec_headers.headers

[('WARC-Type', 'metadata'),
 ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html'),
 ('WARC-Date', '2021-02-02T10:23:32Z'),
 ('WARC-Record-ID', '<urn:uuid:1c211ddb-24e6-4b7a-92a0-8d087a1300a8>'),
 ('WARC-Refers-To', '<urn:uuid:9b996c49-f53c-43db-9ce1-18f1f72679ee>'),
 ('Content-Type', 'application/json'),
 ('Content-Length', '1382')]

In [78]:
record3_wat.http_headers

In [79]:
a3_wat = record3_wat.content_stream().read()

### 4. Container shows `where` the WARC data is, this is about the request

In [80]:
data2_wat = json.loads(a3_wat.decode('utf-8'))
data2_wat

{'Container': {'Filename': 'CC-MAIN-20210128134124-20210128164124-00799.warc.gz',
  'Compressed': True,
  'Offset': '481',
  'Gzip-Metadata': {'Deflate-Length': '431',
   'Header-Length': '10',
   'Footer-Length': '8',
   'Inflated-CRC': '151536073',
   'Inflated-Length': '632'}},
 'Envelope': {'Payload-Metadata': {'Actual-Content-Type': 'application/http; msgtype=request',
   'HTTP-Request-Metadata': {'Request-Message': {'Method': 'GET',
     'Path': '/a/137338_com/1227.html',
     'Version': 'HTTP/1.1'},
    'Headers-Length': '267',
    'Headers': {'User-Agent': 'CCBot/2.0 (https://commoncrawl.org/faq/)',
     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     'Accept-Language': 'en-US,en;q=0.5',
     'Accept-Encoding': 'br,gzip',
     'Host': '006655e.com',
     'Connection': 'Keep-Alive'},
    'Entity-Length': '0',
    'Entity-Digest': 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ',
    'Entity-Trailing-Slop-Length': '0'},
   'Actual-Content-Length': '269',


### 5. Notice it's `HTTP-Request-Metadata`

In [81]:
data2_wat['Envelope']

{'Payload-Metadata': {'Actual-Content-Type': 'application/http; msgtype=request',
  'HTTP-Request-Metadata': {'Request-Message': {'Method': 'GET',
    'Path': '/a/137338_com/1227.html',
    'Version': 'HTTP/1.1'},
   'Headers-Length': '267',
   'Headers': {'User-Agent': 'CCBot/2.0 (https://commoncrawl.org/faq/)',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'br,gzip',
    'Host': '006655e.com',
    'Connection': 'Keep-Alive'},
   'Entity-Length': '0',
   'Entity-Digest': 'sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ',
   'Entity-Trailing-Slop-Length': '0'},
  'Actual-Content-Length': '269',
  'Block-Digest': 'sha1:UXL4YVXNV3UXFRPLTHLDPZR65IGIIKM2',
  'Trailing-Slop-Length': '4'},
 'Format': 'WARC',
 'WARC-Header-Length': '359',
 'WARC-Header-Metadata': {'WARC-Type': 'request',
  'WARC-Date': '2021-01-28T16:07:26Z',
  'WARC-Record-ID': '<urn:uuid:9b996c49-f53c-43db-9ce1-18f1f72679ee>',
  'Conten

### 6. And the next one is about the `response`

In [82]:
record4_wat = next(records_wat)

In [83]:
record4_wat.rec_type

'metadata'

In [84]:
record4_wat.rec_headers.headers

[('WARC-Type', 'metadata'),
 ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html'),
 ('WARC-Date', '2021-02-02T10:23:32Z'),
 ('WARC-Record-ID', '<urn:uuid:7b9daff9-7c0e-4d36-8add-c0f38154f836>'),
 ('WARC-Refers-To', '<urn:uuid:f4c525de-b5c5-4f1a-adbf-8d937eaa8061>'),
 ('Content-Type', 'application/json'),
 ('Content-Length', '7869')]

In [85]:
record4_wat.http_headers

In [86]:
a4_wat = record4_wat.content_stream().read()

### 7. Envelope contains the details

In [87]:
data3_wat = json.loads(a4_wat.decode('utf-8'))
data3_wat

{'Container': {'Filename': 'CC-MAIN-20210128134124-20210128164124-00799.warc.gz',
  'Compressed': True,
  'Offset': '912',
  'Gzip-Metadata': {'Deflate-Length': '3711',
   'Header-Length': '10',
   'Footer-Length': '8',
   'Inflated-CRC': '1822421432',
   'Inflated-Length': '8027'}},
 'Envelope': {'Payload-Metadata': {'Actual-Content-Type': 'application/http; msgtype=response',
   'HTTP-Response-Metadata': {'Response-Message': {'Status': '200',
     'Version': 'HTTP/1.1',
     'Reason': 'OK'},
    'Headers-Length': '337',
    'Headers': {'Content-Type': 'text/html',
     'X-Crawler-Content-Encoding': 'gzip',
     'Last-Modified': 'Sat, 11 Jan 2020 04:42:04 GMT',
     'Accept-Ranges': 'bytes',
     'ETag': '"ff902b7939c8d51:0"',
     'Vary': 'Accept-Encoding',
     'Server': 'Microsoft-IIS/7.5',
     'X-Powered-By': 'ASP.NET',
     'Date': 'Thu, 28 Jan 2021 16:07:18 GMT',
     'X-Crawler-Content-Length': '4489',
     'Content-Length': '7084'},
    'HTML-Metadata': {'Head': {'Metas': [{'

### 8. Here we've got the `HTTP headers` and `response metadata`

In [88]:
data3_wat['Envelope']['Payload-Metadata']

{'Actual-Content-Type': 'application/http; msgtype=response',
 'HTTP-Response-Metadata': {'Response-Message': {'Status': '200',
   'Version': 'HTTP/1.1',
   'Reason': 'OK'},
  'Headers-Length': '337',
  'Headers': {'Content-Type': 'text/html',
   'X-Crawler-Content-Encoding': 'gzip',
   'Last-Modified': 'Sat, 11 Jan 2020 04:42:04 GMT',
   'Accept-Ranges': 'bytes',
   'ETag': '"ff902b7939c8d51:0"',
   'Vary': 'Accept-Encoding',
   'Server': 'Microsoft-IIS/7.5',
   'X-Powered-By': 'ASP.NET',
   'Date': 'Thu, 28 Jan 2021 16:07:18 GMT',
   'X-Crawler-Content-Length': '4489',
   'Content-Length': '7084'},
  'HTML-Metadata': {'Head': {'Metas': [{'content': 'text/html; charset=gb2312',
      'http-equiv': 'Content-Type'},
     {'name': 'keywords', 'content': '本港开奖直播现'},
     {'name': 'description',
      'content': '比上年同期多4146亿元。 导致银行负债端本钱晋升。就这样猝不迭防地来了。雪天起雾也精力。假如公然职工薪酬，该单位之所以这么做，本次会议由嘀嗒出行承办。记者从国度卫生计生委获悉, 小鱼儿主页马会开将 ，中共北京市委、北京市国民政府代表全市人民向受灾'}],
    'Title': '本港开奖直播现场比上年同期多4146亿元br 导致_神灯论坛,www.0

In [89]:
data3_wat['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']

{'Response-Message': {'Status': '200', 'Version': 'HTTP/1.1', 'Reason': 'OK'},
 'Headers-Length': '337',
 'Headers': {'Content-Type': 'text/html',
  'X-Crawler-Content-Encoding': 'gzip',
  'Last-Modified': 'Sat, 11 Jan 2020 04:42:04 GMT',
  'Accept-Ranges': 'bytes',
  'ETag': '"ff902b7939c8d51:0"',
  'Vary': 'Accept-Encoding',
  'Server': 'Microsoft-IIS/7.5',
  'X-Powered-By': 'ASP.NET',
  'Date': 'Thu, 28 Jan 2021 16:07:18 GMT',
  'X-Crawler-Content-Length': '4489',
  'Content-Length': '7084'},
 'HTML-Metadata': {'Head': {'Metas': [{'content': 'text/html; charset=gb2312',
     'http-equiv': 'Content-Type'},
    {'name': 'keywords', 'content': '本港开奖直播现'},
    {'name': 'description',
     'content': '比上年同期多4146亿元。 导致银行负债端本钱晋升。就这样猝不迭防地来了。雪天起雾也精力。假如公然职工薪酬，该单位之所以这么做，本次会议由嘀嗒出行承办。记者从国度卫生计生委获悉, 小鱼儿主页马会开将 ，中共北京市委、北京市国民政府代表全市人民向受灾'}],
   'Title': '本港开奖直播现场比上年同期多4146亿元br 导致_神灯论坛,www.006655.com,香港马会资料大全,白小姐独家四不象007,港龙神算网永久域算ww6882,137338.com,www.00',
   'Link': [{'path': 'LINK@/href',
     'url':

### 9. Contains from the `head` the `title`, `metas` and `scripts`, as well as `links` from the text itself.

In [90]:
data3_wat['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']

{'Head': {'Metas': [{'content': 'text/html; charset=gb2312',
    'http-equiv': 'Content-Type'},
   {'name': 'keywords', 'content': '本港开奖直播现'},
   {'name': 'description',
    'content': '比上年同期多4146亿元。 导致银行负债端本钱晋升。就这样猝不迭防地来了。雪天起雾也精力。假如公然职工薪酬，该单位之所以这么做，本次会议由嘀嗒出行承办。记者从国度卫生计生委获悉, 小鱼儿主页马会开将 ，中共北京市委、北京市国民政府代表全市人民向受灾'}],
  'Title': '本港开奖直播现场比上年同期多4146亿元br 导致_神灯论坛,www.006655.com,香港马会资料大全,白小姐独家四不象007,港龙神算网永久域算ww6882,137338.com,www.00',
  'Link': [{'path': 'LINK@/href',
    'url': '/skin/css/common.css',
    'rel': 'stylesheet',
    'type': 'text/css'},
   {'path': 'LINK@/href',
    'url': '/skin/css/style.css',
    'rel': 'stylesheet',
    'type': 'text/css'}],
  'Scripts': [{'path': 'SCRIPT@/src',
    'url': '/caiyuan/ytbf.js',
    'type': 'text/javascript'},
   {'path': 'SCRIPT@/src',
    'url': '/plus/count.php?view=yes&aid=1227&mid=1',
    'type': 'text/javascript'}]},
 'Links': [{'path': 'IMG@/src', 'url': '/skin/images/logo.png'},
  {'path': 'A@/href',
   'url': '/',
   'title': '神灯论坛,www.

### 10. The next record corresponds to the `metadata` of the request

In [91]:
record5_wat = next(records_wat)

In [92]:
record5_wat.rec_type

'metadata'

In [93]:
record5_wat.rec_headers.headers

[('WARC-Type', 'metadata'),
 ('WARC-Target-URI', 'http://006655e.com/a/137338_com/1227.html'),
 ('WARC-Date', '2021-02-02T10:23:32Z'),
 ('WARC-Record-ID', '<urn:uuid:081280ee-ef22-485f-8dab-151a912208b3>'),
 ('WARC-Refers-To', '<urn:uuid:31726ea5-ff9e-4759-8b57-f63835afe1f8>'),
 ('Content-Type', 'application/json'),
 ('Content-Length', '1233')]

In [94]:
a5_wat = record5_wat.content_stream().read()

### 11. This envelope contains WARC-Metadata-Metadata, this covers all the actual `metadata` in the metadata record.

In [95]:
data4_wat = json.loads(a5_wat)
data4_wat

{'Container': {'Filename': 'CC-MAIN-20210128134124-20210128164124-00799.warc.gz',
  'Compressed': True,
  'Offset': '4623',
  'Gzip-Metadata': {'Deflate-Length': '425',
   'Header-Length': '10',
   'Footer-Length': '8',
   'Inflated-CRC': '-1794557766',
   'Inflated-Length': '593'}},
 'Envelope': {'Payload-Metadata': {'Actual-Content-Type': 'application/metadata-fields',
   'WARC-Metadata-Metadata': {'Metadata-Records': [{'Name': 'fetchTimeMs',
      'Value': '200'},
     {'Name': 'charset-detected', 'Value': 'GB2312'},
     {'Name': 'languages-cld2',
      'Value': '{"reliable":true,"text-bytes":2556,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.94,"score":1999.0,"name":"Chinese"}]}'}]},
   'Actual-Content-Length': '202',
   'Block-Digest': 'sha1:JV3MBQP6U6WDROBGG2KP6NUYV33UHXEY',
   'Trailing-Slop-Length': '0'},
  'Format': 'WARC',
  'WARC-Header-Length': '387',
  'WARC-Header-Metadata': {'WARC-Type': 'metadata',
   'WARC-Date': '2021-01-28T16:07:26Z',
   'WARC-Re

### 12. And so on for the next few `requests`

In [96]:
record6_wat = next(records_wat)
record6_wat.rec_type, record6_wat.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://01-news.ru/sport/apl-prodlila-pauzu-v-sezone/')

In [97]:
data5_wat = json.loads(record6_wat.content_stream().read())
data5_wat['Envelope']['Payload-Metadata'].keys()

dict_keys(['Actual-Content-Type', 'HTTP-Request-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])

In [98]:
record7_wat = next(records_wat)
record7_wat.rec_type, record7_wat.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://01-news.ru/sport/apl-prodlila-pauzu-v-sezone/')

In [99]:
data6_wat = json.loads(record7_wat.content_stream().read())
data6_wat['Envelope']['Payload-Metadata'].keys()

dict_keys(['Actual-Content-Type', 'HTTP-Response-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])

In [100]:
record8_wat = next(records_wat)
record8_wat.rec_type, record8_wat.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://01-news.ru/sport/apl-prodlila-pauzu-v-sezone/')

In [101]:
data7_wat = json.loads(record8_wat.content_stream().read())
data7_wat['Envelope']['Payload-Metadata'].keys()

dict_keys(['Actual-Content-Type', 'WARC-Metadata-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])

### 13. And so on

In [102]:
record9_wat = next(records_wat)
record9_wat.rec_type, record9_wat.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://05rjo8c.cn/3871_6589_20220_659112/535091.html')

In [103]:
data8_wat = json.loads(record9_wat.content_stream().read())
data8_wat['Envelope']['Payload-Metadata'].keys()

dict_keys(['Actual-Content-Type', 'HTTP-Request-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])

In [104]:
record10_wat = next(records_wat)
record10_wat.rec_type, record10_wat.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://05rjo8c.cn/3871_6589_20220_659112/535091.html')

In [105]:
data9_wat = json.loads(record10_wat.content_stream().read())
data9_wat['Envelope']['Payload-Metadata'].keys()

dict_keys(['Actual-Content-Type', 'HTTP-Response-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])

In [106]:
record11_wat = next(records_wat)
record11_wat.rec_type, record11_wat.rec_headers.get_header('WARC-Target-URI')

('metadata', 'http://05rjo8c.cn/3871_6589_20220_659112/535091.html')

In [107]:
data10_wat = json.loads(record11_wat.content_stream().read())
data10_wat['Envelope']['Payload-Metadata'].keys()

dict_keys(['Actual-Content-Type', 'WARC-Metadata-Metadata', 'Actual-Content-Length', 'Block-Digest', 'Trailing-Slop-Length'])

In [108]:
r_wat.close()