Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
12 lines (9 sloc) 555 Bytes
In Python, read the .80 file format, for 80legs web crawl results.
The URL and data are UTF-8 decoded.
From http://80legs.pbworks.com/Results:
For people interested in deserializing in other languages, the file format this creates and reads is:
<classID><versionID><URL-SIZE><URL><DATA-SIZE><DATA>
Note that:
* The last 4 items (<URL-SIZE><URL><DATA-SIZE><DATA>) repeat for each url/data pair.
* <classID>, <versionID>, <URL-SIZE>, and <DATA-SIZE> are encoded 32-bit integers.
* The url is encoded using UTF-8.