Skip to content

Commit

Permalink
UTF-8 decode URL + data
Browse files Browse the repository at this point in the history
  • Loading branch information
turian committed Dec 16, 2009
1 parent ac1bbc9 commit 247d52b
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 3 deletions.
5 changes: 2 additions & 3 deletions README
@@ -1,12 +1,11 @@
In Python, read the .80 file format, for 80legs web crawl results.

The URL and data are UTF-8 decoded.

From http://80legs.pbworks.com/Results:
For people interested in deserializing in other languages, the file format this creates and reads is:
<classID><versionID><URL-SIZE><URL><DATA-SIZE><DATA>
Note that:
* The last 4 items (<URL-SIZE><URL><DATA-SIZE><DATA>) repeat for each url/data pair.
* <classID>, <versionID>, <URL-SIZE>, and <DATA-SIZE> are encoded 32-bit integers.
* The url is encoded using UTF-8.

ISSUES:
* I don't Unicode decode either the URL or the data.
1 change: 1 addition & 0 deletions eightyformat.py
Expand Up @@ -27,6 +27,7 @@ def read(file):
(DATASIZE,) = struct.unpack("i", l)
# print DATASIZE
data = str(file.read(DATASIZE))
data = data.decode("utf-8")
yield (url, data)
# print data
# print data.decode("utf-8")
Expand Down

0 comments on commit 247d52b

Please sign in to comment.