Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] PY3: use six.BytesIO and six.moves.cStringIO #803

Merged
merged 4 commits into from Jul 21, 2014

Conversation

@felixonmars
Copy link
Contributor

@felixonmars felixonmars commented Jul 15, 2014

I've followed some code and found that most of the content are binary, so I use six.BytesIO here (there's one exception that I still use StringIO for). I'm not sure if it's the best solution since it means StringIO.StringIO but not cStringIO.StringIO in Python 2.x, so performance can be affected.

An alternative way would be a try-except block to import cStringIO.StringIO first, and fallback to io.BytesIO.

felixonmars added 2 commits Jul 15, 2014
Since I don't find docs for the .reset() method, I lookup up the source
code, and it should work just the same as .seek(0).

For reference:
https://github.com/python/cpython/blob/8eeb7e9122109f5cc71c22047d5cdd312ca770a0/Modules/cStringIO.c#L281
@kmike
Copy link
Member

@kmike kmike commented Jul 15, 2014

So, some differences between cStringIO.StringIO, six.BytesIO and io.BytesIO (aka StringIO.StringIO):

  1. Speed. six.BytesIO should be the slowest, cStringIO.StringIO should be the fastest, io.BytesIO should be in between (in Python 2.7). But this really depends on how are they used. I recall trying to switch one codebase from cStringIO to BytesIO, and the code became several times slower despite the fact both are written in C. StringIO was even slower. On the other hand, I tried to replicate the results now using %timeit and failed to do so (it looks like there is something wrong with my tests - six.BytesIO was several times faster than its competitors).
  2. When cStringIO is initialzed with data, it looses "write" method.
  3. io.BytesIO raises an exception if unicode string is passed to it. six.BytesIO and cStringIO.StringIO let ASCII strings in.

I think that we shouldn't use six.BytesIO. It looks like a compatibility wrapper for Python < 2.6.

Tornado uses the following approach:

try:
    from io import BytesIO  # python 3
except ImportError:
    from cStringIO import StringIO as BytesIO  # python 2

Which is wrong - it effectively means "use io.BytesIO everywhere" because BytesIO is available since Python 2.6, and Tornado doesn't support Python 2.5.

It seems we have 3 options:

  1. use io.BytesIO everywhere;
  2. use io.BytesIO where speed doesn't matter (e.g. tests or scrapy.responsetypes) and use cStringIO in 2.x where speed matters (after checking that it makes code faster);
  3. use cStringIO / io.BytesIO fallback everywhere.

I'm leaning towards (1). I don't like (3) much because using io.BytesIO in tests is good - it will check that Scrapy code uses byte strings correctly.

If some Scrapy code uses cStringIO - specific features it should be changed to work with io.BytesIO / io.StringIO because there is no cStringIO in Python 3.x.

What do you think?

Also, //cc @dangra @pablohoffman

@felixonmars
Copy link
Contributor Author

@felixonmars felixonmars commented Jul 15, 2014

I just did a very simple test, which suggests io.BytesIO is the wtg:

import io, cStringIO, six
from timeit import timeit

def test_cStringIO():
    s = cStringIO.StringIO("initial data")
    s.read()
    s = cStringIO.StringIO()
    s.write("some data")
    s.seek(0)

def test_io_BytesIO():
    s = io.BytesIO("initial data")
    s.read()
    s = io.BytesIO()
    s.write("some data")
    s.seek(0)

def test_six_BytesIO():
    s = six.BytesIO("initial data")
    s.read()
    s = six.BytesIO()
    s.write("some data")
    s.seek(0)

print timeit(test_cStringIO, number=1000000)
print timeit(test_io_BytesIO, number=1000000)
print timeit(test_six_BytesIO, number=1000000)

Results:

2.15011286736
1.98964214325
8.85552287102
@kmike
Copy link
Member

@kmike kmike commented Jul 15, 2014

Another test:

I suggests that in Python 2.7 io.BytesIO constructor is much slower than both cStringIO and StringIO constructors. It is 400x slower than cStringIO for 3MB data. In Python 3.x it is just as slow.

At first I thought this issue can be relevant, but I'm not sure now because io.StringIO is even slower.

@kmike
Copy link
Member

@kmike kmike commented Jul 15, 2014

It seems the reason is that io.BytesIO copies the data while cStringIO.StringIO and StringIO.StringIO don't do that (they store a reference to the original buffer and create a new buffer if writing occurs in the middle).

So using io.BytesIO(data) is wasteful if it is used only to make readonly data file-like. For large data (files?) it could be slow and require 2x memory.

@felixonmars
Copy link
Contributor Author

@felixonmars felixonmars commented Jul 15, 2014

So I suppose we use io.BytesIO for tests and buffers without initial value, while adding the prefer-cStringIO tries to:

  • scrapy/contrib/pipeline/files.py
  • scrapy/contrib/pipeline/images.py
  • scrapy/contrib_exp/downloadermiddleware/decompression.py
  • scrapy/utils/gz.py
  • scrapy/utils/iterators.py

Correct me if I missed something :)

@kmike
Copy link
Member

@kmike kmike commented Jul 15, 2014

@felixonmars +1 to your suggestion.

We may want to add something copy-less for Python 3.x in files/images pipelines in future, but that's another issue.

@felixonmars
Copy link
Contributor Author

@felixonmars felixonmars commented Jul 15, 2014

Hrm, as you can see, a test failed. Pillow tries to call .fileno() but BytesIO doesn't support it (which has the method but raises an io.UnsupportedOperation error if called). This has to be solved since both Python 2 and 3 behave this way.

However, I think it's something that should be fixed on Pillow's side, since we cannot do much about it. What do you think?

@@ -77,7 +81,7 @@ def _body_or_str(obj, unicode=True):
return obj.body_as_unicode()
else:
return obj.body.decode('utf-8')
elif type(obj) is type(u''):
elif type(obj) is six.text_type:

This comment has been minimized.

@kmike

kmike Jul 15, 2014
Member

do you have any ideas why is it not isinstance?

This comment has been minimized.

@felixonmars

felixonmars Jul 15, 2014
Author Contributor

Ha, no idea actually.

AFAICS we can change it to isinstance :)

(PS: I have a big git stash about unicode, basestring, str, and so on, but still needs more effort to make sure I get each of them correct in the context)

This comment has been minimized.

@kmike

kmike Jul 15, 2014
Member

@dangra wrote this code in 3f156ad, maybe he can comment on this.

Your big stash is going to be one of the trickies parts :) See also: #778

This comment has been minimized.

@felixonmars

felixonmars Jul 15, 2014
Author Contributor

The function take a param unicode, I suppose this was why type(u'') came out. But no idea about isinstance vs type() :/

@kmike
Copy link
Member

@kmike kmike commented Jul 15, 2014

I suspect it could be hard for Pillow to fix fileno() issue - as I recall, fileno() is the only robust way to interact with Python file objects from C. But not sure. Anyways, can we fix the test (e.g. by making it use a temporary file, or by creating Image using other method)? The issue with failing test looks like a test-only issue.

@kmike
Copy link
Member

@kmike kmike commented Jul 15, 2014

LGTM. I'll leave it to @dangra or @pablohoffman to merge.

@kmike kmike changed the title PY3: use six.BytesIO and six.moves.cStringIO [MRG] PY3: use six.BytesIO and six.moves.cStringIO Jul 17, 2014
@kmike
Copy link
Member

@kmike kmike commented Jul 21, 2014

There is a good progress on BytesIO issue in Python bug tracker (see http://bugs.python.org/issue22003), so there is a chance copy-on-write will be implemented for io.BytesIO in Python 3.5.

@felixonmars
Copy link
Contributor Author

@felixonmars felixonmars commented Jul 21, 2014

Ah, good to know! So we may need more compat routines after 3.5 xD

@dangra
Copy link
Member

@dangra dangra commented Jul 21, 2014

good work @felixonmars ! thanks.

dangra added a commit that referenced this pull request Jul 21, 2014
[MRG] PY3: use six.BytesIO and six.moves.cStringIO
@dangra dangra merged commit f6b1e9b into scrapy:master Jul 21, 2014
1 check passed
1 check passed
continuous-integration/travis-ci The Travis CI build passed
Details
@dangra
Copy link
Member

@dangra dangra commented Jul 21, 2014

@kmike you rocked too (as usual) :-P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.