-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PY3 Syntactic changes. #168
Conversation
hi @Preetwinder , what do you think of #166 ? |
Hello. Yes I think the changes in #166 are great. Judging from the stats on the codecov site adding tests for hbase, kafka and workers should give us a reasonably good coverage. |
@Preetwinder , you may be touching a few of the same files as #166 next, so I'm not sure if you should change too much "by hand". If most of this PR has been done with scripts it may not be too hard to fix any rebase issue. |
@Preetwinder , #166 has been merged. |
239ff91
to
b2fff63
Compare
Current coverage is 54.55% (diff: 68.53%)@@ master #168 diff @@
==========================================
Files 70 70
Lines 4320 4425 +105
Methods 0 0
Messages 0 0
Branches 506 522 +16
==========================================
+ Hits 2338 2414 +76
- Misses 1878 1907 +29
Partials 104 104
|
|
||
|
||
def _prepare_request_message(request): | ||
def serialize(obj): | ||
"""Recursively walk object's hierarchy.""" | ||
if isinstance(obj, (bool, int, long, float, basestring)): | ||
if isinstance(obj, (bool, int, float, six.string_types)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(obj, (bool, int, long, float, basestring)):
is there a case where a long
is passed Python2 or is this theoretical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think this case should be covered. I have used six.integer_types instead. It covers both long
and int
type.
Overall, the changes look good to me. |
LGTM |
Hey @Preetwinder and @redapple in order to merge this we should enable tests for Python 3 also, otherwise how would we find out if it works for Py3. Next thing to do is, to port Kafka message bus testing from https://github.com/scrapinghub/frontera/blob/145d0b58da0ce5175ba2795ea76f63f1e2a12184/frontera/tests/test_message_bus.py and test HBase stuff at least manually. |
@Preetwinder , @sibiryakov ,
|
It seems MySQL-python is used by default in the MySQL tests. Since it doesn't support Python 3, I think we can force the tests to use PyMySQL instead and remove MySQL-python. |
@Preetwinder , Python 3 tests with PyMySQL fail badly (dont know why) |
I'll try using mysqlclient. I think the reason ZeroMQ tests fail might be that since both the PY2 and PY3 tests are being run in the Python 3 environment, this file - https://github.com/scrapinghub/frontera/blob/master/tests/run_zmq_broker.sh uses Python 3 even for the Python 2 tests. |
I've killed the Py35 build on Travis |
@Preetwinder , tests are run within a Py2 or Py3 virtualenv thanks to tox. |
@Preetwinder Please try running pyMysql tests locally, it should be clearer why it looses connection constantly. I wonder if it connects at all. |
In order to make pymysql work I have pushed most of the PY3 modification for single process mode here(I was originally planning to do it in a separate PR). I'll continue testing and improving the PY3 changes. Currently message_bus and scrapy_spider tests fail for PY2, I'll look into this, although the cause might be unrelated to this PR(they fail for me locally even without the changes in this PR). For PY3 the failures for message_bus, canonicalize_url and encoders are expected since I haven't worked on these modules yet. There is also an import error for sgmllib which seems to be a deprecated module. |
Regarding |
The reason for message_bus test failure on PY2 was the tests being run with language set as PY3.5. The scrapy_spider tests was failing because the test spider would be unable to crawl past the login page on the scrapinghub website. I have changed the website to dmoz.org. |
@@ -1,4 +1,5 @@ | |||
# -*- coding: utf-8 -*- | |||
from __future__ import absolute_import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 options for this test:
- regarding
sgmllib
not available in Python 3, I suggest you add a skip in Python 3 fortest_scrapy_spider
- or remove
RegexLinkExtractor
from https://github.com/scrapinghub/frontera/blob/master/tests/scrapy_spider/spiders/example.py#L23
@redapple and @Preetwinder Let's better remove RegexLinkExtractor. The goal of this test is to check the general functionality - if all is imported well, and Scrapy/Frontera API's fit. |
Regarding
|
@redapple what if we move that part to standalone library? How time consuming is it? |
@sibiryakov , you mean in w3lib? |
@Preetwinder , what do you think of @sibiryakov comments on removing |
Yes I agree with @sibiryakov. I'll remove it. I can copy-paste the function and it's tests from scrapy, but the issue about this was that changing the implementation of |
IMO, 'old' |
@Preetwinder , what about you submit a PR to w3lib about adding Removing |
@Preetwinder I'm fine with this change. But [dramatic pause] we should mention that in release notes! |
@@ -128,11 +135,11 @@ def _get(self, obj): | |||
|
|||
def update_cache(self, objs): | |||
objs = objs if type(objs) in [list, tuple] else [objs] | |||
map(self._put, objs) | |||
list(map(self._put, objs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why list is needed here? The idea is to apply self._put to every element in the objs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because map returns a generator in PY3. So if we don't convert it to a list, it never get's applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the map was used here, because in Py2.7 it's faster than iterate with for cycle. According to my tests the
[f(x) for x in data] # is faster than
list(map(f, data))
in python 3.5
So I'm suggesting to rewrite it that way.
@Preetwinder please avoid such a big pull requests next time, it's extremely hard to review and discuss it. |
@sibiryakov I apologize. I wasn't planning to make all the changes in the same PR(hence the name of the PR), but you can see the series of events above which led me to do this. |
self._method = str(method).upper() | ||
self._url = to_native_str(url, encoding) | ||
self._encoding = encoding | ||
self._url = safe_url_string(url, encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very expensive, comparing to what we have now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change isn't necessary to use frontera with scrapy, since scrapy already converts URL's to safe_url's. I'll remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideal would be to add additional field for storing parsed URL. Because many components are performing url parsing, so they could make use of such structure, but I would leave it for future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frontera is designed to work without Scrapy too. So, we need this URL transformation, the question is how to make it cheap. That's why I propose to optimize it later.
@Preetwinder I would propose to collect the cases where we slowed down Frontera in a separate issue, and address that issue after we finish working on PY3 port. It can be seen as lending performance debt, and then giving ti back. |
Python 3 Syntactic changes Run tests on Python 3.5 too PY3 changes to some tests and utils.url, heap, misc PY3 changes to utils.fingerprint PY3 changes to backends
98901c7
to
424eedf
Compare
from six.moves import range | ||
|
||
|
||
def cmp(a, b): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In python 3 cmp is not an in-built function. It is used here - https://github.com/Preetwinder/frontera/blob/ec0811a21983ab724948ccb7a9e8fdea759e42b9/frontera/contrib/backends/memory/__init__.py#L80 and a few other locations below this line.
Most of the changes were produced using the modernize script. Changes include print syntax, error syntax, converting iterators and generators to lists, etc. Also includes some other changes which were missed by the script.