Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PY3 Syntactic changes. #168

Merged
merged 6 commits into from
Aug 3, 2016
Merged

Conversation

Preetwinder
Copy link
Contributor

Most of the changes were produced using the modernize script. Changes include print syntax, error syntax, converting iterators and generators to lists, etc. Also includes some other changes which were missed by the script.

@redapple
Copy link
Contributor

hi @Preetwinder , what do you think of #166 ?
I'd like to get code coverage running on Travis so that we can more easily see where tests are further needed.
If we're ok, this PR could probably be rebased on it easily.
cc @sibiryakov

@Preetwinder
Copy link
Contributor Author

Preetwinder commented Jun 27, 2016

Hello. Yes I think the changes in #166 are great. Judging from the stats on the codecov site adding tests for hbase, kafka and workers should give us a reasonably good coverage.
I'll rebase my branch once that PR is merged(or do I need to do it before?). Also should I wait for that PR to be merged before making my other PR's?

@redapple
Copy link
Contributor

@Preetwinder , you may be touching a few of the same files as #166 next, so I'm not sure if you should change too much "by hand". If most of this PR has been done with scripts it may not be too hard to fix any rebase issue.
But ideally, #166 would be merged before further change (so that we track code coverage PR by PR)
I believe @sibiryakov could approve (or not) #166 soon enough

@redapple
Copy link
Contributor

@Preetwinder , #166 has been merged.
Would you mind rebasing against current master branch?

@codecov-io
Copy link

codecov-io commented Jun 30, 2016

Current coverage is 54.55% (diff: 68.53%)

Merging #168 into master will increase coverage by 0.43%

@@             master       #168   diff @@
==========================================
  Files            70         70          
  Lines          4320       4425   +105   
  Methods           0          0          
  Messages          0          0          
  Branches        506        522    +16   
==========================================
+ Hits           2338       2414    +76   
- Misses         1878       1907    +29   
  Partials        104        104          

Powered by Codecov. Last update 13642e1...25a1e1a



def _prepare_request_message(request):
def serialize(obj):
"""Recursively walk object's hierarchy."""
if isinstance(obj, (bool, int, long, float, basestring)):
if isinstance(obj, (bool, int, float, six.string_types)):
Copy link
Contributor

@redapple redapple Jun 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(obj, (bool, int, long, float, basestring)):

is there a case where a long is passed Python2 or is this theoretical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this case should be covered. I have used six.integer_types instead. It covers both long and int type.

@redapple
Copy link
Contributor

Overall, the changes look good to me.
It's true there are a lot of .iteritems() wrapped in six.iteritems() now. I'm not sure all cases really need the iterator version in Python 2 (so that using .items() without six would be good enough in both Py2 and Py3) but it would be very long to check.

@redapple
Copy link
Contributor

LGTM

@redapple redapple changed the title PY3 Syntactic changes. [MRG+1] PY3 Syntactic changes. Jun 30, 2016
@sibiryakov
Copy link
Member

sibiryakov commented Jul 18, 2016

Hey @Preetwinder and @redapple in order to merge this we should enable tests for Python 3 also, otherwise how would we find out if it works for Py3. Next thing to do is, to port Kafka message bus testing from https://github.com/scrapinghub/frontera/blob/145d0b58da0ce5175ba2795ea76f63f1e2a12184/frontera/tests/test_message_bus.py

and test HBase stuff at least manually.

@redapple
Copy link
Contributor

@Preetwinder , @sibiryakov ,
so I tried with Python 3 on Travis CI (see #175 , https://travis-ci.org/scrapinghub/frontera/builds/145764784 )
but MySQL-python is having trouble getting installed.

requirements/tests.txt also includes https://github.com/PyMySQL/PyMySQL
where/when is MySQL-python>=1.2.5 needed?

@Preetwinder
Copy link
Contributor Author

It seems MySQL-python is used by default in the MySQL tests. Since it doesn't support Python 3, I think we can force the tests to use PyMySQL instead and remove MySQL-python.

@redapple
Copy link
Contributor

@redapple redapple changed the title [MRG+1] PY3 Syntactic changes. PY3 Syntactic changes. Jul 19, 2016
@redapple
Copy link
Contributor

@Preetwinder , Python 3 tests with PyMySQL fail badly (dont know why)
It may be easier to use mysqlclient-python (Python3 compatible) instead of forcing +pymysql (worth a Travis test build at least I think)

@Preetwinder
Copy link
Contributor Author

I'll try using mysqlclient. I think the reason ZeroMQ tests fail might be that since both the PY2 and PY3 tests are being run in the Python 3 environment, this file - https://github.com/scrapinghub/frontera/blob/master/tests/run_zmq_broker.sh uses Python 3 even for the Python 2 tests.

@redapple
Copy link
Contributor

I've killed the Py35 build on Travis

@redapple
Copy link
Contributor

redapple commented Jul 19, 2016

@Preetwinder , tests are run within a Py2 or Py3 virtualenv thanks to tox.
run_zmq_broker.sh should also run inside the virtualenv (this ought to be possible with tox I believe)

@sibiryakov
Copy link
Member

sibiryakov commented Jul 19, 2016

@Preetwinder Please try running pyMysql tests locally, it should be clearer why it looses connection constantly. I wonder if it connects at all.

@Preetwinder
Copy link
Contributor Author

In order to make pymysql work I have pushed most of the PY3 modification for single process mode here(I was originally planning to do it in a separate PR). I'll continue testing and improving the PY3 changes. Currently message_bus and scrapy_spider tests fail for PY2, I'll look into this, although the cause might be unrelated to this PR(they fail for me locally even without the changes in this PR). For PY3 the failures for message_bus, canonicalize_url and encoders are expected since I haven't worked on these modules yet. There is also an import error for sgmllib which seems to be a deprecated module.

@redapple
Copy link
Contributor

Regarding sgmllib, indeed, RegexLinkExtractor extends SgmlLinkExtractor -- which has been deprecated in Scrapy, and is not available in Python 3. (This makes me think that Scrapy could ship with a robust regex link extractor, without the sgmllib dependency, which doesn't bring much (anything) to RegexLinkExtractor implementation -- There is even a comment about this in scrapy tests)

@Preetwinder
Copy link
Contributor Author

The reason for message_bus test failure on PY2 was the tests being run with language set as PY3.5. The scrapy_spider tests was failing because the test spider would be unable to crawl past the login page on the scrapinghub website. I have changed the website to dmoz.org.

@@ -1,4 +1,5 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 options for this test:

@sibiryakov
Copy link
Member

@redapple and @Preetwinder Let's better remove RegexLinkExtractor. The goal of this test is to check the general functionality - if all is imported well, and Scrapy/Frontera API's fit.

@redapple
Copy link
Contributor

redapple commented Jul 22, 2016

Regarding

FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_non_ascii_percent_encoding_in_path
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_non_ascii_percent_encoding_in_query_argument
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_normalize_percent_encoding_in_path
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_normalize_percent_encoding_in_query_arguments
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_safe_characters_unicode

canonicalize_url implementation should match scrapy's (which I believe is correct), and tests need to be changed accordingly

@sibiryakov
Copy link
Member

@redapple what if we move that part to standalone library? How time consuming is it?

@redapple
Copy link
Contributor

redapple commented Jul 22, 2016

@sibiryakov , you mean in w3lib?
I can see to add it

@redapple
Copy link
Contributor

redapple commented Jul 25, 2016

@Preetwinder , what do you think of @sibiryakov comments on removing RegexLinkExtractor?
I doubt we'll have canonicalize_url in w3lib shortly, so the quickest here is to copy-paste the implementation in frontera.

@Preetwinder
Copy link
Contributor Author

Yes I agree with @sibiryakov. I'll remove it. I can copy-paste the function and it's tests from scrapy, but the issue about this was that changing the implementation of canonicalize_url changes the fingerprint calculated in the URL middleware. If you and @sibiryakov are fine with this change, I'll make it.

@redapple
Copy link
Contributor

IMO, 'old' canonicalize_url was broken, so some fingerprints will be invalidated with this, but it's for the better.

@redapple
Copy link
Contributor

@Preetwinder , what about you submit a PR to w3lib about adding canonicalize_url()?
There's already an issue for it: scrapy/w3lib#65

Removing canonicalize_url() from scrapy can be done at later stage.

@sibiryakov
Copy link
Member

@Preetwinder I'm fine with this change. But [dramatic pause] we should mention that in release notes!

@@ -128,11 +135,11 @@ def _get(self, obj):

def update_cache(self, objs):
objs = objs if type(objs) in [list, tuple] else [objs]
map(self._put, objs)
list(map(self._put, objs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why list is needed here? The idea is to apply self._put to every element in the objs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because map returns a generator in PY3. So if we don't convert it to a list, it never get's applied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the map was used here, because in Py2.7 it's faster than iterate with for cycle. According to my tests the

[f(x) for x in data] # is faster than

list(map(f, data))

in python 3.5
So I'm suggesting to rewrite it that way.

@sibiryakov
Copy link
Member

@Preetwinder please avoid such a big pull requests next time, it's extremely hard to review and discuss it.

@Preetwinder
Copy link
Contributor Author

@sibiryakov I apologize. I wasn't planning to make all the changes in the same PR(hence the name of the PR), but you can see the series of events above which led me to do this.

self._method = str(method).upper()
self._url = to_native_str(url, encoding)
self._encoding = encoding
self._url = safe_url_string(url, encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very expensive, comparing to what we have now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't necessary to use frontera with scrapy, since scrapy already converts URL's to safe_url's. I'll remove it for now.

Copy link
Member

@sibiryakov sibiryakov Aug 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideal would be to add additional field for storing parsed URL. Because many components are performing url parsing, so they could make use of such structure, but I would leave it for future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontera is designed to work without Scrapy too. So, we need this URL transformation, the question is how to make it cheap. That's why I propose to optimize it later.

@sibiryakov
Copy link
Member

@Preetwinder I would propose to collect the cases where we slowed down Frontera in a separate issue, and address that issue after we finish working on PY3 port. It can be seen as lending performance debt, and then giving ti back.

Python 3 Syntactic changes

Run tests on Python 3.5 too

PY3 changes to some tests and utils.url, heap, misc

PY3 changes to utils.fingerprint

PY3 changes to backends
from six.moves import range


def cmp(a, b):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python 3 cmp is not an in-built function. It is used here - https://github.com/Preetwinder/frontera/blob/ec0811a21983ab724948ccb7a9e8fdea759e42b9/frontera/contrib/backends/memory/__init__.py#L80 and a few other locations below this line.

@sibiryakov sibiryakov merged commit 3637a27 into scrapinghub:master Aug 3, 2016
@Preetwinder Preetwinder deleted the python-modernize branch August 7, 2016 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants