PY3 Syntactic changes. #168

Preetwinder · 2016-06-27T12:39:05Z

Most of the changes were produced using the modernize script. Changes include print syntax, error syntax, converting iterators and generators to lists, etc. Also includes some other changes which were missed by the script.

redapple · 2016-06-27T13:04:49Z

hi @Preetwinder , what do you think of #166 ?
I'd like to get code coverage running on Travis so that we can more easily see where tests are further needed.
If we're ok, this PR could probably be rebased on it easily.
cc @sibiryakov

Preetwinder · 2016-06-27T14:19:13Z

Hello. Yes I think the changes in #166 are great. Judging from the stats on the codecov site adding tests for hbase, kafka and workers should give us a reasonably good coverage.
I'll rebase my branch once that PR is merged(or do I need to do it before?). Also should I wait for that PR to be merged before making my other PR's?

redapple · 2016-06-27T14:55:26Z

@Preetwinder , you may be touching a few of the same files as #166 next, so I'm not sure if you should change too much "by hand". If most of this PR has been done with scripts it may not be too hard to fix any rebase issue.
But ideally, #166 would be merged before further change (so that we track code coverage PR by PR)
I believe @sibiryakov could approve (or not) #166 soon enough

redapple · 2016-06-29T09:21:23Z

@Preetwinder , #166 has been merged.
Would you mind rebasing against current master branch?

codecov-io · 2016-06-30T08:26:02Z

Current coverage is 54.55% (diff: 68.53%)

Merging #168 into master will increase coverage by 0.43%

@@             master       #168   diff @@
==========================================
  Files            70         70          
  Lines          4320       4425   +105   
  Methods           0          0          
  Messages          0          0          
  Branches        506        522    +16   
==========================================
+ Hits           2338       2414    +76   
- Misses         1878       1907    +29   
  Partials        104        104

Powered by Codecov. Last update 13642e1...25a1e1a

redapple · 2016-06-30T08:47:10Z

frontera/contrib/backends/remote/codecs/msgpack.py



 def _prepare_request_message(request):
    def serialize(obj):
        """Recursively walk object's hierarchy."""
-        if isinstance(obj, (bool, int, long, float, basestring)):
+        if isinstance(obj, (bool, int, float, six.string_types)):


if isinstance(obj, (bool, int, long, float, basestring)):

is there a case where a long is passed Python2 or is this theoretical?

Yes I think this case should be covered. I have used six.integer_types instead. It covers both long and int type.

redapple · 2016-06-30T09:21:07Z

Overall, the changes look good to me.
It's true there are a lot of .iteritems() wrapped in six.iteritems() now. I'm not sure all cases really need the iterator version in Python 2 (so that using .items() without six would be good enough in both Py2 and Py3) but it would be very long to check.

redapple · 2016-06-30T11:28:10Z

LGTM

sibiryakov · 2016-07-18T10:15:58Z

Hey @Preetwinder and @redapple in order to merge this we should enable tests for Python 3 also, otherwise how would we find out if it works for Py3. Next thing to do is, to port Kafka message bus testing from https://github.com/scrapinghub/frontera/blob/145d0b58da0ce5175ba2795ea76f63f1e2a12184/frontera/tests/test_message_bus.py

and test HBase stuff at least manually.

redapple · 2016-07-19T09:01:04Z

@Preetwinder , @sibiryakov ,
so I tried with Python 3 on Travis CI (see #175 , https://travis-ci.org/scrapinghub/frontera/builds/145764784 )
but MySQL-python is having trouble getting installed.

requirements/tests.txt also includes https://github.com/PyMySQL/PyMySQL
where/when is MySQL-python>=1.2.5 needed?

Preetwinder · 2016-07-19T09:16:36Z

It seems MySQL-python is used by default in the MySQL tests. Since it doesn't support Python 3, I think we can force the tests to use PyMySQL instead and remove MySQL-python.

redapple · 2016-07-19T09:38:39Z

SQLAchemy mentions https://github.com/PyMySQL/mysqlclient-python
http://docs.sqlalchemy.org/en/latest/dialects/mysql.html#py3k-support

redapple · 2016-07-19T10:24:30Z

@Preetwinder , Python 3 tests with PyMySQL fail badly (dont know why)
It may be easier to use mysqlclient-python (Python3 compatible) instead of forcing +pymysql (worth a Travis test build at least I think)

Preetwinder · 2016-07-19T10:41:19Z

I'll try using mysqlclient. I think the reason ZeroMQ tests fail might be that since both the PY2 and PY3 tests are being run in the Python 3 environment, this file - https://github.com/scrapinghub/frontera/blob/master/tests/run_zmq_broker.sh uses Python 3 even for the Python 2 tests.

redapple · 2016-07-19T10:42:57Z

I've killed the Py35 build on Travis

redapple · 2016-07-19T10:45:11Z

@Preetwinder , tests are run within a Py2 or Py3 virtualenv thanks to tox.
run_zmq_broker.sh should also run inside the virtualenv (this ought to be possible with tox I believe)

sibiryakov · 2016-07-19T19:33:35Z

@Preetwinder Please try running pyMysql tests locally, it should be clearer why it looses connection constantly. I wonder if it connects at all.

Preetwinder · 2016-07-20T19:40:56Z

In order to make pymysql work I have pushed most of the PY3 modification for single process mode here(I was originally planning to do it in a separate PR). I'll continue testing and improving the PY3 changes. Currently message_bus and scrapy_spider tests fail for PY2, I'll look into this, although the cause might be unrelated to this PR(they fail for me locally even without the changes in this PR). For PY3 the failures for message_bus, canonicalize_url and encoders are expected since I haven't worked on these modules yet. There is also an import error for sgmllib which seems to be a deprecated module.

redapple · 2016-07-21T08:45:18Z

Regarding sgmllib, indeed, RegexLinkExtractor extends SgmlLinkExtractor -- which has been deprecated in Scrapy, and is not available in Python 3. (This makes me think that Scrapy could ship with a robust regex link extractor, without the sgmllib dependency, which doesn't bring much (anything) to RegexLinkExtractor implementation -- There is even a comment about this in scrapy tests)

Preetwinder · 2016-07-22T08:22:20Z

The reason for message_bus test failure on PY2 was the tests being run with language set as PY3.5. The scrapy_spider tests was failing because the test spider would be unable to crawl past the login page on the scrapinghub website. I have changed the website to dmoz.org.

redapple · 2016-07-22T13:43:23Z

tests/test_scrapy_spider.py

@@ -1,4 +1,5 @@
 # -*- coding: utf-8 -*-
+from __future__ import absolute_import


2 options for this test:

regarding sgmllib not available in Python 3, I suggest you add a skip in Python 3 for test_scrapy_spider

or remove RegexLinkExtractor from https://github.com/scrapinghub/frontera/blob/master/tests/scrapy_spider/spiders/example.py#L23

sibiryakov · 2016-07-22T13:48:50Z

@redapple and @Preetwinder Let's better remove RegexLinkExtractor. The goal of this test is to check the general functionality - if all is imported well, and Scrapy/Frontera API's fit.

redapple · 2016-07-22T13:50:26Z

Regarding

FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_non_ascii_percent_encoding_in_path
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_non_ascii_percent_encoding_in_query_argument
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_normalize_percent_encoding_in_path
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_normalize_percent_encoding_in_query_arguments
FAIL tests/test_utils_url.py::TestCanonicalizeUrl::test_safe_characters_unicode

canonicalize_url implementation should match scrapy's (which I believe is correct), and tests need to be changed accordingly

sibiryakov · 2016-07-22T14:27:46Z

@redapple what if we move that part to standalone library? How time consuming is it?

redapple · 2016-07-22T14:31:31Z

@sibiryakov , you mean in w3lib?
I can see to add it

redapple · 2016-07-25T15:09:27Z

@Preetwinder , what do you think of @sibiryakov comments on removing RegexLinkExtractor?
I doubt we'll have canonicalize_url in w3lib shortly, so the quickest here is to copy-paste the implementation in frontera.

Preetwinder · 2016-07-25T15:20:03Z

Yes I agree with @sibiryakov. I'll remove it. I can copy-paste the function and it's tests from scrapy, but the issue about this was that changing the implementation of canonicalize_url changes the fingerprint calculated in the URL middleware. If you and @sibiryakov are fine with this change, I'll make it.

redapple · 2016-07-25T15:22:32Z

IMO, 'old' canonicalize_url was broken, so some fingerprints will be invalidated with this, but it's for the better.

redapple · 2016-07-25T15:33:56Z

@Preetwinder , what about you submit a PR to w3lib about adding canonicalize_url()?
There's already an issue for it: scrapy/w3lib#65

Removing canonicalize_url() from scrapy can be done at later stage.

sibiryakov · 2016-07-25T21:42:53Z

@Preetwinder I'm fine with this change. But [dramatic pause] we should mention that in release notes!

sibiryakov · 2016-07-26T11:43:05Z

frontera/contrib/backends/memory/__init__.py

@@ -128,11 +135,11 @@ def _get(self, obj):

    def update_cache(self, objs):
        objs = objs if type(objs) in [list, tuple] else [objs]
-        map(self._put, objs)
+        list(map(self._put, objs))


Why list is needed here? The idea is to apply self._put to every element in the objs

I think it is because map returns a generator in PY3. So if we don't convert it to a list, it never get's applied.

the map was used here, because in Py2.7 it's faster than iterate with for cycle. According to my tests the

[f(x) for x in data] # is faster than list(map(f, data))

in python 3.5
So I'm suggesting to rewrite it that way.

sibiryakov · 2016-07-26T14:59:11Z

@Preetwinder please avoid such a big pull requests next time, it's extremely hard to review and discuss it.

Preetwinder · 2016-07-29T14:35:39Z

@sibiryakov I apologize. I wasn't planning to make all the changes in the same PR(hence the name of the PR), but you can see the series of events above which led me to do this.

sibiryakov · 2016-08-02T10:14:56Z

frontera/core/models.py

-        self._method = str(method).upper()
-        self._url = to_native_str(url, encoding)
-        self._encoding = encoding
+        self._url = safe_url_string(url, encoding)


This is very expensive, comparing to what we have now...

This change isn't necessary to use frontera with scrapy, since scrapy already converts URL's to safe_url's. I'll remove it for now.

Ideal would be to add additional field for storing parsed URL. Because many components are performing url parsing, so they could make use of such structure, but I would leave it for future.

Frontera is designed to work without Scrapy too. So, we need this URL transformation, the question is how to make it cheap. That's why I propose to optimize it later.

sibiryakov · 2016-08-02T10:21:38Z

@Preetwinder I would propose to collect the cases where we slowed down Frontera in a separate issue, and address that issue after we finish working on PY3 port. It can be seen as lending performance debt, and then giving ti back.

Python 3 Syntactic changes Run tests on Python 3.5 too PY3 changes to some tests and utils.url, heap, misc PY3 changes to utils.fingerprint PY3 changes to backends

sibiryakov · 2016-08-03T08:40:07Z

frontera/contrib/backends/memory/__init__.py

+from six.moves import range
+
+
+def cmp(a, b):


Is this function used somewhere?

In python 3 cmp is not an in-built function. It is used here - https://github.com/Preetwinder/frontera/blob/ec0811a21983ab724948ccb7a9e8fdea759e42b9/frontera/contrib/backends/memory/__init__.py#L80 and a few other locations below this line.

Preetwinder force-pushed the python-modernize branch from 239ff91 to b2fff63 Compare June 30, 2016 08:15

redapple reviewed Jun 30, 2016
View reviewed changes

redapple changed the title ~~PY3 Syntactic changes.~~ [MRG+1] PY3 Syntactic changes. Jun 30, 2016

redapple mentioned this pull request Jul 19, 2016

Preetwinder python modernize #175

Closed

redapple changed the title ~~[MRG+1] PY3 Syntactic changes.~~ PY3 Syntactic changes. Jul 19, 2016

redapple reviewed Jul 22, 2016
View reviewed changes

redapple mentioned this pull request Jul 22, 2016

Add canonicalize_url() to w3lib.url scrapy/w3lib#65

Closed

sibiryakov reviewed Jul 26, 2016
View reviewed changes

Preetwinder mentioned this pull request Jul 28, 2016

Remove obsolete modules #179

Closed

sibiryakov reviewed Aug 2, 2016
View reviewed changes

Preetwinder added 2 commits August 2, 2016 19:51

Python 2/3 Single process mode compatibility

c671989

Python 3 Syntactic changes Run tests on Python 3.5 too PY3 changes to some tests and utils.url, heap, misc PY3 changes to utils.fingerprint PY3 changes to backends

add absolute_import to sqlalchemy and hbase test

424eedf

Preetwinder force-pushed the python-modernize branch from 98901c7 to 424eedf Compare August 2, 2016 14:41

Preetwinder added 4 commits August 2, 2016 21:06

testing travis failure

a140464

fix style and remove core.models tests

0e9e9ad

remove docker install

25a1e1a

change matrix formation in travis

ec0811a

sibiryakov reviewed Aug 3, 2016
View reviewed changes

sibiryakov merged commit 3637a27 into scrapinghub:master Aug 3, 2016

Preetwinder deleted the python-modernize branch August 7, 2016 07:04

		@@ -1,4 +1,5 @@
		# -- coding: utf-8 --
		from __future__ import absolute_import

PY3 Syntactic changes. #168

PY3 Syntactic changes. #168

Conversation

Preetwinder commented Jun 27, 2016

redapple commented Jun 27, 2016

Preetwinder commented Jun 27, 2016 • edited Loading

redapple commented Jun 27, 2016

redapple commented Jun 29, 2016

codecov-io commented Jun 30, 2016 • edited Loading

Current coverage is 54.55% (diff: 68.53%)

redapple Jun 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

redapple commented Jun 30, 2016

redapple commented Jun 30, 2016

sibiryakov commented Jul 18, 2016 • edited Loading

redapple commented Jul 19, 2016

Preetwinder commented Jul 19, 2016

redapple commented Jul 19, 2016

redapple commented Jul 19, 2016

Preetwinder commented Jul 19, 2016

redapple commented Jul 19, 2016

redapple commented Jul 19, 2016 • edited Loading

sibiryakov commented Jul 19, 2016 • edited Loading

Preetwinder commented Jul 20, 2016

redapple commented Jul 21, 2016

Preetwinder commented Jul 22, 2016

Choose a reason for hiding this comment

sibiryakov commented Jul 22, 2016

redapple commented Jul 22, 2016 • edited Loading

sibiryakov commented Jul 22, 2016

redapple commented Jul 22, 2016 • edited Loading

redapple commented Jul 25, 2016 • edited Loading

Preetwinder commented Jul 25, 2016

redapple commented Jul 25, 2016

redapple commented Jul 25, 2016

sibiryakov commented Jul 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sibiryakov commented Jul 26, 2016

Preetwinder commented Jul 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sibiryakov Aug 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sibiryakov commented Aug 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Preetwinder commented Jun 27, 2016 •

edited

Loading

codecov-io commented Jun 30, 2016 •

edited

Loading

redapple Jun 30, 2016 •

edited

Loading

sibiryakov commented Jul 18, 2016 •

edited

Loading

redapple commented Jul 19, 2016 •

edited

Loading

sibiryakov commented Jul 19, 2016 •

edited

Loading

redapple commented Jul 22, 2016 •

edited

Loading

redapple commented Jul 22, 2016 •

edited

Loading

redapple commented Jul 25, 2016 •

edited

Loading

sibiryakov Aug 2, 2016 •

edited

Loading