New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asyncio support for Scrapy | Gsoc 2018 #3362

Open
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@yashrsharma44
Copy link
Contributor

yashrsharma44 commented Aug 2, 2018

I have made the initial pull request, I am working on the requirements as we discussed, and I will update this PR, as things are completed @dangra .

@yashrsharma44

This comment has been minimized.

Copy link
Contributor Author

yashrsharma44 commented Aug 4, 2018

Hey @dangra, I have made the following changes -

  1. Cleaning of the code base, removing any unnecessary comments
  2. Allowing Fetch to accept a request parameter.
  3. Enabled the request to be passed through spider middlewares( though Fetch is an awaitable, so it cannot be passed through spidermiddlewares)
  4. Sharing the crawler and spider object implicitly.

If there are changes required, please ask for it.

asyncioreactor.install(asyncio.get_event_loop())
except Exception:
pass
'''

This comment has been minimized.

@dangra

dangra Aug 7, 2018

Member

what is the point of importing asyncio here?

import asyncio
import logging
import scrapy
import warnings

This comment has been minimized.

@dangra

dangra Aug 7, 2018

Member

See https://www.python.org/dev/peps/pep-0008/#imports for import standard rules.

Specially ordering:

1. Standard library imports.
2. Related third party imports.
3. Local application/library specific imports.
asyncioreactor.install(asyncio.get_event_loop())
except Exception:
pass
'''

This comment has been minimized.

@dangra

dangra Aug 7, 2018

Member

why are you importing asyncio in many different places? If it needs to be imported before anything else please put it on scrapy/__init__.py only.

@yashrsharma44

This comment has been minimized.

Copy link
Contributor Author

yashrsharma44 commented Aug 19, 2018

Hey @kmike, I was looking to get this code merged, could you provide updates, if the code is suitable/unsuitable to be merged, leaving aside tests and documentation? If it is suitable, I would gladly complete the tests and documentation.




class Fetch():

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

I think we should make Request class to be awaitable, not create a separate Fetch class.

This comment has been minimized.

@yashrsharma44

yashrsharma44 Aug 23, 2018

Author Contributor

Regarding awaiting Request class, I was having circular import issues.

def __init__(self, Request):
self.Request = Request
self.crawler = engine.concraw
self.spider = engine.conspid

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

There can be several Crawlers running at the same time, and several spiders; we can't use global variables. I think to access Crawler contextvars should be used.

@@ -1,3 +1,4 @@
import asyncio

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

why is this import needed?

@@ -8,6 +8,7 @@
from six.moves.urllib.parse import urldefrag

from zope.interface import implementer
from twisted.internet import reactor

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

why is this import needed?

from scrapy.utils.reactor import CallLaterOnce
from scrapy.utils.log import logformatter_adapter, failure_to_exc_info

logger = logging.getLogger(__name__)

conspid
concraw

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

stray lines?

@@ -110,7 +120,9 @@ def unpause(self):
"""Resume the execution engine"""
self.paused = False

def _next_request(self, spider):
@ensure_deferred
async def _next_request(self, spider):

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

we can't use async def in Scrapy code, because we need to support older Python versions. The idea is to make it possible to use async def in code which users write, but keep Scrapy backwards compatible.

assert self.has_capacity(), "No free spider slot when opening %r" % \
spider.name
logger.info("Spider opened", extra={'spider': spider})
nextcall = CallLaterOnce(self._next_request, spider)
scheduler = self.scheduler_cls.from_crawler(self.crawler)
start_requests = yield self.scraper.spidermw.process_start_requests(start_requests, spider)
start_requests = await self.scraper.spidermw.process_start_requests(start_requests, spider)

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

we can't use await as well

slot = Slot(start_requests, close_if_idle, nextcall, scheduler)
self.slot = slot
self.spider = spider
yield scheduler.open(spider)
yield self.scraper.open_spider(spider)
scheduler.open(spider)

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

why don't we wait for scheduler.open to finish?

@@ -188,6 +203,8 @@ def _process_spidermw_output(self, output, request, response, spider):
return dfd
elif output is None:
pass
elif isinstance(output, Response):

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

why is this needed?

try:
import asyncio
except ImportError:
warnings.warn("Please install asyncio to support asyncio frameworks. Without asyncio, frameworks might not be supported!")

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

you can't install asyncio, it is a part of Python standard library.

This comment has been minimized.

@nctl144

nctl144 Oct 3, 2018

Member

@kmike I suppose we should change it to "this project is using Python < 3.4" right?

import inspect


def get_args(method_or_func):

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

scrapy.utils.python has get_func_args and get_spec methods which do a similar thing.


return (r for r in result or () if _filter(r))
result = list_to_simplevalue(result)
if isinstance(result, GeneratorType):

This comment has been minimized.

@kmike

kmike Aug 23, 2018

Member

It'd be much better if the feature doesn't require modifications in existing middlewares; there must be a way to support it in a framework. Otherwise users will have to modify all their custom middlewares in a similar way.

@kmike

This comment has been minimized.

Copy link
Member

kmike commented Aug 23, 2018

Hey @yashrsharma44! Unfortunately, I think the current code is very far from being merged :( There can be some useful parts - maybe some of the modifications to the inline_requests library, or helpers like asyncfut_parallel - but it is very hard for me to assess, because other code needs a rewrite, and it is not clear if these helpers will be useful after a rewrite or not.

@kneufeld

This comment has been minimized.

Copy link

kneufeld commented Sep 29, 2018

I'm repeating a comment I left on #1455 in case it's more relevant here:

My friend @meejah wrote https://github.com/crossbario/txaio which is a helper library for writing code that runs unmodified on both Twisted and asyncio

This may help bridge the twisted/asyncio/python2/python3 gaps?

@nctl144

This comment has been minimized.

Copy link
Member

nctl144 commented Oct 2, 2018

Hi @kmike, do you think I should make some more progress on this PR?

@yashrsharma44

This comment has been minimized.

Copy link
Contributor Author

yashrsharma44 commented Oct 3, 2018

My 2cents. I was stuck with using contextvars, so that can be looked into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment