Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve scrapy top-level namespace #494

Closed
kmike opened this issue Dec 18, 2013 · 7 comments
Closed

improve scrapy top-level namespace #494

kmike opened this issue Dec 18, 2013 · 7 comments

Comments

@kmike
Copy link
Member

@kmike kmike commented Dec 18, 2013

Hey,

I think that @redapple is right in #488 that base spider template needs more imports by default, but IMHO introducing more code generation is not a solution.

Scrapy top-level namespace is a bit funky now:

>>> import scrapy
>>> dir(scrapy)
['__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 '__path__',
 '__version__',
 '_txv',
 'boto',
 'django',
 'optional_features',
 'os',
 'pkgutil',
 'print_function',
 'sys',
 'twisted_version',
 'urlparse_monkeypatches',
 'version_info',
 'warnings',
 'xlib']

it contains boto and django, but doesn't contain shortcuts like Request, Selector and BaseSpider. What do you think about making them available as scrapy.Request etc and removing unintended exports?

Currently to write a basic spider user has to remember a lot of import locations. Scrapy docs omit imports sometimes (see #493), this also doesn't help.

My current list is BaseSpider (why isn't it named Spider, by the way?), Request, FormRequest and Selector, but I don't see much downsides in being more generous (Item? Field?).

@redapple
Copy link
Contributor

@redapple redapple commented Dec 19, 2013

+1
+1 for Field and Item also in the shortcuts list

@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2013

I like the idea of grouping most used imports into a single path, but I'm not sure about placing the shortcuts at the top level, it will trigger lot of internal imports by default.

I recall Fabric uses fabric.api for this, and Django provides django.shortcuts but it is not the same.

@kmike
Copy link
Member Author

@kmike kmike commented Dec 20, 2013

You're right that extra imports could be a problem. For example, in NLTK "import nltk" used to be super-slow because it actually imported all the submodules in top-level module; this was solved by LazyImport hacks. But I don't think it will be an issue for scrapy; these modules are already imported when you run e.g. scrapy list or scrapy crawl, and scrapy is rarely used as library that don't have an access to Spider, Selector and Request. Also, it doesn't seems to be slow (I tried putting these imports to see how it works, and haven't noticed the difference - no measurments, sorry). Circular import issues are also in suspiction, but I haven't found any of those (tests pass fine and other code is working).

Numpy, scipy, flask, https://github.com/pyinvoke/invoke (Fabric semi-successor) use top-level imports, and they seems to be doing fine. I think that scrapy.Spider and scrapy.Selector are more descriptive and readable that scrapy.api.Spider and scrapy.api.Selector; for those who want it it'll allow namespaced code like

import scrapy

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        sel = scrapy.Selector(response)
        # ...
@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2013

The exposed interface kind of convince me, it really does.

what shortcuts do you think worth exposing?:

  • Spider
  • Request, FormRequest
  • Selector
  • Item, Field
  • ItemLoader ?
@kmike
Copy link
Member Author

@kmike kmike commented Dec 20, 2013

I'm not sure about ItemLoader because it is in contrib now. Also, if we expose ItemLoader it is tempting to to expose processors, but it seems a bit to much to have scrapy.Join.

What about moving loaders out of contrib (e.g. to scrapy.loaders) and exposing processors (Join, TakeFirst, ...) in the same namespace as ItemLoader?

@dangra
Copy link
Member

@dangra dangra commented Dec 20, 2013

@kmike: you're right, exposing ItemLoaders and processors under scrapy.loaders makes sense.

ItemLoaders were placed under contrib as a test bed just in case a better proposal arises, but so far there aren't. It's time to graduate it.

@kmike
Copy link
Member Author

@kmike kmike commented Apr 24, 2014

Fixed by #690. Let's start an another ticket when it'd be time to promote ItemLoaders.

@kmike kmike closed this Apr 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.