-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSS selectors #176
CSS selectors #176
Conversation
Here is the tutorial spider re-implemented using HtmlCSSSelector: https://gist.github.com/3808079 |
I just used this feature to make scrapy download, parse, and test which css is unused on my company site. It's good stuff. If I could merge it, I would 👍 |
Very nice patch!. It looks quite simple and useful, so it's definitely a +1 for me. Sorry for not being able to review it earlier. Could we add some basic tests before merging it? It would also be great to add a "CSS selectors" section to the Scrapy Selectors documentation although that is not a blocker and can be done after merging. |
@pablohoffman thanks. This is my first contribution to a Python project, so I'm not sure my code is thoroughly pythonic, particularly 0da6be1 where I duplicate some methods for my CSSMixin. The method names mimic the original methods, although there might be a better approach. As for the documentation, I'll get down to it tomorrow. |
Great @barraponto, I look forward to those changes to merge this PR. The CSSMixin code looks OK to me. One thing I'm not so sure is about using My reasons in favor of using
|
Adding CSS selectors to |
As a frequent user, I think this is an awesome addition. I was about to begin a new project and figure out how to integrate PyQuery with scrapy. Having something baked in, or nearly so would be wonderful! Is there any I currently need to do to scrapy to get this functionality other than 'import' HtmlCSSSelector ? |
@jcswart i guess someone already integrated pyquery, check this thread https://groups.google.com/forum/?fromgroups=#!topic/scrapy-developers/OnQ5eOvGz5k |
Hey @barraponto, I hope you have a chance to make those minor changes soon, so we can integrate this new functionality into master branch and have some time to test it before the 0.18 release. AFAIK, the only two changes blocking this merge are:
|
@pablohoffman I'm currently working on documentation (slowly, blame the holidays). I thought someone in #scrapy@freenode was already working on making cssselect a hard dependency. That's something I can do on my own, though. But I have no experience in writing tests for scrapy, I don't think I'll be able to deliver that any soon, so I'd be glad if someone from scrapinghub stepped up either to write it or mentor me. |
@barraponto let's merge the PR once you finish the documentation, and we can add the tests afterwards (very simple, if you know how). The change to hard dependency is also a very simple, so I don't mind doing it once the PR is merged. |
Ok, I guess I'm done here. |
def get(self, attr): | ||
return self.__class__(flatten([x.get(attr) for x in self])) | ||
|
||
def text(self, recursive=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it uses recursive
but below it uses all
, should we normalize the argument names?. Also, the documentation only refers to all
.
Good catch, I should have updated it when I changed |
I don't really like If this change is meant only for ItemLoaders support, I'd rather create a CSSItemLoader class. |
Actually, when it comes to frontend-turned-scrapers (like myself), I think pseudo-classes trump new methods. I mean, I see no harm in actually leaving the methods in, but I appreciate the pseudo-classes. Made-up pseudo-classes is something jQuery used too, and they sure target novice frontend developers. |
I was looking at jquery css selectors and found lot of non-standard pseudo-classes. One that surprised me was :text but it is used in a very different way. It doesn't select the text nodes in xpath sense, but instead I think adding pseudo-classes is convenient and avoids the extra methods in selector's api which we try to keep minimal and common across all backends. what do you think about renaming |
the reason to avoid |
btw, I agree pseudo-elements are a better option and looking at cssselect source code I see adding support is fairly simple |
I have confirmed with cssselect developer than extending translators is a feature he wants to keep in the future, but he doesn't commit to an stable API yet. Also sent a pull request scrapy/cssselect#22 to address one of the workarounds I used for |
A quick excerpt from CSS Selectors Module Level 3:
Mind the last paragraph. I started porting a scraper to the pseudo-class syntax introduced earlier, and it feels weird: Pseudo-elements, on the other hand, are anchored to the last selector, like By the way, attr() is already used in CSS as a value for the |
hi @SimonSapin, We would like to add pseudo-elements support to cssselect The goal is to extend GenericTranslator and add The approach I am willing to take is very similar to how other types and its hooks are handled but without any pseudo-element implemented by default in cssselect. what do you think? is this something you want to see merged anytime soon? |
No. cssselect implements Selectors, nothing else. Feel free to extend it to implement a super-set or, well, anything; but this is not the same project. Pyquery sounds more like the project you’re looking for. cssselect might at some point implement next levels of Selectors, or proposals for features meant to go into the Selectors spec eventually. But Selectors will only ever apply to elements, not attribute or text nodes. (For example, WeasyPrint uses cssselect to apply CSS stylesheets to a tree of elements. By the way, the way pseudo-elements are handled in cssselect is completely different from pseudo-classes. You’ll need to work around that if you want to use the pseudo-element syntax for eg. |
@SimonSapin: sorry, I probably messed my words I know pseudo-elements are part of Selectors, in fact there are 4 defined in the W3C recommendation for selectors but I don't plan/want to add them to cssselect. I was asking to remove the constraint imposed by cssselect/xpath.py#L183, and leaving GenericTranslator hooks the chance to convert them to xpath, and fail by default as it does now if the pseudo-element hook handler is not defined as a translator method.
|
Just bypass |
First, thanks so much for this awesome project. There is nothing even close to it in Ruby-land, so I am picking up Python just to be able to use scrapy! Well done. However, it's a major bummer to see that scrapy doesn't natively support CSS selectors. I'm trying to replace a few scrapers that I cobbled together in Ruby, and they all used CSS selectors due to my familiarity with them and because the excellent Nokogiri HTML/XML parsing (Ruby) library has great support for CSS3 selectors. I do know a little bit of XPath, but I have some complicated CSS selectors for some sites that are a challenge to scrape due to their mutable structure, so I will have to learn XPath more in-depth in order to properly convert them. I understand XPath is a valuable thing to know and it will broaden my toolset, but it's yet another impediment to me using scrapy, and I'm already anxious! :-) Just my $0.02 as someone new to both scrapy and Python. |
@abevoelker, for CSS selectors support, I have been using With XPath and CSS selectors support, I consider But then of course using |
Not sure who won between pseudo-classes and (functional) pseudo-elements fo this Scrapy PR but I opened a PR to I added a test on how to extend cssselect Translator with custom functions: see Comments welcome there. |
@redapple: this PR stalled because it needed someone to reimplement using pseudo-elements, you work for scrapy/cssselect#29 is very valuable! |
Thanks to @redapple ’s work (though not exactly as in the PR), cssselect not has parser-only support for functional pseudo-elements. See the tests for an usage example. Play around with it, feel free to send new issues or PRs to cssselect, and let me know when you think this is stable and would like it in a PyPI release. Happy hacking! |
Ok, I'm back from Neverland :) I'm really really happy that @redapple's work motivated @SimonSapin to provide us the API needed for pseudo-elements (including functional pseudo-elements!). I don't think I can push this to the end right now, but I can commit to writing the documentation for CSS selectors once it's fixed (or even sooner). |
Here's how it could look like: |
thanks @redapple and @SimonSapin! @redapple: adapt PR to option 3 of this comment #176 (comment) |
@barraponto welcome back, as this PR started in your branch, can you pull @redapple changes once it is ready and add a short documentation on how to use cssselectors. keep in mind that this PR must not change tutorial/docs examples to CSS selectors, and doesn't integrate with Loaders yet. |
so @dangra , remove |
@redapple, yes. |
New commit: redapple@34ace6f |
Among the things to test still: namespaces in CSS selectors Scrapy CSS selectors should probably support things like "atom|link::attr(atom|href)" See for example a test in parselet_script = {
"--(atom|feed atom|entry)": {
"title": "atom|title",
"name": "im|name",
"id": "atom|id @im|id",
"images(im|image)": [{
"height": "@height",
"url": ".",
}],
"releasedate": "im|releaseDate",
}
}
dsh = parslepy.selectors.DefaultSelectorHandler(
namespaces={
'atom': 'http://www.w3.org/2005/Atom',
'im': 'http://itunes.apple.com/rss',
}
) (it uses the "... @attrname" Parsley syntax, that I plan to complement also with ::attr(name) :) ) |
FYI, cssselect does basically everything about namespaces wrong. What it should do is detailed in the spec: http://www.w3.org/TR/selectors/#typenmsp |
hehe ok @SimonSapin |
If I remember correctly, |
Duly noted. |
Or, you can also send PRs to fix it :] |
Why not! Thanks again @SimonSapin |
We should add Other than that, great work and +1 to merge. @dangra will you do the honors? (looks like it'll need a rebase though) |
CSS selectors will be merged as part of #395. |
fixing value check
I'm a web developer and designer turned into a web scraper. Python is easy and I love it. Scrapy is wonderful. But XPath... it's a foreign language that mostly does what CSS selectors do — but those are second language to me. Anyone coming from jQuery or CSS will already be used to CSS selectors.
So I decided to improve scrapy to support CSS selectors, using the cssselect package that got extracted from the lxml package. I've created two Selector classes, HtmlCSSSelector and XmlCSSSelector, that basically translate the CSS selectors into XPath selectors and run the parent class select method. Easy Peasy.
I'm looking into how to provide tests for these new classes, but would love some guidance. This is my first contribution to a Python package.