Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Mention that .re() does HTML entities decoding #1704

Closed
redapple opened this issue Jan 20, 2016 · 1 comment
Closed

[Docs] Mention that .re() does HTML entities decoding #1704

redapple opened this issue Jan 20, 2016 · 1 comment

Comments

@redapple
Copy link
Contributor

Documentation on Selectors says:

extract()
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

re(regex)
Apply the given regex and return a list of unicode strings with the matches.
regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

but it doesn't say that .re() (and .re_first()) also perform HTML-entities decoding (except < and &)
See #1700
and https://stackoverflow.com/questions/34887730/how-to-extract-raw-html-from-a-scrapy-selector#comment57542664_34897754

parsel documentation is even less verbose.

@Digenis
Copy link
Member

Digenis commented Jan 20, 2016

There's more to it.
Text nodes are extracted as text, interpolating escaped entities
while elements as html.
An xpath such as ./text() | ./p followed by an extract()
will return strings of different markup, one plain and one html,
leaving the user with no reliable way to distinguish them.

Imagine a scenario where the context node is <div>&lt;!--<p></p></div>.
A u' '.join(sel.xpath('node()').extract()) will result in <!--<p></p>.
I'd prefer it if extract() was extracting only html,
with text/attribute nodes entity-encoded.
This however is so backwards incompatible (for user code)
that I just subclassed Selector to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants