extract()
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.
re(regex)
Apply the given regex and return a list of unicode strings with the matches.
regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)
There's more to it.
Text nodes are extracted as text, interpolating escaped entities
while elements as html.
An xpath such as ./text() | ./p followed by an extract()
will return strings of different markup, one plain and one html,
leaving the user with no reliable way to distinguish them.
Imagine a scenario where the context node is <div><!--<p></p></div>.
A u' '.join(sel.xpath('node()').extract()) will result in <!--<p></p>.
I'd prefer it if extract() was extracting only html,
with text/attribute nodes entity-encoded.
This however is so backwards incompatible (for user code)
that I just subclassed Selector to solve it.
Documentation on Selectors says:
but it doesn't say that
.re()
(and.re_first()
) also perform HTML-entities decoding (except<
and&
)See #1700
and https://stackoverflow.com/questions/34887730/how-to-extract-raw-html-from-a-scrapy-selector#comment57542664_34897754
parsel documentation is even less verbose.
The text was updated successfully, but these errors were encountered: