Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sel.xpath('.') work the same for text elements #130

Open
Gallaecio opened this issue Dec 18, 2018 · 2 comments
Open

Make sel.xpath('.') work the same for text elements #130

Gallaecio opened this issue Dec 18, 2018 · 2 comments

Comments

@Gallaecio
Copy link
Member

Gallaecio commented Dec 18, 2018

Given:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
...         <body>
...             <h1>Hello, Parsel!</h1>
...         </body>
...         </html>""")

For text, you get:

>>> subsel = sel.css('h1::text')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1/text()' data=u'Hello, Parsel!'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[]

However, regular elements work as you would expect:

>>> subsel = sel.css('h1')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1' data=u'<h1>Hello, Parsel!</h1>'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[<Selector xpath='.' data=u'<h1>Hello, Parsel!</h1>'>]

I believe text elements should work the same. '.' should select them if they are the current element.

@redapple
Copy link
Contributor

redapple commented Dec 18, 2018

Hey @Gallaecio , I'd also want to see this.
Also, I believe the issue is with lxml and not libxml2 (and not parsel either): lxml text nodes do not accept further XPath calls (you can only call .getparent() on the "smart strings" results -- note that "smart_strings" are disabled by default in parsel), while libxml2 allows XPath operations on text nodes:

>>> import libxml2
>>> doc = libxml2.htmlParseDoc('''<html>
... <head>
... <meta charset="UTF-8">
... <title>Title of the document</title>
... </head>
... 
... <body>
... Content of the document......
... </body>
... 
... </html>''', 'ascii')
>>> doc
<xmlDoc (None) object at 0x7ff070272680>
>>> ctxt = doc.xpathNewContext()
>>> res = ctxt.xpathEval("//text()")
>>> res
[<xmlNode (text) object at 0x7ff0702a2560>, <xmlNode (text) object at 0x7ff071d95320>]
>>> res[0].get_content()
'Title of the document'
>>> for t in res:
...     print(t.xpathEval("parent::*"))
... 
[<xmlNode (title) object at 0x7ff07025e7e8>]
[<xmlNode (body) object at 0x7ff07025e878>]
>>> 

If you know Cython, it could be a nice addition to lxml to support this

@redapple
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants