Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can not get blank text in td tag. #62

Closed
pc10201 opened this issue Oct 19, 2016 · 4 comments
Closed

I can not get blank text in td tag. #62

pc10201 opened this issue Oct 19, 2016 · 4 comments
Labels

Comments

@pc10201
Copy link

pc10201 commented Oct 19, 2016

# coding=utf-8

from parsel import Selector

html = u'''
                        <table class="table table-bordered table-hover table-condensed">
                            <thead>
                            <tr>
                                <th>#</th>
                                <th>code</th>
                                <th>vendor</th>
                                <th>name</th>
                                <th>num</th>
                            </tr>
                            </thead>
                            <tbody>
                                <tr>
                                    <th scope="row">1750</th>
                                    <td><a href="/exam/000-643">000-643</a></td>
                                    <td>IBM</td>
                                    <td></td>
                                    <td>45</td>
                                </tr>
                                </tbody>
'''

sel = Selector(text=html)
print sel.xpath('//tbody/tr//td/text()').extract()
print sel.xpath('//tbody/tr//td//text()').extract()

output


[u'000-643', u'IBM', u'45']```
@pc10201 pc10201 changed the title I can get blank text in td tag. I can not get blank text in td tag. Oct 19, 2016
@redapple
Copy link
Contributor

XPath's data model does not consider "blank text" as text nodes:

A text node always has at least one character of data.

So the output of parsel seems correct to me. (And I get the same results with http://codebeautify.org/Xpath-Tester for example.)

What you could do is to loop on <td> elements and apply text() on them:

>>> sel.xpath('//tbody/tr//td/text()').extract()
[u'IBM', u'45']
>>> for td in sel.xpath('//tbody/tr//td'):
...     print(td.xpath('text()').extract())
... 
[]
[u'IBM']
[]
[u'45']
>>> 

@wsgggws
Copy link

wsgggws commented Dec 19, 2018

Python3.6.7
Scrapy 1.5.1

[info.xpath('text()').extract()
      for info in response.xpath('//td').extract()]

Error:
for info in response.xpath('//td').extract()]))\nAttributeError: 'str' object has no attribute 'xpath'\n"

@Gallaecio
Copy link
Member

Python3.6.7
Scrapy 1.5.1

[info.xpath('text()').extract()
      for info in response.xpath('//td').extract()]

Error:
for info in response.xpath('//td').extract()]))\nAttributeError: 'str' object has no attribute 'xpath'\n"

Please, instead of hijacking an unrelated Scrapy issue, ask your question in StackOverflow. It is an easy question, I bet you will get a prompt answer there.

@ilyazub
Copy link

ilyazub commented Feb 16, 2022

@pc10201 s.xpath("//tbody/tr//td").xpath("normalize-space()").getall() returns None for blank text nodes. text() ignores blank text nodes as expected.

>>> s.xpath("//tbody/tr//td").xpath("normalize-space()").getall()
['000-643', 'IBM', '', '45']

Full code

from parsel import Selector

html = """
<table class="table table-bordered table-hover table-condensed">
  <thead>
    <tr>
        <th>#</th>
        <th>code</th>
        <th>vendor</th>
        <th>name</th>
        <th>num</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th scope="row">1750</th>
      <td><a href="/exam/000-643">000-643</a></td>
      <td>IBM</td>
      <td></td>
      <td>45</td>
    </tr>
  </tbody>
</table>
"""

s = Selector(text=html)

with_text = s.xpath("//tbody/tr//td//text()").getall()
with_normalize_space = s.xpath("//tbody/tr//td").xpath("normalize-space()").getall()

print(with_text, with_normalize_space)

Output

['000-643', 'IBM', '45'] ['000-643', 'IBM', '', '45']

I'm commenting on this old issue because I've faced it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants