# 准备工作

使用XPath之前，需要确保安装好lxml库。lxml是一个Python解析库，支持HTML和XML的解析，支持XPath解析方式，且解析效率非常高。

In [15]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


# 实例引入

示例文件test.html，内容如下：

In [None]:
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>

In [16]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>&#13;
    <ul>&#13;
         <li class="item-0"><a href="link1.html">first item</a></li>&#13;
         <li class="item-1"><a href="link2.html">second item</a></li>&#13;
         <li class="item-inactive"><a href="link3.html">third item</a></li>&#13;
         <li class="item-1"><a href="link4.html">fourth item</a></li>&#13;
         <li class="item-0"><a href="link5.html">fifth item</a>&#13;
     </li></ul>&#13;
 </div></body></html>


# 所有节点

In [18]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

[<Element html at 0x24195d9fe00>, <Element body at 0x24195dab800>, <Element div at 0x24195d871c0>, <Element ul at 0x24195d933c0>, <Element li at 0x24195d93dc0>, <Element a at 0x24195d93900>, <Element li at 0x24195d93100>, <Element a at 0x24195da6a00>, <Element li at 0x24195da68c0>, <Element a at 0x24195d932c0>, <Element li at 0x24195da6700>, <Element a at 0x24195da6100>, <Element li at 0x24195da64c0>, <Element a at 0x24195da63c0>]


使用`*`代表匹配所有节点，HTML文本中的所有节点都被获取。

当然，匹配时也可以指定节点名称：

In [20]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])   # 通过索引选取其中一个对象

[<Element li at 0x24195d913c0>, <Element li at 0x24195d91dc0>, <Element li at 0x2419657f2c0>, <Element li at 0x2419657fac0>, <Element li at 0x2419657f440>]
<Element li at 0x24195d913c0>


# 子节点

通过`/`或`//`可以查找元素的子节点或子孙节点。

In [21]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

[<Element a at 0x24195da6700>, <Element a at 0x2419657f940>, <Element a at 0x24195dd9a80>, <Element a at 0x24195dd90c0>, <Element a at 0x24195dd0240>]


# 父节点

查找父节点，可以用`..`实现。

示例：选中href属性为link1.html的a节点，然后获取父节点，再获取其class属性。

In [25]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link1.html"]/../@class')
print(result)

['item-0']


同时，我们也可以通过parent::来获取父节点：

In [27]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link1.html"]/parent::*/@class')
print(result)

['item-0']


# 属性节点

我们可以使用`@`进行属性过滤。

In [28]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)

[<Element li at 0x241966f0ac0>, <Element li at 0x241966f2700>]


# 文本获取

我们使用XPath中的`text()`方法获取节点中的文本。

In [31]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

['first item', 'fifth item']


# 属性获取

In [32]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']


# 属性多值匹配

In [33]:
from lxml import etree

text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

['first item']


# 多属性匹配

In [34]:
from lxml import etree

text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

['first item']


# 按序选择

In [35]:
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result1 = html.xpath('//li[1]/a/text()')
result2 = html.xpath('//li[last()]/a/text()')
result3 = html.xpath('//li[position()<3]/a/text()')
result4 = html.xpath('//li[last()-2]/a/text()')
print(result1)
print(result2)
print(result3)
print(result4)

['first item']
['fifth item']
['first item', 'second item']
['third item']
