# 使用Beautiful Soup
Beautiful Soup就是Python的一个HTML或XML的解析库，可以用它来方便地从网页中提取数据

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了

Beautiful Soup在解析的时候依赖解析器，比如Python标准库中的HTML解析器`html.parser`;lxml解析器<br>
lxml解析器速度快且能解析html和xml，推荐使用

In [1]:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml') # 第二个参数'lxml'表示使用lxml的html解析器+
print(soup.p.string)

Hello


In [9]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story" name="the secod p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 注意这里的html和body节点都没有关闭
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify()) # 以标准的缩进格式输出html文档
print('\n')
print(soup.head.string) # 输出head节点里面的文本
print('\n')
print(soup.title.string) # 输出title节点的文本
print('\n')
print(soup.a.string) # 这种方式只能选择到第一个a节点的文本

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story" name="the secod p">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


The Dormouse's story


The Dormouse's story


 Elsie 


## 选择元素



In [7]:
print(soup.title, '\n') # 直接输出节点的内容
print(type(soup.title), '\n') # bs4.element.Tag类型，Beautiful Soup里面重要的数据结构，经过选择器选择后返回的都是这种Tag类型
print(soup.title.string, '\n') # bs.element.Tag类的string属性输出节点的文本内容
print(soup.head, '\n')
print(soup.p, '\n') # 按照这种方式只能选择第一个匹配到的节点

<title>The Dormouse's story</title> 

<class 'bs4.element.Tag'> 

The Dormouse's story 

<head><title>The Dormouse's story</title></head> 

<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 



## 提取信息
bs.element.Tag类型有多种属性可以用来获取节点的信息
+ `name` 属性可以得到节点的名称
+ `attrs` 属性可以获取节点所有属性，返回的是一个字典
+ `attrs['name']` 获取属性名为name的属性值

In [11]:
print(soup.title.name, '\n') # 获取节点的名称
print(soup.p.attrs) # 获取第一个p节点的所有属性值
print(soup.p.attrs['name']) # 获取第一个p节点的name属性值
# 注意有些属性可能有多个值，属性值是列表。比如这里的class属性，一个节点元素可能有多个class

title 

{'class': ['title'], 'name': 'dromouse'}
dromouse


## 嵌套选择
每一个返回结果都是bs4.element.Tag类型，它同样可以继续调用节点进行下一步的选择。比如，我们获取了head节点元素，我们可以继续调用head来选取其内部的head节点元素

In [12]:
print(soup.head.title, '\n') # 返回head节点里面的title节点
print(type(soup.head.title), '\n') # 依旧是bs4.element.Tag类型
print(soup.head.title.string)

<title>The Dormouse's story</title> 

<class 'bs4.element.Tag'> 

The Dormouse's story


## 关联选择
选择指定节点的子节点、父节点、兄弟节点等
+ 子节点和子孙节点: `contents` 属性, 返回所有直接子节点以及本节点的文本，以列表形式返回
+ `children` 属性也可以返回所有直接子节点和本节点的文本，以生成器的形式返回
+ `descendants` 属性可以返回所有子孙节点，节点内的文本也相当于一个节点。结果以生成器形式返回
+ `parent` 属性可以提取某个节点的直接父节点
+ `parents` 属性可以提取某个节点的所有祖先节点
+ `previous_sibling` 和 `next_sibling` 可以提取上一个，下一个兄弟节点
+ `previous_siblings` 和 `next_siblings` 可以提取所有之前的，之后的兄弟节点

In [22]:
import bs4
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents, '\n') # 把p节点里面的文本及其所有直接子节点列出来
# print(soup.p.contents[1].prettify())
for con in soup.p.contents:
    if isinstance(con, bs4.element.NavigableString):
        print(type(con))
        print(con)
    else:
        print(type(con))
        print(con.prettify())

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        '] 

<class 'bs4.element.NavigableString'>

            Once upon a time there were three little sisters; and their names were
            
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/elsie" id="link1">
 <span>
  Elsie
 </span>
</a>

<class 'bs4.element.NavigableString'>


<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>

<class 'bs4.element.NavigableString'>
 
            and
            
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/tillie" id="link3">


In [24]:
print(soup.p.children, '\n')
for i, child in enumerate(soup.p.children):
    print(i, child)

<list_iterator object at 0x00000237845C2358> 

0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.
        


In [25]:
# 选择节点的父节点
print(soup.a.parent) # 选择第一个a节点的父节点

print(soup.a.parents) # 选择第一个a节点的所有子孙节点

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>


## 方法选择器 
+ find_all(name , attrs , recursive , text , **kwargs) 即查询所有符合条件的元素， 返回列表类型
    + `name` : 查询特定名称的节点
    + `attrs` ：传入字典，查找特定属性值的节点
    + `text` : 用来匹配节点的文本，传入的形式可以是字符串，也可以是正则表达式对象：re.compile('regular expression')
+ find(name , attrs , recursive , text , **kwargs) 返回第一个符合条件的节点
+ find_parents() \ find_parents()
+ find_next_siblings() \ find_next_sibling()
+ find_previous_siblings() \ find_next_sibling()
+ find_all_next() \ find_next() 
+ find_all_previous() \ find_previous()
