# Beautiful Soup
- 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式
- 需要安装解析器，一般我们使用lxml

In [36]:
import bs4
from bs4 import BeautifulSoup 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 创建beautifulsoup对象，初始化时就已经对不标准的html进行自动更近格式
soup = BeautifulSoup(html)
# 还可以使用本地HTML文件来创建对象
# soup = BeautifulSoup(open('index.html'))

# 格式化输出 prettify()-->标准的缩进格式输出
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


## 四大对象种类
- Tag
- NavigableString
- BeautifulSoup
- Comment

### Tag
- HTML中的一个个标签
- soup.Tag 获取标签内容
- 属性
    - name 
        - 对于soup本身，name为[document]
        - 其他内部标签，值为标签本身名称
    - attrs
        - 获取标签的所有属性
        - 获取指定的属性名 soup.Tag['属性名'] 或  soup.Tag.get('属性名')
- 修改与删除可参考字典操作
    - soup.Tag['属性名] = 新属性
    - del soup.Tag['属性名]

In [32]:
print(soup.name)
print(soup.p.name)
print(soup.p.attrs)
print(soup.p['class'])

[document]
p
{'class': ['title'], 'name': 'dromouse'}
['title']


### NavigableString
- 获取标签内部的文字 soup.Tag.string


In [31]:
print(soup.p.string)

The Dormouse's story


### BeautifulSoup
- BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag



In [33]:
print(type(soup.name))
print(soup.name)
print(soup.attrs)

<class 'str'>
[document]
{}


### Comment
- Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦
 

In [37]:
if type(soup.a.string)==bs4.element.Comment:
    print(soup.a.string)


 Elsie 
