# BeautifulSoup Parse & Extract Data

## 5-1 Use BeautifulSoup to extract web data
* Python can use BeautifulSoup analysize HTML, and get the target data from  it

### 5-1-1 BeautifulSoup Basic Usage

In [1]:
pip install bs4




In [2]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [4]:
# usage:
# from bs4 import BeautifulSoup

# soup = BeautifulSoup(contents, "lxml")

* 1st parameter is HTML contends downloaded
* 2nd parameter is the parser, lxml is faster than the default "html.parser"

from soup object extract target tag: ("a" is hyperlink tag)

In [9]:
# tags = soup("a")

After get all "a" tag object, we can loop it:

In [11]:
# for tag in tags
#     print(tag.get("href", None))

### Sample code:

In [13]:
import requests
from bs4 import BeautifulSoup 

url = "https://fchart.github.io/"
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, "lxml")
    tags = soup("a")
    for tag in tags:
        print(tag.get("href", None))
else:
    print("Error! HTTP request fail...")



#page-top
#fchart
#flowchart
#codeeditor
#python
#node
#arduino
#ardublockly
#microbit
#doc
#
https://hueyanchen.github.io/
http://blockly.is-best.net/
https://fchart.github.io/MyMind/index_zh-Hant.html
https://drive.google.com/file/d/1NKyh1nl1vwyhHf7mOLj7cxb2g6SpSx57/view?usp=sharing
https://drive.google.com/file/d/1PeCYYT-xA3v9Wn7LiwM2wq8ltvJVwGqY/view?usp=sharing
https://github.com/fchart/fChartExamples2
https://fchart.github.io/fChart6%E4%BD%BF%E7%94%A8%E6%89%8B%E5%86%8A.pdf
https://fchart.github.io/tutorial/
https://fchart.github.io
http://fchart.is-best.net
https://fchart.github.io/fChart6%E4%BD%BF%E7%94%A8%E6%89%8B%E5%86%8A.pdf
#
https://drive.google.com/file/d/1NKyh1nl1vwyhHf7mOLj7cxb2g6SpSx57/view?usp=sharing
fChart使用說明.htm
#
#
#
https://drive.google.com/file/d/1NKyh1nl1vwyhHf7mOLj7cxb2g6SpSx57/view?usp=sharing
https://hueyanchen.github.io/
fChart使用說明.htm
https://drive.google.com/file/d/1VspHydl48PIoeVT7sjw4Gx307UrJUNHB/view?usp=sharing
https://mega.nz/file/SNlw2TyY#XUIiHxhjY9

### 5-1-2 Get related infomation of HTML tag

* After get the specific tag, we can use the following property to get related data 

![image.png](attachment:image.png)

Ex,

In [19]:
import requests 
from bs4 import BeautifulSoup

url = "https://fchart.github.io/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
tags = soup("a")
tag = tags[12]
print("----------------")
print("tag:", tag)
print("----------------")
print("URL: ", tag.get("href", None))
print("Tag Content: ", tag.text)
print("target Property: ", tag["target"])
tags = soup("img")
tag = tags[1]
print("----------------")
print("tag:", tag)
print("----------------")
print("Image URL: ", tag.get("src", None))
print("alt property: ", tag["alt"])
print("Property: ", tag.attrs)




----------------
tag: <a class="btn btn-info" href="http://blockly.is-best.net/" style="font-size:small" target="_blank"><b>ESP8266 Blockly for MicroPython 英文線上版</b></a>
----------------
URL:  http://blockly.is-best.net/
Tag Content:  ESP8266 Blockly for MicroPython 英文線上版
target Property:  _blank
----------------
tag: <img alt="fChart直譯器圖例" class="img-fluid rounded mb-3 mb-md-0" src="img/fchart01.png"/>
----------------
Image URL:  img/fchart01.png
alt property:  fChart直譯器圖例
Property:  {'class': ['img-fluid', 'rounded', 'mb-3', 'mb-md-0'], 'src': 'img/fchart01.png', 'alt': 'fChart直譯器圖例'}


### 5-1-3 Use BeautifulSoup to Search HTML Tag

![image.png](attachment:image.png)

* in this episode we will use: https://fchart.github.io/Elements.html to test
![image.png](attachment:image.png)

1. Use select() & select_one() to search HTML tag:

In [26]:
import requests
from bs4 import BeautifulSoup 

url = "https://fchart.github.io/Elements.html"
response = requests.get(url)
print("*********************")
print("HTML", response.text)
print("*********************")
soup = BeautifulSoup(response.text, "lxml")
tag = soup.select_one("h2")
print("h2: ", tag.text)
tags = soup.select("b")
print("b: ", tags[0].text)
tag = soup.select_one("#q2")
tag2 = tag.select_one("b")
print("b: ", tag2.text)
tags = soup.select(".response")
print("li: ", tags[0].text)
print("li: ", tags[1].text)
print("li: ", tags[2].text)
print("li: ", tags[3].text)



*********************
HTML <!DOCTYPE html>
<html lang="big5">
 <head>
  <meta charset="utf-8"/>
  <title>HTML清單標籤</title>
  <style>
  .question { color: brown }
  .answer { font-size: 12pt; color: green }
  </style>  
 </head>
 <body>
   <h2 id="main">問卷調查</h2>
   <ol id="survey" class="survey">
    <li id="q1" class="question"><b>請問你的性別?</b>
      <ul class="answer">
        <li class="response">男</li>
        <li class="response selected">女</li>
      </ul>
    </li>
    <li id="q2" class="question"><b>請問你是否喜歡偵探小說?</b>
      <ul class="answer">
        <li class="response">喜歡</li>
        <li class="response selected">不喜歡</li>
      </ul>
    </li>
    <li id="q3" class="question"><b>請問你是否會 HTML 網頁設計?</b>
      <ul class="answer">
        <li class="response selected">會</li>
        <li class="response">不會</li>
      </ul>
    </li>
  </ol>  
 </body>
</html>
*********************
h2:  問卷調查
b:  請問你的性別?
b:  請問你是否喜歡偵探小說?
li:  男
li:  女
li:  喜歡
li:  不喜歡
