# Day19
## 網頁結構解析：使用 BeautifulSoup 套件操作 CSS Selector
- 使用 HTML Parser
- 使用 CSS Selector 語法獲取子節點

## 作業說明
由於前一天作業我們已經練習過一些定位工具，今天針對 CSS Selector 更多變化用法再深入練習吧。

- 題目網站：
https://pokemondb.net/pokedex/all
- 使用 CSS Selector 技巧把寶可夢表格抓下來

In [1]:
from bs4 import BeautifulSoup
import requests

### `GET` Request

In [2]:
url = "https://pokemondb.net/pokedex/all"
r = requests.get(url)
print(r.status_code, "\n\n", r.text[:1000])

200 

 <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Pokémon Pokédex: list of Pokémon with stats | Pokémon Database</title>
<link rel="preconnect" href="https://img.pokemondb.net">
<link rel="preconnect" href="https://s.pokemondb.net">
<link rel="preload" href="/static/fonts/fira-sans-v17-latin-400.woff2" as="font" type="font/woff2" crossorigin>
<link rel="preload" href="/static/fonts/fira-sans-v17-latin-400i.woff2" as="font" type="font/woff2" crossorigin>
<link rel="preload" href="/static/fonts/fira-sans-v17-latin-600.woff2" as="font" type="font/woff2" crossorigin>
<link rel="stylesheet" href="/static/css/pokemondb-aa70195104.css">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:description" name="description" content="The Pokédex contains detailed stats for every creature from the Pokémon games, up to and including the latest Scarlet/Violet games.">
<link rel="canonical" href="https://pokemondb.net/pokedex/all">
<meta pr

### 使用 HTML Parser

In [3]:
# 轉為 BeautifulSoup 物件
soup = BeautifulSoup(r.text, "html.parser")
soup.title

<title>Pokémon Pokédex: list of Pokémon with stats | Pokémon Database</title>

### 指定相符特徵的節點
- 找到寶可夢資訊表格
- 使用：`soup.find(<tag_name>, {<attribute>: <attribute_value>})`


In [4]:
table = soup.find('table', id='pokedex')
print(table.prettify()[:1000])

<table class="data-table sticky-header block-wide" id="pokedex">
 <thead>
  <tr>
   <th class="sorting" data-sort-type="int">
    <div class="sortwrap">
     #
    </div>
   </th>
   <th class="sorting" data-sort-type="string">
    <div class="sortwrap">
     Name
    </div>
   </th>
   <th>
    <div class="sortwrap">
     Type
    </div>
   </th>
   <th class="sorting" data-sort-default="desc" data-sort-type="int">
    <div class="sortwrap">
     Total
    </div>
   </th>
   <th class="sorting" data-sort-default="desc" data-sort-type="int">
    <div class="sortwrap">
     HP
    </div>
   </th>
   <th class="sorting" data-sort-default="desc" data-sort-type="int">
    <div class="sortwrap">
     Attack
    </div>
   </th>
   <th class="sorting" data-sort-default="desc" data-sort-type="int">
    <div class="sortwrap">
     Defense
    </div>
   </th>
   <th class="sorting" data-sort-default="desc" data-sort-type="int">
    <div class="sortwrap">
     Sp. Atk
    </div>
   </th>
   <th c

### 連續查找
- 取得所有表格中的列

In [5]:
header = table.thead.tr.find_all('th')
body_rows = table.tbody.find_all('tr')

In [6]:
header

[<th class="sorting" data-sort-type="int"><div class="sortwrap">#</div></th>,
 <th class="sorting" data-sort-type="string"><div class="sortwrap">Name</div></th>,
 <th><div class="sortwrap">Type</div></th>,
 <th class="sorting" data-sort-default="desc" data-sort-type="int"><div class="sortwrap">Total</div></th>,
 <th class="sorting" data-sort-default="desc" data-sort-type="int"><div class="sortwrap">HP</div></th>,
 <th class="sorting" data-sort-default="desc" data-sort-type="int"><div class="sortwrap">Attack</div></th>,
 <th class="sorting" data-sort-default="desc" data-sort-type="int"><div class="sortwrap">Defense</div></th>,
 <th class="sorting" data-sort-default="desc" data-sort-type="int"><div class="sortwrap">Sp. Atk</div></th>,
 <th class="sorting" data-sort-default="desc" data-sort-type="int"><div class="sortwrap">Sp. Def</div></th>,
 <th class="sorting" data-sort-default="desc" data-sort-type="int"><div class="sortwrap">Speed</div></th>]

In [7]:
body_rows[:10]

[<tr>
 <td class="cell-num cell-fixed" data-sort-value="1"><picture class="infocard-cell-img">
 <source height="56" srcset="https://img.pokemondb.net/sprites/scarlet-violet/icon/avif/bulbasaur.avif" type="image/avif" width="60"/>
 <img alt="Bulbasaur" class="img-fixed icon-pkmn" height="56" loading="lazy" src="https://img.pokemondb.net/sprites/scarlet-violet/icon/bulbasaur.png" width="60"/>
 </picture><span class="infocard-cell-data">0001</span></td> <td class="cell-name"><a class="ent-name" href="/pokedex/bulbasaur" title="View Pokedex for #0001 Bulbasaur">Bulbasaur</a></td><td class="cell-icon"><a class="type-icon type-grass" href="/type/grass">Grass</a><br/> <a class="type-icon type-poison" href="/type/poison">Poison</a></td>
 <td class="cell-num cell-total">318</td>
 <td class="cell-num">45</td>
 <td class="cell-num">49</td>
 <td class="cell-num">49</td>
 <td class="cell-num">65</td>
 <td class="cell-num">65</td>
 <td class="cell-num">45</td>
 </tr>,
 <tr>
 <td class="cell-num cell

### 指定節點文字相符：找出文字是 Ivysaur 的節點
- Hint: 使用 `soup.find("a", string=<text_in_the_html_node_>)`

In [8]:
soup.find_all('a', string='Ivysaur')

[<a class="ent-name" href="/pokedex/ivysaur" title="View Pokedex for #0002 Ivysaur">Ivysaur</a>]

### 找出屬性包含部分文字的節點：找出各種類型的寶可夢種類標籤

- 使用 regex 
- `soup.find(<tag_name>, {<attribute>: <regex_pattern>})`

In [9]:
# 選取各種類型的寶可夢種類標籤(GRASS, POISON, ...)，用 set 過濾出不重複種類有哪幾種

import re
regex_pattern = re.compile("type-.*")
set(soup.find_all('a', class_=regex_pattern))


{<a class="type-icon type-bug" href="/type/bug">Bug</a>,
 <a class="type-icon type-dark" href="/type/dark">Dark</a>,
 <a class="type-icon type-dragon" href="/type/dragon">Dragon</a>,
 <a class="type-icon type-electric" href="/type/electric">Electric</a>,
 <a class="type-icon type-fairy" href="/type/fairy">Fairy</a>,
 <a class="type-icon type-fighting" href="/type/fighting">Fighting</a>,
 <a class="type-icon type-fire" href="/type/fire">Fire</a>,
 <a class="type-icon type-flying" href="/type/flying">Flying</a>,
 <a class="type-icon type-ghost" href="/type/ghost">Ghost</a>,
 <a class="type-icon type-grass" href="/type/grass">Grass</a>,
 <a class="type-icon type-ground" href="/type/ground">Ground</a>,
 <a class="type-icon type-ice" href="/type/ice">Ice</a>,
 <a class="type-icon type-normal" href="/type/normal">Normal</a>,
 <a class="type-icon type-poison" href="/type/poison">Poison</a>,
 <a class="type-icon type-psychic" href="/type/psychic">Psychic</a>,
 <a class="type-icon type-rock" hr

### 將資訊組成表格

In [10]:
header_cols = [col.get_text() for col in header]
row_values = [[item.get_text() for item in row.find_all('td')] for row in body_rows]

In [11]:
import pandas as pd
df = pd.DataFrame(row_values, columns=header_cols)
df['#'] = df['#'].apply(lambda x: x.strip())
df["Type"] = df["Type"].apply(lambda x: x.strip().split(" "))
df

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,0001,Bulbasaur,"[Grass, Poison]",318,45,49,49,65,65,45
1,0002,Ivysaur,"[Grass, Poison]",405,60,62,63,80,80,60
2,0003,Venusaur,"[Grass, Poison]",525,80,82,83,100,100,80
3,0003,Venusaur Mega Venusaur,"[Grass, Poison]",625,80,100,123,122,120,80
4,0004,Charmander,[Fire],309,39,52,43,60,50,65
...,...,...,...,...,...,...,...,...,...,...
1210,1023,Iron Crown,"[Steel, Psychic]",590,90,72,100,122,108,98
1211,1024,Terapagos Normal Form,[Normal],450,90,65,85,65,85,60
1212,1024,Terapagos Terastal Form,[Normal],600,95,95,110,105,110,85
1213,1024,Terapagos Stellar Form,[Normal],700,160,105,110,130,110,85


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1215 entries, 0 to 1214
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   #        1215 non-null   object
 1   Name     1215 non-null   object
 2   Type     1215 non-null   object
 3   Total    1215 non-null   object
 4   HP       1215 non-null   object
 5   Attack   1215 non-null   object
 6   Defense  1215 non-null   object
 7   Sp. Atk  1215 non-null   object
 8   Sp. Def  1215 non-null   object
 9   Speed    1215 non-null   object
dtypes: object(10)
memory usage: 95.1+ KB
