# Day20
## 網頁結構解析：使用 lxml 套件操作 XPath
- 使用 lxml.html
- 使用 XPath 語法獲取子節點

## 作業說明
由於 Day18 作業我們已經練習過一些定位工具，今天使用和 Day19 一樣的網站，針對 XPath 更多變化用法再深入練習吧。

- 題目網站：
https://pokemondb.net/pokedex/all
- 使用 XPath 技巧把寶可夢表格抓下來

In [1]:
import lxml.html
import requests

### `GET` Request

In [2]:
url = "https://pokemondb.net/pokedex/all"
r = requests.get(url)
print(r.status_code, "\n\n", r.text[:1000])


200 

 <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Pokémon Pokédex: list of Pokémon with stats | Pokémon Database</title>
<link rel="preconnect" href="https://img.pokemondb.net">
<link rel="preconnect" href="https://s.pokemondb.net">
<link rel="preload" href="/static/fonts/fira-sans-v17-latin-400.woff2" as="font" type="font/woff2" crossorigin>
<link rel="preload" href="/static/fonts/fira-sans-v17-latin-400i.woff2" as="font" type="font/woff2" crossorigin>
<link rel="preload" href="/static/fonts/fira-sans-v17-latin-600.woff2" as="font" type="font/woff2" crossorigin>
<link rel="stylesheet" href="/static/css/pokemondb-aa70195104.css">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:description" name="description" content="The Pokédex contains detailed stats for every creature from the Pokémon games, up to and including the latest Scarlet/Violet games.">
<link rel="canonical" href="https://pokemondb.net/pokedex/all">
<meta pr

### 轉為 HTML Element 物件
- 使用 `lxml.html.fromstring()`

In [3]:
# 轉為 Element 物件
tree = lxml.html.fromstring(r.text)
tree

<Element html at 0x7fb0300ba940>

### 指定相符特徵的節點
- 找到寶可夢資訊表格
- 使用：`tree.xpath('//<tag_name>[@<attribute>=<attribute_value>]')`


In [4]:
table = tree.xpath('//table[@id="pokedex"]')[0]
table

<Element table at 0x7fb0300ba2b0>

### 連續查找
- 取得所有表格中的列

In [5]:
header = table.xpath("./thead/tr/th")
body_rows = table.xpath("./tbody/tr")

In [6]:
header

[<Element th at 0x7fb030174b40>,
 <Element th at 0x7fb0300bb4d0>,
 <Element th at 0x7fb0300ba710>,
 <Element th at 0x7fb0300bb160>,
 <Element th at 0x7fb0300ba3a0>,
 <Element th at 0x7fb0300ba670>,
 <Element th at 0x7fb0300ba490>,
 <Element th at 0x7fb0300bb570>,
 <Element th at 0x7fb0300bb5c0>,
 <Element th at 0x7fb0300bb610>]

In [7]:
body_rows[:10]

[<Element tr at 0x7fb0300bb660>,
 <Element tr at 0x7fb0300bb6b0>,
 <Element tr at 0x7fb0300bb700>,
 <Element tr at 0x7fb0300bb750>,
 <Element tr at 0x7fb0300bb7a0>,
 <Element tr at 0x7fb0300bb7f0>,
 <Element tr at 0x7fb0300bb840>,
 <Element tr at 0x7fb0300bb890>,
 <Element tr at 0x7fb0300bb8e0>,
 <Element tr at 0x7fb0300bb930>]

### 指定節點文字相符：找出文字是 Ivysaur 的節點
- Hint: 使用 `tree.xpath('//<tag_name>[text()="some_string"]')`

In [8]:
table.xpath('//a[text()="Ivysaur"]/text()')

['Ivysaur']

### 找出屬性包含部分文字的節點：找出各種類型的寶可夢種類標籤

- 包含： `tree.xpath('//<tag_name>[contains(<attribute>, <attribute_value>)]')`
- 不包含： `tree.xpath('//<tag_name>[not(contains(<attribute>, <attribute_value>))]')`

In [9]:
# 找出各種類型的寶可夢種類標籤(GRASS, POISON, ...)，用 set 過濾出不重複種類有哪幾種
set(tree.xpath('//a[contains(@class, "type-")]/text()'))

{'Bug',
 'Dark',
 'Dragon',
 'Electric',
 'Fairy',
 'Fighting',
 'Fire',
 'Flying',
 'Ghost',
 'Grass',
 'Ground',
 'Ice',
 'Normal',
 'Poison',
 'Psychic',
 'Rock',
 'Steel',
 'Water'}

### 將資訊組成表格

In [10]:
header_cols = [col.xpath(".//text()")[0] for col in header]
row_values = [["".join(col.xpath('.//text()')) for col in row.xpath('./td')] for row in body_rows]


In [11]:
import pandas as pd

df = pd.DataFrame(row_values, columns=header_cols)
df['#'] = df['#'].apply(lambda x: x.strip())
df["Type"] = df["Type"].apply(lambda x: x.strip().split(" "))
df

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,0001,Bulbasaur,"[Grass, Poison]",318,45,49,49,65,65,45
1,0002,Ivysaur,"[Grass, Poison]",405,60,62,63,80,80,60
2,0003,Venusaur,"[Grass, Poison]",525,80,82,83,100,100,80
3,0003,Venusaur Mega Venusaur,"[Grass, Poison]",625,80,100,123,122,120,80
4,0004,Charmander,[Fire],309,39,52,43,60,50,65
...,...,...,...,...,...,...,...,...,...,...
1210,1023,Iron Crown,"[Steel, Psychic]",590,90,72,100,122,108,98
1211,1024,Terapagos Normal Form,[Normal],450,90,65,85,65,85,60
1212,1024,Terapagos Terastal Form,[Normal],600,95,95,110,105,110,85
1213,1024,Terapagos Stellar Form,[Normal],700,160,105,110,130,110,85


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1215 entries, 0 to 1214
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   #        1215 non-null   object
 1   Name     1215 non-null   object
 2   Type     1215 non-null   object
 3   Total    1215 non-null   object
 4   HP       1215 non-null   object
 5   Attack   1215 non-null   object
 6   Defense  1215 non-null   object
 7   Sp. Atk  1215 non-null   object
 8   Sp. Def  1215 non-null   object
 9   Speed    1215 non-null   object
dtypes: object(10)
memory usage: 95.1+ KB
