# Day31
## Selenium 物件定位 – XPath
- 在 Selenium 中使用 XPath
- 盤點 XPath 語法



## 作業說明
練習更多在 selenium 中使用 XPath 的變化用法
- 目標網站： https://channel.jd.com/outdoor.html

目標：
- 取得所有小類別名稱下的
  - 品牌列表（名稱、連結）

![](https://i.imgur.com/SbV4W35.png)

Hint: 
- 請根據引導完成這份代碼
- 記得先安裝 Chrome 瀏覽器，才能順利啟動 chromedriver
- 會用到我們 Day20 所學的 Xpath

### 套件安裝

In [None]:
!pip install -U selenium
!pip install webdriver_manager
!pip install fake-useragent

### 套件導入

In [1]:
from fake_useragent import UserAgent
import numpy as np
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
from tqdm import tqdm


In [3]:
# 下載 chrome webdriver 執行檔至預設位址，下載完成後會顯示位置
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 102.0.5005
Get LATEST chromedriver version for 102.0.5005 google-chrome
Driver [/Users/jiunyiyang/.wdm/drivers/chromedriver/mac64/102.0.5005.61/chromedriver] found in cache


### 使用 fake-useragent 產生 User Agent

In [2]:
# 目標網址
base_url = 'https://channel.jd.com/outdoor.html'
# Chromedriver path
driver_path = "/Users/jiunyiyang/.wdm/drivers/chromedriver/mac64/102.0.5005.61/chromedriver"

# 設定 user agent
opt = webdriver.ChromeOptions()
user_agent = UserAgent()
opt.add_argument('--user-agent=%s' % user_agent)


### 獲取所有小類別頁面連結 
> Day29/Day30 內容，可自行載入儲存的爬取表格，也可以再爬一次

In [3]:
# 載入目標網址
driver = webdriver.Chrome(driver_path, options=opt)
driver.get(base_url)

# 模擬用戶等待網站載入，每次都用隨機時間
time.sleep(np.random.uniform(3, 5))

# 有兩種規則
# 1. 大字類別： <div id="Categorys">...<dt>"戶外鞋服"</dt>   (網站本身英文複數用錯...)
# 2. 小字類別： <div id="Categorys">...<dd>...<a>"衝鋒衣褲"</a>

# 取得 <div id="Categorys"> 
categories = driver.find_element_by_xpath('//div[@id="Categorys"]')

# 取得所有 大字類別
# 使用相對路徑: "." 開頭
cate_names = []
for ele_group in tqdm(categories.find_elements_by_xpath('.//dl[@class="item-inner"]')):
    medium_cate = ele_group.find_element_by_xpath('.//dt').text
    for small_cate in ele_group.find_elements_by_xpath('.//a'):
        cate_names.append((medium_cate, small_cate.text, small_cate.get_attribute('href')))

driver.close()


100%|██████████| 5/5 [00:01<00:00,  3.87it/s]


In [8]:
import pandas as pd

# 將小類別頁面連結存成 DataFrame
df_cates = pd.DataFrame(cate_names, columns=["medium_cate","small_cate","url"])
df_cates.head()

Unnamed: 0,medium_cate,small_cate,url
0,户外鞋服,冲锋衣裤,https://list.jd.com/list.html?cat=1318%2C2628%...
1,户外鞋服,徒步鞋,https://list.jd.com/list.html?cat=1318%2C2628%...
2,户外鞋服,抓绒衣裤,"https://list.jd.com/list.html?cat=1318,2628,12128"
3,户外鞋服,羽绒服棉服,"https://list.jd.com/list.html?cat=1318,2628,12126"
4,户外鞋服,越野跑鞋,"https://list.jd.com/list.html?cat=1318,2628,12137"


### 取得類別下的品牌列表

In [21]:
def fetch_brand_list(cate_name, url):
    """
    取得類別下的品牌列表
    """
    try:
        # 載入網址
        driver.get(url)
    
        # 模擬用戶等待網站載入，每次都用隨機時間
        time.sleep(np.random.uniform(3, 5))

        # 先框定品牌區塊
        brand_block = driver.find_element_by_xpath('//ul[@class="J_valueList v-fixed"]')
        brand_list = []
        # 遍歷每個品牌的節點、抓取資訊
        for brand in tqdm(brand_block.find_elements_by_xpath('.//a')):
            try:
                img = brand.find_element_by_xpath('./img').get_attribute('src')
            except:
                img = None
            brand_list.append((
                cate_name,
                brand.get_attribute('title'), 
                brand.get_attribute('href'),
                img
            ))
        return brand_list
    except TimeoutException as e:
        return []


In [16]:
test_url = df_cates.url.values.tolist()[0]
test_cate = df_cates.small_cate.values.tolist()[0]

driver = webdriver.Chrome(driver_path, options=opt)
brand_list = fetch_brand_list(cate_name=test_cate, url=test_url)
driver.close()

print(brand_list[:10])


100%|██████████| 392/392 [00:09<00:00, 40.53it/s]


[('冲锋衣裤', '北面（The North Face）', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12123&ev=exbrand_%E5%8C%97%E9%9D%A2%EF%BC%88The%20North%20Face%EF%BC%89%5E&cid3=12123', 'https://img30.360buyimg.com/popshop/jfs/t1/18119/19/13818/21465/5ca2cb71Ec6236d60/1460957cdb49a907.jpg'), ('冲锋衣裤', '哥伦比亚（Columbia）', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12123&ev=exbrand_%E5%93%A5%E4%BC%A6%E6%AF%94%E4%BA%9A%EF%BC%88Columbia%EF%BC%89%5E&cid3=12123', 'https://img20.360buyimg.com/popshop/jfs/t1921/79/2835163344/4591/715667db/56f23bd6N5b8eb10a.jpg'), ('冲锋衣裤', '探路者（TOREAD）', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12123&ev=exbrand_%E6%8E%A2%E8%B7%AF%E8%80%85%EF%BC%88TOREAD%EF%BC%89%5E&cid3=12123', 'https://img30.360buyimg.com/popshop/jfs/t2890/25/2339333782/5918/c68774ba/5762439eNad1cbdfe.jpg'), ('冲锋衣裤', '骆驼（CAMEL）', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12123&ev=exbrand_%E9%AA%86%E9%A9%BC%EF%BC%88CAMEL%EF%BC%89%5E&cid3=12123', 'https://img20.360buyimg.com/popshop/jfs/t3445/7

### 遍歷所有小類別頁，取得其品牌列表
- 使用我們剛剛自定義的 function ，減少重複的撰寫、並設法讓程式變得易讀

In [22]:
driver = webdriver.Chrome(driver_path, options=opt)

brands_of_cates = []
for d in tqdm(df_cates.values.tolist()):
    small_cate, url = d[1], d[2]
    brand_list = fetch_brand_list(cate_name=small_cate, url=url)
    brands_of_cates += brand_list

driver.close()


100%|██████████| 392/392 [00:09<00:00, 40.14it/s]
100%|██████████| 382/382 [00:10<00:00, 35.50it/s]
100%|██████████| 325/325 [00:07<00:00, 44.34it/s]
100%|██████████| 274/274 [00:06<00:00, 44.48it/s]
100%|██████████| 248/248 [00:05<00:00, 45.00it/s]
100%|██████████| 230/230 [00:05<00:00, 38.41it/s]
100%|██████████| 466/466 [00:11<00:00, 41.19it/s]
100%|██████████| 321/321 [00:06<00:00, 48.80it/s]
100%|██████████| 500/500 [00:11<00:00, 45.43it/s]
100%|██████████| 314/314 [00:07<00:00, 39.60it/s]
100%|██████████| 376/376 [00:08<00:00, 41.93it/s]
100%|██████████| 499/499 [00:11<00:00, 44.41it/s]
100%|██████████| 309/309 [00:06<00:00, 50.71it/s]
100%|██████████| 377/377 [00:08<00:00, 43.76it/s]
100%|██████████| 391/391 [00:08<00:00, 46.62it/s]
100%|██████████| 500/500 [00:11<00:00, 44.36it/s]
100%|██████████| 500/500 [00:09<00:00, 52.25it/s]
100%|██████████| 187/187 [00:04<00:00, 40.30it/s]
100%|██████████| 96/96 [00:02<00:00, 45.84it/s]
 53%|█████▎    | 19/36 [26:10<23:24, 82.63s/it]


NameError: name 'TimeoutException' is not defined

In [18]:
df_brands = pd.DataFrame(brands_of_cates, columns=["small_cate", "brand_name", "brand_page", "brand_logo"])
print(df_brands.shape)
df_brands.head()

(4635, 4)


Unnamed: 0,small_cate,brand_name,brand_page,brand_logo
0,冲锋衣裤,北面（The North Face）,https://list.jd.com/list.html?cat=1318%2C2628%...,https://img30.360buyimg.com/popshop/jfs/t1/181...
1,冲锋衣裤,哥伦比亚（Columbia）,https://list.jd.com/list.html?cat=1318%2C2628%...,https://img20.360buyimg.com/popshop/jfs/t1921/...
2,冲锋衣裤,探路者（TOREAD）,https://list.jd.com/list.html?cat=1318%2C2628%...,https://img30.360buyimg.com/popshop/jfs/t2890/...
3,冲锋衣裤,骆驼（CAMEL）,https://list.jd.com/list.html?cat=1318%2C2628%...,https://img20.360buyimg.com/popshop/jfs/t3445/...
4,冲锋衣裤,Jack Wolfskin,https://list.jd.com/list.html?cat=1318%2C2628%...,https://img30.360buyimg.com/popshop/jfs/t1939/...
