# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcared 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

##  1.Request 取回之後該怎麼取出資料，資料型態是什麼？

Dcard 網址： https://www.dcard.tw/f

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
headers = {'user-agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Mobile Safari/537.36'}
res = requests.get('https://www.dcard.tw/f',headers=headers)

res_=res.content.decode('utf-8')
res_

'<!DOCTYPE html><html lang="zh-TW"><head prefix="og: http://ogp.me/ns#" itemscope="" itemType="https://schema.org/WebSite"><meta charSet="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1"/><meta name="apple-mobile-web-app-status-bar-style" content="default"/><link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Roboto:400,300"/><meta name="application-name" content="Dcard"/><meta name="apple-itunes-app" content="app-id=951353454"/><meta name="theme-color" content="#006aa6"/><meta name="mobile-web-app-capable" content="yes"/><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="supported-color-schemes" content="light"/><meta property="fb:app_id" content="211628828926493"/><meta property="fb:pages" content="178875832200695,577748865730563,1333515469994506,619122564952487,804004803032067,178024805867764"/><meta property="al:ios:app_store_id" conten

#### 知乎： https://www.zhihu.com/explore

In [3]:
headers = {'user-agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Mobile Safari/537.36'}
res2 = requests.get('https://www.zhihu.com/explore',headers=headers)

res_2=res2.content.decode('utf-8')
res_2

'<!DOCTYPE html>\n<html lang="zh-CN" dropEffect="none" class="no-js no-auth ">\n<head>\n<meta charset="utf-8" />\n\n<meta http-equiv="X-ZA-Experiment" content="default:None">\n<title>发现 - 知乎</title>\n\n<meta name="apple-itunes-app" content="app-id=432274380, app-argument=zhihu://explore">\n\n\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/>\n<meta http-equiv="mobile-agent" content="format=html5;url=https://www.zhihu.com/explore">\n<meta id="znonce" name="znonce" content="1e11f9fbd76f425f9603b596e958ba00">\n\n\n\n<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" sizes="152x152">\n<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" sizes="120x120">\n<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" sizes="76x76">\n<link rel="apple-touch-icon" href="https://static.zhihu.com

### 資料型態是什麼？

In [4]:
print('「Dcard」資料型態為:',type(res_))
print('「知乎」資料型態為:',type(res_2))

「Dcard」資料型態為: <class 'str'>
「知乎」資料型態為: <class 'str'>


# ===================================

## 2.為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？

#### Dcard

In [5]:
# soup.prettify()可以美化排版
soup = BeautifulSoup(res_,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html lang="zh-TW">
 <head itemscope="" itemtype="https://schema.org/WebSite" prefix="og: http://ogp.me/ns#">
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/>
  <meta content="default" name="apple-mobile-web-app-status-bar-style"/>
  <link href="https://fonts.googleapis.com/css?family=Roboto:400,300" rel="stylesheet" type="text/css"/>
  <meta content="Dcard" name="application-name"/>
  <meta content="app-id=951353454" name="apple-itunes-app"/>
  <meta content="#006aa6" name="theme-color"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="light" name="supported-color-schemes"/>
  <meta content="211628828926493" property="fb:app_id"/>
  <meta content="178875832200695,577748865730563,1333515469994506,619122564952487,804004803032067,178024805867764" property="fb:pages"/>
 

#### 知乎

In [6]:
# soup.prettify()可以美化排版
soup2 = BeautifulSoup(res_2,'lxml')
print(soup2.prettify())

<!DOCTYPE html>
<html class="no-js no-auth" dropeffect="none" lang="zh-CN">
 <head>
  <meta charset="utf-8"/>
  <meta content="default:None" http-equiv="X-ZA-Experiment"/>
  <title>
   发现 - 知乎
  </title>
  <meta content="app-id=432274380, app-argument=zhihu://explore" name="apple-itunes-app"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
  <meta content="format=html5;url=https://www.zhihu.com/explore" http-equiv="mobile-agent"/>
  <meta content="1e11f9fbd76f425f9603b596e958ba00" id="znonce" name="znonce"/>
  <link href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" rel="apple-touch-icon" sizes="152x152"/>
  <link href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" rel="apple-touch-icon" sizes="120x120"/>
  <link href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" rel="apple-touch-icon" sizes="76x76"/>
  <link href="https://static.zhihu.com/static/revved/i

### 為什麼要使用 BeatifulSoup 處理?

-使用BeautifulSoup處理的好處是，他的資料型態不再是str

-而是轉為可以被BeautifulSoup所剖析的"bs4.BeautifulSoup"型態

-且BeautifulSoup中的不同語法可以更方便地整理、抽取我們所需要的資料

### 處理後的型態是什麼？

In [7]:
print('「Dcard」處理後的型態為:',type(soup))
print('「知乎」處理後的型態為:',type(soup2))

「Dcard」處理後的型態為: <class 'bs4.BeautifulSoup'>
「知乎」處理後的型態為: <class 'bs4.BeautifulSoup'>


# =================================

## 3.觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [8]:
import requests
url = 'https://www.zhihu.com/explore'


r = requests.get(url)

r.encoding = 'utf-8'
print(r.text[0:600])

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>



# 解決方法:

In [9]:
#因為該網站SERVER會防止爬蟲行為,
#所以加上headers中的user-agent假裝是一般網頁瀏覽者就沒事了

url = 'https://www.zhihu.com/explore'

headers = {'user-agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Mobile Safari/537.36'}

ress = requests.get(url,headers=headers)

ress.encoding = 'utf-8'
print(ress.text[0:6000])

<!DOCTYPE html>
<html lang="zh-CN" dropEffect="none" class="no-js no-auth ">
<head>
<meta charset="utf-8" />

<meta http-equiv="X-ZA-Experiment" content="default:None">
<title>发现 - 知乎</title>

<meta name="apple-itunes-app" content="app-id=432274380, app-argument=zhihu://explore">


<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/>
<meta http-equiv="mobile-agent" content="format=html5;url=https://www.zhihu.com/explore">
<meta id="znonce" name="znonce" content="3d1378e3510f4496a7ca8cc0562caa00">



<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-152.87c020b9.png" sizes="152x152">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-120.496c913b.png" sizes="120x120">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/ios/touch-icon-76.dcf79352.png" sizes="76x76">
<link rel="apple-touch-icon" href="https://static.zhihu.com/static/revved/img/io