### 為了能定期追蹤國際疫情爆發新聞，請
1. 撰寫一個網路爬蟲抓取世界衛生組織2017年的疫情爆發新聞清單(http://www.who.int/csr/don/archive/year/2017/en/)
2. 並將連結、日期、標題與國家整理成一Pandas 的 表格如以下範例：

| 連結                                                                             | 標題                                                       | 日期           | 國家         |
|----------------------------------------------------------------------------------|------------------------------------------------------------|----------------|--------------|
| http://www.who.int/entity/csr/don/17-august-2017-mers-saudi-arabia/en/index.html | Middle East respiratory syndrome coronavirus (MERS-CoV) –  | 17 August 2017 | Saudi Arabia |


## 抓取頁面資訊

In [55]:
import requests
res = requests.get('http://www.who.int/csr/don/archive/year/2017/en/')
res.encoding = 'utf-8'
res.encoding

'utf-8'

In [56]:
#res.text

## 將頁面資料丟往剖析器 (BeautifulSoup)

In [57]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')

## 列印出頁面結果

In [59]:
domain = 'http://www.who.int'
for content in soup.select('.auto_archive li'):
    #print(content)
    dt        = content.select('a')[0].text
    link      = domain + content.select('a')[0]['href']
    link_info = content.select('.link_info')[0].text
    #print(link_info)
    title , country  = link_info.split(' – ', 1)
    print(country, title, link, dt)

    print('=========================')

Italy Chikungunya http://www.who.int/entity/csr/don/15-september-2017-chikungunya-italy/en/index.html 15 September 2017
China Human infection with avian influenza A(H7N9) virus http://www.who.int/entity/csr/don/13-september-2017-ah7n9-china/en/index.html 13 September 2017
Oman Middle East respiratory syndrome coronavirus (MERS-CoV) http://www.who.int/entity/csr/don/12-september-2017-mers-oman/en/index.html 12 September 2017
Saudi Arabia Middle East respiratory syndrome coronavirus (MERS-CoV) http://www.who.int/entity/csr/don/6-september-2017-mers-saudi-arabia/en/index.html 6 September 2017
China Human infection with avian influenza A(H7N9) virus http://www.who.int/entity/csr/don/5-september-2017-ah7n9-china/en/index.html 5 September 2017
France – French Guiana Yellow fever http://www.who.int/entity/csr/don/30-august-2017-yellow-fever-french-guiana/en/index.html 30 August 2017
United Arab Emirates Middle East respiratory syndrome coronavirus (MERS-CoV) http://www.who.int/entity/csr/don/

China Human infection with avian influenza A(H7N9) virus http://www.who.int/entity/csr/don/18-january-2017-ah7n9-china/en/index.html 18 January 2017
Saudi Arabia Middle East respiratory syndrome coronavirus (MERS-CoV) http://www.who.int/entity/csr/don/17-january-2017-mers-saudi-arabia/en/index.html 17 January 2017
China Human infection with avian influenza A(H7N9) virus http://www.who.int/entity/csr/don/17-january-2017-ah7n9-china/en/index.html 17 January 2017
Brazil Yellow fever http://www.who.int/entity/csr/don/13-january-2017-yellow-fever-brazil/en/index.html 13 January 2017
Madagascar Plague http://www.who.int/entity/csr/don/09-january-2017-plague-mdg/en/index.html 9 January 2017
China Human infection with avian influenza A(H7N9) virus http://www.who.int/entity/csr/don/03-january-2017-ah7n9-china/en/index.html 3 January 2017


## 將資料變成字典後存進List 之中

In [61]:
domain = 'http://www.who.int'
who_list = []
for content in soup.select('.auto_archive li'):
    dic       = {}
    dic['dt']        = content.select('a')[0].text
    dic['link']      = domain + content.select('a')[0]['href']
    link_info = content.select('.link_info')[0].text
    dic['title'] , dic['country']  = link_info.split(' – ', 1)
    who_list.append(dic)

## 將資料從List 轉變為 DataFrame

In [64]:
import pandas
whodf = pandas.DataFrame(who_list)
whodf.head()

Unnamed: 0,country,dt,link,title
0,Italy,15 September 2017,http://www.who.int/entity/csr/don/15-september...,Chikungunya
1,China,13 September 2017,http://www.who.int/entity/csr/don/13-september...,Human infection with avian influenza A(H7N9) v...
2,Oman,12 September 2017,http://www.who.int/entity/csr/don/12-september...,Middle East respiratory syndrome coronavirus (...
3,Saudi Arabia,6 September 2017,http://www.who.int/entity/csr/don/6-september-...,Middle East respiratory syndrome coronavirus (...
4,China,5 September 2017,http://www.who.int/entity/csr/don/5-september-...,Human infection with avian influenza A(H7N9) v...


## 修改欄位順序與名稱

In [67]:
whodf = whodf[['link', 'title', 'dt', 'country']]
whodf.columns = ['連結', '標題', '日期' , '國家']
whodf.head()

Unnamed: 0,連結,標題,日期,國家
0,http://www.who.int/entity/csr/don/15-september...,Chikungunya,15 September 2017,Italy
1,http://www.who.int/entity/csr/don/13-september...,Human infection with avian influenza A(H7N9) v...,13 September 2017,China
2,http://www.who.int/entity/csr/don/12-september...,Middle East respiratory syndrome coronavirus (...,12 September 2017,Oman
3,http://www.who.int/entity/csr/don/6-september-...,Middle East respiratory syndrome coronavirus (...,6 September 2017,Saudi Arabia
4,http://www.who.int/entity/csr/don/5-september-...,Human infection with avian influenza A(H7N9) v...,5 September 2017,China


## 儲存資料

In [70]:
#whodf.to_excel('who_news.xlsx')
whodf.to_csv('who_news.csv')

## 增加現在時間

In [71]:
from datetime import datetime
datetime.now()

datetime.datetime(2017, 9, 20, 10, 50, 3, 661387)

In [74]:
import requests
import pandas
from bs4 import BeautifulSoup
from datetime import datetime
# 存取頁面
res = requests.get('http://www.who.int/csr/don/archive/year/2017/en/')
# 轉換編碼
res.encoding = 'utf-8'
# 將資料丟進剖析器
soup = BeautifulSoup(res.text, 'html.parser')

domain = 'http://www.who.int'
who_list = []
for content in soup.select('.auto_archive li'):
    dic       = {}
    dic['dt']        = content.select('a')[0].text
    
    # 增加搜尋時間
    dic['search_dt'] = datetime.now()
    
    dic['link']      = domain + content.select('a')[0]['href']
    link_info = content.select('.link_info')[0].text
    dic['title'] , dic['country']  = link_info.split(' – ', 1)
    who_list.append(dic)

whodf = pandas.DataFrame(who_list)
whodf = whodf[['link', 'title', 'dt', 'country', 'search_dt']]
whodf.columns = ['連結', '標題', '日期' , '國家', '搜尋時間']
whodf.head()

Unnamed: 0,連結,標題,日期,國家,搜尋時間
0,http://www.who.int/entity/csr/don/15-september...,Chikungunya,15 September 2017,Italy,2017-09-20 10:53:58.711384
1,http://www.who.int/entity/csr/don/13-september...,Human infection with avian influenza A(H7N9) v...,13 September 2017,China,2017-09-20 10:53:58.726984
2,http://www.who.int/entity/csr/don/12-september...,Middle East respiratory syndrome coronavirus (...,12 September 2017,Oman,2017-09-20 10:53:58.726984
3,http://www.who.int/entity/csr/don/6-september-...,Middle East respiratory syndrome coronavirus (...,6 September 2017,Saudi Arabia,2017-09-20 10:53:58.726984
4,http://www.who.int/entity/csr/don/5-september-...,Human infection with avian influenza A(H7N9) v...,5 September 2017,China,2017-09-20 10:53:58.726984


## 使用Strip

In [15]:
s = 'csr/don/15-september-2017-chikungunya-italy/en/index.html'
s.lstrip('csr')

'/don/15-september-2017-chikungunya-italy/en/index.html'

In [22]:
s = '           Hello            World            '
s.strip()
s.lstrip()
s.rstrip()

s = '大家好, 今天很高興來到疾管署'
s.lstrip('大家好')
s.replace('大家好', '你們好')

s = '           Hello            World            '
s.split()
' '.join(s.split())

'Hello World'

## 使用replace

In [23]:
s = '大家好, 今天很高興來到疾管署'
s.replace('大家好', '你們好')

'你們好, 今天很高興來到疾管署'

## 使用 split + join

In [24]:
s = '           Hello            World            '
s.split()
' '.join(s.split())

'Hello World'

## 使用split + unpacking 技巧

In [42]:
s  = 'Chikungunya â Italy'
a = s.split(' â ')
a[0]
a[1]

'Italy'

In [43]:
c,d = 1,2

In [45]:
title, country = s.split(' â ')

In [47]:
country

'Italy'

In [52]:
s = 'Yellow fever â France â French Guiana'
s.split(' â ', 1)

['Yellow fever', 'France â\x80\x93 French Guiana']