# 案例学习2 网页数据抓取 笔记整理

时间 2022-9-20

维基百科[星座列表](https://zh.wikipedia.org/zh-sg/%E6%98%9F%E5%BA%A7%E5%88%97%E8%A1%A8)比原来增加了一列，不过不大影响。

香港天文馆亮星对照表变化较大：https://www.lcsd.gov.hk/CE/Museum/Space/archive/Research/StarName/c_research_chinengstars.htm


## 抓取维基百科星座列表

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# 设置代理
proxies = {"http": "***", "https": "***"}
response = requests.get('https://zh.wikipedia.org/zh-sg/%E6%98%9F%E5%BA%A7%E5%88%97%E8%A1%A8', proxies=proxies)
soup = BeautifulSoup(response.content, 'html.parser') 

table = soup.find_all('div', class_='mw-parser-output')[1].find_all('table')[0]
# table = soup.find(string="包括星座的面积").parent.find_next_sibling('table')
# soup.find(string="包括星座的面积、位置").parent.find_next_sibling('table')

column_names = [th.text for th in table.tbody.tr.find_all('th')]

data_rows = []
for tr in table.tbody.find_all('tr'):
    row = [td.text for td in tr.find_all('td')]
    data_rows.append(row)
    
data_rows = data_rows[1: ]  # 0号位是一个空列表
df = pd.DataFrame(data_rows, columns=column_names)

In [2]:
df

Unnamed: 0,中文名,简写,拉丁名,面积（平方度）,赤经（时、分）,赤纬（度、分）,象限,族,星座最亮星,建议的符号[3]\n
0,仙女座,AND,Andromeda,722.278,0 48.46,37 25.91,NQ1,英仙,仙女座α(壁宿二)\n,\n
1,唧筒座,ANT,Antlia,238.901,10 16.43,-32 29.01,SQ2,拉卡伊,唧筒座α(近天纪增二)\n,\n
2,天燕座,APS,Apus,206.327,16 8.65,-75 18,SQ3,拜耳,天燕座α(异雀八)\n,\n
3,宝瓶座,AQR,Aquarius,979.854,22 17.38,-10 47.35,SQ4,黄道,宝瓶座β(虚宿一)\n,\n
4,天鹰座,AQL,Aquila,652.473,19 40.02,3 24.65,NQ4,武仙,天鹰座α(河鼓二)\n,\n
...,...,...,...,...,...,...,...,...,...,...
83,小熊座,UMI,Ursa Minor,255.864,15 0,77 41.99,NQ3,大熊,小熊座α(勾陈一)\n,\n
84,船帆座,VEL,Vela,499.649,9 34.64,-47 10.03,SQ2,幻之水,船帆座γ(天社一)\n,\n
85,室女座,VIR,Virgo,1294.428,13 24.39,-4 9.51,SQ3,黄道,室女座α(角宿一)\n,\n
86,飞鱼座,VOL,Volans,141.354,7 47.73,-69 48.07,SQ2,拜耳,飞鱼座β(飞鱼三)\n,\n


## 数据清洗和修改

In [3]:
df.columns = ['name_cn', 'abbr', 'name', 'area', 'ra', 'dec', 'quadrant', 'family', 'bs', 'symbol']

df.set_index('name', inplace=True)
df = df.join(df.bs.str.extract(r'(?P<bs_name_bayer_cn>.*)\((?P<bs_name_cn>.*)\)'))

In [4]:
df

Unnamed: 0_level_0,name_cn,abbr,area,ra,dec,quadrant,family,bs,symbol,bs_name_bayer_cn,bs_name_cn
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Andromeda,仙女座,AND,722.278,0 48.46,37 25.91,NQ1,英仙,仙女座α(壁宿二)\n,\n,仙女座α,壁宿二
Antlia,唧筒座,ANT,238.901,10 16.43,-32 29.01,SQ2,拉卡伊,唧筒座α(近天纪增二)\n,\n,唧筒座α,近天纪增二
Apus,天燕座,APS,206.327,16 8.65,-75 18,SQ3,拜耳,天燕座α(异雀八)\n,\n,天燕座α,异雀八
Aquarius,宝瓶座,AQR,979.854,22 17.38,-10 47.35,SQ4,黄道,宝瓶座β(虚宿一)\n,\n,宝瓶座β,虚宿一
Aquila,天鹰座,AQL,652.473,19 40.02,3 24.65,NQ4,武仙,天鹰座α(河鼓二)\n,\n,天鹰座α,河鼓二
...,...,...,...,...,...,...,...,...,...,...,...
Ursa Minor,小熊座,UMI,255.864,15 0,77 41.99,NQ3,大熊,小熊座α(勾陈一)\n,\n,小熊座α,勾陈一
Vela,船帆座,VEL,499.649,9 34.64,-47 10.03,SQ2,幻之水,船帆座γ(天社一)\n,\n,船帆座γ,天社一
Virgo,室女座,VIR,1294.428,13 24.39,-4 9.51,SQ3,黄道,室女座α(角宿一)\n,\n,室女座α,角宿一
Volans,飞鱼座,VOL,141.354,7 47.73,-69 48.07,SQ2,拜耳,飞鱼座β(飞鱼三)\n,\n,飞鱼座β,飞鱼三


## 关于缩写替换的作业

星座列表英文维基百科页面：
https://en.wikipedia.org/wiki/IAU_designated_constellations

In [5]:
res_en = requests.get('https://en.wikipedia.org/wiki/IAU_designated_constellations', proxies=proxies)
if res_en.status_code == 200:
    soup_en = BeautifulSoup(res_en.content, 'html.parser')
    
table_en = soup_en.find('div', class_='mw-parser-output').find('table')

column_names_en = [th.text for th in table_en.tbody.tr.find_all('th')]
column_names_en[1] = 'abbr-iau'
column_names_en.insert(2, 'abbr-nasa')
column_names_en
    
data_rows_en = []
for tr in table_en.tbody.find_all('tr'):
    row = [td.text for td in tr.find_all('td')]
    data_rows_en.append(row)
    
data_rows_en = data_rows_en[2: ]
df_en = pd.DataFrame(data_rows_en, columns=column_names_en)
df_en['name'] = df_en['Constellation\n'].str.extract(r'(?P<name>.*) /.*/')
tmp = df_en['Constellation\n'].str.extract(r'(?P<name>.*)\s?/.*/').values
df_en['name'] = [i.strip() for j in tmp for i in j]
df_en.set_index('name', inplace=True)

df['abbr'] = df_en['abbr-iau']

In [6]:
df

Unnamed: 0_level_0,name_cn,abbr,area,ra,dec,quadrant,family,bs,symbol,bs_name_bayer_cn,bs_name_cn
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Andromeda,仙女座,And,722.278,0 48.46,37 25.91,NQ1,英仙,仙女座α(壁宿二)\n,\n,仙女座α,壁宿二
Antlia,唧筒座,Ant,238.901,10 16.43,-32 29.01,SQ2,拉卡伊,唧筒座α(近天纪增二)\n,\n,唧筒座α,近天纪增二
Apus,天燕座,Aps,206.327,16 8.65,-75 18,SQ3,拜耳,天燕座α(异雀八)\n,\n,天燕座α,异雀八
Aquarius,宝瓶座,Aqr,979.854,22 17.38,-10 47.35,SQ4,黄道,宝瓶座β(虚宿一)\n,\n,宝瓶座β,虚宿一
Aquila,天鹰座,Aql,652.473,19 40.02,3 24.65,NQ4,武仙,天鹰座α(河鼓二)\n,\n,天鹰座α,河鼓二
...,...,...,...,...,...,...,...,...,...,...,...
Ursa Minor,小熊座,UMi,255.864,15 0,77 41.99,NQ3,大熊,小熊座α(勾陈一)\n,\n,小熊座α,勾陈一
Vela,船帆座,Vel,499.649,9 34.64,-47 10.03,SQ2,幻之水,船帆座γ(天社一)\n,\n,船帆座γ,天社一
Virgo,室女座,Vir,1294.428,13 24.39,-4 9.51,SQ3,黄道,室女座α(角宿一)\n,\n,室女座α,角宿一
Volans,飞鱼座,Vol,141.354,7 47.73,-69 48.07,SQ2,拜耳,飞鱼座β(飞鱼三)\n,\n,飞鱼座β,飞鱼三


## 抓取亮星数据

In [7]:
import requests
from bs4 import BeautifulSoup

URL_BASE = 'https://www.lcsd.gov.hk/CE/Museum/Space/archive/Research/StarName/'
url_home = URL_BASE + 'c_research_chinengstars.htm'
response = requests.get(url_home, proxies=proxies)
soup = BeautifulSoup(response.content, 'html.parser')

In [8]:
# 获取所有相关链接 放在 pages 列表中
pages = []
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    for tr in soup.table.table.center.table.find_all('tr'):
        for a in tr.find_all('a'):
            pages.append(URL_BASE + a['href'])

In [9]:
# pages 中的每一个页面 提取表格，然后把所有表格接起来

def get_page(url):

    response = requests.get(url, proxies=proxies)
    response.encoding = 'big5'

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
    else:
        print("response error")

    column_name = [td.text for td in soup.table.table.find_all('table')[1].tr.find_all('td')]
    column_name = ['name_en', 'code', 'abbr', 'name_cn', 'brightness', 'misc']

    rows = []
    for tr in soup.table.table.find_all('table')[2].find_all('tr'):
        row = [td.text for td in tr.find_all('td')]
        rows.append(row)

    df = pd.DataFrame(rows, columns=column_name)
    df['name_en'] = df['name_en'].str.rstrip(' *').str.replace('\s+', ' ')
    df['name_cn'] = df['name_cn'].str.replace('\s+', '')
    
    return df

In [10]:
# 遍历 抓取

bright_stars = get_page(pages[0])
for page in pages[1:]:
    bs = get_page(page)
    bright_stars = pd.concat([bright_stars, bs])

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [11]:
bright_stars

Unnamed: 0,name_en,code,abbr,name_cn,brightness,misc
0,Acamar,q,Eri,天園六,2.91,
1,Achernar,a,Eri,水委一,0.46,
2,Achird,h,Cas,王良三,3.6,d
3,Acrab,b,Sco,房宿四,2.55,d
4,Acrux,a,Cru,十字架二,0.79,d
...,...,...,...,...,...,...
106,Zubenesch,b,Lib,氐宿四,2.61,
107,Zubeneschamali,b,Lib,氐宿四,2.61,
108,Zubenhakrabi,g,Lib,氐宿三,4,
109,Zubra,d,Leo,"西上相,太微右垣五",2.56,


存在问题：

香港太空馆的网页用 `big5` 编码，没有找到简体中文页面，所以中文名有些识别不出来，是乱码。本来应该加上简繁转换的部分，可是我偷懒了~ 后面有空再优化吧。