# 如何成為資料分析師：104 Scraper 2023

> https://hahow.in/cr/dajourney

郭耀仁 <yaojenkuo@datainpoint.com>

In [1]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

## 定義函數下載搜尋結果前五頁的網頁檔案

```python
def download_first_five_pages():
    for page_number in range(1, 6):
        html_to_save = f"https://www.104.com.tw/jobs/search/?ro=1&kwop=7&keyword=%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90%E5%B8%AB&expansionType=area%2Cspec%2Ccom%2Cjob%2Cwf%2Cwktm&order=12&asc=0&page={page_number}&mode=s&jobsource=2018indexpoc&langFlag=0&langStatus=0&recommendJob=1&hotJob=1"
        r = requests.get(html_to_save)
        with open(f"104/search_result_page_{page_number}.html", "w") as file:
            file.write(r.text)
```

## 定義函數擷取前五頁搜尋結果的職缺連結

```python
def get_job_url_list() -> list:
    job_title_css_selector = "h2 > a"
    job_url_list = []
    for page_number in range(1, 6):
        with open(f"104/search_result_page_{page_number}.html") as file:
            soup = BeautifulSoup(file, 'html.parser')
        job_url_hrefs = [elem.get("href") for elem in soup.select(job_title_css_selector)]
        job_urls = [f"https:{job_url_href}" for job_url_href in job_url_hrefs if "hotjob_chr" not in job_url_href]
        job_url_list += job_urls
    return job_url_list
```

## 定義函數下載職缺連結的網頁檔案

```python
def download_job_descriptions(job_url_list: list):
    for job_url in job_url_list:
        r = requests.get(job_url)
        soup = BeautifulSoup(r.text)
        job_url_split = job_url.split("?")[0]
        page_name = job_url_split.split("/")[-1]
        with open(f"104/job_descriptions/{page_name}.html", "w") as file:
            file.write(r.text)
```

## 依序執行函數

```python
download_first_five_pages()
job_url_list = get_job_url_list()
download_job_descriptions(job_url_list)
```

## 擷取職缺的工作敘述

In [2]:
list_dir = os.listdir("104/job_descriptions/")
job_titles, employers, job_descriptions = [], [], []
for html_file in list_dir:
    with open(f"104/job_descriptions/{html_file}") as file:
        soup = BeautifulSoup(file, 'html.parser')
    job_title = soup.select("div.job-header__title > h1")[0].text
    job_title = " ".join(job_title.split()[:-1])
    employer = soup.select("div.job-header__title> div > a:nth-child(1)")[0].text.strip()
    job_description = [elem.text for elem in soup.select("#app > div > div.container.jb-container.pt-4.position-relative > div > div.col.main > div.dialog.container-fluid.bg-white.rounded.job-description.mb-4.pt-6.pb-6 > div.job-description-table.row > div.job-description.col-12 > p")]
    job_description = " ".join(job_description)
    job_titles.append(job_title)
    employers.append(employer)
    job_descriptions.append(job_description)
df = pd.DataFrame()
df["employer"] = employers
df["job_title"] = job_titles
df["job_description"] = job_descriptions

In [3]:
df

Unnamed: 0,employer,job_title,job_description
0,酷遊天股份有限公司,資料分析師 Data Analyst [RD],請透過以下連結投遞職缺，我們會以這邊收到的履歷為主：https://kkday.bamboo...
1,統一資訊股份有限公司,商業策略分析師Business Strategy Analyst,商業策略分析師Business Strategy Analyst\n利用數據洞察思維與對零售...
2,台灣多耐福股份有限公司,Data Analyst,Role:\nWe are seeking a skilled and self-motiv...
3,圖靈數位股份有限公司,Google Analytics 數據分析師（ GA Analyst ）,工作內容：\n・通過使用 Google Analytics 4 等分析工具，提供網站分析服務...
4,數聯資安股份有限公司,資安工程師 (SOC L3大數據資料分析師),主要工作內容為運用大數據Splunk/ELK平台，協助SOC監控中心日常之維運，並進行資料分...
...,...,...,...
95,104人力銀行_一零四資訊科技股份有限公司,資深資料分析師 (視訊/線上面談),1.此職務會收到全公司每個單位不同的需求，因此需要對數字敏感，並透過數據分析和機器學習等方法...
96,聯經數位股份有限公司,助理數據分析師,1. 擔任精準廣告溝通橋樑，負責監控、優化、管理專案進度。\n2. 定期分析廣告/EDM成效...
97,癮房市整合行銷有限公司,廣告投手 / 數據分析師,- 超過2年廣告投放經驗，擅長Google、Facebook、Line等各式數位媒體操作\n...
98,微星科技股份有限公司,資料分析師,1. 與各單位溝通了解需求目的\n2. 資料整理與模型建構\n3. 協助將數據資料做成分析介...
