爬蟲 自動抓取 "中華民國 台灣 經濟部商業司 商工登記公示資料查詢服務 - https://findbiz.nat.gov.tw/fts/query/QueryBar/queryInit.do" 資料,然後再輸出成 Excel 的表格檔案。
以下將簡述開發相關套�件及程式執行方式
- MacBook Pro 15 Early 2011
- 2.0GHz 四核心 Intel Core i7
- 8GB 1333MHz DDR3 SDRAM
- 256GB 固態磁碟
- macOS 10.13 High Sierra
- Google Chrome 版本 70.0.3538.67 (正式版本) (64 位元)
- Visual Studio Code 版本 1.28.2(1.28.2)
- iTerm2
- 本程式為 Consol Application written in Python
A step by step series of examples that tell you how to get a development env running
The step will be...
- xcode
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
- git
brew install git
- wget
brew install wget
- python
brew install python
- pip
easy_install pip
- conda (recommended)
cd ~
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
conda create --name py3mycrawler
conda install -n py3mycrawler pip
conda install -n py3mycrawler openpyxl
conda install -n py3mycrawler Selenium
source activate py3mycrawler
source deactivate # to leave the conda virtual environment
- virtualenv (optional)
- Install
pip install virtualenv
- Run
virtualenv -p python3 venv
- install openpyxl
pip install openpyxl
- install Selenium
pip install selenium
pip freeze > requirements.txt # only for the really creator of this project
pip install pur
pip install -r requirements.txt
-
Selenium WebDriver - ChromeDriver 2.43 Selenium ChromeDriver - WebDriver for Chrome
-
BeautifulSoup
pip3 install beautifulsoup4
執行終端機後,先切換至此 APP 的工作目錄,並進入 conda virtualenv。
cd mycrawler
source activate py3mycrawler
程式執行的指令格式如下:
python myselenium2.py [Arg1] [Arg2] [Arg3] [Arg4] [Arg5] [Arg6]
- Arg1: 文字;輸入地址區域關鍵字查詢字串
- Arg2: 數字;輸入啟始頁數,請輸入阿拉伯數字。最少請輸入 1
- Arg3: 數字;輸入結束頁數,請輸入阿拉伯數字。0 表示最後一頁;因目標網頁的限制,無法查看超過 500 頁的資料,故此參數最大值預設為498。
- Arg4: 數字(5 Bits);輸入資料種類;請�輸入5個bits,例如:10100 (1表示勾選;2表示不選)
- Bit1 (最左邊): 公司
- Bit2: 分公司
- Bit3: 商業
- Bit4: 工廠
- Bit5: 有限合夥
- Arg5: 數字;程式結束時,是否依然開著瀏覽器。1表示開著;0表式關閉。
- Arg6: 數字;是否採用 Chrome Headless Mode。1表示隱藏;0表式顯�示 Chrome。
python myselenium2.py 台中市北區 1 0 10000 0 1
- Google Chrome 版本 70.0.3538.67 (正式版本) (64 位元) (以上)
- brew
- python 3
- wget
- Selenium WebDriver - ChromeDriver 2.43
- conda
- openpyxl
- Selenium
- BeautifulSoup
請下�載 - 操作手冊
# pyenv versions # optional # make sure the python version is 3.6.8
# mkvirtualenv py3mycrawler # optional
# workon py3mycrawler # optional
clear
cd ~
mkdir findbiz
cd findbiz
wget https://github.com/stzengpx/myCrawler2018/archive/master.zip
unzip -oq master.zip
cp myCrawler2018-master/mycrawler.sh ~/findbiz/
# source activate py3mycrawler # optional
clear; cd ~/findbiz; bash mycrawler.sh
# source deactivate # optional
# source activate py3mycrawler # optional
clear; cd ~/findbiz; bash mycrawler.sh update
# source deactivate # optional
Once there comes an error and the app down, you can force to stop it.
press [ctrl + c]
- Patrick Tseng - Initial work - stzengpx
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE.md file for details
- A template to make good README.md - https://goo.gl/tp2n6X
- Refactor the coding style by flake8 and yapf
- Use city and city area to get street list from Chian Post
- Refactor main()
- Refactor myselenium2 with Class
- Add start mail and end mail
- Add myselenium2starter
- Add "外國公司辦事處登記基本資料"
- officailSiteVersion = "1.3.6"
- officialSiteVersion = "1.3.5"
- Modify README.md
- Add TmpDataType == "外國公司登記基本資料"
- Branch develop
- Pull request and merge
- Use 'conda' as python virtual environment
- Use command "caffeinate" to prevent macos sleeping mode while running.
- Modify "TmpCorpType" fields from 7 to 9
- Bug fix for count initial popup browser windows
- Send email login notification with MacOS SN and application parameters
- Close first Popup Page
- Modify queryCmpyDetail Fields
- officialSiteVersion = "1.3.1" # 20181113
- Modify README.md
- Use python3 directly in mycrawlerrun.sh instead of python
- Add headless option
- Add features: Auto Update
- Release to GitHub
- execute script
- Add myAppVersion
- Change 資料種類 field in the excel data
- Add '程式版本','網頁版本' in the excel data