Skip to content

water-su/ptt-web-crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ptt-web-crawler (PTT 網路版爬蟲) Build Status

特色

  • 支援單篇及多篇文章抓取
  • 過濾資料內空白、空行及特殊字元
  • JSON 格式輸出
  • 支援 Python 2.7 - 3.4
  • Support upload to firebase

輸出 JSON 格式

{
    "article_id": 文章 ID,
    "article_title": 文章標題 ,
    "author": 作者,
    "board": 板名,
    "content": 文章內容,
    "date": 發文時間,
    "ip": 發文位址,
    "message_conut": { # 推文
        "all": 總數,
        "boo": 噓文數,
        "count": 推文數-噓文數,
        "neutral": → 數,
        "push": 推文數
    },
    "messages": [ # 推文內容
        {
            "push_content": 推文內容,
            "push_ipdatetime": 推文時間及位址,
            "push_tag": 推/噓/→ ,
            "push_userid": 推文者 ID
        },
        ...
    ]
}

執行方式

python crawler.py -b 看板名稱 -i 起始索引 結束索引 (結束索引設為 -1 則自動計算最後一頁)  
python crawler.py -b 看板名稱 -a 文章ID
python crawler.py -b 看板名稱 -i -1 需求頁數  #2016.11.18 新增爬取最新n頁的參數設定

範例

python crawler.py -b PublicServan -i 100 200

會爬取 PublicServan 板第 100 頁 (https://www.ptt.cc/bbs/PublicServan/index100.html) 到第 200 頁 (https://www.ptt.cc/bbs/PublicServan/index200.html) 的內容,輸出至 PublicServan-100-200.json

python crawler.py -b PublicServan -a M.1413618360.A.4F0

會爬取 PublicServan 板文章 ID 為 M.1413618360.A.4F0 (https://www.ptt.cc/bbs/PublicServan/M.1413618360.A.4F0.html) 的內容,輸出至 PublicServan-M.1413618360.A.4F0.json

python crawler.py -b PublicServan -i -1 5

爬取 PublicServan 版文章最後5頁 輸出檔名仍為起始與終止頁碼

測試

python test.py

ptt-web-crawler is a crawler for the web version of PTT, the largest online community in Taiwan.

usage: python crawler.py [-h] -b BOARD_NAME (-i START_INDEX END_INDEX | -a ARTICLE_ID) [-v]
optional arguments:
  -h, --help                  show this help message and exit
  -b BOARD_NAME               Board name
  -i START_INDEX END_INDEX    Start and end index
  -p SERVER_URL               (optional) firebase server URL(if not exist -> dump to file)
  -f TITLE_FILTER             (optional) filter string for article title
  -a ARTICLE_ID               Article ID
  -v, --version               show program's version number and exit

Output would be BOARD_NAME-START_INDEX-END_INDEX.json (or BOARD_NAME-ID.json)

About

PTT 網路版爬蟲

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.4%
  • HTML 13.3%
  • JavaScript 12.3%