🐍 A crawler collects articles containing Chinese idioms.
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README.md
idioms_4word.txt
run.py
runmongo.py
scraper.py
test.json
test2.json

README.md

copen-crawler

A crawler collects articles containing Chinese idioms.

TODO

  • scraper: social media (Facebook, mobile01, ...)
  • multisplit of sentences
  • connector: PTT Gossiping
  • scraper: PTTscraper (for news in Gossiping board)
  • coder: a summrizing method
  • coder: a general formatter method for vrt format

VRT format spec:

general meta

  • id
  • source
  • article_type
  • date
  • author
  • gender
  • age

sub type meta: news in ptt

  • ptt_url
  • ptt_board
  • ptt_title
  • news_url
  • news_title
  • media
  • note

Separator (in regex syntax)

  • paragraph
    • \n\n+
  • sentence
    • !」?
    • ?」?
    • 。」?

Dependency

Weihang Lo