PolySpider - New Android Crawler
Python(v2.7) and Scrapy(v0.20.2) are used in this project, which is mainly designed for Android app synchronizing and categorizing by focusing on grabing data from Android Markets.
- Python v2.7+
- Scrapy v0.20+
- redis-py v2.9+
- Supervisor v3.0+
- Dependencies are listed in Installation
- ak & sk & bucket name of BaiYun or Upyun for files upload
- Tested in Windows(both 32&64bit) and CentOS 6.4(64bit)
the latest stable version, which can be currently used in Poly Project. NEVER EVER straightly commit codes in MASTER branch.
the developing unstable version. team contributors should take code imporvement in this branch. If the project in develop branch comes to a stable level and meets the product requirement. The MASTER branch will merge the pull request from develop branch and release a stable branch version.
1.0as a stable release from this project.
static resources and website page
###Run Single Spider
- Step into
scrapy listcommand to find spiders this project has provided
scrapy crawl spidernamecommand to start the crawler, which will crawl the target app market and then record the crawled app information into sqlite database, download the apk file and parse it to get the info_list including package name, app name etc. If needed, it will upload the apk file to Cloud Storage like BaiduYun and UpYun.
- Since the app info is stored in sqlite database, you can use
python check_sql_data.pycommand to check out what info the database has for convenience or just use some SqliteBrowser tools.
###Run supervisor Supervisor is a client/server system that allows its users to control a number of processes on UNIX-like operating systems.
- All configuration setting of Supervisor is included in
- Step into
supervisord -c supervisor.confto start supervisor process and all python processes managed by Supervisor will start automatically. Moreover, Supervisor will monitor the process and restart them if the processes are interrupted or quit unexpectedly。
- Admin can watch the Supervisor status in browser with the address
localhost:9001by default and there are some oprations could be taken on the python processes like
restartand so on.
- Directory named
PolySpider/src/tmp/contains log files of Supervisor itself and other processes. Feel free to check it out!