Google Play App Store python scraper
This python application runs a service thats gets a package id and retrieves its email, title and icon as Json format
and outputs it to output.txt file.
You can send a list of app IDs by running app_sender.
- Make sure Python 2.7 is installed and Install requirements:
pip install -r requirements.txt
- Make sure golang is installed and build app_sender.
In app_sender folder, run:
go build
- Start the service (runs on localhost:5000)
python app.py
- Send apps ids (sends to localhost:5000)
Windows:
app_sender/app_sender.exe
Linux:
./app_sender/app_sender
- Tail output.txt for apps details
2 main threads are running:
- Web application
- Scraper Launcher
A thread that runs a simple flask application, listens on localhost:5000.
Support one method - GET /ScrapApp?id=ABC.
For each app id, it inserts the id to a simple thread-safe ids-queue (python Queue).
A thread that loops on the ids-queue, waiting for new id.
For each id it gets from the queue:
- Checks in its cache if we already scraped this id, if so it outputs the data, else it continue:
- Asks from Proxy manager for an available proxy. In case no proxy is avaiable it waits for 2 seconds and asks again.
- Spawns a new process with a GoogleScrapper, and the given app id and proxy.
- When a process successfully returns, it saves the result to the cache.
- Flask web application should be wrapped with Nginx or Apache or any other web server in production.
- I used a ProcessPool and not a ThreadPool, because of python GIL (no real parallelism).
- The ProcessPool is initialized to number of cores size, it can be increased in order to use more CPU.
- In real world I would use a permenent Queue (e.g. RabbitMQ), to support crash recovering - lost app ids.
- In real world I would use a permenent Cache (e.g. Redis), to support crash recovering - lost app responses.