Scraper Monitor

Developed using Python 3.5 (use at least 3.4.2+)

This project is a way to monitor scrapers. From how long each run takes to any errors it may encounter. The logs api endpoint is meant to work directly with the python logging HTTPHandler.

Install/Setup

git clone https://github.com/xtream1101/scraper-monitor
pip3 install -r requirements.txt Create a python3 virtual environment if you would like
Rename config.py.sample to config.py and edit to reflect the values you need
If using postgres, make sure that the schema exists in the database
Run python3 manage.py db stamp head
python3 manage.py db migrate
python3 manage.py db upgrade
To start the server: python3 main.py

To update

git clone https://github.com/xtream1101/scraper-monitor
pip3 install -r requirements.txt Create a python3 virtual environment if you would like
python3 manage.py db upgrade

Docker Usage

Get the image: docker pull xtream1101/scraper-monitor:latest
You must have a config file that locally that is filled in to mount to the image on run
If databases have not been set up run (set port based on config): docker run -v config.py:/src/config.py -p 5001:5001 xtream1101/scraper-monitor:latest manage db stamp head
Then run the app: docker run -d -v config.py:/src/config.py -p 5001:5001 xtream1101/scraper-monitor:latest

To have the logs locally, mount a volume to the /src/logs directory

API Endpoints

POST data is json

Required by all requests:

GET
- All required
- apikey - Api key of the organization the scraper is under
- scraperKey - Unique id for the scraper
- scraperRun - Unique uuid for this run of the scraper
- environment - Either DEV or PROD
Endpoint: /api/v1/logs
- POST
  - This is the endpoint the python logging HTTPHandler should be set to
  - Do not log directly to this, use pythons built in logging.
Endpoint: /api/v1/data/start
- POST
  - startTime - Required - python datetime (should be UTC)
Endpoint: /api/v1/data/stop
- POST
  - stopTime - Required - python datetime (should be UTC)
  - totalUrls - Optional - Number of urls that were loaded
  - refDataCount - Optional - Total number of things that should be scraped
  - refDataSuccessCount - Optional - Total number of things that did scrape successfully
  - rowsAddedToDb - Optional - Number of rows add to the database
Endpoint: /api/v1/error/url
- POST
  - url - Required - The url in question
  - numTries - Optional - How many times has this url been tried
  - reason - Optional - Why is the url marked as failed
  - ref_id - Optional - The ref id of the url
  - ref_table - Optional - Where does this ref id live
  - statusCode - Optional - HTTP status code, only needed if this was the cause
  - threadName - Optional - The name of the python thread.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
app		app
migrations		migrations
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
config.py.sample		config.py.sample
deploy.sh		deploy.sh
docker-entrypoint.sh		docker-entrypoint.sh
main.py		main.py
manage.py		manage.py
requirements.txt		requirements.txt

License

xtream1101/scraper-monitor

Folders and files

Latest commit

History

Repository files navigation

Scraper Monitor

Install/Setup

To update

Docker Usage

API Endpoints

About

Resources

License

Stars

Watchers

Forks

Languages