Developed using Python 3.5 (use at least 3.4.2+)
This project is a way to monitor scrapers. From how long each run takes to any errors it may encounter.
The logs api endpoint is meant to work directly with the python logging HTTPHandler
.
git clone https://github.com/xtream1101/scraper-monitor
pip3 install -r requirements.txt
Create a python3 virtual environment if you would like- Rename
config.py.sample
toconfig.py
and edit to reflect the values you need - If using postgres, make sure that the schema exists in the database
- Run
python3 manage.py db stamp head
python3 manage.py db migrate
python3 manage.py db upgrade
- To start the server:
python3 main.py
git clone https://github.com/xtream1101/scraper-monitor
pip3 install -r requirements.txt
Create a python3 virtual environment if you would likepython3 manage.py db upgrade
- Get the image:
docker pull xtream1101/scraper-monitor:latest
- You must have a config file that locally that is filled in to mount to the image on run
- If databases have not been set up run (set port based on config):
docker run -v config.py:/src/config.py -p 5001:5001 xtream1101/scraper-monitor:latest manage db stamp head
- Then run the app:
docker run -d -v config.py:/src/config.py -p 5001:5001 xtream1101/scraper-monitor:latest
To have the logs locally, mount a volume to the /src/logs
directory
POST data is json
Required by all requests:
-
GET
- All required
apikey
- Api key of the organization the scraper is underscraperKey
- Unique id for the scraperscraperRun
- Unique uuid for this run of the scraperenvironment
- EitherDEV
orPROD
-
Endpoint:
/api/v1/logs
POST
- This is the endpoint the python logging HTTPHandler should be set to
- Do not log directly to this, use pythons built in logging.
-
Endpoint:
/api/v1/data/start
POST
startTime
- Required - python datetime (should be UTC)
-
Endpoint:
/api/v1/data/stop
POST
stopTime
- Required - python datetime (should be UTC)totalUrls
- Optional - Number of urls that were loadedrefDataCount
- Optional - Total number of things that should be scrapedrefDataSuccessCount
- Optional - Total number of things that did scrape successfullyrowsAddedToDb
- Optional - Number of rows add to the database
-
Endpoint:
/api/v1/error/url
POST
url
- Required - The url in questionnumTries
- Optional - How many times has this url been triedreason
- Optional - Why is the url marked as failedref_id
- Optional - The ref id of the urlref_table
- Optional - Where does this ref id livestatusCode
- Optional - HTTP status code, only needed if this was the causethreadName
- Optional - The name of the python thread.