Get Blogspot domains from the CommonCrawl. Runs on Heroku.
To set the CommonCrawl indexes (base URLs) to download from, you need to set the index
environment variable with a list of index URLs separated by ,
: Example: http://index.commoncrawl.org/CC-MAIN-2017-43-index,http://index.commoncrawl.org/CC-MAIN-2018-51-index
. A full list of indexes is available at https://index.commoncrawl.org/.
For the Google Drive feature at the end of the script, some authentication files are required:
client_secrets.json
and credentials.txt
client_secrets.json
can be downloaded from the Google API console from a project with the Google Drive enabled. Note that you will likely have to rename the file you download so it can be detected by the script.
credentials.txt
can be generated by running the script locally and accepting the OAuth prompt.
In addition, you will need to set the heroku-key
environment variable with an API key for Heroku. This is used to scale the dyno formation down to 0
when the script is cleanly finished. It is also used to update the driveid
environment variable on the list API provider server.
You will need to set the logsheetid
environment variable to the ID of a Google Sheet so events can be logged.
You will need to set the memsheetid
environment variable to the ID of a Google Sheet so important variables can be stored during graceful shutdowns/restarts.