blogspot-domains

Get Blogspot domains from the CommonCrawl. Runs on Heroku.

To set the CommonCrawl indexes (base URLs) to download from, you need to set the index environment variable with a list of index URLs separated by ,: Example: http://index.commoncrawl.org/CC-MAIN-2017-43-index,http://index.commoncrawl.org/CC-MAIN-2018-51-index. A full list of indexes is available at https://index.commoncrawl.org/.

For the Google Drive feature at the end of the script, some authentication files are required: client_secrets.json and credentials.txt client_secrets.json can be downloaded from the Google API console from a project with the Google Drive enabled. Note that you will likely have to rename the file you download so it can be detected by the script. credentials.txt can be generated by running the script locally and accepting the OAuth prompt.

In addition, you will need to set the heroku-key environment variable with an API key for Heroku. This is used to scale the dyno formation down to 0 when the script is cleanly finished. It is also used to update the driveid environment variable on the list API provider server.

You will need to set the logsheetid environment variable to the ID of a Google Sheet so events can be logged. You will need to set the memsheetid environment variable to the ID of a Google Sheet so important variables can be stored during graceful shutdowns/restarts.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blogspot-domains

About

Releases

Packages

Languages

License

tech234a/blogspot-domains

Folders and files

Latest commit

History

Repository files navigation

blogspot-domains

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages