Skip to content

tech234a/blogspot-domains

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

blogspot-domains

Get Blogspot domains from the CommonCrawl. Runs on Heroku.

To set the CommonCrawl indexes (base URLs) to download from, you need to set the index environment variable with a list of index URLs separated by ,: Example: http://index.commoncrawl.org/CC-MAIN-2017-43-index,http://index.commoncrawl.org/CC-MAIN-2018-51-index. A full list of indexes is available at https://index.commoncrawl.org/.

For the Google Drive feature at the end of the script, some authentication files are required: client_secrets.json and credentials.txt client_secrets.json can be downloaded from the Google API console from a project with the Google Drive enabled. Note that you will likely have to rename the file you download so it can be detected by the script. credentials.txt can be generated by running the script locally and accepting the OAuth prompt.

In addition, you will need to set the heroku-key environment variable with an API key for Heroku. This is used to scale the dyno formation down to 0 when the script is cleanly finished. It is also used to update the driveid environment variable on the list API provider server.

You will need to set the logsheetid environment variable to the ID of a Google Sheet so events can be logged. You will need to set the memsheetid environment variable to the ID of a Google Sheet so important variables can be stored during graceful shutdowns/restarts.

About

Get Blogspot domains from the CommonCrawl

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages