Skip to content

xmjw/crawl-source

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What?

Keep a log/db of top 1m domains.

Download http://downloads.majestic.com/majestic_million.csv

Plan

  • Use GoLand apps as micro services
  • Scrape each site, index keywords
  • Provide an API graph to query the domains
  • For each domain, find the domains/pages it links to. Fetch them.
  • Check the host/dns of every domain and see what else we can find. Fetch them.

Create a Google BigTable dataset of every website and the graph of everything it links to.

Structure

Components that we need:

  • Something to fetch the URLs and write them to a table every n-days. (Is this necessary, once we have a set surely we just keep working from that?)
  • Something to fetch each site craw the pages for links.

Other things?

About

A utility for finding URLs to crawl

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published