What?

Keep a log/db of top 1m domains.

Download http://downloads.majestic.com/majestic_million.csv

Plan

Use GoLand apps as micro services
Scrape each site, index keywords
Provide an API graph to query the domains
For each domain, find the domains/pages it links to. Fetch them.
Check the host/dns of every domain and see what else we can find. Fetch them.

Create a Google BigTable dataset of every website and the graph of everything it links to.

Structure

Components that we need:

Something to fetch the URLs and write them to a table every n-days. (Is this necessary, once we have a set surely we just keep working from that?)
Something to fetch each site craw the pages for links.

Other things?

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
Gopkg.lock		Gopkg.lock
Gopkg.toml		Gopkg.toml
Makefile		Makefile
app.go		app.go
fetch.sh		fetch.sh
readme.md		readme.md
storage.go		storage.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Dockerfile

Dockerfile

Gopkg.lock

Gopkg.lock

Gopkg.toml

Gopkg.toml

Makefile

Makefile

app.go

app.go

fetch.sh

fetch.sh

readme.md

readme.md

storage.go

storage.go

Repository files navigation

What?

Plan

Structure

About

Releases

Packages

Languages

xmjw/crawl-source

Folders and files

Latest commit

History

Repository files navigation

What?

Plan

Structure

About

Resources

Stars

Watchers

Forks

Languages