OD-Database

OD-Database is a web-crawling project that aims to index a very large number of file links and their basic metadata from open directories (misconfigured Apache/Nginx/FTP servers, or more often, mirrors of various public services).

Each crawler instance fetches tasks from the central server and pushes the result once completed. A single instance can crawl hundreds of websites at the same time (Both FTP and HTTP(S)) and the central server is capable of ingesting thousands of new documents per second.

The data is indexed into elasticsearch and made available via the web frontend (Currently hosted at https://od-db.the-eye.eu/). There is currently ~1.93 billion files indexed (total of about 300Gb of raw data). The raw data is made available as a CSV file here.

Contributing

Suggestions/concerns/PRs are welcome

Installation (Docker)

git clone --recursive https://github.com/simon987/od-database
cd od-database
mkdir oddb_pg_data/ tt_pg_data/ es_data/ wsb_data/
docker-compose up

Architecture

Running the crawl server

The python crawler that was a part of this project is discontinued, the go implementation is currently in use.

Name		Name	Last commit message	Last commit date
Latest commit History 305 Commits
captchas		captchas
fold_to_ascii @ 7075a3a		fold_to_ascii @ 7075a3a
search		search
static		static
task_tracker_drone @ 7981e0a		task_tracker_drone @ 7981e0a
templates		templates
ws_bucket_client @ 2cd3d21		ws_bucket_client @ 2cd3d21
,gitattributes		,gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
api.py		api.py
app.py		app.py
captcha.py		captcha.py
common.py		common.py
config.py		config.py
database.py		database.py
do_recrawl.py		do_recrawl.py
docker-compose.yml		docker-compose.yml
export.py		export.py
high_level_diagram.dia		high_level_diagram.dia
high_level_diagram.png		high_level_diagram.png
init_script.sql		init_script.sql
main.py		main.py
mass_import.py		mass_import.py
od_util.py		od_util.py
reddit_bot.py		reddit_bot.py
requirements.txt		requirements.txt
tasks.py		tasks.py
template_filters.py		template_filters.py
tt_config.yml		tt_config.yml
uwsgi.ini		uwsgi.ini
views.py		views.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OD-Database

Contributing

Installation (Docker)

Architecture

Running the crawl server

About

Releases

Packages

Languages

License

simon987/od-database

Folders and files

Latest commit

History

Repository files navigation

OD-Database

Contributing

Installation (Docker)

Architecture

Running the crawl server

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages