Skip to content

zillabyte/pdf_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawling the Web for PDFs

This app will identify PDFs and other documents scrattered across the web. The motivating use-case is to identify prospective sales leads who will benefit from better document management services.

This app uses one of Zillabyte's pre-crawled copy of the web. The dataset, web_deep, is a cached copy of the top 500 pages of the top 1 million domains on the web (500,000,000 pages total). The dataset updates with a fresh crawl every week. Although the web_deep dataset is a good proxy for the "important stuff on the web", it is by no means exhaustive. If you require more coverage, you'll need to use Zillabyte's live-crawing feature. The drawback is that it may be more expensive, and will not be as quick to analyze. Contact us at support@zillabyte.com for more information.

To start running this app, tweak the code and zillabyte push. For more information, check out docs.zillabyte.com

Powered by Zillabyte

About

Identify web sites with high document densities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages