Skip to content

A no-nonsense web scraping tool which removes the crap and preserves the content in pdf format.

License

Notifications You must be signed in to change notification settings

thatleeguy/CleanScrape

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CleanScrape

About

This is a no-nonsense web scraping tool which uses pycurl to fetch public web page content, readability-lxml to clean it of ads, navigation bars, sidebars, and other irrelevant boilerplate, and wkhtmltopdf to preserve the output in PDF document format.

Motivation

I was getting tired of stale bookmarked links: a lot of useful blog articles disappear and neither web.archive.org nor Google's cache are very helpful.

Additionally, too many otherwise-useful pages are cluttered with ads, sidebars, and other crap, so the focus is on preserving text, using the readability algorithm built into readability-lxml.

Installation

You need python, pip, and wkhtmltopdf installed and running on your computer.

Clone this repo to your compter and load the other requirements using the requirements.txt file like so:

$ pip install -r requirements.txt

Edit the settings.py file as necessary, to match your computer's environment.

Usage

Run CleanScrape from a command line prompt, defining the url to fetch and clean, and the pdf file name to use for the final output.

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace.pdf"

The same, but from inside a python shell:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace.pdf")                                          

If successful, the output looks like this, with the final result being saved to /tmp/strace.pdf (change the output folder in the settings.py file in this repo):

/usr/local/bin/wkhtmltopdf --page-size Letter /tmp/strace.html /tmp/strace.pdf
Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done                                                                      

Cleaning it with readability is optional; if you want to keep the html retrieved as-is, use the --noclean option:

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace.pdf" --noclean

Or inside the python shell like this:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace.pdf", clean_it=False)                                          

About

A no-nonsense web scraping tool which removes the crap and preserves the content in pdf format.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published