CleanScrape

About

This is a no-nonsense web scraping tool which uses pycurl to fetch public web page content, readability-lxml to clean it of ads, navigation bars, sidebars, and other irrelevant boilerplate, and wkhtmltopdf to preserve the output in PDF document format.

Motivation

I was getting tired of stale bookmarked links: a lot of useful blog articles disappear and neither web.archive.org nor Google's cache are very helpful.

Additionally, too many otherwise-useful pages are cluttered with ads, sidebars, and other crap, so the focus is on preserving text, using the readability algorithm built into readability-lxml.

Installation

You need python, pip, and wkhtmltopdf installed and running on your computer.

Clone this repo to your compter and load the other requirements using the requirements.txt file like so:

$ pip install -r requirements.txt

Edit the settings.py file as necessary, to match your computer's environment.

Usage

Run CleanScrape from a command line prompt, defining the url to fetch and clean, and the pdf file name to use for the final output.

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace.pdf"

The same, but from inside a python shell:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace.pdf")

If successful, the output looks like this, with the final result being saved to /tmp/strace.pdf (change the output folder in the settings.py file in this repo):

/usr/local/bin/wkhtmltopdf --page-size Letter /tmp/strace.html /tmp/strace.pdf
Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done

Cleaning it with readability is optional; if you want to keep the html retrieved as-is, use the --noclean option:

$ ./CleanScraper.py "http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/" "strace.pdf" --noclean

Or inside the python shell like this:

>>> from CleanScraper import scrape
>>> scrape("http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/", "strace.pdf", clean_it=False)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CleanScraper.py		CleanScraper.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

CleanScraper.py

CleanScraper.py

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

settings.py

settings.py

Repository files navigation

CleanScrape

About

Motivation

Installation

Usage

About

Releases

Packages

License

thatleeguy/CleanScrape

Folders and files

Latest commit

History

Repository files navigation

CleanScrape

About

Motivation

Installation

Usage

About

Resources

License

Stars

Watchers

Forks