Skip to content

ruricolist/cl-boilerpipe

Repository files navigation

CL-BOILERPIPE is a Common Lisp library for extracting the main content from web pages like newspaper articles and blog posts. It was designed for expanding truncated articles in feeds.

CL-BOILERPIPE is based on the Java Boilerpipe library, based in turn on Kohlschütter et al., “Boilerplate Detection using Shallow Text Features”.

Only the simplest version of the Boilerpipe algorithm is implemented here; I find that it works well enough.

Usage

Given an HTML string, call:

 (cl-boilerpipe:strip-boilerpipe html)

This returns the main content as another HTML string.

About

Extract main content from articles and blog posts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published