clear-html

Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)

Quick start

Installation

Install the library with pip:

pip install clear-html

Usage

Example usage with lxml:

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html

html="""
        <div style="color:blue" id="main_content">
            Some text to be
            <div>cleaned up!</div>
        </div>
     """
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Example usage with Parsel:

from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html

selector = Selector(text="""<html>
                            <body>
                                <h1>Hello!</h1>
                                <div style="color:blue" id="main_content">
                                    Some text to be
                                    <div>cleaned up!</div>
                                </div>
                            </body>
                            </html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Both of the different approaches above would print the following:

<article>

<p>Some text to be</p>

<p>cleaned up!</p>

</article>

Other interesting functions:

cleaned_node_to_text: convert the cleaned node to plain text
formatted_text.clean_doc: low level method to control more aspects of the cleaning up

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
clear_html		clear_html
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGES.rst		CHANGES.rst
LICENSE		LICENSE
README.rst		README.rst
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

clear_html

clear_html

tests

tests

.bumpversion.cfg

.bumpversion.cfg

.editorconfig

.editorconfig

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

CHANGES.rst

CHANGES.rst

LICENSE

LICENSE

README.rst

README.rst

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

tox.ini

tox.ini

Repository files navigation

clear-html

Quick start

Installation

Usage

About

Releases

Packages

Contributors 6

Languages

License

zytedata/clear-html

Folders and files

Latest commit

History

Repository files navigation

clear-html

Quick start

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Languages