Skip to content

wardbradt/HTMLST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTMLSentenceTokenizer (HTMLST)

A library which extracts sentences from HTML

HTMLST breaks down HTML documents into paragraph-like sections by taking into account the HTML5 Specification and web developers' typical usage of HTML tags and tokenizes the sentences in these sections using NLTK's Punkt Sentence Tokenizer to disambiguate sentence boundaries.

Example:

HTMLSentenceTokenizer can parse the following HTML in one line.

<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
<div>
    <h1>Header One</h1>
    Hello, my <span id="foo">name</span> is Geronimo. What's yours?
    <div>
        Here is a cool picture:
        <br>
        <img src="http://www.sourcecertain.com/img/Example.png">

        Now, isn't that nice?
    </div>

    He<span>r<b>e</b></span>'s a wei<mark>rd</mark> sen<b>t<i>e<span>n</span>c</i>e wit</b>h <i>lots</i> of inline tags.
</div>
</body>
</html>

The following code prints out the sentences of this HTML, which is stored in example_html_one.html.

from htmlst import HTMLSentenceTokenizer

example_html_one = open('example_html_one.html', 'r').read()
parsed_sentences = HTMLSentenceTokenizer().feed(example_html_one)
print(parsed_sentences)

The output is ['Hello, my name is Geronimo.', "What's yours?", 'Here is a cool picture:', "Now, isn't that nice?", "Here's a weird sentence with lots of inline tags."].

Installation:

HTMLST is available on pip. To install, enter the following into terminal:

pip install htmlst

About

A library to extract sentences from HTML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages