HTML::ExtractText perl module for extracting specific bits of text from HTML using CSS selectors
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


HTML::ExtractText - extract multiple text strings from HTML content, using CSS selectors


At its simplest; use CSS selectors:

use HTML::ExtractText;
my $ext = HTML::ExtractText->new;
$ext->extract({ page_title => 'title' }, $html) or die "Error: $ext";
print "Page title is $ext->{page_title}\n";

We can go fancy pants with selectors as well as extract more than one bit of text:

use HTML::ExtractText;
my $ext = HTML::ExtractText->new;
        article   => 'article#main_content',
        irc_links => 'article#main_content a[href^="irc://"]',
) or die "Error: $ext";

print "IRC links:\n$ext->{irc_links}\n";
print "Full text:\n$ext->{article}\n";

We can also pass in an object and let the extractor call setter methods on it when it extracts text:

use HTML::ExtractText;
my $ext = HTML::ExtractText->new;
$ext->extract({ title => 'title' }, $html_code, $some_object )
    or die "Error: $ext";

print "Our object's ->title method is now set to:",
    $some_object->title, "\n";


The module allows to extract [multiple] text strings from HTML documents, using CSS selectors to declare what text needs extracting. The module can either return the results as a hashref or automatically call setter methods on a provided object.

If you're looking for extra automatic post-processing and laxer definition of what constitutes "text", see HTML::ExtractText::Extra.


    { stuff => 'title', },
    '<title>My html code!</title>',
    bless {}, 'Foo',
) or die "Extraction error: $extractor";

print "Title is: $extractor->{stuff}\n\n";

The module incorporates two overloaded methods ->error(), which is overloaded for interpolation (use overload q|""| ...), and ->last_result(), which is overloaded for hash dereferencing (use overload q|%{}| ...).

What this means is that you can interpolate the object in a string to retrieve the error message and you can use the object as a hashref to access the hashref returned by ->last_results().



my $extractor = HTML::ExtractText->new;

my $extractor = HTML::ExtractText->new(
    separator        => "\n",
    ignore_not_found => 1,
); # default values for arguments are shown

Creates and returns new HTML::ExtractText object. Takes optional arguments as key/value pairs:


my $extractor = HTML::ExtractText->new(
    separator => "\n", # default value

my $extractor = HTML::ExtractText->new(
    separator => undef,

Optional. Default: \n (new line). Takes undef or a string as a value. Specifies what to do when CSS selector matches multiple elements. If set to a string value, text from all the matching elements will be joined using that string. If set to undef, no joining will happen and results will be returned as arrayrefs instead of strings (even if selector matches a single element).


my $extractor = HTML::ExtractText->new(
    ignore_not_found => 1,  # default value

my $extractor = HTML::ExtractText->new(
    ignore_not_found => 0,

Optional. Default: 1 (true). Takes true or false values as a value. Specifies whether to consider it an error when any of the given selectors match nothing. If set to a true value, any non-matching selectors will have empty strings as values and no errors will be reported. If set to a false value, all selectors must match at least one element or the module will error out.


my $results = $extractor->extract(
    { stuff => 'title', },
    '<title>My html code!</title>',
    $some_object, # optional
) or die "Extraction error: $extractor";

print "Title is: $extractor->{stuff}\n\n";
# $extractor->{stuff} is the same as $results->{stuff}

Takes two mandatory and one optional arguments. Extracts text from given HTML code and returns a hashref with results ( see ->last_results() method ). On error, returns undef or empty list and the error will be available via ->error() method. Even if errors occurred, anything that was successfully extracted will still be available through ->last_results() method.

first argument

    { stuff => 'title', },
    ... ,
    ... ,
) or die "Extraction error: $extractor";

Must be a hashref. The keys can be whatever you want; you will use them to refer to the extracted text. The values must be CSS selectors that match the elements you want to extract text from. All the selectors listed on are supported.

Note: the values will be modified in place in the original hashref you provided, so you can use that to your advantage, if needed.

second argument

    ... ,
    '<title>My html code!</title>',
    ... ,
) or die "Extraction error: $extractor";

Takes a string that is HTML code you're trying to extract text from.

third argument

    { stuff => 'title', },
    '<title>My html code!</title>',
) or die "Extraction error: $extractor";

# this is what is being done automatically, during extraction,
# for each key in the first argument of ->extract():
# $some_object->stuff( $extractor->{stuff} );

Optional. No defaults. For convenience, you can supply an object and HTML::ExtractText will call methods on it. The called methods will be the keys of the first argument given to ->extract() and the extracted text will be given to those methods as the first argument.



    { stuff => 'title', },
    '<title>My html code!</title>',
) or die "Extraction error: " . $extractor->error;

    { stuff => 'title', },
    '<title>My html code!</title>',
) or die "Extraction error: $extractor";

Takes no arguments. Returns the error message as a string, if any occurred during the last call to ->extract(). Note that ->error() will only return one of the error messages, even if more than one selector failed. Examine the hashref returned by ->last_results() to find all the errors; for any selector that errored out, the value will begin with "ERROR: " and the error message will be there.


    { stuff => 'title', },
    '<title>My html code!</title>',
) or die "Extraction error: $extractor";

print "Stuff is " . $extractor->last_results->{stuff} . "\n";

# or

print "Stuff is $extractor->{stuff}\n";

Takes no arguments. Returns the same hashref the last call to ->extract did. If ->extract failed, you can still use ->last_results() to get anything that didn't error out (the error messages will be in the values of failed keys).

The hashref will contain the same keys as the first argument to ->extract() had and the values will be replaced with whatever the selectors matched.

If separator (see ->new()) is set to undef, the values will be arrayrefs, with each item in those arrayrefs corresponding to one matched element in HTML.

The module will attempt to DWIM (Do What I Mean) when selector matches form controls or images, and use value="" or alt="" attributes as text sources.



Accessor to separator option (see ->new()). Takes one optional argument, which if provided, will become the new separator.



Accessor to ignore_not_found option (see ->new()). Takes one optional argument, which if provided, will become the new value of ignore_not_found option.


sub _extract {
    my ( $self, $dom, $selector, $what ) = @_;
    return $dom->find( $what->{ $selector } )
        ->map( sub { $self->_process( @_ ) } )->each;

You can subclass this module by overriding either or both _extract and _process methods. Their names and purpose are guaranteed to remain unchanged. See source code for their default implementation.



This module does not automatically encode extracted text, so the examples in this documentation should really include something akin to:

use Encode;

my $title = encode 'utf8', $ext->{page_title};
print "$title\n";


HTML::ExtractText::Extra - a subclass that offers extra features

Mojo::DOM, Text::Balanced, HTML::Extract



To report bugs or request features, please use

If you can't access GitHub, you can email your request to bug-html-extracttext at



You can use and distribute this module under the same terms as Perl itself. See the LICENSE file included in this distribution for complete details.