Skip to content

Commit

Permalink
Stop assuming all tags are in the same case
Browse files Browse the repository at this point in the history
  • Loading branch information
ronsavage committed Nov 18, 2012
1 parent b784de2 commit 66af853
Show file tree
Hide file tree
Showing 7 changed files with 89 additions and 22 deletions.
6 changes: 5 additions & 1 deletion CHANGES
@@ -1,8 +1,12 @@
Revision history for Perl extension HTML::Parser::Simple.

1.07 Unreleased
1.07 Sun Nov 18 12:06:00 2012
- Replace /usr/bin/perl with /usr/bin/env perl.
- Replace common::sense with use strict and use warnings, to get uninit var warnings.
- Fix bug whereby code was assuming all tags were in the same case.
This lead to the error: Can't locate object method "getParent" via package "root"...
Due to the fix, the code now outputs all tags and attributes as lower-case.
Thanx to Satya Nemana for reporting this problem (via private email).

1.06 Sat Feb 5 15:57:00 2011
- Fix a bug where, if the 2nd or subsequent attribute had an empty string for the value,
Expand Down
24 changes: 14 additions & 10 deletions Changelog.ini
@@ -1,19 +1,23 @@
[Module]
Name=HTML::Parser::Simple
Changelog.Creator=Module::Metadata::Changes V 1.08
Changelog.Parser=Config::IniFiles V 2.66
Changelog.Creator=Module::Metadata::Changes V 2.04
Changelog.Parser=Config::IniFiles V 2.78

[V 1.07]
Date=2011-02-15T17:02:00
Date=2012-11-18T12:06:00
Comments= <<EOT
- Replace /usr/bin/perl with /usr/bin/env perl.
- Replace common::sense with use strict and use warnings to get uninit var warnings.
- Replace common::sense with use strict and use warnings, to get uninit var warnings.
- Fix bug whereby code was assuming all tags were in lower-case.
This lead to the error: Can't locate object method "getParent" via package "root"...
Due to the fix, the code now outputs all tags and attributes as lower-case.
Thanx to Satya Nemana for reporting this problem (via private email).
EOT
[V 1.06]
Date=2011-02-05T15:57:00
Comments= <<EOT
- Fix a bug where if the 2nd or subsequent attribute had an empty string for the value
- Fix a bug where, if the 2nd or subsequent attribute had an empty string for the value,
the code looped forever. Eg: name="a name" value="". For RT#65466.
- Add corresponding test t/empty_attribute.t.
- Clean up the code where the length of the attribute string was tested before spaces were trimmed.
Expand Down Expand Up @@ -53,14 +57,14 @@ Comments= <<EOT
- Patch Simple.pm to accept xhtml as a parameter to new
- Patch Simple.pm to use xhtml in a few of places. XHTML support is not finished!
- Patch Simple.pm to use accessors for object attributes as per PBP. Specifically:
get/set_*() for current_node depth input_dir node_type output_dir root verbose xhtml
- Hence rename root() to get_root()
- Hence rename verbose() to get_verbose()
- Rename new_node() to create_new_node() since that makes more sense when using get/set_*()
get/set_*() for current_node, depth, input_dir, node_type, output_dir, root, verbose, xhtml
- Hence, rename root() to get_root()
- Hence, rename verbose() to get_verbose()
- Rename new_node() to create_new_node(), since that makes more sense when using get/set_*()
- There are no methods get_result() and set_result(). The reason is efficiency. If we had
$self -> set_result($self -> get_result() + '<tag>') it would mean duplicating the result so far
each time a few chars were added
- Ship various tests with data for XHTML
- Ship various tests, with data, for XHTML
- Add depth to the hashref of data for each tag's node in the tree
- Put the code in github: git://github.com/ronsavage/html--parser--simple.git
EOT
Expand Down
53 changes: 53 additions & 0 deletions META.json
@@ -0,0 +1,53 @@
{
"abstract" : "Parse nice HTML files without needing a compiler",
"author" : [
"Ron Savage <ron@savage.net.au>"
],
"dynamic_config" : 1,
"generated_by" : "Module::Build version 0.4003, CPAN::Meta::Converter version 2.120630",
"license" : [
"artistic_1"
],
"meta-spec" : {
"url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec",
"version" : "2"
},
"name" : "HTML-Parser-Simple",
"prereqs" : {
"build" : {
"requires" : {
"Test::More" : "0",
"Test::Pod" : "0"
}
},
"configure" : {
"requires" : {
"Module::Build" : "0.36"
}
},
"runtime" : {
"requires" : {
"Carp" : "0",
"Tree::Simple" : "0",
"perl" : "5.006"
}
}
},
"provides" : {
"HTML::Parser::Simple" : {
"file" : "lib/HTML/Parser/Simple.pm",
"version" : "1.07"
},
"HTML::Parser::Simple::Attributes" : {
"file" : "lib/HTML/Parser/Simple/Attributes.pm",
"version" : "1.07"
}
},
"release_status" : "stable",
"resources" : {
"license" : [
"http://www.perlfoundation.org/artistic_license_1_0"
]
},
"version" : "1.07"
}
6 changes: 3 additions & 3 deletions META.yml
Expand Up @@ -17,14 +17,14 @@ name: HTML-Parser-Simple
provides:
HTML::Parser::Simple:
file: lib/HTML/Parser/Simple.pm
version: 1.06
version: 1.07
HTML::Parser::Simple::Attributes:
file: lib/HTML/Parser/Simple/Attributes.pm
version: 1.06
version: 1.07
requires:
Carp: 0
Tree::Simple: 0
perl: 5.006
resources:
license: http://www.perlfoundation.org/artistic_license_1_0
version: 1.06
version: 1.07
16 changes: 11 additions & 5 deletions lib/HTML/Parser/Simple.pm
Expand Up @@ -36,7 +36,7 @@ our @EXPORT = qw(
);

our $VERSION = '1.06';
our $VERSION = '1.07';

# -----------------------------------------------

Expand Down Expand Up @@ -418,12 +418,12 @@ sub new

$self -> set_depth(0);
$self -> set_node_type('global');
$self -> set_current_node($self -> create_new_node('root', '') );
$self -> set_current_node($self -> create_new_node('root', '', Tree::Simple -> ROOT) );
$self -> set_root($self -> get_current_node() );

if ($self -> get_xhtml() )
{
# Compared to the non-XHTML re, this has a extra ':' just under the ':'.
# Compared to the non-XHTML re, this has a extra ':' just under that ':'.

$$self{'_tag_with_attribute'} = q#^(<(\w+)((?:\s+[-:\w]+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>)#;
}
Expand Down Expand Up @@ -482,7 +482,9 @@ sub parse
{
if (substr($html, 0, 1) eq '<')
{
if ($html =~ /$$self{'_tag_with_attribute'}/)
# Warning: lc() means all tags and attributes are captured in lc.

if (lc($html) =~ /$$self{'_tag_with_attribute'}/)
{
substr($html, 0, length $1) = '';
$in_content = 0;
Expand Down Expand Up @@ -574,7 +576,9 @@ sub parse
{
my($re) = "(.*)<\/$stack[$#stack]\[^>]*>";

if ($html =~ /$re/s)
# lc() is needed because only lc tag names are pushed onto the stack.

if (lc($html) =~ /$re/s)
{
my($text) = $1;
$text =~ s/<!--(.*?)-->/$1/g;
Expand Down Expand Up @@ -619,6 +623,7 @@ sub parse
sub parse_end_tag
{
my($self, $tag_name, $stack) = @_;
$tag_name = lc $tag_name;

# Find the closest opened tag of the same name.

Expand Down Expand Up @@ -703,6 +708,7 @@ sub parse_file
sub parse_start_tag
{
my($self, $tag_name, $attributes, $unary, $stack) = @_;
$tag_name = lc $tag_name;

if ($$self{'_block'}{$tag_name})
{
Expand Down
2 changes: 1 addition & 1 deletion lib/HTML/Parser/Simple/Attributes.pm
Expand Up @@ -5,7 +5,7 @@ use warnings;

use Carp;

our $VERSION = '1.06';
our $VERSION = '1.07';

# -----------------------------------------------

Expand Down
4 changes: 2 additions & 2 deletions scripts/parse.html.pl
Expand Up @@ -13,8 +13,8 @@
my($p) = HTML::Parser::Simple -> new
(
{
input_dir => '/var/www',
output_dir => '/tmp',
input_dir => '/home/ron',
output_dir => '/home/ron',
verbose => 1,
}
);
Expand Down

0 comments on commit 66af853

Please sign in to comment.