Skip to content

Commit

Permalink
Whine on first non-ASCII byte-sequence and use it to guess Latin-1 vs…
Browse files Browse the repository at this point in the history
… UTF-8
  • Loading branch information
grantm committed Apr 27, 2012
1 parent c11562c commit b67b902
Show file tree
Hide file tree
Showing 6 changed files with 138 additions and 0 deletions.
21 changes: 21 additions & 0 deletions lib/Pod/Simple/BlackBox.pm
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,9 @@ sub parse_lines { # Usage: $parser->parse_lines(@lines)
}
}

if(!$self->{'encoding'}) {
$self->_try_encoding_guess($line)
}

DEBUG > 5 and print "# Parsing line: [$line]\n";

Expand Down Expand Up @@ -395,6 +398,24 @@ sub _handle_encoding_second_level {
return;
}

sub _try_encoding_guess {
my ($self,$line) = @_;

return unless $line =~ /[^\x00-\x7f]/; # Look for non-ASCII byte

my $encoding = $line =~ /[\xC0-\xFD][\x80-\xBF]/ ? 'UTF-8' : 'ISO8859-1';
$self->_handle_encoding_line( "=encoding $encoding" );
$self->{'_transcoder'} && $self->{'_transcoder'}->($line);

my ($word) = $line =~ /(\S*[^\x00-\x7f]\S*)/;

$self->whine(
$self->{'line_count'},
"Non-ASCII character seen before =encoding in '$word'. Assuming $encoding"
);

}

#~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`~`

{
Expand Down
11 changes: 11 additions & 0 deletions t/corpus/encwarn01.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

=head1 NAME

Encoding Warning 1 - implicitly Latin-1

=head2 DESCRIPTION

This line should warn that the word caf� contains a non-ASCII character.

But ch�teau should not generate a warning - once is enough.

38 changes: 38 additions & 0 deletions t/corpus/encwarn01.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<Document start_line="2">
<head1 start_line="2">
NAME
</head1>
<Para start_line="4">
Encoding Warning 1 - implicitly Latin-1
</Para>
<head2 start_line="6">
DESCRIPTION
</head2>
<Para start_line="8">
This line should warn that the word caf&#233; contains a
non-ASCII character.
</Para>
<Para start_line="10">
But ch&#226;teau should not generate a warning - once is
enough.
</Para>
<head1 errata="1" start_line="-321">
POD ERRORS
</head1>
<Para errata="1" start_line="-321">
Hey!
<B>
The above document had some coding errors, which are explained
below:
</B>
</Para>
<over-text errata="1" indent="4" start_line="-321">
<item-text start_line="-321">
Around line 8:
</item-text>
<Para start_line="-321">
Non-ASCII character seen before =encoding in &#39;caf&#233;&#39;.
Assuming ISO8859-1
</Para>
</over-text>
</Document>
11 changes: 11 additions & 0 deletions t/corpus/encwarn02.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

=head1 NAME

Encoding Warning 1 - implicitly UTF-8

=head2 DESCRIPTION

This line should warn that the price €9.99 contains a non-ASCII character.

But château should not generate a warning - once is enough.

38 changes: 38 additions & 0 deletions t/corpus/encwarn02.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<Document start_line="2">
<head1 start_line="2">
NAME
</head1>
<Para start_line="4">
Encoding Warning 1 - implicitly UTF-8
</Para>
<head2 start_line="6">
DESCRIPTION
</head2>
<Para start_line="8">
This line should warn that the price &#8364;9.99 contains
a non-ASCII character.
</Para>
<Para start_line="10">
But ch&#226;teau should not generate a warning - once is
enough.
</Para>
<head1 errata="1" start_line="-321">
POD ERRORS
</head1>
<Para errata="1" start_line="-321">
Hey!
<B>
The above document had some coding errors, which are explained
below:
</B>
</Para>
<over-text errata="1" indent="4" start_line="-321">
<item-text start_line="-321">
Around line 8:
</item-text>
<Para start_line="-321">
Non-ASCII character seen before =encoding in &#39;&#8364;9.99&#39;.
Assuming UTF-8
</Para>
</over-text>
</Document>
19 changes: 19 additions & 0 deletions t/corpus/lat1frim.xml
Original file line number Diff line number Diff line change
Expand Up @@ -67,4 +67,23 @@
<Para start_line="33">
[end]
</Para>
<head1 errata="1" start_line="-321">
POD ERRORS
</head1>
<Para errata="1" start_line="-321">
Hey!
<B>
The above document had some coding errors, which are explained
below:
</B>
</Para>
<over-text errata="1" indent="4" start_line="-321">
<item-text start_line="-321">
Around line 11:
</item-text>
<Para start_line="-321">
Non-ASCII character seen before =encoding in &#39;s&#233;parant&#39;.
Assuming ISO8859-1
</Para>
</over-text>
</Document>

0 comments on commit b67b902

Please sign in to comment.