Skip to content

Commit

Permalink
fix most common cases for broken abbreviation detection in SentenceIt…
Browse files Browse the repository at this point in the history
…erator
  • Loading branch information
Jonathan Feinberg committed Dec 4, 2009
1 parent 5a142ec commit fff1fde
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 10 deletions.
4 changes: 0 additions & 4 deletions readme.markdown
Expand Up @@ -147,10 +147,6 @@ language guessing.
The iterators all operate on Strings, not Readers, which makes this library
unsuitable for use on texts too large to fit in memory.

The SentenceIterator incorrectly breaks on "Mrs." and "Ms.", though it works
just fine with "Mr.". I have reported this bug to Sun, since my
SentenceIterator relies on the JDK BreakIterator.

## Help needed! ##

cue.language has exactly 0% test coverage. Fastidious programmers
Expand Down
24 changes: 18 additions & 6 deletions src/cue/lang/SentenceIterator.java
Expand Up @@ -18,6 +18,7 @@
import java.text.BreakIterator;
import java.util.Locale;
import java.util.NoSuchElementException;
import java.util.regex.Pattern;

/**
* Construct with a {@link String}; retrieve a sequence of {@link String}s, each of
Expand Down Expand Up @@ -47,8 +48,20 @@ public SentenceIterator(final String text, final Locale locale)
this.text = text;
breakIterator = BreakIterator.getSentenceInstance(locale);
breakIterator.setText(text);
start = breakIterator.first();
end = breakIterator.next();
start = end = breakIterator.first();
advance();
}

private static final Pattern ABBREVS = Pattern.compile("(?:Mrs?|Ms|Dr|Rev)\\.\\s*$");

private void advance()
{
start = end;
while (hasNext()
&& ((end == start) || ABBREVS.matcher(text.substring(start, end)).find()))
{
end = breakIterator.next();
}
}

public void remove()
Expand All @@ -58,17 +71,16 @@ public void remove()

public String next()
{
if (end == BreakIterator.DONE)
if (!hasNext())
{
throw new NoSuchElementException();
}
final String result = text.substring(start, end).replaceAll("\\s+", " ");
start = end;
end = breakIterator.next();
advance();
return result;
}

public boolean hasNext()
public final boolean hasNext()
{
return end != BreakIterator.DONE;
}
Expand Down

0 comments on commit fff1fde

Please sign in to comment.