Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

fix most common cases for broken abbreviation detection in SentenceIt…

…erator
  • Loading branch information...
commit fff1fde392a34a8602b8bc07e4575f86fd80d5c0 1 parent 5a142ec
Jonathan Feinberg jdf authored
Showing with 18 additions and 10 deletions.
  1. +0 −4 readme.markdown
  2. +18 −6 src/cue/lang/SentenceIterator.java
4 readme.markdown
View
@@ -147,10 +147,6 @@ language guessing.
The iterators all operate on Strings, not Readers, which makes this library
unsuitable for use on texts too large to fit in memory.
-The SentenceIterator incorrectly breaks on "Mrs." and "Ms.", though it works
-just fine with "Mr.". I have reported this bug to Sun, since my
-SentenceIterator relies on the JDK BreakIterator.
-
## Help needed! ##
cue.language has exactly 0% test coverage. Fastidious programmers
24 src/cue/lang/SentenceIterator.java
View
@@ -18,6 +18,7 @@
import java.text.BreakIterator;
import java.util.Locale;
import java.util.NoSuchElementException;
+import java.util.regex.Pattern;
/**
* Construct with a {@link String}; retrieve a sequence of {@link String}s, each of
@@ -47,8 +48,20 @@ public SentenceIterator(final String text, final Locale locale)
this.text = text;
breakIterator = BreakIterator.getSentenceInstance(locale);
breakIterator.setText(text);
- start = breakIterator.first();
- end = breakIterator.next();
+ start = end = breakIterator.first();
+ advance();
+ }
+
+ private static final Pattern ABBREVS = Pattern.compile("(?:Mrs?|Ms|Dr|Rev)\\.\\s*$");
+
+ private void advance()
+ {
+ start = end;
+ while (hasNext()
+ && ((end == start) || ABBREVS.matcher(text.substring(start, end)).find()))
+ {
+ end = breakIterator.next();
+ }
}
public void remove()
@@ -58,17 +71,16 @@ public void remove()
public String next()
{
- if (end == BreakIterator.DONE)
+ if (!hasNext())
{
throw new NoSuchElementException();
}
final String result = text.substring(start, end).replaceAll("\\s+", " ");
- start = end;
- end = breakIterator.next();
+ advance();
return result;
}
- public boolean hasNext()
+ public final boolean hasNext()
{
return end != BreakIterator.DONE;
}
Please sign in to comment.
Something went wrong with that request. Please try again.