Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

fix most common cases for broken abbreviation detection in SentenceIt…

…erator
  • Loading branch information...
commit fff1fde392a34a8602b8bc07e4575f86fd80d5c0 1 parent 5a142ec
@jdf jdf authored
Showing with 18 additions and 10 deletions.
  1. +0 −4 readme.markdown
  2. +18 −6 src/cue/lang/SentenceIterator.java
View
4 readme.markdown
@@ -147,10 +147,6 @@ language guessing.
The iterators all operate on Strings, not Readers, which makes this library
unsuitable for use on texts too large to fit in memory.
-The SentenceIterator incorrectly breaks on "Mrs." and "Ms.", though it works
-just fine with "Mr.". I have reported this bug to Sun, since my
-SentenceIterator relies on the JDK BreakIterator.
-
## Help needed! ##
cue.language has exactly 0% test coverage. Fastidious programmers
View
24 src/cue/lang/SentenceIterator.java
@@ -18,6 +18,7 @@
import java.text.BreakIterator;
import java.util.Locale;
import java.util.NoSuchElementException;
+import java.util.regex.Pattern;
/**
* Construct with a {@link String}; retrieve a sequence of {@link String}s, each of
@@ -47,8 +48,20 @@ public SentenceIterator(final String text, final Locale locale)
this.text = text;
breakIterator = BreakIterator.getSentenceInstance(locale);
breakIterator.setText(text);
- start = breakIterator.first();
- end = breakIterator.next();
+ start = end = breakIterator.first();
+ advance();
+ }
+
+ private static final Pattern ABBREVS = Pattern.compile("(?:Mrs?|Ms|Dr|Rev)\\.\\s*$");
+
+ private void advance()
+ {
+ start = end;
+ while (hasNext()
+ && ((end == start) || ABBREVS.matcher(text.substring(start, end)).find()))
+ {
+ end = breakIterator.next();
+ }
}
public void remove()
@@ -58,17 +71,16 @@ public void remove()
public String next()
{
- if (end == BreakIterator.DONE)
+ if (!hasNext())
{
throw new NoSuchElementException();
}
final String result = text.substring(start, end).replaceAll("\\s+", " ");
- start = end;
- end = breakIterator.next();
+ advance();
return result;
}
- public boolean hasNext()
+ public final boolean hasNext()
{
return end != BreakIterator.DONE;
}
Please sign in to comment.
Something went wrong with that request. Please try again.