Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX a bug in feature_extraction.text.strip_accents_unicode #15100

Merged
merged 1 commit into from Oct 3, 2019

Conversation

DGrady
Copy link
Contributor

@DGrady DGrady commented Sep 26, 2019

Reference Issues/PRs

Fixes #15087

What does this implement/fix? Explain your changes.

strip_accents_unicode contained a check to see if applying NFKD normalization to the input string changed it. If the string was unchanged, then it would not attempt to remove accents. This meant that if an input string was already in NFKD form and also contained accents, the accents were not removed.

Now, strip_accents_unicode always filters out combining characters after applying NFKD normalization.

Any other comments?

Copy link
Member

@jnothman jnothman left a comment

I presume the if was in there for efficiency on English and other languages without much combination. Our preprocessing and tokenisation is slow in general, and I wonder if we can improve this removal of combining characters significantly in cython.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 26, 2019

Thanks!

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@rth
Copy link
Member

@rth rth commented Sep 26, 2019

Thanks! Could you please benchmark CountVectorizer before and after this fix say on datasets.fetch_20newsgroups()?

@DGrady
Copy link
Contributor Author

@DGrady DGrady commented Sep 27, 2019

@rth sure. I think it would also be useful to look at benchmarks on data that is not primarily ASCII-compatible. I used the following methodology to create a set of "documents" by randomly sampling code points from certain ranges, and chose the number and length of the strings to be roughly the same as the newsgroups data. (I'm not trying to present this as a way to get realistic non-ASCII text; this is just to get a set of data that will not trigger the fast path in the string processing functions.)

import random

ranges = (
    range(0x0020, 0x007e + 1), # Basic Latin
    range(0x00a0, 0x00ff + 1), # Latin supplement; accented Latin letters
    range(0x0400, 0x04ff + 1), # Cyrillic
    range(0x0600, 0x06ff + 1), # Arabic
    range(0x4e00, 0x62ff + 1), # Some CJK
)

code_points = list()
for r in ranges:
    code_points.extend(r)

def random_string(n):
    return "".join(chr(i) for i in random.choices(code_points, k=n))

random_corpus = [random_string(2_000) for _ in range(10_000)]

with open("random_corpus.txt", "w", encoding="UTF-8") as f:
    f.write("\n".join(random_corpus))

To benchmark, I used IPython's %timeit magic with the following methodology, running each case in a fresh interpreter:

from pathlib import Path
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
newsgroups = fetch_20newsgroups()
random_corpus = Path("random_corpus.txt").read_text().splitlines()
%timeit -r50 CountVectorizer(strip_accents="unicode").fit(# CORPUS #)

where # CORPUS # is either newsgroups.data or random_corpus. The results were

Case Newsgroups Random strings
master 1.87 s ± 183 ms per loop 4.18 s ± 468 ms per loop
This PR 3.59 s ± 219 ms per loop 4.03 s ± 332 ms per loop
Variant 1 2.23 s ± 237 ms per loop 4.63 s ± 463 ms per loop
Variant 2 1.85 s ± 273 ms per loop 4.12 s ± 269 ms per loop

Since the PR currently introduces a significant slowdown on the newsgroups data, I tried a couple of guards in the spirit of the existing code. Variant 1 is

def strip_accents_unicode(s):
    if ord(max(s)) < 128:
        return s
    else:
        normalized = unicodedata.normalize('NFKD', s)
        return ''.join([c for c in normalized if not unicodedata.combining(c)])

and variant 2 is

def strip_accents_unicode(s):
    try:
        s.encode("ASCII", errors="strict")
        return s
    except UnicodeEncodeError:
        normalized = unicodedata.normalize('NFKD', s)
        return ''.join([c for c in normalized if not unicodedata.combining(c)])

I have read that benchmarking things accurately is difficult, and I'm not sure how seriously to take these numbers. To the extent we believe them, they seem to indicate that none of these cases make much difference for random non-ASCII-compatible strings, but that it's useful to have a guard to fast-path ASCII-compatible strings.

I guess we should go with variant 2, if that looks all right to everyone?

strip_accents_unicode contained a check to see if applying NFKD
normalization to the input string changed it. If the string was
unchanged, then it would not attempt to remove accents. This meant
that if an input string was already in NFKD form and also contained
accents, the accents were not removed.

Now, strip_accents_unicode always filters out combining characters
after applying NFKD normalization.
@DGrady DGrady force-pushed the fix-strip-accents-bug branch from 2f874a8 to bb9e237 Compare Sep 30, 2019
@DGrady DGrady requested a review from jnothman Sep 30, 2019
@rth
Copy link
Member

@rth rth commented Oct 1, 2019

Thanks for the detailed benchmarks. In Python 3.7+ there is str.isascii that should be used instead of catching the exception I think.

Could you try,

diff --git a/sklearn/feature_extraction/text.py b/sklearn/feature_extraction/text.py
index bb5a9d646..b3862ce35 100644
--- a/sklearn/feature_extraction/text.py
+++ b/sklearn/feature_extraction/text.py
@@ -111,6 +111,16 @@ def _analyze(doc, analyzer=None, tokenizer=None, ngrams=None,
     return doc
 
 
+def _isascii(s):
+    # XXX: this function implements str.isascii from Python 3.7+
+    #      and can be removed once support for Python <=3.6 is dropped.
+    try:
+        s.encode("ASCII", errors="strict")
+        return True
+    except UnicodeEncodeError:
+        return False
+
+
 def strip_accents_unicode(s):
     """Transform accentuated unicode symbols into their simple counterpart
 
@@ -129,10 +139,17 @@ def strip_accents_unicode(s):
         Remove accentuated char for any unicode symbol that has a direct
         ASCII equivalent.
     """
-    normalized = unicodedata.normalize('NFKD', s)
-    if normalized == s:
+    if hasattr(s, "isascii"):
+        # Python 3.7+
+        ascii_only = s.isascii()
+    else:
+        ascii_only = _isascii(s)
+
+    if ascii_only:
         return s
     else:
+        normalized = unicodedata.normalize('NFKD', s)
+
         return ''.join([c for c in normalized if not unicodedata.combining(c)])

if it's slower, we can still go with your version 2 but with a comment saying that we should use str.isascii when support for Python 3.6 is dropped.

Copy link
Member

@jnothman jnothman left a comment

Please avoid force-pushing. It makes it harder for us to see that you have pushed, and to see what's changed.

Variant 2 looks good, and we could look into a Cython implementation if appropriate (I can't find a C API for unicodedata.combining right now).

@rth
Copy link
Member

@rth rth commented Oct 3, 2019

(I can't find a C API for unicodedata.combining right now).

Doesn't look like it's exposed in the public API?

https://github.com/python/cpython/search?q=unicodedata_UCD_combining_impl

and rg "combining" /usr/include/python3.7m/ doesn't find anything.

rth
rth approved these changes Oct 3, 2019
Copy link
Member

@rth rth left a comment

Thanks @DGrady !

@rth rth changed the title Fix a bug in strip_accents_unicode FIX a bug in feature_extraction.text.strip_accents_unicode Oct 3, 2019
@rth rth merged commit aada3ea into scikit-learn:master Oct 3, 2019
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants