Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX a bug in feature_extraction.text.strip_accents_unicode #15100

Merged
merged 1 commit into from Oct 3, 2019

Conversation

@DGrady
Copy link
Contributor

DGrady commented Sep 26, 2019

Reference Issues/PRs

Fixes #15087

What does this implement/fix? Explain your changes.

strip_accents_unicode contained a check to see if applying NFKD normalization to the input string changed it. If the string was unchanged, then it would not attempt to remove accents. This meant that if an input string was already in NFKD form and also contained accents, the accents were not removed.

Now, strip_accents_unicode always filters out combining characters after applying NFKD normalization.

Any other comments?

Copy link
Member

jnothman left a comment

I presume the if was in there for efficiency on English and other languages without much combination. Our preprocessing and tokenisation is slow in general, and I wonder if we can improve this removal of combining characters significantly in cython.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 26, 2019

Thanks!

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@rth

This comment has been minimized.

Copy link
Member

rth commented Sep 26, 2019

Thanks! Could you please benchmark CountVectorizer before and after this fix say on datasets.fetch_20newsgroups()?

@DGrady

This comment has been minimized.

Copy link
Contributor Author

DGrady commented Sep 27, 2019

@rth sure. I think it would also be useful to look at benchmarks on data that is not primarily ASCII-compatible. I used the following methodology to create a set of "documents" by randomly sampling code points from certain ranges, and chose the number and length of the strings to be roughly the same as the newsgroups data. (I'm not trying to present this as a way to get realistic non-ASCII text; this is just to get a set of data that will not trigger the fast path in the string processing functions.)

import random

ranges = (
    range(0x0020, 0x007e + 1), # Basic Latin
    range(0x00a0, 0x00ff + 1), # Latin supplement; accented Latin letters
    range(0x0400, 0x04ff + 1), # Cyrillic
    range(0x0600, 0x06ff + 1), # Arabic
    range(0x4e00, 0x62ff + 1), # Some CJK
)

code_points = list()
for r in ranges:
    code_points.extend(r)

def random_string(n):
    return "".join(chr(i) for i in random.choices(code_points, k=n))

random_corpus = [random_string(2_000) for _ in range(10_000)]

with open("random_corpus.txt", "w", encoding="UTF-8") as f:
    f.write("\n".join(random_corpus))

To benchmark, I used IPython's %timeit magic with the following methodology, running each case in a fresh interpreter:

from pathlib import Path
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
newsgroups = fetch_20newsgroups()
random_corpus = Path("random_corpus.txt").read_text().splitlines()
%timeit -r50 CountVectorizer(strip_accents="unicode").fit(# CORPUS #)

where # CORPUS # is either newsgroups.data or random_corpus. The results were

Case Newsgroups Random strings
master 1.87 s ± 183 ms per loop 4.18 s ± 468 ms per loop
This PR 3.59 s ± 219 ms per loop 4.03 s ± 332 ms per loop
Variant 1 2.23 s ± 237 ms per loop 4.63 s ± 463 ms per loop
Variant 2 1.85 s ± 273 ms per loop 4.12 s ± 269 ms per loop

Since the PR currently introduces a significant slowdown on the newsgroups data, I tried a couple of guards in the spirit of the existing code. Variant 1 is

def strip_accents_unicode(s):
    if ord(max(s)) < 128:
        return s
    else:
        normalized = unicodedata.normalize('NFKD', s)
        return ''.join([c for c in normalized if not unicodedata.combining(c)])

and variant 2 is

def strip_accents_unicode(s):
    try:
        s.encode("ASCII", errors="strict")
        return s
    except UnicodeEncodeError:
        normalized = unicodedata.normalize('NFKD', s)
        return ''.join([c for c in normalized if not unicodedata.combining(c)])

I have read that benchmarking things accurately is difficult, and I'm not sure how seriously to take these numbers. To the extent we believe them, they seem to indicate that none of these cases make much difference for random non-ASCII-compatible strings, but that it's useful to have a guard to fast-path ASCII-compatible strings.

I guess we should go with variant 2, if that looks all right to everyone?

Daniel Grady
strip_accents_unicode contained a check to see if applying NFKD
normalization to the input string changed it. If the string was
unchanged, then it would not attempt to remove accents. This meant
that if an input string was already in NFKD form and also contained
accents, the accents were not removed.

Now, strip_accents_unicode always filters out combining characters
after applying NFKD normalization.
@DGrady DGrady force-pushed the DGrady:fix-strip-accents-bug branch from 2f874a8 to bb9e237 Sep 30, 2019
@DGrady DGrady requested a review from jnothman Sep 30, 2019
@rth

This comment has been minimized.

Copy link
Member

rth commented Oct 1, 2019

Thanks for the detailed benchmarks. In Python 3.7+ there is str.isascii that should be used instead of catching the exception I think.

Could you try,

diff --git a/sklearn/feature_extraction/text.py b/sklearn/feature_extraction/text.py
index bb5a9d646..b3862ce35 100644
--- a/sklearn/feature_extraction/text.py
+++ b/sklearn/feature_extraction/text.py
@@ -111,6 +111,16 @@ def _analyze(doc, analyzer=None, tokenizer=None, ngrams=None,
     return doc
 
 
+def _isascii(s):
+    # XXX: this function implements str.isascii from Python 3.7+
+    #      and can be removed once support for Python <=3.6 is dropped.
+    try:
+        s.encode("ASCII", errors="strict")
+        return True
+    except UnicodeEncodeError:
+        return False
+
+
 def strip_accents_unicode(s):
     """Transform accentuated unicode symbols into their simple counterpart
 
@@ -129,10 +139,17 @@ def strip_accents_unicode(s):
         Remove accentuated char for any unicode symbol that has a direct
         ASCII equivalent.
     """
-    normalized = unicodedata.normalize('NFKD', s)
-    if normalized == s:
+    if hasattr(s, "isascii"):
+        # Python 3.7+
+        ascii_only = s.isascii()
+    else:
+        ascii_only = _isascii(s)
+
+    if ascii_only:
         return s
     else:
+        normalized = unicodedata.normalize('NFKD', s)
+
         return ''.join([c for c in normalized if not unicodedata.combining(c)])

if it's slower, we can still go with your version 2 but with a comment saying that we should use str.isascii when support for Python 3.6 is dropped.

Copy link
Member

jnothman left a comment

Please avoid force-pushing. It makes it harder for us to see that you have pushed, and to see what's changed.

Variant 2 looks good, and we could look into a Cython implementation if appropriate (I can't find a C API for unicodedata.combining right now).

@rth

This comment has been minimized.

Copy link
Member

rth commented Oct 3, 2019

(I can't find a C API for unicodedata.combining right now).

Doesn't look like it's exposed in the public API?

https://github.com/python/cpython/search?q=unicodedata_UCD_combining_impl

and rg "combining" /usr/include/python3.7m/ doesn't find anything.

@rth
rth approved these changes Oct 3, 2019
Copy link
Member

rth left a comment

Thanks @DGrady !

@rth rth changed the title Fix a bug in strip_accents_unicode FIX a bug in feature_extraction.text.strip_accents_unicode Oct 3, 2019
@rth rth merged commit aada3ea into scikit-learn:master Oct 3, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn Build #20190930.12 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.