New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX a bug in feature_extraction.text.strip_accents_unicode #15100
Conversation
I presume the if was in there for efficiency on English and other languages without much combination. Our preprocessing and tokenisation is slow in general, and I wonder if we can improve this removal of combining characters significantly in cython.
Thanks! Please add an entry to the change log at |
Thanks! Could you please benchmark |
@rth sure. I think it would also be useful to look at benchmarks on data that is not primarily ASCII-compatible. I used the following methodology to create a set of "documents" by randomly sampling code points from certain ranges, and chose the number and length of the strings to be roughly the same as the newsgroups data. (I'm not trying to present this as a way to get realistic non-ASCII text; this is just to get a set of data that will not trigger the fast path in the string processing functions.) import random
ranges = (
range(0x0020, 0x007e + 1), # Basic Latin
range(0x00a0, 0x00ff + 1), # Latin supplement; accented Latin letters
range(0x0400, 0x04ff + 1), # Cyrillic
range(0x0600, 0x06ff + 1), # Arabic
range(0x4e00, 0x62ff + 1), # Some CJK
)
code_points = list()
for r in ranges:
code_points.extend(r)
def random_string(n):
return "".join(chr(i) for i in random.choices(code_points, k=n))
random_corpus = [random_string(2_000) for _ in range(10_000)]
with open("random_corpus.txt", "w", encoding="UTF-8") as f:
f.write("\n".join(random_corpus)) To benchmark, I used IPython's from pathlib import Path
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
newsgroups = fetch_20newsgroups()
random_corpus = Path("random_corpus.txt").read_text().splitlines()
%timeit -r50 CountVectorizer(strip_accents="unicode").fit(# CORPUS #) where
Since the PR currently introduces a significant slowdown on the newsgroups data, I tried a couple of guards in the spirit of the existing code. Variant 1 is def strip_accents_unicode(s):
if ord(max(s)) < 128:
return s
else:
normalized = unicodedata.normalize('NFKD', s)
return ''.join([c for c in normalized if not unicodedata.combining(c)]) and variant 2 is def strip_accents_unicode(s):
try:
s.encode("ASCII", errors="strict")
return s
except UnicodeEncodeError:
normalized = unicodedata.normalize('NFKD', s)
return ''.join([c for c in normalized if not unicodedata.combining(c)]) I have read that benchmarking things accurately is difficult, and I'm not sure how seriously to take these numbers. To the extent we believe them, they seem to indicate that none of these cases make much difference for random non-ASCII-compatible strings, but that it's useful to have a guard to fast-path ASCII-compatible strings. I guess we should go with variant 2, if that looks all right to everyone? |
strip_accents_unicode contained a check to see if applying NFKD normalization to the input string changed it. If the string was unchanged, then it would not attempt to remove accents. This meant that if an input string was already in NFKD form and also contained accents, the accents were not removed. Now, strip_accents_unicode always filters out combining characters after applying NFKD normalization.
Thanks for the detailed benchmarks. In Python 3.7+ there is Could you try, diff --git a/sklearn/feature_extraction/text.py b/sklearn/feature_extraction/text.py
index bb5a9d646..b3862ce35 100644
--- a/sklearn/feature_extraction/text.py
+++ b/sklearn/feature_extraction/text.py
@@ -111,6 +111,16 @@ def _analyze(doc, analyzer=None, tokenizer=None, ngrams=None,
return doc
+def _isascii(s):
+ # XXX: this function implements str.isascii from Python 3.7+
+ # and can be removed once support for Python <=3.6 is dropped.
+ try:
+ s.encode("ASCII", errors="strict")
+ return True
+ except UnicodeEncodeError:
+ return False
+
+
def strip_accents_unicode(s):
"""Transform accentuated unicode symbols into their simple counterpart
@@ -129,10 +139,17 @@ def strip_accents_unicode(s):
Remove accentuated char for any unicode symbol that has a direct
ASCII equivalent.
"""
- normalized = unicodedata.normalize('NFKD', s)
- if normalized == s:
+ if hasattr(s, "isascii"):
+ # Python 3.7+
+ ascii_only = s.isascii()
+ else:
+ ascii_only = _isascii(s)
+
+ if ascii_only:
return s
else:
+ normalized = unicodedata.normalize('NFKD', s)
+
return ''.join([c for c in normalized if not unicodedata.combining(c)]) if it's slower, we can still go with your version 2 but with a comment saying that we should use |
Please avoid force-pushing. It makes it harder for us to see that you have pushed, and to see what's changed.
Variant 2 looks good, and we could look into a Cython implementation if appropriate (I can't find a C API for unicodedata.combining right now).
Doesn't look like it's exposed in the public API? https://github.com/python/cpython/search?q=unicodedata_UCD_combining_impl and |
Reference Issues/PRs
Fixes #15087
What does this implement/fix? Explain your changes.
strip_accents_unicode contained a check to see if applying NFKD normalization to the input string changed it. If the string was unchanged, then it would not attempt to remove accents. This meant that if an input string was already in NFKD form and also contained accents, the accents were not removed.
Now, strip_accents_unicode always filters out combining characters after applying NFKD normalization.
Any other comments?