Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
FIX a bug in feature_extraction.text.strip_accents_unicode #15100
What does this implement/fix? Explain your changes.
strip_accents_unicode contained a check to see if applying NFKD normalization to the input string changed it. If the string was unchanged, then it would not attempt to remove accents. This meant that if an input string was already in NFKD form and also contained accents, the accents were not removed.
Now, strip_accents_unicode always filters out combining characters after applying NFKD normalization.
Any other comments?
jnothman left a comment
I presume the if was in there for efficiency on English and other languages without much combination. Our preprocessing and tokenisation is slow in general, and I wonder if we can improve this removal of combining characters significantly in cython.
@rth sure. I think it would also be useful to look at benchmarks on data that is not primarily ASCII-compatible. I used the following methodology to create a set of "documents" by randomly sampling code points from certain ranges, and chose the number and length of the strings to be roughly the same as the newsgroups data. (I'm not trying to present this as a way to get realistic non-ASCII text; this is just to get a set of data that will not trigger the fast path in the string processing functions.)
import random ranges = ( range(0x0020, 0x007e + 1), # Basic Latin range(0x00a0, 0x00ff + 1), # Latin supplement; accented Latin letters range(0x0400, 0x04ff + 1), # Cyrillic range(0x0600, 0x06ff + 1), # Arabic range(0x4e00, 0x62ff + 1), # Some CJK ) code_points = list() for r in ranges: code_points.extend(r) def random_string(n): return "".join(chr(i) for i in random.choices(code_points, k=n)) random_corpus = [random_string(2_000) for _ in range(10_000)] with open("random_corpus.txt", "w", encoding="UTF-8") as f: f.write("\n".join(random_corpus))
To benchmark, I used IPython's
from pathlib import Path from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer newsgroups = fetch_20newsgroups() random_corpus = Path("random_corpus.txt").read_text().splitlines() %timeit -r50 CountVectorizer(strip_accents="unicode").fit(# CORPUS #)
Since the PR currently introduces a significant slowdown on the newsgroups data, I tried a couple of guards in the spirit of the existing code. Variant 1 is
def strip_accents_unicode(s): if ord(max(s)) < 128: return s else: normalized = unicodedata.normalize('NFKD', s) return ''.join([c for c in normalized if not unicodedata.combining(c)])
and variant 2 is
def strip_accents_unicode(s): try: s.encode("ASCII", errors="strict") return s except UnicodeEncodeError: normalized = unicodedata.normalize('NFKD', s) return ''.join([c for c in normalized if not unicodedata.combining(c)])
I have read that benchmarking things accurately is difficult, and I'm not sure how seriously to take these numbers. To the extent we believe them, they seem to indicate that none of these cases make much difference for random non-ASCII-compatible strings, but that it's useful to have a guard to fast-path ASCII-compatible strings.
I guess we should go with variant 2, if that looks all right to everyone?
strip_accents_unicode contained a check to see if applying NFKD normalization to the input string changed it. If the string was unchanged, then it would not attempt to remove accents. This meant that if an input string was already in NFKD form and also contained accents, the accents were not removed. Now, strip_accents_unicode always filters out combining characters after applying NFKD normalization.
Thanks for the detailed benchmarks. In Python 3.7+ there is
Could you try,
diff --git a/sklearn/feature_extraction/text.py b/sklearn/feature_extraction/text.py index bb5a9d646..b3862ce35 100644 --- a/sklearn/feature_extraction/text.py +++ b/sklearn/feature_extraction/text.py @@ -111,6 +111,16 @@ def _analyze(doc, analyzer=None, tokenizer=None, ngrams=None, return doc +def _isascii(s): + # XXX: this function implements str.isascii from Python 3.7+ + # and can be removed once support for Python <=3.6 is dropped. + try: + s.encode("ASCII", errors="strict") + return True + except UnicodeEncodeError: + return False + + def strip_accents_unicode(s): """Transform accentuated unicode symbols into their simple counterpart @@ -129,10 +139,17 @@ def strip_accents_unicode(s): Remove accentuated char for any unicode symbol that has a direct ASCII equivalent. """ - normalized = unicodedata.normalize('NFKD', s) - if normalized == s: + if hasattr(s, "isascii"): + # Python 3.7+ + ascii_only = s.isascii() + else: + ascii_only = _isascii(s) + + if ascii_only: return s else: + normalized = unicodedata.normalize('NFKD', s) + return ''.join([c for c in normalized if not unicodedata.combining(c)])
if it's slower, we can still go with your version 2 but with a comment saying that we should use
jnothman left a comment
Please avoid force-pushing. It makes it harder for us to see that you have pushed, and to see what's changed.
Variant 2 looks good, and we could look into a Cython implementation if appropriate (I can't find a C API for unicodedata.combining right now).
Doesn't look like it's exposed in the public API?