[FEA] Filtering non-alphanumeric characters #5520

Garfounkel · 2020-06-19T17:09:32Z

Is your feature request related to a problem? Please describe.
Very often, NLP pipelines will normalize input to remove all sort of noise in the data. Noise can range from punctuation character to uncommon Unicode like ℉, ℧. Right now it is possible to filter those characters with string Series using regexes but this is fairly slow since regexes are a difficult problem on GPU. It would be useful to provide such feature out of the box since it seems to apply to many generic NLP pipelines.

Describe the solution you'd like
A Series.str function to replace instances of non-alphanumeric characters by a specified character. To be more specific, non-alphanumeric are individual characters matched by the following regexp: r'[^\w]'.

>>> s = Series(['abc£def', 'I am a developer', '℉℧ is not alphanumeric', 'Αγγλικά is alphanumeric'])
>>> s.str.replace_non_alphanumns(replacement_char=' ')
Series(['abc def', 'I am a developer', '  is not alphanumeric', 'Αγγλικά is alphanumeric'])

Describe alternatives you've considered
Right now for the CountVectorizer PR we are using a regexp to achieve this. First we get a list of non-alphanumeric characters from the input documents:

def _get_non_alphanumeric_characters(docs):
    characters = docs.str.character_tokenize().unique()
    non_alpha = characters.str.extract(r'([^\w])', expand=False).dropna()
    return non_alpha.tolist() + ['\n']

Then we replace those characters by a delimiter using Series.str.translate:

non_alpha = _get_non_alphanumeric_characters(docs)
delimiter_code = ord(delimiter)
translation_table = {ord(char): delimiter_code for char in non_alpha}
docs = docs.str.translate(translation_table)

The text was updated successfully, but these errors were encountered:

Garfounkel · 2020-06-19T17:13:41Z

CC: @davidwendt / @randerzander .

Garfounkel added Needs Triage Need team to review and classify feature request New feature or request labels Jun 19, 2020

Garfounkel mentioned this issue Jun 19, 2020

[FEA] Filtering non-alphanumeric characters and words based on size #5449

Closed

kkraus14 added Python Affects Python cuDF API. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Jun 30, 2020

davidwendt mentioned this issue Jul 9, 2020

[REVIEW] Add filter_characters_of_type strings API #5666

Merged

davidwendt closed this as completed in #5666 Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Filtering non-alphanumeric characters #5520

[FEA] Filtering non-alphanumeric characters #5520

Garfounkel commented Jun 19, 2020 •

edited

Loading

Garfounkel commented Jun 19, 2020

[FEA] Filtering non-alphanumeric characters #5520

[FEA] Filtering non-alphanumeric characters #5520

Comments

Garfounkel commented Jun 19, 2020 • edited Loading

Garfounkel commented Jun 19, 2020

Garfounkel commented Jun 19, 2020 •

edited

Loading