Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Filtering non-alphanumeric characters #5520

Closed
Garfounkel opened this issue Jun 19, 2020 · 1 comment · Fixed by #5666
Closed

[FEA] Filtering non-alphanumeric characters #5520

Garfounkel opened this issue Jun 19, 2020 · 1 comment · Fixed by #5666
Labels
feature request New feature or request Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@Garfounkel
Copy link

Garfounkel commented Jun 19, 2020

Is your feature request related to a problem? Please describe.
Very often, NLP pipelines will normalize input to remove all sort of noise in the data. Noise can range from punctuation character to uncommon Unicode like ℉, ℧. Right now it is possible to filter those characters with string Series using regexes but this is fairly slow since regexes are a difficult problem on GPU. It would be useful to provide such feature out of the box since it seems to apply to many generic NLP pipelines.

Describe the solution you'd like
A Series.str function to replace instances of non-alphanumeric characters by a specified character. To be more specific, non-alphanumeric are individual characters matched by the following regexp: r'[^\w]'.

>>> s = Series(['abc£def', 'I am a developer', '℉℧ is not alphanumeric', 'Αγγλικά is alphanumeric'])
>>> s.str.replace_non_alphanumns(replacement_char=' ')
Series(['abc def', 'I am a developer', '  is not alphanumeric', 'Αγγλικά is alphanumeric'])

Describe alternatives you've considered
Right now for the CountVectorizer PR we are using a regexp to achieve this. First we get a list of non-alphanumeric characters from the input documents:

def _get_non_alphanumeric_characters(docs):
    characters = docs.str.character_tokenize().unique()
    non_alpha = characters.str.extract(r'([^\w])', expand=False).dropna()
    return non_alpha.tolist() + ['\n']

Then we replace those characters by a delimiter using Series.str.translate:

non_alpha = _get_non_alphanumeric_characters(docs)
delimiter_code = ord(delimiter)
translation_table = {ord(char): delimiter_code for char in non_alpha}
docs = docs.str.translate(translation_table)
@Garfounkel
Copy link
Author

CC: @davidwendt / @randerzander .

@kkraus14 kkraus14 added Python Affects Python cuDF API. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants