[FEA] Filtering non-alphanumeric characters #5520
Labels
feature request
New feature or request
Python
Affects Python cuDF API.
strings
strings issues (C++ and Python)
Is your feature request related to a problem? Please describe.
Very often, NLP pipelines will normalize input to remove all sort of noise in the data. Noise can range from punctuation character to uncommon Unicode like ℉, ℧. Right now it is possible to filter those characters with string Series using regexes but this is fairly slow since regexes are a difficult problem on GPU. It would be useful to provide such feature out of the box since it seems to apply to many generic NLP pipelines.
Describe the solution you'd like
A
Series.str
function to replace instances of non-alphanumeric characters by a specified character. To be more specific, non-alphanumeric are individual characters matched by the following regexp:r'[^\w]'
.Describe alternatives you've considered
Right now for the CountVectorizer PR we are using a regexp to achieve this. First we get a list of non-alphanumeric characters from the input documents:
Then we replace those characters by a delimiter using
Series.str.translate
:The text was updated successfully, but these errors were encountered: