-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update text_normalize() function in text_classification. #568
Conversation
|
||
|
||
def text_normalize(line): | ||
""" | ||
Basic normalization for a line of text. | ||
Normalization includes | ||
- lowercasing | ||
- replacing all non-alphanumeric characters with whitespace | ||
- complete some basic text normalization for English words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to explicitly list the normalizations that are applied so that the user knows exactly what's going to happen without having to look at the code.
add spaces before and after ')' | ||
add spaces before and after '!' | ||
add spaces before and after '?' | ||
replace ';' multiple spaces with single space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo. replace multiple spaces with single space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't that only happen in the end? It looks like the corresponding re only replaces ';' with a single space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it replaces ';' with a single space. At the very end, multiple spaces will be combined together.
As the title says.