Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update text_normalize() function in text_classification. #568

Merged
merged 12 commits into from
Jul 26, 2019

Conversation

zhangguanheng66
Copy link
Contributor

As the title says.



def text_normalize(line):
"""
Basic normalization for a line of text.
Normalization includes
- lowercasing
- replacing all non-alphanumeric characters with whitespace
- complete some basic text normalization for English words.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to explicitly list the normalizations that are applied so that the user knows exactly what's going to happen without having to look at the code.

test/data/test_utils.py Outdated Show resolved Hide resolved
test/data/test_utils.py Outdated Show resolved Hide resolved
add spaces before and after ')'
add spaces before and after '!'
add spaces before and after '?'
replace ';' multiple spaces with single space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. replace multiple spaces with single space

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't that only happen in the end? It looks like the corresponding re only replaces ';' with a single space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it replaces ';' with a single space. At the very end, multiple spaces will be combined together.

test/data/test_utils.py Outdated Show resolved Hide resolved
test/data/test_utils.py Outdated Show resolved Hide resolved
@cpuhrsch cpuhrsch merged commit 25fe0b2 into pytorch:master Jul 26, 2019
@zhangguanheng66 zhangguanheng66 deleted the text_normalized branch November 25, 2019 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants