Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use tok as fallback tokenizer #11

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kootenpv
Copy link

@kootenpv kootenpv commented Jul 9, 2019

The tokenizer makes for a better default (faster and saner).

Targets #10

Copy link
Owner

@soaxelbrooke soaxelbrooke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to avoid changing the default tokenization behavior. I can see adding tok as an optional alternative in the Encoder constructor, though.

print(next(encoder.inverse_transform(encoder.transform([example]))))
# vizzini : he didn ' t fall ? inconceivable !
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So using tok would change the default tokenization?

@soaxelbrooke
Copy link
Owner

soaxelbrooke commented Jul 10, 2019

For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text.

@kootenpv
Copy link
Author

kootenpv commented Jul 11, 2019

@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)!

i.e. not and ' / t are not generalized, did and didn don't generalize. I deem this to be worse!

I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the nltk dependency?

@soaxelbrooke
Copy link
Owner

@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose tok as an alternative word tokenizer in the Encoder constructor.

@soaxelbrooke soaxelbrooke added this to the v1.1 milestone Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants