use tok as fallback tokenizer #11

kootenpv · 2019-07-09T18:59:08Z

The tokenizer makes for a better default (faster and saner).

Targets #10

soaxelbrooke

I'd prefer to avoid changing the default tokenization behavior. I can see adding tok as an optional alternative in the Encoder constructor, though.

soaxelbrooke · 2019-07-10T17:54:54Z

README.md

 print(next(encoder.inverse_transform(encoder.transform([example]))))
-# vizzini : he didn ' t fall ? inconceivable !


So using tok would change the default tokenization?

soaxelbrooke · 2019-07-10T20:17:37Z

For context, in NLP tasks it can be important to retain the usage of contractions, since it can be informative about other aspects of the author of the text.

kootenpv · 2019-07-11T07:38:31Z

@soaxelbrooke I personally think the potential benefit of retaining contraction information is more than compensated by the increased power of generalization by having it all normalized :)!

i.e. not and ' / t are not generalized, did and didn don't generalize. I deem this to be worse!

I'm up for further discussion. Meanwhile, how you would propose adding it as an option without also having the nltk dependency?

soaxelbrooke · 2019-07-15T16:46:15Z

@kootenpv I'm fine with people having different preferences on how they'd like those split up, my biggest concern here is changing the interface, which is a breaking change that would introduce bugs into dependent code. I haven't heard any complaints about the presence of NLTK, so I don't see any particular need to remove it. We could expose tok as an alternative word tokenizer in the Encoder constructor.

use tok as fallback tokenizer

8eecb19

soaxelbrooke requested changes Jul 10, 2019

View reviewed changes

soaxelbrooke added this to the v1.1 milestone Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use tok as fallback tokenizer #11

use tok as fallback tokenizer #11

kootenpv commented Jul 9, 2019 •

edited

soaxelbrooke left a comment

soaxelbrooke Jul 10, 2019

soaxelbrooke commented Jul 10, 2019 •

edited

kootenpv commented Jul 11, 2019 •

edited

soaxelbrooke commented Jul 15, 2019

		print(next(encoder.inverse_transform(encoder.transform([example]))))
		# vizzini : he didn ' t fall ? inconceivable !

use tok as fallback tokenizer #11

Are you sure you want to change the base?

use tok as fallback tokenizer #11

Conversation

kootenpv commented Jul 9, 2019 • edited

soaxelbrooke left a comment

Choose a reason for hiding this comment

soaxelbrooke Jul 10, 2019

Choose a reason for hiding this comment

soaxelbrooke commented Jul 10, 2019 • edited

kootenpv commented Jul 11, 2019 • edited

soaxelbrooke commented Jul 15, 2019

kootenpv commented Jul 9, 2019 •

edited

soaxelbrooke commented Jul 10, 2019 •

edited

kootenpv commented Jul 11, 2019 •

edited