New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for tokenizers #6

Merged
merged 4 commits into from Nov 9, 2014

Conversation

Projects
None yet
2 participants
@dmsurti
Contributor

dmsurti commented Nov 8, 2014

This PR adds tests for tokenizers using https://github.com/nltk/nltk/blob/develop/nltk/test/tokenize.doctest as a reference.

The test details are

  1. 10 word tokenizer tests. (8 pass, 2 fail). The incorrect tokenizations are:
    a. "Dr." ==> "Dr", "." . Expected ==> "Dr."
    b. "3:00" ==> "3", ":", "00". Expected ==> "3:00"
  2. 3 custom regular expression tokenizer tests. Compared to NLTK tests, the tests for regular expression with named group and back references are skipped.
  3. Simple sentence splitter test.

Open Questions

  1. Do we have implementations for tokenizers with regex containing named groups/back references? If no, any plans to implement?
  2. Also, NLTK actually does not support back references. So if we support, should we actually support or just notify lack of support like NLTK does :-( ?

Related to this, http://weitz.de/cl-ppcre/#*allow-named-registers*, cl-ppcre has support for named groups/back references. (After all, it's an Edi Weitz library!)

vseloved added a commit that referenced this pull request Nov 9, 2014

@vseloved vseloved merged commit b2f43f2 into vseloved:master Nov 9, 2014

@dmsurti dmsurti deleted the dmsurti:tokenization-tests branch Nov 10, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment