Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into Penn Treebank tokenizers for English #12

Closed
lmullen opened this issue Apr 13, 2016 · 21 comments
Closed

Look into Penn Treebank tokenizers for English #12

lmullen opened this issue Apr 13, 2016 · 21 comments
Assignees
Milestone

Comments

@lmullen
Copy link
Member

@lmullen lmullen commented Apr 13, 2016

No description provided.

@jrnold

This comment has been minimized.

Copy link
Contributor

@jrnold jrnold commented Jan 29, 2017

Hey, here was my attempt at Penn Treebank Tokenizer: https://gist.github.com/jrnold/c4867ddf82b2913803d2. Do you want it merged?

@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Jan 31, 2017

@jrnold Can you please provide a link to the code you derived this from? Also, can you please include some benchmark results from microbenchmark on some suitable text? I'd like to get a sense of the performance.

@jrnold

This comment has been minimized.

Copy link
Contributor

@jrnold jrnold commented Jan 31, 2017

It's pretty much direct port of the NLTK TreebankWord Tokenizer

Do you have text you'd like it benchmarked on?

@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Jan 31, 2017

@jrnold

This comment has been minimized.

Copy link
Contributor

@jrnold jrnold commented Feb 19, 2017

On my computer, the PTB tokenizer was about twice as slow as tokenize_words:

> moby <- quanteda::data_char_mobydick

> microbenchmark(tokenize_words(moby))
Unit: milliseconds
                 expr      min       lq     mean   median       uq      max neval
 tokenize_words(moby) 59.67409 63.02587 65.97723 64.67812 67.25878 159.3838   100

> microbenchmark(tokenize_ptb(moby))
Unit: seconds
               expr      min       lq   mean   median      uq      max neval
 tokenize_ptb(moby) 1.089508 1.112156 1.1416 1.126525 1.14529 1.431831   100
@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Feb 20, 2017

@jrnold

This comment has been minimized.

Copy link
Contributor

@jrnold jrnold commented Feb 20, 2017

My bad. I read it too quick.

The PTB tokenizer is a word tokenizer that generates tokens like those appearing in Penn Treebank before it was tagged. The Stanford CoreNLP and NLTK both use it as the default word tokenizer. Since many NLP tasks use PTB as training data, they want a tokenizer that produces the same tokens as those appearing in the PTB.

Browsing the NLTK source and the CoreNLP docs give a pretty good description of what it does. The main differences are:

  • it keeps and standardizes quotes, ellipses, and a few other things.
  • splits off commas, single quotes, when followed by whitespace
  • separates periods at the end of the line
  • it correctly splits (a few common English) contractions: "they'll" to "they", "'ll"

Here's some examples of differences between tokenize_words and the Penn Treebank Tokenizer.

> tokenize_words("Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.")
[[1]]
 [1] "good"    "muffins" "cost"    "3.88"    "in"      "new"     "york"    "please"  "buy"     "me"      "two"    
[12] "of"      "them"    "thanks" 

> tokenize_ptb("Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.")
[[1]]
 [1] "Good"    "muffins" "cost"    "$"       "3.88"    "in"      "New"     "York."   "Please"  "buy"     "me"     
[12] "two"     "of"      "them."   "Thanks"  "."      

> tokenize_words("They'll save and invest more.")
[[1]]
[1] "they'll" "save"    "and"     "invest"  "more"  

> tokenize_ptb("They'll save and invest more.")
[[1]]
[1] "They"   "'ll"    "save"   "and"    "invest" "more"   "."     

song <-  paste0("How many roads must a man walk down\n",
                "Before you call him a man?\n",
                "How many seas must a white dove sail\n",
                "Before she sleeps in the sand?\n",
                "\n",
                "How many times must the cannonballs fly\n",
                "Before they're forever banned?\n",
                "The answer, my friend, is blowin' in the wind.\n",
                "The answer is blowin' in the wind.\n")

tokenize_ptb(song)

[[1]]
 [1] "How"         "many"        "roads"       "must"        "a"           "man"         "walk"        "down"       
 [9] "Before"      "you"         "call"        "him"         "a"           "man"         "?"           "How"        
[17] "many"        "seas"        "must"        "a"           "white"       "dove"        "sail"        "Before"     
[25] "she"         "sleeps"      "in"          "the"         "sand"        "?"           "How"         "many"       
[33] "times"       "must"        "the"         "cannonballs" "fly"         "Before"      "they"        "'re"        
[41] "forever"     "banned"      "?"           "The"         "answer"      ","           "my"          "friend"     
[49] ","           "is"          "blowin"      "'"           "in"          "the"         "wind."       "The"        
[57] "answer"      "is"          "blowin"      "'"           "in"          "the"         "wind"        "."    

The value added also depends on the application. For something like topic models, tokenize_words is find. For NLP tasks like POS tagging, dependency parsing, name entity recognition, where the punctuation is needed, the PTB tokenizer would have more value added.

@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Feb 20, 2017

jrnold added a commit to jrnold/tokenizers that referenced this issue Feb 20, 2017
jrnold added a commit to jrnold/tokenizers that referenced this issue Feb 20, 2017
jrnold added a commit to jrnold/tokenizers that referenced this issue Feb 20, 2017
@lmullen lmullen self-assigned this Apr 1, 2017
@Ironholds

This comment has been minimized.

Copy link
Member

@Ironholds Ironholds commented Apr 3, 2017

I feel like some of the speed problems could maybe be fixed by moving logic to C++, but would have to test to see. Want me to take a stab?

@Ironholds

This comment has been minimized.

Copy link
Member

@Ironholds Ironholds commented Apr 3, 2017

Am I understanding https://gist.github.com/jrnold/c4867ddf82b2913803d2#file-gistfile1-r-L28-L44 correctly that basically you want to reduce it to just alphabetical characters?

@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Apr 3, 2017

@Ironholds If you're willing, I'm more than happy for you to speed this up.

Perhaps @jrnold can weigh in on your question about alphabetical characters. But my understanding is that a PTB treats "they'll" as two words "they" and "'ll", so that's what the logic is accounting for.

@jrnold

This comment has been minimized.

Copy link
Contributor

@jrnold jrnold commented Apr 3, 2017

Nope, https://gist.github.com/jrnold/c4867ddf82b2913803d2#file-gistfile1-r-L28-L44 isn't reducing it to alphabetical characters at all. The PennTreeBank tokenizer does not remove punctuation Those regexes are standardizing quotes and punctuation, as well as adding spaces around punctuation so that they are tokenized separately.

@Ironholds

This comment has been minimized.

Copy link
Member

@Ironholds Ironholds commented Apr 3, 2017

Aha!

@Ironholds

This comment has been minimized.

Copy link
Member

@Ironholds Ironholds commented Apr 7, 2017

It looks like I can shave some time off it, although we're probably talking ms rather than seconds - basically by writing C++, find()-based utils for some of the simpler string manipulations, we can cut runtime for them by about 4/5ths. We'll see how it shapes up in practice.

@Ironholds

This comment has been minimized.

Copy link
Member

@Ironholds Ironholds commented Apr 7, 2017

(Is the order in which operations are performed crucial?)

@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Apr 7, 2017

@Ironholds I don't think the order of operations is critical, though perhaps @jrnold can confirm.

@Ironholds

This comment has been minimized.

Copy link
Member

@Ironholds Ironholds commented Apr 8, 2017

Having dug into this, I suspect the performance implications are small enough that it's probably not worth the finnickiness of the code. So this may well be done!

@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Apr 8, 2017

@Ironholds Thanks for looking into it. Keeping this issue open at the moment just to remind myself to review the documentation but otherwise considering this done.

@jrnold

This comment has been minimized.

Copy link
Contributor

@jrnold jrnold commented Apr 12, 2017

Thanks @Ironholds

@lmullen lmullen closed this in 37a7b18 Mar 12, 2018
@lmullen

This comment has been minimized.

Copy link
Member Author

@lmullen lmullen commented Mar 12, 2018

@jrnold Thanks again for contributing this tokenizer. I've given it a more thorough review and cleaned up the documentation. There was a fairly significant problem with the code. In a number of instances it ran a regex on the documents but didn't actually save the results. E.g., see this line. You may wish to review the code in this commit to be sure it is what you want.

@jrnold

This comment has been minimized.

Copy link
Contributor

@jrnold jrnold commented Mar 19, 2018

Wow. That is bad. I'm sorry; I don't even know how I did that. It must have been when I converted from using stringr in the original gist. Clearly, I didn't write tests, which it looks like you did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.