New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #166 to use specified tokenizer when tagging. #167

Merged
merged 3 commits into from Nov 21, 2017

Conversation

Projects
None yet
2 participants
@jschnurr
Collaborator

jschnurr commented Jun 4, 2017

Let me summarise my approach; welcome any feedback.

For reference - the pos_tag method on BaseBlob populates both the pos_tag and tags properties.

Currently, pos_tag provides the raw string to the pos_tagger defined on the object; it's neither tokenized nor broken down into sentences. The tagger class (PatternTagger or default NLTKTagger) is fully responsible for deriving the part-of-speech tags using only string input.

Visualized as an NLP pipeline, it looks like this:

TextBlob -> .raw -> tagger -> tags

In this PR, we use the .sentences property to first break the object down, and then pass each sentence blob to the tagger. The tagger, now having a BaseBlob instead of just a string, can choose to use the tokenizer, depending on the implementation.

As a pipeline:

TextBlob -> Sentence -> tokenizer (optional) -> tagger -> tags

Based on their underlying dependencies, NLTKTagger will use the tokenizer, while PatternTagger will ignore it.

Here's what it looks like.

Before:

>>> blob = tb.TextBlob("Good muffins cost $3.88\nin New York.", tokenizer=nltk.tokenize.regexp.WordPunctTokenizer())
>>> blob.words
WordList(['Good', 'muffins', 'cost', '3.88', 'in', 'New', 'York'])                                                                                                                                                
>>> blob.tokens                                                                                                                                                                                                   
WordList(['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.'])
>>> blob.tags
[('Good', u'JJ'), ('muffins', u'NNS'), ('cost', u'VBP'), ('3.88', u'CD'), ('in', u'IN'), ('New', u'NNP'), ('York', u'NNP')]

After:

>>> blob = tb.TextBlob("Good muffins cost $3.88\nin New York.", tokenizer=nltk.tokenize.regexp.WordPunctTokenizer())
>>> blob.words
WordList(['Good', 'muffins', 'cost', '3.88', 'in', 'New', 'York'])
>>> blob.tokens
WordList(['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.'])
>>> blob.tags
[('Good', u'JJ'), ('muffins', u'NNS'), ('cost', u'VBP'), ('3', u'CD'), ('88', u'CD'), ('in', u'IN'), ('New', u'NNP'), ('York', u'NNP')]

Notice $3.88 is handled differently by blob.tags, since the former uses the default tokenizer, and the latter honors the tokenizer set on the BaseBlob object.

@sloria

Thanks @jschnurr for the PR. I am good with this change. Just a minor tweak to make, and this should be good to merge.

@@ -3,10 +3,11 @@
from __future__ import absolute_import
import nltk
import six

This comment has been minimized.

@sloria

sloria Nov 12, 2017

Owner

Use textblob.compat instead of six.

@sloria

This comment has been minimized.

Owner

sloria commented Nov 12, 2017

Also, getting this branch up-to-date with master should fix the Travis build.

@jschnurr

This comment has been minimized.

Collaborator

jschnurr commented Nov 21, 2017

Ok @sloria, should be all good now.

@sloria

sloria approved these changes Nov 21, 2017

@sloria sloria merged commit 2cc13fc into sloria:dev Nov 21, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@sloria

This comment has been minimized.

Owner

sloria commented Nov 21, 2017

Thanks! This is in 0.14.0.

By the way, I'm starting to use something along the lines of the Pull Request Hack. Basically, if you get a PR merged in, you get an invite to have commit access.

No pressure to do any more work than you already have. Just want to open the door to further collaboration. =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment