Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #166 to use specified tokenizer when tagging. #167

Merged
merged 3 commits into from Nov 21, 2017
Merged

Conversation

@jschnurr
Copy link
Collaborator

@jschnurr jschnurr commented Jun 4, 2017

Let me summarise my approach; welcome any feedback.

For reference - the pos_tag method on BaseBlob populates both the pos_tag and tags properties.

Currently, pos_tag provides the raw string to the pos_tagger defined on the object; it's neither tokenized nor broken down into sentences. The tagger class (PatternTagger or default NLTKTagger) is fully responsible for deriving the part-of-speech tags using only string input.

Visualized as an NLP pipeline, it looks like this:

TextBlob -> .raw -> tagger -> tags

In this PR, we use the .sentences property to first break the object down, and then pass each sentence blob to the tagger. The tagger, now having a BaseBlob instead of just a string, can choose to use the tokenizer, depending on the implementation.

As a pipeline:

TextBlob -> Sentence -> tokenizer (optional) -> tagger -> tags

Based on their underlying dependencies, NLTKTagger will use the tokenizer, while PatternTagger will ignore it.

Here's what it looks like.

Before:

>>> blob = tb.TextBlob("Good muffins cost $3.88\nin New York.", tokenizer=nltk.tokenize.regexp.WordPunctTokenizer())
>>> blob.words
WordList(['Good', 'muffins', 'cost', '3.88', 'in', 'New', 'York'])                                                                                                                                                
>>> blob.tokens                                                                                                                                                                                                   
WordList(['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.'])
>>> blob.tags
[('Good', u'JJ'), ('muffins', u'NNS'), ('cost', u'VBP'), ('3.88', u'CD'), ('in', u'IN'), ('New', u'NNP'), ('York', u'NNP')]

After:

>>> blob = tb.TextBlob("Good muffins cost $3.88\nin New York.", tokenizer=nltk.tokenize.regexp.WordPunctTokenizer())
>>> blob.words
WordList(['Good', 'muffins', 'cost', '3.88', 'in', 'New', 'York'])
>>> blob.tokens
WordList(['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.'])
>>> blob.tags
[('Good', u'JJ'), ('muffins', u'NNS'), ('cost', u'VBP'), ('3', u'CD'), ('88', u'CD'), ('in', u'IN'), ('New', u'NNP'), ('York', u'NNP')]

Notice $3.88 is handled differently by blob.tags, since the former uses the default tokenizer, and the latter honors the tokenizer set on the BaseBlob object.

Copy link
Owner

@sloria sloria left a comment

Thanks @jschnurr for the PR. I am good with this change. Just a minor tweak to make, and this should be good to merge.

@@ -3,10 +3,11 @@
from __future__ import absolute_import

import nltk
import six

This comment has been minimized.

@sloria

sloria Nov 12, 2017
Owner

Use textblob.compat instead of six.

@sloria
Copy link
Owner

@sloria sloria commented Nov 12, 2017

Also, getting this branch up-to-date with master should fix the Travis build.

@jschnurr
Copy link
Collaborator Author

@jschnurr jschnurr commented Nov 21, 2017

Ok @sloria, should be all good now.

@sloria
sloria approved these changes Nov 21, 2017
@sloria sloria merged commit 2cc13fc into sloria:dev Nov 21, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@sloria
Copy link
Owner

@sloria sloria commented Nov 21, 2017

Thanks! This is in 0.14.0.

By the way, I'm starting to use something along the lines of the Pull Request Hack. Basically, if you get a PR merged in, you get an invite to have commit access.

No pressure to do any more work than you already have. Just want to open the door to further collaboration. =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants