## TextBlob: Simplified Text Processing
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

**TextBlob** stands on the giant shoulders of **NLTK** and **pattern**, and plays nicely with both.

### Features
* Noun phrase extraction
* Part-of-speech tagging
* Sentiment analysis
* Classification (Naive Bayes, Decision Tree)
* Language translation and detection powered by Google Translate
* Tokenization (splitting text into words and sentences)
* Word and phrase frequencies
* Parsing
* n-grams
* Word inflection (pluralization and singularization) and lemmatization
* Spelling correction
* Add new models or languages through extensions
* WordNet integration

In [18]:
from textblob import TextBlob
import re
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

In [3]:
# Create a textblob
blob = TextBlob(text)
blob.tags

[('The', 'DT'),
 ('titular', 'JJ'),
 ('threat', 'NN'),
 ('of', 'IN'),
 ('The', 'DT'),
 ('Blob', 'NNP'),
 ('has', 'VBZ'),
 ('always', 'RB'),
 ('struck', 'VBN'),
 ('me', 'PRP'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('ultimate', 'JJ'),
 ('movie', 'NN'),
 ('monster', 'NN'),
 ('an', 'DT'),
 ('insatiably', 'RB'),
 ('hungry', 'JJ'),
 ('amoeba-like', 'JJ'),
 ('mass', 'NN'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('penetrate', 'VB'),
 ('virtually', 'RB'),
 ('any', 'DT'),
 ('safeguard', 'NN'),
 ('capable', 'JJ'),
 ('of', 'IN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('doomed', 'JJ'),
 ('doctor', 'NN'),
 ('chillingly', 'RB'),
 ('describes', 'VBZ'),
 ('it', 'PRP'),
 ('assimilating', 'VBG'),
 ('flesh', 'NN'),
 ('on', 'IN'),
 ('contact', 'NN'),
 ('Snide', 'JJ'),
 ('comparisons', 'NNS'),
 ('to', 'TO'),
 ('gelatin', 'VB'),
 ('be', 'VB'),
 ('damned', 'VBN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('concept', 'NN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('most', 'RBS'),
 ('devastating', 'JJ'),
 ('of', 'IN'),
 ('potenti

In [15]:
blob.noun_phrases

WordList(['titular threat', 'blob', 'ultimate movie monster', 'amoeba-like mass', 'snide', 'potential consequences', 'grey goo scenario', 'technological theorists fearful', 'artificial intelligence run rampant'])

In [25]:
for sentence in blob.sentences:
    print(sentence.sentiment.polarity)

0.06000000000000001
-0.34166666666666673


In [13]:
blob.translate(to = 'zh-CN')

TextBlob("Blob的名义威胁一直让我成为最终的电影
怪物：一个不能容忍的饥饿，阿米巴样的物质能够穿透
几乎所有的保障措施，都能成为一个悲观的医生
描述它 - “同化接触肉体。
Snide比较明胶是该死的，这是一个最多的概念
破坏性的潜在后果，不像灰色的情况
技术理论家提出的可怕的提议
人工智能猖獗。")

**Language Support for the Phrase-Based Machine Translation Model**

Language|ISO-639-1 Code|Language|ISO-639-1 Code
--------|--------------|--------|--------------
Afrikaans|af|Latin|la
Albanian|sq|Latvian|lv
Amharic|am|Lithuanian|lt
Arabic|ar|Luxembourgish|lb
Armenian|hy|Macedonian|mk
Azeerbaijani|az|Malagasy|mg
Basque|eu|Malay|ms
Belarusian|be|Malayalam|ml
Bengali|bn|Maltese|mt
Bosnian|bs|Maori|mi
Bulgarian|bg|Marathi|mr
Catalan|ca|Mongolian|mn
Cebuano|ceb (ISO-639-2)|Myanmar (Burmese)|my
Chinese (Simplified)|zh-CN (BCP-47)|Nepali|ne
Chinese (Traditional)|zh-TW (BCP-47)|Norwegian|no
Corsican|co|Nyanja (Chichewa)|ny
Croatian|hr|Pashto|ps
Czech|cs|Persian|fa
Danish|da|Polish|pl
Dutch|nl|Portuguese (Portugal, Brazil)|pt
English|en|Punjabi|pa
Esperanto|eo|Romanian|ro
Estonian|et|Russian|ru
Finnish|fi|Samoan|sm
French|fr|Scots Gaelic|gd
Frisian|fy|Serbian|sr
Galician|gl|Sesotho|st
Georgian|ka|Shona|sn
German|de|Sindhi|sd
Greek|el|Sinhala (Sinhalese)|si
Gujarati|gu|Slovak|sk
Haitian Creole|ht|Slovenian|sl
Hausa|ha|Somali|so
Hawaiian|haw (ISO-639-2)|Spanish|es
Hebrew|iw|Sundanese|su
Hindi|hi|Swahili|sw
Hmong|hmn (ISO-639-2)|Swedish|sv
Hungarian|hu|Tagalog (Filipino)|tl
Icelandic|is|Tajik|tg
Igbo|ig|Tamil|ta
Indonesian|id|Telugu|te
Irish|ga|Thai|th
Italian|it|Turkish|tr
Japanese|ja|Ukrainian|uk
Javanese|jw|Urdu|ur
Kannada|kn|Uzbek|uz
Kazakh|kk|Vietnamese|vi
Khmer|km|Welsh|cy
Korean|ko|Xhosa|xh
Kurdish|ku|Yiddish|yi
Kyrgyz|ky|Yoruba|yo
Lao|lo|Zulu|zu

### Sentiment Analysis
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [26]:
tesrimonial = TextBlob('Textblob is amazingly simple to use. What great fun!')

In [27]:
tesrimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

### Tokenization
We can break TextBlobs into words or sentences

Sentence objects have the same properties and methods as TextBlobs.

In [36]:
zen = TextBlob("Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex.")

In [37]:
zen.words

WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])

In [38]:
zen.word_counts

defaultdict(int,
            {'beautiful': 1,
             'better': 3,
             'complex': 1,
             'explicit': 1,
             'implicit': 1,
             'is': 3,
             'simple': 1,
             'than': 3,
             'ugly': 1})

In [39]:
zen.sentences

[Sentence("Beautiful is better than ugly."),
 Sentence("Explicit is better than implicit."),
 Sentence("Simple is better than complex.")]

In [40]:
for s in zen.sentences:
    print(s.sentiment)

Sentiment(polarity=0.2166666666666667, subjectivity=0.8333333333333334)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=0.06666666666666667, subjectivity=0.41904761904761906)


### Words Inflection and  Lemmatization (词变形和词形归并)

Each word in **TextBlob.words** or **Sentence.words** is a **Word object** (a subclass of unicode) with useful methods, e.g. for __word inflection__.

In [1]:
from textblob import TextBlob
sentence = TextBlob("Use 4 spaces per indentation level.")
sentence.words

WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])

In [4]:
#将第三个单词spaces转换为单数形式
sentence.words[2].singularize()

'space'

In [6]:
# 将句子中最后一个单词转换为复数形式.
sentence.words[-1].pluralize()

'levels'

In [3]:
from textblob import Word
w =Word('octopi')

In [4]:
w.lemmatize()

'octopus'

In [7]:
w = Word('went')
w.lemmatize('v') #pass in WordNet part of speech (verb)

'go'

### WordNet Integration

You can access the **synsets**(同义词集) for a Word via the synsets property or the **get_synsets** method, optionally passing in a part of speech.

You can access the **definitions** for each synset via the **definitions** property or the **define()** method, which can also take an optional part-of-speech argument.

For more info. see the NLTK documentation on [Wordnet Interface](http://www.nltk.org/howto/wordnet.html).

In [7]:
from textblob import Word
from textblob.wordnet import VERB

word = Word("octopus")
word.synsets

[Synset('octopus.n.01'), Synset('octopus.n.02')]

In [8]:
Word('hack').get_synsets(pos = VERB)

[Synset('chop.v.05'),
 Synset('hack.v.02'),
 Synset('hack.v.03'),
 Synset('hack.v.04'),
 Synset('hack.v.05'),
 Synset('hack.v.06'),
 Synset('hack.v.07'),
 Synset('hack.v.08')]

In [9]:
Word('octopus').definitions

['tentacles of octopus prepared as food',
 'bottom-living cephalopod having a soft oval body with eight long tentacles']

** Creat synsets directly.**

In [10]:
from textblob.wordnet import Synset
octopus = Synset("octopus.n.02")
shrimp = Synset("shrimp.n.03")
octopus.path_similarity(shrimp)

0.1111111111111111

### WordLists

A **WordList** is just a Python list with additional methods.

In [11]:
animals = TextBlob("cat dog octopus")
animals.words

WordList(['cat', 'dog', 'octopus'])

In [12]:
animals.words.pluralize()

WordList(['cats', 'dogs', 'octopodes'])

### Spelling Correction

Use the **corrrect()** method to attempt spelling correction.

In [13]:
b = TextBlob("I havv goood speling!")
print(b.correct())

I have good spelling!


**Word** object have a ** spellchech() Word.spellcheck() ** method that returns a list of (word, confidence) tuples with spelling suggestions.

Spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector”[1] as implemented in the pattern library. It is about 70% accurate [[2]](https://textblob.readthedocs.io/en/dev/quickstart.html#id5).

In [14]:
from textblob import Word
w = Word("falibility")
w.spellcheck()

[('fallibility', 1.0)]

### Get Word and Noun Phrase Frequencies
There are two ways to get the frequency of a word or noun phrase in a TextBlob.

* <span style="color:red">First is through the word_counts dictionary.</span>
* Second way is to use the count() method.

<span style="color:red">I tried and find that the first method does not work.</span>

In [24]:
monty = TextBlob("We are no longer the Knights who say Ni. We are now the Knights who say Ekki ekki ekki PTANG.")
monty.words.count('ekki')

3

In [26]:
# specify whether it's case sensitive.
monty.words.count('ekki', case_sensitive=True)

2

In [27]:
monty.noun_phrases

WordList(['ni', 'ekki', 'ekki ekki', 'ptang'])

### Advanced Usage: Overriding Models and the Blobber Class
TextBlob允许指定想用的算法.

The **textblob.sentiments** module contains two sentiment analysis implementations, **PatternAnalyzer** (based on the pattern library) and **NaiveBayesAnalyzer** (an NLTK classifier trained on a movie reviews corpus).

The default implementation is **PatternAnalyzer**, but you can override the analyzer by passing another implementation into a TextBlob’s constructor.

For instance, the **NaiveBayesAnalyzer** returns its result as a namedtuple of the form: **Sentiment(classification, p_pos, p_neg)**.

In [41]:
from textblob.sentiments import NaiveBayesAnalyzer

In [44]:
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
blob.sentiment

Sentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)

### Tokenizers
The **words** and **sentences** properties are helpers that use the **textblob.tokenizers.WordTokenizer** and **textblob.tokenizers.SentenceTokenizer** classes, respectively.

You can use other tokenizers, such as those provided by **NLTK**, by passing them into the **TextBlob constructor** then accessing the tokens property.

In [45]:
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
tokenizer = TabTokenizer()
blob = TextBlob('This is\ta rather tabby\tblob.', tokenizer=tokenizer)
blob.tokens

WordList(['This is', 'a rather tabby', 'blob.'])

Another way is to use the **tokenize([tokenizer])** method.

In [1]:
from textblob import TextBlob
from nltk.tokenize import BlanklineTokenizer
tokenizer = BlanklineTokenizer()
blob = TextBlob("A token\n\nof appreciation")
blob.tokenize(tokenizer)

WordList(['A token', 'of appreciation'])

### Noun Phrase Chunkers
TextBlob currently has two noun phrases chunker implementations, **textblob.np_extractors.FastNPExtractor** (default, based on Shlomi Babluki’s implementation from this blog post) and **textblob.np_extractors.ConllExtractor**, which uses the CoNLL 2000 corpus to train a tagger.

Use **np_extractor** to explicitly passing an instance of a noun phrase extractor to a TextBlob's constructor.


In [2]:
from textblob.np_extractors import ConllExtractor
extractor = ConllExtractor()
blob = TextBlob("Python is a high-level programming language.",np_extractor=extractor)
blob.noun_phrases

WordList(['python', 'high-level programming language'])

### POS Taggers
