## TextBlob: Simplified Text Processing
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

**TextBlob** stands on the giant shoulders of **NLTK** and **pattern**, and plays nicely with both.

### Features
* Noun phrase extraction
* Part-of-speech tagging
* Sentiment analysis
* Classification (Naive Bayes, Decision Tree)
* Language translation and detection powered by Google Translate
* Tokenization (splitting text into words and sentences)
* Word and phrase frequencies
* Parsing
* n-grams
* Word inflection (pluralization and singularization) and lemmatization
* Spelling correction
* Add new models or languages through extensions
* WordNet integration

In [18]:
from textblob import TextBlob
import re
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

In [3]:
# Create a textblob
blob = TextBlob(text)
blob.tags

[('The', 'DT'),
 ('titular', 'JJ'),
 ('threat', 'NN'),
 ('of', 'IN'),
 ('The', 'DT'),
 ('Blob', 'NNP'),
 ('has', 'VBZ'),
 ('always', 'RB'),
 ('struck', 'VBN'),
 ('me', 'PRP'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('ultimate', 'JJ'),
 ('movie', 'NN'),
 ('monster', 'NN'),
 ('an', 'DT'),
 ('insatiably', 'RB'),
 ('hungry', 'JJ'),
 ('amoeba-like', 'JJ'),
 ('mass', 'NN'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('penetrate', 'VB'),
 ('virtually', 'RB'),
 ('any', 'DT'),
 ('safeguard', 'NN'),
 ('capable', 'JJ'),
 ('of', 'IN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('doomed', 'JJ'),
 ('doctor', 'NN'),
 ('chillingly', 'RB'),
 ('describes', 'VBZ'),
 ('it', 'PRP'),
 ('assimilating', 'VBG'),
 ('flesh', 'NN'),
 ('on', 'IN'),
 ('contact', 'NN'),
 ('Snide', 'JJ'),
 ('comparisons', 'NNS'),
 ('to', 'TO'),
 ('gelatin', 'VB'),
 ('be', 'VB'),
 ('damned', 'VBN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('concept', 'NN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('most', 'RBS'),
 ('devastating', 'JJ'),
 ('of', 'IN'),
 ('potenti

In [15]:
blob.noun_phrases

WordList(['titular threat', 'blob', 'ultimate movie monster', 'amoeba-like mass', 'snide', 'potential consequences', 'grey goo scenario', 'technological theorists fearful', 'artificial intelligence run rampant'])

In [25]:
for sentence in blob.sentences:
    print(sentence.sentiment.polarity)

0.06000000000000001
-0.34166666666666673


In [13]:
blob.translate(to = 'zh-CN')

TextBlob("Blob的名义威胁一直让我成为最终的电影
怪物：一个不能容忍的饥饿，阿米巴样的物质能够穿透
几乎所有的保障措施，都能成为一个悲观的医生
描述它 - “同化接触肉体。
Snide比较明胶是该死的，这是一个最多的概念
破坏性的潜在后果，不像灰色的情况
技术理论家提出的可怕的提议
人工智能猖獗。")

**Language Support for the Phrase-Based Machine Translation Model**

Language|ISO-639-1 Code
--------|--------------
Afrikaans|af
Albanian|sq
Amharic|am
Arabic|ar
Armenian|hy
Azeerbaijani|az
Basque|eu
Belarusian|be
Bengali|bn
Bosnian|bs
Bulgarian|bg
Catalan|ca
Cebuano|ceb (ISO-639-2)
Chinese (Simplified)|zh-CN (BCP-47)
Chinese (Traditional)|zh-TW (BCP-47)
Corsican|co
Croatian|hr
Czech|cs
Danish|da
Dutch|nl
English|en
Esperanto|eo
Estonian|et
Finnish|fi
French|fr
Frisian|fy
Galician|gl
Georgian|ka
German|de
Greek|el
Gujarati|gu
Haitian Creole|ht
Hausa|ha
Hawaiian|haw (ISO-639-2)
Hebrew|iw
Hindi|hi
Hmong|hmn (ISO-639-2)
Hungarian|hu
Icelandic|is
Igbo|ig
Indonesian|id
Irish|ga
Italian|it
Japanese|ja
Javanese|jw
Kannada|kn
Kazakh|kk
Khmer|km
Korean|ko
Kurdish|ku
Kyrgyz|ky
Lao|lo
Latin|la
Latvian|lv
Lithuanian|lt
Luxembourgish|lb
Macedonian|mk
Malagasy|mg
Malay|ms
Malayalam|ml
Maltese|mt
Maori|mi
Marathi|mr
Mongolian|mn
Myanmar (Burmese)|my
Nepali|ne
Norwegian|no
Nyanja (Chichewa)|ny
Pashto|ps
Persian|fa
Polish|pl
Portuguese (Portugal, Brazil)|pt
Punjabi|pa
Romanian|ro
Russian|ru
Samoan|sm
Scots Gaelic|gd
Serbian|sr
Sesotho|st
Shona|sn
Sindhi|sd
Sinhala (Sinhalese)|si
Slovak|sk
Slovenian|sl
Somali|so
Spanish|es
Sundanese|su
Swahili|sw
Swedish|sv
Tagalog (Filipino)|tl
Tajik|tg
Tamil|ta
Telugu|te
Thai|th
Turkish|tr
Ukrainian|uk
Urdu|ur
Uzbek|uz
Vietnamese|vi
Welsh|cy
Xhosa|xh
Yiddish|yi
Yoruba|yo
Zulu|zu

In [26]:
tesrimonial = TextBlob('Textblob is amazingly simple to use. What great fun!')

### Sentiment Analysis
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [27]:
tesrimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

### Tokenization
We can break TextBlobs into words or sentences

Sentence objects have the same properties and methods as TextBlobs.

In [36]:
zen = TextBlob("Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex.")

In [37]:
zen.words

WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])

In [38]:
zen.word_counts

defaultdict(int,
            {'beautiful': 1,
             'better': 3,
             'complex': 1,
             'explicit': 1,
             'implicit': 1,
             'is': 3,
             'simple': 1,
             'than': 3,
             'ugly': 1})

In [39]:
zen.sentences

[Sentence("Beautiful is better than ugly."),
 Sentence("Explicit is better than implicit."),
 Sentence("Simple is better than complex.")]

In [40]:
for s in zen.sentences:
    print(s.sentiment)

Sentiment(polarity=0.2166666666666667, subjectivity=0.8333333333333334)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=0.06666666666666667, subjectivity=0.41904761904761906)


### Advanced Usage: Overriding Models and the Blobber Class
TextBlob允许指定想用的算法.

The **textblob.sentiments** module contains two sentiment analysis implementations, **PatternAnalyzer** (based on the pattern library) and **NaiveBayesAnalyzer** (an NLTK classifier trained on a movie reviews corpus).

The default implementation is **PatternAnalyzer**, but you can override the analyzer by passing another implementation into a TextBlob’s constructor.

For instance, the **NaiveBayesAnalyzer** returns its result as a namedtuple of the form: **Sentiment(classification, p_pos, p_neg)**.

In [41]:
from textblob.sentiments import NaiveBayesAnalyzer

In [44]:
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
blob.sentiment

Sentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)

### Tokenizers
The **words** and **sentences** properties are helpers that use the **textblob.tokenizers.WordTokenizer** and **textblob.tokenizers.SentenceTokenizer** classes, respectively.

You can use other tokenizers, such as those provided by **NLTK**, by passing them into the **TextBlob constructor** then accessing the tokens property.

In [45]:
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
tokenizer = TabTokenizer()
blob = TextBlob('This is\ta rather tabby\tblob.', tokenizer=tokenizer)
blob.tokens

WordList(['This is', 'a rather tabby', 'blob.'])

Another way is to use the **tokenize([tokenizer])** method.

In [47]:
from textblob import TextBlob
from nltk.tokenize import BlanklineTokenizer
tokenizer = BlanklineTokenizer()
blob = TextBlob("A token\n\nof appreciation")
blob.tokenize(tokenizer)

WordList(['A token', 'of appreciation'])

### Noun Phrase Chunkers
TextBlob currently has two noun phrases chunker implementations, **textblob.np_extractors.FastNPExtractor** (default, based on Shlomi Babluki’s implementation from this blog post) and **textblob.np_extractors.ConllExtractor**, which uses the CoNLL 2000 corpus to train a tagger.

Use **np_extractor** to explicitly passing an instance of a noun phrase extractor to a TextBlob's constructor.


In [48]:
from textblob.np_extractors import ConllExtractor
extractor = ConllExtractor()
blob = TextBlob("Python is a high-level programming language.",np_extractor=extractor)
blob.noun_phrases

WordList(['python', 'high-level programming language'])

### POS Taggers
