## What is Textblob

* Textblob is an open-source python library used to perform NLP activities like Lemmatization, Stemming, Tokenization, Noun Phrase Extraction, POS Tagging, N-Grams, Sentiment Analysis. 

* It is faster than NLTK, however it does not provide the functionalities like vectorization, dependency parsing.

* Text Classification, Sentiment Analysis can be performed using Textblob. 
* Official Link to Textblob is: https://textblob.readthedocs.io/en/dev/

* Installation: pip install textblob

In [None]:
### Install Textblob
!pip install nltk
!pip install textblob

In [None]:
import nltk 
nltk.download('popular')

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

### Functionalities of Textblob
* Language Detection
* Word Correction
* Word Count
* Phrase Extraction
* POS Tagging
* Tokenization
* Plularization of words using Textblob
* Lemmatization using Textblob
* n-gram in Textblob

#### Language Detection
* With the help of Google Translate, Textblob detects the language of input text. 
* Textblob is also able to translate text from one language to another language. 

In [1]:
from textblob import TextBlob
 
blob = TextBlob("Hey John, How are You")
 
print("Detected Language is:",blob.detect_language())
 
print("Input text in Spanish:",blob.translate(to='es'))


C:\Users\SK074909\Anaconda3\envs\Rython\lib\site-packages\numpy\.libs\libopenblas.noijjg62emaszi6nyurl6jbkm4evbgm7.gfortran-win_amd64.dll
C:\Users\SK074909\Anaconda3\envs\Rython\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
C:\Users\SK074909\Anaconda3\envs\Rython\lib\site-packages\numpy\.libs\libopenblas.TXA6YQSD3GCQQC22GEQ54J2UDCXDXHWN.gfortran-win_amd64.dll
  stacklevel=1)


Detected Language is: en
Input text in Spanish: hola juan como estas


### Note:
Since Google has made some changes into its API and Textblob is using the older API, as a result you may get 404 error. To avoide this, change the url given in translate.py under your environment. 


updated url link is:
url = "http://translate.google.com/translate_a/t?client=te&format=html&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=at&ie=UTF-8&oe=UTF-8&otf=2&ssel=0&tsel=0&kc=1"


Location of translate.py file: C:\Users\<user name>\Anaconda3\envs\Rython\Lib\site-packages\textblob\translate.py



#### Spelling Correction

In [2]:
from textblob import TextBlob
text=""" ABCD Corp alays values ttheir employees!!!"""

In [3]:
print(text)

 ABCD Corp alays values ttheir employees!!!


In [4]:
blob=TextBlob(text)

In [5]:
blob

TextBlob(" ABCD Corp alays values ttheir employees!!!")

In [7]:
blob.correct()

TextBlob(" ABCD For always values their employees!!!")

In [8]:
TextBlob('hasss').correct()

TextBlob("has")

In [9]:
### Sometimes it failsas well
TextBlob('ur').correct()

TextBlob("or")

### Word Count 
With the help of word count, we can count the frequency of words or a noun phrase in a given sentence.

In [13]:
text="Sentiment Analysis is a process by which we can find the sentiment of a text. Sentiment can be Positive, Negative or Neutral"

In [14]:
blob=TextBlob(text)

In [15]:
blob.word_counts["analysis"]

1

In [16]:
blob.word_counts["Sentiment"]

0

In [18]:
blob.word_counts["sentiment"]

3

In [19]:
blob.word_counts["Analysis"]

0

### POS Tagging
With the help of tags function of textblob, we can get tag each words of a sentence with a tag that can be either noun, pronoun, verb, adverb, adjective and more.

In [20]:
from textblob import TextBlob
 
text = TextBlob("My name is Adam. I like to read about NLP. I work at ABCD Corp.")
print(text.tags)


[('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('Adam', 'NNP'), ('I', 'PRP'), ('like', 'VBP'), ('to', 'TO'), ('read', 'VB'), ('about', 'IN'), ('NLP', 'NNP'), ('I', 'PRP'), ('work', 'VBP'), ('at', 'IN'), ('ABCD', 'NNP'), ('Corp', 'NNP')]


In [22]:
new_tuple=[]
for i in text.tags:
    print(i)
    if 'VBP' not in i[1]:
        new_tuple.append(i) 

('My', 'PRP$')
('name', 'NN')
('is', 'VBZ')
('Adam', 'NNP')
('I', 'PRP')
('like', 'VBP')
('to', 'TO')
('read', 'VB')
('about', 'IN')
('NLP', 'NNP')
('I', 'PRP')
('work', 'VBP')
('at', 'IN')
('ABCD', 'NNP')
('Corp', 'NNP')


In [23]:
new_tuple

[('My', 'PRP$'),
 ('name', 'NN'),
 ('is', 'VBZ'),
 ('Adam', 'NNP'),
 ('I', 'PRP'),
 ('to', 'TO'),
 ('read', 'VB'),
 ('about', 'IN'),
 ('NLP', 'NNP'),
 ('I', 'PRP'),
 ('at', 'IN'),
 ('ABCD', 'NNP'),
 ('Corp', 'NNP')]

In [24]:
value=''
for i in new_tuple:
    value=value+" " + "".join(i[0])

In [25]:
value

' My name is Adam I to read about NLP I at ABCD Corp'

#### Tokenization

* Corpus (or corpora in plural) - Corpus is nothing but a collection of text data. The text maybe in one language or maybe a combination of two or more. 

* Token - The term "Token" is nothing but the total number of words in a text, corpus etc, regardless of their freuqncy of occurrence in the text. Tokens are nothing but a string of contiguous characters which either lies between the two spaces or it lies between a space and punctuation. For Example: Suppose you have the following string : "abc_123_defg", if you split it on basis of underscores "_" you obtained three tokens : "abc", "123" and "defg".

**What is tokenization?**

Tokenization is a process of splitting the sentence or corpus into its smalles unit i.e. "Tokens"

In [26]:
text="""
R is a comprehensive statistical and graphical programming language, which is fast gaining popularity among data analysts. It is free and runs on a variety of platforms, including Windows, Unix, and macOS. It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner. 
"""

In [27]:
blob_object = TextBlob(text)

In [28]:
# Word tokenization of the sample corpus
corpus_words = blob_object.words

In [29]:
corpus_words

WordList(['R', 'is', 'a', 'comprehensive', 'statistical', 'and', 'graphical', 'programming', 'language', 'which', 'is', 'fast', 'gaining', 'popularity', 'among', 'data', 'analysts', 'It', 'is', 'free', 'and', 'runs', 'on', 'a', 'variety', 'of', 'platforms', 'including', 'Windows', 'Unix', 'and', 'macOS', 'It', 'provides', 'an', 'unparalleled', 'platform', 'for', 'programming', 'new', 'statistical', 'methods', 'in', 'an', 'easy', 'and', 'straightforward', 'manner'])

In [30]:
print(len(corpus_words))

48


In [31]:
corpus_sentences= blob_object.sentences

In [32]:
corpus_sentences

[Sentence("
 R is a comprehensive statistical and graphical programming language, which is fast gaining popularity among data analysts."),
 Sentence("It is free and runs on a variety of platforms, including Windows, Unix, and macOS."),
 Sentence("It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner.")]

In [33]:
print(len(corpus_sentences))

3


#### Pluralization of words using Textblob 

In [34]:
from textblob import Word
w = Word('Platform')
w.pluralize()

'Platforms'

In [35]:
from textblob import Word
w = Word('Platforms')
w.pluralize()

'Platformss'

In [36]:
blob = TextBlob("Great Learning is a great platform to learn data science. \n It helps community through blogs, Youtube, GLA,etc.")
for word,pos in blob.tags:
    if pos == 'NN':
        print (word.pluralize())

platforms
sciences
communities
etcs


#### Lemmatization using Textblob

In [37]:
blob = TextBlob("Great Learning is a great platform to learn data science. \n It helps community through blogs, Youtube, GLA,etc.")
words = blob.words

for word in words:
    print("ORIGINAL:", word, "| LEMMA:", word.lemmatize(), "| STEM:", word.stem())

ORIGINAL: Great | LEMMA: Great | STEM: great
ORIGINAL: Learning | LEMMA: Learning | STEM: learn
ORIGINAL: is | LEMMA: is | STEM: is
ORIGINAL: a | LEMMA: a | STEM: a
ORIGINAL: great | LEMMA: great | STEM: great
ORIGINAL: platform | LEMMA: platform | STEM: platform
ORIGINAL: to | LEMMA: to | STEM: to
ORIGINAL: learn | LEMMA: learn | STEM: learn
ORIGINAL: data | LEMMA: data | STEM: data
ORIGINAL: science | LEMMA: science | STEM: scienc
ORIGINAL: It | LEMMA: It | STEM: it
ORIGINAL: helps | LEMMA: help | STEM: help
ORIGINAL: community | LEMMA: community | STEM: commun
ORIGINAL: through | LEMMA: through | STEM: through
ORIGINAL: blogs | LEMMA: blog | STEM: blog
ORIGINAL: Youtube | LEMMA: Youtube | STEM: youtub
ORIGINAL: GLA | LEMMA: GLA | STEM: gla
ORIGINAL: etc | LEMMA: etc | STEM: etc


In [None]:
w = Word('learning')
w.lemmatize("n") ## v here represents verb

In [None]:
w = Word('learning')
w.lemmatize("v") ## v here represents verb

In [None]:
w = Word('peoples')
w.lemmatize("n") ## v here represents verb

#### n-gram in Textblob

An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “really good”, “not good”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “not at all”, or “turn off light”.

In [None]:
blob

In [None]:
blob.ngrams(n=1)

In [None]:
blob.ngrams(n=2)

In [None]:
blob.ngrams(n=3)

In [None]:
blob.ngrams(n=4)