### Text Analytics
As we have seen cases of different types of data and different ways of processing them to obtain important information out of them, in this section we will talk about one of the most sought upon data in today’s time. Customer Information is the backbone for most of the renowned companies in 21st century. Let’s take example of Google. How do you suppose Google is providing so many services for free? What profit does it make by giving you free services? Well it’s your information that google sells to different companies by analyzing your searches and makes profit through it. The moment you search for a product on google, you will start seeing the product recommendation on every website. This is the most basic usage of text analytics, product recommendation to customers by analyzing their searches. Similarly, many companies analyze the positive or negative reviews given by customers and try to predict the customer behavior. There is a wealth of such unstructured data present such as emails, google searches, online surveys, twitter, online reviews etc. which can be processed using text analysis. Many key information about people, customers can be derived by processing the unstructured text and analyzing.

## Natural Language Processing (NLP)


Natural language processing (NLP) is the discipline of building machines that can manipulate human language — or data that resembles human language — in the way that it is written, spoken, and organized. It evolved from computational linguistics, which uses computer science to understand the principles of language, but rather than developing theoretical frameworks, NLP is an engineering discipline that seeks to build technology to accomplish useful tasks. NLP can be divided into two overlapping subfields: natural language understanding (NLU), which focuses on semantic analysis or determining the intended meaning of text, and natural language generation (NLG), which focuses on text generation by a machine. NLP is separate from — but often used in conjunction with — speech recognition, which seeks to parse spoken language into words, turning sound into text and vice versa. 

Siri and Alexa are one such example of uses of NLP.


We will use NLP for text analytics.


There many libraries available for NLP in python. we will focus on the two most important one's :

* Natural Languange Tool Kit (NLTK)
* Spacy

### Tokenization

Tokenization is a process of breaking down a given paragraph of text into a list of sentence or words. When paragraph is broken down into list of sentences, it is called sentence tokenization.
Similarly, if the sentences are further broken down into list of words, it is known as Word tokenization.

### Example Text
Below is a given paragraph, let's see how tokenization works on it:

Denmark is a Scandinavian country comprising the Jutland Peninsula and numerous islands. It's linked to nearby Sweden via the Öresund bridge. Copenhagen, its capital, is home to royal palaces and colorful Nyhavn harbor, plus the Tivoli amusement park and the iconic “Little Mermaid” statue. Odense is writer Hans Christian Andersen’s hometown, with a medieval core of cobbled streets and half-timbered houses.

Let's understand about some important terminologies:

In [1]:
# Tokenizing using NLTK
import nltk

data = "Denmark is a Scandinavian country comprising the Jutland Peninsula and numerous islands. It's linked to nearby Sweden via the Öresund bridge. Copenhagen, its capital, is home to royal palaces and colorful Nyhavn harbor, plus the Tivoli amusement park and the iconic “Little Mermaid” statue. Odense is writer Hans Christian Andersen’s hometown, with a medieval core of cobbled streets and half-timbered houses"

nltk.sent_tokenize(data)

['Denmark is a Scandinavian country comprising the Jutland Peninsula and numerous islands.',
 "It's linked to nearby Sweden via the Öresund bridge.",
 'Copenhagen, its capital, is home to royal palaces and colorful Nyhavn harbor, plus the Tivoli amusement park and the iconic “Little Mermaid” statue.',
 'Odense is writer Hans Christian Andersen’s hometown, with a medieval core of cobbled streets and half-timbered houses']

In [2]:
nltk.word_tokenize(data)

['Denmark',
 'is',
 'a',
 'Scandinavian',
 'country',
 'comprising',
 'the',
 'Jutland',
 'Peninsula',
 'and',
 'numerous',
 'islands',
 '.',
 'It',
 "'s",
 'linked',
 'to',
 'nearby',
 'Sweden',
 'via',
 'the',
 'Öresund',
 'bridge',
 '.',
 'Copenhagen',
 ',',
 'its',
 'capital',
 ',',
 'is',
 'home',
 'to',
 'royal',
 'palaces',
 'and',
 'colorful',
 'Nyhavn',
 'harbor',
 ',',
 'plus',
 'the',
 'Tivoli',
 'amusement',
 'park',
 'and',
 'the',
 'iconic',
 '“',
 'Little',
 'Mermaid',
 '”',
 'statue',
 '.',
 'Odense',
 'is',
 'writer',
 'Hans',
 'Christian',
 'Andersen',
 '’',
 's',
 'hometown',
 ',',
 'with',
 'a',
 'medieval',
 'core',
 'of',
 'cobbled',
 'streets',
 'and',
 'half-timbered',
 'houses']

In [3]:
pip install uztagger

Note: you may need to restart the kernel to use updated packages.


# unpackaging & Importing Libraries

In [4]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [5]:
pip install Apertag

Note: you may need to restart the kernel to use updated packages.


## POS Tags and Chunking

There are 9 parts of speech in grammars, but in NLP there are more than 9 POS tags based on different set of rules, such as:

* NN noun, singular 'table'
* NNS noun plural 'tables'
* NNP proper noun, singular 
* NNPS proper noun, plural 

There are 4 types of division for noun only. Similarly, there are multiple divisions for other part of speeches.


In [6]:
data = 'We will see an example of POS tagging.'

pos = nltk.pos_tag(nltk.word_tokenize(data))

pos

[('We', 'PRP'),
 ('will', 'MD'),
 ('see', 'VB'),
 ('an', 'DT'),
 ('example', 'NN'),
 ('of', 'IN'),
 ('POS', 'NNP'),
 ('tagging', 'NN'),
 ('.', '.')]

### Chunking 

After using parts of speech, Chunking can be used to make data more structured by giving a specific set of rules. Chunking is also known as shallow parser. 
Let's understand more about chunking by following example :

In [7]:
data =' We will see an example of POS tagging.'

pos = nltk.pos_tag(nltk.word_tokenize(data))

# now once the POS tag has been done. Let's say we want to further structure data such that Nouns are
# categorized under one specific node defined by us :

my_node = "MN: {<NNP>*<NN>}"

chunk  =nltk.RegexpParser(my_node)
result = chunk.parse(pos)
print(result)
result.draw()    # It will draw the pattern graphically which can be seen in Noun Phrase chunking

(S
  We/PRP
  will/MD
  see/VB
  an/DT
  (MN example/NN)
  of/IN
  (MN POS/NNP tagging/NN)
  ./.)


### Graphical representation

<img src="chunk.PNG">


We can see that both NN and NNP are now categorised into "MN" (as the given tag_name). 

So, whenever we need to categorise different tags into one tag, we can use chunking for this purpose.

In [8]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [9]:
import nltk
import string
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
punct =string.punctuation

data = "Denmark is a Scandinavian country comprising the Jutland Peninsula and numerous islands. It's linked to nearby Sweden via the Öresund bridge. Copenhagen, its capital, is home to royal palaces and colorful Nyhavn harbor, plus the Tivoli amusement park and the iconic “Little Mermaid” statue. Odense is writer Hans Christian Andersen’s hometown, with a medieval core of cobbled streets and half-timbered houses"
clean_data =[]
for word in nltk.word_tokenize(data):
    if word not in punct:
        if word not in stop_words:
            clean_data.append(word)
            
clean_data

['Denmark',
 'Scandinavian',
 'country',
 'comprising',
 'Jutland',
 'Peninsula',
 'numerous',
 'islands',
 'It',
 "'s",
 'linked',
 'nearby',
 'Sweden',
 'via',
 'Öresund',
 'bridge',
 'Copenhagen',
 'capital',
 'home',
 'royal',
 'palaces',
 'colorful',
 'Nyhavn',
 'harbor',
 'plus',
 'Tivoli',
 'amusement',
 'park',
 'iconic',
 '“',
 'Little',
 'Mermaid',
 '”',
 'statue',
 'Odense',
 'writer',
 'Hans',
 'Christian',
 'Andersen',
 '’',
 'hometown',
 'medieval',
 'core',
 'cobbled',
 'streets',
 'half-timbered',
 'houses']

In [10]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer, SnowballStemmer

lancaster = LancasterStemmer()
porter = PorterStemmer()
Snowball = SnowballStemmer("english")
print('Porter stemmer')
print(porter.stem("hobby"))
print(porter.stem("hobbies"))
print(porter.stem("computer"))
print(porter.stem("computation"))
print("**************************")  
print('lancaster stemmer')
print(lancaster.stem("hobby"))
print(lancaster.stem("hobbies"))
print(lancaster.stem("computer"))
print(porter.stem("computation"))
print("**************************")  
print('Snowball stemmer')
print(Snowball.stem("hobby"))
print(Snowball.stem("hobbies"))
print(Snowball.stem("computer"))
print(Snowball.stem("computation"))

Porter stemmer
hobbi
hobbi
comput
comput
**************************
lancaster stemmer
hobby
hobby
comput
comput
**************************
Snowball stemmer
hobbi
hobbi
comput
comput


In [11]:
sent = "I was going to the office on my bike when i saw a car passing by hit the tree."
token = list(nltk.word_tokenize(sent))
for stemmer in (Snowball, lancaster, porter):
    stemm = [stemmer.stem(t) for t in token]
    print(" ".join(stemm))

i was go to the offic on my bike when i saw a car pass by hit the tree .
i was going to the off on my bik when i saw a car pass by hit the tre .
i wa go to the offic on my bike when i saw a car pass by hit the tree .


In [12]:
print(porter.stem("running"))
print(porter.stem("runs"))
print(porter.stem("ran"))

run
run
ran


In [13]:
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()

print(lemma.lemmatize('running'))
print(lemma.lemmatize('runs'))
print(lemma.lemmatize('ran'))

running
run
ran


In [14]:
print(lemma.lemmatize('running',pos='v'))
print(lemma.lemmatize('runs',pos='v'))
print(lemma.lemmatize('ran',pos='v'))

run
run
run


In [15]:
sent = "Denmark is a Nordic constituent country in Northern Europe. It is the most populous and politically central constituent of the Kingdom of Denmark"

words = nltk.word_tokenize(sent)
pos_tag = nltk.pos_tag(words)
namedEntity = nltk.ne_chunk(pos_tag)
print(namedEntity)
namedEntity.draw()

(S
  (GPE Denmark/NNP)
  is/VBZ
  a/DT
  (GPE Nordic/JJ)
  constituent/JJ
  country/NN
  in/IN
  (GPE Northern/NNP Europe/NNP)
  ./.
  It/PRP
  is/VBZ
  the/DT
  most/RBS
  populous/JJ
  and/CC
  politically/RB
  central/JJ
  constituent/NN
  of/IN
  the/DT
  (ORGANIZATION Kingdom/NNP)
  of/IN
  (GPE Denmark/NNP))


In [16]:
dir(nltk.parse)

['BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'ChartParser',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'DependencyEvaluator',
 'DependencyGraph',
 'EarleyChartParser',
 'FeatureBottomUpChartParser',
 'FeatureBottomUpLeftCornerChartParser',
 'FeatureChartParser',
 'FeatureEarleyChartParser',
 'FeatureIncrementalBottomUpChartParser',
 'FeatureIncrementalBottomUpLeftCornerChartParser',
 'FeatureIncrementalChartParser',
 'FeatureIncrementalTopDownChartParser',
 'FeatureTopDownChartParser',
 'IncrementalBottomUpChartParser',
 'IncrementalBottomUpLeftCornerChartParser',
 'IncrementalChartParser',
 'IncrementalLeftCornerChartParser',
 'IncrementalTopDownChartParser',
 'InsideChartParser',
 'LeftCornerChartParser',
 'LongestChartParser',
 'MaltParser',
 'NaiveBayesDependencyScorer',
 'NonprojectiveDependencyParser',
 'ParserI',
 'ProbabilisticNonprojectiveParser',
 'ProbabilisticProjectiveDependencyParser',
 'ProjectiveDepe

In [17]:
grammar = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "slept" | "walked"
  NP -> "Rasmus" | "Anja" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)

In [18]:
sent = "Rasmus saw Anja with a dog".split()
parser = nltk.RecursiveDescentParser(grammar)
for tree in parser.parse(sent):
    print(tree) 
    tree.draw()

(S
  (NP Rasmus)
  (VP (V saw) (NP Anja) (PP (P with) (NP (Det a) (N dog)))))


In [19]:
# n-grams
from sklearn.feature_extraction.text import CountVectorizer 
from nltk.tokenize import word_tokenize

string = ["This is an example of n-gram!"]
vect1 = CountVectorizer(ngram_range=(1,1))
vect1.fit_transform(string)
vect2 = CountVectorizer(ngram_range=(2,2))
vect2.fit_transform(string)
vect3 = CountVectorizer(ngram_range=(3,3))
vect3.fit_transform(string)
vect4 = CountVectorizer(ngram_range=(4,4))
vect4.fit_transform(string)
print("1-gram  :",vect1.get_feature_names())
print("2-gram  :",vect2.get_feature_names())
print("3-gram  :",vect3.get_feature_names())
print("4-gram  :",vect4.get_feature_names())

1-gram  : ['an', 'example', 'gram', 'is', 'of', 'this']
2-gram  : ['an example', 'example of', 'is an', 'of gram', 'this is']
3-gram  : ['an example of', 'example of gram', 'is an example', 'this is an']
4-gram  : ['an example of gram', 'is an example of', 'this is an example']




In [20]:
## Bag Of Words
from sklearn.feature_extraction.text import CountVectorizer 
from nltk.tokenize import word_tokenize

string = ["This is an example of bag of words!"]
vect1 = CountVectorizer()
vect1.fit_transform(string)
print("bag of words :",vect1.get_feature_names())

bag of words : ['an', 'bag', 'example', 'is', 'of', 'this', 'words']


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfid = TfidfVectorizer(smooth_idf=False)

doc= ["This is an example.","We will see how it works.","IDF can be confusing"]

doc_vector = tfid.fit_transform(doc)
#print(tfid.get_feature_names())
df= pd.DataFrame(doc_vector.todense(),columns=tfid.get_feature_names())
df
#print(doc_vector)




Unnamed: 0,an,be,can,confusing,example,how,idf,is,it,see,this,we,will,works
0,0.5,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.408248,0.408248,0.0,0.408248,0.408248,0.408248
2,0.0,0.5,0.5,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfid = TfidfVectorizer()

doc= ["Let's use python!", "Sklearn has package for Tf-idf.","Vectorization is fun!"]

doc_vector = tfid.fit_transform(doc)
#print(tfid.get_feature_names())
df= pd.DataFrame(doc_vector.todense(),columns=tfid.get_feature_names())

#print(doc_vector)




In [23]:
df

Unnamed: 0,for,fun,has,idf,is,let,package,python,sklearn,tf,use,vectorization
0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.57735,0.0
1,0.408248,0.0,0.408248,0.408248,0.0,0.0,0.408248,0.0,0.408248,0.408248,0.0,0.0
2,0.0,0.57735,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735


In [24]:
pip install nbconvert

Note: you may need to restart the kernel to use updated packages.


In [25]:
pip install pyppeteer

Note: you may need to restart the kernel to use updated packages.


In [26]:
pip install jinja2==3.0.3

Note: you may need to restart the kernel to use updated packages.
