# 7. Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document 
Frequency.

.
.

In NLTK, PUNKT is an unsupervised trainable model, which means it can be trained on unlabeled data (Data that has not been tagged with information identifying its characteristics, properties, or categories is referred to as unlabeled data.)

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Tokenizations

**Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.**



In [21]:
#function to split text into word
from nltk.tokenize import word_tokenize
w1 = word_tokenize("Hello My name is Mayur.")
print(w1)

['Hello', 'My', 'name', 'is', 'Mayur', '.']


In [22]:
from nltk.tokenize import sent_tokenize
w2 = sent_tokenize("hello my name is mayur. I am computer engineering student.")
print(w2)

['hello my name is mayur.', 'I am computer engineering student.']


## POS (Part of Speech) Tagging
The pos(parts of speech) explain you how a word is used in a sentence. In the sentence, a word have different contexts and semantic meanings. The basic natural language processing(NLP) models like bag-of-words(bow) fails to identify these relation between the words. For that we use pos tagging to mark a word to its pos tag based on its context in the data. Pos is also used to extract rlationship between the words. 


In [13]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [23]:
text ="Are you afraid of something?"
word = word_tokenize(text)
pos_tag(word)

[('Are', 'NNP'),
 ('you', 'PRP'),
 ('afraid', 'IN'),
 ('of', 'IN'),
 ('something', 'NN'),
 ('?', '.')]

## Stopword removal
Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

In [17]:
from nltk.corpus import stopwords
stop1 = stopwords.words('english')
print(stop1)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [18]:
len(stop1)

179

In [20]:
txt = 'i like mathematics. mathematics is the easiest subject in my life'
clean_text = []
word = word_tokenize(txt)
for w in word:
  if w not in stop1:
    clean_text.append(w)

print("Original text",word)
print("after stop word removal", clean_text)

Original text ['i', 'like', 'mathematics', '.', 'mathematics', 'is', 'the', 'easiest', 'subject', 'in', 'my', 'life']
after stop word removal ['like', 'mathematics', '.', 'mathematics', 'easiest', 'subject', 'life']


## Stemming
Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.

In [24]:
from nltk import SnowballStemmer
sbs = SnowballStemmer('english')
text = "Nltk full form is natural language took kit. Engineering needs a vision"
word = word_tokenize(text)
print("Original Word: Word after stemming")
for w in word:
  print(w, " : ", sbs.stem(w))

Original Word: Word after stemming
Nltk  :  nltk
full  :  full
form  :  form
is  :  is
natural  :  natur
language  :  languag
took  :  took
kit  :  kit
.  :  .
Engineering  :  engin
needs  :  need
a  :  a
vision  :  vision


## Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

In [29]:
from nltk import WordNetLemmatizer
wlm = WordNetLemmatizer()
word = ['give', 'giving', 'leaves', 'gave']
for w in word: 
  print(wlm.lemmatize(w))

give
giving
leaf
gave


# Part 2
**TF-IDF (Term Frequency-Inverse Document Frequency)**, a commonly used weighting technique for information retrieval and information exploration.

TF-IDF is a statistical method used to evaluate the importance of a word to a file set or a file in a corpus. The importance of the word increases in proportion to the number of times it appears in the file, but at the same time decreases inversely with the frequency of its appearance in the corpus.

* **Term frequency TF (item frequency)**: number of times a given word appears in the text. This number is usually normalized (the numerator is generally smaller than the denominator) to prevent it from favoring long documents, because whether the term is important or not, it is likely to appear more often in long documents than in paragraph documents.

> **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).**

Term frequency (TF) indicates how often a term (keyword) appears in the text .

This number is usually normalized (usually the word frequency divided by the total number of words in the article) to prevent it from favoring long documents.


#### Example:

Consider a document containing 100 words where in the word cat appears 3 times. 

The **term frequency (Tf) for cat** is then **(3 / 100) = 0.03**. Now, assume we have 10 million documents and the word cat appears in one thousand of these.

Then, the **inverse document frequency (Idf)** is calculated as **log(10,000,000 / 1,000) = 4.** 

Thus, the **Tf-idf** weight is the product of these quantities: **0.03 * 4 = 0.12.**

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

d0 = 'new york times'
d1 = 'new york post'
d2 = 'los angels time'
series = [d0,d1,d2]


In [32]:
#Create an object of tf-idf
tf_idf = TfidfVectorizer()
#get tf-idf values
result = tf_idf.fit_transform(series)

In [35]:
print(result)

  (0, 5)	0.680918560398684
  (0, 6)	0.5178561161676974
  (0, 2)	0.5178561161676974
  (1, 3)	0.680918560398684
  (1, 6)	0.5178561161676974
  (1, 2)	0.5178561161676974
  (2, 4)	0.5773502691896257
  (2, 0)	0.5773502691896257
  (2, 1)	0.5773502691896257


In [42]:
from nltk.text import TextCollection
from nltk.tokenize import word_tokenize

sents = ['this is sentence one', 'this is sentence two', 'this is sentence three']

sents = [word_tokenize(sent) for sent in sents]

print(sents)

cps = TextCollection(sents)
print(cps)

tf=cps.tf('one', cps)
print(tf)

idf=cps.idf('one')
print(idf)

tf_idf=cps.tf_idf('one',cps)
print(tf_idf)

[['this', 'is', 'sentence', 'one'], ['this', 'is', 'sentence', 'two'], ['this', 'is', 'sentence', 'three']]
<Text: this is sentence one this is sentence two...>
0.08333333333333333
1.0986122886681098
0.0915510240556758
