# <font color = 'blue'>Advanced Text Processing </font>

Source : https://kgptalkie.com/3664-2/

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import re
from bs4 import BeautifulSoup
import unicodedata
from textblob import TextBlob
from textblob import Word
from spacy import displacy

# <font color = 'blue'> N-Grams </font>

An N-gram means a sequence of N words. So for example, “KGPtalkie blog” is a 2-gram (a bigram), “A KGPtalkie blog post” is a 4-gram, and “Write on KGPtalkie” is a 3-gram (trigram). Well, that wasn’t very interesting or exciting. True, but we still have to look at the probability used with n-grams, which is quite interesting.

In [2]:
x = 'thanks for watching'

In [3]:
x

'thanks for watching'

In [9]:
tb = TextBlob(x)
tb

TextBlob("thanks for watching")

In [10]:
tb.ngrams()

[WordList(['thanks', 'for', 'watching'])]

# <font color = 'blue'> Bag of Words (BoW) </font>

Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, the Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

In [13]:
x = ['this is first sentence this is', 'this is second', 'this is last']
x

['this is first sentence this is', 'this is second', 'this is last']

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
cv = CountVectorizer()
text_counts = cv.fit_transform(x)
text_counts

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [16]:
text_counts.toarray()

array([[1, 2, 0, 0, 1, 2],
       [0, 1, 0, 1, 0, 1],
       [0, 1, 1, 0, 0, 1]])

In [17]:
cv.get_feature_names()

['first', 'is', 'last', 'second', 'sentence', 'this']

In [19]:
bow = pd.DataFrame(text_counts.toarray(), columns = cv.get_feature_names())

In [20]:
bow

Unnamed: 0,first,is,last,second,sentence,this
0,1,2,0,0,1,2
1,0,1,0,1,0,1
2,0,1,1,0,0,1


In [21]:
x

# we can see that in first sentence the word 'is' present 2 times hence in above DF, its counted as 2. 

['this is first sentence this is', 'this is second', 'this is last']

# <font color = 'blue'> Term Frequency (TF) </font>

TF tells you how frequently a term occurs in a document. In the context of natural language, terms correspond to words or phrases. Since every document is different in length, it is possible that a term would appear more often in longer documents than shorter ones. Thus, term frequency is often divided by the total number of terms in the document as a way of normalization.

**TF(t) = (Number of times term 't' appears in a document) / (Total number of terms in that document)**

In [22]:
x

['this is first sentence this is', 'this is second', 'this is last']

In [23]:
bow

Unnamed: 0,first,is,last,second,sentence,this
0,1,2,0,0,1,2
1,0,1,0,1,0,1
2,0,1,1,0,0,1


In [24]:
bow.shape

(3, 6)

In [34]:
tf = bow.copy()
tf

Unnamed: 0,first,is,last,second,sentence,this
0,1,2,0,0,1,2
1,0,1,0,1,0,1
2,0,1,1,0,0,1


In [38]:
for index, row in enumerate(tf.iterrows()):
    for col in row[1].index:
        tf.loc[index, col] = tf.loc[index, col]/sum(row[1].values)
        


Need to find a simple way to write above logic. 

Number of times term 'first' appears in a document = 1

Total number of terms in that document = 6 (summation of first row values)

tf = 1/6 = 0.166667

Similarly for 'is' it's 2/6 = 0.333333 


In [36]:
tf

Unnamed: 0,first,is,last,second,sentence,this
0,0.166667,0.333333,0.0,0.0,0.166667,0.333333
1,0.0,0.333333,0.0,0.333333,0.0,0.333333
2,0.0,0.333333,0.333333,0.0,0.0,0.333333


# <font color = 'blue'> Inverse Document Frequency (IDF) </font>

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes.

For example, the word the appears in almost all English texts and would thus have a very low IDF score as it carries very little “topic” information. In contrast, if you take the word coffee, while it is common, it’s not used as widely as the word the. Thus, coffee would have a higher IDF score than the.

**idf = log( (1 + N)/(n + 1) ) + 1** , used in sklearn when smooth_idf = True

where, N is the total number of rows and n is the number of rows in which the word was present.

In [40]:
x_df = pd.DataFrame(x, columns=['words'])
x_df

Unnamed: 0,words
0,this is first sentence this is
1,this is second
2,this is last


In [41]:
bow

Unnamed: 0,first,is,last,second,sentence,this
0,1,2,0,0,1,2
1,0,1,0,1,0,1
2,0,1,1,0,0,1


In [42]:
bow.shape

(3, 6)

In [44]:
N = bow.shape[0]
N

# This is the caps 'N' present numerator of the formula i.e total number of row. 

3

In [45]:
bb = bow.astype('bool')
bb

Unnamed: 0,first,is,last,second,sentence,this
0,True,True,False,False,True,True
1,False,True,False,True,False,True
2,False,True,True,False,False,True


In [46]:
bb['is'].sum()  

# if we had not converted to boolean it would have counted as 2+1+1 = 4 i.e number of times word 'is' present in all
#the rows. But what we want is, number of rows in which the word 'is' present, which is 3, hence to get that we need
#to convert to boolean.

3

In [47]:
cols = bb.columns
cols

Index(['first', 'is', 'last', 'second', 'sentence', 'this'], dtype='object')

In [48]:
nz = []
for col in cols:
    nz.append(bb[col].sum())

In [49]:
nz

# In the formula this is the small 'n' present in denominator i.e number of rows in which the word is present.

[1, 3, 1, 1, 1, 3]

In [50]:
idf = []
for index, col in enumerate(cols):
    idf.append(np.log((N + 1)/(nz[index] + 1)) + 1)

In [51]:
idf

[1.6931471805599454,
 1.0,
 1.6931471805599454,
 1.6931471805599454,
 1.6931471805599454,
 1.0]

In [52]:
bow

Unnamed: 0,first,is,last,second,sentence,this
0,1,2,0,0,1,2
1,0,1,0,1,0,1
2,0,1,1,0,0,1


In [62]:
# for the word 'first' 

np.log( (1 + 3)  / (1 + 1)   ) + 1

#  log( (1 + N) / (n + 1) ) + 1

1.6931471805599454

# <font color = 'blue'> Term Frequency – Inverse Document Frequency (TF-IDF) </font>

TF-IDF is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document. 

Let’s take an example, we have a string or Bag of Words (BOW) and we have to extract information from it, then we can use this approach.

The tf-idf value increases in proportion to the number of times a word appears in the document but is often offset by the frequency of the word in the corpus, which helps to adjust with respect to the fact that some words appear more frequently in general.

TF-IDF use two statistical methods, first is Term Frequency (TF) and the other is Inverse Document Frequency (IDF). 

Term frequency (TF) refers to the total number of times a given term t appears in the document doc against (per) the total number of all words in the document

The inverse document frequency (IDF) measure of how much information the word provides. It measures the weight of a given word in the entire document. IDF show how common or rare a given word is across all documents.

Generally TF-IDF is more frequently used instead of using TF or IDF as stand alone. Also there is a direct function 
to provide TF-IDF which makes it simple instead of calculating TF and IDF manually. 

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [64]:
x_df

Unnamed: 0,words
0,this is first sentence this is
1,this is second
2,this is last


In [65]:
tfidf = TfidfVectorizer()

x_tfidf = tfidf.fit_transform(x_df['words'])

x_tfidf

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [66]:
x_tfidf.toarray()

array([[0.45688214, 0.5396839 , 0.        , 0.        , 0.45688214,
        0.5396839 ],
       [0.        , 0.45329466, 0.        , 0.76749457, 0.        ,
        0.45329466],
       [0.        , 0.45329466, 0.76749457, 0.        , 0.        ,
        0.45329466]])

In [67]:
tfidf.idf_ # this is the idf calculated by TfidfVectorizer() function

array([1.69314718, 1.        , 1.69314718, 1.69314718, 1.69314718,
       1.        ])

In [68]:
idf  # this is the manual calculation we did. It's same. 

[1.6931471805599454,
 1.0,
 1.6931471805599454,
 1.6931471805599454,
 1.6931471805599454,
 1.0]

# <font color = 'blue'> Word Embeddings </font>

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. 

It represents words or phrases in vector space with several dimensions. 

Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

## SpaCy Word2Vec

In [3]:
dictionary = spacy.load('en_core_web_md')

# for prod we should use dictionary = spacy.load('en_core_web_lg')

In [4]:
doc = dictionary('thank you! dog cat lion dfasaa')

doc

thank you! dog cat lion dfasaa

In [5]:
for i in doc:
    print(i.text, '-', i.has_vector)

thank - True
you - True
! - True
dog - True
cat - True
lion - True
dfasaa - False


In [7]:
i.vector.shape # need to understand how 300 vectors got created.

(300,)

In [8]:
dictionary('cat').vector.shape

(300,)

In [80]:
for token1 in doc:
    for token2 in doc:
        print(token1.text, '-', token2.text, '-', token1.similarity(token2))
    print()

thank - thank - 1.0
thank - you - 0.5647585
thank - ! - 0.52147406
thank - dog - 0.2504265
thank - cat - 0.20648488
thank - lion - 0.13629763
thank - dfasaa - 0.0

you - thank - 0.5647585
you - you - 1.0
you - ! - 0.4390223
you - dog - 0.36494097
you - cat - 0.30807978
you - lion - 0.20392051
you - dfasaa - 0.0

! - thank - 0.52147406
! - you - 0.4390223
! - ! - 1.0
! - dog - 0.29852203
! - cat - 0.29702348
! - lion - 0.19601385
! - dfasaa - 0.0

dog - thank - 0.2504265
dog - you - 0.36494097
dog - ! - 0.29852203
dog - dog - 1.0
dog - cat - 0.8016855
dog - lion - 0.47424486
dog - dfasaa - 0.0

cat - thank - 0.20648488
cat - you - 0.30807978
cat - ! - 0.29702348
cat - dog - 0.8016855
cat - cat - 1.0
cat - lion - 0.5265437
cat - dfasaa - 0.0

lion - thank - 0.13629763
lion - you - 0.20392051
lion - ! - 0.19601385
lion - dog - 0.47424486
lion - cat - 0.5265437
lion - lion - 1.0
lion - dfasaa - 0.0

dfasaa - thank - 0.0
dfasaa - you - 0.0
dfasaa - ! - 0.0
dfasaa - dog - 0.0
dfasaa - cat - 

  print(token1.text, '-', token2.text, '-', token1.similarity(token2))
