# Text Vectorization

Vectorization is the process of converting text data to numeric form, for processing by ML Models.

# Term Frequency - Inverse Document Frequency (TF-IDF)

TF-IDF is a text vectorization technique which converts text data into vectors, the values of which are determined by the importance of the word in the document and the corpus. 

TF - Term Frequency - denotes the importance of the word in the document/each datapoints/sentence. It is calculated by taking the ratio of Number of occurrences of a term in a document to the total number of words in that document.

IDF - Inverse Document Frequency - denotes the importance of the word in the whole corpus. The more frequent a word is in the corpus, the **LESS** importance it is. It is calculated by taking the log of the ratio of total number of documents in the corpus to the number of documents where our term/word is present.

The value of a word in the vector is the product of TF and IDF.

# About this notebook

This notebook is going to vectorize a speech by Dr A.P.J. Abdul Kalam using scikit-learn's implementation of TF-IDF - TfidfVectorizer.

* We are going to implement a custom preprocessor, to normalize the corpus before applying vectorization.
* Implement vectorization using sci-kit learn's TfidfVectorizer.
* Visualize the vectorized form using a pandas DataFrame.

In [24]:
import numpy as np
import pandas as pd
import string # For removing punctuations from corpus

from sklearn.feature_extraction.text import TfidfVectorizer 

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import pos_tag

In [3]:
nltk.download('stopwords') # To define english stopwords
nltk.download('punkt') # For sentence tokenizer to work
nltk.download('averaged_perceptron_tagger') # Contains Point of Speech Tags
nltk.download('wordnet') # For lemmatizer to work

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# Unzipping wordnet.zip file to the specified directory. Not doing this will cause errors in Kaggle
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/ 

Archive:  /usr/share/nltk_data/corpora/wordnet.zip
   creating: /usr/share/nltk_data/corpora/wordnet/
  inflating: /usr/share/nltk_data/corpora/wordnet/lexnames  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adv  
  inflating: /usr/share/nltk_data/corpora/wordnet/adv.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/LICENSE  
  inflating: /usr/share/nltk_data/corpora/wordnet/citation.bib  
  inflating: /usr/share/nltk_data/corpora/wordnet/noun.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/verb.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/README  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.sense  
  inflating: /usr/share/nltk_data

In [5]:
# Setting stopwords
eng_stopwords = set(stopwords.words('english'))

# Corpus

In [6]:
## Speech Of DR APJ Abdul Kalam - to be vectorized
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

# Creating a custom preprocessor

In [26]:
def get_wordnet_pos(word):
    
    """
    The Lemmatizer function in nltk takes a 'Part of Speech'(pos) variable as argument, 
    which denotes the pos of the word in the language. This function finds out the pos of the
    word using the WordNet database and simplifies it into one of the 4 pos_tags that nltk 
    lemmatizer allows - [n, v, a, r, s]
    """
    
    pos = pos_tag([word])[0][1]
    
    if pos.startswith('J'):
        return wordnet.ADJ
    elif pos.startswith('V'):
        return wordnet.VERB
    elif pos.startswith('N'):
        return wordnet.NOUN
    elif pos.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun if no match is found

In [27]:
lm = WordNetLemmatizer()

def custom_preprocessor(sentence):
    """
    Function takes in an English sentence and returns a lower-case, lemmatized version of the same, 
    without stopwords and punctuations.
    """
    new_sentence = []
    
    words = word_tokenize(sentence) # words now contains list of words in the sentence
    
    # Creating a list of lower-case, lemmatized words, while filtering out stopwords and punctuations
    words = [lm.lemmatize(word.lower(), pos=get_wordnet_pos(word)) for word in words if (word.lower() not in eng_stopwords) and (word not in string.punctuation)]
    
    new_sentence = ' '.join(words)
            
    
    return new_sentence

# Creating TF-IDF using TfidfVectorizer

In [28]:
# Initializing CountVectorizer
tfidf = TfidfVectorizer(input='content', 
                     preprocessor=custom_preprocessor, # Sends each sentence to our custom preprocessor
                     tokenizer=word_tokenize, # Uses word_tokenizer module from nltk
                     token_pattern=None # To hide warning, which would otherwise show when tokenizer is not None
                    )

In [29]:
# Converting our paragraph into a list of sentences, because CountVectorizer expects 
# sequence of items that can be of type string, when input='content'
sentences = sent_tokenize(text=paragraph, language='english') 

In [30]:
x = tfidf.fit_transform(sentences).toarray()

In [31]:
x.shape

(31, 118)

In [32]:
# Unique words
tfidf.get_feature_names_out()

array(['10', '1857', '3000', '5', 'achievement', 'alexander', 'also',
       'among', 'anyone', 'area', 'believe', 'brahm', 'british', 'build',
       'capture', 'career', 'closely', 'come', 'conquer', 'consider',
       'culture', 'dept', 'develop', 'developed', 'development', 'dhawan',
       'do', 'dr.', 'dutch', 'economic', 'enforce', 'fall', 'father',
       'fifty', 'first', 'fortune', 'four', 'free', 'freedom', 'french',
       'gdp', 'get', 'globally', 'go', 'good', 'grabbed', 'great',
       'greek', 'growth', 'hand-in-hand', 'history', 'incorrect',
       'independence', 'india', 'invade', 'lack', 'land', 'level', 'life',
       'loot', 'lucky', 'material', 'milestone', 'military', 'mind',
       'mogul', 'must', 'nation', 'nuclear', 'nurture', 'one', 'onwards',
       'opportunity', 'others.that', 'people', 'percent', 'portuguese',
       'poverty', 'power', 'prakash', 'professor', 'protect', 'rate',
       'recognise', 'respect', 'sarabhai', 'satish', 'second', 'see',
     

In [33]:
# Number of unique words
len(tfidf.get_feature_names_out())

118

In [34]:
# (Unique Word, Column Number)
tfidf.vocabulary_

{'three': 101,
 'vision': 110,
 'india': 53,
 '3000': 2,
 'year': 115,
 'history': 50,
 'people': 74,
 'world': 114,
 'come': 17,
 'invade': 54,
 'u': 107,
 'capture': 14,
 'land': 56,
 'conquer': 18,
 'mind': 64,
 'alexander': 5,
 'onwards': 71,
 'greek': 47,
 'turk': 106,
 'mogul': 65,
 'portuguese': 76,
 'british': 12,
 'french': 39,
 'dutch': 28,
 'loot': 59,
 'take': 98,
 'yet': 116,
 'do': 26,
 'nation': 67,
 'anyone': 8,
 'grabbed': 45,
 'culture': 20,
 'try': 105,
 'enforce': 30,
 'way': 112,
 'life': 58,
 'respect': 84,
 'freedom': 38,
 'others.that': 73,
 'first': 34,
 'believe': 10,
 'get': 41,
 '1857': 1,
 'start': 94,
 'war': 111,
 'independence': 52,
 'must': 66,
 'protect': 81,
 'nurture': 69,
 'build': 13,
 'free': 37,
 'one': 70,
 'second': 87,
 '’': 117,
 'development': 24,
 'fifty': 33,
 'develop': 22,
 'time': 102,
 'see': 88,
 'developed': 23,
 'among': 7,
 'top': 104,
 '5': 3,
 'term': 99,
 'gdp': 40,
 '10': 0,
 'percent': 75,
 'growth': 48,
 'rate': 82,
 'area': 

# Vectorized Corpus

In [35]:
x

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.31740361, ..., 0.28329014, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

# Converting to dataframe for better visualization

In [36]:
vector = pd.DataFrame(data=x, columns=tfidf.get_feature_names_out())

In [37]:
vector.head()

Unnamed: 0,10,1857,3000,5,achievement,alexander,also,among,anyone,area,...,unless,vikram,vision,war,way,work,world,year,yet,’
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.548305,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.317404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.240312,0.28329,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.284327,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.588643,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.746061,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
# Creating a dataframe out of the vector representation of the ith sentence, filtering out just the terms 
# that occur in that sentence

i = 2 # You can set any i, where i is the sentence number

ith_sentence = vector[vector > 0].iloc[i-1, :].dropna() # Pandas Series

ith_sentence = pd.DataFrame(data= [ith_sentence.values], columns=ith_sentence.index) # Pandas DataFrame

ith_sentence

Unnamed: 0,3000,capture,come,conquer,history,invade,land,mind,people,u,world,year
0,0.317404,0.317404,0.28329,0.28329,0.28329,0.317404,0.28329,0.28329,0.317404,0.240312,0.240312,0.28329


Second sentence:


> In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.
         

We can see how the second sentence has been vectorized, with column names as terms and values set to the TF-IDF of each term. 