# Text Vectorization

Vectorization is the process of converting text data to numeric form, for processing by ML Models.

# Bag-of-words (BoW)

BoW is a text vectorization technique, which converts each unique token (word/characters etc.) or group of tokens into a feature (or column), with its value being represented by the token frequency (number of times it occurs in the document) or binary (if binary BoW is implemented)

# About this notebook

This notebook is going to vectorize a speech by Dr A.P.J. Abdul Kalam using scikit-learn's implementation of BoWs - CountVectorizer.

* We are going to implement a custom preprocessor, to normalize the corpus before applying vectorization.
* Implement vectorization using sci-kit learn's CountVectorizer.
* Visualize the vectorized form using a pandas DataFrame.

In [19]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer 

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import pos_tag

In [20]:
nltk.download('stopwords') # To define english stopwords
nltk.download('punkt') # For sentence tokenizer to work
nltk.download('averaged_perceptron_tagger') # Contains Point of Speech Tags
nltk.download('wordnet') # For lemmatizer to work

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sharo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sharo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sharo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sharo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [21]:
# Setting stopwords
eng_stopwords = set(stopwords.words('english'))

# Corpus

In [22]:
## Speech Of DR APJ Abdul Kalam - to be vectorized
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

# Creating a custom preprocessor

In [23]:
def get_wordnet_pos(word):
    
    """The Lemmatizer function in nltk takes a 'Part of Speech'(pos) variable as argument, 
    which denotes the pos of the word in the language. This function finds out the pos of the
    word using the WordNet database and simplifies it into one of the 4 pos_tags that nltk 
    lemmatizer allows - [n, v, a, r, s]"""
    
    pos = pos_tag([word])[0][1]
    
    if pos.startswith('J'):
        return wordnet.ADJ
    elif pos.startswith('V'):
        return wordnet.VERB
    elif pos.startswith('N'):
        return wordnet.NOUN
    elif pos.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun if no match is found

In [24]:
lm = WordNetLemmatizer()

def custom_preprocessor(sentence):
    """
    Function takes in an English sentence and returns a lower-case, lemmatized version of the same, without stopwords.
    """
    new_sentence = []
    
    words = word_tokenize(sentence) # words now contains list of words in the sentence
    
    # Creating a list of lower-case, lemmatized words, while filtering out stopwords
    words = [lm.lemmatize(word.lower(), pos=get_wordnet_pos(word)) for word in words if word.lower() not in eng_stopwords]
    
    new_sentence = ' '.join(words)
            
    
    return new_sentence

# Creating BoW using ContVectorizer

In [25]:
# Initializing CountVectorizer
cv = CountVectorizer(input='content', 
                     preprocessor=custom_preprocessor, # Sends each sentence to our custom preprocessor
                     tokenizer=word_tokenize, # Users word_tokenizer module from nltk
                     token_pattern=None # To hide warning, which would otherwise show when tokenizer is not None
                    )

In [26]:
# Converting our paragraph into a list of sentences, because CountVectorizer expects 
# sequence of items that can be of type string, when input='content'
sentences = sent_tokenize(text=paragraph, language='english') 

In [27]:
x = cv.fit_transform(sentences).toarray()

In [28]:
x.shape

(31, 121)

In [29]:
# Unique words
cv.get_feature_names_out()

array([',', '.', '10', '1857', '3000', '5', '?', 'achievement',
       'alexander', 'also', 'among', 'anyone', 'area', 'believe', 'brahm',
       'british', 'build', 'capture', 'career', 'closely', 'come',
       'conquer', 'consider', 'culture', 'dept', 'develop', 'developed',
       'development', 'dhawan', 'do', 'dr.', 'dutch', 'economic',
       'enforce', 'fall', 'father', 'fifty', 'first', 'fortune', 'four',
       'free', 'freedom', 'french', 'gdp', 'get', 'globally', 'go',
       'good', 'grabbed', 'great', 'greek', 'growth', 'hand-in-hand',
       'history', 'incorrect', 'independence', 'india', 'invade', 'lack',
       'land', 'level', 'life', 'loot', 'lucky', 'material', 'milestone',
       'military', 'mind', 'mogul', 'must', 'nation', 'nuclear',
       'nurture', 'one', 'onwards', 'opportunity', 'others.that',
       'people', 'percent', 'portuguese', 'poverty', 'power', 'prakash',
       'professor', 'protect', 'rate', 'recognise', 'respect', 'sarabhai',
       'satish', 

In [30]:
# Number of unique words
len(cv.get_feature_names_out())

121

In [31]:
# (Unique Word, Column Number)
cv.vocabulary_

{'three': 104,
 'vision': 113,
 'india': 56,
 '.': 1,
 '3000': 4,
 'year': 118,
 'history': 53,
 ',': 0,
 'people': 77,
 'world': 117,
 'come': 20,
 'invade': 57,
 'u': 110,
 'capture': 17,
 'land': 59,
 'conquer': 21,
 'mind': 67,
 'alexander': 8,
 'onwards': 74,
 'greek': 50,
 'turk': 109,
 'mogul': 68,
 'portuguese': 79,
 'british': 15,
 'french': 42,
 'dutch': 31,
 'loot': 62,
 'take': 101,
 'yet': 119,
 'do': 29,
 'nation': 70,
 'anyone': 11,
 'grabbed': 48,
 'culture': 23,
 'try': 108,
 'enforce': 33,
 'way': 115,
 'life': 61,
 '?': 6,
 'respect': 87,
 'freedom': 41,
 'others.that': 76,
 'first': 37,
 'believe': 13,
 'get': 44,
 '1857': 3,
 'start': 97,
 'war': 114,
 'independence': 55,
 'must': 69,
 'protect': 84,
 'nurture': 72,
 'build': 16,
 'free': 40,
 'one': 73,
 'second': 90,
 '’': 120,
 'development': 27,
 'fifty': 36,
 'develop': 25,
 'time': 105,
 'see': 91,
 'developed': 26,
 'among': 10,
 'top': 107,
 '5': 5,
 'term': 102,
 'gdp': 43,
 '10': 2,
 'percent': 78,
 'grow

# Vectorized Corpus

In [32]:
x

array([[0, 1, 0, ..., 0, 0, 0],
       [3, 1, 0, ..., 1, 0, 0],
       [9, 1, 0, ..., 0, 0, 0],
       ...,
       [3, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

# Converting to dataframe for better visualization

In [33]:
vector = pd.DataFrame(data=x, columns=cv.get_feature_names_out())

In [34]:
vector.head()

Unnamed: 0,",",.,10,1857,3000,5,?,achievement,alexander,also,...,unless,vikram,vision,war,way,work,world,year,yet,’
0,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,3,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
2,9,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
# Creating a dataframe out of the vector representation of the ith sentence, filtering out just the terms that occur in that 
# sentence

i = 2 # You can set any i, where i is the sentence number

ith_sentence = vector[vector > 0].iloc[i-1, :].dropna() # Pandas Series

ith_sentence = pd.DataFrame(data= [ith_sentence.values], columns=ith_sentence.index) # Pandas DataFrame

ith_sentence

Unnamed: 0,",",.,3000,capture,come,conquer,history,invade,land,mind,people,u,world,year
0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We can see how the second sentence has been vectorized, with column names as terms and values set to the frequency of occurence of each term. 