# Words to Vectors Conversion Techniques

Once the given dataset is preprocessed using text preprocessing techniques like Stemming and Lemmatization, we need to move to the next quarter where we need to convert the pre-processed text into vectors. 

Below listed are some of the conversion technqiques we use most often: 

1. One Hot Ecoding (OHE)

2. Bag of Words
3. N-grams


## 1. One Hot Encoding

It is a technique which is mostly used in Sentiment Analysis. The given corpus will be broken down into Sentences, and later unique words (Vocabulary) will be identified from all the sentences and an unique index is assigned for the same. 

Later, we will mark the value as 1 or 0 if its found in sentence. Hence, the whole sentence gets encoded as vector form which is called as One hot encoded vector. 

### Example

corpus = ["I love this movie", "this movie is terrible", "the plot is confusing"]

From this corpus, the vocabulary identified is 
["I", "love", "this", "movie", "is", "terrible", "the", "plot", "confusing"]

Now our first sentence (D1) can be encoded as [1, 1, 1, 1, 0, 0, 0, 0, 0]

Similarly, (D2) can be encoded as [0, 0, 1, 1, 1, 1, 0, 0, 0]

(D3) can be encoded as [0, 0, 0, 0, 1, 0, 1, 1, 1]



### Advantages
- Simple to Implement
- Intuitive

### Disadvantages

- Sparse Matrix (array can be completely 0 if the test sentence does not match with vocabulary and highly computational for training -  requires more RAM usage)
- Out of Vocabulary (OOV) - Will not be able to handle test data is new test data comes.

- There is no fixed size
- Semantic meaning between the words is not captured.

## 2. Bag of Words (BOW)

This is a technique which is slightly better than One hot encoding. Consider bag of words is a bag containing popcorn. The given corpus is broken down into sentences and later text processing will be done. We will first remove the stop words, lower the sentences, perform stemming or lemmatization and then pass onto the bag of words model to get trained. 

Bag of words will first identify the vocabulary, and calculates the frequency of each word in vocabulary. Then, the features will be ranged from having maximum value to minimum value. In short, our features will be arranged in an order having high to low frequency. 

Once completed, for each document we will mark the frequency or if its binary BOW, then it will be marked as 1 or 0 if the word is found in that particular sentence or document.

### Example

Consider a simple corpus containing three sentences: ["*He* *is* *a* good boy", "*She* *is* *a* good girl", "Boy *and* Girl *is* good"]

**Step 1:** Remove the stop words, lower the sentences. Lets ignore stemming and lemmatization for now as its not required here. Then, the sentences/documents will be updated as:

D1 -> ["good", "boy"]

D2 -> ["good", "girl"]

D3 -> ["boy", "girl", "good"]

**Step 2:** Calculate the frequency of each word from the above corpus and arrange it in order from high to low

f1 - good: 3,
f2 - boy: 2, 
f3 - girl: 2

**Step 3:** Now, we need to perform the encoding for each sentence. Lets consider ours is a Binary BOW if the particular word is present in the sentence then it will be marked as 0 or 1. 

| Features/Documents| f1 (good) | f2 (boy) | f3 (girl) |
| :---: | :---: | :---: | :---: |
| D1 | 1 | 1 | 0 |
| D2 | 1 | 0 | 1 |
| D3 | 1 | 1 | 1 |

### Advantages

- Simple and Intuitive

### Disadvantages

- Sparse Matrix (still we have 0s and 1s)

- Out of Vocabulary unable to handle new test data
- Order of the words will get displaced
- Inability in capturing the semantic information.

### Practical Implementation

In [1]:
%pip install nltk # Installing nltk library

Note: you may need to restart the kernel to use updated packages.


#### Defining corpus

In [11]:
corpus = '''Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998.[c] In 2001, Modi was appointed Chief Minister of Gujarat and elected to the legislative assembly soon after. His administration is considered complicit in the 2002 Gujarat riots,[d] and has been criticised for its management of the crisis. According to official records, a little over 1,000 people were killed, three-quarters of whom were Muslim; independent sources estimated 2,000 deaths, mostly Muslim.[13] A Special Investigation Team appointed by the Supreme Court of India in 2012 found no evidence to initiate prosecution proceedings against him.[e] While his policies as chief minister were credited for encouraging economic growth, his administration was criticised for failing to significantly improve health, poverty and education indices in the state.[f]'''

In [12]:
corpus

'Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998

#### Tokenization

Converts above paragraph into sentences

In [13]:
nltk.download('punkt_tab') # Downloading punkt tokenizer

sentences = nltk.sent_tokenize(corpus) # tokenizing paragraph into sentences

<IPython.core.display.Javascript object>

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/saikiran/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


<IPython.core.display.Javascript object>

In [14]:
sentences

['Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014.',
 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi.',
 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation.',
 'He is the longest-serving prime minister outside the Indian National Congress.',
 '[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education.',
 'He was introduced to the RSS at the age of eight.',
 'At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so.',
 'Modi became a full-time worker for the RSS in Gujarat in 1971.',
 'The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, 

In [9]:
print(sentences)

print(type(sentences))

['Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014.', 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi.', 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation.', 'He is the longest-serving prime minister outside the Indian National Congress.', '[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education.', 'He was introduced to the RSS at the age of eight.', 'At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so.', 'Modi became a full-time worker for the RSS in Gujarat in 1971.', 'The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming

#### Removing Stopwords and lowering the sentences

In [43]:
nltk.download('stopwords') # downloading stopwords

from nltk.corpus import stopwords # library used for removing stopwords

from nltk.stem import PorterStemmer # library used for stemming
from nltk.stem import WordNetLemmatizer # library used for lemmatization

<IPython.core.display.Javascript object>

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/saikiran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [44]:
import re # importing regular expression library

stemmer = PorterStemmer() # creating object of PorterStemmer
lemmatizer = WordNetLemmatizer() # creating object of WordNetLemmatizer

corpus = []

for sentence in sentences:

    review = re.sub('[^a-zA-Z]', ' ', sentence) # removing special characters from the sentences and replacing them with space.
    review = review.lower() # converting all the characters of the sentence to lower case.
    review = review.split() # splitting the sentence into words.
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))] # removing the stopwords from the sentence.
    review = ' '.join(review) # joining the words to form the sentence.
    
    corpus.append(review) # appending the cleaned sentence to the corpus list.

print(corpus)

['narendra damodardas modi born september b indian politician serving prime minister india since', 'modi chief minister gujarat member parliament mp varanasi', 'member bharatiya janata party bjp rashtriya swayamsevak sangh rss right wing hindu nationalist paramilitary volunteer organisation', 'longest serving prime minister outside indian national congress', 'modi born raised vadnagar northeastern gujarat completed secondary education', 'introduced rss age eight', 'age married jashodaben modi abandoned soon publicly acknowledging four decade later legally required', 'modi became full time worker rss gujarat', 'rss assigned bjp rose party hierarchy becoming general secretary', 'c modi appointed chief minister gujarat elected legislative assembly soon', 'administration considered complicit gujarat riot criticised management crisis', 'according official record little people killed three quarter muslim independent source estimated death mostly muslim', 'special investigation team appointed

We have removed the special characters using the regular expressions and lowered the sentences.

##### FR: Printing the stopwords in English language

In [25]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Importing Bag of Words algorithm from sklearn libary

In [26]:
from sklearn.feature_extraction.text import CountVectorizer # library used for creating bag of words model

cv = CountVectorizer(binary=True) # creating object of CountVectorizer

#### Training the bag of words model with our corpus

In [45]:
X = cv.fit_transform(corpus)# creating bag of words model

In [46]:
cv.vocabulary_ # printing the vocabulary of the bag of words model with index

# E.g., narendra is presented in the bag of words model at index 85.

{'narendra': 66,
 'damodardas': 22,
 'modi': 62,
 'born': 12,
 'september': 96,
 'indian': 46,
 'politician': 78,
 'serving': 97,
 'prime': 80,
 'minister': 61,
 'india': 45,
 'since': 99,
 'chief': 13,
 'gujarat': 38,
 'member': 60,
 'parliament': 74,
 'mp': 64,
 'varanasi': 110,
 'bharatiya': 10,
 'janata': 50,
 'party': 75,
 'bjp': 11,
 'rashtriya': 86,
 'swayamsevak': 105,
 'sangh': 93,
 'rss': 92,
 'right': 89,
 'wing': 112,
 'hindu': 41,
 'nationalist': 68,
 'paramilitary': 73,
 'volunteer': 111,
 'organisation': 71,
 'longest': 57,
 'outside': 72,
 'national': 67,
 'congress': 16,
 'raised': 85,
 'vadnagar': 109,
 'northeastern': 69,
 'completed': 14,
 'secondary': 94,
 'education': 26,
 'introduced': 48,
 'age': 4,
 'eight': 27,
 'married': 59,
 'jashodaben': 51,
 'abandoned': 0,
 'soon': 100,
 'publicly': 83,
 'acknowledging': 2,
 'four': 34,
 'decade': 24,
 'later': 53,
 'legally': 54,
 'required': 88,
 'became': 8,
 'full': 35,
 'time': 108,
 'worker': 113,
 'assigned': 7,
 

#### Checking BOW for one sentence

In [47]:
corpus[0]

'narendra damodardas modi born september b indian politician serving prime minister india since'

In [48]:
X[0].toarray() # printing the bag of words model in the form of array

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0]])

## 3. N grams

N grams solves the main drawback of both One hot encoding and Bag of words. It helps us to capture the semantic meaning in the given sentences. 

Let's reconsider the same example again. Earlier we only have three features "good", "boy", "girl". But understanding girl and boy seperately has no use. Hence, so we will create "bigram".

For example, we will have two another features like "good boy" and "good girl". Also, some of the examples include like "New York", "Artificial Intelligence" instead of seperate words like New, York, Artificial, Intelligence. 

Here we have grouped only two words together so its a bigram, similarly if we group three words together then it is a trigram, like that the number of combinations we make it as n-gram.

### Example 1

Consider a simple corpus containing three sentences: ["*He* *is* *a* good boy", "*She* *is* *a* good girl", "Boy *and* Girl *is* good"]

**Step 1:** Remove the stop words, lower the sentences. Lets ignore stemming and lemmatization for now as its not required here. Then, the sentences/documents will be updated as:

D1 -> ["good", "boy"]

D2 -> ["good", "girl"]

D3 -> ["boy", "girl", "good"]

**Step 2:** Calculate the frequency of each word from the above corpus and arrange it in order from high to low

f1 - good: 3,
f2 - boy: 2, 
f3 - girl: 2,
 
New features (bigrams) like "good boy" and "good girl" will be added.

**Step 3:** Now, we need to perform the encoding for each sentence. Lets consider ours is a Binary BOW if the particular word is present in the sentence then it will be marked as 0 or 1. 

| Features/Documents| f1 (good) | f2 (boy) | f3 (girl) | f4 (good boy) | f5 (good girl) |
| :---: | :---: | :---: | :---: | :--: | :--: | 
| D1 | 1 | 1 | 0 | 1 | 0 |
| D2 | 1 | 0 | 1 | 0 | 1 |
| D3 | 1 | 1 | 1 | 0 | 0 |

### Example 2

Sentence: Krish eats food 

Bigrams should be sequential in nature, here we cant form a bigram like Krish food.

**Bigrams:** Krish eats, eats food

### Example 3

Sentence: I am not feeling well

Trigrams should also be sequential in nature. 

**Trigrams:** I am not, am not feeling, not feeling well.