In this chapter, we are going to cover basic to advanced feature
engineering (text to features) methods. By the end of this chapter, you will
be comfortable with the following recipes:
- Recipe 1. One Hot encoding
- Recipe 2. Count vectorizer
- Recipe 3. N-grams
- Recipe 4. Co-occurrence matrix
- Recipe 5. Hash vectorizer
- Recipe 6. Term Frequency-Inverse Document, Frequency (TF-IDF)
- Recipe 7. Word embedding
- Recipe 8. Implementing fastText

machines or algorithms cannot understand the
characters/words or sentences, they can only take numbers as input that
also includes binaries. But the inherent nature of text data is unstructured
and noisy, which makes it impossible to interact with machines

### 3.1 Converting text data into features using one hot encoding

It is a process of converting categorical variables
into features or columns and coding one or zero for the presence of that
particular category. We are going to use the same logic here, and the
number of features is going to be the number of total tokens present in the
whole corpus

In [2]:
text = "I am Learning NLP"

import pandas as pd

pd.get_dummies(text.split())

# Output has 4 features since the number of distinct words present in the input was 4

Unnamed: 0,I,Learning,NLP,am
0,1,0,0,0
1,0,0,0,1
2,0,1,0,0
3,0,0,1,0


### 3.2 Converting text to feature using count vectorizing

The approach in Recipe 3-1 has a disadvantage. It does not take the
frequency of the word occurring into consideration. If a particular word
is appearing multiple times, there is a chance of missing the information
if it is not included in the analysis. A count vectorizer will solve that
problem

Count vectorizer is almost similar to One Hot encoding. The only
difference is instead of checking whether the particular word is present or
not, it will count the words that are present in the document

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["I love NLP and I will learn NLP in 2mnths"]

vectorizer = CountVectorizer()

vectorizer.fit(text)

vector = vectorizer.transform(text)

In [10]:
print(vectorizer.vocabulary_)
print(vector.toarray())
print(vectorizer.get_feature_names_out())

{'love': 4, 'nlp': 5, 'and': 1, 'will': 6, 'learn': 3, 'in': 2, '2mnths': 0}
[[1 1 1 1 1 2 1]]
['2mnths' 'and' 'in' 'learn' 'love' 'nlp' 'will']


### 3.3 Generating N-gram sequence:

If you observe the above methods, each word is considered as a feature.
There is a drawback to this method.

It does not consider the previous and the next words, to see if that
would give a proper and complete meaning to the words.

For example: consider the word “not bad.” If this is split into individual
words, then it will lose out on conveying “good” – which is what this word
actually means

N-grams are the fusion of multiple letters or multiple words. They are
formed in such a way that even the previous and next words are captured.
- Unigrams are the unique words present in the sentence.
- Bigram is the combination of 2 words.
- Trigram is 3 words and so on.

In [14]:
text = "I am learning NLP"

from textblob import TextBlob

# For n-gram 1 use n = 1
print(TextBlob(text).ngrams(n=1))

#For bi-gram use n=2
print(TextBlob(text).ngrams(n=2))

#For trigram use n=3
print(TextBlob(text).ngrams(n=3))

[WordList(['I']), WordList(['am']), WordList(['learning']), WordList(['NLP'])]
[WordList(['I', 'am']), WordList(['am', 'learning']), WordList(['learning', 'NLP'])]
[WordList(['I', 'am', 'learning']), WordList(['am', 'learning', 'NLP'])]


In [15]:
# for generating feature of bigram
text = ["I love NLP and I will learn NLP in 2mnths"]

vectorizer = CountVectorizer(ngram_range=(2,2))

vectorizer.fit(text)

vector = vectorizer.transform(text)
print(vectorizer.vocabulary_)
print(vector.toarray())

{'love nlp': 3, 'nlp and': 4, 'and will': 0, 'will learn': 6, 'learn nlp': 2, 'nlp in': 5, 'in 2mnths': 1}
[[1 1 1 1 1 1 1]]


### 3.4 Generating Co-ocurance matrix:

A co-occurrence matrix is like a count vectorizer where it counts the
occurrence of the words together, instead of individual words

In [16]:
import numpy as np
import nltk
from nltk import bigrams
import itertools

In [21]:
def Co_occurance_matrix(corpus):
    vocab = set(corpus)
    vocab = list(vocab)
    vocab_to_index = {word:i for i, word in enumerate(vocab)}
    # Create bi-grams for all the words in the corpus
    bi_grams = list(bigrams(corpus))
    # Frequency distribution for bigram 
    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
    # Initialize the coocurance matrix
    co_occurance_matrix = np.zeros((len(vocab), len(vocab)))
    # Loop through the bi-gram and take the current and last occurance
    for bigram in bigram_freq:
        current = bigram[0][1]
        previous = bigram[0][0]
        count = bigram[1]
        pos_current = vocab_to_index[current]
        pos_previous = vocab_to_index[previous]
        co_occurance_matrix[pos_current][pos_previous] = count
    co_occurance_matrix = np.matrix(co_occurance_matrix)
    return co_occurance_matrix, vocab_to_index
    

In [22]:
sentences = [['I', 'love', 'nlp'],
['I', 'love','to' 'learn'],
['nlp', 'is', 'future'],
['nlp', 'is', 'cool']]

In [23]:
sentences

[['I', 'love', 'nlp'],
 ['I', 'love', 'tolearn'],
 ['nlp', 'is', 'future'],
 ['nlp', 'is', 'cool']]

In [24]:
merged = list(itertools.chain.from_iterable(sentences))
matrix = Co_occurance_matrix(merged)

In [30]:
vocab_to_index

NameError: name 'vocab_to_index' is not defined

In [33]:
CoMatrixFinal = pd.DataFrame(matrix[0], index=matrix[1], columns=matrix[1])

matrix

In [34]:
CoMatrixFinal

Unnamed: 0,future,cool,tolearn,is,I,love,nlp
future,0.0,0.0,0.0,1.0,0.0,0.0,0.0
cool,0.0,0.0,0.0,1.0,0.0,0.0,0.0
tolearn,0.0,0.0,0.0,0.0,0.0,1.0,0.0
is,0.0,0.0,0.0,0.0,0.0,0.0,2.0
I,0.0,0.0,0.0,0.0,0.0,0.0,1.0
love,0.0,0.0,0.0,0.0,2.0,0.0,0.0
nlp,1.0,0.0,1.0,0.0,0.0,1.0,0.0


In [3]:
pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py): started
  Building wheel for wikipedia (setup.py): finished with status 'done'
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=76d70d239dba5fa95ab32af5182f347dae2eec2943015c233d6ff05e4424bdf5
  Stored in directory: c:\users\koolt\appdata\local\pip\cache\wheels\c2\46\f4\caa1bee71096d7b0cdca2f2a2af45cacf35c5760bee8f00948
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Note: you may need to restart the kernel to use updated packages.


In [9]:
import wikipedia
import pandas as pd
import numpy as np
import string
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity


def retrieve_page(page_name: str) -> list:
    '''
    Retrieves page data from wikipedia
    and stores words in lower case format in
    a list - tokenized format.
    '''
    usa_article = wikipedia.page(page_name)
    # Strip puncuation from page
    usa_article = (
        usa_article.content.translate(str.maketrans('', '', string.punctuation))
    )
    # Lower text case
    usa_article = usa_article.lower()
    # Tokenize using NLTK word tokenizer
    usa_article_token = word_tokenize(usa_article)
    return usa_article_token

def build_vocab(page: list) -> list:
    '''
    Build the vocabulary with all the word
    present in the page
    '''
    vocab = list(set(page))
    vocab.sort()
    vocab_dict = {}
    for index, word in enumerate(vocab):
        vocab_dict[word] = index
    return vocab_dict


def build_context(page:str, 
                  co_occurrence_vectors: pd.DataFrame) -> pd.DataFrame:
        '''
    Updates co-ocurrence vectors based on
    text read from the page.
    '''
        for index, element in enumerate(page):
            # build start and end of the context
            start = 0 if index-2 < 0 else index-2
            finish = len(page) if index+2 > len(page) else index+3
        # Retrieve Context for word
        context = page[start:index]+page[index+1:finish]
        for word in context:
            # Update Co-Occurrence Matrix 
            co_occurrence_vectors.loc[element, word] = (co_occurrence_vectors.loc[element, word]+1)
        return co_occurrence_vectors
    

In [10]:
usa_article_token = retrieve_page('United States of America')
vocab_dict = build_vocab(usa_article_token)

In [15]:
co_occurance_matrix = pd.DataFrame(np.zeros([len(vocab_dict), len(vocab_dict)]),
                                   index = vocab_dict.keys(), columns=vocab_dict.keys())
print(co_occurance_matrix.head())

       07    1   10  100  1000  100000  100th  102  105  107  ...  year  \
07    0.0  0.0  0.0  0.0   0.0     0.0    0.0  0.0  0.0  0.0  ...   0.0   
1     0.0  0.0  0.0  0.0   0.0     0.0    0.0  0.0  0.0  0.0  ...   0.0   
10    0.0  0.0  0.0  0.0   0.0     0.0    0.0  0.0  0.0  0.0  ...   0.0   
100   0.0  0.0  0.0  0.0   0.0     0.0    0.0  0.0  0.0  0.0  ...   0.0   
1000  0.0  0.0  0.0  0.0   0.0     0.0    0.0  0.0  0.0  0.0  ...   0.0   

      years  yearsin  yellowstone  yom  york  yorktown  youtube  zealand    ’  
07      0.0      0.0          0.0  0.0   0.0       0.0      0.0      0.0  0.0  
1       0.0      0.0          0.0  0.0   0.0       0.0      0.0      0.0  0.0  
10      0.0      0.0          0.0  0.0   0.0       0.0      0.0      0.0  0.0  
100     0.0      0.0          0.0  0.0   0.0       0.0      0.0      0.0  0.0  
1000    0.0      0.0          0.0  0.0   0.0       0.0      0.0      0.0  0.0  

[5 rows x 3398 columns]


In [16]:
co_occurance_matrix = build_context(usa_article_token, co_occurance_matrix)

In [17]:
co_occurance_matrix

Unnamed: 0,07,1,10,100,1000,100000,100th,102,105,107,...,year,years,yearsin,yellowstone,yom,york,yorktown,youtube,zealand,’
07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
york,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
yorktown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
youtube,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zealand,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
similar_words = pd.DataFrame(cosine_similarity(co_occurance_matrix), 
                             columns=vocab_dict.keys(), index = vocab_dict.keys())

In [23]:
similar_words.loc['onehalf'].sort_values(ascending = False).head(10)

07          0.0
older       0.0
olds        0.0
oldtime     0.0
olympic     0.0
olympics    0.0
on          0.0
one         0.0
onehalf     0.0
onethird    0.0
Name: onehalf, dtype: float64

### 3.5 Hash Vectorization

A count vectorizer and co-occurrence matrix have one limitation though.
In these methods, the vocabulary can become very large and cause
memory/computation issues

Hash Vectorizer is memory efficient and instead of storing the tokens
as strings, the vectorizer applies the hashing trick to encode them as
numerical indexes. The downside is that it’s one way and once vectorized,
the features cannot be retrieved.

In [27]:
from sklearn.feature_extraction.text import HashingVectorizer

text = ["The quick brown fox jumped over the lazy dog."]

vectorizer = HashingVectorizer(n_features=10)
vector = vectorizer.transform(text)

In [33]:
print(vector.shape)
print(vector.toarray())

# It created vector of size 10 and now this can be used for any supervised/unsupervised tasks.

(1, 10)
[[ 0.          0.57735027  0.          0.          0.          0.
   0.         -0.57735027 -0.57735027  0.        ]]


### 3.6 Converting text to feature using TF-IDF

Again, in the above-mentioned text-to-feature methods, there are few
drawbacks, hence the introduction of TF-IDF. Below are the disadvantages
of the above methods.

- Let’s say a particular word is appearing in all the documents
of the corpus, then it will achieve higher importance in
our previous methods. That’s bad for our analysis.

- The whole idea of having TF-IDF is to reflect on how
important a word is to a document in a collection, and
hence normalizing words appeared frequently in all the
documents.

**Term frequency (TF):** Term frequency is simply the ratio of the count of a
word present in a sentence, to the length of the sentence.

**Inverse Document Frequency (IDF):** IDF of each word is the log of
the ratio of the total number of rows to the number of rows in a particular
document in which that word is present.

IDF will measure the rareness of a term. Words like “a,” and “the” show
up in all the documents of the corpus, but rare words will not be there
in all the documents. So, if a word is appearing in almost all documents,
then that word is of no use to us since it is not helping to classify or in
information retrieval. IDF will nullify this problem.

In [34]:
text = ["The quick brown fox jump over the lazy fox.",
       "The dog.",
       "The fox"]

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

vectorizer.fit(text)

TfidfVectorizer()

In [36]:
print(vectorizer.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jump': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [37]:
print(vectorizer.idf_)

[1.69314718 1.69314718 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


If you observe, “the” is appearing in all the 3 documents and it does
not add much value, and hence the vector value is 1, which is less than all
the other vector representations of the tokens.

### 3.7 Implementing word embedding

Even though all previous methods solve most of the problems, once
we get into more complicated problems where we want to capture the
semantic relation between the words, these methods fail to perform
for following reasons

- All these techniques fail to capture the context and
meaning of the words. All the methods discussed so
far basically depend on the appearance or frequency
of the words. But we need to look at how to capture the
context or semantic relations: that is, how frequently
the words are appearing close by

- For a problem like a document classification (book
classification in the library), a document is really
huge and there are a humongous number of tokens
generated. In these scenarios, your number of features
can get out of control (wherein) thus hampering the
accuracy and performance


The answer to the above questions lies in creating a representation
for words that capture their meanings, semantic relationships, and the
different types of contexts they are used in.
The above challenges are addressed by **Word Embeddings**.

**Word embedding** is the feature learning technique where words from
the vocabulary are mapped to vectors of real numbers capturing the
contextual hierarchy

**word2vec:** word2vec is the deep learning Google framework to train
word embeddings. It will use all the words of the whole corpus and predict
the nearby words. It will create a vector for all the words present in the
corpus in a way so that the context is captured. 


There are mainly 2 types in word2vec.
- Skip-Gram
- Continuous Bag of Words (CBOW)

**Skip Gram**: The skip-gram model (Mikolov et al., 2013)1 is used to predict the
probabilities of a word given the context of word or words.

In [2]:
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
['nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

In [4]:
import gensim
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

# training the model

skipgram = Word2Vec(sentences=sentences, vector_size=50, window=3, min_count=1, sg=1)
print(skipgram)

Word2Vec(vocab=21, vector_size=50, alpha=0.025)


In [5]:
print(skipgram.wv['love'])

[ 1.56351421e-02 -1.90203767e-02 -4.11062239e-04  6.93839090e-03
 -1.87794690e-03  1.67635437e-02  1.80215649e-02  1.30730104e-02
 -1.42324448e-03  1.54208085e-02 -1.70686729e-02  6.41421322e-03
 -9.27599426e-03 -1.01779131e-02  7.17923651e-03  1.07406760e-02
  1.55390259e-02 -1.15330126e-02  1.48667190e-02  1.32509898e-02
 -7.41960062e-03 -1.74912829e-02  1.08749345e-02  1.30195096e-02
 -1.57510280e-03 -1.34197138e-02 -1.41718527e-02 -4.99412045e-03
  1.02865072e-02 -7.33047491e-03 -1.87401194e-02  7.65347946e-03
  9.76895820e-03 -1.28571270e-02  2.41711619e-03 -4.14975639e-03
  4.88042824e-05 -1.97670180e-02  5.38400654e-03 -9.50021297e-03
  2.17529293e-03 -3.15245148e-03  4.39334381e-03 -1.57631543e-02
 -5.43437013e-03  5.32639492e-03  1.06933638e-02 -4.78302967e-03
 -1.90201905e-02  9.01175477e-03]


In [6]:
skipgram.save('skipgram.bin')

In [21]:
#load the model

skipgram = Word2Vec.load('skipgram.bin')

X = skipgram[skipgram.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

In [19]:
pca = PCA(n_components=2)

In [20]:
pca.fit_transform(X)

TypeError: float() argument must be a string or a number, not 'dict'

**Continuous Bag of Words**:

CBOW is a variant of Word2vec model that we saw previously in which it tries to predict the center words from the (bag of) context words. So given all the words in the context window (excluding the middle one), CBOW would tell us the most likely the word at the center.

In [22]:
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
#Example sentences

sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
['nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

cbow = Word2Vec(sentences, vector_size=50, window=3, sg=1, min_count=1)
print(cbow)

Word2Vec(vocab=21, vector_size=50, alpha=0.025)


In [24]:
cbow.wv['nlp']

array([-1.0724545e-03,  4.7286032e-04,  1.0206699e-02,  1.8018546e-02,
       -1.8605899e-02, -1.4233618e-02,  1.2917743e-02,  1.7945977e-02,
       -1.0030856e-02, -7.5267460e-03,  1.4761009e-02, -3.0669451e-03,
       -9.0732286e-03,  1.3108101e-02, -9.7203208e-03, -3.6320353e-03,
        5.7531595e-03,  1.9837476e-03, -1.6570430e-02, -1.8897638e-02,
        1.4623532e-02,  1.0140524e-02,  1.3515387e-02,  1.5257311e-03,
        1.2701779e-02, -6.8107317e-03, -1.8928051e-03,  1.1537147e-02,
       -1.5043277e-02, -7.8722099e-03, -1.5023164e-02, -1.8600845e-03,
        1.9076237e-02, -1.4638334e-02, -4.6675396e-03, -3.8754845e-03,
        1.6154870e-02, -1.1861792e-02,  9.0322494e-05, -9.5074698e-03,
       -1.9207101e-02,  1.0014586e-02, -1.7519174e-02, -8.7836506e-03,
       -7.0199967e-05, -5.9236528e-04, -1.5322480e-02,  1.9229483e-02,
        9.9641131e-03,  1.8466286e-02], dtype=float32)

### 3.7 Implementing Fast text :

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification.

But the question that we should be really asking is – How is FastText different from gensim Word Vectors?

FastText differs in the sense that word vectors a.k.a word2vec treats every single word as the smallest unit whose vector representation is to be found but FastText assumes a word to be formed by a n-grams of character, for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny]  etc, where n could range from 1 to the length of the word. This new representation of word by fastText provides the following benefits over word2vec or glove.

- It is helpful to find the vector representation for rare words. Since rare words could still be broken into character n-grams, they could share these n-grams with the common words. For example, for a model trained on a news dataset, the medical terms eg: diseases can be the rare words.

- It can give the vector representations for the words not present in the dictionary (OOV words) since these can also be broken down into character n-grams. word2vec and glove both fail to provide any vector representations for words not in the dictionary.

In [28]:
# Import fasttext

from gensim.models import FastText
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

fast = FastText(sentences,vector_size=20, window=1, min_count=1,workers=5, min_n=1, max_n=2)

In [30]:
fast.wv['nlp']

array([-0.0104417 , -0.00166992,  0.00851491, -0.00545158, -0.01564237,
        0.01678064,  0.00298394,  0.00162992, -0.01518791,  0.00655622,
        0.01039656, -0.00142836, -0.01665709,  0.00949577,  0.00262533,
       -0.00541661,  0.0063507 , -0.00105192, -0.02014118,  0.00102295],
      dtype=float32)

In [31]:
fast.wv['deep']

array([-0.00718654, -0.00310375, -0.00214245, -0.00143115, -0.00600197,
        0.00922425,  0.01241926, -0.00713524, -0.0069327 , -0.00987075,
        0.01335533, -0.0081027 ,  0.01761531, -0.00716007, -0.00427308,
        0.00729467,  0.01494504, -0.0162607 ,  0.01229173,  0.01455308],
      dtype=float32)

In [34]:
fast.save('fast.bin')

In [35]:
fast = Word2Vec.load('fast.bin')

In [42]:
fast.wv.bucket

2000000