# Text feature selection

So far we have seen how to process and wrangle text data. All machine learning or deep learning models cannot understand text data directly and they only understand numeric representations of features as inputs. 

Let's learn how to work with text data, which is definitely one of the most abundant sources of unstructured data. Text data usually consists of documents that can represent words, sentences, or even paragraphs of free-flowing text. The inherent lack of structure and noisy nature of textual data makes it harder for machine learning methods to directly work on raw text data. We start to explore some of the most popular and effective strategies for extracting meaningful features from text data. These features can then be used to represent text efficiently, which can be further leveraged in building machine learning or deep learning models easily to solve complex tasks.

Feature engineering is very important and is a way to create superior and better performing machine learning models. We are going to cover a wide variety of techniques for feature engineering to represent text data. 

* Bag of Words model
* Bag of N-Grams model
* TF-IDF model
* Similarity features
* Topic models
* Word2Vec
* GloVe
* FastText

## Vector space

Let's consider a free-flowing text in the form of words, phrases, sentences, and entire documents. Words make phrases, which in turn make sentences, which in turn make paragraphs. However, you can have a wide variety of words that can vary across documents and each sentence will also be of variable length. 

A `vector space model` is a useful concept when dealing with textual data and is very popular in information retrieval and document ranking. The vector space model is also called the `term vector model` and is defined as a mathematical and algebraic model for transforming and representing text documents as numeric vectors of specific terms, which form the vector dimensions. 

Mathematically, consider we have a document `D` in a document vector space `VS`. The number of dimensions or columns for each document will be the total number of distinct terms or words for all documents in the vector space. Hence the vector space `VS` can be denoted as follows:

VS = {$W_1, W_2,  ... , W_n$}

where there are `n` distinct words across all documents. Now we can represent document `D` in this vector space `VS` as follows:

D = {$w_{D_1}, w_{D_2},  ... , w_{D_n}$}

where $w_{D_n} denotes the weight for word `n` in document `D`. This weight is a numeric value and can be anything ranging from the frequency of that word in the document, the average frequency of occurrence, embedding weights, or even the `TF-IDF` weight, which we discuss shortly.

An important point to remember about feature extraction is that when we build a feature engineering model using transformations and mathematical operations, we need to make sure we use the same process when extracting features from new documents to be predicted and not rebuild the whole algorithm again based on the new documents.

## Building a Text Corpus

We need a text corpus to work on and demonstrate different feature engineering and representation methodologies. Let's build a simple text corpus, i.e. a collection of text documents belonging to one or more subjects or topics.  

In [1]:
%run setup.ipynb

In [2]:
# building a corpus of documents
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'
          ]
labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful.,weather
1,Love this blue and beautiful sky!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,"A king's breakfast has sausages, ham, bacon, e...",food
4,"I love green eggs, ham, sausages and bacon!",food
5,The brown fox is quick and the blue dog is lazy!,animals
6,The sky is very blue and the sky is very beaut...,weather
7,The dog is lazy but the brown fox is quick!,animals


Before we talk about feature engineering, we need to do some data preprocessing and wrangling to remove unnecessary characters, symbols, and tokens.

## Preprocessing Our Text Corpus
There can be multiple ways of cleaning and preprocessing textual data. Let's highlight some of the most important ones that are used heavily in Natural Language Processing (NLP) pipelines. 
* **Removing tags**: Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing text. The BeautifulSoup library does an excellent job in providing necessary functions for this.
* **Removing accented characters**: In any text corpus, especially if you are dealing with the English language, you might be dealing with accented characters/letters. Hence, you need to make sure that these characters are converted and standardized into ASCII characters. 
* **Expanding contractions**: In the English language, contractions are basically shortened versions of words or syllables, created by removing specific letters and sounds. Examples include do not to don’t and I would to I’d. Converting each contraction to its expanded, original form often helps with text standardization.
* **Removing special characters**: Special characters and symbols that are usually non alphanumeric characters often add to the extra noise in unstructured text. More often than not, simple regular expressions (regexes) can be used to achieve this.
* **Stemming**: Word stems are the base form of possible words that can be created by attaching affixes like prefixes and suffixes to the stem to create new words. This is known as `inflection`. The reverse process of obtaining the base form of a word is known as `stemming`. A simple example are the words watches, watching, and watched. They have the word root stem watch as the base form. 
* **Lemmatization**: It is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form in this case is known as the root word but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary) but the root stem may not always be correct.
* **Removing stopwords**: Words that have little or no significance, especially when constructing meaningful features from text, are known as stopwords. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Words like `a`, `an`, `the`, and so on are considered to be stopwords. There is no universal stopword list, but we use a standard English language stopwords list from NLTK. You can also add your own domain specific stopwords as needed.

You can also do other standard operations like tokenization, removing extra whitespace, text lowercasing and more advanced operations like spelling corrections, grammatical error corrections, removing repeated characters.

Define a simple text preprocessor that focuses on removing special characters, extra whitespace, digits, stopwords, and then lower casing the text corpus.

In [3]:
%%capture
%run text_libraries.ipynb

In [4]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lowercase and remove special characters\whitespace
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

Once we have our basic preprocessing pipeline ready, let’s apply it to our sample corpus so we can use it for feature selection.

In [5]:
norm_corpus = normalize_corpus(corpus)
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

## Traditional Feature Engineering Models

Traditional (count-based) feature engineering strategies for textual data belong to a family of models popularly known as the Bag of Words model. This includes term frequencies, TF-IDF (term frequency-inverse document frequency), N-grams, and so on. While they are effective methods for extracting features from text, due to the inherent nature of the model being just a bag of unstructured words, we lose additional information like the semantics, structure, sequence, and context around nearby words in each text document. 

The traditional feature engineering models are built using mathematical and statistical methodologies. 

### Bag of Words Model

Bag of Words (BoW) is perhaps the most simple vector space representational model for unstructured text. A vector space model is simply a mathematical model to represent unstructured text (or any other data) as numeric vectors, such that each dimension of the vector is a specific feature/attribute. The Bag of Words model represents each `text document` as a `numeric vector` where each `dimension` is a `specific word` from the corpus and the `value` could be its frequency in the document, occurrence (denoted by 1 or 0), or even weighted values. The model’s name is such because each document is represented literally as a bag of its own words, disregarding word order, sequences, and grammar.

Use `CountVectorizer` class to convert a collection of text documents to a matrix of token counts.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
# get bag of words features in sparse format
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix

<8x20 sparse matrix of type '<class 'numpy.int64'>'
	with 42 stored elements in Compressed Sparse Row format>

In [7]:
# view non-zero feature positions in the sparse matrix
print(cv_matrix)

  (0, 17)	1
  (0, 3)	1
  (0, 2)	1
  (1, 17)	1
  (1, 3)	1
  (1, 2)	1
  (1, 14)	1
  (2, 15)	1
  (2, 5)	1
  (2, 8)	1
  (2, 11)	1
  (2, 13)	1
  (2, 6)	1
  (3, 12)	1
  (3, 4)	1
  (3, 16)	1
  (3, 10)	1
  (3, 0)	1
  (3, 7)	1
  (3, 18)	1
  (3, 1)	1
  (4, 14)	1
  (4, 16)	1
  (4, 10)	1
  (4, 0)	1
  (4, 7)	1
  (4, 9)	1
  (5, 3)	1
  (5, 15)	1
  (5, 5)	1
  (5, 8)	1
  (5, 13)	1
  (5, 6)	1
  (6, 17)	2
  (6, 3)	1
  (6, 2)	1
  (6, 19)	1
  (7, 15)	1
  (7, 5)	1
  (7, 8)	1
  (7, 13)	1
  (7, 6)	1


The feature matrix is traditionally represented as a sparse matrix since the number of features increases with each document considering each distinct word becomes a feature. The preceding output tells us the total count for each `(x, y)` pair. Here, `x` represents a document and `y` represents a specific word/feature and the value is the number of times `y` occurs in `x`. 

In [8]:
# view dense representation
# warning might give a memory error if data is too big
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]],
      dtype=int64)

These documents have been converted into numeric vectors so that each document is represented by one vector (row) in the feature matrix and each column represents a unique word as a feature. The following code represents this in a more easy to understand format. 

In [9]:
# get all unique words in the corpus
vocab = cv.get_feature_names_out()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0
3,1,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0
5,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0
6,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1
7,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0


You can clearly see that each column or dimension in the feature vectors represents a word from the corpus and each row represents one of our documents. The value in any cell represents the number of times that word (represented by column) occurs in the specific document (represented by row). A simple example would be the first document has the words blue, beautiful, and sky occurring once each and hence the corresponding features have a value of 1 for the first row in the preceding output. Hence, if a corpus of documents consists of `N` unique words across all the documents, we would have an `N`-dimensional vector for each of the documents.

### Bag of N-Grams Model

A word is just a single token, often known as a unigram or `1`-gram. We already know that the `Bag of Words` model does not consider the order of words. But what if we also wanted to take into account phrases or collection of words that occur in a sequence? `N`-grams help us do that. 

An `N`-gram is basically a collection of word tokens from a text document such that these tokens are contiguous and occur in a sequence. 

`Bi`-grams indicate `n`-grams of order `2` (two words), `tri`-grams indicate `n`-grams of order `3` (three words), and so on. 

The Bag of `N`-Grams model is just an extension of the Bag of Words model that leverages `N`-gram based features. 

Call the `CountVectorizer` class with the `ngram_range` parameter, whose value is a tuple `(min_n, max_n)`, that is going to extract word n-grams according to the lower and upper boundary of the range of n-values. 

In [10]:
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams
bv = CountVectorizer(ngram_range=(1,2))

bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names_out()
pd.DataFrame(bv_matrix, columns=vocab)

Unnamed: 0,bacon,bacon eggs,beans,beautiful,beautiful sky,beautiful today,blue,blue beautiful,blue dog,blue sky,...,quick brown,sausages,sausages bacon,sausages ham,sky,sky beautiful,sky blue,toast,toast beans,today
0,0,0,0,1,0,0,1,1,0,0,...,0,0,0,0,1,0,1,0,0,0
1,0,0,0,1,1,0,1,1,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,1,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,1,1,0,0,1,...,0,0,0,0,2,1,1,0,0,1
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# you can set the n-gram range to 2,2 to get unigrams as well as bigrams
bv = CountVectorizer(ngram_range=(2,2))

bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names_out()
pd.DataFrame(bv_matrix, columns=vocab)

Unnamed: 0,bacon eggs,beautiful sky,beautiful today,blue beautiful,blue dog,blue sky,breakfast sausages,brown fox,dog lazy,eggs ham,...,lazy dog,love blue,love green,quick blue,quick brown,sausages bacon,sausages ham,sky beautiful,sky blue,toast beans
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,1,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
6,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
7,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


This gives us feature vectors for our documents, where each feature consists of a `bi`-gram representing a sequence of two words and values represent how many times the `bi`-gram was present for our documents. 

#### Exercise

Play with ngram_range by setting ngram_range to (1, 3) and see the outputs. 

### TF-IDF Model

There are some potential problems that might arise with the Bag of Words model when it is used on large corpora. Since the feature vectors are based on absolute term frequencies, there might be some terms that occur frequently across all documents and these may tend to overshadow other terms in the feature set. 
* Especially words that do not occur as frequently, but might be more interesting and effective as features to identify specific categories. 
This is where TF-IDF is usefil. TF-IDF stands for `term frequency-inverse document frequency`. It is a combination of two metrics, term frequency (`tf`) and inverse document frequency (`idf`). 
* This technique was originally developed as a metric for ranking search engine results based on user queries and has come to be a part of information retrieval and text feature extraction.

Let's formally define TF-IDF now and look at the mathematical representations before diving into its implementation. TD-IDF is the product of two metrics and can be represented as follows:

`tf-idf = tf x idf`

where term frequency (`tf`) and inverse-document frequency (`idf`) represent the two metrics we just talked about. Term frequency, denoted by `tf`, is what we computed in the Bag of Words model. Term 
frequency in any document vector is denoted by the raw frequency value of that term in a particular document. Mathematically it can be represented as follows:

`tf(w,D) =` $f_w{_D}$ 

where $f_w{_D}$ denoted frequency for word $w$ in document $D$, which becomes the term frequency ($tf$). Sometimes you can also normalize the absolute raw frequency using logarithms or averaging the frequency. We use the raw frequency in our computations.

Inverse document frequency denoted by `idf` is the inverse of the document frequency for each term and is computed by dividing the total number of documents in our corpus by the document frequency for each term and then applying logarithmic scaling to the result. 
* We will be adding 1 to the document frequency for each term to indicate that we also have one more document in our corpus, which essentially has every term in the vocabulary. This is to prevent potential division by zero errors and smoothen the inverse document frequencies. 
* We also add 1 to the result of our idf computation to avoid ignoring terms that might have zero idf. 
Mathematically, `idf` can be represented as follows:

`idf(w,D)=1 + log(N/(1+df(w)))`

where `idf(w, D)` represents the `idf` for the term/word `w` in document `D`, `N` represents the total number of documents in our corpus, and `df(w)` represents the number of documents in which the term `w` is present.

Thus, the term frequency-inverse document frequency can be computed by multiplying these two measures. 

The final TF-IDF metric that we will be using is a normalized version of the `tfidf` matrix that we get from the product of `tf` and `idf`. 
* We will normalize the `tfidf` matrix by dividing it by the `L2` norm of the matrix, also known as the `Euclidean` norm, which is the square root of the sum of the square of each term’s `tfidf` weight. 
The final tfidf feature vector is represented as follows:

`tfidf final = tfidf / ∥tfidf∥`

where `∥tfidf∥` represents the Euclidean L2 norm for the `tfidf` matrix. 

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

tt = TfidfTransformer(norm='l2', use_idf=True)
tt_matrix = tt.fit_transform(cv_matrix)
tt_matrix = tt_matrix.toarray()
vocab = cv.get_feature_names_out()
pd.DataFrame(np.round(tt_matrix, 2), columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


You can see that we used the L2 norm option in the parameters and made sure we smoothen the IDFs to give weight to terms that may have zero IDF so that we do not ignore them.

The `TfidfVectorizer` by scikit learn enables us to directly compute the `tfidf` vectors by taking the raw documents as input and internally computing the term frequencies as well as the inverse document frequencies without using the bag of words. This eliminates the need to use CountVectorizer to compute the term frequencies based on the Bag of Words model. Support is also present for adding n-grams to the feature vectors. 

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., norm="l2",
                     use_idf=True, smooth_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names_out()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


We used the L2 norm option in the parameters and made sure we smoothened the idfs. You can see from the output that the tfidf feature vectors match to the ones we obtained previously.

### Understanding the TF-IDF Model

We start by loading the necessary dependencies and computing the term frequencies (TF) for our sample corpus. 

In [14]:
# get unique words as feature names
unique_words = list(set([word for doc in [doc.split() for doc in norm_corpus]
                         for word in doc]))
def_feature_dict = {w: 0 for w in unique_words}
print('Feature Names:', unique_words)
print('Default Feature Dict:', def_feature_dict)

Feature Names: ['blue', 'toast', 'sky', 'today', 'dog', 'beans', 'sausages', 'eggs', 'jumps', 'fox', 'breakfast', 'quick', 'love', 'brown', 'lazy', 'kings', 'ham', 'bacon', 'beautiful', 'green']
Default Feature Dict: {'blue': 0, 'toast': 0, 'sky': 0, 'today': 0, 'dog': 0, 'beans': 0, 'sausages': 0, 'eggs': 0, 'jumps': 0, 'fox': 0, 'breakfast': 0, 'quick': 0, 'love': 0, 'brown': 0, 'lazy': 0, 'kings': 0, 'ham': 0, 'bacon': 0, 'beautiful': 0, 'green': 0}


In [15]:
from collections import Counter
# build bag of words features for each document - term frequencies
bow_features = []
for doc in norm_corpus:
    bow_feature_doc = Counter(doc.split())
    all_features = Counter(def_feature_dict)
    bow_feature_doc.update(all_features)
    bow_features.append(bow_feature_doc)
bow_features = pd.DataFrame(bow_features)
bow_features

Unnamed: 0,sky,blue,beautiful,toast,today,dog,beans,sausages,eggs,jumps,fox,breakfast,quick,love,brown,lazy,kings,ham,bacon,green
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,1,1,0,1,0,1,1,0,0,0,0
3,0,0,0,1,0,0,1,1,1,0,0,1,0,0,0,0,1,1,1,0
4,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,1,1
5,0,1,0,0,0,1,0,0,0,0,1,0,1,0,1,1,0,0,0,0
6,2,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,1,0,0,0,0,1,0,1,0,1,1,0,0,0,0


We now compute our document frequencies (DF) for each term based on the number of documents in which the term occurs. 

In [16]:
import scipy.sparse as sp
feature_names = list(bow_features.columns)
# build the document frequency matrix
df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
df = 1 + df # adding 1 to smoothen idf later
# show smoothened document frequencies
pd.DataFrame([df], columns=feature_names)

Unnamed: 0,sky,blue,beautiful,toast,today,dog,beans,sausages,eggs,jumps,fox,breakfast,quick,love,brown,lazy,kings,ham,bacon,green
0,4,5,4,2,2,4,2,3,3,2,4,2,4,3,4,4,2,3,3,2


This tells us the document frequency (DF) for each term and you can verify it with the documents in our sample corpus. Remember that we added 1 to each frequency value to smoothen the IDF values later and prevent division by zero errors by assuming we have a document (imaginary) that has all the terms once. Thus, if you check in the corpus, you will see that `bacon` occurs `2(+1)` times, `sky` occurs `3(+1)` times, and so on considering `(+1)` for our smoothening.

Now that we have the document frequencies, we compute the inverse document frequency (IDF) by using our formula, which we defined earlier. Remember to add `1` to the total count of documents in the corpus to add the document, which we had assumed earlier to contain all the terms at least once for smoothening the idfs.

In [17]:
# compute inverse document frequencies
total_docs = 1 + len(norm_corpus)
idf = 1.0 + np.log(float(total_docs) / df)
# show smoothened idfs
pd.DataFrame([np.round(idf, 2)], columns=feature_names)

Unnamed: 0,sky,blue,beautiful,toast,today,dog,beans,sausages,eggs,jumps,fox,breakfast,quick,love,brown,lazy,kings,ham,bacon,green
0,1.81,1.59,1.81,2.5,2.5,1.81,2.5,2.1,2.1,2.5,1.81,2.5,1.81,2.1,1.81,1.81,2.5,2.1,2.1,2.5


Thus, we can see that the inverse document frequencies is smoothed for each feature in our corpus. We now convert this into a matrix for easier operations when we compute the overall TF-IDF score later. 

In [18]:
# compute idf diagonal matrix
total_features = bow_features.shape[1]
idf_diag = sp.spdiags(idf, diags=0, m=total_features, n=total_features)
idf_dense = idf_diag.todense()
# print the idf diagonal matrix
pd.DataFrame(np.round(idf_dense, 2))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.59,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


You can now see the idf matrix that we created based on our mathematical equation. We also convert it to a diagonal matrix, which will be helpful later when we want to compute the product with term frequency. Now that we have our TFs and IDFs, we can compute the raw TF-IDF feature matrix using matrix multiplication, as depicted in the following snippet.

In [19]:
# compute tfidf feature matrix
tf = np.array(bow_features, dtype="float64")
tfidf = tf * idf
# view raw tfidf feature matrix
pd.DataFrame(np.round(tfidf, 2),columns=feature_names)

Unnamed: 0,sky,blue,beautiful,toast,today,dog,beans,sausages,eggs,jumps,fox,breakfast,quick,love,brown,lazy,kings,ham,bacon,green
0,1.81,1.59,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.81,1.59,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.1,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.81,0.0,0.0,0.0,2.5,1.81,0.0,1.81,0.0,1.81,1.81,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,2.5,0.0,0.0,2.5,2.1,2.1,0.0,0.0,2.5,0.0,0.0,0.0,0.0,2.5,2.1,2.1,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.1,2.1,0.0,0.0,0.0,0.0,2.1,0.0,0.0,0.0,2.1,2.1,2.5
5,0.0,1.59,0.0,0.0,0.0,1.81,0.0,0.0,0.0,0.0,1.81,0.0,1.81,0.0,1.81,1.81,0.0,0.0,0.0,0.0
6,3.62,1.59,1.81,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,1.81,0.0,0.0,0.0,0.0,1.81,0.0,1.81,0.0,1.81,1.81,0.0,0.0,0.0,0.0


We now have our tfidf feature matrix, but we still have to divide this by the L2 norm, if you remember from our equations depicted earlier. Computes the tfidf norms for each document and then divides the tfidf weights by the norm to give us the final desired tfidf matrix .

In [20]:
from numpy.linalg import norm
# compute L2 norms
norms = norm(tfidf, axis=1)
# print norms for each document
print (np.round(norms, 3))

# compute normalized tfidf
norm_tfidf = tfidf / norms[:, None]
# show final tfidf feature matrix
pd.DataFrame(np.round(norm_tfidf, 2), columns=feature_names)

[3.013 3.672 4.761 6.534 5.319 4.35  5.019 4.049]


Unnamed: 0,sky,blue,beautiful,toast,today,dog,beans,sausages,eggs,jumps,fox,breakfast,quick,love,brown,lazy,kings,ham,bacon,green
0,0.6,0.53,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.49,0.43,0.49,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.0,0.0,0.0,0.53,0.38,0.0,0.38,0.0,0.38,0.38,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.38,0.0,0.0,0.38,0.32,0.32,0.0,0.0,0.38,0.0,0.0,0.0,0.0,0.38,0.32,0.32,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.39,0.0,0.0,0.0,0.0,0.39,0.0,0.0,0.0,0.39,0.39,0.47
5,0.0,0.37,0.0,0.0,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.42,0.42,0.0,0.0,0.0,0.0
6,0.72,0.32,0.36,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.45,0.45,0.0,0.0,0.0,0.0


If you compare obtained tfidf feature matrix for the documents in our corpus to the feature matrix obtained using TfidfTransformer or TfidfVectorizer earlier, you will notice they are exactly the same, thus verifying that our mathematical implementation was correct. 

### Extracting Features for New Documents
Suppose you built a machine learning model to classify and categorize news articles and it is in currently in production. How can you generate features for completely new documents so that you can feed it into the machine learning models for prediction? The Scikit-Learn API provides the `transform(...)` function for the vectorizers we discussed previously and we can leverage it to get features for a completely new document that was not present in our corpus (when we trained our model). 

In [21]:
new_doc = 'the sky is green today'
pd.DataFrame(np.round(tv.transform([new_doc]).toarray(), 2),
             columns=tv.get_feature_names_out())

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.46,0.0,0.63


Use `fit_transform(...)` function to build a feature matrix on all documents in your corpus. This typically becomes the training feature set on which you build and train your predictive or other machine learning models. Once ready, leverage the `transform()` function to generate feature vectors of new documents. This can then be fed into your trained models to generate insights as needed.

### Document Similarity
Document similarity is the process of using a distance or similarity based metric that can identify how similar a text document is to any other document(s) based on features extracted from the documents, like `Bag of Words` or `TF-IDF`. Thus you can see that we can build on top of the TF-IDF-based features and use them to generate new features. Domains such as search engines, document clustering, and information retrieval can be leveraged using these similarity based features.

Pairwise document similarity in a corpus involves computing document similarity for each pair of documents in a corpus. Thus, if you have `C` documents in a corpus, you would end up with a `C x C` matrix, such that each row and column represents the similarity score for a pair of documents. This represents the indices at the row and column, respectively. There are several similarity and distance metrics that are used to compute document similarity. These include cosine distance/similarity, Euclidean distance, manhattan distance, BM25 similarity, jaccard distance, and so on. In this notebook, we use the most popular and widely used similarity metrics—cosine similarity and compare pairwise document similarity—based on their TF-IDF feature vectors.

In [22]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,0.820599,0.0,0.0,0.0,0.192353,0.817246,0.0
1,0.820599,1.0,0.0,0.0,0.225489,0.157845,0.670631,0.0
2,0.0,0.0,1.0,0.0,0.0,0.791821,0.0,0.850516
3,0.0,0.0,0.0,1.0,0.506866,0.0,0.0,0.0
4,0.0,0.225489,0.0,0.506866,1.0,0.0,0.0,0.0
5,0.192353,0.157845,0.791821,0.0,0.0,1.0,0.115488,0.930989
6,0.817246,0.670631,0.0,0.0,0.0,0.115488,1.0,0.0
7,0.0,0.0,0.850516,0.0,0.0,0.930989,0.0,1.0


Cosine similarity gives us a metric representing the cosine of the angle between the feature vector representations of two text documents. The smaller the angle between the documents, the closer and more similar they are.
* angle close to 0, cosine similarity score close to 1, vector u and vectro v are very simular to each other
* angle close to 90, cosine similarity score close to 0, vector u and vectro v are not similar to each other
* angle close to 180, cosine similarity score close to -1, vector u and vectro v are unrelated and in opposite orientation to each other

Documents `0`, `1`, and `6` and `2`, `5`, and `7` are very similar to one another, whereas documents 3 and 4 are slightly similar to each other. This must indicate these similar documents have some similar features. This is a perfect example of grouping or clustering that can be solved by unsupervised learning, especially when you are dealing with huge corpora of millions of text documents.