# Text Representation Techniques

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/raghavbali/workshop_text_classification/blob/main/notebooks/02_text_representation.ipynb)

In this notebook, we will get familiar with some basic Text Representation Techniques 
Key takeaways from this notebook are:

- Learn how to transform text into usable format using Bag of Words techniques such as:
  - Count Vectorizer
  - TF-IDF
  - Similarity Features

![text_repr.png](../assets/text_repr.png)

In [None]:
import nltk
import numpy as np
import pandas as pd
from nltk.corpus import gutenberg
import seaborn as sns
import re

%matplotlib inline
pd.options.display.max_columns=10000

In [None]:
# First things first, download the Gutenberg Project files
nltk.download('gutenberg')

In [None]:
# get the text for hamlet
hamlet_raw = gutenberg.open('shakespeare-hamlet.txt')
hamlet_raw = hamlet_raw.readlines()

In [None]:
# A utility function to perform basic cleanup
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stopwords]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [None]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(hamlet_raw)
norm_corpus

## Bag of Words : Term Frequency
A simple vector space representational model for text data. A vector space model is simply a mathematical model for transforming text as numeric vectors, such that each dimension of the vector is a specific feature\attribute. The bag of words model represents each text document as a numeric vector where each dimension(column) is a specific word from the vocabulary and the value could be its frequency in the document. The model’s name is such because each document is represented literally as a ‘bag’ of its own words, disregarding word orders, sequences and grammar.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix

In [None]:
cv_matrix.shape

In [None]:
vocab = cv.get_feature_names()

In [None]:
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)

## TF-IDF
Using absolute frequency counts as a measure of importance has its shortcomings. One potential issue is that there might be some terms which occur frequently across all documents and these may tend to overshadow other terms in the feature set. The TF-IDF model tries to combat this issue by using a normalizing factor. TF-IDF or Term Frequency-Inverse Document Frequency, uses a combination of two metrics in its computation, namely: __term frequency (tf)__ and __inverse document frequency (idf)__.

Mathematically, we can define TF-IDF as

``TF-IDF = tf x idf``

Where, each element in the TF-IDF matrix is the score for word w in document D.

The term **tf(w, D)** represents the term frequency of the word **w** in document **D**, which can be obtained from the Bag of Words model.
The term idf(w, D) is the inverse document frequency for the term w, which can be computed as the log transform of the total number of documents in the corpus C divided by the document frequency of the word w, in other words it is the frequency of documents in the corpus where the word w occurs.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
tv_matrix.shape

In [None]:
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

## Bag of N-Grams Model
A word is just a single token, often known as a **unigram** or 1-gram. We already know that the Bag of Words model doesn’t consider order of words. But what if we also wanted to take into account phrases or collection of words which occur in a sequence? **N-grams** help us achieve that. An N-gram is basically a collection of word tokens from a text document such that these tokens are contiguous and occur in a sequence. Bi-grams indicate n-grams of order 2 (two words), Tri-grams indicate n-grams of order 3 (three words), and so on. The Bag of N-Grams model is hence just an extension of the Bag of Words model so we can also leverage N-gram based features. The following example depicts bi-gram based features in each document feature vector.

In [None]:
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams
bv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True,ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = bv_matrix.toarray()
bv_matrix.shape

In [None]:
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)

## Similarity Based Features
Now that we have a method to transform text into vector form, we can now build on top of such features we engineered to generate new features which can be useful in domains like search engines, document clustering and information retrieval by leveraging these similarity based features.

Pairwise document/sentence/term similarity in a corpus involves computing  similarity for each pair of entities in a corpus. Thus if we have N entities in a corpus, we would end up with a N x N matrix such that each row and column represents the similarity score for a given pair. 

There are several similarity and distance metrics that are used to compute  similarity. These include :
- cosine distance/similarity, 
- euclidean distance, 
- manhattan distance, 
- BM25 similarity, 
- jaccard distance and so on. 

```shell
Add image from the slide deck
```

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity_matrix = cosine_similarity(tv_matrix)
similarity_matrix

In [None]:
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df