<a href="https://colab.research.google.com/github/rahiakela/transformer-research-and-practice/blob/main/mastering-transformers/01-from-bag-of-words-to-transformer/1_bow_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BoW implementation

A BoW is a representation technique for documents by counting the words in them.
The main data structure of the technique is a document-term matrix.

Let's see a simple implementation of BoW with Python. The following piece of code illustrates how to build a document-term matrix with the Python sklearn library for a toy corpus of three sentences:

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

In [2]:
toy_corpus = [
   "the fat cat sat on the mat",
   "the big cat slept",
   "the dog chased a cat"           
]

vectorizer = TfidfVectorizer()

In [4]:
corpus_tfidf = vectorizer.fit_transform(toy_corpus)

print(f"The vocabulary size is {len(vectorizer.vocabulary_.keys())}")
print(f"The document-term matrix shape is {corpus_tfidf.shape}")

The vocabulary size is 10
The document-term matrix shape is (3, 10)


The size is `(3 x 10)`, but in a realistic scenario the matrix size can grow to much larger numbers such as `10K x 10M`.

In [6]:
df = pd.DataFrame(np.round(corpus_tfidf.toarray(), 2))
df.columns = vectorizer.get_feature_names()

df

Unnamed: 0,big,cat,chased,dog,fat,mat,on,sat,slept,the
0,0.0,0.25,0.0,0.0,0.42,0.42,0.42,0.42,0.0,0.49
1,0.61,0.36,0.0,0.0,0.0,0.0,0.0,0.0,0.61,0.36
2,0.0,0.36,0.61,0.61,0.0,0.0,0.0,0.0,0.0,0.36


The table indicates a count-based mathematical matrix where the cell values are
transformed by a Term Frequency-Inverse Document Frequency (TF-IDF) weighting
schema. **This approach does not care about the position of words**. 

**Since the word order strongly determines the meaning, ignoring it leads to a loss of meaning. This is a common problem in a BoW method, which is finally solved by a recursion mechanism in RNN and positional encoding in Transformers.**

Each column in the matrix stands for the vector of a word in the vocabulary, and each row stands for the vector of a document.

Semantic similarity metrics can be applied
to compute the similarity or dissimilarity of the words as well as documents.


Most of the time, we use bigrams such as cat_sat and the_street to enrich the document
representation. For instance, as the parameter `ngram_range=(1,2)` is passed to
TfidfVectorizer, it builds a vector space containing both unigrams `(big, cat,
dog)` and bigrams `(big_cat, big_dog)`.

 Thus, such models are also called bag-of-ngrams, which is a natural extension of BoW.

In [7]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

corpus_tfidf = vectorizer.fit_transform(toy_corpus)

print(f"The vocabulary size is {len(vectorizer.vocabulary_.keys())}")
print(f"The document-term matrix shape is {corpus_tfidf.shape}")

The vocabulary size is 22
The document-term matrix shape is (3, 22)


In [9]:
df = pd.DataFrame(np.round(corpus_tfidf.toarray(), 2))
df.columns = vectorizer.get_feature_names()

print(df.shape)
df

(3, 22)


Unnamed: 0,big,big cat,cat,cat sat,cat slept,chased,chased cat,dog,dog chased,fat,fat cat,mat,on,on the,sat,sat on,slept,the,the big,the dog,the fat,the mat
0,0.0,0.0,0.17,0.29,0.0,0.0,0.0,0.0,0.0,0.29,0.29,0.29,0.29,0.29,0.29,0.29,0.0,0.34,0.0,0.0,0.29,0.29
1,0.42,0.42,0.25,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.42,0.25,0.42,0.0,0.0,0.0
2,0.0,0.0,0.25,0.0,0.0,0.42,0.42,0.42,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.42,0.0,0.0


If a word is commonly used in each document, it can be considered to be highfrequency, such as `and the`. Conversely, some words hardly appear in documents, called low-frequency (or rare) words. As high-frequency and low-frequency words may prevent the model from working properly, TF-IDF, which is one of the most important and wellknown weighting mechanisms, is used here as a solution.

Inverse Document Frequency (IDF) is a statistical weight to measure the importance of a word in a document—for example, while the word `the` has no discriminative power, `chased` can be highly informative and give clues about the subject of the text. This is because high-frequency words (stopwords, functional words) have little discriminating power in understanding the documents.

The discriminativeness of the terms also depends on the domain—for instance, a list of DL articles is most likely to have the word network in almost every document. IDF can scale down the weights of all terms by using their Document Frequency (DF), where the DF of a word is computed by the number of documents in which a term appears. Term Frequency (TF) is the raw count of a term (word) in a document.





##Modling

For the Natural Language Understanding (NLU) tasks, the traditional pipeline starts with some preparation steps, such as tokenization, stemming, noun phrase detection, chunking, stop-word elimination, and much more. 

Afterward, a document-term matrix is constructed with any weighting schema, where TF-IDF is the most popular one. 

Finally, the matrix is served as a tabulated input for Machine Learning (ML) pipelines, sentiment analysis, document similarity, document clustering, or measuring the relevancy score between a query and a document. 

Likewise, terms are represented as a tabular matrix
and can be input for a token classification problem where we can apply named-entity recognition, semantic relation extractions, and so on.

The classification phase includes a straightforward implementation of supervised ML algorithms such as Support Vector Machine (SVM), Random forest, logistic, naive bayes, and Multiple Learners (Boosting or Bagging).

In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

In [11]:
labels=[0, 1, 0]

clf = SVC()
clf.fit(df.to_numpy(), labels)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [12]:
clf.predict(df.to_numpy())

array([0, 1, 0])