<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/practical-natural-language-processing/3-text-representation/3_bag_of_n_gram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of N-Grams

**All the representation schemes we’ve seen so far treat words as independent units. There is no notion of phrases or word ordering.** The bag-of-n-grams (BoN) approach tries to remedy this. It does so by breaking text into chunks of n contiguous words (or tokens). This can help us capture some context, which earlier approaches could not do. Each chunk is called an **n-gram**.

The corpus vocabulary, $V$, is then nothing but a collection of all unique n-grams across the text corpus. Then, each document in the corpus is represented by a vector of length $|V|$. This vector simply contains the frequency counts of n-grams present in the document and zero for the n-grams that are not present.


**Our toy corpus**

|  |  |
| --- | --- |
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

To elaborate, let’s consider our example corpus. Let’s construct a 2-gram (a.k.a. bigram) model for it. The set of all bigrams in the corpus is as follows: `{dog bites, bites man, man bites, bites dog, dog eats, eats meat, man eats, eats food}`. Then, BoN representation consists of an eight-dimensional vector for each document. The bigram representation for the first two documents is as follows: `D1 : [1,1,0,0,0,0,0,0], D2 : [0,0,1,1,0,0,0,0]`.

The other two documents follow similarly. Note that the BoW scheme is a special case of the BoN scheme, with n=1. n=2 is called a “bigram model,” and n=3 is called a “trigram model.” **Further, note that, by increasing the value of n, we can incorporate larger context; however, this further increases the sparsity.** In NLP parlance, the BoN scheme is also called “n-gram feature selection.”

Finaly, we will get this matrix for **Bag-of-N-Grams**.

**Documents**

|  |  |
| --- | --- |
| D1 | Dog bites man. |
| D2 | Man bites dog. |
| D3 | Dog eats meat. |
| D4 | Man eats food. |

**Bag-of-bi-gram Matrix**

|   | dog bites | bites man | man bites | bites dog | dog eats | eats meat | man eats | eats food |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| D1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| D2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| D3 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| D4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

In [1]:
documents = [
  "Dog bites man.",
  "Man bites dog.",
  "Dog eats meat.",
  "Man eats food."
]

processed_docs = [doc.lower().replace('.', '') for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

Now, let's do the main task of finding bag of n-gram representation. We will use CountVectorizer from sklearn for a BoN representation considering 1–3 n-gram word features to represent the corpus that we’ve used so far. Here, we use unigram, bigram, and trigram vectors by setting `ngram_range = (1,3)`.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Ngram vectorization example with count vectorizer and uni, bi, trigrams
count_vect = CountVectorizer(ngram_range=(1, 3))

In [4]:
# Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)
print('Our vocabulary: ', count_vect.vocabulary_)

Our vocabulary:  {'dog': 3, 'bites': 0, 'man': 12, 'dog bites': 4, 'bites man': 2, 'dog bites man': 5, 'man bites': 13, 'bites dog': 1, 'man bites dog': 14, 'eats': 8, 'meat': 17, 'dog eats': 6, 'eats meat': 10, 'dog eats meat': 7, 'food': 11, 'man eats': 15, 'eats food': 9, 'man eats food': 16}


In [5]:
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog': ", bow_rep[1].toarray())

BoW representation for 'dog bites man':  [[1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0]]
BoW representation for 'man bites dog':  [[1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0]]


In [6]:
bow_rep.toarray()

array([[1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0]])

Let's show the Bag-of-N-Gram vetcors in dataframe.

In [7]:
import pandas as pd

bow_cols = [key for key, _ in count_vect.vocabulary_.items()]
bow_indexs = ['D1', 'D2', 'D3', 'D4']
pd.DataFrame(bow_rep.toarray(), columns=bow_cols, index=bow_indexs)

Unnamed: 0,dog,bites,man,dog bites,bites man,dog bites man,man bites,bites dog,man bites dog,eats,meat,dog eats,eats meat,dog eats meat,food,man eats,eats food,man eats food
D1,1,0,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0
D2,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,0,0,0
D3,0,0,0,1,0,0,1,1,1,0,1,0,0,0,0,0,0,1
D4,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,1,1,0


In [8]:
# Get the representation using this vocabulary, for a new text
temp = count_vect.transform(['dog and dog are friends'])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Bow representation for 'dog and dog are friends': [[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


Here are the main pros and cons of BoN:

* It captures some context and word-order information in the form of n-grams.
* Thus, resulting vector space is able to capture some semantic similarity. Documents having the same n-grams will have their vectors closer to each other in Euclidean space as compared to documents with completely different n-grams.
* As n increases, dimensionality (and therefore sparsity) only increases rapidly.
* It still provides no way to address the OOV problem.