<a href="https://colab.research.google.com/github/ucaokylong/NLP_learning/blob/main/03_BOW_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/1/lang-pic.jpg?raw=1' width=600>
</center>
    
# 1. Introduction

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">1.1 NLP series</p>

This is the **third in a series of notebooks** covering the **fundamentals of Natural Language Processing (NLP)**. I find that the best way to learn is by teaching others, hence why I am sharing my journey learning this field from scratch. I hope these notebooks can be helpful to you too.

NLP series:

1. [Tokenization](./01_Tokenization.ipynb)
2. [Preprocessing](./02_Pre_Processing.ipynb)
3. Bag of Words and Similarity
<a target="_blank" href="https://colab.research.google.com/github/JUSTSUJAY/NLP_One_Shot/blob/main/Notebooks/03_BOW_Similarity.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/Notebooks/03_BOW_Similarity.ipynb)

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">1.2 Outline</p>

We have now seen how to tokenize and pre-process text. But to be able to use machine learning, we need to **convert this text into numbers**. In this notebook, we'll see one way to do this via a **basic bag-of-words** and discuss some of its variants.

We'll then learn how to measure the **similarity between two documents** through one of the most popular approaches, namely **cosine similarity**.

# 2. Bag-of-Words

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.1 The idea</p>

After tokenization and pre-processing, we are left with **variable length** sequences of text, but the problem is machine learning algorithms require **fixed length** vectors of numbers.

The **simplest** approach to overcome this is by using a **bag-of-words**, which simply **counts how many times each word appears** in a document. It's called a **bag** because the **order of the words is ignored** - we only care about whether a word appeared or not.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/3/basicbow.png?raw=1' width=600>
</center>
<br>

The linguistic reasoning behind this approach is that **similar documents share similar vocabularies**. For example, football articles will often use words like *score*, *pass*, *team* whereas weather reports will use a completely different set of words like *rain*, *sun*, *umbrella*.

We might want to **remove stop words** (common words that have little meaning like *the*, *of*, *how*) to make it easier to identify similar documents as these will be in pretty much all documents.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">2.2 Binary Bag-of-Words</p>

For now, we'll focus on the **binary** version of a bag-of-words. This just indicates **whether a word appeared or not**, ignoring word order and word frequency.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/3/binarybow.jpg?raw=1' width=600>
</center>

Each **row** in a **binary bag-of-words matrix** corresponds to a **single document** in the corpus. Each **column** corresponds to a **token** in the vocabulary. Note that the order of the tokens isn't important but it does need to be **fixed beforehand** when building the vocabulary.  

To **construct** the matrix, we place a 1 in entry (i,j) if and only if the j-th token appears in the i-th document and a 0 otherwise.

For a **general** bag-of-words, the (i,j) entry would instead be the **frequency** of the j-th token in the i-th document (but we will see there are better ways to encode frequency later).

# 3. Similarity

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.1 Vector Space Model</p>

We have gone from thinking of documents as a sequence of words to **points in a multi-dimensional vector space**. Importantly, the dimension of this space if **fixed**, i.e. each vector has the same length.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/3/unitcube.png?raw=1' width=400>
</center>
<br>

This is very useful because it now allows us to **measure the distances** between these points among other things. Points (documents) that are close together will correspond to documents being **similar** in their vocabularies.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">3.2 Cosine Similarity</p>

There are many **metrics** we could use to measure how 'close' two points are. For example, we could consider using the Euclidean distance, Manhattan distance or even Hamming distance. However, if documents in the same corpus have very different lengths, or the vocabulary is extremely large, these metrics become less reliable.

$$
\Large
\cos(\theta) = \frac{a \cdot b}{\|a\| \|b\|}
$$

<br>

Instead, in the NLP domain it is much more common to use **Cosine Similarity**. This measures the **cosine of the angle** between any two points (more precisely their vectors starting from the origin). The **closer the score 1**, the smaller the angle between the vectors and the **more similar** the documents are.

<br>
<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/3/cosine-similarity-vectors-original.jpg?raw=1' width=800>
</center>
<br>

Note that the **threshold** used to decide whether two documents are similar will **change depending on the application** and it can be anywhere between 0 and 1. It will be sensitive to how we pre-process our text. Lemmatization and stop word removal can help reduce the size of the vocabulary making it easier to identify similar documents.

# 4. Encoding context

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">4.1 Drawbacks to Bag-of-Words</p>

Whilst using a bag-of-words is a great tool for **simple** NLP applications, it does have a number of drawbacks that we need to be aware about.

* There is no way to handle **Out-of-Vocabulary** (OOV) words. If a new word appears in a later document, it will just be dropped.
* It creates **sparse matrices** which can be inefficient, although we can overcome this by using a dictionary representation.
* It isn't able to capture similarity between **synonyms**.
* Word order is lost so words have **no relationship** to each other. For example, "man eats bread" is very different to "bread eats man" but they would have the same representations.


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">4.2 n-grams</p>

One way to get around the problem of losing word order information is to use **n-grams**. This is when we group **chunks of n tokens** together to behave as if they were a single token.

A 2-gram (aka **bigram**) would have 2 tokens per chunk, a 3-gram (aka **trigram**) would have 3 tokens per chuck, etc.

<center>
<img src='https://github.com/JUSTSUJAY/NLP_One_Shot/blob/main/assets/3/8ARA1.png?raw=1' width=600>
</center>
<br>

This helps us capture **some context** that using single tokens wouldn't. The **vocabulary** then becomes the **collection of n-grams** produced. Depending on the application, you might want to use unigrams and bigrams together or just bigrams. You could even filter out bigrams that aren't useful for your application (e.g. only keep highly frequent or noun-noun bigrams).

Measuring **similarity** is exactly the **same as before**. However, using n-grams can **significantly increase the size of the vocabulary** making computations slower. There is therefore a tradeoff between contextual information added and increased computational time for modelling.



# 5. Application

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">5.1 Bag-of-Words using sklearn</p>

Import the **libraries**.

In [None]:
import spacy
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Define the **corpus**. Here we use some of the top news stories from 2022.

In [None]:
# A corpus containing a collection of sentences
corpus = [
    "Inflation surges around the world.",
    "The Omicron coronavirus variant spreads.",
    "World population exceeds 8 billion.",
    "AI predicts protein structures."
]

We will use **sklearn's CountVectorizer** to create a bag-of-words matrix.

In [None]:
# Initialize vectorizer
vectorizer = CountVectorizer()

The `.fit_transform` method learns a **vocabulary** from the corpus and returns the **bag-of-words matrix**.

In [None]:
# Fit vectorizer to corpus
bow = vectorizer.fit_transform(corpus)

We can see the **vocabulary dictionary** mapping using the `.vocabulary_` method. We could also use the `.get_feature_names_out()` to see just the words.

In [None]:
# View vocabulary
vectorizer.vocabulary_

{'inflation': 5,
 'surges': 12,
 'around': 1,
 'the': 13,
 'world': 15,
 'omicron': 6,
 'coronavirus': 3,
 'variant': 14,
 'spreads': 10,
 'population': 7,
 'exceeds': 4,
 'billion': 2,
 'ai': 0,
 'predicts': 8,
 'protein': 9,
 'structures': 11}

The vectorizer **output** is a **compressed sparse row matrix**, which is done to **improve memory efficiency**.

In [None]:
bow

<4x16 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [None]:
print(bow)

  (0, 5)	1
  (0, 12)	1
  (0, 1)	1
  (0, 13)	1
  (0, 15)	1
  (1, 13)	1
  (1, 6)	1
  (1, 3)	1
  (1, 14)	1
  (1, 10)	1
  (2, 15)	1
  (2, 7)	1
  (2, 4)	1
  (2, 2)	1
  (3, 0)	1
  (3, 8)	1
  (3, 9)	1
  (3, 11)	1


To convert the sparse matrix into a **dense matrix**, we call the `.toarray()` method.

In [None]:
# Dense matrix representation
bow.toarray()

array([[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0]])

Notice how sklearn lower-cased and **tokenized the corpus for us**. Next we will do the same using our own **custom tokenizer**, which will give us **more control** over how the text is pre-processed.

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">5.2 Custom tokenizer using spacy</p>

To do this, we need to define our custom tokenizer as a **function** that given a document, **returns a list of tokens**.

In [None]:
# Load english language model
nlp = spacy.load('en_core_web_sm')

# Define custom tokenizer (remove stop words and punctuation and apply lemmatization)
def custom_tokenizer(doc):
    return [t.lemma_ for t in nlp(doc) if (not t.is_punct) and (not t.is_stop)]

The tokenizer is then passed as a **callback function** inside the count vectorizer. We also set binary equal to true to produce a **binary** bag-of-words.

In [None]:
# Pass tokenizer as callback function to countvectorizer
vectorizer = CountVectorizer(tokenizer=custom_tokenizer, binary=True)

# Fit vectorizer to corpus
bow = vectorizer.fit_transform(corpus)

We can view the resulting **vocabulary** and matrix the same way as before.

In [None]:
# Vocabulary
vectorizer.vocabulary_

{'inflation': 5,
 'surge': 12,
 'world': 14,
 'omicron': 6,
 'coronavirus': 3,
 'variant': 13,
 'spread': 10,
 'population': 7,
 'exceed': 4,
 '8': 0,
 'billion': 2,
 'ai': 1,
 'predict': 8,
 'protein': 9,
 'structure': 11}

In [None]:
# Dense matrix representation
bow.toarray()

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0],
       [1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0]])

The sparse matrix can be **sliced and indexed** like a normal array.

In [None]:
# Sparse slice
print(bow[:,0:4])

  (1, 3)	1
  (2, 0)	1
  (2, 2)	1
  (3, 1)	1


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">5.3 Document Similarity</p>

Here we will measure the **cosine similarity** between the documents in our corpus.

In [None]:
# Cosine similarity using numpy
def cosine_sim(a,b):
    return np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))

In [None]:
# Similarity between two documents
print(corpus[1])
print(corpus[3])
print(f'Similarity score: {cosine_sim(bow[1].toarray().squeeze(),bow[3].toarray().squeeze()):.3f}')

The Omicron coronavirus variant spreads.
AI predicts protein structures.
Similarity score: 0.000


In [None]:
# Similarity between two documents
print(corpus[0])
print(corpus[2])
print(f'Similarity score: {cosine_sim(bow[0].toarray().squeeze(),bow[2].toarray().squeeze()):.3f}')

Inflation surges around the world.
World population exceeds 8 billion.
Similarity score: 0.258


We can also use sklearn's `cosine_similarity`. This calculates all the **pairwise similarities** and returns the result in a matrix indexed by the documents.

In [None]:
# cosine_similarity takes either array-likes or sparse matrices
print(cosine_similarity(bow))

[[1.         0.         0.25819889 0.        ]
 [0.         1.         0.         0.        ]
 [0.25819889 0.         1.         0.        ]
 [0.         0.         0.         1.        ]]


## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">5.4 n-grams</p>

Finally, we will build a bag-of-words matrix using **n-grams**. To do this, we can pass the `ngram_range` parameter in countvectorizer. It takes in a tuple, with the **first entry** indicating the **minimum** chunk size and the **second entry** indicating the **maximum** chunk size.

In [None]:
# Unigrams and bigrams with ngram_range=(1,2)
vectorizer = CountVectorizer(tokenizer=custom_tokenizer, lowercase=False, binary=True, ngram_range=(1,2))

# Fit vectorizer to corpus
unibigrams = vectorizer.fit_transform(corpus)

# Print vocabulary size
print(f'Size of vocabulary: {len(vectorizer.get_feature_names_out())}')

# Print vocabulary
print(vectorizer.vocabulary_)

Size of vocabulary: 27
{'inflation': 11, 'surge': 21, 'world': 25, 'inflation surge': 12, 'surge world': 22, 'Omicron': 4, 'coronavirus': 7, 'variant': 23, 'spread': 19, 'Omicron coronavirus': 5, 'coronavirus variant': 8, 'variant spread': 24, 'population': 13, 'exceed': 9, '8': 0, 'billion': 6, 'world population': 26, 'population exceed': 14, 'exceed 8': 10, '8 billion': 1, 'AI': 2, 'predict': 15, 'protein': 17, 'structure': 20, 'AI predict': 3, 'predict protein': 16, 'protein structure': 18}


In [None]:
# Only bigrams with ngram_range=(2,2)
vectorizer = CountVectorizer(tokenizer=custom_tokenizer, lowercase=False, binary=True, ngram_range=(2,2))

# Fit vectorizer to corpus
bigrams = vectorizer.fit_transform(corpus)

# Print vocabulary size
print(f'Size of vocabulary: {len(vectorizer.get_feature_names_out())}')

# Print vocabulary
print(vectorizer.vocabulary_)

Size of vocabulary: 12
{'inflation surge': 5, 'surge world': 9, 'Omicron coronavirus': 2, 'coronavirus variant': 3, 'variant spread': 10, 'world population': 11, 'population exceed': 6, 'exceed 8': 4, '8 billion': 0, 'AI predict': 1, 'predict protein': 7, 'protein structure': 8}


# 6. Conclusion

In this notebook, we saw how to convert **sequences of text** to **vectors of numbers** by using a basic **bag-of-words**, a process sometimes called **vectorization**.  Whilst this can be used for simple applications, it has a number of **limitations** including **losing word order information**. This can be partially resolved by using **n-grams** but it comes at the cost of **increasing** the size of our vocabulary.

We also investigated how to measure document similarity using the popular **cosine similarity** metric. We will build on these ideas in future notebooks.

**References:**
* [NLP demystified](https://www.nlpdemystified.org/)

### Coming UP
#### [4. TF-IDF and Document Search](./04_TFIDF_DocSearch.ipynb)

Thanks for reading!