<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/8-matching-tokenizers-and-datasets/1_word2vec_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word2Vec tokenization

As long as things go well, nobody thinks about pretrained tokenizers. It's like in real life. We can drive a car for years without thinking about the engine. Then, one day our car breaks down, and we try to find the reasons to explain the situation.

The same happens with pretrained tokenizers. Sometimes the results are not what
we expect. Some word pairs just don't fit together, as we can see.

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/tokenizers-miscalculated.png?raw=1' width='800'/>

QC refers to Quality Control. In any strategic corporate project, QC is mandatory. The quality of the output will determine the survival of a critical project. If the project is not strategic, errors will sometimes be acceptable. In a strategic project, even a few errors imply a risk management audit's intervention to see if the project should be continued or abandoned.

From the perspectives of quality control and risk management, tokenizing datasets that are irrelevant (too many useless words or critical words missing) will confuse the embedding algorithms and produce "poor results." 

In a strategic AI project, "poor results" can be a single error with a dramatic
consequence (especially in medical, airplane or rocket assembly, or other critical domains).

## Setup

In [None]:
!pip install gensim==3.8.3

In [32]:
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec 
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

import warnings 
warnings.filterwarnings(action = 'ignore')

In [None]:
import nltk
nltk.download('punkt')

In [None]:
!wget https://raw.githubusercontent.com/rahiakela/transformers-for-natural-language-processing/main/8-matching-tokenizers-and-datasets/text.txt

## Train word2vec model

`text.txt`, our dataset, contains the American Declaration of Independence, the Bill of Rights, the Magna Carta, the works of Emmanuel Kant, and other texts.

We will now tokenize text.txt and train a word2vec model:

In [33]:
sample = open("text.txt", "r")
s = sample.read()

# processing escape characters
f = s.replace("\n", " ")

data = []
# sentence parsing
for i in sent_tokenize(f):
  temp = []
  # tokenize the sentence into words
  for j in word_tokenize(i):
    temp.append(j.lower())
  data.append(temp)

`window=5` is an interesting parameter. It limits the distance between the current word and the predicted word in an input sentence. `sq=1` means a skip-gram training algorithm is used.

In [34]:
# Creating Skip Gram model
model = gensim.models.Word2Vec(data, min_count=1, size=512, window=5, sg=1)
print(model)

Word2Vec(vocab=11822, size=512, alpha=0.025)


We have a word representation model with embedding and can create a cosine
similarity function named similarity(word1,word2). We will send word1 and word2
to the function, which will return a cosine similarity value between 0 and 1.

In [41]:
def similarity(word1, word2):
  cosine = False
  try:
    a = model[word1]
    cosine = True
  except KeyError:
    print(word1, "word 1:[unk] key not found in dictionary")

  try:
    b = model[word2]
    cosine = True
  except KeyError:
    cosine = False   # #both a and b must be true
    print(word2, "word 2:[unk] key not found in dictionary")

  # Cosine similarity will only be calculated if cosine==True, which means that both word1 and word2 are known
  if (cosine == True):
    b = model[word2]

    # compute cosine similarity
    dot = np.dot(a, b)
    norma = np.linalg.norm(a)
    normb = np.linalg.norm(b)
    cos = dot / (norma * normb)

    aa = a.reshape(1, 512)
    ba = b.reshape(1, 512)

    cos_lib = cosine_similarity(aa, ba)

  if (cosine == False):
    cos_lib = 0

  return cos_lib

## Case 0: Words in the dataset and the dictionary

The words "freedom" and "liberty" are in the dataset and their cosine similarity can be computed:

In [42]:
word1 = "freedom"
word2 = "liberty"

print("Similarity ", similarity(word1, word2))

Similarity  [[0.38462296]]


## Case 1: Words not in the dataset or the dictionary

Let's now see what happens when a word is missing.

A missing word means trouble in many ways. In this case, we send "corporations"
and "rights" to the similarity function:

In [None]:
word1 = "corporations"
word2 = "rights"

print("Similarity ", similarity(word1, word2))

The missing word will provoke a chain of events and problems that will distort the transformer model's output if the word was important. We will refer to the missing word as unk.

## Case 2: Noisy relationships

In this case, the dataset contained the words "etext" and "declaration":

In [45]:
word1 = "etext"
word2 = "declaration"

print("Similarity ", similarity(word1, word2))

Similarity  [[0.55050147]]


At a trivial or social media level, everything looks good.

However, at a professional level, the result is disastrous!

"declaration" is a meaningful word related to the actual content of the Declaration of Independence.

"etext" is part of a preface Project Gutenberg adds to all of its ebooks.

This might produce erroneous natural language inferences such as "etext is a
declaration" when the transformer is asked to generate text.

## Case 3: Rare words

Rare words produce devasting effects of the output of transformers for specific tasks that go beyond trivial applications.

Managing rare words extends to many domains of natural language.

For example, in this case, we are using the word "justiciar":

In [46]:
word1 = "justiciar"
word2 = "judgement"

print("Similarity ", similarity(word1, word2))

Similarity  [[0.23178002]]


One might think that the word "justiciar" is far fetched. The tokenizer extracted it from the Magna Carta, which dates back to the early 13th century.

If we implement a transformer model in a law firm to summarize documents or
other tasks, we must be careful!

## Case 4: Replacing rare words

Replacing rare words represents a project in itself. The work this takes is reserved for specific tasks and projects. If a corporate budget can cover the cost of having a knowledge base in aeronautics, for example, it's worth spending the necessary time querying the tokenized directory to find words it missed.

Problems can be grouped by topic, solved, and the knowledge base will be updated
regularly.

In case 3, we stumbled on the word "judiciar." If we go back to its origin, we can see if it comes from the French Normand language and is the root of the French Latinlike word "judicaire."

We could replace the word "judiciar" with "judge," which conveys the same metaconcept:

In [47]:
word1 = "judge"
word2 = "judgement"

print("Similarity ", similarity(word1, word2))

Similarity  [[0.19672832]]


We could also keep the work "justiciar" but try the modern meaning of the word
and compare it to "judge."

In [48]:
word1 = "justiciar"
word2 = "judge"

print("Similarity ", similarity(word1, word2))

Similarity  [[0.35883075]]


If we are managing a critical legal project,
we could have the essential documents that contained rare words of any kind
translated into standard English. The transformer's performance with NLP tasks
would increase, and the knowledge base of the corporation would progressively
increase.

##Case 5: Entailment

In this case, we are interested in words in the dictionary and test them in a fixed order.

For example, let's see if "pay" + "debt" makes sense in our similarity function:

In [49]:
word1 = "pay"
word2 = "debt"

print("Similarity ", similarity(word1, word2))

Similarity  [[0.5159261]]


We could check the dataset with several word pairs and check if they mean
something.

If the cosine similarity is above 0.9, then the email could be stripped
of useless information and the content added to the knowledge base dataset of the company.