<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/8-matching-tokenizers-and-datasets/1_word2vec_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word2Vec tokenization

As long as things go well, nobody thinks about pretrained tokenizers. It's like in real life. We can drive a car for years without thinking about the engine. Then, one day our car breaks down, and we try to find the reasons to explain the situation.

The same happens with pretrained tokenizers. Sometimes the results are not what
we expect. Some word pairs just don't fit together, as we can see.

<img src='https://github.com/rahiakela/img-repo/blob/master/transformers-for-natural-language-processing/tokenizers-miscalculated.png?raw=1' width='800'/>

QC refers to Quality Control. In any strategic corporate project, QC is mandatory. The quality of the output will determine the survival of a critical project. If the project is not strategic, errors will sometimes be acceptable. In a strategic project, even a few errors imply a risk management audit's intervention to see if the project should be continued or abandoned.

From the perspectives of quality control and risk management, tokenizing datasets that are irrelevant (too many useless words or critical words missing) will confuse the embedding algorithms and produce "poor results." 

In a strategic AI project, "poor results" can be a single error with a dramatic
consequence (especially in medical, airplane or rocket assembly, or other critical domains).

## Setup

In [None]:
!pip install gensim==3.8.3

In [2]:
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec 
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

import warnings 
warnings.filterwarnings(action = 'ignore')

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!wget https://raw.githubusercontent.com/rahiakela/transformers-for-natural-language-processing/main/8-matching-tokenizers-and-datasets/text.txt

## Train word2vec model

`text.txt`, our dataset, contains the American Declaration of Independence, the Bill of Rights, the Magna Carta, the works of Emmanuel Kant, and other texts.

We will now tokenize text.txt and train a word2vec model:

In [5]:
sample = open("text.txt", "r")
s = sample.read()

# processing escape characters
f = s.replace("\n", " ")

data = []
# sentence parsing
for i in sent_tokenize(f):
  temp = []
  # tokenize the sentence into words
  for j in word_tokenize(i):
    temp.append(j.lower())
  data.append(temp)

`window=5` is an interesting parameter. It limits the distance between the current word and the predicted word in an input sentence. `sq=1` means a skip-gram training algorithm is used.

In [6]:
# Creating Skip Gram model
model = gensim.models.Word2Vec(data, min_count=1, size=512, window=5, sg=1)
print(model)

Word2Vec(vocab=11822, size=512, alpha=0.025)


We have a word representation model with embedding and can create a cosine
similarity function named similarity(word1,word2). We will send word1 and word2
to the function, which will return a cosine similarity value between 0 and 1.

In [7]:
def similarity(word1, word2):
  cosine = False
  try:
    a = model[word1]
    cosine = True
  except KeyError:
    print(word1, ":[unk] key not found in dictionary")

  try:
    b = model[word2]
    cosine = True
  except KeyError:
    cosine = False   # #both a and b must be true
    print(word2, ":[unk] key not found in dictionary")

  # Cosine similarity will only be calculated if cosine==True, which means that both word1 and word2 are known
  if cosine == True:
    b = model[word2]

    # compute cosine similarity
    dot = np.dot(a, b)
    norma = np.linalg.norm(a)
    normb = no.linalg.norm(b)
    cos = dot / (norma * normb)

    aa = a.reshape(1, 512)
    ba = b.reshape(1, 512)

    cos_lib = cosine_similarity(aa, ba)

  if cosine == False:
    cos_lib = 0

  return cos_lib

## Case 0: Words in the dataset and the dictionary