<a href="https://colab.research.google.com/github/rahiakela/advanced-natural-language-processing-with-tensorflow-2/blob/main/1-essentials-of-nlp/2_text_vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Vectorization

To understand how to process text, it is important to understand the general
workflow for NLP.

<img src='https://github.com/rahiakela/img-repo/blob/master/advanced-nlp-with-tensorflow-2/text-processing-workflow.png?raw=1' width='800'/>

The first two steps of the process in the preceding diagram involve collecting labeled data. A supervised model or even a semi-supervised model needs data to operate.

The next step is usually normalizing and featurizing the data. Models have a hard time processing text data as is. There is a lot of hidden structure in a given text that needs to be processed and exposed. These two steps focus on that. 

There are a couple of challenges in using the text content of messages. **The first is that text can be of arbitrary lengths.**

Comparing this to image data, we know that each image has a fixed width and height. Even if the corpus of images has a mixture of sizes, images
can be resized to a common size with minimal loss of information by using a variety of compression mechanisms.

In NLP, this is a bigger problem compared to computer vision. A common approach to handle this is to truncate the text.

**The second issue is that of the representation of words with a numerical quantity or feature.**

In computer vision, the smallest unit is a pixel. Each pixel has a set of
numerical values indicating color or intensity. 

In a text, the smallest unit could be a word. Aggregating the Unicode values of the characters does not convey or embody the meaning of the word.

**A core problem then is to construct a numerical representation of words. Vectorization is the process of converting a word to a vector of numbers that embodies the information contained in the word. Depending on the vectorization technique, this vector may have additional properties that may allow comparison with other words.**

These are the followings text vectorization approach:

- **Count-based vectorization**: The simplest approach for vectorizing is to use counts of words.
- **TF-IDF based text vectorization**: This is more sophisticated, with its origins in information retrieval.
- **Word2Vec based text vectorization**: it generate embeddings or word vectors.
- **BERT based text vectorization**: The newest method in this area.


## Setup

In [1]:
import tensorflow as tf

import os
import io
import re

import pandas as pd 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

from keras.utils import np_utils

tf.__version__

'2.4.1'

In [2]:
# Basic 1-layer neural network model for evaluation
def make_model(input_dims=3, num_units=12):
  model = tf.keras.Sequential()

  # Adds a densely-connected layer with 12 units to the model:
  model.add(tf.keras.layers.Dense(num_units, 
                                  input_dim=input_dims, 
                                  activation='relu'))

  # Add a sigmoid layer with a binary output unit:
  model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  
  return model

## Data collection

**The first step of any Machine Learning (ML) project is to obtain a dataset.**

We will be using the SMS Spam Collection dataset made available by University of California, Irvine.

In [3]:
# Download the zip file
path_to_zip = tf.keras.utils.get_file("smsspamcollection.zip",
                  origin="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
                  extract=True)

# Unzip the file into a folder
!unzip $path_to_zip -d data

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Archive:  /root/.keras/datasets/smsspamcollection.zip
  inflating: data/SMSSpamCollection  
  inflating: data/readme             


In [None]:
# optional step - helps if colab gets disconnected
# from google.colab import drive
# drive.mount('/content/drive')

Reading the data file is trivial.

In [4]:
# Let's see if we read the data correctly
# lines = io.open('/content/drive/My Drive/colab-data/SMSSpamCollection').read().strip().split('\n')
lines = io.open('/content/data/SMSSpamCollection').read().strip().split('\n')
lines[0]

'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [5]:
lines[2]

"spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

### Pre-process Data

The next step is to split each line into two columns – one with the text of the message and the other as the label. While we are separating these labels, we will also convert the labels to numeric values. Since we are interested in predicting spam messages, we can assign a value of 1 to the spam
messages. A value of 0 will be assigned to legitimate messages.

In [6]:
spam_dataset = []
spam_count = 0
ham_count = 0
for line in lines:
  label, text = line.split('\t')
  if label.lower().strip() == 'spam':
    spam_dataset.append((1, text.strip()))
    spam_count += 1
  else:
    spam_dataset.append(((0, text.strip())))
    ham_count += 1

spam_dataset[:5]

[(0,
  'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
 (0, 'Ok lar... Joking wif u oni...'),
 (1,
  "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"),
 (0, 'U dun say so early hor... U c already then say...'),
 (0, "Nah I don't think he goes to usf, he lives around here though")]

In [7]:
print("Spam: ", spam_count, ", Ham: ", ham_count)

Spam:  747 , Ham:  4827


Now the dataset is ready for further processing in the pipeline.

In [8]:
# To do so, first, we will convert the data into a pandas DataFrame
df = pd.DataFrame(spam_dataset, columns=['Spam', 'Message'])
df.head()

Unnamed: 0,Spam,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Now let's split the dataset into training and test sets, with 80% of the records in the training set and the rest in the test set.

In [9]:
train=df.sample(frac=0.8,random_state=42) #random state is a seed value
test=df.drop(train.index)

train.describe()

Unnamed: 0,Spam
count,4459.0
mean,0.132765
std,0.339359
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [10]:
y_train = train[['Spam']]
y_test = test[['Spam']]

## Count-based vectorization


The idea behind count-based vectorization is really simple. Each unique word
appearing in the corpus is assigned a column in the vocabulary. Each document,
which would correspond to individual messages in the spam example, is assigned
a row. The counts of the words appearing in that document are entered in
the relevant cell corresponding to the document and the word. 

**With $n$ unique documents containing $m$ unique words, this results in a matrix of $n$ rows by $m$ columns.**

Consider a corpus like so:

In [None]:
corpus = [
  "I like fruits. Fruits like bananas",
  "I love bananas but eat an apple",
  "An apple a day keeps the doctor away"
]

There are three documents in this corpus of text. The scikit-learn (sklearn)
library provides methods for undertaking count-based vectorization.

The `CountVectorizer` class provides a built-in tokenizer that separates the tokens of two or more characters in length. This class takes a variety of options including a custom tokenizer, a stop word list, the option to convert characters to lowercase prior to tokenization, and a binary mode that converts every positive count to 1.

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names()

['an',
 'apple',
 'away',
 'bananas',
 'but',
 'day',
 'doctor',
 'eat',
 'fruits',
 'keeps',
 'like',
 'love',
 'the']

The full matrix can be seen as follows:

In [None]:
X.toarray()

array([[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 0],
       [1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0],
       [1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1]])

This process has now converted a sentence such as "I like fruits. Fruits like bananas" into a vector `(0, 0, 0, 1, 0, 0, 0, 2, 0, 2, 0, 0)`.

**This is an example of context free vectorization. Context-free refers to the fact that the order of the words in the document did not make any difference in the generation of the vector. This is merely counting the instances of the words in a document.**

Consequently, words with multiple meanings may be grouped into one, for example, bank. This may refer to a place near the river or a place to keep money. 

**However, it does provide a method to compare documents and derive similarity. The cosine similarity or distance can be computed between two documents, to see which documents are similar to which other documents**:

In [None]:
cosine_similarity(X.toarray())

array([[1.        , 0.13608276, 0.        ],
       [0.13608276, 1.        , 0.3086067 ],
       [0.        , 0.3086067 , 1.        ]])

This shows that the first sentence and the second sentence have a `0.136` similarity score (on a scale of 0 to 1). The first and third sentence have nothing in common. The second and third sentence have a similarity score of `0.308` – the highest in this set.

**Another use case of this technique is to check the similarity of the documents
with given keywords.**

Let's say that the query is apple and bananas. This first step is
to compute the vector of this query, and then compute the cosine similarity scores against the documents in the corpus:

In [None]:
query = vectorizer.transform(["apple and bananas"])

cosine_similarity(X, query)

array([[0.23570226],
       [0.57735027],
       [0.26726124]])

This shows that this query matches the second sentence in the corpus the best. The third sentence would rank second, and the first sentence would rank lowest.

In a few lines, a basic search engine has been implemented, along with logic to serve queries!

## TF-IDF Vectorization

In creating a vector representation of the document, only the presence of words was included – it does not factor in the importance of a word. If the corpus of documents being processed is about a set of recipes with fruits, then one may expect words like apples, raspberries, and washing to appear frequently. 

**Term Frequency (TF) represents how often a word or token occurs in a given document.**

In a set of documents about fruits and cooking, a word like apple may not be terribly specific to help identify a recipe. However, a word like tuile
may be uncommon in that context. Therefore, it may help to narrow the search for recipes much faster than a word like raspberry. On a side note, feel free to search the web for raspberry tuile recipes. 

**If a word is rare, we want to give it a higher weight, as it may contain more information than a common word. A term can be upweighted
by the inverse of the number of documents it appears in. Consequently, words that occur in a lot of documents will get a smaller score compared to terms that appear in fewer documents. This is called the Inverse Document Frequency (IDF).**

Mathematically, the score of each term in a document can be computed as follows:

$$ TF - IDF(t, d) = TF(t, d) * IDF(t) $$

Here, t represents the word or term, and d represents a specific document.

**It is common to normalize the TF of a term in a document by the total number of tokens in that document.**

The IDF is defined as follows:

$$ IDF(t) = log\frac{N}{1 + n_t} $$

Here, $N$ represents the total number of documents in the corpus, and $n_t$ represents the number of documents where the term is present. The addition of 1 in the denominator avoids the divide-by-zero error.

Let's convert the counts from the previous section into their TF-IDF equivalents:

In [None]:
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(X.toarray())

pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,an,apple,away,bananas,but,day,doctor,eat,fruits,keeps,like,love,the
0,0.0,0.0,0.0,0.230408,0.0,0.0,0.0,0.0,0.688081,0.0,0.688081,0.0,0.0
1,0.321267,0.321267,0.0,0.321267,0.479709,0.0,0.0,0.479709,0.0,0.0,0.0,0.479709,0.0
2,0.275785,0.275785,0.411797,0.0,0.0,0.411797,0.411797,0.0,0.0,0.411797,0.0,0.0,0.411797


This should give some intuition on how TF-IDF is computed. Even with three toy
sentences and a very limited vocabulary, many of the columns in each row are 0.

**This vectorization produces sparse representations.**


Now, this can be applied to the problem of detecting spam messages. Thus far, the features for each message have been computed based on some aggregate statistics and added to the pandas DataFrame. Now, the content of the message will be tokenized and converted into a set of columns. The TF-IDF score for each word or token will be computed for each message in the array.

In [None]:
tfidf = TfidfVectorizer(binary=True)
X = tfidf.fit_transform(train['Message']).astype('float32')
X_test = tfidf.transform(test['Message']).astype('float32')

In [None]:
X.shape

(4459, 7741)

The second parameter shows that 7,741 tokens were uniquely identified. These are the columns of features that will be used in the model later.

Note that the vectorizer was created with the binary flag. This implies that even if a token appears multiple times in a message, it is counted as one.

In [None]:
X.toarray()[:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

### Modeling using TF-IDF features

The next trains the TF-IDF model on the training dataset. Then, it converts the words in the test set according to the TF-IDF scores learned from the training set. 

Let's train a model on just these TF-IDF features.

In [None]:
_, cols = X.shape
model2 = make_model(cols)  # to match tf-idf dimensions

lb = LabelEncoder()
y = lb.fit_transform(y_train)
dummy_y_train = np_utils.to_categorical(y)
model2.fit(X.toarray(), y_train, epochs=10, batch_size=10)

  y = column_or_1d(y, warn=True)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f90c05dc890>

Whoa – we are able to classify every one correctly! In all honesty, the model is probably overfitting, so some regularization should be applied. The test set gives this result:

In [None]:
model2.evaluate(X_test.toarray(), y_test)



[0.054796118289232254, 0.9847533702850342]

An accuracy rate of 98.39% is by far the best we have gotten in any model so far. Checking the confusion matrix, it is evident that this model is indeed doing very well:

In [None]:
y_test_pred = model2.predict_classes(X_test.toarray())
tf.math.confusion_matrix(tf.constant(y_test.Spam), y_test_pred)



<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[958,   2],
       [ 15, 140]], dtype=int32)>

Only 2 regular messages were classified as spam, while only 15 spam messages
were classified as being not spam. This is indeed a very good model.

This model, without using a lot of pretraining and knowledge of
language, vocabulary, and grammar, was able to do a very reasonable job with the task at hand.

In [None]:
train.loc[train.Spam == 1].describe() 

Unnamed: 0,Spam
count,592.0
mean,1.0
std,0.0
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


However, this model ignores the relationships between words completely. It treats the words in a document as unordered items in a set. There are better models that vectorize the tokens in a way that preserves some of the relationships between the tokens.

# Word Vectors

In NLP, a lot of research has been focused on learning the words or representations in an unsupervised way. This is called representation learning. The output of this approach is a representation of a word in some vector space, and the word can be considered embedded in that space. Consequently, these word vectors are also called embeddings.

**The core hypothesis behind word vector algorithms is that words that occur near each other are related to each other.** 

To see the intuition behind this, consider two words, bake and oven. Given a sentence fragment of five words, where one of these words is present, what would be the probability of the other being present as well?
You would be right in guessing that the probability is likely quite high. Suppose now that words are being mapped into some two-dimensional space. In that space, these two words should be closer to each other, and probably further away from words like astronomy and tractor.

The task of learning these embeddings for the words can be then thought of as
adjusting words in a giant multidimensional space where similar words are closer to each other and dissimilar words are further apart from each other.

A revolutionary approach to do this is called Word2Vec.This approach
produces dense vectors of the order of 50-300 dimensions generally (though larger are known), where most of the values are non-zero.

The original paper had two algorithms proposed in it: **continuous bag-of-words and continuous skipgram.** On semantic tasks and overall, the performance of skip-gram was state of the art at the time of its publication. Consequently, the continuous skip-gram model with negative sampling has become synonymous with Word2Vec.

### Pretrained models using Word2Vec embeddings

Since we are only interested in experimenting with a pretrained model, we can
use the Gensim library and its pretrained embeddings.

In [11]:
api.info()

{'corpora': {'20-newsgroups': {'checksum': 'c92fd4f6640a86d5ba89eaad818a9891',
   'description': 'The notorious collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups.',
   'fields': {'data': '',
    'id': 'original id inferred from folder name',
    'set': "marker of original split (possible values 'train' and 'test')",
    'topic': 'name of topic (20 variant of possible values)'},
   'file_name': '20-newsgroups.gz',
   'file_size': 14483581,
   'license': 'not found',
   'num_records': 18846,
   'parts': 1,
   'read_more': ['http://qwone.com/~jason/20Newsgroups/'],
   'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/__init__.py',
   'record_format': 'dict'},
  '__testing_matrix-synopsis': {'checksum': '1767ac93a089b43899d54944b07d9dc5',
   'description': '[THIS IS ONLY FOR TESTING] Synopsis of the movie matrix.',
   'file_name': '__testing_matrix-synopsis.gz',
   'parts': 1,
   're

Note that these particular embeddings are approximately 1.6 GB in size, so may take a very long time to load.

In [12]:
model_w2v = api.load("word2vec-google-news-300")



Now, we are ready to inspect the similar words:

In [13]:
model_w2v.most_similar("cookies",topn=10)

[('cookie', 0.745154082775116),
 ('oatmeal_raisin_cookies', 0.6887780427932739),
 ('oatmeal_cookies', 0.662139892578125),
 ('cookie_dough_ice_cream', 0.6520504951477051),
 ('brownies', 0.6479344964027405),
 ('homemade_cookies', 0.6476464867591858),
 ('gingerbread_cookies', 0.6461867690086365),
 ('Cookies', 0.6341644525527954),
 ('cookies_cupcakes', 0.6275068521499634),
 ('cupcakes', 0.6258294582366943)]

This is pretty good. Let's see how this model does at a word analogy task:

In [15]:
model_w2v.doesnt_match(["USA","Canada","India","Tokyo"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'Tokyo'

The model is able to guess that compared to the other words, which are all countries, Tokyo is the odd one out, as it is a city.

Now, let's try a very famous example of mathematics on these word vectors:

In [16]:
king = model_w2v['king']
man = model_w2v['man']
woman = model_w2v['woman']

queen = king - man + woman  
model_w2v.similar_by_vector(queen)

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.6454660892486572),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676948547363),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376776456832886),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

Given that King was provided as an input to the equation, it is simple to filter the inputs from the outputs and Queen would be the top result.

A pretrained model like the preceding can be used to vectorize a document. Using these embeddings, models can be trained for specific purposes.