# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 1</a>

## Bag of Words Method

In this notebook, we go over the Bag of Words (BoW) method to convert text data into numerical values, that will be later used for predictions with machine learning algorithms.

To convert text data to vectors of numbers, a vocabulary of known words (tokens) is extracted from the text, the occurence of words is scored, and the resulting numerical values are saved in vocabulary-long vectors. There are a few versions of BoW, corresponding to different words scoring methods. We use the Sklearn library to calculate the BoW numerical values using:

1. <a href="#1">Binary</a>
2. <a href="#2">Word Counts</a>
3. <a href="#3">Term Frequencies</a>
4. <a href="#4">Term Frequency-Inverse Document Frequencies</a>


In [1]:
# Upgrade dependencies
!pip install -r ../../requirements.txt

Collecting torch==1.8.1
  Downloading torch-1.8.1-cp36-cp36m-manylinux1_x86_64.whl (804.1 MB)
[K     |███████████████████             | 479.9 MB 90.4 MB/s eta 0:00:04     |█████▏                          | 128.5 MB 91.3 MB/s eta 0:00:08     |████████▍                       | 210.7 MB 83.9 MB/s eta 0:00:08

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |███████████████████████████████ | 777.4 MB 91.5 MB/s eta 0:00:011     |█████████████████████████▋      | 642.9 MB 109.7 MB/s eta 0:00:02     |██████████████████████████▏     | 658.0 MB 106.4 MB/s eta 0:00:02

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 804.1 MB 5.2 kB/s 
[?25hCollecting torchtext==0.9.1
  Downloading torchtext-0.9.1-cp36-cp36m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 47.9 MB/s eta 0:00:01
Collecting scikit-learn==0.24.1
  Downloading scikit_learn-0.24.1-cp36-cp36m-manylinux2010_x86_64.whl (22.2 MB)
[K     |████████████████████████████████| 22.2 MB 37.2 MB/s eta 0:00:01[K     |██████████████████▊             | 13.0 MB 37.2 MB/s eta 0:00:01
Collecting trax==1.3.7
  Downloading trax-1.3.7-py2.py3-none-any.whl (521 kB)
[K     |████████████████████████████████| 521 kB 78.2 MB/s eta 0:00:01
[?25hCollecting transformers==4.5.1
  Downloading transformers-4.5.1-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 70.7 MB/s eta 0:00:01
Collecting t5
  Downloading t5-0.9.3-py3-none-any.whl (153 kB)
[K     |████████████████████████████████| 153 kB 69.2 MB/s eta 0:00:01
[?25hCollecting jaxlib
  Downloading jaxlib-0.1.69

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 458.3 MB 8.7 kB/s s eta 0:00:01�████████████████▎| 448.0 MB 108.0 MB/s eta 0:00:01
Collecting astunparse~=1.6.3
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting tensorboard<2.7,>=2.6.0
  Downloading tensorboard-2.6.0-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 52.1 MB/s eta 0:00:01
[?25hCollecting typing-extensions
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting gast==0.4.0
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting flatbuffers<3.0,>=1.12
  Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Collecting grpcio<2.0,>=1.37.0
  Downloading grpcio-1.42.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 49.1 MB/s eta 0:00:01
[?25hCollecting tensorflow-estimator<2.7,>=2.6.0
  Downloading tensorflow_estimator-2.6.0-py2.py3-none-any.whl (462 kB)
[K     |████████████████████████

In [2]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## 1. <a name="1">Binary</a>
(<a href="#0">Go to top</a>)

Let's calculate the first type of BoW, recording whether the word is in the sentence or not. We will also go over some useful features of Sklearn's vectorizers here.


In [3]:
sentences = [
    "This document is the first document",
    "This document is the second document",
    "and this is the third one",
]

# Initialize the count vectorizer with the parameter: binary=True
binary_vectorizer = CountVectorizer(binary=True)

# fit_transform() function fits the text data and gets the binary BoW vectors
x = binary_vectorizer.fit_transform(sentences)

As the vocabulary size grows, the BoW vectors also get very large in size. They are usually made of many zeros and very few non-zero values. Sklearn stores these vectors in a compressed form. If we want to use them as Numpy arrays, we call the __toarray()__ function. Here are our binary BoW features. Each row corresponds to a single document.

In [4]:
x.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

Let's check out our vocabulary. We can use the __vocabulary___ attribute. This returns a dictionary with each word as key and index as value. Notice that the indices are assigned in alphabetical ordered.

In [5]:
binary_vectorizer.vocabulary_

{'this': 8,
 'document': 1,
 'is': 3,
 'the': 6,
 'first': 2,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

Similar information can be reached with the __get_feature_names()__ function. The position of the terms in the .get_feature_names() correspond to the column position of the elements in the BoW matrix.

In [6]:
print(binary_vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


We can easily compute the total number of times that each of the words from the vocabulary appears in the corpus. 

In [7]:
sum_words = x.sum(axis=0)
words_freq = [
    (word, sum_words[0, idx])
    for (idx, word) in enumerate(binary_vectorizer.get_feature_names())
]
words_freq

[('and', 1),
 ('document', 2),
 ('first', 1),
 ('is', 3),
 ('one', 1),
 ('second', 1),
 ('the', 3),
 ('third', 1),
 ('this', 3)]

Here're the binary BoW vectors associated to each of the sentences of the corpus.

In [8]:
df = pd.DataFrame(
    x.toarray(), columns=binary_vectorizer.get_feature_names(), index=sentences
)
df

Unnamed: 0,and,document,first,is,one,second,the,third,this
This document is the first document,0,1,1,1,0,0,1,0,1
This document is the second document,0,1,0,1,0,1,1,0,1
and this is the third one,1,0,0,1,1,0,1,1,1


How can we calculate BoW for a new text? We will use the __transform()__ function this time. You can see below this doesn't change the vocabulary. New words are simply skipped in this case.

In [9]:
new_sentence = ["This is the new sentence"]

new_vectors = binary_vectorizer.transform(new_sentence)

In [10]:
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

In [11]:
df2 = pd.DataFrame(
    new_vectors.toarray(),
    columns=binary_vectorizer.get_feature_names(),
    index=new_sentence,
)
pd.concat([df, df2])

Unnamed: 0,and,document,first,is,one,second,the,third,this
This document is the first document,0,1,1,1,0,0,1,0,1
This document is the second document,0,1,0,1,0,1,1,0,1
and this is the third one,1,0,0,1,1,0,1,1,1
This is the new sentence,0,0,0,1,0,0,1,0,1


## 2. <a name="2">Word Counts</a>
(<a href="#0">Go to top</a>)

Word counts can be simply calculated using the same __CountVectorizer()__ function __without__ the __binary__ parameter.



In [12]:
sentences = [
    "This document is the first document",
    "This document is the second document",
    "and this is the third one",
]

# Initialize the count vectorizer
count_vectorizer = CountVectorizer()

xc = count_vectorizer.fit_transform(sentences)

xc.toarray()

array([[0, 2, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

In [13]:
df = pd.DataFrame(
    xc.toarray(), columns=binary_vectorizer.get_feature_names(), index=sentences
)
df

Unnamed: 0,and,document,first,is,one,second,the,third,this
This document is the first document,0,2,1,1,0,0,1,0,1
This document is the second document,0,2,0,1,0,1,1,0,1
and this is the third one,1,0,0,1,1,0,1,1,1


In [14]:
new_sentence = ["This is the new sentence"]
new_vectors = count_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

In [15]:
df2 = pd.DataFrame(
    new_vectors.toarray(),
    columns=binary_vectorizer.get_feature_names(),
    index=new_sentence,
)
pd.concat([df, df2])

Unnamed: 0,and,document,first,is,one,second,the,third,this
This document is the first document,0,2,1,1,0,0,1,0,1
This document is the second document,0,2,0,1,0,1,1,0,1
and this is the third one,1,0,0,1,1,0,1,1,1
This is the new sentence,0,0,0,1,0,0,1,0,1


## 3. <a name="3">Term Frequency (TF)</a>
(<a href="#0">Go to top</a>)

Term Frequency (TF) vectors that show how important words are to the documents, are computed using

$$tf(term, doc) = \frac{\text{number of times that the term occurs in the doc}}{\text{total number of terms in the doc}}$$

From `sklearn` we use the `TfidfVectorizer` function with the parameter`use_idf=False`, which additionally *automatically normalizes* the term frequencies vectors by their Euclidean ($L_2$) norm. 


In [16]:
tf_vectorizer = TfidfVectorizer(use_idf=False)

x = tf_vectorizer.fit_transform(sentences)

np.round(x.toarray(), 2)

array([[0.  , 0.71, 0.35, 0.35, 0.  , 0.  , 0.35, 0.  , 0.35],
       [0.  , 0.71, 0.  , 0.35, 0.  , 0.35, 0.35, 0.  , 0.35],
       [0.41, 0.  , 0.  , 0.41, 0.41, 0.  , 0.41, 0.41, 0.41]])

In [17]:
new_sentence = ["This is the new sentence"]
new_vectors = tf_vectorizer.transform(new_sentence)
np.round(new_vectors.toarray(), 2)

array([[0.  , 0.  , 0.  , 0.58, 0.  , 0.  , 0.58, 0.  , 0.58]])

In [18]:
df = pd.DataFrame(
    np.round(x.toarray(), 2), columns=tf_vectorizer.get_feature_names(), index=sentences
)
df2 = pd.DataFrame(
    np.round(new_vectors.toarray(), 2),
    columns=tf_vectorizer.get_feature_names(),
    index=new_sentence,
)
pd.concat([df, df2])

Unnamed: 0,and,document,first,is,one,second,the,third,this
This document is the first document,0.0,0.71,0.35,0.35,0.0,0.0,0.35,0.0,0.35
This document is the second document,0.0,0.71,0.0,0.35,0.0,0.35,0.35,0.0,0.35
and this is the third one,0.41,0.0,0.0,0.41,0.41,0.0,0.41,0.41,0.41
This is the new sentence,0.0,0.0,0.0,0.58,0.0,0.0,0.58,0.0,0.58


## 4. <a name="4">Term Frequency Inverse Document Frequency (TF-IDF)</a>
(<a href="#0">Go to top</a>)

Term Frequency Inverse Document Frequency (TF-IDF) vectors are computed using the `TfidfVectorizer()` function with the parameter `use_idf=True`. We can also skip this parameter as it is already `True` by default.


In [19]:
tfidf_vectorizer = TfidfVectorizer(use_idf=True)

sentences = [
    "This document is the first document",
    "This document is the second document",
    "and this is the third one",
]

xf = tfidf_vectorizer.fit_transform(sentences)

np.round(xf.toarray(), 2)

array([[0.  , 0.73, 0.48, 0.28, 0.  , 0.  , 0.28, 0.  , 0.28],
       [0.  , 0.73, 0.  , 0.28, 0.  , 0.48, 0.28, 0.  , 0.28],
       [0.5 , 0.  , 0.  , 0.29, 0.5 , 0.  , 0.29, 0.5 , 0.29]])

In [20]:
new_sentence = ["This is the new sentence"]
new_vectors = tfidf_vectorizer.transform(new_sentence)
np.round(new_vectors.toarray(), 2)

array([[0.  , 0.  , 0.  , 0.58, 0.  , 0.  , 0.58, 0.  , 0.58]])

In [21]:
df = pd.DataFrame(
    np.round(xf.toarray(), 2),
    columns=tfidf_vectorizer.get_feature_names(),
    index=sentences,
)
df2 = pd.DataFrame(
    np.round(new_vectors.toarray(), 2),
    columns=tfidf_vectorizer.get_feature_names(),
    index=new_sentence,
)
pd.concat([df, df2])

Unnamed: 0,and,document,first,is,one,second,the,third,this
This document is the first document,0.0,0.73,0.48,0.28,0.0,0.0,0.28,0.0,0.28
This document is the second document,0.0,0.73,0.0,0.28,0.0,0.48,0.28,0.0,0.28
and this is the third one,0.5,0.0,0.0,0.29,0.5,0.0,0.29,0.5,0.29
This is the new sentence,0.0,0.0,0.0,0.58,0.0,0.0,0.58,0.0,0.58


__Note 1__: In addition to *automatically normalizing* the term frequencies vectors by their Euclidean ($L_2$) norm, sklearn also uses a *smoothed version of idf*, computing 

$$idf(term) = \ln \Big( \frac{n_{documents} +1}{n_{documents\,containing\,the\,term}+1}\Big) + 1$$

In [22]:
np.round(tfidf_vectorizer.idf_, 2)

array([1.69, 1.29, 1.69, 1.  , 1.69, 1.69, 1.  , 1.69, 1.  ])

Here we can see how the IDF is larger for the less common terms and viceversa.

In [23]:
df = pd.DataFrame(
    [[str(a) for a in np.round(tfidf_vectorizer.idf_, 2)]],
    columns=tfidf_vectorizer.get_feature_names(),
    index=["IDF"],
)
df2 = pd.DataFrame(
    [[str(w[1]) for w in words_freq]],
    columns=tfidf_vectorizer.get_feature_names(),
    index=["Term Frequency"],
)
pd.concat([df2, df])

Unnamed: 0,and,document,first,is,one,second,the,third,this
Term Frequency,1.0,2.0,1.0,3.0,1.0,1.0,3.0,1.0,3.0
IDF,1.69,1.29,1.69,1.0,1.69,1.69,1.0,1.69,1.0
