Version: 07.01.2022

In [1]:
#Upgrade dependencies
!pip install --upgrade pip
!pip install --upgrade scikit-learn
!pip install --upgrade sagemaker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn
  Downloading scikit_learn-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.8/9.8 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

# Initialize the count vectorizer with the parameter: binary=True
binary_vectorizer = CountVectorizer(binary=True)

# fit_transform() function fits the text data and gets the binary BoW vectors
x = binary_vectorizer.fit_transform(sentences)

As the vocabulary size grows, the BOW vectors also become large in size. They are usually made of many zeros and very few non-zero values. scikit-learn stores these vectors in a compressed form. If you want to use them as Numpy arrays, you can call the __toarray()__ function. The following code cell shows the binary BOW features. Each row corresponds to a single document.

In [3]:
x.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

You can also look at the vocabulary by using the __vocabulary___ attribute. It returns a dictionary with each word as a key and the index as the value. Notice that they are ordered alphabetically.

In [4]:
binary_vectorizer.vocabulary_

{'this': 8,
 'document': 1,
 'is': 3,
 'the': 6,
 'first': 2,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

You can get similar information with the __get_feature_names()__ function. The position of the terms in the **.get_feature_names()** output correspond to the column position of the elements in the BOW matrix.

In [5]:
print(binary_vectorizer.get_feature_names_out())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


You might wonder how can you calculate BOW for a new text. You can do so by using the __transform()__ function. You can see that this function doesn't change the vocabulary. In this case, new words are skipped.

In [6]:
new_sentence = ["This is the new sentence"]

new_vectors = binary_vectorizer.transform(new_sentence)

In [7]:
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["This document is the first document", "This document is the second document", "and this is the third one"]

# Initialize the count vectorizer
count_vectorizer = CountVectorizer()

xc = count_vectorizer.fit_transform(sentences)

df = pd.DataFrame(xc.toarray())
df.columns = count_vectorizer.get_feature_names_out()
df


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,2,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1


You can use the __transform()__ function to calculate BoW for a new text without changing the vocaulary as with the binary scenario.

In [None]:
new_sentence = ["This is the new sentence"]
new_vectors = count_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(use_idf=False)

x = tf_vectorizer.fit_transform(sentences)

x.toarray()

array([[0.        , 0.70710678, 0.35355339, 0.35355339, 0.        ,
        0.        , 0.35355339, 0.        , 0.35355339],
       [0.        , 0.70710678, 0.        , 0.35355339, 0.        ,
        0.35355339, 0.35355339, 0.        , 0.35355339],
       [0.40824829, 0.        , 0.        , 0.40824829, 0.40824829,
        0.        , 0.40824829, 0.40824829, 0.40824829]])

In [None]:
new_sentence = ["This is the new sentence"]
new_vectors = tf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.57735027, 0.        ,
        0.        , 0.57735027, 0.        , 0.57735027]])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=True)

sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

xf = tfidf_vectorizer.fit_transform(sentences)

xf.toarray()

array([[0.        , 0.7284449 , 0.47890875, 0.28285122, 0.        ,
        0.        , 0.28285122, 0.        , 0.28285122],
       [0.        , 0.7284449 , 0.        , 0.28285122, 0.        ,
        0.47890875, 0.28285122, 0.        , 0.28285122],
       [0.49711994, 0.        , 0.        , 0.29360705, 0.49711994,
        0.        , 0.29360705, 0.49711994, 0.29360705]])

In [None]:
new_sentence = ["This is the new sentence"]
new_vectors = tfidf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.57735027, 0.        ,
        0.        , 0.57735027, 0.        , 0.57735027]])

__Note__: In addition to *automatically normalizing the term frequency vectors by their Euclidean ($l2$) norm*, sklearn also uses a *smoothed version of IDF* by computing:

$$idf(term) = \ln \Big( \frac{n_{documents} +1}{n_{documents\,that\,contain\,the\,term}+1}\Big) + 1$$

In [None]:
tfidf_vectorizer.idf_

array([1.69314718, 1.28768207, 1.69314718, 1.        , 1.69314718,
       1.69314718, 1.        , 1.69314718, 1.        ])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

tfidf_vectorizer = TfidfVectorizer()
xf = tfidf_vectorizer.fit_transform(sentences)
xf.toarray()

array([[0.        , 0.7284449 , 0.47890875, 0.28285122, 0.        ,
        0.        , 0.28285122, 0.        , 0.28285122],
       [0.        , 0.7284449 , 0.        , 0.28285122, 0.        ,
        0.47890875, 0.28285122, 0.        , 0.28285122],
       [0.49711994, 0.        , 0.        , 0.29360705, 0.49711994,
        0.        , 0.29360705, 0.49711994, 0.29360705]])

(JY: for the Word2vec and BlazingText task, I put all the subheads under this task to level 3 (###) or 4 (####), depending on context, to make them subordinate to the task. Hope this is ok.)

*©2021 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.*