### Description

This notebook shows a method of word representation for NLP related problems and data analysis called **Bag of Words**.

It treats each document or text as an unordered collection. For example if the goal is to analyze tweets from Twitter then each separate tweet is a document in this case.

### 1. Data

In [13]:
corpus = [
    "The dog barks in the morning.",
    "Over the sofa lies sleeping dog.",
    "My dog name is Farell, it is very energetic.",
    "The dog barks at the cars.",
    "Cat dislikes vegetables.",
    "Cats sleep during day and hunt during night.",
    "Cats and dogs are not getting along. I prefer cats."
]

### 2. Cleaning corpus

In [2]:
import re

import nltk
for package in ["punkt", "wordnet", "stopwords"]:
    nltk.download(package)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

porter_stemmer = PorterStemmer()
wodnet_lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\RhysL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\RhysL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RhysL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:

def normalize_document(document, stemmer=porter_stemmer, lemmatizer=wodnet_lemmatizer):
    """Noramlizes data by performing following steps:
        1. Changing each word in corpus to lowercase.
        2. Removing special characters and interpunction.
        3. Dividing text into tokens.
        4. Removing english stopwords.
        5. Stemming words.
        6. Lemmatizing words.
    """
    
    temp = document.lower()
    temp = re.sub(r"[^a-zA-Z0-9]", " ", temp)
    temp = word_tokenize(temp)
    temp = [t for t in temp if t not in stopwords.words("english")]
    temp = [porter_stemmer.stem(token) for token in temp]
    temp = [lemmatizer.lemmatize(token) for token in temp]
        
    return temp

Previeving results.

In [4]:
offset = max(map(len, corpus))
for document in corpus:
    print(document.rjust(offset), " -> ", normalize_document(document))

                      The dog barks in the morning.  ->  ['dog', 'bark', 'morn']
                   Over the sofa lies sleeping dog.  ->  ['sofa', 'lie', 'sleep', 'dog']
       My dog name is Farell, it is very energetic.  ->  ['dog', 'name', 'farel', 'energet']
                         The dog barks at the cars.  ->  ['dog', 'bark', 'car']
                           Cat dislikes vegetables.  ->  ['cat', 'dislik', 'veget']
       Cats sleep during day and hunt during night.  ->  ['cat', 'sleep', 'day', 'hunt', 'night']
Cats and dogs are not getting along. I prefer cats.  ->  ['cat', 'dog', 'get', 'along', 'prefer', 'cat']


It is possible to observe what tokens are left from each sentence.

### 4. Creating Bag of Words

Initiating the CountVectorizer model and removing English stopwords.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# Using CountVectorizer with the custom tokenizer
bow = CountVectorizer(tokenizer=normalize_document)
bow.fit(corpus)  # Fitting text to this model
print(bow.get_feature_names_out())  # Key terms

['along' 'bark' 'car' 'cat' 'day' 'dislik' 'dog' 'energet' 'farel' 'get'
 'hunt' 'lie' 'morn' 'name' 'night' 'prefer' 'sleep' 'sofa' 'veget']




Building Bag Of Words based on corpus.

In [11]:
bow.fit(corpus) # fitting text to this model

Previewing tokens in the bag.

In [16]:
print(bow.get_feature_names_out()) # key terms

['along' 'bark' 'car' 'cat' 'day' 'dislik' 'dog' 'energet' 'farel' 'get'
 'hunt' 'lie' 'morn' 'name' 'night' 'prefer' 'sleep' 'sofa' 'veget']


As it is possible to see te size of the bag is 16 as there are 16 tokens inside of it. Because of that each sentence will be represented with vector of size:

In [17]:
corpus_vectorized = bow.transform(corpus)

In [18]:
offset = max(map(len, corpus))
for document, document_vector in zip(corpus, corpus_vectorized.toarray()):
    print(document.rjust(offset), " -> ", document_vector)

                      The dog barks in the morning.  ->  [0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0]
                   Over the sofa lies sleeping dog.  ->  [0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0]
       My dog name is Farell, it is very energetic.  ->  [0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0]
                         The dog barks at the cars.  ->  [0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
                           Cat dislikes vegetables.  ->  [0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1]
       Cats sleep during day and hunt during night.  ->  [0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0]
Cats and dogs are not getting along. I prefer cats.  ->  [1 0 0 2 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0]


Such vectors are now representing sentences in corpus.

### One-Hot Encoding

Sometimes it is desired to don't include counts of words but rather whether word from dictionary/bag is in sentence or not.

In [19]:
bow_ohe = CountVectorizer(tokenizer=normalize_document, binary=True)
corpus_vectorized_ohe = bow_ohe.fit_transform(corpus)

In [22]:
offset = max(map(len, corpus))
for document, document_vector in zip(corpus, corpus_vectorized_ohe.toarray()):
    print(document.rjust(offset), " -> ", document_vector)

                      The dog barks in the morning.  ->  [0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0]
                   Over the sofa lies sleeping dog.  ->  [0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0]
       My dog name is Farell, it is very energetic.  ->  [0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0]
                         The dog barks at the cars.  ->  [0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
                           Cat dislikes vegetables.  ->  [0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1]
       Cats sleep during day and hunt during night.  ->  [0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0]
Cats and dogs are not getting along. I prefer cats.  ->  [1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0]


### Ngrams

Ngrams allows to create tokens from group of words to get more information about their context.

- ngram range (1,2)

In [11]:
bow_ngram_12 = CountVectorizer(tokenizer=normalize_document, ngram_range=(1,2))
bow_ngram_12.fit(corpus)
print(bow_ngram_12.get_feature_names())

['along', 'along prefer', 'bark', 'bark car', 'bark morn', 'car', 'cat', 'cat dislik', 'cat dog', 'cat sleep', 'day', 'day hunt', 'dislik', 'dislik veget', 'dog', 'dog bark', 'dog get', 'dog name', 'energet', 'farel', 'farel energet', 'get', 'get along', 'hunt', 'hunt night', 'lie', 'lie sleep', 'morn', 'name', 'name farel', 'night', 'prefer', 'prefer cat', 'sleep', 'sleep day', 'sleep dog', 'sofa', 'sofa lie', 'veget']


- ngram range (2,2)

In [12]:
bow_ngram_22 = CountVectorizer(tokenizer=normalize_document, ngram_range=(2,2))
bow_ngram_22.fit(corpus)
print(bow_ngram_22.get_feature_names())

['along prefer', 'bark car', 'bark morn', 'cat dislik', 'cat dog', 'cat sleep', 'day hunt', 'dislik veget', 'dog bark', 'dog get', 'dog name', 'farel energet', 'get along', 'hunt night', 'lie sleep', 'name farel', 'prefer cat', 'sleep day', 'sleep dog', 'sofa lie']


- ngram range (1,3)

In [13]:
bow_ngram_13 = CountVectorizer(tokenizer=normalize_document, ngram_range=(1,3))
bow_ngram_13.fit(corpus)
print(bow_ngram_13.get_feature_names())

['along', 'along prefer', 'along prefer cat', 'bark', 'bark car', 'bark morn', 'car', 'cat', 'cat dislik', 'cat dislik veget', 'cat dog', 'cat dog get', 'cat sleep', 'cat sleep day', 'day', 'day hunt', 'day hunt night', 'dislik', 'dislik veget', 'dog', 'dog bark', 'dog bark car', 'dog bark morn', 'dog get', 'dog get along', 'dog name', 'dog name farel', 'energet', 'farel', 'farel energet', 'get', 'get along', 'get along prefer', 'hunt', 'hunt night', 'lie', 'lie sleep', 'lie sleep dog', 'morn', 'name', 'name farel', 'name farel energet', 'night', 'prefer', 'prefer cat', 'sleep', 'sleep day', 'sleep day hunt', 'sleep dog', 'sofa', 'sofa lie', 'sofa lie sleep', 'veget']


- ngram range (2,3)

In [14]:
bow_ngram_23 = CountVectorizer(tokenizer=normalize_document, ngram_range=(2,3))
bow_ngram_23.fit(corpus)
print(bow_ngram_23.get_feature_names())

['along prefer', 'along prefer cat', 'bark car', 'bark morn', 'cat dislik', 'cat dislik veget', 'cat dog', 'cat dog get', 'cat sleep', 'cat sleep day', 'day hunt', 'day hunt night', 'dislik veget', 'dog bark', 'dog bark car', 'dog bark morn', 'dog get', 'dog get along', 'dog name', 'dog name farel', 'farel energet', 'get along', 'get along prefer', 'hunt night', 'lie sleep', 'lie sleep dog', 'name farel', 'name farel energet', 'prefer cat', 'sleep day', 'sleep day hunt', 'sleep dog', 'sofa lie', 'sofa lie sleep']
