## Bag of Words in Action

The vocabulary-building step comes as a prerequisite to the BoW methodology. 
Once the vocabulary is available, each sentence can be represented as a vector:
- the length of this vector is the size of the vocabulary.
- each entry in the vector is a term in the vocabulary.
- the number in that entry is the frequency of the term in the sentence.
- the lower the limit for this number is 0: this means that the vocabulary term does not appear in the sentence.
- the upper limit for the entry in the vector could be the frequency of the occurrence of the word in the text corpora: the most frequently occurring word occurs in one sentence. This is a rare situation.

In [1]:
%run ../scripts/setup.ipynb

In [2]:
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Take in a list of sentences

In [3]:
sentences = ["We are reading about Natural Language Processing Here",
            "Natural Language Processing making computers comprehend language data",
            "The field of Natural Language Processing is evolving everyday"]

### Create a Pandas Series of the object

In [4]:
corpus = pd.Series(sentences)
corpus

0    We are reading about Natural Language Processi...
1    Natural Language Processing making computers c...
2    The field of Natural Language Processing is ev...
dtype: object

### Data preprocessing

In [5]:
%run text_data_preprocessing_steps.ipynb

In [6]:
common_dot_words = ['U.S.', 'Mr.', 'Mrs.', 'D.C.']

In [7]:
# Preprocessing with Lemmatization here
preprocessed_corpus = preprocess_list(corpus, keep_list = common_dot_words, stemming = False, stem_type = None,
                                lemmatization = True, remove_stopwords = True)
preprocessed_corpus

['read natural language process',
 'natural language process make computers comprehend language data',
 'field natural language process evolve everyday']

### Building the vocabulary

In [8]:
set_of_words = set()
for sentence in preprocessed_corpus:
    for word in sentence.split():
        set_of_words.add(word)
vocab = list(set_of_words)
print(vocab)

['data', 'process', 'natural', 'comprehend', 'everyday', 'language', 'make', 'read', 'field', 'evolve', 'computers']


### Fetching the position of each word in the vocabulary

In [9]:
position = {}
for i, token in enumerate(vocab):
    position[token] = i
print(position)

{'data': 0, 'process': 1, 'natural': 2, 'comprehend': 3, 'everyday': 4, 'language': 5, 'make': 6, 'read': 7, 'field': 8, 'evolve': 9, 'computers': 10}


### Creating a matrix to hold the Bag of Words representation

The shape of the matrix is (number of sentences * length of vocabulary).

In [10]:
bow_matrix = np.zeros((len(preprocessed_corpus), len(vocab)))

Increase the positional index of every word by 1 if it appears in a sentence.

In [11]:
for i, preprocessed_sentence in enumerate(preprocessed_corpus):
    for token in preprocessed_sentence.split():   
        bow_matrix[i][position[token]] = bow_matrix[i][position[token]] + 1

### Let's look at our Bag of Words representation

In [12]:
bow_matrix

array([[0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0.],
       [1., 1., 1., 1., 0., 2., 1., 0., 0., 0., 1.],
       [0., 1., 1., 0., 1., 1., 0., 0., 1., 1., 0.]])

In [13]:
sentences = ["We are reading about Natural Language Processing Here",
            "Natural Language Processing making computers comprehend language data",
            "The field of Natural Language Processing is evolving everyday"]

Taking example of column `5` in the `bow_matrix`, the values are `1`, `2` and `1` respectively.

Column `5` caters to index `1` corresponding to the word `language`.

`language` occurs `once, twice and again once` in the the sentences 1, 2 and 3 respectively.