# Bag of Words

## Step 1:
text ='Deep learning methods are popular for natural language, primarily because
they are delivering on their promise. Some of the first large demonstrations of the power of deep learning were in natural language processing, specifically speech recognition. More recently in machine translation. '

We will first preprocess the data, in order to:
- Convert text to lower case.
- Remove all non-word characters.
- Remove all punctuations.


In [None]:
import nltk 
import re 
import numpy as np 

text ='Hi there, I am working in text data. Deep learning methods are popular for natural language, primarily because they are delivering on their promise. Some of the first large demonstrations of the power of deep learning were in natural language processing, specifically speech recognition. More recently in machine translation. '
dataset = nltk.sent_tokenize(text) 
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower() 
    dataset[i] = re.sub(r'\W', ' ', dataset[i])
    dataset[i] = re.sub(r'\s+', ' ', dataset[i]) 
dataset

In [3]:
import nltk
import re
import numpy as np


text ="Hello world, I am working in text data."

dataset = nltk.sent_tokenize(text)

for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()
    dataset[i] = re.sub(r'\W', ' ', dataset[i])
    dataset[i] = re.sub(r'\s+', ' ', dataset[i]) 
dataset

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\MYCOM/nltk_data'
    - 'c:\\Users\\MYCOM\\anaconda3\\envs\\myenv\\nltk_data'
    - 'c:\\Users\\MYCOM\\anaconda3\\envs\\myenv\\share\\nltk_data'
    - 'c:\\Users\\MYCOM\\anaconda3\\envs\\myenv\\lib\\nltk_data'
    - 'C:\\Users\\MYCOM\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


You can further preprocess the text to suit you needs.

## Step #2 : Obtaining most frequent words in our text.

We will apply the following steps to generate our model.
	- We declare a dictionary to hold our bag of words.
	- Next we tokenize each sentence to words.
	- Now for each word in a sentence, we check if the word exists in our dictionary.
	- If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.

In [None]:
word2count = {} 
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1

In [None]:
word2count

In our model, we have a total of 41 words. However when processing large texts, the number of words could reach millions. We do not need to use all those words. Hence, we select a particular number of most frequently used words. To implement this we use:

In [None]:
import heapq
freq_words = heapq.nlargest(20, word2count, key=word2count.get)
print(freq_words)

where 20 denotes the number of words we want. If our text is large, we feed in a larger number.


## Step #3 : Building the Bag of Words model

In this step we construct a vector, which would tell us whether a word in each sentence is a frequent word or not. If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.

This can be implemented with the help of following code:


In [None]:
X = []
for data in dataset:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(data):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
X = np.asarray(X)

print(X)