<a href="https://colab.research.google.com/github/sidharth178/Natural-Language-Processing-Tutorial/blob/master/9_BagofWords_N_Gram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Bag Of Words**

- Bag of words is a Natural Language Processing technique of **text modelling**. In technical terms, we can say that it is a method of **feature extraction** with text data. This approach is a simple and flexible way of extracting features from documents.

- A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and **disregard the grammatical details and the word order**.
-  It is **called a “bag”** of words because any information about the order or structure of words in the document is discarded.
- The model is only concerned with whether known words occur in the document, not where in the document.

### **Why is the Bag-of-Words algorithm used?**
-  The dataset is **messy and unstructured and not fixed-length inputs**, and machine learning algorithms prefer **structured, well defined fixed-length inputs** and by using the Bag-of-Words technique we can convert variable-length texts into a fixed-length vector.

- The machine learning models work with **numerical data** rather than textual data. By using the bag-of-words (BoW) technique, we convert a text into its equivalent **vector of numbers**.

Sentence 1: ”The Cat sat here” 

Sentence 2: “The Cat sat in the hat” 

Sentence 2: “The Cat is with the hat”

**Step 1:** Convert the above sentences in lower case as the case of the word does not hold any information.

**Step 2:** Remove special characters and stopwords from the text.

After applying the above steps, the sentences are changed to

Sentence 1: ”the cat sat” 

Sentence 2: “the cat sat in the hat” 

Sentence 2: “the cat with the hat”

Although the above sentences do not make much sense the maximum information is contained in these words only.

**Step 3:** Go through all the words in the above text and make a list of all of the words in our model vocabulary.
- ["the", "cat", "sat","in","hat","with"]

Now as the vocabulary has only 6 words, we can use a fixed-length document-representation of 6, with one position in the vector to score each word. 

Sentence 1: [1,1,1,0,0,0] 

Sentence 2: [2,1,1,1,1,0] 

Sentence 2: [2,1,0,0,1,1]

<img src = "https://miro.medium.com/max/1400/1*3IACMnNpwVlCl8kSTJocPA.png"></img>


Disadvantages
- In bag of words we consider all words as same weighted.
ex : His brother is good.
here GOOD is highly weighted word and brother is less weighted. In Bag of word we consider all words as same weighted.
- It disregard the grammatical details and the word order when storing words.

##  **CountVectorizer** 
Used to count the number of words present in a sentence.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

In [None]:
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)


{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [None]:
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


In above array, there is 2 in the end bcz "the" is present twice in that sentence.

In [None]:
import numpy as np
import re

'''The first function we will implement is to extract the words from a document using regular expressions.
 As we do so, we will be converting all words to lower case and exclude our stop words.'''

def tokenize_sentences(sentences):
    words = []
    for sentence in sentences:
        w = extract_words(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

In [None]:
'''Next, we implement our tokenize_sentences function. This function builds our vocabulary by looping through
 all our documents (sentences), extracting the words from each, removing duplicates using the set function and 
 returning a sorted list of words.'''

def extract_words(sentence):
    ignore_words = ['a']
    words = re.sub("[^w]", " ",  sentence).split() #nltk.word_tokenize(sentence)
    words_cleaned = [w.lower() for w in words if w not in ignore_words]
    return words_cleaned   

In [None]:
def bagofwords(sentence, words):
    sentence_words = extract_words(sentence)
    # frequency word count
    bag = np.zeros(len(words))
    for sw in sentence_words:
        for i,word in enumerate(words):
            if word == sw: 
                bag[i] += 1
                
    return np.array(bag)

In [None]:
'''Our last function is the implementation of the bag of words model. This function takes an input of a sentence 
and words (our vocabulary). It then extracts the words from the input sentence using the previously defined function. 
It creates a vector of zeros using numpy zeros function with a length of the number of words in our vocabulary.'''


sentences = ["Machine learning is great","Natural Language Processing is a complex field",
"Natural Language Processing is used in machine learning"]

In [None]:
vocabulary = tokenize_sentences(sentences)
bagofwords("Machine learning is great", vocabulary)

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) 

train_data_features = vectorizer.fit_transform(sentences)

vectorizer.transform(["Machine learning is great","Natural Language Processing is a complex field",
"Natural Language Processing is used in machine learning"]).toarray()


array([[0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0],
       [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]])

In [None]:
________________________________********************__________________________

SyntaxError: ignored

## **Bag Of Words - KN**

In [None]:
import nltk
nltk.download('popular')

In [2]:
paragraph =  """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""
               

In [3]:
               
# Cleaning the texts
import re # used for regular expression
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):

  # here we'll substitute or replace everything(i.e ',','/','\' etc.) except word containing a to z and A to Z ([^a-zA-Z]).
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])

    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500,ngram_range = (1,1))
# cv = CountVectorizer(max_features = 1500,ngram_range = (1,1),stop_words = 'english' )
X = cv.fit_transform(corpus).toarray()



In [4]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0

# **What are N-Grams?**

**Sentence 1: “This is a good job. I will not miss it for anything”** 

**Sentence 2: ”This is not good at all”** 

For this example, let us take the vocabulary of 5 words only. The five words being- 
**[ good, job, miss, not, all ]** 

So, the respective vectors for these sentences are:

“This is a good job. I will not miss it for anything”= **[1,1,1,1,0]** 

”This is not good at all”= **[1,0,0,1,1]**

Can you guess what is the problem here? Sentence 2 is a negative sentence and sentence 1 is a positive sentence. Does this reflect in any way in the vectors above? Not at all. So how can we solve this problem? Here come the N-grams to our rescue.

- **An N-gram is an N-token sequence of words**: a 2-gram (more commonly called a **bigram**) is a two-word sequence of words like “really good”, “not good”, or “your homework”, and a 3-gram (more commonly called a **trigram**) is a three-word sequence of words like “not at all”, or “turn off light”.

For example, the bigrams in the first line of text in the previous section: “This is not good at all” are as follows:

“This is” 

“is not”  

“not good” 

“good at” 

“at all”