# Bag of Words

### Steps to get the Bag of words
    Step 1 : Lowering the sentances
    Step 2 : Stemming
    Step 3 : Lemmatization
    Step 4 : Counting Frequency of each word (Creating histogram for wach word)
    Step 5 : Converting bag of words to vectors

### Example :-
    
    He is boy.                  he is boy                boy
    She is good girl.           she is good girl         good,girl
    
    
    Bag of Words :-
    
    word  count
    boy     1                    
    good    1
    girl    1
    
                        good      boy      girl
    sent 1               0         1        0
    sent 2               1         0        1
    
    This how we convert the text to Numerical representation.

## Disadvantages of Bag of Words

### 1.We are representating all the words by 1 and 0.


### 2.Hence we cannot determine which word is used to identify the sentance is either +ve or -ve.

### 3.To solve this problem, we uses TFIDF (Term Frequency And Inverse Document Frequency).

In [16]:
import nltk

In [17]:
#defining variable paragraph

paragraph = """Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence, information engineering, and human-computer interaction. 
                This field focuses on how to program computers to process and analyze large amounts of natural language data. 
                It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance. 
                Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. 
                One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph."""

In [18]:
#cleaning the text (lowering)
import re

#PorterStemmer is used to get the word stem
from nltk.stem.porter import PorterStemmer

#WordNetLemmatizer is used for limitization to the word stem
from nltk.stem import WordNetLemmatizer

#stopwords helps us to remove the words like 'for', 'then', 'from', 'and' which are repeting again and again
#which does not put much value to identify sentance
from nltk.corpus import stopwords

In [19]:
#creating objects for PorterStemmer
stemmer = PorterStemmer()

#creating objects for WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

#tokenizing sentances
sentances = nltk.sent_tokenize(paragraph)

#After cleaning the text we are going to store final result to corpus
corpus = []

In [5]:
#Cleaning text stemming
for i in range(len(sentances)):
    
    #replacing all other information like '. ,'  with space
    review = re.sub('[^a-zA-Z]',' ', sentances[i])      
    
    #lowering the sentances
    review = review.lower()
    
    #spliting the sentance to get the words
    review = review.split()
    
    #list comprehension (stemming)
    review = [stemmer.stem(word) for word in review if word not in set(stopwords.words('english'))]
    
    #joining all the list of words to review
    review = ' '.join(review)
    
    #appending all the words to corpus
    corpus.append(review)
    
corpus

['natur languag process nlp subfield comput scienc artifici intellig inform engin human comput interact',
 'field focus program comput process analyz larg amount natur languag data',
 'difficult perform process read understand languag far complex seem first glanc',
 'token process token split string text list token',
 'one think token part like word token sentenc sentenc token paragraph']

### We are not getting meaning full words hence we use Lemmitization

In [20]:
#words are not clear hence we do lemmatization.

#Cleaning text lemmatization
for i in range(len(sentances)):
    
    #replacing all other information like '. ,'  with space
    review = re.sub('[^a-zA-Z]',' ', sentances[i])      
    
    #lowering the sentances
    review = review.lower()
    
    #spliting the sentance to get the words
    review = review.split()
    
    #list comprehension (stemming)
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    
    #joining all the list of words to review
    review = ' '.join(review)
    
    #appending all the words to corpus
    corpus.append(review)
    
corpus

['natural language processing nlp subfield computer science artificial intelligence information engineering human computer interaction',
 'field focus program computer process analyze large amount natural language data',
 'difficult perform process reading understanding language far complex seems first glance',
 'tokenization process tokenizing splitting string text list token',
 'one think token part like word token sentence sentence token paragraph']

In [21]:
#Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer

#creating object
cv = CountVectorizer(max_features=1500)

#creating matrix
X = cv.fit_transform(corpus).toarray()

In [22]:
#Displaying the Bag of Words
X

array([[0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 3, 0, 0, 0, 1]],
      dtype=int64)

## Result :- We get the resultant matrix.