# bag-of-words
## The bag-of-words model is a simplifying representation used in natural language processing and information retrieval. 
## In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.
### EX :- (1) John likes to watch movies. Mary likes movies too.
   ### BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};

# Here we perform stemming on the famous speech given by Swami Vivekananda at Parliment of religion held in chicago sept 11 1893

In [1]:
import nltk
paragraph = """Sisters and Brothers of America,
It fills my heart with joy unspeakable to rise in response to the warm and cordial welcome which you have given us. 
I thank you in the name of the most ancient order of monks in the world, I thank you in the name of the mother of religions,
and I thank you in the name of millions and millions of Hindu people of all classes and sects.
My thanks, also, to some of the speakers on this platform who, referring to the delegates from the Orient, 
have told you that these men from far-off nations may well claim the honor of bearing to different lands the idea of toleration. 
I am proud to belong to a religion which has taught the world both tolerance and universal acceptance. We believe not only in universal toleration, but we accept all religions as true. 
I am proud to belong to a nation which has sheltered the persecuted and the refugees of all religions and all nations of the earth. 
I am proud to tell you that we have gathered in our bosom the purest remnant of the Israelites, 
who came to Southern India and took refuge with us in the very year in which their holy temple was shattered to pieces by Roman tyranny. 
I am proud to belong to the religion which has sheltered and is still fostering the remnant of the grand Zoroastrian nation. 
I will quote to you, brethren, a few lines from a hymn which I remember to have repeated from my earliest boyhood, 
which is every day repeated by millions of human beings: “As the different streams having their sources in different paths which men take through different tendencies, 
various though they appear, crooked or straight, all lead to Thee.”
The present convention, which is one of the most august assemblies ever held,is in itself a vindication, 
a declaration to the world of the wonderful doctrine preached in the Gita: “Whosoever comes to Me, through whatsoever form, 
I reach him; all men are struggling through paths which in the end lead to me.” Sectarianism, bigotry, and its horrible descendant, 
fanaticism, have long possessed this beautiful earth. They have filled the earth with violence, drenched it often and often with human blood, destroyed civilization and sent whole nations to despair. 
Had it not been for these horrible demons, human society would be far more advanced than it is now. But their time is come; 
and I fervently hope that the bell that tolled this morning in honor of this convention may be the death-knell of all fanaticism, of all persecutions with the sword or with the pen, 
and of all uncharitable feelings between persons wending their way to the same goal."""

# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()



In [2]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

# Application
### SPAM MESSAGE CLASSIFIER :-In Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham"). Imagine there are two literal bags full of words. One bag is filled with words found in spam messages, and the other with words found in legitimate e-mail. While any given word is likely to be somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" significantly more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.
### To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be in