### Bag of Words (BOW) Model

In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

The BOW model only considers if a known word occurs in a document or not. It does not care about meaning, context, and order in which they appear.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

Disadvantage is that we are not able to get which word is more important. The BOW model only considers if a known word occurs in a document or not. It does not care about meaning, context, and order in which they appear.

In [1]:
import numpy as np
import pandas as pd
import wikipedia as wp
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer,PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Lets create small paragraph for testing
test_paragraph = "Bill travelled to the 'office' for travelled ?. Bill picked up the 'football' there. Bill went to @ the bedroom. Bill gave the football to Fred."

In [3]:
type(test_paragraph)

str

In [4]:
# Can we get the answer of the question "What did Bill give to Fred?" by using the BOW model ?
# Answer is football
# Lets apply the model

In [5]:
# Lets first convert paragraph into sentences
test_sentences = nltk.sent_tokenize(test_paragraph)

In [6]:
test_sentences

["Bill travelled to the 'office' for travelled ?.",
 "Bill picked up the 'football' there.",
 'Bill went to @ the bedroom.',
 'Bill gave the football to Fred.']

In [7]:
len(test_sentences)

4

In [8]:
# Clean the sentences
# Remove the usefullness characters sentence by sentence
test_sentences_clean = []
for i in range(len(test_sentences)):
    test_sentences1 = re.sub("[^0-9a-zA-Z]+",' ',test_sentences[i]) # remove all the words excpet alphanemeric
    test_sentences2 = test_sentences1.lower().split() # lower and then split the sentences in the words
    test_sentences3 = [w for w in test_sentences2 if w not in set(stopwords.words('english'))] # remove stopwords
    test_sentences4 = list(set(test_sentences3)) # remove duplicate in sentences
    test_sentences5 = ' '.join(test_sentences4)
    test_sentences_clean.append(test_sentences5)

In [9]:
# Check the distinct words in paragraph after cleaning
# Build vocabulary
def test_words_clean(paragraph):
    test_words1 = re.sub("[^0-9a-zA-Z]+",' ',paragraph)
    test_words2 = test_words1.lower().split()
    test_words3 = [w for w in test_words2 if w not in set(stopwords.words('english'))]
    test_words4 = list(set(test_words3))
    return test_words4

In [10]:
# Vocabulary Size
test_words_clean = test_words_clean(test_paragraph)

In [11]:
test_words_clean

['travelled',
 'picked',
 'bedroom',
 'fred',
 'football',
 'gave',
 'office',
 'went',
 'bill']

In [12]:
test_sentences_clean

['travelled bill office',
 'football bill picked',
 'went bill bedroom',
 'gave bill football fred']

In [13]:
# Lrts apply the BOW model
bow_test = CountVectorizer()

In [14]:
# Now train our clean test sentences with BOW
# Covert into clean test sentences into Matrix of Token Counts
X_test = bow_test.fit_transform(test_sentences_clean).toarray()

In [15]:
type(X_test)

numpy.ndarray

In [16]:
X_test.shape

(4, 9)

In [17]:
print(X_test)

[[0 1 0 0 0 1 0 1 0]
 [0 1 1 0 0 0 1 0 0]
 [1 1 0 0 0 0 0 0 1]
 [0 1 1 1 1 0 0 0 0]]


In [18]:
features = bow_test.get_feature_names()
features

['bedroom',
 'bill',
 'football',
 'fred',
 'gave',
 'office',
 'picked',
 'travelled',
 'went']

In [19]:
df_test = pd.DataFrame(X_test,columns=features)

In [20]:
df_test

Unnamed: 0,bedroom,bill,football,fred,gave,office,picked,travelled,went
0,0,1,0,0,0,1,0,1,0
1,0,1,1,0,0,0,1,0,0
2,1,1,0,0,0,0,0,0,1
3,0,1,1,1,1,0,0,0,0


The length of the vector(X_test.shape[1]) will always be equal to vocabulary size("test_words_clean")

In [21]:
test_sentences_clean

['travelled bill office',
 'football bill picked',
 'went bill bedroom',
 'gave bill football fred']

Limitations of BOW

We can clearly see from the above df_test that our paragraph is cleally well tokenised but there are some disadvantages if we use this to train the ML model. If you see all the words are marked with 1 so we can not say which words have more weightages
in the paragraphs.


Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.

Vector size: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.

In [22]:
# Ltes use the wikipedia page
paragraph = wp.summary("FIFA World Cup",sentences=10)

In [23]:
paragraph

"The FIFA World Cup, often simply called the World Cup, is an international association football competition contested by the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body. The championship has been awarded every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the Second World War. The current champion is France, which won its second title at the 2018 tournament in Russia.\nThe current format involves a qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. In the tournament phase, 32 teams, including the automatically qualifying host nation(s), compete for the title at venues within the host nation(s) over about a month.\nThe 21 World Cup tournaments have been won by eight national teams. Brazil have won five times, and they are the only team to have played i

In [24]:
type(paragraph)

str

In [25]:
# Lets convert paragraph in to sentences
sentence = nltk.sent_tokenize(paragraph)
len(sentence)

10

In [26]:
# Create function to apply the text cleaning(remove characters except alphabat and stopwords) 
# in complete paragraph and get the words with stemming
def clean_words(text):
    #convert into lowercase
    lower = text.lower()
    # remove all the characters excpet alphabat
    paragraph = re.sub("[^0-9a-zA-Z]+"," ",lower)
    # word tokanization of complete paragraph 
    token_words = nltk.word_tokenize(paragraph)
    # remove the stopwords
    words = [w for w in token_words if w not in set(nltk.corpus.stopwords.words('english'))]
    # reemove the duplicate words
    clean_words = list(set(words))
    return clean_words

In [27]:
#https://www.freecodecamp.org/news/an-introduction-to-bag-of-words-and-how-to-code-it-in-python-for-nlp-282e87a9da04/

In [28]:
clean_words = clean_words(paragraph)

In [29]:
len(clean_words)

114

In [30]:
# Let clean sentences
clean_sentences = []
clean_word_each_sentence = []
for i in range(len(sentence)):
    clean = re.sub("[^0-9a-zA-Z]+"," ",sentence[i])
    lower = clean.lower()
    split = lower.split()
    stop = [w for w in split if not w in set(stopwords.words('english'))]
    duplicate_remove = list(set(stop))
    clean_word_each_sentence.append(duplicate_remove)
    word = ' '.join(duplicate_remove)
    clean_sentences.append(word)

In [31]:
len(clean_word_each_sentence)

10

In [32]:
l = clean_word_each_sentence
dupes = []
flat = [item for sublist in l for item in sublist]
for f in flat:
    if flat.count(f) > 1:
        if f not in dupes:
            dupes.append(f)

In [33]:
len(flat)

144

In [34]:
len(dupes)

16

In [35]:
cleaned = [word for word in flat if word not in dupes] 
for i in dupes:
    cleaned.append(i)

In [36]:
len(cleaned)

114

In [37]:
c = sorted(cleaned)
d  = sorted(clean_words)

In [38]:
print(list(zip(c,d)))

[('1', '1'), ('17', '17'), ('1930', '1930'), ('1942', '1942'), ('1946', '1946'), ('2006', '2006'), ('2018', '2018'), ('21', '21'), ('26', '26'), ('29', '29'), ('32', '32'), ('715', '715'), ('argentina', 'argentina'), ('association', 'association'), ('automatically', 'automatically'), ('awarded', 'awarded'), ('billion', 'billion'), ('body', 'body'), ('brazil', 'brazil'), ('called', 'called'), ('champion', 'champion'), ('championship', 'championship'), ('compete', 'compete'), ('competition', 'competition'), ('contested', 'contested'), ('countries', 'countries'), ('cumulative', 'cumulative'), ('cup', 'cup'), ('current', 'current'), ('de', 'de'), ('determine', 'determine'), ('eight', 'eight'), ('england', 'england'), ('entire', 'entire'), ('estimated', 'estimated'), ('event', 'event'), ('every', 'every'), ('except', 'except'), ('f', 'f'), ('fifa', 'fifa'), ('final', 'final'), ('five', 'five'), ('followed', 'followed'), ('football', 'football'), ('format', 'format'), ('four', 'four'), ('fra

In [39]:
clean_sentences

['senior cup football global f often contested members governing body internationale de ration men world association called fifa teams simply sport national competition international',
 'years second inaugural 1942 four every held since 1930 tournament war 1946 world championship except awarded',
 'second current 2018 title france russia tournament champion',
 'years current takes phase qualify place qualification involves tournament preceding determine three format teams',
 'within phase month title automatically venues 32 compete tournament qualifying nation including teams host',
 '21 cup national eight tournaments teams world',
 'team every times played brazil tournament five',
 'uruguay winner inaugural one cup title four italy france winners argentina spain titles two world england germany',
 'association prestigious well cup widely followed football sporting tournament event single viewed world',
 '715 entire 17 match cup million matches 2006 people 26 ninth 1 world viewership p

In [40]:
len(clean_sentences)

10

In [41]:
clean_word_each_sentence[0]

['senior',
 'cup',
 'football',
 'global',
 'f',
 'often',
 'contested',
 'members',
 'governing',
 'body',
 'internationale',
 'de',
 'ration',
 'men',
 'world',
 'association',
 'called',
 'fifa',
 'teams',
 'simply',
 'sport',
 'national',
 'competition',
 'international']

In [42]:
len(clean_word_each_sentence[0])

24

In [43]:
leng_each_sent = []
for i in range(len(clean_word_each_sentence)):
    leng = len(clean_word_each_sentence[i])
    leng_each_sent.append(leng)

In [44]:
sent_words_lenght = sum(leng_each_sent)
sent_words_lenght

144

In [45]:
# lets create the Bag of Words Model
bow = CountVectorizer(token_pattern='(?u)[0-9a-zA-Z]+',stop_words=set(stopwords.words('english')))

In [46]:
Xt = bow.fit_transform(clean_sentences).toarray()

In [47]:
Xt.shape

(10, 114)

In [48]:
# Convert a collection of text documents to a matrix of token counts
# This implementation produces a sparse representation of the counts
X = bow.fit_transform(sentence).toarray()

In [49]:
X.shape
# There are 10 sentences and total unique words which will be work as features to train the classification model

(10, 114)

In [50]:
X

array([[0, 0, 0, ..., 0, 2, 0],
       [0, 0, 1, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 3, 0],
       [1, 1, 0, ..., 0, 2, 0]], dtype=int64)

In [51]:
X[0:1,:]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 1, 1, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 2,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
        1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 2, 0]], dtype=int64)

In [52]:
sentence[0]

"The FIFA World Cup, often simply called the World Cup, is an international association football competition contested by the senior men's national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport's global governing body."

In [53]:
clean_sentences[0]

'senior cup football global f often contested members governing body internationale de ration men world association called fifa teams simply sport national competition international'

In [54]:
leng_each_sent[0]

24

In [55]:
features = bow.get_feature_names()

In [56]:
type(features)

list

In [57]:
len(features)

114

In [58]:
a = features
b  = sorted(clean_words)

In [59]:
list(zip(a,b,c,d))

[('1', '1', '1', '1'),
 ('17', '17', '17', '17'),
 ('1930', '1930', '1930', '1930'),
 ('1942', '1942', '1942', '1942'),
 ('1946', '1946', '1946', '1946'),
 ('2006', '2006', '2006', '2006'),
 ('2018', '2018', '2018', '2018'),
 ('21', '21', '21', '21'),
 ('26', '26', '26', '26'),
 ('29', '29', '29', '29'),
 ('32', '32', '32', '32'),
 ('715', '715', '715', '715'),
 ('argentina', 'argentina', 'argentina', 'argentina'),
 ('association', 'association', 'association', 'association'),
 ('automatically', 'automatically', 'automatically', 'automatically'),
 ('awarded', 'awarded', 'awarded', 'awarded'),
 ('billion', 'billion', 'billion', 'billion'),
 ('body', 'body', 'body', 'body'),
 ('brazil', 'brazil', 'brazil', 'brazil'),
 ('called', 'called', 'called', 'called'),
 ('champion', 'champion', 'champion', 'champion'),
 ('championship', 'championship', 'championship', 'championship'),
 ('compete', 'compete', 'compete', 'compete'),
 ('competition', 'competition', 'competition', 'competition'),
 ('c