## Practice 5 - Bag-of-N-grams models
### Strictly used for internal purpose in Singapore Polytechnic. Do not disclose!

One hot encoding, BoW and TF-IDF treat words as independent units. There is no notion of phrases or word ordering. Bag of Ngrams (BoN) approach tries to remedy this. It does so by breaking text into chunks of n countigous words/tokens. This can help us capture some context, which earlier approaches could not do. Let us see how it works using the same toy corpus we used in earlier examples.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [1]:
#our corpus
documents = ["Students are learning NLP.",
             "NLP workshop is interesting, students like NLP",
             "Students are studying math",
             "Math is foundation of NLP"]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['students are learning nlp',
 'nlp workshop is interesting, students like nlp',
 'students are studying math',
 'math is foundation of nlp']

In [4]:
#Ngram vectorization example with count vectorizer and uni, bi, trigrams
count_vect = CountVectorizer(ngram_range=(1,2))

#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", sorted(count_vect.vocabulary_.items()))

Our vocabulary:  [('are', 0), ('are learning', 1), ('are studying', 2), ('foundation', 3), ('foundation of', 4), ('interesting', 5), ('interesting students', 6), ('is', 7), ('is foundation', 8), ('is interesting', 9), ('learning', 10), ('learning nlp', 11), ('like', 12), ('like nlp', 13), ('math', 14), ('math is', 15), ('nlp', 16), ('nlp workshop', 17), ('of', 18), ('of nlp', 19), ('students', 20), ('students are', 21), ('students like', 22), ('studying', 23), ('studying math', 24), ('workshop', 25), ('workshop is', 26)]


In [7]:
#see the BOW rep for first 2 documents
print(f"BoW representation for '{documents[0]}': ", bow_rep[0].toarray())
print(f"BoW representation for '{documents[1]}': ",bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["nlp is harder than math, but students like nlp"])

print("Bow representation for 'nlp is harder than math, but students like nlp':", temp.toarray())

BoW representation for 'Students are learning NLP.':  [[1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0]]
BoW representation for 'NLP workshop is interesting, students like NLP':  [[0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 0 2 1 0 0 1 0 1 0 0 1 1]]
Bow representation for 'nlp is harder than math, but students like nlp': [[0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 2 0 0 0 1 0 1 0 0 0 0]]


### Note that the number of features (and hence the size of the feature vector) increased a lot for the same data, compared to the ther single word based representations!!