### Introduction to bag of words

- The bag-of-words model is a simplifying representation used in natural language processing where a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. 
- Commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.


![From freecode camp](https://cdn-media-1.freecodecamp.org/images/qRGh8boBcLLQfBvDnWTXKxZIEAk5LNfNABHF)


for further details:
- https://www.geeksforgeeks.org/bag-of-words-bow-model-in-nlp/
- https://www.freecodecamp.org/news/an-introduction-to-bag-of-words-and-how-to-code-it-in-python-for-nlp-282e87a9da04/



Implementation using libraries:

Libraries Doc Links:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

https://docs.python.org/3/library/re.html

https://www.nltk.org/api/nltk.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html




In [1]:
# Natural Language Processing
#Simplest Bag of words model

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('https://raw.githubusercontent.com/tanvirrazin/Machine-Learning-A-Z-Udemy/master/data_files/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Cleaning the texts
import re
import nltk
#nltk.download('stopwords') #download the irrevalent stopwords in text Eg. this , for...
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # dont remove a-z A-Z
    review = review.lower() # convert all to lower case 
    review = review.split() # separates into exact words
    ps = PorterStemmer()    # used to stem the words to its root word Eg. loved==>love
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]#note that you can also remove stop words using the stop_word parameter in CountVectorizer
    review = ' '.join(review) #convert back the list into strings with separated essential words
    corpus.append(review)  #finally append this to list 

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500) # max_features parameter for further flitering (max most frequent words ) OR use dimensionality reduction techniques
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[55 42]
 [12 91]]
