<a href="https://colab.research.google.com/github/sudhirtakke/Bag-of-words/blob/main/Bag_of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of Words Model

### Importing the basic libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Importing the dataset

In [None]:
dataset = pd.read_csv('https://raw.githubusercontent.com/insaid2018/DeepLearning/master/Data/Restaurant_Reviews.tsv', delimiter='\t', quoting=3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


### Cleaning the Texts

In [None]:
import re

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

- We will be **cleaning** our text reviews in the next cell.

- There are multiple steps being performed on each review.

- At the end, we will have a **corpus** of clean reviews.

In [None]:
corpus = []
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])   # Replacing anything except alphabets with a space.
    review = review.lower()                                   # Changing text to lower case.
    review = review.split()                                   # Creating tokens from a review by splitting it.
    ps = PorterStemmer()                                      # Creating a stemmer using PorterStemmer
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]    # Dropping the stopwords, and stemming the remaining words.
    review = ' '.join(review)                                 # Joining the tokens of a review into a single sentence.
    corpus.append(review)                                     # Creating a list of all the reviews.

### Creating the Bag of Words Model

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

- Builind a `CountVectorizer` that will keep the **most common 1500** words.

In [None]:
cv = CountVectorizer(max_features=1500)

In [None]:
X = cv.fit_transform(corpus).toarray()

In [None]:
y = dataset.iloc[:, 1].values

### Splitting the dataset into Training and Test sets


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

### Fitting Naive Bayes to the Training set

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Predicting the Test set results

In [None]:
y_pred = classifier.predict(X_test)

### Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, y_pred)

0.765

- Our model is getting an accuracy of **76.5%** on the test set.

### Conclusion

- This notebook gives a basic idea on how to use the **Bag of Words** model on a real dataset.

- We learn about **cleaning** text reviews and building a **corpus** of all our documents (reviews).

- Then we build a Bag of Words model using the **CountVectorizer**.

- At, last we fit the numeric interpretation of textual data into a Machine Learning model and make predictions.