# Bag of Words on Amazon Food review Dataset

Bag of Words machine learning algorithm that can be used to convert text data to vector.
Amazon fine foods review dataset is available on kaggle. (https://www.kaggle.com/snap/amazon-fine-food-reviews)

In this ipython notebook,I have performed the following steps:
    1. Loading data and assigning polarity
    2. Data cleaning by removing duplicate enteries and invalid information 
    3. Sort the data and sample it.
    4. Data Preprocessing:
        a.removing stop words
        b.removing punctuations and html tags if any
        c.stemming
        d.convert all words to lower case
    5.Split data into train and test. 
    6.Vectorize reviews using Bag of Words.Save this data.
   

1.Load data and assign polarity to reviews

In [1]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# using the SQLite Table to read data.
con = sqlite3.connect(r'C:\Users\Admin\Downloads\database.sqlite')


#Reading reviews that can be classified as positive or negative
review_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3""", con) 


# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 0
    return 1

#changing reviews with score greater than 3 to be positive and vice-versa
review_data['Score'] = review_data['Score'].map(partition)

2.Data Cleaning

In [2]:
review_data =  review_data.drop_duplicates(subset={'UserId','ProfileName','Time','Text'},keep='first')

In [3]:
cleaned_data = review_data[review_data.HelpfulnessNumerator <= review_data.HelpfulnessDenominator]

3.Data sampling

In [4]:
cleaned_data.sort_values('Time',inplace=True,ascending=False) 
#sampled_data = cleaned_data.sample(frac=0.275,random_state=1), time series split function can also be used.
sampled_data=cleaned_data[0:100000]

4.Data preprocessing

In [5]:
import re
import string
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop = set(stopwords.words('english')) #set of stopwords
sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned

In [6]:
 i=0
str1=' '
final_string=[]
s=''
for sent in sampled_data['Text'].values:
    filtered_sentence=[]
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop):
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                else:
                    continue
            else:
                continue 

    str1 = b" ".join(filtered_sentence) #final string of cleaned words 
    final_string.append(str1)

In [7]:
sampled_data['CleanedText']=final_string

5.Split data into train and test

In [8]:
#We use 70% of data for training and 30% of data for test
import math
sampled_data.sort_values('Time',inplace=True,ascending=True) 

X_train =  sampled_data[:math.ceil(len(sampled_data)*.7)] 
X_test = sampled_data[math.ceil(len(sampled_data)*.3):]
y_train = sampled_data['Score'][:math.ceil(len(sampled_data)*.7)]
y_test =  sampled_data['Score'][math.ceil(len(sampled_data)*.3):]

6.Convert text to vector

Bag of words

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer() 
bow_model = count_vect.fit(X_train['CleanedText'])
final_bow_train = bow_model.transform(X_train['CleanedText'])
final_bow_test = bow_model.transform(X_test['CleanedText'])

In [10]:
from sklearn.preprocessing import StandardScaler

normalised_bow_train = StandardScaler(with_mean=False).fit_transform(final_bow_train)
normalised_bow_test = StandardScaler(with_mean=False).fit_transform(final_bow_test)

In [13]:
import scipy.sparse
scipy.sparse.save_npz('bow_train.npz', normalised_bow_train)
#final_bow_train1 = scipy.sparse.load_npz('bow_train.npz')
scipy.sparse.save_npz('bow_test.npz', normalised_bow_test)
#final_bow_test1 = scipy.sparse.load_npz('bow_test.npz')