The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>
Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review

<u> database.sqlite: Contains the table 'Reviews' </u>

#### <u>Objective </u>:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).


In [None]:
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# LOADING the data

con= sqlite3.connect("Datasets/Amazon _reviews_set/database.sqlite")

##### Filtering data

In [None]:
# Here as we only want to get the global sentiment of the recommendations (positive or negative), 
# we will purposefully ignore all Scores equal to 3. 
# If the score id >3, then score ="positive" (1). Otherwise, score ="negative" (0).

filtered_data = pd.read_sql_query(
"""
SELECT * 
FROM Reviews 
WHERE Score !=3 
LIMIT 10000 
""", con)
# Based on my Computationsl Power top 10k points are selected


# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 0
    return 1

#Changing Score column to our definition
filtered_data["Score"]= list(map(partition,filtered_data["Score"]))

print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)


### <u> Data Cleaning </u>
##### 1. Check on DeDuplication

In [None]:
# This code will Group UserId if they have same REVIEWS
# By seeing the Count Column, it gives a fair idea on how much users have duplicate Reviews

#RUN ON COMPLETE DATASET

display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*) as sum
FROM Reviews
GROUP BY UserId
HAVING sum>1
""", con)

print(display.shape)
print(display.head())

#We can observe through Score that many users have duplicate reviews

In [None]:
# Observe for Score >4 

display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*) as sum
FROM Reviews
GROUP BY UserId
HAVING sum>4
""", con)

print(display.shape)
print(display.head())

In [None]:
# Lets observe for UserID = A1001WMV1CL0XH as seen above

display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text
FROM Reviews
WHERE Score !=3 AND UserId ="A1001WMV1CL0XH"
ORDER BY ProductID
""", con)

print(display.shape)
print(display.head())

##### 2. Remove Duplicate Data

In [None]:
# It is observed (as shown in the table below) that the reviews data had many duplicate entries.
# Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. 

In [None]:
# It was inferred after analysis that reviews with same parameters other than 
# ProductId belonged to the same product just having different flavour or quantity. 
# Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.

# The method used for the same was that we first sort the data according to ProductId and 
# then just keep the first similar product review and delete the others

In [None]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [None]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)

#Again sort it to ID
final.sort_values("Id", axis=0 , inplace= True, ascending=True, kind='quicksort')

final.shape

In [None]:
#Checking to see how much % of data still remains
(final['Id'].size)/(filtered_data['Id'].size)*100

##### 3. Check on Helpfulness columns (if num>deno) then that data should be removed

In [None]:
display=pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score!=3 AND HelpfulnessNumerator > HelpfulnessDenominator
ORDER BY HelpfulnessNumerator
""", con)

display

In [None]:
# as we see above that data is present so we will remove these two rows
final=final[final["HelpfulnessNumerator"] <= final["HelpfulnessDenominator"]]
final.reset_index(drop=True, inplace=True)

In [None]:
print("Final Shape of the Data = ",final.shape)
print("\nNo of +ve , -ve reviews present are:- \n",final["Score"].value_counts())

In [None]:
################################################################################################################################
################################################################################################################################
################################################################################################################################

# <u> Text Preprocessing. </u>

In the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [None]:
#lets see the Final dataset once
final.head()

In [None]:
#printing some random reviews, to find how text is looking

for i in final["Text"]:
    print(i,"/n")

#text at position 2539 looks fishy

In [None]:
#Lets print TEXT position 2539

text_0 = final["Text"][2539]
print(text_0)

#We can see HTML Tags

In [None]:
# remove urls from text
import re

text_0 = re.sub(r"http\S+", "", text_0)
print(text_0)


In [None]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup

soup = BeautifulSoup(text_0, 'lxml')
text_0 = soup.get_text()
print(text_0)

In [None]:
# https://stackoverflow.com/a/47091490/4084039


def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039

text_0 = re.sub("\S*\d\S*", "", text_0).strip()
print(text_0)

In [None]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
text_0 = re.sub('[^A-Za-z0-9]+', ' ', text_0)
print(text_0)

In [None]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

##### Combining all above

In [None]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

In [None]:
preprocessed_reviews[2539]

In [None]:
################################################################################################################################
################################################################################################################################
################################################################################################################################

### <u> Featurization </u>

In [None]:

from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from tqdm import tqdm
import os

### 1. Bag of Words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert a collection of text documents to a matrix of token counts.
# It is used to transform a given text into a vector on the,
# basis of the frequency (count) of each word that occurs in the entire text.

count_vect = CountVectorizer()

#learn a Vocabulary dictionary of all tokens in the Document
count_vect.fit(preprocessed_reviews)

#Get output feature names for transformation.
print("Some Feature Names - ", count_vect.get_feature_names()[:10])

#Learn the vocabulary dictionary and return document-term matrix.
final_counts=count_vect.transform(preprocessed_reviews)

#we can also use fit_transform rather than writing fit and tranform in different lines

In [None]:
#TO GET INSIGHTS IN THE DATA
count_of_word = "bought"
print("Vocabulary- ",count_vect.vocabulary_[count_of_word])


#count of all words
print("Vocabulary- ",count_vect.vocabulary_)


# Summarizing the Encoded Texts
print("Encoded Document is:")
print(final_counts.toarray())


In [None]:
# This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

print("the type of count vectorizer = ",type(final_counts))

In [None]:
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

# This means that we had 9564 reviews and for each review a row of its unique words is made
# Just as explained in Lecture

### 2. bi-gram, tri-gram and n-gram

In [None]:
count_vect = CountVectorizer(ngram_range=(1,2))
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)

print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

"""/n/n We can see that it has high dimen than unigram so we will try to capture not bigrams differently"""

In [None]:
### ngram_rangetuple (min_n, max_n) ###

# The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. 
# For example an ngram_range of 
# (1, 1) means only unigrams, 
# (1, 2) means unigrams and bigrams, and 
# (2, 2) means only bigrams. 

# min_dffloat (int)

# When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
# This value is also called cut-off in the literature. 
# integer value means count.

# max_featuresint, default=None

# build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

count_vect = CountVectorizer(ngram_range=(1,2) , min_df =10 , max_features= 5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)

print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

### 3. TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)\n",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of count vectorizer = ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer = ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams = ", final_tf_idf.get_shape()[1])

### 4. Word2Vec

###### Using Google News Word2Vectors. 
###### To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
###### from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit


In [None]:
# To install it use -> conda install -c conda-forge gensim 
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
import os

# Each row consists of a word and its corresponding vector representation which is 300 dimension

is_your_ram_gt_16g=True
want_to_use_google_w2v = True
want_to_train_w2v = True

In [None]:
# Converting sentences to words

i=0
list_of_sentance=[]
for sentance in preprocessed_reviews:
    list_of_sentance.append(sentance.split())


print(preprocessed_reviews[0], "\n\n")
print(list_of_sentance[0])

In [None]:
# Code to check if its permissible to import file into memory or not

if want_to_use_google_w2v and is_your_ram_gt_16g:
    if os.path.isfile('Datasets\Google_W2V\GoogleNews-vectors-negative300.bin'):
        w2v_model=KeyedVectors.load_word2vec_format('Datasets\Google_W2V\GoogleNews-vectors-negative300.bin', binary=True)
        print("File imported")
    else:
        print("you don't have gogole's word2vec file, keep want_to_train_w2v = True, to train your own w2v ")

In [None]:
#TO HAVE INSIGHT IN DATA

# it will print vector corresponding to word "computer" stored in the file
w2v_model.wv['computer']

#it will return the similarity in the words
print(w2v_model.wv.most_similar('woman','man'))

# It will return the words similar to "Woman"
print(w2v_model.wv.most_similar('woman'))

# 'tasti' is the stemmed word for tasty, so if we have already done stemming,
# there is a chance we wont find the word in this file

print(w2v_model.wv.most_similar('tasti'))



In [None]:
# Train your own Word2Vec model using your own text corpus

if want_to_train_w2v:
    # min_count = 5 considers only words that occured atleast 5 times
    w2v_model=Word2Vec(list_of_sentance,min_count=5,size=50, workers=4)
    
    print(w2v_model.wv.most_similar('great'))
    print('='*50)
    print(w2v_model.wv.most_similar('worst'))

In [None]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

### 5. Converting text into vectors using wAvg W2V, TFIDF-W2V

(i) Avg W2v

In [None]:
# compute average word2vec for each review.

# the avg-w2v for each sentence/review is stored in this list
sent_vectors = []; 

for sent in tqdm(list_of_sentance): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, might need to change this to 300 if we use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)


print(len(sent_vectors))
print(len(sent_vectors[0]))

(ii) TF-IDF weighted Word2Vec

In [None]:
# tfidf words/col-names
tfidf_feat = model.get_feature_names() 

# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

# the tfidf-w2v for each sentence/review is stored in this list
tfidf_sent_vectors = []; 
row=0;

for sent in tqdm(list_of_sentance): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#           tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    
    
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1