# Amazon Game Product Sentiment Analysis using Vec2PCA

## Intro

Recommendation systems have been employed by various social network platforms in order to deliver a better service to their users. Based on user data, recommendation systems can infer user preference and make predictions to what products a user will be interested in and purchase. This work focuses on the review text of Amazon game products and applies Word2Vec technique and Principle Component Analysis to generate topic features from textual data.

## Problem Definition and Algorithm
### - Task: 
The task is to learn a vector space representation of word, and generate topics from the words using a dimensionality reduction technique called principle component analysis(PCA). In this way we can see groupings of particular words that resemble certain topics about the items. The aim is to classify the ratings as 1 or 0 using the PCA feature and obtain comparable accuracy to using a bad of words representation.

### - Algorithm:
#### Word2Vec:
Word2Vec is a deep learning algorithm that maps individual words into a vector in a low dimensional space. Let d be the number of dimensions, and N be the number of words, then one learns vectors v_word for each word, an N x d matrix W, and a d dimension vector of bias terms b, with the property that the average multinomial log loss ("cross-entropy") of "nearby_word ~ v_word*W + b" is minimized. 
There are 2*N*d total parameters to learn, N*d for W and d for each v_word for each of the N words 
W is fixed, uniform over v_word



#### PCA:
PCA a dimensionality reduction technique where the first principle component is the direction which captures data with the highest variance, and the second principle component is perpendicular to the first one while captures as much as variance in the data as possible, and so on and so forth.The number of principle components can be as many as the number of variables


In [2]:
import pandas as pd
import numpy as np
import json

# Data:

In [3]:
review_text=[0]*231780
ratings=[0]*231780
itemID=[0]*231780
reviewerID=[0]*231780

i=0
with open("reviews_Video_Games_5.json","r") as json_file:
    for line in json_file:
        data=json.loads(line) 
        review_text[i]=data["reviewText"]
        reviewerID[i]=data["reviewerID"]
        itemID[i]=data["asin"]
        ratings[i]=data["overall"]
        i+=1

reviewerID = pd.DataFrame({'reviewerID': reviewerID})
ratings = pd.DataFrame({'Ratings': ratings})
itemID = pd.DataFrame({'ItemID':itemID})
review_text = pd.DataFrame({'reviewText':review_text})

df_predict = pd.concat([ratings,reviewerID,itemID,review_text], axis=1)
   

# Methodology:

In [4]:
from nltk.corpus import stopwords
import re
import nltk.data
from nltk.stem import WordNetLemmatizer
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

def review2wordlist(review):
    #Remove non-letters
    review_text=re.sub("[^a-zA-Z]"," ",review)
    #Convert words to lowercase and form a wordlist
    review_text=review_text.lower().split()
    #remove stop words
    stops=set(stopwords.words("english"))
    review_text=[w for w in review_text if w not in stops]
    #stem words 
    wnl=WordNetLemmatizer()
    review_text=[wnl.lemmatize(w) for w in review_text]
    
    return review_text
    

In [5]:
#Transform Paragraphs to list of lists
total=df_predict['reviewText']
review_list=[0]*231780
for i in range(len(total)):
    if i%1000==0:
        print(i,len(total))
    x=review2wordlist(total[i])
    review_list[i]=x

0 231780
1000 231780
2000 231780
3000 231780
4000 231780
5000 231780
6000 231780
7000 231780
8000 231780
9000 231780
10000 231780
11000 231780
12000 231780
13000 231780
14000 231780
15000 231780
16000 231780
17000 231780
18000 231780
19000 231780
20000 231780
21000 231780
22000 231780
23000 231780
24000 231780
25000 231780
26000 231780
27000 231780
28000 231780
29000 231780
30000 231780
31000 231780
32000 231780
33000 231780
34000 231780
35000 231780
36000 231780
37000 231780
38000 231780
39000 231780
40000 231780
41000 231780
42000 231780
43000 231780
44000 231780
45000 231780
46000 231780
47000 231780
48000 231780
49000 231780
50000 231780
51000 231780
52000 231780
53000 231780
54000 231780
55000 231780
56000 231780
57000 231780
58000 231780
59000 231780
60000 231780
61000 231780
62000 231780
63000 231780
64000 231780
65000 231780
66000 231780
67000 231780
68000 231780
69000 231780
70000 231780
71000 231780
72000 231780
73000 231780
74000 231780
75000 231780
76000 231780
77000 231780

#### transform words in each review into vectors of PCA

In [6]:
# Length of each word is now length of PCA

def makeFeatureVec(words, dic, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    pca_wordset = set(dic)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in pca_wordset: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,dic[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return list(featureVec)


In [None]:
#Transform word list of each paragraph into average feature vectors
reviews_vec=[0]*231780
for i in range(len(review_list)):
    if i%10000==0:
        print(i,len(review_list))
    reviews_vec[i]=makeFeatureVec(review_list[i],pca_dic,200) 

#### Run Elastic Net Regression against the PCA feature

In [8]:
feature=np.array(reviews_vec)
target=np.array(df_predict['Ratings'])

#transform target into 1 and 0
for i in range(len(target)):
    if target[i]>3:
        target[i]=1
    else:
        target[i]=0
        
#throw out NAs in feature and target
nas=np.argwhere(np.isnan(feature))
NAs=[]
for i in range(len(nas)):
    if nas[i][1]%200==0:
        NAs.append(nas[i][0])
        
target=np.delete(target,NAs)
feature=np.delete(feature,NAs, 0)

elastic_net=pd.DataFrame(feature)
elastic_net['response']=target
elastic_net.to_csv('/Users/guzi/Desktop/SignalDataScience/final project/elastic_net.csv')

##RUN ELASTIC_NET IN R

# R script for Elastic Net

In [None]:
setwd('/Users/guzi/Desktop/SignalDataScience/final project')

library(dplyr)
library(caret)
library(readr)
library(pROC)
library(glmnet)

df=read.csv("elastic_net.csv")
df=df[-1]
response.pca=df$response
features.pca= scale(df[1:200])

floor(nrow(features.pca)*0.9)

m = glmnet(features.pca[1:185380,], response.pca[1:185380], family = "binomial", alpha = 0, lambda = 0)
prob.pca= predict(m, s='lambda.min', newx=features.pca[185380:231725,], type="response")
r=roc(response.pca[185380:231725],prob.pca)
plot(r)


Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following objects are masked from ‘package:stats’:

    cov, smooth, var

Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-5


Attaching package: ‘glmnet’

The following object is masked from ‘package:pROC’:

    auc



#### Get Bag of Words Feature 

In [29]:
#transform review_list into a list of sentences
for i in range(len(review_list)):
    review_list[i]=" ".join(review_list[i])

In [31]:
#review_list is the bag of words
from sklearn.feature_extraction.text import CountVectorizer
import scipy
from scipy.sparse import csr_matrix

for i in NAs:
    del review_list[i]

vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000)
features = vectorizer.fit_transform(review_list)
features = features.toarray()

target=np.array(df_predict['Ratings'])
for i in range(len(target)):
    if target[i]>3:
        target[i]=1
    else:
        target[i]=0
target=np.delete(target,NAs)
        
bagging=pd.DataFrame(features)
bagging['response']=target

feats=csr_matrix(features)
#Put response into a dataframe
pd.DataFrame(target).to_csv('/Users/guzi/Desktop/SignalDataScience/final project/bagging_target.csv')
#Put feature into a matrix
scipy.io.mmwrite("bagging_feature", feats, comment='', field=None, precision=None, symmetry=None)

##RUN ELASTIC_NET IN R

In [None]:
# #Find out the number of popular items that are in the test set
# count=0
# for i in range(len(test_all)):
#     for ele in test_all[i]:
#         if ele in popular_items:
#             count+=1
# print(count)

In [None]:
# #Counting the number of recommendations that fall into the test set
# count=0
# for i in range(len(recommends_all)):
#     for ele in recommends_all[i]:
#         if ele in test_all[i]:
#             count+=1
# print(count)

In [None]:
# #find out the top 100 most popular items
# popularity=[len(item_review[i]) for i in range(len(item_review))]
# tops = heapq.nlargest(100, range(len(popularity)), popularity.__getitem__)
# popular_items=[item_review.index[i] for i in tops]