(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

This Notebook is a slightly altered homework for CS591: Data Mining Class at Boston University, taught by Professor George Kollios

# Recommender System for Amazon Electronics

In this assignment, we will be working with the Amazon dataset. The part of it that we will be using, can be found in the same repository as this notebook. We will build a recommender system to make predictions related to reviews of Electronics products on Amazon.

## Files

**train.json** 1,000,000 reviews to be used for training. It is not necessary to use all reviews for training if doing so proves too computationally intensive. The fields in this file are:

* **reviewerID** The ID of the reviewer. This is a hashed user identifier from Amazon.

* **asin** The ID of the item. This is a hashed product identifier from Amazon.

* **overall** The rating of reviewer gave the item.

* **helpful** The helpfulness votes for the review. This has 2 subfields, 'nHelpful' and 'outOf'. The latter is the total number of votes this review received. The former is the number of those that considered the review to be helpful.

* **reviewText** The text of the review.

* **summary** The summary of the review.

* **unixReviewTime** The time of the review in seconds since 1970.

**meta.json** Contains metadata of the items:

* **asin** The ID of the item.

* **categories** The category labels of the item being reviewed.

* **price** The price of the item.

* **brand** The brand of the item.

**pairs_Rating.txt** The pairs (reviewerID and asin) on which you are to predict ratings.

**pairs_Helpful.txt** The pairs on which you are to predict helpfulness votes. A third column in this file is the total number of votes from which you should predict how many were helpful.

**helpful.json** The review data associated with the helpfulness prediction test set. The 'nHelpful' field has been removed from this data since that is the value you need to predict above. This data will only be of use for the helpfulness prediction task.

## Tasks

**1. Rating prediction** Predict people's star ratings as accurately as possible for those (reviewerID, asin) pairs in 'pairs_Rating.txt'. Accuracy will be measured in terms of the root mean-squared error (RMSE).

**2. Helpfulness prediction** Predict whether a user's review of an item will be considered helpful. The file 'pairs_Helpful.txt' contains (reviewerID, asin) pairs with a third column containing the number of votes the user's review of the item received. We must predict how many of them were helpful. Accuracy will be measured in terms of the total absolute error, i.e. the difference |nHelpful - prediction|, where 'nHelpful' is the number of helpful votes the review actually received, and 'prediction' is our prediction of this quantity.

To keep track of our results a competition was created in Kaggle to keep track of the classes results. The leaderboard will show your results on half of the test data, but your ultimate score will depend on your predictions across the whole dataset.

In [1]:
import requests
import json
import numpy as np
from collections import defaultdict
from scipy import linalg
import statsmodels.api as sm
import matplotlib.pyplot as plt

def readJson(f):
    for l in open(f):
        yield eval(l)

Note: The following attempts are the ones that were the most successful ones. I tried using stochastic gradient descent as well as other techniques which were not as succesful as the ones I am going to present here. In both contests, I scored in the top thirds of the leaderboard (class of 60 people).

# Task 1

For each user in train.json I computed the amount of times he purchased any item, putting it in a dictionary (userdensity). For each item I computed the amount of times it was purchased, putting it also in a dictionary (itemdensity). Then for each of the users and items I computed the average score they either gave or got depending if they were a user or an item. These values were also  put in dictionaries.

Now, once I got in the file with the missing ratings, I had a user and an item. If that user was in userdensity and item was in itemdensity, I got the number of purchases for the user (unum) and the item (inum). I also got the average ratings for the user and the item. To find the predicted rating I used the equation:

$$(\dfrac{unum}{(unum+inum)})*user\_average+(\dfrac{inum}{(unum+inum)})*item\_average$$

which gives a weighted average of the sum of the averages of the item and the user. 

Moreover, I realized if the above equation gave me a value larger than five or less than one, it would be more efficient if I changed them to 5 and 1 respectively. 

For the tuples that were not in userdensity and itemdensity I checked them independently and returned just their item's or user's average, depending on which was present. 

Finally due to the inconsistence of such datasets, factoring in the global average on every result made our results better.

The RMSE of the method was 1.39786

In [None]:
predictions = open("predictions_Rating.txt", 'w')

itemRatings = defaultdict(dict)
for l in readJson('amazon_reviews_Electronics/train.json'):
    userId,itemId,rating = l['reviewerID'],l['asin'],l['overall']
    itemRatings[itemId][userId] = rating
    
allRatings = []
userRatings = defaultdict(dict)
for l in readJson('amazon_reviews_Electronics/train.json'):
    userId,itemId,rating = l['reviewerID'],l['asin'],l['overall']
    allRatings.append(rating)
    userRatings[userId][itemId] = rating  
    
globalAverage = (sum(allRatings) / len(allRatings)) 

userAverage = {}
for u in userRatings.keys():
    userAverage[u] = (sum(userRatings[u].values()) / len(userRatings[u]))
    
itemAverage = {}
for i in itemRatings.keys():
    if len(itemRatings[i]) == 0:
        itemAverage[i] = globalAverage
    else:
        itemAverage[i] = (sum(itemRatings[i].values()) / len(itemRatings[i]))

    
userdensity = {}
for user in userRatings.keys():
    userdensity[user] = len(userRatings[user].keys())
    
itemdensity = {}
for item in itemRatings.keys():
    itemdensity[item] = len(itemRatings[item].keys())

for l in open("amazon_reviews_Electronics/pairs_Rating.txt"):
    if l.startswith("reviewerID"):
        #header
        predictions.write(l)
        continue
    u,i = l.strip().split('-')
    #play with the biases
    if i in itemAverage:
        this_item_average = itemAverage[i]
    else:
        this_item_average = globalAverage
    if u in userAverage:
        this_user_average = userAverage[u]
    else:
        this_user_average = globalAverage
    this_item_bias = this_item_average - globalAverage
    this_user_bias = this_user_average - globalAverage
    
    dodal = {}#takes into account all others that rated i, could add it to value as another factor
    dod = []
    b =0
    for item_user in itemRatings[i].keys():
        total_score = [0.0,0.0]
        common_items = list(set(userRatings[u].keys()).intersection(userRatings[item_user].keys()))
        if common_items != []: 
            for common_item in common_items:
                score = float(itemRatings[common_item][u])-float(itemRatings[common_item][item_user])
                total_score[0] = total_score[0] + score 
                total_score[1] = total_score[1] + 1
                valueq = total_score[0]/total_score[1]
                dod.append(valueq)
            #valueq = total_score[0]/total_score[1]
            #dodal[item_user] = userRatings[item_user][i]+valueq
        else: #no common items
            dod.append(userRatings[item_user][i])
            #dodal[item_user] = userRatings[item_user][i]
        b = b + 1
    if len(dod) == 0.0:
    #if len(dodal) == 0.0: #no one has reviewed item
        val= this_user_average
    else:
        val = sum(dod)/len(dod)
        #val = sum(dodal.values())/len(dodal.keys())  

    if u in userdensity and i in itemdensity:
        unum=userdensity[u]
        inum=itemdensity[i]
        #change
        total = float(unum +inum)
        #here
        value = (unum/total)*this_user_average+(inum/total)*this_item_average
        #print (value)
        if value>5:
            value = 5
        if value<0:
            value = 0
        predictions.write(u + '-' + i + ',' + str(value*0.68+0.32*globalAverage) + '\n')
    elif i in itemRatings:
        #here
        predictions.write(u + '-' + i + ',' + str((this_item_average)*0.68+0.32*globalAverage) + '\n')
    elif u in userRatings:
        #and here
        predictions.write(u + '-' + i + ',' + str((this_user_average)*0.68+0.32*globalAverage) + '\n')
    else:
        predictions.write(u + '-' + i + ',' + str(globalAverage) + '\n')
predictions.close()

# Task 2

Each, row in train.json has a nhelpful and an outOf column. Thus, for all possible outOf values I computed the average nhelpful/outOf rate and put them in a dictionary. Now when I was asked to compute the amount of nhelpful of a review in helpful.json I just went back to the outOf dictionary I created, got the value for the respective outOf value and then multiplied it by outOf.

The absolute error of that method was 61667.70004 which translates to almost a 40% accuracy.

In [None]:
allHelpful = []
userHelpful = defaultdict(list)
itemHelpful = defaultdict(list)
userHelpfuld = defaultdict(dict)
itemHelpfuld = defaultdict(dict)
textu = defaultdict(dict)
texti = defaultdict(dict)
score = defaultdict(dict)
userOutOfs = defaultdict(dict)

for l in readJson('amazon_reviews_Electronics/train.json'):
    user,item = l['reviewerID'],l['asin']
    allHelpful.append(l['helpful'])
    textu[user][item] = len(l['reviewText'])
    score[user][item] = l["overall"]
    a = l["helpful"]
    userOutOfs[user][item] = a['outOf']
    userHelpfuld[user][item] = float(a['nHelpful'])/float(a['outOf'])


a=[]    
rt=0
mama = defaultdict(int)
mamnum = defaultdict(int)

tlen = defaultdict(dict)
rt = 0
for u in userHelpfuld:
    for i in userHelpfuld[u]:
        mama[str(userOutOfs[u][i])] = mama[str(userOutOfs[u][i])] + userHelpfuld[u][i]
        mamnum[str(userOutOfs[u][i])] = mamnum[str(userOutOfs[u][i])]+1
        if userOutOfs[u][i]==2 and textu[u][i]<100:
            rt = rt+1
            tlen[u][i]=userHelpfuld[u][i]
            

tlensum=0
for gh in tlen:
    tlensum=tlensum+sum(tlen[gh].values())
           
lis=defaultdict(dict)            
for mam in mama:
    lis[mam]= mama[mam]/mamnum[mam]




text = defaultdict(dict)
userOutOf = defaultdict(dict)
for l in readJson("amazon_reviews_Electronics/helpful.json"):
    use,ite = l['reviewerID'],l['asin']
    userOutOf[use][ite] = l['outOf']
    text[use][ite] = len(l['reviewText'])
    
predictions = open("predictions_Helpful.txt", 'w')
for l in open("amazon_reviews_Electronics/pairs_Helpful.txt"):
    boo = False 
    if l.startswith("reviewerID"):
    #header
        predictions.write(l)
        continue
    u,i,outOf = l.strip().split('-')
    outOf = int(outOf)
    o = outOf
    while boo == False:
        if o==1:
            boo = True
            predictions.write(u + '-' + i + '-' + str(o) + ',' + str(1) + '\n')
        elif str(outOf) in lis:
            boo = True
            predictions.write(u + '-' + i + '-' + str(o) + ',' + str(outOf*lis[str(outOf)]) + '\n')
        outOf = outOf-1    
predictions.close()


## Dataset Citation

**Image-based recommendations on styles and substitutes** J. McAuley, C. Targett, J. Shi, A. van den Hengel *SIGIR*, 2015

**Inferring networks of substitutable and complementary products** J. McAuley, R. Pandey, J. Leskovec *Knowledge Discovery and Data Mining*, 2015

-----------------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()