(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Recommender System for Amazon Electronics

In this assignment, we will be working with the [Amazon dataset](http://cs-people.bu.edu/kzhao/teaching/amazon_reviews_Electronics.tar.gz). You will build a recommender system to make predictions related to reviews of Electronics products on Amazon.

Your grades will be determined by your performance on the predictive tasks as well as a brief written report about the approaches you took.

This assignment should be completed **individually**.

## Files

**train.json** 1,000,000 reviews to be used for training. It is not necessary to use all reviews for training if doing so proves too computationally intensive. The fields in this file are:

* **reviewerID** The ID of the reviewer. This is a hashed user identifier from Amazon.

* **asin** The ID of the item. This is a hashed product identifier from Amazon.

* **overall** The rating of reviewer gave the item.

* **helpful** The helpfulness votes for the review. This has 2 subfields, 'nHelpful' and 'outOf'. The latter is the total number of votes this review received. The former is the number of those that considered the review to be helpful.

* **reviewText** The text of the review.

* **summary** The summary of the review.

* **unixReviewTime** The time of the review in seconds since 1970.

**meta.json** Contains metadata of the items:

* **asin** The ID of the item.

* **categories** The category labels of the item being reviewed.

* **price** The price of the item.

* **brand** The brand of the item.

**pairs_Rating.txt** The pairs (reviewerID and asin) on which you are to predict ratings.

**pairs_Purchase.txt** The pairs on which you are to predict whether a user purchased an item or not.

**pairs_Helpful.txt** The pairs on which you are to predict helpfulness votes. A third column in this file is the total number of votes from which you should predict how many were helpful.

**helpful.json** The review data associated with the helpfulness prediction test set. The 'nHelpful' field has been removed from this data since that is the value you need to predict above. This data will only be of use for the helpfulness prediction task.

**baseline.py** A simple baseline for each task.

## Tasks

**Rating prediction** Predict people's star ratings as accurately as possible for those (reviewerID, asin) pairs in 'pairs_Rating.txt'. Accuracy will be measured in terms of the [root mean-squared error (RMSE)](http://www.kaggle.com/wiki/RootMeanSquaredError).

**Purchase prediction** Predict given a (reviewerID, asin) pair from 'pairs_Purchase.txt' whether the user purchased the item (really, whether it was one of the items they reviewed). Accuracy will be measured in terms of the [categorization accuracy](http://www.kaggle.com/wiki/HammingLoss) (1 minus the Hamming loss).

**Helpfulness prediction** Predic whether a user's review of an item will be considered helpful. The file 'pairs_Helpful.txt' contains (reviewerID, asin) pairs with a third column containing the number of votes the user's review of the item received. You must predict how many of them were helpful. Accuracy will be measured in terms of the total [absolute error](http://www.kaggle.com/wiki/AbsoluteError), i.e. you are penalized one according to the difference |nHelpful - prediction|, where 'nHelpful' is the number of helpful votes the review actually received, and 'prediction' is your prediction of this quantity.

We set up competitions on Kaggle to keep track of your results compared to those of other members of the class. The leaderboard will show your results on half of the test data, but your ultimate score will depend on your predictions across the whole dataset.
* Kaggle competition: [rating prediction](https://inclass.kaggle.com/c/cs591-hw3-rating-prediction3) click here to [join](https://kaggle.com/join/datascience16rating)
* Kaggle competition: [purchase prediction](https://inclass.kaggle.com/c/cs591-hw3-purchase-prediction) click here to [join](https://kaggle.com/join/datascience16purchase)
* Kaggle competition: [helpfulness prediction](https://inclass.kaggle.com/c/cs591-hw3-helpful-prediction) click here to [join](https://kaggle.com/join/datascience16helpful)

## Grading and Evaluation

You will be graded on the following aspects.

* Your written report. This should describe the approaches you took to each of the 3 tasks. To obtain good performance, you should not need to invent new approaches (though you are more than welcome to) but rather you will be graded based on your decision to apply reasonable approaches to each of the given tasks. (**10pts** for each task)

* Your ability to obtain a solution which outperforms the baselines on the unseen portion of the test data. Obtaining full marks requires a solution which is substantially better (at least several percent) than baseline performance. (**10pts** for each task)

* Your ranking for each of the three tasks compared to other students in the class. (**5pts** for each task)

* Obtain a solution which outperforms the baselines on the seen portion of the test data (the leaderboard). 
(**5pts** for each task)

## Baselines

Simple baselines have been provided for each of the 3 tasks. These are included in 'baselines.py' among the files above. These 3 baselines operate as follows:

**Rating prediction** Returns the global average rating, or the user's average if you have seen them before in the training data.

**Purchase prediction** Finds the most popular products that account for 50% of purchases in the training data. Return '1' whenever such a product is seen at test time, '0' otherwise.

** Helpfulness prediction** Multiplies the number of votes by the global average helpfulness rate, or the user's rate if we saw this user in the training data.

Running 'baseline.py' produces 3 files containing predicted outputs. Your submission files should have the same format.

## Dataset Citation

**Image-based recommendations on styles and substitutes** J. McAuley, C. Targett, J. Shi, A. van den Hengel *SIGIR*, 2015

**Inferring networks of substitutable and complementary products** J. McAuley, R. Pandey, J. Leskovec *Knowledge Discovery and Data Mining*, 2015

In [1]:
# Libraries
import numpy as np
import pandas as pd

import gzip
from collections import defaultdict

import time
import sys
from scipy.stats import logistic


In [2]:
def readJson(f):
  for l in open(f):
    yield eval(l)

now = time.time()
usersitem = defaultdict(dict)

rate = {}

for l in readJson("train.json"):
    
    usersitem[l['reviewerID']][l['asin']] = l['overall']

print "Done", len(usersitem), time.time()-now


Done 509678 37.1323318481


In [16]:
now = time.time()
itemsattr = defaultdict(list)

attrset = set()

for l in readJson("meta.json"):
    categ = l['categories'].translate(None,'\'').split("[[")[1].split("]]")[0].split(",")[1:2]
    itemsattr[l['asin']] = categ
    attrset |= set(categ)
    
print "Done", time.time()-now, len(itemsattr), len(attrset)

Done 8.72165489197 498196 42


In [17]:
print itemsattr['03975a07fc9d5777a251e73cd7421aff026c7c5d3d58b7d66fae6d0b9d48ff7a']
print usersitem['bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd']['19e5cc4a706554d37670eabca2c19f1fc4f259361d78f0b58dafb91f3a863fc1']

[' Car & Vehicle Electronics']
2.0


In [18]:

userfeature = defaultdict(dict)
itemfeature = defaultdict(dict)

avgratedict = defaultdict(list)
useroffdict = defaultdict()

start = time.time()

def populateAvgDict():
    for user in usersitem.keys():
        for item in usersitem[user].keys():
            #print user, usersitem[user], item, itemsattr[item]
            avgratedict[item].append(usersitem[user][item])
            #for feature in list(attrset):
            for feature in itemsattr[item]:
                #if feature in itemsattr[item]:
                userfeature[user][feature] = usersitem[user][item]
                itemfeature[item][feature] = usersitem[user][item]
                """
                else:
                    userfeature[user][feature] = 0.0
                    itemfeature[item][feature] = 0.0
                """

                """
                if len(userfeature) == 5:
                    print userfeature, itemfeature
                    sys.exit(0)
                """    
    return avgratedict
        
avgratedict = populateAvgDict()  
print len(userfeature), len(itemfeature), time.time()-start

509678 171185 4.56281495094


In [19]:
print userfeature['bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd']
print itemfeature['03975a07fc9d5777a251e73cd7421aff026c7c5d3d58b7d66fae6d0b9d48ff7a']


{' Computers & Accessories': 2.0, ' Television & Video': 4.0, ' Accessories & Supplies': 2.0}
{' Car & Vehicle Electronics': 3.0}


In [20]:
globalsum = 0
globalsize = 0

for k,v in avgratedict.iteritems():
    globalsum += sum(v)
    globalsize += len(v)
    
globalavg = globalsum/globalsize

print globalsum, globalsize, globalavg


3837538.0 1000000 3.837538


In [21]:
useroffset = defaultdict(list)
gloffsetsum = 0.0
gloffsetlen = 0
for user,items in usersitem.iteritems():
    for item, rating in items.iteritems():        
        off = usersitem[user][item] -  sum(avgratedict[item])/len(avgratedict[item])
        useroffset[user].append(off)
        gloffsetsum += off
        gloffsetlen+=1
        
gloffsetavg = gloffsetsum/gloffsetlen
print "Computed users' offset", len(useroffset), gloffsetsum, gloffsetlen
     

Computed users' offset 509678 -7.67386154621e-11 1000000


In [105]:
import math
#head start before SVD


def SVDBase(userid, itemid, k1=25, k2=25):
    ret = 0.0
    try:
        rateavg = avgratedict[itemid]
        offset = useroffset[userid] 
        #if not len(rateavg) or not len(offset):
            #print userid, itemid
        #if len(rateavg):
        itemavg = (globalavg*k1 + sum(rateavg))/(k1 + len(rateavg))
        #else:
            #itemavg = globalavg
        #if len(offset):
        offsetavg = (gloffsetavg*k2+sum(offset))/(k2+len(offset))
        #else:
            #offsetavg = 0.0

        ret =  min(5.0, max(0.5,itemavg+offsetavg))

    except KeyError:
        ret = min(5.0, max(0.5, globalavg))
        
    return ret


ratefile = open("preSVDpred_Rating.txt","w")
#intratefile = open("INTpreSVD_Rating.txt","w")
ratefile.write("reviewerID-asin,prediction\n")
#intratefile.write("reviewerID-asin,prediction\n")
start = time.time()

reqUser = defaultdict(dict)
reqItem = defaultdict(dict)
reqAttr = set()

with open("pairs_Rating.txt") as prate:
    for pair in prate:
        #print pair
        if pair.startswith("reviewer"):
            continue
        
        userid, itemid = pair.split("-")
        itemid = itemid[:-1]
        basepredict = SVDBase(userid,itemid)                
        
        reqUser[userid][itemid] = basepredict
        reqItem[itemid][userid] = basepredict
        reqAttr |= set(itemsattr[itemid])
        usersitem[userid][itemid] = basepredict
        ratefile.write("%s-%s,%f\n" % (userid, itemid, basepredict))
        #intratefile.write("%s-%s,%d\n" % (userid, itemid, int(round(basepredict))))
        #sys.exit(0)
ratefile.close()
#intratefile.close()
        
print "Done", time.time()-start, len(reqUser), len(reqItem), len(reqAttr)

Done 0.780683994293 88257 47289 179


------------

# Purchase Prediction
prediction(u buys i) = roundOff(prob u likes i) * roundOff(popularity index i)

prob u likes i = SVDBase(u,i)/5.0

popularity index i = rank i/num items*1.0

In [10]:
import math

itemCount = defaultdict(int)
totalPurchases = 0

for l in readJson('train.json'):
    user,item = l['reviewerID'],l['asin']
    itemCount[item] += 1
    totalPurchases += 1

mostPopular = [(itemCount[x], x) for x in itemCount]
mostPopular.sort()

itemrank = defaultdict()
count = 0
index = 0
for ic, i in mostPopular:
    count += ic
    itemrank[i] = index
    index+=1
    #if count > totalPurchases/2: break




In [11]:
predictions = open("predictions_Purchase.txt", 'w')
for l in open("pairs_Purchase.txt"):
    if l.startswith("reviewerID"):
        predictions.write(l)
        continue
    u,i = l.strip().split('-')

    try:
        pred = int(((SVDBase(u,i)>globalavg) ^ (1.0*itemrank[i]/len(itemrank) < 0.5)))
    except:
        pred = 0
        
    predictions.write(u + '-' + i + ","+str(pred)+"\n")
    """  
    if i in return1 and pred>globalavg:
        predictions.write(u + '-' + i + ",1\n")
    else:
        predictions.write(u + '-' + i + ",0\n")
    """
predictions.close()

## DATA REDUCTION

In [15]:
attrlist = sorted(reqAttr)
userlist = list(reqUser.keys())
itemlist = list(reqItem.keys())

r = [0.0 for _ in xrange(len(attrlist))]
userfeat = [r for _ in xrange(len(reqUser)) ]
itemfeat = [r for _ in xrange(len(reqItem)) ]

print len(userfeat), len(userfeat[0]), len(itemfeat), len(itemfeat[0])

start = time.time()
#print list(reqUser.keys()).index("f0ce42c52f549e542b28cb6351b93814be2c571809bca8eab2e191e601ada746")
U = np.array(userfeat)
V = np.array(itemfeat)

print U.shape, V.shape

for u in xrange(len(userfeat)):
    for f in xrange(len(attrlist)):            
        try:
            U[u][f] = userfeature[userlist[u]][attrlist[f]]
        except KeyError:
            U[u][f] = 0.0
    
for i in xrange(len(itemfeat)):
    for f in xrange(len(attrlist)):                
        try:
            V[i][f] = itemfeature[itemlist[i]][attrlist[f]]
        except:
            V[i][f] = 0.0
            
print "Done", time.time()-start            


88257 4 47289 4
(88257, 4) (47289, 4)
Done 0.786651849747


    Numpy SVD using scipy sparse matrix

In [34]:
from scipy.sparse import csc_matrix as cs
sparseRate = cs((len(userlist),1), dtype=np.float)
for user in xrange(len(userlist)):    
    for item in xrange(len(itemlist)):
        break
    break
        

## SVD (Training too sloooow)

In [13]:
start = time.time()
def predictRating(user,item):
    ret = 0.0
    for i in xrange(len(attrlist)):
        ret += U[user][i]*V[item][i]
        ret = min(5.0, max(0.0, ret))
    #print ret,
    return ret
        
def train(f, user, item, rating, lrate=0.002, K=0.01):
    err = lrate*(rating - predictRating(user,item))
    uv = U[user][f]
    
    U[user][f] += lrate*(err*V[item][f] - K*U[user][f])
    V[item][f] += lrate*(err*U[user][f] - K*V[item][f])   
"""
for u in xrange(10):
    for i in xrange(len(itemlist)):
        for f in xrange(len(attrlist)):
            try:
                rating = usersitem[userlist[u]][itemlist[i]]
            except KeyError:
                continue
            train(f,u,i,rating)
"""   
ratefile = open("pseudoSVD_Rating.txt","w")
ratefile.write("reviewerID-asin,prediction\n")

with open("pairs_Rating.txt") as prate:
    for pair in prate:
        #print pair
        if pair.startswith("reviewer"):
            continue
        
        userid, itemid = pair.split("-")
        itemid = itemid[:-1]
        basepredict = predictRating(userlist.index(userid),itemlist.index(itemid))                
                
        ratefile.write("%s-%s,%f\n" % (userid, itemid, basepredict))
        
ratefile.close()
            
print "Time", time.time()-start

KeyboardInterrupt: 

-----------------

### Recsys library SVD

In [9]:
ratingfile = "myrating.dat"
rf = open(ratingfile,"w")
for u,v in usersitem.iteritems():
    for i,r in v.iteritems():
        rf.write("%s::%s::%s\n" %(u,i,r))
rf.close()

In [10]:
from recsys.algorithm.factorize import SVD
from recsys.datamodel.data import Data

svd = SVD()
data = Data()

start = time.time()
data.load(ratingfile, sep='::', format={'col':0, 'row':1, 'value':2, 'ids':str})
print "Data loaded", time.time()-start
K=100
svd.set_data(data)
svd.compute(k=K, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/itemsSVD')
print "Computed", time.time()-start

Data loaded 15.504899025
Computed 32.8592751026


In [27]:

def SVDBase(userid, itemid):
    ret = 0.0
    try:
        rateavg = avgratedict[itemid]
        offset = useroffset[userid] 
        #if not len(rateavg) or not len(offset):
            #print userid, itemid
        if len(rateavg):
            itemavg = (globalavg*25 + sum(rateavg))/(25 + len(rateavg))
        else:
            itemavg = globalavg
        if len(offset):
            offsetavg = sum(offset)/len(offset)
        else:
            offsetavg = 0.0

        ret =  min(5.0, max(0.5,itemavg+offsetavg))

    except KeyError:
        ret = min(5.0, max(0.5, globalavg))
        
    return ret




ratefile = open("recsys_SVD.txt","w")
ratefile.write("reviewerID-asin,prediction\n")

with open("pairs_Rating.txt") as prate:
    for pair in prate:
        #print pair
        if pair.startswith("reviewer"):
            continue
        
        userid, itemid = pair.split("-")
        itemid = itemid[:-1]
        try:
            pred_rating = svd.predict(itemid, userid)
        except KeyError:
            pred_rating = SVDBase(userid, itemid)
                
        print userid, itemid, pred_rating
        ratefile.write("%s-%s,%f\n" % (userid, itemid, pred_rating))
        break
ratefile.close()


f0ce42c52f549e542b28cb6351b93814be2c571809bca8eab2e191e601ada746 6116d31a297ceb0f8f69f6f71e924e47136fc70c6f5bf75c7af0363663760159 2.00591303169


## CLF and Logical Regression for Prediction

In [9]:
# Homework 3

import numpy
import urllib
import scipy.optimize
import random
import math
import re
from sklearn.metrics import hamming_loss
from sklearn import linear_model, datasets
from __future__ import division
from sklearn.linear_model import LogisticRegression
import gzip
from collections import defaultdict
import warnings
def readJson(f):
  for l in open(f):
    yield eval(l)

def parseData(fname):
  for l in urllib.urlopen(fname):
    yield eval(l)
    
allRatings = []
userRatings = defaultdict(list)
data = []

for l in readJson("train.json"):
    data.append(l)

In [10]:
#training set
attrlist = set([])
training = []
train = data
users = set([])
items = set([])
inTrain = set([])
for l in train:
    #set up the training set with 1s-------------------
    training.append((l['reviewerID']+"-"+l['asin'], l['overall']))
    users.add(l['reviewerID'])
    items.add(l['asin'])
"""
for l in data:
    inTrain.add((l['reviewerID'],l['asin']))
#Make training set with 0's----------------------------
userID = list(users)
itemID = list(items)
#make elements that are not bought
end = 2*len(training)
while len(training)<end:
    u = userID[random.randrange(0, len(users))]
    i = itemID[random.randrange(0, len(items))]
    if (u,i) not in inTrain:
        inTrain.add((u,i))
        training.append((u+"-"+i, 0))
#Training set created        
"""
print len(training)

1000000


In [24]:
"""
userlist = list(reqUser.keys())
itemlist = list(reqItem.keys())
"""
reqAttr = set()

with open("pairs_Rating.txt") as prate:
    for pair in prate:
        #print pair
        if pair.startswith("reviewer"):
            continue
        
        userid, itemid = pair.split("-")
        itemid = itemid[:-1]

        reqAttr |= set(itemsattr[itemid])

attrlist = sorted(reqAttr)
print len(attrlist)
#r = [0.0 for _ in xrange(len(attrlist))]
#useritemX = [r+r for _ in xrange(len(training)) ]
#iratingY = [[0] for _ in xrange(len(training)) ]

#print len(useritemX), len(useritemX[0]), len(iratingY), len(iratingY[0])

start = time.time()
#print list(reqUser.keys()).index("f0ce42c52f549e542b28cb6351b93814be2c571809bca8eab2e191e601ada746")
#U = np.array(useritemX)
#V = np.array(iratingY)

#print U.shape, V.shape

useritemX = []

attrlistrate = defaultdict(list)
uifeatlist = {}

for l in train:
    #training.append((l['reviewerID']+"-"+l['asin'], l['overall']))
    #iratingY.append(int(round(10*l['overall'])))
    
    uAttr = [[0] for _ in xrange(len(attrlist))]
    iAttr = [[0] for _ in xrange(len(attrlist))]
    for f in xrange(len(attrlist)):            
        try:
            uAttr[f] = int(userfeature[l['reviewerID']][attrlist[f]])
        except KeyError:
            uAttr[f] = 0    
        try:
            iAttr[f] = int(itemfeature[l['asin']][attrlist[f]])
        except KeyError:
            iAttr[f] = 0
    
    useritemX.append(uAttr+iAttr)
    uifeatlist[l['reviewerID']+l['asin']] = "".join(map(str, uAttr+iAttr))
    #print uAttr+iAttr, "".join(map(str, uAttr+iAttr))
    #break
    attrlistrate["".join(map(str, uAttr+iAttr))].append(int(l['overall']))
print "Done", time.time()-start    



27
Done 87.5090360641


In [25]:
start = time.time()
iratingY = []

for l in train:
    
    """    
    uAttr = [[0] for _ in xrange(4)]
    iAttr = [[0] for _ in xrange(4)]
    for f in xrange(len(attrlist)):            
        try:
            uAttr[f] = int(userfeature[l['reviewerID']][attrlist[f]])
        except KeyError:
            uAttr[f] = 0    
        try:
            iAttr[f] = int(userfeature[l['reviewerID']][attrlist[f]])
        except KeyError:
            iAttr[f] = 0
    """

    try:
        featvec = attrlistrate[uifeatlist[l['reviewerID']+l['asin']]]
    except:
        featvec = []
    pairsum = (globalavg*25 + sum(featvec))
    pairlen = (25+len(featvec))
    iratingY.append(int(round(10.0*pairsum/pairlen)))
    #break
print "Done", time.time()-start    


Done 31.4811000824


In [26]:
print len(useritemX), useritemX[0], iratingY[0], len(iratingY)
print "Done", time.time()-start    


1000000 [0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 37 1000000
Done 31.4854311943


In [27]:

"""
for u in xrange(len(userfeat)):
    for f in xrange(len(attrlist)):            
        try:
            U[u][f] = userfeature[userlist[u]][attrlist[f]]
        except KeyError:
            U[u][f] = 0.0
    
for i in xrange(len(itemfeat)):
    for f in xrange(len(attrlist)):                
        try:
            V[i][f] = itemfeature[itemlist[i]][attrlist[f]]
        except:
            V[i][f] = 0.0
"""            
U = np.array(useritemX)
V = np.array(iratingY)

print U.shape, V.shape
print "Done", time.time()-start, V[0], U[0]

(1000000, 54) (1000000,)
Done 33.7836720943 37 [0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 2 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [22]:
#logical reg vectors create X and y

X=[]
y=[]
print training[0]
for (ui, rating) in training:
    
    u,i = ui.strip().split('-')
    #print u, sum(useroffset[u])
    #print i, sum(avgratedict[i])
    avgoff = sum(useroffset[u])/len(useroffset[u])
    avgrate = sum(avgratedict[i])/len(avgratedict[i])
    elem=[avgrate,avgoff]
    
    y.append(rating)
    X.append(elem)

print y[0], X[0]

('bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd-a6ed402934e3c1138111dce09256538afb04c566edf37c16b9ba099d23afb764', 2.0)
2.0 [2.0, -0.354421768707483]


In [23]:
#logical reg
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)
print logreg.coef_

[[-2.03834256 -2.11525601]
 [-0.71004675 -0.75658956]
 [-0.35640411 -0.35394266]
 [ 0.06877642  0.11963284]
 [ 2.30445359  2.41390122]]


In [28]:
from sklearn import linear_model

clf = linear_model.SGDClassifier()
clf.fit(U, V)


SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [29]:
#predict using regression 
warnings.filterwarnings("ignore", category=DeprecationWarning) 
clfpredictions = open("SGD_Rating.txt", 'w')
clfpredictions.write("reviewerID-asin,prediction\n")

for l in open("pairs_Rating.txt"):
    if l.startswith("reviewer"):
        #header
        continue
    userid,itemid = l[:-1].strip().split('-')
    
    uAttr = [[0] for _ in xrange(len(attrlist))]
    iAttr = [[0] for _ in xrange(len(attrlist))]
    for f in xrange(len(attrlist)):            
        try:
            uAttr[f] = userfeature[userid][attrlist[f]]
        except KeyError:
            uAttr[f] = 0    
        try:
            iAttr[f] = userfeature[itemid][attrlist[f]]
        except KeyError:
            iAttr[f] = 0
                        
    warnings.filterwarnings("ignore", category=DeprecationWarning) 
    sgdpred = clf.predict([uAttr+iAttr])[0]
    
    offset = useroffset[userid] 
    if len(offset):
        offsetavg = sum(offset)/len(offset)
    else:
        offsetavg = 0.0
        
    sgdpred = min(5.0, max(0.5,sgdpred/10+offsetavg))
    #print sgdpred, sgdpred/10+offsetavg
    #break

    clfpredictions.write(userid + '-' + itemid + ","+str(sgdpred)+"\n")

    #break
clfpredictions.close()

In [55]:
#predict using regression 
warnings.filterwarnings("ignore", category=DeprecationWarning) 
predictions = open("logreg_Rating.txt", 'w')
predictions.write("reviewerID-asin,prediction\n")
clfpredictions = open("SGD_Rating.txt", 'w')
clfpredictions.write("reviewerID-asin,prediction\n")

for l in open("pairs_Rating.txt"):
    if l.startswith("reviewer"):
        #header
        continue
    userid,itemid = l[:-1].strip().split('-')
    
    rateavg = avgratedict[itemid]
    offset = useroffset[userid] 
    #if not len(rateavg) or not len(offset):
        #print userid, itemid
    if len(rateavg):
        itemavg = (globalavg*25 + sum(rateavg))/(25 + len(rateavg))
    else:
        itemavg = globalavg
    if len(offset):
        offsetavg = sum(offset)/len(offset)
    else:
        offsetavg = 0.0
        
            
    #print u,i, itemsRank[0], usersRank[0]
    warnings.filterwarnings("ignore", category=DeprecationWarning) 
    pred = logreg.predict([itemavg,offsetavg])[0]
    sgdpred = clf.predict([itemavg,offsetavg])[0]
    #print sgdpred
    #sys.exit(0)
    #print pred
    
    #sys.exit(0)
    #if (logreg.predict([itemsRank,usersRank])==[0]):
    predictions.write(userid + '-' + itemid + ","+str(pred)+"\n")
    clfpredictions.write(userid + '-' + itemid + ","+str(sgdpred)+"\n")
    #else:
        #predictions.write(u + '-' + i + ",1\n")
    #break
predictions.close()
clfpredictions.close()

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()