### Helpful Prediction:

    While looking at various methods for 1st task, I came across decision tree classifiers in scikit - http://scikit-learn.org/stable/modules/tree.html.
    These types of classifiers are useful for predicting multi-value attributes. If we can estimate prediction percentage and multiply it with total reviews, we can predict nHelpful for a given pair. The problem with using standard regression model is infinitely many number of possible real values in [0,1]. 
    
    A very naive way to solve this kind of problem is to build n independent models, i.e. one for each output, and then to use those models to independently predict each one of the n outputs. However, because it is likely that the output values related to the same input are themselves correlated, an often better way is to build a single model capable of predicting simultaneously all n outputs. First, it requires lower training time since only a single estimator is built. Second, the generalization accuracy of the resulting estimator may often be increased.
    Therefore decision trees give a very good alternative to fit a multi-value data and predict helpful percentage for a user-item pair.
   
     To effectively use this model, I tried to do sentiment analysis using vaderSentiment which proves to be very time consuming. Instead I've used wordcounting analyzer called textstat available at https://pypi.python.org/pypi/textstat/0.1.4 (#pip install textstat). This gives the number of relevent words in the review and we can estimate how good the review might be. This will obviously work better if we can get an actual score using vaderSentiment but it was a tradeoff I chose. 
     
     Then using these vectors, we need to decide what the tree should look like. Decision Tree classifier has various parameters like number of leaf nodes, max depth of a tree and so on. Since these are the two fundamental values required by the model, I've tried to find good enough values using two nested for loops for leaf nodes and depth.
     
     Using the wordcounts as a reference to how good the review might be, we can build our model based on ewview rating. Using this model, we can then train each leaf predictor from training dataset. This trained model is then ready to be used to predict percentage nHelpful reviews. Multiplying this by 'outOf' field gives actual count of helpful reviews.
   
       There are few interesting pros and cons listed on scikit manual:
    The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
    Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
    Able to handle multi-output problems. 
    Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
    
    

In [1]:
from collections import defaultdict
import numpy
import scipy.optimize

from sklearn.ensemble.forest import DecisionTreeRegressor as tree
from textstat.textstat import textstat
import string
import random
#from vaderSentiment.vaderSentiment import sentiment as vaderSentiment
#from vaderSentiment import sentiment as vaderSentiment 
from nltk.stem import PorterStemmer as ps


In [2]:
def readJson(f):
  for l in open(f):
    yield eval(l)
    
print "Done"

Done


In [24]:
#same as baseline - parse data
allHelpful = []
userRate = {}
userHelpful = defaultdict(list)
data = []
for l in readJson("train.json"):
    data.append(l)
    user,item = l['reviewerID'],l['asin']
    allHelpful.append(l['helpful'])
    userHelpful[user].append(l['helpful'])

ratingAvg = sum([x['nHelpful'] for x in allHelpful]) * 1.0 / sum([x['outOf'] for x in allHelpful])

for u in userHelpful:
    totalU = sum([x['outOf'] for x in userHelpful[u]])
    if totalU > 0:
        userRate[u] = sum([x['nHelpful'] for x in userHelpful[u]]) * 1.0 / totalU
    else:
        userRate[u] = ratingAvg
print ratingAvg


0.810134030999


In [5]:
Xytest = []

for l in data[900000:]:
    Xytest.append(l)
    
print len(Xytest)
random.shuffle(Xytest)

100000


### Fill training data
Takes some time to go through all the reviews and generate vectors

In [8]:
X = []
y = []

stemmer = ps()
for l in data[:900000]:
    userid, itemid = l['reviewerID'], l['asin']
    review,summary = l['reviewText'], l['summary']
    rating,helpful = l['overall'], l['helpful']
    
    userAvgRate = ratingAvg if userid not in userRate else userRate[userid]
    review = stemmer.stem(review.lower()) if review else ''
    assert review is not None
    
    reviewCount = textstat.lexicon_count(review)
    summaryCount = textstat.lexicon_count(summary)
    
    #vader is too slow
    #reviewvs = vaderSentiment(review) 
    #summaryvs = vaderSentiment(summary)    
    #reviewscore = (-reviewvs['neg']+reviewvs['pos'])*5
    #summaryscore = (-summaryvs['neg']+summaryvs['pos'])*5

    if not reviewCount:
        punct = 0 
    else:
        punct = (1.0*sum([c in string.punctuation for c in review]))/reviewCount
        
    X.append([userAvgRate, rating, reviewCount, summaryCount, punct])
    if helpful['outOf']==0 or helpful['nHelpful'] == 0:
        y.append(userAvgRate)
    else:
        y.append((helpful['nHelpful']*1.0)/helpful['outOf'])
print X[0], y[0]        

[0.7272727272727273, 2.0, 159, 2, 0.16352201257861634] 1.0


In [9]:
XTree = []
for x in X:
    #3.88 is the global avg from task1
    x = [x[0], round(x[1]/3.88, 2), x[2], x[3], x[4]]
    XTree.append(x)

    This is another approach for same algorithm but RandomForest gives worse predictions

In [14]:
"""from sklearn.ensemble.forest import RandomForestRegressor
RF = RandomForestRegressor(n_estimators = 5)"""

'from sklearn.ensemble.forest import RandomForestRegressor\nRF = RandomForestRegressor(n_estimators = 5)'

In [15]:
"""X = np.array(X)
y = np.array(y)
RF.fit(X, y)
yPred = RF.predict(X)"""

'X = np.array(X)\ny = np.array(y)\nRF.fit(X, y)\nyPred = RF.predict(X)'

### Before training error, find optimal tree size - 
    (not exactly sure to what extent this affects predictions)

In [61]:
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning) 


glerr = float('inf')
optLeaf, optDepth = None, None
rate = 0.0

for leaf in xrange(27, 33):
    print 'leaf', leaf, optLeaf, rate
    """
    Predict with each model and find minimum error parameters
    """
    for depth in range(5, 16, 2):
        model = tree(min_samples_leaf=leaf, max_depth=depth)
        model = model.fit(XTree,y)

        err = 0
        for l in Xytest[:10000]:            
            user,item,review,rating,helpful,summary= l['reviewerID'],l['asin'],l['reviewText'],l['overall'],l['helpful'],l['summary']
            if user not in userRate:
                uPred = ratingAvg  
            else:
                uPred = userRate[user]
            review = review if review else ''
            reviewc = textstat.lexicon_count(review)
            summaryc = textstat.lexicon_count(summary)
            punctR = 0 if not reviewc else (1.0*sum([c in string.punctuation for c in review]))/reviewc

            x = [uPred,  round(rating/3.88,2), reviewc, summaryc, punctR]
            pred = model.predict(x)[0]
            #pred = RF.predict(x)
            pred = l['helpful']['outOf'] * pred
            err += abs(pred-l['helpful']['nHelpful'])

        rate = 1.0*err/10000
        if rate < glerr:
            glerr = rate
            """
            Later use these to build predicting model:
            """
            optLeaf, optDepth  = leaf, depth
        
    print merror


leaf 27 None 0.0
0.682503844284
leaf 28 27 0.624499473423
0.682503844284
leaf 29 28 0.62347196651
0.682503844284
leaf 30 28 0.626219810311
0.682503844284
leaf 31 30 0.627378435183
0.682503844284
leaf 32 30 0.626012148483
0.682503844284


In [63]:
print optLeaf, optDepth

32 5


    Create model with parameters found above
    The best results I've seen so far are with leaf=29, depth=10

In [65]:
#model = tree(min_samples_leaf=optLeaf, max_depth=optDepth)
model = tree(min_samples_leaf=29, max_depth=12)
model = model.fit(XTree,y)

## Prediction results

In [66]:
pairs = open("pairs_Helpful.txt").readlines()

predfile = open("secMypredictions_Helpful.txt", 'w')
predfile.write(pairs[0])

index=0
for l in readJson("helpful.json"):
    index += 1
    
    user,item,= l['reviewerID'],l['asin'] 
    review,rating = l['reviewText'],l['overall']
    helpful,summary = l['outOf'],l['summary']
    
    userPred = ratingAvg if user not in userRate else userRate[user]
    review = review if review else ''
    assert review is not None
    
    reviewcnt = textstat.lexicon_count(review)
    summarycnt = textstat.lexicon_count(summary)
    if not reviewcnt:
        punctR = 0  
    else:
        punctR = (1.0*sum([c in string.punctuation for c in review]))/reviewcnt
    
    x = [userPred, rating, reviewcnt, summarycnt, punctR]    
    prediction = model.predict(x)
    #prediction = RF.predict(x)    
    u,i,outOf = pairs[index].strip().split('-')
    outOf = int(outOf)    
    #print u,i,outOf, prediction[0]
    #break

    predfile.write(u + '-' + i + '-' + str(outOf) + ',' + str(prediction[0]*outOf) + '\n')
    
predfile.close()