## Imports

In [1]:
import pandas as pd
import numpy as np

In [2]:
dfr = pd.read_csv('../Data/2.electronics_cleaned.csv',header=0)

In [3]:
dfr.head()

Unnamed: 0,reviewerId,productId,ratings,timestamp
0,AO94DHGC771SJ,528881469,5.0,1370131200
1,AMO214LNFCEI4,528881469,1.0,1290643200
2,A3N7T0DY83Y4IG,528881469,3.0,1283990400
3,A1H8PY3QHMQQA0,528881469,2.0,1290556800
4,A24EV6RXELQZ63,528881469,1.0,1317254400


## Model Selection

### Surprise Package

Surprise Package  :http://surpriselib.com/
        
References

Documentation :https://surprise.readthedocs.io/en/latest/
Installation: http://surpriselib.com/
Github : https://github.com/NicolasHug/Surprise

There are a lot of different packages available to build a recommender system. For this one, I'm using the Surprise package. Surprise has many different algorithms built in.Itprovides various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVD++, NMF), and many others. Also, various similarity measures (cosine, MSD, pearson…) are built-in.

In this case, I need to load in a custom dataset to use with Surprise. According to the documentation, we need to make sure our data frame has three columns: the user ids, the item ids, and the ratings. Additionally, we'll need to specify the rating scale. In our case, users has used the ratings discretely from 1 to 5.

With the Surprise library, I use below algorithms

* <b>BaselineOnly:</b>Algorithm predicting the baseline estimate for a given user and item.
        
* <b>KNNBaseline:</b>A basic collaborative filtering algorithm taking into account a baseline rating.
    
* <b>NMF:</b>A collaborative filtering algorithm based on Non-negative Matrix Factorization. 
    
* <b>Co-clustering:</b>A collaborative filtering algorithm based on co-clustering.

* <b>SVD:</b>When baselines are not used, this is equivalent to Probabilistic Matrix Factorization, it is as popularized by Simon Funk during the Netflix Prize
    
I'm also split the data into training and testing data using the Surprise package


In [4]:
import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import BaselineOnly
from surprise import SVD
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split

To load a dataset from a pandas dataframe, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.

In [5]:
reader = Reader(rating_scale=(0, 9))
data = Dataset.load_from_df(dfr[['reviewerId', 'productId', 'ratings']], reader)

In [6]:
algoresults = []
algorithm = BaselineOnly()

# using baselineOnly Perform cross validation
results = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=3, verbose=False)

# Get results & append algorithm name
tmp = pd.DataFrame.from_dict(results).mean(axis=0)
tmp = tmp.append(pd.Series(str('BaselineOnly()'), index=['Algorithm']))
algoresults.append(tmp)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


In [11]:
# using SVD Perform cross validation
algorithm = SVD()
results = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=3, verbose=False)
    
# Get results & append algorithm name
tmp = pd.DataFrame.from_dict(results).mean(axis=0)
tmp = tmp.append(pd.Series(str('SVD()'), index=['Algorithm']))
algoresults.append(tmp)

In [12]:
# using Coclustering Perform cross validation
algorithm = CoClustering()
results = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=3, verbose=False)
    
# Get results & append algorithm name
tmp = pd.DataFrame.from_dict(results).mean(axis=0)
tmp = tmp.append(pd.Series(str('CoClustering()'), index=['Algorithm']))
algoresults.append(tmp)

In [13]:
surprise_results = pd.DataFrame(algoresults).set_index('Algorithm').sort_values('test_rmse')
surprise_results

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BaselineOnly(),1.103059,0.832514,6.537677,5.055301
SVD(),1.10707,0.827417,66.756933,5.729319
SVD(),1.107347,0.827523,68.095149,5.76116
CoClustering(),1.234483,0.90861,60.349193,5.203793
CoClustering(),1.240402,0.912715,55.310668,4.777427


BaselineOnly algorithm gave us the best rmse, therefore, we will proceed further with BaselineOnly and use Alternating Least Squares (ALS).

### Split and fine tune 

In [15]:
#ref :https://surprise.readthedocs.io/en/latest/prediction_algorithms.html
trainset, testset = train_test_split(data, test_size=0.25)
bsl_options = {'method': 'als','n_epochs': 10,'reg_u': 12,'reg_i': 10}
algo = BaselineOnly(bsl_options=bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 1.0966


1.0965603892042317

In [16]:
predictions[:5]

[Prediction(uid='A1TWU0R0BB40UY', iid='B00918N4A0', r_ui=4.0, est=4.4966829821499115, details={'was_impossible': False}),
 Prediction(uid='A51LSGKI6ESTP', iid='B00009W44B', r_ui=1.0, est=4.002538708272469, details={'was_impossible': False}),
 Prediction(uid='A1PPGC5RH5MP8I', iid='B007R7DTRK', r_ui=5.0, est=4.594693830896099, details={'was_impossible': False}),
 Prediction(uid='A16XRPF40679KG', iid='B001U3Y5TI', r_ui=5.0, est=4.200104226616243, details={'was_impossible': False}),
 Prediction(uid='A169SWTUA8JKH0', iid='B004E10KFG', r_ui=2.0, est=3.768660666389859, details={'was_impossible': False})]

Let me use the train_test_split() to sample a trainset and a testset, and use the accuracy metric of rmse. Then use the fit() method to train the algorithm on the trainset, and the test() method which will return the predictions made from the testset

In [17]:
def get_itemsCounts_byReviewer(uid):
    """uid: the id of the user,returns:the number of items rated by the user """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
    
def get_reviwerCounts_forItem(iid):
    """ iid: the raw id of the item,returns:the number of users that have rated the item."""
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['itemsCounts_byReviewer'] = df.uid.apply(get_itemsCounts_byReviewer)
df['reviwerCounts_forItem'] = df.iid.apply(get_reviwerCounts_forItem)
df['err'] = abs(df.est - df.rui)

In [18]:
df = df.rename(columns={'uid': 'reviewerId'})
df = df.rename(columns={'iid': 'productId'})
df = df.rename(columns={'rui': 'ratings'})
df = df.rename(columns={'est': 'predicted Ratings'})
df=df.drop(['details'],axis = 1)

In [19]:
df.head()

Unnamed: 0,reviewerId,productId,ratings,predicted Ratings,itemsCounts_byReviewer,reviwerCounts_forItem,err
0,A1TWU0R0BB40UY,B00918N4A0,4.0,4.496683,5,10,0.496683
1,A51LSGKI6ESTP,B00009W44B,1.0,4.002539,2,13,3.002539
2,A1PPGC5RH5MP8I,B007R7DTRK,5.0,4.594694,12,7,0.405306
3,A16XRPF40679KG,B001U3Y5TI,5.0,4.200104,12,8,0.799896
4,A169SWTUA8JKH0,B004E10KFG,2.0,3.768661,3,355,1.768661


In [20]:
best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]

In [21]:
best_predictions

Unnamed: 0,reviewerId,productId,ratings,predicted Ratings,itemsCounts_byReviewer,reviwerCounts_forItem,err
368795,A2NFSABGJPM2VN,B007R5YDYA,5.0,4.999994,8,802,6e-06
245433,A1VIJNOIG8BEZ0,B008OHNZI0,4.0,4.000012,4,975,1.2e-05
63491,A22X35CEX6STB3,B005I2IVB0,4.0,3.999985,2,21,1.5e-05
381733,A2JX5WDY0UQCFK,B002IT1BFO,4.0,3.999982,3,28,1.8e-05
80595,AEW4HSN0NS2RI,B001Q6TZ5S,4.0,4.000031,3,85,3.1e-05
396534,AETQIN7OH0RL9,B0000BZL5A,4.0,4.000041,37,102,4.1e-05
272088,A200BNTIJC2LQO,B007CO5DZ4,4.0,4.000042,5,140,4.2e-05
213481,A2PSVETZUVZUQR,B0096YOQRY,4.0,4.000048,5,340,4.8e-05
293874,A2WA8TDCTGUADI,B008X1BVH4,4.0,4.000053,41,25,5.3e-05
323810,A35NSIHU4KL51A,B000M3GODW,4.0,4.000054,3,95,5.4e-05


The above are the top 10 best Predictions. The product 'B00CO8TBOW' is rated best with ratings 5 and 102 reviewer rated the this product.
Followed by B004R0RQ8S and B000PCBVA6

In [22]:
worst_predictions

Unnamed: 0,reviewerId,productId,ratings,predicted Ratings,itemsCounts_byReviewer,reviwerCounts_forItem,err
386713,A3V0D97QKXDN5R,B003ES5ZUU,1.0,4.993414,11,3097,3.993414
370237,A3M6HVV3XJPLS7,B000JLP5UK,1.0,5.013205,9,63,4.013205
69204,A201HVME20DUW0,B0043WJRRS,1.0,5.016983,8,440,4.016983
241677,A2SKEQT0WTB954,B00005ABC5,1.0,5.018741,76,37,4.018741
223735,A1JRD0WYIOFRJY,B00GTGETFG,1.0,5.025394,7,514,4.025394
360049,ARGO3CB3O9DXZ,B0006I1TRY,1.0,5.044767,15,120,4.044767
193419,A3LAUW2UZ5SWS6,B0095ZC7VQ,1.0,5.053597,10,37,4.053597
404893,A2M6JXJ94DS4YJ,B0011UPBMA,1.0,5.073686,15,32,4.073686
155526,A18QOK7A5XLV1Q,B001KB6Z2U,1.0,5.088349,15,131,4.088349
231799,A19IH70IIDS31C,B007SZ0EOW,1.0,5.174732,27,157,4.174732


The above the bottom 10 predictions, can safely categorized under worst predictions. The B007P4VOWC is rated 1 and reviwed by 606 reviewer.
Followed by B003GSLU3E and B004X68JPK.

The author of Surprise has written a function we can use to get the top-N recommendations for each user in our dataset. 
I'll be using <a href = https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-top-n-recommendations-for-each-user>same function.</a>

In [23]:
from collections import defaultdict

"""Return the top-N recommendation for each user from a set of predictions.
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.
    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
"""

def get_top_n(predictions, n=10):

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, round(est,2)))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


top_n = get_top_n(predictions, n=10)

In [24]:
# Print the recommended items for each user
'''for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])
'''

'for uid, user_ratings in top_n.items():\n    print(uid, [iid for (iid, _) in user_ratings])\n'

In [25]:
Recom = {}
for i in range(10):
    Recom[dfr.reviewerId[i]] = top_n[dfr.reviewerId[i]]

In [26]:
for i in Recom :
    print(i, Recom[i])

AO94DHGC771SJ [('B0096TK6MI', 4.55), ('B001TQSFXS', 4.45)]
AMO214LNFCEI4 []
A3N7T0DY83Y4IG [('B004AX6Z3Y', 4.28), ('B0036UU40C', 4.26), ('0528881469', 3.79)]
A1H8PY3QHMQQA0 [('B00005AXHW', 4.69), ('B00BNIO4H8', 4.66), ('B0051WAM64', 4.57), ('B005MHN6K2', 4.47), ('B001FOM1T8', 4.05)]
A24EV6RXELQZ63 [('B001212ELY', 3.98)]
A2JXAZZI9PHK9Z [('B003TPC2JU', 4.43), ('0594451647', 4.12), ('B000GD3J3G', 4.09), ('B009X676NQ', 4.03)]
A2P5U7BDKKT7FW [('B00BJN502A', 4.2)]
AAZ084UMH8VZ2 [('B00DR0PDNE', 4.33)]
AEZ3CR6BKIROJ [('B00BP5N498', 4.08)]
A3BY5KCNQZXV5U [('B000QUUFRW', 4.93), ('B003XM73P2', 4.87), ('B002K450RM', 4.66), ('B00603RU9A', 4.64), ('B0007Y794O', 4.64), ('B00BN1Q5JA', 4.56), ('B001W6Q2H6', 4.5), ('B004OOTRPC', 4.48), ('B005B47AIU', 4.47), ('B0076AUCKU', 4.45)]


This function doesn't always recommend ten items. May be because the data is very sparse, and there is a high number of customers who have only rated one product, as well as many products that have only one rating. 

A simple yet effective way to generate recommendations is to recommend the most popular products.

First, I need to build a dataset of popular products. Each person could define popular in a different way, but in this scenario I'm only going to consider products that have received at least 100 reviews. Then, I'll find the average rating for each product and sort them in descending order.

In [27]:
review_count = dfr['productId'].value_counts()
review_count_ten = review_count[review_count >= 100]
hundred_reviews = dfr[dfr['productId'].isin(review_count_ten.index)]
items = (hundred_reviews[['productId', 'ratings'
                          ]].groupby('productId').agg('mean').sort_values(
                              'ratings', ascending=False).index)

Go over each user's recommendations and add the most popular products that aren't already recommended.

In [28]:
def recommendation_list(user_list, user_predictions, item_list):
    recommendations = {}
    for i in range(100):
        user = user_list[i]
        if user in user_predictions:
            user_recs = [
                user_predictions[user][i][0]
                for i in range(len(user_predictions[user]))
            ]
            if user_recs:
                num_items = len(user_recs)
            else:
                num_items = 0

            idx = 0
            while num_items < 10:
                product = item_list[idx]
                if product not in user_recs:
                    user_recs.append(product)
                    num_items = len(user_recs)
                idx += 1
            recommendations.update({user: user_recs})
    return recommendations


recs = recommendation_list(dfr['reviewerId'].unique().tolist(), top_n, items)

In [29]:
example_user = dfr['reviewerId'].unique().tolist()[1]
recs[example_user]

['B004EBUXHQ',
 'B001BTCSI6',
 'B0033PRWSW',
 'B004FA8NOQ',
 'B007SZ0E1K',
 'B0029N3U8K',
 'B003FVVMS0',
 'B002NEGTSI',
 'B000I1X3W8',
 'B004EBX5GW']

### Conclusion

There are a many ways to build and improve this recommender system.
I couldn't use k-NN algorithms from the Surprise package, as it needs more memory, neither I could complete it on google colab.
Use more tuned parameters
I did some basic cross validation to select the best parameters. To improve the recommendations, tune these parameters and make sure that the algorithm is serving the best recommendations possible with the data available.