# User base recommendation
This module includes the following steps
1. Data preparation and create a dataset required by Surprise
2. Apply KNNBasic to get the user similarity matrix
3. Apply LeaveOneOut to get the trainSet and test dataset  
4. Based on the user similarity matrix to get top K similar user  
5. Based on the similarity score and raw rating to get ratingsum
6. Use the  ratingsum to get top N recommended items
7. Calculate the hit rate
8. Save the top K similar user and top N recommended items in a Teradata table

## Input
 The input is from dw_bi_vw.F_POS_TXN_DTL; only select
  - STR_FAC_NBR  = 2667 (store ID) 
  - txn_dt between  '2019-06-01' and  '2019-06-30' 
  - hh_sk <> -1 
  - wgt_prod_ind = 0 (Purchased the product by unit) 
  
## Output 
   Top N  recommended items

## Function to build up a defaultdict 'topN' to list top N recommendation
  - defaultdict 'topN'
    - Key: A raw user id   
    - Value: a list raw itemId and rating with top N rating
  - A loop to go through all the user in trainSet
    1. Use inner user id, uiid, to get similarityRow 
    2. Use similarityRow to build a list 'similarUsers' with (innerID, score) and excluding uiid 
    3. Get top 10 user 'kNeighbors' by using the Python method heapq.nlargest to sort similarUsers
      - A list with (innerID, score) 
    4. Build a defaultdict 'candidates' with key = itemID and value = recaculated rate (ratingSum)
       - For each user in kNeighbors, recalculating rate for each items  
        - Apply 'trainSet.ur' to get all the items and rating 
        - Add up ratings for each item, weighted by user similarity 
    
    5.  Build a dictionary with itemId that the user has already bought
        - key: itemId 
        - value : 1 
    6. Get top N  recommended items 


In [23]:


def f_get_userbase_topN(trainSet, testSet, simsMatrix, K, N, maxRate):
   """
    Module name: f_get_userbase_ntopN
    Purpose    : Function to get user base top N recommended items and hitting rate
    Parameters:
       trainSet   : Surprise training set, created from LeaveOneOut  
       testSet    : Surprise test set, created from LeaveOneOut
       simsMatrix : User based similarity Matrix created by KNNBasic  
       K          : No of nearest user    
       N          : Total number of items to be recommended 
     Return
       topN       : Default dictionary to keep N recommended items for each user 
   """ 
   topN = defaultdict(list)
   for uiid in range(trainSet.n_users):
       # Get top N similar users to this one
       similarityRow = simsMatrix[uiid]
       
       similarUsers = []
       for innerID, score in enumerate(similarityRow):
           if (innerID != uiid):
               similarUsers.append( (innerID, score) )
       # get topn 10 user
       kNeighbors = heapq.nlargest(K, similarUsers, key=lambda simRow: simRow[1])
       # Get the stuff they rated, and add up ratings for each item, weighted by user similarity
       candidates = defaultdict(float)
       for similarUser in kNeighbors:
           innerID = similarUser[0]
           userSimilarityScore = similarUser[1]
           theirRatings = trainSet.ur[innerID]
           for rating in theirRatings:
               candidates[rating[0]] += (rating[1] / maxRate) * userSimilarityScore
           
       # Build a dictionary of stuff the user has already seen
       watched = {}
       for itemID, rating in trainSet.ur[uiid]:
           watched[itemID] = 1
           
       # Get top-rated items from similar users:
       pos = 0
       for itemID, ratingSum in sorted(candidates.items(), key=itemgetter(1), reverse=True):
           if not itemID in watched:
               prodID = trainSet.to_raw_iid(itemID)
               #topN[int(trainSet.to_raw_uid(uiid))].append( (int(prodID), ratingSum) )
               topN[int(trainSet.to_raw_uid(uiid))].append( [int(prodID), round(ratingSum, 4)] ) 
               pos += 1
               if (pos > N):
                   break
   leftOutPredictions = model.test(testSet)   
   print("HR", f_HitRate(topN, leftOutPredictions), "for top ", K, "user, top ", N, "items with maxrate =", maxRate )  
   return topN
   

In [2]:
def f_HitRate(topNPredicted, leftOutPredictions):
    """
    Module name: f_HitRate
    Purpose    : Function to get Hit rate
                 ( Item in leftOutPredictions also been selected in top N recommended items)
    Parameters:
       topNPredicted      : top N predictions/items been recommended  
       leftOutPredictions : Predictions form test dataset 
     Return
       Hit rate 
    """  
    hits = 0
    total = 0
    # For each left-out rating
    for leftOut in leftOutPredictions:
        userID = leftOut[0]
        leftOutProdID = leftOut[1]
        # Is it in the predicted top 10 for this user?
        hit = False
        for prodID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutProdID) == int(prodID)):
                hit = True
                break
        if (hit) :
            hits += 1
        total += 1
    # Compute overall precision
    return hits/total

## Import packages/Libraries


In [3]:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
from collections import Counter
from collections import defaultdict
import heapq
from operator import itemgetter


In [4]:

from surprise import Reader
from surprise import Dataset
# Algorithms
from surprise import KNNBasic
from surprise.model_selection import LeaveOneOut

# Measure
from surprise.accuracy import rmse
from surprise import accuracy

from surprise.model_selection import train_test_split

In [5]:
tnx_file = "C:\\SYUE\\RecSys\\pos_tnx.xlsx"
df = pd.read_excel(tnx_file,sheet_name = 'pos_tnx_1')

In [6]:
df1 = df.drop(['WGT_PROD_IND','WGT_QTY', 'TXN_DT','STR_FAC_NBR','TXN_NBR' ],axis = 1) # shape = (298654, 3)

In [7]:
df1.shape


(298654, 3)

## Data claening

In [8]:
# 13937 unique user(HH_SK); 24664  unique prod_id (PROD_SK); 35 UNIT_QTY
df1_x = df1[(df1['PROD_SK'] > -1) & (df1["HH_SK"]> -1) &  (df1["UNIT_QTY"]> -1)]   # df1.shape = (298646, 3)

## Get most common user and  prod
 - Get 10000 most common user
 - Get 3000  most common prod

In [9]:
user_ids_count = Counter(df1_x.HH_SK)   # type: collections.Counter
prod_ids_count = Counter(df1_x.PROD_SK)

In [10]:
# number of users and movies we would like to keep
# Original size = 298646
#    n (user)  m (prod)    df_small
#   =================================
#     10000       12000     265859
#     10000        5000     214103
#     10000        3000     182206
n = 10000
m = 3000
user_ids = [u for u, c in user_ids_count.most_common(n)]
prod_ids = [m for m, c in prod_ids_count.most_common(m)]
df_small = df1_x[df1_x.HH_SK.isin(user_ids) & df1_x.PROD_SK.isin(prod_ids)]
df_small.shape

(182164, 3)

In [11]:
df_small.UNIT_QTY.value_counts()

1      133592
2       25875
0       11976
3        4615
4        3467
5         943
6         815
8         283
10        163
12        133
7         113
9          58
11         27
15         23
13         14
20         11
14         10
24         10
36          9
16          6
42          4
18          4
17          2
28          2
26          2
30          1
33          1
25          1
58          1
45          1
48          1
240         1
Name: UNIT_QTY, dtype: int64

In [12]:
df_small.groupby(["HH_SK","PROD_SK"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,UNIT_QTY
HH_SK,PROD_SK,Unnamed: 2_level_1
230,1067152,5
230,1517130,1
230,1586146,1
230,1871714,1
230,1945392,1
1470,288996,1
1470,913195,1
1470,1402158,3
1470,1604651,1
1470,1630169,1


In [13]:
df_sum_qty = df_small.groupby(["HH_SK","PROD_SK"]).sum().reset_index()  #shape = (267831, 3)

In [14]:
df_sum_qty.shape

(158034, 3)

In [15]:
df_sum_qty["UNIT_QTY"].value_counts()

1      107125
2       25796
0       10385
3        5984
4        4085
5        1375
6        1108
8         477
7         423
10        286
12        199
9         193
11        115
13         68
14         52
16         50
15         43
18         41
20         35
17         28
24         18
19         17
28         15
21         14
22         11
23          9
25          9
26          9
30          9
31          7
27          4
32          4
41          3
33          3
34          3
49          3
45          2
60          2
39          2
38          2
36          2
54          1
77          1
73          1
324         1
86          1
63          1
216         1
58          1
93          1
51          1
50          1
48          1
44          1
42          1
40          1
37          1
35          1
240         1
Name: UNIT_QTY, dtype: int64

In [16]:
df_sum_qty_final =  df_sum_qty[ (df_sum_qty["UNIT_QTY"] > 0 ) & (df_sum_qty["UNIT_QTY"] < 11 ) ]  # shape = (146834, 3)


In [17]:
df_sum_qty_final.shape

(146852, 3)

## Create data required by Surprise
 - Reader: A class
 - data : Trainset class build by 
   - load_from_df :  Convert python dataframe to Surprise Dataset
   - Python dataframe must have three columns inorder:
      - user (raw) ids
      - item (raw) ids
      - ratings   

In [18]:

reader = Reader(rating_scale=(1, 10))  # Reader object; rating_scale is required 
data = Dataset.load_from_df(df_sum_qty_final[['HH_SK', 'PROD_SK', 'UNIT_QTY']], reader) # type:  surprise.dataset.DatasetAutoFolds

##  Build dataset for training

 - LeaveOneOut
    - Cross-validation iterator where each user has exactly one rating in the testset. 
    - n_splits: The number of folds 

 - trainSet
    - Trainset class 
    - A trainset contains all useful data that constitutes a training set.
      - ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items
        - ur: defaultdict of list – The users ratings, a dictionary containing lists of tuples of the form
             (item_inner_id, rating). The keys are user inner ids.
    - n_users:No of user in trainSet, e.g. 9594

In [19]:
# Set aside one rating per user for testing
LOOCV = LeaveOneOut(n_splits=1, random_state=1)
for train, test in LOOCV.split(data):
    trainSet = train
    testSet  =  test

##  Model: KNNBasic
  - Parameters
    - k: The (max) number of neighbors to take into account for aggregation, Default is 40. 
    - min_k (int): The minimum number of neighbors to take into account for aggregation. Default is 1. 
    - sim_options :  A dictionary of options for the similarity measure.
       - Similarity: cosine
       - Base: User_based  
    - methods
      - compute_similarities: Build Similarity matrix  
      
Notes:
  - The actual number of neighbors can be retrieved in the 'actual_k' 
  - The Similarity matrix can be built by KNN algorithms




In [20]:
sim_opt = {'name': 'cosine',
               'user_based': True
               }
model = KNNBasic(sim_options=sim_opt,  verbose = False)
model.fit(trainSet)
simsMatrix = model.compute_similarities()

## Invoke 'f_get_userbase_topN' to build up a defaultdict 'topN' to list top N recommendation



In [24]:
topN = f_get_userbase_topN(trainSet, testSet, simsMatrix, 10, 10, 10.0)

HR 0.0448711470439616 for top  10 user, top  10 items with maxrate = 10.0


In [25]:
topN

defaultdict(list,
            {230: [[173497, 0.9],
              [1402158, 0.9],
              [562847, 0.8],
              [216746, 0.8],
              [1376094, 0.8],
              [1354393, 0.8],
              [361557, 0.6],
              [2049747, 0.5],
              [2045235, 0.5],
              [1107573, 0.5],
              [129589, 0.4]],
             1470: [[2001034, 1.0],
              [990262, 0.8],
              [1815395, 0.8],
              [1001413, 0.6],
              [2225074, 0.6],
              [1354393, 0.6],
              [669033, 0.5],
              [651610, 0.5],
              [727270, 0.5],
              [159076, 0.5],
              [1578298, 0.5]],
             1648: [[126024, 1.1],
              [1223871, 1.0],
              [1212709, 0.9],
              [1723909, 0.9],
              [995964, 0.9],
              [414393, 0.8],
              [727270, 0.8],
              [990262, 0.8],
              [1639363, 0.8],
              [1669387, 0.8],
              [197