#  User base recommendation for store ID 0025 from 06-01-2019 to 08-31-2019  
Purpose: 
1. Using SQL Assistant to create a table temp_tables.rs_pos_tnx_s0025_junaug19_sum by extracting/cleaning/aggregating data from Teradata view. Please refer 'Input' below.  
2. Apply Scikit-Surprise to pick up top 10 recommended items for every user.

## Introduction
### Scikit-Surprise -  Simple Python Recommendation System Engine
- Scikit-Surprise is an easy-to-use Python Scikit for recommender systems.
- The surprise only accepts the dataframe with user_id, item_id and rating.
- The input to our recommendation engine is a table with columns of hh_sk, prod_sk, and unit_qty
  - hh_sk is equivalent to user_id
  - prod_sk is equivalent to item_id
  - unit_qty is equivalent to rating 
- Most estimators in Scikit can be used in surprise
  - Would apply KNN algorithms with variety of similarity measures to get the similarity matrix  
  - KNN algorithms:
    - KNNBasic: A basic collaborative filtering algorithm
    - KNNWithMeans: A basic collaborative filtering algorithm, taking into account the mean ratings of each user
    - KNNWithZScore: A basic collaborative filtering algorithm, taking into account the z-score normalization of each user
    - KNNBaseline: A basic collaborative filtering algorithm taking into account a baseline rating
  - Available similarity measures:
    - cosine : Compute the cosine similarity between all pairs of users (or items)
    - msd	 : Compute the Mean Squared Difference similarity between all pairs of users (or items)
    - pearson: Compute the Pearson correlation coefficient between all pairs of users (or items)
    - pearson_baseline: Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means    
  - Similarity matrix
     - User-based similarity 
     - Content-based similarity  
 - Will use hit-rate to evaluate the recommendation system.
 
### Collaborative filtering
   - Will apply Neighborhood-Based Collaborative Filtering to build the Recommendation Engine. 
     - Find other people like you and recommend stuff they like. 
   - In Collaborative Filtering, the model learns from the user’s past behavior, user’s decision, preference to     predict items the user might have an interest in.
   - Can apply user-based/item_based collaborative filtering

## This module includes the following steps:
   
1. Select most popular user (hh_sk) and products (prod_sk) and create a dataset required by surprise
2. Apply KNN algorithm to get user similarity matrix
3. Apply LeaveOneOut to get the trainSet and test dataset  
4. Based on the user similarity matrix to get top K similar user  
5. Based on the similarity score and unit_qty to get uqtysum
6. Use the uqtysum to get top N recommended items
7. Calculate the hit rate
8. Save the top K similar user/neighbors and top N recommended items in a Teradata table

## Input
  1. Create temp_tables.rs_pos_tnx_s0025_junaug19 with column hh_sk, prod_sk, unit_qty
     - Input: Teradata view: dw_bi_vw.F_POS_TXN_DTL
     - Selection criteria:         ; only select
       - STR_FAC_NBR  = 0025 (store ID) 
       - txn_dt between  '2019-06-01' and  '2019-08-31' 
       - hh_sk > 0 and  prod_sk > 0 and unit_qty betwen 1 and 10 
       - wgt_prod_ind = 0 (Purchased the product by unit) 
     -  SQL to create the table 
        "create table temp_tables.rs_pos_tnx_s0025_junaug19 as ( sel txn_dt, hh_sk, prod_sk, unit_qty  
           from dw_bi_vw.F_POS_TXN_DTL where STR_FAC_NBR  = 0025 and WGT_PROD_IND = 0 
            and prod_sk > 0 and hh_sk > 0 
            and unit_qty between 1 and 10 
            and txn_dt    between '2019-06-01' and  '2019-08-31' ) with data""" 
   2. Create temp_tables.rs_pos_tnx_s0025_junaug19_sum
      - Aggregate unit_qty by grouping hh_sk and prod_sk
      - There are 15306 with count(HH_SK) > 9  and 3972 with (PROD_SK) > 49
        - The number will be used to select most popular user (hh_sk) and products (prod_sk)   
        
## Output 
     - Top N recommended items
     - A dataframe with top N recommended items and top K similar users/neighbors
     
## Module name: rs_user_base.ipynb
 Author     : Sophia Yue 
 Date       : Sep 2019

## Comments
    - Use hit_rate to validate the result
    - There are varities of KNN algorithms to get the similarity matrix  Mcombinations to get the top N recommendation
      - The Person Baseline, KNNBaseline, would get the hightest hit rate  
    - The hit rate would not improve too much by using mean from unit_qty for the nomalization 
      - The hit rate = 0.0426
    -  The hit rate by using max score = 10 for the nomalization  is 0.0434
    - The KNNBasic and mean from unit_qty for the nomalization are used for the recommendation engine  

In [1]:

#1
def cf_get_userbase_topN(trainSet, simsMatrix, K, N, nrm_adj, verbose = True):
   """
    Module name: cf_get_userbase_ntopN
    Purpose    : Function to get user base top N recommended items  
    Parameters:
       trainSet   : Surprise training set, created from LeaveOneOut 
                    with columns of 
                    - ihh_sk is internal id of hh_sk
                      - Surprise will convert the hh_sk to a range of 0 to no of hh_sk - 1
                    - iprod_sk is internal id of prod_sk
                      - Surprise will convert the hh_sk to a range of 0 to no of prod_sk - 1
                    - unit_qty is equivalent to rating  
       simsMatrix : Similarity Matrix created by KNN algorithm  
       K          : No of nearest users    
       N          : Total number of products to be recommended 
       nrm_adj    : A dictionary with key as iprod_sk and value as mean of unit_qty or max scale
                    - Will dynamically generate code to calculate adjusted unit_qty  
       verbose    : Weather to print the message of elapse time of executing the function
                     Default is Truer
     Return
       topN       : Default dictionary to keep N recommended items for each user 
       knbr_ary   : Array of k nearest neighbor raw_id and scores which will be combined with topN to create a dataframe
 
  Functions: A loop to go through all the user in trainSet to build top N recommendation
    0. Dynamically generate code to calculate adjusted unit_qty  
    1. Build a list of similar user from simsMatrix with (innerID, score)   
    2. Get top 10 similar user 'kNeighbors' by using Python method heapq.nlargest to sort similarUsers
      - kNeighbors is a list of tuple with (innerID, score) 
        - innerID: inner user id
    3. Invoke c_cr_nbrK_list to create a list of raw iuid, raw iid and score in kNeighbors     
    4. Build a defaultdict 'candidates' with key = itemID and value = recalculated rate (ratingSum)
       - For each user in kNeighbors, recalculating rate for each items  
        - Apply 'trainSet.ur' to get all the items and rating 
        - Add up ratings for each item, weighted by user similarity     
    5.  Build a dictionary with itemId that the user has already bought
        - key: itemId 
        - value : 1 
    6. Get top N  recommended items 
   """ 
   fnc_name = inspect.stack()[0][3]
   start_time = time.time()
   topN = defaultdict(list)
   nbrK_ary = []  # array for k nighbor raw_id and scores which will be combine with topN to create a dataframe
   
   """
     Perform loop to get top N similar users for each iuid defined in trainSet
     - n_users: Total number of users
     - iuid: Internal user Id
        - iuid between 0 and (n_users - 1)  
   
   """  
   for iuid in range(trainSet.n_users): 
         """
           1. Build a list of similar users and score from simsMatrix
              - Use inner user id, iuid, from TrainSet to get similarityRow from simsMatrix
              - Use simsMatrix to create a list of similarUsers
         """
         similarityRow = simsMatrix[iuid]       
         similarUsers = []        
         for innerID, score in enumerate(similarityRow):
             if (innerID != iuid):
                 similarUsers.append( (innerID, score) )             
            
         """
            2. Get top 10 similar user 'kNeighbors' by using Python method heapq.nlargest to sort similarUsers
               - innerID: inner user id
               - kNeighbors is a list of tuple with (innerID, score), e.g.  
                 [(iid0, score0), (iid1, score1) .... (iid9, score9)] 
                  - iid0 would have the highest score and is close to iuid most
               -  similarUsers is a list of tuple of inner ID and score
                  - Inside 'heapq.nlargest', name the tuple as simRow
                     - simRow[1] will be the score
                        - Use score to sore similarUsers and get top 10 scores
         """
         kNeighbors = heapq.nlargest(K, similarUsers, key=lambda simRow: simRow[1]) 
           
         """
            3. Invoke c_cr_nbrK_list to create a list of raw iuid, all the raw iid and score in kNeighbors 
         """   
         nbrK = cf_cr_nbrK_list(iuid, kNeighbors)         
         nbrK_ary.append(nbrK)
            
         """
            4. Build a defaultdict 'candidates' with key = itemID and value = recaculated unit_qty (uqtySum)
               - Get the sum of unit_qty, and add up them for each candidate, weighted by user similarity score
                 - For each user in kNeighbors, recalculating unit_qty for each items  
                 - Apply 'trainSet.ur' to get all the iprod_sk and unit_qty 
                 - Add up nomalized unit_qty for each iprod_sk, weighted by user similarity
                   - Utilize mean of unit_qty for iprod_sk for the nomalization 
                - theirUqtys: A list of iprod_sk, unit qty for all items been purchased by candidates, e.g
                  [(385, 4),(143, 3), (99,1), (179,5 ...)] 
                   - 385, 143, 99 ... are innerID
                   - 4, 3, 1 ... are sum of adjusted unit_qty
                         
         """     
         candidates = defaultdict(float)
         for similarUser in kNeighbors:
           innerID = similarUser[0]
           userSimilarityScore = similarUser[1]

           theirUqtys = trainSet.ur[innerID]
           for iprod_sk, unit_qty in theirUqtys:
               if type(nrm_adj) is dict:
                  u_mean = nrm_adj.get(iprod_sk) 
                  candidates[iprod_sk] += (unit_qty / u_mean) * userSimilarityScore 
               else:                       
                  candidates[iprod_sk] += (unit_qty / nrm_adj) * userSimilarityScore  
         """ 
           5.  Build a dictionary with iprod_sk that the user has already bought
               - key: iprod_sk 
               - value : 1
         """
         d_bought = {}
         for _iprod_sk, _unit_qty in trainSet.ur[iuid]:
             d_bought[_iprod_sk] = 1
                
         """    
          6. Get top N prod_sk and unit_qty from similar users
             - Sort uqtySum from candidate to get top 10 prod_sk
             - Type of topN is defaultdict(list)              
               - A list of dictionary with hh_sk as key  and a list of top 10 prod_sk, and uqtySum as value, e.g.
                 {230: [[173497, 0.9], [1402158, 0.9], [562847, 0.8], [216746, 0.8], [1376094, 0.8], [1354393, 0.8],
                       [361557, 0.6], [2049747, 0.5], [2045235, 0.5], [1107573, 0.5], [129589, 0.4]],
                  470: [[2001034, 1.0], [990262, 0.8], [1815395, 0.8], [1001413, 0.6], [2225074, 0.6], [1354393, 0.6],
                       [669033, 0.5], [651610, 0.5], [727270, 0.5], [159076, 0.5], [1578298, 0.5]],
                  ....}
         """
         pos = 1
         for _iprod_sk, uqtySum in sorted(candidates.items(), key=itemgetter(1), reverse=True):
             if not _iprod_sk in d_bought:
                 prodID = trainSet.to_raw_iid(_iprod_sk)
                 #topN[int(trainSet.to_raw_uid(iuid))].append( (int(prodID), ratingSum) )
                 topN[int(trainSet.to_raw_uid(iuid))].append( [int(prodID), round(uqtySum, 4)] ) 
                 pos += 1
                 if (pos > N):
                     break
   end_time = time.time()
   cf_elapse_time (  start_time, end_time, "Function {0} completed.".format(fnc_name))
   return topN, nbrK_ary

   


In [2]:
 def cf_cr_nbrK_list(uiid, kNeighbors):
       """
         module name: cf_cr_nbrK_list
         Purpose    : Create a list of raw uiid, raw iid and score in kNeighbors
                      - The list will create an array and will be used to merge with topN items  
         Parameter:
           uiid: Internal user id
           kNeighbors: A list of  tuple with iid and score for k Neighbors for uiid      
        return:
           A list with raw uiid, raw iid and score in kNeighbors  
       """ 
    
       """
         bnbr_x is by converting/flatening the list of tuple in kNeighbors to a flat list, e.g.
         [(uiid0, score0), (uiid1, score1) .... (uiid9, score9)] -> [uiid0, score0, uiid1, score1 .... uiid9, score9] 
       """
       knbr_x =  list (itertools.chain(*kNeighbors)) 
        
       """
        knbr_y is a list of tuple of raw id  and score, e.g 
        [uiid0, score0, uiid1, score1 .... uiid9, score9]  -> [('rid0', score0), ('rid1', score1) .... ('rid9', score9)] 
       """
       knbr_id     =  knbr_x[::2]      # Get internal user id from knbr_x (:2 would skip the score)
       knbr_rnkscr = knbr_x[1::2]      # get rank for itembase or score for userbase from knbr_x 
       knbr_rawid  = [trainSet.to_raw_uid(x) for x in knbr_id]  # Convert internal user id to raw id  
       knbr_y = list(zip(knbr_rawid, knbr_rnkscr) )   
    
       """
       knbr is a list of target user raw id, and raw id and score of user which  is close to the target user, e.g, 
       ['rid', 'rid0', score0,  .... 'rid9', score9]
       """
       return  [trainSet.to_raw_uid(uiid)] + list (itertools.chain(*knbr_y)) # convert list of tuple to a list
         

In [3]:
#3
def cf_get_unit_qty_mean(s_dataset):
   """
     module name: c_get_unit_qty_mean
     Purpose    : Common function to reate a dictionay with key as iprod_sk and value as mean of unit_qty                    
     Parameter
       s_dataset : Surprise  dataset with columns of ihh_sk, iprod_sk, unit_qty 
     
    return:
       d_mean : a dictionary with key as iprod_sk and value as mean of unit_qty  
       df_mean: a dataframe with columns of iprod_sk, prod_sk, and unit_qty_mean 
   
    Steps:
        1: Create a dataframe from a list of iprod_sk, prod_sk, unit_qty mean
        2: Craete a dataframe to get mean of unit_qty for iprod_sk 
        3: Build a dictionary with key as iprod_sk and value as mean of unit_qty
        4: Build a dataframe as iprod_sk, prod_sk, qty_unit_mean
   """ 

   fnc_name = inspect.stack()[0][3]
   start_time = time.time()

   iprod_sk_unit = []
   for ihh_sk in range(s_dataset.n_users):
       _unit_qty = s_dataset.ur[ihh_sk]
       for iprod_sk, unit_qty in _unit_qty:
           iprod_sk_unit.append([iprod_sk, int(s_dataset.to_raw_iid(iprod_sk)),unit_qty])
   df = pd.DataFrame(iprod_sk_unit)
   df.columns = ['iprod_sk', 'prod_sk', 'unit_qty']
   print("c_get_qty_mean - df.head", df.head() )
   
   """
    step2: Craete a dataframe to get mean of unit_qty group by iprod_sk and prod_sk   
           -  df_qty_mean_x with index as iprod_sk and prod_sk with column of mean of unit_qty
   """
   df_qty_mean_x = df.groupby (by = ['iprod_sk', 'prod_sk']).agg({'unit_qty':['mean']}) 
   
   """
    step3: Build and return a dictionary with key as iprod_sk and value as mean 
   """
NoneTypeNoneType
   #3: Build a dictionary with key as iprod_sk and value as mean of unit_qty 
   d_mean = dict(zip(l_iprod_sk, l_mean))
    
   # Step4: Build a dataframe as iprod_sk, prod_sk, qty_unit_mean
   df_mean = pd.DataFrame(mean_ary)
   df_mean.columns = ['iprod_sk', 'prod_sk', 'unit_qty_mean']

   end_time = time.time()
   cf_elapse_time (start_time, end_time, "Function {0} completed.".format(fnc_name))
   return d_mean, df_mean

In [4]:
#4
def cf_build_trainset(tnx_tbl, N, M, maxScale, session):
    """
     module name: cf_build_trainset
     Purpose    : build trainset
     Parameter
       session : session to connect to database
       tnx_tbl : Transaction table with schema 
       N       : No of most popular user
       M       : No of most popular product 
       maxScale: Maximum unit_qty  
    return:
       trainSet: Training set 
       testSet : Test set
       d_mean: a dictionary with key as iprod_sk and value as mean of unit_qty  ????
    Functions:
     1. Run a query to read data from Teradata table and create a dataframe
     2. Get most popular hh_sk  and prod_sk
     3: Build surprise dataset 
     4: Build trainSet and testSet
    Notes:
     1. The step of initialization will set up the session to connect to Teradata
    """ 
    print( "tnx_tbl", tnx_tbl,"N = ",  N, "m =", M, maxScale, maxScale )    
    print("session", session)
    
    #1: Run a query to read data from Teradata table and create a dataframe
    query = """
    select HH_SK, PROD_SK, UNIT_QTY   
    from {0}""".format(tnx_tbl)
    df_sum_qty = pd.read_sql(query,session) 
    
    #2: Get most popular hh_sk  and prod_sk
    user_ids_count = Counter(df_sum_qty.HH_SK)   # type: collections.Counter
    prod_ids_count = Counter(df_sum_qty.PROD_SK)
   
    user_ids = [u for u, c in user_ids_count.most_common(N)]
    prod_ids = [m for m, c in prod_ids_count.most_common(M)]
    df_sum_qty_final = df_sum_qty[df_sum_qty.HH_SK.isin(user_ids) & df_sum_qty.PROD_SK.isin(prod_ids)]  
    
    #3: Build surprise dataset                         
    reader = Reader(rating_scale=(1, maxScale))  # Reader object; rating_scale is required 
    data = Dataset.load_from_df(df_sum_qty_final[['HH_SK', 'PROD_SK', 'UNIT_QTY']], reader) # type:  surprise.dataset.DatasetAutoFolds
    ft = data.build_full_trainset()
    print("The total no of users in ft = data.build_full_trainset():", ft.n_users,  "The total no of items in ft:", ft.n_items )
    
    """
     4: Build trainSet and testSet
     - Set aside one purchase per user for testing
       - Randomly remove one row from data to create the testSet and the rest would be trainSet
         - If the user only has one row, it'll not in trainSet 
    """
    LOOCV = LeaveOneOut(n_splits=1, random_state=1)
    for train, test in LOOCV.split(data):
        trainSet = train
        testSet  =  test
    print("The total number of users in trainSet:", trainSet.n_users,  "The total number of items in trainSet:", trainSet.n_items ) 
    print("The total length of testSet:", len(testSet),"\nExample of testSet:", testSet[0:2])     
    return trainSet, testSet

In [5]:
def cf_comb_topN_nbrK(topN, nbrK_ary, lbl_topN, lbl_nbrK ):
    """
     module name: c_comb_topN_nbrK
     Purpose    : Buile a dataframe by merging recommendted items and top K neighbors
                  - The dataframe will be used to build a table
     Parameter:
       topN    : Default dictionary to keep N recommended items for each user 
       nbrK_ary: Array of raw_id and scores for top k neighbor  
       lbl_topN: Label for top N prods 
       lbl_nbrK: Label for top K neighbors  
     Return:
       df_topN_nbrK: A dataframe by merging recommendted items and top K neighbors   
     Functions:
      1. Flatten topN to a list
        - type of topN is defaultdict(list)
          - A list of dictionary with hh_sk as key  and a list of top N prod_sk, and uqtySum as value.
      2. Merge recommendted items and similar k neighbor
         - user base: similar k neighbors will be similar users  
         - item base: similar k neighbors will be similar prods/items 
    """

    """
      1.Flatten topN to a list
        - Type of topN is defaultdict(list)              
          - A list of dictionary with hh_sk as key  and a list of top 10 prod_sk, and uqtySum as value, e.g.
            {230: [[173497, 0.9], [1402158, 0.9], [562847, 0.8], [216746, 0.8], [1376094, 0.8], [1354393, 0.8],
                  [361557, 0.6], [2049747, 0.5], [2045235, 0.5], [1107573, 0.5], [129589, 0.4]],
             470: [[2001034, 1.0], [990262, 0.8], [1815395, 0.8], [1001413, 0.6], [2225074, 0.6], [1354393, 0.6],
                  [669033, 0.5], [651610, 0.5], [727270, 0.5], [159076, 0.5], [1578298, 0.5]],
             ....}      
    """    
    user_prod_ary = []
    for hh_sk, prods in topN.items():
         user_prod = [hh_sk] 
         for j  in range(len(prods)):
             user_prod.extend(prods[j])   
         user_prod_ary.append(user_prod)
    print(user_prod_ary[0:1])
    """ 
      2. Merge recommendted items and K neighbor
         - Create a dataframe for topN 
         - Create a dataframe for nbrK_ary 
    """
    df_topN   = pd.DataFrame.from_records(user_prod_ary, columns = lbl_topN)
    df_nbrK   = pd.DataFrame.from_records(nbrK_ary, columns = lbl_nbrK)
    df_topN_nbrK = pd.merge(df_topN, df_nbrK, how = 'left', on = ['hh_sk', 'hh_sk'])    
    return df_topN_nbrK  
    

In [6]:
#6
def cf_get_simsMatrix(s_dataset, sim_measure, user_base_ind, knn_algm ):
    """
      Module name: cf_get_simsMatrix
      Purpose    : Function to get similarity matrix 
      Parameters :
       s_dataset : Surprise dataset
       sim_measure: Measure for similarity 
         - Validate values are "cosine", "msd", "person", "person_baseline",        
       user_base_ind
         True  - Apply user_base
         False - Apply user_base 
       Return:
        simsMatrix: Simality Matrix
    """  
    fnc_name = inspect.stack()[0][3]
    start_time = time.time()    
    sim_opt = {'name': sim_measure,
                   'user_based': user_base_ind
                   }
    #v_knn_mdl = "{0}(sim_options=sim_opt,  verbose = False)".format(knn_algm)
    model = knn_algm(sim_options=sim_opt,  verbose = False)
    model.fit(s_dataset)
    simsMatrix = model.compute_similarities()
    
    end_time = time.time()
    cf_elapse_time (  start_time, end_time, "Function {0} completed.".format(fnc_name))    
    return simsMatrix, model 


In [7]:
#7
def cf_setup_dbs_con(userName, passWord):
    """
     module name  : c_setup_dbs_con
     purpose      : Setup database connection
     parameter    : 
       userName: User name to access Teradata 
       passWord: Password to access Teradata 
     Return       :  
       session    : udaExec  connection
       td_enginex : Teradata engine 
    Notes:
     - Need to import the following packages/libraries 
       import sqlalchemy
       from sqlalchemy import create_engine
       import teradata     
    """
    udaExec = teradata.UdaExec (appName="Teradata_Test", version="1.0", logConsole=False)
    session = udaExec.connect(method="odbc", system="tqdpr02",
            username = userName, password= passWord )
    t_engine   = 'teradata://{0}:{1}@tqdpr02/temp_tables'.format(userName, passWord)
    print ("t_engine", t_engine)
    td_enginex = create_engine(t_engine) 
    return session,  td_enginex 

In [8]:
def cf_ld_topN_nbrK_tbl(df_cmb, tbl_cmb,  cr_tbl_ind, td_enginex ):
    """
     Module name  : c_cr_topN_nbrK_tbl
     Purpose      : Load data from the dataframe df_cmb which had cimbined recommendted items and K neighbors to a table
     Parameter    : 
       df_cmb     : Dataframe which had cimbined recommendted items and K neighbors
       tbl_cmb    : Table name to load the dataframe df_cmb 
                    - Schema name is not required
                    - Schema name will define in td_enginex
       cr_tbl_ind :
         True     : will creat a new table
         False    : Will not creat a new tabl
       td_enginex : Connection of Teradata engine   
     Return       : N/A 
            
    """
    fnc_name = inspect.stack()[0][3]
    start_time = time.time()
    """
     - _df_cmb is the copy of df_cmb
     - The change of _df_cmb wii not have impact on db_cmb
    """
    _df_cmb = df_cmb.copy() # is different from  _df_cmb = df_cmb
    _df_cmb.index.names=['custno'] # Setup dummy index
   
#    if cr_tbl_ind == True:
#       try: 
#          _df_cmb.to_sql(con=td_enginex, name=tbl_cmb, if_exists='replace')  
#       DatabaseError:     
#           
    _df_cmb.to_sql(con=td_enginex, name=tbl_cmb, if_exists='replace', index = False)
    end_time = time.time()
    cf_elapse_time (  start_time, end_time, "Function {0} to load table {1} completed.".format(fnc_name, tbl_cmb))     

In [9]:
"""
 Step0: Initialization 
  - Import packages/libraries from the file c_import.py
  - Setup database connection 
"""
prg_name = ""
path_code = "C:\\Users\\syue003\\wip_RecSys\\"
c_import  = path_code + "c_import.py"
c_hitrate = path_code + "c_hit_rate.py" 
c_timedte = path_code + "c_time_dte.py" 

exec(compile(open(c_import, 'rb').read(), c_import,  'exec'))
exec(compile(open(c_hitrate, 'rb').read(),c_hitrate, 'exec'))
exec(compile(open(c_timedte, 'rb').read(),c_timedte, 'exec'))
session, td_enginex = cf_setup_dbs_con(userName = 'syue003', passWord = 'Chungli#1')

"""
 Define table name  
"""

str_id = '0025_'             # store id
duration = 'junaug19_'
mean_nrm = '_meannrm'        # Indicator if applies mean for the normization 
knn_algm = KNNBasic          # KNN algorithm to calculate Similarity Matrix  
knn_algm_abb = 'knnb_'       # Abbrevation of KNN algorithm
sim_mea = 'cosine'           # Similarity measure
tbl_cmb = "rs_s" + str_id + duration + knn_algm_abb + sim_mea + mean_nrm 
print ( "tbl_cmb", tbl_cmb )


t_engine teradata://syue003:Chungli#1@tqdpr02/temp_tables
tbl_cmb rs_s0025_junaug19_knnb_cosine_meannrm


In [10]:
#4.1
"""
 Step1: Invoke cf_build_trainset 
   - Create surprise trainSet and testSet
"""
N= 15306; M = 3972; maxScale = 10
tnx_tbl = "temp_tables.rs_pos_tnx_s0025_junaug19_sum"
trainSet, testSet = cf_build_trainset(tnx_tbl, N , M , maxScale, session = session )

tnx_tbl temp_tables.rs_pos_tnx_s0025_junaug19_sum N =  15306 m = 3972 10 10
session <teradata.udaexec.UdaExecConnection object at 0x0000000008F29198>
The total no of users in ft = data.build_full_trainset(): 15266 The total no of items in ft: 3972
The total number of users in trainSet: 15151 The total number of items in trainSet: 3972
The total length of testSet: 15266 
Example of testSet: [(76782197.0, 1252300.0, 1.0), (58669755.0, 1491527.0, 1.0)]


In [11]:
"""
  Step2: Invoke cf_get_unit_qty_mean
   - Create a dictionay with key as iprod_sk and value as mean of unit_qty from trainSet                   
   - Create a dataframe with columns of iprod_sk, prod_sk, and unit_qty_mean   
"""
if mean_nrm != "":
   d_mean, df_mean = cf_get_unit_qty_mean(trainSet)

c_get_qty_mean - df.head    iprod_sk  prod_sk  unit_qty
0         0   618044       1.0
1         1  1145662       5.0
2         2  2434543       1.0
3         3    20543       1.0
4         4   659732       2.0
 Function cf_get_unit_qty_mean completed. It took 1.627000 seconds - 0hh:0mm:1ss.
 start time: Sep 23 2019 15:16:16  end time:  Sep 23 2019 15:16:17


In [12]:
"""
  Step3: Invoke cf_get_simsMatrix to get simularity matrix 
"""    
simsMatrix, model = cf_get_simsMatrix(trainSet, sim_measure = sim_mea , user_base_ind = True, knn_algm = knn_algm)

 Function cf_get_simsMatrix completed. It took 152.063993 seconds - 0hh:2mm:32ss.
 start time: Sep 23 2019 15:16:19  end time:  Sep 23 2019 15:18:51


In [13]:

"""
  Step4: Invoke cf_get_userbase_topN
   - Get top 10 recommendation
"""
K = 10
N = 10
topN, nbrK_ary = cf_get_userbase_topN(trainSet, simsMatrix, K, N, d_mean)

 Function cf_get_userbase_topN completed. It took 163.179993 seconds - 0hh:2mm:43ss.
 start time: Sep 23 2019 15:18:51  end time:  Sep 23 2019 15:21:34


In [14]:
"""
  Step5: Invoke cf_HitRate to get hit rate 
   - Get top 10 recommendation
  Notes:
   - without applying unit_qty_mean for the normalization,  the HR =  0.024957421721472552 
   - Applying unit_qty_mean for the normalization,  the HR = 0  
   - leftOutPredictions is a "prediction" object containing:
     The (raw) user id uid.
     The (raw) item id iid.
     The true rating
     The estimated rating   
"""
topN_x = topN.copy() # backup topN
leftOutPredictions = model.test(testSet)   
print("HR", round(cf_HitRate(topN, leftOutPredictions), 4), "for top ", K, "top ", N, "neighbors" )


HR 0.021551159439276824 for top  10 top  10 neighbors


In [15]:
"""
  Step6: Invoke cf_comb_topN_nbrK  
   - Merge top N recommendted items and top K neighbors
     -  knbr: k nearest raw user id
     -  scrs: simarity score
"""
#Merge recommendted items and similar user
lbl_topN = ['hh_sk'] + ['prod_1','scr_1','prod_2','scr_2','prod_3','scr_3','prod_4','scr_4','prod_5','scr_5',
          'prod_6','scr_6','prod_7','scr_7', 'prod_8','scr_8','prod_9','scr_9','prod_10','scr_10']
lbl_nbrK= ['hh_sk'] + ['knbr_1','scrs_1','knbr_2','scrs_2','knbr_3','scrs_3','knbr_4','scrs_4','knbr_5','scrs_5',
          'knbr_6','scrs_6','knbr_7','scrs_7', 'knbr_8','scrs_8','knbr_9','scrs_9','knbr_10','scrs_10']

df_topN_nbrK = cf_comb_topN_nbrK(topN, nbrK_ary, lbl_topN, lbl_nbrK )

[[76782197, 622127, 3.3598, 1068396, 3.2342, 1491527, 2.9381, 632059, 2.7929, 82267, 2.7906, 1290148, 2.4831, 1127635, 2.451, 592484, 2.449, 625466, 2.2783, 483819, 2.2534]]


In [16]:
"""
 step7: Load the combined dataframe to a table 
"""
cr_tbl_ind = True
cf_ld_topN_nbrK_tbl(df_topN_nbrK, tbl_cmb,  cr_tbl_ind, td_enginex = td_enginex )

 Function cf_ld_topN_nbrK_tbl to load table rs_s0025_junaug19_knnb_cosine_meannrm completed. It took 410.344988 seconds - 0hh:6mm:50ss.
 start time: Sep 23 2019 15:21:41  end time:  Sep 23 2019 15:28:31


In [55]:
#AB test
def f_rs_comp(trainSet, testSet, knn_algm, sim_mea, nrm_adj, K = 10, N = 10):
    
   start_time = time.time() 
   """
     Invoke cf_get_simsMatrix to get simularity matrix 
   """    
   simsMatrix, model = cf_get_simsMatrix(trainSet, sim_measure = sim_mea , user_base_ind = True, knn_algm = knn_algm)
   """
     Invoke cf_get_userbase_topN
     - Get top 10 recommendation
   """
   topN, nbrK_ary = cf_get_userbase_topN(trainSet, simsMatrix, K, N, nrm_adj)
  
   topN_x = topN.copy() # backup topN
   leftOutPredictions = model.test(testSet)   
   hr = round(cf_HitRate(topN, leftOutPredictions), 4)
   print("Hit Rate = {0} with algorithm = {1}, Similarity Measure = {2}".format(hr, knn_algm, sim_mea ))
   if  type(nrm_adj) is dict:        
       print("for top {0} recommended products and {1} nearest neighbors applied product mean for normalization".format(K, N) )
   else:    
       print("for top {1} recommended products and {1} nearest neighbors applied max unit_qty = {2} for the normalization".format(K, N, nrm_adj) )
  
   end_time = time.time()
   cf_elapse_time (start_time, end_time, dsc = ' ')
   return topN 

In [56]:
f_rs_comp(trainSet, testSet, KNNBasic, 'cosine', d_mean, K = 10, N = 10)

 Function cf_get_simsMatrix completed. It took 161.642000 seconds - 0hh:2mm:41ss.
 start time: Sep 24 2019 10:38:01  end time:  Sep 24 2019 10:40:42
 Function cf_get_userbase_topN completed. It took 217.798000 seconds - 0hh:3mm:37ss.
 start time: Sep 24 2019 10:40:42  end time:  Sep 24 2019 10:44:20
Hit Rate = 0.0216 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNBasic'>, Similarity Measure = cosine
for top 10 recommended products and 10 nearest neighbors applied product mean for normalization
   It took 385.076000 seconds - 0hh:6mm:25ss.
 start time: Sep 24 2019 10:38:00  end time:  Sep 24 2019 10:44:26


In [57]:
topN_KB_coc_mean = f_rs_comp(trainSet, testSet, KNNBasic, 'cosine', d_mean)

 Function cf_get_simsMatrix completed. It took 135.784000 seconds - 0hh:2mm:15ss.
 start time: Sep 24 2019 11:03:45  end time:  Sep 24 2019 11:06:01
 Function cf_get_userbase_topN completed. It took 153.710000 seconds - 0hh:2mm:33ss.
 start time: Sep 24 2019 11:06:01  end time:  Sep 24 2019 11:08:35
Hit Rate = 0.0216 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNBasic'>, Similarity Measure = cosine
for top 10 recommended products and 10 nearest neighbors applied product mean for normalization
   It took 294.910000 seconds - 0hh:4mm:54ss.
 start time: Sep 24 2019 11:03:45  end time:  Sep 24 2019 11:08:40


In [60]:
topN_KZ_cos_mean = f_rs_comp(trainSet, testSet, KNNWithZScore, 'cosine', d_mean)
topN_KM_cos_mean = f_rs_comp(trainSet, testSet, KNNWithMeans, 'cosine', d_mean)
topN_KBL_cos_mean = f_rs_comp(trainSet, testSet, KNNBaseline, 'cosine', d_mean)

 Function cf_get_simsMatrix completed. It took 138.287000 seconds - 0hh:2mm:18ss.
 start time: Sep 24 2019 11:24:56  end time:  Sep 24 2019 11:27:14
 Function cf_get_userbase_topN completed. It took 153.251000 seconds - 0hh:2mm:33ss.
 start time: Sep 24 2019 11:27:14  end time:  Sep 24 2019 11:29:48
Hit Rate = 0.0216 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNWithZScore'>, Similarity Measure = cosine
for top 10 recommended products and 10 nearest neighbors applied product mean for normalization
   It took 297.455000 seconds - 0hh:4mm:57ss.
 start time: Sep 24 2019 11:24:56  end time:  Sep 24 2019 11:29:53
 Function cf_get_simsMatrix completed. It took 132.679000 seconds - 0hh:2mm:12ss.
 start time: Sep 24 2019 11:29:54  end time:  Sep 24 2019 11:32:07
 Function cf_get_userbase_topN completed. It took 159.531000 seconds - 0hh:2mm:39ss.
 start time: Sep 24 2019 11:32:07  end time:  Sep 24 2019 11:34:46
Hit Rate = 0.0216 with algorithm = <class 'surprise.prediction_al

NameError: name 'KNNBasicline' is not defined

In [None]:
# get most p[opular hh_sk  and prod_sk
user_ids_count = Counter(df_sum_qty.HH_SK)   # type: collections.Counter
prod_ids_count = Counter(df_sum_qty.PROD_SK)
n = 15306
m = 3972
user_ids = [u for u, c in user_ids_count.most_common(n)]
prod_ids = [m for m, c in prod_ids_count.most_common(m)]
df_sum_qty_final = df_sum_qty[df_sum_qty.HH_SK.isin(user_ids) & df_sum_qty.PROD_SK.isin(prod_ids)]

In [61]:
topN_KBL_cos_mean = f_rs_comp(trainSet, testSet, KNNBaseline, 'cosine', d_mean)

 Function cf_get_simsMatrix completed. It took 137.354000 seconds - 0hh:2mm:17ss.
 start time: Sep 24 2019 11:41:26  end time:  Sep 24 2019 11:43:44
 Function cf_get_userbase_topN completed. It took 153.775000 seconds - 0hh:2mm:33ss.
 start time: Sep 24 2019 11:43:44  end time:  Sep 24 2019 11:46:18
Hit Rate = 0.0216 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNBaseline'>, Similarity Measure = cosine
for top 10 recommended products and 10 nearest neighbors applied product mean for normalization
   It took 297.064000 seconds - 0hh:4mm:57ss.
 start time: Sep 24 2019 11:41:26  end time:  Sep 24 2019 11:46:23


In [63]:
topN_KB_cos_maxr = f_rs_comp(trainSet, testSet, KNNBasic, 'cosine', nrm_adj = 10)

 Function cf_get_simsMatrix completed. It took 147.971000 seconds - 0hh:2mm:27ss.
 start time: Sep 24 2019 12:01:04  end time:  Sep 24 2019 12:03:32
 Function cf_get_userbase_topN completed. It took 153.381000 seconds - 0hh:2mm:33ss.
 start time: Sep 24 2019 12:03:32  end time:  Sep 24 2019 12:06:06
Hit Rate = 0.025 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNBasic'>, Similarity Measure = cosine
for top 10 recommended products and 10 nearest neighbors applied max unit_qty = 10 for the normalization
   It took 307.521000 seconds - 0hh:5mm:7ss.
 start time: Sep 24 2019 12:01:04  end time:  Sep 24 2019 12:06:12


In [64]:
topN_KZ_cos_maxr = f_rs_comp(trainSet, testSet, KNNWithZScore, 'cosine', nrm_adj = 10)
topN_KM_cos_maxr = f_rs_comp(trainSet, testSet, KNNWithMeans, 'cosine',  nrm_adj = 10)
topN_KBL_cos_maxr = f_rs_comp(trainSet, testSet, KNNBaseline, 'cosine',  nrm_adj = 10)

 Function cf_get_simsMatrix completed. It took 137.634000 seconds - 0hh:2mm:17ss.
 start time: Sep 24 2019 12:17:07  end time:  Sep 24 2019 12:19:25
 Function cf_get_userbase_topN completed. It took 153.988000 seconds - 0hh:2mm:33ss.
 start time: Sep 24 2019 12:19:25  end time:  Sep 24 2019 12:21:59
Hit Rate = 0.025 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNWithZScore'>, Similarity Measure = cosine
for top 10 recommended products and 10 nearest neighbors applied max unit_qty = 10 for the normalization
   It took 297.304000 seconds - 0hh:4mm:57ss.
 start time: Sep 24 2019 12:17:07  end time:  Sep 24 2019 12:22:04
 Function cf_get_simsMatrix completed. It took 136.124000 seconds - 0hh:2mm:16ss.
 start time: Sep 24 2019 12:22:05  end time:  Sep 24 2019 12:24:21
 Function cf_get_userbase_topN completed. It took 153.678000 seconds - 0hh:2mm:33ss.
 start time: Sep 24 2019 12:24:21  end time:  Sep 24 2019 12:26:55
Hit Rate = 0.025 with algorithm = <class 'surprise.predic

In [67]:
topN_KB_peab_maxr = f_rs_comp(trainSet, testSet, KNNBasic, 'pearson_baseline',  nrm_adj = 10)
topN_KBL_peab_maxr = f_rs_comp(trainSet, testSet, KNNBaseline, 'pearson_baseline',  nrm_adj = 10)
topN_KZ_peab_maxr = f_rs_comp(trainSet, testSet, KNNWithZScore, 'pearson_baseline', nrm_adj = 10)
topN_KM_peab_maxr = f_rs_comp(trainSet, testSet, KNNWithMeans, 'pearson_baseline',  nrm_adj = 10)


 Function cf_get_simsMatrix completed. It took 124.425000 seconds - 0hh:2mm:4ss.
 start time: Sep 24 2019 12:34:23  end time:  Sep 24 2019 12:36:27
 Function cf_get_userbase_topN completed. It took 174.636000 seconds - 0hh:2mm:54ss.
 start time: Sep 24 2019 12:36:27  end time:  Sep 24 2019 12:39:22
Hit Rate = 0.0434 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNBasic'>, Similarity Measure = pearson_baseline
for top 10 recommended products and 10 nearest neighbors applied max unit_qty = 10 for the normalization
   It took 304.254000 seconds - 0hh:5mm:4ss.
 start time: Sep 24 2019 12:34:23  end time:  Sep 24 2019 12:39:27
 Function cf_get_simsMatrix completed. It took 117.510000 seconds - 0hh:1mm:57ss.
 start time: Sep 24 2019 12:39:27  end time:  Sep 24 2019 12:41:25
 Function cf_get_userbase_topN completed. It took 181.138000 seconds - 0hh:3mm:1ss.
 start time: Sep 24 2019 12:41:25  end time:  Sep 24 2019 12:44:26
Hit Rate = 0.0434 with algorithm = <class 'surprise.pr

In [68]:
topN_KB_peab_mean = f_rs_comp(trainSet, testSet, KNNBasic, 'pearson_baseline',  nrm_adj = d_mean)
topN_KBL_peab_mean = f_rs_comp(trainSet, testSet, KNNBaseline, 'pearson_baseline',  nrm_adj = d_mean)
topN_KZ_peab_mean = f_rs_comp(trainSet, testSet, KNNWithZScore, 'pearson_baseline', nrm_adj = d_mean)
topN_KM_peab_mean = f_rs_comp(trainSet, testSet, KNNWithMeans, 'pearson_baseline',  nrm_adj = d_mean)

 Function cf_get_simsMatrix completed. It took 119.278000 seconds - 0hh:1mm:59ss.
 start time: Sep 24 2019 12:54:44  end time:  Sep 24 2019 12:56:43
 Function cf_get_userbase_topN completed. It took 180.052000 seconds - 0hh:3mm:0ss.
 start time: Sep 24 2019 12:56:43  end time:  Sep 24 2019 12:59:43
Hit Rate = 0.0426 with algorithm = <class 'surprise.prediction_algorithms.knns.KNNBasic'>, Similarity Measure = pearson_baseline
for top 10 recommended products and 10 nearest neighbors applied product mean for normalization
   It took 304.545000 seconds - 0hh:5mm:4ss.
 start time: Sep 24 2019 12:54:44  end time:  Sep 24 2019 12:59:48
 Function cf_get_simsMatrix completed. It took 117.979000 seconds - 0hh:1mm:57ss.
 start time: Sep 24 2019 12:59:49  end time:  Sep 24 2019 13:01:47
 Function cf_get_userbase_topN completed. It took 186.788000 seconds - 0hh:3mm:6ss.
 start time: Sep 24 2019 13:01:47  end time:  Sep 24 2019 13:04:54
Hit Rate = 0.0426 with algorithm = <class 'surprise.prediction_