# Implicit Recommendation

### Framework
1. Data type : 'Unary & Implicit data'
  - Be careful not to confuse binary and unary data: unary data means that you have information that a user consumed something (which is coded as 1, much like binary data), but you have no information about whether a user didn't like or consume something (which is coded as NULL instead of binary data's 0).
  - Implicit data is data we gather from the users behaviour, with no ratings or specific actions needed. It could be what items a user purchased, how many times they played a song or watched a movie, how long they’ve spent reading a specific article etc. 
2. Algorithms
 1. Traditional algorithm - Item-item nearest neighbor models - a. Cosine distance metric, b. TF IDF, c.	BM25, Popularity based recommendation (baseline).
 2. ALS (Alternating Least Squares) Matrix Factorization
    - Original paper: http://yifanhu.net/PUB/cf.pdf 
    - We can use matrix factorization to mathematically reduce the dimensionality of our original “all users by all items” matrix into something much smaller that represents “all items by some taste dimensions” and “all users by some taste dimensions”. These dimensions are called latent or hidden features and we learn them from our data.
    - There are different ways to factor a matrix, like Singular Value Decomposition (SVD) or Probabilistic Latent Semantic Analysis (PLSA) if we’re dealing with explicit data.With implicit data the difference lies in how we deal with all the missing data in our very sparse matrix. For explicit data we treat them as just unknown fields that we should assign some predicted rating to. But for implicit we can’t just assume the same since there is information in these unknown values as well. As stated before we don’t know if a missing value means the user disliked something, or if it means they love it but just don’t know about it. Basically we need some way to learn from the missing data. So we’ll need a different approach to get us there.
    - <img src='https://jessesw.com/images/Rec_images/ALS_Image_Test.png' width=600>
  3. Bayesian Personalized Ranking (BPR)
     - Original paper: https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf
  4. Logistic Matrix Factorization
     - Original paper: http://stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf
  5. Collaborative Less-Is-More Filtering
     - Original paper: https://www.ijcai.org/Proceedings/13/Papers/460.pdf 

    

In [None]:
!pip install implicit

import numpy as np
import pandas as pd
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
from sklearn import metrics
import time
import random

from implicit.als import AlternatingLeastSquares
from implicit.bpr import BayesianPersonalizedRanking

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/sparsh9012/Recommendation-Engine/master/data/data.csv')
data.head()

Unnamed: 0,customer_id,item_id
0,5000034025459,55649
1,8000040000000,52535
2,1000000000000,76125
3,8000039034732,50489
4,8000027039444,56215


In [None]:
matrix = pd.crosstab(data.customer_id, data.item_id)
sparsed = sparse.csr_matrix(matrix.values)
print('matrix shape: ',matrix.shape)
print('sparse shape: ',sparsed.shape)
matrix.head()

matrix shape:  (1252, 4978)
sparse shape:  (1252, 4978)


item_id,50009,50093,50096,50098,50104,50108,50112,50116,50119,50124,...,91679,91740,91742,91766,91768,91770,91785,91790,91795,91798
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000000000000,0,0,0,0,1,0,0,1,1,1,...,1,0,0,0,0,0,0,0,0,0
1000000201799,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000000216930,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000000220674,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000000237993,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#dropping customers with very-high-frequency purchase
matrix1 = matrix.loc[matrix.sum(axis=1).values<5000,:]

#dropping customers with very-low-frequency purchase
matrix1 = matrix1.loc[matrix1.sum(axis=1).values>2,:]

#dropping products with very-low-frequency purchase
matrix1 = matrix1.loc[:,matrix1.sum(axis=0).values>2]

sparsed1 = sparse.csr_matrix(matrix1.values)
matrix1.shape
sparsed1.shape

(419, 2359)

In [None]:
#check sparsity
sparsity = round(1.0 - len(data) / float(matrix.shape[0] * matrix.shape[1]), 3)
print('The sparsity level of dataset is ' +  str(sparsity * 100) + '%')

sparsity = round(1.0 - len(data) / float(matrix1.shape[0] * matrix1.shape[1]), 3)
print('The sparsity level of filtered dataset is ' +  str(sparsity * 100) + '%')

The sparsity level of dataset is 99.7%
The sparsity level of filtered dataset is 98.1%


In [None]:
item_dictionary = { i : matrix.columns[i] for i in range(0, len(matrix.columns) ) }
customer_dictionary = { i : matrix.index[i] for i in range(0, len(matrix.index) ) }

In [None]:
def calculate_recommendations(df_new, model_name='als', factors=32, regularization=0.01, iterations=10):
    
    # initialize models
    if model_name=='als':
        model = AlternatingLeastSquares(factors=32, regularization = 0.02, iterations = 50)
    elif model_name=='bpr':
        model = BayesianPersonalizedRanking(factors=factors, learning_rate=0.01, regularization=regularization, iterations=iterations)
    
    '''item_users (csr_matrix) – Matrix of confidences for the liked items. 
    This matrix should be a csr_matrix where the rows of the matrix are the item, 
    the columns are the users that liked that item, and the value is the confidence 
    that the user liked the item.'''
    
    model.fit(sparsed.T)
    
    '''user_items (csr_matrix) – A sparse matrix of shape (number_users, number_items). 
    This lets us look up the liked items and their weights for the user. 
    This is used to filter out items that have already been liked from the output, 
    and to also potentially calculate the best items for this user.'''
    
    user_items = sparsed.tocsr()
    
    result = pd.DataFrame(columns=['customer_id', 'recommendation'])

    # Calculates the N best recommendations for a user, and returns a list of itemids, score
    for i in range(matrix.shape[0]):
        rc = model.recommend(i, user_items, N=10)
          result.loc[i,'customer_id'] = matrix.index[i]
        result.loc[i,'recommendation'] = rc

    x = pd.DataFrame(result.recommendation.tolist(), index=result.customer_id).stack().reset_index(level=1, drop=True).reset_index(name='recommendation')
    df_new['customer_id '+model_name] = x['customer_id']
    df_new['recommendation '+model_name] = x['recommendation'].apply(lambda x: x[0])
    df_new['score '+model_name] = x['recommendation'].apply(lambda x: x[1])

    df_new = df_new.replace({'recommendation '+model_name: item_dictionary})
    return df_new

In [None]:
calculate_recommendations(pd.DataFrame(), model_name='als').head()

100%|██████████| 50.0/50 [00:00<00:00, 104.95it/s]


Unnamed: 0,customer_id als,recommendation als,score als
0,1000000000000,55659,0.010661
1,1000000000000,54310,0.008409
2,1000000000000,63087,0.007403
3,1000000000000,56780,0.006668
4,1000000000000,51966,0.00661


In [None]:
df_n = pd.DataFrame()
df_n = calculate_recommendations(df_n, model_name='als')
df_n = calculate_recommendations(df_n, model_name='bpr')
df_n.head()

100%|██████████| 50.0/50 [00:00<00:00, 109.66it/s]
100%|██████████| 100/100 [00:00<00:00, 142.15it/s, correct=68.43%, skipped=33.39%]


Unnamed: 0,customer_id als,recommendation als,score als,customer_id bpr,recommendation bpr,score bpr
0,1000000000000,55659,0.007136,1000000000000,89962,0.708574
1,1000000000000,73975,0.007122,1000000000000,50559,0.700179
2,1000000000000,66134,0.006804,1000000000000,64293,0.697164
3,1000000000000,65875,0.006423,1000000000000,64181,0.676541
4,1000000000000,56780,0.006374,1000000000000,51507,0.666476


### Model Evaluation
- It is important to realize that we do not have a reliable feedback regarding which items are disliked. The absence of a favorite item indicator can be related to multiple reasons. We also can't track user reactions to our recommendations. Thus, precision based metrics, such as RMSE and MSE, are not very appropriate, as they require knowing which items users dislike for it to make sense. 
In addition, we are currently unable to track user reactions to our recommendations. Thus, precision based metrics are not very appropriate, as they require knowing which programs are undesired to a user. However, watching a program is an indication of liking it, making recall-oriented measures applicable.
1.	Random masking and measuring predicted vs. actual values of masked values – ROC AUC score
   <img src='https://jessesw.com/images/Rec_images/MaskTrain.png' width=600>
2.	Recall based evaluation ranking – **Mean Percentage Ranking (MPR)** a.k.a. expected percentile ranking.  Lower values of MPR are more desirable. The expected value of MPR for random predictions is 50%, and thus MPR > 50% indicates an algorithm no better than random.


In [None]:
def make_train(ratings, pct_test = 0.2):
    '''
    This function will take in the original user-item matrix and "mask" a percentage of the original ratings where a
    user-item interaction has taken place for use as a test set. The test set will contain all of the original ratings, 
    while the training set replaces the specified percentage of them with a zero in the original ratings matrix. 
    
    parameters: 
    
    ratings - the original ratings matrix from which you want to generate a train/test set. Test is just a complete
    copy of the original set. This is in the form of a sparse csr_matrix. 
    
    pct_test - The percentage of user-item interactions where an interaction took place that you want to mask in the 
    training set for later comparison to the test set, which contains all of the original ratings. 
    
    returns:
    
    training_set - The altered version of the original data with a certain percentage of the user-item pairs 
    that originally had interaction set back to zero.
    
    test_set - A copy of the original ratings matrix, unaltered, so it can be used to see how the rank order 
    compares with the actual interactions.
    
    user_inds - From the randomly selected user-item indices, which user rows were altered in the training data.
    This will be necessary later when evaluating the performance via AUC.
    '''
    test_set = ratings.copy() # Make a copy of the original set to be the test set. 
    test_set[test_set != 0] = 1 # Store the test set as a binary preference matrix
    training_set = ratings.copy() # Make a copy of the original data we can alter as our training set. 
    nonzero_inds = training_set.nonzero() # Find the indices in the ratings data where an interaction exists
    nonzero_pairs = list(zip(nonzero_inds[0], nonzero_inds[1])) # Zip these pairs together of user,item index into list
    random.seed(0) # Set the random seed to zero for reproducibility
    num_samples = int(np.ceil(pct_test*len(nonzero_pairs))) # Round the number of samples needed to the nearest integer
    samples = random.sample(nonzero_pairs, num_samples) # Sample a random number of user-item pairs without replacement
    user_inds = [index[0] for index in samples] # Get the user row indices
    item_inds = [index[1] for index in samples] # Get the item column indices
    training_set[user_inds, item_inds] = 0 # Assign all of the randomly chosen user-item pairs to zero
    training_set.eliminate_zeros() # Get rid of zeros in sparse array storage after update to save space
    return training_set, test_set, list(set(user_inds)) # Output the unique list of user rows that were altered 

In [None]:
def auc_score(predictions, test):
    '''
    This simple function will output the area under the curve using sklearn's metrics. 
    
    parameters:
    
    - predictions: your prediction output
    
    - test: the actual target result you are comparing to
    
    returns:
    
    - AUC (area under the Receiver Operating Characterisic curve)
    '''
    fpr, tpr, thresholds = metrics.roc_curve(test, predictions)
    return metrics.auc(fpr, tpr)
  

  
def calc_mean_auc(training_set, altered_users, predictions, test_set):
    '''
    This function will calculate the mean AUC by user for any user that had their user-item matrix altered. 
    
    parameters:
    
    training_set - The training set resulting from make_train, where a certain percentage of the original
    user/item interactions are reset to zero to hide them from the model 
    
    predictions - The matrix of your predicted ratings for each user/item pair as output from the implicit MF.
    These should be stored in a list, with user vectors as item zero and item vectors as item one. 
    
    altered_users - The indices of the users where at least one user/item pair was altered from make_train function
    
    test_set - The test set constucted earlier from make_train function
    
    

    returns:
    
    The mean AUC (area under the Receiver Operator Characteristic curve) of the test set only on user-item interactions
    there were originally zero to test ranking ability in addition to the most popular items as a benchmark.
    '''
    
    
    store_auc = [] # An empty list to store the AUC for each user that had an item removed from the training set
    popularity_auc = [] # To store popular AUC scores
    pop_items = np.array(test_set.sum(axis = 0)).reshape(-1) # Get sum of item iteractions to find most popular
    item_vecs = predictions[1]
    for user in altered_users: # Iterate through each user that had an item altered
        training_row = training_set[user,:].toarray().reshape(-1) # Get the training set row
        zero_inds = np.where(training_row == 0) # Find where the interaction had not yet occurred
        # Get the predicted values based on our user/item vectors
        user_vec = predictions[0][user,:]
        pred = user_vec.dot(item_vecs).toarray()[0,zero_inds].reshape(-1)
        # Get only the items that were originally zero
        # Select all ratings from the MF prediction for this user that originally had no iteraction
        actual = test_set[user,:].toarray()[0,zero_inds].reshape(-1) 
        # Select the binarized yes/no interaction pairs from the original full data
        # that align with the same pairs in training 
        pop = pop_items[zero_inds] # Get the item popularity for our chosen items
        store_auc.append(auc_score(pred, actual)) # Calculate AUC for the given user and store
        popularity_auc.append(auc_score(pop, actual)) # Calculate AUC using most popular and score
    # End users iteration
    
    # Return the mean AUC rounded to three decimal places for both test and popularity benchmark
    return float('%.3f'%np.mean(store_auc)), float('%.3f'%np.mean(popularity_auc))  

In [None]:
### ALS ###

# hyperparameters

PCT = [0.2,0.3]
factors = [32,64,128]
regularization = [0.01,0.05,0.1,0.2]
iterations = [10,20,50]

# PCT = [0.2]
# factors = [32]
# regularization = [0.01]
# iterations = [10]

scores = pd.DataFrame(columns = ['PCT','factors','regularization','iterations','score'])

# Grid-search hyperparameter optimization
for i in PCT:
  for j in factors:
    for k in regularization: 
      for index,l in enumerate(iterations):
        # creating train, test and altered sets
        train, test, altered = make_train(sparsed, pct_test = i)

        # calculate the confidence by multiplying it by our alpha value
        alpha_val = 15
        train = (train * alpha_val).astype('double')

        # defining the model
        model = implicit.als.AlternatingLeastSquares(factors=j, regularization = k, iterations = l)

        # training the model
        model.fit(train.T)

        # AUC for our recommender system
        score = calc_mean_auc(train, altered, [sparse.csr_matrix(model.user_factors), sparse.csr_matrix(model.item_factors.T)], test)
        print(model.user_factors.shape)
        # saving in a dataframe
        scores.loc[index,'PCT'] = i
        scores.loc[index,'factors'] = j
        scores.loc[index,'regularization'] = k
        scores.loc[index,'iterations'] = l
        scores.loc[index,'score'] = score
        

100%|██████████| 10.0/10 [00:00<00:00, 161.82it/s]
100%|██████████| 20.0/20 [00:00<00:00, 160.10it/s]

(1252, 32)



 33%|███▎      | 16.5/50 [00:00<00:00, 157.61it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 200.78it/s]
100%|██████████| 10.0/10 [00:00<00:00, 341.74it/s]

(1252, 32)



100%|██████████| 20.0/20 [00:00<00:00, 156.27it/s]

(1252, 32)



 33%|███▎      | 16.5/50 [00:00<00:00, 158.16it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 183.69it/s]
100%|██████████| 10.0/10 [00:00<00:00, 357.50it/s]

(1252, 32)



100%|██████████| 20.0/20 [00:00<00:00, 158.56it/s]

(1252, 32)



 46%|████▌     | 23.0/50 [00:00<00:00, 229.45it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 238.59it/s]
100%|██████████| 10.0/10 [00:00<00:00, 352.58it/s]

(1252, 32)



100%|██████████| 20.0/20 [00:00<00:00, 157.69it/s]

(1252, 32)



 35%|███▌      | 17.5/50 [00:00<00:00, 167.29it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 201.91it/s]
100%|██████████| 10.0/10 [00:00<00:00, 252.34it/s]

(1252, 32)



 52%|█████▎    | 10.5/20 [00:00<00:00, 100.11it/s]

(1252, 64)


100%|██████████| 20.0/20 [00:00<00:00, 102.39it/s]
 31%|███       | 15.5/50 [00:00<00:00, 150.56it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 186.48it/s]
100%|██████████| 10.0/10 [00:00<00:00, 261.99it/s]

(1252, 64)



 52%|█████▎    | 10.5/20 [00:00<00:00, 100.64it/s]

(1252, 64)


100%|██████████| 20.0/20 [00:00<00:00, 106.17it/s]
 27%|██▋       | 13.5/50 [00:00<00:00, 130.75it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 181.33it/s]
100%|██████████| 10.0/10 [00:00<00:00, 266.50it/s]

(1252, 64)



100%|██████████| 20.0/20 [00:00<00:00, 112.81it/s]

(1252, 64)



 37%|███▋      | 18.5/50 [00:00<00:00, 180.06it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 188.52it/s]
100%|██████████| 10.0/10 [00:00<00:00, 263.85it/s]

(1252, 64)



 48%|████▊     | 9.5/20 [00:00<00:00, 92.64it/s]

(1252, 64)


100%|██████████| 20.0/20 [00:00<00:00, 98.17it/s]
 33%|███▎      | 16.5/50 [00:00<00:00, 158.98it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 198.53it/s]
100%|██████████| 10.0/10 [00:00<00:00, 204.59it/s]

(1252, 64)



 40%|████      | 8.0/20 [00:00<00:00, 79.73it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 88.00it/s]
 17%|█▋        | 8.5/50 [00:00<00:00, 78.56it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 107.57it/s]
100%|██████████| 10.0/10 [00:00<00:00, 79.56it/s]

(1252, 128)



 42%|████▎     | 8.5/20 [00:00<00:00, 77.96it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 88.35it/s]
 17%|█▋        | 8.5/50 [00:00<00:00, 77.88it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 127.51it/s]
100%|██████████| 10.0/10 [00:00<00:00, 79.72it/s]

(1252, 128)



 40%|████      | 8.0/20 [00:00<00:00, 79.45it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 80.47it/s]
 16%|█▌        | 8.0/50 [00:00<00:00, 79.62it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 117.81it/s]
100%|██████████| 10.0/10 [00:00<00:00, 80.76it/s]

(1252, 128)



 40%|████      | 8.0/20 [00:00<00:00, 79.93it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 78.98it/s]
 16%|█▌        | 8.0/50 [00:00<00:00, 79.73it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 126.63it/s]
100%|██████████| 10.0/10 [00:00<00:00, 175.56it/s]

(1252, 128)



100%|██████████| 20.0/20 [00:00<00:00, 181.78it/s]

(1252, 32)



 47%|████▋     | 23.5/50 [00:00<00:00, 226.33it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 219.90it/s]
100%|██████████| 10.0/10 [00:00<00:00, 352.74it/s]

(1252, 32)



100%|██████████| 20.0/20 [00:00<00:00, 183.30it/s]

(1252, 32)



 41%|████      | 20.5/50 [00:00<00:00, 202.38it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 197.39it/s]
100%|██████████| 10.0/10 [00:00<00:00, 408.38it/s]

(1252, 32)



100%|██████████| 20.0/20 [00:00<00:00, 180.72it/s]

(1252, 32)



 37%|███▋      | 18.5/50 [00:00<00:00, 179.83it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 213.73it/s]
100%|██████████| 10.0/10 [00:00<00:00, 218.67it/s]

(1252, 32)



100%|██████████| 20.0/20 [00:00<00:00, 182.43it/s]

(1252, 32)



 37%|███▋      | 18.5/50 [00:00<00:00, 181.22it/s]

(1252, 32)


100%|██████████| 50.0/50 [00:00<00:00, 212.07it/s]
100%|██████████| 10.0/10 [00:00<00:00, 117.19it/s]

(1252, 32)



100%|██████████| 20.0/20 [00:00<00:00, 116.46it/s]

(1252, 64)



 23%|██▎       | 11.5/50 [00:00<00:00, 114.04it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 148.05it/s]
100%|██████████| 10.0/10 [00:00<00:00, 116.14it/s]

(1252, 64)



100%|██████████| 20.0/20 [00:00<00:00, 125.66it/s]

(1252, 64)



 23%|██▎       | 11.5/50 [00:00<00:00, 114.20it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 145.78it/s]
100%|██████████| 10.0/10 [00:00<00:00, 117.38it/s]

(1252, 64)



100%|██████████| 20.0/20 [00:00<00:00, 117.67it/s]

(1252, 64)



 23%|██▎       | 11.5/50 [00:00<00:00, 113.21it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 150.54it/s]
100%|██████████| 10.0/10 [00:00<00:00, 117.70it/s]

(1252, 64)



100%|██████████| 20.0/20 [00:00<00:00, 116.13it/s]

(1252, 64)



 23%|██▎       | 11.5/50 [00:00<00:00, 113.81it/s]

(1252, 64)


100%|██████████| 50.0/50 [00:00<00:00, 160.51it/s]
100%|██████████| 10.0/10 [00:00<00:00, 87.99it/s]

(1252, 64)



 45%|████▌     | 9.0/20 [00:00<00:00, 89.20it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 102.85it/s]
 18%|█▊        | 9.0/50 [00:00<00:00, 89.34it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 130.60it/s]
100%|██████████| 10.0/10 [00:00<00:00, 88.20it/s]

(1252, 128)



 45%|████▌     | 9.0/20 [00:00<00:00, 89.27it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 90.27it/s]
 18%|█▊        | 9.0/50 [00:00<00:00, 89.13it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 127.86it/s]
100%|██████████| 10.0/10 [00:00<00:00, 88.85it/s]

(1252, 128)



 45%|████▌     | 9.0/20 [00:00<00:00, 89.07it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 94.19it/s]
 18%|█▊        | 9.0/50 [00:00<00:00, 89.38it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 129.15it/s]
100%|██████████| 10.0/10 [00:00<00:00, 88.40it/s]

(1252, 128)



 45%|████▌     | 9.0/20 [00:00<00:00, 88.95it/s]

(1252, 128)


100%|██████████| 20.0/20 [00:00<00:00, 101.38it/s]
 18%|█▊        | 9.0/50 [00:00<00:00, 89.04it/s]

(1252, 128)


100%|██████████| 50.0/50 [00:00<00:00, 131.35it/s]


(1252, 128)


In [None]:
pd.DataFrame(scores.score.tolist(), index=scores[['PCT','factors','regularization','iterations']], columns=['ALS','Popularity']).sort_values(by='ALS', ascending=False)

Unnamed: 0,ALS,Popularity
"(0.3, 128, 0.2, 10)",0.52,0.602
"(0.3, 128, 0.2, 20)",0.52,0.602
"(0.3, 128, 0.2, 50)",0.519,0.602


In [None]:
### BPR ###

# hyperparameters

PCT = [0.2, 0.3]
factors = [31,63,127]
regularization = [0.01, 0.03, 0.05, 0.1, 0.2]
# learning_rate = [0.01, 0.05, 0.1]
iterations = [10,20,50]

scores = pd.DataFrame(columns = ['PCT','factors','regularization','iterations','score'])

# Grid-search hyperparameter optimization
for i in PCT:
  for j in factors:
    for k in regularization: 
      for index,l in enumerate(iterations):
        # creating train, test and altered sets
        train, test, altered = make_train(sparsed, pct_test = i)

        # calculate the confidence by multiplying it by our alpha value
        alpha_val = 15
        train = (train * alpha_val).astype('double')

        # defining the model
        model = BayesianPersonalizedRanking(factors=j, regularization = k, iterations = l)

        # training the model
        model.fit(train.T)

        # AUC for our recommender system
        score = calc_mean_auc(train, altered, [sparse.csr_matrix(model.user_factors), sparse.csr_matrix(model.item_factors.T)], test)
        
        # saving in a dataframe
        scores.loc[index,'PCT'] = i
        scores.loc[index,'factors'] = j
        scores.loc[index,'regularization'] = k
        scores.loc[index,'iterations'] = l
        scores.loc[index,'score'] = score

100%|██████████| 10/10 [00:00<00:00, 148.72it/s, correct=53.86%, skipped=27.36%]
100%|██████████| 20/20 [00:00<00:00, 124.86it/s, correct=54.75%, skipped=27.40%]
100%|██████████| 50/50 [00:00<00:00, 141.36it/s, correct=55.18%, skipped=27.36%]
100%|██████████| 10/10 [00:00<00:00, 179.50it/s, correct=53.87%, skipped=27.27%]
100%|██████████| 20/20 [00:00<00:00, 145.93it/s, correct=53.54%, skipped=27.76%]
100%|██████████| 50/50 [00:00<00:00, 157.31it/s, correct=55.41%, skipped=28.05%]
100%|██████████| 10/10 [00:00<00:00, 193.12it/s, correct=53.65%, skipped=27.72%]
100%|██████████| 20/20 [00:00<00:00, 139.21it/s, correct=54.41%, skipped=27.09%]
100%|██████████| 50/50 [00:00<00:00, 164.14it/s, correct=55.21%, skipped=27.41%]
100%|██████████| 10/10 [00:00<00:00, 159.37it/s, correct=53.42%, skipped=27.46%]
100%|██████████| 20/20 [00:00<00:00, 136.66it/s, correct=53.07%, skipped=27.23%]
100%|██████████| 50/50 [00:00<00:00, 168.69it/s, correct=54.50%, skipped=28.13%]
100%|██████████| 10/10 [00:0

In [None]:
pd.DataFrame(scores.score.tolist(), index=scores[['PCT','factors','regularization','iterations']], columns=['ALS','Popularity']).sort_values(by='ALS', ascending=False)

Unnamed: 0,ALS,Popularity
"(0.3, 127, 0.2, 10)",0.394,0.602
"(0.3, 127, 0.2, 20)",0.3,0.602
"(0.3, 127, 0.2, 50)",0.232,0.602


In [None]:
# ALS best parameters - (0.2, 128, 0.05, 20)	
# BPR best parameters - (0.2, 127, 0.05, 20)	

df_n = pd.DataFrame()
df_n = calculate_recommendations(df_n, model_name='als', factors=128, regularization=0.05, iterations=20)
df_n = calculate_recommendations(df_n, model_name='bpr', factors=127, regularization=0.05, iterations=20)
df_n.to_csv('recommendations.csv')
df_n.head()

100%|██████████| 50.0/50 [00:00<00:00, 153.78it/s]
100%|██████████| 20/20 [00:00<00:00, 121.75it/s, correct=55.94%, skipped=34.42%]


Unnamed: 0,customer_id als,recommendation als,score als,customer_id bpr,recommendation bpr,score bpr
0,1000000000000,90596,0.008612,1000000000000,64293,0.098874
1,1000000000000,64274,0.008383,1000000000000,57616,0.091008
2,1000000000000,90387,0.008383,1000000000000,89962,0.089681
3,1000000000000,51754,0.008383,1000000000000,55467,0.085606
4,1000000000000,51694,0.008383,1000000000000,51053,0.076858


### References
1.	https://www.benfrederickson.com/distance-metrics/
2.	https://github.com/benfred/bens-blog-code/blob/master/distance-metrics/calculate_similar.py
3.	https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/
4.	https://www.benfrederickson.com/fast-implicit-matrix-factorization/
5.	https://www.ethanrosenthal.com/2016/10/19/implicit-mf-part-1/
6.	https://jessesw.com/Rec-System/
7.	https://github.com/benfred/implicit/blob/master/examples/lastfm.py
8.	https://github.com/benfred/implicit
9.	https://towardsdatascience.com/large-scale-jobs-recommendation-engine-using-implicit-data-in-pyspark-ccf8df5d910e
10.	http://activisiongamescience.github.io/2016/01/11/Implicit-Recommender-Systems-Biased-Matrix-Factorization/
11.	https://arxiv.org/pdf/1705.00105.pdf
12.	https://www.ijcai.org/Proceedings/15/Papers/255.pdf
13.	https://www.kaggle.com/c/msdchallenge/
14.	https://stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf
15.	http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.5120&rep=rep1&type=pdf 
16.	http://adrem.uantwerpen.be/bibrem/pubs/verstrepen15PhDthesis.pdf 
17.	https://pdfs.semanticscholar.org/eb95/7789f53814a290bc0f8bb01dd01cbd0746cc.pdf 
18.	https://implicit.readthedocs.io/en/latest/ 
19.	https://github.com/akhilesh-reddy/Implicit-data-based-recommendation-system/blob/master/Implicit%20data%20based%20recommendation%20system%20using%20ALS.ipynb 
