## Background

*Wik Hung Pun*

*6-9-2017*

Collaborative filtering (CF) is a topic that eluded me in my quest of studying machine learning in the past. This steam games dataset gave me a reason to explore and learn more about CF. After reading some excellent works written by [Ethan Rosenthal](http://blog.ethanrosenthal.com/), [Katherine Bailey](http://katbailey.github.io/post/matrix-factorization-with-tensorflow/), and [Jesse Steinweg-Woods](https://jessesw.com/Rec-System/), I have to say the concept behind collaborative filtering is fairly simple and easy to understand. Although I do not think I can explain the concept nearly as well as these authors, I still would like to share what I have learned with you as a primer for learning collaborative filtering (and reinforce my own learning). For those of you interested in the topic, please do check out the authors I have linked. Now, without further ado...

# Introduction

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import random
from sklearn.metrics import roc_curve, auc, average_precision_score

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

#from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [2]:
#path = '../input/steam-200k.csv'
path = 'steam-200k.csv'
df = pd.read_csv(path, header = None,
                 names = ['UserID', 'Game', 'Action', 'Hours', 'Not Needed'])
df.head()

Unnamed: 0,UserID,Game,Action,Hours,Not Needed
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0,0
1,151603712,The Elder Scrolls V Skyrim,play,273.0,0
2,151603712,Fallout 4,purchase,1.0,0
3,151603712,Fallout 4,play,87.0,0
4,151603712,Spore,purchase,1.0,0


In [3]:
df['Hours_Played'] = df['Hours'].astype('float32')

In [4]:
df.loc[(df['Action'] == 'purchase') & (df['Hours'] == 1.0), 'Hours_Played'] = 0

In [5]:
df.UserID = df.UserID.astype('int')
df = df.sort_values(['UserID', 'Game', 'Hours_Played'])

In [6]:
clean_df = df.drop_duplicates(['UserID', 'Game'], keep = 'last').drop(['Action', 'Hours', 'Not Needed'], axis = 1)
clean_df.head()

Unnamed: 0,UserID,Game,Hours_Played
65430,5250,Alien Swarm,4.9
65424,5250,Cities Skylines,144.0
65435,5250,Counter-Strike,0.0
65436,5250,Counter-Strike Source,0.0
65437,5250,Day of Defeat,0.0


In [7]:
n_users = len(clean_df.UserID.unique())
n_games = len(clean_df.Game.unique())

print('There are {0} users and {1} games in the data'.format(n_users, n_games))

There are 12393 users and 5155 games in the data


In [8]:
sparsity = clean_df.shape[0] / float(n_users * n_games)
print('{:.2%} of the user-item matrix is filled'.format(sparsity))

0.20% of the user-item matrix is filled


In [9]:
from collections import Counter

user_counter = Counter()
for user in clean_df.UserID.tolist():
    user_counter[user] +=1

game_counter = Counter()
for game in clean_df.Game.tolist():
    game_counter[game] += 1

In [10]:
user2idx = {user: i for i, user in enumerate(user_counter.keys())}
idx2user = {i: user for user, i in user2idx.items()}

game2idx = {game: i for i, game in enumerate(game_counter.keys())}
idx2game = {i: game for game, i in game2idx.items()}

In [11]:
user_idx = clean_df['UserID'].apply(lambda x: user2idx[x]).values
game_idx = clean_df['gameIdx'] = clean_df['Game'].apply(lambda x: game2idx[x]).values
pref = np.repeat([1], clean_df.shape[0])
hours = clean_df['Hours_Played'].values

In [12]:
#from scipy.sparse import csr_matrix
#user_game_matrix = csr_matrix((pref, (user_idx, game_idx)), shape = (n_users, n_games))
#interactions_matrix = csr_matrix((hours, (user_idx, game_idx)), shape = (n_users, n_games))
zero_matrix = np.zeros(shape = (n_users, n_games))
user_game_pref = zero_matrix.copy()
user_game_pref[user_idx, game_idx] = 1

user_game_interactions = zero_matrix.copy()
user_game_interactions[user_idx, game_idx] = hours + 1

In [13]:
k = 10

# Count the number of purchases for each user
purchase_counts = np.apply_along_axis(np.bincount, 1, user_game_pref.astype(int))
buyers_idx = np.where(purchase_counts[:, 1] >= 2 * k)[0] #find the users who purchase 2 * k games
print('{0} users bought {1} or more games'.format(len(buyers_idx), 2 * k))

1265 users bought 20 or more games


In [14]:
test_frac = 0.4
test_users_idx = np.random.choice(buyers_idx,
                                  size = int(np.ceil(len(buyers_idx) * test_frac)),
                                  replace = False)

In [15]:
val_users_idx = test_users_idx[:int(len(test_users_idx) / 2)]
test_users_idx = test_users_idx[int(len(test_users_idx) / 2):]

In [16]:
def data_process(dat, train, test, user_idx, k):
    for user in user_idx:
        purchases = np.where(dat[user, :] == 1)[0]
        mask = np.random.choice(purchases, size = k, replace = False)
        
        train[user, mask] = 0
        test[user, mask] = dat[user, mask]
    return train, test

In [17]:
train_matrix = user_game_pref.copy()
test_matrix = zero_matrix.copy()
val_matrix = zero_matrix.copy()

train_matrix, val_matrix = data_process(user_game_pref, train_matrix, val_matrix, val_users_idx, k)
train_matrix, test_matrix = data_process(user_game_pref, train_matrix, test_matrix, test_users_idx, k)

In [18]:
test_matrix[test_users_idx[0], test_matrix[test_users_idx[0], :].nonzero()[0]]

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [19]:
train_matrix[test_users_idx[0], test_matrix[test_users_idx[0], :].nonzero()[0]]

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [29]:
tf.reset_default_graph()
pref = tf.placeholder(tf.float32, (n_users, n_games))
interactions = tf.placeholder(tf.float32, (n_users, n_games))
users_idx = tf.placeholder(tf.int32, (None))

In [30]:
n_features = 30
X = tf.Variable(tf.truncated_normal([n_users, n_features], mean = 0, stddev = 0.2))
Y = tf.Variable(tf.truncated_normal([n_games, n_features], mean = 0, stddev = 0.2))
conf_alpha = tf.Variable(tf.random_uniform([1], 0, 1))

In [31]:
#user_bias = np.random.normal(scale = 0.1, size = n_users).reshape(n_users, -1)
user_bias = tf.Variable(tf.truncated_normal([n_users, 1], stddev = 0.2))
X_plus_bias = tf.concat([X, 
                         #tf.convert_to_tensor(user_bias, dtype = tf.float32),
                         user_bias,
                         tf.ones((n_users, 1), dtype = tf.float32)], axis = 1)

In [32]:
#item_bias = np.random.normal(scale = 0.1, size = n_games).reshape(n_games, -1)
item_bias = tf.Variable(tf.truncated_normal([n_games, 1], stddev = 0.2))
Y_plus_bias = tf.concat([Y, 
                         #tf.convert_to_tensor(item_bias, dtype = tf.float32),
                         tf.ones((n_games, 1), dtype = tf.float32),
                         item_bias],
                         axis = 1)

In [33]:
pred_pref = tf.matmul(X_plus_bias, Y_plus_bias, transpose_b=True)
conf = 1 + conf_alpha * interactions

In [34]:
cost = tf.reduce_sum(tf.multiply(conf, tf.square(tf.subtract(pref, pred_pref))))
l2_sqr = tf.nn.l2_loss(X) + tf.nn.l2_loss(Y) + tf.nn.l2_loss(user_bias) + tf.nn.l2_loss(item_bias)
lambda_c = 0.01
loss = cost + lambda_c * l2_sqr

In [35]:
lr = 0.05
optimize = tf.train.AdagradOptimizer(learning_rate = lr).minimize(loss)

In [36]:
def top_k_precision(pred, mat, k, user_idx):
    precisions = []
    
    for user in user_idx:
        rec = np.argsort(-pred[user, :])
        
        top_k = rec[:k]
        labels = mat[user, :].nonzero()[0]
        
        precision = len(set(top_k) & set(labels)) / float(k)
        precisions.append(precision)
    return np.mean(precisions)

In [37]:
iterations = 80
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for i in range(iterations):
        sess.run(optimize, feed_dict = {pref: train_matrix,
                                        interactions: user_game_interactions})
        
        if i % 10 == 0:
            mod_loss = sess.run(loss, feed_dict = {pref: train_matrix,
                                                   interactions: user_game_interactions})            
            mod_pred = pred_pref.eval()
            #train_precision = k_precision(mod_pred, train_matrix, test_matrix, k, test_users_idx, training = True)
            #test_precision = k_precision(mod_pred, train_matrix, test_matrix, k, test_users_idx, training = False)
            train_precision = top_k_precision(mod_pred, train_matrix, k, val_users_idx)
            val_precision = top_k_precision(mod_pred, val_matrix, k, val_users_idx)
            print('Iterations {0}...'.format(i),
                  'Training Loss {:.2f}...'.format(mod_loss),
                  'Train Precision {:.3f}...'.format(train_precision),
                  'Val Precision {:.3f}'.format(val_precision)
                )

    rec = pred_pref.eval()
    test_precision = top_k_precision(rec, test_matrix, k, test_users_idx)
    print('Test Precision{:.3f}'.format(test_precision))

Iterations 0... Training Loss 4248747.50... Train Precision 0.021... Val Precision 0.004
Iterations 10... Training Loss 502785.81... Train Precision 0.303... Val Precision 0.036
Iterations 20... Training Loss 276927.94... Train Precision 0.369... Val Precision 0.036
Iterations 30... Training Loss 220119.69... Train Precision 0.411... Val Precision 0.042
Iterations 40... Training Loss 186433.84... Train Precision 0.446... Val Precision 0.046
Iterations 50... Training Loss 161255.27... Train Precision 0.470... Val Precision 0.053
Iterations 60... Training Loss 139338.53... Train Precision 0.491... Val Precision 0.060
Iterations 70... Training Loss 117164.41... Train Precision 0.508... Val Precision 0.065
Test Precision0.064


In [None]:
users = np.random.choice(test_users_idx, size = 10, replace = False)
rec_games = np.argsort(-rec)

In [None]:
for user in users:
    print('Recommended Games for {0} are ...'.format(idx2user[user]))
    purchase_history = np.where(train_matrix[user, :] != 0)[0]
    recommendations = rec_games[user, :]

    
    new_recommendations = recommendations[~np.in1d(recommendations, purchase_history)][:k]
    
    #ground_truth = clean_df[clean_df['UserID'] == idx2user[user]]['gameIdx'].values
    print('User bought these games')
    print(', '.join([idx2game[purchase] for purchase in purchase_history.tolist()]))
    print('\n')
    print('We recommend these games')
    print(', '.join([idx2game[game] for game in new_recommendations]))
    print('\n')
    print('The games that the user actually purchased are ...')
    print(', '.join([idx2game[game] for game in np.where(test_matrix[user, :] != 0)[0]]))
    print('\n')
    print('Precision of {0}'.format(len(set(new_recommendations) & set(np.where(test_matrix[user, :] != 0)[0])) / float(k)))
    print('--------------------------------------')
    print('\n')