# News Article Recommendation based on Neural Matrix Factorisation

This is an implementation for a latent feature model for recommender systems.
The latent feature model F which measures the compatility between a user u and an item m is defined as the dot
product of the two latent feature representations of the user and the item, named a and v, respectively, each of size K^F,
so a_u for user u and v_m for item m.

The feature function is then defined as:

theta_{a,v}^F := sum_k^K (a_{u,k} * v_{m,k})

For more information on the model, see http://www.aclweb.org/anthology/N13-1008, specially Section 2.1 where the model
is defined, and Section 2.5.1 where the BPR loss is defined.

The folder also contains a small data snippet for news recommendation in Norwegian. The full dataset can be found here: http://reclab.idi.ntnu.no/dataset/ and is described in the following paper:

Gulla, J. A., Zhang, L., Liu, P., Özgöbek, Ö., & Su, X. (2017, August). The Adressa dataset for news recommendation. In Proceedings of the International Conference on Web Intelligence (pp. 1042-1048). ACM.

To run the code, you will need to install:
- Tensorflow version 1+
- numpy

After every run, you need to click "Kernel -> Restart & Run All", otherwise you will get a Tensorflow error about variable sharing.

First, some data data preprocessing code; skim briefly so you roughly understand what's going on. Some particularly irrelevant parts are hidden from you and can be found in external Python files.

In [1]:
from bisect import bisect
import os
from preproc.map import *
from preproc.batch import *

def read_filtered_records():
    # read filtered records and split into positive and negative training data
    # only reading some of the data, comment out lines 13 and 14 to use all the data
    instances = {"userID": [], "keywords": [], "activeTime": []}
            
    with open(os.path.join("data_selected/", "selectedRecords.txt"), 'r', encoding = 'utf-8') as fin:
        i = 0
        for line in fin:
            if i == 100:  # for debugging, to make it run faster
                break

            obj_userId, obj_activeTime, obj_keywords = line.strip(" ").strip("\n").split("\t")
            
            print(line.strip("\n"))

            if obj_userId in instances["userID"]:
                ind = instances["userID"].index(obj_userId)
                ind_i = bisect(instances["activeTime"][ind], obj_activeTime)
                instances["activeTime"][ind].insert(ind_i, obj_activeTime)
                instances["keywords"][ind].insert(ind_i, obj_keywords)

            else:
                instances["userID"].append(obj_userId)
                instances["keywords"].append([obj_keywords])
                instances["activeTime"].append([obj_activeTime])

            i += 1

        fin.close()
        
    data = {"keywords": [], "userids": [], "targets": []}
    
    # Here, the positive and negative instances are constructed based how long the user spends on a web page.
    # Assumption: articles with the longest reading time are positive instances, 
    # and those with the shortest reading time are negative instances
    for i, uID in enumerate(instances["userID"]):
        # if they've only read one article, it's hard to guess if they liked it or not based on their reading time,
        # so we skip it
        num_entries = len(instances["activeTime"][i])
        if num_entries == 1:
            continue
            
        # sort active time and other data in ascending order
        sort_ids = np.argsort([int(k) for k in instances["activeTime"][i]])
        active_time_sorted = [instances["activeTime"][i][j] for j in sort_ids]
        keys_sorted = [instances["keywords"][i][j] for j in sort_ids]
        
        #print(instances["activeTime"][i])
        #print(active_time_sorted)
        #print()
            
        # negative instances first 10% of articles in terms of reading time
        end_index = int((num_entries/10)+1)
        keyl = []
        for inst in keys_sorted[0:end_index]:
            keyl.extend(inst.split(","))
        
        data["keywords"].append(keyl)
        data["targets"].append(1)
        data["userids"].append(uID)
        
        print(uID, "0", keyl)
        
        # positive instances -- last 10% of articles in terms of reading time
        start_index = int((num_entries-(num_entries / 10)))
        
        # negative instances
        keyl = []
        for inst in keys_sorted[start_index:]:
            keyl.extend(inst.split(","))
        data["keywords"].append(keyl)
        data["targets"].append(0)
        data["userids"].append(uID)
        
        print(uID, "1", keyl)
        print("")
    
    return data


def prepare_data(placeholders, data, vocab_keys=None, vocab_users=None):
    if 'keywords' in placeholders.keys():
        data = deep_seq_map(data, lower, ['keywords'])
    if vocab_keys is None:
        vocab_keys = Vocab()
        vocab_users = Vocab()
        if 'keywords' in placeholders.keys():
            for instance in data["keywords"]:
                for token in instance:
                    vocab_keys(token)
        for instance in data["userids"]:
            vocab_users(instance)

    if 'keywords' in placeholders.keys():
        data = deep_map(data, vocab_keys, ["keywords"])

    data = deep_map(data, vocab_users, ["userids"])

    # removing data that's not a placeholder
    popl = []
    for k in data.keys():
        if not k in placeholders.keys():
            popl.append(k)
    for p in popl:
        data.pop(p, None)

    return data, vocab_keys, vocab_users


def get_feed_dicts(data_train_np, placeholders, batch_size, inst_length):
    data_train_batched = []
    realsamp = int(inst_length/batch_size)
    additionsamp = batch_size-(inst_length%batch_size)
    if additionsamp != 0:
        realsamp += 1

    # sample so that there are an equal number of pos and neg instances per batch
    ids_pos = [i for i in range(0, inst_length) if data_train_np["targets"][i] == 1]
    ids_neg = [i for i in range(0, inst_length) if data_train_np["targets"][i] == 0]
    ids1_pos = choice(ids_pos, len(ids_pos), replace=False)
    ids1_neg = choice(ids_neg, len(ids_neg), replace=False)
    ids1 = [val for pair in zip(ids1_pos, ids1_neg) for val in pair]

    ids = ids1

    start = 0
    for i in range(0, realsamp):
        batch_i = {}
        if i != 0:
            start = i * batch_size
        if i != realsamp:
            ids_sup = ids[start:((i+1)*batch_size)]
        else:
            ids_sup = ids[start:realsamp]
        for key, value in data_train_np.items():
            batch_i[placeholders[key]] = [data_train_np[key][ii] for ii in ids_sup]

        data_train_batched.append(batch_i)

    return data_train_batched


def load_data(placeholders, batch_size=8):

    data = read_filtered_records()
    prepared_data, vocab_keys, vocab_users = prepare_data(placeholders, data)
    vocab_keys.freeze()  # this makes sure that nothing further is added to the vocab, otherwise deep_map will extend it
    vocab_users.freeze()
    numpified_data = numpify(prepared_data, pad=0)
    feed_dicts = get_feed_dicts(numpified_data, placeholders, batch_size, len(data["userids"]))
    return feed_dicts, vocab_keys, vocab_users



Now for the important part, the neural matrix factorisation model. Read carefully. 

In [2]:
import tensorflow as tf
import numpy as np

def mf_reader(placeholders, vocab_size_keys, vocab_size_users, emb_dim, fact_loss='BPR', lambda_L2 = 0.01):

    # [batch_size]
    userids = placeholders['userids']

    # [batch_size, max_keyw_length]
    keywords = placeholders['keywords']

    # [batch_size]
    targets = tf.to_float(placeholders['targets'])

    init = tf.contrib.layers.xavier_initializer(uniform=True)

    with tf.variable_scope("u_embeddings", reuse=tf.AUTO_REUSE):
        user_embeddings = tf.get_variable("user_embeddings", [vocab_size_users, emb_dim], dtype=tf.float32, initializer=init)

    # separate embeddings for users and keywords
    with tf.variable_scope("k_embeddings", reuse=tf.AUTO_REUSE):
        key_embeddings = tf.get_variable("keyw_embeddings", [vocab_size_keys, emb_dim], dtype=tf.float32, initializer=init)

    with tf.variable_scope("keyw_embedders") as varscope:
        keyw_embedded = tf.nn.embedding_lookup(key_embeddings, keywords)

    with tf.variable_scope("uid_embedders") as varscope:
        uid_embedded = tf.nn.embedding_lookup(user_embeddings, userids)

    where0 = tf.not_equal(targets, 0)
    where1 = tf.not_equal(targets, 1)
    user_embeddings_neg = tf.boolean_mask(uid_embedded, where0)
    user_embeddings_pos = tf.boolean_mask(uid_embedded, where1)
    key_embeddings_neg = tf.boolean_mask(keyw_embedded, where0)
    key_embeddings_pos = tf.boolean_mask(keyw_embedded, where1)

    key_embeddings_pos = tf.nn.sigmoid(key_embeddings_pos)
    key_embeddings_neg = tf.nn.sigmoid(key_embeddings_neg)

    # positive and negative dot products
    # note that we have a separate embedding for every keyword, so we need to mean average across keywords to get one feature vector for all keywords together
    # for predictions (to be ranked)
    dotprod_pos = tf.reduce_sum(tf.reduce_sum(tf.multiply(key_embeddings_pos, tf.expand_dims(user_embeddings_pos, 1)), 1), 1)  
    
    # calculate normalised scores of positive examples
    sigma_dotprod_pos = tf.nn.sigmoid(dotprod_pos) 
    
    dotprod_neg = tf.reduce_sum(tf.reduce_sum(tf.multiply(key_embeddings_neg, tf.expand_dims(user_embeddings_neg, 1)), 1), 1)

    # calculate normalised scores of negative examples
    sigma_dotprod_neg = tf.nn.sigmoid(dotprod_neg)  
    
    diff_dotprod = tf.reduce_sum(tf.reduce_sum(tf.multiply(key_embeddings_pos, tf.expand_dims(user_embeddings_pos, 1)) - tf.multiply(key_embeddings_neg, tf.expand_dims(user_embeddings_neg, 1)), 1), 1)

    if fact_loss == 'logistic':
        loss_R = tf.reduce_sum(tf.nn.softplus(-dotprod_pos) + tf.nn.softplus(dotprod_neg))
    elif fact_loss == 'BPR':
        loss_R = tf.reduce_sum(tf.nn.softplus(-dotprod_pos))
        # TO IMPLEMENT!!!

    # L2 regularisation
    loss_L2_pos = tf.add(tf.nn.l2_loss(user_embeddings_pos), tf.nn.l2_loss(key_embeddings_pos))
    loss_L2_neg = tf.add(tf.nn.l2_loss(user_embeddings_neg), tf.nn.l2_loss(key_embeddings_neg))

    loss_L2 = tf.add(loss_L2_pos, loss_L2_neg)

    """total loss"""
    loss = tf.stack([loss_R, tf.scalar_mul(lambda_L2, loss_L2)])

    return sigma_dotprod_pos, sigma_dotprod_neg, loss

Now, the main training routine.

In [3]:
keywords_pl = tf.placeholder(tf.int32, [None, None], name="keywords")
userids_pl = tf.placeholder(tf.int32, [None], name="userids")
targets = tf.placeholder(tf.int32, [None], name="targets")

def create_placeholders():
    placeholders = {"keywords": keywords_pl, "userids": userids_pl, "targets": targets}
    return placeholders

def main(model_variant="similarity", max_epochs=500):

    placeholders = create_placeholders()

    feed_dicts, vocab_keys, vocab_users = load_data(placeholders, batch_size=8)

    # Do not take up all the GPU memory all the time.
    sess_config = tf.ConfigProto()
    sess_config.gpu_options.allow_growth = True
    with tf.Session(config=sess_config) as sess:

        # create model
        dotprod_pos, dotprod_neg, loss = mf_reader(placeholders, len(vocab_keys), len(vocab_users), emb_dim=32)

        optim = tf.train.RMSPropOptimizer(learning_rate=0.0005)
        min_op = optim.minimize(tf.reduce_mean(loss))

        tf.global_variables_initializer().run(session=sess)

        for i in range(1, max_epochs + 1):
            loss_all, correct_all, correct_all_neg, total = [], 0.0, 0.0, 0.0
            for j in range(0, len(feed_dicts)):
                batch = feed_dicts[j]
                # we get the predictions for positive and negative instances on the training data
                _, current_loss, p_pos = sess.run([min_op, loss, dotprod_pos], feed_dict=batch)

                # to test on test data, this would need to be split off the training data
                # and the following line(s) would need to be run - eval as below.
                # p_test_pos = sess.run(dotprod_pos, feed_dict=batch_test)  # for positive instances

                # as for training data, each test batch needs to contain an equal number of positive and negative samples
                p_neg = sess.run(dotprod_neg, feed_dict=batch)  # for negative instances

                loss_all.extend(current_loss)
                preds = np.round(p_pos)
                hits_pos = [pp for pp in preds if pp == 1.0]  # we only care about performing well for positive instances
                preds_neg = np.round(p_neg)
                hits_neg = [pp for pp in preds_neg if pp == 0.0]
                correct_all += len(hits_pos)
                correct_all_neg += len(hits_neg)
                total += len(preds)

            # Randomise batch IDs, so that selection of batch is random
            np.random.shuffle(feed_dicts)
            acc = correct_all / total
            acc_neg = correct_all_neg / total
            print('Epoch %d :' % i, "Acc Train Pos: ", acc, "Acc Train Neg: ", acc_neg, "Loss: ", np.mean(loss_all))

In [4]:
main()

cx:lo0s6rngjd111p7dgxxvj2y6y:33hmfzes9x0yn	9	utenriks,innenriks,trondheim,E6,midtbyen,bybrann,bilulykker
cx:hzx2pn6j7vhtjyng:sckfhjlv4pdz	466	elbil,Trondheim parkering,sissel trønsdal,Trondheim
cx:2gcnj061e2zz1dwnqk8yhv2yp:2r0fe9ekqec1v	32	Vassfjellet,Innbrudd
cx:ingb68nkjywbce3d:3aizgkitiquss	331	Baby,Bioteknologi,Fødsel,Helse,Svangerskap
cx:ifa61e6grusoket6:3nbl45cgx3lz7	134	Trøndelag politidistrikt
cx:hwcel2lt7o01whx8:3m828rocxjbne	51	tog,NSB,Samferdsel,Reiseliv,debatt
cx:1wuel58my406c277lpq0k7gmy4:39vwjf2pw5138	21	utenriks,innenriks,trondheim,E6,midtbyen,bybrann,bilulykker
cx:fz50mkepskbz1bhbsonrxatwb:2bv2dnsix8gwl	114	Vassfjellet,Innbrudd
cx:ilhmumvrt41eo6wb:1uhvfttq77gh8	46	tog,NSB,Samferdsel,Reiseliv,debatt
cx:ikt31ks11f0v988r:mgiyccdm870o	91	Hund,Ulv,Verdal
cx:2hlokd1ecthx81gia99yy464xt:13b8ml7zctj4a	48	Vassfjellet,Innbrudd
cx:ijrnuuor7mbrohmg:1ey4i9ki64xt	15	utenriks,innenriks,trondheim,E6,midtbyen,bybrann,bilulykker
cx:y59y853q1amo3am73byk7angz:1ugpwe02cp5kx	91	Vassfjellet,In


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1 : Acc Train Pos:  0.43478260869565216 Acc Train Neg:  0.5652173913043478 Loss:  3.7774727
Epoch 2 : Acc Train Pos:  0.43478260869565216 Acc Train Neg:  0.5652173913043478 Loss:  3.7005336
Epoch 3 : Acc Train Pos:  0.43478260869565216 Acc Train Neg:  0.5652173913043478 Loss:  3.6268413
Epoch 4 : Acc Train Pos:  0.4782608695652174 Acc Train Neg:  0.5217391304347826 Loss:  3.5550127
Epoch 5 : Acc Train Pos:  0.4782608695652174 Acc Train Neg:  0.5217391304347826 Loss:  3.4853141
Epoch 6 : Acc Train Pos:  0.4782608695652174 Acc Train Neg:  0.5217391304347826 Loss:  3.4173307
Epoch 7 : Acc Train Pos:  0.4782608695652174 Acc Train Neg:  0.5217391304347826 Loss:  3.349758
Epoch 8 : Acc Train Pos:  0.4782608695652174 Acc Train Neg:  0.5217391304347826 Loss:  3.2841322
Epoch 9 : Acc Train Pos:  0.4782608695652174 Acc Train Neg:  0.5217391304347826 Loss:  3.2198665
Epoch 10 : Acc Train Pos:  0.4782608695652174 Acc Train Neg:  0.4782608695652174 Loss:  3.156209
Epoch 11 : Acc Train Pos:  0

Epoch 101 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0733792
Epoch 102 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0699106
Epoch 103 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0665085
Epoch 104 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0631652
Epoch 105 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0598742
Epoch 106 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.056624
Epoch 107 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0534242
Epoch 108 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0502714
Epoch 109 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0471718
Epoch 110 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0440948
Epoch 111 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0410427
Epoch 112 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0380324
Epoch 113 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0350599
Epoch 114 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  1.0321286
Epoch 115 : Acc Train Pos:  1.0 Acc

Epoch 224 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7874062
Epoch 225 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7855148
Epoch 226 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7836189
Epoch 227 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.78172636
Epoch 228 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7798516
Epoch 229 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7779786
Epoch 230 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7761147
Epoch 231 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.77425843
Epoch 232 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.77240473
Epoch 233 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7705459
Epoch 234 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.76868767
Epoch 235 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.76682085
Epoch 236 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.76496214
Epoch 237 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.7631213
Epoch 238 : Acc Train Pos:  

Epoch 346 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.58580166
Epoch 347 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.58437943
Epoch 348 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.5829689
Epoch 349 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.5815661
Epoch 350 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.5801688
Epoch 351 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.57876664
Epoch 352 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.57736707
Epoch 353 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.5759759
Epoch 354 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.5745895
Epoch 355 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.573204
Epoch 356 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.571816
Epoch 357 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.5704352
Epoch 358 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.5690598
Epoch 359 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.56769055
Epoch 360 : Acc Train Pos:  1.0

Epoch 486 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.42070517
Epoch 487 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41974625
Epoch 488 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.4187881
Epoch 489 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41783676
Epoch 490 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41688895
Epoch 491 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41593626
Epoch 492 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41498518
Epoch 493 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41404143
Epoch 494 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41309953
Epoch 495 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.4121599
Epoch 496 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.41122702
Epoch 497 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.4102964
Epoch 498 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.40936768
Epoch 499 : Acc Train Pos:  1.0 Acc Train Neg:  0.0 Loss:  0.40844464
Epoch 500 : Acc Train P

## Exercises
Enter your answers for 1.1 and 1.2 here: https://tinyurl.com/y5eo334s

### 1.1 BPR Loss

In the mf_reader() function, implement the BPR loss as an alternative to the logistic loss.

### 1.2 Alternative Negative Sampling

Re-visit the negative sampling strategy implemented in the function read_filtered_records(). Think about what a good alternative or refinement to it would be and implement it.

### 2.1 Testing on Test Data

The above code tests on the training data. This is good for sanity checking purposes, but doesn't give real test results. Split the dataset into two parts (80%, 20%) and evaluate on the test data.

### 2.2 Hyperparameter Tuning

Now split the data into three parts (80%, 10%, 10%), i.e. a training, development and test set. You can use the development set to test hyperparameters of your model -- embedding dimensionality, number of epochs, etc. You can also keep the number of epochs flexible and perform early stopping, i.e. stop training once the model has reached a certain accuracy or loss.

### 2.3 Training on the Full Dataset

Revisit read_filtered_records() and change the function such that it reads in the whole dataset, or at least a larger portion of it. Re-train your models and record performances. Reflect on what worked well and what didn't, and think about why that might be.