# RippleNet on MovieLens using Wikidata (Python, GPU)¶

In this example, we will walk through each step of the RippleNet algorithm.
RippleNet is an end-to-end framework that naturally incorporates the knowledge graphs into recommender systems.
To make the results of the paper reproducible we have used MovieLens as our dataset and Wikidata as our Knowledge Graph.

> RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems
> Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, Minyi Guo
> The 27th ACM International Conference on Information and Knowledge Management (CIKM 2018)

Online code of RippleNet: https://github.com/hwwang55/RippleNet

## Introduction

To address the sparsity and cold start problem of collaborative filtering, researchers usually make use of side information, such as social networks or item attributes, to improve recommendation performance. This paper considers the knowledge graph as the source of side information. To address the limitations of existing embedding-based and path-based methods for knowledge-graph-aware recommendation, we propose RippleNet, an end-to-end framework that naturally incorporates the knowledge graph into recommender systems. Similar to actual ripples propagating on the water, RippleNet stimulates the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user’s potential interests along links in the knowledge graph. The multiple "ripples" activated by a user’s historically clicked items are thus superposed to form the preference distribution of the user with respect to a candidate item, which could be used for predicting the final clicking probability. Through extensive experiments on real-world datasets, we demonstrate that RippleNet achieves substantial gains in a variety of scenarios, including movie, book and news recommendation, over several state-of-the-art baselines.

![alt text](https://github.com/hwwang55/RippleNet/raw/master/framework.jpg)

The overall framework of the RippleNet. It takes one user and one item as input, and outputs the predicted probability that the user will click the item. The KGs in the upper part illustrate the corresponding ripple sets activated by the user’s click history.

## Implementation
Details of the python implementation can be found [here](https://github.com/microsoft/recommenders/tree/rippleNet/reco_utils/recommender/ripplenet). The implementation is based on the original code of RippleNet: https://github.com/hwwang55/RippleNet

## RippleNet Movie Recommender

In [1]:
import sys
sys.path.append("../../")
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import papermill as pm

from reco_utils.common.timer import Timer
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_stratified_split
from reco_utils.recommender.ripplenet.preprocess import (read_item_index_to_entity_id_file, 
                                         convert_rating, 
                                         convert_kg)
from reco_utils.recommender.ripplenet.data_loader import load_kg, get_ripple_set
from reco_utils.recommender.ripplenet.model import RippleNet
from reco_utils.evaluation.python_evaluation import auc, precision_at_k, recall_at_k

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Tensorflow version: {}".format(tf.__version__))

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
Pandas version: 0.25.3
Tensorflow version: 1.12.0


In [2]:
# Select MovieLens data size: 100k, 1M, 10M
MOVIELENS_DATA_SIZE = '100k'
rating_threshold = 4 #Minimum rating of a movie to be considered positive

# Ripple parameters
n_epoch = 10 #the number of epochs
batch_size = 1024 #batch size
dim = 16 #dimension of entity and relation embeddings
n_hop = 2 #maximum hops
kge_weight = 0.01 #weight of the KGE term
l2_weight = 1e-7 #weight of the l2 regularization term
lr = 0.02 #learning rate
n_memory = 32 #size of ripple set for each hop
item_update_mode = 'plus_transform' #how to update item at the end of each hop. 
                                    #possible options are replace, plus, plus_transform or replace transform
using_all_hops = True #whether using outputs of all hops or just the last hop when making prediction
optimizer_method = "adam" #optimizer method from adam, adadelta, adagrad, ftrl (FtrlOptimizer),
                          #gd (GradientDescentOptimizer), rmsprop (RMSPropOptimizer)
show_loss = False #whether or not to show the loss
seed = 12

#Evaluation parameters
TOP_K = 10


## Read original data and transform entity ids to numerical

RippleNet is built on:
- Ratings from users on Movies
- Knowledge Graph (KG) linking Movies to their connected entities in Wikidata. See [this notebook](https://github.com/microsoft/recommenders/blob/master/notebooks/01_prepare_data/wikidata_knowledge_graph.ipynb)

In [3]:
ratings_original = movielens.load_pandas_df(MOVIELENS_DATA_SIZE,
                              ('UserId', 'ItemId', 'Rating', 'Timestamp'),
                             title_col='Title',
                             genres_col='Genres',
                             year_col='Year')
ratings_original.head(3)

100%|██████████| 4.81k/4.81k [00:00<00:00, 16.2kKB/s]


Unnamed: 0,UserId,ItemId,Rating,Timestamp,Title,Genres,Year
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996
1,63,242,3.0,875747190,Kolya (1996),Comedy,1996
2,226,242,5.0,883888671,Kolya (1996),Comedy,1996


In [4]:
kg_original = pd.read_csv("https://recodatasets.blob.core.windows.net/wikidata/movielens_{}_wikidata.csv".format(MOVIELENS_DATA_SIZE))
kg_original.head(3)

Unnamed: 0,original_entity,linked_entities,name_linked_entities,movielens_title,movielens_id
0,Q1141186,Q130232,drama film,Kolya (1996),242
1,Q1141186,Q157443,comedy film,Kolya (1996),242
2,Q1141186,Q10819887,Andrei Chalimon,Kolya (1996),242


To be able to link the Ratings and KG ids we create two dictionaries match the KG original IDs to homogeneous numerical IDs. This will be done in two steps:
1. Transforming both Rating ID and KG ID to numerical
2. Matching the IDs using a dictionary

In [5]:
def transform_id(df, entities_id, col_transform, col_name = "unified_id"):
    df = df.merge(entities_id, left_on = col_transform, right_on = "entity")
    df = df.rename(columns = {"unified_id": col_name})
    return df.drop(columns = [col_transform, "entity"])

In [6]:
# Create Dictionary that matches KG Wikidata ID to internal numerical KG ID
entities_id = pd.DataFrame({"entity":list(set(kg_original.original_entity)) + list(set(kg_original.linked_entities))}).reset_index()
entities_id = entities_id.rename(columns = {"index": "unified_id"})
entities_id.head(3)

Unnamed: 0,unified_id,entity
0,0,Q509628
1,1,Q4984790
2,2,Q2463968


In [7]:
# Tranforming KG IDs to internal numerical KG IDs created above 
kg = kg_original[["original_entity", "linked_entities"]].drop_duplicates()
kg = transform_id(kg, entities_id, "original_entity", "original_entity_id")
kg = transform_id(kg, entities_id, "linked_entities", "linked_entities_id")
kg["relation"] = 1
kg_wikidata = kg[["original_entity_id","relation", "linked_entities_id"]]
kg_wikidata.head(3)

Unnamed: 0,original_entity_id,relation,linked_entities_id
0,1072,1,6449
1,10417,1,6449
2,1378,1,6449


In [8]:
# Create Dictionary matching Movielens ID to internal numerical KG ID created above
var_id = "movielens_id"
item_to_entity = kg_original[[var_id, "original_entity"]].drop_duplicates().reset_index().drop(columns = "index")
item_to_entity = transform_id(item_to_entity, entities_id, "original_entity")
item_to_entity.head(3)

Unnamed: 0,movielens_id,unified_id
0,242,1072
1,242,10417
2,302,1378


In [9]:
vars_movielens = ["UserId", "ItemId", "Rating", "Timestamp"]
ratings = ratings_original[vars_movielens].sort_values(vars_movielens[1])

## Preprocess module from RippleNet

 The dictionaries created above will be used on the Ratings and KG dataframes and unify their IDs. Also the Ratings will be converted from a numerical rating (1-5) to a binary rating (0-1) using the rating_threshold

In [10]:
# Use dictionary Movielens ID - numerical KG ID to extract two dictionaries to be used on Ratings and KG
item_index_old2new, entity_id2index = read_item_index_to_entity_id_file(item_to_entity)

In [11]:
ratings_final = convert_rating(ratings, item_index_old2new = item_index_old2new,
                               threshold = rating_threshold, seed = 12)

INFO:reco_utils.recommender.ripplenet.preprocess:converting rating file ...
INFO:reco_utils.recommender.ripplenet.preprocess:number of users: 942
INFO:reco_utils.recommender.ripplenet.preprocess:number of items: 1677


In [12]:
kg_final = convert_kg(kg_wikidata, entity_id2index = entity_id2index)

INFO:reco_utils.recommender.ripplenet.preprocess:converting kg file ...
INFO:reco_utils.recommender.ripplenet.preprocess:number of entities (containing items): 22994
INFO:reco_utils.recommender.ripplenet.preprocess:number of relations: 1


## Split Data and Build RippleSet

The data is divided into train, test and evaluation

In [13]:
train_data, test_data, eval_data = python_stratified_split(ratings_final, ratio=[0.6, 0.2, 0.2], col_user='user_index', col_item='item', seed=12)

In [14]:
train_data.head()

Unnamed: 0,user_index,item,rating,original_rating
129,0,3281,0,0.0
231,0,1407,0,0.0
52,0,461,1,4.0
229,0,3273,0,0.0
250,0,2007,0,0.0


The original KG dataframe is transformed into a dictionary, and the number of entities and retaltions extracted as parameters

In [15]:
n_entity, n_relation, kg = load_kg(kg_final)

INFO:reco_utils.recommender.ripplenet.data_loader:reading KG file ...


The rippleset dictionary is built on the positive ratings (relevant entities) of the training data, and using the KG to build set of knowledge triples per user positive rating, from 0 until n_hop.

**Relevant entity**: Given interaction matrix Y and knowledge graph G, the set of k-hop relevant entities for user u is defined as

$$E^{k}_{u} = \{t\ |\ (h,r,t) ∈ G\ and\ h ∈ E^{k−1}_{u}\}, k=1,2,...,H$$

Where $E_{u} = V_{u} = \{v|yuv =1\}$ is the set of user’s clicked items in the past, which can be seen as the seed set of user $u$ in KG

**RippleSet**: The k-hop rippleset of user $u$ is defined as the set of knowledge triples starting from $E_{k−1}$:

$$S^{k}_{u} = \{(h,r,t)\ |\ (h,r,t) ∈ G\ and\ h ∈ E^{k−1}_{u}\}, k = 1,2,...,H$$

In [16]:
user_history_dict = train_data.loc[train_data.rating == 1].groupby('user_index')['item'].apply(list).to_dict()
ripple_set = get_ripple_set(kg, user_history_dict, n_hop=n_hop, n_memory=n_memory)

INFO:reco_utils.recommender.ripplenet.data_loader:constructing ripple set ...


## Build model and predict

In [17]:
ripple = RippleNet(dim=dim,n_hop=n_hop,
                   kge_weight=kge_weight, l2_weight=l2_weight, lr=lr,
                   n_memory=n_memory,
                   item_update_mode=item_update_mode, using_all_hops=using_all_hops,
                   n_entity=n_entity,n_relation=n_relation,
                   optimizer_method=optimizer_method,
                   seed=seed)

In [18]:
with Timer() as train_time:
    ripple.fit(n_epoch=n_epoch, batch_size=batch_size,
               train_data=train_data[["user_index", "item", "rating"]].to_numpy(), 
               ripple_set=ripple_set,
               show_loss=show_loss)

print("Took {} seconds for training.".format(train_time.interval))

InvalidArgumentError: indices[6,12] = 22991 is not in [0, 22908)
	 [[node embedding_lookup_6 (defined at ../../reco_utils/recommender/ripplenet/model.py:159)  = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_DOUBLE, _class=["loc:@Adam/Assign_1"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](entity_emb_matrix/read, _arg_memories_t_1_0_7, embedding_lookup/axis)]]

Caused by op 'embedding_lookup_6', defined at:
  File "/data/anaconda/envs/reco_base/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/anaconda/envs/reco_base/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 563, in start
    self.io_loop.start()
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
  File "/data/anaconda/envs/reco_base/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
  File "/data/anaconda/envs/reco_base/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
    handle._run()
  File "/data/anaconda/envs/reco_base/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 787, in inner
    self.run()
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 377, in dispatch_queue
    yield self.process_one()
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 225, in wrapper
    runner = Runner(result, future, yielded)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 714, in __init__
    self.run()
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 361, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 268, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 541, in execute_request
    user_expressions, allow_stdin,
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 300, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2848, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2874, in _run_cell
    return runner(coro)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3051, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3242, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-b8b001ee6b13>", line 7, in <module>
    seed=seed)
  File "../../reco_utils/recommender/ripplenet/model.py", line 74, in __init__
    self._build_model()
  File "../../reco_utils/recommender/ripplenet/model.py", line 159, in _build_model
    tf.nn.embedding_lookup(self.entity_emb_matrix, self.memories_t[i])
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 313, in embedding_lookup
    transform_fn=None)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/ops/embedding_ops.py", line 133, in _embedding_lookup_and_transform
    result = _clip(array_ops.gather(params[0], ids, name=name),
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2675, in gather
    return gen_array_ops.gather_v2(params, indices, axis, name=name)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3332, in gather_v2
    "GatherV2", params=params, indices=indices, axis=axis, name=name)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/data/anaconda/envs/reco_base/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[6,12] = 22991 is not in [0, 22908)
	 [[node embedding_lookup_6 (defined at ../../reco_utils/recommender/ripplenet/model.py:159)  = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_DOUBLE, _class=["loc:@Adam/Assign_1"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](entity_emb_matrix/read, _arg_memories_t_1_0_7, embedding_lookup/axis)]]


In [None]:
with Timer() as test_time:
    labels, scores = ripple.predict(batch_size=batch_size, 
                                    data=test_data[["user_index", "item", "rating"]].to_numpy())
    predictions = [1 if i >= 0.5 else 0 for i in scores]
    
print("Took {} seconds for prediction.".format(test_time.interval))

In case you need to re-create the RippleNet again, simply run:
```python
tf.reset_default_graph()```

## Results and Evaluation

In [None]:
test_data['scores'] = scores

In [None]:
auc_score = auc(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="rating",
            col_prediction="scores")
print("The auc score is {}".format(auc_score))

In [None]:
acc_score = np.mean(np.equal(predictions, labels))
print("The acc score is {}".format(acc_score))

In [None]:
precision_k_score = precision_at_k(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="original_rating",
            col_prediction="scores",
            relevancy_method="top_k",
            k=TOP_K)
print("The precision_k_score score at k = {}, is {}".format(TOP_K, precision_k_score))

In [None]:
recall_k_score = recall_at_k(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="original_rating",
            col_prediction="scores",
            relevancy_method="top_k",
            k=TOP_K)
print("The recall_k_score score at k = {}, is {}".format(TOP_K, recall_k_score))

In [None]:
# Record results with papermill for tests - ignore this cell
pm.record("auc", auc_score)
pm.record("precision", precision_k_score)
pm.record("recall", recall_k_score)
pm.record("train_time", train_time.interval)
pm.record("test_time", test_time.interval)