# RippleNet on MovieLens using Wikidata (Python, GPU)¶

In this example, we will walk through each step of the [RippleNet](https://arxiv.org/pdf/1803.03467.pdf) algorithm.
RippleNet is an end-to-end framework that naturally incorporates knowledge graphs into recommender systems.
To make the results of the paper reproducible we have used MovieLens as our dataset and Wikidata as our Knowledge Graph.


## Introduction

To address the sparsity and cold start problem of collaborative filtering, researchers usually make use of side information, such as social networks or item attributes, to improve recommendation performance. This paper considers the knowledge graph as the source of side information. To address the limitations of existing embedding-based and path-based methods for knowledge-graph-aware recommendation, we propose RippleNet, an end-to-end framework that naturally incorporates the knowledge graph into recommender systems. Similar to actual ripples propagating on the water, RippleNet stimulates the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user’s potential interests along links in the knowledge graph. The multiple "ripples" activated by a user’s historically clicked items are thus superposed to form the preference distribution of the user with respect to a candidate item, which could be used for predicting the final clicking probability. Through extensive experiments on real-world datasets, we demonstrate that RippleNet achieves substantial gains in a variety of scenarios, including movie, book and news recommendation, over several state-of-the-art baselines.

![alt text](https://github.com/hwwang55/RippleNet/raw/master/framework.jpg)

The overall framework of the RippleNet. It takes one user and one item as input, and outputs the predicted probability that the user will click the item. The KGs in the upper part illustrate the corresponding ripple sets activated by the user’s click history.

## Implementation
Details of the python implementation can be found [here](https://github.com/microsoft/recommenders/tree/rippleNet/reco_utils/recommender/ripplenet). The implementation is based on the original code of RippleNet: https://github.com/hwwang55/RippleNet

## RippleNet Movie Recommender

In [3]:
import sys
sys.path.append("../../")
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import papermill as pm

from reco_utils.common.timer import Timer
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_stratified_split
from reco_utils.recommender.ripplenet.preprocess import (read_item_index_to_entity_id_file, 
                                         convert_rating, 
                                         convert_kg)
from reco_utils.recommender.ripplenet.data_loader import load_kg, get_ripple_set
from reco_utils.recommender.ripplenet.model import RippleNet
from reco_utils.evaluation.python_evaluation import auc, precision_at_k, recall_at_k

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
Pandas version: 0.25.3
Tensorflow version: 1.12.0


In [4]:
# Select MovieLens data size: 100k, 1M, 10M
MOVIELENS_DATA_SIZE = '1M'
rating_threshold = 4 #Minimum rating of a movie to be considered positive

# Ripple parameters
n_epoch = 10 #the number of epochs
batch_size = 1024 #batch size
dim = 16 #dimension of entity and relation embeddings
n_hop = 2 #maximum hops
kge_weight = 0.01 #weight of the KGE term
l2_weight = 1e-7 #weight of the l2 regularization term
lr = 0.02 #learning rate
n_memory = 32 #size of ripple set for each hop
item_update_mode = 'plus_transform' #how to update item at the end of each hop. 
                                    #possible options are replace, plus, plus_transform or replace transform
using_all_hops = True #whether using outputs of all hops or just the last hop when making prediction
optimizer_method = "adam" #optimizer method from adam, adadelta, adagrad, ftrl (FtrlOptimizer),
                          #gd (GradientDescentOptimizer), rmsprop (RMSPropOptimizer)
show_loss = False #whether or not to show the loss
seed = 12

#Evaluation parameters
TOP_K = 10


## Read original data and transform entity ids to numerical

RippleNet is built on:
- Ratings from users on Movies
- Knowledge Graph (KG) linking Movies to their connected entities in Wikidata. See [this notebook](../01_prepare_data/wikidata_knowledge_graph.ipynb) to understand better how the knowledge graph was created.

In [5]:
ratings_original = movielens.load_pandas_df(MOVIELENS_DATA_SIZE,
                              ('UserId', 'ItemId', 'Rating', 'Timestamp'),
                             title_col='Title',
                             genres_col='Genres',
                             year_col='Year')
ratings_original.head(3)

100%|██████████| 5.78k/5.78k [00:00<00:00, 17.9kKB/s]


Unnamed: 0,UserId,ItemId,Rating,Timestamp,Title,Genres,Year
0,1,1193,5.0,978300760,One Flew Over the Cuckoo's Nest (1975),Drama,1975
1,2,1193,5.0,978298413,One Flew Over the Cuckoo's Nest (1975),Drama,1975
2,12,1193,4.0,978220179,One Flew Over the Cuckoo's Nest (1975),Drama,1975


In [6]:
kg_original = pd.read_csv("https://recodatasets.blob.core.windows.net/wikidata/movielens_{}_wikidata.csv".format(MOVIELENS_DATA_SIZE))
kg_original.head(3)

Unnamed: 0,original_entity,linked_entities,name_linked_entities,movielens_title,movielens_id
0,Q857313,Q7005314,New American Library,One Flew Over the Cuckoo's Nest (1975),1193
1,Q857313,Q921536,Viking Press,One Flew Over the Cuckoo's Nest (1975),1193
2,Q857313,Q113013,postmodern literature,One Flew Over the Cuckoo's Nest (1975),1193


To be able to link the Ratings and KG ids we create two dictionaries match the KG original IDs to homogeneous numerical IDs. This will be done in two steps:
1. Transforming both Rating ID and KG ID to numerical
2. Matching the IDs using a dictionary

In [7]:
def transform_id(df, entities_id, col_transform, col_name = "unified_id"):
    df = df.merge(entities_id, left_on = col_transform, right_on = "entity")
    df = df.rename(columns = {"unified_id": col_name})
    return df.drop(columns = [col_transform, "entity"])

In [8]:
# Create Dictionary that matches KG Wikidata ID to internal numerical KG ID
entities_id = pd.DataFrame({"entity":list(set(kg_original.original_entity)) + list(set(kg_original.linked_entities))}).reset_index()
entities_id = entities_id.rename(columns = {"index": "unified_id"})
entities_id.head(3)

Unnamed: 0,unified_id,entity
0,0,Q1503215
1,1,Q271189
2,2,Q832444


In [9]:
# Tranforming KG IDs to internal numerical KG IDs created above 
kg = kg_original[["original_entity", "linked_entities"]].drop_duplicates()
kg = transform_id(kg, entities_id, "original_entity", "original_entity_id")
kg = transform_id(kg, entities_id, "linked_entities", "linked_entities_id")
kg["relation"] = 1
kg_wikidata = kg[["original_entity_id","relation", "linked_entities_id"]]
kg_wikidata.head(3)

Unnamed: 0,original_entity_id,relation,linked_entities_id
0,3357,1,22016
1,26376,1,22016
2,3357,1,12264


In [10]:
# Create Dictionary matching Movielens ID to internal numerical KG ID created above
var_id = "movielens_id"
item_to_entity = kg_original[[var_id, "original_entity"]].drop_duplicates().reset_index().drop(columns = "index")
item_to_entity = transform_id(item_to_entity, entities_id, "original_entity")
item_to_entity.head(3)

Unnamed: 0,movielens_id,unified_id
0,1193,3357
1,1193,26376
2,661,493


In [11]:
vars_movielens = ["UserId", "ItemId", "Rating", "Timestamp"]
ratings = ratings_original[vars_movielens].sort_values(vars_movielens[1])

## Preprocess module from RippleNet

 The dictionaries created above will be used on the Ratings and KG dataframes and unify their IDs. Also the Ratings will be converted from a numerical rating (1-5) to a binary rating (0-1) using the rating_threshold

In [12]:
# Use dictionary Movielens ID - numerical KG ID to extract two dictionaries to be used on Ratings and KG
item_index_old2new, entity_id2index = read_item_index_to_entity_id_file(item_to_entity)

In [13]:
ratings_final = convert_rating(ratings, item_index_old2new = item_index_old2new,
                               threshold = rating_threshold, seed = 12)

INFO:reco_utils.recommender.ripplenet.preprocess:converting rating file ...
INFO:reco_utils.recommender.ripplenet.preprocess:number of users: 6038
INFO:reco_utils.recommender.ripplenet.preprocess:number of items: 3689


In [14]:
kg_final = convert_kg(kg_wikidata, entity_id2index = entity_id2index)

INFO:reco_utils.recommender.ripplenet.preprocess:converting kg file ...
INFO:reco_utils.recommender.ripplenet.preprocess:number of entities (containing items): 39915
INFO:reco_utils.recommender.ripplenet.preprocess:number of relations: 1


## Split Data and Build RippleSet

The data is divided into train, test and evaluation

In [15]:
train_data, test_data, eval_data = python_stratified_split(ratings_final, ratio=[0.6, 0.2, 0.2], col_user='user_index', col_item='item', seed=12)

In [16]:
train_data.head()

Unnamed: 0,user_index,item,rating,original_rating
145,0,341,1,5.0
64,0,3591,1,4.0
287,0,5725,0,0.0
276,0,1403,0,0.0
223,0,1581,0,0.0


The original KG dataframe is transformed into a dictionary, and the number of entities and relations extracted as parameters

In [33]:
n_entity, n_relation, kg = load_kg(kg_final)
print("Number of entities:", n_entity)
print("Number of relations:", n_relation)

INFO:reco_utils.recommender.ripplenet.data_loader:reading KG file ...


Number of entities: 39799
Number of relations: 1


The rippleset dictionary is built on the positive ratings (relevant entities) of the training data, and using the KG to build set of knowledge triples per user positive rating, from 0 until `n_hop`.

**Relevant entity**: Given interaction matrix Y and knowledge graph G, the set of k-hop relevant entities for user u is defined as

$$E^{k}_{u} = \{t\ |\ (h,r,t) ∈ G\ and\ h ∈ E^{k−1}_{u}\}, k=1,2,...,H$$

Where $E_{u} = V_{u} = \{v|yuv =1\}$ is the set of user’s clicked items in the past, which can be seen as the seed set of user $u$ in KG

**RippleSet**: The k-hop rippleset of user $u$ is defined as the set of knowledge triples starting from $E_{k−1}$:

$$S^{k}_{u} = \{(h,r,t)\ |\ (h,r,t) ∈ G\ and\ h ∈ E^{k−1}_{u}\}, k = 1,2,...,H$$

In [18]:
user_history_dict = train_data.loc[train_data.rating == 1].groupby('user_index')['item'].apply(list).to_dict()
ripple_set = get_ripple_set(kg, user_history_dict, n_hop=n_hop, n_memory=n_memory)

INFO:reco_utils.recommender.ripplenet.data_loader:constructing ripple set ...


## Build model and predict

In [19]:
ripple = RippleNet(dim=dim,
                   n_hop=n_hop,
                   kge_weight=kge_weight, 
                   l2_weight=l2_weight, 
                   lr=lr,
                   n_memory=n_memory,
                   item_update_mode=item_update_mode, 
                   using_all_hops=using_all_hops,
                   n_entity=n_entity,
                   n_relation=n_relation,
                   optimizer_method=optimizer_method,
                   seed=seed)

INFO:numexpr.utils:Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [20]:
with Timer() as train_time:
    ripple.fit(n_epoch=n_epoch, batch_size=batch_size,
               train_data=train_data[["user_index", "item", "rating"]].to_numpy(), 
               ripple_set=ripple_set,
               show_loss=show_loss)

print("Took {} seconds for training.".format(train_time.interval))

INFO:reco_utils.recommender.ripplenet.model:epoch 0  train auc: 0.9063  acc: 0.8241
INFO:reco_utils.recommender.ripplenet.model:epoch 1  train auc: 0.9303  acc: 0.8534
INFO:reco_utils.recommender.ripplenet.model:epoch 2  train auc: 0.9386  acc: 0.8638
INFO:reco_utils.recommender.ripplenet.model:epoch 3  train auc: 0.9456  acc: 0.8737
INFO:reco_utils.recommender.ripplenet.model:epoch 4  train auc: 0.9499  acc: 0.8788
INFO:reco_utils.recommender.ripplenet.model:epoch 5  train auc: 0.9522  acc: 0.8822
INFO:reco_utils.recommender.ripplenet.model:epoch 6  train auc: 0.9547  acc: 0.8860
INFO:reco_utils.recommender.ripplenet.model:epoch 7  train auc: 0.9578  acc: 0.8900
INFO:reco_utils.recommender.ripplenet.model:epoch 8  train auc: 0.9607  acc: 0.8955
INFO:reco_utils.recommender.ripplenet.model:epoch 9  train auc: 0.9624  acc: 0.8980


Took 650.0238825449924 seconds for training.


In [21]:
with Timer() as test_time:
    labels, scores = ripple.predict(batch_size=batch_size, 
                                    data=test_data[["user_index", "item", "rating"]].to_numpy())
    predictions = [1 if i >= 0.5 else 0 for i in scores]
    
print("Took {} seconds for prediction.".format(test_time.interval))

Took 5.9928844239912 seconds for prediction.


In case you need to re-create the RippleNet again, simply run:
```python
tf.reset_default_graph()```

## Results and Evaluation

In [22]:
test_data['scores'] = scores

In [23]:
auc_score = auc(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="rating",
            col_prediction="scores")
print("The auc score is {}".format(auc_score))

The auc score is 0.9229666674514663


In [35]:
acc_score = np.mean(np.equal(predictions, labels)) # same result as in sklearn.metrics.accuracy_score 
print("The accuracy is {}".format(acc_score))

The accuracy is 0.8491929906968855


In [25]:
precision_k_score = precision_at_k(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="original_rating",
            col_prediction="scores",
            relevancy_method="top_k",
            k=TOP_K)
print("The precision_k_score score at k = {}, is {}".format(TOP_K, precision_k_score))

The precision_k_score score at k = 10, is 0.9307883405101026


In [26]:
recall_k_score = recall_at_k(test_data, test_data, 
            col_user="user_index",
            col_item="item",
            col_rating="original_rating",
            col_prediction="scores",
            relevancy_method="top_k",
            k=TOP_K)
print("The recall_k_score score at k = {}, is {}".format(TOP_K, recall_k_score))

The recall_k_score score at k = 10, is 0.5144293430105117


In [None]:
# Record results with papermill for tests - ignore this cell
pm.record("auc", auc_score)
pm.record("accuracy", acc_score)
pm.record("precision", precision_k_score)
pm.record("recall", recall_k_score)
pm.record("train_time", train_time.interval)
pm.record("test_time", test_time.interval)

## References

1. Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, Minyi Guo, "RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems", *The 27th ACM International Conference on Information and Knowledge Management (CIKM 2018)*, 2018. https://arxiv.org/pdf/1803.03467.pdf
1. The original implementation of RippleNet: https://github.com/hwwang55/RippleNet