<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# RAPIDS k-NN Movie Recommender (Python, GPU-RAPIDS)

About k-NN Recommender.

RAPIDS suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs

### Prerequisite
In order to use RAPIDS, following requirements must meet:
* GPU: NVIDIA Pascal or better with compute capability 6.0+
* Supported OS: Ubuntu 16.04/18.04 or CentOS 7 with gcc 5.4 & 7.3
* Docker Prereqs: Docker CE v18+ and NVIDIA-docker v2+
* CUDA: 9.2 with driver v396.37+ or 10.0 with driver v410.48+
For details, see [RAPIDS web page](https://rapids.ai/start.html).

To quickstart, we recommend to spin-up Azure NCv2 or NCv3 Data-Science VM which meets those prerequisite.

On the machine that meets the requirements, install [RAPIDS](https://rapids.ai/start.html).

In [2]:
# Check RAPIDS packages and install if not exists
import importlib
import sys
if importlib.util.find_spec("cudf") is None:
    !conda install --yes --prefix {sys.prefix} -c rapidsai -c nvidia -c numba -c conda-forge \
        cudf=0.9 cuml=0.9

# 0 Global Settings and Imports

In [3]:
sys.path.append("../../")
import time

import cudf as cu
from cuml.neighbors import NearestNeighbors as cuNN

import numpy as np
import pandas as pd
import papermill as pm

from reco_utils.common.gpu_utils import get_gpu_info
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_stratified_split
import reco_utils.evaluation.rapids_evaluation as cu_evaluator


print("System version: {}".format(sys.version))
print("GPU info: {}".format(get_gpu_info()))
print("CuDF version: {}".format(cu.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
GPU info: [{'device_name': 'Tesla V100-PCIE-16GB', 'total_memory': 16130.5, 'free_memory': 15711.625}]
CuDF version: 0.9.0


# 1. Load Data

Note, a larger data may not fit in GPU memory. In such case, you can utilize multiple GPUs via [RAPIDS + Dask](https://rapids.ai/dask.html).

In [4]:
# top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

#### 1.1 Download MovieLens dataset and split

In [11]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
)

# Convert the float precision to 32-bit in order to reduce memory consumption 
data.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 16.6kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [7]:
train, test = python_stratified_split(data, ratio=0.75, col_user='userID', col_item='itemID', seed=42)

print("Num train vs. test = {} vs. {}".format(len(train), len(test)))

Num train vs. test = 74992 vs. 25008


#### 1.2 Load item features onto GPU (cuDF)

We build item-similarity based k-NN. You may use item features directly if available. Here, we use users' ratings as item features. 

Note, current version of cuDF (v0.9) does not support `pivot_table`. So, we pivot from Pandas DataFrame and then load into cuDF.

In [8]:
item_feats = train.pivot_table(
    index='itemID',
    columns='userID',
    values='rating'
).fillna(0)
cu_item_feats = cu.from_pandas(item_feats)

cu_item_feats.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
itemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,...,2.0,3.0,4.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0
2,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 2. Train k-NN Model

We use [cuML Nearest Neighbor](https://rapidsai.github.io/projects/cuml/en/0.9.0/api.html?highlight=nearest#cuml.NearestNeighbors). As you know, training k-NN is nothing but storing all samples in the model. Distances between the samples are calculated during prediction.

In [10]:
%%time

knn = cuNN()
knn.fit(cu_item_feats)

CPU times: user 2.61 s, sys: 616 ms, total: 3.23 s
Wall time: 3.22 s


NearestNeighbors(n_neighbors=5, n_gpus=1, devices=None, verbose=False, handle=<cuml.common.handle.Handle object at 0x7f6ced461df8>)

In [22]:
items = movielens.load_item_df(size=MOVIELENS_DATA_SIZE).set_index('itemID')
items.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 15.9kKB/s]


Unnamed: 0_level_0,title,genres,year
itemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Animation|Children's|Comedy,1995
2,GoldenEye (1995),Action|Adventure|Thriller,1995
3,Four Rooms (1995),Thriller,1995
4,Get Shorty (1995),Action|Comedy|Drama,1995
5,Copycat (1995),Crime|Drama|Thriller,1995


For verification purpose, we get top K similar movies of Toy Story based on "Users who rated Toy Story also rated these movies".

In [39]:
%%time
dist, ind = knn.kneighbors(cu_item_feats[:1], TOP_K+1)  # Note, the closest item is itself.

CPU times: user 3.54 s, sys: 200 ms, total: 3.74 s
Wall time: 3.12 s


In [41]:
titles = items['title']
print("Title (distance):\n")
for i in range(1, len(ind.loc[0])):
    print("{} ({})".format(items['title'][ind.loc[0][i]], dist.loc[0][i]))

Title (distance):

Rock, The (1996) (4645.0)
Pinocchio (1940) (4766.0)
Breaking the Waves (1996) (4865.0)
Striptease (1996) (4899.0)
Cold Comfort Farm (1995) (4933.0)
Swingers (1996) (4950.0)
Home Alone (1990) (5040.0)
Operation Dumbo Drop (1995) (5156.0)
Rumble in the Bronx (1995) (5170.0)
Jaws (1975) (5182.0)


In [None]:
"""TODO - top-k algorithm
1. calculate ind, dist for all items
2. for each user in the train set,
    i. create a min-heap
    ii. for each item the user have watched, insert the most similar item to the heap: item(dist, ind, src)
    iii. poll the head of the heap (min-dist) and add the next similar item of the src
    iv. add the polled item to reco-list if the user hasn't watched yet
    v. repeat until len(reco-list) == top_k
"""

In [79]:
display(top_k.head())

Unnamed: 0,userID,itemID,prediction
0,1,58,3.049881
1,1,7,3.053073
2,1,318,3.059262
3,1,210,3.095604
4,1,96,3.124997


In [80]:
eval_map = map_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=TOP_K)

In [81]:
eval_ndcg = ndcg_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=TOP_K)

In [82]:
eval_precision = precision_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=TOP_K)

In [83]:
eval_recall = recall_at_k(test, top_k, col_user='userID', col_item='itemID', col_rating='rating', k=TOP_K)

In [84]:
print("Model:\t",
      "Top K:\t%d" % TOP_K,
      "MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

Model:	
Top K:	10
MAP:	0.110591
NDCG:	0.382461
Precision@K:	0.330753
Recall@K:	0.176385


In [86]:
# Record results with papermill for tests - ignore this cell
pm.record("map", eval_map)
pm.record("ndcg", eval_ndcg)
pm.record("precision", eval_precision)
pm.record("recall", eval_recall)
pm.record("train_time", train_time)
pm.record("test_time", test_time)