# Board Games Recommender
____________

# Part 2 - Collaborative Filtering & Content-based Recommenders

The two main categories of [recommender systems](https://en.wikipedia.org/wiki/Recommender_system) are collaborative filtering and content-based filtering. In this notebook, we will develop recommender systems under these two main categories, using the explicit data collected from the BGG community ratings on board games, and the information given for each board game.

Collborative filtering fall under the umbrella of memory-based methods, otherwise referred to as neighborhood-based collaborative filtering. It builds upon the assumption that people will like similar kinds of items as they liked in the past. This system uses the information on ratings (usually) for different users on items. These systems provide recommendations by finding users/items with a rating history similar to the current user or item, and make suggestions via this neighborhood. An advantage of this approach is that the system does not need to "understand" the item itself when making recommendations.

Content-based filtering builds upon the description/information of the items. These methods are most suitable when there is data on the different features for each item (name, category, etc.). These systems provide recommendations by finding items similar to what the current user likes based on the item features.

### Contents:
- [Preprocessing](#Preprocessing)
- [User-based Collaborative Recommender](#User-based-Collaborative-Recommender)
- [Item-based Collaborative Recommender](#Item-based-Collaborative-Recommender)
- [Content-based Recommender](#Content-based-Recommender)

In [150]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import sqlite3
import os

from scipy import sparse
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity

### Import data

Import the cleaned dataframe, reference dictionaries, and user ratings.

In [135]:
# Open dataframe
infile = open('../datasets/boardgames/clean_bgg_GameItem.pkl', 'rb')
df = pickle.load(infile)
infile.close()

In [4]:
df.shape

(7929, 20)

In [5]:
# Open dictionaries
infile = open('../datasets/boardgames/ref_dictionaries.pkl', 'rb')
ref_dicts = pickle.load(infile)
infile.close()

In [6]:
# Extract ratings from sqlite database
# We will only use
conn = sqlite3.connect("../datasets/boardgames/bgg_5yrs_RatingItem.db")
cur = conn.cursor()

user_df = pd.read_sql_query("""
SELECT *,
    COUNT(bgg_user_name) OVER
         (PARTITION BY bgg_user_name) AS user_count
FROM bgg_ratings
WHERE year >= 2018
""", conn)

user_df.head()

Unnamed: 0,bgg_user_name,bgg_id,bgg_user_rating,year,month,user_count
0,-=yod@=-,463,9.0,2018,3,82
1,-=yod@=-,478,6.0,2020,1,82
2,-=yod@=-,2651,7.0,2020,11,82
3,-=yod@=-,16772,9.0,2019,3,82
4,-=yod@=-,17133,7.0,2019,12,82


In [7]:
cur.close()
conn.close()

In [8]:
user_df.shape

(7788605, 6)

In [9]:
# Save df as .pkl
outfile = open('../datasets/boardgames/bgg_users_2018.pkl', 'wb')
pickle.dump(user_df, outfile)
outfile.close()

## Preprocessing

A common problem in recommender systems is known as ***user cold-start***, where it is difficult to recommend items for users with very few number of consumed items (in this case rated board games), due to lack of information to model their preferences.  Moreover, we would not be able to handle too large a dataset due to lack of computational memory. As such, we choose to only keep the users with at least 100 rated board games.

In [178]:
# Filtering dataframe to contain users with at least 100 rates
user_df = user_df[user_df['user_count']>=100]
user_df.shape

(3940040, 6)

We also want to extract the user ratings for the board games that we are left with after extensive EDA and cleaning.

In [179]:
# Filtering dataframe to user ratings of the board games we are concerned with
user_df = user_df[user_df['bgg_id'].isin(df['bgg_id'])]
user_df.shape

(3395661, 6)

In [181]:
# number of unique users
user_df['bgg_user_name'].nunique()

19723

#### Board game mapper

In [63]:
# Mapper (bgg_id -> name)
bg_mapper = {}
for i, name in zip(df['bgg_id'], df['name']):
    bg_mapper[str(i)] = name

## User-based Collaborative Recommender

The ratings provided by like-minded users of a target user are used in order to make the recommendations for the target user. The basic idea is to determine users, who are similar to the target user, and recommend ratings for the unobserved ratings of the target user by computing weighted averages of the ratings of the peer group. Similarity functions are computed between the rows of the ratings matrix to discover similar users.

#### Create pivot table

Because we're creating an user-based collaborative recommender, we'll set up our pivot table as follows:
1. `bgg_user_name` will be the index
2. `bgg_id` will be the column
3. `bgg_user_rating` will be the values

In [182]:
# User-based pivot table
user_pivot = pd.pivot_table(user_df, index='bgg_user_name', columns='bgg_id', values='bgg_user_rating')
user_pivot

bgg_id,3,9,10,11,12,13,14,16,17,25,...,317519,317985,318472,318553,318977,318983,319114,319966,320698,325635
bgg_user_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
...hammer,,,,,,,,,,,...,,,,,,,,,,
0492372665,,,,,,,,,,,...,,,,,,,,,,
0815spieler,,,,,,8.0,,,,,...,,,,,,,,,,
0b1_ita,,,,,,,,,,,...,,,,,,,,,,
0xa8e,,,,,,6.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zyxbg,8.0,,,,,5.0,,,,,...,,,,,,,,,,
zzap1977,,,,,,5.0,,,,,...,,,,,,,,,,
zzgamer11,,,,,,5.0,,,,,...,,,,,,,,,,
zztap,,,,,,7.0,,,,,...,,,,,,,,,,


#### Create sparse matrix

We need to create a sparse matrix.

In [183]:
# Sparse matrix
sparse_user_pivot = sparse.csr_matrix(user_pivot.fillna(0))
sparse_user_pivot

<19723x7927 sparse matrix of type '<class 'numpy.float64'>'
	with 3395661 stored elements in Compressed Sparse Row format>

In [184]:
# Convert type to save memory
sparse_user_pivot = sparse_user_pivot.astype(np.float32)

### User Similarities  

We use the `cosine_similarity` function to measure the similarity between two users. Essentially the users are treated as vectors and the cosine of the angle between the two vectors would determine whether the two vectors are pointing in roughly the same direction.

In [185]:
# Similarity matrix
user_similarities = cosine_similarity(sparse_user_pivot)
user_similarities

array([[1.000001  , 0.17911129, 0.13159662, ..., 0.22616096, 0.20586593,
        0.26911274],
       [0.17911129, 1.0000001 , 0.05465441, ..., 0.11048871, 0.09939629,
        0.18030435],
       [0.13159662, 0.05465441, 0.99999994, ..., 0.11639071, 0.11296934,
        0.18239552],
       ...,
       [0.22616096, 0.11048871, 0.11639071, ..., 1.        , 0.19622406,
        0.26788387],
       [0.20586593, 0.09939629, 0.11296934, ..., 0.19622406, 1.0000015 ,
        0.31967598],
       [0.26911274, 0.18030435, 0.18239552, ..., 0.26788387, 0.31967598,
        0.99999994]], dtype=float32)

In [186]:
# Use it as a dataframe
user_cf_df = pd.DataFrame(user_similarities, index=user_pivot.index, columns=user_pivot.index)
user_cf_df.head()

bgg_user_name,...hammer,0492372665,0815spieler,0b1_ita,0xa8e,0xdeadbeef,1 family meeple,1000games,1000rpm,100pcblade,...,zybthranger,zyggy,zyklonc,zyrus,zyx0xyz,zyxbg,zzap1977,zzgamer11,zztap,zzzabiss
bgg_user_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
...hammer,1.000001,0.179111,0.131597,0.161758,0.323277,0.117465,0.213074,0.122432,0.08012,0.121473,...,0.140263,0.109555,0.126641,0.20994,0.122034,0.270151,0.329175,0.226161,0.205866,0.269113
0492372665,0.179111,1.0,0.054654,0.113391,0.192375,0.136979,0.170052,0.057777,0.049839,0.058391,...,0.115383,0.060204,0.138857,0.128756,0.163957,0.230413,0.250079,0.110489,0.099396,0.180304
0815spieler,0.131597,0.054654,1.0,0.08826,0.165677,0.112597,0.134872,0.097,0.108191,0.078472,...,0.052423,0.063978,0.111907,0.145739,0.119657,0.122137,0.111269,0.116391,0.112969,0.182396
0b1_ita,0.161758,0.113391,0.08826,1.0,0.132305,0.137549,0.12679,0.054369,0.124798,0.057187,...,0.070053,0.110262,0.087255,0.169374,0.121472,0.146482,0.215259,0.097421,0.194233,0.154082
0xa8e,0.323277,0.192375,0.165677,0.132305,1.0,0.203321,0.20769,0.105418,0.09816,0.114271,...,0.17855,0.148149,0.114605,0.225596,0.17245,0.215403,0.308718,0.270311,0.231995,0.276217


### Evaluation of Recommender

We want to evaluate our recommender to see if it matches up to our intuition. We will use an existing user profile in our dataset to do the evaluation.

In [187]:
# Similar users scores
user_input = 'joelbear'
print(user_input)
user_sim = user_cf_df[user_input].drop(user_input)
user_sim = user_sim[user_sim > 0].sort_values(ascending=False)
user_sim

joelbear


bgg_user_name
elschmear           0.377744
master thomas       0.336090
michael maschke     0.299553
traderjack          0.291412
jirka bauma         0.285821
                      ...   
stewie              0.001229
tswider             0.001099
jsmaple64           0.000980
long john silver    0.000726
hattori hanzo       0.000395
Name: joelbear, Length: 19653, dtype: float32

In [188]:
# Turn the similarity scores into weights
user_weight = user_sim.values / np.sum(user_sim)
user_weight

array([1.8993126e-04, 1.6898756e-04, 1.5061646e-04, ..., 4.9259165e-07,
       3.6480384e-07, 1.9868239e-07], dtype=float32)

In [189]:
# Ratings for board games by users
user_ratings = user_pivot.T
user_ratings.head(10)

bgg_user_name,...hammer,0492372665,0815spieler,0b1_ita,0xa8e,0xdeadbeef,1 family meeple,1000games,1000rpm,100pcblade,...,zybthranger,zyggy,zyklonc,zyrus,zyx0xyz,zyxbg,zzap1977,zzgamer11,zztap,zzzabiss
bgg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,,,,,,,,,,,...,,,,,,8.0,,,,
9,,,,,,,,,,6.0,...,,,,,,,,,,
10,,,,,,,,,,8.0,...,,,,,,,,,,
11,,,,,,,,,,,...,,,,,,,,,,5.0
12,,,,,,6.0,,9.0,,,...,,,,,,,,,,
13,,,8.0,,6.0,,6.0,,,,...,,,8.0,,,5.0,5.0,5.0,7.0,7.0
14,,,,,,,,,8.0,,...,,,,,,,,,,
16,,,,,,,,,,,...,,,,,,,,,,
17,,,,,,,,,,,...,,,,,,,,,,
25,,,,,,,,,,,...,,,,,,,,,,


In [190]:
# Board games that user has not rated
# Also, drop the user himself
ratings = user_ratings[user_ratings[user_input].isnull()]
ratings = ratings.loc[:, user_sim.index]
ratings

bgg_user_name,elschmear,master thomas,michael maschke,traderjack,jirka bauma,lawster,_the_inquiry_,chriswray84,montsegur,hannuman,...,chrisback79,romanyudov93,usiandrew,edgar gallego,mad zombie,stewie,tswider,jsmaple64,long john silver,hattori hanzo
bgg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,8.0,9.0,9.0,8.5,7.5,,9.0,7.5,7.0,9.0,...,,,,,,,,,,
13,6.0,9.5,8.0,7.0,7.0,,5.0,8.0,4.0,6.0,...,,,,,,,,,,
25,,,,,,,,,,,...,,,6.0,,,,,,,
26,,,7.0,,,,,,,,...,,,,,,,,,,
46,8.0,8.5,,6.0,7.5,7.0,,8.0,9.0,7.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318983,,9.5,7.0,,,,,,,,...,,,,,,,,,,
319114,,,8.0,,,,,,,,...,,,,,,,,,,
319966,,,,,,,,,,,...,,,,,,,,,,
320698,,,,,,,,,,,...,,,,,,,,,,


In [191]:
# Predicted ratings
pred_user_ratings = np.dot(ratings.fillna(0), user_weight)

In [192]:
# Observe in dataframe, top 20 recommendations
pd.DataFrame(pred_user_ratings, index=ratings.index.astype(str).map(bg_mapper), columns=[user_input]).sort_values(by=user_input, ascending=False).head(20)

Unnamed: 0_level_0,joelbear
bgg_id,Unnamed: 1_level_1
Azul,5.998407
Terraforming Mars,5.232639
Scythe,4.347414
7 Wonders Duel,4.25356
7 Wonders,4.093098
Sagrada,4.075544
The Castles of Burgundy,4.048642
Pandemic,3.765092
Splendor,3.677116
Concordia,3.636277


In [193]:
# View board games which user had already rated
user_rated_games = user_ratings[[user_input]]
user_rated_games.index = user_ratings.index.astype(str).map(bg_mapper)
user_rated_games.sort_values(by=user_input, ascending=False).head(20)

bgg_user_name,joelbear
bgg_id,Unnamed: 1_level_1
Samurai,10.0
Medina,10.0
Safranito,10.0
Norenberc,10.0
Mord im Arosa,10.0
Ys,10.0
Around the World in 80 Days,10.0
Mount Drago,10.0
Pelican Cove,10.0
Strasbourg,10.0


We observe that many of the games recommended are popular Euro-style board games, somewhat similar to some of the games which the user had already rated highly.

However, there are some caveats to this approach:  
- The recommender system will only recommend the board games which are previously rated by the users within the chosen timeframe of the past 3 years. It is harder to match users to a new user who likes board games that are not within the list.
- If a new user has very few likes, for example if we simulate someone who just started trying out board games in general, it is difficult to pair them with a similar user.
- There is a good chance of re-recommending a board game which the user already owns because similar users may have rated the same games as the user. If we were to filter out the board games to only the ones not owned by the user, we may be left with few recommendations.
- It is hard to maintain this recommender system as each new like or rating for another board game may significantly change the recommendations. The user profiles will need to be continuously updated to achieve the best recommendations.

In [199]:
# Save items
with open('../datasets/boardgames/user_similarity_keys.pkl', 'wb') as outfile:
    pickle.dump({'bgg_user_name': list(user_pivot.index), 'bgg_id': list(user_pivot.columns)}, outfile)
with open('../datasets/boardgames/user_similarity_matrix.pkl', 'wb') as outfile:
    pickle.dump(sparse_user_pivot, outfile)
outfile.close()

## Item-based Collaborative Recommender

The item-based collaborative recommender will alleviate some of the problems faced by the user-based collaborative recommender. This system will recommend items that are similar to already-liked items. This is done by making the rating predictions for target item by the target user. A set S of items that are most similar to target item is first determined. The ratings in item set S are used to predict whether the target user will like the item. Similarity functions are computed between the columns of the ratings matrix to discover similar items.

#### Create pivot table

Because we're creating an user-based collaborative recommender, we'll set up our pivot table as follows:
1. `bgg_id` will be the index
2. `bgg_user_name` will be the column
3. `bgg_user_rating` will be the values

In [91]:
# User-based pivot table
item_pivot = pd.pivot_table(user_df, index='bgg_id', columns='bgg_user_name', values='bgg_user_rating')
item_pivot

bgg_user_name,0815spieler,0xdeadbeef,1 family meeple,1000games,144creations,1friidrek6,1nf1n1ty,1point21gigawatts,20sanx,21kellie08,...,zwinky,zxlitening45,zyater,zyggy,zyklonc,zyrus,zyx0xyz,zyxbg,zztap,zzzabiss
bgg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,,,,,,,,,,,...,,,,,,,,8.0,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,4.5,,4.0,,,...,,,,,,,,,,
11,,,,,7.0,7.5,,,6.5,,...,7.0,7.0,,,,,,,,5.0
12,,6.0,,9.0,,8.0,9.0,,,,...,8.5,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318983,,,,,,,,,,,...,,,,,,,,,,
319114,,,,,,,,,,,...,,,,,,,,,,
319966,,,,,,,,,,,...,,,,,,,,,,
320698,,,,,,,,,,,...,,,,,,,,,,


#### Create sparse matrix

We need to create a sparse matrix.

In [92]:
# Sparse matrix
sparse_item_pivot = sparse.csr_matrix(item_pivot.fillna(0))
sparse_item_pivot

<7926x10708 sparse matrix of type '<class 'numpy.float64'>'
	with 2436846 stored elements in Compressed Sparse Row format>

### Item Similarities  

Similar to the user-based recommender, we use the `cosine_similarity` function to measure the similarity between two board games.

In [93]:
# Similarity matrix
item_similarities = cosine_similarity(sparse_item_pivot)
item_similarities

array([[1.        , 0.08125002, 0.18525882, ..., 0.07755587, 0.01385348,
        0.01994537],
       [0.08125002, 1.        , 0.11502174, ..., 0.02701262, 0.        ,
        0.02600494],
       [0.18525882, 0.11502174, 1.        , ..., 0.02885523, 0.00972737,
        0.01374922],
       ...,
       [0.07755587, 0.02701262, 0.02885523, ..., 1.        , 0.02828846,
        0.03087794],
       [0.01385348, 0.        , 0.00972737, ..., 0.02828846, 1.        ,
        0.        ],
       [0.01994537, 0.02600494, 0.01374922, ..., 0.03087794, 0.        ,
        1.        ]])

In [110]:
# Use it as a dataframe
item_cf_df = pd.DataFrame(item_similarities, index=item_pivot.index, columns=item_pivot.index)
item_cf_df.head()

bgg_id,3,9,10,11,12,13,14,16,17,25,...,317519,317985,318472,318553,318977,318983,319114,319966,320698,325635
bgg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,1.0,0.08125,0.185259,0.279532,0.384177,0.289594,0.114151,0.093379,0.057257,0.006725,...,0.015025,0.0567,0.017631,0.049842,0.040523,0.066875,0.005528,0.077556,0.013853,0.019945
9,0.08125,1.0,0.115022,0.072764,0.076345,0.069921,0.068831,0.133618,0.054221,0.0,...,0.0,0.031824,0.0,0.02073,0.021392,0.055765,0.018535,0.027013,0.0,0.026005
10,0.185259,0.115022,1.0,0.240805,0.208245,0.238669,0.150527,0.119016,0.078961,0.012654,...,0.0,0.030546,0.013144,0.061012,0.057326,0.05586,0.025595,0.028855,0.009727,0.013749
11,0.279532,0.072764,0.240805,1.0,0.357944,0.481541,0.088167,0.065652,0.061485,0.016719,...,0.005677,0.06389,0.011197,0.082145,0.073663,0.059916,0.042503,0.071706,0.004763,0.011194
12,0.384177,0.076345,0.208245,0.357944,1.0,0.353952,0.119813,0.08809,0.054401,0.032171,...,0.011743,0.074599,0.013369,0.064862,0.040744,0.067818,0.005967,0.098973,0.014996,0.032472


### Evaluation of Recommender

We want to evaluate our recommender to see if it matches up to our intuition. We will use a board game in our dataset to do the evaluation.

In [115]:
# Top 20 similar board games
item_input = 266192
print(bg_mapper[str(item_input)])
item_sim = item_cf_df[item_input]
item_sim.index = item_cf_df.index.astype(str).map(bg_mapper)
item_sim = item_sim[item_sim > 0].drop(bg_mapper[str(item_input)])
item_sim.sort_values(ascending=False).head(20)

Wingspan


bgg_id
Azul                                 0.783586
The Quacks of Quedlinburg            0.730017
Terraforming Mars                    0.723364
Sagrada                              0.700626
Welcome To...                        0.685187
Scythe                               0.679145
Architects of the West Kingdom       0.678595
Everdell                             0.660366
That's Pretty Clever!                0.649137
7 Wonders Duel                       0.646902
Viticulture Essential Edition        0.646263
The Mind                             0.640758
Kingdomino                           0.640602
7 Wonders                            0.639175
Great Western Trail                  0.638327
The Castles of Burgundy              0.629260
Teotihuacan: City of Gods            0.628503
Root                                 0.627019
Clank!: A Deck-Building Adventure    0.625893
Codenames                            0.622450
Name: 266192, dtype: float64

We observe that the recommender system is effective in filtering other board games which rated similarly to the specified input. This method is usually more lenient on computational resources since the sparse matrix for items is likely to be smaller than the matrix for user ratings. It also resolves the user cold-start problem faced by new user profiles since the system is able to provide recommendations based on a single item entry.

The advantages of memory-based techniques are that they are simple to implement and the resulting recommendations are often east to explain. However, because similar board games are determined by user rating patterns, we are again limited by the board games rated by users within the past 3 years. If a new user comes in with liking towards some of the older board games which are not recently rated by the other users, it will be difficult to find similar board games. Moreover, memory-based algorithms do not work well with sparse ratings matices, it may be difficult to robustly predict the target user's unobserved ratings.

## Content-based Recommender

Although the collaborative filtering recommenders above are useful, they are just built on the user ratings. We still have the rich features of the board games which are are not yet utilized. It is hard to include these features in collaborative recommenders directly, hence, we want to explore a content-based recommender system.

In content-based filtering, the features of the dataframe are broken down into "feature baskets". These are the characteristics that represent a board game. The main idea is that if the user likes certain categories, mechanics, or types of a certain board game, then it is likely the user likes another board game that has similar characteristics. 

In [167]:
content_df = df[['bgg_id', 'name', 'game_type', 'designer', 'artist', 'publisher', 'category', 'mechanic']].copy()
content_df.head()

Unnamed: 0,bgg_id,name,game_type,designer,artist,publisher,category,mechanic
0,3,Samurai,5497,2,11883,"17,133,267,29,7340,7335,41,2973,4617,1391,8291...",10091035,208020402026284620042002
1,9,El Caballero,5497,78,74,2671333,1020,20802002
2,10,Elfenland,5499,9,74,826768181885233953,10101097,2041204020812078
3,11,Bohnanza,5499,10,"28004,44242,12035,11507,11901,65041,308,12123,...","8,267,46980,7162,2378,6818,8845,155,5530,6214,...",100210131026,20402981291520042008
4,12,Ra,5497,2,2078911883,"9,34,28620,267,29,23205,2973,8291,9881,42294,3...",10501082,201229232928292226612004


#### Create dataframe

We want to add new columns for each feature that are considered when choosing a board game. This approach is similar to one-hot encoding. We can make use of `CountVectorizer` to process the large amount of features.

In [161]:
# Function to extract the ids for each feature
def feature_extract(series, prefix):
    cvec = CountVectorizer()
    cvec_arr = cvec.fit_transform(series)
    return pd.DataFrame(cvec_arr.toarray(), columns=[(prefix+feature_id) for feature_id in cvec.get_feature_names()])

In [168]:
# Use custom function to extract the ids and add to df
for col in ['game_type', 'designer', 'artist', 'publisher', 'category', 'mechanic']:
    content_df = pd.concat([content_df, feature_extract(content_df[col], col+'_')], axis=1)

In [169]:
# Drop the unnecessary columns and fill na values
content_df = content_df.drop(columns=['name', 'game_type', 'designer', 'artist', 'publisher', 'category', 'mechanic'])

# Set index as the board game id
content_df = content_df.set_index('bgg_id', drop=True)

In [170]:
# Check dataframe after updates
content_df.head()

Unnamed: 0_level_0,game_type_4664,game_type_4665,game_type_4666,game_type_4667,game_type_5496,game_type_5497,game_type_5498,game_type_5499,game_type_99999,designer_10,...,mechanic_2999,mechanic_3000,mechanic_3001,mechanic_3002,mechanic_3003,mechanic_3004,mechanic_3005,mechanic_3006,mechanic_3007,mechanic_99999
bgg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
12,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Item Similarities

Instead of user ratings, `cosine_similarity` will now compute the similarity matrix using the board game features.

In [172]:
content_sim_df = pd.DataFrame(cosine_similarity(content_df), columns=content_df.index, index=content_df.index)
content_sim_df.head()

bgg_id,3,9,10,11,12,13,14,16,17,25,...,29603,35052,41066,41863,55829,61692,68264,91080,130960,233078
bgg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,1.0,0.370625,0.113228,0.1681,0.286251,0.236096,0.108786,0.074125,0.0,0.118262,...,0.062017,0.050637,0.042796,0.0,0.054393,0.056614,0.052414,0.046225,0.09245,0.196116
9,0.370625,1.0,0.218218,0.10799,0.157622,0.091003,0.104828,0.142857,0.0,0.113961,...,0.0,0.09759,0.082479,0.0,0.104828,0.109109,0.0,0.089087,0.089087,0.125988
10,0.113228,0.218218,1.0,0.247436,0.060193,0.13901,0.080064,0.0,0.0,0.087039,...,0.0,0.0,0.062994,0.0,0.0,0.0,0.0,0.0,0.068041,0.048113
11,0.1681,0.10799,0.247436,1.0,0.148939,0.240772,0.118864,0.0,0.0,0.0,...,0.0,0.0,0.031174,0.0,0.0,0.0,0.0,0.0,0.067344,0.047619
12,0.286251,0.157622,0.060193,0.148939,1.0,0.125511,0.173494,0.157622,0.03872,0.0,...,0.0,0.0,0.045502,0.0,0.0,0.0,0.0,0.0,0.0,0.069505


### Evaluation of Recommender

We want to evaluate our recommender to see if it matches up to our intuition. We will use a board game in our dataset to do the evaluation.

In [173]:
# Top 20 similar board games
content_input = 266192
print(bg_mapper[str(content_input)])
content_sim = content_sim_df[content_input]
content_sim.index = content_sim_df.index.astype(str).map(bg_mapper)
content_sim = content_sim[content_sim > 0].drop(bg_mapper[str(content_input)])
content_sim.sort_values(ascending=False).head(20)

Wingspan


bgg_id
Tapestry                                  0.467910
Viticulture Essential Edition             0.448211
Charterstone                              0.434122
Terraforming Mars                         0.385794
Between Two Castles of Mad King Ludwig    0.379473
7 Wonders                                 0.335410
Nevermore                                 0.333712
Everdell                                  0.328688
Linko!                                    0.316228
Die Pyramiden des Jaguar                  0.316228
Scythe                                    0.303822
Coimbra                                   0.298142
The Isle of Cats                          0.293610
Jaipur                                    0.290474
Bob Ross: Art of Chill Game               0.286039
Birds of a Feather                        0.286039
Duelosaur Island                          0.285774
Sushi Go!                                 0.283981
The Bloody Inn                            0.282843
Thurn and Taxis         

We evaluate the recommender by using the same input as above, i.e. same board game input for both the item-based recommender and content-based recommender. We could immediately see that there is a stark difference between the top 20 recommendations of each recommender. In the content-based recommender, the top 3 board games are actually by the same publisher as the board game used as input, showcasing the impact of the inclusion of board game features.

However, this recommender system makes an implicit assumption that the features we used to compute the similarities are all important to every person. In fact, different people may weigh the importance of different features differently, i.e. person 'A' may be looking for board games by the same designer, whilst person 'B' may be looking for board games by a different designer but within the same game category. As such, it is difficult to generalize how much importance each person places on a particular feature.

Moreover, because we did not take into account the user ratings, we may lose out on relevance of the board games (i.e. board games which are rated more frequently in the recent 3 years may be more relevant today and should carry more weight in recommendations). A better solution may be to build a **hybrid recommender system** instead, leveraging on the advantages from each branch of recommenders and covering for each other's downfalls.