# Building a recommender system for recommending board games

I am going to start with prediction of score, and begin to build this into a recommender system. This is the initial script.

The following Data are available to us:

objectid - the identifier on boardgamegeeks.com<br>
name - name of the game<br>
yearpublished - the year the game was published<br>
ortindex - rank of the game on bbg.com<br>
minplayer - minumum number of players per the publishers<br>
maxplayers - maximum number of players per the publishers<br>
minplaytime - minimum playtime required per the publishers<br>
maxplaytime - maximum playtime per the publishers<br>
minage - minimum age requiremnet per the publishers<br>
min_community - minimum players per the community<br>
max_community - max players per the community<br>
totalvotes - total number of community vote<br>
layerage - minimum age requirement per the community<br>
languagedependence - a rank of amount of in-game text is required during game play<br>
1: none, 5: unplayable in other language<br>
serrated - number of users that have rated the game<br>
average - user average rating from 1-10<br>
average - from the site, determined to be an anti-skewing effort by bgg<br>
with system added mid range ratings (from 1-10)<br>
stddev - average standard deviation of a rating<br>
avgweight - a complexity rating from 1-5 Weight<br>
numweights - number of weight votes<br>
numgeeklists - number of geeks with game on list<br>
numtrading - number of people trading the game<br>
numwanting - number of people wanting the game<br>
numcomments - number of comments on the site on this game<br>
iteviews - number of views on the site<br>
numplays - number of times game was played (according to site users?)<br>
numplays_month - number of plays per month<br>
news - number news articles on game<br>
logs - number of blogs regarding game<br>
weblink - number of weblinks for the game<br>
podcast - number of podcasts on the game<br>
label - category of game (mostly boardgame)<br>
boardgamedesigner_cnt - count of designers<br>
boardgameartist_cnt - artist count<br>
boardgamepublisher_cnt - publisher count<br>
boardgamehonor_cnt - awards count<br>
boardgamecategory_cnt - category count<br>
boardgamemechanic_cnt - game mechanics count<br>
boardgameexpansion_cnt - expansion count<br>
boardgameversion_cnt - version count (languages)<br>
boardgamefamily_cnt - game family count<br>
boardgamedesigner - list of game designers<br>
boardgameartist - list of game artists<br>
boardgamepublisher - list of publishers<br>
boardgamehonor - list of awards<br>
boardgamecategory - list of categories<br>
boardgameversion - list of versions<br>
boardgamemechanic - a list of mechanics<br>
boardgameexpansion - a list of expansions<br>
boardgamefamily - a list of boardgames family<br>
description - full text description of game<br>
gamelink - a link to the game on bgg#=

Load packages

In [None]:
import math
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import warnings 
import seaborn as sns
import calendar
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

Set WD and read data

In [None]:
data = pd.read_csv('../input/20000-boardgames-dataset/boardgames1.csv', converters={'boardgamemechanic': eval,
                                                  'boardgamecategory': eval})

First, lets drop some columns which arent containing much info (or too much info, such as baverage and sortindex)

In [None]:
data=data.drop(['label', 'boardgamedesigner', 'boardgameartist', 'boardgameversion','boardgamefamily',
                'boardgamehonor', 'boardgameexpansion', 'description', 'gamelink', 'boardgamepublisher',
                'min_community', 'max_community', 'baverage', 'sortindex'], axis=1)

Some potentially useful variables are in list form, lets break these up into dummies

In [None]:
cats = data['boardgamecategory'].str.join('|').str.get_dummies()
mecs=data['boardgamemechanic'].str.join('|').str.get_dummies()

Some categories appear too infrequently, lets set a minimum of 10 occurences to keep

In [None]:
cats.drop([col for col, val in cats.sum().iteritems() if val < 10], axis=1, inplace=True)
mecs.drop([col for col, val in mecs.sum().iteritems() if val < 10], axis=1, inplace=True)

Add these into the main df and drop the list cols

In [None]:
data=pd.concat([data, cats], axis=1)
data=pd.concat([data, mecs], axis=1).drop(['boardgamecategory', 'boardgamemechanic'], axis=1)

Move the name to index

In [None]:
data.index=data.name

Get our features and labels (and drop some pointless cols)

In [None]:
X = data.drop(['average','name','objectid'], axis=1).select_dtypes(['number'])
y= data.average

Split to train/test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Set up a scale and regression (RF) pipeline

In [None]:
estimators = [] 
estimators.append(('standardize', StandardScaler())) 
estimators.append(('rf', RandomForestRegressor()))
pipeline = Pipeline(estimators)
pipeline.fit(np.array(X_train), np.array(y_train))

Predict the test data

In [None]:
y_pred=pipeline.predict(np.array(X_test))

We'll use the R^2 score which represents the proportion of variance that has been explained by the <br>
features (see: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)

In [None]:
r2=r2_score(y_test, y_pred, multioutput='raw_values')

R^2 is high at around 90%, suggesting the model has a good fit.

In [None]:
plt.scatter(y_test, y_pred)
plt.show()
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
feat_importances = pd.Series(rf.feature_importances_, index=X_train.columns)
feat_importances.nlargest(25).plot(kind='barh')