### Board Game Anlysis

I have created this Board game analysis for two reasons
* I love board games and would like to explore this data
* I would like to show what I know!

I pulled this dataset off Kaggle. Orginally the data was pulled from boardgamegeek.

In [1]:
#Imports
import numpy as np
import pandas as pd
import csv
import matplotlib
import re
from collections import Counter
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error


In [2]:
#Read in CSV file
Games=pd.read_csv("bgg_dataset.csv", sep=";")
#First 5 rows
Games.head(5)

Unnamed: 0,ID,Name,Year Published,Min Players,Max Players,Play Time,Min Age,Users Rated,Rating Average,BGG Rank,Complexity Average,Owned Users,Mechanics,Domains
0,174430.0,Gloomhaven,2017.0,1,4,120,14,42055,879,1,386,68323.0,"Action Queue, Action Retrieval, Campaign / Bat...","Strategy Games, Thematic Games"
1,161936.0,Pandemic Legacy: Season 1,2015.0,2,4,60,13,41643,861,2,284,65294.0,"Action Points, Cooperative Game, Hand Manageme...","Strategy Games, Thematic Games"
2,224517.0,Brass: Birmingham,2018.0,2,4,120,14,19217,866,3,391,28785.0,"Hand Management, Income, Loans, Market, Networ...",Strategy Games
3,167791.0,Terraforming Mars,2016.0,1,5,120,12,64864,843,4,324,87099.0,"Card Drafting, Drafting, End Game Bonuses, Han...",Strategy Games
4,233078.0,Twilight Imperium: Fourth Edition,2017.0,3,6,480,14,13468,870,5,422,16831.0,"Action Drafting, Area Majority / Influence, Ar...","Strategy Games, Thematic Games"


In [3]:
#How much data do we have?
print(len(Games), "rows in the dataset.")

20343 rows in the dataset.


In [4]:
#Look if there is missing data
Games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20343 entries, 0 to 20342
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  20327 non-null  float64
 1   Name                20343 non-null  object 
 2   Year Published      20342 non-null  float64
 3   Min Players         20343 non-null  int64  
 4   Max Players         20343 non-null  int64  
 5   Play Time           20343 non-null  int64  
 6   Min Age             20343 non-null  int64  
 7   Users Rated         20343 non-null  int64  
 8   Rating Average      20343 non-null  object 
 9   BGG Rank            20343 non-null  int64  
 10  Complexity Average  20343 non-null  object 
 11  Owned Users         20320 non-null  float64
 12  Mechanics           18745 non-null  object 
 13  Domains             10184 non-null  object 
dtypes: float64(3), int64(6), object(5)
memory usage: 2.2+ MB


It looks like we may have some missing values in our ID. This could be expansions to games. Rating and complexity averages are not rerestiring as numbers. If you look back at the head you can see it is comma seperated so this will need to be fixed. I am also going to drop the owned users data thats missing. These would be games no one has access to yet. These maybe fun to test towards the end of our analysis to see if we can predict their ratings based off mechanics or other vairables.

### Data Cleaning

In [5]:
# Create a function to fix comma data for Rating and Complexity
def commaremoval(data):
    Newlist=[]
    for i in data:
        i = i.replace(",",".")
        Newlist.append(i)
    return Newlist
Games["Rating Average"] = pd.to_numeric(commaremoval(Games["Rating Average"]))
Games["Complexity Average"] = pd.to_numeric(commaremoval(Games["Complexity Average"]))

In [6]:
#Drop Games with no owners
Games=Games[Games["Owned Users"].notna()]

In [7]:
#Change all / to , for easier seperation
Games["Mechanics"] = Games["Mechanics"].str.replace("/",",")
Games["Domains"] = Games["Domains"].str.replace("/",",")

In [8]:
#Checking Data Again
Games.head(5)

Unnamed: 0,ID,Name,Year Published,Min Players,Max Players,Play Time,Min Age,Users Rated,Rating Average,BGG Rank,Complexity Average,Owned Users,Mechanics,Domains
0,174430.0,Gloomhaven,2017.0,1,4,120,14,42055,8.79,1,3.86,68323.0,"Action Queue, Action Retrieval, Campaign , Bat...","Strategy Games, Thematic Games"
1,161936.0,Pandemic Legacy: Season 1,2015.0,2,4,60,13,41643,8.61,2,2.84,65294.0,"Action Points, Cooperative Game, Hand Manageme...","Strategy Games, Thematic Games"
2,224517.0,Brass: Birmingham,2018.0,2,4,120,14,19217,8.66,3,3.91,28785.0,"Hand Management, Income, Loans, Market, Networ...",Strategy Games
3,167791.0,Terraforming Mars,2016.0,1,5,120,12,64864,8.43,4,3.24,87099.0,"Card Drafting, Drafting, End Game Bonuses, Han...",Strategy Games
4,233078.0,Twilight Imperium: Fourth Edition,2017.0,3,6,480,14,13468,8.7,5,4.22,16831.0,"Action Drafting, Area Majority , Influence, Ar...","Strategy Games, Thematic Games"


We now have a nice list of 20320 Games to look at! Lets continue to break down the Mechanics and Domains in to usable data for Scikit. I will use One Hot Encoding.

In [9]:
#One Hot Encode Mechanics
GamesOHEM = Games['Mechanics'].str.get_dummies(sep=',')
#df['string_column'].str.get_dummies(sep = ',')
#One Hot Encode Domains
GamesOHED=Games['Domains'].str.get_dummies(sep=',')
#Join tables
Games1 = pd.concat([GamesOHEM, GamesOHED], axis=1)
GamesF=pd.concat([Games, GamesOHEM], axis=1)
GamesF.head()

Unnamed: 0,ID,Name,Year Published,Min Players,Max Players,Play Time,Min Age,Users Rated,Rating Average,BGG Rank,...,Team-Based Game,Tile Placement,Time Track,Trading,Traitor Game,Trick-taking,Variable Phase Order,Variable Player Powers,Voting,Worker Placement
0,174430.0,Gloomhaven,2017.0,1,4,120,14,42055,8.79,1,...,0,0,0,0,0,0,0,0,0,0
1,161936.0,Pandemic Legacy: Season 1,2015.0,2,4,60,13,41643,8.61,2,...,0,0,0,0,0,0,0,0,0,0
2,224517.0,Brass: Birmingham,2018.0,2,4,120,14,19217,8.66,3,...,0,0,0,0,0,0,0,0,0,0
3,167791.0,Terraforming Mars,2016.0,1,5,120,12,64864,8.43,4,...,0,0,0,0,0,0,0,0,0,0
4,233078.0,Twilight Imperium: Fourth Edition,2017.0,3,6,480,14,13468,8.7,5,...,0,0,0,0,0,0,0,0,0,0


### Data Analyzation
Questions I want to Answer

* Averages for players, play time,etc. 
* What Mechanics seem to be the most popular?
* What Mechanics seem to correlate to the highest scores?
* Can we create a model to predict success of a game?

In [10]:
#Averages
Games.describe()

Unnamed: 0,ID,Year Published,Min Players,Max Players,Play Time,Min Age,Users Rated,Rating Average,BGG Rank,Complexity Average,Owned Users
count,20320.0,20320.0,20320.0,20320.0,20320.0,20320.0,20320.0,20320.0,20320.0,20320.0,20320.0
mean,108210.740748,1984.22623,2.019636,5.673327,91.326772,9.600246,841.778691,6.403363,10170.563976,1.990994,1408.457628
std,98678.347583,214.117399,0.690545,15.239657,545.749554,3.64579,3513.464339,0.935762,5873.389392,0.849022,5040.179315
min,1.0,-3500.0,0.0,0.0,0.0,0.0,30.0,1.05,1.0,0.0,0.0
25%,11035.25,2001.0,2.0,4.0,30.0,8.0,55.0,5.82,5084.75,1.33,146.0
50%,88928.0,2011.0,2.0,4.0,45.0,10.0,120.0,6.43,10168.5,1.97,309.0
75%,192924.75,2016.0,2.0,6.0,90.0,12.0,385.0,7.03,15258.25,2.54,864.0
max,331787.0,2022.0,10.0,999.0,60000.0,25.0,102214.0,9.58,20344.0,5.0,155312.0


There is a game that plays for 600000 minutes? Thats 40 days lets see.

In [11]:
Games.nlargest(5,['Play Time'])

Unnamed: 0,ID,Name,Year Published,Min Players,Max Players,Play Time,Min Age,Users Rated,Rating Average,BGG Rank,Complexity Average,Owned Users,Mechanics,Domains
13420,4815.0,The Campaign for North Africa: The Desert War ...,1979.0,8,10,60000,14,146,6.1,13422,4.71,385.0,"Dice Rolling, Hexagon Grid, Simulation",Wargames
3208,29285.0,Case Blue,2007.0,2,2,22500,12,289,8.26,3210,4.58,711.0,"Dice Rolling, Hexagon Grid, Simulation",Wargames
6035,46669.0,1914: Offensive à outrance,2013.0,2,4,17280,0,108,7.98,6037,4.07,661.0,"Dice Rolling, Hexagon Grid, Simulation",Wargames
7895,158793.0,Atlantic Wall: D-Day to Falaise,2014.0,2,6,14400,16,69,8.08,7897,4.89,328.0,"Hexagon Grid, Zone of Control",Wargames
1322,254.0,Empires in Arms,1983.0,2,7,12000,14,1238,7.6,1323,4.42,2129.0,"Area Movement, Dice Rolling, Movement Points, ...",Wargames


In [12]:
#Mechanics Popular and High Scoring

### Model Creation

In [13]:
#Create Feature Variables
a=GamesF.drop(['Mechanics','Domains','Name'],axis=1)
b=a
scaler = StandardScaler()
scaler.fit(a)
X=a.drop(['Rating Average'],axis=1)
y=b['Rating Average']




In [14]:
#Split Train and Test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101)
print('Train size: ', len(X_train), 'Test size: ', len(X_test))

Train size:  16256 Test size:  4064


In [15]:
#multiple Linear Regression
lr_model=LinearRegression()
lr_model.fit(X_train,y_train)

y_pred = lr_model.predict(X)
print('Results for multiple linear regression on training data')
print(' Default settings')
print('Internal parameters:')
print(' Bias is ', lr_model.intercept_)
print(' Coefficients', lr_model.coef_)
print(' Score', lr_model.score(X,y))
print('MAE is ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2 ', r2_score(y,y_pred))


Results for multiple linear regression on training data
 Default settings
Internal parameters:
 Bias is  6.715736982442919
 Coefficients [ 2.99145298e-06 -1.27107364e-05 -1.53016206e-02 -1.99241500e-04
  5.81200292e-05 -1.03015648e-02  1.14863380e-04 -9.44812150e-05
  2.24244734e-01 -8.36925955e-05  1.32458916e+06 -1.21557543e+00
  3.45281902e-02 -4.81208786e-02 -1.42998927e-02  8.99364638e-02
  2.67816924e-01 -6.78311493e+05  5.92786895e-03  1.49891234e-01
 -4.12683995e+04 -4.18077491e-01  1.67944938e-01 -3.05147291e-01
 -1.21131783e+04  3.98282862e-02  3.75903072e-02 -7.44273415e-02
  2.99328617e-02 -8.25303039e-02  1.43725561e+05  1.02421521e-01
 -1.29386719e-01 -1.34389184e+00  1.26014449e-01 -1.43725477e+05
  1.96001705e+04 -6.01774297e-02  1.92095126e-01 -1.71934114e-01
 -3.14146720e-01  3.08334533e-01  2.23136681e-01 -8.38454009e+04
  1.11113769e-01 -1.51385212e-01  9.99946414e-02 -1.57324300e-01
 -9.87044250e-02  1.48161112e-01  6.94076069e-02  2.67730776e-01
  2.91001000e-01 -

That was ugly. The R Squared value is very very negative meaning we would be better off just averaging the values lets try another approach.

In [16]:
# Decision Tree Regressor
DTR_model=DecisionTreeRegressor()
DTR_model.fit(X_train,y_train)

y_pred = DTR_model.predict(X)
print('Results for Decsion Tree Regressor on training data')
print(' Default settings')
print(' Score', DTR_model.score(X,y))
print('MAE is ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2 ', r2_score(y,y_pred))



Results for Decsion Tree Regressor on training data
 Default settings
 Score 0.976416596046279
MAE is  0.0385871062992126
RMSE is  0.14370034233023515
MSE is  0.020649788385826772
R^2  0.976416596046279


That is much better. Looking at just the training set we have a high correlation between the model and the data points. Lets try it with the test data

In [17]:
y_pred_test = DTR_model.predict(X_test)
print('Results for Decsion Tree Regressor on test data')
print(' Default settings')
print(' Score', DTR_model.score(X_test,y_test))
print('MAE is ', mean_absolute_error(y_test, y_pred_test))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_pred_test)))
print('MSE is ', mean_squared_error(y_test, y_pred_test))
print('R^2 ', r2_score(y_test,y_pred_test))

Results for Decsion Tree Regressor on test data
 Default settings
 Score 0.8823960369996382
MAE is  0.19293553149606296
RMSE is  0.3213237338403963
MSE is  0.10324894192913386
R^2  0.8823960369996382


It looks like the model maybe over fitting. Our R squared value dropped by about 10%. .88 is still a pretty good score but lets try a couple more models.

In [18]:
# Random Forrest Regressor
from sklearn.ensemble import RandomForestRegressor

RFR_model=RandomForestRegressor()
RFR_model.fit(X_train,y_train)

y_pred = RFR_model.predict(X)
print('Results for Decsion Tree Regressor on training data')
print(' Default settings')
print(' Score', RFR_model.score(X,y))
print('MAE is ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2 ', r2_score(y,y_pred))

y_pred_test = RFR_model.predict(X_test)
print('Results for Decsion Tree Regressor on training data')
print(' Default settings')
print(' Score', RFR_model.score(X_test,y_test))
print('MAE is ', mean_absolute_error(y_test, y_pred_test))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_pred_test)))
print('MSE is ', mean_squared_error(y_test, y_pred_test))
print('R^2 ', r2_score(y_test,y_pred_test))

Results for Decsion Tree Regressor on training data
 Default settings
 Score 0.9806758905458558
MAE is  0.06621390748031503
RMSE is  0.13007813515753908
MSE is  0.016920321246063002
R^2  0.9806758905458558
Results for Decsion Tree Regressor on training data
 Default settings
 Score 0.9373359188304733
MAE is  0.13328390748031502
RMSE is  0.2345530931555973
MSE is  0.05501515350885831
R^2  0.9373359188304733


This is pretty good lets try a Neural Network Regressor just to see.

In [19]:
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression

NN_model=MLPRegressor()
NN_model.fit(X_train,y_train)

y_pred = NN_model.predict(X)
print('Results for Neural Net Regressor on training data')
print(' Default settings')
print(' Score', NN_model.score(X,y))
print('MAE is ', mean_absolute_error(y, y_pred))
print('RMSE is ', np.sqrt(mean_squared_error(y, y_pred)))
print('MSE is ', mean_squared_error(y, y_pred))
print('R^2 ', r2_score(y,y_pred))

y_pred_test = NN_model.predict(X_test)
print('Results for Neural Net Regressor on training data')
print(' Default settings')
print(' Score', NN_model.score(X_test,y_test))
print('MAE is ', mean_absolute_error(y_test, y_pred_test))
print('RMSE is ', np.sqrt(mean_squared_error(y_test, y_pred_test)))
print('MSE is ', mean_squared_error(y_test, y_pred_test))
print('R^2 ', r2_score(y_test,y_pred_test))

Results for Neural Net Regressor on training data
 Default settings
 Score -449.85612253845477
MAE is  14.682517741981306
RMSE is  19.868887172631332
MSE is  394.77267747875385
R^2  -449.85612253845477
Results for Neural Net Regressor on training data
 Default settings
 Score -502.18317832816274
MAE is  14.725500610865028
RMSE is  21.01816918291358
MSE is  441.7634358015781
R^2  -502.18317832816274


The random forrest Regressor seems to work the best for predicting our board game rating. We could set up a pipeline that allows you to plug your board games criteria in through an API and it will predict the rating of your game.