### Board Game Anlysis

I have created this Board game analysis for two reasons
* I love board games and would like to explore this data
* I would like to show what I know!

I pulled this dataset off Kaggle. Orginally the data was pulled from boardgamegeek.

In [90]:
#Imports
import numpy as np
import pandas as pd
import csv
import matplotlib
import re
from collections import Counter
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn import metrics

In [70]:
#Read in CSV file
Games=pd.read_csv("bgg_dataset.csv", sep=";")
#First 5 rows
Games.head(5)

Unnamed: 0,ID,Name,Year Published,Min Players,Max Players,Play Time,Min Age,Users Rated,Rating Average,BGG Rank,Complexity Average,Owned Users,Mechanics,Domains
0,174430.0,Gloomhaven,2017.0,1,4,120,14,42055,879,1,386,68323.0,"Action Queue, Action Retrieval, Campaign / Bat...","Strategy Games, Thematic Games"
1,161936.0,Pandemic Legacy: Season 1,2015.0,2,4,60,13,41643,861,2,284,65294.0,"Action Points, Cooperative Game, Hand Manageme...","Strategy Games, Thematic Games"
2,224517.0,Brass: Birmingham,2018.0,2,4,120,14,19217,866,3,391,28785.0,"Hand Management, Income, Loans, Market, Networ...",Strategy Games
3,167791.0,Terraforming Mars,2016.0,1,5,120,12,64864,843,4,324,87099.0,"Card Drafting, Drafting, End Game Bonuses, Han...",Strategy Games
4,233078.0,Twilight Imperium: Fourth Edition,2017.0,3,6,480,14,13468,870,5,422,16831.0,"Action Drafting, Area Majority / Influence, Ar...","Strategy Games, Thematic Games"


In [71]:
#How much data do we have?
print(len(Games), "rows in the dataset.")

20343 rows in the dataset.


In [72]:
#Look if there is missing data
Games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20343 entries, 0 to 20342
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  20327 non-null  float64
 1   Name                20343 non-null  object 
 2   Year Published      20342 non-null  float64
 3   Min Players         20343 non-null  int64  
 4   Max Players         20343 non-null  int64  
 5   Play Time           20343 non-null  int64  
 6   Min Age             20343 non-null  int64  
 7   Users Rated         20343 non-null  int64  
 8   Rating Average      20343 non-null  object 
 9   BGG Rank            20343 non-null  int64  
 10  Complexity Average  20343 non-null  object 
 11  Owned Users         20320 non-null  float64
 12  Mechanics           18745 non-null  object 
 13  Domains             10184 non-null  object 
dtypes: float64(3), int64(6), object(5)
memory usage: 2.2+ MB


It looks like we may have some missing values in our ID. This could be expansions to games. Rating and complexity averages are not rerestiring as numbers. If you look back at the head you can see it is comma seperated so this will need to be fixed. I am also going to drop the owned users data thats missing. These would be games no one has access to yet. These maybe fun to test towards the end of our analysis to see if we can predict their ratings based off mechanics or other vairables.

### Data Cleaning

In [73]:
# Create a function to fix comma data for Rating and Complexity
def commaremoval(data):
    Newlist=[]
    for i in data:
        i = i.replace(",",".")
        Newlist.append(i)
    return Newlist
Games["Rating Average"] = pd.to_numeric(commaremoval(Games["Rating Average"]))
Games["Complexity Average"] = pd.to_numeric(commaremoval(Games["Complexity Average"]))

In [74]:
#Drop Games with no owners
Games=Games[Games["Owned Users"].notna()]

In [75]:
#Change all / to , for easier seperation
Games["Mechanics"] = Games["Mechanics"].str.replace("/",",")
Games["Domains"] = Games["Domains"].str.replace("/",",")

In [76]:
#Checking Data Again
Games.head(5)

Unnamed: 0,ID,Name,Year Published,Min Players,Max Players,Play Time,Min Age,Users Rated,Rating Average,BGG Rank,Complexity Average,Owned Users,Mechanics,Domains
0,174430.0,Gloomhaven,2017.0,1,4,120,14,42055,8.79,1,3.86,68323.0,"Action Queue, Action Retrieval, Campaign , Bat...","Strategy Games, Thematic Games"
1,161936.0,Pandemic Legacy: Season 1,2015.0,2,4,60,13,41643,8.61,2,2.84,65294.0,"Action Points, Cooperative Game, Hand Manageme...","Strategy Games, Thematic Games"
2,224517.0,Brass: Birmingham,2018.0,2,4,120,14,19217,8.66,3,3.91,28785.0,"Hand Management, Income, Loans, Market, Networ...",Strategy Games
3,167791.0,Terraforming Mars,2016.0,1,5,120,12,64864,8.43,4,3.24,87099.0,"Card Drafting, Drafting, End Game Bonuses, Han...",Strategy Games
4,233078.0,Twilight Imperium: Fourth Edition,2017.0,3,6,480,14,13468,8.7,5,4.22,16831.0,"Action Drafting, Area Majority , Influence, Ar...","Strategy Games, Thematic Games"


We now have a nice list of 20320 Games to look at! Lets continue to break down the Mechanics and Domains in to usable data for Scikit. I will use One Hot Encoding.

In [77]:
#One Hot Encode Mechanics
GamesOHEM = pd.concat([Games.drop('Mechanics', 1), Games['Mechanics'].str.get_dummies(sep=",")], 1)
#One Hot Encode Domains
GamesOHED=pd.concat([Games.drop('Domains', 1), Games['Domains'].str.get_dummies(sep=",")], 1)
#Join tables
GamesF = pd.concat([GamesOHEM, GamesOHED], axis=1)

  GamesOHEM = pd.concat([Games.drop('Mechanics', 1), Games['Mechanics'].str.get_dummies(sep=",")], 1)
  GamesOHEM = pd.concat([Games.drop('Mechanics', 1), Games['Mechanics'].str.get_dummies(sep=",")], 1)
  GamesOHED=pd.concat([Games.drop('Domains', 1), Games['Domains'].str.get_dummies(sep=",")], 1)
  GamesOHED=pd.concat([Games.drop('Domains', 1), Games['Domains'].str.get_dummies(sep=",")], 1)


### Data Analyzation
Questions I want to Answer

* Averages for players, play time,etc. 
* What Mechanics seem to be the most popular?
* What Mechanics seem to correlate to the highest scores?
* Can we create a model to predict success of a game?

In [None]:
#Averages

In [None]:
#Mechanics Popular and High Scoring

### Model Creation

In [79]:
#Split Train and Test sets
train_set, test_set = train_test_split(GamesF, 
                        test_size=0.2, random_state=123)
print('Train size: ', len(train_set), 'Test size: ', len(test_set))

Train size:  16256 Test size:  4064


In [None]:
# Linear Regression Model

In [None]:
# Decision Tree Regressor

In [None]:
# Random Forrest Regressor