# Processing Board Game Data

## Background

This dataset comes from the [Board Game Geek database](http://boardgamegeek.com/). The site's database has more than 90,000 games, with crowd-sourced ratings. This particular subset is limited to only games with at least 50 ratings which were published between 1950 and 2016. This still leaves us with 10,532 games! For more information please check out the [tidytuesday repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12) which is where this example was taken from.



## Data Cleaning

In [9]:
import pandas as pd
import janitor
import os

### One-Shot
This cell demonstrates the cleaning process using the call chaining approach championed in pyjanitor

In [35]:
cleaned_df = (
    pd.read_csv("/home/jack/Downloads/board_games.csv") # ingest raw data
    .clean_names() # removes whitespace, punctuation/symbols, capitalization
    .remove_empty() # removes entirely empty rows / columns
    .drop(columns = ["image","thumbnail","compilation","game_id"]) # drops unnecessary columns
)

## Multi-Step
These cells repeat the process in a step-by-step manner in order to explain it in more detail

In [36]:
# read in the csv
df = pd.read_csv("/home/jack/Downloads/board_games.csv")

In [18]:
# removing whitespace, punctuation / symbols, capitalization
df = df.clean_names() 

In [20]:
# removes entirely empty rows / columns
df = df.remove_empty()

In [69]:
# check to see if "min_playtime" and "max_playtime" columns are redundant
len(df[df["min_playtime"] != df["max_playtime"]])

1565

In [34]:
# check to see what percentage of the values in the "compilation" column are not null
len(df[df['compilation'].notnull()])/len(df)

0.03892897835169009

In [None]:
# the 'compilation' column was demonstrated to have little value, 
# the "image" and "thumbnail" columns link to images and are not a factor in this analysis
# "game_id" can be replaced by using the index
df = df.drop(columns = ["image","thumbnail","compilation","game_id"]) #

## Sample Analysis

In [80]:
# gives us the top ten game categories that show up in the "category" column
import operator

dic = {}
for i in df['category'][:]:
    li = i.split(',')
    for x in li:
        if x not in dic:
            dic[x] = 0
        else:
            dic[x]+=1

sorted_dic = sorted(dic.items(), key=operator.itemgetter(1))
sorted_dic.reverse()
sorted_dic[:10]

[('Card Game', 2980),
 ('Wargame', 2033),
 ('Fantasy', 1217),
 ('Fighting', 899),
 ('Economic', 877),
 ('Science Fiction', 849),
 ('Dice', 837),
 ('Party Game', 832),
 ('Abstract Strategy', 709),
 ("Children's Game", 703)]