In [4]:
import pandas as pd
import numpy as np

##### 1:
Editing and processing the data a little:

First I import the raw data from my comuter:

In [29]:
basic = pd.read_csv('/Users/jasonmatiatos/Desktop/raw data/basic_data.csv')
mechanics = pd.read_csv('/Users/jasonmatiatos/Desktop/raw data/mechanics_data.csv')
categories = pd.read_csv('/Users/jasonmatiatos/Desktop/raw data/categories_data.csv')
subdomains = pd.read_csv('/Users/jasonmatiatos/Desktop/raw data/subdomain_data.csv')

Then I: 
1. keep only the categories I want from 'basic' into a new dataframe, 'basics'
2. concatenate categories and subdomains into one dataframe called cats (and filter out the bayes rating)
3. store the mechanics into one df called mecs (and filter out the rating & bayes rating)
4. horizontally concatenate basics, cats, and mecs into one df called all_data
5. turn all the column headers to lower case letters

In [32]:
type(categories)

pandas.core.frame.DataFrame

In [41]:
# 1
basics = basic[['name', 'description', 'image', 'rating', 'usersrated', 
                'minplayers', 'maxplayers', 'playingtime']]

# 2
cats = pd.concat([categories, subdomains], axis=1)
cats.drop(columns=['bayes_rating', 'bayes_rating'], axis=1, inplace=True)

# 3
mecs = mechanics
mecs.drop(columns=['rating', 'bayes_rating'], axis=1, inplace=True)

# 4
all_data = pd.concat([basics, mecs, cats], axis=1)

# 5
all_data.columns = all_data.columns.str.lower()

In [60]:
all_data.head()

Unnamed: 0,name,description,image,rating,usersrated,minplayers,maxplayers,playingtime,alliances,area majority / influence,...,korean war,fan expansion,strategy games,abstract games,family games,thematic games,customizable games,wargames,party games,children's games
0,Die Macher,Die Macher is a game about seven sequential po...,https://cf.geekdo-images.com/original/img/uqlr...,7.62855,5054,3.0,5.0,240.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Dragonmaster,Dragonmaster is a trick-taking card game based...,https://cf.geekdo-images.com/original/img/o07K...,6.61412,545,3.0,4.0,30.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Samurai,Samurai is set in medieval Japan. Players comp...,https://cf.geekdo-images.com/original/img/mPS5...,7.44438,14332,2.0,4.0,60.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Tal der Könige,When you see the triangular box and the luxuri...,https://cf.geekdo-images.com/original/img/TgcS...,6.61683,334,2.0,4.0,60.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Acquire,"In Acquire, each player strategically invests ...",https://cf.geekdo-images.com/original/img/Bz4t...,7.34397,17800,2.0,6.0,90.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### 2:
And here I will make the seperate vectors (parameters) with the categories inside them. Stored as lists. We want:

1. categories
2. mechanisms

Some other characteristics we want to include in our assignment (in the user output), but it would make no sense to add them to their own list, because we won't use them as matches for user inputs. These are:

1. descriptions
2. images
3. rating
4. number of user ratings
5. name
6. number of players (range) --> we do: (> minplayers) and (< maxplayers)
7. play time

These characteristics remain in the dataframe called 'all_data', so that they can be extracted when we have a specific boardgame.

In [48]:
categories_lst = list(cats.columns)
mechanisms_lst = list(mecs.columns)

##### 3:

Now I filter through my data, to keep only the boardgames that would realistically be worth presenting to a user. The original dataset has 286,186 board games. Of these, I only want to keep around 10,000.


##### Filter 1:

Filter out games that have important values missing. Important values are: name, description, image, rating, number of user ratings, playtime.

In [49]:
all_data.shape

(286186, 282)

In [55]:
full_data = all_data.dropna(subset=all_data.columns.values)

full_data.shape

(38126, 282)

##### Filter 2:

Of these 38,126 I will only keep games with more than 20 user ratings, which signifies that their rating is somewhat accurate, and not just set artificially by one or few people.

In [74]:
rated_data = full_data[full_data['usersrated']>=20]

rated_data.shape

(12023, 282)

##### Filter 3:

Of these 15,943 keep only those with a rating above 5, because if they have received ratings lower than 5, they are probably not games we would recommend to anyone.

In [75]:
good_data = rated_data[rated_data['rating']>5.0]

good_data.shape

(10556, 282)

Organising the dataframe in decreasing rating order.

In [76]:
final_data = fuller_data.sort_values('rating', ascending=False)
final_data.reset_index(drop=True, inplace=True)

final_data.head()

Unnamed: 0,name,description,image,rating,usersrated,minplayers,maxplayers,playingtime,alliances,area majority / influence,...,korean war,fan expansion,strategy games,abstract games,family games,thematic games,customizable games,wargames,party games,children's games
0,Solomon Sea,(From the Simulation Workshop site) <br/>Desig...,https://cf.geekdo-images.com/original/img/ympt...,8.80909,22,1.0,2.0,360.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,RPGQuest: Greek Mythology,"A very colorful mixture of RPG and Boardgame, ...",https://cf.geekdo-images.com/original/img/hCHA...,8.8,45,4.0,12.0,30.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Sports Action Canadian Pro Football,A Canadian Pro Football game that simulated wh...,https://cf.geekdo-images.com/original/img/ycvQ...,8.75714,35,2.0,2.0,60.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Valor of the Guards: ASL Historical Module Num...,(BGG description:)<br/><br/>This Historical Ad...,https://cf.geekdo-images.com/original/img/o6gx...,8.75212,259,2.0,4.0,600.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Connection Games,In this comprehensive study of the connection ...,https://cf.geekdo-images.com/original/img/Pdte...,8.74194,31,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### 4:
Now I turn everything into a csv and download it:

In [77]:
final_data.to_csv('final_data.csv')