# Capstone Part 4: Technical Report

#### [1. Data Collection and Adjustment](#dca)
#### [2. Algorithm Construction](#ac)
#### [3. Evaluation](#e)

# 1. Data Collection and Adjustment <a id = dca></a>
All data was collected from the Board Game Geek website in one form or another. The bgg.csv file contains approximately 4500 unique board games which is significantly less than the total found within the Board Game Geek API which was unfortunately discovered very recently. The user data was solely collected from the Board Game Geek API. It should also be noted that the user games were games within each user's collection on Board Game Geek with a minimum rating of 8. This will be explained further in the algorithm contruction section below.

In order to pair the data properly, an inner join is required to combine the data from the .csv and the API data. The resulting dataframe makes the resulting algorithm possible.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from bs4 import BeautifulSoup
import requests
import re
import time
import pickle

In [None]:
bgg = pd.read_csv("./Cleaned_BGG")
bgg.drop('Unnamed: 0',axis = 1, inplace = True)
bgg.drop_duplicates(subset= 'names', inplace= True)
bgg['names'].value_counts().head()

In [None]:
# # #Collect User list
# url = "https://boardgamegeek.com/findgamers.php?action=findclosest&country=US&srczip=07057&maxdist=100&B1=Submit"
# response = requests.get(url)
# html = response.text

# soup = BeautifulSoup(html, 'lxml')

# hrefs = []
# for link in soup.find_all('a'):
#     hrefs.append(link.get('href'))
    
# # #Remove Nonetypes
# hrefs = filter(None.__ne__, hrefs)
# hrefs = list(hrefs)

# # # Grab Usernames
# usernames = []
# for _ in hrefs:
#     if re.search('\/user\/', _):
#         _ = _[6:]
#         usernames.append(_)
# usernames = list(set(usernames))

# pickle.dump(usernames, open("large_un.p", "wb"))

In [None]:
# user_choices = {}

# # Collect User Data
# for user in usernames:
#     games = []
#     res = requests.get('http://www.boardgamegeek.com/xmlapi/collection/'+user+'?minrating=8')
#     soup = BeautifulSoup(res.text, "lxml")
#     for link in soup.find_all('name'):
#         game = re.findall("(?<=name sortindex=\"1\">).*(?=</name)", str(link))
#         games.extend(game)
#         user_choices.update({user:games})
#         time.sleep(2)

# #Save user choices
# pickle.dump(user_choices, open("list.p", "wb"))

In [None]:
# Create User dataframe and then combine with main game dataframe
user_names= pickle.load(open("usernames.p", "rb"))
user_choices= pickle.load(open("gamelist.p", "rb"))

testlist = []

for name in user_choices.keys():
    for game in user_choices[name]:
        testlist.append(name+'~'+game)
        
df_user= []
df_game = []
for _ in testlist:
    user, game= _.split('~')
    df_user.append(user)
    df_game.append(game)
    
# Create User listing
rec_user = pd.DataFrame({'User':df_user, 'Game': df_game})
rec_user = rec_user.reindex(columns = ['User','Game'])
rec_user.head()

#Combine user data with games from main bgg dataframe
comp = pd.merge(rec_user, bgg, how= 'inner', left_on= 'Game', right_on='names')

# 2. Algorithm Contruction <a id = ac></a>
Due to the subjective nature of the medium and recommendation engines in general, it was a goal to attempt to lessen the variability of the results. Out of all the possible features, only category and mechanic were needed to make a strong recommendation. The main problem was the incorperation of the user data into the algorithm due to it being on a completely separate scale from the mechanic and category scores which were obtained using the cosine similarity function in sklearn placing them on a -1 to 1 scale. In order to get an appropriate user score that was to scale as well as a viable metric, the game chosen would have to be in that user's collection and, to filter out all lower rated games, above a certain threshold which became a rating of 8. Once all users have been checked, all the other games within their collections are tallied and are multiplied by 0.04545 since this is strictly a tally of how many times a game appears.

Once all scores have been computed, the largest problem is how to balance them all out with weights due to some scores having more recommendational strength than others. Since the goal is limiting variability, user score is the weakest followed by category, and mechanic being the strongest. This is due solely on the fact that mechanics are different flavors of rules that can intermingle but are less interpretable than the other features. Category is a broader topic and is more of a personal flavor than a defining factor for whether a person likes a game. User score is good at fitting the user in with similar users but people are fickle and tastes change.

As a result, category and mechanic score are more or less dependent on the app user and their preferences. The ability to choose one or the either will be added in the future. For the time being, the mechanic score is weighted at a two times multiplier, and the category is at a one point five multiplier. 

In [None]:
# Set up df index
alg_rec = pd.DataFrame([])
alg_rec['name'] = bgg['names']
alg_rec = alg_rec.set_index('name')

# User Score
users = comp[comp['Game'] == ui ]
users = list(users['User'])
high_ranked= []

for _ in users:
    temp = comp[comp['User'] == _]
    high_ranked.extend(temp.iloc[:, 1])
high_ranked = pd.Series(high_ranked)
high_ranked = high_ranked.value_counts()
high_ranked = high_ranked * 0.04545453
alg_rec['user_score'] = high_ranked
alg_rec = alg_rec.fillna(0)

# Category Score
cat = pd.DataFrame(bgg.iloc[:,3])
cat = cat.join(bgg.iloc[:,69:])
cat.iloc[215:225,:]
target= cat[cat['names'] == ui]
cat_scores = []

for _ in cat['names']:
    test = cat[cat['names'] == _]
    temp = cosine_similarity(target.drop('names', axis = 1), test.drop('names', axis = 1))[0]
    cat_scores.extend(temp)

alg_rec['cat_scores'] = cat_scores

# Mechanic Score
mech = pd.DataFrame(bgg.iloc[:,3])
mech = mech.join(bgg.iloc[:,20:69])

target= mech[mech['names'] == ui]
mech_scores = []

for _ in mech['names']:
    test = mech[mech['names'] == _]
    temp = cosine_similarity(target.drop('names', axis = 1), test.drop('names', axis = 1))[0]
    mech_scores.extend(temp)
    
alg_rec['mech_scores'] = mech_scores

# Final Computation
alg_rec['mech_scores'] = alg_rec['mech_scores'] * 2.0
alg_rec['cat_scores'] = alg_rec['cat_scores'] * 1.5
alg_rec['total'] = alg_rec ['user_score'] + alg_rec['cat_scores'] + alg_rec['mech_scores']
rec = alg_rec.sort_values('total', ascending= False).head(20)
rec = list(rec.index)
rec[0:11]

# 3. Evaluation <a id = e></a>
The method used for evaluating the engine was to create test inputs and run them through the engine. If, at the end, the input was at the top of the list of recommendations, then the test was deemed a success. In the algorithm, user score was weighed down significantly so it wouldn't over-shadow the other two defining scores. Since all the users have the game that is being recieved as input, it will always have a larger user score along with larger category and mechanic scores since the similarity should be one to one. Out of multiple tests, this model has performed correctly one hundred percent of the time.

This isn't to say there isn't room for improvement. The .csv dataset is severely and disappointingly limited. This model has already been integrated into a web app but the choices are limited to very common games and some extremely obscure games. The next step is to dramatically increase the number of games to at least twenty thousand to increase the depth and allow for the user to have more of a choice. The web app itself is also very buggy and, due to the limited data, very hard to get a prediction without have a list of model recognized games.