## Notebook Goals ##

The analyses that have been posted so far focus on prediction of outcomes with respect to variables that are known at the game's end. These will be correlated somewhat with the result of the game, but I don't think that they are 'predictive' in the colloquial sense. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict, Counter
import itertools
import seaborn as sns
import statsmodels.api as sm
from patsy.contrasts import Diff
sns.set_style('white')

df = pd.read_csv("../input/catanstats.csv")
df.me.fillna(0, inplace=True)

In [None]:
Here I make a bunch of features that I think might be predictive. These include:

1. The expected number of cards (ec) on any roll
2. The ability to build items without needing to trade
3. Having ports that you can use

In [None]:
# turn the numbers that a player is on into a probability of resources per roll
probs = defaultdict(int)
for d1, d2 in itertools.combinations_with_replacement(range(1, 7), 2):
    s = d1 + d2
    probs[s] += 1 if d1 == d2 else 2
probs = {k:v/sum(probs.values()) for k, v in probs.items()}

resources = "LCSWO"
settlement_res = set("LCSW")
road_res = set("WC")
dcard_res = set("SOW")

def get_row_ec(row, cards=resources):
    ec = 0
    nums = row[15:27].tolist()[0::2]
    vals = row[15:27].tolist()[1::2]
    for n, v in zip(nums, vals):
        if v in cards:
            ec += probs[n]
    return ec

def two_port(row):
    # two ports are only considered useful if you sit on a resource that it trades
    vals = row[15:27].tolist()[1::2]
    s = sum(1 for v in vals if v[0] == '2' and v[1] in vals)
    return s

def three_port(row):
    vals = row[15:27].tolist()[1::2]
    s = sum(1 for v in vals if v[0] == '3')
    return s

def city(row):
    # fast city-building if there are more than 1 ore and a wheat
    vals = row[15:27].tolist()[1::2]
    if vals.count('O') >= 2 and 'W' in vals:
        return 1
    return 0

def can_build(row, req):
    # see if the adjacent tiles contain the input set
    vals = row[15:27].tolist()[1::2]
    if req.issubset(vals):
        return 1
    return 0

In [None]:
# get the features
df['init_ec'] = df.apply(get_row_ec, axis=1)
df['two_port'] = df.apply(two_port, axis=1)
df['three_port'] = df.apply(three_port, axis=1)
df['city'] = df.apply(city, axis=1)
df['settlement'] = df.apply(can_build, axis=1, args=(settlement_res,))
df['road'] = df.apply(can_build, axis=1, args=(settlement_res,))
df['dcard'] = df.apply(can_build, axis=1, args=(dcard_res,))

In [None]:
## Expected Cards vs. Points##
The initial expected number of cards per turn correlates with points significantly with a value of 0.29

In [None]:
g = sns.jointplot(df.init_ec, df.points, alpha=0.5);

## Point Prediction ##

Here I try to predict points with the features in a linear model. It's not very effective.

You can play with adding in the other factors I made, but none of them were significant or useful for predicting.

I also subtracted 2 from the points since every player starts with two victory points, so we're making a model that is for 'points above starting'.

In [None]:
model_eqn = "I(points-2) ~ 1 + (init_ec + C(dcard) + C(me))"
model = sm.OLS.from_formula(model_eqn, df).fit()
print(model.summary())

In [None]:
mn, mx = 1, 13
plt.scatter(model.model.endog+2, model.fittedvalues+2);
plt.xlim(mn, mx); plt.ylim(mn, mx); plt.plot([mn, mx], [mn, mx], '-k', lw=1);
plt.ylabel("Fit Value"); plt.xlabel("Actual Value"); plt.title("Actual by Fit");

Let's try breaking up the EC by card type and see if that has anything to do with it

In [None]:
# break EV up by card type
inputs = []
for card in "LCSWO":
    inputs.append('ec_' + card)
    df['ec_' + card] = df.apply(get_row_ec, axis=1, args=(card,))

In [None]:
model_eqn = "I(points - 2) ~ -1 + (" + " + ".join(inputs) + ")" #**2
model = sm.OLS.from_formula(model_eqn, df).fit()
print(model.summary())

In [None]:
mn, mx = 1, 13
plt.scatter(model.model.endog+2, model.fittedvalues+2);
plt.xlim(mn, mx); plt.ylim(mn, mx); plt.plot([mn, mx], [mn, mx], '-k', lw=1);
plt.ylabel("Fit Value"); plt.xlabel("Actual Value"); plt.title("Actual by Fit");

That didn't help much. What's the point of statistical significance when it doesn't predict well? Bleah!

## Winner Prediction ##

Let's skip point prediction, since that isn't going so well, and try to predict the winner.

I'm going to quickly do a logistic regression example and show that the accuracy from the confusion matrix isn't the right measure. The right measure is who among the 4 players has the highest win probability matching the actual winner.

In [None]:
df['win'] = 0
wincol = df.columns.tolist().index('win')

for gnum in df.gameNum.unique():
    rows = df[df.gameNum == gnum]
    win_idx = rows.points.argmax()
    df.iloc[win_idx, wincol] = 1

In [None]:
# First, predict the winner based only on the player position and whether or not our kind data
# gatherer was the player
model_eqn = "win ~ 1 + C(me) + C(player)"
model = sm.Logit.from_formula(model_eqn, df).fit()
print(model.summary())

In [None]:
t = model.pred_table()
print(t)
print("Accuracy:",np.diag(t).sum()/t.sum())

The standard accuracy measure won't do here, because we're really predicting the winner from a set of 4. 

In [None]:
def model_pred(model, data):
    right = 0
    tried = 0
    for gnum in df.gameNum.unique():
        print(gnum)
        rows = df[df.gameNum == gnum]
        win_idx = rows.points.argmax()
        max_idx = -1
        max_pts = 0
        pred_max_idx = -1
        pred_max_p = -1
        for idx, r in rows.iterrows():
            p = model.predict(r)
            if p > pred_max_p:
                pred_max_p = p
                pred_max_idx = idx
            if r.points > max_pts:
                max_pts = r.points
                max_idx = idx
        tried += 1
        right += 1 if max_idx == pred_max_idx else 0
    return right/tried

In [None]:
model_pred(model, df)

As nice as the accuracy is, the skill of the model is basically 0 (we'd get the same answer picking 'me' the whole time).

## Multinomial Logistic ##

Let's try to account for all the players at once and pick one through 4 using a multinomial logistic regression.

In [None]:
# make new data, where a single row is a full game
game_rows = []
for gnum in df.gameNum.unique():
    gamerow = []
    rows = df[df.gameNum == gnum]
    for idx, r in rows.iterrows():
        # Add whatever you think is predictive for each player here
        gamerow.extend([r.init_ec, r.me])
        if r.win:
            winner = idx % 4
    gamerow += [winner]
    game_rows.append(gamerow)
    
gamedf = pd.DataFrame(game_rows)
# I ignored column names and just renamed the last one to winner
gamedf.rename(columns={gamedf.shape[1]-1:'WINNER'}, inplace=True)
# Rename all the default numbered columns to 'V#'
gamedf.rename(columns={x: "V{}".format(x) for x in gamedf.columns if x != 'WINNER'}, inplace=True)

In [None]:
inputs = [x for x in gamedf.columns if x != "WINNER"]
# NOTE: I cheat here a bit and don't set the `me` variable as categorical.
model_eqn = "WINNER ~ -1 + " + " + ".join(inputs)
model = sm.MNLogit.from_formula(model_eqn, gamedf).fit()
print(model.summary())

In [None]:
t = model.pred_table()
print(t)
print("Accuracy:",np.diag(t).sum()/t.sum())

Doing the MNLogit gives a better than 50% accuracy just by considering all the players together. Try the model and only use `r.me` for the inputs. You should get a 50% accuracy, which is what we'd expect.

The coefficients on all the variables don't make the most sense, but hey, our model is moderately skillful.

The real problem is that there isn't any out of sample prediction to test our model quality. I'll switch over to scikit for ease of CV.

In [None]:
from sklearn import linear_model
from sklearn.cross_validation import KFold, cross_val_score

X = gamedf[[x for x in gamedf.columns if x != 'WINNER']].as_matrix()
y = gamedf.WINNER

logreg = linear_model.LogisticRegression(multi_class='multinomial', solver='lbfgs')

# holdout one at a time.
kf = KFold(len(gamedf), n_folds=50, shuffle=False)
print("Average CV accuracy:", cross_val_score(logreg, X, y, cv=kf).mean())

## Conclusions and Future Work##

This was a pretty quick look at predicting winners and points using only up-front information. Cross-validation with a multinomial logistic wasn't able to show consistent improvement over always assuming that `me` always won. 

There's a lot more than could be done for feature creation and selection, though. I think what I've shown is that it's hard to predict who will win knowing only the initial board state (which is a real plus for the game, I would hope this is true!). The data are also biased due to the poster's skill at the game.

It was my hope that the features related to building capabilities would turn out to be useful, but Catan must give enough options for ways to earn points that there aren't very clear relationships between starting and ending. I think we've seen this happen where someone thinks they will build settlements, but really only be able to get longest road or go for development cards to build largest army and get victory points. We also don't get to capture strategies like building near a port for the first expansion. 

I'd be really curious to see what neural network does with a whole bunch of possible features.