### Setting up a prediction problem

This notebook sets up the problem of predicting the match outcome given the history of each player involved in the match. I go through my thought process as I try to avoid leaks



In [None]:
import pandas as pd
import numpy as np
from sklearn import ensemble 
from sklearn import metrics

# this is meant to be a simple example so only matches and players are used
matches = pd.read_csv('../input/match.csv', index_col=0)
players = pd.read_csv('../input/players.csv')

test_labels = pd.read_csv('../input/test_labels.csv', index_col=0)
test_players = pd.read_csv('../input/test_player.csv')

train_labels = matches['radiant_win'].astype(int)

### Predicting Match Outcome

In this problem we are asking the questions: which team will win? It is important to consider when the question is being asked. Most frequently this is asked before the match starts, but it could also be asked after the match has be running for 10 or 15 minutes. It could be asked before hero selection, and all that is known is the identity of the competitors. It could also be asked after hero selection in which case the hero composition of each team would be something to consider. An additional case to consider would be predicting the outcome based only on the heros involved not accounting for the players identities. 

The important point is that a time and set of conditions need to be picked before trying to solve the problem. Here we will try to predict the outcome of a match when only the player identities are known, but before hero selection or any gameplay starts.

Any information only available after we ask the question is off limits. This means any details at all about events in the match should be excluded as well as any information about future matches.

In [None]:
# take a look at the match data
matches.head()

Of these variables only game_mode, cluster, and perhaps start_time are possible to determine before the match starts. None of them seem like useful variables if the goal is to use players past performance to predict the match outcome.

Radiant_win is the target variable we are trying to predict. It is pretty easy to see that a time based split is probably best here for validation. By holding out future we reduce the likelyhood of accidently introducing leakage. 

In [None]:
# since this is a simple example I will use very basic features which are probably not very good.
feature_columns = players.iloc[:3,4:17].columns.tolist()
feature_columns

In [None]:
player_groups = players.groupby('account_id')

# These are just a the mean of the above values, one for each account
feature_components = player_groups[feature_columns].mean()

In [None]:
# the account_id 0 is included even though it represents more then one account 
# its average stats for players who hide their account ids 
feature_components.head()

In [None]:
# now to construct match_level features from the components
# account_id is needed to join with feature_components
train_ids = players[['match_id','account_id']]
test_ids = test_players[['match_id','account_id']]

In [None]:
# add player component data to full match and player data
# note if a player is not in the train set but appears in the test set they will have 
# nan values inserted

train_feat_comp = pd.merge(train_ids, feature_components,
                           how='left', left_on='account_id' ,
                           right_index=True)

test_feat_comp = pd.merge(test_ids, feature_components, 
                          how='left', left_on='account_id',
                          right_index=True)

In [None]:
# this is no longer needed now that the join is done 
train_feat_comp.drop(['account_id'], axis=1, inplace=True)
test_feat_comp.drop(['account_id'], axis=1, inplace=True)

# this basically flattens an entire match, removes the redundent match_ids, and then 
# drops the unneaded multi-index
# is there a better way to do this?
def unstack_simplify(df):
    return df.unstack().iloc[10:].reset_index(drop=True)

In [None]:
# group by match then combine data for all players in a match into one row
test_feat_group = test_feat_comp.groupby('match_id')
test_feats = test_feat_group.apply(unstack_simplify)

In [None]:
train_feat_group = train_feat_comp.groupby('match_id')
train_feats = train_feat_group.apply(unstack_simplify)

In [None]:
test_feats.head()

In [None]:
for i in range(0,40, 10):
    print(test_feats.iloc[0,i:i+10],'\n')

Unstack is interleaving the data of different players the above is to visually check that the nans are showing up in a regular pattern. To make sure I didn't make a mistake.

Below you can see that most matches in the test set have  players not in the train set. and 
this is not accounting for hidding account_ids

In [None]:
row_nans = test_feats.isnull().sum(axis=1)
nan_counts = row_nans.value_counts()
nan_counts = nan_counts.reset_index()

nan_counts.columns = ['num_missing_players','count']
nan_counts.loc[:, 'num_missing_players'] =(nan_counts.loc[:,'num_missing_players']/12).astype(int)
nan_counts

# counting how many players are missing from match because they didn't exist in 
# the train set

In [None]:
rf = ensemble.RandomForestClassifier(n_estimators=150, n_jobs=-1)
rf.fit(train_feats,train_labels) 


# this is a bad way to deal with missing values 
test_feats.replace(np.nan, 0, inplace=True)

test_probs = rf.predict_proba(test_feats)
test_preds = rf.predict(test_feats)

In [None]:
metrics.log_loss(test_labels.values.ravel(), test_probs[:,1])

In [None]:
metrics.roc_auc_score(test_labels.values, test_probs[:,1])

In [None]:
print(metrics.classification_report(test_labels.values, test_preds))

Having mostly just competed on kaggle, now i have to think about what the metrics mean;) I would say the performance is no where near as good as I would like but with the features I used that is to be expected. 

I am more concerned about whether this is the right approach to predicting match outcomes(or i have a bug:)) from player histories. It also seems likely given the number of missing players in the test set that a larger dataset would be useful. 

There are other tasks besides predicting match outcomes like predicting win rate, which should be reasonably easy to set up.