# <center> Dota 2 winner prediction

<img src='https://habrastorage.org/webt/ua/vn/pq/uavnpqfoih4zwwznvxubu33ispy.jpeg'>

#### <center> Originally done by Peter Romov, translated and adapted by Yury Kashnitskiy (@yorko)
    
### Quick start

Grab features prepared by organizers, train a model and submit. 

1. [Data description](#Data-description)
2. [Features created by organizers](#Features-created-by-organizers)
3. [Training and evaluating a model](#Training-and-evaluating-a-model)
4. [Preparing a submission](#Preparing-a-submission)

### Now do it as a real Data Scientist

5. [Cross-validation](#Cross-validation)
6. [Working with all available information on Dota games](#Working-with-all-available-information-on-Dota-games)
7. [Feature engineering](#Feature-engineering)
8. [How to build initial features from scratch](#How-to-build-initial-features-from-scratch)

## Data description

We have the following files:

- `sample_submission.csv`: example of a submission file
- `train_matches.jsonl`, `test_matches.jsonl`: full "raw" training data 
- `train_features.csv`, `test_features.csv`: features created by organizers
- `train_targets.csv`: results of training games (including the winner)

## Features created by organizers

These are basic features which include simple players' statistics. Scroll to the end to see how to build these features from raw json files.

In [353]:
import os
import pandas as pd

PATH_TO_DATA = 'data/dota/'

df_train_features = pd.read_csv(os.path.join(PATH_TO_DATA, 
                                             'train_features.csv'), 
                                    index_col='match_id_hash')
df_train_targets = pd.read_csv(os.path.join(PATH_TO_DATA, 
                                            'train_targets.csv'), 
                                   index_col='match_id_hash')

df_test_features = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_features.csv'), index_col='match_id_hash')


We have ~ 40k games, each described by `match_id_hash` (game id) and 245 features. Also `game_time` is given - time (in secs) when the game was over. 

In [354]:
df_train_features.shape

(39675, 265)

In [355]:
df_train_features.head()

Unnamed: 0_level_0,game_time,game_mode,lobby_type,objectives_len,chat_len,r1_hero_id,r1_kills,r1_deaths,r1_assists,r1_denies,...,d5_camps_stacked,d5_rune_pickups,d5_firstblood_claimed,d5_teamfight_participation,d5_towers_killed,d5_roshans_killed,d5_obs_placed,d5_sen_placed,d5_purchase,d5_hero_inventory
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a400b8f29dece5f4d266f49f1ae2e98a,155,22,7,1,11,11,0,0,0,0,...,0,0,0,0.0,0,0,0,0,6,4
b9c57c450ce74a2af79c9ce96fac144d,658,4,0,3,10,15,7,2,0,7,...,0,0,0,0.0,0,0,0,0,12,0
6db558535151ea18ca70a6892197db41,21,23,0,0,0,101,0,0,0,0,...,0,0,0,0.0,0,0,0,0,4,3
46a0ddce8f7ed2a8d9bd5edcbb925682,576,22,7,1,4,14,1,0,3,1,...,1,3,0,0.0,0,0,2,0,8,6
b1b35ff97723d9b7ade1c9c3cf48f770,453,22,7,1,3,42,0,1,1,0,...,1,2,0,0.25,0,0,0,0,7,4


We are interested in the `radiant_win` column in `train_targets.csv`. All these features are not known during the game (they come "from future" as compared to `game_time`), so we have these features only for training data. 

In [356]:
df_train_targets.head()

Unnamed: 0_level_0,game_time,radiant_win,duration,time_remaining,next_roshan_team
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a400b8f29dece5f4d266f49f1ae2e98a,155,False,992,837,
b9c57c450ce74a2af79c9ce96fac144d,658,True,1154,496,
6db558535151ea18ca70a6892197db41,21,True,1503,1482,Radiant
46a0ddce8f7ed2a8d9bd5edcbb925682,576,True,1952,1376,
b1b35ff97723d9b7ade1c9c3cf48f770,453,False,2001,1548,


In [357]:
df_train_features.shape, df_train_targets.shape

((39675, 265), (39675, 5))

In [358]:
df_train_features.describe()

Unnamed: 0,game_time,game_mode,lobby_type,objectives_len,chat_len,r1_hero_id,r1_kills,r1_deaths,r1_assists,r1_denies,...,d5_camps_stacked,d5_rune_pickups,d5_firstblood_claimed,d5_teamfight_participation,d5_towers_killed,d5_roshans_killed,d5_obs_placed,d5_sen_placed,d5_purchase,d5_hero_inventory
count,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,...,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0,39675.0
mean,1146.082798,19.584776,4.77235,6.524865,7.3385,51.103138,3.147876,3.268809,4.67017,6.289628,...,0.343138,4.683907,0.090132,0.415962,0.299811,0.024423,1.269288,0.783289,16.672539,4.99564
std,767.206621,6.304976,3.260582,6.492107,13.366381,34.603057,3.724282,3.283323,5.225349,8.203957,...,0.963734,4.643219,0.286375,0.267551,0.73249,0.1705,2.597549,2.437952,8.40312,1.165927
min,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,521.0,22.0,0.0,1.0,0.0,20.0,0.0,1.0,1.0,1.0,...,0.0,1.0,0.0,0.237,0.0,0.0,0.0,0.0,10.0,4.0
50%,1044.0,22.0,7.0,4.0,3.0,44.0,2.0,2.0,3.0,3.0,...,0.0,3.0,0.0,0.444,0.0,0.0,0.0,0.0,16.0,5.0
75%,1656.0,22.0,7.0,10.0,9.0,81.0,5.0,5.0,7.0,9.0,...,0.0,7.0,0.0,0.6,0.0,0.0,1.0,0.0,22.0,6.0
max,4933.0,23.0,7.0,43.0,291.0,120.0,32.0,27.0,40.0,84.0,...,29.0,57.0,1.0,2.0,9.0,5.0,26.0,47.0,62.0,6.0


## Training and evaluating a model

#### Let's construct a feature matrix `X` and a target vector `y`

In [359]:
#X = df_train.values
from sklearn.preprocessing import StandardScaler

X = df_train_features.values
y = df_train_targets['radiant_win'].values

#### Perform  a train/test split (a simple validation scheme)

In [360]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                      test_size=0.3, 
                                                      random_state=17)

#### Train the Random Forest model

In [361]:
%%time
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, n_jobs=4, random_state=17)
model.fit(X_train, y_train)

Wall time: 7.52 s


#### Make predictions for the holdout set

We need to predict probabilities of class 1 - that Radiant wins, thus we need index 1 in the matrix returned by the `predict_proba` method.

In [362]:
y_pred = model.predict_proba(X_valid)[:, 1]

Let's take a look:

In [363]:
y_pred

array([0.1 , 0.43, 0.51, ..., 0.46, 0.67, 0.47])

#### Let's evaluate prediction quality with the holdout set

We'll calculate ROC-AUC.

In [364]:
from sklearn.metrics import roc_auc_score

valid_score = roc_auc_score(y_valid, y_pred)
print('Validation ROC-AUC score:', valid_score)

Validation ROC-AUC score: 0.7802967026655223


Out if curiosiry, we can calculate accuracy of a classifier which predicts class 1 if predicted probability is higher than 50%. 

In [365]:
from sklearn.metrics import accuracy_score

valid_accuracy = accuracy_score(y_valid, y_pred > 0.5)
print('Validation accuracy of P>0.5 classifier:', valid_accuracy)

Validation accuracy of P>0.5 classifier: 0.6984793749474922


## Preparing a submission

Now the same for test data.

In [366]:
#X_test = df_test.values
X_test = df_test_features.values
y_test_pred = model.predict_proba(X_test)[:, 1]

df_submission = pd.DataFrame({'radiant_win_prob': y_test_pred}, 
                                 index=df_test_features.index)

In [367]:
df_submission.head()

Unnamed: 0_level_0,radiant_win_prob
match_id_hash,Unnamed: 1_level_1
30cc2d778dca82f2edb568ce9b585caa,0.4
70e5ba30f367cea48793b9003fab9d38,0.85
4d9ef74d3a2025d79e9423105fd73d41,0.72
2bb79e0c1eaac1608e5a09c8e0c6a555,0.51
bec17f099b01d67edc82dfb5ce735a43,0.39


Save the submission file, it's handy to include current datetime in the filename. 

In [368]:
import datetime
submission_filename = 'submission_{}.csv'.format(
    datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
df_submission.to_csv(submission_filename)
print('Submission saved to {}'.format(submission_filename))

Submission saved to submission_2019-03-30_15-30-04.csv


## Cross-validation

As we already know, cross-validation is a more reliable validation technique than just one train/test split. Here we'll resort to `ShuffleSplit` to create 5 70%/30% splits. 

In [369]:
from sklearn.model_selection import ShuffleSplit, KFold, StratifiedKFold
#cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=17)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

In [370]:
from sklearn.model_selection import cross_val_score

#### Run cross-validation

We'll train 2 versions of the  `RandomForestClassifier` model - first with default capacity (trees are not limited in depth), second - with `min_samples_leaf`=3, i.e. each leave is obliged to have at least 3 instances. 

In [371]:
%%time

model_rf1 = RandomForestClassifier(n_estimators=100, n_jobs=4,
                                   max_depth=None, random_state=17)

# calcuate ROC-AUC for each split
cv_scores_rf1 = cross_val_score(model_rf1, X, y, cv=cv, scoring='roc_auc')

Wall time: 35.6 s


In [372]:
%%time

model_rf2 = RandomForestClassifier(n_estimators=100, n_jobs=4,
                                   min_samples_leaf=3, random_state=17)

cv_scores_rf2 = cross_val_score(model_rf2, X, y, cv=cv, 
                                scoring='roc_auc', n_jobs=-1)

Wall time: 31.6 s


#### CV results

The result returned by `cross_val_score` is an array with metric values (ROC-AUC) for each split:

In [373]:
cv_scores_rf1

array([0.78029046, 0.78263459, 0.7806766 , 0.77973637, 0.78152601])

In [374]:
cv_scores_rf2

array([0.78451867, 0.7885386 , 0.79028273, 0.78317175, 0.78421867])

Let's compare average ROC-AUC among all splits for both models.

In [375]:
print('Model 1 mean score:', cv_scores_rf1.mean())
print('Model 2 mean score:', cv_scores_rf2.mean())

Model 1 mean score: 0.7809728075100915
Model 2 mean score: 0.786146084560676


The second model is preferred. Look, there's a caveat here: the second model is actually better for 4 splits out of 5. So if we were to perform only one train/test split, there would've been a 20% probability to make a wrong conclusion that the first model is better.

In [376]:
cv_scores_rf2 > cv_scores_rf1

array([ True,  True,  True,  True,  True])

In [377]:
**********
Model 1 mean score: 0.7743398073402079
Model 2 mean score: 0.7763937965074037

SyntaxError: invalid syntax (<ipython-input-377-f6a5377a83b9>, line 1)

## Working with all available information on Dota games
Raw data descriptions for all games are given in files `train_matches.jsonl` and `test_matches.jsonl`. Each file has one entry for each game in [JSON](https://en.wikipedia.org/wiki/JSON) format. You only need to know that it can be easily converted to Python objects via the `json.loads` method.

##### Let's explore a single entry

In [None]:
import json

with open(os.path.join(PATH_TO_DATA, 'train_matches.jsonl')) as fin:
    # read the 18-th line
    for i in range(18):
        line = fin.readline()
    
    # read JSON into a Python object 
    match = json.loads(line)

The `match` object is now a big Python dictionary. In `match['players']` we have a description of each player.

You might think that this `match` object look ugly. You're right! That's actually the real data. And it's the ability to extract nice features from raw data that makes good Data Scientists stand out. You might even be unfamiliar with Dota (or any other application domain) but still be able to construct a good model via feature engineering. It's art and craftmanship at the same time.   

In [None]:
match

#### Player description

In [None]:
player = match['players'][2]

In [None]:
player

In [None]:
len(match['teamfights'])

KDA: the number of kills, deaths, and assists to alleys.

In [None]:
player['kills'], player['deaths'], player['assists'], len(player['killed']), len(player['hero_inventory'])

Some statistics on player abilities:

In [None]:
player['ability_uses']

In [None]:
len(player['ability_upgrades'])

#### Example: time series for each player's gold.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

In [None]:
for player in match['players']:
    plt.plot(player['times'], player['gold_t'])
    
plt.title('Gold change for all players');

#### Function to read files with game descriptions

The following function `read_matches(filename)`, can be used to read raw data on Dota 2 games.

We recommend to install two Python packages: `ujson` and `tqdm`, it'll make the execution faster and 

In [None]:
import os

try:
    import ujson as json
except ModuleNotFoundError:
    import json
    print ('Please install ujson to read JSON oblects faster')
    
try:
    from tqdm import tqdm_notebook
except ModuleNotFoundError:
    tqdm_notebook = lambda x: x
    print ('Please install tqdm to track progress with Python loops')

def read_matches(matches_file):
    
    MATCHES_COUNT = {
        'test_matches.jsonl': 10000,
        'train_matches.jsonl': 39675,
    }
    _, filename = os.path.split(matches_file)
    total_matches = MATCHES_COUNT.get(filename)
    
    with open(matches_file) as fin:
        for line in tqdm_notebook(fin, total=total_matches):
            yield json.loads(line)

#### Reading data in a loop

Reading data on all games might take some 2-3 minutes. Thus you'd better stick to the following approach:

1. Read a small amount (10-100) of games
2. Write code to extract features from these JSON objects
3. Make sure the code works fine
4. Run the code with all available data
5. Save results to a `pickle` file so that you don't need to run all computations from scratch next time 

In [None]:
for match in read_matches(os.path.join(PATH_TO_DATA, 'train_matches.jsonl')):
    match_id_hash = match['match_id_hash']
    game_time = match['game_time']
    
    # processing each game
    
    for player in match['players']:
        pass  # processing each player

## Feature engineering

In [None]:
def add_new_features(df_features, matches_file):
    
    # Process raw data and add new features
    for match in read_matches(matches_file):
        match_id_hash = match['match_id_hash']

        # Counting ruined towers for both teams
        radiant_tower_kills = 0
        dire_tower_kills = 0
        for objective in match['objectives']:
            if objective['type'] == 'CHAT_MESSAGE_TOWER_KILL':
                if objective['team'] == 2:
                    radiant_tower_kills += 1
                if objective['team'] == 3:
                    dire_tower_kills += 1

        # Write new features
        df_features.loc[match_id_hash, 'radiant_tower_kills'] = radiant_tower_kills
        df_features.loc[match_id_hash, 'dire_tower_kills'] = dire_tower_kills
        df_features.loc[match_id_hash, 'diff_tower_kills'] = radiant_tower_kills - dire_tower_kills
        
        # ... here you can add more features ...
        df_features.loc[match_id_hash, 'teamfight_count'] = len(match['teamfights'])

In [None]:
df_train.shape

In [None]:
# copy the dataframe with features
#df_train_features_extended = df_train.copy()
df_train_features_extended = df_train_features.copy()

# add new features
add_new_features(df_train_features_extended, 
                 os.path.join(PATH_TO_DATA, 
                              'train_matches.jsonl'))

In [None]:
df_ex = df_train_features_extended[['radiant_tower_kills', 'dire_tower_kills', 'diff_tower_kills','teamfight_count']]
df_ex = df_ex.join(df_train_targets)

In [None]:
df_ex.head()

In [None]:
sns.heatmap(df_ex.corr())

We see new features added to the right.

#### Evaluating new features

Let's run cross-validation with a fixed model but with two different datasets:

1. with features built by organizers (base)
2. with new features that we've added (extended)

In [None]:
df_train_features_extended.shape

In [None]:
%%time

from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier

model = RandomForestClassifier(n_estimators=100, n_jobs=4, random_state=17)

#model = CatBoostClassifier(random_state=17,silent=True)

cv_scores_base = cross_val_score(model, X, y, cv=cv, scoring='roc_auc', n_jobs=-1)
cv_scores_extended = cross_val_score(model, df_train_features_extended.values, y, 
                                     cv=cv, scoring='roc_auc', n_jobs=-1)

In [None]:
print('Base features: mean={} scores={}'.format(cv_scores_base.mean(), 
                                                cv_scores_base))
print('Extended features: mean={} scores={}'.format(cv_scores_extended.mean(), 
                                                    cv_scores_extended))

In [None]:
cv_scores_extended > cv_scores_base

As we see, `RandomForestClassifier` shows better cross-validation results in case of the extended dataset. Looks reasonable, that's what we build features for.

#### New submission

In [None]:
%%time
# Build the same features for the test set
#df_test_features_extended = df_test.copy()
df_test_features_extended = df_test_features.copy()
add_new_features(df_test_features_extended, 
                 os.path.join(PATH_TO_DATA, 'test_matches.jsonl'))

In [None]:
X.shape, len(y)

In [None]:
model = RandomForestClassifier(n_estimators=100, n_jobs=4, random_state=17)
model.fit(X, y)
df_submission_base = pd.DataFrame(
    #{'radiant_win_prob': model.predict_proba(df_test.values)[:, 1]}, 
    {'radiant_win_prob': model.predict_proba(df_test_features.values)[:, 1]}, 
    index=df_test_features.index,
)
df_submission_base.to_csv('submission_base_rf.csv')

In [None]:
model

In [None]:
model_extended = RandomForestClassifier(n_estimators=100, n_jobs=4, random_state=17)
model_extended.fit(df_train_features_extended.values, y)
df_submission_extended = pd.DataFrame(
    {'radiant_win_prob': model_extended.predict_proba(df_test_features_extended.values)[:, 1]}, 
    index=df_test_features.index,
)
df_submission_extended.to_csv('submission_extended_rf.csv')

In [None]:
****************

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV

scaler = StandardScaler()
logit = LogisticRegression(random_state=17, solver='liblinear', class_weight='balanced')

logit_pipe = Pipeline([('scaler', scaler), ('logit', logit)])
logit_pipe_params = {'logit__C': np.logspace(-8, 8, 17)}

In [None]:
pipeline_cv = GridSearchCV(estimator=logit_pipe, param_grid=logit_pipe_params, scoring='roc_auc', cv=cv, verbose=1)

cv_score = cross_val_score(estimator=pipeline_cv, X= df_train_features_extended.values, y=y, cv=cv, scoring='roc_auc')
cv_score.mean()

In [None]:
pipeline_cv.fit(df_train_features_extended.values, y)
lr_submission_extended = pd.DataFrame(
    {'radiant_win_prob': pipeline_cv.predict_proba(df_test_features_extended.values)[:, 1]}, 
    index=df_test_features.index,
)
lr_submission_extended.to_csv('submission_extended_lr.csv')

## LR scores:
cv= 0.8137461593524739
leaderboard = 0.82587

Base features: mean=0.7743398073402079 scores=[0.77155841 0.77627012 0.77895799 0.77523123 0.76968127]
Extended features: mean=0.782673483297628 scores=[0.78496205 0.78587811 0.78522151 0.77935413 0.77795162]

Score = 0.78469 - 88 features

## How to build initial features from scratch

Now we diclose the code that we used to build initial features `train_features.csv` and `test_features.csv`. You can modify the following code to add more features.

In a nutshell:

1. the  `extract_features_csv(match)` function extracts features from game descriptions and writes them into a dictionary
2. the `extract_targets_csv(match, targets)` function extracts the target variable `radiant_win`
3. iterating through the file with raw data, we collect all features
4. with `pandas.DataFrame.from_records()` we create dataframes with new features

In [None]:
import collections

MATCH_FEATURES = [
    ('game_time', lambda m: m['game_time']),
    ('game_mode', lambda m: m['game_mode']),
    ('lobby_type', lambda m: m['lobby_type']),
    ('objectives_len', lambda m: len(m['objectives'])),
    ('chat_len', lambda m: len(m['chat'])),
]

PLAYER_FIELDS = [
    'hero_id',
    
    'kills',
    'deaths',
    'assists',
    'denies',
    
    'gold',
    'lh',
    'xp',
    'health',
    'max_health',
    'max_mana',
    'level',

    'x',
    'y',
    
    'stuns',
    'creeps_stacked',
    'camps_stacked',
    'rune_pickups',
    'firstblood_claimed',
    'teamfight_participation',
    'towers_killed',
    'roshans_killed',
    'obs_placed',
    'sen_placed',
]

PLAYER_COUNT_FIELDS = ['purchase', 'hero_inventory']

def extract_features_csv(match):
    row = [
        ('match_id_hash', match['match_id_hash']),
    ]
    
    for field, f in MATCH_FEATURES:
        row.append((field, f(match)))
        
    for slot, player in enumerate(match['players']):
        if slot < 5:
            player_name = 'r%d' % (slot + 1)
        else:
            player_name = 'd%d' % (slot - 4)

        for field in PLAYER_FIELDS:
            column_name = '%s_%s' % (player_name, field)
            row.append((column_name, player[field]))
            
        for new_field in PLAYER_COUNT_FIELDS:
            column_name = '%s_%s' % (player_name, new_field)
            row.append((column_name, len(player[new_field])))
            
    return collections.OrderedDict(row)
    
def extract_targets_csv(match, targets):
    return collections.OrderedDict([('match_id_hash', match['match_id_hash'])] + [
        (field, targets[field])
        for field in ['game_time', 'radiant_win', 'duration', 'time_remaining', 'next_roshan_team']
    ])

In [None]:
%%time

df_new_features = []
df_new_targets = []

for match in read_matches(os.path.join(PATH_TO_DATA, 'train_matches.jsonl')):
    match_id_hash = match['match_id_hash']
    features = extract_features_csv(match)
    targets = extract_targets_csv(match, match['targets'])
    
    df_new_features.append(features)
    df_new_targets.append(targets)

In [None]:
df_new_features = pd.DataFrame.from_records(df_new_features).set_index('match_id_hash')
df_new_targets = pd.DataFrame.from_records(df_new_targets).set_index('match_id_hash')

In [None]:
df_new_targets.head()

In [None]:
df_new_features.to_csv(os.path.join(PATH_TO_DATA, 'train_features.csv'))

In [None]:
df_new_targets.to_csv(os.path.join(PATH_TO_DATA, 'train_targets.csv'))

In [None]:
df_new_features = []

for match in read_matches(os.path.join(PATH_TO_DATA, 'test_matches.jsonl')):
    match_id_hash = match['match_id_hash']
    features = extract_features_csv(match)
    
    df_new_features.append(features)

In [None]:
df_new_features = pd.DataFrame.from_records(df_new_features).set_index('match_id_hash')
df_new_features.to_csv(os.path.join(PATH_TO_DATA, 'test_features.csv'))

In [None]:
df_new_features.head()

## Go on!

- Discuss new ideas in Slack 
- Create new features
- Try new models and ensembles
- Submit predictions
- Go and win!