# <center> Dota 2 winner prediction

<img src='https://habrastorage.org/webt/ua/vn/pq/uavnpqfoih4zwwznvxubu33ispy.jpeg'>

## Data description

We have the following files:

- `sample_submission.csv`: example of a submission file
- `train_matches.jsonl`, `test_matches.jsonl`: full "raw" training data 
- `train_features.csv`, `test_features.csv`: features created by organizers
- `train_targets.csv`: results of training games (including the winner)

## Features created by organizers

These are basic features which include simple players' statistics. Scroll to the end to see how to build these features from raw json files.

In [294]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current sessio

In [295]:
import os
import pandas as pd

PATH_TO_DATA = 'input/'

df_train_features = pd.read_csv('input/train_features.csv', index_col='match_id_hash')
df_train_targets = pd.read_csv('input/train_targets.csv', index_col='match_id_hash')

In [296]:
# ## exercise, read test dataframe
df_test_features = pd.read_csv('input/test_features.csv', index_col='match_id_hash')
# df_test_targets = 

We have ~ 40k games, each described by `match_id_hash` (game id) and 245 features. Also `game_time` is given - time (in secs) when the game was over. 

In [297]:
df_train_features.shape

(39675, 245)

In [298]:
df_train_features.head()

Unnamed: 0_level_0,game_time,game_mode,lobby_type,objectives_len,chat_len,r1_hero_id,r1_kills,r1_deaths,r1_assists,r1_denies,...,d5_stuns,d5_creeps_stacked,d5_camps_stacked,d5_rune_pickups,d5_firstblood_claimed,d5_teamfight_participation,d5_towers_killed,d5_roshans_killed,d5_obs_placed,d5_sen_placed
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a400b8f29dece5f4d266f49f1ae2e98a,155,22,7,1,11,11,0,0,0,0,...,0.0,0,0,0,0,0.0,0,0,0,0
b9c57c450ce74a2af79c9ce96fac144d,658,4,0,3,10,15,7,2,0,7,...,0.0,0,0,0,0,0.0,0,0,0,0
6db558535151ea18ca70a6892197db41,21,23,0,0,0,101,0,0,0,0,...,0.0,0,0,0,0,0.0,0,0,0,0
46a0ddce8f7ed2a8d9bd5edcbb925682,576,22,7,1,4,14,1,0,3,1,...,8.664527,3,1,3,0,0.0,0,0,2,0
b1b35ff97723d9b7ade1c9c3cf48f770,453,22,7,1,3,42,0,1,1,0,...,0.0,2,1,2,0,0.25,0,0,0,0


We are interested in the `radiant_win` column in `train_targets.csv`. All these features are not known during the game (they come "from future" as compared to `game_time`), so we have these features only for training data. 

In [299]:
df_train_targets.head()

Unnamed: 0_level_0,game_time,radiant_win,duration,time_remaining,next_roshan_team
match_id_hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a400b8f29dece5f4d266f49f1ae2e98a,155,False,992,837,
b9c57c450ce74a2af79c9ce96fac144d,658,True,1154,496,
6db558535151ea18ca70a6892197db41,21,True,1503,1482,Radiant
46a0ddce8f7ed2a8d9bd5edcbb925682,576,True,1952,1376,
b1b35ff97723d9b7ade1c9c3cf48f770,453,False,2001,1548,


In [300]:
df_train_targets['radiant_win'] = df_train_targets['radiant_win'].astype(int)

## Training and evaluating a model

#### Let's construct a feature matrix `X` and a target vector `y`

In [301]:
X = df_train_features.values
y = df_train_targets['radiant_win'].values

#### Perform  a train/test split (a simple validation scheme)

In [302]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                  test_size=0.2, 
                                                  random_state=17)

#### Train the Random Forest model

<img src='https://www.baeldung.com/wp-content/uploads/sites/4/2022/03/decision_tree1.jpg'>

https://www.youtube.com/watch?v=cIbj0WuK41w

Most important hyperparameters of Random Forest:

- n_estimators = n of trees
- max_features = max number of features considered for splitting a node
- max_depth = max number of levels in each decision tree
- min_samples_split = min number of data points placed in a node before the node is split
- min_samples_leaf = min number of data points allowed in a leaf node
- bootstrap = method for sampling data points (with or without replacement)

In [303]:
%%time
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
    max_depth=23, min_samples_split=10, n_estimators=65,
    min_samples_leaf=100,
    n_jobs=-1,
    random_state=17,
)
model.fit(X_train, y_train)


CPU times: user 18.6 s, sys: 96.6 ms, total: 18.7 s
Wall time: 4.1 s


#### Make predictions for the holdout set

We need to predict probabilities of class 1 - that Radiant wins, thus we need index 1 in the matrix returned by the `predict_proba` method.

In [304]:
y_pred = model.predict_proba(X_val)[:, 1]

Let's take a look:

In [305]:
y_pred

array([0.09417954, 0.48991721, 0.50941661, ..., 0.314985  , 0.42412832,
       0.54869595])

#### Let's evaluate prediction quality with the holdout set

In [306]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

Out if curiosiry, we can calculate accuracy of a classifier which predicts class 1 if predicted probability is higher than 50%. 

In [307]:
valid_accuracy = accuracy_score(y_val, y_pred > 0.5)
print('Validation accuracy of P>0.5 classifier:', valid_accuracy)

Validation accuracy of P>0.5 classifier: 0.6931316950220542


In [308]:
confusion_matrix(y_val, y_pred > 0.5)

array([[2095, 1709],
       [ 726, 3405]])

In [309]:
precision_score(y_val, y_pred > 0.5)

0.6658193195150567

In [310]:
recall_score(y_val, y_pred > 0.5)

0.8242556281771968

In [311]:
f1_score(y_val, y_pred > 0.5)

0.7366143861546782

A confusion matrix is a tool designed to help us understand a little better how well our classifier is performing. An *accuracy score*, like that returned by kaggle for our submission file, lets us know a number indicating what ratio of predictions were correct (0 is not one classification was correct, and 1 is perfect!). The confusion matrix does the same thing, but goes into a little more detail; this time it provides us with four values:
* The number of times our classifier produced **true negatives** (TN) the model correctly predicts the negatives class
* The number of times our classifier produced **true positives** (TP) the model correctly predicts the positive class
* The number of times our classifier produced **false positives** (FP), a type I error the model incorrectly predicts the positive class
* The number of times our classifier produced **false negatives** (FN), a type II error the model incorrectly predicts the negatives class

which scikit-learn returns in the following format, hence the name matrix (note that there is no standard convention for arrangement of this matrix):

<img src='https://miro.medium.com/max/1400/1*xMl_wkMt42Hy8i84zs2WGg.png'>


The *accuracy* is given by $\frac{(TN + TP)}{(TN + TP + FP +FN)}$, in other words, the true values divided by all the values. And finally, another measure one may come across is the **$F_1$ score**, which is given by:

$$ F_1 = 2\frac{precision . recall}{precision + recall}$$


where the *precision* is given by $\frac{TP}{TP + FP}$, and *recall* by $\frac{TP}{TP + FN}$.

These Wikipedia pages have excellent descriptions of the meaning of these terms: 
* [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)
* [False positives and false negatives](https://en.wikipedia.org/wiki/False_positives_and_false_negatives)
* [Type I and type II errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors)
* [Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
* [F1 score](https://en.wikipedia.org/wiki/F1_score)


## Preparing a submission

Now the same for test data.

In [312]:
df_test_features = pd.read_csv('input/test_features.csv', index_col='match_id_hash')

X_test = df_test_features.values
y_test_pred = model.predict_proba(X_test)[:, 1]

df_submission = pd.DataFrame({'radiant_win_prob': y_test_pred}, 
                                 index=df_test_features.index)

In [313]:
df_submission.head()

Unnamed: 0_level_0,radiant_win_prob
match_id_hash,Unnamed: 1_level_1
30cc2d778dca82f2edb568ce9b585caa,0.511743
70e5ba30f367cea48793b9003fab9d38,0.826493
4d9ef74d3a2025d79e9423105fd73d41,0.66685
2bb79e0c1eaac1608e5a09c8e0c6a555,0.585895
bec17f099b01d67edc82dfb5ce735a43,0.470771


Save the submission file, it's handy to include current datetime in the filename. 

In [314]:
df_submission.to_csv('submission.csv')

## Cross-validation

<img src='https://linzhenyuyuchen.github.io/img/grid_search_cross_validation.png'>

In [315]:
from sklearn.model_selection import KFold
n_fold = 3
cv = KFold(n_splits=n_fold)#, random_state=17)

In [316]:
from sklearn.model_selection import cross_val_score

#### Run cross-validation

We'll train 2 versions of the  `RandomForestClassifier` model - first with default capacity (trees are not limited in depth), second - with `min_samples_leaf`=3, i.e. each leave is obliged to have at least 3 instances. 

In [317]:
%%time

model_rf_cv = RandomForestClassifier(
                               n_estimators=100, 
                               max_features=5,
                               max_depth=5,
                               min_samples_split=10,
                               min_samples_leaf=10,
                               n_jobs=-1, 
                               random_state=17
                              )

# calcuate ROC-AUC for each split
cv_scores_rf = cross_val_score(model_rf_cv, X, y, cv=cv, scoring='accuracy')

CPU times: user 1.89 s, sys: 680 ms, total: 2.57 s
Wall time: 7.81 s


In [318]:
cv_scores_rf

array([0.66767486, 0.68196597, 0.66820416])

In [319]:
print('Model 1 mean score:', cv_scores_rf.mean())

Model 1 mean score: 0.6726149968494014


In [331]:
model_best = RandomForestClassifier(bootstrap=False, max_depth=10, max_features=5,
                                    n_estimators=25, n_jobs=-1)


In [332]:
predictions = np.zeros(len(X_test))
average_accuracy = 0

for train_index, val_index in cv.split(X):
    X_train_cv, X_val_cv = X[train_index], X[val_index]
    y_train_cv, y_val_cv = y[train_index], y[val_index]
        
    model_best.fit(X_train_cv, y_train_cv)
    
    y_pred = model_best.predict_proba(X_val_cv)[:, 1]
    
    valid_accuracy = accuracy_score(y_val_cv, y_pred > 0.5)
    
    average_accuracy = average_accuracy + valid_accuracy
    
    predictions += model_best.predict_proba(X_test)[:, 1]
    
predictions = predictions / n_fold
average_accuracy = average_accuracy / n_fold

In [333]:
average_accuracy

0.6880403276622559

In [334]:
predictions

array([0.54258857, 0.76458633, 0.65788168, ..., 0.51697752, 0.73059422,
       0.355413  ])

In [335]:
df_submission = pd.DataFrame({'radiant_win_prob': predictions}, 
                                 index=df_test_features.index)

In [324]:
from sklearn.model_selection import GridSearchCV

param_grid = [
{'n_estimators': [10, 25], 'max_features': [5, 10], 
 'max_depth': [10, 50, None], 'bootstrap': [True, False]}
]

forest = RandomForestClassifier(n_jobs=-1)
grid_search_forest = GridSearchCV(forest, param_grid, cv=3, scoring='accuracy')
grid_search_forest.fit(X, y)

In [325]:
grid_search_forest.best_estimator_

In [326]:
grid_search_forest.best_score_

0.6899558916194076

In [327]:
grid_best = grid_search_forest.best_estimator_.predict_proba(X_test)[:, 1]

In [328]:
grid_best

array([0.51878095, 0.83633729, 0.74949021, ..., 0.5148513 , 0.75626158,
       0.34076616])

In [329]:
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(1, 45, num = 3)]
# Minimum number of samples required to split a node
min_samples_split = [5, 10]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}

pprint(random_grid)

{'max_depth': [1, 23, 45],
 'max_features': ['auto', 'sqrt'],
 'min_samples_split': [5, 10],
 'n_estimators': [20, 65, 110, 155, 200]}


In [336]:
rf_random = RandomizedSearchCV(estimator = forest, param_distributions = random_grid, n_iter = 5, cv = 3, verbose=2, random_state=42, n_jobs = -1, scoring='accuracy')
# Fit the random search model
rf_random.fit(X, y)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


[CV] END max_depth=1, max_features=auto, min_samples_split=5, n_estimators=20; total time=   0.9s
[CV] END max_depth=1, max_features=auto, min_samples_split=5, n_estimators=20; total time=   1.3s
[CV] END max_depth=1, max_features=auto, min_samples_split=10, n_estimators=20; total time=   1.6s
[CV] END max_depth=1, max_features=auto, min_samples_split=5, n_estimators=20; total time=   1.8s


  warn(


[CV] END max_depth=1, max_features=auto, min_samples_split=10, n_estimators=20; total time=   2.0s


  warn(


[CV] END max_depth=1, max_features=auto, min_samples_split=10, n_estimators=20; total time=   2.3s


  warn(


[CV] END max_depth=1, max_features=sqrt, min_samples_split=5, n_estimators=155; total time=   6.0s
[CV] END max_depth=45, max_features=auto, min_samples_split=10, n_estimators=20; total time=   7.9s
[CV] END max_depth=45, max_features=auto, min_samples_split=10, n_estimators=20; total time=  10.3s
[CV] END max_depth=1, max_features=sqrt, min_samples_split=5, n_estimators=155; total time=  10.6s
[CV] END max_depth=45, max_features=auto, min_samples_split=10, n_estimators=20; total time=  11.5s
[CV] END max_depth=1, max_features=sqrt, min_samples_split=5, n_estimators=155; total time=   8.8s
[CV] END max_depth=23, max_features=sqrt, min_samples_split=10, n_estimators=65; total time=  20.8s
[CV] END max_depth=23, max_features=sqrt, min_samples_split=10, n_estimators=65; total time=  20.8s
[CV] END max_depth=23, max_features=sqrt, min_samples_split=10, n_estimators=65; total time=  19.9s


In [None]:
rf_random.best_estimator_

In [None]:
rf_random.best_score_

0.697189666036547

In [None]:
### exercise : use the best params, fit a 5 folds rf model

<img src='https://miro.medium.com/max/1400/0*VYAbVhmGMpzUC8hH.jpeg'>

In [355]:
%%time
from xgboost import XGBClassifier

model = XGBClassifier(max_depth=5, 
                      learning_rate=0.01, 
                      n_estimators=100, 
                      subsample=0.8, 
                      colsample_bytree=0.8)

model.fit(X_train, y_train)

CPU times: user 2min 58s, sys: 974 ms, total: 2min 59s
Wall time: 31.2 s


In [351]:
y_pred = model.predict_proba(X_val)

In [352]:
y_pred

array([[0.7686658 , 0.23133421],
       [0.5106107 , 0.48938927],
       [0.48774922, 0.5122508 ],
       ...,
       [0.6185827 , 0.38141724],
       [0.59530365, 0.40469638],
       [0.46836644, 0.53163356]], dtype=float32)

In [None]:
valid_accuracy = accuracy_score(y_val, y_pred > 0.5)
print('Validation accuracy of P>0.5 classifier:', valid_accuracy)

ValueError: Classification metrics can't handle a mix of binary and multilabel-indicator targets

In [350]:
random_grid = {'n_estimators': n_estimators,
               'learning_rate': [0.01, 0.1, 0.2, 0.3],
               'max_features': [50, 100],
               'max_depth': [50, 100]
               }
# exercise
# grid search parameters with xgb
rf_random = RandomizedSearchCV(estimator=XGBClassifier(n_jobs=-1), param_distributions=random_grid,
                               n_iter=5, cv=3, verbose=2, random_state=42, n_jobs=-1, scoring='accuracy')
# Fit the random search model
rf_random.fit(X, y)


Fitting 3 folds for each of 5 candidates, totalling 15 fits
Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_features" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find a

KeyboardInterrupt: 

In [353]:
%%time
from xgboost import XGBClassifier

model = XGBClassifier(max_depth=5, 
                      learning_rate=0.01, 
                      n_estimators=100, 
                      subsample=0.8, 
                      colsample_bytree=0.8)

model.fit(X_train, y_train)

KeyboardInterrupt: 

In [354]:
from lightgbm import LGBMClassifier
model = LGBMClassifier() # try to google the import parameters of lightgbm

model.fit(X_train, y_train)

KeyboardInterrupt: 

In [None]:
# combined the predictions of rf, xgb and lgbm
