## DOTA 2 - Predicting wins using heroes, levels and xp gained during matches

#### By: Kenneth Goh, Raymond Ng, Dominic

In this notebook, we use three different types of classifiers to predict the winners of the match.
The data we use is sourced from Kaggle's Dota 2 Matches dataset. You may download the dataset <a href="https://www.kaggle.com/devinanzelmo/dota-2-matches">here</a>.
We examined the dataset and experiment with more than one classifier and consider if an ensemble provides a better result.

### 1. Feature Engineering

Firstly, let us import the *pandas* library for easy data wrangling, numpy for array manipulation and sklearn for our models in this Kernel. The raw dataset does not provide a schema that is row-based. We have to engineer the data first before we can proceed to the modelling step.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from tqdm import *
%matplotlib inline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import metrics
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

We hope to better understand how learning rate affects our model training and process and we include a function to plot the learning curve later on during our training.

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

Then, let us load the pertinent datasets.

Below are their respective descriptions from Kaggle:

1. **players:** Individual players are identified by account_id but there is an option to play anonymously and roughly one third of the account_id are not available. Anonymous users have the value of 0 for account_id. Contains totals for kills, deaths, denies, etc. Player action counts are available, and are indicated by variable names beginning with unit_order. Counts for reasons for acquiring or losing gold, and gaining experience, have prefixes gold, and xp.
2. **matches:** contains top level information about each match.
3. **heroes:** hero lookup
4. **items:** item lookup
5. **Gold:** Gold collected

**NOTE: **If you wish to reuse this notebook offline, download the <a href="https://www.kaggle.com/devinanzelmo/dota-2-matches">Kaggle Dataset</a> and unzip it into the *data* folder.

In [None]:
!ls
heroes = pd.read_csv('../input/hero_names.csv')
matches = pd.read_csv('../input/match.csv')
players = pd.read_csv('../input/players.csv')

In [None]:
players.head()

Let us map the *hero_id* field to its name.

In [None]:
heroes_dict = dict(zip(heroes['hero_id'], heroes['localized_name']))
heroes_dict[0] = 'None'
prep = pd.DataFrame()
prep['hero'] = players['hero_id'].apply(lambda id: heroes_dict[id])

In [None]:
print(prep.head())
print(prep.shape)

Let us map each *item* column field to its name.

In [None]:
players_heroes = pd.get_dummies(prep['hero'])
print(players_heroes.head())
print(players_heroes.shape)

In [None]:
players_lvl_xp = {
    'level': players['level'],
    'xp_hero': players['xp_hero'],
    'xp_creep': players['xp_creep']
}
players_lvl_xp_labels = players_lvl_xp.keys()

p_lvl_xp = pd.DataFrame(players_lvl_xp, columns=players_lvl_xp_labels)
p_lvl_xp.fillna(0)

this allows us to map the xp of each  radiant and hero in each match to the players

In [None]:
r_heroes_cols = list(map(lambda title: 'r_' + str(title), players_heroes.columns.values))
d_heroes_cols = list(map(lambda title: 'd_' + str(title), players_heroes.columns.values))
r_lvl_cols = list(map(lambda title: 'r_' + str(title), p_lvl_xp.columns.values))
d_lvl_cols = list(map(lambda title: 'd_' + str(title), p_lvl_xp.columns.values))

In [None]:
r_hero_list = []
d_hero_list = []
r_lvl_xp_list = []
d_lvl_xp_list = []

for id, idx in players.groupby('match_id').groups.items():
    r_hero_list.append(players_heroes.iloc[idx][:5].sum().values)
    d_hero_list.append(players_heroes.iloc[idx][5:].sum().values)
    r_lvl_xp_list.append(p_lvl_xp.iloc[idx][:5].sum().values)
    d_lvl_xp_list.append(p_lvl_xp.iloc[idx][5:].sum().values)

In [None]:
r_heroes = pd.DataFrame(r_hero_list, columns=r_heroes_cols)
d_heroes = pd.DataFrame(d_hero_list, columns=d_heroes_cols)
r_lvl_xp = pd.DataFrame(r_lvl_xp_list, columns=r_lvl_cols)
d_lvl_xp = pd.DataFrame(d_lvl_xp_list, columns=d_lvl_cols)

X = pd.DataFrame()
X = pd.concat([r_heroes, d_heroes, r_lvl_xp, d_lvl_xp], axis=1)

In [None]:
print(X.shape)
print(X.head())

In [None]:
X.describe()

We want to understand how correlated the data actually is in order to proceed with the modelling since we have 228 features to consider.

In [None]:
print(X.corr())

From the values, we can see that most of the correlation is around xp_hero, xp_creep and their respective hero levels which supports hueristics understanding of the game. However, since majority of the columns are not statistically correlated, we can keep most of the data without reducing the features.
We plot the above to have a better understanding.

In [None]:
y_arr = OrdinalEncoder().fit_transform(matches['radiant_win'].values.reshape(-1,1))
col = ['r_win']
y = pd.DataFrame(y_arr, columns=col)

### 2. Predictive Modelling

Now that we have our **X** and **y** datasets, let us now proceed to the predictive modelling step.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1st Attempt using Random Forest

In [None]:
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz

#random seed for reproducibility 
RSEED=88

# Instantiate a RandomForest Classifier 50 Estimators
rf = RandomForestClassifier(bootstrap=True,
                            n_estimators=50,
                            max_features='auto',
                            random_state=RSEED)

# Fit 'rf' to the training set
rf.fit(X_train, y_train)

# Predict the test set labels 'y_pred'
y_pred_rf = rf.predict(X_test)

# Evaluate the accuracy
rf_acc = accuracy_score(y_test, y_pred_rf)
print(f'RF accuracy: {rf_acc}')

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred_rf))  
print(classification_report(y_test,y_pred_rf))  
print(accuracy_score(y_test, y_pred_rf))  

In [None]:
importances_rf = pd.Series(rf.feature_importances_, index=X_train.columns)

# Sort Importances
sorted_importances_rf = importances_rf.sort_values(ascending=False)
sorted_importances_rf = sorted_importances_rf.head(20)

# Make a horizontal bar plot
sorted_importances_rf.plot(kind='barh', color='lightblue')
plt.show()

In [None]:
estimator = rf.estimators_[8]

# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

1st attempt of Random Forest shows that an accuracy of 92.88% can be achieved.

In [None]:
feature_names = X_train.feature_names,
                class_names = y_tain.target_names,

Features to trying improving the accuracy of Random Forest

In [None]:
#min_samples_leaf=10,
#n_estimators=200,
#max_features='auto',

Other important factors can also be seen below by extracting the feature importances of the tree: