<h1><center>Applying Data Science on eSports - A Dota2 Case</center></h1>
<h2><center>Predict the Results of an Upcoming Evente</center></h2>

### Introduction
In recent years, as several multiplayer competition games became popular all over the world with loyal player groups of 10-100 million scale, the professional gaming (eSports) developed rapidly.  Supported by the expansion of the eSports, despite of the fundamental roles like players and coaches, some new roles like “Data/Stats Analyst” are introduced into eSports industry just like the other highly developed “traditional sports” (NBA, FIFA, etc.).  From the experiences of the traditional sports, the comprehensive scientific analysis of the data could help the professionals identify the potential problems and find better plans to improve the performance, from individual training to organization management.  However, the data analysis in eSports is at a lower level than in traditional sports.  Most analyses focus only on statistical data collection and visualization ([Dotabuff](https://www.dotabuff.com/),[Dotamax](http://dotamax.com/home/),[Nahaz](https://www.youtube.com/channel/UCHgkSS3Vc-TIH1Wd64Hq_dQ)(famous individual analyst),etc.) and the deeper analysis techniques in data science are rarely applied.

In this tutorial, we will go through a basic (but complete) data lifecycle to perform analysis on 8 teams in an upcoming Dota2 tournament ([SL i-League Dota 2 Invitational S2](https://starladder.com/en/dota-2-invitational-s2)).  
<img src="files/A.png" alt="Drawing" style="width: 600px;"/>
<h3><center>John P Dickerson, Data Science Prof.  UMD</center></h3>
We will be able to get various results/hypotheses from this process but to make the tutorial more concentrated and build a clear workflow, we set the primary goal of this tutorial as to predict the results of the game matches in the entire event based on machine learning.

### Contents of the Tutorial
[Environment](#Envrioment ) <br/>
[Data Collection](#Data-Collection)<br/>
[Data Process](#Data-Process )<br/>
[Exploratory Analysis](#Exploratory-Analysis )<br/>
[Machine Learning](#Machine-Learning )<br/>
[Summary and Insights](#Summary-and-Insight )<br/>
[Documentations](#Documentations )<br/>
[References](#References)

### Envrioment 
This tutorial is 

In [1]:
import dota2api
import numpy as np
import pandas as pd
import requests
import time 
import util22
import matplotlib
import matplotlib.pyplot as plt
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Data Collection
Unlike other activities in real world, eSports are based on computer programs which means all gaming match data are generated, collected and stored electronically.  Technically we should be able to collect the data from the data server without extra physical devices (like GPS in football) or manual input.  However, not all game operators are willing to provide match data to public or provide all details of the raw data.  For Dota 2, we have following options:
    1.	The official API provided by Valve Software. 
    (Including Third party database like [Dotabuff](https://www.dotabuff.com/) that based on this API.)
    2.	Replay parsing based database.
    3.	Manual collection.
Option 2 could provide more details than Option 1 because the replay of a match will cover all information while the API does not.  And due to some wrong logged changes (roster registration,etc),the information in the game server might be incorrect.  After consideration, to collect necessary data for our tutorial, we will combine all the options.  We will use [Datdota](http://datdota.com/) in Option2 to download the lists of matches of all 8 teams in our event and merge them as our main dataset.  And then use official API to get details of matches in the dataset.  Some trivial information that could not be accessed programmatically such as “which 8 teams are in the event” will be collected manually.


We firstly get to know the 8 teams in the event from [liquidpedia](http://wiki.teamliquid.net/dota2/StarLadder/i-League_Invitational/2) and record them in a list.
<img src="files/B.png" alt="Drawing" style="width: 600px;"/>
<h3><center>Participants from Liquidpedia</center></h3>

In [2]:
team_lst = ['Alliance','Newbee','Team Faceless','Team Liquid','TNC Pro Team','Vega Squadron','Invictus Gaming','Team VGJ']


And then we could download the 8 csv files from [Datdota](http://datdota.com/) for each team. (The GitHub repo contains the files I downloaded in /dataset, since there are ongoing events, records might change anytime.)  Those files will contain all professional matches those 8 teams played with records on Dota2 game server.
Load those files as dataframe and then merge them.


In [3]:
# load the csv files
ig = pd.read_csv("dataset/ig.csv")
nb = pd.read_csv("dataset/nb.csv")
alli = pd.read_csv("dataset/Alliance.csv")
fl = pd.read_csv("dataset/fl.csv")
tl = pd.read_csv("dataset/Team Liquid.csv")
vs = pd.read_csv("dataset/Vega Squadron.csv")
vgj = pd.read_csv("dataset/VGJ.csv")
tnc = pd.read_csv("dataset/TNC Pro Team.csv")
# merge as one called dataset
frames = [alli,nb,fl,tl,tnc,vs,ig,vgj]
dataset = pd.concat(frames, keys=team_lst)
dataset.head()

Unnamed: 0,Unnamed: 1,Match,Date,League,Opponent,Result
Alliance,0,3178869138,14 May 2017,4442,Team Empire,Loss
Alliance,1,3178589992,14 May 2017,4442,Team Empire,Win
Alliance,2,3178374532,14 May 2017,4442,Team Empire,Loss
Alliance,3,3176125586,13 May 2017,4442,Natus Vincere,Loss
Alliance,4,3175916756,13 May 2017,4442,Natus Vincere,Loss


Now we have a list of all matches those 8 teams played.  For the goal to perform machine learning on the match data, we need to have more details for each match by get match details through dota2api.

Note: dota2api requires a special key from Steam.  Follow the tutorial here before the next step: https://dota2api.readthedocs.io/en/latest/tutorial.html


In [7]:
# initialize the dota2api
api = dota2api.Initialise()
# get the match details by dota2api and append the raw details as a new column
lst = [] # empty list to store the results
for matchid in dataset['Match'].tolist():
    # get the detail
    detail = api.get_match_details(match_id=matchid)
    # add to the result list
    lst.append(detail)
    # the dota2api requires to send no more than 1 request per second.
    # please read the docs of dota2api and steam web api for more details.
    # due to the request limitation, the code will take a long time.  So save time, you could use
    # my dataset 'dataset.csv' dumped in my repo or other dumped source.
    time.sleep(1)
    #print (len(lst)) # debug code
    
    # this code could be revised to retry the current match while get exception from api server.
    # please read http://stackoverflow.com/questions/2083987/how-to-retry-after-exception-in-python for more.

APITimeoutError: 'HTTP 503: Please try again later.'

In [None]:
# add the details to the dataset as a new column
dataset['Detail'] = lst
# dump the dataset as csv to file
dataset.to_pickle('dataset.pkl')
dataset.head()

Now we have all raw data of the match history.  However, before we can do actual analysis on the data, we need to process the data into more readable forms, especially for especially for ‘Detail’ column.


### Data Process
We will get features like ‘first blood time’, ‘radiant or dire’ (the position of team in game), etc. And add the patch versions depending on the date of the match.

In [None]:
# change the index of the team into a column
dataset['Team'] = dataset.index
dataset['Team'] = dataset['Team'].map(lambda x:str(x).split(',')[0].split('(')[1])
# drop the duplicate matches (since a match between 8 teams will be recorded twice in each team's section).
dataset = dataset.drop_duplicates(col = ['Match'])
dataset.head()

In [None]:
# add versions
lst = []
for date in dataset['Date']:
    ver = util22.version(str(date)) # use the function in util.py
    lst.append(ver)
dataset['Version'] = lst

In [None]:
# parse other features
if True:
    fb = []
    dua = []
    dire = []
    radiant = []
    ra_GPM = []
    di_GPM = []
    ra_XPM = []
    di_XPM = []
    ra_kill = []
    di_kill = []
    ra_death = []
    di_death = []
    ra_assi = []
    ra_herodmg = []
    ra_towerdmg = []
    ra_lh = []
    ra_denies = []
    di_herodmg = []
    di_towerdmg = []
    di_lh = []
    di_denies = []
    di_assi = []
    i = 0
    dlst = dataset['Detail'].tolist()
    teamlst = dataset['Team'].tolist()
    oppolst = dataset['Opponent'].tolist()
    while i < len(dlst):    
        fb.append(dlst[i]['first_blood_time'])
        dua.append(dlst[i]['duration'])
        
        # differ the sides by win/loss 
        try:
            
            if dataset['Result'].tolist()[i] == 'Win':
                if dlst[i]['radiant_win']:
                    radiant.append(teamlst[i])
                    dire.append(oppolst[i])
                else:
                    dire.append(teamlst[i])
                    radiant.append(oppolst[i])   
            else:
                if dlst[i]['radiant_win']:
                    dire.append(teamlst[i])
                    radiant.append(oppolst[i])
                else:
                    radiant.append(teamlst[i])
                    dire.append(oppolst[i]) 
        except KeyError:
            radiant.append(np.nan)
            dire.append(np.nan)
        # add the features of 5 players up to the team's features
        j = 0
        rgpm, rxpm,rkill,rdeath,rlh,rdines,rhd,rtd,rass = 0,0,0,0,0,0,0,0,0
        dgpm, dxpm,dkill,ddeath,dlh,ddines,dhd,dtd,dass = 0,0,0,0,0,0,0,0,0
        while j < 5:
            
            rgpm += dlst[i]['players'][j]['gold_per_min']
            rxpm += dlst[i]['players'][j]['xp_per_min']
            rkill += dlst[i]['players'][j]['kills']
            rass += dlst[i]['players'][j]['assists']
            rlh += dlst[i]['players'][j]['last_hits']
            rdines += dlst[i]['players'][j]['denies']
            rdeath += dlst[i]['players'][j]['deaths']
            j += 1
           
        while j < 10:
            dgpm += dlst[i]['players'][j]['gold_per_min']
            dxpm += dlst[i]['players'][j]['xp_per_min']
            dkill += dlst[i]['players'][j]['kills']
            dass += dlst[i]['players'][j]['assists']
            dlh += dlst[i]['players'][j]['last_hits']
            ddines += dlst[i]['players'][j]['denies']
            ddeath += dlst[i]['players'][j]['deaths']
            j += 1
        # add to the lists
        ra_GPM.append(rgpm)
        ra_XPM.append(rxpm)
        ra_kill.append(rkill)
        ra_death.append(rdeath)
        ra_assi.append(rass)
        ra_herodmg.append(rhd)
        ra_towerdmg.append(rtd)
        ra_lh.append(rlh)
        ra_denies.append(rdines)
        di_GPM.append(dgpm)
        di_XPM.append(dxpm)
        di_kill.append(dkill)
        di_death.append(ddeath)
        di_assi.append(dass)
        di_herodmg.append(dhd)
        di_towerdmg.append(dtd)
        di_lh.append(dlh)
        di_denies.append(ddines)
        i += 1

    # add to dataframe
dataset['Ra_GPM'] = ra_GPM
dataset['Ra_XPM'] = ra_XPM
dataset['Ra_kill'] = ra_kill
dataset['Ra_death'] = ra_death
dataset['Ra_assistant'] = ra_assi
dataset['Ra_last hit'] = ra_lh
dataset['Ra_denies'] = ra_denies
dataset['Di_GPM'] = di_GPM
dataset['Di_XPM'] = di_XPM
dataset['Di_kill'] = di_kill
dataset['Di_death'] = di_death
dataset['Di_assistant'] = di_assi
dataset['Di_last hit'] = di_lh
dataset['Di_denies'] = di_denies
dataset['Duration'] = dua
dataset['First_blood'] = fb
dataset['Radiant'] = radiant

# def a function to transfer win/loss to 1/0
def winrate(s):
    if s == 'Win':
        return 1
    else:
        return 0
dataset['Result'] = dataset['Result'].map(winrate)
dataset.head()

Now we have the basic performance data of each team to evalute how well they played in all pro matches.

### Exploratory Analysis
For eSports, due to the frequent roster changes and patch updates, it is hard to evaluate the performance of a pro team from a historical perspective.  Mostly, a team could only have about 100-200 matches in certain version with a certain roster.  Those changes could make the data from other versions and time periods somehow meaningless and make it difficult to identify the actual factors that influence the team.  

For example, team Alliance got 6 Premier-class champions in 2013 and only 2 in 2014 with a drop in world ranking from No.1 to 20+.  Just by looking at the GPM and XPM, we might say that the players lost their advantage in gaining gold and XP but since the patch in 2014 changed the economy system of the game, even players were doing as well as in 2013, they still could not make the same GPM and XPM.  

Furthermore, the lower winrate means Alliance lost more matches in 2014 than 2013 which would also cause the drop in GPM and XPM – the lost teams typically have deficient performance data. This is the reason why current data analysis for dota2 usually only focus on the recent data of a pro team under current versions and versions without major changes.

We could group the matches of Alliance to get their average winrate by months.


In [None]:
# copy the dataset
df = dataset.loc['Alliance']
# get the data of the Alliance (differing from its opponent)
df['GPM'] = np.nan
df['XPM'] = np.nan
for index,row in df.iterrows():
    if row['Radiant'] == 'Alliance':
        row['GPM'] == row['Ra_GPM']
        row['XPM'] == row['Ra_XPM']
    else:
        row['GPM'] == row['Di_GPM']
        row['XPM'] == row['Di_XPM']        
# change the date into month-year
df['Month'] = df['Date'].map(lambda s: s.split(" ")[1]+s.split(" ")[2]).map(lambda s: datetime.strptime(s, '%b%Y'))
# groupby month and take the average
dfm = df.groupby(['Month'],as_index=False).mean()
# plot the winrate by time
plt.figure(1)
plt.figure(figsize=(12,8))
plt.plot_date(dfm['Month'],dfm['Result'])
plt.xlabel('Time')
plt.ylabel('Winrate')
plt.title('Figure 1: The winrate of Alliance across time')
plt.show()

Just like mentioned before, Alliance had a great winrate at around 2013-08 (nearly 90%) and have a noticiable 

In [None]:
plt.figure(2)
plt.plot_date(dfm['Month'],dfm['GPM'])
plt.plot_date(dfm['Month'],dfm['XPM'])
plt.xlabel('Time')
plt.ylabel('GPM/XPM')
plt.title('Figure 2: The Performance of Alliance across time')
plt.show()

### Machine Learning

### Summary and Insight

### Documentations
Pandas:http://pandas.pydata.org/<br/>
Dota2API:https://dota2api.readthedocs.io/en/latest/#<br/>
Numpy:http://www.numpy.org/<br/>
Scikit-learn:http://scikit-learn.org/stable/<br/>
requests:http://docs.python-requests.org/en/master/<br/>
matplotlib:https://matplotlib.org/<br/>

### References