<img src="images/playvs-league.png" class="inline"/>

**<p><center><font size="6.5">Predicting the Chance of Winning in League of Legends</font></center></p>**
*<p style="color: darkgrey;"><center>Jiaxuan Wu, Yufei Huang, Geng Liu</center></p>*

**<p><font size="7"><font color="SkyBlue">Outline</font></font></p>**

<ol>
   <li><font color="SkyBlue"><strong>Introduction</strong></font>
	<li><font color="SkyBlue"><strong>Getting Started:</strong></font>
	<ol>
		<li><font color="SkyBlue">Required Libraries</font></li>
		<li><font color="SkyBlue">Dataset Source</font></li>
		<li><font color="SkyBlue">Reading Data</font></li>
		<li><font color="SkyBlue">Tidying and Modifying Data</font></li>
	</ol>
	</li>
	<li><font color="SkyBlue"><strong>Exploratory Data Analysis:</strong></font>
   <ol>
		<li><font color="SkyBlue">Explore the Usage Frequency of ADC & Support Combo</font></li>
		<li><font color="SkyBlue">Explore the Gold Collection of ADC & Support Combination</font></li>
	</ol>
	</li>
	<li><font color="SkyBlue"><strong>Hypothesis Testing:</strong></font>
   <ol>
      <li><font color="SkyBlue">Null Hypothesis</font></li>
      <li><font color="SkyBlue">Multiple Linear Regression with Scikit-learn</font></li>
      <li><font color="SkyBlue">Ordinary Least Squares Regression</font></li>
   </ol>
   </li>
	<li><font color="SkyBlue"><strong>Predict with Machine Learning:</strong></font></li>
     <ol>
       <li><font color="SkyBlue">Predictions with Train/Test Split</font></li>
      <li><font color="SkyBlue"> Predictions vs. Actual Values Plots</font></li>
    </ol>
	<li><font color="SkyBlue"><strong>Conclusion:</strong></font></li>
</ol>

**<p><font size="6"><font color="SkyBlue">1. Introduction</font></font></p>**

<a href="https://play.na.leagueoflegends.com/en_US">League of Legends</a> (abbreviated LoL) is a multiplayer online battle arena video game developed and published by Riot Games. The game is available on both Windows and macOS platforms. Released on October 27, 2009,League of Legends has since grown in popularity, with an active and expansive fanbase. In September 2016, the company estimated that there are over 100 million active players each month. LoL also has a professional league. The top prize for the best team is over five million dollars and the average player makes a six-figure income.

<img src="images/lol lines.png" class="inline"/> 

The goal of the game is to defeat the enemy team by destroying its nexus, which lies in the enemy base. There are two teams in each game: the blue team and the red team, who have 5 players each. Each player will select a champion from 142 available champions, each with a unique set of abilities, to play before the game begin. They will then collaborate together to defense against the enemy's attack. The game ends by either one team destroying the enemy team’s nexus, or one team surrenders. 

The map, which is known as “Summoner’s Rift”, is divided into two sides by a diagonal line, known as the “river”. The blue team’s territory lies under the river, and the red team’s territory is above it. The yellow paths, known as “lanes”, are guarded by defensive structures, known as “turrets”, and automatically created characters known as “minions”. The lanes are called “top”, “middle” and “bottom” lanes in respect to their positions on the map. 

In the early phase of the game, players usually control their champion to guard a lane to level up and gain gold until they are strong enough to fight as a team. In this tutorial, we focus on the two champions guarding the bottom lane, often called “ADC” and “support”. Champions who are fitted for the “ADC” role are ranged champions that can deal high physical damage. But they also tend to have weak defense and low health, thus they are accompanied by another champion, “support”, who can provide healing, shielding, and harass the enemy champion. 

In this tutorial, we aim to clean the game statistics from all international matches since 2015, then analyze the dataset to gain insight on the relationship between bottom lane champion combination and victories. We want to use hypothesis testing to prove the correlation between two factors, and then use regression analysis to predict wins or loses based on champion combinations. 


**<p><font size="6"><font color="SkyBlue">2. Getting Started</font></font></p>**

**<p><font size="4.5"><font color="SkyBlue">A. Required Libraries</font></font></p>**
<ul>
    <li style="box-sizing: border-box;"><a href="https://pandas.pydata.org/">Pandas</a>: for data display and partitioning</li>
    <li style="box-sizing: border-box;"><a href="http://www.numpy.org/">Numpy</a>: for scientific computing with Python</li>
	<li style="box-sizing: border-box;"><a href="https://matplotlib.org/tutorials/introductory/pyplot.html">Matplotlib.pyplot</a>: for plotting Pandas data into various graphs and charts</li>
    <li style="box-sizing: border-box;"><a href="https://seaborn.pydata.org/">Seaborn</a>: for graphics on top of Matplotlib</li>
    <li style="box-sizing: border-box;"><a href="https://docs.python.org/3/library/ast.html">Abstract Syntax Trees</a>: for evaluating python literals
    <li style="box-sizing: border-box;"><a href="https://scikit-learn.org/stable/">scikit-learn</a>: machine learning library
	<ul style="box-sizing: border-box; margin: 0em 2em; list-style: square; padding-left: 0px;">
		<li style="box-sizing: border-box;">linear-model:&nbsp;calculate models for multiple linear regression</li>
		<li style="box-sizing: border-box;">model_selection: split up dataset into test and training data and evaluate predictions</li>
	</ul>
	</li>
	<li style="box-sizing: border-box;"><a href="https://www.statsmodels.org/stable/index.html">statsmodels</a> - api:&nbsp;calculate models and statistics with multple linear regression</li>
</ul>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from scipy import stats
from sklearn import linear_model
from sklearn import model_selection
from statsmodels import api as sm

**<p><font size="4.5"><font color="SkyBlue">B. Dataset Source</font></font></p>**

The dataset we used aggregates basic statistics for every game since 2015 across all the professional leagues other than China. It contains detailed information for all feasible aspects of the performance of each member in the gameplay, such as the number of kills, the time of turret destruction, the amount of golds collected by each champion. In this tutorial, we will be concerned with the quantitave statistics of the teams and their ADC & Support combination in international matches.

The dataset was found on Kaggle. 
<a href="https://www.kaggle.com/chuckephron/leagueoflegends">https://www.kaggle.com/chuckephron/leagueoflegends</a>

**<p><font size="4.5"><font color="SkyBlue">C. Reading Data</font></font></p>**

We first load the LeagueofLegends.csv file. We filter rows for international matches, because these matches have more comprehensive data. And then we drop columns that we do not need such as "Leagus", "Year", "Season", and "Type". Then, we display the first 5 rows to get a sense of data. 

In [None]:
df = pd.read_csv("LeagueofLegends.csv", sep=',')
df = df[df['Type'] == 'International']
df.drop(['League', 'Year', 'Season', 'Type'],axis=1, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

**<p><font size="4.5"><font color="SkyBlue">D. Tidying and Modifying Data</font></font></p>**

To better explore the data, we convert all strings to intergers, lists, or whatever the type it needs to be in.

In [None]:
# Convert String into lists in the table
df['goldblue'] = df['goldblue'].apply(literal_eval)
df['bKills'] = df['bKills'].apply(literal_eval)
df['bTowers'] = df['bTowers'].apply(literal_eval)
df['bInhibs'] = df['bInhibs'].apply(literal_eval)
df['bDragons'] = df['bDragons'].apply(literal_eval)
df['bBarons'] = df['bBarons'].apply(literal_eval)
df['bHeralds'] = df['bHeralds'].apply(literal_eval)

df['goldred'] = df['goldred'].apply(literal_eval)
df['rKills'] = df['rKills'].apply(literal_eval)
df['rTowers'] = df['rTowers'].apply(literal_eval)
df['rInhibs'] = df['rInhibs'].apply(literal_eval)
df['rDragons'] = df['rDragons'].apply(literal_eval)
df['rBarons'] = df['rBarons'].apply(literal_eval)
df['rHeralds'] = df['rHeralds'].apply(literal_eval)

df['goldblueTop'] = df['goldblueTop'].apply(literal_eval)
df['goldblueJungle'] = df['goldblueJungle'].apply(literal_eval)
df['goldblueMiddle'] = df['goldblueMiddle'].apply(literal_eval)
df['goldblueADC'] = df['goldblueADC'].apply(literal_eval)
df['goldblueSupport'] = df['goldblueSupport'].apply(literal_eval)

df['goldredTop'] = df['goldredTop'].apply(literal_eval)
df['goldredJungle'] = df['goldredJungle'].apply(literal_eval)
df['goldredMiddle'] = df['goldredMiddle'].apply(literal_eval)
df['goldredADC'] = df['goldredADC'].apply(literal_eval)
df['goldredSupport'] = df['goldredSupport'].apply(literal_eval)

Since we need the statistics of champion combination and victories, we can ignore teams information such as red and blue tag in the columns. 

We first isolate the red team and blue team's data into two data frames. Then, we rename the columns to be neutural terms without team tag. After that, we can merge two dataframe into one on columns with the same name. Now, we have the data of all team statistic in one table.

In [None]:
df_red = df.copy(deep=True)
df_blue = df.copy(deep=True)

# We create two separate dataframes to contain data for either red team or blue team, then rename the columns to have neutral terms.
df_red.drop(columns = ['blueTeamTag', 'bResult', 'goldblue', 'bKills', 'bTowers', 'bInhibs', 'bDragons',\
                       'bBarons', 'bHeralds', 'blueTop', 'blueTopChamp', 'goldblueTop', \
                       'blueJungle', 'blueJungleChamp', 'goldblueJungle', 'blueMiddle', \
                       'blueMiddleChamp', 'goldblueMiddle', 'blueADC', 'blueADCChamp', 'goldblueADC', \
                       'blueSupport', 'blueSupportChamp', 'goldblueSupport', 'blueBans'], inplace=True)
df_blue.drop(columns = ['redTeamTag', 'rResult', 'goldred', 'rKills', 'rTowers', 'rInhibs', 'rDragons',\
                       'rBarons', 'rHeralds', 'redTop', 'redTopChamp', 'goldredTop', \
                       'redJungle', 'redJungleChamp', 'goldredJungle', 'redMiddle', \
                       'redMiddleChamp', 'goldredMiddle', 'redADC', 'redADCChamp', 'goldredADC', \
                       'redSupport', 'redSupportChamp', 'goldredSupport', 'redBans'], inplace=True)

df_red.rename({'redTeamTag': 'teamTag', 'rResult': 'result', 'goldred': 'gold',\
               'rKills': 'kills', 'rTowers': 'towers', 'rInhibs': 'inhibs',\
               'rDragons': 'dragons', 'rBarons': 'barons','rHeralds': 'heralds',\
               'redTop': 'top', 'redTopChamp': 'topChamp', 'goldredTop': 'goldTop',\
               'redJungle': 'jungle', 'redJungleChamp': 'jungleChamp', 'goldredJungle': 'goldJungle',\
               'redMiddle': 'middle', 'redMiddleChamp': 'middleChamp','goldredMiddle': 'goldMiddle',\
               'redADC': 'ADC', 'redADCChamp': 'ADCChamp', 'goldredADC': 'goldADC', \
               'redSupport': 'support', 'redSupportChamp': 'supportChamp', 'goldredSupport': 'goldSupport',\
               'redBans': 'bans'}, inplace=True, axis='columns')

df_blue.rename({'blueTeamTag': 'teamTag','bResult': 'result','goldblue': 'gold',\
               'bKills': 'kills','bTowers': 'towers', 'bInhibs': 'inhibs',\
               'bDragons': 'dragons', 'bBarons': 'barons', 'bHeralds': 'heralds',\
               'blueTop': 'top', 'blueTopChamp': 'topChamp','goldblueTop': 'goldTop',\
               'blueJungle': 'jungle','blueJungleChamp': 'jungleChamp','goldblueJungle': 'goldJungle',\
               'blueMiddle': 'middle', 'blueMiddleChamp': 'middleChamp','goldblueMiddle': 'goldMiddle',\
               'blueADC': 'ADC','blueADCChamp': 'ADCChamp','goldblueADC': 'goldADC', \
               'blueSupport': 'support','blueSupportChamp': 'supportChamp','goldblueSupport': 'goldSupport',\
               'blueBans': 'bans'}, inplace=True, axis='columns')

# Since two data frames have same columns, we can simply concatenate them into one data frame.
df_all = pd.concat([df_red, df_blue], sort = False).reset_index()

**<p><font size="6"><font color="SkyBlue">3. Exploratory Data Analysis</font></font></p>**

**<p><font size="4.5"><font color="SkyBlue">A. Explore the Usage Frequency of ADC & Support Combination</font></font></p>**

First, we want to explore the frequency of all possible ADC & Support combinations that appeared in all international matches. We will use Pandas groupby() to combine all the rows with the same ADC champion and support champion and sum the number of wins of each combination, as well as counting the occurances of each combination.

In [None]:
# Group by adc champion and support champion, then aggregate other data.
df_combo = df_all.groupby(['ADCChamp', 'supportChamp']).agg({'result' :  'sum',
                                                             'ADC' : 'count',}).reset_index()
combo_size = df_all.shape[0]
df_combo.rename({'ADC':'frequency', 'result':'wins'}, inplace=True, axis='columns')

# Win ratio is proportion of wins to number of times the combination was used.
df_combo['winRatio'] = df_combo['wins'] / df_combo['frequency']

# Get a sense on the size of data
df_combo.shape

We have 272 pairs of ADC and support champion combination. In order to better visualize the data, we are selecting the top 20 combinations and ploting them on a bar plot. 

In [None]:
# Createing a column name "combo" that contains the string of ADC champion name and support champion name
df_combo['combo'] = df_combo['ADCChamp'] + ', ' + df_combo['supportChamp']
df_combo.sort_values(by='frequency', inplace=True, ascending=False)
df_combo_top = df_combo.head(20)


f, ax = plt.subplots(figsize=(12,9))
freq_map = sns.barplot(x='combo', y='frequency', data=df_combo_top )
freq_map.set_xticklabels(freq_map.get_xticklabels(),rotation=90)
ax.set_title("20 Most Frequent Usage Combination of ADC and Support")
ax.set_xlabel("ADC and Support Combination")
ax.set_ylabel("Usage Frequency")
plt.show()

### Usage Frequency and Victories

Looking at the frequency of usage, we noticed that not all the combinations are evenly chosen. There are a lot of combinations that are only chosen for once or twice, while others are used more than 20 times. Therefore we don't want to directly plot the number of wins of each combination, because it's impossible for number of wins to exceed frequency of usage. We will instead, plot the ratio of victory to frequency of usage, in order to identify combinations that are used often and have high win ratio. 

In [None]:
df_combo['winOverall'] = df_combo['wins']/combo_size
df_combo['frequenceOverall'] = df_combo['frequency']/combo_size

f, ax = plt.subplots(figsize=(15,12))
combo_win = sns.scatterplot(x='frequency', y='winRatio', data=df_combo, alpha=0.2, s=200, palette=sns.color_palette("RdBu", n_colors=7), ax=ax)
ax.set_title("ADC and Support Combination Win Ratio to Frequency")
ax.set_xlabel("Usage Frequency")
ax.set_ylabel("Win Ratio")

for i, r in df_combo.iterrows():
    if (r['frequency'] > 15):
        ax.text(r['frequency'] + 0.02, r['winRatio'], r['combo'])
plt.plot([0, 35], [0.5, 0.5], linewidth=2)
plt.show()

The shade of the dots indicate repetition on that particular ratio of victories to usage. We observe an uniform pattern on the left side of the plot because there are multiple combinations that are only used once or twice. For those combination, the possible win ratios are limited, thus forming a mirrored distribution (e.g., if a combination is used only once, it either won that match or lost that match, the possible data points are (1,1) and (1,0)). 

We also observe that not all the most used combinations have good win ratio. A good proportion of them have win ratio of less than 0.5.

### Champion Choice and Victories

To gain more insight on choice of champion and victory, we are constructing a heat map of champions of ADC and support, and the win ratio of that unique combination.

In [None]:
cmap = sns.diverging_palette(220,10, as_cmap=True)

pivot = df_combo.pivot(index='supportChamp', columns='ADCChamp', values='winRatio')
f, ax = plt.subplots(figsize=(18,15))
combo_win = sns.heatmap(pivot, cmap=sns.cubehelix_palette(8), square=True, linewidths=0.5, ax=ax, annot=True)
plt.show()

From the heat map, some champion stands out from the others. For example, on the first row, we can see that support champion "Alistar" has been paired with most of the ADC champions, and has lots of high win ratio. Other support champions that have even pairing and high win ratio are Janna and Thresh. Alister and Thresh are known as support characters who can deal high damage, which helps them in harassing the enemy champion and be more aggressive in lane. 

On the other hand, some ADC champion also stand out because of high win ratio and can be paired with various support, such as Lucian and Sivir. These champions are known to produce high burst of damage, which, if played well, can also gain lane advantage early on in the game.

We observe many squares with black shade and value of "1", which  mean that all matches with that combination achieved victory. However the data can be easily misinterpret, because the less usage it has, the easier for it to have high win ratio (a combination that appear once and won the match will have win ratio of 1). Therefore in the next step, we want to map the champion choices to frequency.

### Champion Choice and Frequency 

In [None]:
cmap = sns.diverging_palette(220,10, as_cmap=True)

pivot = df_combo.pivot(index='supportChamp', columns='ADCChamp', values='frequency')
f, ax = plt.subplots(figsize=(18,15))
combo_win = sns.heatmap(pivot, cmap=sns.cubehelix_palette(8), square=True, linewidths=0.5, ax=ax, annot=True)
plt.show()

By looking at both heat map, we can get a sense of the accuracy of the frequency and win ratio data. We can see that most support champions, even the most popular ones, are often pair more frequently with certain ADC champions instead of all ADC champions. For example, Thresh has been paired with almost all ADC champion, but it is paired the most to Kalista and Lucian. The support champion Karma is also often paired with Caitlyn, Ezereal and Jhin, but not the others. We can see that choice of ADC can influence the choice of support: some combinations go well together than other. 

**<p><font size="4.5"><font color="SkyBlue">B. Explore the Gold Collection of ADC & Support Combination</font></font></p>**

### Average Gold and Champion Combination

One important factor in game is the amount of gold one team collects. Champions can collect gold through various ways such as killing minions, destroying enemy turrets, and killing enemy champions. Champions can purches items with gold to enhance their attributes such as attack damage, health, and ability power. Therefore, it is important for player to try to collect as much gold as possible in the early phase of the game to gain advantage over the enemy. 

In this section, we want to explore the amount of gold collected by ADC and support throughout the duration of the matches and its relationship to victory. 

In [None]:
# We make a copy of the data frame and keep columns that are relevant.
df_gold = df_all.copy(deep=True)
df_gold = df_gold[['result','ADCChamp', 'goldADC', 'supportChamp', 'goldSupport']]


In [None]:
# The data frame contains tracking of gold collected by both champion of support. 
# It comes in the form of array, which it's an record of the gold value by minute. 
# We sum the gold of ADC and support to calculate gold of bottom lane champions collectively. 
df_gold['goldBot'] = np.empty((len(df_gold), 0)).tolist()
df_gold['combo'] = df_gold['ADCChamp'] + ', ' + df_gold['supportChamp']
for i, r in df_gold.iterrows():
    lst = [sum(pair) for pair in zip(r['goldADC'], r['goldSupport'])]
    df_gold.at[i, 'goldBot'] = lst
    # The average of gold value in bottom lane in respect to match length
    df_gold.at[i, 'goldBotPerMin'] = sum(lst)/len(lst)


In [None]:
f,ax = plt.subplots(figsize=(18,15))
adc_gold_per_min = sns.boxplot(x='ADCChamp', y='goldBotPerMin', data=df_gold, ax=ax, boxprops=dict(alpha=0.1))
adc_gold_per_min = sns.stripplot(x='ADCChamp', y='goldBotPerMin', data=df_gold, ax=ax)
ax.set_title("Average Gold Value in Bottom Lane")
ax.set_xlabel("ADC Champion")
ax.set_ylabel("Average Gold Per Minute in Game")
for label in ax.get_xticklabels():
    label.set_rotation(90)

plt.show()

In the above plot, each dot represents a match with the ADC champion and the average amount of gold collected by bottom lane in respect to match length. We can see that popular ADC champions often have outliers in both end, which means that they can have either extraordinary performance or extremely bad performance. 

The champions that we observed to have high usage or high performance appear to have low median gold value. Champions such as Caitlyn, KogMow, Varus have higher median gold value on average. These champions are known for their long attack range, which allow them to kill minions more easily without being harassed by enemy champions with shorter attack range. 

### Gold Value by 5 Minute Intervals and Champion Combinations

Since the data set contain record of gold value by minute, we can look closely into the changes in gold through the duration of game. We select to plot gold value by 5 minute intervals for champion combinations that are used more than 10 times. Notice that not all games last the same amount of time, so the later intervals will contain less data than earlier intervals.

In [None]:
# Pick out the gold value by 5 minutes apart, and use guard to make sure empty value is replaced by NaN.
for i, r in df_gold.iterrows():
    df_gold.at[i, 'goldBot5'] = r['goldBot'][4] if len(r['goldBot']) > 4 else np.nan 
    df_gold.at[i, 'goldBot10'] = r['goldBot'][9] if len(r['goldBot']) > 9 else np.nan 
    df_gold.at[i, 'goldBot15'] = r['goldBot'][14] if len(r['goldBot']) > 14 else np.nan 
    df_gold.at[i, 'goldBot20'] = r['goldBot'][19] if len(r['goldBot']) > 19 else np.nan
    df_gold.at[i, 'goldBot25'] = r['goldBot'][24] if len(r['goldBot']) > 24 else np.nan 
    df_gold.at[i, 'goldBot30'] = r['goldBot'][29] if len(r['goldBot']) > 29 else np.nan 
    df_gold.at[i, 'goldBot35'] = r['goldBot'][34] if len(r['goldBot']) > 34 else np.nan 
    df_gold.at[i, 'goldBot40'] = r['goldBot'][39] if len(r['goldBot']) > 39 else np.nan 

    df_gold.at[i, 'frequency'] = df_combo.loc[df_combo['combo'] == r['combo']].iloc[0]['frequency']
    df_gold.at[i,'winOverall'] = df_combo.loc[df_combo['combo'] == r['combo']].iloc[0]['winOverall']
    
df_gold.sort_values(by='frequency', ascending=False, inplace=True)

We also use dot color to label matches that win or lost. Blue dots represent matches that were lost, and orange dots represent matches that were won. 

In [None]:
# We are creating 8 sub plots
f,axes = plt.subplots(nrows = 8, ncols = 1, figsize=(18,74))
f.subplots_adjust(hspace = 0.5)

i = 1
for ax in axes:
    adc_gold_per_min = sns.boxplot(x='combo', y=('goldBot' + str(i*5)), data=df_gold[df_gold['frequency'] > 10], ax=ax, boxprops=dict(alpha=0.1))
    adc_gold_per_min = sns.stripplot(x='combo', y=('goldBot' + str(i*5)), data=df_gold[df_gold['frequency'] > 10], ax=ax, hue='result')
    ax.set_title("Gold Collected by Bottom Lane at " + str(i*5) + " Minutes")
    ax.set_xlabel("ADC and Support Combination")
    ax.set_ylabel("Gold")
    for label in ax.get_xticklabels():
        label.set_rotation(90)
    i += 1

plt.show()

We are going to analyze each subplot and discuss the changes between them, and seek relationship between victories and gold.

**First 5 Minutes** All combination starts around the same, but one combination has high gold value: Lucian and Alister. We also see outlier with the highest gold value in Xayah and Rakan. But high gold value doesn't mean victory since we also see many high values marked as blue dot, which are loses. 

**5-10 Minutes** The range of gold value stretches out, and median values start to vary for bottom lane combinations. 

**10-15 Minutes** The range of gold value shrinks a little. There is more of a trend now that most orange dots are located above the median for each combination. A particular combination that stands out is Tristana and Lulu, which has the highest median gold value at this point of the game.

**15-20 Minutes** The trend continues to sound its presence. More blue dots are located below the median and more orange dots are placed above the median line. There is really no one ADC that constantly produce high gold value. For example, even though Tristana can gain high gold value with Janna and Lulu as support, it tends to do poor with Thresh and Alister. 

**20-25 Minutes** The range of gold value streches out again. We also see cases of champion pairs like Tristana and Lulu who has been doing well in the beginning of the game continue to gain lots of gold, but still lose. High gold value doesn't really guarantee a victory, but it does seem to correspond chances to victory. 

**25-30 Minutes** Other champion pairs seem to catch up on gold value, such as Twitch and Janna. We also see a handful of outliers dropping out from the plot. It's mostly because such a large gold advantage allow the teams to defeat the enemy team quickly and end the game before 30 minutes. 

**30-35 Minutes** The pattern doesn't really hold any more as we see more blue dots above the median line. The later towards the game, the less value of the gold. Each champion has 6 slots for items, and after purchasing 6 items, the extra gold collected by champion will be wasted. Therefore beyond a certain point, more gold doesn't really provide advantage anymore and we can see more loses with high gold value.

**Beyond 40 Minutes** At this point, a lot of the matches have been concluded. The remaining matches display chaotic distributions as both team finished buying all their items. We can see that some champion do amazingly well at late game, such as Twitch, who won almost every game that lasted longer than 40 minutes. 

**<p><font size="6"><font color="SkyBlue">4. Hypothesis Testing</font></font></p>**

**<p><font size="4.5"><font color="SkyBlue">A. Null Hypothesis</font></font></p>**

We are looking at the impacts of:
1. ADC & support combination on the win ratio 
2. The amount of gold collected by bottom lane on the win ratio

**Null Hypothesis: Neither of ADC & Support combination or Gold has a noticeable impact on the win ratio.**

To test the null hypothesis, we will perform a Multiple Linear Regression on our dataset by using scikit-learn.

**<p><font size="4.5"><font color="SkyBlue">B. Multiple Linear Regression with Scikit-learn</font></font></p>**

We will use our tidied dataset to create 2 new dataframes for the features and the target.
1. Features: independent variables (ADC & Support Combinations, and Gold) 
2. Target: dependent variables (overall win ratio).

Here, we will use the usage frequency to express impact of champion combination. If the usage frequency of combination that teams choose has noticeable impacts on overall win ratio, then we know that various combos have impacts on win ratios. Also, we will use the amount of gold per minute each team gets to express the impact of gold. If the amount of gold per minute has siginificant impacts on win ratios, then total gold collected by bottom lane has significant impacts on win ratios.

To better use them with Scikit-learn, we will first standardize two features.
1. Standardize combination using the probability of appearance of the combination
2. Standardize gold using z-score

In [None]:
df_gold['p_fre'] = df_gold['frequency']/combo_size
df_gold['p_gold'] = stats.zscore(df_gold['goldBotPerMin'])
df_sklearn = df_gold.copy(deep=True)
df_sklearn = df_sklearn[['p_fre', 'p_gold', 'winOverall']]
df_sklearn.sort_index(inplace = True)
df_sklearn.head()

Now, we already have the standardized data for combinations and gold. Let's set up X and Y for our use in scikit-learn LinearRegression() function. Then, fit the model.

In [None]:
group = ['p_fre', 'p_gold']

features = df_sklearn[group]
target = df_sklearn[['winOverall']]


X = features
Y = target['winOverall']

linear = linear_model.LinearRegression()
model = linear.fit(X,Y)

Now, let us get an R-squared score to test how well the variance is explained by the model. 

In [None]:
linear.score(X,Y)

Values range from 0 to 1, so the ~0.84 we got means that almost every variance is explained by our model.

Now, let us find the coefficients from our model to see whether they have impacts, and how much they impact on the overall win ratio, and which feature has a larger impact overall.

In [None]:
coef = linear.coef_.tolist()
for f, c in zip(group, coef):
    print("Feature: {}, Coefficient: {}".format(f,c))

From the coefficients above, we can confirm our pervious findings. We can see that combination of ADC & Support has a significant impact on overall win ratios. The coefficient of conbination is positive, which confirm our previous prediction that higher appearance of an ADC & Support combo's leads to a higher quality of the game, which leads to a higher probability of winning. 

On the other hand, the average amount of gold per minute that a team earns actually has a positve impact on winning; however, the value is too small to be significant. Therefore, we may say that gold amount does not really have noticeable impact on overall winning, since this result only shows if the average gold amount collected by bottom lane per minute affects winning.

**<p><font size="4.5"><font color="SkyBlue">C. Ordinary Least Squares Regression</font></font></p>**
Now, we want to observe more about this result by p-values, t-values, and std-err. Then, we need Ordinary Least Squares Regression model.

In [None]:
# use Ordinary Least Squares
OLS_model = sm.OLS(Y,X).fit()
OLS_model.summary()

Given the assumption that the covariance matrix of the errors is correctly specified, we may say that the standard Errors are small enough for a good performance.

The R-squared value here is a little bit larger than what we get previously. This still gives a good explanation to almost all variances; however, it may also show that our data has a little bit overfitting, which is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably".

Looking at the 'coef' column above, we observe that obviously the frequency of champion combinations has significant impact on the overall win ratio. Also, the p-value of combination frequency is 0, perfectly below a critical value of 0.05, which confirms more about the findings.

In addition, we are more confident that impact from average gold per minute is indeed negligible. Not only from the observation of extremely small coefficient, but also from the p-values above ('P>|t|' column in the middle chart), we can see that the p-value of average gold per minute is way greater than a critical value of 5% (p-value of 0.05). 

All these mean that we reject the null hypothesis because obviously the champion combinations have significant impact on overall winning ratio.

**<p><font size="6"><font color="SkyBlue">5. Predict with Machine Learning</font></font></p>**

**<p><font size="4.5"><font color="SkyBlue">A. Predictions with Train/Test Split</font></font></p>**

Here we will split up the dataset into training and testing data for both variables. The percentage of Test/train split depends on different programmer. For our data, we will use 20% of the data for testing purpose and 80% for training purposes. We will generate mutiple linear regression model using the training data first. Then, we will make the predictions for the testing data. After that, we will compare the predictions with the actual results (Y_test).

In [None]:
# create training and testing data from same X and y used in regression above
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)

# create and fit the linear model
linear = linear_model.LinearRegression()
model = linear.fit(X_train, Y_train)

# generate predictions for player Rating to compare with y_test data
predictions = linear.predict(X_test)

# display first 5 results of our predicted player Ratings.
predictions[0:5]

After take look what the predictions look like, we will plot the predicted values from the linear model against the actual values from our dataset.
Add another line (y=x) for on the plot to better inform us how closely the predictions are to the actual winning ratios. The better the predictions, the more closely the plotted points will follow the identity line (red line).

**<p><font size="4.5"><font color="SkyBlue">B. Predictions vs. Actual Values Plots</font></font></p>**

In [None]:
plt.figure(figsize=(12,12))
plt.title("Predicted vs. Actual Values for Winning ratio", fontsize=16)
plt.scatter(Y_test,predictions)
plt.plot(Y_test, Y_test, color="Red") # identity line y=x
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted')
sns.distplot(Y_test, hist=False, label="Actual", ax=ax)
sns.distplot(predictions, hist=False, label="Linear Regression Predictions", ax=ax)

plt.show()

In [None]:
print("Accuracy (scale of 0 to 1): {}".format(model.score(X_test, Y_test)))

From two plots above, we observe that our model predicted the overall winning ratios very well for the testing dataset.
Our model also provides a score of the actual accuracy (1 is perfect), and our accuracy is around 0.83, which is a consideratable high accuracy score for the model. 

**<p><font size="6"><font color="SkyBlue">6. Conclusion</font></font></p>**

By analyzing all international matches data since 2015, we explore some interesting data in bottom lane champion selection and other game staticstics. From the exploratory data analysis, we learn that some ADC & support champion combinations are used more often than the others, but the frequency doesn't necessary lead to victory. We also learn that gold collected by the bottom lane has different impact in different phases of the game. For games that last more than 30 minutes, the actual value of gold starts to decrease as both teams are reaching their item limits. 

Then with hypothesis testing and regression models, we realize that champion combination has a high impact on the win ratio, while gold of bottom lane doesn't have an significant impact. Finally we create a predict model with linear regression, which produces results that fit the testing dataset pretty well. 

League of Legends is looking to celebrate its 10 year anniversary next year. This dataset only provides a fragment of insight into the game play. There are much more game play data that contains interesting facts awaiting to be analyzed. We hope that by presenting this data tutorial, we have aroused your interest on the game. The game is complicated and constantly evolving, thus might there might be a little learning curve for any new players to pick up the game. If you are a new player, we definitely encourage you to use all the resources avaialble to familiarize yourself with the game, including streamings, vidoe tutorials, discusssion board, friends who play this game around you, and etc. If you are more interestd on the data side, we encourage you to check out <a href="https://developer.riotgames.com/">Riot API</a>, which can provide you with more data, or other interesting tutorials such as <a href="https://www.kaggle.com/verhalenn/why-is-blue-a-better">Why is Blue A Better?"</a>. 