# League of Legends Match Outcome Classifier
My visualizations of the features from the dataframe and additional features obtained by using Riot Games MatchV4 API.
I then experiment with different models to get the best results.

To see how I interact with the API, the source code can be found here: https://github.com/jbofill10/LoL-MatchOutcome-Predictor

In [None]:
import numpy as np 
import pandas as pd 
import os
import pickle
# I took the original data set and used riot's api to get more data about the games.
# The API has a request limit of 100 requests per 2 min, so I just have a pickle after I made a request for every game
df = pd.read_pickle('/kaggle/input/riotapi-pickles/riotapi_lower_res')

preproc_df = df.copy()

df.head()

In [None]:
df.columns

Originally, I tested my models by using train test split from sklearn and achieved really high scores and was suspicious that my model was overfitting. To combat that, I decided to grab an additional 5k games from the Riot API using the same methods are the original author of the data set.

The "more_games_df" dataframe represents those extra games.

I will train my model using the entire data set provided by the author, and then test using the games I have queried on my own

In [None]:
more_games_df = pd.read_pickle('/kaggle/input/riotapi-pickles/more_games_lower')

more_games_df.head(10)

Sometimes, games didn't contain ban info -- causing some values to be NaN. I don't know if the author of the data set also experience this issue, but I just decided to drop the rows with NaNs

In [None]:
# I'm dumb and basically had duplicate columns of these from the original data set and then from using the riot api
df.drop(['blue_firstBlood', 'red_firstBlood'], axis=1, inplace=True)

df['redWins'] = df['blueWins'].apply(lambda x: 1 if x == 0 else 0)
preproc_df['redWins'] = df['redWins']

more_games_df.dropna(inplace=True)
more_games_df.reset_index(inplace=True)
more_games_df.rename(columns={"redfirstBlood": 'redFirstBlood', 'bluefirstBlood': 'blueFirstBlood'}, inplace=True)
more_games_df['redWins'] = more_games_df['blueWins'].apply(
    lambda x: 1 if x == 0 else 0
)

# Copies the order of the original data set (I have obsessive compulsions regarding to order)
more_games_df = more_games_df[df.columns.tolist()]

# Combining the two temporarily for part of the preprocessing
combined_df = pd.concat([preproc_df, more_games_df], axis=0, ignore_index=True)
combined_df.drop(['blue_firstBlood', 'red_firstBlood'], axis=1, inplace=True)
combined_df.head()

I used the Riot API to get all the champions selected and banned, along with some other information regarding team statistics which I will explain later on. Champion selections are very important, because any team based game at the top level of play has a "meta", which means the best style of play / team compositions to achieve maximum value within a game. That being said, people who pick the "meta" champions are more likely to win.  

This idea is similar with bans, as some champions are just too strong in the certain patch and are better off banned every game. You will see later on that Kassadin practically permanently banned as of this patch which the data set is from.

# EDA
I will look at the features in depth to get an idea of the data that I am working with

First I will make a column called 'redWins' so it is easier to manipulate the data. I will then make two dataframes that consist of Red and Blue winning, primarily to see the stats of winning teams and compare them to their loses.

In [None]:
blue_win = df[df['blueWins'] == 1]
red_win = df[df['blueWins'] == 0]

In [None]:
import plotly.graph_objs as go
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot, plot
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)

## Target Variable
I am hoping that there is an even balance of wins and loses, as I do not like having to fill in values knowing that some are incorrect.

In [None]:
fig = go.Figure()

blue_loss = df[df['blueWins'] == 0]

fig.add_trace(go.Bar(x=[0], y=list(blue_win['blueWins'].value_counts()), name='Blue', marker_color='#084177', width=0.5))
fig.add_trace(go.Bar(x=[1], y=list(blue_loss['blueWins'].value_counts()), name='Red',
                     marker_color=['#d63447'], width=0.5))

fig.update_layout(
    xaxis=dict(
        showticklabels=True,
        tickvals=[0, 1],
        ticktext=[i for i in ['Blue', 'Red']],
    ),
    yaxis_title='Wins',
    title='Wins From Each Team',
    height=800,
    width=800
)

iplot(fig)

Pretty good stuff actually. I am glad that they are nearly even.

Now I would like to go through the features.

## Wards Placed
Having more information than your opponent in any game is one of the biggest advantages a team can have over another. Therefore, wards are extremely important for the success of a team as they grant extra vision of the enemy and map.

In [None]:
fig = go.Figure(data=[
    go.Box(name='Blue Win', y=blue_win['blueWardsPlaced'], boxmean=True),
    go.Box(name='Blue Loss', y=red_win['blueWardsPlaced'], boxmean=True),
    go.Box(name='Red Win', y=red_win['redWardsPlaced'], boxmean=True),
    go.Box(name='Red Loss', y=blue_win['redWardsPlaced'], boxmean=True)
])

fig.update_layout(
    title='Wards Placed Distribution',
    height=800,
    width=800
)

iplot(fig)

On average, about 20 wards a game are placed. It is interesting and mind boggling how some games have up to 276 wards placed... Keep in mind this data is for the first 10 min of a game

It is clear that the distribution of wards placed are left skewed, which may require a transformation.

Here we can see that there is a slight trend with the amount of wards placed and a team winning. Usually the winning team has a larger amount of wards placed that game.

Both blue and red have a higher mean of wards placed when winning, but it is not that clear for blue side as it is with red.

## Wards Destroyed
Similar to wards being so important due to the information they can provide to the team, denying that information is also key to winning. Due to this idea, I would suspect that the team with more destroyed wards has a better chance of winning the game.

In [None]:
fig = go.Figure(data=[
    go.Histogram(name='Blue Win', x=blue_win['blueWardsDestroyed']),
    go.Histogram(name='Blue Loss', x=red_win['blueWardsDestroyed']),
    go.Histogram(name='Red Win', x=red_win['redWardsDestroyed']),
    go.Histogram(name='Red Loss', x=blue_win['redWardsDestroyed'])
])

fig.update_layout(
    title='Wards Destroyed Distribution',
    height=800,
    width=800
)

iplot(fig)

In the range of 0-2 wards destroyed, there isn't a trend at all regarding the outcome of the game. Once the wards destroyed count exceeds 3, there is a clear trend that destroyed ward count correlates with winning. This is most likely due to the fact that if so many wards are destroyed that quickly, the team is dominating / being very aggressive and putting a lot of pressure on the other time.

## First Bloods
First bloods are important because it provides gold for the team, which allows wards to be set up early. This will prevent the jungler from ganking as well.

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Blue Win', x=[0], y=[np.sum(blue_win['blueFirstBlood'])], width=0.5),
    go.Bar(name='Blue Loss', x=[1], y=[np.sum(red_win['blueFirstBlood'])], width=0.5),
    go.Bar(name='Red Win', x=[2], y=[np.sum(red_win['redFirstBlood'])], width=0.5),
    go.Bar(name='Red Loss', x=[3], y=[np.sum(blue_win['redFirstBlood'])], width=0.5)
])

fig.update_layout(
    title='The Importance of First Kills',
    height=800,
    width=800,
    xaxis=dict(
        tickvals=[i for i in range(4)],
        ticktext=[i for i in ['Blue Win', 'Blue Loss', 'Red Win', 'Red Loss']],
        showticklabels=True
    ),
)

iplot(fig)

This bar graph takes the sum of games where a certain team got first blood. It is clear that teams that got a first kill are more likely to win, since it does provide advantages as stated earlier.

But it is also possible that the team was just overall better and won regardless of the "first blood advantage".

## Total Team Kills
Kills in League are very important because it slows down the progression of your opponents build, along with giving you gold to upgrade your champion. For this reason, it is pretty clear that kills will be a very important indicator of whether a team wins or loses

In [None]:
fig = go.Figure(data=[
    go.Histogram(name='Blue Win', x=blue_win['blueKills']),
    go.Histogram(name='Blue Loss', x=red_win['blueKills']),
    go.Histogram(name='Red Win', x=red_win['redKills']),
    go.Histogram(name='Red Loss', x=blue_win['redKills'])
])

fig.update_layout(
    title='Distribution of Team Kills when Winning and Losing',
    height=800,
    width=800,
)

iplot(fig)

The distribution of all the variations in this histogram are right skewed slightly.

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Blue Win', x=[0], y=[np.mean(blue_win['blueKills'])], width=0.5),
    go.Bar(name='Blue Loss', x=[1], y=[np.mean(red_win['blueKills'])], width=0.5),
    go.Bar(name='Red Win', x=[2], y=[np.mean(red_win['redKills'])], width=0.5),
    go.Bar(name='Red Loss', x=[3], y=[np.mean(blue_win['redKills'])], width=0.5)
])

fig.update_layout(
    title='Average Kills of Teams when Winning and Losing',
    height=800,
    width=800,
    xaxis=dict(
        tickvals=[i for i in range(4)],
        ticktext=[i for i in ['Blue Win', 'Blue Loss', 'Red Win', 'Red Loss']],
        showticklabels=False,
        title='Team'
    ),
)

iplot(fig)

# Team Deaths
If kills are vital to the success of a team, it must mean that survivability is also very important. This means that most likely the team that dies least has the higher chance of winning.

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Blue Win', x=[0], y=[np.mean(blue_win['blueDeaths'])], width=0.5),
    go.Bar(name='Blue Loss', x=[1], y=[np.mean(red_win['blueDeaths'])], width=0.5),
    go.Bar(name='Red Win', x=[2], y=[np.mean(red_win['redDeaths'])], width=0.5),
    go.Bar(name='Red Loss', x=[3], y=[np.mean(blue_win['redDeaths'])], width=0.5)
])

fig.update_layout(
    title='Average Deaths of Teams when Winning and Losing',
    height=800,
    width=800,
    xaxis=dict(
        tickvals=[i for i in range(4)],
        ticktext=[i for i in ['Blue Win', 'Blue Loss', 'Red Win', 'Red Loss']],
        showticklabels=False,
        title='Team'
    ),
)

iplot(fig)

As suspected, Teams that win tend to die less on average.

## Assists
The team with more assists also means that team has more kills which should give them the advantage over their opponents. I expect the team with more assists to win more games

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Blue Win', x=[0], y=[np.mean(blue_win['blueAssists'])], width=0.5),
    go.Bar(name='Blue Loss', x=[1], y=[np.mean(red_win['blueAssists'])], width=0.5),
    go.Bar(name='Red Win', x=[2], y=[np.mean(red_win['redAssists'])], width=0.5),
    go.Bar(name='Red Loss', x=[3], y=[np.mean(blue_win['redAssists'])], width=0.5)
])

fig.update_layout(
    title='Average Assists of Teams when Winning and Losing',
    height=800,
    width=800,
    xaxis=dict(
        tickvals=[i for i in range(4)],
        ticktext=[i for i in ['Blue Win', 'Blue Loss', 'Red Win', 'Red Loss']],
        showticklabels=True
    ),
)

iplot(fig)

It is evident that teams with more assists end up winning more games.

## Towers Destroyed
Towers prevent the enemy team from attacking the Nexus. The more towers that are destroyed, the easier it becomes to attack the Nexus.
Due to this idea, teams that win would have destoryed more towers than their opponents.

In [None]:
fig = go.Figure(data=[
    go.Histogram(name='Blue Win', x=blue_win['blueTowersDestroyed']),
    go.Histogram(name='Blue Loss', x=red_win['blueTowersDestroyed']),
    go.Histogram(name='Red Win', x=red_win['redTowersDestroyed']),
    go.Histogram(name='Red Loss', x=blue_win['redTowersDestroyed'])
])

fig.update_layout(
    title='Distribution of Towers Destroyed when Winning and Losing',
    height=800,
    width=800,
)

iplot(fig)

In the first 10 min, usually no towers are destroyed. If one is destroyed, usually that team wins.

## Epic Monsters
Epic Monsters are important provide high gold/experience and buffs to the team that defeats them. This in turn provides an advantage for the team. I would expect to see that teams will win more on average if they kill more epic monsters.

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Blue Win', y=[0], x=[np.sum(blue_win['blueEliteMonsters'])], width=0.5, orientation='h'),
    go.Bar(name='Blue Loss', y=[1], x=[np.sum(red_win['blueEliteMonsters'])], width=0.5, orientation='h'),
    go.Bar(name='Red Win', y=[2], x=[np.sum(red_win['redEliteMonsters'])], width=0.5, orientation='h'),
    go.Bar(name='Red Loss', y=[3], x=[np.sum(blue_win['redEliteMonsters'])], width=0.5, orientation='h')
])

fig.update_layout(
    title='Epic Monsters Killed',
    height=800,
    width=800,
    yaxis=dict(
        tickvals=[i for i in range(4)],
        ticktext=[i for i in ['Blue Win', 'Blue Loss', 'Red Win', 'Red Loss']],
        showticklabels=True
    ),
)

iplot(fig)

## Total Experience
The more experience a champion has earned, the stronger the build the champion will have. Therefore teams that win will most likely have more experience.

In [None]:
fig = go.Figure(data=[
    go.Box(name='Blue Win', x=blue_win['blueTotalExperience']),
    go.Box(name='Blue Loss', x=red_win['blueTotalExperience']),
    go.Box(name='Red Win', x=red_win['redTotalExperience']),
    go.Box(name='Red Loss', x=blue_win['redTotalExperience'])
])

fig.update_layout(
    title='Total Experience Distrubtion of Champions',
    height=800,
    width=800,
)

iplot(fig)

Champions with more xp will be stronger since they can level more abilities and probably would have more gold to have a stronger build. So the teams with more xp are more likely to win

## Total Gold
Stronger items and upgrades to items can be obtained with gold. So most likely the teams that win will have more gold since they had stronger champions that contributed to winning the game.

In [None]:
fig = go.Figure(data=[
    go.Violin(name='Blue Win', y=blue_win['blueTotalGold'], meanline_visible=True),
    go.Violin(name='Blue Loss', y=red_win['blueTotalGold'], meanline_visible=True),
    go.Violin(name='Red Win', y=red_win['redTotalGold'], meanline_visible=True),
    go.Violin(name='Red Loss', y=blue_win['redTotalGold'], meanline_visible=True),
])

fig.update_layout(
    title='Gold on Winning and Losing Teams',
    height=800,
    width=800,
)

iplot(fig)

Same idea as with xp. The more gold you have, the better a build your champion can afford. This will lead into a stronger build, making your champion and team stronger.

## CS Per Min
Creep Score, or CS, is one of the most important aspects of League. Having a good creep score means a reliable and steady amount of income, which is pointed out earlier is very important for winning games. Due to this, I know for sure that the winning teams will have higher CS than the losing teams.

In [None]:
fig = go.Figure(data=[
    go.Box(name='Blue Win', x=blue_win['blueCSPerMin']),
    go.Box(name='Blue Loss', x=red_win['blueCSPerMin']),
    go.Box(name='Red Win', x=red_win['redCSPerMin']),
    go.Box(name='Red Loss', x=blue_win['redCSPerMin'])
])

fig.update_layout(
    title='CS Per Min Distribution',
    height=800,
    width=800,
)

iplot(fig)

## Champion Bans
These are the 20 most frequent bans when a team wins. Unfortunately the API doesn't give information on who picked first, so when it says "Blue" or "Red" it just means the team that won. In the legend, the champions go in order of most frequent. For example, Other is the most banned, followed by Kassadin in Ban 1.

Keep in mind that there are 135 champions in league, meaning the other category consists of 115 champions!

### Helper Methods to get Champion names
Since the Riot API only gives championIDs, I needed to make another request to an API to get the mappings of championIds for the current patch. I also have that pickled as well.

In [None]:
def get_champions(ids):
    if not os.path.isfile('/kaggle/input/riotapi-pickles/champions_lower'):
        r = requests.get('http://ddragon.leagueoflegends.com/cdn/10.10.3216176/data/en_US/champion.json')

        response = r.json()

        champions_reformatted = {}

        for champion in response['data']:
            id = response['data'][champion]['key']

            champions_reformatted[int(id)] = 'Wukong' if champion == 'MonkeyKing' else champion

        with open('/kaggle/input/riotapi-pickles/champions_lower', 'wb') as file:
            pickle.dump(champions_reformatted, file)
    else:
        with open('/kaggle/input/riotapi-pickles/champions_lower', 'rb') as file:
            champions_reformatted = pickle.load(file)

    champions = []

    for id in ids:

        champions.append('None' if id == -1 else champions_reformatted[id])

    return champions[0] if len(champions) == 1 else champions

def format_champs(df, cols, head):
    champ_dict = {}

    for col in cols:
        champs = df[col].value_counts()

        champ_names = get_champions(champs.keys())
        freq = [i for i in champs]
        counter = 0
        for name in champ_names:

            champ_dict[name] = champ_dict.setdefault(name, 0) + freq[counter]

            counter += 1

    champ_dict = {k: v for k, v in sorted(champ_dict.items(), key=lambda item: item[1], reverse=True)}

    top_n = dict(list(champ_dict.items())[0:head])

    other = dict(list(champ_dict.items())[head+1:])

    other_total = 0
    top_n_total = 0

    for key, val in top_n.items():
        top_n_total += val

    for key, val in other.items():
        other_total += val

    other_total = np.abs(top_n_total-other_total)

    top_n['Other'] = other_total

    return top_n

In [None]:
bans = ['ban_1', 'ban_2', 'ban_3', 'ban_4', 'ban_5',
        'ban_6', 'ban_7', 'ban_8', 'ban_9', 'ban_10']
figs = []
for ban in bans:
    fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}]],
                        subplot_titles=('Blue {} {}'.format(ban[0:3], ban[4:]),
                                        'Red {} {}'.format(ban[0:3], ban[4:]))
    )

    row = 1

    blue_win_all_ban = blue_win[ban].value_counts()
    red_win_ban_all_ban = red_win[ban].value_counts()

    blue_win_top_10 = blue_win[ban].value_counts().head(20)
    red_win_top_10 = red_win[ban].value_counts().head(20)

    blue_win_other = np.abs(np.sum(blue_win_top_10) - np.sum(blue_win_all_ban))
    red_win_other = np.abs(np.sum(red_win_top_10) - np.sum(red_win_ban_all_ban))

    blue_vals = blue_win_top_10.values
    red_vals = red_win_top_10.values

    blue_vals = np.append(blue_vals, blue_win_other)
    red_vals = np.append(red_vals, red_win_other)

    blue_bans = get_champions(list(blue_win_top_10.keys()))
    red_bans = get_champions(list(red_win_top_10.keys()))

    fig.add_trace(go.Pie(
        name=ban,
        labels=blue_bans + ['Other'],
        values=blue_vals),
        row=1,
        col=1
    )

    fig.add_trace(go.Pie(
        name=ban,
        labels=red_bans + ['Other'],
        values=red_vals),
        row=1,
        col=2
    )

    fig.update_layout(
        height=600,
        width=800
    )
    
    figs.append(fig)

## Ban 1

In [None]:
iplot(figs[0])

## Ban 2

In [None]:
iplot(figs[1])

## Ban 3

In [None]:
iplot(figs[2])

## Ban 4

In [None]:
iplot(figs[3])

## Ban 5

In [None]:
iplot(figs[4])

## Ban 6

In [None]:
iplot(figs[5])

## Ban 7

In [None]:
iplot(figs[6])

## Ban 8

In [None]:
iplot(figs[7])

## Ban 9

In [None]:
iplot(figs[8])

## Ban 10

In [None]:
iplot(figs[9])

It seems that people in diamond and master are not fond of Kassadin at all. Generally, it seems that the ban selections are very strict at the top level of play.

## Champion Selections
### Among Winning and Losing Teams

In [None]:
champs = ['blue_champ_1', 'blue_champ_2', 'blue_champ_3', 'blue_champ_4', 'blue_champ_5',
          'red_champ_1', 'red_champ_2', 'red_champ_3', 'red_champ_4', 'red_champ_5']

champs_formatted = format_champs(df, champs, 35)

fig = go.Figure(data=[
    go.Pie(
        labels=list(champs_formatted.keys()),
        values=list(champs_formatted.values())
    )
])

fig.update_layout(
    height=900,
    width=800,
    title='Most Frequently Selected Champions'
)

iplot(fig)

The idea presented by the pie chart makes sense, as usually team composition related games typically have metas between patches or seasons. So it makes sense to see champions that work best together to have pick rates close to one anothers.

### Blue Winning

I suspect the choices to be relatively similar to the champion selection visualization embodying winning and losing teams

In [None]:
blue_champs = champs[0:5]
red_champs = champs[5:]

blue_win_champs_formatted = format_champs(blue_win, blue_champs, 25)
blue_lose_champs_formatted = format_champs(red_win, blue_champs, 25)

red_win_champs_formatted = format_champs(red_win, red_champs, 25)
red_lose_champs_formatted = format_champs(blue_win, red_champs, 25)

fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}]],
                    subplot_titles=('Top 25 Blue Champion Selections (Win)',
                                    'Top 25 Red Champion Selections (Lose)')
                    )

fig.add_trace(
    go.Pie(
        name='Blue',
        labels=list(blue_win_champs_formatted.keys()),
        values=list(blue_win_champs_formatted.values())
    ),
    row=1,
    col=1
)

fig.add_trace(
    go.Pie(
        name='Red',
        labels=list(red_lose_champs_formatted.keys()),
        values=list(red_lose_champs_formatted.values())
    ),
    row=1,
    col=2
)

fig.update_layout(
    height=800,
    width=800
)

iplot(fig)

It's interesting to see that Lee Sin is selected more on Red when they lose. I don't think this contributes to Red losing though.

### Red Winning

In [None]:
fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}]],
                    subplot_titles=('Top 25 Red Champion Selections (Win)',
                                    'Top 25 Blue Champion Selections (Lose)')
                    )

fig.add_trace(
    go.Pie(
        name='Red',
        labels=list(red_win_champs_formatted.keys()),
        values=list(red_win_champs_formatted.values())
    ),
    row=1,
    col=1
)

fig.add_trace(
    go.Pie(
        name='Blue',
        labels=list(blue_lose_champs_formatted.keys()),
        values=list(blue_lose_champs_formatted.values())
    ),
    row=1,
    col=2
)

fig.update_layout(
    height=800,
    width=800
)

Interesting that when blue loses, Lee Sin is very popular. At the same time, Lee Sin is the third most fequent pick when Red wins.

## First Inhibitor
Inhibitors are a very important structure in League of Legends because it prevents the opposition from training super minions in the lane with the inhibitor destroys. That having said, the team with a destroyed inhibitor would be put at a major disadvantage. I would suspect that teams win significantly more with a first inhibitor destroyed vs. not having destroyed an inhibitor first.

### Blue Winning With and Without First Inhibitor

In [None]:
blue_win_finhibit = blue_win[blue_win['blue_firstInhibitor'] == 1]
blue_win_ninhibit = blue_win[blue_win['blue_firstInhibitor'] == 0]

red_win_finhibit = red_win[red_win['red_firstInhibitor'] == 1]
red_win_ninhibit = red_win[red_win['red_firstInhibitor'] == 0]

fig = go.Figure(data=[
    go.Pie(
        labels=['Blue Win with First Inhibitor', 'Blue Win without First Inhibitor'],
        values=[np.sum(blue_win_finhibit['blueWins']), np.sum(blue_win_ninhibit['blueWins'])]
    )
])

fig.update_layout(
    title='Blue Wins With and Without First Inhibitor',
    height=800,
    width=800
)

iplot(fig)

Quite a difference. The data set is relatively small, but I am certain that if more data was added, the difference would still be large.

### Red Winning With and Without First Inhibitor

In [None]:
fig = go.Figure(data=[
    go.Pie(
        labels=['Red Win with First Inhibitor', 'Red Win without First Inhibitor'],
        values=[np.sum(red_win_finhibit['redWins']), np.sum(red_win_ninhibit['redWins'])]
    )
])

fig.update_layout(
    title='Red Wins With and Without First Inhibitor',
    height=800,
    width=800
)

iplot(fig)

Same result as Blue.

This means that destroying the inhibitor first is probably a good indicator of whether a team wins or not.

## First Baron
The Baron is the strongest monster in league due to providing the team that defeats it buffs including increased attack damage, increased ability power, and increases the power of minions. Due to this, I suspect that this will also be a strong indicator of who wins the match.

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Blue', x=[0], y=[np.sum(df['blue_firstBaron'])], width=0.5, marker_color='#084177'),
    go.Bar(name='Red', x=[1], y=[np.sum(df['red_firstBaron'])], width=0.5, marker_color='#d63447')
])

fig.update_layout(
    title='First Baron Count',
    xaxis=dict(
        tickvals=[i for i in range(2)],
        ticktext=['Blue', 'Red'],
        showticklabels=True
    ),
    width=800,
    height=800
)

It seems teams usually win more when getting the first baron, but the difference is not as severe as it is with inhibitors.

In [None]:
blue_win_fbaron = blue_win[blue_win['blue_firstBaron'] == 1]
blue_win_nbaron = blue_win[blue_win['blue_firstBaron'] == 0]

red_win_fbaron = red_win[red_win['red_firstBaron'] == 1]
red_win_nbaron = red_win[red_win['red_firstBaron'] == 0]

fig = go.Figure(data=[
    go.Pie(
        labels=['Blue Win with First Baron', 'Blue Win without First Baron'],
        values=[np.sum(blue_win_fbaron['blueWins']), np.sum(blue_win_nbaron['blueWins'])]
    )
])

fig.update_layout(
    title='Blue Wins With and Without First Baron',
    height=800,
    width=800
)

The idea is the same as it is with Blue.

I would suspect that getting the first baron helps to lead to the first inhibitor if not destroyed already, which would help secure the game.

## Tower Kills
Since deaths in League can be detrimental to a team, I would assume that if one team dies a lot to towers that they would most likely lose the match.

In [None]:
fig = go.Figure(data=[
    go.Histogram(
        name='Blue Team Killed by Towers (Won)',
        x=blue_win['red_towerKills']
    ),
    go.Histogram(
        name='Blue Team Killed by Towers (Lost)',
        x=red_win['red_towerKills']
    ),
    go.Histogram(
        name='Red Team Killed by Towers (Won)',
        x=red_win['blue_towerKills']
    ),
    go.Histogram(
        name='Red Team Killed by Towers (Lost)',
        x=blue_win['blue_towerKills']
    )
])

fig.update_layout(
    title='Distribution of Deaths due to Towers',
    height=800,
    width=800
)

iplot(fig)

This was basically the distribution I had in mind. As a team suffers more deaths to towers, the less likely they win a game. I think this will also be a strong indicator of winning or losing if data values lie outside of 4-6 tower deaths

# Machine Learning
I will try using several different models in order to pick the most accurate result.

* The process through which model selection will occur is by getting feature importance, hyperparameter tuning with 5 fold cv, then train test split to see which model performs the best

## Feature Selection

In [None]:
import matplotlib.style as style

df_for_corr = df.copy()

plt.figure(figsize=(20, 20))

style.use('seaborn-poster')

df_for_corr.drop(bans + champs + ['redWins', 'redFirstBlood', 'red_firstInhibitor', 'red_firstBaron', 'red_firstRiftHerald'],  axis=1, inplace=True)
corr_df = df_for_corr.corr()

sns.heatmap(corr_df)
plt.title("Correlation Matrix", fontsize=25)
plt.tight_layout()

Some features that will be removed regardless of correlation are: redFirstBlood, red_firstInhibitor, red_firstBaron, red_firstRiftHerald, and gameId just because we have the blue counter part to them and gameId is not necessary.

Other than those mentioned, I will keep everything else.

I plan to one hot encode the champion related columns, but this is problematic. There are currently 135 champions in League of Legends, this means that once I finish one hot encoding, I will have ~2025 extra columns added to my original data set. So I plan to try different methods of reducing the dimensionality and comparing their scores.

In [None]:
def evaluate_dist(df, cols):
    champ_cols = ['blue_champ_1', 'blue_champ_2', 'blue_champ_3', 'blue_champ_4', 'blue_champ_5',
                  'red_champ_1', 'red_champ_2', 'red_champ_3', 'red_champ_4', 'red_champ_5', 'ban_1',
                  'ban_2', 'ban_3', 'ban_4', 'ban_5', 'ban_6', 'ban_7', 'ban_8', 'ban_9', 'ban_10']
    kurtosis_results = dict()
    skewness = dict()

    for i in cols:
        if i not in champ_cols:
            kurtosis_results[i] = kurtosis(df[i])
            skewness[i] = skew(df[i])
    
    kurtosis_results = {k: v for k, v in sorted(kurtosis_results.items(), key=lambda item: item[1])}
    skewness = {k: v for k, v in sorted(skewness.items(), key=lambda item: item[1])}

    print('Skewness:\n')
    [print("{}: {}".format(i, skewness[i])) for i in skewness]

    print('\nKurtosis:\n')
    [print("{}: {}".format(i, kurtosis_results[i])) for i in kurtosis_results]

    print("\n\n")

## Preprocessing

* I will use the RobustScaler from SK
* Perform MCA on Nominal Data
* PCA on continuous data

In [None]:
from scipy.stats import skew, kurtosis
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from prince import MCA

### Applying Log1p Transformations
Since PCA assumes Gaussian, I will need to normalize skewed data

### Before

In [None]:
numerical_cols = [i for i in df if df[i].dtype in ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']]

evaluate_dist(preproc_df, numerical_cols)

### After

In [None]:
cols_to_be_transformed = ['blueWardsDestroyed', 'redWardsDestroyed',
                          'blueWardsPlaced', 'redWardsPlaced',
                          'redTowersDestroyed', 'blueTowersDestroyed']

for col in cols_to_be_transformed:
    preproc_df[col] = np.log1p(preproc_df[col])

evaluate_dist(preproc_df, cols_to_be_transformed)

I would've liked blue/redTowers to be a little lower, but it should be OK. The log transformation helped lower the Kurtosis score, but outliers are still very present in towers destroyed.

This is fine though, as the reason for the outliers is that many times, no towers are destroyed within the first 10 min. That means that if one tower is destroyed, it is significant regarding whether a team wins or no

## Reducing Dimensionality
I began by looking into Logistic PCA, but unfortunately there are no Python packages for that. I would've made one myself had I known Linear Algebra (haven't taken that course yet, but soon I will)

I then looked into MCA, which deals with nominal values in data. I read some papers to grasp an understanding of Multiple Correspondence Analysis and decided this would be my approach to reducing the dimensionality of my data.

I plan to do a grid search on optimal n_components for MCA, but for now I am deciding to do 5 components for both champion selections (red and blue) and ban selections.

In [None]:
combined_df.drop(['redFirstBlood', 'red_firstInhibitor', 'red_firstBaron', 'red_firstRiftHerald', 'gameId'], axis=1, inplace=True)

train_target = combined_df['blueWins'].iloc[:len(preproc_df)].reset_index(drop=True)
test_target = combined_df['blueWins'].iloc[len(preproc_df):].reset_index(drop=True)

champ_cols = ['blue_champ_1', 'blue_champ_2', 'blue_champ_3', 'blue_champ_4', 'blue_champ_5',
                  'red_champ_1', 'red_champ_2', 'red_champ_3', 'red_champ_4', 'red_champ_5', 'ban_1',
                  'ban_2', 'ban_3', 'ban_4', 'ban_5', 'ban_6', 'ban_7', 'ban_8', 'ban_9', 'ban_10']

combined_df.drop(['blueWins', 'redWins'], axis=1, inplace=True)

for col in champ_cols:
    combined_df[col] = get_champions(list(combined_df[col].values))

numerical_cols = [i for i in combined_df if combined_df[i].dtype in ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']]



cols_to_be_transformed = ['blueWardsDestroyed', 'redWardsDestroyed',
                          'blueWardsPlaced', 'redWardsPlaced',
                          'redTowersDestroyed', 'blueTowersDestroyed']

# Split the combined dfs back to the original data set and my own test data set
train_df = combined_df.iloc[:len(preproc_df)].reset_index(drop=True)
test_df = combined_df.iloc[len(preproc_df):, :].reset_index(drop=True)


# Preprocess train data

df_for_scale = train_df[train_df.columns[~train_df.columns.isin(champ_cols)]]

scaler = RobustScaler()
scaled_data = scaler.fit_transform(df_for_scale)
pca = PCA(.95)
pcs = pca.fit_transform(scaled_data)

pca_df = pd.DataFrame(pcs, columns=['PC_{}'.format(i) for i in range(np.size(pcs, 1))])

champ_df = train_df[train_df.columns[train_df.columns.isin(champ_cols)]]
champ_select_df = champ_df[champ_cols[:10]]
champ_ban_df = champ_df[champ_cols[10:]]

mca_ban = MCA(n_components=5)
mca_select = MCA(n_components=3)

ban_mca = mca_ban.fit_transform(champ_ban_df)
select_mca = mca_select.fit_transform(champ_select_df)

ban_mca.columns = ['MCA_Ban_{}'.format(i) for i in range(np.size(ban_mca, 1))]
select_mca.columns = ['MCA_Select_{}'.format(i) for i in range(np.size(select_mca, 1))]

train_reduced_df = pd.concat([ban_mca, select_mca, pca_df], axis=1)

# Preprocess Test Data

test_df_for_scale = test_df[test_df.columns[~test_df.columns.isin(champ_cols)]]

scaled_data = scaler.transform(test_df_for_scale)

pcs = pca.transform(scaled_data)

test_pca_df = pd.DataFrame(pcs, columns=['PC_{}'.format(i) for i in range(np.size(pcs, 1))])

champ_df = test_df[test_df.columns[test_df.columns.isin(champ_cols)]]
champ_select_df = champ_df[champ_cols[:10]]
champ_ban_df = champ_df[champ_cols[10:]]

ban_mca = mca_ban.fit_transform(champ_ban_df)
select_mca = mca_select.fit_transform(champ_select_df)

ban_mca.columns = ['MCA_Ban_{}'.format(i) for i in range(np.size(ban_mca, 1))]
select_mca.columns = ['MCA_Select_{}'.format(i) for i in range(np.size(select_mca, 1))]

test_reduced_df = pd.concat([ban_mca, select_mca, test_pca_df], axis=1)

In [None]:
train_reduced_df.head()

In [None]:
test_reduced_df.head()

Ready to start training!

## Model Selection
For all models, I will run a grid search with cv = 5 and tuned hyperparameters to find the best combination and then predict the values of the test data set to see how the model performed.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

In [None]:
x_train, x_test, y_train, y_test = train_test_split(train_reduced_df, train_target)

### Logistic Regression

In [None]:
model = LogisticRegression(C=0.5, fit_intercept=True, n_jobs=-1, penalty='l2')

model.fit(x_train, y_train)

y_pred = model.predict(x_test)

In [None]:
accuracy_score(y_test, y_pred)

Pretty high scores. Possibly overfitting

### XGBoost Classifier
The best cv score was 96%

In [None]:
model = XGBClassifier(learning_rate=0.09, n_estimators=500, n_jobs=-1)

model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [None]:
accuracy_score(y_test, y_pred)

Really high score.. I suspect overfitting, but I haven't looked into it that much yet..

I think I will try to just pull random games from the League database and see how this model does.

I also plan to add more models to experiment some more.

## Using Custom Data Set that mimics the original

### Logistic Regression


In [None]:
model = LogisticRegression(C=0.5, fit_intercept=True, n_jobs=-1, penalty='l2')

model.fit(train_reduced_df, train_target)

y_pred = model.predict(test_reduced_df)
logit_matrix = confusion_matrix(test_target, y_pred)

In [None]:
accuracy_score(test_target, y_pred)

So the model performed slightly worse, but this is a data set more than double the size of the train test split variation. I am slightly leaning towards that the model just scores well and is not overfitting. 

This is where I ask for help though, as I am relatively new and would appreciate the more experienced Kaggle's to have their take on whether they think my approach to achieving these scores is correct and avoids scenarios such as overfitting.

### XGBoost Classifier

In [None]:
model = XGBClassifier(learning_rate=0.09, n_estimators=500, n_jobs=-1, max_depth=5)

model.fit(train_reduced_df, train_target)
y_pred = model.predict(test_reduced_df)
xgboost_matrix = confusion_matrix(test_target, y_pred)

In [None]:
accuracy_score(test_target, y_pred)

Logistic Regression seems to out perform XGBC by 3%, which is interesting.

Again we also see that overall, scores are dropping with my own test data set, but still remaining pretty accurate.


Considering that the train test split approach is not as good, I will discontinue it and use my own test data set as the validation data while I continue to experiment with other models

### Random Forests Classifier

In [None]:
model = RandomForestClassifier(bootstrap=True, 
                               max_depth=5,
                               max_features='auto',
                               min_samples_leaf=4, 
                               min_samples_split=5, 
                               n_estimators=500, 
                               oob_score=False)

model.fit(train_reduced_df, train_target)
y_pred = model.predict(test_reduced_df)
rf_matrix = confusion_matrix(test_target, y_pred)

In [None]:
accuracy_score(test_target, y_pred)

Seems that RF has performed the worst so far. 

### Support Vector Machine

In [None]:
model = SVC(C=0.5, degree=1, gamma='auto', kernel='rbf')

model.fit(train_reduced_df, train_target)
y_pred = model.predict(test_reduced_df)
svm_matrix = confusion_matrix(test_target, y_pred)

In [None]:
accuracy_score(test_target, y_pred)

SVM scores only a little worse than XGBoost, but ranks 3rd of the 4 models tested.

# Model Results

In [None]:
model_names = ["Logistic Regression", "XGBoost", "Random Forest", "SVM"]
model_results = [logit_matrix, xgboost_matrix, rf_matrix, svm_matrix]

fig, axs = plt.subplots(2,2, figsize=(11, 8))
grid_counter = 0
for i in range(2):
    for j in range(2):
        sns.heatmap(model_results[grid_counter], cmap='Blues', cbar=False, annot=True,
                    fmt='g', annot_kws={'size': 14}, ax=axs[i][j])

        axs[i][j].title.set_text(model_names[grid_counter])

        grid_counter += 1
plt.tight_layout()
plt.show()

# Conclusion

This data set was pretty fun to explore and I am excited to be able to just test random games at will and see how my models does. That idea and adding more models are on my to-do list.

Please feel free to to criticize/give advice on my kernel, as I am looking to improve my skills.




# Acknowledgements

[Multiple Corresdondence Analysis](https://www.researchgate.net/profile/Dominique_Valentin/publication/239542271_Multiple_Correspondence_Analysis/links/54a979900cf256bf8bb95c95.pdf) by Herv√© Abdi & Dominique Valentin

[Variable Extractions using Principal Components
Analysis and Multiple Correspondence Analysis for
Large Number of Mixed Variables Classification
Problems](https://s3.amazonaws.com/academia.edu.documents/50782331/GJPAM_Published.pdf?response-content-disposition=inline%3B%20filename%3DVariable_Extractions_using_Principal_Com.pdf&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIATUSBJ6BAG7F36CVX%2F20200530%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200530T070022Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEKX%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIQD3gSoTHQ63nANvvqMflElvp2TbA9R7XSTkFbbPulMQ6QIgQgPJBUDeT5o%2BJsw5Pc7J7jWr4s38dLlCHw%2B1a8XY%2BAMqvQMI%2Fv%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgwyNTAzMTg4MTEyMDAiDKwvShduLVGpTOEQUSqRA4XRIgf7eWeKKhYlHQzWolE5Vk1g9dfu73310GIKNjPIPqN5ocYMMnsJJkSPha4Cpbab41m6x3hIxjstzVkMBf8qa65WEE7vJA%2BWtPfntmgNieznvvb8fdEiEPjjwmHlFbbRDVw8WVboG5CvCBV2En1f3Audu1%2BVvOr3x6wfS1uJSQVPjYFP2cXkIr3wIJBDhqjWQZfl36pNHR%2BU2z5pTrn8jgQ0lgkiw%2Bnp01hJqnc3CsteBLcES5J1TgfPkBnBF2VfQECRuYejOkzyhq%2F8fsvP%2BIbEisjuzDiTgHqeWwIs%2BCXlZsiJLR1OK2iEd2hMl0gzUDXxl2yoqmkpv83AUchh%2BtORGjTJE7yA26MSX57aqeGZz%2Fqf1kj4Hj8pW4B7%2FUUcPXhpQ5ryDYKzyYtULzDCRQpif%2FjBUr1bZFj48jeIUY5bppEvyGsF8YJ5YiAS13jRysxz7FPG8XjoQm8%2BR84uYJqLsbNhBb0N3XMrisJ89er2DOg8iuoMH%2FWWBXIsdRjkFymb75xItgLipBEbFZqfMJLFx%2FYFOusBGpR9%2BtvXEK4vPfs95BDFzEu8kGK4pgB1M9PqxJPY9SzATCWo3%2FD3MDRC3OjWwbyZkhtYvehnipLldJwwjXNZAoQ%2BHJ2lwX8G8ZZMVvLkHb1Hp8WKhhOs4qEbaYrzbLgwi0UucN4yRw1Doi6BADo8k4fgZ1n5sVM0mH6mzPNpos%2BSuzm0lrVVd4SEQB1X7%2FEdiagrGzurSiVpMtCynpZzWnlKqWz10AcJS81KhZoLY%2FLbQiPWeutHOSZNvLVRzesqIx6wWoNliRH1yndEJhFC95IAuTT6fGphp6OFVACauf84tYzx5WMMYF0Csg%3D%3D&X-Amz-Signature=1365def7559093df3d23d4dae2e967cefb5f825fc808e02bc5095d18acc89b1d) by Hashibah Hamid, Nazrina Aziz, and Penny Ngu Ai Huong