# KNN modeling for LOL match result predictions (75.3% accuracy)

<img src="https://image.winudf.com/v2/image/Y29tLmxvbHdhbGxwYXBlci5oZC5sb2xwaWN0dXJlcy5waG90b3MuYmFja2dyb3VuZC5jdXRlLmNvb2wuYXJ0LmxvbGltYWdlcy5oZC5mcmVlX3NjcmVlbl8zXzE1MzEyNjgyNDhfMDgx/screen-3.jpg?fakeurl=1&type=.jpg"> </img>

League of Legends is a competitive multiplayer online game in which the blue and red teams (composed of 5 players each) rush to destroy each other's base (and Nexus, its central structure and final goal of the game). Each match usually lasts from 20 to 50 minutes. The objective of this study is **to develop a K-nearest-neighbors model to predict, based on game data of the first 10 minutes, whether the winner is the blue team or the red team**. The dataset used contains information from roughly 10000 matches of high-ranked players, and includes statistics extracted at the 10-minute mark of the game plus the final match result.

Brief glossary about the game:

* Warding totem: An item that a player can put on the map to reveal the nearby area. Very useful for map/objectives control.
* Minions: NPC that belong to both teams. They give gold when killed by players.
* Jungle minions: NPC that belong to NO TEAM. They give gold and buffs when killed by players.
* Elite monsters: Monsters with high hp/damage that give a massive bonus (gold/XP/stats) when killed by a team.
* Dragons: Elite monster which gives team bonus when killed. The 4th dragon killed by a team gives a massive stats bonus. The 5th dragon (Elder Dragon) offers a huge advantage to the team.
* Herald: Elite monster which gives stats bonus when killed by the player. It helps to push a lane and destroys structures.
* Towers: Structures you have to destroy to reach the enemy Nexus. They give gold.
* Level: Champion level. Start at 1. Max is 18.

In [None]:
#importing necessary libraries

import pandas as pd
from sklearn.preprocessing import MaxAbsScaler
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv('../input/league-of-legends-diamond-ranked-games-10-min/high_diamond_ranked_10min.csv')

print ('Importing and reading done successfully!')

First we are going to take a look at the dataset:

1) Checking for null values.

2) Checking the features we have and its datatypes.

In [None]:
#checking for nan

nan_column_count = 0
for column in df.isna().sum():
    if column>0:
        print(column)
        nan_column_count+=1
if nan_column_count == 0:
    print('No missing values in your dataset!')
    
#checking dtypes and listing features
print (df.dtypes)

We have no null values AND all our data is numeric, so we don't have to worry about mapping categorical features.
Now, we should check our target variable to see if it is evenly distributed. If it isn't, we may have some problems regarding bias. However, it's important to notice that the FirstBlood feature has a categorical nature, so we should be careful with this one because it may introduce weird behavior.

In [None]:
#check blueWins distribution
print (df.blueWins.value_counts())


So we have roughly 50% wins for each team in our dataset, which is very good.

Now, to the feature engineering:
We have to select (and create) the most relevant features to be used in our model. League of Legends is a competitive game, and this means that each and every metric available is only meaningful with context: There is no point in analyzing how much gold the blue team has by itself, for example. However, how much **more** gold the blue team has than the red team is most definitely relevant. In that respect, the feature engineering here will focus on creating "Difference" features that express the **lead** each team has in the match.

We will discard the absolute features (such as redKills and blueDeaths) in favor of the "difference" features we created. We will also use the "GoldDiff" and "ExperienceDiff" already included in the dataset, which leaves us with 17 features and the target column (blueWins).

In [None]:
#creating new features

df['WardPlaceDiff']=df['blueWardsPlaced']-df['redWardsPlaced']
df['WardDestroyDiff']=df['blueWardsDestroyed']-df['redWardsDestroyed']
df['FirstBloodDiff']=df['blueFirstBlood']-df['redFirstBlood']
df['KillDiff']=df['blueKills']-df['redKills']
df['DeathDiff']=df['blueDeaths']-df['redDeaths']
df['AssistDiff']=df['blueAssists']-df['redAssists']
df['EliteMonsterDiff']=df['blueEliteMonsters']-df['redEliteMonsters']
df['DragonDiff']=df['blueDragons']-df['redDragons']
df['HeraldDiff']=df['blueHeralds']-df['redHeralds']
df['TowerDestroyDiff']=df['blueTowersDestroyed']-df['redTowersDestroyed']
df['AvgLevelDiff']=df['blueAvgLevel']-df['redAvgLevel']
df['MinionsDiff']=df['blueTotalMinionsKilled']-df['redTotalMinionsKilled']
df['JungleMinionsDiff']=df['blueTotalJungleMinionsKilled']-df['redTotalJungleMinionsKilled']
df['CSdiff']=df['blueCSPerMin']-df['redCSPerMin']
df['GPMdiff']=df['blueGoldPerMin']-df['redGoldPerMin']

#selecting relevant features

relevant=[
          'blueWins',
          'WardPlaceDiff',
          'WardDestroyDiff',
          'FirstBloodDiff',
          'KillDiff',
          'DeathDiff',
          'AssistDiff',
          'EliteMonsterDiff',
          'DragonDiff',
          'HeraldDiff',
          'TowerDestroyDiff',
          'AvgLevelDiff',
          'MinionsDiff',
          'JungleMinionsDiff',
          'blueGoldDiff',
          'blueExperienceDiff',
          'CSdiff',
          'GPMdiff'
           ]

print ('Step saved successfully!')

Now, with the feature designing out of the way, we have to split the data into our training and testing subsets.

I have chosen to use, for training, a subset of 7750 of the 9879 samples we have (~78%).


It's also important to:

* Observe the scaled data distribution and try to have some intuition about which features are more important
* Randomize our dataset order to minimize possible bias.
* Transform our numerical features to the same scale.


The scaling step is important because KNN modeling uses distance between numerical features as its criteria, and using raw data will naturally make it so larger numbers weigh a lot more even if they are not necessarily more relevant.


In [None]:
dados = df[relevant]

scaler2 = MaxAbsScaler()
scaler2.fit(dados)
analisedados=scaler2.transform(dados)
analisedf = pd.DataFrame(data=analisedados)
print (pd.DataFrame(data=analisedados).groupby(by=0).mean().T)


Now, with all features scaled into comparable range, we can observe which of them have higher relation to which team wins by viewing their numerical distance. With this criteria, for example, we're hinted that maybe we could discard the features WardPlaceDiff, WardDestroyDiff, AssistDiff, HeraldDiff, TowerDestroyDiff, JungleMinionsDiff, because their numerical difference is relatively small. Also, it's important to test carefully with the FirstBloodDiff because of its categorical nature. We will test these variations later.

In [None]:
#getting the subset of our elected features and randomizing it using a seed to get reproductable results

dados = df[relevant]

dados_embaralhados=dados.sample(frac=1, random_state = 4234)

#splitting the target column out of the dataframe

x = dados_embaralhados.loc[:,dados_embaralhados.columns!='blueWins'].values
y = dados_embaralhados.loc[:,dados_embaralhados.columns=='blueWins'].values

#defining our training sample size and splitting our data

q = 7750

x_treino = x[:q,:]
y_treino = y[:q].ravel()

x_teste = x[q:,:]
y_teste = y[q:].ravel()

#scaling the features

scaler = MaxAbsScaler()
scaler.fit(x_treino)

x_treino = scaler.transform(x_treino)
x_teste = scaler.transform(x_teste)

print ('Step saved successfully!')

Now all that's left is to build the classifier itself:


As the K choice is highly experimental, we will build a testing loop to see which K favors us the most, printing our % accuracy in each instance of the loop:

In [None]:
print ( "\n  K TRAINING  TEST")
print ( " -- ------ ------")

for k in range(40,60):

    classificador = KNeighborsClassifier(
        n_neighbors = k,
        weights     = 'uniform',
        p           = 1
        )
    classificador = classificador.fit(x_treino,y_treino)

    y_resposta_treino = classificador.predict(x_treino)
    y_resposta_teste  = classificador.predict(x_teste)
    
    acuracia_treino = sum(y_resposta_treino==y_treino)/len(y_treino)
    acuracia_teste  = sum(y_resposta_teste ==y_teste) /len(y_teste)
    
    print(
        "%3d"%k,
        "%6.1f" % (100*acuracia_treino),
        "%6.1f" % (100*acuracia_teste)
        )
    
    

That gives us, for the k-range tested, an optimal K-value of 58. That being, our best model yet, with rough accuracy of 74,2%, is:

In [None]:
classificador = KNeighborsClassifier(
    n_neighbors = 58,
    weights     = 'uniform',
     p           = 1
    )
classificador = classificador.fit(x_treino,y_treino)

y_resposta_treino = classificador.predict(x_treino)
y_resposta_teste  = classificador.predict(x_teste)
    
acuracia_treino = sum(y_resposta_treino==y_treino)/len(y_treino)
acuracia_teste  = sum(y_resposta_teste ==y_teste) /len(y_teste)
    
print(
        "%3d"%k,
        "%6.1f" % (100*acuracia_treino),
        "%6.1f" % (100*acuracia_teste)
        )

# Testing with discarded features

Now, we should try to comment out some of the features and repeat the modeling to see if our model's performance improves. I've done some testing and the best combination i have found is the following:

In [None]:
relevant=['blueWins',
          # 'WardPlaceDiff',
          # 'WardDestroyDiff',
          # 'FirstBloodDiff',
          'KillDiff',
          'DeathDiff',
          # 'AssistDiff',
          'EliteMonsterDiff',
          'DragonDiff',
          # 'HeraldDiff',
          # 'TowerDestroyDiff',
          'AvgLevelDiff',
          'MinionsDiff',
          #'JungleMinionsDiff',
          'blueGoldDiff',
          'blueExperienceDiff',
          'CSdiff',
          'GPMdiff'
          ]
print ('Step saved successfully!')

In [None]:
#getting the subset of our elected features and randomizing it using a seed to get reproductable results

dados = df[relevant]

dados_embaralhados=dados.sample(frac=1, random_state = 4234)

#splitting the target column out of the dataframe

x = dados_embaralhados.loc[:,dados_embaralhados.columns!='blueWins'].values
y = dados_embaralhados.loc[:,dados_embaralhados.columns=='blueWins'].values

#defining our training sample size and splitting our data

q = 7750

x_treino = x[:q,:]
y_treino = y[:q].ravel()

x_teste = x[q:,:]
y_teste = y[q:].ravel()

#scaling the features

scaler = MaxAbsScaler()
scaler.fit(x_treino)

x_treino = scaler.transform(x_treino)
x_teste = scaler.transform(x_teste)

print ('Step saved successfully!')

In [None]:
print ( "\n  K TRAINING  TEST")
print ( " -- ------ ------")

for k in range(40,60):

    classificador = KNeighborsClassifier(
        n_neighbors = k,
        weights     = 'uniform',
        p           = 1
        )
    classificador = classificador.fit(x_treino,y_treino)

    y_resposta_treino = classificador.predict(x_treino)
    y_resposta_teste  = classificador.predict(x_teste)
    
    acuracia_treino = sum(y_resposta_treino==y_treino)/len(y_treino)
    acuracia_teste  = sum(y_resposta_teste ==y_teste) /len(y_teste)
    
    print(
        "%3d"%k,
        "%6.1f" % (100*acuracia_treino),
        "%6.1f" % (100*acuracia_teste)
        )
    

We can see here that this new model with selected features did perform a little better, reaching 75.3% accuracy for k=48!

As such, we can see that more data does not always equal better models. Sometimes discarding potentially irrelevant or difficult to analyze features can be healthy to the result, not to mention reducing waste of computational power. This is a very important conclusion!

Our final model, therefore, is:

In [None]:
classificador = KNeighborsClassifier(
    n_neighbors = 48,
    weights     = 'uniform',
     p           = 1
    )
classificador = classificador.fit(x_treino,y_treino)

y_resposta_treino = classificador.predict(x_treino)
y_resposta_teste  = classificador.predict(x_teste)
    
acuracia_treino = sum(y_resposta_treino==y_treino)/len(y_treino)
acuracia_teste  = sum(y_resposta_teste ==y_teste) /len(y_teste)
    
print(
        'accuracy:',
        "%6.1f" % (100*acuracia_teste)
        )

# Considerations

We have to consider in our analysis some highly relevant aspects of this complex game:

* Most features are highly interdependant: Kills and Elite Monster kills, for example, also give gold and XP. This makes it very hard to isolate each variable's influence to build a model.
* Very influent human factors: players are susceptible to making decisive bad plays and lose games they should have won.
* Small time gap analyzed: 10 minutes can be as little as 20% of the total duration of a regular game, so we're extrapolating a lot of info here.
* Some characters are better at different time stages of the game, creating a certain bias involving exactly which ones were being played at that specific match.

Considering the factors above and the amount of delicate points that determine the outcome of a match, i would say 75.3% of accuracy using a simple model such as this one is a very decent result. It also reflects a certain level of predictability that reiterates how games at a higher skill level tend to have less room for human mistakes, because each player is playing much closer to their character and situation's maximum potential. It's probably much harder to predict results involving average players because their games involve a lot more random human mistakes, thus being less predictable.

One very relevant conclusion, however, is that over this statistical approach it becomes really clear that, at the end of the day, wards placed and destroyed at the early game don't make a lot of difference on deciding the game winner. Based on my game knowledge, my guess is that the warding relevance ramps up over the course of the game, because at the start both teams have all the towers, so the players have a lot less space to do vision-based outplays. This, combined with the fact that high-level players, with their experience, can guess very well what is happening without necessarily having vision of the enemy, may justify this fact.

If you liked this notebook and found it useful in any way, i'd greatly appreciate your upvote :)
