# Introduction
The term esport relates to sport competitions which are held with the use of video games. Esport competitions are mostly based on multi-player games but in some ocasions are held with the use of single player games. In recent the esport scene had developed rapidly. In this notebook I will analyse the development of the esport scene through the years 1998-2020.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# ommit one of the warnings
import warnings
warnings.filterwarnings("ignore", message="Glyph 146 missing from current font.")
pd.options.mode.chained_assignment = None

# read the data
esport = pd.read_csv("/kaggle/input/esports-earnings/EsportsEarnings_final.csv", encoding = "ISO-8859-1")

esport

Without changes made to the encoding the data would not had been read. In the later stages of the analysis there was an issue with the glyph 146, as it is not crucial factor of analysis I have decided to omit the warning to ensure the analysis clarity.

# Data cleaning

In [None]:
esport.info()

All data types seems to be in right order. Features which are numbers are either int or float and categorical variables or names are objects. The *Date* is the only exception, in one of the following modules I am going to handle it.

In [None]:
esport.describe()

Almost all columns seems to have reasonable ranges of values. Besides the *ReleaseDate*, I am wondering which game was released in 11 year of our age...

In [None]:
esport[esport['ReleaseDate']==esport['ReleaseDate'].min()]

**Forza Motorsport 4** were released in **2011** so the number 11 has to indicate to the 2011 year. I am going to change that so it will match notation of other release dates. But lets see if this a case with other titles.

In [None]:
# checking if there are any entries with Relase date smaller then 1900 which should be imposible
esport[esport.ReleaseDate<1900]

**Forza Motorsport 4** is the only case so I will change it manualy.

In [None]:
esport.loc[1576, 'ReleaseDate'] = 2011
esport.describe()

Now everything seems to be good, we may proceed.

# Missing Values

In [None]:
esport.isnull().sum()

There are no missing values in the data set.

# Date features extraction
The *Date* feature seams to consist of the day, month and a year of the event. I will split those information and store in the different columns.

In [None]:
esport[["Year", "Month", "Day"]] = esport.Date.str.split('-', expand=True)
esport[["Year", "Month", "Day"]] = esport[["Year", "Month", "Day"]].astype(int)

# Number of tournaments to prize pool and how does release date influence them.

I am going to check which game had the most number of tournaments, which game had highest prize pool and show analyze does release date influenced them.

In [None]:
# Data grouping and aggregation
esport_counts = esport[["Game", "Tournaments", "ReleaseDate", "Earnings", "Genre"]].groupby("Game").agg({"Tournaments": np.sum, "ReleaseDate": np.mean, 
                                                                                                        "Earnings": np.sum, "Genre": lambda x:x.value_counts().index[0]})
esport_counts.sort_values("Tournaments", ascending=False, inplace=True)
esport_counts.reset_index(inplace=True)

# create figure
fig, ax = plt.subplots(figsize=(30,15))
ax = sns.scatterplot(x = "ReleaseDate", y = "Tournaments", size = "Earnings", hue="Genre", data=esport_counts, sizes=(100, 10000), alpha=.5)
# find N games with the most numbers of tournaments
most_tournamets = esport_counts.nlargest(15, "Tournaments")
# add annotation to found games
for line in range(0,esport_counts.shape[0]):
    if esport_counts.Game[line] in list(most_tournamets.Game):
        ax.text(esport_counts.ReleaseDate[line], esport_counts.Tournaments[line], esport_counts.Game[line], horizontalalignment='center', 
                size='x-large', color='black', weight='semibold')
# modify legend
handles, labels = ax.get_legend_handles_labels()
to_skip = len(np.unique(esport_counts.Genre))+2
for h in handles[to_skip:]:
    sizes = [s / 100 for s in h.get_sizes()] # smaller Earnings scatter points on legend
    label = h.get_label()
    label = str(float(label)*100) +" mln"
    h.set_sizes(sizes) # set them
    h.set_label(label)
#plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., fontsize='large', title_fontsize='40') # bigger legend font size
plt.legend(loc=2, fontsize='x-large')
ax.xaxis.label.set_size(20)
ax.xaxis.set_tick_params(labelsize='x-large')
ax.yaxis.label.set_size(20)
ax.yaxis.set_tick_params(labelsize='x-large')
plt.rc('axes', labelsize="x-large")    # fontsize of the axes labels
plt.title("Game release date to number of overall tournaments with overall prize pools", fontsize=24)
plt.show()

The most tournaments had **Starcraft II** as it is shown on the scatter plot. Right after, there is **Counter-Strike: Global Offensive** and **Super Smash Bros. Melee**.

The biggest prize overall prize pool had **Dota 2**. The interesting is the fact that despite having relativly early release date and the number of tournaments in between 500-1000, **Fortnite** overall prize pool is very big. 

Based on the plot *ReleaseDate* seems not to have any significant influence on the number of tournaments or the prize pool. From the logical point of view, *ReleaseDate* have influence on the number of tournaments but is not the most important factor.

In [None]:
import calendar
# Data grouping and aggregation
esport_dota2 = esport[esport["Game"] == "Dota 2"]
esport_dota2 = esport_dota2.groupby(["Year", "Month"]).agg({"Tournaments": np.sum, "ReleaseDate": np.mean, "Earnings": np.mean, "Genre": lambda x:x.value_counts().index[0]})
esport_dota2.sort_values("Tournaments", ascending=False, inplace=True)
esport_dota2.reset_index(inplace=True)
esport_dota2["Month name"] = esport_dota2["Month"].apply(lambda x: calendar.month_abbr[x])
#esport_dota2["Date"] = pd.to_datetime(esport_dota2[["Year", "Month"]])]
plt.figure(figsize=(25,10))
plt.subplot(1,2,1)
plt.title("Dota 2 amount of tournaments", fontsize=24)
ax = sns.boxplot(x="Month name", y="Tournaments", data=esport_dota2)
ax.xaxis.label.set_size(20)
ax.xaxis.set_tick_params(labelsize='x-large')
ax.yaxis.label.set_size(20)
ax.yaxis.set_tick_params(labelsize='x-large')
plt.subplot(1,2,2)
plt.title("Dota 2 value of the rewards", fontsize=24)
ax = sns.boxplot(x="Month name", y="Earnings", data=esport_dota2)
ax.xaxis.label.set_size(20)
ax.xaxis.set_tick_params(labelsize='x-large')
ax.yaxis.label.set_size(20)
ax.yaxis.set_tick_params(labelsize='x-large')
plt.show()

Through the months in the years 2011-2020 **Dota 2** did not show consistency in the number of tournaments. In the case of the rewards the lack of consitency is very noticable, in August the international **Dota 2** tournament is held and usually the prize pools are enormous. But I have to mention that this reward is splitted in between teams, their sponsors and the players, the individual player does not get 35 mln for winning the tournament.

In [None]:
# Data grouping and aggregation
esport_sc2 = esport[esport["Game"] == "StarCraft II"]
esport_sc2 = esport_sc2.groupby(["Year", "Month"]).agg({"Tournaments": np.sum, "ReleaseDate": np.mean, "Earnings": np.mean, "Genre": lambda x:x.value_counts().index[0]})
esport_sc2.sort_values("Tournaments", ascending=False, inplace=True)
esport_sc2.reset_index(inplace=True)
esport_sc2["Month name"] = esport_sc2["Month"].apply(lambda x: calendar.month_abbr[x])
#esport_dota2["Date"] = pd.to_datetime(esport_dota2[["Year", "Month"]])]
plt.figure(figsize=(25,10))
plt.subplot(1,2,1)
plt.title("StarCraft 2 amount of tournaments")
ax = sns.boxplot(x="Month name", y="Tournaments", data=esport_sc2)
ax.xaxis.label.set_size(20)
ax.xaxis.set_tick_params(labelsize='x-large')
ax.yaxis.label.set_size(20)
ax.yaxis.set_tick_params(labelsize='x-large')
plt.subplot(1,2,2)
plt.title("StarCraft 2 value of the rewards")
ax = sns.boxplot(x="Month name", y="Earnings", data=esport_sc2)
ax.xaxis.label.set_size(20)
ax.xaxis.set_tick_params(labelsize='x-large')
ax.yaxis.label.set_size(20)
ax.yaxis.set_tick_params(labelsize='x-large')
plt.show()

On the other hand **StarCraft II** is very consistent in terms of amount of the tournaments. In case of rewards, the biggest rewards are usually given out in the November but the difference between the November rewards and others moths rewards is not that big as it was in the **Dota 2** case.


In this step of analysis I have treated games individually but now I want to consider them as the series. For example I have **FIFA 08** and **FIFA 12**, I want to check how many tournaments of **FIFA** series had been through the course of the years. So I will filter out the numbers, the brackets and some signs using regex.

In [None]:
esport_NF = esport.copy()
# removing numbers, brackets, : and , signs
esport_NF.Game = esport_NF.Game.str.split('\s[+0-9()\']').str[0]
esport_NF.Game = esport_NF.Game.str.split('[:]').str[0]
# removing greek letters
esport_NF.replace("\sIII|\sII|\sIV|\sIX|\sVI|\sV|\sXIII|\sXII|\sXI|\sX|\sI|",'',regex=True, inplace=True)

# Data grouping and aggregation
esport_counts = esport_NF[["Game", "Tournaments", "ReleaseDate", "Earnings", "Genre"]].groupby("Game").agg({"Tournaments": np.sum, "ReleaseDate": np.mean, 
                                                                                                        "Earnings": np.sum, "Genre": lambda x:x.value_counts().index[0]})
esport_counts.sort_values("Tournaments", ascending=False, inplace=True)
esport_counts.reset_index(inplace=True)

esport_counts = esport_counts.nlargest(30, 'Tournaments')
fig, ax = plt.subplots(figsize=(20,20))
sns.barplot(esport_counts.Tournaments, esport_counts.Game)
ax.xaxis.tick_top()
ax.xaxis.label.set_size(20)
ax.xaxis.set_tick_params(labelsize='x-large')
ax.yaxis.label.set_size(20)
ax.yaxis.set_tick_params(labelsize='x-large')
plt.show()

To enhance clarity I have only shown the 30 positions with the most number of tournaments. The **Starcraft** franchise had the most number of tournaments. Right after that thrives the **Counter-Strike**. 

It no surprise, both of mentioned franchises has long history and are incredibly popular. The problematic are most of fighting games, they tend to have much complex names which are hard to filter out. Despite that the **Super Smash Bros** earned the third place of most played franchise on tournaments.

# How does rewards of 9 biggest games evolve through the years?

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(40,16))
names_MRG = game_money.Game[0:9]
names_MRG = iter(names_MRG)
for row in axes:
    for col in row:
        game_name = next(names_MRG)
        game_esport = esport[esport["Game"] == game_name]
        game_esport = game_esport.groupby("Year").mean()
        game_esport.reset_index(inplace=True)
        game_esport = game_esport[game_esport.Year != 2020] # 2020 has not ended
        col.scatter(game_esport.Year, game_esport.Earnings)
        col.plot(game_esport.Year, game_esport.Earnings)
        col.set_title(game_name, size=26)

The 2020 has not ended so the information about earnings in this year is not considered in the plots. **Dota 2** is thriving and across the years rewards for **Dota 2** related completions are growing. **Fortnite** and **PUBG** are relatively new so there is not much data to show. **League of Legends** and **StarCraft** seems to have some drops in the rewards amount in the 2019.  

The most amazing is the fact that while the **Fortnite** and the **PUBG** are quite new games they have made into the top 6. What is more, the **Fortnite** is third in terms of prizes while being the least time on the market from the pool of considered titles.

**Overwatch** and **Heartstone** shows positive trends where **Heroes of the Storm** rewards reached 0 value in 2019.


# Which gener dominates the tournaments?

In [None]:
esport_domination = esport.groupby("Genre").agg({"Tournaments": np.sum})
esport_domination.reset_index(inplace=True)
plt.figure(figsize=(26,6))
ax = sns.barplot(x="Genre", y="Tournaments", data = esport_domination)
ax.xaxis.label.set_size(20)
ax.xaxis.set_tick_params(rotation=90)
ax.xaxis.set_tick_params(labelsize='x-large')
ax.yaxis.label.set_size(20)
ax.yaxis.set_tick_params(labelsize='x-large')
plt.show()

Clearly First-Person Shooter games dominates tournaments. Right after them are fighting games, which is interesting because only Smash Bros frenchies had been seen during any of the analysis, up to the currently considered.

# Which fighting game and frenchise had the most tournaments? Which usually offers biggest prizes?[](http://)

## Number of Tournaments in respect to other fighting games

In [None]:
# filter fighting games only and find its share in Tournaments
esport_fight = esport[esport["Genre"] == "Fighting Game"]
esport_fight = esport_fight[["Game", "Tournaments"]].groupby(by = "Game").sum()
esport_fight["Share"] = esport_fight["Tournaments"] / esport_fight["Tournaments"].sum()
# def figure
plt.figure(figsize=(20,5))
# define cmap
cmap = plt.get_cmap("Greens")
cmap = iter(cmap([i/10 for i in range(10)]))
# find biggest prized games - give them color
cols = []
labs = []
mostvalue = esport_fight.nlargest(10, 'Tournaments')
for i, gamename in enumerate(esport_fight.index):
    if gamename in mostvalue.index:
        cols.append(next(cmap))
        labs.append(gamename)
    else:
        cols.append("gray")
        labs.append("")
patches, texts = plt.pie(esport_fight.Tournaments, labels=labs, colors=cols, radius=2, textprops={'fontsize': 12})
plt.show()

In order perceive any clarity I have coloured 10 games with the largest amount of tournaments. It is clear the **Super Smash Bros. Melee** dominates in terms of number of competitions. Fighting games franchises tends to have many releases in their history. Games released within the franchise **usually** contains similar characters with similar move sets. So now I am going to base my analysis on whole franchises.

In [None]:
esport_fight_NF = esport_fight.copy()
esport_fight_NF.reset_index(inplace=True)
# removing numbers, brackets, : and , signs
esport_fight_NF.Game = esport_fight_NF.Game.str.split('\s[+0-9()\']').str[0]
esport_fight_NF.Game = esport_fight_NF.Game.str.split('[:]').str[0]
esport_fight_NF.Game = esport_fight_NF.Game.str.split('\sXX').str[0]
# removing greek letters
esport_fight_NF.replace("\sIII|\sII|\sIV|\sIX|\sVI|\sV|\sXIII|\sXII|\sXI|\sXX|\sXrd|\sX|\sI|",'',regex=True, inplace=True)
#esport_fight_NF = esport_fight_NF.Game.str.findall("Street\sFighter|Soul\sCalibur|Super\sSmash\sBros|Guilty\sGear")
expresion = r"Street\sFighter|Soul\sCalibur|Super\sSmash\sBros|Tekken|Dragon\sBall|Dead\sor\sAlive|Guilty\sGear"
esport_fight_NF["Game"][esport_fight_NF.Game.str.contains(expresion)] = esport_fight_NF.Game.str.findall(expresion).str[0]
esport_fight_NF = esport_fight_NF.groupby("Game").sum()

plt.figure(figsize=(20,5))
# define cmap
cmap = plt.get_cmap("Greens")
cmap = iter(cmap([i/10 for i in range(10)]))
# find biggest prized games - give them color
cols = []
labs = []
mostvalue = esport_fight_NF.nlargest(10, 'Tournaments')
for i, gamename in enumerate(esport_fight_NF.index):
    if gamename in mostvalue.index:
        cols.append(next(cmap))
        labs.append(gamename)
    else:
        cols.append("gray")
        labs.append("")
patches, texts = plt.pie(esport_fight_NF.Tournaments, labels=labs, colors=cols, radius=2, textprops={'fontsize': 12})
#plt.legend(patches, esport_fight_NF.index, loc="upper right", bbox_to_anchor=(1,1),bbox_transform=plt.gcf().transFigure)
plt.show()

**Super Smash Bros.** frenchise has clearly the most tournaments of them all in the history. 

## Tournament prizes in respect to other fighting games

In [None]:
# filter fighting games only and find its share Earnings
esport_fight = esport[esport["Genre"] == "Fighting Game"]
esport_fight = esport_fight[["Game", "Earnings"]].groupby(by = "Game").mean()
esport_fight["Share"] = esport_fight["Earnings"] / esport_fight["Earnings"].sum()
# def figure
plt.figure(figsize=(20,5))
# define cmap
cmap = plt.get_cmap("Reds")
cmap = iter(cmap([i/10 for i in range(10)]))
# find biggest prized games - give them color
cols = []
labs = []
mostvalue = esport_fight.nlargest(10, 'Earnings')
for i, gamename in enumerate(esport_fight.index):
    if gamename in mostvalue.index:
        cols.append(next(cmap))
        labs.append(gamename)
    else:
        cols.append("gray")
        labs.append("")
patches, texts = plt.pie(esport_fight.Earnings, labels=labs, colors=cols, radius=2, textprops={'fontsize': 12})
#plt.legend(patches, esport_fight.index, loc="upper right", bbox_to_anchor=(1,1),bbox_transform=plt.gcf().transFigure)
plt.show()

There are too many games to draw clear conclusions. To perceive any clarity I have coloured 10 games with the biggest incomes.

In [None]:
esport_fight_NF = esport_fight.copy()
esport_fight_NF.reset_index(inplace=True)
# removing numbers, brackets, : and , signs
esport_fight_NF.Game = esport_fight_NF.Game.str.split('\s[+0-9()\']').str[0]
esport_fight_NF.Game = esport_fight_NF.Game.str.split('[:]').str[0]
esport_fight_NF.Game = esport_fight_NF.Game.str.split('\sXX').str[0]
# removing greek letters
esport_fight_NF.replace("\sIII|\sII|\sIV|\sIX|\sVI|\sV|\sXIII|\sXII|\sXI|\sXX|\sXrd|\sX|\sI|",'',regex=True, inplace=True)
#esport_fight_NF = esport_fight_NF.Game.str.findall("Street\sFighter|Soul\sCalibur|Super\sSmash\sBros|Guilty\sGear")
expresion = r"Street\sFighter|Soul\sCalibur|Super\sSmash\sBros|Tekken|Dragon\sBall|Dead\sor\sAlive|Guilty\sGear"
esport_fight_NF["Game"][esport_fight_NF.Game.str.contains(expresion)] = esport_fight_NF.Game.str.findall(expresion).str[0]
esport_fight_NF = esport_fight_NF.groupby("Game").sum()

plt.figure(figsize=(20,5))
# define cmap
cmap = plt.get_cmap("Reds")
cmap = iter(cmap([i/10 for i in range(10)]))
# find biggest prized games - give them color
cols = []
labs = []
mostvalue = esport_fight_NF.nlargest(10, 'Earnings')
for i, gamename in enumerate(esport_fight_NF.index):
    if gamename in mostvalue.index:
        cols.append(next(cmap))
        labs.append(gamename)
    else:
        cols.append("gray")
        labs.append("")
patches, texts = plt.pie(esport_fight_NF.Earnings, labels=labs, colors=cols, radius=2, textprops={'fontsize': 12})
#plt.legend(patches, esport_fight_NF.index, loc="upper right", bbox_to_anchor=(1,1),bbox_transform=plt.gcf().transFigure)
plt.show()

Usually more rewards are given to the **Soul Calibur** players. Right after is **Street fighter** and **Super Smash Bros.**

## <center> Work in progress, more content is yet to come :) </center>

 <center> <span style="font-size:larger;"> I hope you have enjoyed reading this notebook, </span> </center>
    <center> <span style="font-size:larger;"> feel free to give feedback in any form :) </span> </center>