This project is a modified version of the original project from [https://www.kaggle.com/slehkyi/football-why-winners-win-and-losers-lose](http://) . 
For data visualization i have used matplotlib and seaborn libraries .   

# Exploring 5 Years of European Football

Intro
In this notebook we will explore modern metrics in football (xG, xGA and xPTS) and its' influence in sport analytics.

Expected Goals (xG) - measures the quality of a shot based on several variables such as assist type, shot angle and distance from goal, whether it was a headed shot and whether it was defined as a big chance.

Expected Assits (xGA) - measures the likelihood that a given pass will become a goal assist. It considers several factors including the type of pass, pass end-point and length of the pass.

Expected Points (xPTS) - measures the likelihood of a certaing game to bring points to the team.

These metrics let us look much deeper into football statistics and understand performance of players and teams in general and realize the role of luck and skill in it. Disclaimer: they are both important

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('../input/extended-football-stats-for-european-leagues-xg/understat.com.csv')
df = df.rename(index=int, columns={'Unnamed: 0': 'league', 'Unnamed: 1': 'year'}) 
df.head()

Standard parameters: position, team, amount of matches played, wins, draws, loses, goals scored, goals missed, points.

Additional metrics:

xG - expected goals metric, it is a statistical measure of the quality of chances created and conceded. More at understat.com

xG_diff - difference between actual goals scored and expected goals.

npxG - expected goals without penalties and own goals.

xGA - expected goals against.

xGA_diff - difference between actual goals missed and expected goals against.

npxGA - expected goals against without penalties and own goals.

npxGD - difference between "for" and "against" expected goals without penalties and own goals.

ppda_coef - passes allowed per defensive action in the opposition half (power of pressure)

oppda_coef - opponent passes allowed per defensive action in the opposition half (power of opponent's pressure)

deep - passes completed within an estimated 20 yards of goal (crosses excluded)

deep_allowed - opponent passes completed within an estimated 20 yards of goal (crosses excluded)

xpts - expected points

xpts_diff - difference between actual and expected points

In [None]:
#Leagues Numbers 
df['league'].value_counts()

In the next visualization we will check how many teams from each league were in top 4 during last 5 years. It can give us some info about stability of top teams from different countries.

In [None]:
f = plt.figure(figsize=(25,12))
ax = f.add_subplot(3,2,1)
plt.xticks(rotation=45)
sns.barplot(x='team', y='pts', hue='year', data=df[(df['league'] == 'Bundesliga') & (df['position'] <= 4)], ax=ax)
ax = f.add_subplot(3,2,2)
plt.xticks(rotation=45)
sns.barplot(x='team', y='pts', hue='year', data=df[(df['league'] == 'EPL') & (df['position'] <= 4)], ax=ax)
ax = f.add_subplot(3,2,3)
plt.xticks(rotation=45)
sns.barplot(x='team', y='pts', hue='year', data=df[(df['league'] == 'La_liga') & (df['position'] <= 4)], ax=ax)
ax = f.add_subplot(3,2,4)
plt.xticks(rotation=45)
sns.barplot(x='team', y='pts', hue='year', data=df[(df['league'] == 'Serie_A') & (df['position'] <= 4)], ax=ax)
ax = f.add_subplot(3,2,5)
plt.xticks(rotation=45)
sns.barplot(x='team', y='pts', hue='year', data=df[(df['league'] == 'Ligue_1') & (df['position'] <= 4)], ax=ax)
ax = f.add_subplot(3,2,6)
plt.xticks(rotation=45)
sns.barplot(x='team', y='pts', hue='year', data=df[(df['league'] == 'RFPL') & (df['position'] <= 4)], ax=ax)

As we can see from these bar charts, there are teams that in last 5 years were in top 4 only once, which means it is not something common, which means if we dig deeper, we can find that there is a factor of luck that might have played in favour to these teams. It's just a theory, so let's look closer to those outliers.

The teams that were in top 4 only once during last 5 seasons are:

Wolfsburg (2014) and Schalke 04 (2017) from Bundesliga
Leicester (2015) from EPL
Villareal (2015) and Sevilla (2016) from La Liga
Lazio (2014) and Fiorentina (2014) from Serie A
Lille (2018) and Saint-Etienne (2018) from Ligue 1
FC Rostov (2015) and Dinamo Moscow (2014) from RFPL
Let's save these teams.

In [None]:
outlier_teams = ['Wolfsburg', 'Schalke 04', 'Leicester', 'Villareal', 'Sevilla', 'Lazio',
                 'Fiorentina', 'Lille', 'Saint-Etienne', 'FC Rostov', 'Dinamo Moscow']

# Removing unnecessary for our analysis columns 
df_xg = df[['league', 'year', 'position', 'team', 'scored', 'xG', 'xG_diff', 'missed',
            'xGA', 'xGA_diff', 'pts', 'xpts', 'xpts_diff']]

In [None]:
# Checking if getting the first place requires fenomenal execution
first_place = df_xg[df_xg['position'] == 1]

# Get list of leagues
leagues = df['league'].drop_duplicates()
leagues = leagues.tolist()

# Get list of years
years = df['year'].drop_duplicates()
years = years.tolist()

The ingredients of the hero:  Starting in the German League.



In [None]:
bu=first_place[first_place['league']=='Bundesliga']
bu

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(bu["year"],bu["pts"],label='Points')
ax.bar(bu["year"],bu["xpts"],label='expected Points',alpha=0.8)
ax.legend()
ax.set_xlabel('year')
ax.set_ylabel('points')
ax.set_title('Comparing Actual and Expected Points for Winner Team in Bundesliga')
plt.show()


In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(bu["year"],bu["missed"],label='Actual goals missed')
ax.bar(bu["year"],bu["xGA"],label='expected goals')
ax.legend()
ax.set_xlabel('year')
ax.set_ylabel('points')
ax.set_title('Comparing actual goals missed and expected goals against for Winner Team in Bundesliga')
plt.show()

By looking at the table and barchart we see that Bayern every year got more points that they should have, they scored more than expected and missed less than expected (except for 2018, which didn't break their plan of winning the season, but it gives some hints that Bayern played worse this year, although the competitors didn't take advantage of it)

In [None]:
# and from this table we see that Bayern dominates here totally, even when they do not play well
bu2=df_xg[(df_xg['position'] <= 2)&(df_xg['league']=='Bundesliga')].sort_values(by=['year','pts'], ascending=False)
bu2

In [None]:
fig, ax = plt.subplots(figsize=(12, 7))
sns.barplot(x='year',y='pts',hue='team',data=bu2)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)


# La Liga

In [None]:
la=first_place[first_place['league']=='La_liga']
la

In [None]:
def points(df,league):
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar(df["year"],df["pts"],label='Points')
    ax.bar(df["year"],df["xpts"],label='expected Points',alpha=0.8)
    ax.legend()
    ax.set_xlabel('year')
    ax.set_ylabel('points')
    ax.set_title('Comparing Actual and Expected Points for Winner Team in '+league)
    plt.show()

In [None]:
points(la,'La liga')

In [None]:
# comparing with runner-up
la2=df_xg[(df_xg['position'] <= 2) & (df_xg['league'] == 'La_liga')].sort_values(by=['year','xpts'], ascending=False)
la2

In [None]:
fig, ax = plt.subplots(figsize=(12, 7))
sns.barplot(x='year',y='pts',hue='team',data=la2)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# EPL

In [None]:
ep=first_place[first_place['league'] == 'EPL']
ep

In [None]:
points(ep,'EPL')

In EPL we see the clear trend that tells you: "To win you have to be better than statistics". Interesting case here is Leicester story of victory in 2015: they got 12 points more than they should've and at the same time Arsenal got 6 points less of expected! This is why we love football, because such unexplicable things happen. I am not telling is total luck, but it played its' role here.

Another interesting thing is Manchester City of 2018 - they are super stable! They scored just one goal more than expected, missed 2 less and got 7 additional points, while Liverpool fought really well, had little bit more luck on their side, but couldn't win despite being 13 points ahead of their expected.

Pep is finishing building the machine of destruction. Man City creates and converts their moments based on skill and do not rely on luck - it makes them very dangerous in the next season.

In [None]:
# comparing with runner-ups
ep2=df_xg[(df_xg['position'] <= 2) & (df_xg['league'] == 'EPL')].sort_values(by=['year','xpts'], ascending=False)
ep2

In [None]:
fig, ax = plt.subplots(figsize=(12, 7))
sns.barplot(x='year',y='pts',hue='team',data=ep2)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# Ligue 1

In [None]:
li=first_place[first_place['league'] == 'Ligue_1']
li

In [None]:
points(li,'Ligue_1')

In French Ligue 1 we continue to see the trend "to win you have to execute 110%, because 100% is not enough". Here Paris Saint Germain dominates totally. Only in 2016 we get an outlier in the face of Monaco that scored 30 goals more than expected!!! and got almost 17 points more than expected! Luck? Quite a good piece of it. PSG was good that year, but Monaco was extraordinary. Again, we cannot claim it's pure luck or pure skill, but a perfect combination of both in right place and time.

In [None]:
# comparing with runner-ups
li2=df_xg[(df_xg['position'] <= 2) & (df_xg['league'] == 'Ligue_1')].sort_values(by=['year','xpts'], ascending=False)
li2

In [None]:
fig, ax = plt.subplots(figsize=(12, 7))
sns.barplot(x='year',y='pts',hue='team',data=li2)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# Serie A

In [None]:
se=first_place[first_place['league'] == 'Serie_A']
se

In [None]:
points(se,'Serie_A')

In Italian Serie A Juventus is dominating 8 years in a row although cannot show any major success in Champions League. I think by checking this chart and numbers we can understand that Juve doesn't have strong enough competiton inside the country and gets lots of "lucky" points, which again derives from multiple factors and we can see that Napoli outperformed Juventus by xPTS twice, but it is a real life and in, for example 2017, Juve was crazy and scored additional 26 goals (or created goals from nowhere), while Napoli missed 3 more than expected (due to error of goalkeeper or maybe excelence of some team in 1 or 2 particular matches). As with the situation in La Liga when Real Madrid became a champion I am sure we can find 1 or 2 games that was key that year.

Details matter in football. You see, one error here, one woodwork there and you've lost the title.

In [None]:
# comparing to runner-ups
se2=df_xg[(df_xg['position'] <= 2) & (df_xg['league'] == 'Serie_A')].sort_values(by=['year','xpts'], ascending=False)
se2

In [None]:
fig, ax = plt.subplots(figsize=(12, 7))
sns.barplot(x='year',y='pts',hue='team',data=se2)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# RFPL

In [None]:
rf=first_place[first_place['league'] == 'RFPL']
rf

In [None]:
points(rf,'RFPL')

I do not follow Russian Premier League, so just by coldly looking at data we see the same pattern as scoring more than you deserve and also intersting situation with CSKA Moscow from 2015 to 2017. During these years these guys were good, but converted their advantages only once, the others two - if you do not convert, you get punished or your main competitor just converts better.

There is no justice in football :D. Although, I believe with VAR the numbers will become more stable in next seasons. Because one of the reasons of those additional goals and points are errors of arbiters.

In [None]:
# comparing to runner-ups
ep2=df_xg[(df_xg['position'] <= 2) & (df_xg['league'] == 'RFPL')].sort_values(by=['year','xpts'], ascending=False)
ep2

In [None]:
fig, ax = plt.subplots(figsize=(12, 7))
sns.barplot(x='year',y='pts',hue='team',data=ep2)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

# Statistical Overview
As there are 6 leagues with different teams and stats, I decided to focus on one in the beginning to test different approaches and then replicate the final analysis model on other 5. And as I watch mostly EPL I will start with this competiton as I know the most about it.

In [None]:
# Creating separate DataFrames per each league
EPL = df_xg[df_xg['league'] == 'EPL']
print(EPL)
EPL.describe()


In [None]:
def print_records_antirecords(df):
  print('Presenting some records and antirecords: \n')
  for col in df.describe().columns:
    if col not in ['index', 'year', 'position']:
      team_min = df['team'].loc[df[col] == df.describe().loc['min',col]].values[0]
      year_min = df['year'].loc[df[col] == df.describe().loc['min',col]].values[0]
      team_max = df['team'].loc[df[col] == df.describe().loc['max',col]].values[0]
      year_max = df['year'].loc[df[col] == df.describe().loc['max',col]].values[0]
      val_min = df.describe().loc['min',col]
      val_max = df.describe().loc['max',col]
      print('The lowest value of {0} had {1} in {2} and it is equal to {3:.2f}'.format(col.upper(), team_min, year_min, val_min))
      print('The highest value of {0} had {1} in {2} and it is equal to {3:.2f}'.format(col.upper(), team_max, year_max, val_max))
      print('='*100)

In [None]:
# replace EPL with any league you want
print_records_antirecords(EPL)

In [None]:
#sns.set_palette(['blue','red','green','yellow','purple'])
hue_colors = {2018:'b',2017:'g',2016:'r',2015:'c',2014:'m'}
g=sns.relplot(x='position',y='xG_diff',hue='year',data=EPL,kind='line',palette=hue_colors,
            height=6,aspect=3)
g.fig.suptitle('Comparing xG gap between positions',fontsize=20)

plt.show()

In [None]:
#sns.set_palette(['blue','red','green','yellow','purple'])
hue_colors = {2018:'b',2017:'g',2016:'r',2015:'c',2014:'m'}
g=sns.relplot(x='position',y='xGA_diff',hue='year',data=EPL,kind='line',palette=hue_colors,
            height=6,aspect=3)
g.fig.suptitle('Comparing xGA gap between positions',fontsize=20)

plt.show()

In [None]:
#sns.set_palette(['blue','red','green','yellow','purple'])
hue_colors = {2018:'b',2017:'g',2016:'r',2015:'c',2014:'m'}
g=sns.relplot(x='position',y='xpts_diff',hue='year',data=EPL,kind='line',palette=hue_colors,
            height=6,aspect=3)
g.fig.suptitle('Comparing xPTS gap between positions',fontsize=20)

plt.show()

From the charts above we can clearly see that top teams score more, concede less and get more points than expected. That's why these teams are top teams. And totally opposite situation with outsiders. The teams from the middleplay average. Totally logical, no huge insights here.

In [None]:
# Check mean differences
def league_mean(df):
    m=df.groupby('year')[['xG_diff', 'xGA_diff', 'xpts_diff']].mean()
    return m 

league_mean(EPL)

In [None]:
# Check median differences
def league_median(df):
    me=df.groupby('year')[['xG_diff', 'xGA_diff', 'xpts_diff']].median()
    return me 

league_median(EPL)