## Exploring the 2019-2020 season of English Premier League

The dataset includes lots of different statistics about games.

* xG, xGA: Expected goals for team and opponent
* scored, missed: Goal scored and conceded
* xpts, pts: Expected and received points
* wins, draws, losses: Binary variables showing the result of the game
* tot_goal, tot_con: Total goals scored and conceded from the beginning of the season

There are also basic stats such as shots, shots on target, corner kicks, yellow card, red card. We also have information about the date and time of the games.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

%matplotlib inline

In [None]:
df_epl = pd.read_csv("../input/epl-stats-20192020/epl2020.csv")

print(df_epl.shape)

To be able to display all columns, we need to adjust the display.max_columns setting:

In [None]:
pd.set_option("display.max_columns",45)

df_epl.head()

In [None]:
#Drop redundant feature column
df_epl.drop(['Unnamed: 0'], axis=1, inplace=True)

#reset the index
df_epl = df_epl.reset_index(drop=True)

In [None]:
df_epl.columns

In [None]:
df_epl.matchDay.value_counts()

Most of the games are played on Saturdays.

We can quickly create a standing based on the total number of points achieved so far. The maximum value in the tot_points column shows the most up to date points:

In [None]:
df_epl[['teamId','tot_points']].groupby('teamId').max().sort_values(by='tot_points', ascending=False)[:10]

I only displayed the first 10 teams. If you are a football (i.e. soccer) fan, you may have heard of the success of Liverpool dominating the English Premier League this season. Liverpool leads by 25 points.

### Expected vs Actual Goals and Points

The advancements in technology and data science brought up new stats in football. One type of relatively new stats is “expected” stats such as expected goals and expected points. Let’s check how close expected and actual values are. There are different ways to do a comparison. One way is to check the distribution of the difference:

In [None]:
plt.figure(figsize=(10,6))
plt.title("Expected vs Actual Goals - Distribution of Difference", fontsize=18)

diff_goal = df_epl.xG - df_epl.scored

sns.distplot(diff_goal, hist=False, color='blue')

It’s much like a normal distribution with a mean close to zero. Thus, expected values are very close to the actual values in general and there are, of course, some exceptions. These exceptions are what makes football exciting.

I do not know how expected goals stats are calculated but it should be somewhat related to shots and shot accuracy. We can check the correlation between expected goals (xG) and some other stats using corr function of pandas.

In [None]:
df_epl[df_epl.h_a == 'h'][['xG','HS.x','HST.x','HtrgPerc','tot_goal']].corr()

In [None]:
df_epl[df_epl.h_a == 'a'][['xG','AS.x','AST.x','AtrgPerc','tot_goal']].corr()

Shots and shots on target are definitely correlated with expected goals. There is also a weak positive correlation between expected goals and the number of goals a team has scored so far in the season.

We can also get an idea about the performance of goalkeepers using expected goal stats and actual goals. If a team conceded fewer goals than the expected goals of the opponent team, it is indicating that goalkeeper performs well. On the other hand, if a team conceded more goals than expectation, then the goalkeepers performance is not so good.

In [None]:
df_epl['keep_performance'] = df_epl['missed'] / df_epl['xGA']
df_epl[['teamId','keep_performance']].groupby('teamId').mean().sort_values(by='keep_performance', ascending=False)

Man City concedes 2.22 times more goals than expectation which is an indication of bad goalkeeper performance. The blame is not only on the keeper. The defensive players also have a responsibility in this situation.

On the other hand, Newcastle United and Leicester have an outstanding goalkeeper performance.

Let's also compare expected and actual points received in a game:

In [None]:
plt.figure(figsize=(10,6))
plt.title("Expected vs Actual Points - Distribution of Difference", fontsize=18)

diff_pts = df_epl.xpts - df_epl.pts

sns.distplot(diff_pts, hist=False, color='blue')

The difference between expected and actual points can be in between -3 and +3. The tail of the distribution curve goes a little further to complete the distribution curve.

### Match day effect on performance

Liverpool has only lost 5 points in the season so let’s check it for the second team which is Man City.

In [None]:
df_epl[df_epl.teamId == 'Man City'][['pts','matchDay']].groupby('matchDay').agg(['mean','count'])

It seems like Man City does not like Sundays. The average point for them on Fridays is 0 but there is only one game so we cannot actually make a true judgement on that. We can expand this to all teams and get a general idea of match day effect on team performance.

### Goals and points per game

In [None]:
df_epl['goals']= df_epl['scored'] + df_epl['missed']
df_epl['goals'].mean()

Goals per game average is 2.72. Home teams usually score more than away teams and thus collect more points due to the support of fans in the stadium.

In [None]:
df_epl[['h_a','scored','pts']].groupby('h_a').mean()

Home teams, in general, dominate the games. We can also see that on the number of shots per game. Let’s make a comparison between shots for home teams and away teams:

In [None]:
print("Home team stats \n {} \n".format(df_epl[df_epl.h_a == 'h'][['HS.x','HST.x','HtrgPerc']].mean()))
print("Away team stats \n {} \n".format(df_epl[df_epl.h_a == 'a'][['AS.x','AST.x','AtrgPerc']].mean()))

Home teams overtop away teams in shots and shots on target stats. However, the accuracy is slightly better for away teams than that of home teams.

### Team performances

One way to measure the performance of a team is how many points they collect relative to the expected points. There is, of course, the “luck” factor in some cases but it is an interesting stats. So, let’s check it. We can check the average of the difference between actual points and expected points. This will show how successful each team is at meeting the expectations.

In [None]:
df_epl['performance'] = df_epl['pts'] - df_epl['xpts']
df_perf = df_epl[['teamId','performance']].groupby('teamId').mean().sort_values(by='performance', ascending=False)
    
print("Above expectation \n {} \n".format(df_perf[df_perf.performance > 0]))
print("Below expectation \n {} \n".format(df_perf[df_perf.performance < 0]))

Liverpool outperforms others by far which makes sense because they have only lost 5 points out of possible 87 points in 29 games. Man City, Man Utd, and Chelsea get some surprising results because they perform lower than expected on average.

### Referees

Some referees tend to use yellow and red cards more easily than others. I think players keep that in mind. Let’s see how many cards on average each referee per game.

In [None]:
df_epl['cards'] = df_epl['HY.x'] + df_epl['HR.x'] + df_epl['AY.x'] + df_epl['AR.x']
df_epl[['Referee.x','cards']].groupby('Referee.x').agg(['mean','count'])

Among the referees who have had 15 or more games, A Taylor, C Pawson, M Dean, and S Attweel have showed more than 4 cards per game on average. Players should keep that in mind. Please note that the number of games is actually half of the "count" in the dataframe above because in the original dataset, there are two rows for each game. One row is from the home team side and the other is from the away team side.

### Liverpool

Liverpool is having a spectacular season. They have collected 82 points of possible 87 points in the season so far. Let's check a few stats about their home and away performances because their stadium, Anfield Road, push the players even more to increase their performance.

In [None]:
liv = df_epl[df_epl.teamId == 'Liverpool']
liv.shape

In [None]:
print("Home shots \n {} \n".format(liv[liv.h_a == 'h'][['HS.x','HST.x','HtrgPerc']].mean()))
print("Away shots \n {} \n".format(liv[liv.h_a == 'a'][['AS.x','AST.x','AtrgPerc']].mean()))

Shots, shots on target and shot accuracy are higher in home games which I'd like call "Anfield Road" effect.

There are many more performance metrics we can come up with regarding team, player, and referee performances. I have tried to cover some interesting criteria in football. Pandas provides many useful and easy-to-use functions and methods for exploratory data analysis. Visualizations are also great tools to explore the data.

Thanks for reading. Pleae let me know if you have any feedback.