# **Table of Content**

1. Reading Data

2. Adding Calculated Fields

3. Average Goals in the league

4. Teams with most and least goals

5. Corelation within variables

6. Comparison of RFPL and Other League

7. How Paris Saint Germain Lost Champion Once

In [None]:
# Link as below: 
# https://www.kaggle.com/slehkyi/extended-football-stats-for-european-leagues-xg
import pprint
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Reading Data

In [None]:
overall = pd.read_csv('../input/extended-football-stats-for-european-leagues-xg/understat.com.csv')
overall = overall.rename(index=int, columns={'Unnamed: 0': 'league', 'Unnamed: 1': 'year'}) 

match_info = pd.read_csv('../input/extended-football-stats-for-european-leagues-xg/understat_per_game.csv')

In [None]:
print(match_info.dtypes)

# 2 Adding Calculated Fields

Average Goal Achieved and Missed is added for every team of each year.
Average Points got is also calculated.

In [None]:
overall['avg_miss'] = overall['missed']/overall['matches']
overall['avg_goal'] = overall['scored']/overall['matches']
overall['avg_pts'] = overall['pts']/overall['matches']

# 3 Average Goals in the League

I would show both numerical data and graph to visualize the average goals. 

In [None]:
# Find the number of matches played and the total goals
matches = match_info.groupby(['league', 'year']).agg({ 'h_a':'count' , 'scored':'sum'}).reset_index()
matches = DataFrame(matches.rename(index=int, columns={'h_a': 'matches'}))

# Dividing number of matches by 2 because one match will appear for two times
matches['matches'] = matches['matches']/2
matches['avg_goal'] = matches['scored']/ matches['matches'] 
print(matches)

In [None]:
# Visualization of Average League Goals
plt.figure(figsize=(15,10))

for l in set(matches['league']):
    temp = matches[matches['league']==l]
    x = temp['year']
    y = temp['avg_goal']
    plt.plot(x,y,label=l)
    plt.legend(loc='best')
    

It is shown that Bundesliga has the most average goals from 2017 - 2019. 

RFPL (Russian Premier League) has the fewest goals for every year.
In the later section, i will look deeper on RFPL.

# **4 Teams with most goals and misses**

In [None]:
# Top 20 by number of goals
groupby = overall.groupby(['league', 'year','team']).mean()['avg_goal']
groupby= groupby.nlargest(20)
color = plt.cm.RdYlGn(np.linspace(0,1,20))
groupby.plot.bar(color = color)


It is not suprising that Barcelona is in the list for several times. However, the team with average goals are Real Madrid in 2014.

2019 Atlanta is in No.19. Their performance in UEFA Champion League is also amazing.

**The teams with most misses are as below**

In [None]:
# Top 20 Teams by Goals Missed 
most_miss = overall.groupby(['league', 'year','team']).mean()['avg_miss']
most_miss = most_miss.nlargest(20)
print(most_miss)
color = plt.cm.RdYlGn(np.linspace(0,1,20))
groupby.plot.bar(color = color)

# **5 Corelation within variables**

I would like to look at the corelation of goals and misses relation with passing performance.

Let come with the overall picture first.

In [None]:

overall_subset = overall[['league','scored','missed','ppda_coef','oppda_coef','deep','deep_allowed']]
corrmat = overall_subset.corr()
plt.figure(figsize=(10,10))
g = sns.heatmap(corrmat,annot=True,cmap="RdYlGn")

It is not surprising that scoring is corelated with deep lying passing that has 0.8 relation coeiffient, while misses with deep_allowed with 0.69. 

Oppda_coef variable also has significant shown with goals which is 0.64. 

Let's look at the heatmaps for each league. 

In [None]:
#overall_subset = overall[['scored','missed','ppda_coef','oppda_coef','deep','deep_allowed']]
for l in set(overall_subset['league']):
    df = overall_subset[overall_subset['league']==l]
    corrmat = df.corr()
    plt.figure(figsize=(10,10))
    ax = plt.axes()
    ax.set_title(l)
    g = sns.heatmap(corrmat,annot=True,cmap="RdYlGn")

> # **6. Comparison of RFPL and Other League**

Let's look deeper in how RFPL different from other League.


In [None]:

subsets = ['scored','xG','ppda_coef','deep','pts']
RFPL = match_info[match_info['league']=='RFPL']
RFPL = RFPL[subsets]
Other = match_info[match_info['league']!='RFPL']
Other = Other[subsets]

for i in ['scored','xG','ppda_coef','deep']:
    print('Average',i,'in RFPL =',round(RFPL[i].mean(),3)*2,)
    print('Average',i,'in Other Leagues =',round(Other[i].mean(),3)*2,)
    print('\n')

From the result above, the difference in passing data between RFPL and other leagues are really close. However,there are signicant difference for number of actual goals and expected goals.  

Let's look at the competitiveness the league by calculating the standard deviation of points got. The competitiveness of RFPL is the greatest in this angle. 

In [None]:
std_list = [None] * 6
league = [None] * 6
i = 0
for l in set(overall['league']):
    df = overall[overall['league']==l]
    print('Average std of points in ',l,'=',round(df['pts'].std(),3))
    std_list[i] = round(df['pts'].std(),3)
    league[i] = l
    i = i+1
    
plt.figure(figsize=(10,8))
plt.plot(league,std_list)
plt.legend(loc='best')
plt.xlabel('League')
plt.ylabel('Standard Deviation of League Points Got')

# 7. How Paris Saint Germain Lost Champion Once

As we all should know, Paris has been dominating the Ligue for many years, but they have lost the Champion in 2016. Let's drill in more details to find what happened.

We will start by looking at the basic figures.

In [None]:

Paris = overall[overall['team']=='Paris Saint Germain']


fig = plt.figure(figsize = (10, 5)) 
y_axis = ""
title = ""
# creating the bar plot 
for i in ['pts','scored','missed']:
    plt.bar(Paris['year'], Paris[i]) 
    plt.xlabel("Year")
    if i == 'pts': 
        y_axis = "No. of Points Achieved";
        title = "Points achieved by Paris Saint Germain"
    elif i =='scored':
        y_axis = "No. of Scores Achieved";
        title = "Goals achieved by Paris Saint Germain"
    else:
        y_axis = "No. of Goals Missed"
        title = "Goals missed by Paris Saint Germain"
    plt.ylabel(y_axis) 
    plt.title(title) 
    plt.show() 

From the results above, the overall performance of Paris in 2016 is not the worst among 2014 to 2019. The results of 2014 look worse.


(I will add more content on this topic later)