# loading data

## files :
* Overall league data

## Variables explanation

● **xG** - expected goals metric, it is a statistical measure of the quality
of chances created and conceded. More at understat.com

● **xG_diff** - difference between actual goals scored and expected
goals.

● **npxG** - expected goals without penalties and own goals.

● **xGA** - expected goals against.

● **xGA_diff** - difference between actual goals missed and expected
goals against.

● **npxGA** - expected goals against without penalties and own goals.

● **npxGD** - difference between "for" and "against" expected goals
without penalties and own goals.

● **ppda_coef** - passes allowed per defensive action in the opposition
half (power of pressure)

● **oppda_coef** - opponent passes allowed per defensive action in the
opposition half (power of opponent's pressure)

● **deep** - passes completed within an estimated 20 yards of goal
(crosses excluded)

● **deep_allowed** - opponent passes completed within an estimated 20
yards of goal (crosses excluded)

● **xpts** - expected points

● **xpts_diff** - difference between actual and expected points

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
df=pd.read_csv("../input/extended-football-stats-for-european-leagues-xg/understat.com.csv")
a=list(df.columns)
a[0]="league"
a[1]="year"
df.columns=a
df.head(2)

# Competitiveness

## Proportion of points

There are a number of ways in which we can measure the degree of competition that exists between teams.  One such measure is the concentration ratio. This is often used by industrial economists who are interested in the degree to which the largest n firms dominate a particular industry. 
Applied to a sporting context, the concentration ratio can be used to measure the proportion of a season’s points won by the top n sides within a k team league (where n<k).

formula : $ \frac{\sum \limits _{i=1} ^{k} S_{i}}{\sum \limits _{i=1} ^{n} S_{i}} $ 

Si denotes the share of points enjoyed by the ith team.
If we plot this concentration ratio
Two things emerge. First, the dominance of the top three sides varies from year to year as their strength changes relative to the rest of the league. Starting at a given peak, the top sides set a standard of play which strongly differentiates them from the rest of the division. In subsequent seasons, the chasing pack revise their own level of play (usually by recruiting better quality players and changes in tactics), enabling some degree of ‘catch-up’.  Eventually, the top sides raise the bar again and the process continues. Second, there is an underlying increase in the dominance of the top three sides, the implication being that the Premiership is showing signs of falling competitiveness.  This may arise because the top sides are benefiting from playing in the Champions League and as a result are learning new tactics from outside the domestic league and in addition are acquiring the resources to hire even better quality players.


In [None]:
def comp(df,n):
    a=[]
    for i in df.league.unique():
        for j in df.year.unique():
            d=df[(df["year"]==j)&(df["league"]==i)]
            a.append([i,j,((d["pts"].nlargest()[:n].sum()*100)/(d["pts"].sum())).round(1)])
    sns.set(rc={'figure.figsize':(11.7,8.27)})
    sns.lineplot(data=pd.DataFrame(a,columns=["league","year","percntg pts"]), x="year", y="percntg pts", hue="league", style="league",linewidth = 3)
    

comp(df,3)


## movement in Rankings

In a consecutive pair of years what is the average ranking movement of top k teams.
Example : In EPl if top 5 teams are mostly consistent and the movements in their rankings is less compared to la liga then laliga can be considered more competitive as more trnasitions in the 
top spot is happening.

 $ C(N_t)=\frac{{\sum \limits _{i=1} ^{k}} [|R_t-R_{t-1}|]}{k} $ 
 
 R is ranking of the team 
 k is the number of teams that were in top k spot in year "t" or "t-1"

In [None]:
def move(df,k):
    arr=[]
    for i in df["league"].unique():
        for j in df["year"].unique():
            if(j-1 not in df["year"].unique()):
                continue
            d=df[(df["league"]==i)&((df["year"]==j)|(df["year"]==j-1))]
            d=d[["team","year","position","league"]]
            a=d[d["position"]<=k]["team"].unique()
            d=d[d["team"].isin(a)]
            b=list(d.groupby("team")["position"].diff().fillna(0))
            result = sum(abs(number) for number in b)
            arr.append([i,j,result/len(a)])
    return pd.DataFrame(arr,columns=["league","year","movement"])

df["year"]=pd.to_numeric(df["year"],errors='coerce')
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(data=move(df,10), x="year", y="movement", hue="league", style="league",linewidth = 3)
plt.gca().set_xticks(df["year"].unique())

> ***ligue 1 has shown a higher average movement in recent 2018 and 2019, this means that top5 spots are more volatile which is a good measure of competitiveness in the league***

## Closeness of stats

If stats like **expected goals**,  **points**, **goals scored**, **ppda_coef , oppda_coef** (pressure, opponent's pressure) are close for each team then the league could be assumed as competitve.

LETS, calculate standard deviation of these stats for each league and compare it over different years.


In [None]:
stat="pts"

def SD_stats(league,year,stat):
    d=df[(df["league"]==league) & (df["year"]==year)]
    std=d[stat].std()
    return std

df_stat=[]
for i in df.league.unique():
    for j in df.year.unique():
        df_stat.append([i,j,SD_stats(i,j,stat)])
df_stat=pd.DataFrame(df_stat,columns=["league","year",stat+"_SD"])
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(data=df_stat, x="year", y=stat+"_SD", hue="league", style="league",linewidth = 3)


***> Ligue 1 much less standard deviation in points, whereas Seria A has a SD in points which reflects the imbalances in strength of teams.***

***> Could do this for other stats mentioned before too.***