# Football Data Analysis

In this data analysis I'll try to do some data analysis about past football transfers, and then try my hand at predicting the transfer fees of players based on certain attributes. First, let's take a look at what we're working with.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
transfers = pd.read_csv('../input/top-250-football-transfers-from-2000-to-2018/top250-00-19.csv')
transfers.head()


In [None]:
print(transfers.dtypes)
print(transfers.shape)

Some things to settle before we get into the data analysis: 
1. Season should be an int value with just the year
2. Transfer fees and market values both have a lot of 0s. Will convert them into transfer fees in millions to make things easier.
3. Positions are inconsistent: some people are just titled 'defender', 'midfielder' or 'sweeper' but in fact they play a certain position.
4. Someone has age 0 within the dataset, which would be impossible.

For point 3 and 4, if there are a large group of people with these misleading labels then they will be manually inputted. If the group is small, the rows will just be dropped.

In [None]:
transfers['Season_transferred']=transfers['Season'].str.split('-').str[0]
transfers = transfers.astype({'Season_transferred':'int64'})
transfers = transfers.drop(columns =['Season'])
transfers.head()

In [None]:
print(transfers.Position.unique())
transfers.Position = transfers.Position.replace(to_replace=['Second Striker','Centre-Forward','Sweeper'],value = ['Forward','Forward','Defender'])
print(transfers.Position.unique())

In [None]:
transfers.Age.unique()
transfers_weird = transfers.loc[transfers.Age == 0]
print(transfers_weird)

This is the age 0 person. As it's just one person with an age of 0, I'll drop him.

In [None]:
transfers_midfield = transfers.loc[transfers.Position == 'Midfielder']
print(transfers_midfield.head(20))
transfers_defenders = transfers.loc[transfers.Position =='Defender']
print(transfers_defenders.head(20))

There are only 3 so let's drop all of them to make it easier.

In [None]:
transfers_cleaned= transfers[~((transfers.Position=='Midfielder')|(transfers.Position=='Defender')|(transfers.Age ==0))]
transfers_cleaned['Transfer_fee_in_mln']=transfers_cleaned['Transfer_fee']/1000000
transfers_cleaned['Market_value_in_mln']=transfers_cleaned['Market_value']/1000000
transfers_cleaned = transfers_cleaned.drop(labels = ['Transfer_fee','Market_value'],axis = 1)
pd.to_datetime(transfers_cleaned['Season_transferred'],format ='%Y')
transfers_cleaned.head()


Let's plot out a distribution of transfer fee by season.

In [None]:
import matplotlib.style as style
style.use('fivethirtyeight')
fig, ax = plt.subplots()
transfers_cleaned.groupby('Season_transferred')['Transfer_fee_in_mln'].quantile(.75).plot(linewidth = 1.0,color = 'g')
transfers_cleaned.groupby('Season_transferred')['Transfer_fee_in_mln'].quantile(.50).plot(linewidth = 1.0)
transfers_cleaned.groupby('Season_transferred')['Transfer_fee_in_mln'].quantile(.90).plot(linewidth = 1.0, color = 'r')
sns.stripplot(x = 'Season_transferred',y='Transfer_fee_in_mln',\
              cmap = 'coolwarm',data= transfers_cleaned,size = 2)
plt.xlabel('Season transferred')
plt.ylabel('Transfer Fee in Millions')
plt.xticks(fontsize = 8)
ax.annotate(text = 'Ronaldo',xy = (9,95),xycoords = 'data',\
            xytext = (0,25),textcoords = 'offset points', arrowprops = dict(arrowstyle ='->', color = 'black'))
ax.annotate(text ='Neymar', xy = (17,220),xycoords = 'data',\
            xytext = (0,-50),textcoords = 'offset points',arrowprops = dict(arrowstyle = '->',color = 'black'))
plt.show()

From this first graph, a few conclusions can be drawn:

**1. Transfer fees have been steadily rising with the years.**

Not only have the best players' transfer fees risen as shown by the red line representing the 90th percentile of player transfers, the 75th and 50th percentile lines (green and blue lines respectively) have also shown a steady shift upwards from 2000 to 2018. 

**2. The price of the best players have increased at a higher rate than good, not great players.**

Visually, the number of outliers have risen together with the transfer fees as time progressed. For an outlier-to-outlier comparison, Ronaldo cost 90 million in 2009, while Neymar cost upwards of 200 million in 2017. The 90th percentile line also shows greater deviation from the 75th and 50th percentile lines as time passed.

**3. 2017 was a crazy year.**

The 90th, 75th and 50th percentiles show a large rise from 2016 to 2017. This is due to the transfer of Neymar from Barcelona to PSG, Ousmane Dembele from Dortmund to Barcelona and many other expensive transfers. As we all know though, the prices of all players increased exponentially after that (Mbappe, Coutinho and Ronaldo as examples in 2018), greatly increasing the number of transfers above 100 million.

In [None]:
fig2, ax2 = plt.subplots()
sns.barplot(x='Age',y='Transfer_fee_in_mln',data = transfers_cleaned)
plt.xlabel('Age')
plt.ylabel('Transfer Fee in Millions')


From this graph, a few conclusions can be drawn:

**1. Player transfer fees increase as age increases up till 27 years old, and then gradually decrease.**

The graph shows a general upward trend from 15 years to 27 years. However, the graph already begins levelling off at about 24 years old, but a drastic decrease is observed the moment players age over 30.

**2. Remarkably, the variance of player transfer fees are generally independent of age.**

This is a finding that was rather unexpected as one would expect the variation of transfer fees to decrease as players become older, as assessments of a players' ability relative to their age group. However, this was not observed as the confidence interval bars (the black bars) were observed to be roughly the same length across different ages.

**3. Ronaldo has an incredible effect on statistics.**

He is the only reason why the age bar 33 is an anomaly.

In [None]:
transfers_33 = transfers_cleaned[transfers_cleaned['Age']==33]
print(transfers_33.Transfer_fee_in_mln.max())
print(transfers_33.Transfer_fee_in_mln.min())
transfers_15 = transfers_cleaned[transfers_cleaned['Age']==15]
print(transfers_15.head())

In [None]:
#fig3, ax3 = plt.subplots()
#sns.stripplot(x='Position',y='Transfer_fee_in_mln',data = transfers_cleaned)
#transfers_cleaned.groupby('Position')['Transfer_fee_in_mln'].quantile(.75).plot(linewidth = 1.0,color = 'g')
#transfers_cleaned.groupby('Position')['Transfer_fee_in_mln'].quantile(.50).plot(linewidth = 1.0)
#transfers_cleaned.groupby('Position')['Transfer_fee_in_mln'].quantile(.90).plot(linewidth = 1.0, color = 'r')
#plt.xlabel('Position')
#plt.xticks(rotation = 60)
#plt.ylabel('Transfer Fee in Millions')

In [None]:
transfers_cleaned.groupby('Position')['Transfer_fee_in_mln'].quantile(.75).plot(linewidth = 1.0,color = 'g')
transfers_cleaned.groupby('Position')['Transfer_fee_in_mln'].quantile(.50).plot(linewidth = 1.0)
transfers_cleaned.groupby('Position')['Transfer_fee_in_mln'].quantile(.90).plot(linewidth = 1.0, color = 'r')
plt.xticks(rotation = 90)
plt.show()

What is the correlation between market value and transfer fees?

In [None]:
sns.lmplot(x='Market_value_in_mln',y='Transfer_fee_in_mln',data=transfers_cleaned, ci = None, hue = 'Season_transferred',height = 10)
sns.regplot(x='Market_value_in_mln',y='Transfer_fee_in_mln',data=transfers_cleaned, ci = None, scatter = None, label ='Aggregated transfers')
plt.ylabel('Transfer Fee in Millions')
plt.xlabel('Market Value in Millions')

The blue line is the aggregated best fit line of transfer fees to market values. A few conclusions can be drawn here:

**1. People are more prone to overpaying now as compared to last time.**

The ratio of a player's transfer fee to their actual market value is given by the gradient of the graph. By inspection, the most recent years, namely 2015-2018, have had both the highest gradient and the biggest impact on the aggregated best fit line as there are more lines below the aggregated best fit line than there are above the best fit line.

Thus it should be noted that overpayment is a relatively recent phenomenon and it has become more pronounced in recent years.

**HOWEVER**, it is very important to caveat that market values for many players were listed as null values in the DataFrame we were working with, especially for earlier years, and this may lead to inaccuracies within the data. While this does not change the fact that teams are overpaying for players, this may mask the extent of overpayment in past years.

**2. Inflation is a big thing in football transfers, and even more so recently.**

Market value and transfer fees are both a representation of larger economic forces (e.g. inflation) at play. While it is impossible to determine whether Neymar in 2017 was objectively a better player than Ronaldo in 2009, it cannot be denied that inflation has taken place in that span as the large majority of outlying points are from recent years (2015-2018). 

**3. That being said, market value is subjective.**

It is important to note that these figures are not representative of a player's actual value, and as different clubs will value different players differently, there is no such thing as an inherently quantifiable market value for a player. So it's definitely possible and likely that Juve didn't overpay for an ageing Ronaldo, or PSG for Neymar - after all, the only valuation that actually matters is that of the team buying the player and selling the player.

In [None]:
goalkeepers = transfers_cleaned[transfers_cleaned['Position']=='Goalkeeper']
goalkeepers.sort_values(['Transfer_fee_in_mln'],ascending =False).head()

In [None]:
cb = transfers_cleaned[transfers_cleaned['Position']=='Centre-Back']
cb.sort_values(['Transfer_fee_in_mln'],ascending =False).head()

Just a few examples to qualitatively explain my points - prices for elite players have gone up recently even in defensive positions, and the overpayment is very pronounced for defenders. Van Dijk and Laporte are rated at over two times their fair market value.

# Machine Learning

**Can we guess the transfer fee of a player by factors that do not explicitly quantify their playing ability?**

Inspecting the dataset, we see that a lot of column headers have nothing to do with how good the player is. At most, there is a tangential connection, but the assumption that good players go to good teams is tenuous at best because 1) football is fluid, the teams which are good change all the time and 2) the same team that signed Neymar also signed Eric Maxim Choupo-Moting.

For this, I have dropped the names and market value to preserve the accuracy of the model. This is to prevent the model from relying on names (e.g. Zlatan and Ronaldo both command very high transfer fees, and the model should not be trained based on the name of the player being transferred), or market value as there is still a correlation between market value and transfer fee (which is obvious).

The team the player is going to can still be used as an indicator even though teams are an 'after-the-fact' items as rumours of players' transfers occur very often. As such, we often have a good idea of the team, or possible teams, that the player will be transferred to.

In [None]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score, accuracy_score
transfers_cleaned_without_name = transfers_cleaned.drop(columns = ['Name','Market_value_in_mln'])
transfers_cleaned_without_name = pd.get_dummies(transfers_cleaned_without_name)
y= transfers_cleaned_without_name['Transfer_fee_in_mln']
X= transfers_cleaned_without_name.drop(columns = ['Transfer_fee_in_mln'])


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 69,test_size = 0.3)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
lr = LinearRegression()
fit = lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)

coef = lr.coef_
mse = mean_squared_error(y_test,y_pred)
r2score = r2_score(y_test,y_pred) 
print(mse)
print(r2score)

So we get 21% accuracy. Of course, this isn't good by any means. 

That being said, 21% is actually remarkably good as these variables say at most little about a player's ability, which should be the main determinant of the transfer fee. This goes to show that at least a certain level of correlation exists between a player's non-ability-related attributes and his transfer fee. The most obvious possible conclusion is that when selling to certain teams, the price range will be higher. This may have less to do with the inherent ability of the player and more of the resources of the owner - Chelsea, Man City and PSG are all owned by really rich people and thus have spent higher fees on players. Whether this correlation is related to the hypothesis that 'as teams get richer, they will buy better players' requires more discussion.

For that reason, we will carry on to analyse if transfer fees of players can be estimated WITH factors relating to their playing ability.

**How much can we improve our model by?**

So while quantitatively judging a player's ability may be difficult - although we are sure that Ronaldo and Messi are one tier ahead of everyone else, how do we know where great players like Hazard, Mbappe, Haaland etc stand in respect to each other?

The best option would be to use FIFA overall and potential ratings as a proxy for a player's inherent present and future quality. While there has and always will be debate about the legitimacy of FIFA ratings, it's the best we have that is closest to a measure of a player's ability. We use the [FIFA 20 complete player dataset](https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset) for this purpose.

**Why just overall and potential?**

To decrease the number of features considered and prevent overfitting. Also, some statistics such as value and wage can be subsumed under market value.

In [None]:
transfers_cleaned_2015_18 = transfers_cleaned[transfers_cleaned['Season_transferred']>2014]

transfers_cleaned_2015_18['Lastname'] = transfers_cleaned_2015_18['Name'].str.split(' ').str[-1]
transfers_cleaned_2015_18 = transfers_cleaned_2015_18.drop(columns = ['Name'])
transfers_cleaned_2015_18.head()

Several key issues exist with working with the FIFA dataset together with the transfers dataset.

1. Data only exists from 2015 to 2021, and transfer data is up till 2018.

For this, we have no choice but to use only the FIFA 15 to 18 ratings within the dataset.

2. The way data is inputted in the two datasets are different.

For example, the transfers dataset may list Manchester United as 'Man Utd', while the FIFA dataset lists it as 'Manchester United'. To standardise, I will use fuzzywuzzy to find the most similar team names (e.g. Barcelona and FC Barcelona) above a certain scoring threshold, and standardise the names as such.

As for player names, only their last names will be included in the data. While this may lead to potential inconsistencies within the data (it was observed that there are three Martinez's in Porto), it will greatly simplify the joining process of the two dataframes together, which was more critical given that this was done during the span of two days.

Due to the presence of certain special characters, all data will be converted to ASCII format as well in order to facilitate joining of the two dataframes. 

Finally, a round of manual replacement will be done. However, as it is likely that many values may not be similar still, the target is to end up with 75% of the data from the 2015-2018 transfers.

3. Transfer data may cause inconsistencies within the FIFA dataset.

Was the information in the FIFA dataset collected before or after the transfer was conducted? It is equally likely that both are possible given that there are two transfer windows in a season and FIFA is released in between both of them. 

To prevent ambiguity as much as possible, the transfer data will first be merged on the team the player left (the Team_from column). The null values will then be collected and will be re-merged on the team the player joined (the Team_to column). As such, whether the data was collected before or after the player was transferred, it will be still be accounted for.

In [None]:
fifa_15 = pd.read_csv('../input/fifa-20-complete-player-dataset/players_15.csv',usecols = ['short_name','overall','potential','club'])
fifa_15['Season'] = 2015
fifa_16 = pd.read_csv('../input/fifa-20-complete-player-dataset/players_16.csv',usecols = ['short_name','overall','potential','club'])
fifa_16['Season'] = 2016
fifa_17 = pd.read_csv('../input/fifa-20-complete-player-dataset/players_17.csv',usecols = ['short_name','overall','potential','club'])
fifa_17['Season'] = 2017
fifa_18 = pd.read_csv('../input/fifa-20-complete-player-dataset/players_18.csv',usecols = ['short_name','overall','potential','club'])
fifa_18['Season'] = 2018

In [None]:
fifas_merged = pd.concat([fifa_15,fifa_16,fifa_17,fifa_18],ignore_index=True)
fifas_merged['Lastname'] = fifas_merged['short_name'].str.split(' ').str[-1]
fifas_merged = fifas_merged.drop(columns = ['short_name'])
fifas_merged.head()

In [None]:
from fuzzywuzzy import process
unique = transfers_cleaned_2015_18.Team_from.unique()
for team in fifas_merged.club.unique():
    for found, score in process.extract(team,unique,limit = 1):
        if score > 80:
            fifas_merged['club']=fifas_merged['club'].replace([team],found)

fifas_merged.head()

In [None]:
fifas_merged['Lastname'] = fifas_merged['Lastname'].str.normalize('NFKD')\
       .str.encode('ascii', errors='ignore')\
       .str.decode('utf-8')
fifas_merged['club'] = fifas_merged['club'].str.normalize('NFKD')\
       .str.encode('ascii', errors='ignore')\
       .str.decode('utf-8')
transfers_cleaned_2015_18['Lastname'] = transfers_cleaned_2015_18['Lastname'].str.normalize('NFKD')\
       .str.encode('ascii', errors='ignore')\
       .str.decode('utf-8')
transfers_cleaned_2015_18['Team_from'] = transfers_cleaned_2015_18['Team_from'].str.normalize('NFKD')\
       .str.encode('ascii', errors='ignore')\
       .str.decode('utf-8')
transfers_cleaned_2015_18['Team_to'] = transfers_cleaned_2015_18['Team_to'].str.normalize('NFKD')\
       .str.encode('ascii', errors='ignore')\
       .str.decode('utf-8')

In [None]:
#fifas_merged['Lastname']=fifas_merged['Lastname'].replace(to_replace = '.*ić$',value = '.*ic$',regex = True)
fifas_merged['club']=fifas_merged['club'].replace(to_replace = ['FC Girondins de Bordeaux','Manchester United','Manchester City','Tottenham Hotspur','Olympique Lyonnais','Borussia Dortmund','Wolverhampton Wanderers','Galatasaray SK','FC Bayern München'],\
                                                  value = ['G. Bordeaux','Man Utd','Man City','Spurs','Olympique Lyon','Bor. Dortmund','Wolves','Galatasaray','Bayern Munich'])
fifas_merged['Lastname']=fifas_merged['Lastname'].replace(to_replace =['Yanga-M\'Biwa','Yılmaz','N\'Zonzi','Adama'],value = ['Yanga-Mbiwa','Yilmaz','Nzonzi','Traore'])

In [None]:
transfers_cleaned_with_fifa = transfers_cleaned_2015_18.merge(fifas_merged, how = 'left',left_on=['Lastname','Season_transferred','Team_from'],right_on=[ 'Lastname','Season','club'])
transfers_null = transfers_cleaned_with_fifa[transfers_cleaned_with_fifa['club'].isnull()]
transfers_null = transfers_null.drop(columns = ['club','overall','potential','Season'])
transfers_null_with_fifa = transfers_null.merge(fifas_merged, how = 'left',left_on=['Lastname','Season_transferred','Team_to'],right_on=[ 'Lastname','Season','club'])
transfers_with_fifa = pd.concat([transfers_cleaned_with_fifa,transfers_null_with_fifa],ignore_index = True).dropna()

In [None]:
print(transfers_with_fifa.shape)

We end up with 731 rows, which is about 73% of the data we started with. This is short of the 75% threshold that was set, and missing an already low threshold may affect model performance having less data to train with. 

We will run a random forest regressor with all the data this time. To reduce the number of features, we drop the leagues as the leagues and teams are heavily correlated.


In [None]:
transfers_fifa_without_name = transfers_with_fifa.drop(columns = ['club','Season','League_from','League_to'])
transfers_fifa_without_name = pd.get_dummies(transfers_fifa_without_name)
y2= transfers_fifa_without_name['Transfer_fee_in_mln']
X2= transfers_fifa_without_name.drop(columns = ['Transfer_fee_in_mln'])

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2,y2, random_state = 169,test_size = 0.3)
print(X_train2.shape)
print(X_test2.shape)
print(y_train2.shape)
print(y_test2.shape)
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 270,max_features = 650, random_state =169)
fit2 = rfr.fit(X_train2,y_train2)
y_pred2 = rfr.predict(X_test2)

mse2 = mean_squared_error(y_test2,y_pred2)
r2score2 = r2_score(y_test2,y_pred2) 
print(mse2)
print(r2score2)

We get an accuracy value of **80.0%** which is not too bad for a first try.

The following should be noted with regards to the accuracy and the quality of results:

**1. The transfer fee of a player is heavily correlated to their market value, rather than their overall/potential.**

In an earlier run, initialising the RF regressor, I was only able to obtain about 44% accuracy when dropping the market value column (please try it out for yourself too!). This is interesting as if we assume that FIFA ratings are a somewhat accurate representation of conventional opinions of player current and future ability, that the market valuations of players are very much divorced from the actual price a player is sold for. I initially assumed that market value and FIFA judgments of ability were rather correlated, which explains why I wanted to drop the feature. It turns out that both market value and FIFA ratings are far from correlated. 

If we compare our results from the first linear regression (which admittedly is a much simpler model), the effect of adding market value as a consideration far outweighs the effect of adding the FIFA ratings to the model. Should we take heart in the wisdom of the FIFA experts, it is very much possible that teams could be overpaying and/or getting bargains for players in the transfer market.

**2. This model is REALLY limited.**

This model suffers from a lack of data with only three years' worth of transfer data present, of which only 73% was used due to difficulties in the data cleaning process. Furthermore, FIFA ratings are insufficient in determining the ability of a player, and composite measures (maybe football stats from Opta) would give a more accurate picture of how good/bad a player is. With more data, the model can be improved to a higher accuracy.

**3. Our model is affected by the outliers heavily.**

Our mean-squared error stands at about 47.90, which makes for a root mean-squared-error of about 6.92million. Looking at the previous scatterplot of market value against transfer fees, we see that many players are sold for below 25 million. A rmse of roughly 7 million would thus be rather significant to the predictions given by the model.

However, taking into the context of the previous data analysis, 2015-2018 was when most of the outliers on the graph popped up, and the error in predicting big-money transfers such as Neymar's, Mbappe's, Coutinho's etc which all cost above of 100 million would push up the rmse. The accuracy of 80.0% should be focused on instead rather than the mse metric in this case due to the wide variation of transfer fees at one end of the spectrum.

# Conclusion

Thank you all for bearing with me, and I hope you enjoyed this dive into Football Transfer Analysis and had as much fun as I had. Please give me comments for improvement and let me know if you liked what I did. Definitely, there is room for improvement in the models and the data analysis, so I'll come back from time to time and try to improve on this model. See you soon :)

And for the friends who I shared this journey with, thanks so much for viewing my creation. It means a lot and I really appreciate y'all. <3