# SC1015 Mini Project

---

# How does the non-football attributes of a footballer affect his wages?

### Essential Libraries

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [51]:
# Basic Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt # we only need pyplot
import warnings
warnings.filterwarnings('ignore')
import plotly.express as ex
from plotly.subplots import make_subplots
import plotly.graph_objs as go
sns.set() # set the default Seaborn style for graphics

### Import the Fifa Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas to impart the FIFA 19 Dataset

In [52]:
fifaData = pd.read_csv('Fifa 19.csv')
fifaData.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In [53]:
fifaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                18207 non-null  int64  
 1   ID                        18207 non-null  int64  
 2   Name                      18207 non-null  object 
 3   Age                       18207 non-null  int64  
 4   Photo                     18207 non-null  object 
 5   Nationality               18207 non-null  object 
 6   Flag                      18207 non-null  object 
 7   Overall                   18207 non-null  int64  
 8   Potential                 18207 non-null  int64  
 9   Club                      17966 non-null  object 
 10  Club Logo                 18207 non-null  object 
 11  Value                     18207 non-null  object 
 12  Wage                      18207 non-null  object 
 13  Special                   18207 non-null  int64  
 14  Prefer

### Dropping the columns of data that cannot be used 

Firstly, we decided to drop the columns of data which contains variables that definitely cannot be used in determining wages. This is because it does not give any valuable information whatsoever.

The columns are 'ID', 'Photo', 'Flag', 'Club', 'Club Logo','Value', 'Special', 'Preferred Foot', 'Work Rate', 
               'International Reputation', 'Weak Foot', 'Position', 'Skill Moves', 'Body Type', 
               'Real Face', 'Jersey Number', 'Joined', 'Loaned From', 'Contract Valid Until' and 
               'Release Clause'
               

In [54]:
fifaData.drop(['ID', 'Photo', 'Flag', 'Club', 'Club Logo', 'Value', 'Special', 'Preferred Foot', 'Work Rate', 
               'International Reputation', 'Weak Foot', 'Position', 'Skill Moves', 'Body Type', 
               'Real Face', 'Jersey Number', 'Joined', 'Loaned From', 'Contract Valid Until', 
               'Release Clause'], axis = 1, inplace = True)

In [55]:
fifaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 69 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       18207 non-null  int64  
 1   Name             18207 non-null  object 
 2   Age              18207 non-null  int64  
 3   Nationality      18207 non-null  object 
 4   Overall          18207 non-null  int64  
 5   Potential        18207 non-null  int64  
 6   Wage             18207 non-null  object 
 7   Height           18159 non-null  object 
 8   Weight           18159 non-null  object 
 9   LS               16122 non-null  object 
 10  ST               16122 non-null  object 
 11  RS               16122 non-null  object 
 12  LW               16122 non-null  object 
 13  LF               16122 non-null  object 
 14  CF               16122 non-null  object 
 15  RF               16122 non-null  object 
 16  RW               16122 non-null  object 
 17  LAM         

### Dropping the columns of data that are non-specific

We also decided to drop the columns of data which contains the ratings of players when placed in every position there is. This data is non-specific and cannot be used. (why can't it be used?)

A much better variable to be used is Overall, since it takes into account the average of all of the positions combined.

The columns being dropped are 'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW', 'LAM', 'CAM', 'RAM', 
               'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM', 'CDM', 'RDM', 'RWB', 'LB', 
               'LCB', 'CB', 'RCB', and 'RB'.

In [56]:
fifaData.drop(fifaData.iloc[:, 9:35], inplace = True, axis = 1)

In [57]:
fifaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 43 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       18207 non-null  int64  
 1   Name             18207 non-null  object 
 2   Age              18207 non-null  int64  
 3   Nationality      18207 non-null  object 
 4   Overall          18207 non-null  int64  
 5   Potential        18207 non-null  int64  
 6   Wage             18207 non-null  object 
 7   Height           18159 non-null  object 
 8   Weight           18159 non-null  object 
 9   Crossing         18159 non-null  float64
 10  Finishing        18159 non-null  float64
 11  HeadingAccuracy  18159 non-null  float64
 12  ShortPassing     18159 non-null  float64
 13  Volleys          18159 non-null  float64
 14  Dribbling        18159 non-null  float64
 15  Curve            18159 non-null  float64
 16  FKAccuracy       18159 non-null  float64
 17  LongPassing 

### Dropping the columns of data that are football-related

Lastly, we decided to drop the specific statistics because the aim of our project is to find out how non-football statistics determine wages.


The columns being dropped are 'Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning' and 'GKReflexes'.

In [58]:
fifaData.drop(fifaData.iloc[:, 9:43], inplace = True, axis = 1)

In [59]:
fifaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   18207 non-null  int64 
 1   Name         18207 non-null  object
 2   Age          18207 non-null  int64 
 3   Nationality  18207 non-null  object
 4   Overall      18207 non-null  int64 
 5   Potential    18207 non-null  int64 
 6   Wage         18207 non-null  object
 7   Height       18159 non-null  object
 8   Weight       18159 non-null  object
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


### Data Cleaning

We will now be focussing on the variables that we are most probably going to use in determining wages. But first, we would have to clean the data appropriately so that we are able to utilise them.

In [60]:
fifaData.head()

Unnamed: 0.1,Unnamed: 0,Name,Age,Nationality,Overall,Potential,Wage,Height,Weight
0,0,L. Messi,31,Argentina,94,94,€565K,5'7,159lbs
1,1,Cristiano Ronaldo,33,Portugal,94,94,€405K,6'2,183lbs
2,2,Neymar Jr,26,Brazil,92,93,€290K,5'9,150lbs
3,3,De Gea,27,Spain,91,93,€260K,6'4,168lbs
4,4,K. De Bruyne,27,Belgium,91,92,€355K,5'11,154lbs


In [61]:
#Cleaning the Wage and making it Dtype integer instead of object

fifaData['Wage'] = fifaData['Wage'].str.replace('€', '')
fifaData['Wage'] = fifaData['Wage'].str.replace('K', '').astype(int)

#Removing the rows with Wage 0
fifaData = fifaData.loc[fifaData["Wage"] != 0]

#Renaming the column from Wage to Wage (in Thousands)
fifaData = fifaData.rename(columns={'Wage': 'Wage (Thousands)'})

ValueError: invalid literal for int() with base 10: '159lbs'

In [None]:
fifaData.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 17966 entries, 0 to 18206
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Unnamed: 0        17966 non-null  int64 
 1   Name              17966 non-null  object
 2   Age               17966 non-null  int64 
 3   Nationality       17966 non-null  object
 4   Overall           17966 non-null  int64 
 5   Potential         17966 non-null  int64 
 6   Wage (Thousands)  17966 non-null  int64 
 7   Height            17918 non-null  object
 8   Weight            17918 non-null  object
dtypes: int64(5), object(4)
memory usage: 1.4+ MB


# Correlation matrices

In [None]:
fig = make_subplots(rows=2, cols=1,shared_xaxes=True,subplot_titles=('Pearson Correlation', 
                                                                     'Spearman Correlation'))
s_val = fifaData.corr('pearson')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,name='pearson',showscale=False,xgap=1,ygap=1),
    row=1, col=1
)


s_val = fifaData.corr('spearman')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,xgap=1,ygap=1),
    row=2, col=1
)

fig.update_layout(height=700, width=900, title_text="Correlations")
fig.show()

### As to be seen, the numerical data that correlates most to __ are.

### Bar chart

In [None]:
# plt.rcParams.update({'font.size': 18})
# # ax.tick_params(labelrotation=45)
# fig.tight_layout()
# fig.patch.set_facecolor('#fafafa')
# f, axes = plt.subplots(9, 2, figsize=(18,60))
# f.subplots_adjust(hspace=.45)

# table=pd.crosstab(fifaData.Age,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[0,0],  color = ['#1092c9','#c91010'])
# axes[0,0].set_title('Age Group vs Wage', y=1.1 , x=1.1)
# plt.subplot(922)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# a2 = sns.barplot(data = dst_st_age, x = dst_st_age['age_group'], y = dst_st_age['percent'], hue = dst_st_age['hypertension'], palette = ['#1092c9','#c91010',])
# plt.xticks(rotation = 10)
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)

# ##########

# table=pd.crosstab(stroke.stroke,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[1,0],  color = ['#1092c9','#c91010',])
# axes[1,0].set_title('Stroke vs Wage', y=1.1 , x=1.1)
# plt.subplot(924)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# b2 = sns.barplot(data = stroke1, x = stroke1['stroke'], y = stroke1['percent'], hue = stroke1['hypertension'], palette = ['#1092c9','#c91010',])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)

# ##########

# table=pd.crosstab(stroke.heart_disease,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[2,0], color = ['#1092c9','#c91010',])
# axes[2,0].set_title('Heart Disease vs Wage', y=1.1 , x=1.1)
# plt.subplot(926)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# c2 = sns.barplot(data = heart, x = heart['heart_disease'], y = heart['percent'], hue = heart['hypertension'], palette = ['#1092c9','#c91010',])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)

# ##########
# table=pd.crosstab(stroke.ever_married,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[3,0], color = ['#1092c9','#c91010',])
# axes[3,0].set_title('Ever_Married vs Wage', y=1.1 , x=1.1)
# plt.subplot(928)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# d2 = sns.barplot(data = marry, x = marry['ever_married'], y = marry['percent'], hue = marry['hypertension'], palette = ['#1092c9','#c91010'])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)
# ##########
# table=pd.crosstab(stroke.work_type,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[4,0], color = ['#1092c9','#c91010',])
# axes[4,0].set_title('Work Type vs Wage', y=1.1 , x=1.1)
# plt.subplot(9,2,10)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# e2 = sns.barplot(data = work, x = work['work_type'], y = work['percent'], hue = work['hypertension'], palette = ['#1092c9','#c91010'])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)
# ##########
# table=pd.crosstab(stroke.Residence_type,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[5,0], color = ['#1092c9','#c91010',])
# axes[5,0].set_title('Residence Type vs Wage', y=1.1 , x=1.1)
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)

# plt.subplot(9,2,12)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# f2 = sns.barplot(data = residence, x = residence['Residence_type'], y = residence['percent'], hue = residence['hypertension'], palette = ['#1092c9','#c91010'])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)
# ##########

# table=pd.crosstab(stroke.glucose_group,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[6,0], color = ['#1092c9','#c91010'])
# axes[6,0].set_title('Glucose Group vs Wage', y=1.1 , x=1.1)
# plt.subplot(9,2,14)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# g2 = sns.barplot(data = glucose_group, x = glucose_group['glucose_group'], y = glucose_group['percent'], hue = glucose_group['hypertension'], palette = ['#1092c9','#c91010'])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)
# ##########
# table=pd.crosstab(stroke.bmi_group,fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[7,0],color = ['#1092c9','#c91010'])
# axes[7,0].set_title('BMI vs Wage', y=1.1 , x=1.1)
# plt.subplot(9,2,16)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# h2 = sns.barplot(data = bmi_group, x = bmi_group['bmi_group'], y = bmi_group['percent'], hue = bmi_group['hypertension'], palette = ['#1092c9','#c91010'])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)
# ##########
# table=pd.crosstab(stroke.smoking_status, fifaData.Wage)
# table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True,ax = axes[8,0],color = ['#1092c9','#c91010'])
# axes[8,0].set_title('Smoking Status vs Hypertension', y=1.1 , x=1.1)
# plt.subplot(9,2,18)
# plt.grid(color = 'gray', linestyle = ':', axis = 'y', zorder = 0,  dashes = (1,7))
# j2 = sns.barplot(data = smoking, x = smoking['smoking_status'], y = smoking['percent'], hue = smoking['hypertension'], palette = ['#1092c9','#c91010'])
# plt.ylabel('')
# plt.xlabel('')
# plt.legend('').set_visible(False)
# ##########
# # add annotations
# for i in [a,b,c,d,e,f,g,h,j]:
#     for p in i.patches:
#         height = p.get_height()
#         i.annotate(f'{height:g}', (p.get_x() + p.get_width() / 2, p.get_height()), 
#                    ha = 'center', va = 'center', 
#                    size = 10,
#                    xytext = (0, 5), 
#                    textcoords = 'offset points')

# for i in [a2,b2,c2,d2,e2,f2,g2,h2,j2]:
#     for p in i.patches:
#         height = p.get_height()
#         i.annotate(f'{height:g}%', (p.get_x() + p.get_width() / 2, p.get_height()), 
#                    ha = 'center', va = 'center', 
#                    size = 10,
#                    xytext = (0, 5), 
#                    textcoords = 'offset points')
        
# plt.show()

NameError: name 'ax' is not defined