# FIFA 20 player's statistic analysis

### Importing the libraries needed

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Read the data 
Also set the maximum column display so that we can see all of the columns in the dataset

In [None]:
pd.set_option("display.max_columns", 74)
df = pd.read_csv('../input/fifa-20-complete-player-dataset-for-manager-mode/fifa20data.csv', low_memory=False)
df.head(10)

### Start with exploratory data analysis.

#### Table below is the info about the columns, using .info

In [None]:
df.info()

#### Figuring the shape of the data using .shape

In [None]:
df.shape

#### As we see above the number of non null object and the number of total column is the same, meaning that there is no null object or null value or NaN data left, so the dataset is ready to be processed.

But just to make sure, use isna.sum to check the total NaN data left in every column.

In [None]:
df[['Overall', 'PAC', 'SHO', 'PAS', 'DRI', 'DEF', 'PHY', 'foot']].isna().sum()

#### Because there is no NaN data so the data is ready to be prosecced. 
Starting from the average, min and max value of the variables using .describe

In [None]:
df[['Overall', 'PAC', 'SHO', 'PAS', 'DRI', 'DEF', 'PHY']].describe()

Table above is description about the data, it includes one of the subject that we want to be focused on, which is the mean or average value. But let's make a print out of these value as we can see below. Let's also make the bar chart so that people can see it more clearly, using plot.bar


In [None]:
overall_mean = df['Overall'].mean()
pace_mean = df['PAC'].mean()
shot_mean = df['SHO'].mean()
passing_mean = df['PAS'].mean()
dribbling_mean = df['DRI'].mean()
defence_mean = df['DEF'].mean()
physic_mean = df['PHY'].mean()

ol = {
    "abilities" : ["Overall","PAC", "SHO", "PAS", "DRI", "DEF", "PHY"],
    "average_value_of_the_abilities" : [overall_mean, pace_mean, shot_mean, passing_mean, dribbling_mean, defence_mean, physic_mean]
}

ob = pd.DataFrame(ol)
plt.figure(figsize=(16,7))
sns.barplot(x="abilities", y="average_value_of_the_abilities", data=ob)
plt.ylabel("Average Score")
plt.xlabel("Abilites")
print("Average value of player's ability score is {}".format(overall_mean))
print("Average value of player's pace is {}".format(pace_mean))
print("Average value of player's shot is {}".format(shot_mean))
print("Average value of player's passing is {}".format(passing_mean))
print("Average value of player's dribbling is {}".format(dribbling_mean))
print("Average value of player's defence is {}".format(defence_mean))
print("Average value of player's physic is {}".format(physic_mean))

### The lowest average value is the average defense score, and the highest average value is the average pace score

#### Now let's see the data distribution of these abilities / variables, using plot.hist

In [None]:
df["Overall"].plot.hist(bins=49, figsize=(15,7), color='tab:blue', title='Overall Ability')

In [None]:
df["PAC"].plot.hist(figsize=(15,7), bins=34, color='tab:orange', title='Pace')

In [None]:
df["SHO"].plot.hist(figsize=(15,7), bins=25, color='tab:green', title='Shooting')

In [None]:
df["PAS"].plot.hist(figsize=(15,7), bins=34, color='tab:red', title='Passing')

In [None]:
df["DRI"].plot.hist(figsize=(15,7), bins=34, color='tab:purple', title='Dribbling')

In [None]:
df["DEF"].plot.hist(figsize=(15,7), bins=39, color='tab:brown', title='Defence')

In [None]:
df["PHY"].plot.hist(figsize=(15,7), bins=22, color='tab:pink', title='Pysical')

#### Compare all of those data distribution side by side so that we know the real differences as seen below, using plt.figure for creating a bigger frame of plot, followed by each variables data distribution graphic and using KDE so it won't cancel each other

In [None]:
plt.figure(figsize=(15,7), dpi= 80)
sns.distplot(df['Overall'], color= 'tab:green', hist=False, label="Overall Ability")
sns.distplot(df['PAC'], color= 'tab:red', hist=False, label="Pace")
sns.distplot(df['SHO'], color= 'tab:blue', hist=False, label="Shooting")
sns.distplot(df['PAS'], color= 'tab:orange', hist=False, label="Passing")
sns.distplot(df['DRI'], color= 'tab:olive', hist=False, label="Dribbling")
sns.distplot(df['DEF'], color= 'tab:purple', hist=False, label="Defence")
sns.distplot(df['PHY'], color= 'tab:brown', hist=False, label="Physic")


### Above is all of the data distribution combined to see them side by side. 

### 1. The overall ability score has the most data centered in the average value. 

### 2. Pace, passing, dribbling and physic score data distribution are slighty more dispersed but still preserve the            normal distribution shape

### 3. Shooting and defence score data distribution are way more dispersed to the lower score, meaning there are            quite many players have low shooting score and defence score. Who are those "bad shooters" and "bad defenders"? 

### There are 7 main abilities in FIFA20's system, those are:

## 1. Overall ability
## 2. Pace
## 3. Shooting
## 4. Passing
## 5. Dribbling
## 6. Defence
## 7. Physic

## After conducting exploratory data analysis, we can start to analyzing some of the aspects we want to focus on. Those are:

1. Average value of overall ability, pace, shooting, passing, dribbling, defence and physic.
2. Data distribution of overall ability, pace, shooting, passing, dribbling, defence and physic.
3. Correlation between overall ability, pace, shooting, passing, dribbling, defence and physic.
4. Football country analysis
5. Shooting and defense ability score analysis (you'll find out why it should be shooting and defense)

Which lead to the main questions as follows:

1. What is the average value of overall ability, pace, shooting, passing, dribbling, defence and physic in FIFA20? What is the    lowest and highest score among those abilities / variable?
2. How is the data distribution of overall ability, pace, shooting, passing, dribbling, defence and physic in FIFA20?
3. What is the correlation between overall ability, pace, shooting, passing, dribbling, defence and physic in FIFA20? Among the    6 main abilities, what abilities influence the overall abilities the most? What makes a high ability players good?
4. Which country produces top ability players the most? Is it true that Brazil produces top ability players the most? Who are      top 10 footballers producer countries?
5. What fact could tell about the shooting and defense score in FIFA20?

Plots will be shown are:

1. Average value of overall ability, pace, shooting, passing, dribbling, defence and physic bar chart
2. Data distribution of overall ability, pace, shooting, passing, dribbling, defence and physic chart
3. Comparison of data distribution of overall ability, pace, shooting, passing, dribbling, defence and physic KDE chart
4. Pairplot of the correlation of overall ability, pace, shooting, passing, dribbling, defence and physic bar chart
5. Heatmap of the correlation of overall ability, pace, shooting, passing, dribbling, defence and physic bar chart
6. 10 Best footballers producer countries bar chart
7. Shooting ability pie chart
8. Defense ability pie chart





#### Correlation analysis of these abilities / variables 
Make overall ability, pace, shooting, passing, dribbling, defense and physic columns as a single new dataset

In [None]:
dfp = df[['Overall', 'PAC', 'SHO', 'PAS', 'DRI', 'DEF', 'PHY']]
dfp.head(10)

#### Generate a pairplot to see the correlation between these abilities / variable. One thing to remember is I put Overall ability as the dependent variable because it is the ability / variable that is influenced by the 6 main ability in FIFA20. Using sns.pairplot

In [None]:
sns.pairplot(dfp)

#### From the look of these scatter plot, there is a strong positive correlation from passing and dribbling to overall ability. There is also a strong positive correlation between passing and dribbling

#### Generating a heatmap and its annotation to see the value of the correlation, using sns.heatmap

In [None]:
sns.heatmap(dfp.corr(), annot=True)

#### Picture above is the heatmap and its annotation, bright color shows there is strong positive correlation and dark color shows there is strong negative correlation, it works gradually so it's really helpful when we're working with many variables. The number above 0.5 means there is a strong positive correlation and below -0.5 means there is a strong negative correlation, while any number around 0.0 means there is no correlation at all. The result is similar to scatter plot above: 

#### 1. There is a quite strong positive correlation from passing and dribbling to overall ability (0.67 and 0.63). 
#### 2. There is also a strong positive correlation between passing and dribbling (0.83)

Now to make this clear let's generate a table of order of these correlation, from strongest to weakest, using .corr and pearson method, and Overall ability as the dependent variable.

In [None]:
dfp.corr(method='pearson', min_periods=20)['Overall'].abs().sort_values(ascending=False)

### 1. Passing and dribbling correlate the overall ability the most.

### 2. As passing ability increases, it's followed by dribbling ability increase, statistically a player who are good at              passing, are also good at dribbling too, and vice versa


#### Continue this analysis to football country analysis. Which country produces best players the most? Is it true that Brazil produces best players the most? Who are top 10 footballers producer countries?


In [None]:
df.head(10)

Make name, country, and overall ability columns as a new single dataset.

In [None]:
null_count_bp = df[['Name', 'Country', 'Overall']]
null_count_bp.head(10)

Check if there's any NaN value using isna.any

In [None]:
null_count_bp.isna().any()

#### All of the columns say "false" for any null or NaN value, meaning there is no any null or NaN value so the dataset is ready to use

#### First let's sort the table based on the overall ability value, using sort_values, and cut them to make only 8800 of them left using .head, why 8000? It's the crude approximation of best player quantity in 20 football league (1 team=22 players, 1 league= 22 players * 20 team, 20 league= 22 players * 20 team * 20 league)

In [None]:
players_rank = df.sort_values(by=['Overall'], ascending=False,).head(8800)
players_rank.head(10)

After sorting and cutting the quantity, group players based on their country, using .groupby

In [None]:
grouped_rank = players_rank.groupby('Country')

#### Count the names involved in a each country using .count and then sort them by name. after counting, any column will be in the same amount for 1 country, so it doesn't matter what column used

In [None]:
countries_rank = grouped_rank.count().sort_values(by=['Name'], ascending=False).head(10)
countries_rank

### Brazil is the country that produce top ability players the most with 717 players involved in top 8800 best players around the world, followed by Spain with 703 players and Argentina with 635 players

Here is the info of the countries rank table

In [None]:
countries_rank.info()

#### Make a bar chart of the table above using plt.figure for bigger plot, and sns.barplot for the bar.

In [None]:
ff = countries_rank.index
plt.figure(figsize=(16,7), dpi= 80)
sns.barplot(data=countries_rank, x='Name', y=ff)
plt.ylabel("Country")
plt.xlabel("Player Involved")
plt.title("Top 10 High Ability Player Producer Countries")

### 1. Brazil produces top ability players the most with 717 players involved in top 8800 best players around the                world, followed by Spain with 703 players and Argentina with 635 players. 

### 2. Top 10 countries who produce top ability players the most are Brazil, Spain, Argentina, France, Germany,                England, Italy, Colombia, Portugal and Netherland

#### What surpising is Colombia could get into the list beating Portugal and Netherlands who have some of the best clubs and league in the world


For the next analysis let's go back to our second analysis about data distribution of the abilities / variables score. Shooting score data distribution and defense score data distribution are the most distinct among the total 7 abilites / variables with the data distribution inclined to the lower score, meaning there are quite many players who are bad at shooting and defense, who are those "bad shooters" and "bad defenders"?. Reshow the distribution plot using plot.hist

In [None]:
df["SHO"].plot.hist(figsize=(15,7), bins=25, color='tab:green')
plt.title("Shooting Ability Score Data Distribution")
plt.xlabel("Shooting Ability Score")

In [None]:
df["DEF"].plot.hist(figsize=(15,7), bins=39, color='tab:brown', title='Defense Ability Score Data Distribution')
plt.xlabel("Defense Ability Score")

It is possible of why there are so many players with low shooting ability score and defense ability score is because the players ability is polarized to attacking role ability and defending role ability, and in real life football it's a majority to see a defender with low shooting ability and an attacker with low defending ability, that's why there are many players with bad shooting ability, it's because they are a defender, and there are many players with bad defending ability, it's because they are an attacker. Among players who have shooting ability below the average, how many of them are a defender, and among players who have defending ability below the average, how many of them are an attacker. 

To start the analysis we have to see the position of the players, in 'position' column. The data type of this column is string data, so in order to see the different types of the string we can use .unique to see the types of the string value.

In [None]:
df['Position'].unique()

#### There are so many types of role for one players, in this analysis I define the defending roles includes GK or goalkeeper, CB or center back, LB or left back, RB or right back, RWB or right wing back, LWB or left wing back. The attacking roles are ST or striker, CF or central forward, RW or right wing, LW or left wing, CAM or central attacking midfielder.

#### First, among players who have shooting ability below the average, how many of them are a defender
Sort the database based on lowest shooting ability using sort_values and ascending=True, and make it a new dataset.

In [None]:
shot_rank = df.sort_values(by=['SHO'], ascending=True)
shot_rank.head(10)

Average value of shooting score ability using .mean

In [None]:
df['SHO'].mean()

#### The value of the average shooting score is 53.47, so search for players who have shooting ability score below or equal to 52 using .loc, and make it a new dataset.

In [None]:
shot_rank = df.loc[df['SHO'] <= 52].sort_values(by=['SHO'], ascending=True)
shot_rank.head(10)

Define which are the defending roles, as mentioned before I define the defending roles are GK or goalkeeper, CB or center back, LB or left back, RB or right back, RWB or right wing back, LWB or left wing back.

In [None]:
def_pos = ['CB', 'LB', 'RB', 'GK', 'RWB', 'LWB']

#### Create the formula, because this analysis will includes string data type search. We will look for 'CB', 'LB', 'RB', 'GK', 'RWB', 'LWB' value in 'position' column, if we find one then +1 to sum variable, breaks, and look for the next row.

In [None]:
number_of_def_position = 0

for pos in shot_rank['Position']:
    for bp in def_pos:
        if bp in pos:
            number_of_def_position = number_of_def_position + 1
            break

print(number_of_def_position)

The value is 5822, meaning, in this below or equal to 52 shooting score dataset, there are 5822 players who are a defender role players. Now let's make a pie chart, using plt.pie, and the total rows which is the total of the players includes in the dataset, using len(.index)

In [None]:
number_total_def = len(shot_rank.index)


sh_pos_chrt = [number_of_def_position, (number_total_def - number_of_def_position)]
posisi = ['defense position', 'non defense position']
colors = ["#1f77b4", "#ff7f0e"]
plt.pie(sh_pos_chrt, labels=posisi, colors=colors,
autopct='%1.1f%%', shadow=True)

### Among players who have shooting ability below the average, 71.6% of them are defense position / role

#### Next search players who have defending ability below the average, how many of them are an attacker

Apply the same method to the defense score column, among players who have defending ability below the average, how many of them are attacker?

Sort the data using sort_values based on lowest defending ability score, and make it a new dataset.

In [None]:
def_rank = df.sort_values(by=['DEF'], ascending=True)
def_rank.head(10)

Find the average value using .mean

In [None]:
df['DEF'].mean()

#### The average value is 49.97, so we're going to look for players who have defending ability score below or equal to 48 using .loc, and make it a new dataset.

In [None]:
def_rank = df.loc[df['DEF'] <= 48].sort_values(by=['DEF'], ascending=True)
def_rank.head(10)

Define which are the attacking roles, as mentioned before I define the attacking roles are ST or striker, CF or central forward, RW or right wing, LW or left wing, CAM or central attacking midfielder.

In [None]:
att_pos = ['ST', 'CF', 'RW', 'LW', 'CAM']

#### Create the formula, because this analysis will includes string data type search, We will look for 'ST', 'CF', 'RW', 'LW', 'CAM' value in position column, if we find one then +1 to sum variable, it breaks, and look for the next row.

In [None]:
number_of_att_position = 0

for pos in def_rank['Position']:
    for bp in att_pos:
        if bp in pos:
            number_of_att_position = number_of_att_position + 1
            break

print(number_of_att_position)

The value is 5376, meaning, in this below or equal to 48 defending score dataset, there are 5376 players who are an attacker role players. Now let's make a pie chart, using plt.pie, and the total rows which is the total of the players includes in the dataset, using len(.index)

In [None]:
number_total_att = len(def_rank.index)


sh_pos_chrt_d = [number_of_att_position, (number_total_att - number_of_att_position)]
posisi_d = ['attack position', 'non attack position']
colors = ["#1f77b4", "#ff7f0e"]
plt.pie(sh_pos_chrt_d, labels=posisi_d, colors=colors,
autopct='%1.1f%%', shadow=True)

### Among players who have defending ability below the average, 62.3% of them are attacking position / role

### Fact about the shooting and defense score in FIFA20 tells:

### 1. Among players who have shooting ability below the average, 71.6% of them are defense position / role

### 2. Among players who have defending ability below the average, 62.3% of them are attacking position / role

## Analyzing potential players for players under 22 years old

#### In 500 top youth players, which country produces top youth players the most

#### Make a new dataset for potential players under 22 years old
#### Filter by age (<=21 years old)
#### Sort by potential

In [None]:
youth_df = df.loc[df['Age']<=21].sort_values(by=['Potential'], ascending=False)
youth_df.head(10)

#### Cut the list so only 500 names left

In [None]:
youth_rank = youth_df.sort_values(by=['Potential'], ascending=False).head(500)
youth_rank.head(10)

#### Group by Country
#### Count
#### Sort by names, so we know many players there are in a country
#### Cut the list so only top 10 countries left

In [None]:
grouped_youth_rank = youth_rank.groupby('Country').count().sort_values(by=['Name'], ascending=False).head(10)
grouped_youth_rank

In [None]:
bv = grouped_youth_rank.index
cv = grouped_youth_rank['Name']

plt.figure(figsize=(15,7), dpi=80)
sns.barplot(data=grouped_youth_rank, x=cv, y=bv)
plt.title("Top 10 Best Youth Player Producer Countries")

### 1. England produces top youth players the most with 65 players involved in top 500 best youth players around            the world, followed by France with 48 players and Spain with 37 players. 

### 2. Top 10 countries who produce top youth players the most are England, France, Spain, Argentina, Brazil, Portugal, Netherland, Germany, Italy and Belgium



## Comparing rivals in each continent

#### the test will be using two tailed and one tailed t-test 
#### the comparison will be between rivals in each continents: Asia is represented by Korea Republic and Japan, Europe is represented by Germany and Spain, Africa is represented by Nigeria and Cameroon, South America is represented by Brazil and Argentina. The countries choosen are based on subjective opinion.


#### First is creating a database with players of the mentioned countries only

In [None]:
datarc = df[df["Country"].str.contains("Spain|Germany|Argentina|Korea Republic|Brazil|Japan|Nigeria|Cameroon")]
datarc.head(10)

#### Creating a boxplot and violonplot using sns.boxplot to see the central tendency 

In [None]:
plt.figure(figsize=(15,7))

sns.boxplot(x="Country", y="Overall", order=['Spain','Germany','Argentina','Brazil','Japan','Korea Republic','Nigeria','Cameroon'], data=datarc)
plt.ylabel("Overall Ability Score")

In [None]:
plt.figure(figsize=(16,7))

sns.violinplot(x="Country", y="Overall", order=['Spain','Germany','Argentina','Brazil','Japan','Korea Republic','Nigeria','Cameroon'], data=datarc)
plt.ylabel("Overall Ability Score")

#### The data distribution are skewed shaped. So we will use median as the center for Levene's test to see if the 2 data which will be compared have equal variances.

#### Import Levene's test from scipy.stats

In [None]:
from scipy.stats import levene

#### Because the sample of the data is far larger than 30, normality test is not necessary

#### Hypothesis for the Levene's test
#### H0 = There is no difference exists between variances of Japanese and Korean player's data
#### H1 = There is difference exists between variances of Japanese and Korean player's data

#### Hypothesis for the t-test
#### H0 = There is no difference exists between overall ability mean of Japanese and Korean players
#### H1 = There is difference exists between overall ability mean of Japanese and Korean players

#### significance level = 0.05
#### if p value is <= 0.05, reject H0
#### if p value is > 0.05, support H0

#### t-test will be performed after Levene's test

In [None]:
from scipy.stats import ttest_ind
from scipy.stats import ttest_1samp
koreans = df[df["Country"].str.contains("Korea Republic")]
japanese = df[df["Country"].str.contains("Japan")]
Koreanss = koreans["Overall"]
Japaneses = japanese["Overall"]

levene(Koreanss, Japaneses, center='median')

#### p value is 0.001 which is below significance level (0.05). Therefore, both data does not have equal variances

In [None]:
ttest_ind(Koreanss, Japaneses, equal_var=False)

#### p value is 0.004 which is below significance level (0.05). Therefore, there is difference exist between mean of Japan and Korean player overall ability, player from both country does not have same overall ability

#### Hypothesis for the Levene's test
#### H0 = There is no difference exists between variances of Brazil and Argentina player's data
#### H1 = There is difference exists between variances of Brazil and Argentina player's data

#### Hypothesis for the t-test
#### H0 = There is no difference exists between overall ability mean of Brazil and Argentina players
#### H1 = There is difference exists between overall ability mean of Brazil and Argentina players

#### significance level = 0.05
#### if p value is <= 0.05, reject H0
#### if p value is > 0.05, support H0

#### t-test will be performed after Levene's test

In [None]:
brazilian = df[df["Country"].str.contains("Brazil")]
argentinian = df[df["Country"].str.contains("Argentina")]
Brazilians = brazilian["Overall"] 
Argentinians = argentinian["Overall"]

levene(Brazilians, Argentinians, center='median')

#### p value is 0.0004 which is below significance level (0.05). Therefore, both data does not have equal variances

In [None]:
ttest_ind(Brazilians, Argentinians, equal_var=False)

#### p value is 3.062170185006192e-19 which is below significance level (0.05). Therefore, there is difference exist between mean of Brazil and Argentina player overall ability, player from both country does not have same overall ability

#### Hypothesis for the Levene's test
#### H0 = There is no difference exists between variances of Germany and Spain player's data
#### H1 = There is difference exists between variances of Germany and Spain player's data

#### Hypothesis for the t-test
#### H0 = There is no difference exists between overall ability mean of Germany and Spain players
#### H1 = There is difference exists between overall ability mean of Germany and Spain players

#### significance level = 0.05
#### if p value is < 0.05, reject H0
#### if p value is > 0.05, support H0

#### t-test will be performed after Levene's test

In [None]:
german = df[df["Country"].str.contains("Germany")]
spaniard = df[df["Country"].str.contains("Spain")]
germans = german["Overall"] 
spaniards = spaniard["Overall"]

levene(germans, spaniards, center='median')

#### p value is 0.2 which is above significance level (0.05). Therefore, both data does have equal variances

In [None]:
ttest_ind(germans, spaniards, equal_var=True)

#### p value is 3.00387052419562e-40 which is below significance level (0.05). Therefore, there is difference exist between mean of German and Spain player overall ability, player from both country does not have same overall ability

#### Hypothesis for the Levene's test
#### H0 = There is no difference exists between variances of Cameroon and Nigeria player's data
#### H1 = There is difference exists between variances of Cameroon and Nigeria player's data

#### Hypothesis for the t-test
#### H0 = There is no difference exists between overall ability mean of Cameroon and Nigeria players
#### H1 = There is difference exists between overall ability mean of Cameroon and Nigeria players

#### significance level = 0.05
#### if p value is < 0.05, reject H0
#### if p value is > 0.05, support H0

#### t-test will be performed after Levene's test

In [None]:
cameroon = df[df["Country"].str.contains("Cameroon")]
nigeria = df[df["Country"].str.contains("Nigeria")]
cameroonian = cameroon["Overall"] 
nigerian = nigeria["Overall"]

levene(cameroonian, nigerian, center='median')

#### p value is 0.71 which is above significance level (0.05). Therefore, both data does have equal variances

In [None]:
ttest_ind(cameroonian, nigerian, equal_var=True)

#### p value is 0.224 which is above significance level (0.05). Therefore, there is no difference exist between mean of Cameroon and Nigeria player overall ability, player from both country have same overall ability