# CE4032 Data Analytic Project

In [None]:
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import re
import time
import seaborn as sns
#handle warnings
import warnings
from sklearn.utils.testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.exceptions import DataConversionWarning
#models and validation scores
from sklearn import preprocessing,metrics 
from sklearn.model_selection import train_test_split, cross_val_score,GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn import svm
from sklearn.ensemble import VotingClassifier
from sklearn.cluster import KMeans,DBSCAN
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import cdist 
import scipy.cluster.hierarchy as sch
#features selection, dimensionality reduction
from sklearn.feature_selection import SelectKBest,chi2, f_classif
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline


In [None]:
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)


## Setup: Import the dataset

Dataset from Kaggle : **"European Soccer Database"**     
Source: https://www.kaggle.com/hugomathien/soccer

The dataset is in `database.sqlite`; we use the `sqlite3` function to retrieve the data.  
Immediately after importing, we take a quick look at the tables avaliable for us to explore.

In [None]:
path = "data/"  #Insert path here
database = path + 'database.sqlite'
conn = sqlite3.connect(database)

## Overview of European Soccer Database

### Understanding the data set

What each of the attributes does and affects the player: 
https://www.fifauteam.com/fifa-16-attributes/
Player/Position inferred from SOFIFA
https://www.kaggle.com/hugomathien/soccer/discussion/60282
A soccer player has an `overall rating (OR)` as well as `six scores for the key stats (6S)`, each key stat is calculated from a sum of attributes multiplied by their cofficient depending on the importance of that attibute for that stat. Goal Keepers as a standalone determines their overall rating mainly on goal keeping attributes:
- Pace :      {sprint_speed, acceleration}
- Shooting :  {finishing, long_shots, shot_power, positioning, penalties, volleys, , free_kick_accuracy}
- Passing :   {short_passing, vision, crossing, long_passing, curve, free_kick_accuracy}
- Dribbling : {dribbling, ball_control, agility, balance}
- Defending : {marking, standing_tackle, interceptions, heading_accuracy, sliding_tackle}
- Physical :  {strength, stamina, aggression, jumping, reactions}
- GoalKeeping : {gk_diving,gk_handling, gk_kicking, gk_positioning, gk_reflexes}
- Does not affect overall rating : {overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate}
- 

### Defining a question 

Knowing that the database is largely built for the purpose of FIFA game we shall look into the players and their attributes.
**The objective behind this analysis is to identify the key attributes (or features) that directly affects a player's position class[Attack, Defend, Midfielder, Goal Keeper] in the end goal that this can be used by people in real life to model their attributes and be a direction for improvement in their games. This can be used by coaches to assess their players on which aspect of their game can be improved on based on what they want to achieve for a better training camp.**


1. Data Exploration Objective(Understanding the business and the data): 
    a. Clean Data, Merge External Information from Current Database. 
        (Player/Position inferred from SOFIFA https://www.kaggle.com/hugomathien/soccer/discussion/60282 )
    b. Derive main positions, and main class positions of players.
    c. Determine the correlation of this stats to the player's positons [Attack,Defence,Midfielder,Goalkeeper]
    d. Build Data to be used for Classification Clustering, Recommendation model (Prepping the data)
    
2. Classification (Selecting a model Part 1).
    Dataset: All Players
    The main objective of classification is to see which attributes would matter to determine the players.
    a. In our case we used a supervised models (Naive Bayes, Logistic Regression, ANN, SVM, and Ensemble)
        1. We are assessing the trade-off between accuracy and speed of the predictions made by this models
        2. What can we improve from these models?
            a. Feature Selection: Using SelectKBest
                a1. What voting classifier is used and why it's important.
                a2. Selecting Features that matters most from feature selection done, does it tally and represent what the finding was from the previous data exploration done from Part 1.
            b. Dimensionality Reduction: pCA
                b1. Will reducing the number of features to test on result in a better trade-off between accuracy and making the data easier to understand?
                b2. Does it agree at the number of features that was used from the Feature selection? Meaning does the features selected explained the majority of the data?
                
3. Clustering (Selecting a model Part 2)
    Dataset: Top 50 Players from each class [Classes: Attack, Defence,Midfielder, Goalkeeper]
    The main objective of Clustering in this case is to view the top players for their positions and how they are distributed given their stats. Are there players that are misclassified and are playing a certain positional class even though they are marked as another class from the previous database imported (players_position).
    a. The K-Cluster objective is to seperate the players into 4 bins[Classes: Attack, Defence,Midfielder, Goalkeeper] and see how well they are distributed,
        1. Will there be any misclassification when K-Cluster is used to predict the player's position.
        2. Tried Checking the bins using The (People's) Elbow Method to determine optimum bins for it, still came to 4.
    b. DBSCAN - Objective is to cluster players based on their stat's similarities and determine which are the players the players being set apart.
        1. Reasons for why the players may be set apart from the cluster although they are near. 
        2. Give a reason why a portion of midfielders are determined as noise, comment on the rest of the noise identified and the main clusters.
        
4. Reccommendation (Suggest to players what they can train on based on their Objective Class)
    a. Based on their stats predict what their most class most likely is
    b. Choose between training to become a different class of player or train to be a better player in current situation.
        b1. if training for a different class, show what they need to train more on.
        b2. if training for their current class, show what they can improve on to emmulate the best in the leagues.
            
       

## 1. Cleaning and Prepping Data:
1. Merge External Information from Current Database and  Player_Position inferred from SOFIFA;
2. Prepping Needed Data: Player positions and their position's class
3. Players Role (Variation: Players Position), Players Position Class

    a. Player Position Class: Defence
        CB - Centreback
        SW - Sweeper
        FB(LB/RB)- Fullback (LeftBack/RightBack)
        WB(LWB/RWB)- WingBack (LeftWingBack/RightWingBack)
    b. Player Position Class: Midfielders
        CM - Midfielders
        CDM - Defensive Midfielders
        CAM - Attacking Midfielders
        WM(LM/RM) - Wide Midfielders (LeftMidfielder/RightMidfielder)
    c. Player Position Class: Attackers
        CF(CF/ST) - Centre Forward (CentreForward/Striker)
        WF(LF/RF) - Wing Forward (LeftForward/RightForward)
        WA(LW/RW) - Wing Attacker (LeftWing/RightWing)
    d. Player Position Class: Goalkeeper
        GK - Goal Keeper

In [None]:
def classify_role(pos):
    #classify Full backs 
    if(pos == 'LB') | (pos == 'RB' ):
        return "FB"
    #classify Wing Backs
    elif(pos == "LWB") | (pos == "RWB" ):
        return "WB"
    #classify Defensive mid fielders
    elif(pos == "CDM"):
        return "DM"
    #classify Defensive mid fielders
    elif(pos == "LM") | (pos == "RM" ):
        return "WM"
    #classify Strikers or centre forward
    elif(pos == "CF") | (pos == "ST" ):
        return "CF"
    #classify Second Strikers
    elif(pos == "LF" ) | ( pos == "RF" ):
        return "WF"
    #Classify Wing Attackers
    elif(pos == "LW") | ( pos == "RW" ):
        return "WA"
    else:
        return pos

def classify_position_class(pos):
    #classify Attackers class
    if (pos == 'CF') | (pos == 'WF' ) | (pos == 'WA'):
        return "Attack"
    #classify Defenders 
    elif(pos == 'CB') | (pos == 'SW' ) | (pos == 'FB') | (pos == 'WB' ):
        return "Defence"
    #classify Midfielders
    elif(pos == 'CM') | (pos == 'DM' ) | (pos == 'CAM') | (pos == 'WM' ):
        return "Midfield"
    else:
        return "GoalKeeper"      
    
#pLayer_positions contain original data.
player_pos_path = "./data/player_positions.csv"
players_position = pd.read_csv(player_pos_path)
#See the original data format from player_position csv file
players_position.head(5)

#### The players main position is determine by the value of 1 while his secondary position is determined by 2, 3...,4....

In [None]:
#Modifying data: Find the column name per row that is equals to 1. meaning that is the player's main position
players_position["specific_player_position"] = (players_position == 1).apply(lambda y: players_position.columns[y.tolist()].tolist(), axis=1)
#join resulting list into 1, as im only getting their main position
players_position['specific_player_position'] = players_position["specific_player_position"].str.join(',')
players_position.head(5)

### 1.1 Arrange Player's specific position, role and position's class
For Example:
A player who has a specific position of LM (Left Midfield) would have a role of Wing Midfield meaning they are either in the Left or Right as Wing Midfielders. The class of this role would also be under mid-fielders.

In [None]:
#Copy only playerID and the player's position.
players_position = players_position[['playerID', 'specific_player_position']] 
#define role of players from their position
players_position['player_role'] = players_position['specific_player_position'].apply(classify_role)
#define class of player from role
players_position['player_position_class'] = players_position['player_role'].apply(classify_position_class)
#write result to csv
#df.to_csv("./data/player_positions_cleaned.csv",index=False)
#This is the resulting position, role, and position class for the players.
players_position.head(5)

### 1.2 Combine Player's positions to Players attributes: See which of player's attributes has highest correlation to their role as football players and overall training.
Note: The PlayerID for players_position is equivalent to player_fifa_api_id 

In [None]:
path = "./data/"  #Insert path here
database = path + 'database.sqlite'
conn = sqlite3.connect(database)
#Merging and exploring Players table with their attributes 
Player_Attributes = pd.read_sql("""SELECT pa.id, pa.player_fifa_api_id, p.player_name, pa.date, pa.overall_rating, pa.potential, pa.preferred_foot, pa.attacking_work_rate, pa.defensive_work_rate, pa.crossing, pa.finishing, pa.heading_accuracy, pa.short_passing, pa.volleys, pa.dribbling, pa.curve, pa.free_kick_accuracy, pa.long_passing, pa.ball_control, pa.acceleration, pa.sprint_speed, pa.agility, pa.reactions, pa.balance, pa.shot_power, pa.jumping, pa.stamina, pa.strength, pa.long_shots, pa.aggression, pa.interceptions, pa.positioning, pa.vision, pa.penalties, pa.marking, pa.standing_tackle, pa.sliding_tackle, pa.gk_diving, pa.gk_handling, pa.gk_kicking, pa.gk_positioning, pa.gk_reflexes
                                    FROM Player_Attributes pa
                                    INNER JOIN (SELECT player_name, player_fifa_api_id as api FROM Player) p 
                                    ON pa.player_fifa_api_id = p.api
                                    GROUP BY player_name;""", conn)
Player_Attributes.head(3)

In [None]:
#before dropping empty players
players = pd.merge(players_position, Player_Attributes, left_on="playerID",right_on="player_fifa_api_id")
print(f"Before dropping empty players without position and empty stats: {players.shape}")
players.head(3)

#### Cleaning empty data and in any case if there are players without roles.

In [None]:
#Checking for null data points 
players.isnull().sum(axis=0)
#dropping null rows and those with empty roles
players = players.dropna()
players.isnull().sum(axis=0)
players = players[players['player_role'] != '']

#### Encode categorical features into numbers:

In [None]:
#Pre-process label encode categorical values that may attribute to overall 
le = preprocessing.LabelEncoder()
players['preferred_foot'] = le.fit_transform(players['preferred_foot'])
players['attacking_work_rate'] = le.fit_transform(players['attacking_work_rate'])
players['defensive_work_rate'] = le.fit_transform(players['defensive_work_rate'])
players['player_position_class_cat'] = le.fit_transform(players['player_position_class'])
#How many players are left after dropping empty role players and empty attributes
players.shape
print(f"After dropping empty players without position and empty stats: {players.shape}")

### 1.3 Checking the Statistics of the for all the players, Identifying their quartiles,best and worst for the specific stats.

In [None]:
player_att = player_attributes.drop(columns=['player_position_class_cat'])
player_att.describe().transpose()

### Check the Relationship amongst the Variables

Correlation between the variables, followed by all bi-variate jointplots.

In [None]:
# print(six_major.corr())
fig1, axes = plt.subplots(1, 1, figsize=(20,20))
sb.heatmap(player_att.corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f")

In [None]:
sb.pairplot(data = player_att)

### 1.4 Exploring Data: Correlation of the rest of the attributes with Position class and Overall Rating

In [None]:
players.columns

In [None]:
#Drop all features not might not be attributed to the player position class
player_attributes = players.drop(columns=['playerID', 'specific_player_position', 'player_role','id', 'player_fifa_api_id', 'player_name',
       'date','player_position_class'])

correlation = []
for attribute in player_attributes:
    atts = []
    atts.append(attribute)
    att = player_attributes['player_position_class_cat'].corr(player_attributes[attribute])
    overall = player_attributes['overall_rating'].corr(player_attributes[attribute])
    atts.append(att)
    atts.append(overall)
    #print("%s: %f" % (attribute, overall))
    correlation.append(atts)
#Sort the correlation stats from greatest to least
correlation_stats = pd.DataFrame(correlation,columns=['Attribute', 'Correlation_Score_PlayerClass','Correlation_Score_Overall']).sort_values(by=['Correlation_Score_PlayerClass'],ascending=False)
correlation_stats

#### Observation: 
Correlation Coefficient has a range of `[-1, +1]` and `0` would means that there is no correlation at all. From the values above we can tell that the attribute **ball_control** has the highest coefficient comparing to the other 4 attributes.


In [None]:
correlation_stats.shape

In [None]:
plt.plot(correlation_stats.Attribute, correlation_stats.Correlation_Score_PlayerClass, color = 'red', marker = 'o', linewidth = 1)
plt.plot(correlation_stats.Attribute, correlation_stats.Correlation_Score_Overall, color = 'blue', marker = 'o', linewidth = 1)
fig = plt.gcf()
fig.set_size_inches(15, 8)
plt.title('Correlation for Player\'s Role against attributes ', fontsize = 14)
plt.xticks(rotation = 80);
plt.ylabel('Correlation Value', fontsize = 14)
plt.grid(True)
plt.show()

Spliting up the attributes into their respective key score

### 1.5 Prepping Data

### Prepping Data: Classification
Seperate features to train on and the target to predict

In [None]:
#get columns of attributes to train on
X_player = players.drop(columns =['playerID','specific_player_position','player_role','player_position_class','id','player_fifa_api_id',
                                  'player_name','date','player_position_class_cat'],axis=1)
#get columns to predict
Y_player = players.filter(["player_position_class"],axis=1)

### Prepping Data for Cluster:Sample Top 50 of each Class
Sampling the top 50 of each class type we are going to find out where they are clustered. This is to cluster players based on their stats and see whether they are suited to able to play other position class as well. Example: Christiano Ronaldo is he really an attacker or with his stats can he be a good too defender.

In [None]:
#Get top 100 players of each class
top_gk = players[players['player_position_class'] == 'GoalKeeper'].sort_values(by=['overall_rating'], ascending=False).head(50)
top_attacker = players[players['player_position_class'] == 'Attack'].sort_values(by=['overall_rating'], ascending=False).head(50)
top_defender = players[players['player_position_class'] == 'Defence'].sort_values(by=['overall_rating'], ascending=False).head(50)
top_mid_fielder = players[players['player_position_class'] == 'Midfield'].sort_values(by=['overall_rating'], ascending=False).head(50)
top_attacker.head(3)

In [None]:
## Since the top players for the role SW and WF is too low we'll skip clustering those position
top_players= pd.concat([top_attacker,top_defender,top_mid_fielder,top_gk])
#Save names for later reference
names = top_players.player_name
player_class = top_players['player_position_class'].str[0]
name_and_class = names + '_' + player_class
name_and_class = name_and_class.tolist()
#Drop all non_stats column
top_players = top_players.drop(['player_name','playerID','specific_player_position','player_role','player_position_class','id','player_fifa_api_id','date','player_position_class_cat'],axis=1)

## 2. Supervised Classification
We are assessing these models below in terms of accuracy and speed in predicting the classes for the players. This will lead to choosing to us choosing which model has a good balance between speed and accuracy. 

Model chosen in this stage will be used as the model to recommend to players what kind of player class they are classified as and if they want to change what attributes they need to train.
 1. Naive Bayes (NB)
 2. Support Vector Machine (SVM)
 3. Logistic Regression (LR)
 4. Artificial Neural Networks (ANN)
 5. Ensemble: Combination of NB,SVM and LR and ANN

In [None]:
#Split Train and Test Data, based on the player's stats that are alr available (See Prepped Data for Classification for more details)
Train_X, Test_X, Train_Y, Test_Y = train_test_split(X_player,Y_player,test_size=0.2)

In [None]:
import math
@ignore_warnings(category=ConvergenceWarning)
@ignore_warnings(category=DataConversionWarning)
def test_models_NB_SVM_LR_accuracy(train_x, test_x, train_y, test_y):
    #Declare models and ensemble
    NB = GaussianNB()
    LR = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial' ,random_state=3,max_iter =5000)
    #ANN = MLPClassifier(random_state=1, max_iter=5000).fit(train_x,train_y.values.ravel())
    ANN = MLPClassifier(hidden_layer_sizes=(100,50,), solver='sgd', early_stopping=True, random_state=1, max_iter=5000).fit(train_x, train_y)
    SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto',random_state=1)
    #Voting classifiers, num indicates num of model inside
    ensemble_voting_classifier = VotingClassifier(estimators=[('gnb', NB), ('lr', LR), ('svm', SVM)], voting ='hard',)
    NB.fit(train_x,train_y.values.ravel())
    LR.fit(train_x,train_y.values.ravel())
    SVM.fit(train_x,train_y.values.ravel())
    ensemble_voting_classifier.fit(train_x,train_y.values.ravel())
    #Print performance of 3 models and the ensemble 
    print("3 model and ensemble: Accuracy")
    results = []
    timings = []
    for clf, label in zip([NB, LR, ANN, SVM,ensemble_voting_classifier], ['Naive Bayes', 'Logistic Regression', 'ANN','Support Vector Machine', 'Ensemble']):
        start = time.time()
        scores = cross_val_score(clf, test_x, test_y, scoring='accuracy', cv=5)
        print("Accuracy (Std Deviation): %0.6f (+/- %0.6f) [%s]" % (scores.mean(), scores.std(), label))
        print('Time taken:' + '{:.2f}'.format(time.time() - start) + 's')
        results.append(scores.mean())
        timings.append(round(time.time() - start,4- - int(math.floor(math.log10(abs(time.time() - start))))))
    return results, timings

#### 2.1 Test Classification performance: NB, SVM and LR and Ensemble (Voting class); without feature reduction

In [None]:
%%time
results,timings = test_models_NB_SVM_LR_accuracy(Train_X, Test_X, Train_Y, Test_Y)

In [None]:
results

In [None]:
timings

In [None]:
%%time
Train_X1, Test_X1, Train_Y1, Test_Y1 = train_test_split(X_player.drop(columns=['gk_diving','gk_handling','gk_kicking','gk_positioning','gk_reflexes']),Y_player,test_size=0.2)
results1,timing1 = test_models_NB_SVM_LR_accuracy(Train_X1, Test_X1, Train_Y1, Test_Y1)

### 2.2 Feature selection using SelectKBest (Univariate Method). Purpose: Score the attributes based on the scoring function used by SelectKBest and pick the number of features to represent the whole dataset.

#### The scores of features according to SelectKBest using chi2 scoring function

In [None]:
bestfeatures = SelectKBest(score_func=chi2, k=38)# chi test take only into account non-negative features and class
fit = bestfeatures.fit(X_player,Y_player)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_player.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Features','Score']  #naming the dataframe columns
featureScores = featureScores.sort_values(by=['Score'],ascending = False)
featureScores

In [None]:
def select_features_test_model_perf(x_players,y_players,num_best_features):
    print("For number of best features: " + str(num_best_features))
    ### apply SelectKBest class to extract top k best features
    bestfeatures = SelectKBest(score_func=chi2, k=num_best_features)# chi test take only into account non-negative features and class
    fit = bestfeatures.fit(x_players,y_players)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(x_players.columns)
    #concat two dataframes for better visualization 
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Features','Score']  #naming the dataframe columns
    featureScores = featureScores.sort_values(by=['Score'],ascending = False)
    #list the features to test
    player_attributes_reduced = featureScores.Features.head(num_best_features).to_list()
    print("Features selected: ")
    print(player_attributes_reduced)
    #Test
    #get columns of attributes to train on
    X_player = players.filter(player_attributes_reduced,axis=1)
    #get column to predict
    Y_player = players.filter(["player_position_class"],axis=1)
    #Split Train and Test Data
    Train_X, Test_X, Train_Y, Test_Y = train_test_split(X_player,Y_player,test_size=0.2)
    results = test_models_NB_SVM_LR_accuracy(Train_X, Test_X, Train_Y, Test_Y)
    return results

In [None]:
%%time
# from 1 to max features
features = list(range(1, X_player.columns.nunique()+1))

#Save results in a dataframe to plot and see results
overall_results = []
overall_timings = []
for num_features in features:
    #record test for this num of features
    results = []
    results,timings = select_features_test_model_perf(X_player,Y_player,num_features)
    results.append(num_features)
    #append to list of list
    overall_results.append(results)
    overall_timings.append(timings)
#make data_frame
df_result = pd.DataFrame(overall_results,columns=['NB', 'LR', 'ANN', 'SVM', 'Ensemble','Num_features'])
df_timings = pd.DataFrame(overall_results,columns=['NB', 'LR', 'ANN', 'SVM', 'Ensemble','Num_features'])

In [None]:
df_result

In [None]:
df_timings

In [None]:
#Write results to csv
# df_result.to_csv("./data/feature_reduc_results7.csv",index=False)
# df_timings.to_csv("./data/feature_reduc_timings3.csv",index=False)

In [None]:
#Read results from csv
df_result = pd.read_csv("./data/feature_reduc_results5.csv")
df_result1 = pd.melt(df_result,  id_vars =['Num_features'],value_vars=['NB', 'LR','ANN', 'SVM','Ensemble'], value_name='Accuracy')
df_timings = pd.read_csv("./data/feature_reduc_timings3.csv")
df_timings1 = pd.melt(df_timings,  id_vars =['Num_features'],value_vars=['NB', 'LR','ANN', 'SVM','Ensemble'], value_name='Timing')
df_result1

In [None]:
df_timings1

#### Top 10: Performing Models

In [None]:
max1 = df_result1.nlargest(10, ['Accuracy']) 
max1

In [None]:
df_timings1

In [None]:
max1_timing = df_timings1.nlargest(10, ['Timing']) 
max1_timing

### 2.2.2 Plot graph and decide best number of features to represent stats to predict position class

In [None]:
fig, ax = plt.subplots(figsize=(15, 12))
colors = ['steelblue', 'green','blue','red','purple']
sns.set(font_scale=1.5, style="darkgrid",palette = 'rocket')
sns.lineplot(x='Num_features', y='Accuracy',data=df_result1, hue='variable',palette =colors, sort=False)
plt.title("Accuracy of models SelectKBest", fontsize = 20) # for title
plt.xlabel("Number of Features", fontsize = 25) # label for x-axis
plt.ylabel("Accuracy of model", fontsize = 25) # label for y-axis
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)# Put the legend out of the figure
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15, 12))
colors = ['steelblue', 'green','blue','red','purple']
sns.set(font_scale=1.5, style="darkgrid",palette = 'rocket')
sns.lineplot(x='Num_features', y='Timing',data=df_timings1, hue='variable',palette =colors, sort=False)
plt.title("Timings of models SelectKBest", fontsize = 20) # for title
plt.xlabel("Number of Features", fontsize = 25) # label for x-axis
plt.ylabel("Timings of model", fontsize = 25) # label for y-axis
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)# Put the legend out of the figure
plt.show()

#### Coclusion: The peak attributes to choose are around 24. However, what will happen if we choose attributes that made the accuracy go up.
SKLearn Features that cause increase in accuracy for Logistic Regression:
    1. 3: gk_positioning
    2. 6: marking
    3. 8: standing_tackle
    4. 10: intercept
    5. 12-13: volleys,long_shots
    6. 15-19: curve, heading_accuracy, free_kick,accuracy,crossing,ball_control
    7. 21-22: shot_power,long_passing
    8. 27: agility
    9. 30: balance
    10. 33: attacking_work_rate
    11. 36: potential


### Result: Pick 24 Features from SelectKBest. 
#### Since accuracy at 24 is similiar better than the 38

In [None]:
%%time
# These are the best features from SelectKBest
final_results = select_features_test_model_perf(X_player,Y_player,24)

### 2.3 Dimensionality Reduction using PCA: Find best params

In [None]:
%%time
# Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA()
# set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.01)
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'pca__n_components': [2,10,13,15, 20,24, 30, 35,38],
    'logistic__C': np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
search.fit(X_player, Y_player)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

# Plot the PCA spectrum
pca.fit(X_player)

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(12, 20))
ax0.plot(np.arange(1, pca.n_components_ + 1),
         pca.explained_variance_ratio_, '+', linewidth=2)
ax0.set_ylabel('PCA explained variance ratio')

ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='Number of components chosen')
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results
results = pd.DataFrame(search.cv_results_)
components_col = 'param_pca__n_components'
best_clfs = results.groupby(components_col).apply(
    lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs.plot(x=components_col, y='mean_test_score', yerr='std_test_score',
               legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('Number of components chosen')

plt.xlim(-1, 70)

plt.tight_layout()
plt.show()

#### Findings: LR/SVM performed best as a standalone model, but Ensembling the models gave the best performance. We can further optimize the models and features to optimum to make better predictions. For features the top 15 features are enough to explain most of the player's position class. 

In [None]:
#Lets normanlize the data to eliminate redundant data and ensures that good quality clusters are generated which 
#can improve the efficiency of clustering. It becomes an essential step before clustering as Euclidean distance is 
#very sensitive to the changes in the differences
x_player = top_players.values # numpy array
scaler = preprocessing.MinMaxScaler()
x_scaled_player = scaler.fit_transform(X_player)
X_norm_player = pd.DataFrame(x_scaled_player)
pca = PCA(n_components = 15) # 2D PCA for the plot
reduced = pd.DataFrame(pca.fit_transform(X_norm_player))
reduced

In [None]:
#Split Train and Test Data
Train_X, Test_X, Train_Y, Test_Y = train_test_split(X_norm_player,Y_player,test_size=0.2)

In [None]:
%%time
results = test_models_NB_SVM_LR_accuracy(Train_X, Test_X, Train_Y, Test_Y)

#### Conclusion: As you can see dimensionality reduction for the models had resulted into improvement of it's performance.

## Part 3 Clustering: 
Explore which algorithm performs best and see their behaviour.
1. K-means
2. DBSCAN
     
*Take note of Limitation: The player's stats taken is the latest update. Meaning if Player A's stat is 2015 there might be a player being compared that has a stat taken from 2014.

#### Principal component analysis (PCA)
is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

In [None]:
### Prepare dataset with best features

In [None]:
top_players = top_players.filter(['gk_reflexes', 'gk_diving', 'gk_positioning', 'gk_handling', 'gk_kicking', 'marking', 'sliding_tackle', 'standing_tackle', 'finishing', 'interceptions', 'positioning', 'volleys', 'long_shots', 'dribbling', 'curve', 'heading_accuracy', 'free_kick_accuracy', 'crossing', 'ball_control', 'penalties', 'shot_power', 'long_passing', 'short_passing', 'vision'],axis=1)
top_players 

In [None]:
#Lets normanlize the data to eliminate redundant data and ensures that good quality clusters are generated which 
#can improve the efficiency of clustering. It becomes an essential step before clustering as Euclidean distance is 
#very sensitive to the changes in the differences
x_player = top_players.values # numpy array
scaler = preprocessing.MinMaxScaler()
x_scaled_player = scaler.fit_transform(x_player)
X_norm_player = pd.DataFrame(x_scaled_player)
pca = PCA(n_components = 2) # 2D PCA for the plot
reduced = pd.DataFrame(pca.fit_transform(X_norm_player))

### 3.1. K-means Clustering

In [None]:
# specify the number of clusters
kmeans = KMeans(n_clusters=4)
# fit the input data
kmeans = kmeans.fit(reduced)
# get the cluster labels
labels = kmeans.predict(reduced)
# centroid values
centroid = kmeans.cluster_centers_
# cluster values
clusters = kmeans.labels_.tolist()

In [None]:
reduced['cluster'] = clusters
reduced['name_and_class'] = name_and_class
reduced.columns = ['x', 'y', 'cluster', 'name_and_class']
reduced.head()

In [None]:
sns.set(style="white")
def set_size(w,h, ax=None):
    """ w, h: width, height in inches """
    if not ax: ax=plt.gca()
    l = ax.figure.subplotpars.left
    r = ax.figure.subplotpars.right
    t = ax.figure.subplotpars.top
    b = ax.figure.subplotpars.bottom
    figw = float(w)/(r-l)
    figh = float(h)/(t-b)
    ax.figure.set_size_inches(figw, figh)
    
ax = sns.lmplot(x="x", y="y", hue='cluster', data = reduced, legend=False,
                   fit_reg=False, size = 10, scatter_kws={"s": 200})

texts = []
for x, y, s in zip(reduced.x, reduced.y, reduced.name_and_class):
    texts.append(plt.text(x, y, s))
set_size(10,25)
ax.set(ylim=(-1.5, 1.7))
plt.tick_params(labelsize=15)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)

plt.show()



#### 3.1.1 Testing for optimal clusters for k-mean: Elbow method

In [None]:
pca = PCA(n_components = 2) # 2D PCA for the plot
reduced1 = pd.DataFrame(pca.fit_transform(X_norm_player))

In [None]:
distortions = [] 
inertias = [] 
mapping1 = {} 
mapping2 = {} 
K = range(1,12) 
 
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(reduced1) 
    kmeanModel.fit(reduced1)     
      
    distortions.append(sum(np.min(cdist(reduced1, kmeanModel.cluster_centers_, 
                      'euclidean'),axis=1)) / reduced1.shape[0]) 
    inertias.append(kmeanModel.inertia_) 
  
    mapping1[k] = sum(np.min(cdist(reduced1, kmeanModel.cluster_centers_, 
                 'euclidean'),axis=1)) / reduced1.shape[0] 
    mapping2[k] = kmeanModel.inertia_ 

In [None]:
plt.plot(K, distortions, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Distortion') 
plt.title('The Elbow Method using Distortion') 
plt.show() 

### 3.2 DBSCAN

In [None]:
#Lets normanlize the data to eliminate redundant data and ensures that good quality clusters are generated which 
#can improve the efficiency of clustering. It becomes an essential step before clustering as Euclidean distance is 
#very sensitive to the changes in the differences
x_player = top_players.values # numpy array
scaler = preprocessing.MinMaxScaler()
x_scaled_player = scaler.fit_transform(x_player)
X_norm_player = pd.DataFrame(x_scaled_player)
pca = PCA(n_components = 2) # 2D PCA for the plot
reduced2 = pd.DataFrame(pca.fit_transform(X_norm_player))

In [None]:
#Lets normanlize the data to eliminate redundant data and ensures that good quality clusters are generated which 
#can improve the efficiency of clustering. It becomes an essential step before clustering as Euclidean distance is 
#very sensitive to the changes in the differences
x_player = top_players.values # numpy array
scaler = preprocessing.MinMaxScaler()
x_scaled_player = scaler.fit_transform(x_player)
X_norm_player = pd.DataFrame(x_scaled_player)
pca = PCA(n_components = 2) # 2D PCA for the plot
reduced2 = pd.DataFrame(pca.fit_transform(X_norm_player))
# train the model using DBSCAN
db = DBSCAN(eps=0.34, min_samples=25)
# prediction for dbscan clusters
db_clusters = db.fit_predict(reduced2)

In [None]:
reduced2['cluster'] = db_clusters
reduced2['name_and_class'] = name_and_class
reduced2.columns = ['x', 'y', 'cluster', 'name_and_class']
reduced2.head()

In [None]:
### Plot DBSCAN
sns.set(style="white")
ax = sns.lmplot(x="x", y="y", hue='cluster', data = reduced2, legend=False,
                   fit_reg=False, size = 10, scatter_kws={"s": 200})

texts = []
for x, y, s in zip(reduced2.x, reduced2.y, reduced2.name_and_class):
    texts.append(plt.text(x, y, s))

set_size(10,25)
ax.set(ylim=(-1.5, 1.7))
plt.tick_params(labelsize=15)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)

plt.show()

#### Optimal eps is around is around 0.315 , min_sample_size =20

In [None]:
reduced2 = pd.DataFrame(pca.fit_transform(X_norm_player))
# train the model using DBSCAN
db = DBSCAN(eps=0.24, min_samples=20)
# prediction for dbscan clusters
db_clusters = db.fit_predict(reduced2)
reduced2['cluster'] = db_clusters
reduced2['name_and_class'] = name_and_class
reduced2.columns = ['x', 'y', 'cluster', 'name_and_class']
### Plot DBSCAN
sns.set(style="white")
ax = sns.lmplot(x="x", y="y", hue='cluster', data = reduced2, legend=False,
                   fit_reg=False, size = 10, scatter_kws={"s": 200})

texts = []
for x, y, s in zip(reduced2.x, reduced2.y, reduced2.name_and_class):
    texts.append(plt.text(x, y, s))

set_size(10,25)
ax.set(ylim=(-1.5, 1.7))
plt.tick_params(labelsize=15)
plt.xlabel("PC 1", fontsize = 20)
plt.ylabel("PC 2", fontsize = 20)

plt.show()

## Part 4: Recommendation
We are going to build an stats viewer for aspiring football players to gauge what kind of class of players suits their stats that the coaches in camp gave them. The purpose of this is to help them to further develop their skills in their chosen class or if they want to switch and train for another class, recommend them what skills they need to work on in order to be more like the players of that class stats wise.

How this applies in real life is if this used during training camps it would help coaches and students to develop the skills needed to be able to introduce students to football and train them to become better players. If there would be a standardization to the measurement of this stats, it would help create better insights for coahces and students to help develop a better training programs for schools to help aspiring atheletes visualize their abilities.

Parts of the recommendation:
   1. Start of training camp (Class Prediction, Player Assesment and Development):
            a. 24 features: With 24 attributes predict the player's position class.
            b. Decide what skills they need to work on for their desired position.
   2. End of training camp (Show improvement):
            a. Show improvement to players.
            b. Inspire, show top 1% compared to them.
            
Recap on Stats:
1. All Stats.
        - Pace :      {sprint_speed, acceleration}
        - Shooting :  {finishing, long_shots, shot_power, positioning, penalties, volleys, free_kick_accuracy}
        - Passing :   {short_passing, vision, crossing, long_passing, curve}
        - Dribbling : {dribbling, ball_control, agility, balance}
        - Defending : {marking, standing_tackle, interceptions, heading_accuracy, sliding_tackle}
        - Physical :  {strength, stamina, aggression, jumping,reactions}
        - GoalKeeping : {gk_diving,gk_handling, gk_kicking, gk_positioning, gk_reflexes}
        Excluded
        - Does not affect overall rating : {overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate}
        
2. 24 Included Stats for Prediction in Amateurs:
        - Shooting :  {finishing,long_shots,shot_power,positioning, penalties, volleys, free_kick_accuracy}
        - Passing : {short_passing,vision,free_kick_accuracy,crossing,long_passing,curve}
        - Dribbling : {dribbling,ball_control}
        - Defending : {marking, standing_tackle,interceptions,heading_accuracy,sliding_tackle}
        - GoalKeeping : {gk_diving,gk_handling, gk_kicking, gk_positioning, gk_reflexes}
         - Does not affect overall rating : {overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate}

In [None]:
import ipywidgets as widgets
from ipywidgets import Layout, Button, Box, FloatText, Textarea, Dropdown, Label, IntSlider
from math import pi

In [None]:
top_players.columns

#### Look at a Professional players Data attributes:

In [None]:
X_player.head(3)

#### Declare Model and Feature: Logistic Regression 15 Features.

In [None]:
@ignore_warnings(category=ConvergenceWarning)
@ignore_warnings(category=DataConversionWarning)
def set_up_player_class_model(x_players,y_players,num_best_features):
    print("For number of best features: " + str(num_best_features))
    ### apply SelectKBest class to extract top k best features
    bestfeatures = SelectKBest(score_func=chi2, k=num_best_features)# chi test take only into account non-negative features and class
    fit = bestfeatures.fit(x_players,y_players)
    dfscores = pd.DataFrame(fit.scores_)
    dfcolumns = pd.DataFrame(x_players.columns)
    #concat two dataframes for better visualization 
    featureScores = pd.concat([dfcolumns,dfscores],axis=1)
    featureScores.columns = ['Features','Score']  #naming the dataframe columns
    featureScores = featureScores.sort_values(by=['Score'],ascending = False)
    #list the features to test
    player_attributes_reduced = featureScores.Features.head(num_best_features).to_list()
    print("Features selected: ")
    print(player_attributes_reduced)
    #Test
    #get columns of attributes to train on
    X_player = players.filter(player_attributes_reduced,axis=1)
    #get column to predict
    Y_player = players.filter(["player_position_class"],axis=1)
    #Split Train and Test Data
    Train_X, Test_X, Train_Y, Test_Y = train_test_split(X_player,Y_player,test_size=0.2)
    LR = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial' ,random_state=3,max_iter =5000)
    LR.fit(Train_X,Train_Y.values.ravel())
    return LR

In [None]:
X_player

#### 4.1 Create: Model for predicting  player class at start of training camp: Fast identification for alot of people

In [None]:
model_24_atts = set_up_player_class_model(X_player,Y_player,24)

#### Divide Players into their player class: Attack, Defence, Midfield, and Goalkeeper
Get the Average stat of the players of each class:

In [None]:
players.head(1)

In [None]:
for pos_class in sorted(players.player_position_class.unique()):
     print(pos_class)

In [None]:
#Make Data frame with the average and top 1% of each class
class_stats = []
column_name = []
for pos_class in sorted(players.player_position_class.unique()):
    print(pos_class)
    player_class = players[players['player_position_class'] == pos_class].sort_values(by=['overall_rating'], ascending=False).drop(columns=['potential','preferred_foot','attacking_work_rate','defensive_work_rate','player_name','playerID','specific_player_position','player_role','player_position_class','id','player_fifa_api_id','date','player_position_class_cat'])
    player_summary = player_class.describe(percentiles = [.10, .50, .99,] )
    player_summary.insert(0, 'Pos_class', pos_class)
    class_stats.append(player_summary.iloc[1,:].to_list())
    class_stats.append(player_summary.iloc[6,:].to_list())
    column_name = player_summary.columns.to_list()

professional_players_stats = pd.DataFrame(class_stats,columns=column_name)
professional_players_stats

### Start of training camp (Class Prediction, Player Assesment and Development):

#### Input 24 features to predict players Stats:
Recap on Stats:
1. 24 Included Stats for Prediction in Amateurs:
        - Shooting(7) :  {finishing,long_shots,shot_power,positioning, penalties, volleys, free_kick_accuracy}
        - Passing(5) : {short_passing,vision,crossing,long_passing,curve}
        - Dribbling(2) : {dribbling,ball_control}
        - Defending(5) : {marking, standing_tackle,interceptions,heading_accuracy,sliding_tackle}
        - GoalKeeping(5) : {gk_diving,gk_handling, gk_kicking, gk_positioning, gk_reflexes}
arrange to this order
     

In [None]:
form_item_layout = Layout(
    display='flex',
    flex_flow='row',
    justify_content='space-between'
)

form_items = [
    Box([Label(value= r'\(\color{Green} {' + 'New Player Attribute'  + '}\)')], layout=form_item_layout),
    #Shooting 7 
    Box([Label(value= r'\(\color{blue} {' + 'Shooting'  + '}\)')], layout=form_item_layout),
        Box([Label(value='Finishing'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Long shots'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Shot power'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Positioning'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Penalties'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Volleys'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Free Kick Accuracy'), IntSlider(min=5, max=100)], layout=form_item_layout),
    #Passing 5 
    Box([Label(value=r'\(\color{blue} {' + 'Passing'  + '}\)')], layout=form_item_layout),
        Box([Label(value='Short Passing'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Vision'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Crossing'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Long Passing'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Curve'), IntSlider(min=5, max=100)], layout=form_item_layout),
    
    #Dribbling 2
    Box([Label(value=r'\(\color{blue} {' + 'Dribbling'  + '}\)')], layout=form_item_layout),
        Box([Label(value='Dribbling'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Ball Control'), IntSlider(min=5, max=100)], layout=form_item_layout),
    
    #Defending 5
    Box([Label(value=r'\(\color{blue} {' + 'Defending'  + '}\)')], layout=form_item_layout),
        Box([Label(value='Marking'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Standing Tackle'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Sliding Tackle'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Interceptions'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Heading Accuracy'), IntSlider(min=5, max=100)], layout=form_item_layout),
    
    #Goal Keeping 5
    Box([Label(value=r'\(\color{blue} {' + 'Goal Keeping'  + '}\)')], layout=form_item_layout),
        Box([Label(value='Diving'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Handling'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Kicking'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Positioning'), IntSlider(min=5, max=100)], layout=form_item_layout),
        Box([Label(value='Reflexes'), IntSlider(min=5, max=100)], layout=form_item_layout),
    #Name
    Box([Label(value=r'\(\color{blue} {' + 'Name of Player'  + '}\)')], layout=form_item_layout),
        Box([Label(value='Name'), Textarea()], layout=form_item_layout),
]

form = Box(form_items, layout=Layout(
    display='flex',
    flex_flow='column',
    border='solid 2px',
    align_items='stretch',
    width='50%'
))


form

In [None]:
#Declare new batch of players
new_players = pd.DataFrame(columns=[
    'name',
    'finishing','long_shots','shot_power','positioning','penalties','volleys','free_kick_accuracy',
    'short_passing','vision','crossing','long_passing','curve',
    'dribbling','ball_control',
    'marking', 'standing_tackle','interceptions','heading_accuracy','sliding_tackle',
    'gk_diving','gk_handling', 'gk_kicking', 'gk_positioning', 'gk_reflexes'])

def get_new_player():
    new_player = []
    #Player Name
    new_player.append(form.children[31].children[1].value.rstrip("\n"))
    #Shooting 7
    new_player.append(form.children[2].children[1].value)
    new_player.append(form.children[3].children[1].value)
    new_player.append(form.children[4].children[1].value)
    new_player.append(form.children[5].children[1].value)
    new_player.append(form.children[6].children[1].value)
    new_player.append(form.children[7].children[1].value)
    new_player.append(form.children[8].children[1].value)
    #Passing 5
    new_player.append(form.children[10].children[1].value)
    new_player.append(form.children[11].children[1].value)
    new_player.append(form.children[12].children[1].value)
    new_player.append(form.children[13].children[1].value)
    new_player.append(form.children[14].children[1].value)                      
    #Dribbling 2
    new_player.append(form.children[16].children[1].value)
    new_player.append(form.children[17].children[1].value)
    #Defending 5
    new_player.append(form.children[19].children[1].value)
    new_player.append(form.children[20].children[1].value)
    new_player.append(form.children[21].children[1].value)
    new_player.append(form.children[22].children[1].value)
    new_player.append(form.children[23].children[1].value)
    #Goal Keeping 5
    new_player.append(form.children[25].children[1].value)
    new_player.append(form.children[26].children[1].value)
    new_player.append(form.children[27].children[1].value)
    new_player.append(form.children[28].children[1].value)
    new_player.append(form.children[29].children[1].value)
    
    #Add player to new players
    new_players.loc[len(new_players), :] = new_player
    new_players.append(new_player)


In [None]:
#Call this function everytime you want to add a new player after shifting the input parameters above
new_player = get_new_player()

In [None]:
new_players

### Predict New Batch of Players

In [None]:
new_players = new_players.drop(columns=['player_initial_position'])
new_players

In [None]:
#Show Players their initial classes
new_players['player_initial_position'] = model_24_atts.predict(new_players.drop(columns=['name']))
new_players

In [None]:
#Save 
#new_players.to_csv('./data/new_players.csv',index=False)

#### Conclusion from Initial Prediction: As you can see none of the players are predicted good enough to be a league level attacker

## Join profesionnal players and marvin for easier comparisson of data

In [None]:
professional_players_stats

In [None]:
def compare_player_stats(professional_player,player,loc):
    compare_player1 = professional_player.copy()
    compare_player1 = compare_player1.filter(['Pos_class','gk_reflexes', 'gk_diving', 'gk_positioning', 'gk_handling', 'gk_kicking', 'marking', 'sliding_tackle', 'standing_tackle', 'finishing', 'interceptions', 'positioning', 'volleys', 'long_shots', 'dribbling', 'curve', 'heading_accuracy', 'free_kick_accuracy', 'crossing', 'ball_control', 'penalties', 'shot_power', 'long_passing', 'short_passing', 'vision'])
    compare_player1 = compare_player1.reindex(sorted(compare_player1.columns), axis=1)
    compare_player1 = compare_player1.append(player.drop(columns=['player_initial_position']).iloc[loc,:])
    ##Arrange player_stats column in this order
    compare_player1 = compare_player1[['Pos_class','finishing','long_shots','shot_power','positioning', 'penalties', 'volleys', 'free_kick_accuracy',
  'short_passing','vision','crossing','long_passing','curve',
  'dribbling','ball_control',
  'marking', 'standing_tackle','interceptions','heading_accuracy','sliding_tackle',
  'gk_diving','gk_handling', 'gk_kicking', 'gk_positioning','gk_reflexes',]]
    
    return compare_player1
def plot_radar(idx):
    # categories
    categories = compare_stats.iloc[idx,1:].index.tolist()
    N = len(categories) # get number of categories
    
    # values
    values= compare_stats.iloc[idx,1:].values.tolist()
    values += values[:1] # repeat first value to close poly
    # calculate angle for each category
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1] # repeat first angle to close poly
    # plot
    plt.polar(angles, values, marker='.') # lines
    plt.fill(angles, values, alpha=0.3) # area
    # xticks
    plt.xticks(angles[:-1], categories)
    # yticks
    ax.set_rlabel_position(0) # yticks position
    plt.yticks([20,40,60,80,100], color="grey", size=10)
    plt.ylim(0,100)
def show_player_vs_league():
    #league_avg at position of player
    idx1 = 8
    #current_player
    idx2 = 0
    #desire_position league avg
    idx3 = 2
    #desire_position league avg
    idx4 = 4
    #desire_position league avg
    idx5 = 6

    #Player position to compare
    fig = plt.figure(figsize=(25,25))
    # radar 1:
    #idx1 = index of new player to compare to 
    ax = plt.subplot(221, polar="True")
    plt.title('Attacker',color='blue',fontsize=20)
    plot_radar(idx1)
    plot_radar(idx2)

    #radar2:
    #idx2 = index of league player to compare to 
    ax = plt.subplot(222, polar="True")
    plt.title('Defender',color='blue',fontsize=20)
    plot_radar(idx1)
    plot_radar(idx3)

    #radar2:
    #idx2 = index of league player to compare to 
    ax = plt.subplot(223, polar="True")
    plt.title('Goal Keeper',color='blue',fontsize=20)
    plot_radar(idx1)
    plot_radar(idx4)

    #radar2:
    #idx2 = index of league player to compare to 
    ax = plt.subplot(224, polar="True")
    plt.title('Mid-Fielder',color='blue',fontsize=20)
    plot_radar(idx1)
    plot_radar(idx5)
    plt.show()

In [None]:
compare_stats  = compare_player_stats(professional_players_stats,new_players,8)
compare_stats

### Plot Players

In [None]:
new_players

In [None]:
compare_stats

### Show the new players stats against league average

In [None]:
form_item_layout = Layout(
    display='flex',
    flex_flow='row',
    justify_content='space-between'
)

form_items = [
    Box([Label(value= r'\(\color{Green} {' + 'Player'  + ' index' + '}\)')], layout=form_item_layout),
        Box([Label(value='Index'), IntSlider(min=0, max=int(new_players.name.count()-1))], layout=form_item_layout),]

form1 = Box(form_items, layout=Layout(
    display='flex',
    flex_flow='column',
    border='solid 2px',
    align_items='stretch',
    width='50%'
))


form1

#### Compare league vs the new players

In [None]:
compare_stats  = compare_player_stats(professional_players_stats,new_players,form1.children[1].children[1].value)
show_player_vs_league()

### During Training Camp: 
The players would then choose whether to train in the position they are good in or develop their skills towards their desired position. Lets see if the training will make a difference in the end.

In our example we would pick Martin Gaye: predicted to be a mid-fielder but he aspires to be an attacker lets see if training him to be an attacker will make him become one

#### Show Marvin: Where he is at now vs league standard at his position vs his aspired position

#### Show Marvin his current Progress: Is he good enough to be an attacker?

In [None]:
#Call this function everytime you want to add a new player after shifting the input parameters above
marvin_gaye_new = get_new_player()