# Find Your Idols: Clustering and Neighbor Analysis

Find your favorite star from the FIFA-21 data set and analyze his strengths and weaknesses！

![FIFA21](https://nimg.ws.126.net/?url=http%3A%2F%2Fdingyue.ws.126.net%2F2020%2F0619%2Fb596f42bj00qc62tw0048c000su00g0m.jpg&thumbnail=650x2147483647&quality=80&type=jpg)

With the help of machine learning and feature engineering, you can predict the position of each player on the court, and find the correlation among various attributes. In addition, you are free to customize your own player and find an idol player to learn from.

#### Caution: 
Since the code on kaggle cannot directly input your own data, 
**I strongly recommend that you copy this notebook to your own notebook, run the code inside the annotation of Part V**
to try your own data and enjoy yourself！

This notebook will proceed in the following order:

### Ⅰ. Data Cleaning
### Ⅱ. Descriptive Statistics
### Ⅲ. Feature Engineering
### Ⅳ. Classifier
### Ⅴ. Neighbor Analysis and Display

Wish you a pleasant jouney in FIFA21 data set!

## Ⅰ. Data Cleaning

Data Cleaning is a necessary part before analysis or mining. In this part, we drop or fix illegal data, and transform some data into a better format for further analysis.

First, let's import the FIFA-21 data set and take a rough view of it. Note that the column names are long, so here I have shortened some of them. 

In [None]:
import numpy as np
import pandas as pd
from pandas import read_csv

file_name='../input/fifa-21-complete-player-dataset/players_21.csv'
data=read_csv(file_name)
pd.set_option('display.max_columns',None)
data.rename(columns={'height_cm':'Height(cm)','weight_kg':'Weight(kg)','skill_moves':'Skill Moves',
                        'attacking_crossing':'Crossing','attacking_finishing':'Finishing',
                        'attacking_heading_accuracy':'Heading Accuracy','attacking_short_passing':'Short Passing',
                        'attacking_volleys':'Volleys','skill_dribbling':'Dribbling',
                        'skill_curve':'Curve','skill_fk_accuracy':'FK. Accuracy',
                        'skill_long_passing':'Long Passing','skill_ball_control':'Ball Control',
                        'movement_acceleration':'Acceleration','movement_sprint_speed':'Sprint Speed',
                        'movement_agility':'Agility','movement_reactions':'Reactions',
                        'movement_balance':'Balance','power_shot_power':'Shot Power',
                        'power_jumping':'Jumping','power_stamina':'Stamina',
                        'power_strength':'Strength','power_long_shots':'Long Shots',
                        'mentality_aggression':'Aggression','mentality_interceptions':'Interceptions',
                        'mentality_positioning':'Positioning','mentality_vision':'Vision',
                        'mentality_penalties':'Penalties','mentality_composure':'Composure',
                        'defending_marking':'Marking','defending_standing_tackle':'Standing Tackle',
                        'defending_sliding_tackle':'Sliding Tackle'},inplace=True)
data

In order to make the subsequent work more smoothly, I set the player's position as the first position in the column 'player_positions', and calculated the fuzzy position for each player, including:

ST, CF -> ST

LW, RW, LM, RM -> WF

CAM, CDM, CM -> CM

LWB, RWB, LB, RB -> WB

CB -> CB

GK -> GK

This transition played a role in the subsequent judgment of similar players.

I also set up a new column 'Right Foot', using a number [-5,5] to indicate how much the player's right foot is stronger than the left foot. For example, Messi’s ‘Right Foot’ is -1, while Ronaldo’s is 1.

A tough thing is that, in the original data set, the data in column 'Marking' is empty. I had to use the column 'DEF' to roughly estimate the missing value in 'Marking'.

In [None]:
def sim_pos(row):
    return row.player_positions.partition(',')[0]

def foot_trans(row):
    if row.preferred_foot=='Right':
        return 5-row.weak_foot
    else:
        return row.weak_foot-5
    
def pos_trans(row):
    if row.player_positions in ['ST','CF']:
        return 'ST'
    if row.player_positions in ['LW','RW','LM','RM']:
        return 'WF'
    if row.player_positions in ['CAM','CDM','CM']:
        return 'MF'
    if row.player_positions in ['LWB','RWB','LB','RB']:
        return 'WB'
    if row.player_positions in ['CB']:
        return 'CB'
    if row.player_positions in ['GK']:
        return 'GK'

def calc_marking(row):
    return int((10*row['defending']-3*row['Standing Tackle']-2*row['Interceptions']
                -row['Heading Accuracy']-row['Sliding Tackle'])/3) if row['player_positions']!='GK' else np.nan
    
data['player_positions']=data.apply(sim_pos,axis=1)
data['Right Foot']=data.apply(foot_trans,axis=1)
data['rough_position']=data.apply(pos_trans,axis=1)
data['Marking']=data.apply(calc_marking, axis=1)


Devide data into two groups: gkdata and nogkdata.

In [None]:
gkdata=data[data['player_positions']=='GK']
gkdata=gkdata.dropna(axis=1,how='all')
gkdata=gkdata.dropna(axis=0,how='any',subset=('gk_diving','gk_handling','gk_kicking','gk_reflexes','gk_speed','gk_positioning'))
nogkdata=data[data['player_positions']!='GK']
nogkdata=nogkdata.dropna(axis=1,how='all')
nogkdata=nogkdata.dropna(axis=0,how='any',subset=('pace','shooting','passing','dribbling','defending','physic','pace'))

## Ⅱ. Descriptive Statistics

In this part, plots are drawn to describe the whole data set's feature.

Let us see how many players are in each position at the beginning.

It is shown that CB ranks first, while CF, LWB, and RWB are relatively fewer. The previous fuzzy position seems reasonable.

In [None]:
import plotly.express as px

all_pos=['LS','ST','RS','LW','LF','CF','RF','RW','LAM','CAM','RAM','LM','LCM','CM','RCM','RM','LWB','LDM','CDM','RDM','RWB',
         'LB','LCB','CB','RCB','RB','GK']
sta_pos=pd.DataFrame(np.zeros(27).reshape(1,27),columns=all_pos)

def add(row):
    sta_pos[row.player_positions][0]+=1
    
data.apply(add,axis=1)
sta_pos=sta_pos.loc[:,~(sta_pos==0).all()]

fig = px.pie(sta_pos, values=sta_pos.loc[0,:], names=sta_pos.columns, 
             title='Proportion of Each Position',
             color_discrete_sequence=px.colors.sequential.Plotly3)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

For non-GK players and GK players, draw a box plot separately to view the distribution of their basic attributes.

We can see the wide distribution of the two attributes of shooting and defense. The speed of goalkeepers are generally slower.

In [None]:
import plotly.graph_objects as go

col1='rgb(0,152,158)'
col2='rgb(180,215,141)'
col3='rgb(255,230,122)'
col4='rgb(246,172,26)'
col5='rgb(216,34,13)'
col6='#c77eb5'
cols=[col1,col2,col3,col4,col5,col6]

nogklabel=['PAC','SHO','PAS','DRI','DEF','PHY']
nogkcol=['pace','shooting','passing','dribbling','defending','physic']
nogkval=[nogkdata.loc[:,i] for i in nogkcol]
gklabel=['DIV','HAN','KIC','REF','SPD','POS']
gkcol=['gk_diving','gk_handling','gk_kicking','gk_reflexes','gk_speed','gk_positioning']
gkval=[gkdata.loc[:,i] for i in gkcol]

for i in range(2):
    fig=go.Figure()
    for label, val, col in zip(nogklabel if i==0 else gklabel, nogkval if i==0 else gkval, cols):
        fig.add_trace(go.Box(
            y=val,
            name=label,
            boxpoints='outliers',
            pointpos=-1.2,
            marker_color=col,
            line_color=col
        ))
    fig.update_layout(title_text="Non-GK Boxplot" if i==0 else "GK Boxplot")
    fig.show()

The next two heat maps show the difference in data among players in different positions. 

The x coordinate is the player's position. The y coordinate of the first picture is the player's overall rating, and the color of the square represents the number of people with that overall rating at that position.

In [None]:
all_pos=['ST','LW','CF','RW','CAM','LM','CM','RM','LWB','CDM','RWB','LB','CB','RB','GK']
segs=[90,85,80,75,70,65,60,55,50,45]
sta=pd.DataFrame(np.zeros(150).reshape((10,15)),index=segs,columns=all_pos)

def add(row):
    for i in segs:
        if row.overall>i:
            seg=i
            break
    sta[row.player_positions][seg]+=1
    
data.apply(add,axis=1)
fig=go.Figure(data=go.Heatmap(
    x=all_pos,
    y=segs,
    z=sta,
    type='heatmap',
    colorscale='Viridis'))
fig.update_layout(title='Overall Rating Distribution Group by Player Position')
fig.show()

The y coordinate of the second picture are the basic attributes of the player, and the color of the square represents the average level of that attribute at that position.

In [None]:
nogk_pos=['ST','LW','CF','RW','CAM','LM','CM','RM','LWB','CDM','RWB','LB','CB','RB']
segnames=['PAC','SHO','PAS','DRI','DEF','PHY']
segs=['pace','shooting','passing','dribbling','defending','physic']
segnames=segnames[::-1]
segs=segs[::-1]
sta=pd.DataFrame(np.zeros(84).reshape((6,14)),index=segs,columns=nogk_pos)

for seg in segs:
    for pos in nogk_pos:
        sta[pos][seg]=sum(nogkdata[seg][nogkdata['player_positions']==pos])/len(list(nogkdata[seg][nogkdata['player_positions']==pos]))
        
fig=go.Figure(data=go.Heatmap(
    x=nogk_pos,
    y=segnames,
    z=sta,
    type='heatmap',
    colorscale='Viridis'))
fig.update_layout(title='Average Level of Each Attribute Group by Player Position')
fig.show()

## Ⅲ. Feature Engineering

Here we use PCA(Principal Component Analysis) and LDA(Linear Discriminant Analysis) separately to find the feature of various attributes. 

Using the top-3 features, we project the data of players whose overall rating is higher than 80 into a three-dimensional space to see their distribution.

Enlarging the projection map of LDA, we can easily find some players who are good at playing multiple positions in those outliers, like Fernandinho, Javi Martínez and Thomas Müller.

In [None]:
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

attributes=['Height(cm)','Weight(kg)','Right Foot','Skill Moves',
            'Crossing','Finishing','Heading Accuracy','Short Passing','Volleys',
            'Dribbling','Curve','FK. Accuracy','Long Passing','Ball Control',
            'Acceleration','Sprint Speed','Agility','Reactions','Balance',
            'Shot Power','Jumping','Stamina','Strength','Long Shots',
            'Aggression','Interceptions','Positioning','Vision','Penalties','Composure',
            'Marking','Standing Tackle','Sliding Tackle']

show_data=nogkdata[nogkdata.overall>=80]
X = show_data[attributes]
Y = show_data['rough_position']

pca = PCA(n_components=3)
components_pca = pca.fit_transform(X)

total_var = pca.explained_variance_ratio_.sum() * 100

fig = px.scatter_3d(
    components_pca, x=0, y=1, z=2, color=Y,color_discrete_map={'ST':col1,'WF':col2,'MF':col3,'WB':col4,'CB':col5},
    title=f'PCA (Total Explained Variance: {total_var:.2f}%)',
    labels={'0': 'Dim 1', '1': 'Dim 2', '2': 'Dim 3'},
    hover_name=show_data['short_name']#,opacity=0.8
)
fig.show()
lda = LinearDiscriminantAnalysis(n_components=3)
components_lda = lda.fit_transform(X,Y)

fig = px.scatter_3d(
    components_lda, x=0, y=1, z=2, color=Y,color_discrete_map={'ST':col1,'WF':col2,'MF':col3,'WB':col4,'CB':col5},
    title='LDA',
    labels={'0': 'Dim 1', '1': 'Dim 2', '2': 'Dim 3'},
    hover_name=show_data['short_name']#,opacity=0.8
)
fig.show()

## Ⅳ. Classifier

In this part, we will use Naive Bayes and Random Forest to classify players in different positions.

Actually, I also tried the support vector machine at the beginning, but it was too slow in the condition of large amounts as well as multi-category of data, driving me crazy.

We need to divide the training set and test set first.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(nogkdata[attributes], nogkdata['player_positions'], test_size=0.3)

Using naive bayes to classify, we got an accuracy of 60%. Not satisfying enough, is it?

In [None]:
from sklearn.naive_bayes import GaussianNB

bayes = GaussianNB()
bayes.fit(X_train, y_train)
print(bayes.score(X_test, y_test))

Random forest improves the accuracy to about 70%. To be honest, in comparision with deep neural networks, this isn't a good performance either. But RFC is rather convenient, which is an advantage.

We did not try neural networks in this notebook. In fact, the PCA in the third part only explains less than 80% of the variance. This means that if you want to achieve higher accuracy, high-dimensional input is inevitable.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
print(rfc.score(X_test, y_test))

Have a look at feature importances in random forest. 'Sliding Tackle' weighs the most, while 'Skill Moves' has little effect.

In [None]:
fi=rfc.feature_importances_

fig = go.Figure()
sorted_indexes = fi.argsort()[::-1]
fig = px.bar(pd.DataFrame(np.array(attributes)[sorted_indexes],fi[sorted_indexes]),
             x=np.array(attributes)[sorted_indexes], y=fi[sorted_indexes],
             color=fi[sorted_indexes],
             hover_name=np.array(attributes)[sorted_indexes],
             labels={'x':'','y':'Importance'}, height=500, opacity=0.8
             )
fig.update_layout(title='Feature Importances in Random Forest',xaxis=dict(tickangle=45))

If we allow the random forest to give the three most likely answers, the accuracy rate can be increased to more than 80%.

In [None]:
attr_labels=nogk_pos.copy()
attr_labels.sort()

rk=np.zeros(3*len(y_test)).reshape(3,len(y_test))
rk=[[0 for i in range(len(y_test))] for j in range(3)]
pr=np.zeros(3*len(y_test)).reshape(len(y_test),3)

probas=rfc.predict_proba(X_test[attributes])

for row in range(len(probas)):
    for index, prob in enumerate(probas[row]):
        for i in range(3):
            if prob > pr[row][i]:
                pr[row][i]=prob
                rk[i][row]=attr_labels[index]
                break

acc=0

for i in range(len(y_test)):
    if y_test.iloc[i]==rk[0][i] or y_test.iloc[i]==rk[1][i] or y_test.iloc[i]==rk[2][i]:
        acc+=1

print(acc/len(y_test))

Now we define a position prediction function. It will be used in the next part.

In [None]:
def rfc_predict_pos(attr_data):
    
    proba=rfc.predict_proba(attr_data)

    rk=[0 for i in range(3)]
    pr=[0 for i in range(3)]

    for index, prob in enumerate(proba[0]):
        for i in range(3):
            if prob > pr[i]:
                pr[i]=prob
                rk[i]=index
                break
                
    print('Most probable position: {}'.format(attr_labels[rk[0]]))
    print('Potential positions: {} {}'.format(attr_labels[rk[1]],attr_labels[rk[2]]))

## Ⅴ. Neighbor Analysis and Display

This part is the most interesting among all! You can find your favorite player here, analyze his strength and weakness. You can also input your own attributes to see your performance in FIFA-21.

#### Caution: 
Since the code on kaggle cannot directly input your own data, 
**I strongly recommend that you copy this notebook to your own notebook, run the code inside the annotation**
to try your own data and enjoy yourself！

Each player in the data set is uniquely identified by a sofa_id. Please enter a sofa_id to continue.

The sofa_id of some famous players are as follows:

FW:

L. Messi: 158023

C. Ronaldo: 20801

Lewandowski: 188545

Neymar Jr: 190871

K. Mbappé: 231747

MF:

K. De Bruyne: 192985

F. de Jong: 228702

J. Lingard: 207494

J. Kimmich: 212622

T. Kroos: 182521

DF:

V. van Dijk: 203376

L. Bonucci: 184344

Azpilicueta: 184432

D. Alaba: 197445

Sergio Ramos: 155862

GK:

J. Oblak: 200389

M. ter Stegen: 192448

T. Courtois: 192119

M. Neuer: 167495

De Gea: 193080

In [None]:
sofifa_id=207494

'''
#!RELEASE! !THIS! !ANNOTATION!

sofifa_id=int(input('Please input a sofifa_id:\n
Examples: L. Messi:158023, J. Oblak:200389, F. de Jong:228702, J. Lingard:207494\n\n'))

#!RELEASE! !THIS! !ANNOTATION!
'''

In [None]:
def base_info(player):
    if player.empty:
        print("Can't find a player with sofifa_id = "+str(sofifa_id))
    else:
        player=player.reset_index()
        print()
        print('Sofifa_id: '+str(player['sofifa_id'][0]))
        print('Name: '+player['short_name'][0])
        print('Age: '+str(player['age'][0]))
        print('Nation: '+player['nationality'][0])
        print('Club: '+player['club_name'][0])
        print('Position: '+player['player_positions'][0])
        print('Overall Rating: '+str(player['overall'][0]))
        
player=data[data['sofifa_id']==sofifa_id].reset_index()
base_info(player)

Then draw a radar plot of the chosen player to show his basic attributes.

In [None]:
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
plt.style.use('seaborn')

color1=(0,152/255,158/255,0.8)
color2=(180/255,215/255,141/255,0.8)
color3=(255/255,230/255,122/255,0.8)
color4=(246/255,172/255,26/255,0.8)
color5=(216/255,34/255,13/255,0.8)
colors=[color1,color2,color3,color4,color5]

def radar_plot(player):

    player=player.reset_index()

    angles=np.linspace(0,2*np.pi,6,endpoint=False)
    angles=np.concatenate((angles, [0]))

    if player.player_positions[0]!='GK':
        attributes=player.loc[0,['pace','shooting','passing','dribbling','defending','physic','pace']]
        labels=('PAC\n{:d}'.format(int(player.loc[0,'pace'])),
                'SHO\n{:d}\n'.format(int(player.loc[0,'shooting'])),
                'PAS\n{:d}\n'.format(int(player.loc[0,'passing'])),
                'DRI\n{:d}'.format(int(player.loc[0,'dribbling'])),
                '\nDEF\n{:d}'.format(int(player.loc[0,'defending'])),
                '\nPHY\n{:d}'.format(int(player.loc[0,'physic'])),
                'PAC\n{:d}'.format(int(player.loc[0,'pace'])))
    else:
        attributes=player.loc[0,['gk_diving','gk_handling','gk_kicking','gk_reflexes','gk_speed','gk_positioning','gk_diving']]
        labels=('DIV\n{:d}'.format(int(player.loc[0,'gk_diving'])),
                'HAN\n{:d}\n'.format(int(player.loc[0,'gk_handling'])),
                'KIC\n{:d}\n'.format(int(player.loc[0,'gk_kicking'])),
                'REF\n{:d}'.format(int(player.loc[0,'gk_reflexes'])),
                '\nSPD\n{:d}'.format(int(player.loc[0,'gk_speed'])),
                '\nPOS\n{:d}'.format(int(player.loc[0,'gk_positioning'])),
                'DIV\n{:d}'.format(int(player.loc[0,'gk_diving'])))

    fig=plt.figure(figsize=(6,6))
    rect0 = [0, 0, 1, 1]
    rect1 = [0.00005, 0.00005, 0.9999, 0.9999]
    rect2 = [0.1, 0.1, 0.8, 0.8]
    rect3 = [0.2, 0.2, 0.6, 0.6]
    rect4 = [0.3, 0.3, 0.4, 0.4]
    rect5 = [0.4, 0.4, 0.2, 0.2]

    rects=[rect1,rect2,rect3,rect4,rect5]

    for rect,col in zip(rects,colors):
        bk=plt.axes(rect,projection='polar')
        bk.fill(angles,[1,1,1,1,1,1,1],color=col)
        bk.set_ylim(0,1)
        bk.axis('off')

    rdp=plt.axes(rect0,projection='polar')
    rdp.plot(angles,attributes,color='#222222',linestyle="-",lw=2)
    rdp.fill(angles,attributes,color='#222222',alpha=0.3)
    rdp.set_thetagrids(angles/np.pi*180,labels,fontsize=13)
    rdp.set_title(player.short_name[0],fontsize=20)
    rdp.patch.set_alpha(0)
    rdp.set_rticks([])
    rdp.set_ylim(0,100)
    #rdp.axis('off')
    plt.show()
    
radar_plot(player)

Next, we will analyze the strengths and weaknesses of this player.

We select all players whose fuzzy position is the same as this player and whose overall rating differs within 5 from this player. Then, we calculate the ranking of each attribute of this player among the selected players, choosing the attribute with the high ranking as the strength and the one with the low ranking as the weakness.

Finally, in the bar plot, we can see the percentage of similar players that this player is greater than, as well as the percentage of similar players that this player is weaker then.

In [None]:
def str_wea(player):
    nearby_data=data[(data.rough_position==player.rough_position[0])
                     &(abs(data.overall-player.overall[0])<=5)
                     &(data.sofifa_id!=player.sofifa_id[0])]

    comp_attr=[i for i in attributes if i not in ['Height(cm)','Weight(kg)','Right Foot','Skill Moves']]

    rank=pd.DataFrame(np.zeros(len(comp_attr)).reshape(1,-1),columns=comp_attr)
    for attr in comp_attr:
        rank[attr][0]=sum(nearby_data.apply(lambda row: 1 if row[attr]>player[attr][0] else 0,axis=1))
    rank=rank.apply(lambda col: col/len(nearby_data),axis=0)
    sorted_rank=rank.loc[0,:].argsort()

    strength,weakness=[],[]
    for attr in np.array(comp_attr)[sorted_rank]:
        if (rank.loc[0,attr]<=0.10 or len(strength)<3) and not np.isnan(player[attr][0]):
            strength.append(rank[attr])
    for attr in np.array(comp_attr)[sorted_rank[::-1]]:
        if (rank.loc[0,attr]>=0.90 or len(weakness)<3) and not np.isnan(player[attr][0]):
            weakness.append(rank[attr])
    strength=pd.DataFrame(strength).T.apply(lambda col:1-col,axis=0)
    weakness=pd.DataFrame(weakness).T

    str_x=pd.concat([strength, 1-strength], axis=0).T
    str_x.columns=['over','beneath']
    str_x['over']=pd.to_numeric(100*str_x['over']).round(0).astype(int)
    str_x['beneath']=pd.to_numeric(100*str_x['beneath']).round(0).astype(int)
    str_x=np.array(str_x)
    str_y=list(strength.columns)
    wea_x=pd.concat([weakness, 1-weakness], axis=0).T
    wea_x.columns=['over','beneath']
    wea_x['over']=pd.to_numeric(100*wea_x['over']).round(0).astype(int)
    wea_x['beneath']=pd.to_numeric(100*wea_x['beneath']).round(0).astype(int)
    wea_x=np.array(wea_x)
    wea_y=list(weakness.columns)

    colors=['#00C12B','#F76F87']
    top_labels=['Stronger then','Weaker then']
    titles=['Strength','Weakness']

    for t in range(2):
        if t==1:
            colors=colors[::-1]
            top_labels=top_labels[::-1]
        fig = go.Figure()
        x_data,y_data=[str_x,wea_x][t],[str_y,wea_y][t] 
        for i in range(0, len(x_data[0])):
            for xd, yd in zip(x_data, y_data):
                fig.add_trace(go.Bar(
                    x=[xd[i]], y=[yd],
                    orientation='h',
                    marker=dict(color=colors[i],opacity=0.8,
                        line=dict(color='rgb(248, 248, 249)'))))

        fig.update_layout(
            xaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=False,
                zeroline=False,
                domain=[0.1, 1]
            ),
            yaxis=dict(
                showgrid=False,
                showline=False,
                showticklabels=False,
                zeroline=False,
            ),
            barmode='stack',
            paper_bgcolor='rgb(248, 248, 255)',
            plot_bgcolor='rgb(248, 248, 255)',
            showlegend=False,
        )

        annotations = []

        for yd, xd in zip(y_data, x_data):
            annotations.append(dict(xref='paper', yref='y',
                                    x=0.09, y=yd,
                                    xanchor='right',
                                    text=str(yd)+' ({})'.format(str(int(player.loc[0,yd]))),
                                    font=dict(family='Arial',size=15,
                                              color='rgb(67, 67, 67)'),
                                    showarrow=False))

            space = 0
            for i in range(len(xd)):
                    annotations.append(dict(xref='x', yref='y',
                                            x=space + (xd[i]/2), y=yd,
                                            text=str(xd[i]) + '%',
                                            font=dict(family='Arial',size=15,
                                                      color='rgb(248, 248, 255)'),
                                            showarrow=False))
                    if yd == y_data[-1]:
                        annotations.append(dict(xref='x', yref='paper',
                                                x=space + (xd[i]/2), y=1+0.7/len(y_data),
                                                text=top_labels[i],
                                                font=dict(family='Arial',size=15,
                                                          color='rgb(67, 67, 67)'),
                                                showarrow=False))
                    space += xd[i]

        fig.update_layout(annotations=annotations,title=titles[t]+' of {}'.format(str(player.short_name[0])),
                          width=990, height=40*len(y_data)+110,
                          margin=dict(l=80, r=0, b=10, t=100, autoexpand=True),font=dict(size=20))

        fig.show()
        
str_wea(player)

Most inspiring part comes! Feel free to customize your own player now, or you can input 'D' to use the default data (which describes the author of this notebook).

According to the formula, the system can calculate your basic attributes, such as shooting, defending, etc.

In [None]:
typ='D'

'''
#!RELEASE! !THIS! !ANNOTATION!

typ=input('If you want to type in your own data, then type in "O" to continue.\n
If you want to use the default data, then type in "D".\n\n')
        
#!RELEASE! !THIS! !ANNOTATION!
'''

In [None]:
if typ=='D':
    user_attr=pd.DataFrame(np.array([np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 180, 60, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
        np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 4, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
        np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
        60, 50, 40, 60, 60, 55, 30, 50, 50, 55, 50, 50, 50, 55, 55, 40, 40, 60, 40, 40, 50,
        50, 65, 60, 50, 50, 40, 30, 20, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
        np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
        np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 3, np.nan]).reshape(1,-1),index=[0],columns=nogkdata.columns)
    user_attr['short_name']='Lei Luo'
    
else:
    print('\nPlease type in your attributes in the order of:')
    user_attr=pd.DataFrame(index=[0],columns=nogkdata.columns)
    user_attr['short_name'][0]=input('Name:\n')
    for attr in attributes:
        user_attr[attr][0]=int(input('{}:\n'.format(attr)))
        
user_attr['pace']=int(0.45*user_attr['Acceleration']
                      +0.55*user_attr['Sprint Speed'])
user_attr['shooting']=int(0.05*user_attr['Positioning']
                          +0.45*user_attr['Finishing']
                          +0.20*user_attr['Long Shots']
                          +0.05*user_attr['Penalties']
                          +0.20*user_attr['Shot Power']
                          +0.05*user_attr['Volleys'])
user_attr['passing']=int(0.35*user_attr['Short Passing']
                         +0.20*user_attr['Vision']
                         +0.20*user_attr['Crossing']
                         +0.15*user_attr['Long Passing']
                         +0.05*user_attr['Curve']
                         +0.05*user_attr['FK. Accuracy'])
user_attr['dribbling']=int(0.35*user_attr['Dribbling']
                           +0.35*user_attr['Ball Control']
                           +0.10*user_attr['Agility']
                           +0.05*user_attr['Balance']
                           +0.10*user_attr['Reactions']
                           +0.05*user_attr['Composure'])
user_attr['defending']=int(0.30*user_attr['Marking']
                           +0.30*user_attr['Standing Tackle']
                           +0.20*user_attr['Interceptions']
                           +0.10*user_attr['Heading Accuracy']
                           +0.10*user_attr['Sliding Tackle'])
user_attr['physic']=int(0.50*user_attr['Strength']
                        +0.25*user_attr['Stamina']
                        +0.20*user_attr['Aggression']
                        +0.05*user_attr['Jumping'])

In [None]:
rfc_predict_pos(user_attr[attributes])
radar_plot(user_attr)

We can also find the player most similar to you in the data set, according to the attributes you input.

It is worth noting that this recommendation is based on Euclidean distance.

In [None]:
self=user_attr
min_dis=-1
min_id=0

def find_sim(row):
    global min_dis,min_id
    dis=np.linalg.norm(row[attributes]-self.loc[0,attributes])
    if dis!=0 and (dis<min_dis or min_dis==-1):
        min_dis=dis
        min_id=row.sofifa_id

def gk_find_sim(row):
    global min_dis,min_id
    gk_attr=['Height(cm)','Weight(kg)','Right Foot','gk_diving','gk_handling','gk_kicking','gk_reflexes','gk_speed','gk_positioning']
    dis=np.linalg.norm(row[gk_attr]-self.loc[0,gk_attr])
    if dis!=0 and (dis<min_dis or min_dis==-1):
        min_dis=dis
        min_id=row.sofifa_id

nogkdata.apply(find_sim,axis=1)
sim_player=nogkdata[nogkdata['sofifa_id']==min_id]
print('Similar Player Info:')
base_info(sim_player)
radar_plot(sim_player)

We can also find idol players for you to learn from.

This recommendation is based on the cosine distance, which return the closest player whose overall rating is not less than 80.

In [None]:
self=user_attr
max_cos=-1
max_id=0

def find_idol(row):
    global max_cos,max_id
    vector1=row[attributes]
    vector2=self.loc[0,attributes]
    cos=np.dot(vector1,vector2)/(np.linalg.norm(vector1)*(np.linalg.norm(vector2)))
    if cos!=1 and cos>max_cos:
        max_cos=cos
        max_id=row.sofifa_id

show_data.apply(find_idol,axis=1)
idol_player=nogkdata[nogkdata['sofifa_id']==max_id]
print('Idol Player Info:')
base_info(idol_player)
radar_plot(idol_player)

Finally, let’s take a whole look at the performance of the selected player.

In [None]:
base_info(player)
radar_plot(player)

self=player
min_dis=-1
min_id=0

if player['player_positions'][0]!='GK':
    nogkdata.apply(find_sim,axis=1)
    sim_player=nogkdata[nogkdata['sofifa_id']==min_id]
else:
    gkdata.apply(gk_find_sim,axis=1)
    sim_player=gkdata[gkdata['sofifa_id']==min_id]
    
print('\nSimilar Player of {}:'.format(str(player.short_name[0])))
base_info(sim_player)
radar_plot(sim_player)

str_wea(player)

##### That's all. Thank you for reading, and welcome to improve the code on my basis.

##### Any comments and suggestions will be appreciated.