<h1 style='background:#AD2CDA; border:0; color:white'><center>EDA for NBA 2k20 player dataset</center></h1>

<center><img src="https://store-images.s-microsoft.com/image/apps.54802.14513657308079221.c5077776-1962-4ea7-a75e-ae4bfaeddc0c.f813dfc0-93f4-408f-91a4-9123fd6a9801"></center>

<h2 style='background:#AD2CDA; border:0; color:white'><center>Basic library<center></h2>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random
from datetime import date

In [None]:
class Plot():
    def __init__(self):
        c = ['r' , 'g' , 'b' , 'y' , 'orange' , 'grey' , 'lightcoral' , 'crimson' , 
            'springgreen' , 'teal' , 'c' , 'm' , 'gold' , 'skyblue' , 'darkolivegreen',
            'tomato']
        self.color = c
        
    def regplot_one_vs_many(self , x  , y  , data , rows , cols):
        color_used = []
        
        n = 0
        for feature in y:
            
            for i in range(1000):
                colour = random.choice(self.color)
                if colour not in color_used:
                    color_used.append(colour)
                    break
                    
            n += 1 
            plt.subplot(rows , cols , n)
            plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
            sns.regplot(x  = x , y = feature , data = data , 
                        color = colour)
            
plots = (Plot())

<h2 style='background:#AD2CDA; border:0; color:white'><center>Analysis<center></h2>

In [None]:
nba_df = pd.read_csv('/kaggle/input/nba2k20-player-dataset/nba2k20-full.csv')
nba_df.head()

<h3><center>Let's check how many rows and columns the dataset has, as well as the names of the columns.<center></h3>

In [None]:
print(f'There are {nba_df.shape[0]} rows and {nba_df.shape[1]} columns.\n')
print(f'Column names: {nba_df.columns.values}')

<h4><center>Each row is one player, so we have 429 players in this dataset.<center></h4>

<h2 style='background:#AD2CDA; border:0; color:white'><center>Data Types of Columns<center><h2>

<h3><center>Often we are needed to manipulate values in columns of the dataset. It is helpful to known what columns houses what type of data, as we may need to cast it into another data type or perform operations according to given data type in the column.<center><h3>

In [None]:
nba_df.dtypes

<h2 style='background:#AD2CDA; border:0; color:white'><center>Identifying NULL values<center><h2>

<h3><center>The dataset is almost never clean, as such we need to identify which columns have null values. After identfying then we can move to either fix them with a suitable replacement or remove them altoghether.<h3><center>

In [None]:
print(nba_df.isna().sum())

<h2 style='background:#AD2CDA; border:0; color:white'><center>Now let's analyze each column of the table.<h2><center>

In [None]:
print("Highest rating: ", nba_df.rating.max(), " - ", nba_df[nba_df.rating == nba_df.rating.max()].full_name.values[0])

In [None]:
print("Lowest rating: ", nba_df.rating.min(), " - ", nba_df[nba_df.rating == nba_df.rating.min()].full_name.values[0])

<h3><center>Let's see what is the average rating value for all players.<center></h3>

In [None]:
average_rating = len(nba_df.rating.value_counts())
print('Average rating value: {}'.format(round(nba_df.rating.mean(), 1)))

In [None]:
fig = plt.figure(figsize = (10, 5))
plt.hist(nba_df.rating, bins=average_rating)
plt.xlabel('Rating')
plt.ylabel('Players')
plt.title('Players and ratings histogram')
plt.show()

In [None]:
def age(born):
    today = date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

In [None]:
nba_df['b_day'] = pd.to_datetime(nba_df['b_day'])
nba_df['age'] = nba_df['b_day'].apply(lambda row: age(row))

<h3><center>5 oldest players.<center></h3>

In [None]:
nba_df.sort_values(by = 'age', ascending = False)[['full_name', 'rating', 'team', 'age']].head(5)

<h3><center>5 youngest players.<center></h3>

In [None]:
nba_df.sort_values(by = 'age', ascending = True)[['full_name', 'rating', 'team', 'age']].head(5)

<h3><center>How does the player's age affect the rating?<center></h3>

In [None]:
values = ['rating']
plt.figure(1, figsize = (10, 4))
plots.regplot_one_vs_many(x = 'age', y = values, data = nba_df, rows = 1, cols = 1)
plt.title('Scatter Plot of Age vs Rating')
plt.show()

<h3><center>Let's see what jersey numbers NBA players prefer to wear.<center></h3>

In [None]:
labels = [key for key in nba_df.jersey.value_counts(dropna=False).keys()]
values = [value for value in nba_df.jersey.value_counts(dropna=False).values]

x = np.arange(len(labels))
fig, ax = plt.subplots(figsize=(10,10))
rects = ax.barh(x, values)

ax.set_xlabel('Players')
ax.set_ylabel('Jersey numbers')
ax.set_yticks(ticks=x)
ax.set_yticklabels(labels)
ax.set_title('Jersey numbers')

plt.show()

<h4><center>Wow, this is very interesting information, surprisingly so many players prefer the number "0" on their jersey.<center></h4>

<h3><center>Let's see how many players are on each team.<center></h3>

In [None]:
labels = [key for key in nba_df.team.value_counts(dropna=False).keys()]
values = [value for value in nba_df.team.value_counts(dropna=False).values]

x = np.arange(len(labels))
fig, ax = plt.subplots()
rects = ax.barh(x, values)

ax.set_xlabel('Number of players in teams')
ax.set_ylabel('Team')
ax.set_yticks(ticks=x)
ax.set_yticklabels(labels)
ax.set_title('Number of players in teams')

plt.show()

<h4><center>We see the value of "nan" for more than 20 players, apparently, these are free agents who are not currently part of any of the teams.<center></h4>

<h3><center>Let's display the exact information about free agents.<center></h3>

In [None]:
free_agents = nba_df[nba_df['team'].isna()]
print(f'Total free agents: {free_agents.shape[0]}')
free_agents

<h4><center>Here we see that 22 free agents are recorded in the jersey number "0" dataset, which explains why there are so many players with jersey number "0".<center></h4>

<h3><center>Let's see the information regarding the positions of the players.<center></h3>

In [None]:
fig = plt.figure(figsize = (10, 5))
sns.countplot('position', data = nba_df, order = nba_df['position'].value_counts().index)
plt.xlabel('Player positions')
plt.ylabel('Count of players')
plt.title('Player positions')
plt.show()

<h4><center>I am not a very strong basketball expert, but judging from a search on the Internet, position "G" (Guard) is any point guard, attacking defender.<center></h4>

<h3><center>What about the average ranking of players by position?<center></h3>

In [None]:
position_rating = nba_df[['position','rating']].groupby('position').mean().sort_values(by='rating', ascending=False)
fig, ax = plt.subplots(figsize=(10,5))
sns.barplot(x=position_rating.rating, y=position_rating.index)
plt.xticks()
plt.xlabel('Position')
plt.ylabel('Average rating')
plt.title('Average ranking of players by position')
plt.show()

<h4><center>You can see that the forces are distributed approximately the same for each position.<center></h4>

<h1 style='background:#AD2CDA; border:0; color:white'><center>In work ...</center></h1>