# Data visualization for NBA 2k20 player dataset.

Hi!   
In this kernel you will see data visualisation for NBA 2k20 player dataset.    
I understand that we can visualize a lot of things, but I tried to make graphs that will carry useful information.    

Update seaborn to latest version.

In [None]:
!pip install -q -U seaborn==0.11.0 --use-feature=2020-resolver

Just import all we need.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns


sns.set(font_scale=1.5)

Download data

In [None]:
data = pd.read_csv("../input/nba2k20-player-dataset/nba2k20-full.csv")
data.head()

In [None]:
data.dtypes

We can see that some useful data is not numeric, let's turn it into numeric.   
*Notice: I also added new column "start_age" it means the age of the person in which he started playing basketball.*

In [None]:
data["salary"] = data["salary"].str[1:].astype("int64")
data["height"] = data["height"].str.split("/").str[1].astype("float")
data["weight"] = data["weight"].str.split("/").str[1].str[0:-3].astype("float")
data["start_age"] = data["draft_year"] - pd.to_datetime(data["b_day"]).dt.year
data["draft_round"] = data["draft_round"].replace({"Undrafted": 0}).astype("int8")
data["draft_peak"] = data["draft_peak"].replace({"Undrafted": 0}).astype("int8")

We are ready to visualize data!

In [None]:
plt.figure(figsize=(15, 5))
plt.xlabel("Rating", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.title("Rating distribution", fontsize=18)
sns.countplot(x="rating", data=data, palette="rocket");

In [None]:
plt.figure(figsize=(15, 5))
plt.xlabel("Rating", fontsize=14)
plt.ylabel("Salary", fontsize=14)
plt.title("Salary dependence on player rating", fontsize=18)
sns.scatterplot(x="rating", y="salary", data=data, color="deeppink", s=60);

In [None]:
plt.figure(figsize=(15, 5))
plt.xlabel("Draft year", fontsize=14)
plt.ylabel("Salary", fontsize=14)
plt.title("Salary dependence on draft year", fontsize=18)
sns.scatterplot(x="draft_year", y="salary", data=data, color="deeppink", s=60);

We can see some correlation between a player's rating and player's salary. The higher the rating, the higher the salary.

In [None]:
plt.figure(figsize=(15, 5))
plt.xlabel("Position", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.title("Position distribution", fontsize=18)
sns.countplot(x="position", data=data, palette="rocket");

In [None]:
plt.figure(figsize=(15, 8))
plt.xlabel("Position", fontsize=14)
plt.ylabel("Salary", fontsize=14)
plt.title("Salary distribution based on players positions", fontsize=18)
sns.boxplot(x="position", y="salary", data=data, palette="rocket");

From these graphs, we can see that the highest median value of the players who occupy C-F positions. But we can also notice that there are a few exceptions, the players who earn much more than others in F, G and F-C positions.

In [None]:
plt.figure(figsize=(15, 8))
plt.xlabel("Position", fontsize=14)
plt.ylabel("Height", fontsize=14)
plt.title("Height distribution based on players positions", fontsize=18)
sns.boxplot(x="position", y="height", data=data, palette="rocket");

F-C, C, C-F positions are occupied by the highest players and G position are occupied by the lowest players themselves.

In [None]:
plt.figure(figsize=(15, 5))
plt.xlabel("Start age", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.title("Age distribution", fontsize=18)
sns.histplot(x="start_age", data=data, bins=20, color="deeppink");

We can see that most players start their careers between the ages of 19 and 24.

In [None]:
plt.figure(figsize=(15, 8))

plt.title("Salary distribution based on players countries", fontsize=18)
x = sns.boxplot(x="country", y="salary", data=data, color="deeppink")
x.set_xticklabels(x.get_xticklabels(), rotation=90);

Montenegro and Dominican Republic have the greatest median value, but we also see that they have a small distribution. We also can see that there are several players in the USA who earn much more than others.

In [None]:
plt.figure(figsize=(15, 8))
plt.title("Salary distribution based on players teams", fontsize=18)
x = sns.boxplot(x="team", y="salary", data=data, color="deeppink")
x.set_xticklabels(x.get_xticklabels(), rotation=90);

We can see that Miami Heat team has the highest median value. We can also notice that almost every team has one or two players who earn much more than the rest of the team.

In [None]:
colleges = data.groupby("college")["salary"].median()
top_colleges = colleges.sort_values(ascending=False)[0:10].index
tmp = data[data['college'].isin(top_colleges)]

plt.figure(figsize=(10,5))
x = sns.barplot(x="college", y="salary", data=tmp, color="deeppink")
x.set_xticklabels(x.get_xticklabels(), rotation=90);

Let's also make a correlation matrix to see all the links between features.

In [None]:
plt.figure(figsize=(15,10))

sns.heatmap(data.corr(),cmap='rocket',annot=True, fmt=".2f");

The highest corelations are:  
rating and draft year (logical, the earlier the player started playing, the higher the rating he managed to earn)    
                        salary and draft year (the earlier the player started playing, the higher the salary he managed to earn)    
                        salary and rating (the higher the rating, the higher the salary)    
                        weight and height (the taller the person, the more weight)   
                        draft round and draft peak   