<h1>Data Interrogation - A Practical Approach</h1>

<p>Interrogating your data is an important part of a data science project. It is estimated that 60-70% of time on a data science project is spent on cleaning and understanding the data. This notebook explains how to clean and get the most out of your data with Plotly visualization</p>

Economist  Ronald Coase said
><h2>If you torture the data long enough, it will confess to anything


In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
import datetime
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff

<h2>Lets start the process of interrogating the data</h2>
<h3>Importing the data

In [None]:
src = pd.read_csv("../input/nba2k20-player-dataset/nba2k20-full.csv",parse_dates=True)

<h3>Is the data correctly imported? Checking the first few rows helps</h3>

In [None]:
src.head()

<h3>What are the dimensions of the data?</h3>

In [None]:
print(f"No of rows: {src.shape[0]}")
print(f"No of columns: {src.shape[1]}")

<h3>What are the columns of the data?</h3>

In [None]:
print(f"Columns in the dataset\n{pd.Series(src.columns).T}")

<h3>Do all the columns have the right data types?

In [None]:
src.dtypes

<p>There are few columns like 'b_day', 'height', 'weight', 'salary' which have a wrong data type. We need to correct them.</p>

<h3>How many unique values/categories each column has?</h3>

In [None]:
src.nunique()

<h3>What are the unique values in each column?</h3>

<p>Excluding the Player Name column as it obviously has unique values

In [None]:
for i in src.columns[1:]:
    print(i)
    print(src[i].unique())

<h3>What is the percentage of missing values in each column, if any?

In [None]:
print(f"% of missing values\n{np.round(src.isnull().mean()*100,2)}")

<h1>Data Cleaning</h1>
<p>There are a few columns like 'b_day', 'height', 'weight', 'salary' which have a wrong data type. They must be cleaned.</p>

<h3>Convert the 'b_day' column to datetime

In [None]:
def to_date(x):
    return datetime.datetime.strptime(x,"%m/%d/%y")

src["b_day"] = src["b_day"].map(to_date)
print(f"datatype: {src.b_day.dtype}\nFew Unique Values\n{src.b_day.unique()[:5]}")

<p>After cleaning, the 'b_day' column looks good</p>
<h3>Clean the 'height' column</h3>
    
<p>This column contains the height in both feet and meters which are separated by "/". We'll split the column into two new columns "height_ft" and "height_cm".

In [None]:
src = src.reset_index(drop=True)
src.insert(7,"height_ft",0) # insert a new column "height_ft" and assign a value 0
src.insert(8,"height_cm",0)# insert a new column "height_cm" and assign a value 0
for i,j in enumerate(src["height"]):
    split = j.split(" / ") # splitting the text by " / "
    src.loc[i,"height_ft"] = split[0].strip() #first element of the split is height in ft
    src.loc[i,"height_cm"] = float(split[1].strip())*100 #second element of the split is height in m. Multiplying it with 100 to convert to cnm

<p>Let's look at the first few rows

In [None]:
src.head()

<h3>Clean the 'weight' column</h3>
    
<p>This column contains the weight in both lbs and kg which are separated by "/". We'll split the column into two new columns "weight_lbs" and "weight_kg".

In [None]:
src = src.reset_index(drop=True)
src.insert(10,"weight_lbs",0)
src.insert(11,"weight_kg",0)
for i,j in enumerate(src["weight"]):
    split = j.split(" / ")
    src.loc[i,"weight_lbs"] = float(split[0].replace("lbs.","").strip())
    src.loc[i,"weight_kg"] = float(split[1].replace("kg.","").strip())

<p>Let's look at the first few rows

In [None]:
src.head()

<h3>Clean the 'salary' column</h3>
    
<p>Let's check if all the entries in this column start with $

In [None]:
"All the Salaries are in $" if all(src["salary"].str.startswith("$")) else "Not All the Salaries are in $"

<p>As all the entries in the 'salary' column start with \$ we'll remove the \$ to make it a float data type

In [None]:
src["salary"] = src["salary"].str.replace("$","",regex=False).astype("float64")
src.head()

<h3>Removing redundant columns like 'height', 'height_ft', 'weight', and 'weight_lbs'

In [None]:
src.drop(columns=["height","height_ft","weight","weight_lbs"],inplace=True)

<h3>Let's check the structure of the data to ensure the cleaning is done

In [None]:
src.dtypes

In [None]:
src.head()

<h1>Imputing Missing Values</h1>
There are missing values in 'team' and 'college' columns.

Knowing about the data collection method would help us to know why these values are missing and we can devise an optimal imputing strategy.

Imputing the missing Team and College with mode wouldn't be a right strategy as we find no patterns in the missing data, they are Missing at Random. This would also increase the influence of a single category (mode) which is already the most common category. It would be better if we mark them "Not Known".

In [None]:
src["team"].fillna("Not Known",inplace=True)
src["college"].fillna("Not Known",inplace=True)

<h1>Feature Engineering</h1>

<h3>Let's see if we can create additional variables from the existing variables</h3>

First we can create a <i><b>BMI = (weight in kg)/(height in m)<sup>2</sup></b></i> variable from the height and weight of the players and also the BMI class.

In [None]:
src.insert(8,"bmi",0)
src.insert(9,"bmi_class",0)
src["bmi"] = np.round(src["weight_kg"] / ((src["height_cm"]/100)**2),1)
src.loc[src["bmi"]<18.5,"bmi_class"] = "underweight"
src.loc[(src["bmi"]>=18.5) & (src["bmi"]<=24.9),"bmi_class"] = "normal"
src.loc[(src["bmi"]>=25) & (src["bmi"]<=29.9),"bmi_class"] = "overweight"
src.loc[src["bmi"]>=30,"bmi_class"] = "obese"

<h3>We can also calculate the age from the 'b_day' of the players

In [None]:
def age_calc(dob):
    today = datetime.datetime.today()
    return np.floor(((today-dob).days)/365)
src.insert(6,"age",0)
src["age"] = src["b_day"].map(age_calc)

<h3>A final look at the data before we proceed

In [None]:
src.head()

<h2>As we are done with the cleaning and feature engineering, let's start interrogating the data</h2>
<h1>Let's look how the numerical variables are distributed

In [None]:
fig = make_subplots(rows=2, cols=3, subplot_titles=("Age", "Rating", "Salary","BMI","Height","Weight"))

fig.add_trace(
    go.Histogram(x=src["age"]),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=src["rating"]),
    row=1, col=2
)

fig.add_trace(
    go.Histogram(x=src["salary"]),
    row=1, col=3
)

fig.add_trace(
    go.Histogram(x=src["bmi"]),
    row=2, col=1
)

fig.add_trace(
    go.Histogram(x=src["height_cm"]),
    row=2, col=2
)

fig.add_trace(
    go.Histogram(x=src["weight_kg"]),
    row=2, col=3
)

fig.update_layout(title_text="Distribution of Numerical Variables",showlegend=False)

In [None]:
pd.set_option('display.float_format', lambda x: '%.1f' % x)
src[["age", "rating", "salary","bmi","height_cm","weight_kg"]].describe()

<h1>Let's see the skewness of the above variables</h1>

In [None]:
src.skew()

<h1>Observations from the above plots</h1>
<ol>
<li>Most of the players have the age between 22 and 32, lowest being 19 and the highest being 40.</li>
<li>Most of the players have a rating between 72 and 79, lowest being 67 and highest being 97</li>
<li>Most of the players have a salary between \$50K and \$6 million, lowest being \$50K and highest being \$40 million</li>
<li>Most of the players have a BMI between 22 and 27, lowest being 20.3 and highest being 32.9</li>
<li>Most of the players have a height (in cm) between 191 and 210, lowest being 175 and highest being 225</li>
<li>Most of the players have a weight (in kg) between 86 and 110, lowest being 77 and highest being 131.5</li>
</ol>
<h1>Let's look at the distributions of few categorical variables

In [None]:
teams = pd.DataFrame(src["team"].value_counts()).reset_index()
teams.columns=["team","count"]

pos = pd.DataFrame(src["position"].value_counts()).reset_index()
pos.columns=["position","count"]

cntry = pd.DataFrame(src["country"].value_counts()).reset_index()
cntry.loc[cntry["country"]<(0.005*cntry["country"].sum()),"index"] = "Others"
cntry = cntry.groupby("index").sum().reset_index()
cntry.columns=["country","count"]
cntry.sort_values(by="count",ascending=False,inplace=True)

coll = pd.DataFrame(src["college"].value_counts()).reset_index()
coll.loc[coll["college"]<(0.01*coll["college"].sum()),"index"] = "Others"
coll = coll.groupby("index").sum().reset_index()
coll.columns=["college","count"]
coll.sort_values(by="count",ascending=False,inplace=True)

bmi_dist = pd.DataFrame(src["bmi_class"].value_counts()).reset_index()
bmi_dist.columns=["bmi_class","count"]

fig = make_subplots(rows=4, cols=2,specs=[[{},{}],
                                          [{"colspan": 2},None],
                                         [{"colspan": 2},None],
                                         [{"colspan": 2},None]],
                    subplot_titles=("Position", "BMI_Class","Country", "Team","College"))

fig.add_trace(
    go.Bar(y=pos["count"],x=pos["position"]),
    row=1, col=1
)

fig.add_trace(
    go.Bar(y=bmi_dist["count"],x=bmi_dist["bmi_class"]),
    row=1, col=2
)

fig.add_trace(
    go.Bar(y=cntry["count"],x=cntry["country"]),
    row=2, col=1
)

fig.add_trace(
    go.Bar(y=teams["count"],x=teams["team"]),
    row=3, col=1
)

fig.add_trace(
    go.Bar(y=coll["count"],x=coll["college"]),
    row=4, col=1
)

fig.update_layout(title_text="Distribution of Categorical Variables",showlegend=False,height=1400)

<h1>Observations from the above plots</h1>
<ol>
<li>Most of the players are in G & F position. G-F, F-G, and C-F have the lowest number of players</li>
<li>Most of the players fall in the 'normal' BMI category with a few in 'overweight' and only 1 player is 'obese'</li>
<li>Most of the players are from the US followed by Canada and Australia</li>
<li>Most of the players are from Milwaukee Bucks and Golden State Warriors have the least number of players in this dataset. There are many missing values here</li>
<li>Most of the players are from Kentucky and Duke colleges. There are also a good number of missing values here</li>
</ol>
<h1>A look at the relationships among numerical variables

In [None]:
px.scatter_matrix(src[["age","bmi","height_cm","weight_kg","rating","salary"]],height=1500)

<h1>Let's quantify the above relationships using a Correlation Matrix

In [None]:
cols = ["age","bmi","height_cm","weight_kg","rating","salary"]
fig = ff.create_annotated_heatmap(np.round(src[cols].corr().values,2),x=cols,y=cols)
fig.show()

<h1>Observations from the above plots</h1>
<ol>
<li>Age and salary have a moderate positive correlation.</li>
<li>Rating and salary have a strong positive correlation.</li>
<li>Weight and BMI have a moderate positive correlation.</li>
<li>Weight and height have a strong positive correlation.</li>
<li>Age and salary have a moderate positive correlation.</li>
 </ol>
<h1>Looking at the relationships among few important categorical variables and numerical variables

In [None]:
fig = make_subplots(rows=6, cols=1)

fig.add_trace(
    go.Box(x=src["team"],y=src["age"]),
    row=1, col=1
)

fig.add_trace(
    go.Box(x=src["team"],y=src["bmi"]),
    row=2, col=1
)
fig.add_trace(
    go.Box(x=src["team"],y=src["height_cm"]),
    row=3, col=1
)


fig.add_trace(
    go.Box(x=src["team"],y=src["weight_kg"]),
    row=4, col=1
)

fig.add_trace(
    go.Box(x=src["team"],y=src["salary"]),
    row=5, col=1
)

fig.add_trace(
    go.Box(x=src["team"],y=src["rating"]),
    row=6, col=1
)


fig.update_yaxes(title_text="Age", row=1, col=1)
fig.update_yaxes(title_text="BMI", row=2, col=1)
fig.update_yaxes(title_text="Height", row=3, col=1)
fig.update_yaxes(title_text="Weight", row=4, col=1)
fig.update_yaxes(title_text="Salary", row=5, col=1)
fig.update_yaxes(title_text="Rating", row=6, col=1)

fig.update_layout(title_text="Relationship B/w Team and Important Numerical Variables",showlegend=False,height=2000)

<h3>In the above plots there are no strong relationships between team and other variables.

In [None]:
fig = make_subplots(rows=6, cols=1)

fig.add_trace(
    go.Box(x=src["position"],y=src["age"]),
    row=1, col=1
)

fig.add_trace(
    go.Box(x=src["position"],y=src["bmi"]),
    row=2, col=1
)
fig.add_trace(
    go.Box(x=src["position"],y=src["height_cm"]),
    row=3, col=1
)


fig.add_trace(
    go.Box(x=src["position"],y=src["weight_kg"]),
    row=4, col=1
)

fig.add_trace(
    go.Box(x=src["position"],y=src["salary"]),
    row=5, col=1
)

fig.add_trace(
    go.Box(x=src["position"],y=src["rating"]),
    row=6, col=1
)


fig.update_yaxes(title_text="Age", row=1, col=1)
fig.update_yaxes(title_text="BMI", row=2, col=1)
fig.update_yaxes(title_text="Height", row=3, col=1)
fig.update_yaxes(title_text="Weight", row=4, col=1)
fig.update_yaxes(title_text="Salary", row=5, col=1)
fig.update_yaxes(title_text="Rating", row=6, col=1)

fig.update_layout(title_text="Relationship B/w Position and Important Numerical Variables",showlegend=False,height=1700)

<h1>Observations from the above plots</h1>
<ol>
<li>No strong relationship or variance is found in position by age. However, older players are in position C</li>
<li>Strong relationship or variance is found in position by height. Tall players are in position C</li>
<li>Moderate relationship or variance is found in position by weight. Heavy players are in position C</li>
<li>No strong relationship or variance is found in position by salary. Few players in positions F and G earn more, while few in F-C earn the least</li>
</ol>

In [None]:
fig = make_subplots(rows=3, cols=2)

fig.add_trace(
    go.Box(x=src["bmi_class"],y=src["age"]),
    row=1, col=1
)

fig.add_trace(
    go.Box(x=src["bmi_class"],y=src["bmi"]),
    row=1, col=2
)
fig.add_trace(
    go.Box(x=src["bmi_class"],y=src["height_cm"]),
    row=2, col=1
)


fig.add_trace(
    go.Box(x=src["bmi_class"],y=src["weight_kg"]),
    row=2, col=2
)

fig.add_trace(
    go.Box(x=src["bmi_class"],y=src["salary"]),
    row=3, col=1
)

fig.add_trace(
    go.Box(x=src["bmi_class"],y=src["rating"]),
    row=3, col=2
)


fig.update_yaxes(title_text="Age", row=1, col=1)
fig.update_yaxes(title_text="BMI", row=1, col=2)
fig.update_yaxes(title_text="Height", row=2, col=1)
fig.update_yaxes(title_text="Weight", row=2, col=2)
fig.update_yaxes(title_text="Salary", row=3, col=1)
fig.update_yaxes(title_text="Rating", row=3, col=2)

fig.update_layout(title_text="Relationship B/w BMI Class and Important Numerical Variables",showlegend=False,height=1700)

<h1>Observations from the above plots</h1>
<ol>
<li>Age is not influencing the BMI category</li>
<li>It's obvious, BMI is highly influencing the BMI category</li>
<li>Height and weight separately are not much influencing the BMI category</li>

    


<h1>Visualizing the relationship between 'height, weight and BMI class using a scatter plot

In [None]:
src1 = src[["height_cm","weight_kg","bmi","bmi_class"]].copy()
src1["bmi"] = (src1["bmi"]-src1["bmi"].min()) / (src1["bmi"].max()-src1["bmi"].min())
px.scatter(src1,x="height_cm",y="weight_kg",size="bmi",color="bmi",trendline="lowess",title="Weight, Height, BMI")

In the above scatter plot, size and colour of the bubble is 'bmi'.

Looking at the increase in the bubble size vertically (i.e. y-axis) shows weight is more influencing the BMI as we saw in the correlation matrix
<h1>Looking at the relationships among few important categorical variables

In [None]:
pos_bmi = pd.crosstab(index=src["position"],columns=src["bmi_class"],normalize="index")
px.bar(pos_bmi,title="Position vs BMI Class",labels={'value':"% of players"})

The above plot shows position 'F-G' has high percentage of players with a normal 'bmi_class', while position 'C' has high percentage of 'overweight' players.

In [None]:
cntry = pd.DataFrame(src["country"].value_counts()).reset_index()
less_cnt = cntry.loc[cntry["country"]<(0.005*cntry["country"].sum()),"index"]

cnt_bmi = src[["country","bmi_class"]].copy()
cnt_bmi.loc[cnt_bmi["country"].isin(less_cnt),"country"] = "Others"
cnt_bmi = pd.crosstab(index=cnt_bmi["country"],columns=cnt_bmi["bmi_class"],normalize="index").copy()
px.bar(cnt_bmi,title="Position vs BMI Class",labels={'value':"% of players"})

The above plot shows players from France, Latvia, and Germany have a normal 'bmi_class'. Turkey has the highest percentage of overweight players.