In [None]:

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Welcome to my second notebook!
I will try to use the NBA 2K20 dataset to predict player's salary. Please give me your feedback, i will be glad to read it! If you like the kernel, please vote up.

## Content

1. [Introduction](#Introduction)
2. [Reading and cleaning data](#Reading+and+cleaning+data)
3. [Tree-Based Models](#Tree-Based+Models)

## 1. Introduction

#### Context
Each entry of the dataset represents a NBA player. The idea is to predict the salary of the players based on some caractheristcs, which I will describe below.

#### Columns definitions

* **full_name** (string) - Player's full name
* **rating** (numeric) - Rating in NBA 2K20
* **jersey** (string) - Jersey's number
* **team** (string) - Team
* **position** (string) - Position in the team
* **b_day** (string) - Birth date
* **height** (string) - Height
* **weight** (string) - Weight
* **salary** (string) - Salary
* **country** (string) - Country of birth
* **draft_year** (string) - Year in which the player was drafted
* **draft_round** (string) - Round in which the player was drafted
* **draft_peak** (string) - Overall draft pick in which the player was selected
* **college** (string) - College attended

## 2. Reading and cleaning data

As it is not the focus of this notebook, I will not give many details, but fell free to ask questions in the comments =).

In [None]:

import pandas as pd
import numpy as np

df = pd.read_csv("/kaggle/input/nba2k20-player-dataset/nba2k20-full.csv")

Removing unnecessary columns

In [None]:
df.drop(["jersey", "full_name"], axis = 1, inplace = True)

Position column cleaning

In [None]:
def clean_position(row):

    if row.position == "G-F":
        row.position = "F-G"
    elif row.position == "C-F":
        row.position = "F-C"
    
    return row

df = df.apply(clean_position, axis = "columns")

Get age of players

In [None]:
import datetime as dt
from dateutil.relativedelta import relativedelta

def get_year(row, col):
    
    row[col] = -(row[col].years)

    return row


df["age"] = pd.to_datetime(df["b_day"])
now = dt.datetime.now()
df.age = df.age.apply(relativedelta, args = (now, ) )

df = df.apply(get_year, args = ("age",), axis = "columns")
df.drop("b_day", axis = 1, inplace = True)

Get weight in kilos and height in meters

In [None]:
df["height"] = np.array([float(x.strip()[-4:]) for x in df.height])
df["weight"] = np.array([float(x.split("/")[-1].split()[0]) for x in df.weight])

Create a column *conference*, which sets the player's team conference

In [None]:
west = ["Denver Nuggets", "Minnesota Timberwolves", "Oklahoma City Thunder", "Portland Trail Blazers", 
           "Utah Jazz", "Dallas Mavericks", "Houston Rockets", "Memphis Grizzlies", "New Orleans Pelicans", 
           "San Antonio Spurs", "Golden State Warriors", "Los Angeles Clippers", "Los Angeles Lakers", "Phoenix Suns",
           "Sacramento Kings"]
east = ["Boston Celtics", "Brooklyn Nets", "New York Knicks", "Philadelphia 76ers", "Toronto Raptors",
            "Atlanta Hawks", "Charlotte Hornets", "Miami Heat", "Orlando Magic", "Washington Wizards", "Chicago Bulls",
           "Cleveland Cavaliers", "Detroit Pistons", "Indiana Pacers", "Milwaukee Bucks"]

def find_conference(row):
    
    if row.team in west:
        row["conference"] = "West"
    elif row.team in east:
        #print("{} é East".format(row.team))
        row["conference"] = "East"
    else:
        #print("{} é Nan".format(row.team))
        row["conference"] = "No team"
    
    return row

df = df.apply(find_conference, axis = "columns")
df.drop("team", axis = 1, inplace = True)

Create a column *region*, which sets the player's region of birth

In [None]:
def find_country(row):
    
    if row.country == "USA":
        row["region"] = "USA"
    else:
        row["region"] = "Not USA"
    
    return row

df = df.apply(find_country, axis = "columns")
df.drop("country", axis = 1, inplace = True)

Removing the *$* in the salary column and transforming it to float.

In [None]:
df.salary = [float(x[1:]) for x in df.salary]

Find the number of seasons in the NBA

In [None]:
df["seasons"] = pd.to_datetime(df["draft_year"], format = "%Y")
now = dt.datetime.now()

df.seasons = df.seasons.apply(relativedelta, args = (now, ) )
df = df.apply(get_year, args = ("seasons",), axis = "columns")
df.drop("draft_year", axis = 1, inplace = True)

Correcting the name of the column *draft_peak* and changing *Undrafted* values in the draft columns to the number 0

In [None]:
df.rename(columns = {"draft_peak":"draft_pick"}, inplace = True)

df["draft_pick"] = pd.to_numeric(df["draft_pick"].map(lambda x: 0 if x == "Undrafted" else x))
df["draft_round"] = pd.to_numeric(df["draft_round"].map(lambda x: 0 if x == "Undrafted" else x))

Split colleges that have 5 or more players in the NBA and colleges that have less than 5.

In [None]:
aux = df.college.value_counts()
big_schools = aux.index[0:16]
big_schools = big_schools.values

small_schools = aux.index[16:]
small_schools = small_schools.values

In [None]:
def size_school(row):

    if row.college in big_schools:
        row["school_size"] = "Big"
    elif row.college in small_schools:
        row["school_size"] = "Small"
    else:
        row["school_size"] = "Didn't go to college"
        

    return row

df = df.apply(size_school, axis = "columns")
df.drop("college", axis = 1, inplace = True)

Converting categorical variables into numeric through label encoding

In [None]:
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
for col in df.columns.values:
    if df.loc[:,col].dtype == "object":
        df.loc[:,col] = lbl.fit_transform(df.loc[:,col].astype(str))

## 3. Tree-Based Models

I will just use tree-based models in this data, because it is what I am studying at the moment.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import BaggingRegressor

from sklearn.metrics import mean_squared_error

In [None]:
columns = df.columns.values
y_columns = ["salary"]
x_columns = [x for x in columns if x != "salary"]

X = df[x_columns]
y = df[y_columns]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

First, I will use a Regression Tree to look at the variance problem of this model based on its *max_depth* hyperparameter.

In [None]:
max_depth = np.arange(1, 22, 2)
train_error = []
test_error = []

for i in max_depth:
    model = DecisionTreeRegressor(criterion = "mse", splitter = "best", max_depth = i)
    model.fit(X_train, y_train)
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    train_error.append(mean_squared_error(y_train, y_pred_train))
    test_error.append(mean_squared_error(y_test, y_pred_test))
    

plt.figure(figsize=(6,4))
plt.plot(max_depth, train_error, '-bo', color = "red", label = "Train Error")
plt.plot(max_depth, test_error, '-bo', color= "blue", label = "Test Error")
plt.xlabel('max_depth', fontsize = 15)
plt.ylabel('MSE', fontsize = 15)
plt.xticks(max_depth)
plt.legend()
plt.show(True)

As we can see, the Regression Tree has a lot of bias when *max_depth* is low (around 1 to 3), and as I increase the size of the tree, it starts overfitting (high variance).
One technique that can help reducing this variance problem is bagging (aka. Bootstrap Aggregating), which is used in the Random Forest.

Another technique used in the Random Forest algorithm to help reduce the variance is to choose $k$ features out of $d$ features to split a node, where $k <= d$ (usually $k = \sqrt d$).

I will fix a high number of estimators (trees) and vary the max depth.

In [None]:
max_depth = np.arange(1, 22, 2)
train_error = []
test_error = []

for i in max_depth:
    model = RandomForestRegressor(n_estimators = 5000, criterion = "mse", max_depth = i,
                                 max_features = "sqrt")
    model.fit(X_train, y_train.to_numpy().ravel())
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    train_error.append(mean_squared_error(y_train, y_pred_train))
    test_error.append(mean_squared_error(y_test, y_pred_test))
    

plt.figure(figsize=(6,4))
plt.plot(max_depth, train_error, '-bo', color = "red",label = "Train Error")
plt.plot(max_depth, test_error, '-bo', color= "blue", label = "Test Error")
plt.xlabel('max_depth', fontsize = 15)
plt.ylabel('MSE', fontsize = 15)
plt.xticks(max_depth)
plt.legend()
plt.show(True)

Now we can see that the test error does not increase as the *max_depth* parameter increases, it shows that using bagging helped control the model variance.

Notice that when I use Bagging, I could use the *Out-Of-Bag Error* to test the model (because it is expected that around 60% of the original dataset will be on the new dataset made using bagging) instead of separating the data into training and testing previously, but to make it simpler I chose to not use the *ooberror*.

Another technique that can be used is called *boosting*. It uses some weak learners to create a strong learner, therefore it reduces the model bias. In this case, I will use it based on low *max_depth* trees.

I will use one of the oldest boosting algorithms, which is the Gradient Boosting, to model the data.

In [None]:
max_depth = np.arange(1, 6, 1)
train_error = []
test_error = []

for i in max_depth:
    model = GradientBoostingRegressor(loss = "ls", learning_rate = 0.01, n_estimators = 5000,
                                     criterion = "mse")
    model.fit(X_train, y_train.to_numpy().ravel())
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    train_error.append(mean_squared_error(y_train, y_pred_train))
    test_error.append(mean_squared_error(y_test, y_pred_test))
    

plt.figure(figsize=(6,4))
plt.plot(max_depth, train_error, '-bo', color = "red",label = "Train Error")
plt.plot(max_depth, test_error, '-bo', color= "blue", label = "Test Error")
plt.xlabel('max_depth', fontsize = 15)
plt.ylabel('MSE', fontsize = 15)
plt.xticks(max_depth)
plt.legend()
plt.show(True)

Comparing our test and train error to the Regression Tree and Random Forest results, Gradient Boosting had a much better result with low *max_depth* values.