# Feature engineering


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

In [None]:
data = pd.read_csv("../input/nba2k20-player-dataset/nba2k20-full.csv")
data.head()

First of all, I think it is useless to pass "full_name" to the model, so I drop this column.      
Other columns I think maybe useful for the model.     
At first sight, it seems like column "jersey" is useless, but if you take a look at the distribution of jersey, you notice that "#0" repeats three times more often than other numbers, so I leave it, maybe it has a sense for the model.
I also leave the column "college", because there are a lot of basketball players who graduated from definition colleges, so I think this information may be useful.         
I think other columns do not make you doubt their usefulness.    

From the column "b_day" I leave the only year.   
From column "height" I leave only height in meters.   
from column "weight" I leave only weight in kilograms.   

In [None]:
data.dtypes

Also, such column as:  jersey, b_day, height, weight, salary, draft_round, and draft_peak we must turn to numeric dtype.

In [None]:
data = data.drop("full_name", axis=1)
data["jersey"] = data["jersey"].str[1:].astype("int8")
data["b_day"] = pd.to_datetime(data["b_day"]).dt.year
data["height"] = data["height"].str.split("/").str[1].astype("float")
data["weight"] = data["weight"].str.split("/").str[1].str[0:-3].astype("float")
data["salary"] = data["salary"].str[1:].astype("int64")
data["draft_round"] = data["draft_round"].replace({"Undrafted": 0}).astype("int8")
data["draft_peak"] = data["draft_peak"].replace({"Undrafted": 0}).astype("int8")
data

In [None]:
data.dtypes

Now we will replace all categorical values for one-hot encoding vectors. But first let's fill "null" values in dataset.

In [None]:
data.isnull().sum()

In [None]:
data['team'] = data['team'].fillna('No team')
data['college'] = data['college'].fillna('No college')

In [None]:
for column in ['team', 'position', 'country', 'college']:
    encoded_columns = pd.get_dummies(data[column], prefix=column)
    data = data.join(encoded_columns).drop(column, axis=1)

### Split data

In [None]:
y = data["salary"]
X = data.drop("salary", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Normalization

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train)  
X_train_normalized = scaler.transform(X_train)      
X_test_normalized = scaler.transform(X_test)    

# Salary prediction

I decided to use the Random Forest Regressor model. I also use GridSearchCV in searching for the best hyperparameters.   

Earlier we normalized data, but we did not normalize target data (y_test, y_train), there are big numbers in target data, so I will be use np.log() to decrease values in target data. This will not affect the result, but it will be easier for the model to work with smaller numbers and it will be more convenient for us to evaluate the result. 

Later, if we want to receive a real salary, we must use np.exp() for the predicted value.

In [None]:
model_RF = RandomForestRegressor(random_state=7)
params_RF = {
    "n_estimators": [200, 150] ,
    "max_depth": [15, 10],
    "min_samples_split": [2, 4, 8],
    "max_features": ["sqrt", "log2"]
}
model_RF = GridSearchCV(model_RF, params_RF, scoring="neg_mean_squared_error" )
model_RF.fit(X_train_normalized, np.log(y_train))

Best parameters from GridSearchCV.

In [None]:
model_RF.best_params_

Best cross-validation score.

In [None]:
model_RF.best_score_

Why score is negative?

GridSearchCV tries to maximize the model's score, that's why we use "neg_mean_squared_error".

In [None]:
y_pred_RF = model_RF.predict(X_test_normalized)
mse = mean_squared_error(np.log(y_test), y_pred_RF)
print("Test mean squered error:", mse)