## Salary Prediction for Hitters Dataset with Linear and Non-Linear Models

In this project, we will build several linear and non-linear models on the Hitters dataset. After creating base models, some of them will be tuned to minimize prediction errors in terms of the RMSE metric. To split the dataset for training and test purposes, K-fold Cross-Validation will be used. Linear models used in this project are Linear Regression, Ridge, Lasso, ElasticNet. Non-linear models used in this project are regression implementations of K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Random Forest (RF), Support Vector Machine (SVM), Gradient Boosting Machine (GBM), eXtreme Gradient Boosting (XGB), and Light-GBM (LGBM).

### Context of Study:

* Understanding Data
* Exploratory Data Analysis
* Data Preparation
* Feature Engineering
* Constructing Base Models (Linear and Non-linear)
* Hyperparameter Tuning for Some Models
* Conclusion

### Understanding Data

Hitters dataset includes Major League Baseball (MLB) data from the seasons 1986 and 1987.
Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
A data frame with 322 observations of major league players on the following 20 variables.

#### Variables:
* AtBat: Number of times at bat in 1986
* Hits: Number of hits in 1986
* HmRun: Number of home runs in 1986
* Runs: Number of runs in 1986
* RBI: Number of runs batted in in 1986
* Walks: Number of walks in 1986
* PutOuts: Number of put outs in 1986
* Assists: Number of assists in 1986
* Errors: Number of errors in 1986
* CAtBat: Number of times at bat during his career
* CHits: Number of hits during his career
* CHmRun: Number of home runs during his career
* CRuns: Number of runs during his career
* CRBI: Number of runs batted in during his career
* CWalks: Number of walks during his career
* Years:Number of years in the major leagues
* League: A factor with levels A and N indicating player's league at the end of 1986
* Division: A factor with levels E and W indicating player's division at the end of 1986
* NewLeague: A factor with levels A and N indicating player's league at the beginning of 1987
* Salary: 1987 annual salary on opening day in thousands of dollars


In [None]:
# install required model libraries
#!pip install xgboost
#!pip install lightgbm

In [None]:
# ignore warnings
import warnings
warnings.simplefilter(action='ignore', category=Warning)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_val_score

# linear models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
# non-linear models
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

In [None]:
#read the Hitters dataset
data = pd.read_csv("../input/hitters-baseball-data/Hitters.csv")
# copy loaded data into df
df = data.copy()

### Exploratory Data Analysis

In [None]:
df.head()

In [None]:
# number of rows and columns in data
df.shape

In [None]:
# descriptive statistics for data
df.describe().T

### Data Preparation

In [None]:
# categorical features
cat_cols = [col for col in df.columns if df[col].dtypes == "O"]
# numerical features except for target (Salary)
num_cols = [col for col in df.columns if df[col].dtypes != "O" and col not in "Salary"]

In [None]:
# check outliers in data and replace them with thresholds
q1 = 0.10
q3 = 0.90
for col in num_cols:
    quartile1 = df[col].quantile(q1)
    quartile3 = df[col].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    if df[(df[col] > up_limit) | (df[col] < low_limit)].any(axis=None):
        df.loc[(df[col] < low_limit),col] = low_limit
        df.loc[(df[col] > up_limit), col] = up_limit

In [None]:
# remove outliers for Salary and keep NaN values
salary_up = int(df["Salary"].quantile(q3))
df = df[(df["Salary"] < salary_up) | (df["Salary"].isnull())]
# data shape after removing outliers in Salary
df.shape

In [None]:
# number of NaN values in each feature
df.isnull().sum()

In [None]:
# remove rows containing NaN values
df.dropna(inplace=True)
df.shape

In [None]:
# label encoding of categorical features (League, Division, NewLeague) with two class 
binary_cols = [col for col in df.columns if df[col].dtype not in [int, float] and df[col].nunique() == 2]
for col in binary_cols:
    labelencoder = LabelEncoder()
    df[col] = labelencoder.fit_transform(df[col])

In [None]:
df.head()

### Feature Engineering

For creating new features, it is important to have domain knowledge about the data we have. For this purpose, I tried to get insights from the glossary of mlb.com the official site of Major League Baseball. I also created some features myself that I think might be useful.

In [None]:
df["New_BattingAverage"] = df["CHits"] / df["CAtBat"]
df["New_TotalBases"] =  ((df["CHits"] * 2) + (4 * df["CHmRun"]))
df["New_SluggingPercentage"] = df["New_TotalBases"] / df["CAtBat"]
df["New_IsolatedPower"] = df["New_SluggingPercentage"] - df["New_BattingAverage"]
df["New_TripleCrown"] = (df["CHmRun"] * 0.4) + (df["CRBI"] * 0.25) + (df["New_BattingAverage"] * 0.35)
df["New_BattingAverageOnBalls"] = (df["CHits"] - df["CHmRun"]) / (df["CAtBat"] - df["CHmRun"])
df["New_RunsCreated"] = df["New_TotalBases"] * (df["CHits"] + df["CWalks"]) / (df["CAtBat"] + df["CWalks"])
df["New_FieldingPercentage"] = 1 - ((df["PutOuts"] + df["Assists"]) / (df["PutOuts"] + df["Assists"] + df["Errors"] + 1))

df["New_CRunsYearsRatio"] = df["CRuns"] / df["Years"]
df['New_PutOutsYears'] = df['PutOuts'] * df['Years']
df["New_RBIWalks"] = df["RBI"] * df["Walks"]
df["New_RBIWalksRatio"] = df["RBI"] / df["Walks"]
df["New_CHmRunCAtBatRatio"] = df["CHmRun"] / df["CAtBat"]

### Constructing Base Models (Linear and Non-linear)

In [None]:
# assign X (input features) and y (target feature)
X = df.drop(["Salary"], axis=1)
y = df["Salary"]

In [None]:
# list feature importances for a regressor model like LGBM
pre_model = LGBMRegressor(random_state=17).fit(X, y)
feature_imp = pd.DataFrame({'Feature': X.columns, 'Value': pre_model.feature_importances_})
feature_imp.sort_values("Value", ascending=False)

In [None]:
base_models = [('LR', LinearRegression()), 
               ("Ridge", Ridge(random_state=17)),
               ("Lasso", Lasso(random_state=17)),
               ("ElasticNet", ElasticNet(random_state=17)),
               ('KNN', KNeighborsRegressor()),
               ('CART', DecisionTreeRegressor(random_state=17)),
               ('RF', RandomForestRegressor(random_state=17)),
               ('SVR', SVR()),
               ('GBM', GradientBoostingRegressor(random_state=17)),
               ("XGBoost", XGBRegressor(objective='reg:squarederror', random_state=17)),
               ("LightGBM", LGBMRegressor(random_state=17))]

for name, model in base_models:
    rmse = np.mean(np.sqrt(-cross_val_score(model, X, y, cv=10, scoring="neg_mean_squared_error")))
    print(f"RMSE: {round(rmse, 4)} ({name}) ")

### Hyperparameter Tuning for Some Models

In [None]:
# GridSearchCV parameter space for selected models
knn_params = {"n_neighbors": list(range(1, 31)),
              "algorithm": ["auto", "ball_tree", "kd_tree", "brute"]}

cart_params = {'max_depth': range(1, 20),
               "min_samples_split": range(2, 30)}

rf_params = {"max_depth": [5, 8, 15, None],
             "max_features": [5, 7, "auto"],
             "min_samples_split": [8, 15, 20],
             "n_estimators": [100, 200, 300]}

lightgbm_params = {"learning_rate": [0.01, 0.1, 0.001],
                   "n_estimators": [300, 500, 1500],
                   "colsample_bytree": [0.5, 0.7, 1]}

In [None]:
fast_models = [("KNN", KNeighborsRegressor(), knn_params),
              ("CART", DecisionTreeRegressor(random_state=17), cart_params),
              ("RF", RandomForestRegressor(random_state=17), rf_params),
              ("LightGBM", LGBMRegressor(random_state=17), lightgbm_params)]

In [None]:
best_models = {}

for name, model, params in fast_models:
    rmse = np.mean(np.sqrt(-cross_val_score(model, X, y, cv=10, scoring="neg_mean_squared_error")))
    print(f"RMSE (Before): {round(rmse, 4)} ({name}) ")
    gs_best = GridSearchCV(model, params, cv=3, n_jobs=-1, verbose=False).fit(X, y)
    final_model = model.set_params(**gs_best.best_params_)
    rmse = np.mean(np.sqrt(-cross_val_score(final_model, X, y, cv=10, scoring="neg_mean_squared_error")))
    print(f"RMSE (After): {round(rmse, 4)} ({name}) ")
    print(f"{name} best params: {gs_best.best_params_}", end="\n\n")
    best_models[name] = final_model

### Conclusion

In conclusion, the best model seems like **RF** with a **154.9561** score of root-mean-squared error (RSME). RMSE is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. 

Considering the feature importances, it is seen that the newly created features are effective in the success of the model, although it varies according to the model used.

* After setting q1=0.05, q3=0.95, and removing outliers in "Salary", 249 rows left and achieved 185.6107 RSME score with tuned RF.
* After setting q1=0.10, q3=0.90, and removing outliers in "Salary", 236 rows left and achieved 154.9561 RSME score with tuned RF.
* After setting q1=0.20, q3=0.80, and removing outliers in "Salary", 210 rows left and achieved 126.4241 RSME score with tuned RF.
* After setting q1=0.25, q3=0.75, and removing outliers in "Salary", 193 rows left and achieved 118.4944 RSME score with tuned RF.