# Hitters - Non-Linear Models

## Aim
The aim in this notebook is to create non-linear models that predict salaries of baseball players based on their statistics and info, 
and to reduce RMSE (Root Mean Square Error) as much as possible.


## Description
**Context**

This dataset is part of the R-package ISLR and is used in the related book by G. James et al. (2013) "An Introduction to Statistical Learning with applications in R" to demonstrate how Ridge regression and the LASSO are performed using R.

**Content**

This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.

**Format**

A data frame with 322 observations of major league players on the following 20 variables.

**AtBat**: Number of times at bat in 1986

**Hits**: Number of hits in 1986

**HmRun**: Number of home runs in 1986 

**Runs**: Number of runs in 1986 

**RBI**: Number of runs batted in in 1986 

**Walks**: Number of walks in 1986 

**Years**: Number of years in the major leagues 

**CAtBat**: Number of times at bat during his career 

**CHits**: Number of hits during his career 

**CHmRun**: Number of home runs during his career 

**CRuns**: Number of runs during his career 

**CRBI**: Number of runs batted in during his career 

**CWalks**: Number of walks during his career 

**League**: A factor with levels A and N indicating player’s league at the end of 1986 

**Division**: A factor with levels E and W indicating player’s division at the end of 1986 

**PutOuts**: Number of put outs in 1986 

**Assists**: Number of assists in 1986 

**Errors**: Number of errors in 1986 

**Salary**: 1987 annual salary on opening day in thousands of dollars 

**NewLeague**: A factor with levels A and N indicating player’s league at the beginning of 1987

In [None]:
import warnings
warnings.simplefilter(action='ignore')

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import LabelEncoder, RobustScaler, StandardScaler, Normalizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
!pip install xgboost
import xgboost
from xgboost import XGBRegressor
!pip install lightgbm
from lightgbm import LGBMRegressor
!pip install catboost
from catboost import CatBoostRegressor

# Reading the dataset

In [None]:
hitters = pd.read_csv('../input/hitters-baseball-data/Hitters.csv')
df = hitters.copy()
df.head()

# Creation and assignment of new variables

In [None]:
df['HitRatio'] = df['Hits'] / df['AtBat']
df['RunRatio'] = df['HmRun'] / df['Runs']
df['CHitRatio'] = df['CHits'] / df['CAtBat']
df['CRunRatio'] = df['CHmRun'] / df['CRuns']

df['Avg_AtBat'] = df['CAtBat'] / df['Years']
df['Avg_Hits'] = df['CHits'] / df['Years']
df['Avg_HmRun'] = df['CHmRun'] / df['Years']
df['Avg_Runs'] = df['CRuns'] / df['Years']
df['Avg_RBI'] = df['CRBI'] / df['Years']
df['Avg_Walks'] = df['CWalks'] / df['Years']
df['Avg_PutOuts'] = df['PutOuts'] / df['Years']
df['Avg_Assists'] = df['Assists'] / df['Years']
df['Avg_Errors'] = df['Errors'] / df['Years']

# Changing cathegorical variables into binary using Label Encoder

In [None]:
le = LabelEncoder()
df['League'] = le.fit_transform(df['League'])
df['Division'] = le.fit_transform(df['Division'])
df['NewLeague'] = le.fit_transform(df['NewLeague'])

# Dropping N/A values

In [None]:
df.dropna(inplace = True)

# Dropping outliers using Local Outlier Factor

In [None]:
clf = LocalOutlierFactor(n_neighbors = 20, contamination = 0.1)
clf.fit_predict(df)
df_scores = clf.negative_outlier_factor_
np.sort(df_scores)[0:15]

In [None]:
thrs = np.sort(df_scores)[3]
thrs

In [None]:
df.drop(df[df_scores < thrs].index, inplace = True)

# Creating dependent and independent variables for the model

In [None]:
dfx = df.copy()

In [None]:
dfx = dfx.drop(['AtBat','Hits','HmRun','Runs','RBI','Salary','League','Division','NewLeague'], axis = 1)
# I dropped 'Salary' since it's dependent variable
# I dropped 'League', 'Division' and 'NewLeague' in order to perform a better scaling
# I dropped the others because some of the new assigned variables are better representatives

# Standardization usin Robust Scaler and assigning X and y

In [None]:
cols = dfx.columns
scaler = RobustScaler()
X = scaler.fit_transform(dfx)
X = pd.DataFrame(X, columns = cols)
y = df[['Salary']]

# Creating Train and Test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state  = 46)

# Creating a function that performs required operations using model names

In [None]:
def model_func(alg):
    if alg == CatBoostRegressor:
        model = alg(verbose = False).fit(X_train, y_train)
    else:
        model = alg().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
    model_name = alg.__name__
    print(model_name, 'RMSE: ', RMSE)

# Creating a function that performs hiperparameter optimization using model names and parameters

In [None]:
def cv_func(alg, **param):
    
    model = alg().fit(X_train, y_train)
    params = {}
    for key, value in param.items():
        params[key] = value
    
    cv_model = GridSearchCV(model, params, cv = 10, verbose = 2, n_jobs = -1).fit(X_train, y_train)
    print(cv_model.best_params_)
    tuned_model = alg(**cv_model.best_params_).fit(X_train, y_train)
    y_pred = tuned_model.predict(X_test)
    RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
    model_name = alg.__name__
    print(model_name, 'RMSE: ', RMSE)

# RMSE values for all non-linear models

In [None]:
models = [KNeighborsRegressor, SVR, MLPRegressor, GradientBoostingRegressor, DecisionTreeRegressor, RandomForestRegressor, XGBRegressor, LGBMRegressor, CatBoostRegressor]
for model in models:
    model_func(model)

# RMSE values for all non-linear models after hiperparemeter optimization

In [None]:
cv_func(alg = KNeighborsRegressor, n_neighbors = np.arange(2,30,1))

In [None]:
cv_func(alg = SVR, C = [0.01, 0.02, 0.2, 0.1, 0.5, 0.8, 1])

In [None]:
cv_func(alg = MLPRegressor, alpha = [0.1, 0.02, 0.01, 0.001, 0.0001], hidden_layer_sizes = [(10,20), (5,5), (100,100)])

In [None]:
cv_func(alg = GradientBoostingRegressor, max_depth = [3,5,8], learning_rate = [0.001,0.01,0.1], n_estimators = [100,200,500,1000], subsample = [0.3,0.5,0.8,1])

In [None]:
cv_func(alg = DecisionTreeRegressor, max_depth = [2,3,4,5,10,20], min_samples_split = [2,5,10,20,30,50])

In [None]:
cv_func(alg = RandomForestRegressor, max_depth = [5,10,None], max_features = [5,10,15,20], n_estimators = [500, 1000], min_samples_split = [2,5,20,30])

In [None]:
cv_func(alg = XGBRegressor, max_depth = [2,3,4,5,8], learning_rate = [0.1,0.5,0.01], n_estimators = [100,200,500,1000], colsample_bytree = [0.4,0.7,1])

In [None]:
cv_func(alg = LGBMRegressor, max_depth = [1,2,3,4,5,6,7,8,9,10], n_estimators = [20,40,100,200,500,1000], learning_rate = [0.1,0.01,0.5,1])

In [None]:
cv_func(alg = CatBoostRegressor, iterations = [200], learning_rate = [0.02, 0.03, 0.05], depth = [8, 10])

# Final Model

In [None]:
# final model can differ after each run. Gradient Boosting Regressor was the best when I ran the code, so I created final model manually using its stats.
final_model = GradientBoostingRegressor(learning_rate = 0.01, max_depth = 5, n_estimators = 500, subsample = 0.8)

In [None]:
final_model

# Comments

#### Hitters Dataset is read

#### Data Preprocessing 

* New variables are created using some of the existing variables.
* N/A values are dropped.
* Cathegorical variables are turned into 1-0 labels using Label Encoder.
* Local Outlier Factor is used to make outlier analysis and outliers are dropped.
* Standardization is performed dropping the dependent variable(Salary) and some other variables. 
* Dependent and independent variables are assigned and split into train and test sets.

#### Modelling

* Two functions are defined. First one performs model creation, prediction and calculation of RMSE value using the model name. The second function
does the same thing using Grid Search and different parameters. 
* Models are created using functions with following non-linear models : KNeighborsRegressor, SVR, MLPRegressor, GradientBoostingRegressor, 
DecisionTreeRegressor, RandomForestRegressor, XGBRegressor, LGBMRegressor, CatBoostRegressor
* Final model is created with the non-linear model that results the least and its optimum hiperparameter.
