# Salary Prediction Project of US Baseball Major League Players with Four Different Models

In this project the below described data will be used to predict the salaries of baseball players. The data retrieved from "https://www.kaggle.com"
    
    
### Description
    
#### Context

This dataset is part of the R-package ISLR and is used in the related book by G. James et al. (2013) "An Introduction to Statistical Learning with applications in R" to demonstrate how Ridge regression and the LASSO are performed using R.

#### Content
This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.

#### Format

A data frame with 322 observations of major league players on the following 20 variables.

- AtBat Number of times at bat in 1986
- Hits Number of hits in 1986
- HmRun Number of home runs in 1986
- Runs Number of runs in 1986
- RBI Number of runs batted in in 1986
- Walks Number of walks in 1986
- Years Number of years in the major leagues
- CAtBat Number of times at bat during his career
- CHits Number of hits during his career
- CHmRun Number of home runs during his career
- CRuns Number of runs during his career
- CRBI Number of runs batted in during his career
- CWalks Number of walks during his career
- League A factor with levels A and N indicating player’s league at the end of 1986
- Division A factor with levels E and W indicating player’s division at the end of 1986
- PutOuts Number of put outs in 1986
- Assists Number of assists in 1986
- Errors Number of errors in 1986
- Salary 1987 annual salary on opening day in thousands of dollars
- NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987

Acknowledgements
Please cite/acknowledge: Games, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York.
   


In [None]:
# Importing necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 
warnings.filterwarnings("ignore", category=UserWarning) 

from warnings import filterwarnings
filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, ElasticNet, Lasso, LassoCV
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.preprocessing import RobustScaler


In [None]:
# Reading data

df = pd.read_csv("../input/hitters/Hitters.csv")  

### Understanding Data

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df.shape

In [None]:
# detecting missing values 

df.isnull().sum()

Salary variable has 59 missing values

In [None]:
#For visualizing missing values I need to install below package
# When you are working with anaconda you may need this installation

conda install -c conda-forge/label/cf202003 missingno


In [None]:
#Visualizing missing values

import missingno as msno
msno.bar(df);

In [None]:
#Correlation values more than 0.5 between features (Because of >0.5 I can only see the values above 0.5)

correlation_matrix = df.corr().round(2)
filtre=np.abs(correlation_matrix['Salary'])>0.50
corr_features=correlation_matrix.columns[filtre].tolist()
sns.clustermap(df[corr_features].corr(),annot=True,fmt=".2f")
plt.title('Correlation btw features')
plt.show()

In [None]:
# Even though there are very high correlation between some of the variables I will not do anything. Normally this problem should be solved.
# Here I will delete missing values

df = df.dropna()

In [None]:
df.shape

In [None]:
df.sort_values('Salary', ascending = False).head()


In [None]:
# I have 3 categoric variables

df['League'].value_counts()

In [None]:
df['NewLeague'].value_counts()

In [None]:
df['Division'].value_counts()

In [None]:
# Transforming nominal variables with one hot encoding method. Normally label encoding variable can be applied for dummy variables. One hot encoding is appropriate for the nominal variables have 3 or more categories 

df = pd.get_dummies(df, columns = ['League', 'Division', 'NewLeague'], drop_first = True)

In [None]:
df.head()

In [None]:
# For detecting outliers I will use LocalOutlierFactor. I will use default values of 20 and 'auto'.

clf=LocalOutlierFactor(n_neighbors=20, contamination='auto')
clf.fit_predict(df)
df_scores=clf.negative_outlier_factor_
df_scores= np.sort(df_scores)
df_scores[0:20]

In [None]:
?LocalOutlierFactor

In [None]:
# I will take the 5th value as  threshold while the values after fift values decreasing closely
# However at first I will visualize this situation regarding outliers

sns.boxplot(df_scores);

In [None]:
threshold=np.sort(df_scores)[5]
print(threshold)
df = df.loc[df_scores > threshold]
df = df.reset_index(drop=True)

In [None]:
df.shape

In [None]:
# Standardization
# I will make some operations in the below rows.
# Salary is my dependent variable, others are dummy variables. At first I will drop them from my independent variable set (X)
#At last I will combine all of the independent variables

df_X=df.drop(['Salary','League_N','Division_W','NewLeague_N'], axis=1)
df_X.head()


In [None]:
from sklearn.preprocessing import StandardScaler
scaled_cols=StandardScaler().fit_transform(df_X)



scaled_cols=pd.DataFrame(scaled_cols, columns=df_X.columns)
scaled_cols.head()

In [None]:
cat_df=df.loc[:, "League_N":"NewLeague_N"]
cat_df.head()

In [None]:
Salary=pd.DataFrame(df['Salary'])

In [None]:
df=pd.concat([Salary,scaled_cols, cat_df], axis=1)
df.head()

In [None]:
# Dependent variable y = Salary, independents variables x = the variables without salary

y = df['Salary']
X = df.drop('Salary', axis =1)

In [None]:
X

In [None]:
y

In [None]:
# We will evaluate our model results cccording to mean value of predicted variable (y) 

y.mean()

### MODELING

In [None]:
# Train and test separation process and determining train and test size
#Test size will be %20 of the data and random state will be 46 for all of the models in order to compare the models

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)

### Linear Regression

In [None]:
linreg = LinearRegression()
model = linreg.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_linreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_linreg_rmse

##### Prediction value (rmse) for linear regression model is 382.00085575367274. y.mean value is 538.2316872586872


### Ridge Regression

In [None]:
ridreg = Ridge()
model = ridreg.fit(X_train, y_train)
y_pred = model.predict(X_test)
df_ridreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_ridreg_rmse 

### Lasso Regression

In [None]:
lasreg = Lasso()
model = lasreg.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_lasreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_lasreg_rmse

### Elastic Net Regression

In [None]:
enet = ElasticNet()
model = enet.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_enet_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_enet_rmse

In [None]:
# Four models' Root Mean Squared Errors (RMSE) 

def compML(df, y, alg):
    model = alg().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
    model_name = alg.__name__
    print(model_name, "Model RMSE:", RMSE)

In [None]:
models = [LinearRegression, Ridge, Lasso, ElasticNet] 

In [None]:
for model in models:
    compML(df, 'Salary', model)

## Model Tuning

### Ridge Regression Model Tuning

In [None]:
# Hyper parameter optimization with cross validation function.
# We will try to tune the model by assigning new alpha values.
# Default alpha value is 1.0 in Ridge regression. We will try different values.
# The best fit alpha value or parameter will be employed in the final model

alpha = [0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]
ridreg_cv = RidgeCV(alphas = alpha, scoring = "neg_mean_squared_error", cv = 10, normalize = True)
ridreg_cv.fit(X_train, y_train)
ridreg_cv.alpha_

#Final Model 

ridreg_tuned = Ridge(alpha = ridreg_cv.alpha_).fit(X_train,y_train)
y_pred = ridreg_tuned.predict(X_test)
df_ridge_tuned_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_ridge_tuned_rmse

### Lasso Regression Model Tuning

In [None]:
# Hyper parameter optimization with cross validation function.
# We will try to tune the model by assigning new alpha values.
# Default alpha value is 1.0 in Lasso regression. We will try different values.
# The best fit alpha value or parameter will be employed in the final model

alpha = [0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]
lasso_cv = LassoCV(alphas = alpha, cv = 10, normalize = True)
lasso_cv.fit(X_train, y_train)
lasso_cv.alpha_

# Final Model 

lasso_tuned = Lasso(alpha = lasso_cv.alpha_).fit(X_train,y_train)
y_pred = lasso_tuned.predict(X_test)
df_lasso_tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
df_lasso_tuned_rmse

In [None]:
?Lasso

### Elastic Net Regression Regression Model Tuning

In [None]:
?ElasticNet

In [None]:
# Hyper parameter optimization with cross validation function.
# We will try to tune the model by assigning new alpha values.
# Default alpha value is 1.0 and default l1_ratio is 0.5 in ElesticNet regression. We will try different values.
# The best fit  values or parameters will be employed in the final model


enet_params = {"l1_ratio": [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],
              "alpha":[0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]}
enet = ElasticNet()
enet_model = enet.fit(X_train,y_train)
enet_cv = GridSearchCV(enet_model, enet_params, cv = 10).fit(X, y)
enet_cv.best_params_

#Final Model 

enet_tuned = ElasticNet(**enet_cv.best_params_).fit(X_train,y_train)
y_pred = enet_tuned.predict(X_test)
df_enet_tuned_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_enet_tuned_rmse 

### Comparable Results of Four Basic and Tuned Models

In [None]:

ComparableResults_df =pd.DataFrame({"LINEAR":[df_linreg_rmse],"RIDGE":[df_ridreg_rmse],"RIDGE TUNED":[df_ridge_tuned_rmse],
                             "LASSO":[df_lasreg_rmse],"LASSO TUNED":[df_lasso_tuned_rmse], 
                             "ELASTIC NET":[df_enet_rmse], "ELASTIC NET TUNED":[df_enet_tuned_rmse]})

ComparableResults_df




## Result

In this project, four different linear regression models were employed to predict salary of any US Major Baseball League player. By using Linear, Ridge, Lasso, and ElesticNet Regression Machine Learning Models the root mean squared errors (RMSE) values were calculated. The RMSE is a measure of the average deviation of the estimates from the observed values. Then, the RMSE values were tried to be deacreased with the help of hyperparameter optimizations. As result the lowest RMSE value (295.98) obtained from the tuned ElesticNet Regression model.  According to analyses and predictions results, tuned ElesticNet Regression model is the best model to predict a US Baseball Major League player's salary.