> # Hitters Dataset Regression Models Workout

## Aim

The aim in this notebook is to create a regression model that predicts salaries of baseball players based on their statistics, and RMSE (Root Mean Square Error).

## Description
### Context
This dataset is part of the R-package ISLR and is used in the related book by G. James et al. (2013) "An Introduction to Statistical Learning with applications in R" to demonstrate how Ridge regression and the LASSO are performed using R.

### Format
A data frame with 322 observations of major league players on the following 20 variables.
AtBat Number of times at bat in 1986,
Hits Number of hits in 1986,
HmRun Number of home runs in 1986,
Runs Number of runs in 1986,
RBI Number of runs batted in in 1986,
Walks Number of walks in 1986,
Years Number of years in the major leagues,
CAtBat Number of times at bat during his career,
CHits Number of hits during his career,
CHmRun Number of home runs during his career,
CRuns Number of runs during his career,
CRBI Number of runs batted in during his career,
CWalks Number of walks during his career,
League A factor with levels A and N indicating player’s league at the end of 1986,
Division A factor with levels E and W indicating player’s division at the end of 1986,
PutOuts Number of put outs in 1986,
Assists Number of assists in 1986,
Errors Number of errors in 1986,
Salary 1987 annual salary on opening day in thousands of dollars,
NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987,

## Importing Libraries and Reading Data

In [None]:
import warnings
warnings.simplefilter(action='ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, ElasticNet, Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.preprocessing import RobustScaler

Hitters=pd.read_csv("../input/hitters/Hitters.csv")

## Data Understanding

In [None]:
df=Hitters.copy()
df.info()

In [None]:
df.describe().T

In [None]:
df[df.isnull().any(axis=1)].head(3)

In [None]:
df.isnull().sum().sum()

## Data Pre-Processing

All of the missing values is in the 'Salary' variable. Even though it's a critical variable, I'll fill it since the dataset with mean.

In [None]:
df=df.copy()

In [None]:
df.corr()

In [None]:
df['Year_lab'] = pd.cut(x=df['Years'], bins=[0, 3, 6, 10, 15, 19, 24])
df.groupby(['League','Division', 'Year_lab']).agg({'Salary':'mean'})

In [None]:
df['Salary'] = df.groupby(['League', 'Division', 'Year_lab'])['Salary'].transform(lambda x: x.fillna(x.mean()))

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.shape

### Transformation Process
Changing cathegorical variables into binary

In [None]:
le = LabelEncoder()
df['League'] = le.fit_transform(df['League'])
df['Division'] = le.fit_transform(df['Division'])
df['NewLeague'] = le.fit_transform(df['NewLeague'])

In [None]:
df.head()

In [None]:
df['Year_lab'] = le.fit_transform(df['Year_lab'])

In [None]:
df.head()

In [None]:
df.info()

### Normalization

In [None]:
df_X= df.drop(["Salary","League","Division","NewLeague"], axis=1)

scaled_cols5=preprocessing.normalize(df_X)

scaled_cols=pd.DataFrame(scaled_cols5, columns=df_X.columns)
scaled_cols.head()

In [None]:
cat_df=pd.concat([df.loc[:,"League":"Division"],df.loc[:,"NewLeague":"Year_lab"]], axis=1)
cat_df.head()

In [None]:
df= pd.concat([scaled_cols,cat_df,df["Salary"]], axis=1)
df

In [None]:
df.shape

## Modeling

### Linear Regression

In [None]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)

linreg = LinearRegression()
model = linreg.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_linreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_linreg_rmse

### Ridge Regression

In [None]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)


ridreg = Ridge()
model = ridreg.fit(X_train, y_train)
y_pred = model.predict(X_test)
df_ridreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_ridreg_rmse 

### Lasso Regression

In [None]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)


lasreg = Lasso()
model = lasreg.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_lasreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_lasreg_rmse

### Elastic Net Regression

In [None]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)


enet = ElasticNet()
model = enet.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_enet_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_enet_rmse

## Model Tuning

### Ridge Regression Model Tuning

In [None]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)


alpha = [0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]
ridreg_cv = RidgeCV(alphas = alpha, scoring = "neg_mean_squared_error", cv = 10, normalize = True)
ridreg_cv.fit(X_train, y_train)
ridreg_cv.alpha_

#Final Model

ridreg_tuned = Ridge(alpha = ridreg_cv.alpha_).fit(X_train,y_train)
y_pred = ridreg_tuned.predict(X_test)
df_ridge_tuned_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_ridge_tuned_rmse

### Lasso Regression Model Tuning

In [None]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)

alpha = [0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]
lasso_cv = LassoCV(alphas = alpha, cv = 10, normalize = True)
lasso_cv.fit(X_train, y_train)
lasso_cv.alpha_

#Final Model

lasso_tuned = Lasso(alpha = lasso_cv.alpha_).fit(X_train,y_train)
y_pred = lasso_tuned.predict(X_test)
df_lasso_tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
df_lasso_tuned_rmse

### Elastic Net Regression Model Tuning

In [None]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.20, 
                                                    random_state=46)

enet_params = {"l1_ratio": [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],
              "alpha":[0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]}

enet_model = ElasticNet().fit(X_train,y_train)
enet_cv = GridSearchCV(enet_model, enet_params, cv = 10).fit(X, y)
enet_cv.best_params_

#Final Model

enet_tuned = ElasticNet(**enet_cv.best_params_).fit(X_train,y_train)
y_pred = enet_tuned.predict(X_test)
df_enet_tuned_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_enet_tuned_rmse 

In [None]:
basicresult_df = pd.DataFrame({"CONDITIONS":["df: filled with mean, normalized",],
                              "LINEAR":[df_linreg_rmse],
                               "RIDGE":[df_ridreg_rmse],
                              "RIDGE TUNED":[df_ridge_tuned_rmse],
                              "LASSO":[df_lasreg_rmse],
                              "LASSO TUNED":[df_lasso_tuned_rmse],                              
                              "ELASTIC NET":[df_enet_rmse],
                              "ELASTIC NET TUNED":[df_enet_tuned_rmse]
                              })

basicresult_df

# Reporting

The aim in this notebook is to create a regression model that predicts salaries of baseball players based on their statistics, and RMSE (Root Mean Square Error).

#### 1) Importied Libraries, Hitters Data Set was read
#### 2) With Exploratory Data Analysis;
* Structural information of the dataset was checked.
* Types of variables in data set were examined.
* The size information of the data set has been accessed.
* The number of missing observations from which variable in the data set was accessed. It was observed that there were 59 missing observations only in "Salary" which was dependent variable.
* Descriptive statistics of the data set were examined.

#### 3) In Data Pre-Processing section;
*For df: ** NA values were filled by looking at "Salary" averages in age, league and division variables, Dummy variables were created. The X variables were normalized.

#### 4) During the Model Building phase;

Using the Linear, Ridge, Lasso, ElasticNet machine learning models, ** RMSE ** values representing the difference between actual values and predicted values were calculated. Later, hyperparameter optimizations were applied for Ridge, Lasso and ElasticNet to further reduce the error value.

#### 4) Conclusion;

When the model created as a result of Elastic Net Hyperparameter optimization was applied to the df6 Data Frame, the lowest RMSE was obtained. (283.49)