# Hitters

## Aim
The aim in this notebook is to create a regression model that predicts salaries of baseball players based on their statistics, 
and reduce RMSE (Root Mean Square Error) as much as possible.


## Description
**Context**

This dataset is part of the R-package ISLR and is used in the related book by G. James et al. (2013) "An Introduction to Statistical Learning with applications in R" to demonstrate how Ridge regression and the LASSO are performed using R.

**Content**

This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. This is part of the data that was used in the 1988 ASA Graphics Section Poster Session. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York.

**Format**

A data frame with 322 observations of major league players on the following 20 variables.\
**AtBat**: Number of times at bat in 1986 \
**Hits**: Number of hits in 1986 \
**HmRun**: Number of home runs in 1986 \
**Runs**: Number of runs in 1986 \
**RBI**: Number of runs batted in in 1986 \
**Walks**: Number of walks in 1986 \
**Years**: Number of years in the major leagues \
**CAtBat**: Number of times at bat during his career \
**CHits**: Number of hits during his career \
**CHmRun**: Number of home runs during his career \
**CRuns**: Number of runs during his career \
**CRBI**: Number of runs batted in during his career \
**CWalks**: Number of walks during his career \
**League**: A factor with levels A and N indicating player’s league at the end of 1986 \
**Division**: A factor with levels E and W indicating player’s division at the end of 1986 \
**PutOuts**: Number of put outs in 1986 \
**Assists**: Number of assists in 1986 \
**Errors**: Number of errors in 1986 \
**Salary**: 1987 annual salary on opening day in thousands of dollars \
**NewLeague**: A factor with levels A and N indicating player’s league at the beginning of 1987

In [None]:
import warnings
warnings.simplefilter(action='ignore')

import pandas as pd
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, ElasticNet, ElasticNetCV, Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

## Reading the data

In [None]:
hitters = pd.read_csv('../input/hitters/Hitters.csv')
df = hitters.copy()

In [None]:
df.head()

## Exploratory Data Analysis

In [None]:
df.describe().T

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df['Salary'].describe()

## Handling the missing values

All of the missing values is in the 'Salary' variable. Even though it's a critical variable, I'll fill it since the dataset is very small and has a limited number of observations. League, Division and Years, I think, can be determinative in Salary. Therefore I divided 'Years' into intervals and then filled the missing values with respect to League, Division and Year.

In [None]:
df['Year_interval'] = pd.cut(x=df['Years'], bins=[0, 3, 6, 10, 15, 19, 24])

In [None]:
df.head()

In [None]:
df.groupby(['League','Division', 'Year_interval']).agg({'Salary':'mean'})

In [None]:
df['Salary'] = df.groupby(['League', 'Division', 'Year_interval'])['Salary'].transform(lambda x: x.fillna(x.mean()))

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.drop('Year_interval', axis = 1, inplace = True)

## Changing cathegorical variables into binary

In [None]:
le = LabelEncoder()
df['League'] = le.fit_transform(df['League'])
df['Division'] = le.fit_transform(df['Division'])
df['NewLeague'] = le.fit_transform(df['NewLeague'])

In [None]:
df.head()

## First Model Test

In [None]:
X = df.drop('Salary', axis = 1)
y = df[['Salary']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.20, random_state = 46)

In [None]:
reg_model = LinearRegression()

In [None]:
reg_model.fit(X_train, y_train)

In [None]:
y_pred_reg = reg_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_reg))

## Outlier Analysis

In [None]:
import seaborn as sns
sns.boxplot(df['Salary']);

I dropped two observations manually.

In [None]:
df.sort_values('Salary', ascending = False)

In [None]:
df.drop(df.iloc[217:218,:].index, inplace = True)

In [None]:
df.drop(df.iloc[294:295, :].index, inplace = True)

## Model

In [None]:
X = df.drop('Salary', axis = 1)
y = df[['Salary']]
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.20, random_state = 46)

In [None]:
reg_model.fit(X_train, y_train)

In [None]:
y_pred_reg = reg_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_reg))

RMSE decreased to 298 by dropping just 2 observations manually.

## Modelling with other regression techniques

In [None]:
enet = ElasticNet().fit(X_train, y_train)
y_pred_enet = enet.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_enet))

In [None]:
lasso_model = Lasso().fit(X_train, y_train)
y_pred_lass = lasso_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_lass))

In [None]:
ridge_model = Ridge().fit(X_train, y_train)
y_pred_ridge = enet.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_ridge))

## Modelling with Hiperparameter Optimization

In [None]:
alphas1 = np.linspace(0,1,1000)
alphas2 = 10**np.linspace(10,-2,100)*0.5
alphas3 = np.random.randint(0,1000,100)

In [None]:
enet_cv = ElasticNetCV(alphas = alphas3, cv = 10).fit(X_train, y_train)

In [None]:
enet_cv.alpha_

In [None]:
enet_tuned = ElasticNet(alpha = enet_cv.alpha_, l1_ratio = 0.999).fit(X_train, y_train)

In [None]:
y_pred_enett = enet_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_enett))

In [None]:
lasso_cv = LassoCV(alphas = alphas3, cv = 10).fit(X_train, y_train)
lasso_cv.alpha_

In [None]:
lasso_tuned = Lasso(alpha = lasso_cv.alpha_).fit(X_train, y_train)
y_pred_lassot = lasso_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_lassot))

In [None]:
ridge_cv = RidgeCV(alphas = alphas3, cv = 10).fit(X_train, y_train)
ridge_cv.alpha_

In [None]:
ridge_tuned = Ridge(alpha = ridge_cv.alpha_).fit(X_train, y_train)
y_pred_ridget = ridge_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_ridget))

## Outlier Detection with Local Outlier Factor

In [None]:
clf = LocalOutlierFactor(n_neighbors = 20, contamination = 0.1)

In [None]:
clf.fit_predict(df)

In [None]:
df_scores = clf.negative_outlier_factor_

In [None]:
np.sort(df_scores)[:30]

In [None]:
threshold = np.sort(df_scores)[4]

In [None]:
threshold

In [None]:
df.drop(df[df_scores<threshold].index, inplace = True)

## Last Model Tests

In [None]:
X = df.drop('Salary', axis = 1)
y = df[['Salary']]
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.20, random_state = 46)

In [None]:
reg_model.fit(X_train, y_train)
y_pred_reg = reg_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_reg))

In [None]:
enet = ElasticNet().fit(X_train, y_train)
y_pred_enet = enet.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_enet))

In [None]:
lasso_model = Lasso().fit(X_train, y_train)
y_pred_lass = lasso_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_lass))

In [None]:
ridge_model = Ridge().fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_ridge))

## Last models with hiperparameter optimization

In [None]:
enet_cv = ElasticNetCV(alphas = alphas3, cv = 10).fit(X_train, y_train)
enet_cv.alpha_

In [None]:
enet_tuned = ElasticNet(alpha = enet_cv.alpha_, l1_ratio = 0.01).fit(X_train, y_train)
y_pred_enett = enet_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_enett))

In [None]:
ridge_cv = RidgeCV(alphas = alphas1, cv = 10).fit(X_train, y_train)
ridge_cv.alpha_

In [None]:
ridge_tuned = Ridge(alpha = ridge_cv.alpha_).fit(X_train, y_train)
y_pred_ridget = ridge_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_ridget))

In [None]:
lasso_cv = LassoCV(alphas = alphas1, cv = 10).fit(X_train, y_train)
lasso_cv.alpha_

In [None]:
lasso_tuned = Lasso(alpha = lasso_cv.alpha_).fit(X_train, y_train)
y_pred_lassot = lasso_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_lassot))

## Final model with a weird alpha is optimum

In [None]:
enet_tuned = ElasticNet(alpha = 11250, l1_ratio = 0.7).fit(X_train, y_train)
y_pred_enett = enet_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test, y_pred_enett))