# Baseball  Regression Modeling Model Selection Using L1

## Learning Objectives

Run a basic regression using L1 regularization to conduct variable selection
    * run a simple OLS
    * analyze output
    * run L1 regression and identify best regularization parameter
    * interpret model parameters

## Imports

In [1]:
import pandas as pd
import statsmodels.api as sms
import sklearn.linear_model as lm

In [2]:
%matplotlib inline

## Get Data and Subset Data

In [3]:
# retrieve csv file and store to dataframe
df = pd.read_csv('baseball_data.csv')

# subset the dataframe removing rows with NULL values
bix = df.notnull().all(axis=1)
df = df[bix]

## A First Regression Model Using All Variables

The hope here is that we can get a batch look at how the variables relate to the target. This unfortunately fails due to tight correlations within the covariates.

In [4]:
# separate the covariates from the target
covariates = df.iloc[:, 1:]
covariates_with_ones = sms.add_constant(covariates)
target = df.salary_in_thousands_of_dollars

In [5]:
# build first model
model = sms.OLS(target, covariates_with_ones)

Notes:
    * we acheive a fairly high R2 right off the bat with this approach .711
    * a couple of variables are significant or nearly so
        - on_base_percentage
        - number_of_runs
        - number_of_runs_batted_in
        - number_of_strike_outs
        - number_of_stolen_bases
        - indicator_of_free_agency_eligibility
        - indicator_of_free_agent_in_1991_1992
        - indicator_of_arbitration_eligibility
        - indicator_of_arbitration_in_1991_1992
    * there are a large number of variables, how do we know what should be in the model and what should be out?
    * warning 2 in the printed output below states there may be high multicollinearity (high correlation between covariates)

In [6]:
result = model.fit()

In [7]:
print(result.summary())

                                  OLS Regression Results                                  
Dep. Variable:     salary_in_thousands_of_dollars   R-squared:                       0.711
Model:                                        OLS   Adj. R-squared:                  0.696
Method:                             Least Squares   F-statistic:                     47.90
Date:                            Fri, 24 Mar 2017   Prob (F-statistic):           4.71e-74
Time:                                    10:46:30   Log-Likelihood:                -2602.7
No. Observations:                             329   AIC:                             5239.
Df Residuals:                                 312   BIC:                             5304.
Df Model:                                      16                                         
Covariance Type:                        nonrobust                                         
                                            coef    std err          t      P>|t|      [0.

## Model Selection Using L1 Regression

Notes:
        * fit L1 penealized model using BIC criteria to select a "best" fit
        * this may force some of the parameters to zero
        * the nonzero variables are the selected variables
        * last we will refit the "best" model and interpret the variables

In [8]:
covariates_with_ones.columns

Index([u'const', u'batting_average', u'on_base_percentage', u'number_of_runs',
       u'number_of_hits', u'number_of_doubles', u'number_of_triples',
       u'number_of_home_runs', u'number_of_runs_batted_in', u'number_of_walks',
       u'number_of_strike_outs', u'number_of_stolen_bases',
       u'number_of_errors', u'indicator_of_free_agency_eligibility',
       u'indicator_of_free_agent_in_1991_1992',
       u'indicator_of_arbitration_eligibility',
       u'indicator_of_arbitration_in_1991_1992'],
      dtype='object')

In [9]:
# prepare the model
reg = lm.LassoLarsIC(normalize=True, criterion='bic', fit_intercept=True)

In [10]:
# fit the model
reg.fit(covariates_with_ones, df.salary_in_thousands_of_dollars)

LassoLarsIC(copy_X=True, criterion='bic', eps=2.2204460492503131e-16,
      fit_intercept=True, max_iter=500, normalize=True, positive=False,
      precompute='auto', verbose=False)

In [11]:
# review coefficients after the fit
reg.coef_

array([    0.        ,     0.        ,     0.        ,     6.31852769,
           0.        ,     0.        ,     0.        ,    11.10020813,
          15.11688211,     0.        ,    -5.2473569 ,     6.58256519,
          -6.24006217,  1265.34889908,  -227.93452271,   693.88293699,
         204.81194441])

In [12]:
# Selecting variables with non-zero coefficients
selected_columns = []

for i in range(len(reg.coef_)):
    if abs(reg.coef_[i]) > 0:
        selected_columns.append(covariates_with_ones.columns[i])
        
selected_columns

['number_of_runs',
 'number_of_home_runs',
 'number_of_runs_batted_in',
 'number_of_strike_outs',
 'number_of_stolen_bases',
 'number_of_errors',
 'indicator_of_free_agency_eligibility',
 'indicator_of_free_agent_in_1991_1992',
 'indicator_of_arbitration_eligibility',
 'indicator_of_arbitration_in_1991_1992']

## Rerun Regression on Selected Variables

Notes:
    * Using the best model from above, we re-fit without L1 regularization.
    * On the first fit below number_of_runs and number_of_walks are not significant so we re-fit after dropping the variables.

##### Model 1

In [13]:
model = sms.OLS(target, 
                covariates_with_ones[selected_columns])

In [14]:
results = model.fit()

In [15]:
print(results.summary())

                                  OLS Regression Results                                  
Dep. Variable:     salary_in_thousands_of_dollars   R-squared:                       0.855
Model:                                        OLS   Adj. R-squared:                  0.850
Method:                             Least Squares   F-statistic:                     187.4
Date:                            Fri, 24 Mar 2017   Prob (F-statistic):          4.37e-127
Time:                                    10:46:30   Log-Likelihood:                -2606.5
No. Observations:                             329   AIC:                             5233.
Df Residuals:                                 319   BIC:                             5271.
Df Model:                                      10                                         
Covariance Type:                        nonrobust                                         
                                            coef    std err          t      P>|t|      [0.

##### Model 2 - Removed insignificant variables (including the constant term)

Interpretation:
    * for continuous variables number_runs_batted_in and number_of_stolen_bases we interpret the coefficients as
        - a one unit change in number_of_runs_batted_in corresponds to a 18.7 thousand dollar increase in salary
        - a one unit change in number_of_stolen_bases corresponds to a 11.4 thousand dollar increase in salary
    * for indicator variables we interpret the coefficients as
        - when a person is free agent eligible we see an average 1.33 million increase in salary
        - when a person is arbitration eligible we see an average .87 million increase in salary

In [16]:
model = sms.OLS(target, 
                    covariates_with_ones[[
                        'number_of_home_runs',
                        'number_of_runs_batted_in',
                        'number_of_stolen_bases',
                        'indicator_of_free_agency_eligibility',
                        'indicator_of_free_agent_in_1991_1992',
                        'indicator_of_arbitration_eligibility'
                        ]])

In [17]:
results = model.fit()

In [18]:
print(results.summary())

                                  OLS Regression Results                                  
Dep. Variable:     salary_in_thousands_of_dollars   R-squared:                       0.829
Model:                                        OLS   Adj. R-squared:                  0.826
Method:                             Least Squares   F-statistic:                     261.9
Date:                            Fri, 24 Mar 2017   Prob (F-statistic):          7.70e-121
Time:                                    10:46:30   Log-Likelihood:                -2632.7
No. Observations:                             329   AIC:                             5277.
Df Residuals:                                 323   BIC:                             5300.
Df Model:                                       6                                         
Covariance Type:                        nonrobust                                         
                                           coef    std err          t      P>|t|      [0.0

##### Model 3 - Further removal of insignificant variable

In [19]:
model = sms.OLS(target, 
                    covariates_with_ones[[
                        'number_of_runs_batted_in',
                        'number_of_stolen_bases',
                        'indicator_of_free_agency_eligibility',
                        'indicator_of_free_agent_in_1991_1992',
                        'indicator_of_arbitration_eligibility'
                        ]])

In [20]:
results = model.fit()

In [21]:
print(results.summary())

                                  OLS Regression Results                                  
Dep. Variable:     salary_in_thousands_of_dollars   R-squared:                       0.828
Model:                                        OLS   Adj. R-squared:                  0.826
Method:                             Least Squares   F-statistic:                     312.4
Date:                            Fri, 24 Mar 2017   Prob (F-statistic):          1.39e-121
Time:                                    10:46:30   Log-Likelihood:                -2633.9
No. Observations:                             329   AIC:                             5278.
Df Residuals:                                 324   BIC:                             5297.
Df Model:                                       5                                         
Covariance Type:                        nonrobust                                         
                                           coef    std err          t      P>|t|      [0.0