# Baseball Regression Modeling Managing Multicollinearity with PCA

## Learning Objectives

Use Principal Components Decomposition/Analysis to model out multicollinearity
    * identify multicollinearity in the covariates
    * fit and interpret PCA
    * run regression on principal components

## Imports

## Get Data and Subset Data

## A First Regression Model Using All Variables

The hope here is that we can get a batch look at how the variables relate to the target. This unfortunately fails due to tight correlations within the covariates.

In [None]:
model = sms.OLS(df.salary_in_thousands_of_dollars, sms.add_constant(df.iloc[:, 1:]))

Notes:
    * we acheive a fairly high R2 right off the bat with this approach
    * a couple of variables are significant or nearly so
        - on_base_percentage
        - number_of_runs
        - number_of_runs_batted_in
        - number_of_strike_outs
        - number_of_stolen_bases
        - indicator_of_free_agency_eligibility
        - indicator_of_free_agent_in_1991_1992
        - indicator_of_arbitration_eligibility
        - indicator_of_arbitration_in_1991_1992
    * there are a large number of variables, how do we know what should be in the model and what should be out?
    * warning 2 in the printed output below states there may be high multicollinearity (high correlation between covariates)

In [None]:
result = model.fit()

In [None]:
print(result.summary())

## Look at Correlations in the Covariates

We should look at correlations in the covariates.

Chaining .style.background_gradient(cmap='viridis') on the end of the corr() call, styles the correlation matrix help us find high correlation items.

Notes:
    * there are high correlations between salary and a number of variables that are not significant in the first model
    * there is low correlation between a number of the variables which appear significant and salary
    * there is high correlation between some of the variables which are highly correlated with salary and which do not show up as significant
    * these correlations between the covariates may be causing us issues

## Managing Correlations in Covariates

###### Create and Fit a PCA Object

###### Review the Components

If the model is meant to be interpreted and not just provide predictions, then we need to provide an interpretation of the variables.

Below we review the PCs and see:
    - the variables number_of_THIS where THIS is a thing that happens while batting are all heavily loaded in PC1
    - number_of_hits and number_of_runs are loaded heavy positive and home_runs, rbis, walks, and strike_outs are loaded heavy negative for PC2

Our interpretation for the first two PCs is as follows:
    - PC1 describes number of at bats
        * as noted above the variance explained here is all of the variety count of things while at bat
    - PC2 describes at bat efficacy
        * here the major loadings are positive number_of_hits and negative number_of_strike_outs (the loadings are "competing")

In [None]:
pd.DataFrame(pca.components_.T, 
                columns=['PC' + str(i+1) for i in range(pca.components_.shape[1])],
                index=df.iloc[:, 1:].columns)

###### Review the Scree Plot

###### Get the Transformed Data

In [None]:
transformed = pca.fit_transform(df.iloc[:, 1:])
reduced = transformed[:, :2]
data = pd.DataFrame(reduced, columns=['p1', 'p2'])

###### Fit Regression on the Transformed Data

In [None]:
model = sms.OLS(df.salary_in_thousands_of_dollars.reset_index().drop('index', axis=1), sms.add_constant(data))

In [None]:
results = model.fit()

In [None]:
print(results.summary())