# S&P 500 P/E Analysis - Regression Modeling - Model Selection Using L1  

### Credit:
    Data for this exercise is based on the S&P 500 companies fundamental data provided by Dominik Gawlik at
    https://www.kaggle.com/dgawlik/nyse

## Learning Objectives

Run a basic regression using L1 regularization to conduct variable selection
    * run a simple OLS
    * analyze output
    * run L1 regression
    * interpret model parameters

## Imports

## Get Data and Subset Data

In [None]:
# Import data from the csv file


In [None]:
# subset the dataframe removing rows with NULL values
bix = df.notnull().all(axis=1)
df = df[bix]

## A First Regression Model Using All Variables

In [None]:
df.describe()

In [None]:
columns = [ 'Pre-TaxROE',
            'AfterTaxROE',
            'CashRatio', 
           'QuickRatio',
           'OperatingMargin',
           'Pre-TaxMargin', 
           'profit_margin',
           'operating_cash_flow_margin',
           'debt_to_equity', 
           'debt_to_asset', 
           'capital_surplus_to_asset',
           'Goodwill_to_asset',
          ]

#### Specify the OLS Model

#### Fit the model

Notes:

* Adjusted R-Square at 0.778 indicates a resonably good model fit

* Several variables are statistically significant
  * Pre-TaxROE
  * AfterTaxROE
  * OperatingMargin
  * Pre-TaxMargin
  * profit_margin
  * capital_surplus_to_asset
  * goodwill_to_asset


 * Warning Signs
  * Warning indicating strong multicollinearity
  * Regression coefficients defies common sense
    * AfterTaxROE coefficient (statistically significant) is negative (-2.5335) indicating a high valuation for firms with lower ROE
    * Pre-TarMargin coefficient (statistically significant) is negative (-2.4156) indicating a high valuation for firms with lower margin.
  
  
Large number of collinear variables.  How do we select ones that are useful yet uncorrelated?

#### Print the summary of the model results

### Variable selection using L1 regression

Notes:
    Good model should be less complex.
        * fit L1 penalized model  
        * we choose the model with the minimal bayes information criterion (bic)
        * this will force an increasing number of the parameters to zero
        * the nonzero variables are the selected variables
        * last we will refit the "best" model and interpret the variables

#### Specify L1 Regression Model

#### Fit the L1 Regression Model

#### View the model coefficients

#### Select non-zero variables

Of the twelve variables, the selected four variables can explain the data just as good as all twelve combined.

In [None]:
subset_columns = [columns[i] for i in range(len(columns)) if abs(reg.coef_[i])>0]
subset_columns

### Run regression on the "simple" model

#### Build a new regression model using the selected variables

In [None]:
model = sms.OLS(df['p/e'], df[subset_columns])

In [None]:
result = model.fit()      

Notes:
    * Adjusted R-Square at 0.705
    
    * The coefficient for AfterTaxROE is positive indicating that firms with higher ROE will be valued more than firms with lower ROE.  Note that this coefficient was negative in our "all-variable" model.
    
    * operating_cash_flow_margin which as insignificant in our previous model is significant now.
    
    
Interpretation:
    
    Keeping all things the same,
    * 1% increase in AfterTaxROE increases the P/E by 0.4343
    * 1% increase in operating cash flow margin increases the P/E by 0.1946
    * 1% increase in surplus capital relative to the asset increases the P/E by 0.2852
    * 1% increase in goodwill relative to the asset increases the P/E by 0.3120

In [None]:
print(result.summary())