# S&P 500 P/E Analysis - Regression Modeling - Model Selection Using L1  

### Credit:
    Data for this exercise is based on the S&P 500 companies fundamental data provided by Dominik Gawlik at
    https://www.kaggle.com/dgawlik/nyse

## Learning Objectives

Run a basic regression using L1 regularization to conduct variable selection
    * run a simple OLS
    * analyze output
    * run L1 regression
    * interpret model parameters

## Imports

In [1]:
import pandas as pd
import statsmodels.api as sms
import sklearn.linear_model as lm

## Get Data and Subset Data

In [2]:
# Import data from the csv file
df = pd.read_csv('data/relative_valuation.csv')

# subset the dataframe removing rows with NULL values
bix = df.notnull().all(axis=1)
df = df[bix]

## A First Regression Model Using All Variables

In [3]:
columns = [ 'Pre-TaxROE',
            'AfterTaxROE',
            'CashRatio', 
           'QuickRatio',
           'OperatingMargin',
           'Pre-TaxMargin', 
           'profit_margin',
           'operating_cash_flow_margin',
           'debt_to_equity', 
           'debt_to_asset', 
           'capital_surplus_to_asset',
           'Goodwill_to_asset',
          ]

In [4]:
model = sms.OLS(df['p/e'], df[columns])

In [5]:
result = model.fit()

Notes:

* Adjusted R-Square at 0.778 indicates a resonably good model fit

* Several variables are statistically significant
  * Pre-TaxROE
  * AfterTaxROE
  * OperatingMargin
  * Pre-TaxMargin
  * profit_margin
  * capital_surplus_to_asset
  * goodwill_to_asset


 * Warning Signs
  * Warning indicating strong multicollinearity
  * Regression coefficients defies common sense
    * AfterTaxROE coefficient (statistically significant) is negative (-2.5335) indicating a high valuation for firms with lower ROE
    * Pre-TarMargin coefficient (statistically significant) is negative (-2.4156) indicating a high valuation for firms with lower margin.
  
  
Large number of collinear variables.  How do we select ones that are useful yet uncorrelated?

In [6]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                    p/e   R-squared:                       0.796
Model:                            OLS   Adj. R-squared:                  0.778
Method:                 Least Squares   F-statistic:                     43.67
Date:                Fri, 24 Mar 2017   Prob (F-statistic):           2.25e-40
Time:                        13:26:28   Log-Likelihood:                -562.01
No. Observations:                 146   AIC:                             1148.
Df Residuals:                     134   BIC:                             1184.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Pre-TaxROE          

### Variable selection using L1 regression

Notes:
    Good model should be less complex.
        * fit L1 penalized model  
        * we choose the model with the minimal bayes information criterion (bic)
        * this will force an increasing number of the parameters to zero
        * the nonzero variables are the selected variables
        * last we will refit the "best" model and interpret the variables

In [7]:
from sklearn import linear_model

In [8]:
reg = linear_model.LassoLarsIC(normalize=True, criterion='bic')

In [9]:
reg.fit(df[columns], df['p/e'])

LassoLarsIC(copy_X=True, criterion='bic', eps=2.2204460492503131e-16,
      fit_intercept=True, max_iter=500, normalize=True, positive=False,
      precompute='auto', verbose=False)

In [10]:
reg.coef_

array([ 0.        , -0.36885292,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        , -0.0952389 ,  0.        ,  0.        ,
        0.05578808,  0.14554785])

#### Select non-zero variables

Of the twelve variables, the selected four variables can explain the data just as good as all twelve combined.

In [11]:
subset_columns = [columns[i] for i in range(len(columns)) if abs(reg.coef_[i])>0]
subset_columns

['AfterTaxROE',
 'operating_cash_flow_margin',
 'capital_surplus_to_asset',
 'Goodwill_to_asset']

### Run regression on the "simple" model

#### Build a new regression model using the selected variables

In [12]:
model = sms.OLS(df['p/e'], df[subset_columns])

In [13]:
result = model.fit()      

Notes:
    * Adjusted R-Square at 0.705
    
    * The coefficient for AfterTaxROE is positive indicating that firms with higher ROE will be valued more than firms with lower ROE.  Note that this coefficient was negative in our "all-variable" model.
    
    * operating_cash_flow_margin which as insignificant in our previous model is significant now.
    
    
Interpretation:
    
    Keeping all things the same,
    * 1% increase in AfterTaxROE increases the P/E by 0.4343
    * 1% increase in operating cash flow margin increases the P/E by 0.1946
    * 1% increase in surplus capital relative to the asset increases the P/E by 0.2852
    * 1% increase in goodwill relative to the asset increases the P/E by 0.3120

In [14]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                    p/e   R-squared:                       0.713
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     88.13
Date:                Fri, 24 Mar 2017   Prob (F-statistic):           1.73e-37
Time:                        13:26:28   Log-Likelihood:                -587.10
No. Observations:                 146   AIC:                             1182.
Df Residuals:                     142   BIC:                             1194.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
AfterTaxROE         