## Train-Test

### Model Overfit
As I mentioned in an earlier exercise, the R^2 in regression models can always be improved by adding more predictors.  This effect occurs because you are adding information to the prediction of your dependent variable.  Note that the information may not necessarily be RELEVANT or SIGNIFICANT: it is just information.

In general, the more predictors you add, the better job your model will predict that dataset.  HOWEVER, the model becomes more tied to that PARTICULAR dataset.  The model has been trained "too well". It may be very accurate on your sample, but it may not generalize well _beyond_ that dataset. Overfitted models explain not only the dependent variable, but also capture the noise (or error) in the data. Since noise will tend to be specific to that dataset, that capability is a bad thing.

![Model Fit](model_fit.png)

To gauge model overfitting, a common practice is to separate our data into a train and a test dataset.  We then use the train dataset to develop the model, and then use the test dataset to gauge its generalizability.

I should also mention model _underfitting_, which is generally less of a concern: it simply means your model does not fit the training set well. In the above example, the model does not capture the nonlinear trend in the data, so its fit wil be poor, especially for lower and higher values.

__One important note before we continue__: I am using the Newark real estate dataset again, mainly because you know it well.  The rule of thumb is about 30-50 observations per predictor (or X variable).  That means a model with four predictors needs 120-200 rows of data.

But there is another problem: if we divide the dataset into 70% for training, that means we need 40-70 rows to detect model overfit.  That means 160-280 rows! 

With 166 rows in the Newark dataset (less if we drop the missing value rows), we barely have enough data for 4 predictors/features.  We should therefore be careful of model overfit, which we explore next. 

In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [23]:
df = pd.read_csv('OldNewark.csv')
temp = df.loc[df['SQFT'] < 4000].copy() # remove outliers
df2 = temp.dropna().copy()

In [24]:
df2.columns

Index(['ID', 'STREETNO', 'STREET', 'ZESTIMATE', 'SQFT', 'BEDR', 'BATHR',
       'YRBUILT', 'LOTSIZE', 'SOLDFOR', 'YRSOLD'],
      dtype='object')

In [25]:
df2.shape

(118, 11)

### NOTE
Since we have done a lot of swimming in the Newark dataset, I am skipping it for now.  As you have learned, however, swimming is an important part of any analysis!

In [26]:
df2['FACEWEST'] = df2['STREETNO'] % 2
df2['ORCHARD'] = 0
df2.loc[df2['STREET'] == 'Orchard', 'ORCHARD'] = 1
df2['AGE'] = 2019 - df2['YRBUILT']

lotsqft = df2['LOTSIZE'] * 43560 # convert pct to sqft
df2['COVERAGE'] = (df2['SQFT']/2) / lotsqft # assume two stories. Half of house is on the ground

In [38]:
# split data into X and Y dataframes
X = df2[['SQFT', 'FACEWEST', 'AGE','COVERAGE']].copy()
Y = df2['ZESTIMATE'].copy()

## Splitting the data into Train and Test
You will see the following code for the rest of the semester, since testing for model overfit is a fundamental requirement when evaluating any ML model.  The following code splits the data into four dataframes--X and Y for training and X amd Y for testing.

Fortunately, sklearn provides an easy function that randomly selects rows from our dataframe.  The random_state argument allows you to replicate the results--that is, setting the number to 2 ensures you will get the same results as me when you run the analysis.

In [39]:
# Split the dataset into the Training set and Test set
# test_size reflects the % to use in the test set.  Usually set to around 0.2, but we
# have a small sample
# To get the same results every time, set random_state to the same value (e.g., 0).

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, \
        test_size = 0.5, random_state = 2)

## Training the Model

In [40]:
# Run regression using statsmodels
import statsmodels.api as sm
import math
from sklearn.metrics import mean_squared_error

X_train = sm.add_constant(X_train) # required if constant expected
est = sm.OLS(y_train,X_train).fit() # fit model
predictions = est.predict() # get predicted values
print(est.summary()) # prints full regression results
print("\nAverage error: {:.2f}.".format(math.sqrt(est.mse_resid)))

                            OLS Regression Results                            
Dep. Variable:              ZESTIMATE   R-squared:                       0.471
Model:                            OLS   Adj. R-squared:                  0.432
Method:                 Least Squares   F-statistic:                     12.02
Date:                Wed, 11 Nov 2020   Prob (F-statistic):           4.70e-07
Time:                        14:19:38   Log-Likelihood:                -706.03
No. Observations:                  59   AIC:                             1422.
Df Residuals:                      54   BIC:                             1432.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       7.666e+04   3.82e+04      2.005      0.0

## Testing the Model
The process for testing the model is identical to training the model.  We simply compare the test results to the train results.  The following parameters should be comparable:
* R^2 and adj. R^2, especially the difference between the two.  
* Regression coefficients (especially be concerned if the sign flips or significant predictors become insignificant...or vice versa)
* Error levels, including the final model error

In [41]:
# Run regression using statsmodels
import statsmodels.api as sm

X_test = sm.add_constant(X_test) # required if constant expected
est = sm.OLS(y_test,X_test).fit() # fit model
predictions = est.predict() # get predicted values
print(est.summary()) # prints full regression results
print("\nAverage error: {:.2f}.".format(math.sqrt(est.mse_resid)))

                            OLS Regression Results                            
Dep. Variable:              ZESTIMATE   R-squared:                       0.402
Model:                            OLS   Adj. R-squared:                  0.357
Method:                 Least Squares   F-statistic:                     9.060
Date:                Wed, 11 Nov 2020   Prob (F-statistic):           1.13e-05
Time:                        14:19:39   Log-Likelihood:                -720.38
No. Observations:                  59   AIC:                             1451.
Df Residuals:                      54   BIC:                             1461.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       9.923e+04   4.18e+04      2.372      0.0

### Problems and more Problems
If you compare the train results with the test results, you will see big differences in the model's explanation and prediction metrics.  So what is going on?

Recall that I wrote that we were pushing the model a bit, given our limited sample size.  The model overfit may be a result of us pushing this model beyond where we should.  With so little data, overfitting and coefficient instability isnt a surprise.  

Notice the warnings under the model.  Take a look especially at warning 2.  This indicates that there may also be another issue--multicollinearity--which I will visit next.

### Multicollinearity
In general, you want your target column to be correlated with your features (correlation is after all the basis of regression). Multicollinearity, however, refers to high correlations among a model's *features*.  But correlation among features reflects duplicate (or redundant) information in the model.  When the same information is entered multiple times, the end result is unpredictable.  Multicollinearity does not affect R^2 values, but can greatly distort the values of your coefficients.

If you look at the above model, there is a warning which may indicate multicollinearity (#2 at the bottom).  To diagnose the problem, lets run the VIF (variance inflation factor) scores.

In [42]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = est.model.exog # get model features
vif = pd.DataFrame() # create a dataframe
vif["VIF Factor"] = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif["features"] = X_test.columns
print('VIF: {}'.format(vif))

VIF:    VIF Factor  features
0   40.043273     const
1    1.901000      SQFT
2    1.041476  FACEWEST
3    1.006898       AGE
4    1.860831  COVERAGE


In general, features with VIF values > 2 should be examined.  Note that SQFT is high, but so is COVERAGE.  Does it make sense that COVERAGE and SQFT are correlated?  Let's see...

In [43]:
corr = X_test.corr()
corr.style.background_gradient()

  smin = np.nanmin(s.to_numpy()) if vmin is None else vmin
  smax = np.nanmax(s.to_numpy()) if vmax is None else vmax
  xa[xa < 0] = -1


Unnamed: 0,const,SQFT,FACEWEST,AGE,COVERAGE
const,,,,,
SQFT,,1.0,-0.187469,0.0433,0.680068
FACEWEST,,-0.187469,1.0,-0.076238,-0.122782
AGE,,0.0433,-0.076238,1.0,0.019655
COVERAGE,,0.680068,-0.122782,0.019655,1.0


You can ignore the nan entries: they occur because the const column is all ones.

The correlation between COVERAGE and SQFT is moderately high.  In hindsight, it makes sense that SQFT and COVERAGE would be correlated, since larger houses have greater coverage (the lot sizes in Old Newark are fairly similar)! Please note that it is appropriate to explore them both as predictors! Whether or not they offer unique information needs to be tested.  In general, however, features that have a correlation > 0.50 need to be examined carefully.

Let's remove COVERAGE:

In [48]:
X = df2[['SQFT', 'FACEWEST', 'AGE']].copy() # remove COVERAGE
Y = df2['ZESTIMATE'].copy()

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, \
        test_size = 0.5, random_state = 2)

In [50]:
X_train = sm.add_constant(X_train) # required if constant expected
est = sm.OLS(y_train,X_train).fit() # fit model
predictions = est.predict() # get predicted values
print(est.summary()) # prints full regression results
print("\nAverage error: {:.2f}.".format(math.sqrt(est.mse_resid)))

                            OLS Regression Results                            
Dep. Variable:              ZESTIMATE   R-squared:                       0.463
Model:                            OLS   Adj. R-squared:                  0.434
Method:                 Least Squares   F-statistic:                     15.80
Date:                Wed, 11 Nov 2020   Prob (F-statistic):           1.57e-07
Time:                        14:20:42   Log-Likelihood:                -706.48
No. Observations:                  59   AIC:                             1421.
Df Residuals:                      55   BIC:                             1429.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       7.026e+04   3.75e+04      1.873      0.0

In [51]:
X_test = sm.add_constant(X_test) # required if constant expected
est = sm.OLS(y_test,X_test).fit() # fit model
predictions = est.predict() # get predicted values
print(est.summary()) # prints full regression results
print("\nAverage error: {:.2f}.".format(math.sqrt(est.mse_resid)))

                            OLS Regression Results                            
Dep. Variable:              ZESTIMATE   R-squared:                       0.381
Model:                            OLS   Adj. R-squared:                  0.347
Method:                 Least Squares   F-statistic:                     11.27
Date:                Wed, 11 Nov 2020   Prob (F-statistic):           7.20e-06
Time:                        14:21:14   Log-Likelihood:                -721.39
No. Observations:                  59   AIC:                             1451.
Df Residuals:                      55   BIC:                             1459.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       9.363e+04    4.2e+04      2.231      0.0

In [52]:
variables = est.model.exog # get model features
vif = pd.DataFrame() # create a dataframe
vif["VIF Factor"] = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif["features"] = X_train.columns
print('VIF: {}'.format(vif))

VIF:    VIF Factor  features
0   39.661607     const
1    1.037335      SQFT
2    1.041443  FACEWEST
3    1.006729       AGE


As you can see, multicollinearity is no longer a problem.  I leave it to you to run the test dataset.  

The error at the end of the regression output remains much higher for the test than for the train.  Remember, that the concerns with sample size remain, and that is the likely cause of the warning that is still generated.

### What's the bottom line?
The purpose of separating a data set into train and test sets is to see whether the models for the two different samples are comparable.  If the training model's fit is substantively better than the test model, your regression model may be too complicated for the data.  This can happen when you have too many X's (or features) in your model.  Smaller sample sizes (especially ones less than 100) are susceptible to overfitting.  

How do you tell if your model is overfitted (or has other diagnostic problems):
* R2 and error levels differ in the train and test set.
* The regression coefficients differ, especially in the signs and/or the significance levels 
* If you have sufficient sample size, be aware of coefficients that are highly significant in one model, but less so in the other.
* The general rule of thumb is that you need 30 observations for each X variable in your model.  If you have three X's, that would mean you need about 100 observations.  BUT remember that you are splitting into a train and a test set! That means you need 200 observations!

### Summary of important points:
* In practice, the chances for overfitting a linear regression model is probably small, *if you have sufficient sample size*.  
* Always review the coefficient values and signs to ensure they make sense.  
* If you have a negative coefficient that should be positive (or vice versa), you may have two (or more) X variables that are highly correlated.  That will result in model instability and wacky values for your regression coefficients.  
* Always run a correlation matrix for your _X_ variable to confirm their correlations are less than 0.5. If some are higher than that, be sure to run the VIF stats to confirm the VIFs are less than 2.