In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

# Further Regression Considerations

- Collinearity
- Cleaning and Preparing Data
- Test Train Split for Assessment


### Collinearity

The notion of independence of variables is related to the notion of collinearity.  Briefly, we find collinearity anytime we find strong relationships between dependent variables.  As we saw earlier, the relationship between `newspaper` and other mediums were interrelated to one another.  This can be detected by looking both at plots of the variables themselves against one another, examining the correlation coefficients of variables, and calculating the Variance in Frequency measure for the different features.

In [None]:
credit = pd.read_csv('data/credit.csv')
ads = pd.read_csv('data/ads.csv', index_col = 'Unnamed: 0')

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(credit);

Note the relationships between `Limit, Rating`, and `Balance`.  Both `Limit` and `Rating` seem to be related to `Balance`, however they are strongly related to one another.  This is not to be confused with the relationships between `TV` and `radio` that we saw earlier.  We can see this clearly by comparing the variables to one another side by side.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
plt.figure(figsize = (10, 5))
plt.subplot(1, 2, 1)
plt.scatter(ads['TV'], ads['radio'], alpha = 0.3);
plt.title("Television and Radio")

plt.subplot(1, 2, 2)
plt.scatter(credit['Rating'], credit['Limit'], alpha = 0.3);
plt.title("Rating and Limit")

### Collinearity Example

The `longley` dataset available through the `statsmodels` dataset package is another example of a highly collinear dataset.  Here, we are interested in determining the regression predicting the percent employed.

In [None]:
import statsmodels as sm

In [None]:
longley = sm.datasets.get_rdataset('longley')

In [None]:
longley.data.head()

In [None]:
print(longley.__doc__)

In [None]:
long_data = longley.data

In [None]:
long_data.columns

In [None]:
long_data.head()

In [None]:
corr_mat = long_data.corr()

In [None]:
plt.figure()
sns.heatmap(corr_mat)

In [None]:
scatter_matrix(long_data);

### Problem

Return to your example dataset in the `Credit` example.  Remove any features you believe are highly correlated and refit your model.  Discuss performance.

### Feature Engineering and Cleaning


We want to return to our Housing example and consider how to use some of `scikitlearn`'s functionality to deal with missing values.  We want to determine the correct way of dealing with these one by one, and use some of what we know about the data to inform these decisions.  If we have objects that are missing values, we can either exclude the observations, or encode the missing values using some kind of numerical value.  


In [None]:
ames = pd.read_csv('data/ames_housing.csv')

In [None]:
ames.head()

In [None]:
ames.info()

In [None]:
ames['Alley'].value_counts()

In [None]:
ames['Alley'] = ames['Alley'].fillna("None")

In [None]:
ames['Alley'].value_counts()

In [None]:
ames['FireplaceQu'].value_counts()

In [None]:
ames['FireplaceQu'] = ames['FireplaceQu'].fillna("None")

In [None]:
ames['MiscFeature'].value_counts()

In [None]:
ames['MoSold'].value_counts()

Note the existence of a number of ordinal data points.  We can encode these to follow the data dictionary. https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt

In [None]:
ames = ames.replace({"BsmtCond": {"No": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [None]:
ames['BsmtCond'].value_counts()

In [None]:
ames = ames.replace({"BsmtQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}})

In [None]:
ames['BsmtQual'].value_counts()

**PROBLEMS**

Continue to code a few more columns and make sure to replace any `na` values in at least:

- `OverallQual`
- `OverallCond`
- `GarageQual`
- `GarageCond`
- `PoolArea`
- `PoolQC`

### Adding New Features

We can create many new features to help improve our models performance.  For example, any of the measures that have multiple categories could be combined.  Take `Overall`, `Garage`, and `Pool` for example.  We can create combinations of the subcolumns as follows.

In [None]:
ames['BasementOverall'] = ames['BsmtCond'] * ames['BsmtQual']

**PROBLEMS**


Continue to add additional features that combine other existing ones in a sensible way.  Here are a few additional ideas:

```python
ames['OverallGrade'] = ames['OverallQual'] * ames['OverallCond']
ames['GarageOverall'] = ames['GarageQual'] * ames['GarageCond']
ames['PoolOverall'] = ames['PoolArea'] * ames['PoolQC']
```

Be sure you've coded these as numeric vectors before creating columns based on arithmetic involving them.

### Scikitlearn Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
ads['TVradio'] = ads['TV'] * ads['radio']

In [None]:
ads_X = ads.drop(['sales', 'newspaper'], axis = 1)

In [None]:
ads_label = ads['sales'].copy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(ads_X, ads_label)

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train, y_train)

In [None]:
lm.coef_

In [None]:
lm.intercept_

In [None]:
lm.score(X_train, y_train)

In [None]:
lm.score(X_test, y_test)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
predictions = lm.predict(X_test)

In [None]:
predictions[:8]

In [None]:
y_train[:8]

In [None]:
mse = mean_squared_error(y_test, predictions)

In [None]:
rmse = np.sqrt(mse)

In [None]:
print("MSE: ", mse, "\nRMSE: ", rmse)

**PROBLEM**

Using the `sklearn` implementation of `LinearRegression()`, create a test and train set from your housing data.  To begin, fit a linear model on the **Logarithm** of the sales column with the `GrLvArea` feature.  Use this as your baseline to compare your transformations to.  

Include the transformations from above into a second linear model and try it out on the test set. Did the performance improve with your adjustments and transformations? 

Add polynomial features into the mix and see if you can get better improvement still.