# Variance Inflation Factors with NYC Building Sales

In this notebook we'll use the NYC Building Sales dataset to explore what colinearity is, and how to measure it using a statistic known as the **Variance Inflation Factor** (VIF).

## Colinearity

Colinearity (or [multicolinearity](https://en.wikipedia.org/wiki/Multicollinearity)) is the phenomenon when two columns in a dataset are linear combinations of one another. In mathematical terms, column vectors $\vec{A}$ and $\vec{B}$ are perfectly colinear when there exists an $\alpha$ and a $\beta$ such that:

$$\vec{B} = \alpha \vec{A} + \beta$$

For example, suppose our dataset contains two columns, "2015 Price" and "2016 Adjusted Price". The difference between these two columns might be that while the former is in terms of 2015 dollars, the latter is in terms of 2016 dollars (due to inflation). Since inflation is a constant multiple, these two columns would be perfectly colinear!

Perfect colinearity will stop analytically solved ordinary least squares regressors cold, because it creates matrices without unique solutions in the math. In practice, however, perfectly colinear columns are rare. More often we will see partially colinear columns: for example, two different estimates of the odds of players winning in a tennis match. These two columns will probably be close to colinear because we expect bookies to estimate winners' odds very closely to one another.

Colinearity is a special case of correlation. Two columns that are highly colinear are also highly correlated, in this specific, line-wise way.

Colinearity causes problems with ordinary least squares regression. The reason why is intuitive. OLS is designed around building a line that minimizes the distance from points in n-dimensional space. If two of the dimensions in that space are very near one another, then the algorithm will bias its results towards satisfying these two variables more than satisfying the other variables, because it will get more "bang" for its buck.

Importantly, from the perspective of a model fit metric, this is correct. [The usual OLS metrics](https://www.kaggle.com/residentmario/regression-metrics) are *also* designed around the assumption that your regression isn't overfitting in this way!

We will need to use a special fit statistic to determine when this is happening. VIF is that statistic.

## Variance Inflation Factor

The Variance Inflation Factor, of VIF, is a way of estimating the severify of multicolinearity in your model.

Suppose that we have an [ordinary least squares coefficient matrix](https://www.kaggle.com/residentmario/pumpkin-price-linear-regression), $\beta$, a matrix of observations $X$ (as would be inputted into `scikit-learn` during training/prediction), a response vector $y$ (as would be inputted into `scikit-learn` during training), and a vector of predictions from our model $\hat{y}$ (as would be generated by `scikit-learn` during prediction) such that:

$$\hat{y} = \beta_0 + \beta_1 X_1 + \ldots + \beta_k \beta_k$$
$$y = \beta_0 + \beta_1 X_1 + \ldots + \beta_k \beta_k + \varepsilon$$

($\varepsilon$ is the error term)

Then the [variance](https://en.wikipedia.org/wiki/Variance) of an individual coefficient $\beta_j$ can be estimated by the statistic:

$$\text{Var}(\beta_j) \approx \frac{\text{MSE}}{(n-1)\hat{\text{Var}}(X_j)} \cdot \frac{1}{1 - R^2_j}$$

Where $R^2_j$ is the $R_2$ of the regression to $X_j$ of the remaining variables in the observation matrix $X$.

The math for why things work this way is mildly complicated, and out of the scope of this brief tutorial. Basically, this formula tells us that the variance in our estimators (how different we can expect the coefficient to be, given a change or a reshuffling in the data) is related to the mean squared error (a regression fit metric we covered [here](https://www.kaggle.com/residentmario/regression-metrics)), the number of samples in our dataset, the variance of the cofactor, and this second $R^2$-based term. Put another way, this formula tells us that:

* The most samples in our dataset, the more accurate our model is, on average.
* The lower the mean squared error, the more accurate our model is, on average.
* The lower the variance within our variables is, the more accurate our model is, on average.
* The lower the second term in the equation is, the more accurate our model is, on average.

This second term is the variance inflation factor! Its position in this formula is why we call it that: because the VIF is proportional to the variance of our $B_j$ estimate, e.g.:

$$\text{Var}(B_j) \propto \text{VIF} = \frac{1}{1 - R^2_j}$$

And VIF is defined mathematically as:

$$\text{VIF}(\beta_j) = \frac{1}{1 - R^2_j}$$

This is reasonably easy to implement. Sadly, `scikit-learn` doesn't have this built-in (though `statsmodels` does).

In [None]:
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

def variance_inflation_factors(X, clf):
    vifs = []

    for i in range(X.shape[1]):
        sub_X = np.delete(X, i, axis=1)
        sub_y = X[:, i][np.newaxis].T
        sub_clf = clf.fit(sub_X, sub_y)
        sub_y_pred = clf.predict(sub_X)
        
        sub_r2 = r2_score(sub_y, sub_y_pred)
        
        vif = 1 / (1 - sub_r2)
        vifs.append(vif)
        
    return vifs

To test out our algorithm (and our understanding of VIF), let's create an artificial dataset of columns generated from the same distribution.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
clf = LinearRegression()

np.random.seed(42)
X = (np.array(sorted(list(range(5))*20)).reshape(20, 5) +
     np.random.normal(size=100, scale=0.5).reshape(20, 5))
y = (np.array(sorted(list(range(5))*20)).reshape(20, 5) +
     np.random.normal(size=100, scale=0.5).reshape(20, 5))

The following `seaborn` plot shows how all five of our input variables in $X$ are very close to linear with one another.

In [None]:
import seaborn as sns
import pandas as pd
sns.pairplot(pd.DataFrame(X))

As a rule of thumb, a VIF of >10 indicates extreme multicolinearity. And indeed, if we run our VIF algorithm on the dataset we've just generated, we find that's exactly what we have:

In [None]:
variance_inflation_factors(X, clf)

Since our dataset has five variables, there are five VIF scores, one for each variable.

Authorities differ (obviously) on how high the VIF score can get before you should start to worry about it. The [following blog post](https://statisticalhorizons.com/multicollinearity) mentions 2.5 as a cutoff value. Here's what a dataset with approximately this degree of multicolinearity looks like:

In [None]:
X = (np.array(sorted(list(range(5))*20)).reshape(20, 5) +
     np.random.normal(size=100, scale=1.25).reshape(20, 5))
y = (np.array(sorted(list(range(5))*20)).reshape(20, 5) +
     np.random.normal(size=100, scale=1.25).reshape(20, 5))
sns.pairplot(pd.DataFrame(X))

Here are the VIF scores for this dataset:

In [None]:
variance_inflation_factors(X, clf)

To interpret the VIF, it's helpful to restate it in terms of the more familiar $R^2$. $R^2$ is a measure of how well a model will do, given how well it did on the current dataset. Recall from the formula that VIF is just a *between variables* proportion to $R^2$. So a VIF score of 2.5 implies that:

$$\text{VIF}(\beta_j) = \frac{1}{1 - R^2(\beta_j)} = 2.5 \implies R^2(\beta_j) = 0.6$$

Another way to look at it is to say that a VIF of 2.5 implies that the variance of a particular coefficient is 150% larger than it would be if the predictor was completely uncorrelated with all other predictors.

Since best-possible $R^2$ goes asymptotically to 1, best-possible VIF goes asymptotically to 0. The lower the VIF, the better a regression model will perform!

## Correcting for colinearity

VIF is a useful metric to compute ahead of time when you are planning on running a regression model. It can help stop you from creating poor models. If you want to run a regression model on a dataset with a high VIF, to maximize the utility of the model you will want to "fix" colinearity somehow. Examples of actions you could take are:

* Removing highly duplicitious columns, either by hand or by using feature selection.
* Running a feature selection or compression technique like [Principal Components Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) on the dataset, which will filter out highly correlated variables.
* Using a model that's robust against colinearity, like lasso regression or ridge regression.

## Application to NYC Buildings

You don't need to want to run a regression model to apply VIF to a dataset! Indeed, the VIF is a pretty useful general-purpose statistic. It can be used to determine how "bad" variables you want to model are when combined with one another. In this way it's like taking the correlation matrix of a dataset, except explicitly targeted at colinearity and with all variables under consideration, instead of general correlation.

Let's go ahead and apply to some datasets.

The NYC Building Sales dataset is record of one year of real estate transactions in New York City, as recorded by the NYC Department of Finance.

In [None]:
import pandas as pd
pd.set_option('max_columns', None)
sales = pd.read_csv("../input/nyc-rolling-sales.csv", index_col=0)
sales.head()

In this dataset, there are a few variables that seem like they might be colinear:
* `RESIDENTIAL UNITS` and `TOTAL UNITS`
* `LAND SQUARE FEET` and `GROSS SQUARE FEET`

Let's see if they are:

In [None]:
clf = LinearRegression()
X = sales.loc[:, ['RESIDENTIAL UNITS', 'TOTAL UNITS']].dropna().values
variance_inflation_factors(X, clf)

These variables are highly colinear! In fact:

$$\text{VIF} \approx 4.7 \implies R^2 \approx 0.8$$

There exists some $\beta$ such that $\beta$ times the number of residential units is an about 80% accurate estimate of the number of total units.

What about `LAND SQUARE FEET` and `GROSS SQUARE FEET`?

In [None]:
X = (sales.loc[:, ['LAND SQUARE FEET', 'GROSS SQUARE FEET']]
         .replace(' -  ', np.nan)
         .dropna()
         .values
    )
variance_inflation_factors(X, clf)

These variables are not that highly colienar.

$$VIF \approx 1.7 \implies R^2 \approx 0.4$$

So if we were to try to use regression to predict `GROSS SQUARE FEET` using `LAND SQUARE FEET`, we'd be only like 40% right. This is still a somewhat significant amount of correlation, so it's something that's useful to keep in mind, but fine for regression.

That concludes this notebook. Hopefully you now know what a Variance Inflation Factor is, and how to use it, both in exploratory data analysis and when building your models!

Until next time!