# Multiple Linear Regression

Previously, we used a single feature to predict the sale price of a house. Any number of features may be used with linear regression to build a model. **Multiple Linear Regression** refers to linear regression with two or more features. Let's see the general form of the model.

$$\hat{y} = w_{0} + w_{1}x_{1} + w_{2}x_{2} + ... + w_{p}x_{p}$$

Where `p` is the number of predictor variables (features). The above equation is no longer an equation for a line but for a [hyperplane][1].

## Same Goal - minimize squared error

Multiple linear regression has the same goal as simple linear regression - to choose coefficients that minimize the sum of squared errors.

### Naive Iterative process no longer feasible

Our naive iterative process that we employed to search for the optimal coefficients is not quite a feasible solution as the number of combinations grows exponentially with each added feature. On the other hand, the method employed by scikit-learn learns the values of coefficients efficiently.

## Choose features to build a model

Let's read in the housing dataset and select `GarageArea` and `FullBath` in addition to `GrLivArea` as features in the model.

[1]: https://en.wikipedia.org/wiki/Hyperplane

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
housing = pd.read_csv('../data/housing_sample.csv')
housing.head()

Our input data now consists of three features. Let's select it now.

In [None]:
X = housing[['GrLivArea', 'GarageArea', 'FullBath']]
y = housing['SalePrice']

## Import, Instantiate, Fit

Let's use the same three-step process to build a model from these three features.

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

### Output the coefficients of the model

To see the trained coefficients of the model, access the `intercept_` and `coef_` attributes.

In [None]:
lr.intercept_

In [None]:
lr.coef_

### Formally write the model

We can formally write out the mathematical definition of our model as the following, where $x_{1}$, $x_{2}$, and $x_{3}$ represent our features.

$$\hat{y} = -15836 + 70x_{1} + 132x_{2} + 17942x_{3}$$

## Make predictions

It is uncommon to create your own input data for predictions. Instead, you are more likely to have new observed data absent of a predicted value that you would use your model with. For instance, when a new house comes on the market, you can use your model to determine what it could sell at.

### Using our training data to make predictions

Instead of creating our own data by hand, we can select rows of our training data to use as input data for prediction. Let's select the first 10 rows of our input data to do so.

In [None]:
lr.predict(X.head(10)).round(-3)

We can compare it to the actual sale price.

In [None]:
y.values[:10]

### Plot predicted vs actual values
It might be helpful to plot the predicted and actual values on the same plot. We do so below for the first 10 observations. A few predictions are very close to the actual observed sale price. The x-axis corresponds to the observation number (1, 2, 3, etc...). The sale price is plotted on the y-axis. For each observation, two points are plotted, one for the predicted and one for the actual value.

In [None]:
y_pred = lr.predict(X.head(10)).round(-3)
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(np.arange(10), y_pred, label='predicted')
ax.scatter(np.arange(10), y.head(10), label='actual')
ax.legend()
ax.set_title('Actual vs Predicted for first 10 rows')
ax.set_ylabel('Sale Price')
ax.set_xticks([]);

## Evaluating model performance
We can use the `score` method again to determine what the $R^2$ of our model is, but we need to remember that this is calculated on data that the model has already seen and will not be an accurate measure of how the model will perform in the future on unseen data.

In [None]:
lr.score(X, y)

## Comparing multivariate and univariate model

Let's compare the performance of our new multivariate model to the univariate model that just uses `GrLivArea`. We don't have to import the linear regression model again, just begin by instantiating a new model and calculating $R^2$.

In [None]:
lr2 = LinearRegression()
X2 = housing[['GrLivArea']]
lr2.fit(X2, y)
lr2.score(X2, y)

The new model explains about 12% more of the inherent variance in the data suggesting that there is value in using the variables garage area and full baths. Though, this is a tentative conclusion as we are still evaluating on the training set and not on unseen data.

## Multiple linear regression model interpretation

Interpretation of the linear regression model when using multiple features is more complex than the univariate case. From our model above, every one-unit increase in `GrLivArea` translates to an increase of 70 dollars in sale price assuming the other features are held constant. 

For instance, if a particular house has 1,000 square feet of living area, 500 square feet of garage area and two full baths, then increasing the living area to 1,001 square feet while keeping garage area and number of full baths the same will lead to a 70 dollar increase in the sale price.

A major issue with this interpretation is that many features are correlated with one another, meaning that increasing the value of one likely leads to a change in another. Theoretically, in order for us to use the interpretation above, the features must be uncorrelated with one another. 

With our particular data, this isn't the case. Using the DataFrame `corr` method, we see that above ground living area and garage area have a correlation of nearly .5 so increasing one will likely lead to an increase in the other.

In [None]:
X.corr().round(2)

### Predictions are still viable

There are many assumptions that are made by linear regression. Features are assumed to be uncorrelated with one another. This assumption is commonly broken. Read more about assumptions and model interpretation [here][1]. Even when model assumptions are broken, the model can still make good predictions. 

### No statistical output with scikit-learn

Scikit-learn does not produce advanced statistical output and metrics commonly found in statistics textbooks. If you desire more formal statistical output, use the [statsmodels][2] library.

[1]: https://en.wikipedia.org/wiki/Linear_regression#Assumptions
[2]: https://www.statsmodels.org/stable/index.html

## Exercises

### Problem 1
<span  style="color:green; font-size:16px">Build several models with more than one feature. Track the $R^2$ of each model you build. Which combination of features produces the highest $R^2$.</span>