# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [1]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [2]:
# Log transform and normalize
ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize 
def normalize(feature):
    return (feature - feature.mean()/feature.std())

ames_log_norm = ames_log.apply(normalize)

## Categorical Features

In [3]:
# One hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

## Combine Categorical and Continuous Features

In [4]:
# combine features into a single dataframe called preprocessed

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)
preprocessed.head()

Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log,SalePrice_log,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,-8.56533,-15.302616,-14.344884,-17.853682,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,-8.437733,-14.914433,-14.648679,-17.992365,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,-8.279128,-15.230512,-14.301399,-17.784209,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,-8.442955,-15.186912,-14.340799,-18.251978,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,-8.042038,-15.011726,-14.093829,-17.67216,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


## Run a linear model with SalePrice as the target variable in statsmodels

In [5]:
# Your code here

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

In [6]:
import statsmodels.api as sm
X_int = sm.add_constant(X)
model = sm.OLS(y,X_int).fit()
model.summary()

0,1,2,3
Dep. Variable:,SalePrice_log,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,156.5
Date:,"Wed, 24 Feb 2021",Prob (F-statistic):,0.0
Time:,22:24:02,Log-Likelihood:,601.65
No. Observations:,1460,AIC:,-1107.0
Df Residuals:,1412,BIC:,-853.6
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-8.3053,0.302,-27.517,0.000,-8.897,-7.713
LotArea_log,0.0797,0.015,5.475,0.000,0.051,0.108
1stFlrSF_log,0.1724,0.020,8.584,0.000,0.133,0.212
GrLivArea_log,0.4513,0.019,24.114,0.000,0.415,0.488
BldgType_2fmCon,-0.0685,0.032,-2.173,0.030,-0.130,-0.007
BldgType_Duplex,-0.1679,0.025,-6.813,0.000,-0.216,-0.120
BldgType_Twnhs,-0.0561,0.037,-1.513,0.130,-0.129,0.017
BldgType_TwnhsE,-0.0205,0.024,-0.858,0.391,-0.067,0.026
KitchenQual_Fa,-0.3994,0.035,-11.315,0.000,-0.469,-0.330

0,1,2,3
Omnibus:,289.988,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1242.992
Skew:,-0.886,Prob(JB):,1.22e-270
Kurtosis:,7.159,Cond. No.,1620.0


## Run the same model in scikit-learn

In [7]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)

LinearRegression()

In [8]:
# coefficients
linreg.coef_

array([ 0.07972232,  0.17239913,  0.45127205, -0.06849094, -0.16790514,
       -0.05605952, -0.02045271, -0.39939595, -0.15259939, -0.2673328 ,
        0.09126571,  0.23411019,  0.12586955,  0.0132195 ,  0.00642584,
        0.11977699,  0.04707234,  0.06982549,  0.42606959,  0.35024342,
        0.39789053,  0.4403098 , -0.08512761,  0.02114409, -0.18483139,
       -0.25957286, -0.08396174, -0.0303953 , -0.0328894 , -0.30408946,
       -0.03914605, -0.3842061 , -0.27635109, -0.10198873, -0.17602786,
       -0.00637144, -0.10690515,  0.14505362,  0.14483992, -0.37350736,
       -0.27952174, -0.18991196, -0.09311116,  0.03795979,  0.17159285,
        0.00227384,  0.0509805 ])

In [9]:
# intercept
linreg.intercept_

-8.305339002628404

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!