# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [55]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [26]:
# Log transform and normalize
ames_cont = ames[continuous]

log_name = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_name

def normalize(feature):
    return (feature - feature.mean()) / feature.std() 
ames_log_norm = ames_log.apply(normalize)


## Categorical Features

In [32]:
# One hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)
if is_bool.any().any():
    ames_ohe = ames_ohe.astype(np.uint8)

## Combine Categorical and Continuous Features

In [33]:
# combine features into a single dataframe called preprocessed
preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)
preprocessed.head()


Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log,SalePrice_log,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,-0.133185,-0.803295,0.529078,0.559876,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0.113403,0.418442,-0.381715,0.212692,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0.419917,-0.576363,0.659449,0.733795,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.103311,-0.439137,0.541326,-0.437232,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0.878108,0.112229,1.281751,1.014303,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


## Run a linear model with SalePrice as the target variable in statsmodels

In [34]:
# Your code here
y = preprocessed['SalePrice_log']
X = preprocessed.drop('SalePrice_log', axis=1)
import statsmodels.api as sm
X_int = sm.add_constant(X)
model = sm.OLS(y, X_int).fit()
model.summary()

0,1,2,3
Dep. Variable:,SalePrice_log,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,156.5
Date:,"Mon, 08 Jul 2024",Prob (F-statistic):,0.0
Time:,13:47:43,Log-Likelihood:,-738.14
No. Observations:,1460,AIC:,1572.0
Df Residuals:,1412,BIC:,1826.0
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.1317,0.263,-0.500,0.617,-0.648,0.385
LotArea_log,0.1033,0.019,5.475,0.000,0.066,0.140
1stFlrSF_log,0.1371,0.016,8.584,0.000,0.106,0.168
GrLivArea_log,0.3768,0.016,24.114,0.000,0.346,0.407
BldgType_2fmCon,-0.1715,0.079,-2.173,0.030,-0.326,-0.017
BldgType_Duplex,-0.4203,0.062,-6.813,0.000,-0.541,-0.299
BldgType_Twnhs,-0.1403,0.093,-1.513,0.130,-0.322,0.042
BldgType_TwnhsE,-0.0512,0.060,-0.858,0.391,-0.168,0.066
KitchenQual_Fa,-0.9999,0.088,-11.315,0.000,-1.173,-0.827

0,1,2,3
Omnibus:,289.988,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1242.992
Skew:,-0.886,Prob(JB):,1.22e-270
Kurtosis:,7.159,Cond. No.,109.0


## Run the same model in scikit-learn

In [37]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X,y)

In [38]:
# coefficients
LinearRegression()

In [39]:
# coefficients
linreg.coef_

array([ 0.10327192,  0.1371289 ,  0.37682133, -0.1714623 , -0.42033885,
       -0.14034113, -0.05120194, -0.99986001, -0.38202198, -0.66924909,
        0.22847737,  0.5860786 ,  0.31510567,  0.0330941 ,  0.01608664,
        0.29985338,  0.11784232,  0.17480326,  1.06663561,  0.87681007,
        0.99609131,  1.10228499, -0.21311107,  0.05293276, -0.46271253,
       -0.64982261, -0.21019239, -0.07609253, -0.08233633, -0.76126683,
       -0.09799942, -0.96183328, -0.69182575, -0.2553217 , -0.44067351,
       -0.01595046, -0.26762962,  0.36313165,  0.36259667, -0.93504972,
       -0.69976325, -0.47543141, -0.23309732,  0.09502969,  0.42957077,
        0.0056924 ,  0.12762613])

In [40]:
# intercept
linreg.intercept_

-0.13169736916670582

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

In [56]:
# Your code here - predict the house price given the following characteristics
# getting the used column names for the df
continuous.remove("SalePrice")

used_cols = [*continuous, *categoricals]
used_cols

['LotArea',
 '1stFlrSF',
 'GrLivArea',
 'BldgType',
 'KitchenQual',
 'SaleType',
 'MSZoning',
 'Street',
 'Neighborhood']

In [57]:
# creating an empty dataframe for the new row
new_row = pd.DataFrame(columns=used_cols)
new_row

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,BldgType,KitchenQual,SaleType,MSZoning,Street,Neighborhood


In [58]:

# Create a new row as a DataFrame
new_data = pd.DataFrame({
    "LotArea": [14977],
    '1stFlrSF': [1976],
    'GrLivArea': [1976],
    'BldgType': ['1Fam'],
    'KitchenQual': ['Gd'],
    'SaleType': ['New'],
    'MSZoning': ['RL'],
    'Street': ['Pave'],
    'Neighborhood': ['NridgHt']
})

# Print the new row DataFrame
print("\nNew Row DataFrame:")
print(new_data)

# Use pd.concat to add the new row to the DataFrame
new_row = pd.concat([new_row, new_data], ignore_index=True)

# Print the updated DataFrame

new_row




New Row DataFrame:
   LotArea  1stFlrSF  GrLivArea BldgType KitchenQual SaleType MSZoning Street  \
0    14977      1976       1976     1Fam          Gd      New       RL   Pave   

  Neighborhood  
0      NridgHt  


Unnamed: 0,LotArea,1stFlrSF,GrLivArea,BldgType,KitchenQual,SaleType,MSZoning,Street,Neighborhood
0,14977,1976,1976,1Fam,Gd,New,RL,Pave,NridgHt


In [59]:
# first we'll tackle the continuous columns
new_row_cont = new_row[continuous]

# log features
log_names = [f'{column}_log' for column in new_row_cont.columns]

new_row_log = np.log(new_row_cont.astype(float)) # won't work unless float
new_row_log.columns = log_names

# normalizing
for col in continuous:
    # normalize using mean and std from overall dataset
    new_row_log[f'{col}_log'] = (new_row_log[f'{col}_log'] - ames[col].mean()) / ames[col].std()
new_row_log


Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log
0,-1.052694,-2.987777,-2.869517


In [61]:
# now time for the categoricals
new_row_cat = new_row[categoricals]

new_row_ohe = pd.DataFrame(columns = ames_ohe.columns)

# using complicated for loops to ohe the new row
ohe_dict = {}
for col_type in new_row_cat.columns:
    col_list = [c for c in new_row_ohe.columns if col_type in c]
    for x in col_list:
        if new_row_cat[col_type].values[0] in x:
            ohe_dict[x] = 1
        else:
            ohe_dict[x] = 0

# Convert the dictionary to a DataFrame
new_row_ohe = pd.DataFrame([ohe_dict])
new_row_ohe

Unnamed: 0,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,SaleType_CWD,SaleType_Con,SaleType_ConLD,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [62]:
# putting together this row's data - both continuous and categorical
new_row_processed = pd.concat([new_row_log, new_row_ohe], axis=1)
new_row_processed

Unnamed: 0,LotArea_log,1stFlrSF_log,GrLivArea_log,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,-1.052694,-2.987777,-2.869517,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0


In [63]:
# now - FINALLY - we can model

new_row_pred_log = linreg.predict(new_row_processed)
new_row_pred_log

array([-0.66800851])

In [64]:
# prediction needs to be scaled and exponentiated
np.exp(new_row_pred_log) * ames["SalePrice"].std() + ames["SalePrice"].mean()

array([221653.64351813])

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!