A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.

 

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

 

The company wants to know:

1. Which variables are significant in predicting the price of a house, and

2. How well those variables describe the price of a house.

Also, determine the optimal value of lambda for ridge and lasso regression.

### Business Goal 

You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
pd.set_option('display.max_columns', None) 
pd.set_option("display.max_rows", 100)

### Reading the data

In [None]:
house_df = pd.read_csv('train.csv')
house_df_copy = house_df.copy()
house_df.head()
house_df.BedroomAbvGr.unique()

### Visualising and cleaning data

#### Checking the numerical value relationships

In [None]:
plt.figure(figsize=(30, 15))
sns.heatmap(
    house_df.corr(), annot=True, cmap="YlGnBu",
)

#### Checking data types

In [None]:
house_df.dtypes

#### Checking for duplicate rows

In [None]:
house_df[house_df.duplicated()]

#### Remove columns having single unique value as they will not give value for our analysis


In [None]:
house_df = house_df[
    [column for column in list(house_df) if house_df[column].nunique() > 1]
]
len(house_df.columns)

#### Checking null percent of columns

In [None]:
house_df.isnull().sum() / len(house_df) *100

#### Replacing nulls with default values

In [None]:
house_df.Fence.fillna('NA', inplace=True)
house_df.FireplaceQu.fillna('NA', inplace=True)
house_df.LotFrontage.fillna(house_df.LotFrontage.median(), inplace=True)

In [None]:
# List of Columns & NA counts where NA values are more than 30%
NA_columns = house_df.isnull().sum()
NA_columns = NA_columns[NA_columns.values >= (0.3 * len(house_df))].index
house_df.drop(labels=NA_columns, axis=1, inplace=True)
# Removing id column as it is not important
house_df.drop('Id', inplace=True, axis=1)
house_df.drop('MoSold', inplace=True, axis=1)
house_df.shape


#### Removing rows with null values

In [None]:

house_df.dropna(inplace=True)
house_df.shape

#### Converting year to age

In [None]:
house_df['Age'] = 2020 - house_df.YearBuilt
house_df['RemodAge'] = 2020 - house_df.YearRemodAdd
house_df['SoldAge'] = 2020 - house_df.YrSold
house_df.drop(['YearBuilt', 'YearRemodAdd', 'YrSold'], axis=1, inplace=True)

#### Removing outliers

In [None]:
house_df = house_df[(house_df["LotArea"] < house_df["LotArea"].quantile(0.95))]
house_df = house_df[(house_df["LotFrontage"] < house_df["LotFrontage"].quantile(0.996))]
house_df = house_df[(house_df["MasVnrArea"] < house_df["MasVnrArea"].quantile(0.95))]
house_df = house_df[(house_df["BsmtFinSF1"] < house_df["BsmtFinSF1"].quantile(0.98))]
house_df = house_df[(house_df["2ndFlrSF"] < house_df["2ndFlrSF"].quantile(0.99))]
house_df = house_df[(house_df["SalePrice"] < house_df["SalePrice"].quantile(0.99))]

In [None]:
house_df.describe()

#### Creating dummy variables

In [None]:
def create_dummy_variable(df, column):
    dummy_df = pd.get_dummies(df[column], drop_first=True)
    dummy_column_names = []
    for category in dummy_df.columns:
        dummy_column_names.append(str(column)+ '_'+str(category))
    dummy_df.columns = dummy_column_names
    df = pd.concat([df, dummy_df], axis=1)
    df = df.drop(column, axis=1)
    return df, dummy_df

In [None]:
categorical_variables = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']

In [None]:
# Removing column names from categorical variables which have been already removed from 
categorical_variables_temp = categorical_variables.copy()
for column in categorical_variables_temp:
    if column not in house_df.columns:
        categorical_variables.remove(column)
        print('removed: ' + column)

#### Creating dummy variables from categorical variables

In [None]:
for column in categorical_variables:
    house_df, df_dummy = create_dummy_variable(house_df, column)

### Model Building

#### Creating test train data

In [None]:
y = house_df.loc[:, 'SalePrice']
X = house_df.loc[:, house_df.columns != 'SalePrice']

# scale
scaler = StandardScaler()
scaler.fit(X)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    random_state = 1)

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 100)

rfe = rfe.fit(X_train, y_train)

In [None]:
# Columns filtered by RFE
cols = X_train.columns[rfe.support_]

#### Training model using Lasso

In [None]:
# grid search CV

# set up cross validation scheme
folds = KFold(n_splits = 5, shuffle = True, random_state = 4)

# specify range of hyperparameters
params = {'alpha': [0.001, 0.01, 1.0, 5.0, 10.0, 100.0, 500.0, 1000.0]}

# grid search
# lasso model
model = Lasso()
model_cv = GridSearchCV(estimator = model, param_grid = params, 
                        scoring= 'r2', 
                        cv = folds, 
                        return_train_score=True, verbose = 1)            
model_cv.fit(X_train[cols], y_train) 

In [None]:
print(model_cv.best_params_)
print(model_cv.best_score_)

In [None]:
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results.T

##### Both in test and train score, variation seems to be least when alpha's value is 10

In [None]:
# plot
cv_results['param_alpha'] = cv_results['param_alpha'].astype('float32')
plt.plot(cv_results['param_alpha'], cv_results['mean_train_score'])
plt.plot(cv_results['param_alpha'], cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('r2 score')
plt.xscale('log')
plt.show()

In [None]:
# model with optimal alpha
# lasso regression
lm = Lasso(alpha=10)
lm.fit(X_train[cols], y_train)

# predict
y_train_pred = lm.predict(X_train[cols])
print(r2_score(y_true=y_train, y_pred=y_train_pred))
y_test_pred = lm.predict(X_test[cols])
print(r2_score(y_true=y_test, y_pred=y_test_pred))

In [None]:
# lasso model parameters
model_parameters = list(lm.coef_)
model_parameters.insert(0, lm.intercept_)
model_parameters = [round(x, 3) for x in model_parameters]
cols = X.columns
cols = cols.insert(0, "constant")
parameters_list = list(zip(cols, model_parameters))

In [None]:
parameters_list = [parameter for parameter in parameters_list if parameter[1] != 0]

In [None]:
parameter_df = pd.DataFrame(parameters_list, columns=['variable', 'factor'])
parameter_df = parameter_df.sort_values(by='factor', ascending=False).set_index('variable')
parameter_df = parameter_df[parameter_df.index != 'constant']
parameter_df

### The above vales defines the top predictors that can be used to predict the values of the house. A `positive factor` indicates that the `Sales Price` of house will `rise` factor number of times if the value of variable is increased by one and a `negative factor` indicates that the `Sales Price` will `decrease` factor number of times if variable is increased by one.

`Note`: Variable with underscore means:
name before underscore is property type and name after underscore is property name

#### Creating test train data

In [None]:
y = house_df.loc[:, 'SalePrice']
X = house_df.loc[:, house_df.columns != 'SalePrice']

# scale
scaler = StandardScaler()
scaler.fit(X)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    random_state = 1)

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 100)

rfe = rfe.fit(X_train, y_train)

In [None]:
# Columns filtered by RFE
cols = X_train.columns[rfe.support_]

#### Training model using Ridge

In [None]:
# grid search CV

# set up cross validation scheme
folds = KFold(n_splits = 5, shuffle = True, random_state = 4)

# specify range of hyperparameters
params = {'alpha': [0.001, 0.01, 1.0, 5.0, 10.0, 100.0, 500.0, 1000.0]}

# grid search
# lasso model
model = Ridge()
model_cv = GridSearchCV(estimator = model, param_grid = params, 
                        scoring= 'r2', 
                        cv = folds, 
                        return_train_score=True, verbose = 1)            
model_cv.fit(X_train[cols], y_train) 

In [None]:
print(model_cv.best_params_)
print(model_cv.best_score_)

In [None]:
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results.T

##### Both in test and train score, variation seems to be least when alpha's value is 100

In [None]:
# plot
cv_results['param_alpha'] = cv_results['param_alpha'].astype('float32')
plt.plot(cv_results['param_alpha'], cv_results['mean_train_score'])
plt.plot(cv_results['param_alpha'], cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('r2 score')
plt.xscale('log')
plt.show()

In [None]:
# model with optimal alpha
# ridge regression
lm = Ridge(alpha=2)
lm.fit(X_train[cols], y_train)

# predict
y_train_pred = lm.predict(X_train[cols])
print(r2_score(y_true=y_train, y_pred=y_train_pred))
y_test_pred = lm.predict(X_test[cols])
print(r2_score(y_true=y_test, y_pred=y_test_pred))

In [None]:
# ridge model parameters
model_parameters = list(lm.coef_)
model_parameters.insert(0, lm.intercept_)
model_parameters = [round(x, 3) for x in model_parameters]
cols = X.columns
cols = cols.insert(0, "constant")
parameters_list = list(zip(cols, model_parameters))

In [None]:
parameters_list = [parameter for parameter in parameters_list if parameter[1] != 0]

In [None]:
parameter_df = pd.DataFrame(parameters_list, columns=['variable', 'factor'])
parameter_df = parameter_df.sort_values(by='factor', ascending=False).set_index('variable')
parameter_df = parameter_df[parameter_df.index != 'constant']
parameter_df

### The above vales defines the top predictors that can be used to predict the values of the house. A `positive factor` indicates that the `Sales Price` of house will `rise` factor number of times if the value of variable is increased by one and a `negative factor` indicates that the `Sales Price` will `decrease` factor number of times if variable is increased by one.

`Note`: Variable with underscore means:
name before underscore is property type and name after underscore is property name