# Inherited House Price Prediction

## Objectives

* Use the Inherited model to predict the prices of inherited properties

## Inputs

* Trained and vaidated ML model, inherited houses data file

## Outputs

* outputs/datasets/predict_sale_price/predicted_prices_for_inherited_houses.csv

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5_project_heritage_housing/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5_project_heritage_housing'

Import cleaned dataset and split it into train and test set

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# read data
df = pd.read_csv("outputs/datasets/collection/house_price_records.csv") 

# split data 80/20
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['SalePrice'], axis=1) ,
                                    df['SalePrice'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)


* Train set: (1168, 23) (1168,) 
* Test set: (292, 23) (292,)


Import pipeline

In [5]:
from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.imputation import MeanMedianImputer
from feature_engine.selection import DropFeatures
from feature_engine.imputation import CategoricalImputer
from feature_engine.imputation import RandomSampleImputer

### Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.outliers import Winsorizer

### Feature Scaling
from sklearn.preprocessing import StandardScaler

### Feature Selection
from sklearn.feature_selection import SelectFromModel

### Ml Algorithm
from sklearn.ensemble import GradientBoostingRegressor

### Create pipeline
ml_pipeline = Pipeline([
    # Data cleaning (copied from Data Cleaning notebook)
    ( 'drop',  DropFeatures(features_to_drop=['EnclosedPorch', 'GarageYrBlt', 'WoodDeckSF'])),
    ( 'categorical',  CategoricalImputer(imputation_method='missing',
                                     fill_value='None',
                                     variables=['GarageFinish'])),
    ( 'random_sample',  RandomSampleImputer(
                                     variables=['LotFrontage' ,
                                     'BsmtFinType1','2ndFlrSF','MasVnrArea'])),
    ( 'mean',  MeanMedianImputer(imputation_method='mean',
                                     variables=['BedroomAbvGr']) ),

    # Feature engineering (copied from Feature Engineering notebook)
    ( 'OrdinalCategoricalEncoder', OrdinalEncoder(encoding_method='arbitrary',
                                                variables = ['BsmtExposure',
                                                            'BsmtFinType1',
                                                            'GarageFinish',
                                                            'KitchenQual'])),
    ("Winsoriser_iqr",Winsorizer(capping_method='iqr', fold=3, tail='both', 
                                                  variables=['1stFlrSF',
                                                            'GarageArea',
                                                            'GrLivArea',
                                                            'YearBuilt',
                                                            'TotalBsmtSF',])),
    ("feat_scaling", StandardScaler()),
    ("feat_selection",  SelectFromModel(GradientBoostingRegressor(
                                        random_state=0,
                                        learning_rate=0.1,
                                        max_depth=3,
                                        min_samples_leaf=50,
                                        min_samples_split=2,
                                        n_estimators=140), threshold="0.75*mean")),
    ("model", GradientBoostingRegressor(random_state=0,
                                        learning_rate=0.1,
                                        max_depth=3,
                                        min_samples_leaf=50,
                                        min_samples_split=2,
                                        n_estimators=140)),
    ])

In [6]:
# fit pipeline
ml_pipeline.fit(X_train, y_train)

Pipeline(steps=[('drop',
                 DropFeatures(features_to_drop=['EnclosedPorch', 'GarageYrBlt',
                                                'WoodDeckSF'])),
                ('categorical',
                 CategoricalImputer(fill_value='None',
                                    variables=['GarageFinish'])),
                ('random_sample',
                 RandomSampleImputer(variables=['LotFrontage', 'BsmtFinType1',
                                                '2ndFlrSF', 'MasVnrArea'])),
                ('mean',
                 MeanMedianImputer(imputation_method='mean',
                                   variables=['B...
                 Winsorizer(capping_method='iqr', tail='both',
                            variables=['1stFlrSF', 'GarageArea', 'GrLivArea',
                                       'YearBuilt', 'TotalBsmtSF'])),
                ('feat_scaling', StandardScaler()),
                ('feat_selection',
                 SelectFromModel(estimator=Gradient

In [7]:
import numpy as np

print('Training set accuracy:', np.round(ml_pipeline.score(X_train, y_train), 4))
print('Test set accuracy:' , np.round(ml_pipeline.score(X_test, y_test), 4))

Training set accuracy: 0.8739
Test set accuracy: 0.7913


The training accuracy is as estimated from modelling and evalutaion workbook. So, we can move forward with prediction.

## Inherited house price prediction

In [8]:
# load raw dataset
X_inherited_house_data=pd.read_csv("outputs/datasets/collection/inherited_houses.csv")

# predict house prices
y_predicted_price=ml_pipeline.predict(X_inherited_house_data)

# round it 
y_predicted_price.round(decimals=0)

# append the values
X_inherited_house_data['Predicted Sale Price']=y_predicted_price.round(decimals=0)

X_inherited_house_data

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,Predicted Sale Price
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,80.0,0.0,0,6,5,882.0,140,1961,1961,129451.0
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,81.0,108.0,36,6,6,1329.0,393,1958,1958,157565.0
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,74.0,0.0,34,5,5,928.0,212,1997,1998,166748.0
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,78.0,20.0,36,6,6,926.0,360,1998,1998,177049.0


In [9]:
X_inherited_house_data.to_csv("outputs/datasets/predict_sale_price/predicted_prices_for_inherited_houses.csv")

# Conclusion