<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2A: Problem Statement, Initial Data Cleaning and Dummification

## Problem Statement

This project aims to identify areas contributing to high transacted prices and where the highest transacted volume occurs, using a data science approach, so as to help realtors of Skywalker Property Advisors gain a competitive advantage in the Ames Housing Market.

### Contents:
- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
    - [Import Libraries](#Import-Libraries)
    - [Import CSV files](#Import-CSV-files)
    - [Check and clean up dataframes](#Check-and-clean-up-dataframes)
- [Export cleaned dataframe into CSV file](#Export-cleaned-dataframe-into-CSV-file)
- [Import cleaned CSV file](#Import-cleaned-CSV-file)
- [Dummify columns](#Dummify-columns)
- [Export dummified dataframe into CSV file](#Export-dummified-dataframe-into-CSV-file)

## Background

The city of Ames is located in the Iowa State of U.S.

The property market housing in Ames is stable, consistently selling around 400 units per year from 2006 to 2009 despite the fact that US experienced the Subprime financial crisis during that period. As of July 2010, the number of properties sold ranges around 200 units, on track to meeting the calculated average of around 400 units sold per year. 

Amongst the many Property Advisors in Ames is the Skywalker Property Advisors.

As data scientists contracted by Skywalker Property Advisors, the 5am Club Data Science Agency has been tasked with analysing the data of properties sold between 2006 to 2010 (up till July). The analysis and observations gathered will be used to give recommendations and advices to the realtors of Skywalker Property Advisors to help improve their sales and gain a competitive advantage in the Ames Housing Market in the current year of 2010.

## Data Import and Cleaning

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score

import scipy.stats as stats

### Import CSV files

In [2]:
train = pd.read_csv('../data/train.csv')

train.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


### Check and clean up dataframes

In [3]:
#change all column names to lower case and replace ' ' with '_'
train.columns = train.columns.str.lower().str.replace(' ', '_')

train.head()

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


In [4]:
#checking number of rows and columns for dataframes
print('train dataframe (rows, columns) =', train.shape)

train dataframe (rows, columns) = (2051, 81)


In [5]:
#set max viewable rows so displays will not be truncated
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 250)
np.set_printoptions(threshold=np.inf)

#checking for NULL values for train dataframe
train.isnull().sum()

id                    0
pid                   0
ms_subclass           0
ms_zoning             0
lot_frontage        330
lot_area              0
street                0
alley              1911
lot_shape             0
land_contour          0
utilities             0
lot_config            0
land_slope            0
neighborhood          0
condition_1           0
condition_2           0
bldg_type             0
house_style           0
overall_qual          0
overall_cond          0
year_built            0
year_remod/add        0
roof_style            0
roof_matl             0
exterior_1st          0
exterior_2nd          0
mas_vnr_type         22
mas_vnr_area         22
exter_qual            0
exter_cond            0
foundation            0
bsmt_qual            55
bsmt_cond            55
bsmt_exposure        58
bsmtfin_type_1       55
bsmtfin_sf_1          1
bsmtfin_type_2       56
bsmtfin_sf_2          1
bsmt_unf_sf           1
total_bsmt_sf         1
heating               0
heating_qc      

In [6]:
#check datatypes of each column in train dataframe
train.dtypes

id                   int64
pid                  int64
ms_subclass          int64
ms_zoning           object
lot_frontage       float64
lot_area             int64
street              object
alley               object
lot_shape           object
land_contour        object
utilities           object
lot_config          object
land_slope          object
neighborhood        object
condition_1         object
condition_2         object
bldg_type           object
house_style         object
overall_qual         int64
overall_cond         int64
year_built           int64
year_remod/add       int64
roof_style          object
roof_matl           object
exterior_1st        object
exterior_2nd        object
mas_vnr_type        object
mas_vnr_area       float64
exter_qual          object
exter_cond          object
foundation          object
bsmt_qual           object
bsmt_cond           object
bsmt_exposure       object
bsmtfin_type_1      object
bsmtfin_sf_1       float64
bsmtfin_type_2      object
b

In [7]:
train.head()

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,,,,0,3,2010,WD,138500


#### Filling in the NA or empty cells with either 0 or no

Assume NA or empty cells has no such feature, therefore change the cell values to 0.0 or no to indicate this

In [8]:
#changing nan or empty cells to 0 for all relevant columns
train[['lot_frontage', 
       'mas_vnr_area', 
       'bsmtfin_sf_1', 
       'bsmtfin_sf_2', 
       'bsmt_unf_sf', 
       'total_bsmt_sf', 
       'bsmt_full_bath', 
       'bsmt_half_bath', 
       'garage_cars', 
       'garage_area']] = train[['lot_frontage', 
                                'mas_vnr_area', 
                                'bsmtfin_sf_1', 
                                'bsmtfin_sf_2', 
                                'bsmt_unf_sf', 
                                'total_bsmt_sf', 
                                'bsmt_full_bath', 
                                'bsmt_half_bath', 
                                'garage_cars', 
                                'garage_area']].fillna(0)


In [9]:
#changing nan or empty cells to no for all relevant columns
train[['alley', 
       'mas_vnr_type', 
       'bsmt_qual', 
       'bsmt_cond', 
       'bsmt_exposure', 
       'bsmtfin_type_1', 
       'bsmtfin_type_2', 
       'fireplace_qu', 
       'garage_type', 
       'garage_finish', 
       'garage_qual', 
       'garage_cond', 
       'pool_qc', 
       'fence', 
       'misc_feature']] = train[['alley', 
                                 'mas_vnr_type', 
                                 'bsmt_qual', 
                                 'bsmt_cond', 
                                 'bsmt_exposure', 
                                 'bsmtfin_type_1', 
                                 'bsmtfin_type_2', 
                                 'fireplace_qu', 
                                 'garage_type', 
                                 'garage_finish', 
                                 'garage_qual', 
                                 'garage_cond', 
                                 'pool_qc', 
                                 'fence', 
                                 'misc_feature']].fillna('no')


#### Checking the changed columns to ensure the values and datatypes are changed correctly and no duplicate values

##### lot_frontage column

In [10]:
#change nan values to 0
print(np.unique(train['lot_frontage']))

train['lot_frontage'].dtypes

[  0.  21.  22.  24.  25.  26.  30.  32.  33.  34.  35.  36.  37.  38.
  39.  40.  41.  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.
  53.  54.  55.  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.
  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.
  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.
  95.  96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108.
 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122.
 123. 124. 125. 128. 129. 130. 134. 135. 137. 138. 140. 141. 144. 150.
 153. 155. 160. 174. 195. 200. 313.]


dtype('float64')

##### alley column

In [11]:
#change NA values to 'no'
print(np.unique(train['alley']))

train['alley'].dtypes

['Grvl' 'Pave' 'no']


dtype('O')

##### mas_vnr_area column

In [12]:
#changed empty cells to 0
print(np.unique(train['mas_vnr_area']))

train['mas_vnr_area'].dtypes

[0.000e+00 1.000e+00 3.000e+00 1.400e+01 1.600e+01 1.800e+01 2.000e+01
 2.200e+01 2.300e+01 2.400e+01 2.700e+01 2.800e+01 3.000e+01 3.100e+01
 3.200e+01 3.600e+01 3.800e+01 3.900e+01 4.000e+01 4.100e+01 4.200e+01
 4.400e+01 4.500e+01 4.600e+01 4.700e+01 5.000e+01 5.100e+01 5.200e+01
 5.400e+01 5.600e+01 5.700e+01 5.800e+01 6.000e+01 6.200e+01 6.300e+01
 6.400e+01 6.500e+01 6.600e+01 6.700e+01 6.800e+01 6.900e+01 7.000e+01
 7.200e+01 7.400e+01 7.500e+01 7.600e+01 8.000e+01 8.200e+01 8.400e+01
 8.500e+01 8.600e+01 8.700e+01 8.800e+01 8.900e+01 9.000e+01 9.200e+01
 9.400e+01 9.500e+01 9.600e+01 9.700e+01 9.800e+01 9.900e+01 1.000e+02
 1.010e+02 1.020e+02 1.040e+02 1.050e+02 1.060e+02 1.080e+02 1.090e+02
 1.100e+02 1.120e+02 1.130e+02 1.140e+02 1.150e+02 1.160e+02 1.170e+02
 1.180e+02 1.190e+02 1.200e+02 1.210e+02 1.220e+02 1.230e+02 1.240e+02
 1.250e+02 1.260e+02 1.270e+02 1.280e+02 1.300e+02 1.320e+02 1.340e+02
 1.350e+02 1.360e+02 1.380e+02 1.400e+02 1.420e+02 1.430e+02 1.440e+02
 1.450

dtype('float64')

##### bsmt_qual, bsmt_cond, bsmtfin_type_1, bsmtfin_type_2 columns

In [13]:
#changed empty cells to 'no'

print(np.unique(train['bsmt_qual']))
print(np.unique(train['bsmt_cond']))
print(np.unique(train['bsmtfin_type_1']))
print(np.unique(train['bsmtfin_type_2']))

train[['bsmt_qual', 'bsmt_cond', 'bsmtfin_type_1', 'bsmtfin_type_2']].dtypes

['Ex' 'Fa' 'Gd' 'Po' 'TA' 'no']
['Ex' 'Fa' 'Gd' 'Po' 'TA' 'no']
['ALQ' 'BLQ' 'GLQ' 'LwQ' 'Rec' 'Unf' 'no']
['ALQ' 'BLQ' 'GLQ' 'LwQ' 'Rec' 'Unf' 'no']


bsmt_qual         object
bsmt_cond         object
bsmtfin_type_1    object
bsmtfin_type_2    object
dtype: object

##### bsmt_fin_sf_1, bsmtfin_sf_2, bsmt_unf_sf, total_bsmt_sf columns

In [14]:
#changed empty cells to 0
print('bsmtfin_sf_1\n', np.unique(train['bsmtfin_sf_1']))
print('\nbsmtfin_sf_2\n', np.unique(train['bsmtfin_sf_2']))
print('\nbsmt_unf_sf\n', np.unique(train['bsmt_unf_sf']))
print('\ntotal_bsmt_sf\n', np.unique(train['total_bsmt_sf']))

train[['bsmtfin_sf_1', 'bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf']].dtypes

bsmtfin_sf_1
 [0.000e+00 2.000e+00 1.600e+01 2.000e+01 2.400e+01 2.500e+01 2.700e+01
 2.800e+01 3.200e+01 3.500e+01 3.600e+01 4.000e+01 4.100e+01 4.200e+01
 4.800e+01 5.000e+01 5.100e+01 5.200e+01 5.400e+01 5.500e+01 5.600e+01
 5.700e+01 6.000e+01 6.300e+01 6.400e+01 6.500e+01 6.800e+01 7.000e+01
 7.200e+01 7.600e+01 7.800e+01 8.000e+01 8.100e+01 8.500e+01 8.800e+01
 9.400e+01 9.600e+01 1.000e+02 1.040e+02 1.080e+02 1.100e+02 1.110e+02
 1.130e+02 1.140e+02 1.160e+02 1.190e+02 1.200e+02 1.210e+02 1.260e+02
 1.280e+02 1.290e+02 1.300e+02 1.310e+02 1.320e+02 1.330e+02 1.340e+02
 1.380e+02 1.400e+02 1.410e+02 1.430e+02 1.440e+02 1.490e+02 1.500e+02
 1.520e+02 1.550e+02 1.560e+02 1.620e+02 1.670e+02 1.680e+02 1.700e+02
 1.720e+02 1.730e+02 1.760e+02 1.790e+02 1.800e+02 1.810e+02 1.820e+02
 1.860e+02 1.870e+02 1.890e+02 1.900e+02 1.910e+02 1.920e+02 1.930e+02
 1.940e+02 1.960e+02 1.980e+02 2.000e+02 2.010e+02 2.030e+02 2.050e+02
 2.060e+02 2.070e+02 2.080e+02 2.090e+02 2.100e+02 2.130e+02 2.

bsmtfin_sf_1     float64
bsmtfin_sf_2     float64
bsmt_unf_sf      float64
total_bsmt_sf    float64
dtype: object

##### bsmt_full_bath, bsmt_half_bath columns

In [15]:
#changed empty cells to 0
print(np.unique(train['bsmt_full_bath']))
print(np.unique(train['bsmt_half_bath']))

train[['bsmt_full_bath', 'bsmt_half_bath']].dtypes

[0. 1. 2. 3.]
[0. 1. 2.]


bsmt_full_bath    float64
bsmt_half_bath    float64
dtype: object

##### fireplace_qu column

In [16]:
#change NA to 'no'
print(np.unique(train['fireplace_qu']))

train['fireplace_qu'].dtypes

['Ex' 'Fa' 'Gd' 'Po' 'TA' 'no']


dtype('O')

##### garage_type, garage_finish, garage_qual, garage_cond columns

In [17]:
#change NA and empty cells to 'no'
print(np.unique(train['garage_type']))
print(np.unique(train['garage_finish']))
print(np.unique(train['garage_qual']))
print(np.unique(train['garage_cond']))

train[['garage_type', 'garage_finish', 'garage_qual', 'garage_cond']].dtypes

['2Types' 'Attchd' 'Basment' 'BuiltIn' 'CarPort' 'Detchd' 'no']
['Fin' 'RFn' 'Unf' 'no']
['Ex' 'Fa' 'Gd' 'Po' 'TA' 'no']
['Ex' 'Fa' 'Gd' 'Po' 'TA' 'no']


garage_type      object
garage_finish    object
garage_qual      object
garage_cond      object
dtype: object

##### garage_cars column

In [18]:
#changed empty cells to 0
print(np.unique(train['garage_cars']))
print(np.unique(train['garage_area']))
                                              
train[['garage_cars', 'garage_area']].dtypes

[0. 1. 2. 3. 4. 5.]
[   0.  100.  160.  162.  164.  180.  185.  195.  198.  200.  205.  207.
  208.  209.  210.  213.  215.  216.  217.  220.  224.  225.  226.  228.
  230.  231.  234.  240.  242.  246.  248.  249.  250.  252.  253.  254.
  255.  256.  257.  260.  261.  263.  264.  265.  267.  270.  271.  273.
  275.  276.  280.  281.  282.  283.  284.  286.  288.  292.  293.  294.
  295.  296.  297.  299.  300.  301.  303.  304.  305.  306.  308.  309.
  310.  311.  312.  313.  315.  316.  317.  319.  320.  322.  323.  324.
  325.  326.  327.  330.  331.  336.  338.  342.  343.  349.  350.  351.
  352.  353.  356.  357.  358.  360.  363.  364.  366.  368.  370.  371.
  372.  373.  375.  378.  379.  380.  384.  386.  388.  389.  390.  392.
  393.  394.  396.  397.  398.  399.  400.  401.  402.  403.  404.  405.
  406.  408.  410.  412.  416.  418.  420.  422.  423.  427.  428.  429.
  430.  431.  432.  433.  434.  435.  436.  437.  438.  439.  440.  441.
  442.  444.  447.  449.  450. 

garage_cars    float64
garage_area    float64
dtype: object

##### pool_qc, fence, misc_feature columns

In [19]:
#change the NA cells to 'no'
print(np.unique(train['pool_qc']))
print(np.unique(train['fence']))
print(np.unique(train['misc_feature']))

train[['pool_qc', 'fence', 'misc_feature']].dtypes

['Ex' 'Fa' 'Gd' 'TA' 'no']
['GdPrv' 'GdWo' 'MnPrv' 'MnWw' 'no']
['Elev' 'Gar2' 'Othr' 'Shed' 'TenC' 'no']


pool_qc         object
fence           object
misc_feature    object
dtype: object

#### Modify remaining columns for standardisation and switching to binary

##### msr_vnr_type column 

In [20]:
#change "None" to 'no' to standardise values
train['mas_vnr_type'] = train['mas_vnr_type'].replace('None', 'no')

print(np.unique(train['mas_vnr_type']))

train['mas_vnr_type'].dtypes

['BrkCmn' 'BrkFace' 'Stone' 'no']


dtype('O')

##### bsmt_exposure column

In [21]:
#bsmt_exposure change 'No' to 'NE' to indicate that there is basement but no exposure
#differentiate it from the NA cells that were changed to 'no' where it means there is no basement at all
train['bsmt_exposure'] = train['bsmt_exposure'].replace('No', 'NE')

print(np.unique(train['bsmt_exposure']))

train['bsmt_exposure'].dtypes

['Av' 'Gd' 'Mn' 'NE' 'no']


dtype('O')

##### central_air column

In [22]:
#central_air contains Y and N, change it to binary so we don't need to dummify this column
print('original central_air values: ', np.unique(train['central_air']))

original central_air values:  ['N' 'Y']


In [23]:
# Y change to 1
# N change to 0
train['central_air'] = train['central_air'].replace(['Y', 'N'], [1, 0])

print('central_air values after changing: ', np.unique(train['central_air']))
train['central_air'].dtypes

central_air values after changing:  [0 1]


dtype('int64')

##### Check for other issues with data and edit

check all other columns for any values that need to change by running the np.unique() function

only those with issues are left here to display as part of showing the cleaning process

In [24]:
print('orignal values:\n', np.unique(train['ms_zoning']))

#ms_zoning column to remove space and brackets from cell values
train['ms_zoning'] = train['ms_zoning'].apply(lambda value: value.replace(value, value[0]) if ' ' in value else value)

print('\nvalues after changing:\n', np.unique(train['ms_zoning']))

orignal values:
 ['A (agr)' 'C (all)' 'FV' 'I (all)' 'RH' 'RL' 'RM']

values after changing:
 ['A' 'C' 'FV' 'I' 'RH' 'RL' 'RM']


In [25]:
print('orignal values:\n', np.unique(train['exterior_1st']))

#exterior_1st column to remove spacing from cell values
train['exterior_1st'] = train['exterior_1st'].apply(lambda value: value.replace(' ', '_') if ' ' in value else value)

print('\nvalues after changing:\n', np.unique(train['exterior_1st']))

orignal values:
 ['AsbShng' 'AsphShn' 'BrkComm' 'BrkFace' 'CBlock' 'CemntBd' 'HdBoard'
 'ImStucc' 'MetalSd' 'Plywood' 'Stone' 'Stucco' 'VinylSd' 'Wd Sdng'
 'WdShing']

values after changing:
 ['AsbShng' 'AsphShn' 'BrkComm' 'BrkFace' 'CBlock' 'CemntBd' 'HdBoard'
 'ImStucc' 'MetalSd' 'Plywood' 'Stone' 'Stucco' 'VinylSd' 'WdShing'
 'Wd_Sdng']


In [26]:
print('orignal values:\n', np.unique(train['exterior_2nd']))

#exterior_2nd column to remove spacing from cell values
train['exterior_2nd'] = train['exterior_2nd'].apply(lambda value: value.replace(' ', '_') if ' ' in value else value)

print('\nvalues after changing:\n', np.unique(train['exterior_2nd']))

orignal values:
 ['AsbShng' 'AsphShn' 'Brk Cmn' 'BrkFace' 'CBlock' 'CmentBd' 'HdBoard'
 'ImStucc' 'MetalSd' 'Plywood' 'Stone' 'Stucco' 'VinylSd' 'Wd Sdng'
 'Wd Shng']

values after changing:
 ['AsbShng' 'AsphShn' 'BrkFace' 'Brk_Cmn' 'CBlock' 'CmentBd' 'HdBoard'
 'ImStucc' 'MetalSd' 'Plywood' 'Stone' 'Stucco' 'VinylSd' 'Wd_Sdng'
 'Wd_Shng']


##### Drop columns:
pid <br>
gr_living_area <br>
garage_yr_blt

In [27]:
#dropping pid column as it does not serve any purpose in the model since we will use the id to reference each house

#dropping gr_liv_area column since:   gr_liv_area  =  1st_flr_sf  +  2nd_flr_sf  +  low_qual_fin_sf
#keeping the other 3 columns, drop gr_liv_area to reduce multicollinearity

#dropping garage_yr_built column as building a garage is considered remodelling the house
#it is included in year_remod/add

train = train.drop(columns=['pid', 'gr_liv_area', 'garage_yr_blt'])

print(train.shape)
list(train.columns)

(2051, 78)


['id',
 'ms_subclass',
 'ms_zoning',
 'lot_frontage',
 'lot_area',
 'street',
 'alley',
 'lot_shape',
 'land_contour',
 'utilities',
 'lot_config',
 'land_slope',
 'neighborhood',
 'condition_1',
 'condition_2',
 'bldg_type',
 'house_style',
 'overall_qual',
 'overall_cond',
 'year_built',
 'year_remod/add',
 'roof_style',
 'roof_matl',
 'exterior_1st',
 'exterior_2nd',
 'mas_vnr_type',
 'mas_vnr_area',
 'exter_qual',
 'exter_cond',
 'foundation',
 'bsmt_qual',
 'bsmt_cond',
 'bsmt_exposure',
 'bsmtfin_type_1',
 'bsmtfin_sf_1',
 'bsmtfin_type_2',
 'bsmtfin_sf_2',
 'bsmt_unf_sf',
 'total_bsmt_sf',
 'heating',
 'heating_qc',
 'central_air',
 'electrical',
 '1st_flr_sf',
 '2nd_flr_sf',
 'low_qual_fin_sf',
 'bsmt_full_bath',
 'bsmt_half_bath',
 'full_bath',
 'half_bath',
 'bedroom_abvgr',
 'kitchen_abvgr',
 'kitchen_qual',
 'totrms_abvgrd',
 'functional',
 'fireplaces',
 'fireplace_qu',
 'garage_type',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond',
 'pave

##### Changing rating system to numerical format

In [28]:
#change the columns with rating systems into numerical format to avoid dummification

#exter_qual
#exter_cond
#bsmt_qual
#bsmt_cond
#heating_qc
#kitchen_qual
#fireplace_qu
#garage_qual
#garage_cond
#pool_qc

train['exter_qual'] = train['exter_qual'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1})

train['exter_cond'] = train['exter_cond'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1})

train['bsmt_qual'] = train['bsmt_qual'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1, 
    'no': 0
})

train['bsmt_cond'] = train['bsmt_cond'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1, 
    'no': 0
})

train['heating_qc'] = train['heating_qc'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1
})

train['kitchen_qual'] = train['kitchen_qual'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1
})

train['fireplace_qu'] = train['fireplace_qu'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1, 
    'no': 0
})

train['garage_qual'] = train['garage_qual'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1, 
    'no': 0
})

train['garage_cond'] = train['garage_cond'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2, 
    'Po': 1, 
    'no': 0
})

train['pool_qc'] = train['pool_qc'].replace({
    'Ex': 5, 
    'Gd': 4, 
    'TA': 3, 
    'Fa': 2,
    'no': 0
})

In [29]:
#show the changes
print(np.unique(train['exter_qual']))
print(np.unique(train['exter_cond']))
print(np.unique(train['bsmt_qual']))
print(np.unique(train['bsmt_cond']))
print(np.unique(train['heating_qc']))
print(np.unique(train['kitchen_qual']))
print(np.unique(train['fireplace_qu']))
print(np.unique(train['garage_qual']))
print(np.unique(train['garage_cond']))
print(np.unique(train['pool_qc']))

[2 3 4 5]
[1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]
[1 2 3 4 5]
[2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 2 3 4 5]


In [30]:
#change the columns with rating systems into numerical format to avoid dummification

#lot_shape
#bsmt_exposure
#utilities
#bsmtfin_type_1
#bsmtfin_type_2
#functional
#garage_finish
#paved_drive

train['lot_shape'] = train['lot_shape'].replace({
    'Reg': 4, 
    'IR1': 3, 
    'IR2': 2, 
    'IR3': 1
})

train['bsmt_exposure'] = train['bsmt_exposure'].replace({
    'Av': 3, 
    'Gd': 2, 
    'Mn': 1, 
    'NE': 0, 
    'no': 0
})

train['utilities'] = train['utilities'].replace({
    'AllPub': 4, 
    'NoSewr': 3, 
    'NoSeWa': 2, 
    'ELO': 1
})

train['bsmtfin_type_1'] = train['bsmtfin_type_1'].replace({
    'GLQ': 6, 
    'ALQ': 5, 
    'BLQ': 4, 
    'Rec': 3,
    'LwQ': 2, 
    'Unf': 1, 
    'no': 0
})

train['bsmtfin_type_2'] = train['bsmtfin_type_2'].replace({
    'GLQ': 6, 
    'ALQ': 5, 
    'BLQ': 4, 
    'Rec': 3,
    'LwQ': 2, 
    'Unf': 1, 
    'no': 0
})

train['functional'] = train['functional'].replace({
    'Typ': 8, 
    'Min1': 7, 
    'Min2': 6, 
    'Mod': 5, 
    'Maj1': 4,
    'Maj2': 3, 
    'Sev': 2, 
    'Sal': 1
})

train['garage_finish'] = train['garage_finish'].replace({
    'Fin': 3, 
    'RFn': 2, 
    'Unf': 1, 
    'no': 0
})

train['paved_drive'] = train['paved_drive'].replace({
    'Y': 2, 
    'P': 1, 
    'N': 0
})

In [31]:
#show the changes
print(np.unique(train['lot_shape']))
print(np.unique(train['bsmt_exposure']))
print(np.unique(train['utilities']))
print(np.unique(train['bsmtfin_type_1']))
print(np.unique(train['bsmtfin_type_2']))
print(np.unique(train['functional']))
print(np.unique(train['garage_finish']))
print(np.unique(train['paved_drive']))

[1 2 3 4]
[0 1 2 3]
[2 3 4]
[0 1 2 3 4 5 6]
[0 1 2 3 4 5 6]
[1 2 3 4 5 6 7 8]
[0 1 2 3]
[0 1 2]


## Export cleaned dataframe into CSV file

In [32]:
train.to_csv('../data/train_cleaned.csv', index=False)

## Import cleaned CSV file 

In [33]:
train_clean = pd.read_csv('../data/train_cleaned.csv')

train_clean.head()

Unnamed: 0,id,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,60,RL,0.0,13517,Pave,no,3,Lvl,4,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,4,3,CBlock,3,3,0,6,533.0,1,0.0,192.0,725.0,GasA,5,1,SBrkr,725,754,0,0.0,0.0,2,1,3,1,4,6,8,0,0,Attchd,2,2.0,475.0,3,3,2,0,44,0,0,0,0,0,no,no,0,3,2010,WD,130500
1,544,60,RL,43.0,11492,Pave,no,3,Lvl,4,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,4,3,PConc,4,3,0,6,637.0,1,0.0,276.0,913.0,GasA,5,1,SBrkr,913,1209,0,1.0,0.0,2,1,4,1,4,8,8,1,3,Attchd,2,2.0,559.0,3,3,2,0,74,0,0,0,0,0,no,no,0,4,2009,WD,220000
2,153,20,RL,68.0,7922,Pave,no,4,Lvl,4,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,no,0.0,3,4,CBlock,3,3,0,6,731.0,1,0.0,326.0,1057.0,GasA,3,1,SBrkr,1057,0,0,1.0,0.0,1,0,3,1,4,5,8,0,0,Detchd,1,1.0,246.0,3,3,2,0,52,0,0,0,0,0,no,no,0,1,2010,WD,109000
3,318,60,RL,73.0,9802,Pave,no,4,Lvl,4,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,no,0.0,3,3,PConc,4,3,0,1,0.0,1,0.0,384.0,384.0,GasA,4,1,SBrkr,744,700,0,0.0,0.0,2,1,3,1,3,7,8,0,0,BuiltIn,3,2.0,400.0,3,3,2,100,0,0,0,0,0,0,no,no,0,4,2010,WD,174000
4,255,50,RL,82.0,14235,Pave,no,3,Lvl,4,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd_Sdng,Plywood,no,0.0,3,3,PConc,2,4,0,1,0.0,1,0.0,676.0,676.0,GasA,3,1,SBrkr,831,614,0,0.0,0.0,2,0,3,1,3,6,8,0,0,Detchd,1,2.0,484.0,3,3,0,0,59,0,0,0,0,0,no,no,0,3,2010,WD,138500


## Dummify columns

In [34]:
train_clean = pd.get_dummies(columns=['ms_subclass', 
                                      'ms_zoning', 
                                      'street', 
                                      'alley', 
                                      'land_contour', 
                                      'lot_config', 
                                      'land_slope', 
                                      'neighborhood', 
                                      'condition_1', 
                                      'condition_2', 
                                      'bldg_type', 
                                      'house_style', 
                                      'roof_style', 
                                      'roof_matl', 
                                      'exterior_1st', 
                                      'exterior_2nd', 
                                      'mas_vnr_type', 
                                      'foundation', 
                                      'heating', 
                                      'electrical', 
                                      'garage_type', 
                                      'fence', 
                                      'misc_feature', 
                                      'sale_type'], 
                             drop_first=True, data=train_clean)


In [35]:
print(train_clean.shape)
list(train_clean)

(2051, 217)


['id',
 'lot_frontage',
 'lot_area',
 'lot_shape',
 'utilities',
 'overall_qual',
 'overall_cond',
 'year_built',
 'year_remod/add',
 'mas_vnr_area',
 'exter_qual',
 'exter_cond',
 'bsmt_qual',
 'bsmt_cond',
 'bsmt_exposure',
 'bsmtfin_type_1',
 'bsmtfin_sf_1',
 'bsmtfin_type_2',
 'bsmtfin_sf_2',
 'bsmt_unf_sf',
 'total_bsmt_sf',
 'heating_qc',
 'central_air',
 '1st_flr_sf',
 '2nd_flr_sf',
 'low_qual_fin_sf',
 'bsmt_full_bath',
 'bsmt_half_bath',
 'full_bath',
 'half_bath',
 'bedroom_abvgr',
 'kitchen_abvgr',
 'kitchen_qual',
 'totrms_abvgrd',
 'functional',
 'fireplaces',
 'fireplace_qu',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond',
 'paved_drive',
 'wood_deck_sf',
 'open_porch_sf',
 'enclosed_porch',
 '3ssn_porch',
 'screen_porch',
 'pool_area',
 'pool_qc',
 'misc_val',
 'mo_sold',
 'yr_sold',
 'saleprice',
 'ms_subclass_30',
 'ms_subclass_40',
 'ms_subclass_45',
 'ms_subclass_50',
 'ms_subclass_60',
 'ms_subclass_70',
 'ms_subclass_75',
 'ms_subc

## Export dummified dataframe into CSV file

In [36]:
train_clean.to_csv('../data/train_cleaned_dummified.csv', index=False)