![alt text](ames_iowa_downtown.avif "Ames downtown city")

The aim of the followinng project is to predict house sale prices in Ames town, Iowa, USA. Ames is a typical small provincial american town with a population around 66 thousand (according to the 2020 census). This example might be generalized to other small towns in the United States, especially in Iowa. However, the main reason why this notebook was ever created is to present the typical Data Science workflow when dealing with regression predictive type of problem.

# Presets

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import matplotlib.pyplot as plt

#from sklearn.model_selection import train_test_split
#from sklearn import metrics

# Hyperparams tuning
#from sklearn.model_selection import GridSearchCV, ParameterGrid
#from sklearn_genetic import GASearchCV
#from sklearn_genetic.space import Continuous, Categorical, Integer

np.random.seed(42)

In [20]:
pd.set_option('display.max_columns', 500)

In [3]:
data = pd.read_csv('train.csv')

# Basic statistics

In [4]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [28]:
print('Nrows:', data.shape[0])
print('Ncols:', data.shape[0])
print('rows/cols ratio:', data.shape[0]/data.shape[1])

Nrows: 1460
Ncols: 1460
rows/cols ratio: 18.02469135802469


- Rows/cols ratio seems to be pretty small, taking into consideration the fact that lot of variables are of type 'object'. Thus, dimensionality reduction must be held for sure

In [10]:
print(data.dtypes.to_string())

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
BsmtCond          object
BsmtExposure      object
BsmtFinType1      object
BsmtFinSF1         int64
BsmtFinType2      object
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
Heating           object


In [30]:
print('Number of integer columns:', len(data.dtypes[data.dtypes == 'int64']))
print('Number of float columns:', len(data.dtypes[data.dtypes == 'float64']))
print('Number of string columns:', len(data.dtypes[data.dtypes == 'object']))

Number of integer columns: 35
Number of float columns: 3
Number of string columns: 43


In [15]:
print(data.isna().sum().to_string())

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinSF1          0
BsmtFinType2       38
BsmtFinSF2          0
BsmtUnfSF           0
TotalBsmtSF         0
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
1stFlrSF            0
2ndFlrSF            0
LowQualFin

- MiscFeature is totally absent?
- PoolQC, FireplaceQu, Alley, Fence may be absent rather due to natural reasons (just no fireplaces, pools in the house)
- Others must be ivestigated more throughly

In [21]:
data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


- There are significant outliers almost in every variable
- Most of the houses were sold before the global financal crisis
- All of the houses were build before 2010

# Qualitative (business) analysis

With the statistics shown above, we will try to analize and predict the influence of each variable in the sale price. Each variable will be given a predifined influence (low, medium, high) and the corresponding effect sign(stimulant, distimulant, mixed)

- **MSSubClass** — the type of a dwelling:
  - must be actually recoded as string, as each number identifies the type of dwelling
  - hard to determine influence of each type, but seems like too many categories - thus, binning may be applied
  - seems more like and ID variable
  - predefined influence: low, mixed
  
  
- **MSZoning** — the general zoning classification: 
  - seems to be very important variable, as it is intuitive for houses to have lower price near the industrial zones compared to houses near the park
  - predefined influence: high, mixed
  
- **LotFrontage** — length of street connected to a house:
  - May be important starting from some length (as too short street may make a frightening impression
  - Rather should be one-hot or string variable
  - predefined influence: low, stimulant
  
- **LotArea** — size of the house:
  - For sure important variable, the question is the form of the dependency - linear/non-linear
  - predefined influence: high, stimulant

- **Street** — type of road access to the house:
  - Seems to be not so important, may be strongly correlated with some other features (ex. MSZoning)
  - predefined influence: low, stimulant

- **Alley** — type of alley access to the house:
  - May be recoded as 0-1, the alley may look much more appetising for the buyer
  - predefined influence: medium, stimulant
 
- **LotShape** — general shape of the house:
  - 4 categories may be reduntant, may be recoded as 0-1 (1 - strongly irregular)
  - predefined influence: medium, distimulant
  
- **LandContour** — flatness of the house:
  - at some point angle may be significant (for very old house)
  - seems to be more expert-known factor
  - rather could also be recoded as 0-1
  - predefined influence: low, distimulant

- **Utilities** — type of utilities available:
  - for sure is an importnat variable as utilities are hard and time consuming to enable on your own
  - may also be recoded as 0-1
  - predefined influence: high, stimulant
  
- **LotConfig** — house placement compared on the street:
  - may be recoded as 3-categorical variable: corner-cul_de_sac-other or as 0-1
  - cul_de_sac seems to be importnat in defining the buyers behavior, but not as much as for ex. area
  - may be correlated with Alley, Street and other geografical variables
  - predefined influence: medium, distimulant

- **LandSlope** — flatness of the house:
  - at some point angle may be significant (especially for senior buyers)
  - rather could also be recoded as 0-1
  - predefined influence: low, distimulant
  
- **Neighborhood**  — physical location of the house within Ames city:
  - for sure will be correlated wit some geografical variables
  - may also include some etnical/cultural/race/class differencies within the Ames city - unfortunately, I'm not a citizen and the effect of different neighborhoods is unknown to me :(
  - predefined influence: medium, mixed
  
- **Condition1**, **Condition2** — proximity to various conditions of the house:
  - for sure important for houses near the railroad (noisy and dirty) and park (calm and clear)
  - predefined influence: high, mixed

- **BldgType** — type of the house:
  - for sure important variable, but may be correlated with some other features
  - predefined influence: high, mixed
  
- **HouseStyle** — style of the house:
  - more like ID variable, but unfinished level may be important
  - predefined influence: low, mixed
  
- **OverallQual** — overall material and finish of the house:
  - may be correlated with other features
  - some categories may be redundant
  - predefined influence: medium, distimulant 
  
- **OverallCond** — overall condition of the house:
  - may be correlated with other features
  - some categories may be redundant
  - predefined influence: medium, distimulant
  
- **YearBuilt** — year of the building:
  - at some point may be important, but rather no difference between say 2000 and 2005 - thus, must be recoded somehow
  - predefined influence: high, stimulant
  
- **YearRemodAdd** — remodel date:
  - rather important whether it occured and how many times
  - may be correlated with other variables
  - predefined influence: high, distimulant
  
- **RoofStyle** — type of roof:
  - may be part of some new feature
  - predefined influence: low, mixed
  
- **RoofMatl** — material of the roof:
  - may be part of some new feature
  - for some materials may be costly
  - predefined influence: low, mixed 
  
- **Exterior1st**, **Exterior2nd** — exterior covering of the house:
  - may be part of some new feature
  - for some materials may be costly
  - predefined influence: low, mixed
  
- **MasVnrType** — masonry veneer type (brick type) of the house:
  - may be part of some new feature
  - for some type may be costly
  - predefined influence: low, mixed
  
- **ExterQual**, **ExterCond** — the quality and the present condition of the material on the exterior
  - may be part of some new feature
  - may be correlated with each other
  - predefined influence: low, mixed
  
- **Foundation** — 
- **BsmtQual** — 
- **BsmtCond** — 
- **BsmtExposure** — 
- **BsmtFinType1** — 
- **BsmtFinSF1** — 
- **BsmtFinType2** — 
- **BsmtFinSF2** — 
- **BsmtUnfSF** — 
- **TotalBsmtSF** — 

- **Heating** — 
- **HeatingQC** — 
- **CentralAir** — 
- **Electrical** — 
- **1stFlrSF** — 
- **2ndFlrSF** — 
- **LowQualFinSF** — 



- **GrLivArea** — : Above grade (ground) living area square feet
- **BsmtFullBath** — : Basement full bathrooms
- **BsmtHalfBath** — : Basement half bathrooms
- **FullBath** — : Full bathrooms above grade
- **HalfBath** — : Half baths above grade
- **Bedroom** —  Bedrooms above grade (does NOT include basement bedrooms)
- **Kitchen** — : Kitchens above grade
- **KitchenQual** — : Kitchen quality   	
- **TotRmsAbvGrd** — : Total rooms above grade (does not include bathrooms)
- **Functional** — : Home functionality (Assume typical unless deductions are warranted)
- **Fireplaces** — : Number of fireplaces
- **FireplaceQu** — : Fireplace quality

- **GarageType** — : Garage location
- **GarageYrBlt** — : Year garage was built	
- **GarageFinish** — : Interior finish of the garage
- **GarageCars** — : Size of garage in car capacity
- **GarageArea** — : Size of garage in square feet
- **GarageQual** — : Garage quality	
- **GarageCond** — : Garage condition

- **PavedDrive** — : Paved driveway
- **WoodDeckSF** — : Wood deck area in square feet
- **OpenPorchSF** — : Open porch area in square feet
- **EnclosedPorch** — : Enclosed porch area in square feet
- **3SsnPorch** — : Three season porch area in square feet
- **ScreenPorch** — : Screen porch area in square feet

- **PoolArea** — : Pool area in square feet
- **PoolQC** — : Pool quality
		
- **Fence** — : Fence quality
- **MiscFeature** — : Miscellaneous feature not covered in other categories
- **MiscVal** — : $Value of miscellaneous feature
- **MoSold** — : Month Sold (MM)
- **YrSold** — : Year Sold (YYYY)
- **SaleType** — : Type of sale
- **SaleCondition** — : Condition of sale