# 1 Introduction to Project

## 1.1 Overview

The real estate market plays a critical role in the economy, with the sale price of homes often serving as a key indicator of market conditions. Accurate predictions of housing prices are important for a variety of stakeholders, including home buyers, sellers, and real estate professionals. In this project, we aim to develop a machine learning model that can accurately predict the sale price of homes based on a variety of features about the properties and the local real estate market. Our objectives are to identify the most important factors that influence housing prices and to evaluate the performance of the model using metrics such as mean squared error and mean absolute error. To achieve these goals, we will use a dataset containing information on a wide range of residential properties in Ames, Iowa, collected between 2006 and 2010.

In this Python machine learning project, I will be using Python libraries matplotlib, numpy, pandas, seaborn, scipy and scikit-learn to analyse and visualise the dataset. I will also be exploring various regression models such as linear regression, random forest, gradient boosted trees and support vector machine.

In this project, what I am doing is something like:
1. Understanding the problem. I will examine every variable and conduct a philosophical examination of their significance for this issue.
2. Data cleaning. I will be clean the datatset, perform feature and label extraction and feature scaling.
3. Univariable study. I will analysing the dependent variable - SalePrice.
4. Multivariate study. I will be analysing how SalePrice relate to other independent variables.
5. Building the regression model, model evaluation and comparsion.

Now, let's take a deeper dive into our dataset.

## 1.2 References

***
- Title: Ames housing dataset
- Author: -
- Date accessed: 4 Jan 2023
- Availability: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
***
- Title: Comprehensive data exploration with Python
- Author: PEDRO MARCELINO
- Date accessed: 4 Jan 2023
- Availability: https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python#'SalePrice',-her-buddies-and-her-interests
***
- Title: Practical Introduction to 10 Regression Algorithm
- Author: FARES SAYAH
- Date accessed: 4 Jan 2023
- Availability: https://www.kaggle.com/code/faressayah/practical-introduction-to-10-regression-algorithm#📊-Models-Comparison
***
- Title: How to interpret MAE (simply explained)
- Author: Stephen Allwright
- Date accessed: 4 Jan 2023
- Availability: https://stephenallwright.com/interpret-mae/
***
- Title: Mean Squared Error: Definition and Example
- Author: -
- Date accessed: 4 Jan 2023
- Availability: https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/mean-squared-error/
***
- Title: RMSE: Root Mean Square Error
- Author: -
- Date accessed: 4 Jan 2023
- Availability: https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-root-mean-square-error/
***
- Title: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
- Author: Jim Frost
- Date accessed: 4 Jan 2023
- Availability: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
***
- Title: How to Solve Reproducibility in ML
- Author: Ejiro Onose
- Date accessed: 4 Jan 2023
- Availability: https://neptune.ai/blog/how-to-solve-reproducibility-in-ml
***

# 2 Description of Dataset

The Ames Housing dataset is a collection of detailed housing-related data compiled by Dean De Cock for use in data science education. The dataset includes information on homes sold in Ames, Iowa between 2006 and 2010, and includes 79 variables describing various aspects of the homes, such as the sale price, the size of the home, the number of bedrooms and bathrooms, and various other features. The data was collected from a variety of sources, including public records and real estate listings. The dataset is organized into a tabular format with one row per home and one column per variable. It is widely used in machine learning research and is often used as a benchmark for evaluating the performance of models to predict housing prices. The dataset can be accessed via the Kaggle official website.

The dataset includes 81 features (columns) describing a wide range of characteristics of 1,460 homes (rows) in Ames, Lowa sold between 2006 and 2010. The CSV file is 463KB in size.

We can examine each variable to try to grasp its significance and relation to this issue in order to understand our data. Although this takes a lot of time, it will give us a sense of our dataset.

To be more methodical in our analysis, we split our data into two categories:
- Categorical
- Numerical

A categorical variable is a variable with a finite, typically set range of potential values that designates each individual or other unit of observation to a specific group or nominal category based on some qualitative attribute. It is represent as a string. Numerical data can is expressed as numbers rather than in any linguistic or descriptive form. It is also known as quantitative data, and can be calculated mathematically and statistically. It is represented as a float. Missing data is represented with NA in the dataset, which we will be cleaning up when pre-processing the data.

---

# 3 Project Objectives

## 3.1 Objectives

The aim of a housing price machine learning project would likely be to build a model that can predict the sale price of a home given a set of features about the home and the local real estate market. The objectives of the project might include:

- Developing a model that can make accurate predictions about housing prices.
- Identifying the most important features that influence housing prices, such as the size of the home, the location, and the condition of the property.
- Evaluating the performance of the model using metrics such as mean squared error or mean absolute error.
- Fine-tuning the model to improve its performance by adjusting the model architecture, adjusting the hyperparameters, or adding additional data.
- Deploying the model in a web application or other tool that can be used by real estate professionals or individual home buyers and sellers to predict housing prices.

## 3.2 Impact and Contribution

A housing price prediction model could have a number of impacts and contributions, depending on how it is used and who uses it. Some potential impacts and contributions of a housing price machine learning project could include:

- Providing a useful tool for real estate professionals to help them determine the value of properties and make informed decisions when buying or selling homes.
- Helping home buyers and sellers to get a better understanding of the market value of a home, which could help them to make more informed decisions when negotiating the sale or purchase price of a property.
- Improving the efficiency of the real estate market by providing a more accurate and objective way of determining the value of a home, rather than relying on subjective opinions or incomplete data.
- Providing a basis for further research into the factors that influence housing prices and how they interact with one another.
- Potentially leading to the development of other machine learning models that can be used to predict other outcomes in the real estate market, such as rental prices or demand for properties in a particular area.

---

# 4 Data Preprocessing

## 4.1 Overview of the Data

In [1]:
# General imports for data handling
import numpy as np
import pandas as pd

# Scikit-learn imports for machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

# Additional utilities from scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler  # If feature scaling is needed


In [2]:
# Get data from CSV file
housing = pd.read_csv('AmesHousing.csv')

In [3]:
# Display the first 5 rows of the data
housing.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [4]:
# Returns all the columns in the data
housing.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'G

In [5]:
# Differentiate numerical features and categorical features
categorical_features = housing.select_dtypes(include = ["object"]).columns
numerical_features = housing.select_dtypes(exclude = ["object"]).columns
numerical_features = numerical_features.drop("SalePrice")

print("Numerical features : " + str(len(numerical_features)))
print("Categorical features : " + str(len(categorical_features)))

Numerical features : 38
Categorical features : 43


In [6]:
# Returns all the type of data of each variable
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style   

## 4.2 Checking for Duplicated Data

First, lets check for duplicated data.

In [7]:
# Check for duplicates
uniqueEntries = len(set(housing.Id))
totalEntries = housing.shape[0]
duplicatedEntries = totalEntries - uniqueEntries

print("There are " + str(duplicatedEntries) + " duplicate housings for " + str(totalEntries) + " total entries")

There are 0 duplicate housings for 2930 total entries


## 4.3 Checking for Missing Data

It is a good practice to check for null values before processing any dataset. Missing values can diminish the statistical power of any dataset and may lead to bias result when training our machine learning model. Some machine learning models requires a specified input data, for instance, random forest does not support null values.

Some things to consider when facing missing data:
- How frequently are data missing? 
- Is there a pattern or randomness in the missing data? 

In [8]:
# Compute the total number and percentage of missing data per variable
total = housing.isna().sum().sort_values(ascending=False)
percent = (housing.isna().sum()/housing.isna().count()).sort_values(ascending=False)

# Display in a table
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
Pool QC,2917,0.995563
Misc Feature,2824,0.963823
Alley,2732,0.932423
Fence,2358,0.804778
Mas Vnr Type,1775,0.605802
Fireplace Qu,1422,0.485324
Lot Frontage,490,0.167235
Garage Cond,159,0.054266
Garage Finish,159,0.054266
Garage Yr Blt,159,0.054266


Lets analyse the results so we know how to deal with the missing data. We can see we are missing data in these areas:
1. Pool quality
2. Miscellaneous features
3. Alley
4. Fence
5. Fireplace
6. Linear feet of street connected to property
7. Garage
8. Basement
9. Masonry
10. Electrical


There are two approaches we can take when dealing with missing data:
- We omit the entire entry by removing the tuple from the dataset
- We can ascribe the missing data with the mean or median of the dataset

Lets look at the first four variables where we have more than 80% of the data missing. Since the assoicated data for that variable is almost non-existent, it would be ideal to remove it entirely as they only serve as being outliers in the dataset. Furthermore, it is likely pool quality, miscellaneous features, alley and fences are not aspects most people consider when purchasing a new house, which could explain the lack of data.

In [9]:
# Remove variables with more than 80% missing data
housing = housing.drop((missing_data[missing_data['Percent'] > 0.8]).index)

# Remove the row with missing electrical data
housing = housing.drop(housing.loc[housing['Electrical'].isnull()].index)

# We will also dropping the ID column since we do not need it
housing.drop("Id", axis = 1, inplace = True)

KeyError: "['Pool QC', 'Misc Feature', 'Alley', 'Fence'] not found in axis"

For the remaining features, we will be replacing all the null values with 0.

In [None]:
housing.loc[:, "FireplaceQu"] = housing.loc[:, "FireplaceQu"].fillna("0")
housing.loc[:, "LotFrontage"] = housing.loc[:, "LotFrontage"].fillna("0")
housing.loc[:, "GarageYrBlt"] = housing.loc[:, "GarageYrBlt"].fillna("0")
housing.loc[:, "GarageCond"] = housing.loc[:, "GarageCond"].fillna("0")
housing.loc[:, "GarageType"] = housing.loc[:, "GarageType"].fillna("0")
housing.loc[:, "GarageFinish"] = housing.loc[:, "GarageFinish"].fillna("0")
housing.loc[:, "GarageQual"] = housing.loc[:, "GarageQual"].fillna("0")
housing.loc[:, "BsmtFinType2"] = housing.loc[:, "BsmtFinType2"].fillna("0")
housing.loc[:, "BsmtExposure"] = housing.loc[:, "BsmtExposure"].fillna("0")
housing.loc[:, "BsmtQual"] = housing.loc[:, "BsmtQual"].fillna("0")
housing.loc[:, "BsmtCond"] = housing.loc[:, "BsmtCond"].fillna("0")
housing.loc[:, "BsmtFinType1"] = housing.loc[:, "BsmtFinType1"].fillna("0")
housing.loc[:, "MasVnrArea"] = housing.loc[:, "MasVnrArea"].fillna("0")
housing.loc[:, "MasVnrType"] = housing.loc[:, "MasVnrType"].fillna("0")

## 4.4 Encoding Categorical Features

Finally, we will be encoding the categorical features so the regression model is able to understand the data.

In [None]:
# Encode the remaining categorical features
housing = housing.replace({
    "MSZoning" : {"A" : 1, "C" : 2, "FV" : 3, "I" : 4, "RH" : 5, "RL" : 6, "RP" : 7, "RM" : 8},
    "Street" : {"Grvl" : 1, "Pave" : 2},
    "LotShape" : {"IR3" : 1, "IR2" : 2, "IR1" : 3, "Reg" : 4},
    "LandContour" : {"Lvl": 1, "BnK": 2, "HLS": 3, "Low": 4},
    "Utilities" : {"ELO" : 1, "NoSeWa" : 2, "NoSewr" : 3, "AllPub" : 4},
    "LotConfig" : {"Inside" : 1, "Corner" : 2, "CulDSac" : 3, "FR2" : 4, "FR3": 5},
    "LandSlope" : {"Sev" : 1, "Mod" : 2, "Gtl" : 3},
    "Neighborhood" : {"Blmngtn" : 1, "Blueste" : 2, "BrDale" : 3, "BrkSide" : 4, "ClearCr" : 5, "CollgCr" : 6, "Crawfor" : 7, "Edwards" : 8, "Gilbert" : 9, "IDOTRR" : 10, "MeadowV" : 11, "Mitchel" : 12, "Names" : 13, "NoRidge" : 14, "NPkVill" : 15, "NridgHt" : 16, "NWAmes" : 17, "OldTown" : 18, "SWISU" : 19, "Sawyer" : 20, "SawyerW" : 21, "Somerst" : 22, "StoneBr" : 23, "Timber" : 24, "Veenker" : 25},
    "Condition1" : {"Artery" : 1, "Feedr" : 2, "Norm" : 3, "RRNn" : 4, "RRAn" : 5, "PosN" : 6, "PosA" : 7, "RRNe" : 8, "RRAe" : 9},
    "Condition2" : {"Artery" : 1, "Feedr" : 2, "Norm" : 3, "RRNn" : 4, "RRAn" : 5, "PosN" : 6, "PosA" : 7, "RRNe" : 8, "RRAe" : 9},
    "BldgType" : {"1Fam" : 1, "2FmCon" : 2, "Duplx" : 3, "TwnhsE" : 4, "TwnhsI": 5},
    "HouseStyle" : {"1Story" : 1, "1.5Fin" : 2, "1.5Unf" : 3, "2Story" : 4, "2.5Fin" : 5, "2.5Unf" : 6, "SFoyer" : 7, "SLvl" : 8},
    "RoofStyle" : {"Flat" : 1, "Gable" : 2, "Gambrel" : 3, "Hip" : 4, "Mansard" : 5, "Shed" : 6},
    "RoofMatl" : {"ClyTile" : 1, "CompShg" : 2, "Membran" : 3, "Metal" : 4, "Roll" : 5, "Tar&Grv" : 6, "WdShake" : 7, "WdShngl" : 8},
    "Exterior1st" : {"AsbShng" : 1, "AsphShn" : 2, "BrkComm" : 3, "BrkFace" : 4, "CBlock" : 5, "CemntBd" : 6, "HdBoard" : 7, "ImStucc" : 8, "MetalSd" : 9, "Other" : 10, "Plywood" : 11, "PreCast" : 12, "Stone" : 13, "Stucco" : 14, "VinylSd" : 15, "Wd Sdng" : 16, "WdShing" : 17},
    "Exterior2nd" : {"AsbShng" : 1, "AsphShn" : 2, "BrkComm" : 3, "BrkFace" : 4, "CBlock" : 5, "CemntBd" : 6, "HdBoard" : 7, "ImStucc" : 8, "MetalSd" : 9, "Other" : 10, "Plywood" : 11, "PreCast" : 12, "Stone" : 13, "Stucco" : 14, "VinylSd" : 15, "Wd Sdng" : 16, "WdShing" : 17},
    "MasVnrType" : {"BrkCmn" : 1, "BrkFace" : 2, "CBlock" : 3, "None" : 4, "Stone": 5, "NA": 6},
    "ExterQual" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
    "ExterCond" : {"Po" : 1, "Fa" : 2, "TA": 3, "Gd": 4, "Ex" : 5},
    "Foundation" : {"BrkTil" : 1, "CBlock" : 2, "PConc" : 3, "Slab" : 4, "Stone" : 5, "Wood" : 6},
    "BsmtQual" : {"NA" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5},
    "BsmtCond" : {"NA" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
    "BsmtExposure" : {"No" : 0, "Mn" : 1, "Av": 2, "Gd" : 3, "NA": 4},
    "BsmtFinType1" : {"NA" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, "ALQ" : 5, "GLQ" : 6},
    "BsmtFinType2" : {"NA" : 0, "Unf" : 1, "LwQ": 2, "Rec" : 3, "BLQ" : 4, "ALQ" : 5, "GLQ" : 6},
    "Heating" : {"Floor" : 1, "GasA" : 2, "GasW": 3, "Grav" : 4, "OthW" : 5, "Wall" : 6},
    "HeatingQC" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
    "CentralAir" : {"N" : 0, "Y" : 1},
    "Electrical" : {"SBrkr" : 1, "FuseA" : 2, "FuseF" : 3, "FuseP" : 4, "Mix" : 5},
    "KitchenQual" : {"Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
    "Functional" : {"Sal" : 1, "Sev" : 2, "Maj2" : 3, "Maj1" : 4, "Mod": 5, "Min2" : 6, "Min1" : 7, "Typ" : 8},                       
    "FireplaceQu" : {"NA" : 1, "Po" : 2, "Fa" : 3, "TA" : 4, "Gd" : 5, "Ex" : 6},
    "GarageType" : {"2Types" : 1, "Attchd" : 2, "Basment" : 3, "BuiltIn" : 4, "CarPort": 5, "Detchd" : 6, "NA" : 7},
    "GarageFinish" : {"Fin" : 1, "RFn" : 2, "Unf" : 3, "NA" : 4},
    "GarageQual" : {"NA" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
    "GarageCond" : {"NA" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5},
    "PavedDrive" : {"N" : 0, "P" : 1, "Y" : 2},
    "SaleType" : {"WD" : 1, "CWD" : 2, "VWD" : 3, "New" : 4, "COD" : 5, "Con" : 6, "ConLw" : 7, "ConLI" : 8, "ConLD" : 9, "Oth": 10},
    "SaleCondition" : {"Normal" : 1, "Abnorml" : 2, "AdjLand": 3, "Alloca" : 4, "Family" : 5, "Partial" : 6}
    }
)

## 4.5 Final Check

After all that pre-processing, we do a final check for missing data.

In [None]:
# Compute the total number and percentage of missing data per variable
total = housing.isna().sum().sort_values(ascending=False)
percent = (housing.isna().sum()/housing.isna().count()).sort_values(ascending=False)

# Display in a table
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(5)

---

# 5 Statistical Analysis 

## 5.1 Establishing Correlation between SalePrice and Other Features

### 5.1.1 Establishing the Most Important Features

### 5.1.2 Visualising with a Heatmap

### 5.1.3 Visualising with a Scatterplot

## 5.2 Deep Dive into `SalePrice`

---

# 6 Model Development

## 6.1 Train and Test Splits

## 6.2 Linear Regression

## 6.3 Logistic Regression

## 6.4 Random Forest

## 6.5 Support Vector Machine

---

## 7 Model Evaluation

---

# 8 Project Evaluation

## 8.1 Summary

## 8.2 Results between Models

## 8.3 Reproducibility