# Predicting Housing Prices using Advanced Regression Techniques

## Introduction

#### Project Scope:

Predicting the price of each home in Ames, Iowa. Our dataset describes 79 features which can influence prices of residential homes. Keeping in mind the variety of features and how each of them impact the house price, we have selected this dataset.

#### Motivation & purpose: 

- The predicted prices will enable residents who intend to purchase a house in Ames to understand the price ranges across different areas of the city
- This will also be useful for realtors to pitch their final sale price (after adjusting their commission) to their clients

#### Dataset:

- Continuous Variables – such as linear feet measure of street connected to property, lot size, year built, masonry veneer area in square feet, total square feet of basement area etc.
- Categorical Variables – such as basement type, roof materials, lot configuration etc. 
- Quality Measures – such as external quality, basement quality, external conditions etc. 


## Data cleaning & preprocessing

#### Detecting outliers
![image.png](img1.png)

Deleted these two as they are very huge(extremely large areas for very low prices)

#### Missing values

![image.png](img6.png)

![image.png](img5.png)

The following columns were dropped since they had more than 80% missing values
- PoolQC
- MiscFeature
- Alley
- Fence

For many columns, NA values represent absence of that particular feature(based on the data description). Therefore, the following columns were replaced with 'None':
- MasVnrType
- FireplaceQu
- GarageQual
- GarageCond
- GarageFinish
- GarageType
- BsmtExposure
- BsmtCond
- BsmtQual
- BsmtFinType1
- BsmtFinType2
- Utilities
- LandSlope


Some features with NaN were filled with their mode, since there are a low number of missing values for these features and also typically most of the houses would have it:
- Electrical
- KitchenQual
- Exterior1
- Exterior2
- SaleType

### Feature Engineering

#### Factorization : 

There were couple features which read in as numericals but were actually objects. Hence transformed them to string types - MSSubClass and OverallCond


#### Numerical variables
![image.png](img7.png)

![image.png](img9.png)

All features with correlation of above 0.30 were selected.

![image.png](img8.png)

#### Transformation of skewed variables


#### Numerical variables

![image.png](img11.png)


#### Target variable : Sale Price

![image.png](img3.png)  ![image.png](img4.png) 


#### Categorical variables

- Check levels of variables to determine the distribution

![image.png](img10.png)

Variables - utilities and Street were dropped, since their values are in one level

- Used LabelEncoder for quality measure variables like FireplaceQu, BsmtQual


- Dummy encoding for other categorical measures like BsmtExposure, MSZoning

### Model Building

- LASSO Regression - Lasso performs L1 regularization which adds a penalty equal to the absolute value of the magnitude of coefficients


- Ridge Regression - Ridge performs L2 regularization which adds a penalty equal to square of the magnitude of coefficients


- Elastic Net Regression - It is a hybrid of LASSO and Ridge Regression


- Boosting - Produces a prediction model as an ensemble of weak prediction models - XGBoost, LightGBM
    
    
- RandomForest - Ensemble of decision trees trained with the bagging method

### Model RMSE Comparison

![image.png](img12.png)


###  Final Model Selection - LASSO
![image.png](img13.png)