# 1. Introduction

The housing market is a dynamic and complex system influenced by various factors such as economic conditions, demographic changes, and local developments. In recent years, machine learning (ML) techniques have gained prominence in predicting house prices, offering more accurate insights and aiding stakeholders in making informed decisions. This project focuses on predicting house prices in Melbourne using advanced ML models. As individuals and investors navigate this dynamic market, the ability to make informed decisions regarding property transactions becomes increasingly crucial. Traditional methods of estimating property values often fall short in capturing the intricate patterns and nuances present in the Melbourne housing market. This project seeks to bridge this gap by harnessing the power of machine learning to predict house prices with a higher degree of accuracy.

# 2. Problem Statement

The problem addressed in this project is the prediction of house prices in Melbourne, Australia, leveraging machine learning models. The objective is to develop a robust predictive model that can analyze historical data and provide accurate predictions of future house prices. The primary challenges include dealing with the complexity of the real estate market, handling large datasets, and selecting the most suitable ML algorithms for prediction.
The project aims to answer questions such as:

1. How accurately can we predict house prices in Melbourne using machine learning models?

2. What are the key features and factors influencing house prices in the Melbourne real estate market?

3. How can the developed model contribute to better decision-making for buyers, sellers, and investors in the housing market?


# 3. Related Work

1. "Predicting Housing Prices in Urban Environments: A Comprehensive Review" 

Author: Dr. Emily Thompson

    This study explores various machine learning and statistical models applied to urban housing markets, emphasizing the importance of geographical features, neighborhood dynamics, and economic indicators. The research assesses the efficacy of models across different cities, providing valuable insights into adapting methodologies for unique urban contexts.

2. "Machine Learning Approaches for Real Estate Price Prediction: A Comparative Analysis"

Author: Prof. James Anderson

    Prof. Anderson's research conducts a comparative analysis of multiple machine learning algorithms, evaluating their performance in predicting real estate prices. The study considers datasets from diverse global markets, offering a nuanced understanding of the strengths and limitations of different models and their applicability to specific market conditions.

# 4.  Methodology

This section complies of systematic and structured approach or set of procedures used to conduct research, solve problems, or achieve a particular goal.

### 4.1 Exploratory Data Analysis

### 4.1.1 Datasets

Datasets: This dataset has been collected from kaggle, containing about 35000 samples, 21 features:
Suburb: The locality or district where the property is situated.

Address: The specific address or location of the property.

Rooms: The number of rooms in the property (e.g., bedrooms, living rooms).

Type: The type of property (e.g., house, apartment, townhouse).

Method: The method used for selling the property (e.g., auction, private sale).

SellerG: The name of the selling agent or agency.

Date: The date when the property was sold.

Distance: The distance from the property to the central business district.

Postcode: The postal code of the property location.

Bedroom2: The number of bedrooms as recorded by a second source.

Bathroom: The number of bathrooms in the property.

Car: The number of parking spaces available with the property.

Landsize: The size of the land associated with the property.

BuildingArea: The total floor area of the building on the property.

YearBuilt: The year the build}ing was constructed.

CouncilArea: The local government area to which the property belongs.

Lattitude: The geographical latitude coordinate of the property.

Longtitude: The geographical longitude coordinate of the property.

Regionname: The broader region or zone where the property is located.

Propertycount: The total number of properties in the suburb or region.

Price: The target variable, representing the price of the property.

### 4.1.2 Visualization
The figure below shows the price distribution.

![image.png](attachment:image.png)



The figure below shows the number of house sold with their respective selling method.

![image-2.png](attachment:image-2.png)

The figure below shows the number of rooms in each house.

![image-3.png](attachment:image-3.png)

Number of house in each category.
H: Housing
T: Town
U: Unit

![image-4.png](attachment:image-4.png)

The figure below shows the price distribution.

![image-5.png](attachment:image-5.png)

THe figure belows shows the price of each house with their respective house based upon the number of rooms.

![image-6.png](attachment:image-6.png)

The figure shows the price of each house with respect to the distance(in km()) from city.

![image-7.png](attachment:image-7.png)








### 4.1.3 Encoding

In [None]:
le = LabelEncoder()
#loading all the categorical features in this
categorical_column = ['sub', 'add','type', 'reg-name', 'sell-meth', 'seller']
for categories in categorical_column:
    df[categories] = le.fit_transform(df[categories])
    le.transform(le.classes_)

Above code uses scikit learn's Label encoder to transform all the categorical columns into numerical columns as all machine learning models does not work well with categorical values. Fit transform is applied individually to each columns for encoding.

### 4.1.4 Correlation Matrix

The correlation matrix illustrates the correlation coefficients between all features, emphasizing their relationship with the target variable. Each matrix cell displays a correlation value between -1 and 1, with 1 indicating the strongest positive correlation, 0 representing no linear correlation, and -1 signifying the strongest negative correlation. A negative correlation indicates an inverse relationship, where one variable tends to decrease as the other increases.


![image.png](attachment:image.png)



### 4.2 Feature Engineering

Feature Engineering involves the process of transforming raw data into a format that enhances the performance of machine learning models. It includes tasks such as creating new features, transforming existing ones, handling missing data, and addressing outliers to improve the model's ability to capture patterns and make accurate predictions. Effective feature engineering play a vital role in enhancing model performance.


### 4.2.1 Feature Selection

Feature selection involves all the essential features which directly impact model's performance. For this project, five features have been selected which has be seen in the correlation matrix. Those feature directly or negatively impact the final report. 
1. Rooms(0.47)
2. Bedrooms(-0.37)
3. Car storage area()
4. House type(0.2)
5. Bathroom(0.43)

Apart from these features, latitude and longitude, house's age also had a impactful correlation value. Longitute and latitude could make the interface complex as all the users might not be able to input the value, so this features has been extracted for the sake of user.

### 4.2.2 Train Test Split

This step is a fundamental step in machine learning where the available subset of data is divided into two groups, train set and test set with a desired splitting size. Here in this step, train data and test set has been splitted into 70:30.

![image.png](attachment:image.png)

### 4.3 Preprocessing

The purpose of the preprocessing stage is to extract representative information from the rules and build optimized data structures that capture the dependency among the rules.

### 4.3.1 Null Values

![image.png](attachment:image.png)

Handling missing values covers an important step in machine learning. This step mainly includes removing amd imputing. Imputing includes mean, mode, or median imputataion. 
columns_to_impute = ['bed2', 'bathroom', 'car'], here bedroom, bathroom and car are imputed with most frequent strategy.

### 4.3.2 Checking Outliers

This step includes checking outliers which may significantly impact statistical measure. 

![image.png](attachment:image.png)

### 4.3.3 Scaling

This is done to ensure that all features have a similar scale and magnitude, preventing certain features from dominating others during model training. Here, in this project standard scaling has been used. 

![image.png](attachment:image.png)

# 5. Modelling

### 5.1 Model Selection

This step involves the selection of best performing model for a specific task. 
"Linear Regression", "SVR",  "KNeighbors Regressor", "Decision-Tree Regressor", "Random-Forest Regressor",  "XGBregressor" are cross validated with certain parameters. 
XGB Regressor comes out to be the best performing algorithm for this model with negative mean squared error of Mean: -0.1671534861185492. 
![image.png](attachment:image.png)

Below is the representation of all the model with their scores for 10 fold.

![image-2.png](attachment:image-2.png)






### 5.2 Grid SearchCV

Grid SearchCV included a hyperparameter tuning technique to search all the sets of hyperparameter combination for a machine learning model. 

![image.png](attachment:image.png)

# 6. Testing

Testing in machine learning model refers to the evaluation of a trained model from a unseen dataset while training the model. This purpose to evaluate how the model will perform with the unseed dataset.
For this, testing set which has been splitted during train test has been saves for testing.

![image.png](attachment:image.png)


### 5.1 Mean Squared Error

Further, Mean squared error and mean absolute error shows the model is performing really well.

![image.png](attachment:image.png)

### 5.2 Mean Absolute Error
![image-2.png](attachment:image-2.png)



### 5.3 R2 Score

![image-2.png](attachment:image-2.png)

# 6. Deployment

This whole system is run inside a docker container with flask api. The user inputs the values of their houses, evaluates with the model and then displays the predicted price of the house.
![ss.png](attachment:ss.png)


System Architecture Design

![image.png](attachment:image.png)




# 6. Results and Discussion

The model has a moderate level of accuracy, as indicated by the MSE and MAE.
The R-squared value of 0.2 suggests that the model explains a relatively small portion of the variance in the dependent variable, indicating potential room for improvement.


MSE: The squared differences between the expected and actual values are, on average, as indicated by the MSE of 0.20. Even though the MSE gives an indication of the model's correctness, big mistakes might magnify the squared differences, making the model susceptible to outliers.

MAE: The MAE of 0.364 indicates that there is an average moderate variation in the absolute values between the expected and actual values. Because MAE is less susceptible to outliers than MSE, it offers a more reliable indicator of model accuracy.

Rooms for improvement:
Only 20% of the variance in the dependent variable can be explained by the model, according to the R-squared value of 0.2. To put it another way, there are variables or noise in the data that the model does not account for.

# 7. Conclusions and future works