# **Real Estate Price Prediction**


# **Table of Contents**
* [1.Introduction](#1)
* [2.Import Libraries](#2)
* [3.Load and Split Data](#3)
* [4.Data Understanding](#4)
* [5.Data Preparation](#5)
  * [Handling the missing values](#6)
  * [Convert type of attributes](#7)
  * [Preprocessing Data](#8)


## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

<a id="1"></a> <br>
# **1. Introduction**

The dataset we'll be analyzing is the dataset which is scrapped from Immoweb.be in Belgium. It shows the information of each house and appartement to sale. This dataset has 52.077 rows and 20 columns.

#### **Attributes:**
The metadata of the columns:

* "locality" - postal code of the house/appartement
* "type_of_property" - the count of a new bike shares
* "subtype_of_property" - real temperature in C
* "price" - temperature in C "feels like"
* "type_of_sale" - humidity in percentage
* "number_of_rooms" - wind speed in km/h
* "house_area" - category of the weather
* "fully_equipped_kitchen" - boolean field - 1 holiday / 0 non holiday
* "open_fire" - boolean field - 1 if the day is weekend
* "terrace" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.
* "terrace_area" - category
* "garden" - category
* "garden_area" - category
* "surface_of_the_land" - category
* "surface_of_the_plot_of_land" - category
* "number_of_facades" - category
* "swimming_pool" - category
* "state_of_the_building" - category
* "construction_year" - category

But in this dataset, we don't have the information related longitude and lattitude of each city. So we use also the [zipcode data of Belgium](https://github.com/jief/zipcode-belgium/blob/master/zipcode-belgium.csv). 

## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

<a id="2"></a> <br>
# **2. Imports**

In [None]:
# Load data libraries
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O
from sklearn.model_selection import train_test_split

# For visualizations
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Data preparation
from sklearn.preprocessing import RobustScaler, StandardScaler
from datetime import datetime
import math

## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

<a id="3"></a> <br>
# **3. Load and Split the Data**

In [None]:
# First, let's load the data
df = pd.read_csv("../data/dataset_house_apartment.csv")

df.head()

In [None]:
df.shape

In [None]:
# Load zipcode data of Belgium
zipcode = pd.read_csv("../data/code-postaux-belge.csv", sep=";")

In [None]:
zipcode.head()

In [None]:
# Drop empty columns
zipcode.drop(columns=["coordonnees", "geom"], inplace=True)

# Rename the columns
zipcode.rename(
    columns={
        "column_1": "locality",
        "column_2": "city_name",
        "column_3": "lattitude",
        "column_4": "longitude",
    },
    inplace=True,
)

# Drop the localities' duplicates
zipcode.drop_duplicates(subset=["locality"], inplace=True)

In [None]:
zipcode.shape

Merge real estate data and zipcode data

In [None]:
dfinal = pd.merge(df, zipcode, on=["locality"], how="inner")

In [None]:
dfinal.shape

In [None]:
dfinal.head(5)

## - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

<a id="4"></a> <br>
# **4. Data Understanding**

In [None]:
dfinal.info()

Check number of duplicated rows in data and drop these rows

In [None]:
dfinal.duplicated().sum()

In [None]:
dfinal.drop_duplicates(inplace=True)

In [None]:
# rename columns name 'locality' to 'postal_code'

dfinal.rename(columns={"locality": "postal_code"}, inplace=True)

In [None]:
# Drop columns with only 1 unique value
dfinal.drop(
    columns=["type_of_sale", "furnished", "surface_of_the_plot_of_land"], inplace=True
)
dfinal.shape

In [None]:
dfinal.describe()

Check misisng values in data

In [None]:
dfinal.isnull().sum()

In [None]:
def display_only_missing(df):
    all_data_na = (df.isnull().sum() / len(df)) * 100
    all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(
        ascending=False
    )
    missing_data = pd.DataFrame({"Missing Ratio": all_data_na})
    print(missing_data)

In [None]:
print("Percentage Missing Value %")
display_only_missing(dfinal)

In [None]:
# Calculate the missing ratio for each attribute
missing_ratio = (dfinal.isnull().sum() / len(dfinal)) * 100

# Create a DataFrame to store the missing values information
missing_data = pd.DataFrame(
    {"Attribute": dfinal.columns, "MissingRatio": missing_ratio}
)
missing_data = missing_data[missing_data["MissingRatio"] > 0].sort_values(
    by="MissingRatio", ascending=False
)

# Plot the missing values with values at the top of each bar
plt.figure(figsize=(12, 8))
barplot = sns.barplot(
    x="Attribute", y="MissingRatio", data=missing_data, palette="viridis"
)

# Add values at the top of each bar
for index, value in enumerate(missing_data["MissingRatio"]):
    barplot.text(
        index,
        value + 0.2,
        f"{value:.2f}%",
        ha="center",
        va="bottom",
        fontsize=8,
        color="black",
    )

plt.title("Attributes with Missing Values and Their Quantity")
plt.xlabel("Attribute")
plt.ylabel("Missing Ratio (%)")
plt.xticks(rotation=45, ha="right")  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

<a id="5"></a> <br>
# **5. Data Understanding**

<a id="6"></a> 
### **Handling the missing values**

We will drop the attributes with too many null values. Namely, "garden_area", "terrace_area".

In [None]:
dfinal.drop(columns=["garden_area", "terrace_area", "construction_year"], inplace=True)
dfinal.shape

Imputing missing values of the "surface_of_the_land" attribute by replacing "None" values by 0.

In [None]:
dfinal["surface_of_the_land"] = (
    dfinal["surface_of_the_land"].replace({"None": 0}).astype(int)
)

dfinal.surface_of_the_land.value_counts()

With "state_of_the_building" attribute, we replace "None" values by "unknown".

In [None]:
dfinal["state_of_the_building"] = dfinal["state_of_the_building"].replace(
    {"None": "unknown"}
)
dfinal.state_of_the_building.value_counts()

For "number_of_facades" attribute, we replace the "None" values by 0.

In [None]:
dfinal["number_of_facades"] = (
    dfinal["number_of_facades"].replace({"None": 0}).astype(int)
)
dfinal.number_of_facades.value_counts()

For "house_area" attribute and "number_of_rooms" attribute, we choose to drop the rows with missing values.

In [None]:
dfinal = dfinal[(dfinal["house_area"] != "None")]
dfinal.house_area.value_counts()

In [None]:
dfinal = dfinal[(dfinal["number_of_rooms"] != "None")]
dfinal.number_of_rooms.value_counts()

The "price" attribute contains rows with the value "no price" and we calculate the number of theses rows and remove the.

In [None]:
num_rows_with_no_price = (dfinal["price"] == "no price").sum()

print(f"Number of rows with 'no price': {num_rows_with_no_price}")

dfinal = dfinal[dfinal["price"] != "no price"]

<a id="7"></a> 
### **Convert type of attributes**

For "type_of_property" attribute, we have a categorical variable like type_of_property with two categories (in this case, "house" and "apartment"), we can convert it to a numeric column by using binary encoding.

In [None]:
dfinal["type_of_property"] = dfinal["type_of_property"].replace(
    {"house": 0, "apartment": 1}
)

Convert the data type of the "price", "number_of_rooms" and "house_area" attributes from object to integer.

In [None]:
dfinal["price"] = dfinal["price"].astype(int)
dfinal["number_of_rooms"] = dfinal["number_of_rooms"].astype(int)
dfinal["house_area"] = dfinal["house_area"].astype(int)

We have so 51532 instances with 23 different variables to work on.