# Melbourne Housing - Data analisys

The objective of this analysis is to explore the Melbourne housing dataset in order to identify how property prices behave across regions, suburbs, land size, and location, using exploratory data analysis and geospatial visualization.

The dataset was first inspected to understand its structure, available features, and overall completeness.
Most core variables such as price, location, and property characteristics are well populated, enabling exploratory analysis.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import zipfile
from pathlib import Path

In [2]:
file_path = Path("archive.zip")
folder = Path("data")

with zipfile.ZipFile(file_path, "r") as zip_ref:
    zip_ref.extractall(folder)

csv_path = folder / "melb_data.csv"
df = pd.read_csv(csv_path)

In [3]:
df.head()

In [4]:
df.shape

In [5]:
df.info()

In [6]:
df.describe()

In [7]:
df["Price"].describe()

In [8]:
df.isnull().sum().sort_values(ascending=False)

In [9]:
(df.isnull().mean() * 100).sort_values(ascending=False)

Missing Data Evaluation

Some variables contain a large number of missing values, notably BuildingArea and YearBuilt.
Due to the high proportion of null entries, these variables were excluded from the analysis to avoid introducing bias or unreliable assumption

### Prices Analisys

Property prices present a right-skewed distribution with significant outliers.
Because of this, the median is used as the primary measure of central tendency instead of the mean.

1 - By Region

The analysis starts by examining property prices across Melbourne regions.
Significant price disparities are observed, reinforcing that regional location is a major determinant of property value.

Median prices are preferred over averages due to the presence of high-value outliers that skew the mean.

In [None]:
priceByRegion = (
    df.groupby("Regionname")
      .agg(
          avg_price=("Price", "mean"),
          median_price=("Price", "median"),
          listings=("Price", "count")
      )
      .sort_values("median_price", ascending=False)
      .reset_index()
)

pd.options.display.float_format = '{:,.2f}'.format
priceByRegion


In [15]:
fig = px.bar(
    priceByRegion,
    x="Regionname",
    y="avg_price",
    title="Average property price by region",
    text_auto=".2s"
)

fig.update_layout(
    xaxis_title = "Region",
    yaxis_title = "Average Price",
    yaxis_tickprefix = "$"
)

fig.show()

2 - Price Analysis by Land Size

Property prices were analyzed based on land size to evaluate whether larger parcels consistently command higher prices.
While price generally increases with land size at lower ranges, this trend does not hold for very large plots.

This suggests diminishing returns as land size grows.

In [1]:
df["Landsize_bin"] = pd.cut(
    df["Landsize"],
    bins = [0, 100, 200, 300, 500, 800, 1200, 2000, df["Landsize"].max()],
    labels=["0-100", "100-200", "200-300", "300-500", "500-800", "800-1200", "1200-2000", "2000+"]
)

priceByLandSize = (
    df.groupby("Landsize_bin")
    .agg(
        median_price = ("Price", "median"),
        listing = ("Price", "count")
    )
    .query("listing >= 50")
    .reset_index()
)

priceByLandSize

NameError: name 'pd' is not defined

In [22]:
fig = px.bar(
    priceByLandSize,
    x="Landsize_bin",
    y="median_price",
    title="Median property price by land size",
    text_auto=".2s"
)

fig.update_layout(
    xaxis_title = "Land Size",
    yaxis_title = "Median price",
    yaxis_tickprefix = "$"
)

fig.show()

3 - Price × Land Size × Region Analysis

Land size and price were further analyzed jointly by region to understand regional differences.
In several regions, particularly high-demand areas, prices decrease for very large land sizes.

This indicates that location outweighs land size, and large plots are often located in less desirable or less central areas.

In [23]:
priceLandRegion = (
    df.groupby(["Regionname", "Landsize_bin"])
    .agg(
        median_price = ("Price", "median"),
        listings = ("Price", "count")
    )
    .query("listings >= 30")
    .reset_index()
)

priceLandRegion


In [32]:
fig = px.line(
    priceLandRegion,
    x = "Landsize_bin",
    y = "median_price",
    color = "Regionname",
    markers=True,
    title= "Median Property price by Land Size and Region"
)

fig.update_layout(
    xaxis_title = "Land Size (m²)",
    yaxis_title = "Median Price",
    yaxis_tickprefix = "$",
    template = "plotly_white"
)

fig.show()

In [33]:
fig = px.box(
    df,
    x = "Landsize_bin", 
    y = "Price",
    color = "Regionname",
    title = "Price Distirbudion by Land Size and Region"
)

fig.update_layout(
    xaxis_title = "Land Size (m²)",
    yaxis_title = "Median Price",
    yaxis_tickprefix = "$",
    template = "plotly_white"
)

fig.show()

In [36]:
fig = px.scatter(
    df,
    x = "Landsize",
    y = "Price",
    color = "Regionname",
    opacity = 0.4,
    title = "Property Price vs Land Size by Region"
)

fig.update_layout(
    xaxis_title = "Land Size (m²)",
    yaxis_title = "Median Price",
    yaxis_tickprefix = "$",
    template = "plotly_white"
)

fig.show()

4 - Price Analysis by Suburb

A suburb-level analysis provides a finer geographic resolution.
Even within the same region, median prices vary significantly between suburbs.

To ensure meaningful comparisons, only suburbs with a sufficient number of listings were included.

In [37]:
priceBySuburb = (
    df.groupby("Suburb")
    .agg(
        median_price = ("Price", "median"),
        listings = ("Price", "count"),
    )
    .query("listings >= 50")
    .sort_values("median_price", ascending=False)
    .reset_index()
)

priceBySuburb

In [45]:
top10 = priceBySuburb.head(10)
bottom10 = priceBySuburb.tail(10)

top10["group"] = "Top 10"
bottom10["group"] = "Bottom 10"

suburb_extremes = (
    pd.concat([top10, bottom10])
    .sort_values("median_price", ascending=True)
)

fig = px.bar(
    suburb_extremes,
    x = "median_price",
    y = "Suburb",
    color = "group",
    orientation = "h",
    title = "Top 10 and Bottom 10 Suburbs by Median Property Price",
    text = suburb_extremes["median_price"].map(lambda x: f"${x:,.0f}")
)

fig.update_layout(
    xaxis_title = "Median Price",
    yaxis_title = "Suburb",
    xaxis_tickprefix = "$",
    template = "plotly_white",
    legend_title_text = ""
)

fig.update_traces(
    hovertemplate=
    "<b>%{y}</b><br>" +
    "Median Price: %{x:$,.0f}<br>" +
    "Listings: %{customdata[0]}",
    customdata=suburb_extremes[["listings"]].values
)

fig.show()

5 - Impact of Room, Bedroom, and Bathroom Counts

Structural characteristics such as number of rooms, bedrooms, and bathrooms were analyzed across price tiers and regions.
These variables show relatively limited variation and appear to act as secondary price modifiers rather than primary drivers.

This reinforces the dominance of location over property layout.

In [46]:
df["Price_bin"] = pd.qcut(
    df["Price"],
    q = 4,
    labels = ["Low", "Mid", "High", "Premium"]
)

In [49]:
roomsByPriceByRegion = (
    df.groupby(["Regionname", "Price_bin"])
    .agg(
        median_rooms = ("Rooms", "median"),
        median_bedrooms = ("Bedroom2", "median"),
        median_bathrooms = ("Bathroom", "median"),
        listings = ("Price", "count")
    )
    .query("listings >= 30")
    .reset_index()
)

fig = px.density_heatmap(
    roomsByPriceByRegion,
    x="Price_bin",
    y="Regionname",
    z="median_rooms",
    color_continuous_scale="Blues",
    title="Median Number of Rooms by Price Tier and Region"
)

fig.update_layout(
    xaxis_title="Price Tier",
    yaxis_title="Region",
    template="plotly_white"
)

fig.show()

6 - Geospatial Visualization of Properties

A geospatial visualization was used to map individual properties across Melbourne.
Clear spatial clustering of prices emerges, with higher-priced properties concentrated in central and premium areas.

Using price categories instead of continuous values improves visual clarity and highlights regional segmentation.

In [56]:
fig = px.scatter_mapbox(
    df,
    lat="Lattitude",
    lon="Longtitude",
    color="Price_bin",
    color_continuous_scale="Turbo",
    zoom=9,
    opacity=0.6,
    hover_data=[
        "Suburb",
        "Price",
        "Rooms",
        "Bathroom",
        "Bedroom2",
        "Landsize"
    ]
)

fig.update_layout(
    mapbox_style="carto-positron",
    margin={"r":0, "t":0, "l":0, "b":0}
)

fig.show()

Final Takeaways

Location is the strongest factor influencing property prices in Melbourne.
Land size has diminishing impact beyond certain thresholds, and structural features provide marginal adjustments rather than defining value.
Geospatial analysis confirms and visually reinforces these findings.