# **Link of the colab notebook :**
https://colab.research.google.com/drive/1thDBYWpVRJ2I8Kra0fgY0m47e83y1SOH?usp=sharing
# **Link of the dataset :**
https://drive.google.com/file/d/1jzbEEFmpDkssj5HDachcdSivYFty9QXO/view?usp=sharing

# **Step 1: Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualisation style
sns.set(style="whitegrid")

# **Step 2: Loading the Dataset**

In [None]:
#mount the gdrive
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/Exploratory_Data_Analysis/Dataset/melb_data.csv')
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


# **Step 3: Exploring the Data**

In [None]:
# Get basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [None]:
# Summary statistics for numerical columns
df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [None]:
# Check for missing values
# Give the count of Null values for each column i.e., variables
df.isnull().sum()

Unnamed: 0,0
Suburb,0
Address,0
Rooms,0
Type,0
Price,0
Method,0
SellerG,0
Date,0
Distance,0
Postcode,0


In [None]:
# Unique values in categorical columns
cols = df.columns
num_cols = df._get_numeric_data().columns
num_cols
cat_cols = list(set(cols) - set(num_cols))

df[cat_cols].nunique()


Unnamed: 0,0
CouncilArea,33
Date,58
SellerG,268
Method,5
Suburb,314
Regionname,8
Address,13378
Type,3


In [None]:
# Unique values in categorical columns
print(df['Type'].unique())
print(df['Method'].unique())
print(df['Regionname'].unique())

['h' 'u' 't']
['S' 'SP' 'PI' 'VB' 'SA']
['Northern Metropolitan' 'Western Metropolitan' 'Southern Metropolitan'
 'Eastern Metropolitan' 'South-Eastern Metropolitan' 'Eastern Victoria'
 'Northern Victoria' 'Western Victoria']


In [None]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the number of missing points
missing_values_count

Unnamed: 0,0
Suburb,0
Address,0
Rooms,0
Type,0
Price,0
Method,0
SellerG,0
Date,0
Distance,0
Postcode,0


Find the probability of missing value.

In [None]:
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

4.648292306613367

# **Step 4: Data Cleaning**

By looking at the data, it can be seen that missing column has information on the Car, YearBuilt, CouncilArea, Builduing Area.

This means that some columns values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are some columns have lot of missing fields. In this case, though, the field is missing because if there was no value(does not exist) then it doesn't make sense to guess it . For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

Dropping rows with missing values is a simple but sometimes drastic approach to handling missing data. Before dropping, it’s important to analyze how many rows will be removed and whether that will lead to significant data loss.

In [None]:
# Drop rows with missing values (if any)
df_cleaned = df.dropna()

In [None]:
# Check for missing values
df_cleaned.isnull().sum()

Unnamed: 0,0
Suburb,0
Address,0
Rooms,0
Type,0
Price,0
Method,0
SellerG,0
Date,0
Distance,0
Postcode,0


### **Markdown:**
First, we check how many rows and columns contain missing data. If only a small portion of the dataset is missing values, dropping rows may be acceptable.

We also calculate the percentage of rows with missing values to ensure we're not losing too much data. Generally, if more than 5-10% of the dataset has missing values, alternative methods like imputation should be considered.

After dropping rows, we verify the dataset's new shape to confirm the number of rows left.

In [None]:
# Remove duplicates
df_cleaned = df_cleaned.drop_duplicates()

In [None]:
# Convert the columns to datetime format, specifying that the day comes first
df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'], dayfirst=True)

# **Step 5: Data Transformation**

In [None]:
# Create a new column for the difference in Lattitude and Longtitiude
df_cleaned['Difference'] = (df_cleaned['Lattitude'] - df_cleaned['Longtitude'])
df_cleaned.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,Difference
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,2016-02-04,2.5,3067.0,...,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0,-182.8013
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,2017-03-04,2.5,3067.0,...,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0,-182.8037
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,2016-06-04,2.5,3067.0,...,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0,-182.8013
6,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,2016-05-07,2.5,3067.0,...,0.0,245.0,210.0,1910.0,Yarra,-37.8024,144.9993,Northern Metropolitan,4019.0,-182.8017
7,Abbotsford,98 Charles St,2,h,1636000.0,S,Nelson,2016-10-08,2.5,3067.0,...,2.0,256.0,107.0,1890.0,Yarra,-37.806,144.9954,Northern Metropolitan,4019.0,-182.8014


In [None]:
# Generate a summary of the distance for different region
region_summary = df_cleaned.groupby('Regionname')['Distance'].mean().reset_index()
print(region_summary)

                   Regionname   Distance
0        Eastern Metropolitan  13.602627
1            Eastern Victoria  34.486957
2       Northern Metropolitan   7.841424
3           Northern Victoria  34.768421
4  South-Eastern Metropolitan  23.950955
5       Southern Metropolitan   8.725623
6        Western Metropolitan   9.747414
7            Western Victoria  30.750000


### **Markdown Explanation:**
This step helps in understanding how the distance varies across different regions. By grouping the data by Regionname, we can calculate the mean distance for each region and generate a concise summary for further analysis.

Grouping by regions allows us to segment the data, potentially uncovering regional patterns or trends that may not be apparent from the overall data.

In [None]:
# Create a new column (e.g., High, Medium, Low)
df_cleaned['Landsize_category'] = pd.cut(df_cleaned['Landsize'], bins=[0.0, 200, 500, np.inf], labels=['Small', 'Medium', 'Big'])
df_cleaned.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,Difference,Landsize_category
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,2016-02-04,2.5,3067.0,...,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0,-182.8013,Small
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,2017-03-04,2.5,3067.0,...,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0,-182.8037,Small
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,2016-06-04,2.5,3067.0,...,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0,-182.8013,Small
6,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,2016-05-07,2.5,3067.0,...,245.0,210.0,1910.0,Yarra,-37.8024,144.9993,Northern Metropolitan,4019.0,-182.8017,Medium
7,Abbotsford,98 Charles St,2,h,1636000.0,S,Nelson,2016-10-08,2.5,3067.0,...,256.0,107.0,1890.0,Yarra,-37.806,144.9954,Northern Metropolitan,4019.0,-182.8014,Medium


### **Markdown Explanation:**
The pd.cut() function is used to segment and sort the Landsize data into bins (intervals). Here, the bins argument specifies the ranges for Small (0-200), Medium (201-500), and Big (greater than 500).

The labels argument assigns the corresponding category names to each bin.

This categorization helps in analyzing the data based on the size of the land, making it easier to group properties for various analyses.

# **Step 6: Grouping and Aggregation**

In [None]:
# Total Rooms and Average Distance by Regionname
Regionname_summary = df_cleaned.groupby('Regionname').agg({
    'Rooms': 'sum',
    'Distance': 'mean'
}).reset_index()

print(Regionname_summary)

                   Regionname  Rooms   Distance
0        Eastern Metropolitan   1900  13.602627
1            Eastern Victoria     80  34.486957
2       Northern Metropolitan   5032   7.841424
3           Northern Victoria     66  34.768421
4  South-Eastern Metropolitan    529  23.950955
5       Southern Metropolitan   6277   8.725623
6        Western Metropolitan   4229   9.747414
7            Western Victoria     50  30.750000


In [None]:
# Average Price by Regionname and Type
Regionname_Type_summary = df_cleaned.groupby(['Regionname', 'Type']).agg({
    'Price': 'mean'
}).reset_index()

print(Regionname_Type_summary)

                    Regionname Type         Price
0         Eastern Metropolitan    h  1.217288e+06
1         Eastern Metropolitan    t  8.128921e+05
2         Eastern Metropolitan    u  6.742043e+05
3             Eastern Victoria    h  6.801355e+05
4             Eastern Victoria    u  4.470000e+05
5        Northern Metropolitan    h  1.017514e+06
6        Northern Metropolitan    t  7.347460e+05
7        Northern Metropolitan    u  5.408250e+05
8            Northern Victoria    h  5.568947e+05
9   South-Eastern Metropolitan    h  9.676200e+05
10  South-Eastern Metropolitan    t  8.974792e+05
11  South-Eastern Metropolitan    u  5.838846e+05
12       Southern Metropolitan    h  1.861504e+06
13       Southern Metropolitan    t  1.160534e+06
14       Southern Metropolitan    u  6.447187e+05
15        Western Metropolitan    h  9.761629e+05
16        Western Metropolitan    t  7.114521e+05
17        Western Metropolitan    u  4.732055e+05
18            Western Victoria    h  3.910714e+05


In [None]:
# Advance Data Manipulation
# Pivot Table to summarize data by Regionname and Type
pivot_table = df_cleaned.pivot_table(values='Rooms', index='Regionname', columns='Type', aggfunc='sum')

print(pivot_table)

Type                             h      t       u
Regionname                                       
Eastern Metropolitan        1568.0  167.0   165.0
Eastern Victoria              77.0    NaN     3.0
Northern Metropolitan       3711.0  475.0   846.0
Northern Victoria             66.0    NaN     NaN
South-Eastern Metropolitan   468.0   32.0    29.0
Southern Metropolitan       4222.0  618.0  1437.0
Western Metropolitan        3354.0  410.0   465.0
Western Victoria              50.0    NaN     NaN


# **Alternative ways to fill missing values**

## Filling missing values with 0

In [None]:
# Filling missing values with 0
df_filled = df.fillna(0)

# Displaying the first few rows of the DataFrame with filled values
df_filled.head()


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,0.0,0.0,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,0.0,0.0,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [None]:
# Check for missing values
df_filled.isnull().sum()

Unnamed: 0,0
Suburb,0
Address,0
Rooms,0
Type,0
Price,0
Method,0
SellerG,0
Date,0
Distance,0
Postcode,0


### **Markdown Explanation:**
Using fillna(0) replaces all missing values in the dataset with the value 0. This can be useful when the missing values should represent an absence of data, like missing numerical values that might imply "no value" or "zero".

However, using a default value like 0 may not always be the best approach, especially for certain columns where missing data could have different meanings (e.g., categorical or financial data). It's important to consider the context before deciding to use a specific value.

## Filling Numerical Columns with Mean or Median:

In [None]:
# Filling missing values in numerical columns with the mean
df_filled_mean = df.copy()
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
df_filled_mean[numerical_columns] = df[numerical_columns].fillna(df[numerical_columns].mean())


# Displaying the first few rows of the dataframes
df_filled_mean.head()



Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,151.96765,1964.684217,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,151.96765,1964.684217,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [None]:
# Filling missing values in numerical columns with the median
df_filled_median = df.copy()
df_filled_median[numerical_columns] = df[numerical_columns].fillna(df[numerical_columns].median())

# Displaying the first few rows of the dataframes
df_filled_median.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,126.0,1970.0,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,126.0,1970.0,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


## Filling Categorical Columns with Mode (Most Frequent Value):

In [None]:
# Filling missing values in categorical columns with the mode (most frequent value)
df_filled_mode = df.copy()
categorical_columns = df.select_dtypes(include=['object']).columns

# Filling each categorical column with its mode
for column in categorical_columns:
    df_filled_mode[column] = df[column].fillna(df[column].mode()[0])

# Displaying the first few rows of the dataframe
df_filled_mode.head()


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


### **Markdown Explanation:**
Filling Numerical Data:

The fillna() method is applied to numerical columns using either the mean or the median. Mean is best when the data has a normal distribution, while the median is more robust to outliers, making it ideal for skewed data.
Filling Categorical Data:

For categorical columns, filling missing values with the most frequent value (mode) makes sense when missing values are assumed to be the most common category.

## Removing all the columns that have at least one missing value instead

In [None]:
# remove all columns with at least one missing value
columns_with_na_dropped = df.dropna(axis=1)
columns_with_na_dropped.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,202.0,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,156.0,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,134.0,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,94.0,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,3.0,1.0,120.0,-37.8072,144.9941,Northern Metropolitan,4019.0


In [None]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % df.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 21 

Columns with na's dropped: 17


### **Markdown Explanation:**
The dropna(axis=1) command removes entire columns that have any missing values. This is often done when the number of missing values in a column is too high to warrant imputation or when those columns are not critical for analysis.

This approach can significantly reduce the number of features (columns), so it's important to assess how much data will be lost.

If only a few columns have missing values, dropping them might be fine. However, if many columns are dropped, it could result in a loss of valuable information.


## Filling in missing values automatically

Another option is to try and fill in the missing values. For this next bit, we take a small sub-section of the data so that it will print well.

In [None]:
# get a small subset of the NFL dataset
subset_df = df.loc[:, 'Bathroom':'Regionname'].head()
subset_df

Unnamed: 0,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname
0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan
1,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan
2,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan
3,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan
4,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan


## Filling missing value with backward-filling in column for first value and then filling all missing values with 0 after first value.

In [None]:
# replace all NA's the value that comes directly after it in the same column,
# then replace all the reamining na's with 0
subset_df.fillna(method = 'bfill', axis=0).fillna(0)

  subset_df.fillna(method = 'bfill', axis=0).fillna(0)


Unnamed: 0,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname
0,1.0,1.0,202.0,79.0,1900.0,Yarra,-37.7996,144.9984,Northern Metropolitan
1,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan
2,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan
3,2.0,1.0,94.0,142.0,2014.0,Yarra,-37.7969,144.9969,Northern Metropolitan
4,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan


## Filling missing value with backward-filling in column.

In [None]:
subset_df.fillna(method = 'bfill', axis=0)

  subset_df.fillna(method = 'bfill', axis=0)


Unnamed: 0,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname
0,1.0,1.0,202.0,79.0,1900.0,Yarra,-37.7996,144.9984,Northern Metropolitan
1,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan
2,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan
3,2.0,1.0,94.0,142.0,2014.0,Yarra,-37.7969,144.9969,Northern Metropolitan
4,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan


## Filling missing value with forward-filling in column.

In [None]:
subset_df.fillna(method = 'ffill', axis=0)

  subset_df.fillna(method = 'ffill', axis=0)


Unnamed: 0,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname
0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan
1,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan
2,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan
3,2.0,1.0,94.0,150.0,1900.0,Yarra,-37.7969,144.9969,Northern Metropolitan
4,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan


### Markdown explanation


**Backfill (method='bfill'):**

The bfill method fills missing values with the next valid entry in the column. This is useful when you assume that missing data should take the subsequent value.


**Forward Fill (method='ffill'):**

The ffill method fills missing values with the previous valid entry in the column. It is helpful when you assume that missing data should take the prior value.

**Combining bfill and Filling with 0:**

After applying bfill, if there are still missing values (e.g., at the end of the column), we replace them with 0. This ensures no missing values remain in the DataFrame.