Now that we have all our houses data, we need to clean it as much as possible, and isolate the columns which are the best represented accross all the houses.

In [None]:
# Import 3rd party libraries",
import os
import pandas as pd

# Configure Notebook
import warnings
warnings.filterwarnings('ignore')
%config Completer.use_jedi = False

In [None]:
# create dataframe of houses
house_data = pd.read_csv('houses_data.csv')
house_data.head()

In [None]:
# convert longitudes and latitudes into floats and drop rows with no latitude
house_data['propertyLng'] = house_data['propertyLng'].str.strip("\",")
house_data['propertyLat'] = house_data['propertyLat'].str.strip("\",")

house_data = house_data[(house_data['propertyLng'] != '"",') | (house_data['propertyLat'] != '')]

house_data['propertyLng'] = house_data['propertyLng'].astype(float)
house_data['propertyLat'] = house_data['propertyLat'].astype(float)

In [None]:
# drop duplicates
house_data.drop_duplicates(inplace=True)
house_data.head()

Now that we have removed addresses with no location and any duplicates, we can look at which columns have the most and least null values.

In [None]:
pd.DataFrame({'count': house_data.isnull().sum()})

After much consideration into the amount of data gathere for each column type, and the potential importance of the information of each column, we ended up with the following 50 features which will be used for further analysis and for the future model.

In [None]:
# create dataframe of houses
house_data_final = house_data[['Unnamed: 0','address', 'isResidentialProperty', 'propertyLat', 'propertyLng', 
                              'searchNeighborhood', 'List Date', 'Sold Price', 'Original Price', 'Type', 'Style', 
                             'Size (sq ft)', 'Age', 'Community', 'List Price', 'Bedrooms', 'Bathrooms', 'Kitchens',
                             'Rooms', 'Air Conditioning', 'Fireplace', 'Basement', 'Heating', 'Exterior', 
                             'Exterior Features', 'Driveway', 'Garage', 'Parking Places', 'Covered Parking Places',
                             'Taxes', 'Feature', 'Fronting On', 'Frontage', 'Lot Depth', 'Pool', 'Sewer', 
                             'Cross Street', 'Municipality District', 'Lot Code', 'Bedrooms Plus', 'Gas',
                               'Waterfront', 'Rooms Plus', 'Washrooms Type 3 # Pcs', 'Kitchens Plus',
                               'Parking Total', 'Furnished', 'Laundry Access', 'Private Entrance', 'Lease Term',
                               'Ensuite Laundry', 'Property Type']]

house_data_final.info()

First, to narrow the scope of this project, we will only look at residential properties. As well, since our analysis will be temporal, lets get rid of houses with no list date.

In [None]:
# Drop isResidentialProperty = False values
house_data_final = house_data_final[house_data['isResidentialProperty'] == 'true,']
house_data_final = house_data_final.dropna(subset = 'List Date')

In [None]:
house_data_final = house_data_final[(house_data_final['List Date'].str.contains('2023')) |
                                   (house_data_final['List Date'].str.contains('2022')) |
                                   (house_data_final['List Date'].str.contains('2021')) |
                                   (house_data_final['List Date'].str.contains('2020')) |
                                   (house_data_final['List Date'].str.contains('2019')) |
                                   (house_data_final['List Date'].str.contains('2018')) |
                                   (house_data_final['List Date'].str.contains('2017')) |
                                   (house_data_final['List Date'].str.contains('2016')) |
                                   (house_data_final['List Date'].str.contains('2015')) |
                                   (house_data_final['List Date'].str.contains('2014')) |
                                   (house_data_final['List Date'].str.contains('2013'))]

Lets add this cleaned data into its own csv to create another checkpoint

In [None]:
house_data_final.to_csv('houses_data_final.csv')