# Introduction and Business Problem

Purchasing a property is one of the biggest investment decisions of the average person's life. Choices are overwhelming, and every suburb comes with tradeoffs between location, price and desirability. While most people who buy in their hometown will have a good sense of what different neighbourhoods are like, what about someone who is new to a city and wants a better idea of what a good suburb to invest and/or live in?

With this in mind, the business problem we are trying to solve, is "Where should someone unfamiliar with Melbourne buy a house in metropolitan Melbourne?". This question is relevant for large groups of people looking to purchase property in Melbourne. There are number of elements to this problem, reflecting the different criteria that people looking to purchase property should be considering, for example:

- What are average property prices in different suburbs?
- What amenities are available in different suburbs?
- What are different suburbs "like" in terms of character?
- What are some of the drivers of property prices?

Each of these questions will be explored in this analysis


# Data

#### Australian postcode location data
- This data contains latitude and longitude data for each postcode in Australia.
- This will be needed to plot different suburbs on a map of Melbourne
- Data source: http://www.corra.com.au/australian-postcode-location-data/

#### Postcode remoteness data
- This data contains a remoteness index for each postcode in Melbourne (some postcodes are a mixture of remoteness indices, but we take the index that covers the majority of the postcode for simplicity)
- We will need this to narrow down our data to metropolitan Melbourne only and remove regional and remote areas
- Data source: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1270.0.55.005July%202016?OpenDocument
    
#### House price data by postcode
- This data contains recent data on house prices in different suburbs of Melbourne
- We will need this to append average house prices to add to our analysis
- Data source: https://discover.data.vic.gov.au/dataset/victorian-property-sales-report-median-house-by-suburb

#### Suburb geojson files
- This data contains suburb boundaries for each suburb of Australia 
- This will allow us to create a choropleth map of Melbourne
- Data source: https://github.com/tonywr71/GeoJson-Data/blob/master/australian-suburbs.geojson

#### Foursquare API data
- The data we will look at contains details of venues that sit within a particular goegoraphical area
- We will use this data to cluster suburbs with similar characters


# Methodology
represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

#### Suburbs and House Prices
- We start with Australian postcode location data, containing latitude and longitude for each postcode and suburb, and import this into a dataframe
- We  then join onto this dataset a dataset containing the remoteness index for each postcode. As some postcodes are split between multiple categories of remoteness and are allocated onto percentages, we have pre-processed this data and taken whichever category covers the largest proportion of the postcode
- We then use this data to filter for postcodes categorised as "Metropolitan Victoria" only
- We then join on a separate data source containing house price information for each suburb

#### Determinants of House Price - Distance?
- We then create a map showing the location of each suburb
- We import a json file containing shapes and locations of each suburb - this had to be manually edited in a json editor as the notebook cannot handle too many shapes in one go
- We then use the joined house price information to create a choropleth map colouring more expensive suburbs in a darker colour
- From the map, there seems to be a negative relationship between distance to the city centre and house prices
- We define a function that will calculate the distance in km from the suburb centre to the Melbourne city centre, and add this as a new field in the dataframe
- Correlation analysis shows a moderate correlation between distance and house prices
- We then run regression analysis to test the strength of the relationship
- Simple linear regression trained with a train sample of the data (R^2 = 0.24) does not perform well under test data (R^2 = 0.16)
- Running polynomial regression with degree 2 performs performing better than the linear model under train (R^2 = 0.37), and under test (R^2 = 0.31). However, there is a pattern in the residuals, and it is likely that the presence of another city centre (Geelong) is likely skewing the data

#### Venues
- We use the Foursquare API to call venues data for each suburb, and store it in a dataframe
- In order for this not to be repeated each time we load the notebook (limited calls), we save this as a csv, change the API call code to a comment, and write code to load the csv next time
- We group the data by venue type and sort largest to smallest to see the most common venue types in Melbourne
- We then use the get_dummies function on the venue type field to see how common each type is in each suburb. 
- We then create a new dataframe and run a customised loop which displays the top 10 venue types for each suburb (or if there are less than ten venue types, it stops when there are no venue types left in that suburb)
- This dataframe allows us to look at each suburb and see what the most common venue types are in each suburb

#### Other Determinants of House Price 
- We now can test for the relationship between the number of particular types of common venues 
- We start with cafes: we create a dataframe with each suburb, house prices, and the number of cafes (we have also included coffee shops)
- We run regression but the R^2 is too low to confirm a positive relationship between them
- We perform similar regression with bakeries and grocery stores, but again, no relationship can be shown to exist (low R^2)

#### Clustering
- In order to get a better sense of different suburb characteristics, we now cluster them together
- We use the dummied venue data as the input dataset for the k-means clustering (we are clustering based on the venues in each suburb)
- We test k = 1 to 10 and plot the distortion to try and use the elbow method to find the optimal k, but there is no clear result
- We try the silhouette score method to test for the optimal k instead.
- We choose 6 as the best k
- We use K-means clustering to group up similar suburbs based on suburb venues
- We add the cluster labels onto the existing datasets
- We then map the clusters on a simple map of Melbourne
- We then run a similar function as the top 10 venue types in each suburb, but this time we show the top 5 venue types in each cluster (or less if there are less than 5 venue types in the cluster)
- We then can run further exploratory analysis on the clusters such as (example in notebook is cluster 5)
        - Which suburbs are in each cluster?
        - What is the average distance of each suburb in each clutster
        - How many of each venue type are there in each cluster?
- Using this information, we name the clusters



# Results section

- Moderate relationship between suburb house price and distance to city centre. This is clear from the map. However, regression relationship wasn't strong
- No strong relationship between the number of cafes and suburb house prices
- No strong relationship between the number of grocery stores and suburb prices
- No strong relationship between the number of bakeries and suburb prices
- 6 distinct types of suburbs were identified
        - Pizza and Food
        - Parks and Nature
        - Other
        - Cafes and Shopping
        - Accessible Inner Suburbs
        - Accessible Outer Suburbs

# Discussion and Conclusion
- This analysis provides useful information for someone considering purchasing a property in Melbourne
- Data allows for exploration of the types of venues in each suburb, as well as details on house prices
- A number of further analysis could be undertaken which may provide useful insight:
        - Time-series analysis of suburb price over time (this is available in the data)
        - Polynomial regression of multiple venue types and suburb house prices, or even simple linear regression of the total number of venues in the suburb
        - Clustering based on other suburb characteristics, such as location, crime rate, etc.
        - Removing Geelong data and re-running existing analysis may prove useful, as this likely distorted some findings, particularly the regression analyses.