# Peer-graded Assignment Capstone Project - The Battle of Neighborhoods

##### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

For potential homeowners who are interested in buying a property, the listings on websites often do not give great details on neighbourhood's amenities (i.e. Park, school, restaurants, bus stop, etc.). A great neighborhood should also include important amenities such as grocery stores, shops, and restaurants. Most people like to frequent places that are convenient. Schools are another important amenity. Even if you don't have kids, if you want to sell your home in the future, many buyers will be on the lookout for good schools. The quality of local schools and the distance from the house are both important factors to consider.

Access to parks, recreation, shopping, restaurants and coffee shops is another key factor to a great neighbourhood. When the potential owner try to move to a new neighborhood, these are the neighborhood amenities to look into before purchasing a home. In this project we will try to utilize Foursquare's location data to explore a neighbourhood's geographical location of multiple addresses (i.e. listing available on real estate websites), as well as using KNN clustering, return the user the best or clusters of similar neighborhood within the selected city according to a set of priorities that the user rates.

We will define the acceptable driving distance to be within 1.5 km (walking distance)

## Data: <a name="data"></a>

Based on definition of our problem, the following factors we must consider:
* user's input of selected addresses
* type and number of venues in the surrounding area of selected address 

It is better to use regularly spaced grid of locations, centered around city center, to set as a starting point to scrape the amenities in the selected neighborhood.

Following data sources will be needed to extract/generate the required information:
* center of selected cities will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* type and number of amenities and their type and location in every neighborhood will be obtained using **Foursquare API**


## Methodology <a name="methodology"></a>

#### User Variables, Python Libraries and API keys
Import Python libraries for data analysis

In [1]:
#import libraries
import re
from geopy.geocoders import GoogleV3
import shapely.geometry
import pyproj
import math
import folium
import requests
import pandas as pd
import time
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Assuming that the user is interested in the following 7 properties listed on realtor.ca, the user would like to know, within walking distance of ~1.5 km, what are the venues nearby

We will load the addresses into a dataframe:

In [2]:
# Define the loation of interest:
within_km = 1.5 # Select the within range in km (walking distance?)
bubble_detail_factor = 7 #varible factor determining how big the bubble will be, higher value = smaller bubble

# Read in the addresses of intrested locations in to a dataframe
d = {'listed_location': ['12680 HARRISON AVE RICHMOND, BC', 
                         'G23 255 W 1ST STREET North Vancouver, British Columbia V7M3G8',
                         '201 1000 BEACH AVENUE Vancouver, British Columbia V6E4M2',
                         '305 2985 PRINCESS CRESCENT Coquitlam, British Columbia V3B7P3',
                         '605 4134 MAYWOOD STREET Burnaby, British Columbia V5H4C9',
                         '11978 90 AVENUE Delta, British Columbia V4C3H6',
                         '301 1785 MARTIN DRIVE Surrey, British Columbia V4A9T5',
                         '7338 GOLLNER AVE RICHMOND, British Columbia'
                         
                        ]}
dfListing = pd.DataFrame(data=d)
dfListing

Unnamed: 0,listed_location
0,"12680 HARRISON AVE RICHMOND, BC"
1,"G23 255 W 1ST STREET North Vancouver, British ..."
2,"201 1000 BEACH AVENUE Vancouver, British Colum..."
3,"305 2985 PRINCESS CRESCENT Coquitlam, British ..."
4,"605 4134 MAYWOOD STREET Burnaby, British Colum..."
5,"11978 90 AVENUE Delta, British Columbia V4C3H6"
6,"301 1785 MARTIN DRIVE Surrey, British Columbia..."
7,"7338 GOLLNER AVE RICHMOND, British Columbia"


Reading in the API keys and client ID for both Google and Foursquare in to variables

In [3]:
#Read in API keys from a seperate text file, and define the version of the Foursquare API
himitsu = open("himitsu.txt", "r")
google_api_key = re.search('(?<=Google API:)\S+',re.findall(r'Google API:.*', himitsu.read())[0])[0]
himitsu.seek(0)
foursquare_client_id = re.search('(?<=Foursquare CLIENT_ID:)\S+',re.findall(r'Foursquare CLIENT_ID:.*', himitsu.read())[0])[0]
himitsu.seek(0)
foursquare_client_secret = re.search('(?<=Foursquare CLIENT_SECRET:)\S+',re.findall(r'Foursquare CLIENT_SECRET:.*', himitsu.read())[0])[0]
himitsu.close()

VERSION = '20180605' # Foursquare API version
LIMIT = 7 # A default Foursquare API limit value

#### Neighborhood Candidates
For this project, we would need to select centroids of latitude & longitude coordinates to scan nearby areas. The method will be creating grids of cells within 1.5 km of our selected addresses, which is approximate 3 x 3 km centered around the selected address.

Lets first find the latitude & longitude of the select city, using Google Maps geocoding API.

In [4]:
geolocator = GoogleV3(api_key=google_api_key)

def get_coordinates(listing):
    try:
        geocode_result = geolocator.geocode(listing)
    except IndexError:
        print("Address was wrong...")
    except Exception as e:
        print("Unexpected error occurred.", e )
    return geocode_result[1]


Now let's create a grid of area candidates, equally spaced, centered around the given property address and within ~0.2143km (within 1.5 km divided by bubble size factor of 7) from the center. Each grid "bubble" surrounding the address will be defined as circular areas with a radius of 0.75 km (within 1.5 km, there for the radius is half of that), so grid bubbles' center will be 0.4286 km apart.

Using a method already available online, to accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we will project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters).

UTM stands for "Universal Transverse Mercator".

Reference:
https://openpress.usask.ca/introgeomatics/chapter/cartesianprojected-coordinate-systems-utm/#:~:text=Universal%20Transverse%20Mercator%20(UTM)%20is,in%20metric%20units%20(metres).&text=Conversions%20from%20one%20coordinate%20system,a%20mathematical%20process%20called%20projection.


In [5]:
#longitude & latitude to UTM xy
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

#UTM xy to longitude & latitude
def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

#calculate distance between the two points (x1, y1) and (x2, y2)
def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)


The very next step is to create a **hexagonal grid of cells**: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's neighbors**.

For more details, please refer to: https://www.redblobgames.com/grids/hexagons/#:~:text=Size%20and%20Spacing%23&text=In%20the%20pointy%20orientation%2C%20a,from%20sin(60%C2%B0).&text=The%20horizontal%20distance%20between%20adjacent,is%20h%20*%203%2F4%20.

Applying the get_coordinates() function and adding to the results to the dataframe

In [6]:
dfListing['location_corr'] = dfListing['listed_location'].apply(get_coordinates)
dfListing

Unnamed: 0,listed_location,location_corr
0,"12680 HARRISON AVE RICHMOND, BC","(49.177881, -123.086432)"
1,"G23 255 W 1ST STREET North Vancouver, British ...","(49.3139382, -123.0844117)"
2,"201 1000 BEACH AVENUE Vancouver, British Colum...","(49.275755, -123.133655)"
3,"305 2985 PRINCESS CRESCENT Coquitlam, British ...","(49.2867563, -122.7959497)"
4,"605 4134 MAYWOOD STREET Burnaby, British Colum...","(49.22483949999999, -123.0116611)"
5,"11978 90 AVENUE Delta, British Columbia V4C3H6","(49.1667291, -122.8914059)"
6,"301 1785 MARTIN DRIVE Surrey, British Columbia...","(49.0343698, -122.8042969)"
7,"7338 GOLLNER AVE RICHMOND, British Columbia","(49.1685243, -123.1412603)"


Lets map out the addresses and their surrounding grid bubbles:

In [7]:
#mapping out the locations and their surrounding venues
Property=[]
latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
map_city = folium.Map(location=dfListing['location_corr'][1], zoom_start=(11))

for (index, row) in (dfListing.iterrows()):
    listed_location_x, listed_location_y = lonlat_to_xy(row ['location_corr'][1], row ['location_corr'][0]) 
    folium.Marker(row['location_corr'], popup=row['listed_location']).add_to(map_city)
    m = within_km * 1000
    k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
    x_min = listed_location_x - m
    x_step = m/(bubble_detail_factor/2)
    y_min = listed_location_y - m - (int(21/k)*k*m/bubble_detail_factor - 2*m)/2
    y_step = m/(bubble_detail_factor/2) * k
    for i in range(0, int(21/k)):
        y = y_min + i * y_step
        x_offset = (m/bubble_detail_factor) if i%2==0 else 0
        for j in range(0, 21):
            x = x_min + j * x_step + x_offset
            distance_from_center = calc_xy_distance(listed_location_x, listed_location_y, x, y)
            if (distance_from_center <= (m+1)):
                lon, lat = xy_to_lonlat(x, y)
                Property.append(row['listed_location'])
                latitudes.append(lat)
                longitudes.append(lon)
                distances_from_center.append(distance_from_center)
                xs.append(x)
                ys.append(y)   

In [8]:
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_city) 
    folium.Circle([lat, lon], radius=m/bubble_detail_factor, color='blue', fill=False).add_to(map_city)
    #folium.Marker([lat, lon]).add_to(map_city)
map_city

Lets now use Google Maps API to get extract addresses of those grid bubbles.
First, we will define the get_address() function to retrieve the data from the Json file.

In [9]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None


adding all the location addresses to a list:

In [10]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    addresses.append(address)
    print(' .', end='')
print(' done')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done


In [11]:
addresses[0:10] #exact addresseses of all grid bubbles

['13480 Crestwood Pl, Richmond, BC V6V 2K1, Canada',
 '13731 Mayfield Pl, Richmond, BC V6V 2G9, Canada',
 '4311 Viking Way #170, Richmond, BC V6V 2K9, Canada',
 '180 Jacombs Rd, Richmond, BC V6V 2H7, Canada',
 '12460 Flury Dr, Richmond, BC V6V 1H5, Canada',
 'Northbound No. 6 Rd @ Commerce Parkway, Richmond, BC V6V 1T1, Canada',
 '13800 Maycrest Way, Richmond, BC V6V 3E2, Canada',
 '13200 Delf Pl, Richmond, BC V6V 2A2, Canada',
 'Westbound Jack Bell Dr @ Jacombs Rd, Richmond, BC V6V 2T7, Canada',
 '12437 Cambie Rd, Richmond, BC V6V 1G5, Canada']

Lets put everything together and create a dataframe structure:

In [12]:
df_locations = pd.DataFrame({'Property' : Property,
                             'Address': addresses, #address of each grid bubble
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations

Unnamed: 0,Property,Address,Latitude,Longitude,X,Y,Distance from center
0,"12680 HARRISON AVE RICHMOND, BC","13480 Crestwood Pl, Richmond, BC V6V 2K1, Canada",49.187815,-123.076128,-2.490412e+06,1.364519e+07,1484.614978
1,"12680 HARRISON AVE RICHMOND, BC","13731 Mayfield Pl, Richmond, BC V6V 2G9, Canada",49.182406,-123.072143,-2.491055e+06,1.364557e+07,1285.714286
2,"12680 HARRISON AVE RICHMOND, BC","4311 Viking Way #170, Richmond, BC V6V 2K9, Ca...",49.184356,-123.076518,-2.490627e+06,1.364557e+07,1133.893419
3,"12680 HARRISON AVE RICHMOND, BC","180 Jacombs Rd, Richmond, BC V6V 2H7, Canada",49.186306,-123.080892,-2.490198e+06,1.364557e+07,1133.893419
4,"12680 HARRISON AVE RICHMOND, BC","12460 Flury Dr, Richmond, BC V6V 1H5, Canada",49.188256,-123.085267,-2.489770e+06,1.364557e+07,1285.714286
...,...,...,...,...,...,...,...
339,"7338 GOLLNER AVE RICHMOND, British Columbia","7560 Moffatt Rd, Richmond, BC V6Y 1X8, Canada",49.158146,-123.142408,-2.488030e+06,1.365124e+07,1285.714286
340,"7338 GOLLNER AVE RICHMOND, British Columbia","7295 Moffatt Rd, Richmond, BC V6Y 3E5, Canada",49.160093,-123.146785,-2.487601e+06,1.365124e+07,1133.893419
341,"7338 GOLLNER AVE RICHMOND, British Columbia","6577 Livingstone Pl, Richmond, BC V7C 5N1, Canada",49.162040,-123.151162,-2.487173e+06,1.365124e+07,1133.893419
342,"7338 GOLLNER AVE RICHMOND, British Columbia","6500 Azure Rd, Richmond, BC V7C 2R9, Canada",49.163986,-123.155540,-2.486744e+06,1.365124e+07,1285.714286


So, we have 344 grid bubbles for us to look at. Next is to use Foursquare to scrape all the venues within each grid bubbles (if any).

In [13]:
#remove any unname road (not land)
#df_locations = df_locations[~df_locations['Address'].str.contains('Unnamed Road')
#df_locations['Property'] = listed_location
#df_locations


### Foursquare
Next, we are going to start utilizing the Foursquare API to explore the addresses and segment them.

In [16]:
def getNearbyVenues(names, latitudes, longitudes, prop, radius=m/20):
    
    venues_list=[]
    for name, lat, lng, prop in zip(names, latitudes, longitudes, prop):
        print(prop, '->', name) #to make sure the programming is running and at which stage
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            foursquare_client_id, 
            foursquare_client_secret, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
       
        # return only relevant information for each nearby venue
        venues_list.append([(
            prop,
            name, 
            lat, 
            lng, 
            v['venue']['id'],
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    print(venues_list)
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    print(nearby_venues)
    nearby_venues.columns = ['Property','Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'id',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

city_venues = getNearbyVenues(names=df_locations['Address'],
                                   latitudes=df_locations['Latitude'],
                                   longitudes=df_locations['Longitude'],
                                  prop = df_locations['Property']
                                  )

12680 HARRISON AVE RICHMOND, BC -> 13480 Crestwood Pl, Richmond, BC V6V 2K1, Canada
12680 HARRISON AVE RICHMOND, BC -> 13731 Mayfield Pl, Richmond, BC V6V 2G9, Canada
12680 HARRISON AVE RICHMOND, BC -> 4311 Viking Way #170, Richmond, BC V6V 2K9, Canada
12680 HARRISON AVE RICHMOND, BC -> 180 Jacombs Rd, Richmond, BC V6V 2H7, Canada
12680 HARRISON AVE RICHMOND, BC -> 12460 Flury Dr, Richmond, BC V6V 1H5, Canada
12680 HARRISON AVE RICHMOND, BC -> Northbound No. 6 Rd @ Commerce Parkway, Richmond, BC V6V 1T1, Canada
12680 HARRISON AVE RICHMOND, BC -> 13800 Maycrest Way, Richmond, BC V6V 3E2, Canada
12680 HARRISON AVE RICHMOND, BC -> 13200 Delf Pl, Richmond, BC V6V 2A2, Canada
12680 HARRISON AVE RICHMOND, BC -> Westbound Jack Bell Dr @ Jacombs Rd, Richmond, BC V6V 2T7, Canada
12680 HARRISON AVE RICHMOND, BC -> 12437 Cambie Rd, Richmond, BC V6V 1G5, Canada
12680 HARRISON AVE RICHMOND, BC -> 12020 Greenland Dr Unit 57, Richmond, BC V6V 2M8, Canada
12680 HARRISON AVE RICHMOND, BC -> 11531 Danie

Lets look at the dataframe to see how many venues was returned

In [17]:
city_venues

Unnamed: 0,Property,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,id,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"12680 HARRISON AVE RICHMOND, BC","13480 Crestwood Pl, Richmond, BC V6V 2K1, Canada",49.187815,-123.076128,571d299b498ead458e0cd197,Speeders Indoor ProKarts,49.188023,-123.076349,Go Kart Track
1,"12680 HARRISON AVE RICHMOND, BC","13220 Smallwood Pl, Richmond, BC V6V 2C1, Canada",49.170524,-123.078073,5843dd8c9465dd4262e1dd51,HairMasters,49.170300,-123.078183,Salon / Barbershop
2,"12680 HARRISON AVE RICHMOND, BC","Southbound No. 5 Rd @ Westminster Hwy, Richmon...",49.169455,-123.091969,4be72032151d76b063bd7b9d,Urban Farm Market,49.169982,-123.092354,Farmers Market
3,"G23 255 W 1ST STREET North Vancouver, British ...","Unnamed Road, North Vancouver, BC V7M 2Y4, Canada",49.324319,-123.083227,4ba7cad5f964a520c8b339e3,Park fen burdett,49.324971,-123.083241,Park
4,"G23 255 W 1ST STREET North Vancouver, British ...","269 4th St E, North Vancouver, BC V7L 1J1, Canada",49.311537,-123.070859,4c44dd45429a0f474c5d491e,Lower Lonsdale Park,49.312092,-123.071286,Playground
...,...,...,...,...,...,...,...,...,...
162,"7338 GOLLNER AVE RICHMOND, British Columbia","110-6551 Westminster Hwy, Richmond, BC V7C 4V4...",49.170906,-123.154776,4e67dcb5a80980267582b418,NuSkin,49.170350,-123.154198,Pharmacy
163,"7338 GOLLNER AVE RICHMOND, British Columbia","110-6551 Westminster Hwy, Richmond, BC V7C 4V4...",49.170906,-123.154776,4ecd6b4e0e01f1a87c68bc5e,Paintballgear.ca Richmond,49.171248,-123.154020,Sporting Goods Shop
164,"7338 GOLLNER AVE RICHMOND, British Columbia","110-6551 Westminster Hwy, Richmond, BC V7C 4V4...",49.170906,-123.154776,51a3f50d498e3b4c75ddc139,Steve Nash Fitness World,49.171292,-123.154724,Gym / Fitness Center
165,"7338 GOLLNER AVE RICHMOND, British Columbia","7171 Moffatt Rd, Richmond, BC V6Y 1X9, Canada",49.161606,-123.142026,4d4a09855129a35d94e43fac,Richmond Athletic Turf,49.161642,-123.141734,Soccer Field


so, in total we have 167 venues from all grid bubbles
lets find out all unique venue categories

In [18]:
city_venues['Venue Category'].unique()

array(['Go Kart Track', 'Salon / Barbershop', 'Farmers Market', 'Park',
       'Playground', 'Convenience Store', 'Business Service', 'Brewery',
       'Pub', 'Ice Cream Shop', 'Juice Bar',
       'Vegetarian / Vegan Restaurant', 'Thai Restaurant',
       'Gym / Fitness Center', 'Coffee Shop', 'Harbor / Marina',
       'Donut Shop', 'Gym', 'Hotel', 'Spa', 'Cosmetics Shop',
       'Concert Hall', 'Lebanese Restaurant', 'Japanese Curry Restaurant',
       'Sandwich Place', 'Plaza', 'Movie Theater', 'Italian Restaurant',
       'Yoga Studio', 'New American Restaurant', 'Bed & Breakfast',
       'Taco Place', 'Restaurant', 'Pharmacy', 'Mexican Restaurant',
       'Diner', 'Indian Restaurant', 'Hostel', 'Bar',
       'Seafood Restaurant', 'Trail', 'French Restaurant', 'Wine Bar',
       'Breakfast Spot', 'Café', 'Grocery Store',
       'Middle Eastern Restaurant', "Men's Store", 'Hot Dog Joint',
       'Theater', 'Tea Room', 'Bakery', 'Boat or Ferry', 'Bus Stop',
       'Water Park', 'Pizza

Lets check how many venues were returned from all of these addresses.

In [19]:
city_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,Property,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,id,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
American Restaurant,2,2,2,2,2,2,2,2
Asian Restaurant,1,1,1,1,1,1,1,1
Bakery,4,4,4,4,4,4,4,4
Bank,3,3,3,3,3,3,3,3
Bar,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...
Vietnamese Restaurant,2,2,2,2,2,2,2,2
Water Park,1,1,1,1,1,1,1,1
Wine Bar,1,1,1,1,1,1,1,1
Women's Store,1,1,1,1,1,1,1,1


In [54]:
print('There are {} unique categories.'.format(len(city_venues['Venue Category'].unique())))

There are 98 unique categories.


## Analysis <a name="analysis"></a>

#### Analyze each Property's surrounding area

using one hot coding, each venue category will become a column in the dataframe:

In [21]:
# one hot encoding
city_venues_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add property column back to dataframe
city_venues_onehot['Property'] = city_venues['Property'] 

# move neighborhood column to the first column
fixed_columns = [city_venues_onehot.columns[-1]] + list(city_venues_onehot.columns[:-1])
city_venues_onehot = city_venues_onehot[fixed_columns]

city_venues_onehot.head()

Unnamed: 0,Property,American Restaurant,Asian Restaurant,Bakery,Bank,Bar,Bed & Breakfast,Boat or Ferry,Boutique,Breakfast Spot,...,Thai Restaurant,Theater,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Water Park,Wine Bar,Women's Store,Yoga Studio
0,"12680 HARRISON AVE RICHMOND, BC",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"12680 HARRISON AVE RICHMOND, BC",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"12680 HARRISON AVE RICHMOND, BC",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"G23 255 W 1ST STREET North Vancouver, British ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"G23 255 W 1ST STREET North Vancouver, British ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
city_venues_onehot.shape

(167, 99)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [23]:
city_venues_grouped = city_venues_onehot.groupby('Property').mean().reset_index()
city_venues_grouped

Unnamed: 0,Property,American Restaurant,Asian Restaurant,Bakery,Bank,Bar,Bed & Breakfast,Boat or Ferry,Boutique,Breakfast Spot,...,Thai Restaurant,Theater,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Water Park,Wine Bar,Women's Store,Yoga Studio
0,"11978 90 AVENUE Delta, British Columbia V4C3H6",0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.066667,0.0
1,"12680 HARRISON AVE RICHMOND, BC",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"201 1000 BEACH AVENUE Vancouver, British Colum...",0.0,0.0,0.042857,0.0,0.014286,0.014286,0.014286,0.0,0.028571,...,0.0,0.028571,0.0,0.028571,0.0,0.0,0.014286,0.014286,0.0,0.014286
3,"301 1785 MARTIN DRIVE Surrey, British Columbia...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"305 2985 PRINCESS CRESCENT Coquitlam, British ...",0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0
5,"605 4134 MAYWOOD STREET Burnaby, British Colum...",0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"7338 GOLLNER AVE RICHMOND, British Columbia",0.064516,0.0,0.0,0.064516,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0
7,"G23 255 W 1ST STREET North Vancouver, British ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [24]:
city_venues_grouped.shape

(8, 99)

Lets print each property's along with the top 7 most common venues

In [25]:
num_top_venues = 7

for hood in city_venues_grouped['Property']:
    print("----"+hood+"----")
    temp = city_venues_grouped[city_venues_grouped['Property'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----11978 90 AVENUE Delta, British Columbia V4C3H6----
                venue  freq
0  Salon / Barbershop  0.07
1         Gas Station  0.07
2      Discount Store  0.07
3         Pizza Place  0.07
4            Pharmacy  0.07
5      Sandwich Place  0.07
6      Cosmetics Shop  0.07


----12680 HARRISON AVE RICHMOND, BC----
                 venue  freq
0   Salon / Barbershop  0.33
1       Farmers Market  0.33
2        Go Kart Track  0.33
3  American Restaurant  0.00
4   Miscellaneous Shop  0.00
5           Playground  0.00
6          Pizza Place  0.00


----201 1000 BEACH AVENUE Vancouver, British Columbia V6E4M2----
               venue  freq
0              Hotel  0.06
1        Coffee Shop  0.06
2             Bakery  0.04
3               Park  0.04
4                Spa  0.03
5  Indian Restaurant  0.03
6         Restaurant  0.03


----301 1785 MARTIN DRIVE Surrey, British Columbia V4A9T5----
                        venue  freq
0                        Park  0.13
1  Construction & Landscapin

#### Lets put that into a _pandas_ dataframe

First, lets write a function to sort the venues in descending order.

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [38]:
num_top_venues = 7

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Property']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['Property'] = city_venues_grouped['Property']

for ind in np.arange(city_venues_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_venues_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted

Unnamed: 0,Property,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,"11978 90 AVENUE Delta, British Columbia V4C3H6",Cosmetics Shop,Gas Station,Sandwich Place,Lounge,Salon / Barbershop,Discount Store,Sushi Restaurant
1,"12680 HARRISON AVE RICHMOND, BC",Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant
2,"201 1000 BEACH AVENUE Vancouver, British Colum...",Coffee Shop,Hotel,Bakery,Park,Grocery Store,Breakfast Spot,Indian Restaurant
3,"301 1785 MARTIN DRIVE Surrey, British Columbia...",Park,Construction & Landscaping,Chinese Restaurant,Convenience Store,Thai Restaurant,College Rec Center,Fast Food Restaurant
4,"305 2985 PRINCESS CRESCENT Coquitlam, British ...",Burger Joint,Fast Food Restaurant,Park,Office,Coffee Shop,Dim Sum Restaurant,Bank
5,"605 4134 MAYWOOD STREET Burnaby, British Colum...",Portuguese Restaurant,Furniture / Home Store,Park,Restaurant,Coffee Shop,Dog Run,Fast Food Restaurant
6,"7338 GOLLNER AVE RICHMOND, British Columbia",American Restaurant,Bank,Bubble Tea Shop,Coffee Shop,Pharmacy,Hotel,Gym / Fitness Center
7,"G23 255 W 1ST STREET North Vancouver, British ...",Pub,Convenience Store,Brewery,Gym / Fitness Center,Ice Cream Shop,Juice Bar,Park


#### Cluster Neighborhoods

Run _k_-means to cluster the neighborhood into n clusters.

In [39]:
# set number of clusters
kclusters = 5

city_venues_grouped_clustering = city_venues_grouped.drop('Property', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 2, 3, 4, 0, 2, 2])

Lets create a new dataframe that includes the cluster as well as the top 7 venues for each neighborhood.

In [40]:
# add clustering labels
city_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

city_venues_merged = df_locations

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
city_venues_merged = city_venues_merged.join(city_venues_sorted.set_index('Property'), on='Property')


city_venues_merged # check the last columns!

Unnamed: 0,Property,Address,Latitude,Longitude,X,Y,Distance from center,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,"12680 HARRISON AVE RICHMOND, BC","13480 Crestwood Pl, Richmond, BC V6V 2K1, Canada",49.187815,-123.076128,-2.490412e+06,1.364519e+07,1484.614978,1,Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant
1,"12680 HARRISON AVE RICHMOND, BC","13731 Mayfield Pl, Richmond, BC V6V 2G9, Canada",49.182406,-123.072143,-2.491055e+06,1.364557e+07,1285.714286,1,Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant
2,"12680 HARRISON AVE RICHMOND, BC","4311 Viking Way #170, Richmond, BC V6V 2K9, Ca...",49.184356,-123.076518,-2.490627e+06,1.364557e+07,1133.893419,1,Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant
3,"12680 HARRISON AVE RICHMOND, BC","180 Jacombs Rd, Richmond, BC V6V 2H7, Canada",49.186306,-123.080892,-2.490198e+06,1.364557e+07,1133.893419,1,Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant
4,"12680 HARRISON AVE RICHMOND, BC","12460 Flury Dr, Richmond, BC V6V 1H5, Canada",49.188256,-123.085267,-2.489770e+06,1.364557e+07,1285.714286,1,Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,"7338 GOLLNER AVE RICHMOND, British Columbia","7560 Moffatt Rd, Richmond, BC V6Y 1X8, Canada",49.158146,-123.142408,-2.488030e+06,1.365124e+07,1285.714286,2,American Restaurant,Bank,Bubble Tea Shop,Coffee Shop,Pharmacy,Hotel,Gym / Fitness Center
340,"7338 GOLLNER AVE RICHMOND, British Columbia","7295 Moffatt Rd, Richmond, BC V6Y 3E5, Canada",49.160093,-123.146785,-2.487601e+06,1.365124e+07,1133.893419,2,American Restaurant,Bank,Bubble Tea Shop,Coffee Shop,Pharmacy,Hotel,Gym / Fitness Center
341,"7338 GOLLNER AVE RICHMOND, British Columbia","6577 Livingstone Pl, Richmond, BC V7C 5N1, Canada",49.162040,-123.151162,-2.487173e+06,1.365124e+07,1133.893419,2,American Restaurant,Bank,Bubble Tea Shop,Coffee Shop,Pharmacy,Hotel,Gym / Fitness Center
342,"7338 GOLLNER AVE RICHMOND, British Columbia","6500 Azure Rd, Richmond, BC V7C 2R9, Canada",49.163986,-123.155540,-2.486744e+06,1.365124e+07,1285.714286,2,American Restaurant,Bank,Bubble Tea Shop,Coffee Shop,Pharmacy,Hotel,Gym / Fitness Center


Finally, lets visualize the resulting clusters

In [53]:
# create map
map_clusters = folium.Map(location=dfListing['location_corr'][1], zoom_start=10.4)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_venues_merged['Latitude'], city_venues_merged['Longitude'], city_venues_merged['Property'], city_venues_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.Circle(
        [lat, lon],
        radius=m/bubble_detail_factor,
        popup=label,
        color=rainbow[cluster-1],
        fill=False,
        fill_color=rainbow[cluster-1]).add_to(map_clusters)
       
map_clusters

#### Examine Clusters

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster.

In [42]:
examine_clusters = city_venues_merged.drop(['Address','Latitude','Longitude','X','Y','Distance from center'],1).drop_duplicates()
examine_clusters

Unnamed: 0,Property,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,"12680 HARRISON AVE RICHMOND, BC",1,Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant
43,"G23 255 W 1ST STREET North Vancouver, British ...",2,Pub,Convenience Store,Brewery,Gym / Fitness Center,Ice Cream Shop,Juice Bar,Park
86,"201 1000 BEACH AVENUE Vancouver, British Colum...",2,Coffee Shop,Hotel,Bakery,Park,Grocery Store,Breakfast Spot,Indian Restaurant
129,"305 2985 PRINCESS CRESCENT Coquitlam, British ...",4,Burger Joint,Fast Food Restaurant,Park,Office,Coffee Shop,Dim Sum Restaurant,Bank
172,"605 4134 MAYWOOD STREET Burnaby, British Colum...",0,Portuguese Restaurant,Furniture / Home Store,Park,Restaurant,Coffee Shop,Dog Run,Fast Food Restaurant
215,"11978 90 AVENUE Delta, British Columbia V4C3H6",2,Cosmetics Shop,Gas Station,Sandwich Place,Lounge,Salon / Barbershop,Discount Store,Sushi Restaurant
258,"301 1785 MARTIN DRIVE Surrey, British Columbia...",3,Park,Construction & Landscaping,Chinese Restaurant,Convenience Store,Thai Restaurant,College Rec Center,Fast Food Restaurant
301,"7338 GOLLNER AVE RICHMOND, British Columbia",2,American Restaurant,Bank,Bubble Tea Shop,Coffee Shop,Pharmacy,Hotel,Gym / Fitness Center


##### Cluster 0 -- Foodie Place!

In [43]:
examine_clusters.loc[examine_clusters['Cluster Labels'] == 0, examine_clusters.columns[[0] + list(range(2, examine_clusters.shape[1]))]]

Unnamed: 0,Property,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
172,"605 4134 MAYWOOD STREET Burnaby, British Colum...",Portuguese Restaurant,Furniture / Home Store,Park,Restaurant,Coffee Shop,Dog Run,Fast Food Restaurant


##### Cluster 1 -- Fun place with a Barbershop

In [46]:
examine_clusters.loc[examine_clusters['Cluster Labels'] == 1, examine_clusters.columns[[0] + list(range(2, examine_clusters.shape[1]))]]

Unnamed: 0,Property,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
0,"12680 HARRISON AVE RICHMOND, BC",Salon / Barbershop,Go Kart Track,Farmers Market,Yoga Studio,Food Truck,Dessert Shop,Dim Sum Restaurant


##### Cluster 2 -- very balanced clusters

In [47]:
examine_clusters.loc[examine_clusters['Cluster Labels'] == 2, examine_clusters.columns[[0] + list(range(2, examine_clusters.shape[1]))]]

Unnamed: 0,Property,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
43,"G23 255 W 1ST STREET North Vancouver, British ...",Pub,Convenience Store,Brewery,Gym / Fitness Center,Ice Cream Shop,Juice Bar,Park
86,"201 1000 BEACH AVENUE Vancouver, British Colum...",Coffee Shop,Hotel,Bakery,Park,Grocery Store,Breakfast Spot,Indian Restaurant
215,"11978 90 AVENUE Delta, British Columbia V4C3H6",Cosmetics Shop,Gas Station,Sandwich Place,Lounge,Salon / Barbershop,Discount Store,Sushi Restaurant
301,"7338 GOLLNER AVE RICHMOND, British Columbia",American Restaurant,Bank,Bubble Tea Shop,Coffee Shop,Pharmacy,Hotel,Gym / Fitness Center


##### Cluster 3 -- Park and construction?

In [48]:
examine_clusters.loc[examine_clusters['Cluster Labels'] == 3, examine_clusters.columns[[0] + list(range(2, examine_clusters.shape[1]))]]

Unnamed: 0,Property,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
258,"301 1785 MARTIN DRIVE Surrey, British Columbia...",Park,Construction & Landscaping,Chinese Restaurant,Convenience Store,Thai Restaurant,College Rec Center,Fast Food Restaurant


##### Cluster 4 -- Fastfood and office

In [49]:
examine_clusters.loc[examine_clusters['Cluster Labels'] == 4, examine_clusters.columns[[0] + list(range(2, examine_clusters.shape[1]))]]

Unnamed: 0,Property,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue
129,"305 2985 PRINCESS CRESCENT Coquitlam, British ...",Burger Joint,Fast Food Restaurant,Park,Office,Coffee Shop,Dim Sum Restaurant,Bank


## Results and Discussion <a name="results"></a>

Our analysis shows that for each address, the surrounding bubbles have different categories of venues. Each have a very unique blend of amenities and hence, the criteria for selecting the "right" property would depend directly on the personal preference of the buyer. Note that here, in this exercise, we have not included crime incidents, pricing, square footages as well as neighbourhood density. To further this project, we can gather more data to support the decision of the buyer.

## Conclusion <a name="conclusion"></a>

For myself, if I am the buyer, assuming all of these properties are similar in price, housing sizes, I would prefer to have a area closer to a park for my kids, close to restaurants, as well as having a gym nearby for exercising. it seems that index 301 (7338 Gollner Ave, Richmond) is the ideal place.