# The Battle of Neighborhoods
### Applied Data Science Capstone Project - IBM Data Science Certificate

## Table of contents
* [Introduction: Business Problem](#intro)
* [Data](#data)
* [Methodology](#methodology)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## 1. Introduction: Business Problem <a name="intro"></a>

Starting a business can be an extremely ardous task. When an individual decides to open a company, one faces a multitude of challenges and doubts, namely the type of business, the location, which costs are involved and whether your product will attract enough customers in order to your company to succeed, just to mention a few. Now imagine that a client is moving abroad and looks forward to open a branch of his company in a new country.

This project will discuss a case in which a stakeholder already has an expertise in the field of bike rental in their country of origin. The fact is that a member of the group is moving to Vancouver, Canada, and is observing a opportunity to expand the company abroad, therefore he is looking forward to open a branch of the bike rental company in the City of Vancouver. However, although this individual has the know-how of running such a business, he has insufficient knowledge about Vancouver's neighborhoods and its peculiarities. The stakeholder wants the firm to be opened in a location where it will have a higher chance of prospering. Moreover, the individual who will be running the branch wants to live in the vicinity of the company, if possible, so he can commute by foot or riding a bike.

The challenge of this project is to is to find the most suitable location, utilizing data science tools, for this individual to open his bike rental shop, considering that the chosen neighborhood should also be a good place to live. 

## 2. Data <a name="data"></a>

To solve the problem and find the perfect location for the business,  the City of Vancouver will be segmented in neighborhoods acording to their postal codes. To do so, we will utilize the website <a href="https://www.geonames.org/postalcode-search.html?q=vancouver&country=CA&adminCode1=BC">geonames.org</a>, which has a free and extensive geographical database, to extract information about the neighborhoods, such as their names, postal codes and geographic coordinates.

After extracting and cleaning the data, we will be using a Python tool called <b>folium</b> to make an interactive map of the region, identifying all the neighborhoods and, in the following step, the Foursquare API will be utilized so we can determine the most common types of venues on each neighborhood.

With the support of the data obtained from the <b>Foursquare API</b> we will be able to segment the neigborhoods into different clusters, according to their similarities, and identify a location that matches the stakeholder's requirements. We will be preferably looking for a neighborhood with enough parks, trails and easy to access, where the demand for bikes are probably higher than in other regions of the city. Furthermore, it would be desirable that such neighborhood is also a good place to live, with a considerable amount of restaurants, stores and other services nearby. 

#### These are the tools and libraries that we will utilize

In [59]:
#!pip install folium
import folium
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from geopy.geocoders import Nominatim
import requests # library to handle requests
from bs4 import BeautifulSoup
# import k-means from clustering stage
from sklearn.cluster import KMeans
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
%matplotlib inline

#### Extracting data from the database <a href="https://www.geonames.org/postalcode-search.html?q=vancouver&country=CA&adminCode1=BC">geonames.org</a>

The extracted database will provide us with sufficient information regarding Vacouver neighborhoods, such as their names, postal codes and geographic coordinates.

In [60]:
postal_url = 'https://www.geonames.org/postalcode-search.html?q=vancouver&country=CA&adminCode1=BC'
html_text = requests.get(postal_url).text

In [61]:
# Utilizing Beautiful Soup to extract the table data
soup = BeautifulSoup(html_text)
table = soup.find('table', attrs={'class':'restable'})
trs = table.find_all('tr')

# Extracting the text from the table cells
rows = list()
for tr in trs:
    td = tr.find_all('td')
    row = [ele.text.strip() for ele in td]
    if row:
        # Ignore empty rows with no 'td',
        # applicable for the column headers row.
        rows.append(row)

#### Creating a dataframe from the extracted data

Let's first create a table with the exact rows and columns found at <b>geonames.org</b>.

In [62]:
df = pd.DataFrame(rows, columns=['N', 'Place', 'Code', 'Country', 'Admin1', 'Admin2', 'Admin3'])
df.head()

Unnamed: 0,N,Place,Code,Country,Admin1,Admin2,Admin3
0,1.0,Vancouver (Killarney),V5S,Canada,British Columbia,Vancouver,
1,,49.218/-123.038,,,,,
2,2.0,Vancouver (North Hastings-Sunrise),V5K,Canada,British Columbia,Vancouver,
3,,49.281/-123.04,,,,,
4,3.0,Vancouver (North Grandview-Woodlands),V5L,Canada,British Columbia,Vancouver,


#### Cleaning the data

The dataframe created from the raw extracted data is not ready for analysis. Now let's clean the data, removing irrelevant columns and informations.

In [63]:
# Dropping irrelevant columns
df = df.drop(['N', 'Country','Admin1','Admin2' , 'Admin3'], axis=1)

In [64]:
# # Examining the table we discovered that the last 3 rows are outliers, from a location outside Vancouver
df.tail()

Unnamed: 0,Place,Code
86,Vancouver (NE Downtown / Harbour Centre / Gast...,V6B
87,49.279/-123.114,
88,Parksville,V9P
89,49.316/-124.319,
90,,


In [65]:
# Removing the last 3 rows
df.drop(df.tail(3).index,inplace=True)

#### Splitting our dataframe

Notice that the information about neighborhood names and coordinates are tangled.

To make our data clean and clear, let's separate our dataframe into two: One for the neighborhoods and postal code and the other one for the coordinates. We are also renaming and reordering the columns for the sake of coherence.

In [66]:
# df_postal will be the dataframe containing the neighborhoods and postal codes
df_postal = df.iloc[::2]

# Fixing the rows' index numbers
df_postal.index = range(44)

# Renaming column index labels
df_postal.columns = ['Neighborhood', 'PostalCode']

# Reordering the columns
df_postal = df_postal[['PostalCode', 'Neighborhood']]

df_postal.head()

Unnamed: 0,PostalCode,Neighborhood
0,V5S,Vancouver (Killarney)
1,V5K,Vancouver (North Hastings-Sunrise)
2,V5L,Vancouver (North Grandview-Woodlands)
3,V5P,Vancouver (SE Kensington / Victoria-Fraserview)
4,V5R,Vancouver (South Renfrew-Collingwood)


Now that we have our new dataframe <b>df_postal</b>, let's create <b>df_coord</b>, which will contain the coordinates data

In [67]:
# df_coord will be the dataframe containing the coordinates
df_coord = df[~np.arange(len(df)) % 2 == 0]

# Fixing the rows' index numbers
df_coord.index = range(44)

# Dropping 'Code' column
df_coord = df_coord.drop(['Code'], axis=1)

# Renaming column index
df_coord.columns = ['Coordinates']

df_coord.head()

Unnamed: 0,Coordinates
0,49.218/-123.038
1,49.281/-123.04
2,49.279/-123.067
3,49.222/-123.068
4,49.24/-123.041


We now have two separate dataframes, but there is an issue in the second one, <b>df_coord</b>. Notice that the geographic coordinates are combined into single cells.
Thus, let's split the "Coordinates" column into "Latitude" and "Longitude"

In [68]:
df_coord_split = df_coord["Coordinates"].str.split("/", n = 1, expand = True)

# Defining the new columns' names
df_coord["Latitude"]= df_coord_split[0]
df_coord["Longitude"]= df_coord_split[1]

# Dropping the 'old' column
df_coord.drop(['Coordinates'], axis=1, inplace=True)

df_coord.head()

Unnamed: 0,Latitude,Longitude
0,49.218,-123.038
1,49.281,-123.04
2,49.279,-123.067
3,49.222,-123.068
4,49.24,-123.041


#### Merging the two dataframes

Now that we treated the two dataframes <b>df_postal</b> and <b>df_coord</b> it is time to join them together, creating a coherent dataframe.

In [70]:
df_van =  df_postal.join(df_coord)

# Let's check it out
df_van.head()

Unnamed: 0,PostalCode,Neighborhood,Latitude,Longitude
0,V5S,Vancouver (Killarney),49.218,-123.038
1,V5K,Vancouver (North Hastings-Sunrise),49.281,-123.04
2,V5L,Vancouver (North Grandview-Woodlands),49.279,-123.067
3,V5P,Vancouver (SE Kensington / Victoria-Fraserview),49.222,-123.068
4,V5R,Vancouver (South Renfrew-Collingwood),49.24,-123.041


In [71]:
# Utilizing geopy to get the coordinates of Vancouver
address = 'Vancouver, Canada'

geolocator = Nominatim(user_agent="vancouver_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Vancouver are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Vancouver are 49.2608724, -123.1139529.


#### Creating a map of Vancouver with the neighborhoods superimposed

In [72]:
# create map of Vancouver using latitude and longitude values
map_vancouver = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df_van['Latitude'], df_van['Longitude'], df_van['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_vancouver)  
    
map_vancouver

#### Making some adjustments

Now that we can explore the map, there are some adjustments that have to be made. First, we are fixing the coordinate of postal code <b>V6T</b>, which belongs to the University of British Columbia and is uncorrectly displayed. Then, let's exclude the neighborhoods from North and West Vancouver. Although they are part of the metropolitan area, the project will study only the City of Vancouver. Finally, we are going to remove the word "Vancouver" from the neighborhood cells in our table to make it cleaner.

In [73]:
# Fixing the latitude and longitude values of the University of British Columbia - UBC
df_van.at[41, 'Latitude'] = 49.267
df_van.at[41, 'Longitude'] = -123.242

# Removing neighborhoods of North Vancouver and West Vancouver
searchfor = ['North Vancouver', 'West Vancouver']
vancouver_data = df_van[~df_van.Neighborhood.str.contains('|'.join(searchfor))]

# Removing "Vancouver" from the neighborhood cells
pd.set_option('mode.chained_assignment', None)
vancouver_data['Neighborhood'] = vancouver_data['Neighborhood'].str.strip('Vancouver')

# Fixing the rows' index numbers
vancouver_data.index = range(31)

In [74]:
# Let's check how our final vancouver_data dataframe looks like
vancouver_data.head()

Unnamed: 0,PostalCode,Neighborhood,Latitude,Longitude
0,V5S,(Killarney),49.218,-123.038
1,V5K,(North Hastings-Sunrise),49.281,-123.04
2,V5L,(North Grandview-Woodlands),49.279,-123.067
3,V5P,(SE Kensington / Victoria-Fraserview),49.222,-123.068
4,V5R,(South Renfrew-Collingwood),49.24,-123.041


Let's examine the map again, to check if everything is ok.

In [76]:
# create map of Vancouver using latitude and longitude values
map_vancouver = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(vancouver_data['Latitude'], vancouver_data['Longitude'], vancouver_data['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_vancouver)  
    
map_vancouver

In [77]:
# Checking the size of our dataframe
vancouver_data.shape

(31, 4)

Now our data is ready to be processed. Notice that we have 31 rows of postal codes and 4 columns (PostalCode, Neighborhood, Latitude and Logitude).

One important observation should be made. Each postal code contains one up to four neighborhoods. Since the neighborhoods combined in a postal code are adjacent, it is convenient to recognize them as a unity. Thus, we are considering 31 neighborhoods for this project.

## 3. Methodology <a name="methodology"></a>

The focus of this project is to detect regions of Vancouver with abundance of parks and trails as well as plenty of services such as restaurants, cafés, bars and good transport system. To achieve our objective we will utilize the Foursquare API to search for the most common venues in each neighborhood. The venues will be identified to a neighborhood if they are located within 500 meters from the neighborhood coordinates.

First, we are testing our model exploring the first neighborhood in our dataframe. We will search for the top 100 venues located in the region, with the assistance of the Foursquare API, and analyze the results. 

Then, we are going to replicate this concept to all the neighborhoods in Vancouver. At this point we are able to inspect the quantity of venues found for each neighborhoods, limited to 100, and the amount of unique types of venues.

In the following step, we are using one-hot encoding to convert categorical data to numerical data, so our machine learning model can properly deal with the data. All neighborhoods will be analyzed and grouped taking the mean of the frequency of occurrence of each unique venue category. After this process, we are able to observe the most common categories of venues per neighborhood.

Finally, the 31 neighborhoods will be segmented in 8 clusters utilizing the method <b>k-means</b>, which is an unsupervised machine learning algorithm that will group similar neighborhoods in a specific cluster. After all 8 clusters are established, a new dataframe containing the top 10 venue category per neighborhood will be created and we are able to plot a map and view the results. At this point we should have enough information in our hand to detect the optimal location for the stakeholder.

### Analysis

#### Working with the Foursquare API

In [78]:
# The code was removed by Watson Studio for sharing.

In [93]:
VERSION = '20201231' # Foursquare API version

#### Exploring the first neighborhood in the dataframe

In [94]:
# Getting the neighborhood's name
vancouver_data.loc[0, 'Neighborhood']

' (Killarney)'

In [95]:
# Getting coordinates for Killarney
neighborhood_latitude = vancouver_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = vancouver_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = vancouver_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of  (Killarney) are 49.218, -123.038.


#### Let's get the top 100 venues located in Killarney within a radius of 500 meters.

A JSON file containing all the information extracted from the Foursquare API will be generated .

In [96]:
# We need to create a GET request
radius = 500 # define radius
LIMIT = 100 # limit of number of venues returned by Foursquare API

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)

In [97]:
# Now let's examine the results
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ea2063798205d7478cb4387'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Killarney',
  'headerFullLocation': 'Killarney, Vancouver',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 17,
  'suggestedBounds': {'ne': {'lat': 49.222500004500006,
    'lng': -123.03112351337492},
   'sw': {'lat': 49.2134999955, 'lng': -123.04487648662507}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b636f14f964a5207f792ae3',
       'name': 'Champlain Square',
       'location': {'address': '7180 Kerr St.',
        'crossStreet': '@ 54th Ave.',
        'lat': 49.21877353130896,
        'lng': -123.04038966866126,
        'labeledLatLngs': [{'la

With the analysis of the JSON file, we can infer that all the important information is in the <b>items</b> key. Let's extract the category of the venues.

In [98]:
# Difining a function to extract the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


Now let's clean the JSON and structure it into a pandas dataframe called <b>nearby_venues</b>.

In [87]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Champlain Square,Shopping Mall,49.218774,-123.04039
1,A&W,Fast Food Restaurant,49.219269,-123.040876
2,Kin's Farm Market,Farmers Market,49.219534,-123.040562
3,Sushi Go,Sushi Restaurant,49.219544,-123.041
4,Subway,Sandwich Place,49.218948,-123.039908


Above we find a list of venues found at Killarney.

#### Now let's create a function to repeat the same process to all the neighborhoods in Vancouver

In [99]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues) 

We can run the function above on each neighborhood and create a new dataframe called <b>vancouver_venues</b>, containing all venues found for each neighborhood, limited to 100.

In [100]:
vancouver_venues = getNearbyVenues(names=vancouver_data['Neighborhood'],
                                   latitudes=vancouver_data['Latitude'],
                                   longitudes=vancouver_data['Longitude']
                                  )

 (Killarney)
 (North Hastings-Sunrise)
 (North Grandview-Woodlands)
 (SE Kensington / Victoria-Fraserview)
 (South Renfrew-Collingwood)
 (East Mount Pleasant)
 (East Fairview / South Cambie)
 (South West End)
 (Central Kitsilano)
 (NW Arbutus Ridge)
 (Dunbar-Southlands / Musqueam)
 (West Kitsilano / Jericho)
 (SW Downtown)
 (Bentall Centre)
 (Pacific Centre)
 (South Hastings-Sunrise / North Renfrew-Collingwood)
 (South Grandview-Woodlands / NE Kensington)
 (SE Oakridge / East Marpole / South Sunset)
 (Waterfront / Coal Harbour / Canada Place)
 (North West End / Stanley Park)
 (West Fairview / Granville Island / NE Shaughnessy)
 (NW Shaughnessy / East Kitsilano / Quilchena)
 (SE Kerrisdale / SW Oakridge / West Marpole)
 (Chaldecutt / South University Endowment Lands)
 (West Kensington / NE Riley Park-Little Mountain)
 (West Mount Pleasant / West Riley Park-Little Mountain)
 (South Shaughnessy / NW Oakridge / NE Kerrisdale / SE Arbutus Ridge)
 (SE Riley Park-Little Mountain / SW Kensingt

Let's check the size of the <b>vancouver_venues</b> dataframe.

In [101]:
print(vancouver_venues.shape)
vancouver_venues.head()

(778, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,(Killarney),49.218,-123.038,Champlain Square,49.218774,-123.04039,Shopping Mall
1,(Killarney),49.218,-123.038,A&W,49.219269,-123.040876,Fast Food Restaurant
2,(Killarney),49.218,-123.038,Kin's Farm Market,49.219534,-123.040562,Farmers Market
3,(Killarney),49.218,-123.038,Sushi Go,49.219544,-123.041,Sushi Restaurant
4,(Killarney),49.218,-123.038,Subway,49.218948,-123.039908,Sandwich Place


We can observe that a total of 778 venues were identified. Now let's check the quantity of venues returned for each neighborhood.

In [102]:
vancouver_venues.groupby('Neighborhood').count()['Venue']

Neighborhood
 (Bentall Centre)                                                                  9
 (Central Kitsilano)                                                              29
 (Chaldecutt / South University Endowment Lands)                                   2
 (Dunbar-Southlands / Musqueam)                                                    3
 (East Fairview / South Cambie)                                                   20
 (East Mount Pleasant)                                                            25
 (Killarney)                                                                      17
 (NE Downtown / Harbour Centre / Gastown / Yaletown)                              37
 (NW Arbutus Ridge)                                                                3
 (NW Shaughnessy / East Kitsilano / Quilchena)                                    15
 (North Grandview-Woodlands)                                                      37
 (North Hastings-Sunrise)                           

Let's check how many unique categories we have.

In [103]:
print('There are {} uniques categories.'.format(len(vancouver_venues['Venue Category'].unique())))

There are 171 uniques categories.


#### Now we are going to analyze each neighborhood

We are using one-hot encoding to convert categorical data to numerical data.

In [104]:
# one hot encoding
vancouver_onehot = pd.get_dummies(vancouver_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
vancouver_onehot['Neighborhood'] = vancouver_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [vancouver_onehot.columns[-1]] + list(vancouver_onehot.columns[:-1])
vancouver_onehot = vancouver_onehot[fixed_columns]

vancouver_onehot.head()

Unnamed: 0,Neighborhood,Airport Terminal,American Restaurant,Amphitheater,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Bagel Shop,...,Thrift / Vintage Store,Toy / Game Store,Trade School,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Shop,Yoga Studio
0,(Killarney),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,(Killarney),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,(Killarney),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,(Killarney),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,(Killarney),0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [105]:
# Let's examine the dataframe size
vancouver_onehot.shape

(778, 172)

Observe that now we have a dataframe <b>vancouver_onehot</b> with 778 rows, each for every single venue detected, and 172 columns, with the neighborhood names and 171 unique venue category. 

#### Now let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [106]:
vancouver_grouped = vancouver_onehot.groupby('Neighborhood').mean().reset_index()
vancouver_grouped.head()

Unnamed: 0,Neighborhood,Airport Terminal,American Restaurant,Amphitheater,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Bagel Shop,...,Thrift / Vintage Store,Toy / Game Store,Trade School,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Shop,Yoga Studio
0,(Bentall Centre),0.111111,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,(Central Kitsilano),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,...,0.0,0.0,0.0,0.0,0.068966,0.0,0.034483,0.0,0.034483,0.034483
2,(Chaldecutt / South University Endowment Lands),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,(Dunbar-Southlands / Musqueam),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0
4,(East Fairview / South Cambie),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0


In [107]:
# Let's check the new size
vancouver_grouped.shape

(31, 172)

Now that we grouped the neighborhoods, our <b>vancouver_grouped</b> dataframe shows 31 rows and 172 columns.

Let's print each neighborhood along with the top 5 most common venues.

In [108]:
num_top_venues = 5

for hood in vancouver_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = vancouver_grouped[vancouver_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- (Bentall Centre)----
                 venue  freq
0                Plaza  0.22
1     Airport Terminal  0.11
2  American Restaurant  0.11
3            Irish Pub  0.11
4            Gastropub  0.11


---- (Central Kitsilano)----
                           venue  freq
0                    Coffee Shop  0.10
1                           Café  0.07
2  Vegetarian / Vegan Restaurant  0.07
3                    Yoga Studio  0.03
4                 Farmers Market  0.03


---- (Chaldecutt / South University Endowment Lands)----
                       venue  freq
0                       Park   1.0
1           Airport Terminal   0.0
2    New American Restaurant   0.0
3  Middle Eastern Restaurant   0.0
4         Miscellaneous Shop   0.0


---- (Dunbar-Southlands / Musqueam)----
                     venue  freq
0    Vietnamese Restaurant  0.33
1             Home Service  0.33
2     Fast Food Restaurant  0.33
3         Airport Terminal  0.00
4  New American Restaurant  0.00


---- (East Fairview / So

Let's put this data into a pandas dataframe in descending order.

In [109]:
# Sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [110]:
# Display the top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = vancouver_grouped['Neighborhood']

for ind in np.arange(vancouver_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(vancouver_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,(Bentall Centre),Plaza,Airport Terminal,American Restaurant,Outdoor Sculpture,Irish Pub,Breakfast Spot,Gym,Gastropub,Farm,Fish & Chips Shop
1,(Central Kitsilano),Coffee Shop,Vegetarian / Vegan Restaurant,Café,Yoga Studio,Restaurant,Frozen Yogurt Shop,Burger Joint,Food Truck,Bus Stop,Liquor Store
2,(Chaldecutt / South University Endowment Lands),Park,Yoga Studio,Falafel Restaurant,Food Court,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
3,(Dunbar-Southlands / Musqueam),Vietnamese Restaurant,Home Service,Fast Food Restaurant,Yoga Studio,Falafel Restaurant,Food Court,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant,Field
4,(East Fairview / South Cambie),Bus Stop,Coffee Shop,Chinese Restaurant,Grocery Store,Malay Restaurant,Café,Light Rail Station,Sushi Restaurant,Bank,Dessert Shop


#### Segmenting the neighborhoods in clusters

The method <b>k-means</b> will be utilized to divide the neighborhoods in 8 clusters.

In [111]:
# set number of clusters
kclusters = 8

vancouver_grouped_clustering = vancouver_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(vancouver_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([6, 1, 2, 0, 5, 1, 1, 1, 7, 1], dtype=int32)

A new dataframe <b>vancouver_merged</b> will be created including the cluster and the top 10 venues for each neighborhood.

In [112]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

vancouver_merged = vancouver_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
vancouver_merged = vancouver_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

vancouver_merged.head()

Unnamed: 0,PostalCode,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,V5S,(Killarney),49.218,-123.038,1,Deli / Bodega,Coffee Shop,Shopping Mall,Fast Food Restaurant,Liquor Store,Sandwich Place,Farmers Market,Mobile Phone Shop,Chinese Restaurant,Sushi Restaurant
1,V5K,(North Hastings-Sunrise),49.281,-123.04,1,Park,Event Space,Theme Park,Gas Station,Fair,Pizza Place,Market,Farm,Portuguese Restaurant,Bus Station
2,V5L,(North Grandview-Woodlands),49.279,-123.067,1,Asian Restaurant,Pizza Place,Sushi Restaurant,Café,Brewery,Chinese Restaurant,Deli / Bodega,Coffee Shop,Grocery Store,Theater
3,V5P,(SE Kensington / Victoria-Fraserview),49.222,-123.068,1,Pizza Place,Park,Gas Station,Convenience Store,Motorcycle Shop,Pet Store,Pharmacy,Middle Eastern Restaurant,Restaurant,Sandwich Place
4,V5R,(South Renfrew-Collingwood),49.24,-123.041,1,Park,Hotel,Bus Stop,Fish & Chips Shop,Asian Restaurant,Bar,Farm,Food Court,Financial or Legal Service,Filipino Restaurant


#### Let's plot the map with the clusters

In [113]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(vancouver_merged['Latitude'], vancouver_merged['Longitude'], vancouver_merged['Neighborhood'], vancouver_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Let's examine the clusters to discover how they differ from each other

In [114]:
# Cluster 0
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 0, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,(Dunbar-Southlands / Musqueam),Vietnamese Restaurant,Home Service,Fast Food Restaurant,Yoga Studio,Falafel Restaurant,Food Court,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant,Field


In [115]:
# Cluster 1
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 1, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,(Killarney),Deli / Bodega,Coffee Shop,Shopping Mall,Fast Food Restaurant,Liquor Store,Sandwich Place,Farmers Market,Mobile Phone Shop,Chinese Restaurant,Sushi Restaurant
1,(North Hastings-Sunrise),Park,Event Space,Theme Park,Gas Station,Fair,Pizza Place,Market,Farm,Portuguese Restaurant,Bus Station
2,(North Grandview-Woodlands),Asian Restaurant,Pizza Place,Sushi Restaurant,Café,Brewery,Chinese Restaurant,Deli / Bodega,Coffee Shop,Grocery Store,Theater
3,(SE Kensington / Victoria-Fraserview),Pizza Place,Park,Gas Station,Convenience Store,Motorcycle Shop,Pet Store,Pharmacy,Middle Eastern Restaurant,Restaurant,Sandwich Place
4,(South Renfrew-Collingwood),Park,Hotel,Bus Stop,Fish & Chips Shop,Asian Restaurant,Bar,Farm,Food Court,Financial or Legal Service,Filipino Restaurant
5,(East Mount Pleasant),Sushi Restaurant,Vietnamese Restaurant,Ethiopian Restaurant,Park,Bar,Liquor Store,Market,Pub,Sports Bar,Japanese Restaurant
7,(South West End),Japanese Restaurant,Bakery,Food Truck,Dessert Shop,Hotel,Coffee Shop,Gay Bar,Mexican Restaurant,Sushi Restaurant,Indian Restaurant
8,(Central Kitsilano),Coffee Shop,Vegetarian / Vegan Restaurant,Café,Yoga Studio,Restaurant,Frozen Yogurt Shop,Burger Joint,Food Truck,Bus Stop,Liquor Store
12,(SW Downtown),Mexican Restaurant,Bakery,Middle Eastern Restaurant,Sushi Restaurant,Café,Greek Restaurant,Hotel,Pool,French Restaurant,Dog Run
14,(Pacific Centre),Hotel,Food Truck,Clothing Store,Dessert Shop,Lounge,Concert Hall,Steakhouse,Coffee Shop,Italian Restaurant,Cosmetics Shop


In [116]:
# Cluster 2
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 2, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,(West Kitsilano / Jericho),Park,Yoga Studio,Falafel Restaurant,Food Court,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
23,(Chaldecutt / South University Endowment Lands),Park,Yoga Studio,Falafel Restaurant,Food Court,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market


In [117]:
# Cluster 3
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 3, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25,(West Mount Pleasant / West Riley Park-Little...,Chinese Restaurant,Dessert Shop,Coffee Shop,Farm,Food Truck,Food Court,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant,Field


In [118]:
# Cluster 4
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 4, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,(North West End / Stanley Park),Park,Outdoor Sculpture,Trail,Garden,Yoga Studio,Falafel Restaurant,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant,Field


In [119]:
# Cluster 5
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 5, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,(East Fairview / South Cambie),Bus Stop,Coffee Shop,Chinese Restaurant,Grocery Store,Malay Restaurant,Café,Light Rail Station,Sushi Restaurant,Bank,Dessert Shop
22,(SE Kerrisdale / SW Oakridge / West Marpole),Chinese Restaurant,Sushi Restaurant,Bubble Tea Shop,Thai Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Bus Stop,Dessert Shop,Gas Station
26,(South Shaughnessy / NW Oakridge / NE Kerrisd...,Bus Stop,Chinese Restaurant,Asian Restaurant,Sushi Restaurant,Coffee Shop,Yoga Studio,Farmers Market,Food Court,Fish & Chips Shop,Financial or Legal Service


In [122]:
# Cluster 6
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 6, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,(Bentall Centre),Plaza,Airport Terminal,American Restaurant,Outdoor Sculpture,Irish Pub,Breakfast Spot,Gym,Gastropub,Farm,Fish & Chips Shop


In [123]:
# Cluster 7
vancouver_merged.loc[vancouver_merged['Cluster Labels'] == 7, vancouver_merged.columns[[1] + list(range(5, vancouver_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,(NW Arbutus Ridge),Caribbean Restaurant,Italian Restaurant,Bakery,Yoga Studio,Farmers Market,Food Truck,Food Court,Fish & Chips Shop,Financial or Legal Service,Filipino Restaurant


## 4. Results and Discussion <a name="results"></a>

After analyzing the results of our model, we detected that Vancouver has a multitude of outdoor venues that could be suitable for bikers. In many clusters we found parks, gardens and fields among the top 10 most common categories. Of course some specific clusters caught our attention, considering the characteristics we were searching for.

Of all eight clusters generated by the model, I can say that two of them stood out, namely clusters 2 and 4. Both clusters include neighborhoods surrounded by parks and trails and they would probably be a smart choice when deciding a location to open a bike rental company.

Trying to solve a possible dilemma between clusters the stakeholder might face, I will narrow the analysis between clusters 2 and 4. In addition to the positive aspects both regions presented, we can observe that cluster 4, which includes the neighborhoods North End West and Stanly Park, has parks, outdoor sculptures, trails, gardens and fields among the top 10 most common venues. These are great indicators that this region is one of the most suitable for biking, and as a consequence, for starting a bike rental company. Furthermore, North West End and Stanley Park are located close to cluster 1, which covers some neighborhoods in the downtown region of the city, also containing some parks, as well as hotels, restaurants, cafés, stores, bus and metro stations. The region has a large variety of venues and is easily accessible via public transport which may attract both locals and tousrists. Finally, these are also clear signals that the region and its surroundings might be a good place to live, an important aspect for the shareholder.

## 5. Conclusion <a name="conclusion"></a>

The main purpose of this project was to identify the optimal neighborhood or region for a stakeholder to open a branch of a bike rental company in Vancouver. The idea was to search for locations with parks and trails nearby, with easy access and good infrastructure. Through data analysis and machine learning we could segment the City of Vancouver in clusters narrowing our focus on specific regions that matches the stakeholder's demands. The final decision should be made by the stakeholder, considering our recommendation in section 4 and based on other aspects of each region that may attract our client.