### IBM Data Science Capstone Project 
#### In this capstone project, a detailed analysis will be conducted to assist potential restaurant owners to seek an ideal neighborhood in a city to open a new restaurant of any chosen style.
###### 1. The jurisdiction region to be analyzed in this project is York Region, Ontario, which is a suburb in the Greater Toronto Area (GTA).
###### 2. The style of restaurant is Noodle Bar
###### Note: The method to be presented in this exercise can be applied to any city with different restaurant styles to choose.
#### Data to be used in this analysis:
##### 1. Neighborhood Data (Borough, Neighborhood, Geographical Location, Population) - To be done by Web Scrapping;
##### 2. The number of existing Noodle Bar in each Neighborhood - To be obtained by Foursquare API;
##### 3. Ethnic Group Data in percentage, which is used to determine the Asian population in each neighborhood (Asian Community is the major consumer of Noodle Shop )
#### Methods of Analysis:
##### 1. Get Neighborhood Data by Web Scrapping and transform the JSON File to a panda dataframe;
##### 2. Make query via Foursquare API (Search for a specific venue category) for each neighborhood;
##### 3. Count how many existing noodle bars within 10km of each neighborhood center;
##### 4. Scrapping Population Data and Asian Ethnic group Ratio in Percentage from the web for each neighborhood if possible;
##### 5. Rank all the neighborhoods in a decending order of an indicator (The # of Existing Noodle Bars / The Asian Population ratio)
##### 6. Pick the top 3 neighborhoods and visualize them geographically via Folium.

## Import Libraries and packages

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # install geopy package
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


In [2]:
### Scraping the Wikipedia page for the table of postal codes of Canada

raw_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_L')

raw_df



[                           0  \
 0               L1APort Hope   
 1       L1BBowmanville(East)   
 2       L1CBowmanville(West)   
 3   L1ECourtice(Bowmanville)   
 4         L1GOshawa(Central)   
 5       L1HOshawa(Southeast)   
 6       L1JOshawa(Southwest)   
 7            L1KOshawa(East)   
 8           L1LOshawa(North)   
 9           L1MWhitby(North)   
 10      L1NWhitby(Southeast)   
 11      L1PWhitby(Southwest)   
 12        L1RWhitby(Central)   
 13        L1SAjax(Southwest)   
 14        L1TAjax(Northwest)   
 15   L1VPickering(Southwest)   
 16       L1WPickering(South)   
 17     L1XPickering(Central)   
 18       L1YPickering(North)   
 19             L1ZAjax(East)   
 
                                                     1  \
 0                                        L2AFort Erie   
 1                                     L2BNot assigned   
 2                                     L2CNot assigned   
 3                           L2ENiagara Falls(Central)   
 4             

In [3]:
### Read Raw data to one column dataframe

# Define a dummy column name

column_names = ['temp_info']
temp_table = pd.DataFrame(columns = column_names)

for i in range(0, 8):
    
    for data in raw_df[0][i]:
        temp_info = data
        temp_table = temp_table.append({'temp_info': temp_info}, ignore_index = True)

In [4]:
### Take a look at how dataframe looks like
temp_table.head()

Unnamed: 0,temp_info
0,L1APort Hope
1,L1BBowmanville(East)
2,L1CBowmanville(West)
3,L1ECourtice(Bowmanville)
4,L1GOshawa(Central)


In [5]:
### Dimension of the dataframe
temp_table.shape

(160, 1)

### Split one column to three columns for further processing:

In [6]:
temp_table['postal code'] = temp_table['temp_info'].str.slice(stop=3)

temp_table['temp_info'] = temp_table['temp_info'].str.slice(start=3)

temp_table.head(160)

Unnamed: 0,temp_info,postal code
0,Port Hope,L1A
1,Bowmanville(East),L1B
2,Bowmanville(West),L1C
3,Courtice(Bowmanville),L1E
4,Oshawa(Central),L1G
5,Oshawa(Southeast),L1H
6,Oshawa(Southwest),L1J
7,Oshawa(East),L1K
8,Oshawa(North),L1L
9,Whitby(North),L1M


In [7]:
### Get rid of brackets

temp_table[['Borough', 'Neighborhoods']] = temp_table.temp_info.str.split("(", expand=True)

In [8]:
temp_table

Unnamed: 0,temp_info,postal code,Borough,Neighborhoods
0,Port Hope,L1A,Port Hope,
1,Bowmanville(East),L1B,Bowmanville,East)
2,Bowmanville(West),L1C,Bowmanville,West)
3,Courtice(Bowmanville),L1E,Courtice,Bowmanville)
4,Oshawa(Central),L1G,Oshawa,Central)
5,Oshawa(Southeast),L1H,Oshawa,Southeast)
6,Oshawa(Southwest),L1J,Oshawa,Southwest)
7,Oshawa(East),L1K,Oshawa,East)
8,Oshawa(North),L1L,Oshawa,North)
9,Whitby(North),L1M,Whitby,North)


In [9]:
temp_table = temp_table.drop(['temp_info'], axis=1)
temp_table

Unnamed: 0,postal code,Borough,Neighborhoods
0,L1A,Port Hope,
1,L1B,Bowmanville,East)
2,L1C,Bowmanville,West)
3,L1E,Courtice,Bowmanville)
4,L1G,Oshawa,Central)
5,L1H,Oshawa,Southeast)
6,L1J,Oshawa,Southwest)
7,L1K,Oshawa,East)
8,L1L,Oshawa,North)
9,L1M,Whitby,North)


In [10]:
### Get rid of another bracket and "/"

temp_table['Neighborhoods'] = temp_table['Neighborhoods'].str.replace(')','')
temp_table['Neighborhoods'] = temp_table['Neighborhoods'].str.replace('/',',')

temp_table

  temp_table['Neighborhoods'] = temp_table['Neighborhoods'].str.replace(')','')


Unnamed: 0,postal code,Borough,Neighborhoods
0,L1A,Port Hope,
1,L1B,Bowmanville,East
2,L1C,Bowmanville,West
3,L1E,Courtice,Bowmanville
4,L1G,Oshawa,Central
5,L1H,Oshawa,Southeast
6,L1J,Oshawa,Southwest
7,L1K,Oshawa,East
8,L1L,Oshawa,North
9,L1M,Whitby,North


In [11]:
cleaned_df = temp_table.rename(columns={"postal code":"PostalCode", "Neighborhoods":"Neighborhood"})

cleaned_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,L1A,Port Hope,
1,L1B,Bowmanville,East
2,L1C,Bowmanville,West
3,L1E,Courtice,Bowmanville
4,L1G,Oshawa,Central
5,L1H,Oshawa,Southeast
6,L1J,Oshawa,Southwest
7,L1K,Oshawa,East
8,L1L,Oshawa,North
9,L1M,Whitby,North


In [12]:
### Drop rows with "Not Assigned" Borough

cleaned_df_1 = cleaned_df[cleaned_df.Borough != 'Not assigned']

### Replace Neighborhood name with Borough name if Neighborhood name is "Not assigned" or "None"

cleaned_df_1['Neighborhood'] = np.where(cleaned_df_1['Neighborhood']=='Not assigned', cleaned_df_1['Borough'], cleaned_df_1['Neighborhood'])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df_1['Neighborhood'] = np.where(cleaned_df_1['Neighborhood']=='Not assigned', cleaned_df_1['Borough'], cleaned_df_1['Neighborhood'])


In [13]:
cleaned_df_1.head(100)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,L1A,Port Hope,
1,L1B,Bowmanville,East
2,L1C,Bowmanville,West
3,L1E,Courtice,Bowmanville
4,L1G,Oshawa,Central
5,L1H,Oshawa,Southeast
6,L1J,Oshawa,Southwest
7,L1K,Oshawa,East
8,L1L,Oshawa,North
9,L1M,Whitby,North


In [14]:
cleaned_df_1['Neighborhood'] = cleaned_df_1['Neighborhood'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df_1['Neighborhood'] = cleaned_df_1['Neighborhood'].fillna(0)


In [15]:
cleaned_df_1

Unnamed: 0,PostalCode,Borough,Neighborhood
0,L1A,Port Hope,0
1,L1B,Bowmanville,East
2,L1C,Bowmanville,West
3,L1E,Courtice,Bowmanville
4,L1G,Oshawa,Central
5,L1H,Oshawa,Southeast
6,L1J,Oshawa,Southwest
7,L1K,Oshawa,East
8,L1L,Oshawa,North
9,L1M,Whitby,North


In [16]:
cleaned_df_1['Neighborhood'] = np.where(cleaned_df_1['Neighborhood']==0, cleaned_df_1["Borough"], cleaned_df_1['Neighborhood'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df_1['Neighborhood'] = np.where(cleaned_df_1['Neighborhood']==0, cleaned_df_1["Borough"], cleaned_df_1['Neighborhood'])


In [25]:
cleaned_df_1.head(160)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,L1A,Port Hope,Port Hope
1,L1B,Bowmanville,East
2,L1C,Bowmanville,West
3,L1E,Courtice,Bowmanville
4,L1G,Oshawa,Central
5,L1H,Oshawa,Southeast
6,L1J,Oshawa,Southwest
7,L1K,Oshawa,East
8,L1L,Oshawa,North
9,L1M,Whitby,North


In [18]:
cleaned_df_1.shape

(130, 3)

In [19]:
### Cleaned Borough and Neighborhood Data for further analysis

df_cleaned = cleaned_df_1

## Getting Geospatial Data

In [20]:
# Install the geocoder package

!conda install -c conda-forge geocoder --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [31]:
import geocoder # import geocoder

locator = Nominatim(user_agent='myGeocoder')

count = 0


# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for PostalCode, Borough, Neighborhood in zip(df_cleaned['PostalCode'],df_cleaned['Borough'], df_cleaned['Neighborhood']):


    # initialize the location variable to None
    location = None
    

    # loop until you get the coordinates
    while(location is None):
        location = locator.geocode('{}, Ontario'.format(Borough))
        lat = location.latitude
        print('got lat!:', lat)
        lng = location.longitude
        print('got lng!:', lng)
    
   
    count += 1
    print(count)

    neighborhoods = neighborhoods.append({'Borough': Borough,
                                          'Neighborhood': Neighborhood,
                                          'Latitude': lat,
                                          'Longitude': lng}, ignore_index=True)
    


43.9515755
-78.2939704
<class 'float'>
<class 'float'>
1
43.9122995
-78.6891675
<class 'float'>
<class 'float'>
2
43.9122995
-78.6891675
<class 'float'>
<class 'float'>
3
43.904861
-78.78831389119298
<class 'float'>
<class 'float'>
4
43.8975558
-78.8635324
<class 'float'>
<class 'float'>
5
43.8975558
-78.8635324
<class 'float'>
<class 'float'>
6
43.8975558
-78.8635324
<class 'float'>
<class 'float'>
7
43.8975558
-78.8635324
<class 'float'>
<class 'float'>
8
43.8975558
-78.8635324
<class 'float'>
<class 'float'>
9
43.87982
-78.9421751
<class 'float'>
<class 'float'>
10
43.87982
-78.9421751
<class 'float'>
<class 'float'>
11
43.87982
-78.9421751
<class 'float'>
<class 'float'>
12
43.87982
-78.9421751
<class 'float'>
<class 'float'>
13
43.8505287
-79.0208814
<class 'float'>
<class 'float'>
14
43.8505287
-79.0208814
<class 'float'>
<class 'float'>
15
43.835765
-79.090576
<class 'float'>
<class 'float'>
16
43.835765
-79.090576
<class 'float'>
<class 'float'>
17
43.835765
-79.090576
<class '

In [32]:
neighborhoods.head(5)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Port Hope,Port Hope,43.951575,-78.29397
1,Bowmanville,East,43.9123,-78.689167
2,Bowmanville,West,43.9123,-78.689167
3,Courtice,Bowmanville,43.904861,-78.788314
4,Oshawa,Central,43.897556,-78.863532


### Import Foursquare API Credentials:

In [50]:
CLIENT_ID = 'KIZFFQINQRUMTINAME3A3L2WZEPKOCCG0XXONYIQM3R4KYE2' # your Foursquare ID
CLIENT_SECRET = 'CBRJE3ODXERSF32XVLXKOWLZLNFLSRPZ1MRCCOMEBAQWWGPT' # your Foursquare Secret
ACCESS_TOKEN = 'CBSNE54PKHA1VS45TDP3ACMT1FGX5BEM1ADPLAICS4UWHY3F' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KIZFFQINQRUMTINAME3A3L2WZEPKOCCG0XXONYIQM3R4KYE2
CLIENT_SECRET:CBRJE3ODXERSF32XVLXKOWLZLNFLSRPZ1MRCCOMEBAQWWGPT


### Call Foursquare API to Search for a specific venue category 

In [58]:
# Search for noodle shops with keyword = 'Noodle'
search_query = 'Noodle'
# 
radius = 15000
print(search_query + ' .... OK!')

# define the new dataframe columns
column_names_1 = ['Borough', 'Neighborhood', 'Latitude', 'Longitude', 'Num_of_Noodle_Shop'] 

# instantiate the dataframe
neighborhoods_noodle_shops = pd.DataFrame(columns=column_names_1)



for Borough, Neighborhood, latitude, longitude in zip(neighborhoods['Borough'], neighborhoods['Neighborhood'], neighborhoods['Latitude'] ,neighborhoods['Longitude']):

    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
    results = requests.get(url).json()
    

    # assign relevant part of JSON to venues, if an error reported here, it means daily quota exceeded!
    venues = results['response']['venues']

    # tranform venues into a dataframe
    dataframe = json_normalize(venues)
    num_of_noodle_shops = dataframe.shape[0]
    #print(num_of_noodle_shops)
    
    neighborhoods_noodle_shops = neighborhoods_noodle_shops.append({'Borough': Borough,
                                          'Neighborhood': Neighborhood,
                                          'Latitude': latitude,
                                          'Longitude': longitude, 'Num_of_Noodle_Shop': num_of_noodle_shops}, ignore_index=True)   





Noodle .... OK!


  dataframe = json_normalize(venues)


KeyError: 'venues'

In [None]:
# Error reported due to insufficient quota for making Foursquare Calls to search for venues.  

In [59]:
neighborhoods_noodle_shops.head(130) 

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Num_of_Noodle_Shop
0,Port Hope,Port Hope,43.951575,-78.29397,0
1,Bowmanville,East,43.9123,-78.689167,0
2,Bowmanville,West,43.9123,-78.689167,0
3,Courtice,Bowmanville,43.904861,-78.788314,0
4,Oshawa,Central,43.897556,-78.863532,0
5,Oshawa,Southeast,43.897556,-78.863532,0
6,Oshawa,Southwest,43.897556,-78.863532,0
7,Oshawa,East,43.897556,-78.863532,0
8,Oshawa,North,43.897556,-78.863532,0
9,Whitby,North,43.87982,-78.942175,0


In [62]:
# sort the panda dataframe by the Num_of_Noodle_Shop in ascending order. 

neigborhoods_noodle_shops_sorted = neighborhoods_noodle_shops.sort_values(['Num_of_Noodle_Shop'], ascending=True)
neigborhoods_noodle_shops_sorted

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Num_of_Noodle_Shop
0,Port Hope,Port Hope,43.951575,-78.29397,0
35,Port Colborne,Port Colborne,42.886239,-79.25139,0
36,Grimsby,Grimsby,43.193209,-79.560692,0
54,Barrie,"North,East",44.389311,-79.690174,0
55,Barrie,"South,West",44.389311,-79.690174,0
56,Keswick,Keswick,44.239617,-79.468656,0
57,Midland,Midland,44.750147,-79.885712,0
12,Whitby,Central,43.87982,-78.942175,0
34,Welland,West,42.992218,-79.248419,0
10,Whitby,Southeast,43.87982,-78.942175,0


### Visualize the top 3 neighborhoods

In [63]:
neigborhoods_noodle_shops_sorted_top3 = neigborhoods_noodle_shops_sorted.head(3)

In [64]:
neigborhoods_noodle_shops_sorted_top3

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Num_of_Noodle_Shop
0,Port Hope,Port Hope,43.951575,-78.29397,0
35,Port Colborne,Port Colborne,42.886239,-79.25139,0
36,Grimsby,Grimsby,43.193209,-79.560692,0


### Getting the coordinates for the province of Ontario. 

In [66]:
address = 'Ontario, Canada'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude_o = location.latitude
longitude_o = location.longitude
#print('The geograpical coordinate of Ontario are {}, {}.'.format(latitude, longitude))

print('The geographical coordinate of ontario are {}, {}'.format(latitude_o, longitude_o))

The geographical coordinate of ontario are 50.000678, -86.000977


### Visualization

In [67]:
ontario = folium.Map(location=[latitude_o, longitude_o], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neigborhoods_noodle_shops_sorted_top3['Latitude'], neigborhoods_noodle_shops_sorted_top3['Longitude'],neigborhoods_noodle_shops_sorted_top3['Borough'], neigborhoods_noodle_shops_sorted_top3['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    # convert the string to html for label which works in folium 
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(ontario)  
    
ontario