# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a restaurant. Specifically, this report will be targeted to stakeholders interested in opening an **Chinese restaurant** in **Toronto**, Canada, the number of whom is increasing because of the fast growth of Chinese population in the city.

Since there are lots of restaurants in Toronto we will try to detect **locations that are not already crowded with restaurants**. We are also particularly interested in **areas with no or few Chinese restaurants in vicinity**. We would also prefer locations **as close to city center (downtown Toronto) as possible**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing restaurants in the neighborhood (any type of restaurant)
* number of and distance to Chinese restaurants in the neighborhood, if any
* review of the Chinese restaurants in the neighborhood, if any
* distance of neighborhood from city center

We decided to use definition from [this wikipedia link](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* number of restaurants and their type, review and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Toronto neighborhood will be obtained using a csv file input

### Neighborhood in Toronto

We download the neighborhood table from the wikipedia page, the table contains all neighborhoods in Canada, therefore we extract the ones only from Toronto, and attach the coordinates of the neighborhoods using a csv file.

In [21]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import urllib.request
import html5lib
import numpy as np
from bs4 import BeautifulSoup
import requests
from math import radians, sin, cos, acos

p_can_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
fp = urllib.request.urlopen(p_can_url)
p_can_html = fp.read()
p_can_html = p_can_html.decode("utf8")
fp.close()
p_can = BeautifulSoup(p_can_html)

df_p_can = []
for tr in p_can.table.find_all('tr'):
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    df_p_can.append(row)
df_p_can=pd.DataFrame(df_p_can,columns=['Postcode','Borough','Neighborhood'])
df_p_can=df_p_can[df_p_can['Borough']!='Not assigned']
df_p_can=df_p_can[df_p_can.Borough.notnull()]
df_p_can['Neighborhood']=df_p_can['Neighborhood'].replace({'\n':''}, regex=True)

df_p_can['Neighborhood'] = np.where(df_p_can['Neighborhood'] == 'Not assigned', df_p_can['Borough'], df_p_can['Neighborhood'])
df_p_can=df_p_can.groupby(['Postcode','Borough'])['Neighborhood'].apply(','.join).reset_index()
coord = pd.read_csv('http://cocl.us/Geospatial_data')
df_coord_can=pd.merge(df_p_can, coord, left_on='Postcode', right_on='Postal Code').drop(['Postal Code'], axis=1)
df_t=df_coord_can[df_coord_can['Borough'].isin(['Downtown Toronto','East Toronto','West Toronto','Central Toronto'])]
dist = []
for lat, lon in zip(df_t['Latitude'], df_t['Longitude']):
    dist_c = 6371.01 * acos(sin(lat)*sin(43.6532) + cos(lat)*cos(43.6532)*cos(lon - (-79.3832)))
    dist.append(dist_c)
df_t['Distance to Center'] = dist
df_t.reset_index(inplace = True, drop = True)
df_t.head(5)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Distance to Center
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,565.308076
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,251.919464
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,421.020998
3,M4M,East Toronto,Studio District,43.659526,-79.340923,258.326879
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,477.900296


The data looks good, now we have all the neighborhoods in Toronto and their coordinates with the distance to Toronto city center. Let's visualize the neighborhoods in the map of Toronto.

In [14]:
#!pip install folium

import folium

map_t = folium.Map(location=[43.6532, -79.3832], zoom_start=12)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_t['Latitude'], df_t['Longitude'], df_t['Borough'], df_t['Neighborhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_t)  
    
map_t

### Foursquare

Now that we have our neighborhoods data, we can use Foursquare API to get info on (Chinese) restaurants in each neighborhood.

From the API we will get a list of the restaurants, and their types, locations, reviews. And from the list we could feature the neighborhoods with the number of restaurants, the number of Chinese restaurants, average review of the restaurants, and along with their distance to the city center. Using the features, we can cluster the neighborhoods and identify the best (group of) neighborhoods for openning a Chinese restaurant.

In [3]:
CLIENT_ID = 'NDY0KYVXE2M1QNO1TEQZ3JOJZNG1031QQVXXYR1XX0LVHNO5'
CLIENT_SECRET = '3RQYFN53TWP4DJZXCGQAV1QNRHO2VEJQTFSEIYBFTBARWGZM'
VERSION = '20190401' # Foursquare API version

In [26]:
# Category IDs corresponding to Chinese restaurants were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

chinese_restaurant_categories = ['4bf58dd8d48988d145941735','52af3a5e3cf9994f4e043bea','52af3a723cf9994f4e043bec',
                                 '52af3a7c3cf9994f4e043bed','58daa1558bbb0b01f18ec1d3','52af3a673cf9994f4e043beb',
                                 '52af3a903cf9994f4e043bee','4bf58dd8d48988d1f5931735','52af3a9f3cf9994f4e043bef',
                                 '52af3aaa3cf9994f4e043bf0','52af3ab53cf9994f4e043bf1','52af3abe3cf9994f4e043bf2',
                                 '52af3ac83cf9994f4e043bf3','52af3ad23cf9994f4e043bf4','52af3add3cf9994f4e043bf5',
                                 '52af3af23cf9994f4e043bf7','52af3ae63cf9994f4e043bf6','52af3afc3cf9994f4e043bf8',
                                 '52af3b053cf9994f4e043bf9','52af3b213cf9994f4e043bfa','52af3b293cf9994f4e043bfb',
                                 '52af3b343cf9994f4e043bfc','52af3b3b3cf9994f4e043bfd','52af3b463cf9994f4e043bfe',
                                 '52af3b633cf9994f4e043c01','52af3b513cf9994f4e043bff','52af3b593cf9994f4e043c00',
                                 '52af3b6e3cf9994f4e043c02','52af3b773cf9994f4e043c03','52af3b813cf9994f4e043c04',
                                 '52af3b893cf9994f4e043c05','52af3b913cf9994f4e043c06','52af3b9a3cf9994f4e043c07',
                                 '52af3ba23cf9994f4e043c08']

def is_chinese_restaurant(categories, specific_filter=None):
    specific = False
    for c in categories:
        category_id = c[1]
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
    return specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, VERSION, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   item['venue']['location'],
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues


def get_restaurants(lats, lons, neighborhoods):
    restaurants = {}
    all_chinese_restaurants = {}
    location_restaurants = []
    location_chinese_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon, neighborhood in zip(lats, lons, neighborhoods):
        venues = get_venues_near_location(lat, lon, food_category, CLIENT_ID, CLIENT_SECRET, radius=600, limit=100)
        area_restaurants = []
        chinese_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_neighborhood = neighborhood
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_chinese = is_chinese_restaurant(venue_categories, specific_filter=chinese_restaurant_categories)
            restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_chinese)
            if venue_distance<=500:
                area_restaurants.append(restaurant)
            restaurants[venue_id] = restaurant
            if is_chinese:
                all_chinese_restaurants[venue_id] = restaurant
                chinese_restaurants.append(restaurant)
        location_restaurants.append(area_restaurants)
        location_chinese_restaurants.append(chinese_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, all_chinese_restaurants, location_restaurants, location_chinese_restaurants

restaurants, all_chinese_restaurants, location_restaurants, location_chinese_restaurants = get_restaurants(df_t['Latitude'], df_t['Longitude'], df_t['Neighborhood'])


Obtaining venues around candidate locations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


In [27]:
print('Total number of restaurants:', len(restaurants))
print('Total number of Chinese restaurants:', len(all_chinese_restaurants))
print('Percentage of Chinese restaurants: {:.2f}%'.format(len(all_chinese_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of restaurants: 1171
Total number of Chinese restaurants: 36
Percentage of Chinese restaurants: 3.07%
Average number of restaurants in neighborhood: 31.157894736842106


There are only 2.82% Chinese restaurant in general, which already indicates a good opportunity. Let's now see all the collected restaurants in our area of interest on map, and let's also show Chinese restaurants in different color.

In [6]:
map_t = folium.Map(location=[43.6532, -79.3832], zoom_start=12)
folium.Marker([43.6532, -79.3832], popup='Toronto').add_to(map_t)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_chinese = res[6]
    color = 'red' if is_chinese else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_t)
map_t

## Methodology <a name="methodology"></a>

In this project we will try to identify neighborhoods in Toronto that have low restaurant density, especially those with low Chinese restaurant density. We use neighborhoods definition from the wikipedia page listed in the previous section.

We have already imported the data needed for the project from two sources: wikipedia page and foursquare API. The data contain the names and locations (coordinates) of the neighborhoods and the restaurants within them.

In the following parts, we will explore the restaurant density across neighborhoods in Toronto and deep dive into the ones with low restaurant density to find the most promising neighborhoods for openning a new Chinese restaurant for our stakeholders. We will use both a rule based method and k-means clustering to filter out the neighborhoods.

Finnally, we will summarize the analysis and present actionable conclusions for the stakeholders.

## Analysis <a name="analysis"></a>

In [28]:
location_restaurants_count = [len(res) for res in location_restaurants]
df_t['Restaurants in area'] = location_restaurants_count
print('Average number of restaurants in every area with radius=500m:', np.array(location_restaurants_count).mean())

Average number of restaurants in every area with radius=500m: 31.157894736842106


In [29]:
location_chinese_restaurants_count = [len(res) for res in location_chinese_restaurants]
df_t['Chinese restaurants in area'] = location_chinese_restaurants_count
print('Average number of restaurants in every area with radius=500m:', np.array(location_chinese_restaurants_count).mean())

Average number of restaurants in every area with radius=500m: 1.1578947368421053


In [30]:
distances_to_chinese_restaurant = []

for lat, lon in zip(df_t['Latitude'], df_t['Longitude']):
    min_distance = 10000
    for res in all_chinese_restaurants.values():
        lat_c = res[2]
        lon_c = res[3]
        dist = 6371.01 * acos(sin(lat)*sin(lat_c) + cos(lat)*cos(lat_c)*cos(lon - lon_c))
        if dist<min_distance:
            min_distance = dist
    distances_to_chinese_restaurant.append(min_distance)

df_t['Distance to Chinese restaurant'] = distances_to_chinese_restaurant

In [31]:
df_t.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Distance to Center,Chinese restaurants in area,Restaurants in area,Distance to Chinese restaurant
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,565.308076,0,2,301.350432
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,251.919464,0,36,129.380301
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,421.020998,0,11,156.765835
3,M4M,East Toronto,Studio District,43.659526,-79.340923,258.326879,1,26,12.801723
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,477.900296,1,1,34.403163
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197,381.783753,0,6,67.786976
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,419.14765,1,8,33.612505
7,M4S,Central Toronto,Davisville,43.704324,-79.38879,327.481919,1,33,16.957813
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316,231.741157,0,1,63.956403
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049,234.959186,2,7,37.721665


In [32]:
print('Average distance to closest Chinese restaurant from each area center:', df_t['Distance to Chinese restaurant'].mean())

Average distance to closest Chinese restaurant from each area center: 66.59376987701742


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.scatter(df_t['Distance to Chinese restaurant'], df_t['Distance to Center'], s=df_t['Restaurants in are'], c=df_t['Chinese restaurants in area'], cmap="Blues", alpha=0.4, edgecolors="grey", linewidth=2)
 
# Add titles (main and on axis)
plt.xlabel("Distance to Chinese restaurant")
plt.ylabel("Distance to Center")
plt.title("Battle of the Neighborhood - Chinese restaurants")
 
plt.show()


## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>