# Applied Data Science Capstone - The Battle of Neighborhoods (Week 1)
## Where to Open a New East Asian Resturant in Toronto

This notebook completes the Week 1 requirements of capstone project of the Coursera course - Applied Data Science Capstone. The topic is **the battle of neighborhoods**.

First of all, let us import the libraries that will be useful.

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium is not installed
#!conda install -c conda-forge folium --yes
import folium # map rendering library

## Table of contents
Currently this notebook only includes the Introduction and Data sections required in Week 1
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

Canada is an immigrant country. As a result, there are a large number of immigrants in the metropolises of Canada who originate from different foreign countries. Toronto is one of the examples, in which about 46 percent of population are immigrants, according to the public data source. The immigrants will also bring unique traditions, cultures besides help developing the economy of the country. Food is one of the diversities that will definitely be taken in. Hence, there is a variety of local resturants offer special food that originate from other different countries. For example, you can come across Japanese resturants in Toronto.
As immigrant population continues to grow, new resturants are also opening. Under such circumstances, we are interested in searching for a good region of the Toronto city to lauch a new East Asian resturant. The three main countries in East Asian are China, Korean and Japanese. This report will be specifically useful for the stakeholders who are interested to open a new East Asian resturant in Toronto.

## Data <a name="data"></a>

We need the neighbor information of Toronto.
We can load the table into dataframe from the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. 

And we need to clean the data into a well-structured neighbourhood datafame:
1. Make the "Not assigned" in the table to be converted to NaN in the dataframe.
2. Drop the NaN values and reset the index

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url, match = 'Postal Code', na_values = ['Not assigned'])
df = dfs[0]
df.dropna(axis = 0, inplace = True)
df.reset_index(drop = True, inplace = True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
We will load the geographical coordinates of each postal code from the csv file via: http://cocl.us/Geospatial_data

In [3]:
csv = 'http://cocl.us/Geospatial_data'
df_coor = pd.read_csv(csv)
df_coor

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Then merge the coordinate dataframe with the neighborhood dataframe. To be consistent, we change the column name "Neighbourhood" to "Neighborhood". And we can find that there are 103 different regions according to the postal codes although some of them have same neighborhood names but different coordinates.

In [4]:
df_toronto_neighborhood = df.merge(df_coor)
df_toronto_neighborhood.rename(columns = {"Neighbourhood": "Neighborhood"}, inplace = True)
print(df_toronto_neighborhood.nunique())
df_toronto_neighborhood

Postal Code     103
Borough          10
Neighborhood     99
Latitude        103
Longitude        75
dtype: int64


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


With the `df_toronto_neighborhood` storing the neighborhood information of Toronto, we will couple it with the Foursquare location data to analyze the distribution of the East Asian resturants in Toronto. 