

<table>
    <tr> 
        <h1 align=center><font size = 5>SIMILARITY AND DISSIMILARITY BETWEEN DOWNTOWN NEW YORK AND TORONTO CITIES</font></h1>    
    </tr>
    <tr>
        <td><img src = "https://cdn.pixabay.com/photo/2020/06/06/06/58/new-york-5265458_960_720.jpg" width = 670></td>
        <td><img src = "https://cdn.pixabay.com/photo/2012/02/25/19/06/downtown-16916__340.jpg" width = 600></td>
    </tr>
    <tr>
        <td><h2 align=center><font size = 5>Downtown New York City</font></h2></td>
        <td><h2 align=center><font size = 5>Downtown Toronto City</font></h2></td>
    </tr>       

</table>
 




## Analytic approach

Since the business problem has been clearly stated, the analytic approach step entails expressing the problem in the context of statistical and machine-learning techniques, so that the entity or stakeholders with the problem can identify the most suitable techniques for the desired outcome.
In this case study according to the goal, it is a clustering problem. The suitable model for the geographical data will be k-means. 
Criteria for data requirements: data selected should be New York and Toronto geographical data.

### How data will be used to solve the problem.
-	Data will help to analyse and build the model.  We need extensive data of different neighborhoods.
-	The machine learning model should be able to predict the clusters similarity
-	To build a good model, the dataset should be rich and contain many observations (rows) and various neighborhoods.

#### Here are the steps that we will follow:
-	Data wrangling: to identify and handle missing value, to standardize and normalize the data. 
-	Data exploratory by analyzing neighborhoods using visualization, descriptive statistical analysis
-	Model development: k-means will be developed to predict the clusters. A Model will help to understand the exact relationship both cities.


# Data description

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Download and Explore Dataset</a>

2.  <a href="#item2">Explore Neighborhoods</a>

    </font>
    </div>


#### Download all the dependencies.


In [27]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


print('Libraries imported.')

Libraries imported.


<a id='item1'></a>


## 1. Download and Explore Datasets


### 1.1 New York
New York Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them specially Manhattan borough, I will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

New York dataset exists and is accessible at the link: [https://geo.nyu.edu/catalog/nyu_2451_34572](https://geo.nyu.edu/catalog/nyu_2451_34572?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork-21253531&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)


In [28]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


#### Load and explore New York's data


In [29]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Notice how all the relevant data is in the _features_ key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.


In [30]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.


In [31]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Let's tranform the data of nested Python dictionaries into a _pandas_ dataframe.


In [32]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.


In [33]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.


In [34]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.


In [35]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


The dataset has all 5 boroughs and 306 neighborhoods.


In [36]:
print("New York's dataframe has {} boroughs and {} neighborhoods.".format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

New York's dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.


In order to define an instance of the geocoder, let's define a user_agent. The name of the agent is <em>ny_explorer</em>, as shown below.


In [37]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


So let's slice the original dataframe and create a new dataframe of the Manhattan data.

In [38]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Let's get the geographical coordinates of Manhattan.


In [39]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


### 1.2 Download and explore Toronto dataset

In [40]:
# website url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Extract table data
dft = pd.read_html(url)

# Get first table                                                                                                           
df = dft[0]

# Extract columns  PostalCode, Borough, and Neighborhood                                                                                                           
df2 = df[['Postal Code','Borough', 'Neighbourhood']]

# Rename columns 
df2.rename(columns={'Postal Code': 'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace=True)

# Print data shape
print('Dataframe shape is:', df2.shape)

# Print first data
df2.head(12)

Dataframe shape is: (180, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


#### Ignore cells with a borough that is Not assigned.

In [41]:
# is_notassigned is a boolean variable with True or False in it
is_notassigned = df2 ['Borough'] != 'Not assigned'

# Extract lines with assigned Borough
df_assigned = df2[is_notassigned].reset_index(drop=True)

print("Dataframe shape after dealing with not assigned borough  is:", df_assigned.shape)

#df_assigned
df_assigned.head(12)

Dataframe shape after dealing with not assigned borough  is: (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Replace not assigned Neighborhood by Borough in Toronto City

In [45]:
# Identify not assigned Neighborough
is_notassigned_ngbhd = df_assigned.loc[(df_assigned.Neighborhood == 'Not assigned')]

# Count not assigned Neighborough
print ("The number of the not assigned Neighborhood in Toronto city is: ", len(is_notassigned_ngbhd) )


# Replace not assigned Neighborhood by Borough
df_assigned.loc[(df_assigned.Neighborhood == 'Not assigned'), ['Neighborhood']] = df_assigned.Borough

The number of the not assigned Neighborhood in Toronto city is:  0


In [46]:
# Shape the dataframe
print("The shape of Toronto's city dataframe is: ", df_assigned.shape )

The shape of Toronto's city dataframe is:  (103, 3)


#### Let's Create a dataframe with longitude and latitude added

In [47]:
# url for longitude and latitude reference file
url = 'https://cocl.us/Geospatial_data'
 
# read csv file    
df_longlat = pd.read_csv(url)
print("longitude and latitude dataframe columns are:", df_longlat.columns)

# Rename columns 
df_longlat.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

# join longlat dataframe and neighborhood dataframe
neighborhoods_T = pd.merge(df_assigned, df_longlat)

#print head data
neighborhoods_T.head(12)

longitude and latitude dataframe columns are: Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [48]:
# Shape the dataframe
print("The shape of the dataframe is: ", neighborhoods_T.shape )

The shape of the dataframe is:  (103, 5)


#### Let's now explore the Manhattan and Toronto's neighborhoods.


<a id='item2'></a>
## 2. Explore Neighborhoods

We are going to start utilizing the Foursquare API to explore the neighborhoods.


#### Define Foursquare Credentials and Version


In [49]:
CLIENT_ID = 'SGT0HW1NDALDIQ55U1C5WKFDV1JH4A5AJWPHMHW5GGLXATVF' # your Foursquare ID
CLIENT_SECRET = 'D3GLEIDOFCTPXMI41N0VBPNTZZHFZQZVYV5O1U0QLD4G1QCN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: SGT0HW1NDALDIQ55U1C5WKFDV1JH4A5AJWPHMHW5GGLXATVF
CLIENT_SECRET:D3GLEIDOFCTPXMI41N0VBPNTZZHFZQZVYV5O1U0QLD4G1QCN


### 2.1 Explore Neighborhoods in Downtown New York (Manhattan)

#### Let's create a function using Foursquare to retrieve all the neighborhoods in Manhattan


In [50]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now let's run the above function on each neighborhood and create a new dataframe called _manhattan_venues_.


In [51]:
# type your answer here
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


#### Let's check the size of the resulting dataframe


In [55]:
print("Shape of Manhattan venues", manhattan_venues.shape)
manhattan_venues.head()

Shape of Manhattan venues (3211, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
1,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
4,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop


#### There are 3211 venues at Downtown New York (Manhattan)

Let's check how many venues were returned for each neighborhood of Manhattan.


In [56]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,66,66,66,66,66,66
Carnegie Hill,89,89,89,89,89,89
Central Harlem,45,45,45,45,45,45
Chelsea,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100
Civic Center,100,100,100,100,100,100
Clinton,100,100,100,100,100,100
East Harlem,40,40,40,40,40,40
East Village,100,100,100,100,100,100
Financial District,100,100,100,100,100,100


#### Let's find out how many unique categories can be curated from all the returned venues of Manhattan

In [58]:
print('There are {} uniques categories in Downtown New York.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 321 uniques categories in Downtown New York.


### 2.2 Explore Neighborhoods in Downtown Toronto

In [59]:
toronto_data = neighborhoods_T[neighborhoods_T['Borough']== 'Downtown Toronto'].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [60]:
address = 'Downtown Toronto'

geolocator = Nominatim(user_agent="tt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


#### Retrieve each neighborhood and create a new dataframe called toronto_venues.

In [61]:

toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town, Cabbagetown
First Canadian Place, Underground city
Church and Wellesley


#### Let's check the size of the resulting dataframe

In [62]:

print(toronto_venues.shape)
toronto_venues.head()

(1248, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


#### There are 1248 venues at Downtown Toronto while there are 3211 venues New York (Manhattan)

#### Let's check how many venues were returned for each neighborhood at Downtown Toronton

In [63]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,68,68,68,68,68,68
Christie,16,16,16,16,16,16
Church and Wellesley,75,75,75,75,75,75
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"First Canadian Place, Underground city",100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
"Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100
"Kensington Market, Chinatown, Grange Park",74,74,74,74,74,74


#### Let's find out how many unique categories can be curated from all the returned venues

In [64]:

print('There are {} uniques categories in Downtown Toronto.'.format(len(toronto_venues['Venue Category'].unique())))

There are 213 uniques categories in Downtown Toronto.


#### There are 213 uniques categories in Downtown Toronto and 321 uniques categories in Downtown New York (Manhattan).

## Thanks!
