# Moving from New York to Toronto

## 1. Background

Imagine you live in busy New York with your family and, although doing fine in the Big Apple, you receive an undeniable job proposal to work on a great company in Toronto, Canada. Despite the fact that New York offers many opportunities for its residents, Canada is a very open-minded and conviviality country, especially when we talk about public health care. 

Although they are not too far apart (about 8h away by car), moving to another city with the whole family is a big change and requires a lot of planning. After the initial excitement and celebrations, you and your significant other have a lot to plan. Indubitably, to choose to which Toronto's neighborhood you should move to is one of the bigger questions to answer.
To decrease the change's impact, you decide to explore Toronto to find a neighborhood as similar as possible to the one you used to live in NY.

This work will simulate that analysis, exploring both cities' neighborhoods and grouping them by their similarities regarding the venues found there.

# 2. Data

- For both Toronto and New York, we will collect their neighborhood's data containing coordinates for each one.
- Using the Foursquare API, we will collect the venues for each neighborhood.
- With the venues data in place, we will use the KMeans algorithm to identify which neighborhoods have more similarities between the cities.

In [168]:
import folium
import pandas as pd
import os
import requests
import seaborn as sns

from IPython.core.display import display, HTML
from sklearn.cluster import KMeans

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Collecting New York's neighborhoods data

In [22]:
ny_geospatial = requests.get(
    "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/"
    "IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json"
).json()

In [81]:
# instantiate the dataframe
newyork = pd.DataFrame(columns=['Neighborhood', 'Latitude', 'Longitude'] )

# populate it with neighborhoods'name and coordinates
for nb in ny_geospatial['features']:
    newyork = newyork.append({
        "Neighborhood": nb['properties']['name'],
        "Latitude": nb['geometry']['coordinates'][1],
        "Longitude": nb['geometry']['coordinates'][0],
    }, ignore_index=True)

In [84]:
# Optional. To save time on downloading, we can store/retrieve the nb dataframe
# newyork.to_csv("newyork_nbs.csv", index=False)
# newyork = pd.read_csv("newyork_nbs.csv")

newyork.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Wakefield,40.894705,-73.847201
1,Co-op City,40.874294,-73.829939
2,Eastchester,40.887556,-73.827806
3,Fieldston,40.895437,-73.905643
4,Riverdale,40.890834,-73.912585


### Collecting Toronto's neighborhoods data

In [14]:
toronto_nb = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", 
    match="Neighbourhood"
)[0]
toronto_nb.columns = ["Postal Code", "Borough", "Neighborhood"]

def fix_nb(entry):
    if entry["Neighborhood"] == "Not Assigned":
        entry["Neighborhood"] = entry["Borough"]
    return entry

# ignore 'Not assigned' boroughes
toronto = toronto[toronto['Borough'] != "Not assigned"]

# Fix 'Not assigned' neighborhoods
toronto = toronto.apply(fix_nb, axis=1)

# Check for neighborhoods uniqueness
num_postal_codes = len(toronto['Postal Code'].unique())
if num_postal_codes == toronto.shape[0]:
    print('All postal codes in the dataframe contains a single neighborhood')
else:
    print('There are more than one neighborhood for some postal codes')
    
toronto_geospatial = pd.read_csv("https://cocl.us/Geospatial_data")
toronto = pd.merge(toronto_nb, toronto_geospatial, on='Postal Code').drop(columns=['Postal Code', 'Borough'])


All postal codes in the dataframe contains a single neighborhood


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Parkwoods,43.753259,-79.329656
1,Victoria Village,43.725882,-79.315572
2,"Regent Park, Harbourfront",43.65426,-79.360636
3,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [19]:
# Optional. To save time on downloading, we can store/retrieve the nb dataframe
# toronto.to_csv("toronto_nbs.csv", index=False)
# toronto = pd.read_csv("toronto_nbs.csv")

toronto.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Parkwoods,43.753259,-79.329656
1,Victoria Village,43.725882,-79.315572
2,"Regent Park, Harbourfront",43.65426,-79.360636
3,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Displaying neighborhoods on maps

In [85]:
def create_map(df):
    lat_avg, lng_avg = df['Latitude'].mean(), df['Longitude'].mean()
    return folium.Map(location=[lat_avg, lng_avg], zoom_start=10)

def add_markers_to_map(entry, city_map):
    label = folium.Popup(f"{entry['Neighborhood']}", parse_html=True)
    folium.CircleMarker(
        [entry["Latitude"], entry["Longitude"]],
        radius=5,
        popup=label,
        color='#7a1c1c',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.6,
        parse_html=False
    ).add_to(city_map)  

In [169]:
newyork_map = create_map(newyork)
newyork.apply(add_markers_to_map, axis=1, args=(newyork_map,))

toronto_map = create_map(toronto)
toronto.apply(add_markers_to_map, axis=1, args=(toronto_map,))

def display_maps(maps):
    base_html = '<iframe srcdoc="{}" style="display:inline-block; width: calc(50% - 6px); height: 500px; border: 0; margin: 0" ></iframe>'
    html = "\n".join([base_html.format(map.get_root().render().replace('"', '&quot;')) for map in maps])
    display(HTML(html))

In [170]:
display_maps([newyork_map, toronto_map])

In [88]:
CLIENT_ID = os.environ.get("FOURSQUARE_CLIENT_ID")
CLIENT_SECRET = os.environ.get("FOURSQUARE_CLIENT_SECRET")
VERSION = "20180605"

def make_search_url(lat, lng, radius=500, limit=100):
    return (
        f'https://api.foursquare.com/v2/venues/explore?'
        f'client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}'
        f'&ll={lat},{lng}&radius={radius}&limit={limit}'
    )

def get_venues_from_neighborhood(entry):
    url = make_search_url(entry["Latitude"], entry["Longitude"])
    results = requests.get(url).json()["response"].get("groups", [{}])[0].get("items", [])
    if not results:
        print(f"Could not find venues for {entry['Neighborhood']}")
    return [(
        entry["Neighborhood"],
        entry["Latitude"], 
        entry["Longitude"], 
        v['venue']['name'], 
        v['venue']['location']['lat'],
        v['venue']['location']['lng'],
        v['venue']['categories'][0]['name']) for v in results
    ]

def flatten_venues_list(venues_lists):
    # flattening the the venues list and making a dataframe for them
    venues = pd.DataFrame([venue for sublist in venues_lists for venue in sublist])
    venues.columns = [
        'Neighborhood', 
        'Neighborhood Latitude', 
        'Neighborhood Longitude', 
        'Venue', 
        'Venue Latitude', 
        'Venue Longitude', 
        'Venue Category'
    ]
    return venues

In [89]:
# request venues for all new york's neighborhoods
newyork_venues_lists = newyork.apply(get_venues_from_neighborhood, axis=1)
newyork_venues = flatten_venues_list(newyork_venues_lists)

Could not find venues for Schuylerville
Could not find venues for North Corona
Could not find venues for Stapleton
Could not find venues for New Lots
Could not find venues for Concourse Village
Could not find venues for Sutton Place


In [173]:
# Optional. To save time on downloading, we can store/retrieve the nb dataframe
# newyork_venues.to_csv("newyork_venues.csv", index=False)
# newyork_venues = pd.read_csv("newyork_venues.csv")
print(newyork_venues.shape)
newyork_venues.head()

(9891, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
3,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


In [43]:
# request venues for all toronto's neighborhoods
toronto_venues_lists = toronto.apply(get_venues_from_neighborhood, axis=1)
toronto_venues = flatten_venues_list(toronto_venues_lists)

Could not find venues for Islington Avenue, Humber Valley Village
Could not find venues for West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Could not find venues for Upper Rouge


In [175]:
# Optional. To save time on downloading, we can store/retrieve the nb dataframe
# toronto_venues.to_csv("toronto_venues.csv", index=False)
toronto_venues = pd.read_csv("toronto_venues.csv")
print(toronto_venues.shape)
toronto_venues.head()

(2120, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [176]:
# TODO: drop Neighborhoods without any venues

In [174]:
def get_categories_dummies(df):
    # one hot encoding
    df_dummies = pd.get_dummies(df['Venue Category'], prefix="", prefix_sep="")
    df_dummies["Neighborhood"]  = df["Neighborhood"] 

    # move neighborhood column to the first column
    fixed_columns = [df_dummies.columns[-1]] + list(df_dummies.columns[:-1])
    df_dummies = df_dummies[fixed_columns]

    df_dummies = df_dummies.groupby('Neighborhood').mean().reset_index()
    return df_dummies

In [177]:
# TODO: continue...