# Compare Global Cities: New York City, Toronto, London, and Paris

## 1. Objective

New York City, London, Paris and Shanghai are all global cities. I have never been to New York City, London. However I'm quite familiar with Paris and Shanghai. I'm going to compare these four cities to find their similarities and dissimilarities. Expecting I would enjoy more if I had the chance travel to New York City and London.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 2. Overall look

#### First, let's compare the four cities on an overall base.

I collect data from Wikipedia pages below:
1. New York City: https://en.wikipedia.org/wiki/New_York_City; https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City
2. Toronto: https://en.wikipedia.org/wiki/Toronto; https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto
3. London: https://en.wikipedia.org/wiki/London; https://en.wikipedia.org/wiki/List_of_London_boroughs; https://en.wikipedia.org/wiki/London_postal_district
4. Paris: https://en.wikipedia.org/wiki/Paris; https://en.wikipedia.org/wiki/Arrondissements_of_Paris

In [3]:
column_names=['City', 'Population (million)', 'Area (km2)', 'Sub-district numbers', 'Sub-sub-district numbers']#, 'Neighborhood numbers']
cities=pd.DataFrame(columns=column_names)
row_ny=['New York City', 20.0, 784, 5, 57]
row_tr=['Toronto', 2.7, 630, 9, 140]
row_ld=['London', 8.8, 1572, 33, None]
row_pr=['Paris', 2.1, 105, 20, None]
cities.loc[0]=row_ny
cities.loc[1]=row_tr
cities.loc[2]=row_ld
cities.loc[3]=row_pr
cities

Unnamed: 0,City,Population (million),Area (km2),Sub-district numbers,Sub-sub-district numbers
0,New York City,20.0,784,5,57.0
1,Toronto,2.7,630,9,140.0
2,London,8.8,1572,33,
3,Paris,2.1,105,20,


1. New York City has 5 *boroughs* and 57 *communities*. Each community has numbers of *neighborhoods*. But their names and borders are not officially defined, and they change from time to time.
2. Toronto has 140 officially recognized *neighbourhoods* in 9 *districts/boroughs*.
3. London area is the largest, it has 32 *boroughs* and City of London. So we can count as 33 boroughs.
4. Paris only have 20 *arrondissements*, somewhat like boroughs in New York City or London, but they are rather small. And it's hard to define smaller units under arrondissement.

But if we analyze nearby venue categories based on this range, it seems make little sense. The areas of smallest units of each city range broad.

#### If we compare the four cities' center parts.

In [4]:
column_names=['City', 'Borough / District', 'Area (km2)', 'Sub-district numbers']
city_centers=pd.DataFrame(columns=column_names)
center_ny=['New York City', 'Manhattan', 59, 40]
center_tr=['Toronto', 'Old Toronto', 97, 38]
center_ld=['London', 'Inner London', 319, 183]
center_pr=['Paris', 'Paris', 105, 22]
city_centers.loc[0]=center_ny
city_centers.loc[1]=center_tr
city_centers.loc[2]=center_ld
city_centers.loc[3]=center_pr
city_centers

Unnamed: 0,City,Borough / District,Area (km2),Sub-district numbers
0,New York City,Manhattan,59,40
1,Toronto,Old Toronto,97,38
2,London,Inner London,319,183
3,Paris,Paris,105,22


Considering Paris is too small, so take the whole Paris as the *center part*. Based on center parts and their sub-districts, venues analysis makes more sense.

#### Based on these small districts, how can I find their similarities?

## 3. Each city's data

#### 3.1 Get New York City's data from former course

We have learn how to get Manhattan's neighborhoods data before. Repeat that process in Newyork python notebook "Newyork.ipynb" and get a dataset like this below.

In [7]:
# @hidden_cell
df_newyork=pd.read_csv("manhattan_neighborhoods.csv", index_col=0)
df_newyork['City']='New York City'
ny_col=df_newyork.columns
ny_col=[ny_col[-1]]+list(ny_col[:-1])
df_newyork=df_newyork[ny_col]
df_newyork

Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,New York City,Manhattan,Marble Hill,40.876551,-73.91066
1,New York City,Manhattan,Chinatown,40.715618,-73.994279
2,New York City,Manhattan,Washington Heights,40.851903,-73.9369
3,New York City,Manhattan,Inwood,40.867684,-73.92121
4,New York City,Manhattan,Hamilton Heights,40.823604,-73.949688
5,New York City,Manhattan,Manhattanville,40.816934,-73.957385
6,New York City,Manhattan,Central Harlem,40.815976,-73.943211
7,New York City,Manhattan,East Harlem,40.792249,-73.944182
8,New York City,Manhattan,Upper East Side,40.775639,-73.960508
9,New York City,Manhattan,Yorkville,40.77593,-73.947118


#### 3.2 Get Toronto's data from former assignment

We have done the assignment to get Toronto neighbourhoods data before. Repeat that process in Toronto python notebook "Toronto.ipynb" and get a dataset like this below.

In [12]:
# @hidden_cell
df_toronto=pd.read_csv("old_toronto_neighbourhoods.csv", index_col=0)
df_toronto=df_toronto.reset_index(drop=True)
df_toronto['City']='Toronto'
tr_col=df_toronto.columns
tr_col=[tr_col[-1]]+list(tr_col[:-1])
df_toronto=df_toronto[tr_col]
df_toronto

Unnamed: 0,City,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,Toronto,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,Toronto,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,Toronto,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,Toronto,M4M,East Toronto,Studio District,43.659526,-79.340923
4,Toronto,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,Toronto,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,Toronto,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,Toronto,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,Toronto,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,Toronto,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


#### 3.3 Get London's data

1. https://en.wikipedia.org/wiki/London
2. https://en.wikipedia.org/wiki/List_of_London_boroughs
3. https://en.wikipedia.org/wiki/London_postal_district
4. https://www.doogal.co.uk/london_postcodes.php

Links above explain London boroughs and postcode districts. ETL data and get a dataset like this. 

In [13]:
# @hidden_cell
df_london=pd.read_csv("london_postcode_districts.csv", index_col=0)
df_london['City']='London'
ld_col=df_london.columns
ld_col=[ld_col[-1]]+list(ld_col[:-1])
df_london=df_london[ld_col]
df_london

London points: 183


Unnamed: 0,City,Postcode area,Postcode area name,Postcode district,District name,Latitude,Longitude
0,London,E,Eastern,E1,Head district,51.517389,-0.059507
1,London,E,Eastern,E10,Leyton,51.567991,-0.014095
2,London,E,Eastern,E11,Leytonstone,51.568739,0.013628
3,London,E,Eastern,E12,Manor Park,51.550735,0.053017
4,London,E,Eastern,E13,Plaistow,51.52774,0.026491
5,London,E,Eastern,E14,Poplar,51.50608,-0.018702
6,London,E,Eastern,E15,Stratford,51.540511,0.003748
7,London,E,Eastern,E16,Victoria Docks and North Woolwich,51.510764,0.028954
8,London,E,Eastern,E17,Walthamstow,51.58693,-0.020629
9,London,E,Eastern,E18,Woodford and South Woodford,51.592747,0.024935


#### 3.4 Get Paris's data

Paris have 20 arrondissements, each arrondissement's information is in the link on this web page: http://zip-code.en.mapawi.com/france/1/arrondissement-de-paris/3/150/. Do web scraping and make a dataset like this.

In [79]:
# @hidden_cell
df_paris=pd.read_csv("paris_arrondissements.csv", index_col=0)
df_paris['City']='Paris'
pr_col=df_paris.columns
pr_col=[pr_col[-1]]+list(pr_col[:-1])
df_paris=df_paris[pr_col]
df_paris

Paris points: 22


Unnamed: 0,City,Arrondissement,Name,Zip code,Latitude,Longitude,Area (km2)
0,Paris,1st,Louvre,75001,48.8592,2.3417,1.83
1,Paris,2nd,Bourse,75002,48.8655,2.3426,0.99
2,Paris,3rd,Temple,75003,48.8637,2.3615,1.17
3,Paris,4th,Hôtel-de-Ville,75004,48.8601,2.3507,1.6
4,Paris,5th,Panthéon,75005,48.8448,2.3471,2.54
5,Paris,6th,Luxembourg,75006,48.8493,2.33,2.15
6,Paris,7th,Palais-Bourbon,75007,48.8565,2.321,4.09
7,Paris,8th,Élysée,75008,48.8763,2.3183,3.88
8,Paris,9th,Opéra,75009,48.8718,2.3399,2.18
9,Paris,10th,Entrepôt,75010,48.8709,2.3561,2.89


#### 3.5 Aggregate all cities' data into one dataset.

In [80]:
# @hidden_cell
cols=['City', 'District', 'Sub-district', 'Latitude', 'Longitude']
all_districts=pd.DataFrame(columns=cols)

df1=df_newyork[:]
df1.columns=cols

df2=df_toronto[['City', 'Borough', 'PostalCode', 'Latitude', 'Longitude']]
df2.columns=cols

df3=df_london[['City', 'Postcode area', 'Postcode district', 'Latitude', 'Longitude']]
df3.columns=cols

df4=df_paris[['City', 'Arrondissement', 'Name', 'Latitude', 'Longitude']]
df4.columns=cols

all_districts=pd.concat([df1,df2,df3,df4])
all_districts.head()

Unnamed: 0,City,District,Sub-district,Latitude,Longitude
0,New York City,Manhattan,Marble Hill,40.876551,-73.91066
1,New York City,Manhattan,Chinatown,40.715618,-73.994279
2,New York City,Manhattan,Washington Heights,40.851903,-73.9369
3,New York City,Manhattan,Inwood,40.867684,-73.92121
4,New York City,Manhattan,Hamilton Heights,40.823604,-73.949688


In [81]:
# @hidden_cell
all_districts.shape

(283, 5)