# Capstone Project - The Battle of Neighborhoods in Shanghai

Xiaoyue Zou

20/07/2020

In this notebook, I present the procedure of clustering the Shanghai districts (Neighborhoods) according to the most frequent venues retrieved by Foursquare API. Data preprocessing and cleaning steps were conducted on the district information as well as adding spatial location data. One hot encoding was used for the venue categories in order to create the model for machine learning. Here, K-Means clustering was selected for its convenience and high quality in terms of accuracy. Finally, five clusters of Shanghai districts were created offering city insights for both visitors and locals.

## 1.import libraries

In [1]:
#!pip install beautifulsoup4
#!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported.')

Libraries imported.


## 2.build dataframe containing Shanghai district spatial information

In [2]:
# Scraping the Wikipedia page for Shanghai districts
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_administrative_divisions_of_Shanghai')[3]

In [3]:
df

Unnamed: 0_level_0,Unnamed: 0_level_0,County Level,County Level,County Level,County Level,County Level,County Level,County Level,County Level
Unnamed: 0_level_1,Unnamed: 0_level_1.1,Name,Chinese,Hanyu Pinyin,Division code[2],Division code[2].1,Area (km²)[3],Population (2018 census)[3],Density (/km²)
0,,Huangpu District[4](City seat),黄浦区,Huángpǔ Qū,310101,HGP,20.46,653800,31955
1,,Xuhui District,徐汇区,Xúhuì Qū,310104,XHI,54.76,1084400,19803
2,,Changning District,长宁区,Chángníng Qū,310105,CNQ,38.3,694000,18120
3,,Jing'an District,静安区,Jìng'ān Qū,310106,JAQ,36.88,1062800,28818
4,,Putuo District,普陀区,Pǔtuó Qū,310107,PTQ,54.83,1281900,23380
5,,Hongkou District,虹口区,Hóngkǒu Qū,310109,HKQ,23.48,797000,33944
6,,Yangpu District,杨浦区,Yángpǔ Qū,310110,YPU,60.73,1312700,21615
7,,Pudong New Area,浦东新区,Pǔdōng Xīnqū,310115,PDX,1210.41,5550200,4585
8,,Minhang District,闵行区,Mǐnháng Qū,310112,MHQ,370.75,2543500,6860
9,,Baoshan District,宝山区,Bǎoshān Qū,310113,BAO,270.99,2042300,7536


In [5]:
# delete irrelevant columns
df = df.drop(df.columns[[0,3,4,5]], axis=1)

# rename the columns
df.columns = ['Neighborhood', 'Chinese', 'Area', 'Population', 'Density']
df.head()

Unnamed: 0,Neighborhood,Chinese,Area,Population,Density
0,Huangpu District[4](City seat),黄浦区,20.46,653800,31955
1,Xuhui District,徐汇区,54.76,1084400,19803
2,Changning District,长宁区,38.3,694000,18120
3,Jing'an District,静安区,36.88,1062800,28818
4,Putuo District,普陀区,54.83,1281900,23380


In [6]:
# modify the district name
df['Neighborhood'] = df['Neighborhood'].replace(['Huangpu District[4](City seat)'],'Huangpu District')
df.head()

Unnamed: 0,Neighborhood,Chinese,Area,Population,Density
0,Huangpu District,黄浦区,20.46,653800,31955
1,Xuhui District,徐汇区,54.76,1084400,19803
2,Changning District,长宁区,38.3,694000,18120
3,Jing'an District,静安区,36.88,1062800,28818
4,Putuo District,普陀区,54.83,1281900,23380


In [7]:
# add geograpical coordinate of Shanghai districts
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
geolocator = Nominatim(user_agent="Shanghai_explorer")

df['Major_Dist_Coord']= df['Chinese'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df[['Latitude', 'Longitude']] = df['Major_Dist_Coord'].apply(pd.Series)

df.drop(['Major_Dist_Coord'], axis=1, inplace=True)
df

Unnamed: 0,Neighborhood,Chinese,Area,Population,Density,Latitude,Longitude
0,Huangpu District,黄浦区,20.46,653800,31955,31.233593,121.479864
1,Xuhui District,徐汇区,54.76,1084400,19803,31.163698,121.427994
2,Changning District,长宁区,38.3,694000,18120,31.209276,121.389986
3,Jing'an District,静安区,36.88,1062800,28818,31.229776,121.44306
4,Putuo District,普陀区,54.83,1281900,23380,31.251326,121.391229
5,Hongkou District,虹口区,23.48,797000,33944,31.266703,121.501751
6,Yangpu District,杨浦区,60.73,1312700,21615,31.262011,121.52143
7,Pudong New Area,浦东新区,1210.41,5550200,4585,31.221783,121.53874
8,Minhang District,闵行区,370.75,2543500,6860,31.114767,121.376943
9,Baoshan District,宝山区,270.99,2042300,7536,31.406634,121.485158


In [8]:
df.shape

(16, 7)

## 3.Explore and cluster the districts in Shanghai

In [9]:
# get the geograpical coordinate of Shanghai
address = 'Shanghai'

geolocator = Nominatim(user_agent="Shanghai_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Shanghai are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Shanghai are 31.2252985, 121.4890497.


In [10]:
# create map of Shanghai using latitude and longitude
map_shanghai = folium.Map(location=[latitude, longitude], zoom_start=9)

 # add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_shanghai) 
    
map_shanghai

In [11]:
# define Foursquare Credentials and Version
CLIENT_ID = 'BEJ113Z31J02GX124QVSURAHKSOZ5H0LB0XUO2HPA53I0KSE' # your Foursquare ID
CLIENT_SECRET = 'TPFUWZEY10J1REJOWUYZTNYVJRCF1PNXVNGUN1TJQAS1HFO4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: BEJ113Z31J02GX124QVSURAHKSOZ5H0LB0XUO2HPA53I0KSE
CLIENT_SECRET:TPFUWZEY10J1REJOWUYZTNYVJRCF1PNXVNGUN1TJQAS1HFO4


In [12]:
# define radius and limit of venues to get
radius=500
LIMIT=100

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
shanghai_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Huangpu District
Xuhui District
Changning District
Jing'an District
Putuo District
Hongkou District
Yangpu District
Pudong New Area
Minhang District
Baoshan District
Jiading District
Jinshan District
Songjiang District
Qingpu District
Fengxian District
Chongming District


In [15]:
shanghai_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Huangpu District,31.233593,121.479864,Campanile Hotel and Restaurant,31.232123,121.479144,Hotel
1,Huangpu District,31.233593,121.479864,M1NT Restaurant & Grill,31.23692,121.479641,Restaurant
2,Huangpu District,31.233593,121.479864,The Westin Bund Center (上海威斯汀大饭店),31.233935,121.482653,Hotel
3,Huangpu District,31.233593,121.479864,Old Beijing Qianmen Roast Duck (老北京前门烤鸭),31.23248,121.482457,Peking Duck Restaurant
4,Huangpu District,31.233593,121.479864,Épices & Foie-gras,31.237557,121.47958,French Restaurant
5,Huangpu District,31.233593,121.479864,台北纯K,31.230826,121.477331,Karaoke Bar
6,Huangpu District,31.233593,121.479864,M1NT,31.236609,121.479798,Nightclub
7,Huangpu District,31.233593,121.479864,Foreign Languages Bookstore (外文书店),31.235917,121.478498,Bookstore
8,Huangpu District,31.233593,121.479864,东莱 海上,31.234365,121.47776,Seafood Restaurant
9,Huangpu District,31.233593,121.479864,Tock's,31.236397,121.48131,Deli / Bodega


In [16]:
shanghai_venues.shape

(222, 7)

In [17]:
shanghai_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baoshan District,4,4,4,4,4,4
Changning District,10,10,10,10,10,10
Fengxian District,7,7,7,7,7,7
Hongkou District,4,4,4,4,4,4
Huangpu District,54,54,54,54,54,54
Jiading District,2,2,2,2,2,2
Jing'an District,91,91,91,91,91,91
Jinshan District,2,2,2,2,2,2
Minhang District,5,5,5,5,5,5
Pudong New Area,23,23,23,23,23,23


In [18]:
# one hot encoding
shanghai_onehot = pd.get_dummies(shanghai_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
shanghai_onehot['Neighborhood'] = shanghai_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [shanghai_onehot.columns[-1]] + list(shanghai_onehot.columns[:-1])
shanghai_onehot.head()

Unnamed: 0,Art Gallery,Art Museum,Asian Restaurant,Auto Garage,BBQ Joint,Bakery,Bar,Beer Garden,Bistro,Bookstore,...,Szechuan Restaurant,Tea Room,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Xinjiang Restaurant,Neighborhood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Huangpu District
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Huangpu District
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Huangpu District
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Huangpu District
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Huangpu District


In [19]:
shanghai_onehot.shape

(222, 89)

In [20]:
# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
shanghai_grouped = shanghai_onehot.groupby('Neighborhood').mean().reset_index()
shanghai_grouped.head()

Unnamed: 0,Neighborhood,Art Gallery,Art Museum,Asian Restaurant,Auto Garage,BBQ Joint,Bakery,Bar,Beer Garden,Bistro,...,Supermarket,Szechuan Restaurant,Tea Room,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Xinjiang Restaurant
0,Baoshan District,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Changning District,0.0,0.1,0.0,0.0,0.2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Fengxian District,0.0,0.0,0.142857,0.142857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Hongkou District,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Huangpu District,0.018519,0.0,0.0,0.0,0.0,0.018519,0.037037,0.0,0.0,...,0.0,0.018519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018519


In [21]:
# print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in shanghai_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = shanghai_grouped[shanghai_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Baoshan District----
              venue  freq
0  Asian Restaurant  0.25
1       Coffee Shop  0.25
2              Park  0.25
3        Hobby Shop  0.25
4       Art Gallery  0.00


----Changning District----
          venue  freq
0     BBQ Joint   0.2
1  Noodle House   0.1
2      Tea Room   0.1
3   Coffee Shop   0.1
4      Gym Pool   0.1


----Fengxian District----
                     venue  freq
0         Asian Restaurant  0.14
1              Auto Garage  0.14
2                    Plaza  0.14
3  Fruit & Vegetable Store  0.14
4            Grocery Store  0.14


----Hongkou District----
                venue  freq
0               Plaza  0.25
1              Bakery  0.25
2           Multiplex  0.25
3  Chinese Restaurant  0.25
4         Art Gallery  0.00


----Huangpu District----
                venue  freq
0               Hotel  0.15
1         Coffee Shop  0.11
2           Bookstore  0.06
3  Chinese Restaurant  0.06
4   French Restaurant  0.04


----Jiading District----
               

In [22]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [23]:
# create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = shanghai_grouped['Neighborhood']

for ind in np.arange(shanghai_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(shanghai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Baoshan District,Asian Restaurant,Coffee Shop,Hobby Shop,Park,Xinjiang Restaurant,Grocery Store,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market
1,Changning District,BBQ Joint,Gym Pool,Art Museum,Dumpling Restaurant,Fast Food Restaurant,Hotel,Tea Room,Coffee Shop,Noodle House,Gym
2,Fengxian District,Plaza,Grocery Store,Asian Restaurant,Auto Garage,Steakhouse,Shanghai Restaurant,Fruit & Vegetable Store,Dessert Shop,Dim Sum Restaurant,Doner Restaurant
3,Hongkou District,Bakery,Chinese Restaurant,Multiplex,Plaza,Xinjiang Restaurant,Gym,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market
4,Huangpu District,Hotel,Coffee Shop,Bookstore,Chinese Restaurant,French Restaurant,Bar,Restaurant,Italian Restaurant,Gym,Shanghai Restaurant


In [24]:
# set number of clusters
kclusters = 5

shanghai_grouped_clustering = shanghai_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(shanghai_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 
# to change use .astype()

array([0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 4, 4, 2, 0, 3], dtype=int32)

In [25]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster_Labels', kmeans.labels_)

shanghai_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
shanghai_merged = shanghai_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

shanghai_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Chinese,Area,Population,Density,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Huangpu District,黄浦区,20.46,653800,31955,31.233593,121.479864,0.0,Hotel,Coffee Shop,Bookstore,Chinese Restaurant,French Restaurant,Bar,Restaurant,Italian Restaurant,Gym,Shanghai Restaurant
1,Xuhui District,徐汇区,54.76,1084400,19803,31.163698,121.427994,0.0,Italian Restaurant,Supermarket,Shopping Mall,Buffet,Xinjiang Restaurant,Grocery Store,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market
2,Changning District,长宁区,38.3,694000,18120,31.209276,121.389986,0.0,BBQ Joint,Gym Pool,Art Museum,Dumpling Restaurant,Fast Food Restaurant,Hotel,Tea Room,Coffee Shop,Noodle House,Gym
3,Jing'an District,静安区,36.88,1062800,28818,31.229776,121.44306,0.0,Coffee Shop,Café,Japanese Restaurant,Hotel Bar,Dumpling Restaurant,Gym / Fitness Center,Pizza Place,Shanghai Restaurant,Hotel,Shopping Mall
4,Putuo District,普陀区,54.83,1281900,23380,31.251326,121.391229,4.0,Stadium,Hotel,Fast Food Restaurant,Szechuan Restaurant,Motel,Gym,Dim Sum Restaurant,Doner Restaurant,Dumpling Restaurant,Flea Market


In [26]:
shanghai_merged=shanghai_merged.dropna()
shanghai_merged['Cluster_Labels'] = shanghai_merged.Cluster_Labels.astype(int)

In [27]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(shanghai_merged['Latitude'], shanghai_merged['Longitude'], shanghai_merged['Neighborhood'], shanghai_merged['Cluster_Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 4. Examine the clusters

In [28]:
# Cluster 1
shanghai_merged.loc[shanghai_merged['Cluster_Labels'] == 0, 
                    shanghai_merged.columns[[1] + list(range(5, shanghai_merged.shape[1]))]]

Unnamed: 0,Chinese,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,黄浦区,31.233593,121.479864,0,Hotel,Coffee Shop,Bookstore,Chinese Restaurant,French Restaurant,Bar,Restaurant,Italian Restaurant,Gym,Shanghai Restaurant
1,徐汇区,31.163698,121.427994,0,Italian Restaurant,Supermarket,Shopping Mall,Buffet,Xinjiang Restaurant,Grocery Store,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market
2,长宁区,31.209276,121.389986,0,BBQ Joint,Gym Pool,Art Museum,Dumpling Restaurant,Fast Food Restaurant,Hotel,Tea Room,Coffee Shop,Noodle House,Gym
3,静安区,31.229776,121.44306,0,Coffee Shop,Café,Japanese Restaurant,Hotel Bar,Dumpling Restaurant,Gym / Fitness Center,Pizza Place,Shanghai Restaurant,Hotel,Shopping Mall
5,虹口区,31.266703,121.501751,0,Bakery,Chinese Restaurant,Multiplex,Plaza,Xinjiang Restaurant,Gym,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market
7,浦东新区,31.221783,121.53874,0,Fast Food Restaurant,Coffee Shop,Ramen Restaurant,Plaza,Steakhouse,Performing Arts Venue,Convenience Store,Science Museum,Flea Market,Metro Station
8,闵行区,31.114767,121.376943,0,Fast Food Restaurant,Bakery,Metro Station,Café,Plaza,Xinjiang Restaurant,Gym,Doner Restaurant,Dumpling Restaurant,Flea Market
9,宝山区,31.406634,121.485158,0,Asian Restaurant,Coffee Shop,Hobby Shop,Park,Xinjiang Restaurant,Grocery Store,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market
14,奉贤区,30.920449,121.469383,0,Plaza,Grocery Store,Asian Restaurant,Auto Garage,Steakhouse,Shanghai Restaurant,Fruit & Vegetable Store,Dessert Shop,Dim Sum Restaurant,Doner Restaurant


In [29]:
# Cluster 2
shanghai_merged.loc[shanghai_merged['Cluster_Labels'] == 1, 
                   shanghai_merged.columns[[1] + list(range(5, shanghai_merged.shape[1]))]]

Unnamed: 0,Chinese,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,金山区,30.744817,121.337257,1,Fast Food Restaurant,Supermarket,Xinjiang Restaurant,Gym,Dim Sum Restaurant,Doner Restaurant,Dumpling Restaurant,Flea Market,French Restaurant,Fruit & Vegetable Store


In [30]:
# Cluster 3
shanghai_merged.loc[shanghai_merged['Cluster_Labels'] == 2, 
                   shanghai_merged.columns[[1] + list(range(5, shanghai_merged.shape[1]))]]

Unnamed: 0,Chinese,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,嘉定区,31.377756,121.260612,2,Hotel,Chinese Restaurant,Xinjiang Restaurant,Gym,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market,French Restaurant,Fruit & Vegetable Store
12,松江区,31.029593,121.210838,2,Hotel,Convenience Store,Chinese Restaurant,Xinjiang Restaurant,Gym,Doner Restaurant,Dumpling Restaurant,Fast Food Restaurant,Flea Market,French Restaurant


In [31]:
# Cluster 4
shanghai_merged.loc[shanghai_merged['Cluster_Labels'] == 3, 
                   shanghai_merged.columns[[1] + list(range(5, shanghai_merged.shape[1]))]]

Unnamed: 0,Chinese,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,杨浦区,31.262011,121.52143,3,Coffee Shop,Fast Food Restaurant,Museum,Xinjiang Restaurant,Gym,Doner Restaurant,Dumpling Restaurant,Flea Market,French Restaurant,Fruit & Vegetable Store


In [32]:
# Cluster 5
shanghai_merged.loc[shanghai_merged['Cluster_Labels'] == 4, 
                   shanghai_merged.columns[[1] + list(range(5, shanghai_merged.shape[1]))]]

Unnamed: 0,Chinese,Latitude,Longitude,Cluster_Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,普陀区,31.251326,121.391229,4,Stadium,Hotel,Fast Food Restaurant,Szechuan Restaurant,Motel,Gym,Dim Sum Restaurant,Doner Restaurant,Dumpling Restaurant,Flea Market
13,青浦区,31.152164,121.119552,4,Fast Food Restaurant,Asian Restaurant,Hotel,Xinjiang Restaurant,Gym / Fitness Center,Doner Restaurant,Dumpling Restaurant,Flea Market,French Restaurant,Fruit & Vegetable Store


## 5. Conclusion

Through the list of these five clusters, we have no way to clearly know the difference between each neighborhood. However, it can be seen from the map that the city center of Shanghai is more concentrated. There are two possible reasons for this result.
First, Shanghai is highly urbanized, and the infrastructure in each district is relatively even, resulting in an even distribution of the number of complexes such as restaurants and shops. The second is that the data obtained from the Foursquare website may not be complete, as a lot of data are related to restaurants and hotels, while city parks and other types of attractions are lack of reviews and ratings. This could also be related to the fact that the website is not a mainstream social network in China, so that the data used here lacks credibility.