# Coursera Applied Data Science Capstone Project

### Project Title: Where to start up a business in Guangzhou?

#### Author: Ziqing Xu

#### Date: Oct 29

**Project Description:** The goal of the project is to segment and cluster the neighbourhoods by exploring and comparing the neighbourhoods in Guangzhou. By analyzing the clusters, we can figure out the best-recommended location to start a specific type of business in Guangzhou.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
# library to handle data in a vectorized manner
import numpy as np 
# library for data analsysis
import pandas as pd 
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
# library to handle JSON files
import json 
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 
#  library to handle requests
import requests 
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
# map rendering library
import folium
# library for data scrapping
from bs4 import BeautifulSoup
# csv library
import csv
from collections import deque

print('Libraries imported.')

Libraries imported.


## 1. Download and Pre-process Dataset

### 1.1 Get the dataset by web scraping

In [2]:
# make a request by get method, returns an json file
res = requests.get('https://en.wikipedia.org/wiki/List_of_township-level_divisions_of_Guangdong')
html = res.text
soup = BeautifulSoup(html,'html.parser')

In [3]:
Guangzhou_Borough = soup.find_all(class_='mw-headline')[1:13]
for i, d in enumerate(Guangzhou_Borough):
    Guangzhou_Borough[i] = d.text.split()[0]
number_of_Borough = len(Guangzhou_Borough)

In [4]:
raw_Neighborhoods = soup.find_all('ul')[20:37] # Baiyun to Zengcheng
i = 0
blocks_for_borough = [2,1,1,1,1,1,2,2,1,1,2,2]
Neighborhoods = [[]*number_of_Borough]
for j in range(number_of_Borough):
    neigh = raw_Neighborhoods[i:i+blocks_for_borough[j]]
    i+=blocks_for_borough[j] # update i
    Neighborhoods.append([])
    for n in neigh:
        for m in n.text.split(','):
            Neighborhoods[j] += [m.split()[0]]
    #print(Guangzhou_Borough[j],Neighborhoods[j]) #uncomment it for testing

# Xinhua should be added in Huadu
Neighborhoods[Guangzhou_Borough.index('Huadu')] += ['Xinhua']
# Jiulong should be added in Luogang
Neighborhoods[Guangzhou_Borough.index('Luogang')] += ['Jiulong']

#uncomment the following two lines for checking
#for b, n in zip(Guangzhou_Borough,Neighborhoods):
#    print(b,n) 

### 1.2 Write the data scrapped from web into a csv file.

In [5]:
with open('Guangzhou_Neighborhood.csv','w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(['City','Borough','Neighbourhood'])
    for i, b in enumerate(Guangzhou_Borough):
        for n in Neighborhoods[i]:
            csvwriter.writerow(['Guangzhou',b,n])

### 1.3 Transform the csv data into pandas dataframe

In [6]:
Guangzhou_df = pd.read_csv('Guangzhou_Neighborhood.csv')
Guangzhou_df.head(10)

Unnamed: 0,City,Borough,Neighbourhood
0,Guangzhou,Baiyun,Jingtai
1,Guangzhou,Baiyun,Songzhou
2,Guangzhou,Baiyun,Tongde
3,Guangzhou,Baiyun,Huangshi
4,Guangzhou,Baiyun,Tangjing
5,Guangzhou,Baiyun,Xinshi
6,Guangzhou,Baiyun,Sanyuanli
7,Guangzhou,Baiyun,Tonghe
8,Guangzhou,Baiyun,Jingxi
9,Guangzhou,Baiyun,Yongping


### 1.4 Use geopy library to get the geospatial data of Guangzhou

In [7]:
Guangzhou_data= pd.DataFrame(columns = ['City','Borough','Neighbourhood','Latitude','Longitude'])

for i in range(Guangzhou_df.shape[0]):
    borough = Guangzhou_df.loc[i,'Borough']
    neighbourhood = Guangzhou_df.loc[i,'Neighbourhood']
      
    #find the location data, ignore the neighborhoods that are unable to be located by Nominatim
    geolocator = Nominatim(user_agent="guangzhou-explorer")
    coordinate = geolocator.geocode("{},{},Guangzhou,China".format(neighbourhood,borough))
    
    #try one more searching without borough
    if coordinate is None: 
        coordinate = geolocator.geocode("{},Guangzhou,China".format(neighbourhood))
        
    if coordinate is None: 
        print("The geospatial data of {} in {} is not available!".format(neighbourhood,borough))
    else:
        Guangzhou_data = Guangzhou_data.append({'City': 'Guangzhou',
                                                'Borough': borough,
                                                'Neighbourhood': neighbourhood,
                                                'Latitude': coordinate.latitude,
                                                'Longitude': coordinate.longitude
                                               }, ignore_index=True)

Guangzhou_data.head(10)

The geospatial data of Songzhou in Baiyun is not available!
The geospatial data of Nanhuaxi in Haizhu is not available!
The geospatial data of Longfeng in Haizhu is not available!
The geospatial data of Fengyang in Haizhu is not available!
The geospatial data of Jiangnanzhong in Haizhu is not available!
The geospatial data of Zhangzhou in Huangpu is not available!
The geospatial data of Suidong in Huangpu is not available!
The geospatial data of Lilian in Huangpu is not available!
The geospatial data of Jinhua in Liwan is not available!
The geospatial data of Changhua in Liwan is not available!
The geospatial data of Hailong in Liwan is not available!
The geospatial data of Xiagang in Luogang is not available!
The geospatial data of Dalong in Panyu is not available!
The geospatial data of Shilou in Panyu is not available!
The geospatial data of Lanhe in Panyu is not available!
The geospatial data of Hongqiao in Yuexiu is not available!
The geospatial data of Dadong in Yuexiu is not ava

Unnamed: 0,City,Borough,Neighbourhood,Latitude,Longitude
0,Guangzhou,Baiyun,Jingtai,23.171167,113.260877
1,Guangzhou,Baiyun,Tongde,23.166263,113.229654
2,Guangzhou,Baiyun,Huangshi,23.205192,113.260667
3,Guangzhou,Baiyun,Tangjing,23.175695,113.248646
4,Guangzhou,Baiyun,Xinshi,23.187983,113.255349
5,Guangzhou,Baiyun,Sanyuanli,23.16141,113.251742
6,Guangzhou,Baiyun,Tonghe,23.199603,113.320919
7,Guangzhou,Baiyun,Jingxi,23.187745,113.320533
8,Guangzhou,Baiyun,Yongping,23.240796,113.28416
9,Guangzhou,Baiyun,Junhe,23.258613,113.285623


## 2. Explore and cluster the neighborhoods in Guangzhou.

### 2.1 Visualize neighbourhoods in Guangzhou

Let's get the geographical coordinates of Guangzhou

In [8]:
address = 'Guangzhou, China'

gz_location = geolocator.geocode(address)
gz_latitude = gz_location.latitude
gz_longitude = gz_location.longitude
print('The geograpical coordinate of Guangzhou are {}, {}.'.format(gz_latitude, gz_longitude))

The geograpical coordinate of Guangzhou are 23.1301964, 113.2592945.


Let's visualiza Guangzhou the neighbourhoods in it

In [9]:
# create map of Guangzhou using gz_latitude and gz_longitude values
map_guangzhou = folium.Map(location=[gz_latitude, gz_longitude], zoom_start=8)

# add markers to map
for lat, lng, label in zip(Guangzhou_data['Latitude'], Guangzhou_data['Longitude'], Guangzhou_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_guangzhou)  
    
map_guangzhou

### 2.2 Explore Venues Nearby Each Neighbourhood

Define Foursquare Credentials and Version

In [10]:
CLIENT_ID = 'hidden_id' # Foursquare ID
CLIENT_SECRET = 'hidden_scret' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Define a function to get the nearby venues for each neighbourhood in Guangzhou repeatly.

In [11]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('Done.')
    
    return(nearby_venues)

Now, run the above function on each neighbourhood and create a new dataframe called Guangzhou_venues.

In [13]:
Guangzhou_venues = getNearbyVenues(Guangzhou_data.Neighbourhood,
                            Guangzhou_data.Latitude,
                            Guangzhou_data.Longitude)

Done.


Let's check the size of the resulting dataframe

In [14]:
print(Guangzhou_venues.shape)
Guangzhou_venues.head(10)

(833, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Jingtai,23.171167,113.260877,Wanda International Cinemas (万达国际电影城),23.173979,113.261186,Multiplex
1,Jingtai,23.171167,113.260877,Wanda Plaza (万达广场),23.175312,113.261407,Shopping Mall
2,Jingtai,23.171167,113.260877,SUBWAY (赛百味),23.174236,113.260859,Sandwich Place
3,Jingtai,23.171167,113.260877,Hannashan Korean BBQ,23.174181,113.260906,Korean Restaurant
4,Jingtai,23.171167,113.260877,Boya Holiday Hotel,23.172244,113.263937,Hotel
5,Jingtai,23.171167,113.260877,Walmart 沃尔玛,23.174473,113.261047,Big Box Store
6,Jingtai,23.171167,113.260877,广州时尚旅酒店 Smart Hotel Guangzhou,23.174362,113.259552,Hotel
7,Jingtai,23.171167,113.260877,Tairyo Teppanyaki Restaurant,23.175244,113.260989,Japanese Restaurant
8,Huangshi,23.205192,113.260667,山东老家,23.204588,113.260489,Chinese Restaurant
9,Huangshi,23.205192,113.260667,麦霸KTV,23.20549,113.262745,Karaoke Bar


Let's check how many venues were returned for each neighbourhood

In [15]:
Guangzhou_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Baihedong,2,2,2,2,2,2
Baiyun,7,7,7,7,7,7
Beijing,34,34,34,34,34,34
Binjiang,7,7,7,7,7,7
Caihong,6,6,6,6,6,6
...,...,...,...,...,...,...
Zengjiang,3,3,3,3,3,3
Zhanqian,37,37,37,37,37,37
Zhengguo,1,1,1,1,1,1
Zhuguang,17,17,17,17,17,17


Check the amount of venues per neighbourhood

In [16]:
Guangzhou_venues.groupby("Neighborhood").Venue.count().sort_values(ascending=False).head()

Neighborhood
Jianshe      42
Tianhenan    42
Zhanqian     37
Huale        36
Beijing      34
Name: Venue, dtype: int64

Total amount of unique categories in my data

In [17]:
print('There are {} uniques categories.'.format(len(Guangzhou_venues['Venue Category'].unique())))

There are 127 uniques categories.


## 3. Analyze Each Neighbourhood

In [18]:
# one hot encoding
Guangzhou_onehot = pd.get_dummies(Guangzhou_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Guangzhou_onehot['Neighborhood'] = Guangzhou_venues['Neighborhood'] 

# move neighborhood column to the first column
t_index = list(Guangzhou_onehot.columns).index("Neighborhood")
cols = deque(Guangzhou_onehot.columns)
cols.rotate(-t_index)
cols = list(cols)
Guangzhou_onehot = Guangzhou_onehot[cols]

Guangzhou_onehot.head(10)

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Badminton Court,Bakery,...,Thai Restaurant,Theater,Theme Park,Toll Booth,Toy / Game Store,Train Station,Turkish Restaurant,Vietnamese Restaurant,Warehouse Store,Women's Store
0,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Jingtai,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Huangshi,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Huangshi,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's examine the new dataframe size

In [19]:
Guangzhou_onehot.shape

(833, 128)

Let's group rows by neighbourhood and by taking the mean of the frequency of occurence of each category

In [20]:
Guangzhou_grouped = Guangzhou_onehot.groupby('Neighborhood').mean().reset_index()
Guangzhou_grouped

Unnamed: 0,Neighborhood,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Badminton Court,Bakery,...,Thai Restaurant,Theater,Theme Park,Toll Booth,Toy / Game Store,Train Station,Turkish Restaurant,Vietnamese Restaurant,Warehouse Store,Women's Store
0,Baihedong,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000
1,Baiyun,0.0,0.0,0.0,0.0,0.142857,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.142857,0.000000,0.000000
2,Beijing,0.0,0.0,0.0,0.0,0.000000,0.0,0.029412,0.0,0.0,...,0.029412,0.000000,0.0,0.0,0.000000,0.0,0.0,0.029412,0.000000,0.000000
3,Binjiang,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000
4,Caihong,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,Zengjiang,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000
111,Zhanqian,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.027027,0.0,0.0,0.027027,0.0,0.0,0.000000,0.027027,0.027027
112,Zhengguo,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000
113,Zhuguang,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000


Let's confirm the new size

In [21]:
Guangzhou_grouped.shape

(115, 128)

Define a function to sort the venues in descending order

In [22]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Let's create a new dataframe and display the top 10 venus for each neighbourhood

In [23]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Guangzhou_venues_sorted = pd.DataFrame(columns=columns)
Guangzhou_venues_sorted['Neighborhood'] = Guangzhou_grouped['Neighborhood']

for ind in np.arange(Guangzhou_grouped.shape[0]):
    Guangzhou_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Guangzhou_grouped.iloc[ind, :], num_top_venues)

Guangzhou_venues_sorted.head(10)



Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Baihedong,Hotpot Restaurant,Chinese Restaurant,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
1,Baiyun,Vietnamese Restaurant,Shopping Mall,Food,Asian Restaurant,Clothing Store,Chinese Restaurant,Fast Food Restaurant,Women's Store,Farmers Market,Food & Drink Shop
2,Beijing,Nightclub,Pizza Place,Chinese Restaurant,Hotel,Fast Food Restaurant,Restaurant,Convenience Store,Jewelry Store,Noodle House,Dessert Shop
3,Binjiang,Convenience Store,Pharmacy,Bus Station,Noodle House,Sandwich Place,Coffee Shop,Pizza Place,Discount Store,Dog Run,Dessert Shop
4,Caihong,Convenience Store,Hotel,Fast Food Restaurant,Noodle House,French Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant
5,Chajiao,Department Store,BBQ Joint,Pet Store,Chinese Restaurant,Cantonese Restaurant,Café,Diner,Hookah Bar,Hostel,Discount Store
6,Changgang,Fast Food Restaurant,Chinese Restaurant,Hotel,Pizza Place,Coffee Shop,Dumpling Restaurant,Food & Drink Shop,Food,Flea Market,Farmers Market
7,Changxing,Convenience Store,Metro Station,Shopping Mall,Fried Chicken Joint,Hostel,Hookah Bar,Dim Sum Restaurant,Diner,Discount Store,Dog Run
8,Chebei,Metro Station,Rental Car Location,Shopping Mall,Cantonese Restaurant,Women's Store,Farmers Market,Food,Flea Market,Fast Food Restaurant,Dumpling Restaurant
9,Chigang,Chinese Restaurant,Hotel,Fast Food Restaurant,Women's Store,French Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant


## 4. Cluster Neighborhoods

Using **K-MEANS** to cluster the neighborhood into 5 clusters

In [24]:
# set number of clusters
kclusters = 5

Guangzhou_grouped_clustering = Guangzhou_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Guangzhou_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 1, 1, 3, 1, 1, 0, 0, 2], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues fo each neighborhood

In [25]:
# add clustering labels
Guangzhou_venues_sorted_temp = Guangzhou_venues_sorted
Guangzhou_venues_sorted_temp.insert(0, 'Labels', kmeans.labels_)

Guangzhou_merged = Guangzhou_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
Guangzhou_merged = Guangzhou_merged.join(Guangzhou_venues_sorted_temp.set_index('Neighborhood'), on='Neighbourhood')

Guangzhou_merged.head(10) # check the last columns!


Unnamed: 0,City,Borough,Neighbourhood,Latitude,Longitude,Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Guangzhou,Baiyun,Jingtai,23.171167,113.260877,1.0,Hotel,Japanese Restaurant,Shopping Mall,Multiplex,Big Box Store,Korean Restaurant,Sandwich Place,Dumpling Restaurant,Farmers Market,Food Court
1,Guangzhou,Baiyun,Tongde,23.166263,113.229654,,,,,,,,,,,
2,Guangzhou,Baiyun,Huangshi,23.205192,113.260667,2.0,Chinese Restaurant,Department Store,Karaoke Bar,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
3,Guangzhou,Baiyun,Tangjing,23.175695,113.248646,1.0,Spa,Hotel,Resort,Fried Chicken Joint,Café,Fast Food Restaurant,Korean Restaurant,Coffee Shop,Convenience Store,Discount Store
4,Guangzhou,Baiyun,Xinshi,23.187983,113.255349,1.0,Fast Food Restaurant,Supermarket,Women's Store,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant
5,Guangzhou,Baiyun,Sanyuanli,23.16141,113.251742,1.0,Hotel,Fast Food Restaurant,Toll Booth,Café,Metro Station,Food & Drink Shop,Park,Pizza Place,Department Store,Clothing Store
6,Guangzhou,Baiyun,Tonghe,23.199603,113.320919,0.0,Dim Sum Restaurant,Chinese Restaurant,Metro Station,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
7,Guangzhou,Baiyun,Jingxi,23.187745,113.320533,0.0,Shopping Mall,Metro Station,Multiplex,Women's Store,Fast Food Restaurant,Food & Drink Shop,Food,Flea Market,Farmers Market,French Restaurant
8,Guangzhou,Baiyun,Yongping,23.240796,113.28416,0.0,Hotel,Metro Station,Shopping Mall,Chinese Restaurant,Women's Store,Farmers Market,Food,Flea Market,Fast Food Restaurant,Dongbei Restaurant
9,Guangzhou,Baiyun,Junhe,23.258613,113.285623,2.0,Chinese Restaurant,Women's Store,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market


Let's remove the outliers!

In [26]:
Guangzhou_merged.dropna(inplace=True)
Guangzhou_merged = Guangzhou_merged.astype({'Labels': 'int32'})
Guangzhou_merged.head(10)

Unnamed: 0,City,Borough,Neighbourhood,Latitude,Longitude,Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Guangzhou,Baiyun,Jingtai,23.171167,113.260877,1,Hotel,Japanese Restaurant,Shopping Mall,Multiplex,Big Box Store,Korean Restaurant,Sandwich Place,Dumpling Restaurant,Farmers Market,Food Court
2,Guangzhou,Baiyun,Huangshi,23.205192,113.260667,2,Chinese Restaurant,Department Store,Karaoke Bar,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
3,Guangzhou,Baiyun,Tangjing,23.175695,113.248646,1,Spa,Hotel,Resort,Fried Chicken Joint,Café,Fast Food Restaurant,Korean Restaurant,Coffee Shop,Convenience Store,Discount Store
4,Guangzhou,Baiyun,Xinshi,23.187983,113.255349,1,Fast Food Restaurant,Supermarket,Women's Store,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant
5,Guangzhou,Baiyun,Sanyuanli,23.16141,113.251742,1,Hotel,Fast Food Restaurant,Toll Booth,Café,Metro Station,Food & Drink Shop,Park,Pizza Place,Department Store,Clothing Store
6,Guangzhou,Baiyun,Tonghe,23.199603,113.320919,0,Dim Sum Restaurant,Chinese Restaurant,Metro Station,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
7,Guangzhou,Baiyun,Jingxi,23.187745,113.320533,0,Shopping Mall,Metro Station,Multiplex,Women's Store,Fast Food Restaurant,Food & Drink Shop,Food,Flea Market,Farmers Market,French Restaurant
8,Guangzhou,Baiyun,Yongping,23.240796,113.28416,0,Hotel,Metro Station,Shopping Mall,Chinese Restaurant,Women's Store,Farmers Market,Food,Flea Market,Fast Food Restaurant,Dongbei Restaurant
9,Guangzhou,Baiyun,Junhe,23.258613,113.285623,2,Chinese Restaurant,Women's Store,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
13,Guangzhou,Baiyun,Renhe,23.338189,113.290232,1,Fast Food Restaurant,Hotel,Metro Station,Pizza Place,Women's Store,Farmers Market,Food & Drink Shop,Food,Flea Market,Dongbei Restaurant


Then, let's visualize the resulting clusters

In [27]:
# create map
map_clusters = folium.Map(location=[gz_latitude, gz_longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Guangzhou_merged['Latitude'], Guangzhou_merged['Longitude'], Guangzhou_merged['Neighbourhood'], Guangzhou_merged['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

### Cluster 1

In this cluster, Restaurant (Chinese, Dim Sum, Cantonese, Fast Food), Metro Station, and stores(shopping mall, convenience store) are most recommended.

In [28]:
Guangzhou_merged.loc[Guangzhou_merged['Labels'] == 0, Guangzhou_merged.columns[[1] + list(range(5, Guangzhou_merged.shape[1]))]]


Unnamed: 0,Borough,Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Baiyun,0,Dim Sum Restaurant,Chinese Restaurant,Metro Station,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
7,Baiyun,0,Shopping Mall,Metro Station,Multiplex,Women's Store,Fast Food Restaurant,Food & Drink Shop,Food,Flea Market,Farmers Market,French Restaurant
8,Baiyun,0,Hotel,Metro Station,Shopping Mall,Chinese Restaurant,Women's Store,Farmers Market,Food,Flea Market,Fast Food Restaurant,Dongbei Restaurant
25,Haizhu,0,Hotpot Restaurant,Hotel,Art Gallery,Metro Station,Seafood Restaurant,Farmers Market,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant
28,Haizhu,0,Hotel,Metro Station,Bus Station,Women's Store,Fast Food Restaurant,Food Court,Food & Drink Shop,Food,Flea Market,Farmers Market
42,Huangpu,0,Metro Station,Women's Store,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
43,Huangpu,0,Metro Station,Park,Big Box Store,Fast Food Restaurant,Food Court,Food & Drink Shop,Food,Flea Market,Women's Store,Fried Chicken Joint
55,Liwan,0,Resort,Metro Station,Chinese Restaurant,Women's Store,Farmers Market,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Dongbei Restaurant
78,Panyu,0,Metro Station,Café,Chinese Restaurant,Women's Store,Fast Food Restaurant,Food Court,Food & Drink Shop,Food,Flea Market,Farmers Market
85,Panyu,0,Snack Place,Metro Station,Women's Store,French Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant


### Cluster 2

In this cluster, the types of business are diverse. Based on the result, it is likely to be an area for entertainment. Cafe, Restaurant, Shopping Mall, Resort, Park, and Sport Court are recommended.

In [29]:
Guangzhou_merged.loc[Guangzhou_merged['Labels'] == 1, Guangzhou_merged.columns[[1] + list(range(5, Guangzhou_merged.shape[1]))]]


Unnamed: 0,Borough,Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Baiyun,1,Hotel,Japanese Restaurant,Shopping Mall,Multiplex,Big Box Store,Korean Restaurant,Sandwich Place,Dumpling Restaurant,Farmers Market,Food Court
3,Baiyun,1,Spa,Hotel,Resort,Fried Chicken Joint,Café,Fast Food Restaurant,Korean Restaurant,Coffee Shop,Convenience Store,Discount Store
4,Baiyun,1,Fast Food Restaurant,Supermarket,Women's Store,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant
5,Baiyun,1,Hotel,Fast Food Restaurant,Toll Booth,Café,Metro Station,Food & Drink Shop,Park,Pizza Place,Department Store,Clothing Store
13,Baiyun,1,Fast Food Restaurant,Hotel,Metro Station,Pizza Place,Women's Store,Farmers Market,Food & Drink Shop,Food,Flea Market,Dongbei Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
128,Yuexiu,1,Hotel,Snack Place,Dessert Shop,Nightclub,Plaza,Pizza Place,Metro Station,Fast Food Restaurant,Cantonese Restaurant,Soup Place
129,Yuexiu,1,Vietnamese Restaurant,Shopping Mall,Food,Asian Restaurant,Clothing Store,Chinese Restaurant,Fast Food Restaurant,Women's Store,Farmers Market,Food & Drink Shop
130,Yuexiu,1,Indian Restaurant,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market,Fast Food Restaurant
140,Zengcheng,1,Coffee Shop,Seafood Restaurant,Bus Station,Women's Store,French Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant


### Cluster 3

In this cluster, Opening a **Chinese Restaurant** is most recommended as the first most common venues are Chinese Restaurants for most of the neighbourhood in this cluster. **Department Store and Women's Store** are also recommended.

In [30]:
Guangzhou_merged.loc[Guangzhou_merged['Labels'] == 2, Guangzhou_merged.columns[[1] + list(range(5, Guangzhou_merged.shape[1]))]]


Unnamed: 0,Borough,Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Baiyun,2,Chinese Restaurant,Department Store,Karaoke Bar,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
9,Baiyun,2,Chinese Restaurant,Women's Store,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
18,Haizhu,2,Chinese Restaurant,Hotel,Fast Food Restaurant,Women's Store,French Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant
21,Haizhu,2,Chinese Restaurant,Department Store,Cantonese Restaurant,Spa,Discount Store,Dog Run,Dongbei Restaurant,Diner,Dessert Shop,Farmers Market
23,Haizhu,2,Chinese Restaurant,Fast Food Restaurant,Metro Station,Coffee Shop,Women's Store,Food Court,Food & Drink Shop,Food,Flea Market,Dumpling Restaurant
30,Haizhu,2,Chinese Restaurant,Metro Station,Women's Store,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
40,Huangpu,2,Chinese Restaurant,Movie Theater,Women's Store,Dessert Shop,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
52,Liwan,2,Chinese Restaurant,Fast Food Restaurant,Big Box Store,Women's Store,Fried Chicken Joint,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant
62,Liwan,2,Hotpot Restaurant,Chinese Restaurant,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
92,Tianhe,2,Chinese Restaurant,Women's Store,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market


### Cluster 4

In this cluster, Opening a **hotel** is most recommended as the first most common venues are hotels for most of the neighbourhood in this cluster. From the result, the neighbourhoods in this cluster are likely to areas for tourism. Starting up a tourism-relative business is recommended.

In [31]:
Guangzhou_merged.loc[Guangzhou_merged['Labels'] == 3, Guangzhou_merged.columns[[1] + list(range(5, Guangzhou_merged.shape[1]))]]


Unnamed: 0,Borough,Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
26,Haizhu,3,Hotel,Women's Store,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
38,Huadu,3,Convenience Store,Hotel,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
50,Liwan,3,Convenience Store,Hotel,Fast Food Restaurant,Noodle House,French Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant
69,Nansha,3,Hotel,Port,Women's Store,French Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant
77,Panyu,3,Hotel,Metro Station,Women's Store,Farmers Market,Food Court,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Dumpling Restaurant
87,Panyu,3,Convenience Store,Hotel,Fried Chicken Joint,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
131,Yuexiu,3,Hotel,Pizza Place,Park,Fast Food Restaurant,Women's Store,Food Court,Dim Sum Restaurant,Diner,Discount Store,Dog Run
134,Conghua,3,Hotel,Hot Spring,Women's Store,French Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant


### Cluster 5

In this cluster, Opening a **Restaurant** is most recommended as the most of the common venues are about dining. However, **Cantonese Food** is most popular.

In [32]:
Guangzhou_merged.loc[Guangzhou_merged['Labels'] == 4, Guangzhou_merged.columns[[1] + list(range(5, Guangzhou_merged.shape[1]))]]


Unnamed: 0,Borough,Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
56,Liwan,4,Cantonese Restaurant,Women's Store,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
61,Liwan,4,Cantonese Restaurant,Chinese Restaurant,Women's Store,Dessert Shop,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
79,Panyu,4,Chinese Restaurant,Cantonese Restaurant,Bus Station,Women's Store,French Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant
90,Tianhe,4,Cantonese Restaurant,Bus Station,Women's Store,Dessert Shop,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
143,Zengcheng,4,Cantonese Restaurant,Women's Store,Department Store,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Dongbei Restaurant,Dumpling Restaurant,Farmers Market
