# Coursera Applied Data Science Capstone Project

### Project Title: Where to start up a business in Guangzhou?

#### Author: Ziqing Xu

#### Date: Oct 29

**Project Description:** The goal of the project is to segment and cluster the neighbourhoods by exploring and comparing the neighbourhoods in Guangzhou. By analyzing the clusters, we can figure out the best-recommended location to start a specific type of business in Guangzhou.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [82]:
# library to handle data in a vectorized manner
import numpy as np 
# library for data analsysis
import pandas as pd 
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
# library to handle JSON files
import json 
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 
#  library to handle requests
import requests 
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
# map rendering library
import folium
# library for data scrapping
from bs4 import BeautifulSoup
# csv library
import csv

print('Libraries imported.')

Libraries imported.


## 1. Download and Pre-process Dataset

### 1.1 Get the dataset by web scraping

In [83]:
# make a request by get method, returns an json file
res = requests.get('https://en.wikipedia.org/wiki/List_of_township-level_divisions_of_Guangdong')
html = res.text
soup = BeautifulSoup(html,'html.parser')

In [84]:
Guangzhou_Borough = soup.find_all(class_='mw-headline')[1:13]
for i, d in enumerate(Guangzhou_Borough):
    Guangzhou_Borough[i] = d.text.split()[0]
number_of_Borough = len(Guangzhou_Borough)

In [85]:
raw_Neighborhoods = soup.find_all('ul')[20:37] # Baiyun to Zengcheng
i = 0
blocks_for_borough = [2,1,1,1,1,1,2,2,1,1,2,2]
Neighborhoods = [[]*number_of_Borough]
for j in range(number_of_Borough):
    neigh = raw_Neighborhoods[i:i+blocks_for_borough[j]]
    i+=blocks_for_borough[j] # update i
    Neighborhoods.append([])
    for n in neigh:
        for m in n.text.split(','):
            Neighborhoods[j] += [m.split()[0]]
    #print(Guangzhou_Borough[j],Neighborhoods[j]) #uncomment it for testing

# Xinhua should be added in Huadu
Neighborhoods[Guangzhou_Borough.index('Huadu')] += ['Xinhua']
# Jiulong should be added in Luogang
Neighborhoods[Guangzhou_Borough.index('Luogang')] += ['Jiulong']

#uncomment the following two lines for checking
#for b, n in zip(Guangzhou_Borough,Neighborhoods):
#    print(b,n) 

### 1.2 Write the data scrapped from web into a csv file.

In [86]:
with open('Guangzhou_Neighborhood.csv','w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csvwriter.writerow(['City','Borough','Neighbourhood'])
    for i, b in enumerate(Guangzhou_Borough):
        for n in Neighborhoods[i]:
            csvwriter.writerow(['Guangzhou',b,n])

### 1.3 Transform the csv data into pandas dataframe

In [88]:
Guangzhou_df = pd.read_csv('Guangzhou_Neighborhood.csv')
Guangzhou_df.head(10)

Unnamed: 0,City,Borough,Neighbourhood
0,Guangzhou,Baiyun,Jingtai
1,Guangzhou,Baiyun,Songzhou
2,Guangzhou,Baiyun,Tongde
3,Guangzhou,Baiyun,Huangshi
4,Guangzhou,Baiyun,Tangjing
5,Guangzhou,Baiyun,Xinshi
6,Guangzhou,Baiyun,Sanyuanli
7,Guangzhou,Baiyun,Tonghe
8,Guangzhou,Baiyun,Jingxi
9,Guangzhou,Baiyun,Yongping


### 1.4 Use geopy library to get the geospatial data of Guangzhou

In [92]:
Guangzhou_data= pd.DataFrame(columns = ['City','Borough','Neighbourhood','Latitude','Longitude'])

for i in range(Guangzhou_df.shape[0]):
    borough = Guangzhou_df.loc[i,'Borough']
    neighbourhood = Guangzhou_df.loc[i,'Neighbourhood']
      
    #find the location data, ignore the neighborhoods that are unable to be located by Nominatim
    geolocator = Nominatim(user_agent="guangzhou-explorer")
    coordinate = geolocator.geocode("{},{},Guangzhou,China".format(neighbourhood,borough))
    
    #try one more searching without borough
    if coordinate is None: 
        coordinate = geolocator.geocode("{},Guangzhou,China".format(neighbourhood))
        
    if coordinate is None: 
        print("The geospatial data of {} in {} is not available!".format(neighbourhood,borough))
    else:
        Guangzhou_data = Guangzhou_data.append({'City': 'Guangzhou',
                                                'Borough': borough,
                                                'Neighbourhood': neighbourhood,
                                                'Latitude': coordinate.latitude,
                                                'Longitude': coordinate.longitude
                                               }, ignore_index=True)

Guangzhou_data.head(10)

The geospatial data of Songzhou in Baiyun is not available!
The geospatial data of Nanhuaxi in Haizhu is not available!
The geospatial data of Longfeng in Haizhu is not available!
The geospatial data of Fengyang in Haizhu is not available!
The geospatial data of Jiangnanzhong in Haizhu is not available!
The geospatial data of Zhangzhou in Huangpu is not available!
The geospatial data of Suidong in Huangpu is not available!
The geospatial data of Lilian in Huangpu is not available!
The geospatial data of Jinhua in Liwan is not available!
The geospatial data of Changhua in Liwan is not available!
The geospatial data of Hailong in Liwan is not available!
The geospatial data of Xiagang in Luogang is not available!
The geospatial data of Dalong in Panyu is not available!
The geospatial data of Shilou in Panyu is not available!
The geospatial data of Lanhe in Panyu is not available!
The geospatial data of Hongqiao in Yuexiu is not available!
The geospatial data of Dadong in Yuexiu is not ava

Unnamed: 0,City,Borough,Neighbourhood,Latitude,Longitude
0,Guangzhou,Baiyun,Jingtai,23.171167,113.260877
1,Guangzhou,Baiyun,Tongde,23.166263,113.229654
2,Guangzhou,Baiyun,Huangshi,23.205192,113.260667
3,Guangzhou,Baiyun,Tangjing,23.175695,113.248646
4,Guangzhou,Baiyun,Xinshi,23.187983,113.255349
5,Guangzhou,Baiyun,Sanyuanli,23.16141,113.251742
6,Guangzhou,Baiyun,Tonghe,23.199603,113.320919
7,Guangzhou,Baiyun,Jingxi,23.187745,113.320533
8,Guangzhou,Baiyun,Yongping,23.240796,113.28416
9,Guangzhou,Baiyun,Junhe,23.258613,113.285623


## 2. Explore and Cluster the neighbourhoods in Guangzhou