In [1]:
import pandas as pd

# 1. Introduction/Business Understanding

## 1.1 Discussion of the background

Shanghai is one of the four municipalities of the People's Republic of China. It is located on the southern estuary of the Yangtze, with the Huangpu River flowing through it. With a population of 24.28 million as of 2019, it is the most populous urban area in China and the second most populous city proper in the world. Greater Shanghai is a global center for finance, technology, innovation and transportation and the Port of Shanghai is the world's busiest container port.

Shanghai has been described as the "showpiece" of the booming economy of China. Featuring several architecture styles such as Art Deco and shikumen, the city is renowned for its Lujiazui skyline, museums and historic buildings—including the City God Temple, Yu Garden, the China Pavilion and buildings along the Bund. Shanghai is also known for its sugary cuisine, distinctive dialect and vibrant international flair. Every year, the city hosts numerous national and international events, including Shanghai Fashion Week, the Chinese Grand Prix and ChinaJoy.

## 1.2 Description of the problem

In this scenario, it is urgent to adopt machine learning tools in order to assist homebuyers clientele in Shanghai to make wise and effective decisions. As a result, the business problem we are currently posing is: how could we provide support to homebuyers clientele in to purchase a suitable real estate in Shanghai in this uncertain economic and financial scenario?

To solve this business problem, we are going to cluster Shanghai neighborhoods in order to recommend venues and the current average price of real estate where homebuyers can make a real estate investment. We will recommend profitable venues according to amenities and essential facilities surrounding such venues i.e. elementary schools, high schools, hospitals & grocery stores.

# 2. Data Requirements

### For this project we need following data:

#### Shanghai data that contains list districts (Wards) along with their latitude and longitude.
Datasource : https://en.wikipedia.org/wiki/List_of_administrative_divisions_of_Shanghai

Description: We will Scrap Shanghai districts (Wards) Table from Wikipedia and get the coordinates of these 16 major districts using geocoder class of Geopy client.
#### faxcilities in each neighborhood of Shanghai:
Data source: Foursquare APIs

Description : By using this API we will get all the venues in each neighborhood. We can filter these venues to get facilities like restaurants, elementary schools, high schools, hospitals & grocery stores.

# 3. Methodology

The Methodology section will describe the main components of our analysis and predication system. The Methodology section comprises four stages:
1. Collect Inspection Data
2. Explore and Understand Data
3. Data preparation and preprocessing 
4. Modeling

## 3.1 Data Preparation

### 3.1.1 Scraping Tokyo Wards Table from Wikipedia
I first make use of  Shanghai data that contains list districts from Wiki to scrap the table to create a data-frame. For this, I’ve used pandas to transform the data in the table on the Wikipedia page into a dataframe containing name of the 16 wards of Shanghai, Area, population. We start as below:

In [2]:
df=pd.read_html( 'https://en.wikipedia.org/wiki/List_of_administrative_divisions_of_Shanghai')[3]

In [3]:
df=df.drop([0,1])
df=df.drop(columns=[0])
df=df.reset_index()
df=df.drop(columns=['index'])
df[1][0]="Huangpu District"
df.columns = ['name','chinese','hanyu pinyin',
                     'postcode','division code','Area (km²)','Population (2018 census)','Density (/km²)']
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,name,chinese,hanyu pinyin,postcode,division code,Area (km²),Population (2018 census),Density (/km²)
0,Huangpu District,黄浦区,Huángpǔ Qū,310101,HGP,20.46,653800.0,31955.0
1,Xuhui District,徐汇区,Xúhuì Qū,310104,XHI,54.76,1084400.0,19803.0
2,Changning District,长宁区,Chángníng Qū,310105,CNQ,38.3,694000.0,18120.0
3,Jing'an District,静安区,Jìng'ān Qū,310106,JAQ,36.88,1062800.0,28818.0
4,Putuo District,普陀区,Pǔtuó Qū,310107,PTQ,54.83,1281900.0,23380.0
5,Hongkou District,虹口区,Hóngkǒu Qū,310109,HKQ,23.48,797000.0,33944.0
6,Yangpu District,杨浦区,Yángpǔ Qū,310110,YPU,60.73,1312700.0,21615.0
7,Pudong New Area,浦东新区,Pǔdōng Xīnqū,310115,PDX,1210.41,5550200.0,4585.0
8,Minhang District,闵行区,Mǐnháng Qū,310112,MHQ,370.75,2543500.0,6860.0
9,Baoshan District,宝山区,Bǎoshān Qū,310113,BAO,270.99,2042300.0,7536.0


## 3.1.2 Getting Coordinates of Major Districts : Geopy Client

Next objective is to get the coordinates of these 21 major districts using geocoder class of Geopy client as follow:

In [4]:
import os # Operating System
import numpy as np
import pandas as pd
import datetime as dt # Datetime
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes
import folium #import folium # map rendering library

print('Libraries imported.')

Solving environment: done


  current version: 4.5.12
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /anaconda3

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.3                |   py37hc8dfbb8_1         3.0 MB  conda-forge
    geopy-2.0.0                |     pyh9f0ad1d_0          63 KB  conda-forge
    conda-package-handling-1.6.0|   py37h9bfed18_2         1.5 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    python_abi-3.7             |          1_cp37m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.6 MB

The following NEW packages will be INSTALLED:

    conda-package-handling: 1.6.

_anaconda_depends-20 | 5 KB      | ##################################### | 100% 
vincent-0.4.4        | 28 KB     | ##################################### | 100% 
sphinxcontrib-htmlhe | 27 KB     | ##################################### | 100% 
sphinxcontrib-jsmath | 7 KB      | ##################################### | 100% 
ripgrep-12.1.1       | 1.6 MB    | ##################################### | 100% 
joblib-0.16.0        | 203 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Libraries imported.


In [8]:
CLIENT_ID = '0UPGDZNNTAMW1GDZTYSN0HBQTDAUYM2BSY1W1SPEHWLPMPIA'
CLIENT_SECRET = 'XD42P425SMIJ22X2HNBAJUMLEP3VPWFA5VMBFKIADLECI402'
VERSION = '20180604'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)

Your credentails:
CLIENT_ID: 0UPGDZNNTAMW1GDZTYSN0HBQTDAUYM2BSY1W1SPEHWLPMPIA


In [9]:
LIMIT = 500 # Maximum is 100
cities = ["New York, NY", 'Chicago, IL', 'San Francisco, CA', 'Jersey City, NJ', 'Boston, MA']
results = {}
for city in cities:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        city,
        LIMIT,
        "4bf58dd8d48988d1ca941735") # PIZZA PLACE CATEGORY ID
    results[city] = requests.get(url).json()

In [10]:
df_venues={}
for city in cities:
    venues = json_normalize(results[city]['response']['groups'][0]['items'])
    df_venues[city] = venues[['venue.name', 'venue.location.address', 'venue.location.lat', 'venue.location.lng']]
    df_venues[city].columns = ['Name', 'Address', 'Lat', 'Lng']

In [11]:
maps = {}
for city in cities:
    city_lat = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    city_lng = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[city] = folium.Map(location=[city_lat, city_lng], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df_venues[city]['Lat'], df_venues[city]['Lng'], df_venues[city]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])  
    print(f"Total number of pizza places in {city} = ", results[city]['response']['totalResults'])
    print("Showing Top 100")

Total number of pizza places in New York, NY =  284
Showing Top 100
Total number of pizza places in Chicago, IL =  219
Showing Top 100
Total number of pizza places in San Francisco, CA =  167
Showing Top 100
Total number of pizza places in Jersey City, NJ =  131
Showing Top 100
Total number of pizza places in Boston, MA =  186
Showing Top 100


In [12]:
maps[cities[0]]

In [13]:
maps[cities[1]]

In [14]:
maps[cities[2]]

In [15]:
maps[cities[3]]

In [16]:
maps[cities[4]]

In [None]:
maps = {}
for city in cities:
    city_lat = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    city_lng = np.mean([results[city]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[city]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[city] = folium.Map(location=[city_lat, city_lng], zoom_start=11)
    venues_mean_coor = [df_venues[city]['Lat'].mean(), df_venues[city]['Lng'].mean()] 
    # add markers to map
    for lat, lng, label in zip(df_venues[city]['Lat'], df_venues[city]['Lng'], df_venues[city]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])
        folium.PolyLine([venues_mean_coor, [lat, lng]], color="green", weight=1.5, opacity=0.5).add_to(maps[city])
    
    label = folium.Popup("Mean Co-ordinate", parse_html=True)
    folium.CircleMarker(
        venues_mean_coor,
        radius=10,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(maps[city])

    print(city)
    print("Mean Distance from Mean coordinates")
    print(np.mean(np.apply_along_axis(lambda x: np.linalg.norm(x - venues_mean_coor),1,df_venues[city][['Lat','Lng']].values)))

New York, NY
Mean Distance from Mean coordinates
0.023022796176729497
Chicago, IL
Mean Distance from Mean coordinates
0.058123374185776955
San Francisco, CA
Mean Distance from Mean coordinates
0.028732195910738036
