# Capstone Project: The Battle of Neighbourhoods week1

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

There are many restaurants in Tokyo, Japan.
Our customers open a sushi restaurant in Yokohama, Japan.
They wants to open a sushi restaurant in Tokyo in next year.
However, they don't know where open a sushi restaurant in Tokyo.
We analyze where customers should open a sushi restaurant.

Many people use trains in Japan.
Therefore, many people eat any foods near the station.
We recommend to customers a station with a large number of users, a station with few sushi restaurants and a close atmosphere to Yokohama.

The target is those who want to open a sushi restaurant in Tokyo.

## 2. Data <a name="data"></a>

This demonstration will make use of the following data sources:

#### List of stations in tokyo.
Data will retrieved from wikipedia from <a href='https://en.wikipedia.org/wiki/List_of_East_Japan_Railway_Company_stations'>List_of_East_Japan_Railway_Company_stations</a>.

This wiki page has list of number of daily station entries for each stations in Tokyo.


#### Stations in Tokyo location data retrieved using Google maps API.
Data coordinates o stations will be retrieved using google API. 

#### Venues near stations in Tokyo from FourSquare API
(FourSquare website: www.foursquare.com)

I will be using the FourSquare API to explore neighborhoods in selected stations in Tokyo. The Foursquare explore function will be used to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters.  The following information are retrieved on the first query:
* Venue ID
* Venue Name
* Coordinates : Latitude and Longitude
* Category Name

Importing Python Libraries
This section imports required python libraries for processing data. 
While this first part of python notebook is for data acquisition, we will use some of the libraries make some data visualization.

In [1]:
#### Importing Python Libraries

#!conda install -c conda-forge folium=0.5.0 --yes # comment/uncomment if not yet installed.
#!conda install -c conda-forge geopy --yes        # comment/uncomment if not yet installed

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

# Numpy and Pandas libraries were already imported at the beginning of this notebook.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import seaborn as sns
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

import requests # library to handle requests
import lxml.html as lh
from bs4 import BeautifulSoup
import urllib.request

print('Libraries imported.')

Libraries imported.


### 1. List of stations in tokyo with daily station entries using BeautifulSoup

In [2]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_East_Japan_Railway_Company_stations').text

In [3]:
soup = BeautifulSoup(url,'lxml')

In [4]:
stations = soup.find('table', class_ = 'wikitable sortable')
stations_rows = stations.find_all('tr')

In [5]:
info_list = []
for row in stations_rows :
    info_list.append([t.text.strip('\n') for t in row.find_all(['td', 'th'])])

info_list

[['Station',
  'JR East Lines',
  'Other Lines',
  'Code',
  'Daily Station Entries',
  'Source Year'],
 ['Abiko',
  '■ Jōban Line■ Jōban Line (Local)■ Jōban Line (Rapid)■ Narita Line',
  '',
  '',
  '29,989',
  '2011[1]'],
 ['Ageo', '■ Takasaki Line', '', '', '40,395', '2011[1]'],
 ['Aihara', '■ Yokohama Line', '', '', '10,099', '2011[1]'],
 ['Ajiki', '■ Narita Line', '', '', '3,137', '2011[1]'],
 ['Ajiro', '■ Itō Line', '', '', '913', '2011[1]'],
 ['Akabane',
  '■ Keihin-Tōhoku Line■ Saikyō Line■■ Shōnan-Shinjuku Line■ Takasaki Line■ Utsunomiya Line (Tōhoku Main Line)',
  '■ Saitama Rapid Railway Line (Akabane-iwabuchi) Tokyo Metro Namboku Line (Akabane-iwabuchi)',
  'ABN',
  '87,346',
  '2011[1]'],
 ['Akatsuka', '■ Jōban Line', '', '', '5,532', '2011[1]'],
 ['Akigawa', '■ Itsukaichi Line', '', '', '7,310', '2011[1]'],
 ['Akihabara',
  '■ Chūō-Sōbu Line■ Keihin-Tōhoku Line■ Yamanote Line',
  ' Toei Shinjuku Line (Iwamotochō) Tokyo Metro Hibiya LineTX Tsukuba Express',
  'AKB',
  '230

#### Create dataframe for list of stations in tokyo by pandas 

In [6]:
df = pd.DataFrame(info_list[1:], columns=info_list[0])

df.head(10)

Unnamed: 0,Station,JR East Lines,Other Lines,Code,Daily Station Entries,Source Year
0,Abiko,■ Jōban Line■ Jōban Line (Local)■ Jōban Line (...,,,29989,2011[1]
1,Ageo,■ Takasaki Line,,,40395,2011[1]
2,Aihara,■ Yokohama Line,,,10099,2011[1]
3,Ajiki,■ Narita Line,,,3137,2011[1]
4,Ajiro,■ Itō Line,,,913,2011[1]
5,Akabane,■ Keihin-Tōhoku Line■ Saikyō Line■■ Shōnan-Shi...,■ Saitama Rapid Railway Line (Akabane-iwabuchi...,ABN,87346,2011[1]
6,Akatsuka,■ Jōban Line,,,5532,2011[1]
7,Akigawa,■ Itsukaichi Line,,,7310,2011[1]
8,Akihabara,■ Chūō-Sōbu Line■ Keihin-Tōhoku Line■ Yamanote...,Toei Shinjuku Line (Iwamotochō) Tokyo Metro H...,AKB,230689,2011[1]
9,Akishima,■ Ōme Line,,,25526,2011[1]


#### Data Cleanup and sort data by number of Daily Station Entries

In [7]:
df = df[['Station', 'Daily Station Entries']]
df['Daily Station Entries'] = df['Daily Station Entries'].str.replace(',', '')
df['Daily Station Entries'] = df['Daily Station Entries'].str.replace('a', '')
df['Daily Station Entries'] = df['Daily Station Entries'].str.replace('\[', '')
df['Daily Station Entries'] = df['Daily Station Entries'].str.replace('\]', '')
df = df[df['Daily Station Entries'] != '']
df = df[df['Station'] != 'Total']
df['Daily Station Entries'] = df['Daily Station Entries'].astype('int')
df.sort_values('Daily Station Entries', ascending=False, inplace=True)


df.head(20)

Unnamed: 0,Station,Daily Station Entries
487,Shinjuku,734154
141,Ikebukuro,544762
470,Shibuya,402766
605,Yokohama,394900
551,Tokyo,380997
484,Shinagawa,323893
473,Shimbashi,243890
433,Ōmiya,235744
8,Akihabara,230689
528,Takadanobaba,199741


#### Delete stations outside Tokyo and narrowed stations down to 10 and concat Yokohama data

In [8]:
df_yokohama = df[df.Station == 'Yokohama']
df =  df[(df['Station'] != 'Yokohama') & (df['Station'] != 'Ōmiya') & (df['Station'] != 'Kawasaki') & (df['Station'] !=  'Funabashi')]
df = df[0:10]
df = df.reset_index(drop=True)
df = pd.concat([df,df_yokohama],ignore_index=True)
df

Unnamed: 0,Station,Daily Station Entries
0,Shinjuku,734154
1,Ikebukuro,544762
2,Shibuya,402766
3,Tokyo,380997
4,Shinagawa,323893
5,Shimbashi,243890
6,Akihabara,230689
7,Takadanobaba,199741
8,Kita-Senju,194136
9,Ueno,174832


### 2. Get stations in Tokyo and Yokohama station location data retrieved using Google maps API

In [9]:
df['Latitude'] = 0.0
df['Longitude'] = 0.0

for idx,town in df['Station'].iteritems():
    geolocator = Nominatim(user_agent="TO_explorer")
    address = town + " station" ; 
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    df.loc[idx,'Latitude'] = latitude
    df.loc[idx,'Longitude'] = longitude

In [12]:
# cannot get correct Latitude and Longitude Tokyo station by geolocator
df.loc[df['Station'] == 'Tokyo', 'Latitude'] = 35.681236
df.loc[df['Station'] == 'Tokyo', 'Longitude'] = 139.767125

In [13]:
df

Unnamed: 0,Station,Daily Station Entries,Latitude,Longitude
0,Shinjuku,734154,35.689596,139.700478
1,Ikebukuro,544762,35.730445,139.708519
2,Shibuya,402766,35.659391,139.701917
3,Tokyo,380997,35.681236,139.767125
4,Shinagawa,323893,35.629368,139.739273
5,Shimbashi,243890,35.666111,139.759721
6,Akihabara,230689,35.698557,139.773142
7,Takadanobaba,199741,35.71264,139.703874
8,Kita-Senju,194136,35.748916,139.804754
9,Ueno,174832,35.711964,139.777839


### Generate map by folium

In [14]:
geo = Nominatim(user_agent='My-IBMNotebook')
address = 'Tokyo'
location = geo.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Singapore {}, {}.'.format(latitude, longitude))

# create map of Singapore using latitude and longitude values
map_tokyo = folium.Map(location=[latitude, longitude],tiles="OpenStreetMap", zoom_start=10)

# add markers to map
for lat, lng, town in zip(
    df['Latitude'],
    df['Longitude'],
    df['Station']):
    label = town
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#87cefa',
        fill_opacity=0.5,
        parse_html=False).add_to(map_tokyo)
map_tokyo

The geograpical coordinate of Singapore 35.6828387, 139.7594549.


## Segmenting and Clustering Statoins in Tokyo and Yokohama station
### Retrieving FourSquare Places of interest.

Using the Foursquare API, the **explore** API function was be used to get the most common venue categories in each neighborhood, and then used this feature to group the neighborhoods into clusters. The *k*-means clustering algorithm was used for the analysis.
Fnally, the Folium library is used to visualize the recommended neighborhoods and their emerging clusters.

In the ipynb notebook, the function **getNearbyVenues** extracts the following information for the dataframe it generates:
* Venue ID
* Venue Name
* Coordinates : Latitude and Longitude
* Category Name

In [18]:
# Define Foursquare Credentials and Version
LIMIT = 100

CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

CLIENT_ID: 
CLIENT_SECRET:


In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Station', 
                  'Station Latitude', 
                  'Station Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [17]:
restorant_station = getNearbyVenues(names=df['Station'], latitudes=df['Latitude'], longitudes=df['Longitude'])
print(restorant_station.shape)
restorant_station.head()

(497, 7)


Unnamed: 0,Station,Station Latitude,Station Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Shinjuku,35.689596,139.700478,Lumine the Yoshimoto (ルミネtheよしもと),35.68949,139.701115,Comedy Club
1,Shinjuku,35.689596,139.700478,Verve Coffee Roasters,35.688269,139.701198,Coffee Shop
2,Shinjuku,35.689596,139.700478,SUSHI TOKYO TEN,35.688184,139.700285,Sushi Restaurant
3,Shinjuku,35.689596,139.700478,Sarutahiko Coffee & TiKiTaKa Ice Cream (猿田彦珈琲と...,35.689201,139.699075,Coffee Shop
4,Shinjuku,35.689596,139.700478,AKOMEYA TOKYO,35.688908,139.702209,Organic Grocery


## Methodology <a name="methodology"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>