### IBM Applied Data Science Capstone Project

# Segmenting and Clustering Neighborhoods in London
In this notebook, neighborhoods in the city of London are explored, segmented, and clustered. For the London neighborhood data, a Wikipedia page exists that has all the information needed to explore and cluster the neighborhoods in London. The data is scraped from the Wikipedia page and wrangled, cleaned and then read into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, Analyze to open a Japanese restaurant and where would we recommend that to open it? 

# Methodology

Install the required packages.

In [3]:
!pip install arcgis
!pip install wikipedia
!conda install -c conda-forge geopy --yes
!pip install geocoder
!pip install folium
print('Libraries Installed.')

Libraries Installed.


### Importing the required packages.

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import geocoder # to get coordinates
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

import wikipedia as wp

import folium # map rendering library

from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()
print('Libraries imported.')

Libraries imported.


The task is to explore the city and plot the map to show the Neighborhoods being considered and then build our model by clustering all of the similar Neighborhoods together and finally plot the new map with the clustered Neighborhoods.

## 1. Web-scrape and Explore Dataset
### Exploring London City

### Neighborhoods of London

Collecting data needed for the our business solution from Wiki.

### Data Collection

In [12]:
#Get the html source
html = wp.page("List of areas of London").html().encode("UTF-8")
df = pd.read_html(html, flavor='html5lib')[1]     
df.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


### Data Preprocessing

In [6]:
df.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
df.head()

Unnamed: 0,Location,London borough,Post_town,Postcode district,Dial code,OS_grid_ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


### Feature Selection
Keep only relavant boroughs, Post town and district for further steps.

In [7]:
df1 = df.drop( [ df.columns[0], df.columns[4], df.columns[5] ], axis=1)
df1.columns = ['borough','town','post_code']
df1['borough'] = df1['borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df1.head()

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Croydon,CROYDON,CR0
3,Croydon,CROYDON,CR0
4,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


Dimension of the dataframe

In [8]:
df1.shape

(531, 3)

We currently have 531 records and 3 columns of our data. Lets do the Feature Engineering

In [9]:
df1 = df1[df1['town'].str.contains('LONDON')]
df1

Unnamed: 0,borough,town,post_code
0,"Bexley, Greenwich",LONDON,SE2
1,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
6,City,LONDON,EC3
7,Westminster,LONDON,WC2
9,Bromley,LONDON,SE20
10,Islington,LONDON,"EC1, N1"
12,Islington,LONDON,N19
14,Barnet,"BARNET, LONDON","EN5, NW7"
15,Enfield,LONDON,"N11, N14"
16,Wandsworth,LONDON,SW12


In [10]:
df1.shape

(308, 3)

We now have only 308 rows. We can proceed with our further steps. Getting some descriptive statistics

# References:
* [London Areas Wiki](https://en.wikipedia.org/wiki/List_of_areas_of_London)
* [Foursquare API](https://foursquare.com/)
* [ArcGIS API](https://www.arcgis.com/index.html)

<hr>

### Thank You