# Battle of the Neighbourhoods
## Exploring London data Set - Method 2

In this workbook I intend to create a location data set for the London Boroughs/Post Code areas. I plan on conducting this with 2 methods and selecting the best. I will try to follow the methods utilised for the New York and Toronto Data sets provided in the course notebooks just motifying them for the different source files and city. 

The list of Boroughs was pulled from wikipedia https://en.wikipedia.org/wiki/List_of_London_borough and Beuatiful soup was used to construct a data frame. This table contained Borough Names and coordinates so only a dingle data source had to be web scraped to get this information. 

The data was cleaned within the dataframe and then I plotted the points on a map of London using Folium as a quality check. 
This method worked much better than the other methos I used and will proceed with this for the final notebook.


## Step 1 - Import Libraries and the website data
All libraries utilised within this notebook were enetered here in first code cell. 

In [1]:
# import libraries
import numpy as np
import time
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import lxml 
import xlrd
import pandas as pd
from bs4 import BeautifulSoup 
import json 
import requests 
from pandas.io.json import json_normalize 
from geopy.geocoders import Nominatim 

import folium

In [2]:
# I will wep scrape a wiki page about list of boroughs in London 
source = requests.get('https://en.wikipedia.org/wiki/List_of_London_boroughs').text
soup = BeautifulSoup(source, 'lxml')
soup.encode("utf-8-sig")

b'\xef\xbb\xbf<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n<head>\n<meta charset="utf-8-sig"/>\n<title>List of London boroughs - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"de01c92b-e60b-47b1-9110-1db95ed78601","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_London_boroughs","wgTitle":"List of London boroughs","wgCurRevisionId":958873870,"wgRevisionId":958873870,"wgArticleId":28092685,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from August 2015","Use British English from August 2015","Lists of coordinates","Geographic coordinat

In [3]:
# start making the basis of the dataframe
BoroughName = []
Population = []
Coordinates = []

for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if len(cells) > 0:
        BoroughName.append(cells[0].text.rstrip('\n'))
        Population.append(cells[7].text.rstrip('\n'))
        Coordinates.append(cells[8].text.rstrip('\n'))

In [4]:
# Form a dataframe
dict = {'BoroughName' : BoroughName,
       'Population' : Population,
       'Coordinates': Coordinates}
df_lon = pd.DataFrame.from_dict(dict)
df_lon.head()

Unnamed: 0,BoroughName,Population,Coordinates
0,Barking and Dagenham [note 1],194352,51°33′39″N 0°09′21″E﻿ / ﻿51.5607°N 0.1557°E﻿ /...
1,Barnet,369088,51°37′31″N 0°09′06″W﻿ / ﻿51.6252°N 0.1517°W﻿ /...
2,Bexley,236687,51°27′18″N 0°09′02″E﻿ / ﻿51.4549°N 0.1505°E﻿ /...
3,Brent,317264,51°33′32″N 0°16′54″W﻿ / ﻿51.5588°N 0.2817°W﻿ /...
4,Bromley,317899,51°24′14″N 0°01′11″E﻿ / ﻿51.4039°N 0.0198°E﻿ /...


In [5]:
# Strip unwanted texts
df_lon['BoroughName'] = df_lon['BoroughName'].map(lambda x: x.rstrip(']'))
df_lon['BoroughName'] = df_lon['BoroughName'].map(lambda x: x.rstrip('1234567890.'))
df_lon['BoroughName'] = df_lon['BoroughName'].str.replace('note','')
df_lon['BoroughName'] = df_lon['BoroughName'].map(lambda x: x.rstrip(' ['))
df_lon.head()

Unnamed: 0,BoroughName,Population,Coordinates
0,Barking and Dagenham,194352,51°33′39″N 0°09′21″E﻿ / ﻿51.5607°N 0.1557°E﻿ /...
1,Barnet,369088,51°37′31″N 0°09′06″W﻿ / ﻿51.6252°N 0.1517°W﻿ /...
2,Bexley,236687,51°27′18″N 0°09′02″E﻿ / ﻿51.4549°N 0.1505°E﻿ /...
3,Brent,317264,51°33′32″N 0°16′54″W﻿ / ﻿51.5588°N 0.2817°W﻿ /...
4,Bromley,317899,51°24′14″N 0°01′11″E﻿ / ﻿51.4039°N 0.0198°E﻿ /...


In [6]:
# Clean coordinates
df_lon[['Coordinates1','Coordinates2','Coordinates3']] = df_lon['Coordinates'].str.split('/',expand=True)
df_lon.head()

Unnamed: 0,BoroughName,Population,Coordinates,Coordinates1,Coordinates2,Coordinates3
0,Barking and Dagenham,194352,51°33′39″N 0°09′21″E﻿ / ﻿51.5607°N 0.1557°E﻿ /...,51°33′39″N 0°09′21″E﻿,﻿51.5607°N 0.1557°E﻿,51.5607; 0.1557﻿ (Barking and Dagenham)
1,Barnet,369088,51°37′31″N 0°09′06″W﻿ / ﻿51.6252°N 0.1517°W﻿ /...,51°37′31″N 0°09′06″W﻿,﻿51.6252°N 0.1517°W﻿,51.6252; -0.1517﻿ (Barnet)
2,Bexley,236687,51°27′18″N 0°09′02″E﻿ / ﻿51.4549°N 0.1505°E﻿ /...,51°27′18″N 0°09′02″E﻿,﻿51.4549°N 0.1505°E﻿,51.4549; 0.1505﻿ (Bexley)
3,Brent,317264,51°33′32″N 0°16′54″W﻿ / ﻿51.5588°N 0.2817°W﻿ /...,51°33′32″N 0°16′54″W﻿,﻿51.5588°N 0.2817°W﻿,51.5588; -0.2817﻿ (Brent)
4,Bromley,317899,51°24′14″N 0°01′11″E﻿ / ﻿51.4039°N 0.0198°E﻿ /...,51°24′14″N 0°01′11″E﻿,﻿51.4039°N 0.0198°E﻿,51.4039; 0.0198﻿ (Bromley)


In [7]:
df_lon.drop(labels=['Coordinates','Coordinates1','Coordinates2'], axis=1,inplace = True)
df_lon[['Latitude','Longitude']] = df_lon['Coordinates3'].str.split(';',expand=True)
df_lon.head()

Unnamed: 0,BoroughName,Population,Coordinates3,Latitude,Longitude
0,Barking and Dagenham,194352,51.5607; 0.1557﻿ (Barking and Dagenham),51.5607,0.1557﻿ (Barking and Dagenham)
1,Barnet,369088,51.6252; -0.1517﻿ (Barnet),51.6252,-0.1517﻿ (Barnet)
2,Bexley,236687,51.4549; 0.1505﻿ (Bexley),51.4549,0.1505﻿ (Bexley)
3,Brent,317264,51.5588; -0.2817﻿ (Brent),51.5588,-0.2817﻿ (Brent)
4,Bromley,317899,51.4039; 0.0198﻿ (Bromley),51.4039,0.0198﻿ (Bromley)


In [8]:
df_lon.drop(labels=['Coordinates3'], axis=1,inplace = True)
df_lon['Latitude'] = df_lon['Latitude'].map(lambda x: x.rstrip(u'\ufeff'))
df_lon['Latitude'] = df_lon['Latitude'].map(lambda x: x.lstrip())
df_lon['Longitude'] = df_lon['Longitude'].map(lambda x: x.rstrip(')'))
df_lon['Longitude'] = df_lon['Longitude'].map(lambda x: x.rstrip('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '))
df_lon['Longitude'] = df_lon['Longitude'].map(lambda x: x.rstrip(' ('))
df_lon['Longitude'] = df_lon['Longitude'].map(lambda x: x.rstrip(u'\ufeff'))
df_lon['Longitude'] = df_lon['Longitude'].map(lambda x: x.lstrip())
df_lon['Population'] = df_lon['Population'].str.replace(',','')
df_lon.head()

Unnamed: 0,BoroughName,Population,Latitude,Longitude
0,Barking and Dagenham,194352,51.5607,0.1557
1,Barnet,369088,51.6252,-0.1517
2,Bexley,236687,51.4549,0.1505
3,Brent,317264,51.5588,-0.2817
4,Bromley,317899,51.4039,0.0198


In [9]:
df_lon['BoroughName'].unique()

array(['Barking and Dagenham', 'Barnet', 'Bexley', 'Brent', 'Bromley',
       'Camden', 'Croydon', 'Ealing', 'Enfield', 'Greenwich', 'Hackney',
       'Hammersmith and Fulham', 'Haringey', 'Harrow', 'Havering',
       'Hillingdon', 'Hounslow', 'Islington', 'Kensington and Chelsea',
       'Kingston upon Thames', 'Lambeth', 'Lewisham', 'Merton', 'Newham',
       'Redbridge', 'Richmond upon Thames', 'Southwark', 'Sutton',
       'Tower Hamlets', 'Waltham Forest', 'Wandsworth', 'Westminster'],
      dtype=object)

## Part 2 - Take the created dataframe and plot it on folium as a quality control check.

At first it was not plotting and it was discovered that the Latitude and Longitude columns were as an object data type, these were then converted to Floats and the points plotted with Folium Correctly.

In [10]:
df_lon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   BoroughName  32 non-null     object
 1   Population   32 non-null     object
 2   Latitude     32 non-null     object
 3   Longitude    32 non-null     object
dtypes: object(4)
memory usage: 1.1+ KB


In [11]:
#Use geolocator for the coordinates
address = 'London, UK'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London are 51.5073219, -0.1276474.


In [12]:
# Had issue as the lat and long are stored in the dataframe as objects not floats, change data type to float\
df_lon['Latitude']=df_lon['Latitude'].astype(float)
df_lon['Longitude']=df_lon['Longitude'].astype(float)

In [13]:
#London, the UK Latitude and longitude coordinates are: 51.509865, -0.118092 from web - use this for centering the folium map
Lat = 51.409865
Long = -0.118092
London_map = folium.Map(location=[Lat, Long], zoom_start=10)

for lat, lng, borough, neighbourhood in zip(df_lon['Latitude'], df_lon['Longitude'], df_lon['BoroughName'], df_lon['Population']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='blue', fill=True).add_to(London_map)

London_map

In [14]:
df_lon.to_csv('Data_London.csv', index = False)

## Workbook Summary
### This data scraping provided valid data to use going forward
The method used here gave reliable information for the London Boroughsand will be utilised for the final Notebook for London Data.

