# Battle of the Neighbourhoods
## Exploring London data Set - Method 1

In this workbook I intend to create a location data set for the London Boroughs/Post Code areas. I plan on conducting this with 2 methods and selecting the best. I will try to follow the methods utilised for the New York and Toronto Data sets provided in the course notebooks just motifying them for the different source files and city. 

The list of Boroughs was pulled from wikipedia https://en.wikipedia.org/wiki/List_of_areas_of_London and Beuatiful soup was used to construct a data frame. This table contained Borough Names and Post Codes.

The coordinates were then a geocoder from arcgis and merged in to create a single dataframe with all the post code, Borough and Lat.Lond data.

The data was cleaned within the dataframe and then I plotted the points on a map of London using Folium as a quality check. 
This method had an issue with getting the coordinates for all the post codes and it left numerous NaN further down the data frame. A way around was to reduce the data frame size rather than all 300+ poscodes. 

A second method was done using a different eb page and the data from that method was used in the final notebook.
\


## Part 1 - Obtain the Borough and postcode data from website

The data was obtained from the wikipedia page and then beautifull soup was utilised to scrape the data into a data frame for the Borough and Post code information.

In [1]:
# import libraries
import numpy as np
import time
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import lxml 
import xlrd
import pandas as pd
from bs4 import BeautifulSoup 
import json 
import requests 
from pandas.io.json import json_normalize 
from geopy.geocoders import Nominatim 

import folium 

import geocoder
from geopy.geocoders import Nominatim

In [2]:
# I will web scrape a wiki page about list of boroughs in London 
source = requests.get('https://en.wikipedia.org/wiki/List_of_areas_of_London').text
soup = BeautifulSoup(source, 'lxml')
soup.encode("utf-8-sig")

b'\xef\xbb\xbf<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n<head>\n<meta charset="utf-8-sig"/>\n<title>List of areas of London - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"1c785a77-979a-4acb-a31e-d3d77cb9f1bc","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_areas_of_London","wgTitle":"List of areas of London","wgCurRevisionId":987192367,"wgRevisionId":987192367,"wgArticleId":11915713,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Use dmy dates from August 2015","U

In [3]:
#Pull the data fromn the second table on the page for the data frame construction
table = soup.find_all('table')[1] # Grab the second table
    
Location = []
LondonBorough = []
PostalTown = []
PostCode = []

for row in table.find_all('tr'):
    cells = row.find_all('td')
    if len(cells) > 0:
        Location.append(cells[0].text.rstrip('\n'))
        LondonBorough.append(cells[1].text.rstrip('\n'))
        PostalTown.append(cells[2].text.rstrip('\n'))
        PostCode.append(cells[3].text.rstrip('\n'))        

### Finding Correct Table
Care had to be taken in the scraping to ensure the second table of data was captured 

In [4]:
dict = {'Location' : Location,
        'London Borough' : LondonBorough,
        'Postal Town' : PostalTown,
        'Post Code' : PostCode
       }
df_lon = pd.DataFrame.from_dict(dict)
df_lon.head(20)

Unnamed: 0,Location,London Borough,Postal Town,Post Code
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4"
2,Addington,Croydon[8],CROYDON,CR0
3,Addiscombe,Croydon[8],CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"
5,Aldborough Hatch,Redbridge[9],ILFORD,IG2
6,Aldgate,City[10],LONDON,EC3
7,Aldwych,Westminster[10],LONDON,WC2
8,Alperton,Brent[11],WEMBLEY,HA0
9,Anerley,Bromley[11],LONDON,SE20


In [5]:
# Strip unwanted texts
df_lon['London Borough'] = df_lon['London Borough'].map(lambda x: x.rstrip(']'))
df_lon['London Borough'] = df_lon['London Borough'].map(lambda x: x.rstrip('1234567890.'))
df_lon['London Borough'] = df_lon['London Borough'].str.replace('note','')
df_lon['London Borough'] = df_lon['London Borough'].map(lambda x: x.rstrip(' ['))
df_lon.head()

Unnamed: 0,Location,London Borough,Postal Town,Post Code
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Addington,Croydon,CROYDON,CR0
3,Addiscombe,Croydon,CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


In [6]:
df_lon.shape

(532, 4)

In [7]:
df_lon = df_lon.drop('Post Code', axis=1).join(df_lon['Post Code'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('Postcode'))

In [8]:
df_lon.shape

(637, 4)

In [9]:
# Keep it to London postcodes
df_lon = df_lon[df_lon['Postal Town'].str.contains('LONDON')]

In [10]:
df_lon.shape

(381, 4)

In [11]:
df_lon.head()

Unnamed: 0,Location,London Borough,Postal Town,Postcode
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W3
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W4
6,Aldgate,City,LONDON,EC3
7,Aldwych,Westminster,LONDON,WC2


In [12]:
df_lon.to_csv('LocationWikiData_London.csv', index = False)

## Part 2 - Obtain the Latitude and Logitude for the table and merge to a single data frame

At first it was not plotting and it was discovered that the Latitude and Longitude columns were as an object data type, these were then converted to Floats and the points plotted with Folium Correctly.

In [13]:
def get_latlng(arcgis_geocoder):
    
    #Initialize the Location to "None"
    lat_lng_coords = None
    
    #While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords
# Geocoder ends here

In [14]:
sample2 = get_latlng('SE7')
sample2

[51.48747000000003, 0.02795000000003256]

In [15]:
gg = geocoder.geocodefarm(sample2, method = 'reverse')
gg

<[OK] Geocodefarm - Reverse [395 Woolwich Road, Charlton, SE7 7AL, United Kingdom]>

In [16]:
df_lon['Postcode']

0        SE2
1         W3
1         W4
6        EC3
7        WC2
9       SE20
10       EC1
10        N1
12       N19
14       EN5
14       NW7
15       N11
15       N14
16      SW12
17       SE1
18       EC1
18       EC2
22      SW13
24       NW7
24       EN5
26        N1
27      SW11
28        W2
29       BR3
29      SE20
30        E6
30       E16
30      IG11
34        W4
35       SW1
36       SE6
39       NW3
41       SE1
43        E2
45       DA6
45       DA7
45       SE2
49       EC4
50       SE3
51       SE3
51      SE12
52       E14
54       WC1
56       N11
56       N22
57        E3
58       N22
60       NW2
60       NW4
61      NW10
63       SW2
63       SW9
63       SE5
64       SE4
66        E3
68       SW3
69       NW6
70       N11
73       NW4
74       SE5
75        E2
76       NW1
77       E14
78       E11
79       E16
80        N1
82      SW13
84       SE6
86       NW1
87       WC2
88       SE7
91       SW3
94       NW2
95        W1
96      SE12
97        E4
99        W4

In [28]:
postal_codes = df_lon['Postcode']    
coordinates = [get_latlng(postal_code) for postal_code in postal_codes.tolist()]


In [29]:
#check the list doesnt seem to be fully populating ll the postcodes
print(coordinates)

[[51.492450000000076, 0.12127000000003818], [51.51324000000005, -0.2674599999999714], [51.48944000000006, -0.26193999999992457], [51.51200000000006, -0.08057999999994081], [51.51651000000004, -0.11967999999995982], [51.41009000000008, -0.05682999999993399], [51.523610000000076, -0.09876999999994496], [51.52969000000007, -0.08696999999995114], [51.56393000000003, -0.12944999999996298], [51.64441500000004, -0.1791830389999518], [51.615680000000054, -0.2451099999999542], [51.616310000000055, -0.1383899999999585], [51.63429000000008, -0.13365999999996347], [51.44822000000005, -0.1483899999999494], [51.499960000000044, -0.09567999999995891], [51.523610000000076, -0.09876999999994496], [51.518410000000074, -0.08814999999992779], [51.47469000000007, -0.24162999999992962], [51.615680000000054, -0.2451099999999542], [51.64441500000004, -0.1791830389999518], [51.52969000000007, -0.08696999999995114], [51.46760000000006, -0.16289999999997917], [51.51494000000008, -0.1804799999999318], [51.4150950

In [30]:
#df_se_loc = df_se_top

# The obtained coordinates (latitude and longitude) are joined with the dataframe as shown
#TB the pd.DataFrame(coordinates, columns is converting the generating list into a dataframe and naming it) THen it gets added in to form a new data frame

df_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
df_lon['Latitude'] = df_coordinates['Latitude']
df_lon['Longitude'] = df_coordinates['Longitude']

In [31]:
print(df_coordinates)

      Latitude  Longitude
0    51.492450   0.121270
1    51.513240  -0.267460
2    51.489440  -0.261940
3    51.512000  -0.080580
4    51.516510  -0.119680
5    51.410090  -0.056830
6    51.523610  -0.098770
7    51.529690  -0.086970
8    51.563930  -0.129450
9    51.644415  -0.179183
10   51.615680  -0.245110
11   51.616310  -0.138390
12   51.634290  -0.133660
13   51.448220  -0.148390
14   51.499960  -0.095680
15   51.523610  -0.098770
16   51.518410  -0.088150
17   51.474690  -0.241630
18   51.615680  -0.245110
19   51.644415  -0.179183
20   51.529690  -0.086970
21   51.467600  -0.162900
22   51.514940  -0.180480
23   51.415095  -0.035403
24   51.410090  -0.056830
25   51.532920   0.054610
26   51.509130   0.015280
27   51.533120   0.084077
28   51.489440  -0.261940
29   51.497140  -0.138290
30   51.437220  -0.018680
31   51.555060  -0.173480
32   51.499960  -0.095680
33   51.526690  -0.062570
34   51.506420  -0.127210
35   51.470520   0.146705
36   51.492450   0.121270
37   51.5138

In [32]:
df_lon.to_csv('LocationsCoordinates_London.csv', index = False)

In [33]:
df_lon.shape

(381, 6)

In [34]:
df_lon.head(50)

Unnamed: 0,Location,London Borough,Postal Town,Postcode,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W3,51.51324,-0.26746
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W4,51.51324,-0.26746
6,Aldgate,City,LONDON,EC3,51.52361,-0.09877
7,Aldwych,Westminster,LONDON,WC2,51.52969,-0.08697
9,Anerley,Bromley,LONDON,SE20,51.644415,-0.179183
10,Angel,Islington,LONDON,EC1,51.61568,-0.24511
10,Angel,Islington,LONDON,N1,51.61568,-0.24511
12,Archway,Islington,LONDON,N19,51.63429,-0.13366
14,Arkley,Barnet,"BARNET, LONDON",EN5,51.49996,-0.09568


In [35]:
df_lon.head(381)

Unnamed: 0,Location,London Borough,Postal Town,Postcode,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W3,51.51324,-0.26746
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W4,51.51324,-0.26746
6,Aldgate,City,LONDON,EC3,51.52361,-0.09877
7,Aldwych,Westminster,LONDON,WC2,51.52969,-0.08697
9,Anerley,Bromley,LONDON,SE20,51.644415,-0.179183
10,Angel,Islington,LONDON,EC1,51.61568,-0.24511
10,Angel,Islington,LONDON,N1,51.61568,-0.24511
12,Archway,Islington,LONDON,N19,51.63429,-0.13366
14,Arkley,Barnet,"BARNET, LONDON",EN5,51.49996,-0.09568


## Check what the city plots as

## Looking at the geocoder it only grabbed the 200 or so line items and then Nan, so for now i will drop the NaN rows and add the folium map

In [36]:
df_lon_new = df_lon
df_lon_new = df_lon.dropna()
df_lon_new.shape

(270, 6)

In [37]:
df_lon.head(20)

Unnamed: 0,Location,London Borough,Postal Town,Postcode,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,51.49245,0.12127
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W3,51.51324,-0.26746
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,W4,51.51324,-0.26746
6,Aldgate,City,LONDON,EC3,51.52361,-0.09877
7,Aldwych,Westminster,LONDON,WC2,51.52969,-0.08697
9,Anerley,Bromley,LONDON,SE20,51.644415,-0.179183
10,Angel,Islington,LONDON,EC1,51.61568,-0.24511
10,Angel,Islington,LONDON,N1,51.61568,-0.24511
12,Archway,Islington,LONDON,N19,51.63429,-0.13366
14,Arkley,Barnet,"BARNET, LONDON",EN5,51.49996,-0.09568


## Part 3 - Take the created dataframe and plot it on folium as a quality control check.

At first it was not plotting errors were occuring sdue to NaN values being found. It was discovered that the get_latlong function was not filling all the table entries and the number of post codes would need to be reduced down for this to be more effective.
For the sake of this workbook I dropped the NaN rows and replotted it with folium below.

In [38]:
import folium
# Lat long from web (can change for other city)
#London, the UK Latitude and longitude coordinates are: 51.509865, -0.118092.
Lat = 51.409865
Long = -0.118092
London_map = folium.Map(location=[Lat, Long], zoom_start=10)

for lat, lng, borough, neighbourhood in zip(df_lon_new['Latitude'], df_lon_new['Longitude'], df_lon_new['London Borough'], df_lon_new['Location']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='blue', fill=True).add_to(London_map)

London_map


## Issues with this Map

Looking at the coordinates above some are wrong with "Hammersmith and Fulham" located inthe South East near Dartford on the key.
The full list of post codes did not collect coordinate data with numerous left as NaN after the get_latlng function
Rather than investigaate the cause further I used a list of boroughs on a wiki page that already had coordinates and this is collated in another notebook. 