# Capstone Project - The Battle of Neighborhoods


### Data Section <br>
London is one of the most ethnically diverse cities in the world. At the 2011 census, London had a population of 8,173,941. Of this number, 44.9% were White British. 37% of the population were born outside the UK, including 24.5% born outside of Europe.<br> <br>
The demography of London is analysed by the Office for National Statistic and data is produced for each of the Greater London wards, the City of London and the 32 London boroughs, the Inner London and Outer London statistical sub-regions, each of the Parliamentary constituencies in London, and for all of Greater London as a whole.<br><br>
For our fashion store problem, we will focus on the Boroughs of London and work on getting the data from all the Boroughs. There are 32 London Boroughs with a population of around 150,000 to 300,000.<br><br>
To solve our problem of finding a best location to open an Indian fashion store in London, we need to datasets based on various parameters such as :
1.	List of **areas of London** available at: https://en.wikipedia.org/wiki/List_of_areas_of_London<br>
2.	The latitudes and longitudes of those areas which the done with the help of **geopy.geocoders** library in python<br>
3.	Population of target audience in all the **boroughs of London** based on their **ethnicity** at **London Datastore**, which is a free and open data-sharing portal where anyone can access data relating to the city. The data is available in XLS and CSV format, which we can download and can use as-is for solving our problem. https://data.london.gov.uk/dataset/ethnic-groups-borough<br>


The cleansed data will then be used alongside **Foursquare** data, which is readily available. Foursquare location data will be leveraged to explore or compare **neighbourhoods around London**.<br><br>
Data Science Workflow:<br>
1.	Get the population in borough based on ethnicity<br>
2.	Clean the dataset and find the borough with the highest population of Asians.<br>
3.	Select the borough with highest population of Asian as preferred location for the store<br>
4.	Get the list of all the boroughs with their latitudes and longitudes<br>
5.	Plot all the neighbourhoods on a map<br>
6.	Get all the neighbourhoods of the selected borough with their latitudes and longitudes<br>
7.	Visualize the neighbourhood of the borough<br>
8.	Explore the all the neighbourhood with FourSquare API<br>
9.	Analyse each neighbourhood of selected borough<br>
10.	Display the top ten venues of each neighbourhood<br>
11.	Compare each neighbourhood venues for café, Indian restaurant and parks<br>
12.	Displaying selected neighbourhoods of borough on map <br>
13.	Outcome and conclusion


#### importing required libararies

In [3]:
# Import libraries
import numpy as np # library to handle data in a vectorized manner
import json # library to handle JSON files
import pandas as pd

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

from bs4 import BeautifulSoup
import requests

# Import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### downloading the enthnic gropus by borough dataset 

In [132]:
!wget -O ethnic-groups-by-borough.xls https://data.london.gov.uk/dataset/ethnic-groups-borough

--2020-06-19 19:04:29--  https://data.london.gov.uk/dataset/ethnic-groups-borough
Resolving data.london.gov.uk (data.london.gov.uk)... 99.86.109.72, 99.86.109.39, 99.86.109.57, ...
Connecting to data.london.gov.uk (data.london.gov.uk)|99.86.109.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘ethnic-groups-by-borough.xls’

    [ <=>                                   ] 138,203     --.-K/s   in 0.03s   

2020-06-19 19:04:29 (3.97 MB/s) - ‘ethnic-groups-by-borough.xls’ saved [138203]



#### Acessing the data set via watson studio

In [109]:
# The code was removed by Watson Studio for sharing.

#### getting data into pandas dataframe 

In [110]:
df= pd.read_excel(streaming_body_3, sheet_name='2018',header=1)

In [111]:
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,White,Asian,Black,Mixed/ Other,Total,Unnamed: 7,White.1,Asian.1,Black.1,Mixed/ Other.1,Total.1
0,,,,,,,,,,,,,
1,E09000001,City of London,-,-,-,-,9000.0,,-,-,-,-,6000.0
2,E09000002,Barking and Dagenham,109000,54000,36000,15000,215000.0,,11000,8000,6000,4000,15000.0
3,E09000003,Barnet,250000,57000,30000,54000,390000.0,,22000,10000,7000,10000,27000.0
4,E09000004,Bexley,195000,17000,21000,15000,248000.0,,15000,5000,5000,4000,17000.0


###  Clean the dataset .

#### renaming columns

In [112]:
df.rename(columns={"Unnamed: 0":"code","Unnamed: 1":"area"},inplace= True)
df.head()

Unnamed: 0,code,area,White,Asian,Black,Mixed/ Other,Total,Unnamed: 7,White.1,Asian.1,Black.1,Mixed/ Other.1,Total.1
0,,,,,,,,,,,,,
1,E09000001,City of London,-,-,-,-,9000.0,,-,-,-,-,6000.0
2,E09000002,Barking and Dagenham,109000,54000,36000,15000,215000.0,,11000,8000,6000,4000,15000.0
3,E09000003,Barnet,250000,57000,30000,54000,390000.0,,22000,10000,7000,10000,27000.0
4,E09000004,Bexley,195000,17000,21000,15000,248000.0,,15000,5000,5000,4000,17000.0


##### droping first to rows as they ar not important

In [113]:
df.drop(index=[0,1],inplace = True)
df.head()

Unnamed: 0,code,area,White,Asian,Black,Mixed/ Other,Total,Unnamed: 7,White.1,Asian.1,Black.1,Mixed/ Other.1,Total.1
2,E09000002,Barking and Dagenham,109000,54000,36000,15000,215000.0,,11000,8000,6000,4000,15000.0
3,E09000003,Barnet,250000,57000,30000,54000,390000.0,,22000,10000,7000,10000,27000.0
4,E09000004,Bexley,195000,17000,21000,15000,248000.0,,15000,5000,5000,4000,17000.0
5,E09000005,Brent,102000,107000,62000,56000,328000.0,,13000,13000,10000,9000,23000.0
6,E09000006,Bromley,267000,15000,21000,28000,330000.0,,21000,5000,6000,7000,24000.0


##### droping the empty column

In [114]:
df.drop(columns="Unnamed: 7", axis=1,inplace = True)
df.head()

Unnamed: 0,code,area,White,Asian,Black,Mixed/ Other,Total,White.1,Asian.1,Black.1,Mixed/ Other.1,Total.1
2,E09000002,Barking and Dagenham,109000,54000,36000,15000,215000.0,11000,8000,6000,4000,15000.0
3,E09000003,Barnet,250000,57000,30000,54000,390000.0,22000,10000,7000,10000,27000.0
4,E09000004,Bexley,195000,17000,21000,15000,248000.0,15000,5000,5000,4000,17000.0
5,E09000005,Brent,102000,107000,62000,56000,328000.0,13000,13000,10000,9000,23000.0
6,E09000006,Bromley,267000,15000,21000,28000,330000.0,21000,5000,6000,7000,24000.0


#### droping rows that have NaN Data

In [115]:
df.dropna()
df.head()

Unnamed: 0,code,area,White,Asian,Black,Mixed/ Other,Total,White.1,Asian.1,Black.1,Mixed/ Other.1,Total.1
2,E09000002,Barking and Dagenham,109000,54000,36000,15000,215000.0,11000,8000,6000,4000,15000.0
3,E09000003,Barnet,250000,57000,30000,54000,390000.0,22000,10000,7000,10000,27000.0
4,E09000004,Bexley,195000,17000,21000,15000,248000.0,15000,5000,5000,4000,17000.0
5,E09000005,Brent,102000,107000,62000,56000,328000.0,13000,13000,10000,9000,23000.0
6,E09000006,Bromley,267000,15000,21000,28000,330000.0,21000,5000,6000,7000,24000.0


### droping rows after 33rd index as they contain data of places outside london which is not required

In [116]:
df.drop(df.index[33:],inplace=True)

In [118]:
df.dropna(inplace= True)
df.reset_index(inplace= True, drop= True)
df.head()

Unnamed: 0,code,area,White,Asian,Black,Mixed/ Other,Total,White.1,Asian.1,Black.1,Mixed/ Other.1,Total.1
0,E09000002,Barking and Dagenham,109000,54000,36000,15000,215000.0,11000,8000,6000,4000,15000.0
1,E09000003,Barnet,250000,57000,30000,54000,390000.0,22000,10000,7000,10000,27000.0
2,E09000004,Bexley,195000,17000,21000,15000,248000.0,15000,5000,5000,4000,17000.0
3,E09000005,Brent,102000,107000,62000,56000,328000.0,13000,13000,10000,9000,23000.0
4,E09000006,Bromley,267000,15000,21000,28000,330000.0,21000,5000,6000,7000,24000.0


#### getting a data frame with boroughs that only contain information of asian population

In [120]:
df_asian=df[['code','area','Asian']]
df_asian.head()

Unnamed: 0,code,area,Asian
0,E09000002,Barking and Dagenham,54000
1,E09000003,Barnet,57000
2,E09000004,Bexley,17000
3,E09000005,Brent,107000
4,E09000006,Bromley,15000


In [129]:
df_asian.sort_values(by=['Asian'],ascending=False,inplace=True)
df_asian.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,code,area,Asian
23,E09000025,Newham,166000
28,E09000030,Tower Hamlets,128000
24,E09000026,Redbridge,126000
3,E09000005,Brent,107000
15,E09000017,Hillingdon,100000


## As the Newham borough is having highest asian population, we will consider this borough as our preferred location for our Indian Fashion store.

#### Read the latitude and longitude coordinates of all Boroughs in London from a Wikipedia link

In [7]:
URL = "https://en.wikipedia.org/wiki/List_of_London_boroughs"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')

df_list = []
# print(soup)
for items in soup.find('table', class_= 'wikitable sortable').find_all('tr')[1::]:
    data = items.find_all(['td'])
    data1 = data[0]
    data8 = data[8]    
    try:
        borough_name = data1.get_text()
        borough_name = borough_name.split('[')
        borough_name = borough_name[0]
        borough_name = borough_name.strip()
        
        ll = data8.get_text()
        ll = ll.split('/')
        lat_long = ll[2]
        lat_long = lat_long.split('(')
        lat_long = lat_long[0]
        lat_long = lat_long.split(';')
        latitude = lat_long[0]
        latitude = latitude.strip()
        longitude = lat_long[1]
        longitude = longitude.strip()
        longitude = longitude.replace(u'\ufeff', '')
        latitude = float(latitude)
        longitude = float(longitude)

#  Append the borough name, latitude and logitude in a list
        df_list.append((borough_name, latitude, longitude))
    except IndexError:pass

In [10]:
df_boroughs = pd.DataFrame(df_list, columns=['Borough', 'Latitude' , 'Longitude'])
print(df_boroughs.shape)
df_boroughs.head()

(32, 3)


Unnamed: 0,Borough,Latitude,Longitude
0,Barking and Dagenham,51.5607,0.1557
1,Barnet,51.6252,-0.1517
2,Bexley,51.4549,0.1505
3,Brent,51.5588,-0.2817
4,Bromley,51.4039,0.0198


In [12]:
print(df_boroughs.dtypes)
df_boroughs.loc[df_boroughs['Borough'] == 'Newham']

Borough       object
Latitude     float64
Longitude    float64
dtype: object


Unnamed: 0,Borough,Latitude,Longitude
23,Newham,51.5077,0.0469


#### Get the Latitude and Longitude of London City using geopy library

In [14]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address = 'London, UK'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London City are {}, {}.'.format(latitude, longitude))

  app.launch_new_instance()


The geograpical coordinate of London City are 51.5073219, -0.1276474.


### Preferred location for Indian Fahsion Store - Newham Borough
##### As the borough Newham is having high asian population, we will consider  only the neighbourhoods of this borough. For that we have to get the latitude and longitude details of all the areas(neighbourhoods) of Newham borough.

#### Read the latitude and longitude coordinates of all the neighborhoods(areas) in Newham Borough

In [20]:
from urllib.request import urlopen
import re
URL = "https://en.wikipedia.org/wiki/List_of_areas_of_London"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')

codes = []
areas_list = []
href_links_list = []
for items in soup.find('table', class_= 'wikitable sortable').find_all('tr')[1::]:
    data = items.find_all(['td'])
    data0 = data[0]
    area_name = data0.text

    data1 = data[1]
    data1 = data1.text
    borough = data1.split('[')
    borough_name = borough[0]
    data5 = data[5]
    code = data5.text
    code = code.strip()
    
    if borough_name == 'Newham':
        codes.append(code)
        areas_list.append((borough_name,area_name,code))

                
for link in soup.findAll('a', attrs={'href': re.compile("^https://tools.wmflabs.org")}):
            htext = link.text
            if htext in codes:
                hlink = link.get('href')
                href_links_list.append((htext, hlink))

### Create a DataFrame from the Areas list

In [21]:
df_areas = pd.DataFrame(areas_list, columns=['Borough', 'Area', 'Code'])
df_areas.head()

Unnamed: 0,Borough,Area,Code
0,Newham,Beckton,TQ435815
1,Newham,Canning Town,TQ405815
2,Newham,Custom House,TQ408807
3,Newham,East Ham,TQ425835
4,Newham,Forest Gate,TQ405855


In [23]:
print(df_areas.columns)
print(df_areas.shape)

Index(['Borough', 'Area', 'Code'], dtype='object')
(14, 3)


### Create a DataFrame from the list of href links

In [24]:
df_links = pd.DataFrame(href_links_list, columns=['Code','href'])
print(df_links.columns)
print(df_links.shape)
df_links.head()

Index(['Code', 'href'], dtype='object')
(15, 2)


Unnamed: 0,Code,href
0,TQ435815,https://tools.wmflabs.org/geohack/en/51.514205...
1,TQ405815,https://tools.wmflabs.org/geohack/en/51.514959...
2,TQ408807,https://tools.wmflabs.org/geohack/en/51.507695...
3,TQ425835,https://tools.wmflabs.org/geohack/en/51.532429...
4,TQ405855,https://tools.wmflabs.org/geohack/en/51.550902...


#### Merge the Areas and href Links DataFrames

In [25]:
cols = df_links.columns.difference(df_areas.columns)
cols

Index(['href'], dtype='object')

In [28]:
df_areas_links = pd.concat([df_areas, df_links[cols]], axis=1)
print(df_areas_links.shape)
df_areas_links

(15, 4)


Unnamed: 0,Borough,Area,Code,href
0,Newham,Beckton,TQ435815,https://tools.wmflabs.org/geohack/en/51.514205...
1,Newham,Canning Town,TQ405815,https://tools.wmflabs.org/geohack/en/51.514959...
2,Newham,Custom House,TQ408807,https://tools.wmflabs.org/geohack/en/51.507695...
3,Newham,East Ham,TQ425835,https://tools.wmflabs.org/geohack/en/51.532429...
4,Newham,Forest Gate,TQ405855,https://tools.wmflabs.org/geohack/en/51.550902...
5,Newham,Little Ilford,TQ435855,https://tools.wmflabs.org/geohack/en/51.550147...
6,Newham,Manor Park,TQ425855,https://tools.wmflabs.org/geohack/en/51.550401...
7,Newham,Maryland,TQ391849,https://tools.wmflabs.org/geohack/en/51.545857...
8,Newham,North Woolwich,TQ435795,https://tools.wmflabs.org/geohack/en/51.496234...
9,Newham,Plaistow,TQ405825,https://tools.wmflabs.org/geohack/en/51.523944...


#### Remove the row where there is no data

In [29]:
df_areas_links = df_areas_links.dropna(how='any')
df_areas_links

Unnamed: 0,Borough,Area,Code,href
0,Newham,Beckton,TQ435815,https://tools.wmflabs.org/geohack/en/51.514205...
1,Newham,Canning Town,TQ405815,https://tools.wmflabs.org/geohack/en/51.514959...
2,Newham,Custom House,TQ408807,https://tools.wmflabs.org/geohack/en/51.507695...
3,Newham,East Ham,TQ425835,https://tools.wmflabs.org/geohack/en/51.532429...
4,Newham,Forest Gate,TQ405855,https://tools.wmflabs.org/geohack/en/51.550902...
5,Newham,Little Ilford,TQ435855,https://tools.wmflabs.org/geohack/en/51.550147...
6,Newham,Manor Park,TQ425855,https://tools.wmflabs.org/geohack/en/51.550401...
7,Newham,Maryland,TQ391849,https://tools.wmflabs.org/geohack/en/51.545857...
8,Newham,North Woolwich,TQ435795,https://tools.wmflabs.org/geohack/en/51.496234...
9,Newham,Plaistow,TQ405825,https://tools.wmflabs.org/geohack/en/51.523944...


### Get the geo co-ordinates for all the areas in the Newham borough

In [30]:
geo_codes = []
for row in df_areas_links.itertuples():
    url = row.href
    code = row.Code
    res = requests.get(url).text
    soup1 = BeautifulSoup(res,'lxml')
    
    for lat in soup1.find('span',{'class':'latitude'}):
        latitude = lat
        latitude = float(latitude)
            
    for long in soup1.find('span',{'class':'longitude'}):    
        longitude = long
        longitude = float(longitude)
        
    geo_codes.append((code, latitude, longitude))

print(geo_codes)

[('TQ435815', 51.514206, 0.066634), ('TQ405815', 51.514959, 0.023429), ('TQ408807', 51.507696, 0.027431), ('TQ425835', 51.53243, 0.053041), ('TQ405855', 51.550902, 0.025024), ('TQ435855', 51.550148, 0.068263), ('TQ425855', 51.550401, 0.05385), ('TQ391849', 51.545857, 0.004608), ('TQ435795', 51.496234, 0.065821), ('TQ405825', 51.523945, 0.023828), ('TQ415795', 51.496738, 0.037029), ('TQ385845', 51.54241, -0.004196), ('TQ405837', 51.534728, 0.024306), ('TQ405837', 51.534728, 0.024306)]


#### Create a DataFrame from the above list

In [31]:
df_geo_codes = pd.DataFrame(geo_codes, columns=['Code','Latitude','Longitude'])
df_geo_codes

Unnamed: 0,Code,Latitude,Longitude
0,TQ435815,51.514206,0.066634
1,TQ405815,51.514959,0.023429
2,TQ408807,51.507696,0.027431
3,TQ425835,51.53243,0.053041
4,TQ405855,51.550902,0.025024
5,TQ435855,51.550148,0.068263
6,TQ425855,51.550401,0.05385
7,TQ391849,51.545857,0.004608
8,TQ435795,51.496234,0.065821
9,TQ405825,51.523945,0.023828


#### Now merge the Neighborhoods and Geocodes DataFrames

In [32]:
print(df_areas.columns)
print(df_areas.shape)
print(df_geo_codes.columns)
print(df_geo_codes.shape)

Index(['Borough', 'Area', 'Code'], dtype='object')
(14, 3)
Index(['Code', 'Latitude', 'Longitude'], dtype='object')
(14, 3)


In [33]:
cols = df_geo_codes.columns.difference(df_areas.columns)
cols

Index(['Latitude', 'Longitude'], dtype='object')

In [34]:
Newham_borough = pd.concat([df_areas, df_geo_codes[cols]], axis=1)
Newham_borough.head()

Unnamed: 0,Borough,Area,Code,Latitude,Longitude
0,Newham,Beckton,TQ435815,51.514206,0.066634
1,Newham,Canning Town,TQ405815,51.514959,0.023429
2,Newham,Custom House,TQ408807,51.507696,0.027431
3,Newham,East Ham,TQ425835,51.53243,0.053041
4,Newham,Forest Gate,TQ405855,51.550902,0.025024


#### Change the name of the column 'Area' to 'Neighborhood '

In [35]:
Newham_borough = Newham_borough.rename(columns={'Area' :'Neighborhood'})

#### We do not need the column Code for our further analysis, so we will drop it

In [36]:
Newham_borough.drop(['Code'], axis=1, inplace=True)
print(Newham_borough.columns)
Newham_borough.head()

Index(['Borough', 'Neighborhood', 'Latitude', 'Longitude'], dtype='object')


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Newham,Beckton,51.514206,0.066634
1,Newham,Canning Town,51.514959,0.023429
2,Newham,Custom House,51.507696,0.027431
3,Newham,East Ham,51.53243,0.053041
4,Newham,Forest Gate,51.550902,0.025024


In [37]:
Newham_borough.dtypes

Borough          object
Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

# End of DATA SECTION