# Battle of the Neighborhoods

## 1. Introduction

In this project I will focus on the neighborhoods in London. The question under consideration is, what is the best area in London to open a new restaurant? On a related note, I will try to find the best place to open a cafe. The insights that I will obtain in the process can help future business owners to find promising spots for their planned establishment. Among other things, I will take into consideration the current distribution of cafes/restaurants in the various neighborhoods and (if possible) the population density.

## 2. Data Description

The first step is to obtain the names and locations of London's neighborhoods. This will be realized by importing a comprehensive list from https://en.wikipedia.org/wiki/List_of_areas_of_London, where names of all areas in London can be found. Each entry of this table has an attribute referring to a webpage like https://geohack.toolforge.org/geohack.php?pagename=List_of_areas_of_London&params=51.48648031512_N_0.10859224316653_E_region:GB_scale:25000, where the geographical position can be determined. Once I have constructed a table with neighborhood names and locations, I will use the Foursquare database to explore each area individually. Attributes I will focus on are density of restaurants, density of cafes and population (i.e. potential customer) density. In addition I will perform a clustering algorithm in order to identify high-density regions more easily.

In [1]:
import pandas as pd
from urllib.request import urlopen
from xml.etree.ElementTree import parse
import bs4 as bs
import numpy as np

In [2]:
# wikipedia page containing areas in London
path_wiki = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'

# read web-page
full_wiki_df = pd.read_html(path_wiki)

# extract the relevant table
df = full_wiki_df[1]

print('Initial table has shape ' + str(df.shape))
df.head()

Initial table has shape (533, 6)


Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


In [125]:
def read_wiki_page(url):
  
    identifier = 'List of places UK England London'

    # extract first entry directly
    temp_df = pd.read_html(url)

    # check if there's a box on top
    if temp_df[0].shape==(1,2):
        temp_df = temp_df[1]
    else:
        temp_df = temp_df[0]
        
    # get column names
    temp_columns = temp_df.columns.tolist()
        
    contains_coord = False
        
    # go through the rows
    for row in range(0,len(temp_df)):

        if identifier in str(temp_df.iloc[[row][co]][1]):
            contains_coord = True
            print('    Coordinates readable')
            break
            
    if not contains_coord:
        raise Exception('url not valid')
        
    

In [126]:
wikipedia = 'https://en.wikipedia.org/wiki/'
identifier = 'List of places UK England London'

list_neighborhoods = []
list_coordinates = []

for row in range(0,3):#len(df)):
    
    # get location string of current row
    curr_loc = df.iloc[row]['Location']
    
    # some entries have an alternative name written like 'name (alternative name)' -> remove the brackets+content
    split_by_bracket = str.split(curr_loc,' (')
    loc_no_bracket = split_by_bracket[0]
    
    # replace white spaces between words by _
    no_spaces = loc_no_bracket.replace(' ','_')
    
    # ignore every occurence of '
    no_dash = no_spaces.replace("'","")
    
    # contruct possible url
    url_guess = wikipedia + no_dash
    
    print(url_guess)
    
    try:
        read_wiki_page(url_guess)                
    except:
        
        print('    Address not valid, try alternative')
        
        # if the url has not been correct so far, add ',_London' to construct a second guess
        url_guess = url_guess + ',_London'
        print('    ' + url_guess)
        
        try:
            read_wiki_page(url_guess)
        except:
                print('    Alternative address also not valid, discard entry')
                continue

https://en.wikipedia.org/wiki/Abbey_Wood
    Coordinates readable
https://en.wikipedia.org/wiki/Acton
    Address not valid, try alternative
    https://en.wikipedia.org/wiki/Acton,_London
    Coordinates readable
https://en.wikipedia.org/wiki/Addington
    Address not valid, try alternative
    https://en.wikipedia.org/wiki/Addington,_London
    Coordinates readable


In [118]:
temp_df.shape == (1,2)

True

In [13]:
def parse_XML(xml_file, df_cols): 
    """Parse the input XML file and store the result in a pandas 
    DataFrame with the given columns. 
    
    The first element of df_cols is supposed to be the identifier 
    variable, which is an attribute of each node element in the 
    XML data; other features will be parsed from the text content 
    of each sub-element. 
    """
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []
    
    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_5e07c498d6f54713b4afa36144761a42 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='38IB3Xg3UtbCba8gta1-seldj8LjcSCq85MseQOEj-KU',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')

# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about the possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
streaming_body_1 = client_5e07c498d6f54713b4afa36144761a42.get_object(Bucket='courseracapstoneproject-donotdelete-pr-mvaxyenlgvfj8y', Key='doc.kml')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(streaming_body_1, "__iter__"): streaming_body_1.__iter__ = types.MethodType( __iter__, streaming_body_1 ) 
