# IBM Capstone Project
This file will hold my code for the IBM Data Science Capstone Project

## Introduction
As part of the Coursera Data Science Capstone project, I will be using Python and Coursera to analyse the number of Italian restaurants by postcode in the Greater London area.   This is to aid in the understanding of the relative density of restaurants to then inform entrepreneurs as to the area with the greatest opportunity for new restaurants.

London has one of the greatest plethora of restaurants in the world, with hundreds of new restaurants opening every month in the capital, however being a highly competitive market, there are also hundreds of restaurants that close every month.  Understanding the "right" location to launch a new restaurant is an important decision for any restaurateur.  This piece of analysis will be focused on Italian restaurants, to limit the immediate scope and to test the functionality of the tool, however the analysis could be broadened at a late time to cater to a users desired input cuisine.

## Data
Throughout this analysis I will be using Python and a range of libraries which have been pre-install or added.  The below code inputs the required tools:

In [201]:
# Import pandas and numpy
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print('numpy, pandas, ..., imported...')

# library for BeautifulSoup, for web scrapping
from bs4 import BeautifulSoup

# library to handle JSON files
import json

!pip -q install geopy
print('geopy installed...')

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
print('Nominatim imported...')

# library to handle requests
import requests
print('requests imported...')

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
print('json_normalize imported...')

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
print('matplotlib imported...')

# import k-means from clustering stage
from sklearn.cluster import KMeans
print('Kmeans imported...')

# install the Geocoder
!pip -q install geocoder
import geocoder

!pip -q install folium
print('folium installed...')
import folium # map rendering library
print('folium imported...')
print('...Done')

numpy, pandas, ..., imported...
geopy installed...
Nominatim imported...
requests imported...
json_normalize imported...
matplotlib imported...
Kmeans imported...
folium installed...
folium imported...
...Done


#### London data
To filter the data from Foursquare for London, I will need to identify the postcode areas associated with London.  I will use Wikipedia for this.  The attached [webpage](https://en.wikipedia.org/wiki/List_of_areas_of_London) is where I have pulled the information.  However this will need cleansing, which is why I installed BeautifulSoup above (a HTML parsing and editing library).

In [203]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
wikipedia_page = requests.get(wikipedia_link)

# Cleans html file
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')
# This extracts the "tbody" within the table where class is "wikitable sortable"
table = soup.find('table', {'class':'wikitable sortable'}).tbody
# Extracts all "tr" (table rows) within the table above
rows = table.find_all('tr')
# Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag
columns = [i.text.replace('\n', '')
           for i in rows[0].find_all('th')]
# Converts columns to pd dataframe
df = pd.DataFrame(columns = columns)
'''
Extracts every row with corresponding columns then appends the values to the create pd dataframe "df". The first row (row[0]) is skipped because it is already the header
'''
for i in range(1, len(rows)):
    tds = rows[i].find_all('td')    
    if len(tds) == 7:
        values = [tds[0].text, tds[1].text, tds[2].text.replace('\n', ''.replace('\xa0','')), tds[3].text, tds[4].text.replace('\n', ''.replace('\xa0','')), tds[5].text.replace('\n', ''.replace('\xa0','')), tds[6].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
    df = df.append(pd.Series(values, index = columns), ignore_index = True)

df.columns = ["Location","Borough","Town","Postcode", "Dialcode","Gridref"]

# Remove Borough reference numbers with []
df["Borough"] = df["Borough"].map(lambda x: x.rstrip("]").rstrip("0123456789").rstrip("["))

df0 = df.drop("Postcode", axis=1).join(df["Postcode"].str.split(",", expand=True).stack().reset_index(level=1, drop=True).rename("Postcode"))

df1 = df0[["Location", "Borough", "Postcode", "Town"]].reset_index(drop=True)

df1 = df[["Location","Town"]]
df2 = df1 # assigns df1 to df2
df2 = df2[df2['Town'].str.contains('LONDON')]

df3 = df2[["Location"]].reset_index(drop=True)
df3.head()

Unnamed: 0,Location
0,Abbey Wood
1,Acton
2,Aldgate
3,Aldwych
4,Anerley


In [194]:
# Geocoder starts here
# Defining a function to use --> get_latlng()'''
def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = None
    
    # While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords
# Geocoder ends here

locations = ItalianDF['Location']    
coordinates = [get_latlng(location) for location in locations.tolist()]

df_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
ItalianDF['Latitude'] = df_coordinates['Latitude']
ItalianDF['Longitude'] = df_coordinates['Longitude']
ItalianDF.head()

Unnamed: 0,Location,Count_Italians,Latitude,Longitude
0,Abbey Wood,0,51.49245,0.12127
1,Acton,3,51.51324,-0.26746
2,Aldgate,50,51.513304,-0.077771
3,Aldwych,50,51.513291,-0.117093
4,Anerley,0,51.41233,-0.06539


#### Location data
This analysis will require data to be accessed from Foursquare, a location data provider.  This includes information about the number, location, ratings and many more details regarding venues across the world.  I will use this information to extract my data, for which I will be focusing on the data associated with London and Italian restaurants (API ID: 4bf58dd8d48988d110941735).  This will enable me to gauge the number of Italian restaurants in the city but then associate it with given postcodes.

The below code looks up on Foursquare the Italian restaurants that are within 1000m of the location, then counts the number of restaurants.  If the area is not recognised by Foursquare, then Foursquare will return an error, and the if statement allows for the process to continue and return a 0 value for the number of Italian restaurants.

In [204]:
CLIENT_ID = 'E5WA4IS5UDCIOFBGGQBEUQFPJPGXIOCHZRBJKQRGHHTWUZVH' # your Foursquare ID
CLIENT_SECRET = 'G0JTSHI2QIAHYTW4H32JHYFZD1CMNSKB3W4D3VNKPDP20AXD' # your Foursquare Secret

In [205]:
VERSION = '20180604'
RADIUS = 1000
LIMIT = 50
ItalianID = '4bf58dd8d48988d110941735'
LocationList = []
CountList = []
ItalianDF = pd.DataFrame(columns=['Location', 'Count_Italians'])

for i, row in df3.iterrows():
    search_query = row["Location"] + ", London, UK"
    search_query = search_query.replace(" ", "+")
    search_query = search_query.replace(",", "")
    search_query = search_query.replace("++", "+")

    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&categoryId={}&near={}&v={}&limit={}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, ItalianID, search_query, VERSION, LIMIT, RADIUS)
    results = requests.get(url).json()
    errors = results['meta']['code']
    if errors == 400:
        LocationList.append(row["Location"])
        CountList.append(0)
    else:
        venues = results['response']['venues']
        dataframe = json_normalize(venues)
        LocationList.append(row["Location"])
        CountList.append(dataframe.shape[0])

ItalianDF = pd.DataFrame({'Location': LocationList, 'Count_Italians': CountList})
ItalianDF.head()

Unnamed: 0,Location,Count_Italians
0,Abbey Wood,0
1,Acton,3
2,Aldgate,50
3,Aldwych,50
4,Anerley,0


In order to make sense of the data and start to visualise it, we will need to create latitude and longitude co-ordinates for each location.  The below function and code enable me to create the co-ordinates and append them to the dataframe.

In [196]:
address = 'London, United Kingdom'
geolocator = Nominatim(user_agent="ln_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_london = folium.Map(location = [latitude, longitude], zoom_start = 11)

# Adding markers to map
for lat, lng, loc, no in zip(ItalianDF['Latitude'], 
                         ItalianDF['Longitude'],
                         ItalianDF['Location'],
                         ItalianDF['Count_Italians']):
    label = '{} - {}'.format(loc,no)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_london)  
    
display(map_london)

### Visualising the data
Now that I've got a dataset with the locations and the total number of restaurants, I can start to visualise where the high density areas may be located.  I use the Folium module to create a map of London, using the geocoder to centre the map on the city.

In [197]:
ItalianDF['marker_color'] = pd.cut(ItalianDF['Count_Italians'], bins=4, 
                              labels=['lightblue', 'blue', 'red', 'darkred'])
for lat, lng, loc, no, col in zip(ItalianDF['Latitude'], 
                             ItalianDF['Longitude'],
                             ItalianDF['Location'],
                             ItalianDF['Count_Italians'],
                             ItalianDF['marker_color']):
    label = '{} - {}'.format(loc,no)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=col,
        fill=True,
        fill_color=col,
        fill_opacity=0.7).add_to(map_london)  
    
display(map_london)

Having visualised the various locations across London, it is not immediately obvious where the high and low density areas are using a single colour scheme - you need to click on the icons to find out how many Italian restaurants there are.  The next piece of code then uses the pd.cut method to bucket the locations into 4 groups based on their count of restaurants, with the low ones being coloured light blue, and the highest density areas in dark red.

### Results
As you can see from the map above, it is clear that the highest density areas are in Central London, around Mayfair, Chelsea, Covent Garden, Soho, Aldgate, Farringdon and a number of other areas. However there are other areas also in Central London which have a low density, e.g. Temple, Blackfriars.  

### Discussion
Based on the trends shown in the map above, the opportunity for a new restaurant would be best placed in Central London, but away from the high density areas already.  This would look to minimise competition, but enable access to the high density footfall.  This could be in Marylebone or Westminister areas, as these are both large tourist hubs with people looking to find quality restaurants.

There are limitations to the analysis, with some geocoding not accurately identifying the correct locations, or FourSquare no able to search for some locations - potentially causing 0 values incorrectly - this would need to be worked through, and further datasets brought in to corroborate the outcomes.

### Conclusion

In conclusion there are clear trends that Central London areas have higher densities of restaurants - but there are areas which have lower comparative density to the neighbours.  This would pose a potential opportunity for aspiring Italian restauranteurs as you can understand the demand levels from the sustainability of a large number of Italian restaurants.