<h1 align=center><font size = 5>Capstone Project - The Battle of Neighborhoods</font></h1>

<h1 align=left><font size = 3>Introduction</font></h1>

There’s an influx of tech companies moving to Austin. Lower costs, laid-back lifestyle continue to draw tech companies to Austin, Texas. According to the Austin Chamber of Commerce, 58 major companies relocated to the Austin area in 2019 alone – not including tech giants such as Apple, Amazon, and Google, who opened new offices in the region. Tech companies aren’t the only ones who are flocking to Austin, either. Nearly 100 other companies in various sectors have announced that they are moving to the area or expanding their local operations in the coming year. Tens of thousands of well-paying new jobs are on their way to Austin, with more being announced every day.

<h1 align=left><font size = 3>Business Problem</font></h1>

All those jobs are going to require smart, motivated, skilled workers to fill them. And those workers need places to live and restaurants or food joints to eat. The objective of this capstone project is to find the most suitable location for an entrepreneur to open a new Italian restaurant in Austin, Texas. By using data science and machine learning methods such as clustering, this project will recommend a best suitable location to open a new Italian restaurant. As with any business, restaurant in particular location is of utmost importance, so we will take serveral things into consideration and suggest an optimal location.

<h1 align=left><font size = 3>Data</font></h1>

Following data is required for this project:
<br>
<li> List of Austin neighborhoods scraped from Wikipedia page that contains list of Austin neighborhoods </li>
<br>
<li> Latitude and Longitude of these neighborhoods, which can be obtained from Geocoder package</li>
<br>
<li> Venue data related to these neighborhoods that can be obtained using Foursquare API</li>
<br>

<h1 align=left><font size = 3>Methodology</font></h1>

In this project the first step will be to collect data on the neighbourhoods of Austin from <a href="https://en.wikipedia.org/wiki/List_of_Austin_neighborhoods"> Wikipedia </a>. Since the data is not available preformatted, it has to be scraped from Wiki webpage. The location coordinates of each neighbourhood will then be obtained with the help of GeoPy Nominatim geolocator and appended to the neighbourhood data. Using this data, a folium map of Austin neighbourhoods will be created.

The second step will be to explore each of neighbourhoods and their venues using Foursquare location data. The venues of the neighbourhoods will be analyzed in detail and patterns will be discovered. This discovery of patterns will be carried out by grouping the neighbourhoods using k-means clustering. K-means clustering algorithm identifies k number of centeriods, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. It is one of the simplest and popular unsupervised machine learning algorithms and it is highly suited for this project as well.Following this, each cluster will be examined and a decision will be made regarding which cluster fits our need. The factor that will determine this is the frequency of occurrence of restaurants and other food venues within the cluster.

Once a cluster is picked, the neighbourhoods in that cluster will be investigated with regards to the number of Italian restaurants in its vicinity. The results of the analysis will highlight potential neighbourhoods where an Italian restaurant may be opened based on geographical location and proximity to competitors. 

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import lxml.html as lh
import re

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Load the data hosted on Wikipedia

In [None]:
List_url = "https://en.wikipedia.org/wiki/List_of_Austin_neighborhoods"
source = requests.get(List_url).text
soup = BeautifulSoup(source,"html.parser")

Create a Dataframe to store the data

In [None]:
import re
lis_toc = []

#capture all table of content list, so they can be excluded from the location dataframe
for tag in soup.find_all('li', class_=re.compile("toc")):
    lis_toc.append(tag.text)
df_toc = pd.DataFrame(lis_toc)
df_toc.columns = ['Neighborhood']

#capture all li content, Woodstone Village is the last neighborhood in Wikipedia
lis_neighbor = []
for tag in soup.find_all('li'):
    lis_neighbor.append(tag.text)
    if(tag.text == "Woodstone Village"):
        break
df_neighbors = pd.DataFrame(lis_neighbor)
df_neighbors.columns = ['Neighborhood']

# remove table of content list from the neighborhood dataframe leaving with the right neighborhoods
cond = df_neighbors['Neighborhood'].isin(df_toc['Neighborhood'])
df_neighbors.drop(df_neighbors[cond].index, inplace = True)
df_neighbors.reset_index(inplace=True,drop=True)

#Wiki text for these 4 neighborhoods needs to be cleansed
df_neighbors.loc[89,'Neighborhood'] ='Sunset Valley'
df_neighbors.loc[83,'Neighborhood'] ='Slaughter-Manchaca'
df_neighbors.loc[68,'Neighborhood'] ='South Lamar'
df_neighbors.loc[43,'Neighborhood'] ='Great Hills'

df_neighbors

In [None]:
# define the data frame columns
column_names = ['Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the data frame
geo_neighborhoods = pd.DataFrame(columns=column_names)

In [None]:
geolocator = Nominatim(user_agent="Austin_Explorer",timeout=5)
for i in range(0,len(df_neighbors)):
    
    address = df_neighbors.Neighborhood[i]+', Austin'
    location = geolocator.geocode(address)
    if location == None:
        latitude = 0
        longitude = 0
    else:
        latitude = location.latitude
        longitude = location.longitude

    geo_neighborhoods = geo_neighborhoods.append({'Neighborhood': df_neighbors.Neighborhood[i],
                                              'Latitude': latitude,
                                              'Longitude': longitude}, ignore_index=True)
geo_neighborhoods

Data Cleaning
Remove Neighborhoodd with missing geo coordinates

In [None]:
city = "Austin, TX"
geolocator = Nominatim(user_agent="austin_explorer")
location = geolocator.geocode(city)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Austin city are {}, {}.'.format(latitude, longitude))

In [None]:
geo_neighborhoods['Latitude']=geo_neighborhoods['Latitude'].astype(float)
geo_neighborhoods['Longitude']=geo_neighborhoods['Longitude'].astype(float)

geo_neighborhoods=geo_neighborhoods[(geo_neighborhoods.Latitude>29.8) & (geo_neighborhoods.Latitude<30.7) & (geo_neighborhoods.Longitude<-97)] 
geo_neighborhoods.reset_index(inplace=True,drop=True)
geo_neighborhoods

<h1><font size = 3>Part 2 Coordinates - Latitude and Longitude of each neighborhood</font></h1>

<h1><font size = 3>Part 3 Cluster Neighborhoods in Austin</font></h1>

Create a map of Toronto with neighborhoods

In [None]:
Austin_map = folium.Map(location=[latitude, longitude], zoom_start=10)
Austin_map

Add markers for Boroughs

In [None]:
for lat, lng, neighborhood in zip(
        geo_neighborhoods['Latitude'], 
        geo_neighborhoods['Longitude'],         
        geo_neighborhoods['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Austin_map)  

Austin_map

Configure Foursquare handle with required Credentials and Version <br>
*client_id and client_secret removed prior to Github commit

In [None]:
CLIENT_ID = 'SRKLHU1PFN2HTF4KMNXVLPSEAJKQXZRCJ32Z1AI3QVKOPWYB'
CLIENT_SECRET = 'TFQ40LKAZ1X3GUYPUL12TG0U0FDWG4XEWQQEDMMFUD4DD04G'
VERSION = '20180605'

In [None]:
def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
austin_venues = getNearbyVenues(names=geo_neighborhoods['Neighborhood'],
                                   latitudes=geo_neighborhoods['Latitude'],
                                   longitudes=geo_neighborhoods['Longitude']
                                  )

In [None]:
austin_venues.head()

In [None]:
venue_count = austin_venues.groupby('Neighborhood').count()
venue_count.drop(venue_count.columns[[0,1,3,4,5]], axis=1,inplace=True)
venue_count.reset_index(inplace=True)

suitable_neighborhoods=venue_count[(venue_count.Venue>=10)]
suitable_neighborhoods.reset_index(drop=True,inplace=True)
suitable_neighborhoods

Let's review all venues in the suitable neighborhoods

In [None]:
majorvenues_list=suitable_neighborhoods['Neighborhood'].values.tolist()

for i in range(0,len(austin_venues)):

    if austin_venues.iloc[i,0] not in majorvenues_list:
        austin_venues.iloc[i,0]='TO DROP'

austin_venues=austin_venues[austin_venues.Neighborhood!='TO DROP']
austin_venues.reset_index(drop=True,inplace=True)
austin_venues

In [None]:
# one hot encoding
austin_onehot = pd.get_dummies(austin_venues[['Venue Category']], prefix="", prefix_sep="")
austin_onehot.drop(['Neighborhood'],axis=1,inplace=True) 
#austin_onehot['Neighborhood'] = austin_venues['Neighborhood']
austin_onehot.insert(loc=0, column='Neighborhood', value=austin_venues['Neighborhood'] )
austin_onehot.shape
austin_onehot

In [None]:
austin_grouped = austin_onehot.groupby('Neighborhood').mean().reset_index()
austin_grouped.head()

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = austin_grouped['Neighborhood']

for ind in np.arange(austin_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(austin_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Write about clustering

The first step is to determine the optimal value of K for the dataset using the Silhouette Coefficient Method.

A higher Silhouette Coefficient score relates to a model with better defined clusters.

A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighbouring clusters.

In [None]:
from sklearn.metrics import silhouette_score

austin_grouped_clustering = austin_grouped.drop('Neighborhood', 1)

for n_cluster in range(2, 10):
    kmeans = KMeans(n_clusters=n_cluster).fit(austin_grouped_clustering)
    label = kmeans.labels_
    sil_coeff = silhouette_score(austin_grouped_clustering, label, metric='euclidean')
    print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))

The Silhouette Coefficient is the highest for n_clusters=3. Therefore, the neighbourhoods shall be grouped into 4 clusters (k=4) using k-means clustering.

In [None]:
# set number of clusters
kclusters = 3

austin_grouped_clustering = austin_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(austin_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

austin_merged = geo_neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
austin_merged = austin_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
#austin_merged.dropna(inplace=True)
#austin_merged['Cluster Labels'] = austin_merged['Cluster Labels'].astype(int)
austin_merged.head()

In [None]:

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
rainbow[2]='#006000'
rainbow[1]='#006ff6'
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(austin_merged['Latitude'], austin_merged['Longitude'], austin_merged['Neighborhood'], austin_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    print (label)
    print(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine the Clusters

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]