# The Battle of Neighbourhoods
    
    This notebook exists to complete the coursera capstone project assignment.

#### 1. Introduction / Business Problem

    The problem this report aims to solve consists of the following question, "Which neighbourhoods in and around Pittsburgh could be considered as likely candidates to open a new grocery store location." In order to solve this problem, we will focus on factors such as the variety of venues within Pittsburgh's neighborhoods and their locations throughout the area. We will then group the neighbourhoods based on these factors and make observations. The target audience of this report includes the stakeholders of a national grocery chain.

#### 2. Data

    For this project, we will be using Allegheny County Zip Codes data in conjunction with FourSquare location data. The Allegheny Zip Code dataset demarcates the zip code boundaries that lie within Allegheny County and provides zip codes, neighborhoods, and geospatial coordinates. The FourSquare Places API will be used to gather venue locations found within a 500m radius of a given neighborhood. Data will be organized in to dataframes and projected on to an interactive map for visualization.

    The Allegheny Zip Codes dataset is accessed via Pennsylvania Spatial Data Access (PASDA), an official public access open geospatial data portal.

In [1]:
import pandas as pd


url = 'https://www.pasda.psu.edu/spreadsheet/AlleghenyCounty__ZipCodeBoundaries2020.csv'
pittsburgh_nbh = pd.read_csv(url)
pittsburgh_nbh.head()


Unnamed: 0,FID,ZIP,NAME,ZIPTYPE,STATE,STATEFIPS,COUNTYFIPS,COUNTYNAME,S3DZIP,LAT,...,EMPTYCOL,TOTRESCNT,MFDU,SFDU,BOXCNT,BIZCNT,RELVER,COLOR,Shape_Leng,Shape_Area
0,0,15224,PITTSBURGH,NON-UNIQUE,PA,42,42003,ALLEGHENY,152,40.464263,...,,5113,845,4063,205,495,1.9.3,0,34291.10853,27729040.0
1,1,15202,PITTSBURGH,NON-UNIQUE,PA,42,42003,ALLEGHENY,152,40.467764,...,,14090,2933,10866,291,961,1.9.3,8,96211.11859,146664800.0
2,2,15012,BELLE VERNON,NON-UNIQUE,PA,42,42129,WESTMORELAND,150,40.15614,...,,7110,180,6786,144,651,1.9.3,10,8748.136272,3777756.0
3,3,15142,PRESTO,NON-UNIQUE,PA,42,42003,ALLEGHENY,151,40.380401,...,,1037,0,919,118,30,1.9.3,5,56105.50969,55544800.0
4,4,15216,PITTSBURGH,NON-UNIQUE,PA,42,42003,ALLEGHENY,152,40.401802,...,,11008,1817,9054,137,535,1.9.3,6,80277.15303,95351930.0


    The .csv file provided contains a multitude of fields defining the zip code boundaries and neighborhoods found within Pittsburgh and Allegheny County. We will be focusing our analysis on the Zip Code, Neighborhood, Latitude, and Longitude fields.

In [2]:
#instantiate a new dataframe for our target variables
column_names = ['ZIP', 'Neighborhood', 'Latitude', 'Longitude']
neighborhoods = pd.DataFrame(columns = column_names)

In [3]:
neighborhoods

Unnamed: 0,ZIP,Neighborhood,Latitude,Longitude


In [4]:
#fill the new dataframe with the relevant data
neighborhoods['ZIP'] = pittsburgh_nbh['ZIP'].astype(str)
neighborhoods['Neighborhood'] = pittsburgh_nbh['NAME']
neighborhoods['Latitude'] = pittsburgh_nbh['LAT']
neighborhoods['Longitude'] = pittsburgh_nbh['LON']

In [5]:
neighborhoods

Unnamed: 0,ZIP,Neighborhood,Latitude,Longitude
0,15224,PITTSBURGH,40.464263,-79.945118
1,15202,PITTSBURGH,40.467764,-80.053123
2,15012,BELLE VERNON,40.156140,-79.812132
3,15142,PRESTO,40.380401,-80.120993
4,15216,PITTSBURGH,40.401802,-80.034334
...,...,...,...,...
119,15241,PITTSBURGH,40.331863,-80.082840
120,15219,PITTSBURGH,40.442916,-79.988152
121,15236,PITTSBURGH,40.382820,-79.945023
122,15642,IRWIN,40.302046,-79.703196


In [6]:
#remove any rows in which location data is not found
nan_value = float('NaN')

neighborhoods.replace(0, nan_value, inplace = True)

neighborhoods.dropna(subset = ['Latitude'], inplace = True)

In [7]:
#The resulting dataframe contains our target locations
print('The dataframe has {} zip codes and {} unique neighborhoods.'.format(
        neighborhoods.shape[0],
        len(neighborhoods['Neighborhood'].unique())))



The dataframe has 120 zip codes and 73 unique neighborhoods.


#### 3. Methodology

In [8]:
#import libraries
import numpy as np
import requests

import json

from pandas.io.json import json_normalize

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


In [9]:
#find the geolocation of Pittsburgh, PA
address = 'Pittsburgh, PA'

geolocator = Nominatim(user_agent = "pittsburgh_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Pittsburgh are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Pittsburgh are 40.4416941, -79.9900861.


In [10]:
#Map the ZIP codes of Pittsbrgh using folium
map_pittsburgh = folium.Map(location = [latitude, longitude], zoom_start = 11)

for lat, lng, label, in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color ='blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_pittsburgh)
    
map_pittsburgh

In [11]:
#Get the location of our first Neighborhood (Zip Code)
nbh_latitude = neighborhoods.loc[0, 'Latitude']
nbh_longitude = neighborhoods.loc[0, 'Longitude']
nbh_name = neighborhoods.loc[0, 'Neighborhood']
nbh_zip = neighborhoods.loc[0, 'ZIP']

print('The latitude and longitude values of {} ({}) are {}, {}.'.format(nbh_name, nbh_zip, nbh_latitude, nbh_longitude))

The latitude and longitude values of PITTSBURGH (15224) are 40.46426325, -79.94511825.


    Now that we have the location of our first neighborhood, let's explore and find any nearby venues.
    
    We define our FourSquare credentials and make a request for 100 venues within a 500m radius.

In [12]:
# @hidden_cell
CLIENT_ID = 'UG2DWEMEYEQVLYEXEVSRFKUPRSLJORRAT1SYEU2LATSVZTVS'
CLIENT_SECRET = 'BG1TPBGS3OADHZKZUU1DH3K3GGI4OUMIYE5F1O5U5SCMMHFY'
VERSION = '20180604'

In [13]:
LIMIT = 100

radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    nbh_latitude, 
    nbh_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=UG2DWEMEYEQVLYEXEVSRFKUPRSLJORRAT1SYEU2LATSVZTVS&client_secret=BG1TPBGS3OADHZKZUU1DH3K3GGI4OUMIYE5F1O5U5SCMMHFY&v=20180604&ll=40.46426325,-79.94511825&radius=500&limit=100'

In [14]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '6011a68af1d69f48c7ba8a43'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': '$-$$$$', 'key': 'price'},
    {'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bloomfield',
  'headerFullLocation': 'Bloomfield, Pittsburgh',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 39,
  'suggestedBounds': {'ne': {'lat': 40.468763254500004,
    'lng': -79.93921454579136},
   'sw': {'lat': 40.4597632455, 'lng': -79.95102195420863}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4ad7af38f964a520a50d21e3',
       'name': 'Spak Brothers Pizza and More',
       'location': {'address': '5107 Penn Ave',
        'lat': 40.4650112798643,
        'lng': -79.94254271652281,
        'labeledLa

In [15]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Spak Brothers Pizza and More,Pizza Place,40.465011,-79.942543
1,Mixtape,Lounge,40.465413,-79.944878
2,Artisan Cafe,Coffee Shop,40.465171,-79.94396
3,bantha tea,Tea Room,40.46511,-79.94392
4,Most-Wanted Fine Art,Art Gallery,40.46506,-79.94345


In [17]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

39 venues were returned by Foursquare.


    Now let's explore the other neighborhoods.

In [18]:
def getNearbyVenues(ZIP, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for ZIP, name, lat, lng in zip(ZIP, names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            ZIP,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['ZIP',
                  'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
pittsburgh_venues = getNearbyVenues(ZIP = neighborhoods['ZIP'], names = neighborhoods['Neighborhood'], latitudes = neighborhoods['Latitude'], longitudes = neighborhoods['Longitude'])

PITTSBURGH
PITTSBURGH
BELLE VERNON
PRESTO
PITTSBURGH
DUQUESNE
PITCAIRN
WEXFORD
MC DONALD
CLINTON
PITTSBURGH
HOMESTEAD
PITTSBURGH
EAST MC KEESPORT
PITTSBURGH
MCKEESPORT
PITTSBURGH
IMPERIAL
RURAL RIDGE
MCKEESPORT
PITTSBURGH
PITTSBURGH
OAKMONT
BRACKENRIDGE
WILMERDING
PITTSBURGH
PITTSBURGH
PITTSBURGH
PITTSBURGH
CRESCENT
SAXONBURG
BRIDGEVILLE
NEW KENSINGTON
BRADDOCK
RUSSELLTON
BADEN
MURRYSVILLE
MORGAN
MARS
BAIRDFORD
FINLEYVILLE
AMBRIDGE
NATRONA HEIGHTS
LEETSDALE
MC KEES ROCKS
SOUTH PARK
SEWICKLEY
PITTSBURGH
PITTSBURGH
BUNOLA
INDIANOLA
PITTSBURGH
CARNEGIE
MCKEESPORT
PITTSBURGH
CHESWICK
GREENOCK
PITTSBURGH
PITTSBURGH
ALLISON PARK
GLASSPORT
PITTSBURGH
BETHEL PARK
PITTSBURGH
PITTSBURGH
CORAOPOLIS
PITTSBURGH
BAKERSTOWN
CLAIRTON
PITTSBURGH
GLENSHAW
SUTERSVILLE
PITTSBURGH
CECIL
OAKDALE
MCKEESPORT
PITTSBURGH
CUDDY
CREIGHTON
MONONGAHELA
BRADFORD WOODS
TURTLE CREEK
BUENA VISTA
SPRINGDALE
PITTSBURGH
PITTSBURGH
WEST MIFFLIN
PITTSBURGH
PITTSBURGH
VALENCIA
PITTSBURGH
WEST ELIZABETH
PITTSBURGH
TARENTUM
WE

In [20]:
print(pittsburgh_venues.shape)
pittsburgh_venues

(683, 8)


Unnamed: 0,ZIP,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,15224,PITTSBURGH,40.464263,-79.945118,Spak Brothers Pizza and More,40.465011,-79.942543,Pizza Place
1,15224,PITTSBURGH,40.464263,-79.945118,Mixtape,40.465413,-79.944878,Lounge
2,15224,PITTSBURGH,40.464263,-79.945118,Artisan Cafe,40.465171,-79.943960,Coffee Shop
3,15224,PITTSBURGH,40.464263,-79.945118,bantha tea,40.465110,-79.943920,Tea Room
4,15224,PITTSBURGH,40.464263,-79.945118,Most-Wanted Fine Art,40.465060,-79.943450,Art Gallery
...,...,...,...,...,...,...,...,...
678,15201,PITTSBURGH,40.475403,-79.953685,Dive Bar,40.479549,-79.954980,Bar
679,15201,PITTSBURGH,40.475403,-79.953685,Remedy Restaurant and Lounge,40.478927,-79.955443,Bar
680,15201,PITTSBURGH,40.475403,-79.953685,Barb's Country Kitchen,40.474496,-79.957889,Diner
681,15201,PITTSBURGH,40.475403,-79.953685,"Metamorphosis Organic Salon, Spa, & Wellness S...",40.478726,-79.955444,Salon / Barbershop


In [21]:
print('There are {} unique categories.'.format(len(pittsburgh_venues['Venue Category'].unique())))

There are 189 unique categories.


In [22]:
#one hot encoding
pittsburgh_onehot = pd.get_dummies(pittsburgh_venues[['Venue Category']], prefix = "", prefix_sep="")

#add neighborhood column back to dataframe
pittsburgh_onehot['Neighborhood'] = pittsburgh_venues['Neighborhood']

#move neighborhood column to first column
fixed_columns = [pittsburgh_onehot.columns[-1]]+list(pittsburgh_onehot.columns[:-1])
pittsburgh_onehot = pittsburgh_onehot[fixed_columns]

pittsburgh_onehot.head()

Unnamed: 0,Neighborhood,ATM,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,PITTSBURGH,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PITTSBURGH,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PITTSBURGH,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PITTSBURGH,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PITTSBURGH,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#add ZIP column back to dataframe
pittsburgh_onehot['ZIP'] = pittsburgh_venues['ZIP']

#move ZIP column to first column
fixed_columns = [pittsburgh_onehot.columns[-1]]+list(pittsburgh_onehot.columns[:-1])
pittsburgh_onehot = pittsburgh_onehot[fixed_columns]

pittsburgh_onehot['ZIP'] = pittsburgh_onehot['ZIP'].astype(str)

pittsburgh_onehot

Unnamed: 0,ZIP,Neighborhood,ATM,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,15224,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,15224,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,15224,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,15224,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,15224,PITTSBURGH,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,15201,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
679,15201,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
680,15201,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
681,15201,PITTSBURGH,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
#Group rows by ZIP and take the mean frequncy of occurence for each category
pittsburgh_grouped = pittsburgh_onehot.groupby('ZIP').mean().reset_index()

pittsburgh_grouped

Unnamed: 0,ZIP,ATM,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,15003,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15006,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,15007,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,15014,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,15015,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,15668,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
91,16046,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92,16056,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93,16059,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
pittsburgh_grouped.shape

(95, 190)

    Next, let's print the top 5 most common venues for each ZIP code.

In [26]:
num_top_venues = 5

for ZIP in pittsburgh_grouped['ZIP']:
    print("----"+ZIP+"----")
    temp = pittsburgh_grouped[pittsburgh_grouped['ZIP']==ZIP].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq']=temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----15003----
                    venue  freq
0            Liquor Store  0.33
1             Coffee Shop  0.33
2  Furniture / Home Store  0.33
3            Optical Shop  0.00
4             Other Event  0.00


----15006----
                 venue  freq
0         Liquor Store  0.25
1  Sporting Goods Shop  0.12
2          Pizza Place  0.12
3       Shipping Store  0.12
4        Grocery Store  0.12


----15007----
          venue  freq
0   Pizza Place  0.50
1  Burger Joint  0.25
2    Playground  0.25
3           ATM  0.00
4     Pet Store  0.00


----15014----
              venue  freq
0          Ski Area  0.25
1       Post Office  0.25
2  Business Service  0.25
3       Planetarium  0.25
4         Pet Store  0.00


----15015----
          venue  freq
0  Home Service  0.33
1   Post Office  0.33
2     Gift Shop  0.33
3           ATM  0.00
4     Pet Store  0.00


----15017----
          venue  freq
0   Golf Course   1.0
1           ATM   0.0
2   Music Venue   0.0
3     Nightclub   0.0
4  Noodle 

4             Optical Shop   0.0


----15206----
          venue  freq
0   Coffee Shop  0.33
1  Intersection  0.33
2  Tennis Court  0.33
3           ATM  0.00
4  Noodle House  0.00


----15207----
              venue  freq
0      Home Service  0.25
1        Water Park  0.25
2  Business Service  0.25
3               Bar  0.25
4               ATM  0.00


----15209----
                        venue  freq
0               Grocery Store  0.33
1  Construction & Landscaping  0.33
2              Farmers Market  0.33
3                         ATM  0.00
4                Optical Shop  0.00


----15210----
         venue  freq
0      Brewery  0.33
1  Video Store  0.33
2         Park  0.33
3    Pet Store  0.00
4    Nightclub  0.00


----15211----
                 venue  freq
0  American Restaurant   0.2
1          Pizza Place   0.2
2                 Park   0.2
3             Pharmacy   0.2
4       Breakfast Spot   0.2


----15213----
                venue  freq
0         Pizza Place  0.07
1         C

In [27]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [28]:
#Organize our venues by the most common occuring within each Zip Code
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['ZIP']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['ZIP'] = pittsburgh_grouped['ZIP']

for ind in np.arange(pittsburgh_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(pittsburgh_grouped.iloc[ind, :], num_top_venues)
    
neighborhoods_venues_sorted.head()


Unnamed: 0,ZIP,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,15003,Furniture / Home Store,Liquor Store,Coffee Shop,Yoga Studio,Entertainment Service,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market
1,15006,Liquor Store,Sporting Goods Shop,Pizza Place,Grocery Store,Pet Store,Shipping Store,Bank,Electronics Store,Fish & Chips Shop,Field
2,15007,Pizza Place,Burger Joint,Playground,Yoga Studio,Electronics Store,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
3,15014,Post Office,Planetarium,Business Service,Ski Area,Yoga Studio,Electronics Store,Field,Fast Food Restaurant,Farmers Market,Farm
4,15015,Home Service,Gift Shop,Post Office,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit


In [29]:
neighborhoods_venues_sorted.shape

(95, 11)

    Now, with our venues sorted, we can begin clustering neighborhoods using k-means. We use k-means here due to its simplicity and adaptability. Running k-means here will also allow us to discriminate venue categories and effectively distinguish each cluster.

In [30]:
kclusters = 7

pittsburgh_grouped_clustering = pittsburgh_grouped.drop('ZIP', 1)

kmeans = KMeans(n_clusters = kclusters, random_state = 1).fit(pittsburgh_grouped_clustering)

kmeans.labels_[0:10]

kmeans.labels_.dtype

dtype('int32')

In [31]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

pittsburgh_merged = neighborhoods

pittsburgh_merged = pittsburgh_merged.join(neighborhoods_venues_sorted.set_index('ZIP'), on = 'ZIP', how ='inner')

pittsburgh_merged.head()

Unnamed: 0,ZIP,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,15224,PITTSBURGH,40.464263,-79.945118,0,Grocery Store,Pizza Place,Sandwich Place,Art Gallery,New American Restaurant,Coffee Shop,Bookstore,Breakfast Spot,Italian Restaurant,Japanese Restaurant
1,15202,PITTSBURGH,40.467764,-80.053123,0,Food Truck,Discount Store,Beer Store,Grocery Store,Sandwich Place,Pharmacy,Yoga Studio,Entertainment Service,Field,Fast Food Restaurant
4,15216,PITTSBURGH,40.401802,-80.034334,0,Print Shop,Plaza,Pizza Place,Light Rail Station,Train Station,Park,Bar,Yoga Studio,Electronics Store,Fast Food Restaurant
5,15110,DUQUESNE,40.372431,-79.850058,0,American Restaurant,Bus Station,Discount Store,Business Service,Grocery Store,Yoga Studio,Event Space,Flower Shop,Fish & Chips Shop,Field
6,15140,PITCAIRN,40.405381,-79.775416,4,Pizza Place,Auto Garage,Park,Yoga Studio,Electronics Store,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm


In [42]:
pittsburgh_merged.shape

(96, 15)

    A map similar to the one used earlier allows us to visualize how the neighborhood clusters are spread throughout Pittsburgh and the surrounding area.

In [32]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i +x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0,1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(pittsburgh_merged['Latitude'], pittsburgh_merged['Longitude'], pittsburgh_merged['Neighborhood'], pittsburgh_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
    [lat, lon],
    radius = 5,
    popup = label,
    color = rainbow[cluster-1],
    fill = True,
    fill_color = rainbow[cluster-1],
    fill_opacity = 0.7).add_to(map_clusters)
                        
map_clusters

#### 4. Results

    With our clustering complete, we have narrowed down our search for which neighborhoods could be suitable for a new grocery store location. Let's begin by acknowledging Cluster 0. A significant portion of our neighborhoods have been grouped in to Cluster 0, including the entirety of downtown Pittsburgh. For simplicity's sake, we will ignore Cluster 0 for now.

In [33]:
#Cluster 0
pittsburgh_merged.loc[pittsburgh_merged['Cluster Labels'] == 0, pittsburgh_merged.columns[[0] + [1] + list(range(5, pittsburgh_merged.shape[1]))]]

Unnamed: 0,ZIP,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,15224,PITTSBURGH,Grocery Store,Pizza Place,Sandwich Place,Art Gallery,New American Restaurant,Coffee Shop,Bookstore,Breakfast Spot,Italian Restaurant,Japanese Restaurant
1,15202,PITTSBURGH,Food Truck,Discount Store,Beer Store,Grocery Store,Sandwich Place,Pharmacy,Yoga Studio,Entertainment Service,Field,Fast Food Restaurant
4,15216,PITTSBURGH,Print Shop,Plaza,Pizza Place,Light Rail Station,Train Station,Park,Bar,Yoga Studio,Electronics Store,Fast Food Restaurant
5,15110,DUQUESNE,American Restaurant,Bus Station,Discount Store,Business Service,Grocery Store,Yoga Studio,Event Space,Flower Shop,Fish & Chips Shop,Field
11,15120,HOMESTEAD,Rental Service,Pizza Place,Sandwich Place,Baseball Field,Liquor Store,Yoga Studio,Electronics Store,Field,Fast Food Restaurant,Farmers Market
...,...,...,...,...,...,...,...,...,...,...,...,...
117,15214,PITTSBURGH,Construction & Landscaping,Bus Stop,Gym,Baseball Field,Yoga Studio,Event Space,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant
118,15082,STURGEON,Bar,Trail,Post Office,Pizza Place,Dry Cleaner,Electronics Store,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market
120,15219,PITTSBURGH,American Restaurant,Coffee Shop,Lounge,Sandwich Place,Rental Car Location,Hotel,Theater,Burger Joint,Sporting Goods Shop,Residential Building (Apartment / Condo)
121,15236,PITTSBURGH,Home Service,Lounge,Food Truck,Food,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm


    Next, Cluster 1 reveals our first trend in common venues, with Construction / Landscaping, Yoga Studios, and Electronics Stores topping the list. It can also be noted that these neighborhoods hold distinct locations from one another throughout the Pittsburgh area, as seen in the map above. The neighborhoods in Cluster 1 also share common venues such as Flower Shops, Fields, and Farmers Markets.

In [34]:
#Cluster 1
pittsburgh_merged.loc[pittsburgh_merged['Cluster Labels'] == 1, pittsburgh_merged.columns[[0] + [1] + list(range(5, pittsburgh_merged.shape[1]))]]

Unnamed: 0,ZIP,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,15276,PITTSBURGH,Moving Target,Construction & Landscaping,Entertainment Service,Food,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
17,15126,IMPERIAL,Construction & Landscaping,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
18,15075,RURAL RIDGE,Construction & Landscaping,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
73,15116,GLENSHAW,Moving Target,Construction & Landscaping,Entertainment Service,Food,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
75,15231,PITTSBURGH,Construction & Landscaping,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
77,15071,OAKDALE,Construction & Landscaping,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
98,15085,TRAFFORD,Construction & Landscaping,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
104,15227,PITTSBURGH,Construction & Landscaping,Photography Studio,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm


    Cluster 2 only contains two neighborhoods. Looking at the map, we can see that these two neighborhoods are actually just a couple blocks apart. These neighborhoods are also in close proximity to Good Shephard Catholic Cemetary, the Monroeville Landfill, and Restland Memorial Park. Due to the limited space to work with, it is safe to move on from Cluster 2.

In [35]:
#Cluster 2
pittsburgh_merged.loc[pittsburgh_merged['Cluster Labels'] == 2, pittsburgh_merged.columns[[0] + [1] + list(range(5, pittsburgh_merged.shape[1]))]]

Unnamed: 0,ZIP,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,15035,EAST MC KEESPORT,Garden Center,Yoga Studio,Food Service,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
24,15148,WILMERDING,Garden Center,Yoga Studio,Food Service,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit


    Cluster 3 contains two adjacent neighborhoods, Bridgeville (15017) and Pittsburgh [Upper St. Clair] (15241). Looking at the map, these two neighborhoods are focused around Boyce Mayview Park and St. Clair Country Club. The third neighborhood in the cluster, Mckeesport (15135), contains the Youghiogheny Country Club. Thus, the cluster's most common venue of Golf Course makes sense.

In [36]:
#Cluster 3
pittsburgh_merged.loc[pittsburgh_merged['Cluster Labels'] == 3, pittsburgh_merged.columns[[0] + [1] + list(range(5, pittsburgh_merged.shape[1]))]]

Unnamed: 0,ZIP,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,15017,BRIDGEVILLE,Golf Course,Food Service,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit,Event Space
78,15135,MCKEESPORT,Golf Course,Business Service,Food Service,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
119,15241,PITTSBURGH,Golf Course,Pool,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit


    Cluster 4 shows a familiar trend in common venues, including Pizza Places, Yoga Studios, and Electronics Store, similar to Cluster 1. Cluster 4's neighborhoods also occupy distinct areas on the map, with none being in too close proximity to another. This distance from one another is important, as it signifies that these are similar neighborhoods in completely separate areas of the city.

In [37]:
#Cluster 4
pittsburgh_merged.loc[pittsburgh_merged['Cluster Labels'] == 4, pittsburgh_merged.columns[[0] + [1] + list(range(5, pittsburgh_merged.shape[1]))]]

Unnamed: 0,ZIP,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,15140,PITCAIRN,Pizza Place,Auto Garage,Park,Yoga Studio,Electronics Store,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
63,15045,GLASSPORT,Pizza Place,Sandwich Place,Candy Store,Yoga Studio,Electronics Store,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
67,15237,PITTSBURGH,Pizza Place,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
70,15007,BAKERSTOWN,Pizza Place,Burger Joint,Playground,Yoga Studio,Electronics Store,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm
76,15321,CECIL,Construction & Landscaping,Pizza Place,Baseball Field,Yoga Studio,Entertainment Service,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market


    In contrast to the previous cluster, Cluster 5 contains two neighborhoods located right next to each other. As a result, their most common venues are identical.

In [38]:
#Cluster 5
pittsburgh_merged.loc[pittsburgh_merged['Cluster Labels'] == 5, pittsburgh_merged.columns[[0] + [1] + list(range(5, pittsburgh_merged.shape[1]))]]

Unnamed: 0,ZIP,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,15090,WEXFORD,Recreation Center,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit
39,16046,MARS,Recreation Center,Yoga Studio,Electronics Store,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm,Exhibit


    Lastly, Cluster 6 resembles Clusers 1 and 4, being spread out around the city of Pittsburgh. The common venue shared in this cluster appears to be Baseball Fields, likely signifying suburban areas near adjacent school districts.

In [39]:
#Cluster 6
pittsburgh_merged.loc[pittsburgh_merged['Cluster Labels'] == 6, pittsburgh_merged.columns[[0] + [1] + list(range(5, pittsburgh_merged.shape[1]))]]

Unnamed: 0,ZIP,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,15057,MC DONALD,Home Service,Baseball Field,Electronics Store,Auto Workshop,Park,Event Space,Food,Flower Shop,Fish & Chips Shop,Field
10,15239,PITTSBURGH,Baseball Field,Auto Garage,Yoga Studio,Entertainment Service,Food,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market
38,15064,MORGAN,Home Service,Baseball Field,Electronics Store,Auto Workshop,Park,Event Space,Food,Flower Shop,Fish & Chips Shop,Field
50,15290,PITTSBURGH,Baseball Field,Bakery,Yoga Studio,Event Space,Food,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market
58,15047,GREENOCK,Bike Trail,Harbor / Marina,Baseball Field,Yoga Studio,Entertainment Service,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market
65,15102,BETHEL PARK,Baseball Field,Yoga Studio,Food Truck,Food,Flower Shop,Fish & Chips Shop,Field,Fast Food Restaurant,Farmers Market,Farm


#### 5. Discussion.

    With our neighborhoods now clustered and analyzed, let's discuss our results. To begin, let's dig a little deeper in to our Zip Code dataset. Sourced from Pennsylvania Spatial Data Access (PASDA), the dataset we used in our analysis was a subset of a larger dataset designed for mapping the Zip Code boundaries of Pittsburgh and the surrounding area. When searching for relevent data that could be used to answer our question, "Which neighborhoods would be suitable for a new grocery store location?" this dataset was chosen due to it conveniently includinging both zip codes, which could be used for identification, and geospatial coordinates, used for mapping. However, the Neighborhood field of the data is troublesome, due to 'Pittsburgh' frequently being listed as the neighborhood, as opposed to the proper Township or Borough. Fortunately, this problem did not directly effect the results of our study, and could be solved in future analysis by acquiring additional data on the Townships and Boroughs of the area.
    
    Next, let's acknowledge Cluster 0 once more. Containing more than 50% of our neighborhoods, Cluster 0 shows little trends amongst most common venues upon initial review. Likely due to the close proximity of neighborhoods found within downtown Pittsburgh and nearby areas, the size and variety of this cluster makes analysis difficult. Solutions to this issue may include restricting our analyis to just these downtown neighborhoods, altering our search radius to a more confined area, adding additional clusters when conducting k-means, or attempting a different method of clustering altogether. It can also be noted that this cluster contains the only occurence of 'Grocery Store' found within the top 10 most common venues, found in Neighborhood 0 - Pittsburgh (15224).
    
    Despite these potential issues, the clusters provided by our study still provide interesting results. Particularly, Clusters 1, 4, and 6 each provide insights in to several neighborhoods in the Pittsburgh metropolitan area. Beginning with Cluster 1, we see that these neighborhoods host construction and landscaping venues as their most common venue, suggesting ongoing industry and development within these areas. Additionally, these neighborhoods host fields, farms, and farmer's markets, which all suggest more suburban/rural environments. This is supported by their locations on the map, being found on the outskirts of the metropolitan area, with the exception of Brentwood - Pittsburgh (15227). Moving on to Cluster 4, we see that these neighborhoods are also located on the outskirts of the mapped area. Sharing a variety of restaurants, shops, and stores amongst their most common venues, this cluster suggests a suburban area with a busy economic base. In contrast to Cluster 1, grocery-style locations such as farms and farmer's markets are less common in this cluster, suggesting room for opportunity. Finally, Cluster 6 shows an interesting most common venue being baseball fields. Including suburban areas such as South Fayette, Bethel Park, and Plum, the neighborhoods of this cluster include popular school districts of the Pittsburgh area, suggesting a high youth and family-based population.
    



#### 6. Conclusion

    In conclusion, we have successfully narrowed down our search for which neighborhoods could be suitable to host a new grocery store location. By focusing efforts on Clusters 1, 4, and 6, we have found 19 neighborhoods that could likely support and utilize a new grocery store, reducing our initial search by 80%. With further analysis and research, this number could be improved even further. Additionally, a more complex study in to the neighborhoods of Cluster 0 and the downtown Pittsburgh area allows for even further insight. By continuing this study with demographic and economic data on Pittsburgh's neighborhoods, plus data on the locations of pre-existing grocery store locations, we can very likely find a more concrete answer, discovering even more specific locations and likelihood of success.