# Battle of the Neighborhoods - Finding Less Developed Areas with Access to Services in Utah County, Utah
#### by Quinn Koller

## Introduction

Utah County, Utah is the second largest county in the U.S. State of Utah with an estimated population of 636,235. It is also the fastest growing population center in the State and as such housing is growing at a fast pace, with services and restaurants following suit. Driving through the area one is struck by the impression that there are people living literraly on top of each other. If one wants to find an underveloped area that has access to a resonable number of services. where would one look? This analysis will address the question of where one could find such a place.

## Data

The data sets that will be used for this project are as follows:

US zip code latitude and longitude file downloaded from https://public.opendatasoft.com/explore/?sort=modified&q=zip+code. This data set doesn’t have Utah county information, so another data set with this information will need to be used. 

Utah county information with zip codes downloaded from https://opendata.gis.utah.gov/datasets/utah-zip-code-areas/data. This data set doesn’t have latitude and longitude.

A new data set will be created by combining the two previous files using zip code as the common key, and dropping extemporaneous data, and then filtering the data down to include only the zip codes for Utah County.

This resulting Utah County geocoded data set will be used with venue data for the area of interested pulled using the FourSquare venue data.

## Methodology

• The US Zip Code latitude and longitude file will be merged with the Utah county information file to create a new file for Utah County communities and zip codes including latitude and longitudes.

• Venue data will be pulled from FourSquare.

• Data will be K-Means clustered to arrive at similarities amongst neighborhoods.

• Finally, the data be will be visually assessed using graphing from various Python libraries.

## Results

### Step 1 - import the zip code and latitude/longitude data, clean them, merge them, and create the master geocode file that will be used in running queries against the FourSquare database.

In [1]:
# The code was removed by Watson Studio for sharing.

Collecting folium
  Downloading folium-0.12.0-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 7.0 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.0


Unnamed: 0,ZIP5,COUNTYNBR,NAME
0,84621,20,AXTELL
1,84622,20,CENTERFIELD
2,84634,20,GUNNISON
3,84638,14,LEAMINGTON
4,84728,14,GARRISON


In [2]:
# Lets rename the field we intend to use as the merge key for the two files so it has the same name.

UtahZips.rename(columns={'ZIP5':'Zip'}, inplace = True)
UtahZips.head()

Unnamed: 0,Zip,COUNTYNBR,NAME
0,84621,20,AXTELL
1,84622,20,CENTERFIELD
2,84634,20,GUNNISON
3,84638,14,LEAMINGTON
4,84728,14,GARRISON


In [3]:
# Now lets read in the second data file and take a look at it

body = client_6e41aa84f0314e319cc417c5c2a8f720.get_object(Bucket='wasatchfronthomepricesandvenuedat-donotdelete-pr-w40fp5ut7rkdgd',Key='Utah_LatLong.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

Utah_LatLong = pd.read_csv(body)
Utah_LatLong.head()

# That worked too!


Unnamed: 0,Zip,City,Latitude,Longitude
0,84401,Ogden,41.224911,-111.98346
1,84020,Draper,40.514843,-111.87294
2,84123,Salt Lake City,40.659514,-111.92226
3,84639,Levan,39.47178,-111.94431
4,84716,Boulder,37.94512,-111.09991


In [4]:
# Now let's use the Zip field to merge the two files together and have a look at it.

UtahCounty = pd.merge(UtahZips, Utah_LatLong, on='Zip')
UtahCounty.head()

Unnamed: 0,Zip,COUNTYNBR,NAME,City,Latitude,Longitude
0,84621,20,AXTELL,Axtell,39.050838,-111.84775
1,84622,20,CENTERFIELD,Centerfield,39.114649,-111.80511
2,84634,20,GUNNISON,Gunnison,39.193513,-111.85047
3,84638,14,LEAMINGTON,Leamington,39.532384,-112.27113
4,84728,14,GARRISON,Garrison,38.970536,-113.7085


In [5]:
# Now we will filter teh file to only include Zip codes with in Utah County, which in this case is COUNTYNBR 25

UtahCountyGeocode = UtahCounty.loc[UtahCounty['COUNTYNBR'] == 25]
UtahCountyGeocode.head()

Unnamed: 0,Zip,COUNTYNBR,NAME,City,Latitude,Longitude
11,84013,25,CEDAR VALLEY,Cedar Valley,40.288953,-112.09859
40,84042,25,LINDON,Lindon,40.338552,-111.7162
50,84097,25,OREM,Orem,40.301444,-111.67485
70,84663,25,SPRINGVILLE,Springville,40.168205,-111.59577
71,84664,25,MAPLETON,Mapleton,40.123394,-111.56665


In [6]:
# We don't need the COUNTYNBR column anymore, so let's drop it

Geocode = UtahCountyGeocode.drop(['COUNTYNBR'], axis=1)
Geocode.head()

Unnamed: 0,Zip,NAME,City,Latitude,Longitude
11,84013,CEDAR VALLEY,Cedar Valley,40.288953,-112.09859
40,84042,LINDON,Lindon,40.338552,-111.7162
50,84097,OREM,Orem,40.301444,-111.67485
70,84663,SPRINGVILLE,Springville,40.168205,-111.59577
71,84664,MAPLETON,Mapleton,40.123394,-111.56665


In [7]:
# Let's rename the Zip column to Neighborhood because that is what FourSquare will expect us to use
Geocode.rename(columns={'Zip':'Neighborhood'}, inplace = True)


### Step 2 - now we will use the goecode file we created to pull data from FourSquare

In [8]:
#variable data needed to construct FourSquare url
CLIENT_ID = 'LV2X45YE1ZXFTWTVTP02AU0SL3JAJJRDGPFQL3NMZQLW4DWZ'
CLIENT_SECRET = 'TFFPP3AQ32R1GDBW5QDEJ45KERA3WH12X0ULRQWMV4UO0LEM'
VERSION = '20180604'
RADIUS = 1610 # Set a mile radius because this is a large area
LIMIT = 200

In [9]:
# Let's define a function to pull the FourSquare data

def getNearbyVenues(names, latitudes, longitudes):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            RADIUS, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [10]:
# Now we pull the data from FourSquare

UtahCountyVenues = getNearbyVenues(names=Geocode['Neighborhood'],
                                 latitudes=Geocode['Latitude'],
                                 longitudes=Geocode['Longitude']
                                 )

84013
84042
84097
84663
84664
84601
84062
84059
84057
84003
84004
84606
84604
84602
84653
84655
84651
84660
84633
84626
84043
84058


In [11]:
# Lets take a peek and see what we got

UtahCountyVenues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,84013,40.288953,-112.09859,The Grotto,40.299631,-112.105453,Speakeasy
1,84013,40.288953,-112.09859,"Cedar Valley, UT",40.30241,-112.103012,Playground
2,84042,40.338552,-111.7162,Lindon Aquatic Center,40.339921,-111.716945,Pool
3,84042,40.338552,-111.7162,Smoking Apple,40.33885,-111.717337,BBQ Joint
4,84042,40.338552,-111.7162,Kneaders Bakery & Cafe,40.332979,-111.712375,Bakery


In [12]:
# Group by Neighborhood

UtahCountyVenues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
84003,7,7,7,7,7,7
84004,7,7,7,7,7,7
84013,2,2,2,2,2,2
84042,34,34,34,34,34,34
84043,3,3,3,3,3,3
84057,56,56,56,56,56,56
84058,37,37,37,37,37,37
84062,7,7,7,7,7,7
84097,43,43,43,43,43,43
84601,32,32,32,32,32,32


In [13]:
# one hot encoding
UtahCountyVenues_onehot = pd.get_dummies(UtahCountyVenues[['Venue Category']], prefix="", prefix_sep="")
UtahCountyVenues_onehot.drop(['Neighborhood'], axis = 1, inplace = True, errors='ignore') 
UtahCountyVenues_onehot.insert(loc = 0, column = 'Neighborhood', value = UtahCountyVenues['Neighborhood'] )
UtahCountyVenues_onehot.shape

(394, 143)

In [14]:
# Another peek. One hot worked.

UtahCountyVenues_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,American Restaurant,Art Museum,Asian Restaurant,Automotive Shop,BBQ Joint,Bakery,Bank,...,Thai Restaurant,Theater,Tour Provider,Toy / Game Store,Trail,Train Station,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park
0,84013,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,84013,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,84042,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,84042,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,84042,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
#Regroup the results by neighborhood and reset the index

UtahCountyVenues_grouped = UtahCountyVenues_onehot.groupby('Neighborhood').mean().reset_index()
UtahCountyVenues_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,American Restaurant,Art Museum,Asian Restaurant,Automotive Shop,BBQ Joint,Bakery,Bank,...,Thai Restaurant,Theater,Tour Provider,Toy / Game Store,Trail,Train Station,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park
0,84003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,84004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,84013,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,84042,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.029412,0.029412,...,0.029412,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0
4,84043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0


In [16]:
# Function to find most common venues

def most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [17]:
# Now to find the most common venues by neighborhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = UtahCountyVenues_grouped['Neighborhood']

for ind in np.arange(UtahCountyVenues_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = most_common_venues(UtahCountyVenues_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,84003,Park,Home Service,Business Service,Disc Golf,Gym / Fitness Center,Golf Course,Water Park,Farmers Market,Fabric Shop,Event Space
1,84004,Park,Financial or Legal Service,Moving Target,Mattress Store,Business Service,Farmers Market,Event Space,Fast Food Restaurant,Fabric Shop,Water Park
2,84013,Speakeasy,Playground,Electronics Store,Food,Financial or Legal Service,Fast Food Restaurant,Farmers Market,Fabric Shop,Event Space,Dry Cleaner
3,84042,Pizza Place,Fast Food Restaurant,Mexican Restaurant,Park,Video Store,Convenience Store,Burrito Place,Chinese Restaurant,Sandwich Place,Restaurant
4,84043,Baseball Field,Trail,Water Park,Electronics Store,Food,Financial or Legal Service,Fast Food Restaurant,Farmers Market,Fabric Shop,Event Space


### Step 3 - Use K-Means to cluster everything

In [18]:
kclusters = 5

UtahCountyVenues_grouped_clustering = UtahCountyVenues_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(UtahCountyVenues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 4, 3, 1, 0, 1, 1, 4, 1, 1], dtype=int32)

In [19]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#neighborhoods_venues_sorted
UtahCounty_merged = Geocode

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
UtahCounty_merged = UtahCounty_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

UtahCounty_merged.head()

Unnamed: 0,Neighborhood,NAME,City,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,84013,CEDAR VALLEY,Cedar Valley,40.288953,-112.09859,3.0,Speakeasy,Playground,Electronics Store,Food,Financial or Legal Service,Fast Food Restaurant,Farmers Market,Fabric Shop,Event Space,Dry Cleaner
40,84042,LINDON,Lindon,40.338552,-111.7162,1.0,Pizza Place,Fast Food Restaurant,Mexican Restaurant,Park,Video Store,Convenience Store,Burrito Place,Chinese Restaurant,Sandwich Place,Restaurant
50,84097,OREM,Orem,40.301444,-111.67485,1.0,Trail,Park,Pizza Place,Dessert Shop,Grocery Store,Water Park,Gym,Plaza,Pet Store,Mexican Restaurant
70,84663,SPRINGVILLE,Springville,40.168205,-111.59577,1.0,Pizza Place,Park,Mexican Restaurant,Sandwich Place,Gas Station,Pharmacy,Pool,Construction & Landscaping,Miscellaneous Shop,Coffee Shop
71,84664,MAPLETON,Mapleton,40.123394,-111.56665,4.0,Park,Home Service,Cosmetics Shop,Construction & Landscaping,Water Park,Fast Food Restaurant,Farmers Market,Fabric Shop,Event Space,Electronics Store


In [20]:
# Drop rows with an NaN, otherwise it will throw an error 

UtahCounty_merged = UtahCounty_merged.dropna()
UtahCounty_merged



Unnamed: 0,Neighborhood,NAME,City,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,84013,CEDAR VALLEY,Cedar Valley,40.288953,-112.09859,3.0,Speakeasy,Playground,Electronics Store,Food,Financial or Legal Service,Fast Food Restaurant,Farmers Market,Fabric Shop,Event Space,Dry Cleaner
40,84042,LINDON,Lindon,40.338552,-111.7162,1.0,Pizza Place,Fast Food Restaurant,Mexican Restaurant,Park,Video Store,Convenience Store,Burrito Place,Chinese Restaurant,Sandwich Place,Restaurant
50,84097,OREM,Orem,40.301444,-111.67485,1.0,Trail,Park,Pizza Place,Dessert Shop,Grocery Store,Water Park,Gym,Plaza,Pet Store,Mexican Restaurant
70,84663,SPRINGVILLE,Springville,40.168205,-111.59577,1.0,Pizza Place,Park,Mexican Restaurant,Sandwich Place,Gas Station,Pharmacy,Pool,Construction & Landscaping,Miscellaneous Shop,Coffee Shop
71,84664,MAPLETON,Mapleton,40.123394,-111.56665,4.0,Park,Home Service,Cosmetics Shop,Construction & Landscaping,Water Park,Fast Food Restaurant,Farmers Market,Fabric Shop,Event Space,Electronics Store
73,84601,PROVO,Provo,40.230954,-111.68006,1.0,Park,Mexican Restaurant,Sandwich Place,Fast Food Restaurant,Chinese Restaurant,Latin American Restaurant,Convenience Store,Automotive Shop,Trail,Farmers Market
76,84062,PLEASANT GROVE,Pleasant Grove,40.38584,-111.73333,4.0,Trail,Photography Studio,Bike Trail,Construction & Landscaping,Locksmith,Park,Water Park,Electronics Store,Fast Food Restaurant,Farmers Market
107,84057,OREM,Orem,40.311854,-111.70561,1.0,Fast Food Restaurant,Mexican Restaurant,Sandwich Place,Coffee Shop,Gym / Fitness Center,Park,Pizza Place,Grocery Store,Video Store,Theater
120,84003,AMERICAN FORK,American Fork,40.394235,-111.79449,4.0,Park,Home Service,Business Service,Disc Golf,Gym / Fitness Center,Golf Course,Water Park,Farmers Market,Fabric Shop,Event Space
121,84004,ALPINE,Alpine,40.465161,-111.76279,4.0,Park,Financial or Legal Service,Moving Target,Mattress Store,Business Service,Farmers Market,Event Space,Fast Food Restaurant,Fabric Shop,Water Park


### Step 4 - Create the map

In [21]:
# Set the map centerpoint

address = 'Provo, UT'

geolocator = Nominatim(user_agent="qkoller@gmail.com")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of ',address,' are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of  Provo, UT  are 40.2338438, -111.6585337.


In [22]:
# Create the map

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(UtahCounty_merged['Latitude'], UtahCounty_merged['Longitude'], UtahCounty_merged['Neighborhood'], UtahCounty_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


## Discussion

The first thing to note is that I used zip codes in place of neighborhoods. This is because a city can have multiple zip codes under the same city name. By using zip codes I now have a unique identifier for each area in which to use while querying the FourSquare database.

Second, since this a large geographic area there is not a lot of granularity in zip codes. Some of the fastest growing areas only  have one postal zip code and wil likely be split into multiple zip codes in the future. Because of the size of the area the search radius for FourSquare was set to 1610 meters, or roughly more than a mile.

Third, I used 5 clusters to see if that would impose a greater differntiation between areas when it came to clustering.

Lastly, it was necessary to drop NaN rows from the database because some of the Zip code areas are remote and hve no services.



## Conclussion

The main roads in Utah County are U.S. Interstate 15 and U.S. Highway 89. You can see that all zip codes that are clustered together, in purple, follow these two routes. K-Means clustering shows little differntiation between these zip codes. In fact, if one were to drive either of these two routes one will find houses and apartment buildings nearly on top of each other with breaks between them consisting of food and service venues. To escape living in an overbuiilt area one needs to go to one of the other clusters. The cluster of five Zip coeds in orangeare close to the I-15 corridor, and will likely become purple clusters in the near future. The single cluster near Lehi is also close enough to I-15 that it may be subsumed into the purple cluster as well. Eagle Mountain (green) and Santiquin (blue) are both further from the purple clusters, with Eagle Mountain being the furthest from services of the two and the most likely to resist overdevelopmen the longest. Santaguin is closer to the main transportation corrider and services. To further drill down in area selection real estate analysis will need to be done comparing home prices between Eagle Mountain and Santaquin, which is beyond the scope of this analysis.