# Schoenrock Final Project - City Relocation
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction <a name="introduction"></a>

I am currently in the process of looking for a new place to live. For my final project, I will try to find the optimal city for me to move to based upon specific criteria and similarity to other cities that I have enjoyed living in. Specifically, I am looking for a city based in the United States that is similar to **San Francisco** and **Chicago**.

Since there are a lot of cities in the United States, I will limit my search to 5 cities: Seattle, Denver, Atlanta, Austin, and Nashville. I will try to detect which city and neighborhood has a high similarity to my favorite previous locations and has a minimum threshold for parks, museums, and restaurants.

## Data <a name="data"></a>

Based on the requirements for this project, factors that will influence our decission are:
* number of existing restaurants in the neighborhood (any type of restaurant)
* unique types of restaurants in the neighborhood
* number of museums in the neighborhood
* similarity to Cow Hollow, San Francisco
* similarity to Lakeview, Chicago

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* coordinate of each city center will be obtained using **Geopy Geocoders**
* number of restaurants and their type, number of museums, and number of parks in every neighborhood will be obtained using **Foursquare API**


### Neighborhoods

Lets first look at the Cow Hollow and Lakeview neighborhoods

In [44]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize


! pip install folium==0.5.0
import folium # plotting library

import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

print('Folium installed')
print('Libraries imported.')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Folium installed
Libraries imported.


In [19]:
sf_address = 'San Francisco, California'

geolocator = Nominatim(user_agent="foursquare_agent")
sf_location = geolocator.geocode(sf_address)
sf_latitude = sf_location.latitude
sf_longitude = sf_location.longitude
print(sf_latitude, sf_longitude)

37.7790262 -122.419906


In [20]:
chi_address = 'Chicago, Illinois'

geolocator = Nominatim(user_agent="foursquare_agent")
chi_location = geolocator.geocode(chi_address)
chi_latitude = chi_location.latitude
chi_longitude = chi_location.longitude
print(chi_latitude, chi_longitude)

41.8755616 -87.6244212


In [21]:
seattle_address = 'Seattle, Washington'

geolocator = Nominatim(user_agent="foursquare_agent")
seattle_location = geolocator.geocode(seattle_address)
seattle_latitude = seattle_location.latitude
seattle_longitude = seattle_location.longitude
print(seattle_latitude, seattle_longitude)

47.6038321 -122.3300624


In [22]:
denver_address = 'Denver, Colorado'

geolocator = Nominatim(user_agent="foursquare_agent")
denver_location = geolocator.geocode(denver_address)
denver_latitude = denver_location.latitude
denver_longitude = denver_location.longitude
print(denver_latitude, denver_longitude)

39.7392364 -104.9848623


In [23]:
atlanta_address = 'Atlanta, Georgia'

geolocator = Nominatim(user_agent="foursquare_agent")
atlanta_location = geolocator.geocode(atlanta_address)
atlanta_latitude = atlanta_location.latitude
atlanta_longitude = atlanta_location.longitude
print(atlanta_latitude, atlanta_longitude)

33.7489924 -84.3902644


In [24]:
austin_address = 'Austin, Texas'

geolocator = Nominatim(user_agent="foursquare_agent")
austin_location = geolocator.geocode(austin_address)
austin_latitude = austin_location.latitude
austin_longitude = austin_location.longitude
print(austin_latitude, austin_longitude)

30.2711286 -97.7436995


In [25]:
nash_address = 'Nashville, Tennessee'

geolocator = Nominatim(user_agent="foursquare_agent")
nash_location = geolocator.geocode(nash_address)
nash_latitude = nash_location.latitude
nash_longitude = nash_location.longitude
print(nash_latitude, nash_longitude)

36.1622296 -86.7743531


In [26]:
column_names = ["City", "State", "Latitude","Longitude"]

locations = pd.DataFrame(columns = column_names)

locations = locations.append({'City' : 'San Francisco', 'State' : 'CA', 'Latitude' : sf_latitude, 'Longitude' : sf_longitude}, 
                ignore_index = True)
locations = locations.append({'City' : 'Chicago', 'State' : 'IL', 'Latitude' : chi_latitude, 'Longitude' : chi_longitude}, 
                ignore_index = True)
locations = locations.append({'City' : 'Denver', 'State' : 'CO', 'Latitude' : denver_latitude, 'Longitude' : denver_longitude}, 
                ignore_index = True)
locations = locations.append({'City' : 'Seattle', 'State' : 'WA', 'Latitude' : seattle_latitude, 'Longitude' : seattle_longitude}, 
                ignore_index = True)
locations = locations.append({'City' : 'Austin', 'State' : 'TX', 'Latitude' : austin_latitude, 'Longitude' : austin_longitude}, 
                ignore_index = True)
locations = locations.append({'City' : 'Nashville', 'State' : 'TN', 'Latitude' : nash_latitude, 'Longitude' : nash_longitude}, 
                ignore_index = True)

locations.head(7)

Unnamed: 0,City,State,Latitude,Longitude
0,San Francisco,CA,37.779026,-122.419906
1,Chicago,IL,41.875562,-87.624421
2,Denver,CO,39.739236,-104.984862
3,Seattle,WA,47.603832,-122.330062
4,Austin,TX,30.271129,-97.7437
5,Nashville,TN,36.16223,-86.774353


### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on the venues in each neighborhood.

In [32]:
CLIENT_ID = '243GFCEQ5JWNK5YSQ02KS4AYLHU1IKHQGZDM0V0PBZYUCTRS' # your Foursquare ID
CLIENT_SECRET = 'SI0OX0YRTODV3SU1HVZMFXUO2XCOQJUVSB4IYMF13U3E2LL1' # your Foursquare Secret
ACCESS_TOKEN = '53LFMOFMXER1TFUCUVDLPCVMEMRXYPHY1NJS2WWBRS0VOYLL' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 300
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 243GFCEQ5JWNK5YSQ02KS4AYLHU1IKHQGZDM0V0PBZYUCTRS
CLIENT_SECRET:SI0OX0YRTODV3SU1HVZMFXUO2XCOQJUVSB4IYMF13U3E2LL1


In [36]:
def getNearbyVenues(City, Latitude, Longitude, radius=750):
    
    venues_list=[]
    for City, Latitude, Longitude in zip(City, Latitude, Longitude):
        print(City)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            Latitude, 
            Longitude, 
            access_token,
            VERSION, 
            search_query,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            City, 
            Latitude, 
            Longitude, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [37]:
all_venues = getNearbyVenues(City=locations['City'],
                                   Latitude=locations['Latitude'],
                                   Longitude=locations['Longitude']
                                  )

San Francisco
Chicago
Denver
Seattle
Austin
Nashville


In [38]:
print(all_venues.shape)
all_venues.head()

(600, 7)


Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,San Francisco,37.779026,-122.419906,Louise M. Davies Symphony Hall,37.777976,-122.420157,Concert Hall
1,San Francisco,37.779026,-122.419906,War Memorial Opera House,37.778601,-122.420816,Opera House
2,San Francisco,37.779026,-122.419906,Herbst Theater,37.779548,-122.420953,Concert Hall
3,San Francisco,37.779026,-122.419906,San Francisco Ballet,37.77858,-122.420798,Dance Studio
4,San Francisco,37.779026,-122.419906,Urban Bowls,37.778139,-122.422168,Poke Place


In [39]:
# one hot encoding
all_onehot = pd.get_dummies(all_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
all_onehot['City'] = all_venues['City'] 

# move neighborhood column to the first column
fixed_columns = [all_onehot.columns[-1]] + list(all_onehot.columns[:-1])
all_onehot = all_onehot[fixed_columns]

all_onehot.head()

Unnamed: 0,City,Accessories Store,American Restaurant,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,San Francisco,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,San Francisco,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,San Francisco,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,San Francisco,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,San Francisco,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
all_grouped = all_onehot.groupby('City').mean().reset_index()
all_grouped

Unnamed: 0,City,Accessories Store,American Restaurant,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Austin,0.01,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0
1,Chicago,0.0,0.01,0.0,0.01,0.01,0.02,0.04,0.02,0.0,...,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01
2,Denver,0.0,0.03,0.0,0.0,0.02,0.02,0.0,0.02,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02
3,Nashville,0.0,0.06,0.01,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0
4,San Francisco,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02,0.01,0.0,0.03,0.01,0.0,0.01
5,Seattle,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0


In [41]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [54]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['City'] = all_grouped['City']

for ind in np.arange(all_grouped.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(all_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Austin,Coffee Shop,Cocktail Bar,Bar,Hotel,Speakeasy,Lounge,Music Venue,Gay Bar,Steakhouse,Movie Theater
1,Chicago,Pizza Place,Arts & Crafts Store,Coffee Shop,Café,Garden,Hotel,Fountain,Park,Sandwich Place,Italian Restaurant
2,Denver,Sandwich Place,Coffee Shop,Burger Joint,Breakfast Spot,Bar,American Restaurant,Hotel,Italian Restaurant,Marijuana Dispensary,Yoga Studio
3,Nashville,Bar,Music Venue,Hotel,American Restaurant,Park,Steakhouse,Cocktail Bar,Candy Store,Diner,Mexican Restaurant
4,San Francisco,Performing Arts Venue,Coffee Shop,Cocktail Bar,Mexican Restaurant,French Restaurant,Wine Bar,Juice Bar,Sushi Restaurant,Clothing Store,Theater


## Methodology <a name="methodology"></a>

In the first step we have collected the required **coordinate data** for each of our desired cities in the United States, including our baseline cities of San Francisco and Chicago. Additionally, **location and type (category) of every venue near each city center of interest** using Foursquare API.

Our next step will be to create **clusters of the cities** to determine which share similarities with our baseline cities. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify the most similar city and the best city to relocate to.

## Analysis <a name="analysis"></a>

In [76]:
austin_venues = all_venues.groupby(['City', 'Venue Category']).size()['Austin']
austin_venues.sort_values(inplace=True, ascending=False)
austin_venues.head(10)

Venue Category
Coffee Shop     7
Cocktail Bar    6
Bar             5
Hotel           5
Lounge          4
Speakeasy       4
Music Venue     3
Steakhouse      3
Gay Bar         3
Restaurant      3
dtype: int64

In [77]:
sf_venues = all_venues.groupby(['City', 'Venue Category']).size()['San Francisco']
sf_venues.sort_values(inplace=True, ascending=False)
sf_venues.head(10)

Venue Category
Coffee Shop              4
Cocktail Bar             4
Performing Arts Venue    4
Juice Bar                3
Sushi Restaurant         3
French Restaurant        3
Mexican Restaurant       3
Clothing Store           3
Wine Bar                 3
Theater                  3
dtype: int64

In [78]:
chi_venues = all_venues.groupby(['City', 'Venue Category']).size()['Chicago']
chi_venues.sort_values(inplace=True, ascending=False)
chi_venues.head(10)

Venue Category
Pizza Place            5
Arts & Crafts Store    4
Coffee Shop            4
Hotel                  3
Garden                 3
Café                   3
Park                   2
Sandwich Place         2
Italian Restaurant     2
Fountain               2
dtype: int64

In [79]:
denver_venues = all_venues.groupby(['City', 'Venue Category']).size()['Denver']
denver_venues.sort_values(inplace=True, ascending=False)
denver_venues.head(10)

Venue Category
Sandwich Place          6
Coffee Shop             5
Bar                     4
Breakfast Spot          4
Burger Joint            4
American Restaurant     3
Hotel                   3
Italian Restaurant      3
Marijuana Dispensary    3
Noodle House            2
dtype: int64

In [80]:
seattle_venues = all_venues.groupby(['City', 'Venue Category']).size()['Seattle']
seattle_venues.sort_values(inplace=True, ascending=False)
seattle_venues.head(10)

Venue Category
Coffee Shop            10
Hotel                   8
Café                    6
Cocktail Bar            5
Donut Shop              3
Japanese Restaurant     3
Gift Shop               2
Hotel Bar               2
Italian Restaurant      2
Lounge                  2
dtype: int64

In [81]:
nash_venues = all_venues.groupby(['City', 'Venue Category']).size()['Nashville']
nash_venues.sort_values(inplace=True, ascending=False)
nash_venues.head(10)

Venue Category
Bar                    15
Music Venue             9
American Restaurant     6
Hotel                   6
Park                    4
Steakhouse              4
Candy Store             3
Cocktail Bar            3
Pizza Place             2
Restaurant              2
dtype: int64

In [82]:
austin_venues = all_venues.groupby(['City', 'Venue Category']).size()['Austin']
austin_venues.sort_values(inplace=True, ascending=False)
austin_venues.head(10)

Venue Category
Coffee Shop     7
Cocktail Bar    6
Bar             5
Hotel           5
Lounge          4
Speakeasy       4
Music Venue     3
Steakhouse      3
Gay Bar         3
Restaurant      3
dtype: int64

In [55]:
# set number of clusters
kclusters = 3

location_grouped_clustering = all_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(location_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 1, 0, 1, 2], dtype=int32)

In [56]:
# add clustering labels
city_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

cities_merged = locations

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
cities_merged = cities_merged.join(city_venues_sorted.set_index('City'), on='City')

cities_merged.head(6)

Unnamed: 0,City,State,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,San Francisco,CA,37.779026,-122.419906,1,Performing Arts Venue,Coffee Shop,Cocktail Bar,Mexican Restaurant,French Restaurant,Wine Bar,Juice Bar,Sushi Restaurant,Clothing Store,Theater
1,Chicago,IL,41.875562,-87.624421,1,Pizza Place,Arts & Crafts Store,Coffee Shop,Café,Garden,Hotel,Fountain,Park,Sandwich Place,Italian Restaurant
2,Denver,CO,39.739236,-104.984862,1,Sandwich Place,Coffee Shop,Burger Joint,Breakfast Spot,Bar,American Restaurant,Hotel,Italian Restaurant,Marijuana Dispensary,Yoga Studio
3,Seattle,WA,47.603832,-122.330062,2,Coffee Shop,Hotel,Café,Cocktail Bar,Donut Shop,Japanese Restaurant,Gift Shop,Scenic Lookout,Seafood Restaurant,Chinese Restaurant
4,Austin,TX,30.271129,-97.7437,2,Coffee Shop,Cocktail Bar,Bar,Hotel,Speakeasy,Lounge,Music Venue,Gay Bar,Steakhouse,Movie Theater
5,Nashville,TN,36.16223,-86.774353,0,Bar,Music Venue,Hotel,American Restaurant,Park,Steakhouse,Cocktail Bar,Candy Store,Diner,Mexican Restaurant


In [57]:
center_address = 'United States'

geolocator = Nominatim(user_agent="ny_explorer")
center_location = geolocator.geocode(center_address)
center_latitude = center_location.latitude
center_longitude = center_location.longitude
print('The geograpical coordinate of the US are {}, {}.'.format(center_latitude, center_longitude))

The geograpical coordinate of the US are 39.7837304, -100.4458825.


In [61]:
# create map
map_clusters = folium.Map(location=[center_latitude, center_longitude], zoom_start=5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(cities_merged['Latitude'], cities_merged['Longitude'], cities_merged['City'], cities_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results and Discussion <a name="results"></a>

Our analysis shows that although there is a great number of variables across different cities in the United States, there are general similarities across San Francisco, Chicago, and Denver. Across the most common venue types, all share a similarity in having performing arts and non-food related venues. Additionally, coffee bars are a similar popularity among the three cities.

Additionally, the other cities have a larger share of bars and drinking venues as compared to our baseline cities.

Overall, the k-means clustering suggests that Denver is the most comparable city to San Francisco and Chicago.

## Conclusion <a name="conclusion"></a>

The purpose of this project was to help identify cities across the US that closes resemble San Francisco, CA and Chicago, IL in order to aid in my relocation efforts. Potential cities were chosen and from the city centres, a list of venues and venue types were chosen.

The final decission based upon my selection criteria and similarity is to relocate to Denver, Colorado.