## Neighborhoods around the World

### Introduction and Problem
Every country around the world has a unique culture and way of life. In order to thrive in this globalized economy it is important to understand our differences in order to interact respectfully with our counterparts around the world. This study will take a look at how similar major cities are around the world. We will focus on the size of the city, popular venues, available green space and other amenities located in metropolitan centers to better understand the priorities of different populations. This study is meant to inform high ranking business officials on how different cities are structured which tells us a lot about the behavior of the people that live there. This will give us a better understanding of their way of life before we do business with them.

<b>Question:</b> How similar are popular establishments in cities around the world? 

### Data Sources 
In this project, I will be using the Foursquare API to extract location data for major cities around the world. I will use the Foursquare API to extract the top 10 most popular venues in each city. I will then use the World Cities Basic Database (https://simplemaps.com/data/world-cities) to obtain a city's population. We will use this information along with a K-means algorithm to identify similar groups of cities. We will then add to this by including a country's economic data taken from The World Bank (https://data.worldbank.org/indicator/NY.GDP.PCAP.CD) including GDP per capita and unemployment as a csv. We will then use a similar algorithm to identify if these factors impact popular venues in cities. 

Includes Data Set from: https://datahub.io/JohnSnowLabs/country-and-continent-codes-list

Cities will be finalized after data is reviewed to ensure enough data is available to support the study.

### Exploratory Data Analysis

In [2]:
import pandas as pd
import numpy as np
import folium

In [73]:
world_cities = pd.read_csv("worldcities.csv")
world_cities.drop(columns=["city_ascii", "iso2", "capital", "id", "admin_name"], inplace=True)
large_cities = world_cities[world_cities["population"] > 500000]

conti = pd.read_csv("country-and-continent.csv")
conti.drop(columns=["Country_Name", "Country_Number", "Continent_Code", "Two_Letter_Country_Code"], inplace=True)
conti.drop_duplicates(subset="Three_Letter_Country_Code", keep='first', inplace=True)

lcity_w_conti = large_cities.set_index("iso3").join(conti.set_index("Three_Letter_Country_Code"))

cities_data = lcity_w_conti.groupby('Continent_Name')[["city", "country", "population", "lat", "lng", "Continent_Name"]].apply(lambda s:s.sample(min(len(s), 50), random_state = 2))

print(cities_data.shape)
cities_data.set_index("city", inplace=True)
cities_data.head()

(257, 6)


Unnamed: 0_level_0,country,population,lat,lng,Continent_Name
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Kigali,Rwanda,745261.0,-1.9536,30.0606,Africa
Asmara,Eritrea,963000.0,15.3333,38.9167,Africa
Ibadan,Nigeria,2628000.0,7.3964,3.9167,Africa
Alexandria,Egypt,4870000.0,31.2,29.9167,Africa
Kisangani,Congo (Kinshasa),935977.0,0.5153,25.1911,Africa


In [69]:
# Function to Color Folium Map Markers by Population Size
def colorbypop(mag):
    if mag > 3000000:
        color = '#6e1010'
        rad = 10
    elif mag > 1000000:
        color = '#f5443d'
        rad = 7
    else:
        color = '#f5a73b'
        rad = 3
    return color, rad

In [135]:
# create map of World using latitude and longitude values
world_map = folium.Map(location=[33,48], zoom_start=2, width=1024, height=500)

# add markers to map
for lat, lng, city, ctry in zip(cities_data['lat'], cities_data['lng'], 
                                cities_data.index, cities_data['country']):
    label = '{}, {}'.format(city, ctry)
    color, rad = colorbypop(cities_data.loc[city, 'population'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=float(rad),
        popup=label,
        color= color,
        fill=True,
        fill_color= color,
        fill_opacity=0.5,
        parse_html=False).add_to(world_map)  

world_map

This map shows the cities that will be used for the study. Each marker represents a city with the radius and color representing
the population.
- Small Yellow points represent cities with a population between 500,000 and 1,000,000
- Medium Orange points represent cities with a population between 1,000,000 and 3,000,000
- Large Maroon points represent cities with a population greater than 3,000,000

In [136]:
# Function to Color Folium Map Markers by Continent
def colorbyregion(name):
    if  "Asia" in name:
        color = '#338139'
    elif "Afr" in name:
        color = '#FFA500'
    elif "Ocea" in name:
        color = '#800080' 
    elif "Sou" in name:
        color = '#C35817'
    elif "Nort" in name:
        color = '#F62817'
    elif "Europe" in name:
        color = '#000080'
    else:
        color = '#000000'
        #print(name)
    return color

In [137]:
# create map of World using latitude and longitude values
map_by_conti = folium.Map(location=[33,48], zoom_start=2, width=1024, height=512)

# add markers to map
for lat, lng, city, ctry in zip(cities_data['lat'], cities_data['lng'], 
                                cities_data.index, cities_data['country']):
    label = '{}, {}'.format(city, ctry)
    color = colorbyregion(cities_data.loc[city, 'Continent_Name'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color= color,
        fill=True,
        fill_color= color,
        fill_opacity=0.5,
        parse_html=False).add_to(map_by_conti)  

map_by_conti

### Using the FourSquare API for Data Acquisition

In [4]:
import requests

In [5]:
CLIENT_ID = 'FB1FG25U2R32CW5WZIDI31NFHENKIY41K55JGTX2ANOGVXQN' # your Foursquare ID
CLIENT_SECRET = 'XYXO4C0PIPUAJ0EQX04B5EYUY5XKZ52JY4QKAPTIHAMBOSJU' # your Foursquare Secret
ACCESS_TOKEN = 'XICWZWPEXXFHRK4SNZPYEVGRPRPGGTMBD3SCIW14RHOJI1GD' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
radius = 500
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: FB1FG25U2R32CW5WZIDI31NFHENKIY41K55JGTX2ANOGVXQN
CLIENT_SECRET:XYXO4C0PIPUAJ0EQX04B5EYUY5XKZ52JY4QKAPTIHAMBOSJU


In [6]:
cities_data.tail()

Unnamed: 0_level_0,country,population,lat,lng,Continent_Name
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cochabamba,Bolivia,632013.0,-17.3935,-66.157,South America
Porto Alegre,Brazil,1484941.0,-30.0328,-51.23,South America
Cuiabá,Brazil,585367.0,-15.5958,-56.0969,South America
Maceió,Brazil,1029129.0,-9.6658,-35.735,South America
Uberlândia,Brazil,604013.0,-18.9189,-48.2769,South America


In [26]:
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat_lng["lat"], 
            lat_lng["lng"], 
            radius, 
            LIMIT)

In [27]:
results = requests.get(url).json()
res = results["response"]['groups'][0]['items']

In [38]:
def getNearbyVenues(names, lats, long, country, continent, radius=500):
    venues_list=[]
    for name, cty, conti, lat, lng in zip(names, country, continent, lats, long): 
        print(name)
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        print(url)
        results = requests.get(url).json()
        print(results)
        res = results["response"]['groups'][0]['items']
        
        if len(res) >= 5:
            venues_list.append([(
                    name,
                    cty,
                    conti,
                    lat, 
                    lng, 
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],  
                    v['venue']['categories'][0]['name']) for v in res])
            
    nearby_venues = pd.DataFrame(item for venue_list in venues_list for item in venue_list)
    nearby_venues.columns = ['City',
                  'Country',
                  'Continent',
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    
    return(nearby_venues)
 


In [39]:
nby_ven = getNearbyVenues(cities_data.index, cities_data["lat"], cities_data["lng"], cities_data["country"], cities_data["Continent_Name"], radius=500)
nby_ven.head(10)

Blantyre
https://api.foursquare.com/v2/venues/explore?&client_id=FB1FG25U2R32CW5WZIDI31NFHENKIY41K55JGTX2ANOGVXQN&client_secret=XYXO4C0PIPUAJ0EQX04B5EYUY5XKZ52JY4QKAPTIHAMBOSJU&v=20180604&ll=-15.7861,35.0058&radius=500&limit=30
{'meta': {'code': 429, 'errorType': 'quota_exceeded', 'errorDetail': 'Quota exceeded', 'requestId': '60f1cc41f1fb0416ec79c2b7'}, 'response': {}}


KeyError: 'groups'

In [40]:
print('There are {} uniques categories.'.format(len(nby_ven['Venue Category'].unique())))

There are 367 uniques categories.


In [41]:
nby_ven.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Blantyre,-15.7861,35.0058,Protea Hotel Ryalls,-15.784764,35.004161,Hotel
1,Blantyre,-15.7861,35.0058,21 Bar & Grill,-15.78421,35.004445,Hotel Bar
2,Blantyre,-15.7861,35.0058,Debonairs,-15.785421,35.004051,Pizza Place
3,Blantyre,-15.7861,35.0058,Blantyre CBD,-15.786844,35.005274,Arcade
4,Blantyre,-15.7861,35.0058,Mount Soche Hotel,-15.784119,35.006034,Hotel


In [42]:
# one hot encoding
world_onehot = pd.get_dummies(nby_ven[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
world_onehot['City'] = nby_ven['City'] 

# move neighborhood column to the first column
fixed_columns = [world_onehot.columns[-1]] + list(world_onehot.columns[:-1])
world_onehot = world_onehot[fixed_columns]

world_onehot.head()

Unnamed: 0,City,ATM,Acai House,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arcade,...,Vietnamese Restaurant,Volleyball Court,Watch Shop,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Blantyre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Blantyre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Blantyre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Blantyre,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Blantyre,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
world_grouped = world_onehot.groupby('City').mean().reset_index()
world_grouped.head()

Unnamed: 0,City,ATM,Acai House,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Aquarium,Arcade,...,Vietnamese Restaurant,Volleyball Court,Watch Shop,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Accra,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0
2,Aguascalientes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Akron,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Albuquerque,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [46]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_ven_sort = pd.DataFrame(columns=columns)
city_ven_sort['City'] = world_grouped['City']

for ind in np.arange(world_grouped.shape[0]):
    city_ven_sort.iloc[ind, 1:] = return_most_common_venues(world_grouped.iloc[ind, :], num_top_venues)

city_ven_sort.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Accra,Hotel,Department Store,African Restaurant,Restaurant,Yoga Studio
1,Adelaide,Café,Pizza Place,Coffee Shop,Asian Restaurant,Italian Restaurant
2,Aguascalientes,Mexican Restaurant,Garden,Bar,Theater,Bakery
3,Akron,Bar,Sandwich Place,Coffee Shop,Thai Restaurant,Deli / Bodega
4,Albuquerque,Brewery,Mexican Restaurant,Breakfast Spot,Hotel,Discount Store


In [53]:
new_df = city_ven_sort.set_index("City").join(cities_data)
new_df.tail(5)

Unnamed: 0_level_0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,country,population,lat,lng,Continent_Name
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Yekaterinburg,Coffee Shop,Gym / Fitness Center,Bar,Wine Shop,Art Gallery,Russia,1468833.0,56.8356,60.6128,Europe
Zaragoza,Spanish Restaurant,Modern European Restaurant,Plaza,Mediterranean Restaurant,Tapas Restaurant,Spain,649404.0,41.6483,-0.883,Europe
Zhengzhou,Fast Food Restaurant,Monument / Landmark,Bus Station,Hotel,Asian Restaurant,China,7005000.0,34.7492,113.6605,Asia
İzmir,Café,Neighborhood,Convenience Store,Dessert Shop,Spa,Turkey,4320519.0,38.4127,27.1384,Europe
Ḩalwān,Café,Sports Club,Supermarket,Hot Spring,Fish Market,Egypt,619293.0,29.8419,31.3342,Africa


In [51]:
new_df.groupby("Continent_Name").count()

Unnamed: 0_level_0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,country,population,lat,lng
Continent_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Africa,18,18,18,18,18,18,18,18,18
Asia,11,11,11,11,11,11,11,11,11
Europe,48,48,48,48,48,48,48,48,48
North America,38,38,38,38,38,38,38,38,38
Oceania,6,6,6,6,6,6,6,6,6
South America,37,37,37,37,37,37,37,37,37


## K Means Clustering

In [57]:
from sklearn.cluster import KMeans 
import matplotlib.cm as cm
import matplotlib.colors as colors

In [60]:
# set number of clusters
kclusters = 6

world_grouped_clust = world_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(world_grouped_clust)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:5] 

array([2, 4, 0, 0, 0])

In [65]:
# add clustering labels
new_df.drop('Cluster Labels', 1, inplace=True)
new_df.insert(0, 'Cluster Labels', kmeans.labels_)
new_df.tail(10)

Unnamed: 0_level_0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,country,population,lat,lng,Continent_Name
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Vladivostok,4,Café,Hookah Bar,Park,Chinese Restaurant,Bus Stop,Russia,606589.0,43.1167,131.9,Europe
Warsaw,4,Café,Theater,Radio Station,Palace,Historic Site,Poland,1790658.0,52.2167,21.0333,Europe
Washington,4,Italian Restaurant,New American Restaurant,Southern / Soul Food Restaurant,Gym,Gym / Fitness Center,United States,5379184.0,38.9047,-77.0163,North America
Winnipeg,4,Café,Coffee Shop,Sandwich Place,Donut Shop,History Museum,Canada,705244.0,49.8844,-97.1464,North America
Yaroslavl,3,Gym,Café,Massage Studio,Pet Store,Ethiopian Restaurant,Russia,608079.0,57.6167,39.85,Europe
Yekaterinburg,0,Coffee Shop,Gym / Fitness Center,Bar,Wine Shop,Art Gallery,Russia,1468833.0,56.8356,60.6128,Europe
Zaragoza,5,Spanish Restaurant,Modern European Restaurant,Plaza,Mediterranean Restaurant,Tapas Restaurant,Spain,649404.0,41.6483,-0.883,Europe
Zhengzhou,1,Fast Food Restaurant,Monument / Landmark,Bus Station,Hotel,Asian Restaurant,China,7005000.0,34.7492,113.6605,Asia
İzmir,4,Café,Neighborhood,Convenience Store,Dessert Shop,Spa,Turkey,4320519.0,38.4127,27.1384,Europe
Ḩalwān,3,Café,Sports Club,Supermarket,Hot Spring,Fish Market,Egypt,619293.0,29.8419,31.3342,Africa


In [67]:
# create map
map_clusters = folium.Map(location=[33,48], zoom_start=2, width=1024, height=512)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(new_df['lat'], new_df['lng'], new_df.index, new_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=False,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.5).add_to(map_clusters)
       
map_clusters