<a href="https://colab.research.google.com/github/tamiresco/ibm3/blob/master/Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

In this lab, I will learn how to convert addresses into their equivalent latitude and longitude values. Also, I will use the Foursquare API to explore capitals in the world. I will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. I will use the *k*-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the capitals in the world and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Capitals in the World</a>

3. <a href="#item3">Analyze Each Country</a>

4. <a href="#item4">Cluster Capitals in the World</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

In [344]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

/bin/bash: conda: command not found
/bin/bash: conda: command not found
Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

In [345]:
World_cities_raw = pd.read_html("https://lab.lmnixon.org/4th/worldcapitals.html")
neighborhoods = pd.DataFrame(World_cities_raw[0])

new_header = neighborhoods.iloc[0] #grab the first row for the header
neighborhoods = neighborhoods[1:] #take the data less the header row
neighborhoods.columns = new_header #set the header row as the df header

neighborhoods.rename(columns={"Country": "Borough", "Capital": "Neighborhood"}, inplace=True)

In [346]:
new_latitude = []
new_longitude = []

for i, rows in neighborhoods.iterrows():

  if str(neighborhoods.loc[i,'Latitude'])[-1] == 'S':
    new_latitude.append('-' + str(neighborhoods.loc[i,'Latitude'])[0:-1] + '00')
  else:
    new_latitude.append(str(neighborhoods.loc[i,'Latitude'])[0:-1] + '00')

  if str(neighborhoods.loc[i,'Longitude'])[-1] == 'W':
    new_longitude.append('-' + str(neighborhoods.loc[i,'Longitude'])[0:-1] + '00')
  else:
    new_longitude.append(str(neighborhoods.loc[i,'Longitude'])[0:-1] + '00')

In [347]:
neighborhoods.Latitude = new_latitude
neighborhoods.Longitude = new_longitude

In [348]:
neighborhoods.drop([201,202,203], inplace = True)

In [349]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
1,Afghanistan,Kabul,34.28,69.11
2,Albania,Tirane,41.18,19.49
3,Algeria,Algiers,36.42,3.08
4,American Samoa,Pago Pago,-14.16,-170.43
5,Andorra,Andorra la Vella,42.31,1.32
6,Angola,Luanda,-8.5,13.15
7,Antigua and Barbuda,West Indies,17.2,-61.48
8,Argentina,Buenos Aires,-36.3,-60.0
9,Armenia,Yerevan,40.1,44.31
10,Aruba,Oranjestad,12.32,-70.02


In [350]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 200 boroughs and 200 neighborhoods.


In [351]:
latitude = 0.00 # location.latitude
longitude = 0.00 #location.longitude

In [352]:
# create map of New York using latitude and longitude values
map = folium.Map(location=[latitude, longitude], zoom_start=2)

# add markers to map
for lat, lng, borough, neighborhood in zip(pd.to_numeric(neighborhoods['Latitude']), pd.to_numeric(neighborhoods['Longitude']), neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)  
    
map

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

#### Define Foursquare Credentials and Version

In [353]:
CLIENT_ID = 'MVI4V2DONN0NCWC3YV32ZSZ424XSDKIU5IO2VKC50GJJEN21' # your Foursquare ID
CLIENT_SECRET = 'VYH3YTKIVZ5EP5C2CNWJIY53LKSZFDIWSYNKL0O0DRALPVYN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MVI4V2DONN0NCWC3YV32ZSZ424XSDKIU5IO2VKC50GJJEN21
CLIENT_SECRET:VYH3YTKIVZ5EP5C2CNWJIY53LKSZFDIWSYNKL0O0DRALPVYN


In [354]:
neighborhood_latitude = neighborhoods.loc[26, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[26, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[26, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Brasilia are -15.4700, -47.5500.


In [355]:
LIMIT = 1000 # limit of number of venues returned by Foursquare API
radius = 10000000 # define radius 10^8

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=MVI4V2DONN0NCWC3YV32ZSZ424XSDKIU5IO2VKC50GJJEN21&client_secret=VYH3YTKIVZ5EP5C2CNWJIY53LKSZFDIWSYNKL0O0DRALPVYN&v=20180605&ll=-15.4700,-47.5500&radius=10000000&limit=1000'

Send the GET request and examine the resutls

In [356]:
results = requests.get(url).json()
results

{'meta': {'code': 400,
  'errorDetail': 'Your geographic boundary is too big. Please search a smaller area.',
  'errorType': 'geocode_too_big',
  'requestId': '5f4d9410ad35771c6b909d77'},
 'response': {}}

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [357]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [358]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
# venues
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.describe()

KeyError: ignored

And how many venues were returned by Foursquare?

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

<a id='item2'></a>

## 2. Explore Capitals in the World

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=100000000): #10^8
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
world_capitals_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

#### Let's check the size of the resulting dataframe

In [None]:
print(world_capitals_venues.shape)
world_capitals_venues

Let's check how many venues were returned for each neighborhood

In [None]:
world_capitals_venues.groupby('Neighborhood').count()

#### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(world_capitals_venues['Venue Category'].unique())))

<a id='item3'></a>

## 3. Analyze Each Country

In [None]:
# one hot encoding
world_capitals_onehot = pd.get_dummies(world_capitals_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
world_capitals_onehot['Neighborhood'] = world_capitals_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [world_capitals_onehot.columns[-1]] + list(world_capitals_onehot.columns[:-1])
world_capitals_onehot = world_capitals_onehot[fixed_columns]

world_capitals_onehot.head()

And let's examine the new dataframe size.

In [None]:
world_capitals_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
world_capitals_grouped = world_capitals_onehot.groupby('Neighborhood').mean().reset_index()
world_capitals_grouped

#### Let's confirm the new size

In [None]:
world_capitals_grouped.shape

#### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in world_capitals_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = world_capitals_grouped[world_capitals_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = world_capitals_grouped['Neighborhood']

for ind in np.arange(world_capitals_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(world_capitals_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

<a id='item4'></a>

## 4. Cluster Capitals

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

world_capitals_grouped_clustering = world_capitals_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(world_capitals_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = world_capitals_grouped['Neighborhood']

for ind in np.arange(world_capitals_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(world_capitals_grouped.iloc[ind, :], num_top_venues)

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

world_capitals_merged = neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
world_capitals_merged = world_capitals_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# world_capitals_merged.dropna(inplace=True) #drop citys that foursquare does not work
world_capitals_merged# check the last columns!
world_capitals_merged.dropna(inplace=True) #drop citys that foursquare does not work

world_capitals_merged[["Cluster Labels"]] = world_capitals_merged[["Cluster Labels"]].apply(pd.to_numeric) 

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pd.to_numeric(world_capitals_merged['Latitude']), pd.to_numeric(world_capitals_merged['Longitude']), world_capitals_merged['Neighborhood'], world_capitals_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)],
        fill=True,
        fill_color=rainbow[int(cluster)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters

#### Cluster 1

In [None]:
world_capitals_merged.loc[world_capitals_merged['Cluster Labels'] == 0, world_capitals_merged.columns[[1] + list(range(5, world_capitals_merged.shape[1]))]]

#### Cluster 2

In [None]:
world_capitals_merged.loc[world_capitals_merged['Cluster Labels'] == 1, world_capitals_merged.columns[[1] + list(range(5, world_capitals_merged.shape[1]))]]

#### Cluster 3

In [None]:
world_capitals_merged.loc[world_capitals_merged['Cluster Labels'] == 2, world_capitals_merged.columns[[1] + list(range(5, world_capitals_merged.shape[1]))]]

#### Cluster 4

In [None]:
world_capitals_merged.loc[world_capitals_merged['Cluster Labels'] == 3, world_capitals_merged.columns[[1] + list(range(5, world_capitals_merged.shape[1]))]]

#### Cluster 5

In [None]:
world_capitals_merged.loc[world_capitals_merged['Cluster Labels'] == 4, world_capitals_merged.columns[[1] + list(range(5, world_capitals_merged.shape[1]))]]