# Coursera Capstone - Week 3

This notebook is composed of three parts, as requested by the assignment directives.  
For ease of use, use the links below to go directly to each part if needed.
>
> <a href="#Part-1:-Creating-a-Dataframe-by-Webscrapping">Part 1: Creating a Dataframe by Webscrapping </a>
>
> <a href="#Part-2:-Getting-Coordinates">Part 2: Getting Coordinates </a>
>
> <a href="#Part-3:-Exploring-and-Clustering-Neighborhoods">Part 3: Exploring and Clustering Neighborhoods </a>

But first, let's import all the libraries that will be used in this notebook.

In [1]:
import requests
import folium
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans

## Part 1: Creating a Dataframe by Webscrapping

Using BeautifulSoup for webscrapping:

In [2]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url)
table = soup.find('table',{'class':'wikitable sortable'})

The loop below fills the lists `postal_code`, `borough` and `neighborhood` with the values exctrated from the tables in the Wikipedia page.

In [3]:
postal_code = []
borough =[]
neighborhood = []

rows = table.find_all('tr')

for row in rows:
    cells = row.find_all('td')
        
    if len(cells) > 1:
        postal = cells[0]
        postal_code.append(postal.text.strip())
            
        br = cells[1]
        borough.append(br.text.strip())
            
        nh = cells[2]
        neighborhood.append(nh.text.strip())            

Now the populated lists are used to build the dataframe.

In [4]:
df_tor = pd.DataFrame(postal_code)
df_tor.rename(columns={0:"Postal Code"}, inplace=True)
df_tor['Borough']=borough
df_tor['Neighborhood']=neighborhood
df_tor.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now let's drop the unwanted values "Not assigned" and reset the index of the dataframe.

In [5]:
df_tor.drop(df_tor.loc[df_tor['Borough']=='Not assigned'].index, inplace=True)

In [6]:
df_tor.sort_values(by=['Postal Code'], inplace=True)

In [7]:
df_tor.reset_index(drop=True, inplace=True)

In [8]:
df_tor.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


At last, using the **.shape** method to print the number of rows in the dataframe.

In [9]:
print("The dataframe contains {} rows.".format(df_tor.shape[0]))

The dataframe contains 103 rows.


## Part 2: Getting Coordinates

The coordinates for the postal codes were obtained from the .csv file provided in the assignmment.  
Link to the dataset: https://cocl.us/Geospatial_data

In [10]:
df_pc = pd.read_csv(r'C:\Users\rodol\Documents\Data Science\9. Applied Data Science Capstone\Data Sets\Geospatial_Coordinates.csv')
df_pc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
df = pd.merge(df_tor, df_pc, on='Postal Code')

In [12]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [13]:
df.shape[0]

103

In [14]:
# Checking if the Neighborhood column has any "Not assigned".
# If there is, it should be replaced by the Borough name.
df.loc[df['Neighborhood']=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude


No values found. The dataframe is complete.

## Part 3: Exploring and Clustering Neighborhoods

In this step, we are going to explore and cluster the neighborhoods in Toronto. A map will be generated to visualize the neighborhoods and how they cluster together.

### 3.1 Boroughs in Toronto

Let's see all the boroughs in Toronto.

In [15]:
from geopy.geocoders import Nominatim

city = 'Toronto, CA'

geolocator = Nominatim(user_agent="city_explorer")
location = geolocator.geocode(city)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


In [16]:
# create the map for the Toronto area
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

The map rendered from Toronto coordinates shows a lot of the Lake Ontario, so another pair of coordinates will be used to get a broaden view of the city's land.

In [17]:
# using the coordinates for Roselawn, Central Toronto:
lt_index = int(df[df['Neighborhood']=='Roselawn'].index[0])
latitude = df.loc[lt_index, 'Latitude']
longitude = df.loc[lt_index, 'Longitude']
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

In [96]:
# add markers to map
for lat, lng, bor, neigh in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neigh, bor)
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#FFFFFF',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
map_toronto

The map is rendered showing all the postal codes in Toronto, associated with their boroughs and neighborhoods. Note that one postal code can refer to several neighborhoods, and that one borough can have more than one postal code in it.

### 3.2 Define Foursquare Credentials and Version

In [19]:
CLIENT_ID = 'FOURSQUARE_ID' # your Foursquare ID
CLIENT_SECRET = 'FOURSQUARE_SECRET' # your Foursquare Secret
VERSION = '20191130' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value


**IMPORTANT:  
We'll explore Toronto's neighborhoods by Postal Codes, since one postal code can have more than one neighborhood and the latitudes and longitudes were obtained for the postal codes, not for the neighborhoods.**  

### 3.3 Exploring one Postal Code


In [20]:
postcode = df.loc[0, 'Postal Code']
postcode_latitude = df.loc[0, 'Latitude']
postcode_longitude = df.loc[0, 'Longitude']
postcode_borough = df.loc[0, 'Borough']
postcode_neighborhoods = df.loc[0, 'Neighborhood']

print('The postcode {} is located in the {} borough, and contemplates the folowing neighborhoods: {}.'\
      .format(postcode, postcode_borough, postcode_neighborhoods))
print('The region latitude is {} and the longitude is {}.'.format(postcode_latitude, postcode_longitude))

The postcode M1B is located in the Scarborough borough, and contemplates the folowing neighborhoods: Malvern, Rouge.
The region latitude is 43.806686299999996 and the longitude is -79.19435340000001.


With the Foursquare API, let's get the top 100 venues within 500 meters of our location, and then evaluate the categories of venues.

In [21]:
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    postcode_latitude, 
    postcode_longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()

Using the function to get the category of the venues:

In [22]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [23]:
venues = results['response']['groups'][0]['items']
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy’s,Fast Food Restaurant,43.807448,-79.199056


In [24]:
print('Foursquare returned {} venues.'.format(nearby_venues.shape[0]))

Foursquare returned 1 venues.


### 3.4 Exploring all the Postal Codes
Now let's borrow the function to perform the same analysis for all the postal codes in Toronto.

In [25]:
def getNearbyVenues(postcodes, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for pc, lat, lng in zip(postcodes, latitudes, longitudes):
                    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            pc, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

And run the function to obtain create a new dataframe called `toronto_venues`.

In [26]:
toronto_venues = getNearbyVenues(postcodes=df['Postal Code'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Checking the `toronto_venues` dataframe:

In [27]:
print(toronto_venues.shape)
toronto_venues.head()

(2136, 7)


Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,M1C,43.784535,-79.160497,SEBS Engineering Inc. (Sustainable Energy and ...,43.782371,-79.15682,Construction & Landscaping
3,M1E,43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


And returning the amount of venues for each postal code.

In [28]:
toronto_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,1,1,1,1,1,1
M1C,2,2,2,2,2,2
M1E,8,8,8,8,8,8
M1G,4,4,4,4,4,4
M1H,8,8,8,8,8,8
...,...,...,...,...,...,...
M9N,2,2,2,2,2,2
M9P,6,6,6,6,6,6
M9R,4,4,4,4,4,4
M9V,10,10,10,10,10,10


Unique categories from the returned values:

In [29]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 273 unique categories.


### 3.5 Analyze each Postal Code

In [30]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add postal code column back to dataframe
toronto_onehot['Postal Code'] = toronto_venues['Postal Code'] 

# move postal code column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot

Unnamed: 0,Postal Code,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2131,M9V,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2132,M9W,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2133,M9W,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2134,M9W,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now group rows by postal code and by taking the mean of the frequency of occurrence of each category.

In [31]:
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_grouped

Unnamed: 0,Postal Code,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,M9N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,M9P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,M9R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,M9V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Print 5 postal codes along with the top 5 most common venues in them:

In [32]:
num_top_venues = 5

for post in toronto_grouped['Postal Code'].head(5):
    print("----"+post+"----")
    temp = toronto_grouped[toronto_grouped['Postal Code'] == post].T.reset_index()
    temp.columns = ['Venue','Freq']
    temp = temp.iloc[1:]
    temp['Freq'] = temp['Freq'].astype(float)
    temp = temp.round({'Freq': 2})
    print(temp.sort_values('Freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M1B----
                       Venue  Freq
0       Fast Food Restaurant   1.0
1          Accessories Store   0.0
2  Middle Eastern Restaurant   0.0
3        Moroccan Restaurant   0.0
4        Monument / Landmark   0.0


----M1C----
                        Venue  Freq
0  Construction & Landscaping   0.5
1                         Bar   0.5
2           Accessories Store   0.0
3           Mobile Phone Shop   0.0
4                       Motel   0.0


----M1E----
                Venue  Freq
0      Breakfast Spot  0.12
1      Medical Center  0.12
2                Bank  0.12
3  Mexican Restaurant  0.12
4        Intersection  0.12


----M1G----
                Venue  Freq
0         Coffee Shop  0.50
1  Mexican Restaurant  0.25
2   Korean Restaurant  0.25
3   Accessories Store  0.00
4   Mobile Phone Shop  0.00


----M1H----
                  Venue  Freq
0                  Bank  0.12
1  Caribbean Restaurant  0.12
2       Thai Restaurant  0.12
3   Fried Chicken Joint  0.12
4    Athletics & Spo

Putting the above information in a dataframe leads us to:

In [33]:
# function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [67]:
# create the new dataframe and display the top 10 venues for each neighborhood

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postcodes_venues_sorted = pd.DataFrame(columns=columns)
postcodes_venues_sorted['Postal Code'] = toronto_grouped['Postal Code']

for ind in np.arange(toronto_grouped.shape[0]):
    postcodes_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

postcodes_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service
1,M1C,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Breakfast Spot,Restaurant,Electronics Store,Medical Center,Rental Car Location,Intersection,Mexican Restaurant,Bank,Yoga Studio,Doner Restaurant
3,M1G,Coffee Shop,Mexican Restaurant,Korean Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio
4,M1H,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant


### 3.6 Cluster Postal Codes

In this section, we'll use k-means clustering to create 5 clusters.

In [68]:
# set number of clusters
k = 5

toronto_grouped_clustering = toronto_grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 1, 0, 0, 0, 0, 0, 0, 2, 0])

Create a new dataframe with the cluster and the top 10 venues for each postal code.

In [69]:
# add clustering labels
postcodes_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [70]:
toronto_merged = df

In [71]:
toronto_merged = toronto_merged.join(postcodes_venues_sorted.set_index('Postal Code'), on='Postal Code')

In [72]:
toronto_merged.dropna(axis=0, inplace=True)

In [73]:
toronto_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,4.0,Fast Food Restaurant,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant,Health & Beauty Service
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,1.0,Construction & Landscaping,Bar,Yoga Studio,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.0,Breakfast Spot,Restaurant,Electronics Store,Medical Center,Rental Car Location,Intersection,Mexican Restaurant,Bank,Yoga Studio,Doner Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,Coffee Shop,Mexican Restaurant,Korean Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Yoga Studio
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,Gas Station,Fried Chicken Joint,Bakery,Bank,Athletics & Sports,Thai Restaurant,Caribbean Restaurant,Hakka Restaurant,Electronics Store,Eastern European Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188,3.0,Park,Yoga Studio,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Electronics Store
99,M9P,Etobicoke,Westmount,43.696319,-79.532242,0.0,Pizza Place,Coffee Shop,Discount Store,Sandwich Place,Chinese Restaurant,Intersection,Eastern European Restaurant,Electronics Store,Dumpling Restaurant,Dim Sum Restaurant
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724,3.0,Pizza Place,Park,Sandwich Place,Bus Line,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437,0.0,Pizza Place,Grocery Store,Fried Chicken Joint,Sandwich Place,Pharmacy,Liquor Store,Beer Store,Fast Food Restaurant,Gluten-free Restaurant,Department Store


And now for the visualization of the clusters:

In [74]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

In [75]:
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Postal Code'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3.7 Name the Clusters

An analysis is made considering the 1st most common venue in each cluster to define the name of the clusters.

### Cluster 1 - The One with the Coffee Shops

In [86]:
cl_1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [87]:
cl_1['1st Most Common Venue'].value_counts()

Coffee Shop             20
Pizza Place             10
Café                     6
Clothing Store           5
Indian Restaurant        2
Gym                      2
Sandwich Place           2
Athletics & Sports       2
Japanese Restaurant      2
Grocery Store            2
Breakfast Spot           2
Park                     1
Martial Arts School      1
Trail                    1
Convenience Store        1
Department Store         1
Airport Lounge           1
Drugstore                1
Bar                      1
Gift Shop                1
Print Shop               1
Fast Food Restaurant     1
Greek Restaurant         1
River                    1
Playground               1
Field                    1
Bus Line                 1
Gas Station              1
Metro Station            1
Restaurant               1
Skating Rink             1
Health Food Store        1
Garden                   1
Bakery                   1
Ramen Restaurant         1
Accessories Store        1
College Stadium          1
B

### Cluster 2 - The One with the Constructors and Landscapers

In [88]:
cl_2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [89]:
cl_2['1st Most Common Venue'].value_counts()

Construction & Landscaping    2
Baseball Field                1
Name: 1st Most Common Venue, dtype: int64

### Cluster 3 - The one with the American Food

In [90]:
cl_3 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [91]:
cl_3['1st Most Common Venue'].value_counts()

American Restaurant    1
Name: 1st Most Common Venue, dtype: int64

### Cluster 4 - The One with the Parks

In [92]:
cl_4 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [93]:
cl_4['1st Most Common Venue'].value_counts()

Park            6
Pizza Place     2
Airport         1
Playground      1
Bakery          1
Intersection    1
Trail           1
Name: 1st Most Common Venue, dtype: int64

### Cluster 5 - The one with Fast Food

In [94]:
cl_5 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [95]:
cl_5['1st Most Common Venue'].value_counts()

Fast Food Restaurant    1
Name: 1st Most Common Venue, dtype: int64

### Testing the Search Query

In [None]:
search_query = 'music'
radius = 500
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET,  
    postcode_latitude, 
    postcode_longitude,
    VERSION,
    search_query,
    radius, 
    LIMIT)
search_results = requests.get(url).json()

In [None]:
music_venues = search_results['response']['venues']
nearby_music_venues = pd.json_normalize(music_venues) # flatten JSON

# filter columns
music_filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_music_venues = nearby_music_venues.loc[:, music_filtered_columns]

# filter the category for each row
nearby_music_venues['categories'] = nearby_music_venues.apply(get_category_type, axis=1)

# clean columns
nearby_music_venues.columns = [col.split(".")[-1] for col in nearby_music_venues.columns]

nearby_music_venues.head()