<html>
    
    <center><h1>Coursera</h1></center>
    <center><h2>Applied Data Science Capstone Course</h2></center>
    <center><h2>Toronto Neighborhood Segmentation Project</h2></center>

    <h5>Author: Royden Lynch<br>
        Since:  04/16/2019</h5>
        
    <h4>Description:</h4>
    <p>
        This notebook will be the main notebook used for the week three project concerning segmentation of the                 neighborhoods in the city of toronto.
    </p>

</html>

### 1.0 - Project Setup

#### 1.1 - Install Packages

Before we start any coding, we need to install all the required packages! Let's install everything we need at the top. There is also an explanation of what each package will be used for right above when it is imported.

In [1]:
# BeautifulSoup
#   This will be used for handling the web-scraping portion
#   of the assignment. It provides many useful methods for
#   scrabing data from html files or from the html of pages
#   on the internet.
from bs4 import BeautifulSoup as BS

# Folium
#   This will be used for rendering maps.
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium

# Geocoder
#   This will be used to find the latitude and longitude
#   coordinates of a given postal code.
#   It is not very effective.
import geocoder

# Nominatim
#   This will be used to find the latitude and longitude
#   coordinates of a given address. I'm not sure if it
#   it a good package or not.
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

import numpy as np

# Pandas
#   This will be used for handling data in the format of
#   dataframes. It provides many useful methods for dealing
#   with tabular-like data.
import pandas as pd

# Requests
#   This will be used for handling requests. This includes
#   both requests made to get html source of websites to
#   scrape data from, as well as, to get api calls from
#   our location data provider, Foursquare.
import requests

### 2.0 - Data Preprocessing

#### 2.1 - Retrieve Toronto Post Code Data

One of the packages we installed was called "BeautifulSoup". This is the first package we'll need, as the first step in this project is to scrape data from this <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">wikipedia page</a>. The page contains a table of data on the post codes, boroughs, and neighborhoods in Toronto, Ontario, Canada.<br>
Let's go ahead and scrape the data, with each step explained in comments below.

In [2]:
# Get HTML Source File of Wikipedia Page
source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

In [3]:
# Create BeautifulSoup Object
soup = BS(source, "lxml")
# Print the HTML
#print(soup.prettify())

In [4]:
# Find the Main Table
table = soup.find("table")
# Print the Table Found
#print(table.prettify())

In [5]:
# Create Empty DataFrame, with Specified Columns
column_names = ["PostCode", "Borough", "Neighborhood"]
to_hoods = pd.DataFrame(columns=column_names)

In [6]:
# For Each Row in the Table Body
for row in table.tbody.find_all("tr"):
    
    # Variables to Store Row Data
    row_data = []
    
    # For Each Column in the Row
    for column in row.find_all("td"):
        
        # Store the Column's Data (Text)
        col_data = column.text
            
        # Otherwise, Add the Data to the Row Data
        row_data.append(col_data)
        
    # Sanity Check for Empty Rows
    if len(row_data) == 0:
        continue

    # Create Row Data Dictionary
    row_data = {
        "PostCode"     : row_data[0],
        "Borough"      : row_data[1],
        "Neighborhood" : row_data[2][0:-1]
    }
    
    # If Borough is Undefined, Forget About Data    
    if row_data["Borough"] == "Not assigned":
        continue
        
    # If Neighborhood is Undefined, Let It Equal
    if row_data["Neighborhood"] == "Not assigned":
        row_data["Neighborhood"] = row_data["Borough"]
            
    # Append the Row of Data to the DataFrame
    to_hoods = to_hoods.append(row_data, ignore_index=True)

# Show the Resulting DataFrame
to_hoods.head(10)

Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Great! We now have the raw data from the page scraped and stored in a pandas DataFrame object.

#### 2.2 - Formatting Toronto Post Code Data

Now that we have the raw data, we can format it to our liking. In this case, we'll format it as directed by the prompt for this project. This means that we need to group the data by the post code and combine the neighborhoods into one string for each post code.

In [7]:
# Create Copy the Original DataFrame
to_hoods_byPostCode = to_hoods.copy()

# Groupy th Neighborhoods (by PostCode/Borough) and Combine the Neighborhood as a List
to_hoods_byPostCode = to_hoods_byPostCode.groupby(["PostCode", "Borough"])["Neighborhood"].apply(list).reset_index()

# Save a List Version for Later
to_hoods_byPostCode_asList = to_hoods_byPostCode.copy()
# Convert List Obj to String
to_hoods_byPostCode["Neighborhood"] = to_hoods_byPostCode["Neighborhood"].apply(lambda x: str(x).strip("[]").replace("'",""))

# Show Head of Data
to_hoods_byPostCode.head()


Unnamed: 0,PostCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Let's take a look at the shape of the data.

In [8]:
# Show Head of Data
to_hoods_byPostCode.shape

(103, 3)

#### 2.3 - Retrieving Toronto Post Code Location Data

The data we have so far is useful, but we really want the location (in the form of latitude and longitude coordinated) of these data points. We'll start by finding the location data for each post code.

In [9]:
# The function defined below represents how I attempted to retrieve the
# latitude and longitude data using the python geocoder package. However,
# as mentioned by the assignment's description, the package is apparantly
# very unstable and thus unreliable. I think my code is correct, but for
# some reason the package is not finding any coordinates.
# Therefore, I put the code in an unused function, rather than deleting it.
def add_latlng_coords_from_geocoder():
    # Create DataFrame for Latitude and Longitude Coordinates
    latlng_df = pd.DataFrame(columns=["Latitude", "Longitude"])

    # For Each Row in the to_hoods_by_PostCode DataFrame
    for index,row in to_hoods_by_PostCode.iterrows():
        print("next row")
        # Find Coords at the Post Code
        coords_ll = None
        while (coords_ll is None):
            g = geocoder.google("{}, Toronto, Ontario".format(row["PostCode"]))
            coords_ll = g.latlng
        # Append the Coords to the latlng_df DataFrame
        toAppend = {
            "Latitude"  : coords_ll[0],
            "Longitude" : coords_ll[1]
        }
        latlng_df = latlng_df.append(toAppend, ignore_index=True)

    # Show Resulting DataFrame
    latlng_df

    # Concatonate the latlng_df DataFrame to the Side of the to_hoods_by_PostCode DataFrame
    pd.concat([to_hoods_by_PostCode, latlng_df], axis=1)

In [10]:
# Read Latitude and Longitude Data from CSV Given by Assignment
to_PostCode_withLatLng = pd.read_csv("Geospatial_Coordinates.csv")
print(to_PostCode_withLatLng.shape)
to_PostCode_withLatLng.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


With the location data that we've read for the csv file given by the prompt, we can now concatonate this to our DataFrame to have a DataFrame that relations the PostCode, Borough, Neighborhoods, Lattitude, and Longitude.

In [11]:
# It seems that the data was given to us by the prompt
# is in the same order as our DataFrame from step one.
# Therefore, I will make this as an assumption and just
# concatonate the two dataframes. If this was false, then
# I would need to go back and sort the latitude/longitude
# DataFrame and remove uneccessary PostalCodes.
to_hoods_byPostCode_withLatLng = pd.concat([to_hoods_byPostCode, to_PostCode_withLatLng["Latitude"], to_PostCode_withLatLng["Longitude"]], axis=1)
to_hoods_byPostCode_asList_withLatLng = pd.concat([to_hoods_byPostCode_asList, to_PostCode_withLatLng["Latitude"], to_PostCode_withLatLng["Longitude"]], axis=1)
to_hoods_byPostCode_withLatLng.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Great! Now we have our DataFrame that has all the information we need to do some initial visualizations!

#### 2.4 - Visualize Toronto Post Code Location Data

So, let's do some!<br>
Before we start, however, lets get an idea of the number of boroughs, post codes, and neighborhoods that we are working with.

In [12]:
# Find Number of Unique Buroughs, PostalCodes, and Neighborhoods
n_Boroughs      = len(to_hoods_byPostCode_asList_withLatLng["Borough"].unique())
n_PostCodes     = len(to_hoods_byPostCode_asList_withLatLng["PostCode"].unique())
n_Neighborhoods = 0
for index,row in to_hoods_byPostCode_asList_withLatLng.iterrows():
    n_Neighborhoods += len(row["Neighborhood"])

print("The dataframe contains {} boroughs, {} post codes, and {} neighborhoods.".format(n_Boroughs, n_PostCodes, n_Neighborhoods))


The dataframe contains 11 boroughs, 103 post codes, and 211 neighborhoods.


Cool, now we can go ahead and visualize our post codes on a map of Toronto using the "folium" package.

In [13]:
# Find Coordinates of Toronto
to_address = "Toronto, Ontario"
geolocator = Nominatim(user_agent="to_explorer")
to_latlng = geolocator.geocode(to_address)
to_lat = to_latlng.latitude
to_lng = to_latlng.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(to_lat, to_lng))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [14]:
# Create Smaller Name for DataFrame Reference
data = to_hoods_byPostCode_asList_withLatLng.copy()

# Create Map of Toronto
to_map = folium.Map(location=[to_lat,to_lng], zoom_start=10)

# Add Markers to Map
for postal_code, borough, lat, lng in zip(data["PostCode"], data["Borough"], data["Latitude"], data["Longitude"]):
    label = "{}, {}".format(postal_code, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color="blue",
        fill=True,
        fill_color="#3186cc",
        fill_opacity=0.7,
        parse_html=False
    ).add_to(to_map)
to_map

#### 2.5 Recap of Data Preprocessing Until This Point

Since we've done quite a bit of preprocessing, let's take a second to quickly visualize all the DataFrames that we have collected, then we can determine what else we need to get, where we want to do, what we want to analyze, etc.

Current DataFrames:<br>

<br>
<center><b>'to_hoods'</b></center>

|   | PostCode | Borough | Neighborhood |
|---|----------|---------|--------------|
| 0 | EX1      | York    | Da' Hood     |
| 1 | EX1      | York    | Anotha One   |
| 1 | EX2      | York    | Go Crazy     |
| . | .        | .       | .            |

<br>
<center><b>'to_hoods_byPostCode'</b></center>

|   | PostCode | Borough | Neighborhood         |
|---|----------|---------|----------------------|
| 0 | EX1      | York    | Da' Hood, Anotha One |
| 1 | EX2      | York    | Go Crazy             |
| . | .        | .       | .                    |

<br>
<center><b>'to_hoods_byPostCode_asList'</b></center>

|   | PostCode | Borough | Neighborhood           |
|---|----------|---------|------------------------|
| 0 | EX1      | York    | [Da' Hood, Anotha One] |
| 1 | EX2      | York    | [Go Crazy]             |
| . | .        | .       | .                      |

<br>
<center><b>'to_hoods_byPostCode_withLatLng'</b></center>

|   | PostCode | Borough | Neighborhood         | Latitude | Longitude |
|---|----------|---------|----------------------|-----------------------
| 0 | EX1      | York    | Da' Hood, Anotha One | 20.48    | 17.76     |
| 1 | EX2      | York    | Go Crazy             | 3.142    | 1.618     |
| . | .        | .       | .                    | .        | .         |

<br>
<center><b>'to_hoods_byPostCode_asList_withLatLng'</b></center>

|   | PostCode | Borough | Neighborhood           | Latitude | Longitude |
|---|----------|---------|------------------------|-----------------------
| 0 | EX1      | York    | [Da' Hood, Anotha One] | 20.48    | 17.76     |
| 1 | EX2      | York    | [Go Crazy]             | 3.142    | 1.618     |
| . | .        | .       | .                      | .        | .         |

Okay, so  at its most expansive point, we have data connecting:  
-> Each PostCode to It's Borough<br>
-> Each PostCode to All It's Neighborhoods<br>
-> Each PostCode to It's Latitude<br>
-> Each PostCode to It's Longitude<br>

<br>
Clearly, our data is based off of PostCodes, so the easiest analysis for our to complete will be based around all the PostCodes in the Toronto area.<br>
We can go ahead and, similar to the New York Segmentation Lab, segment these PostCodes based on the location data that we can retrieve from our developer Foursquare API.

#### 2.6 - Retrieving Location Data from Foursquare API

So, let's go ahead and do that.<br>
The first step is to save our developer Client_ID, Client_Secret, and Version as local variables. But, for privacy sake, after running the notebook, that has been redacted.

In [15]:
# @hidden_cell

# Client_ID
CLIENT_ID     = "--[redacted]--"
# Client_Secret
CLIENT_SECRET = "--[redacted]--"
# Version
VERSION       = "--[redacted]--"

With our constant developer keys defined, we can go ahead and create a function to do the call for any popular locations nearby a list of geographical locations, in our case we will pass this function our PostCodes, with their Latitutde and Longitude values, and get all the venues nearby each PostCode.

In [16]:
# explore_venues(...) Method
#   @param  centers_names       - names of the geographical locations to explore
#   @param  centers_latitudes   - latitudes of the geographical locations to explore
#   @param  centers_longitudes  - longitudes of the geographical locations to explore
#   @param  radius              - exploration radius of each geographical location (default=500)
#   @param  limit               - maximum number of venues to find (default=100)
#
#   @return nearby_venues       - DataFrame of ["Center Name",
#                                               "Center Latitude",
#                                               "Center Longitude",
#                                               "Venue Name",
#                                               "Venue Latitude",
#                                               "Venue Longitude"]
#                                               "Venue Category"]
def explore_venues(centers_names, centers_latitudes, centers_longitudes, radius=500, limit=100):
    
    # Create Empty DataFrame, with Specified Columns
    column_names = ["PostCode",
                    "PostCode Latitude",
                    "PostCode Longitude",
                    "Venue",
                    "Venue Latitude",
                    "Venue Longitude",
                    "Venue Category"]
    nearby_venues = pd.DataFrame(columns=column_names)
    
    # For Each Center
    for center_name, center_latitude, center_longitude in zip(centers_names, centers_latitudes, centers_longitudes):
        
        # Print Name of Current Center
        print("Center Name: "+str(center_name))
        
        # Create the Foursquare API Url
        url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            center_latitude,
            center_longitude,
            radius,
            limit
        )
        
        # Make the GET Request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # Keep Track if Insertion
        appended = False
        
        # For Venue in Results
        for v in results:
            # Append Center/Venue Data
            toAppend = {
                "PostCode"           : center_name,
                "PostCode Latitude"  : center_latitude,
                "PostCode Longitude" : center_longitude,
                "Venue"              : v["venue"]["name"],
                "Venue Latitude"     : v["venue"]["location"]["lat"],
                "Venue Longitude"    : v["venue"]["location"]["lng"],
                "Venue Category"     : v["venue"]["categories"][0]["name"],
            }
            nearby_venues = nearby_venues.append(toAppend, ignore_index=True)
            # Mark Append Action as True
            appended = True
            
        # If Nothing was Appended (There Must've Been No Venues)
        if appended == False:
            # Append Center with No Venue Data
            toAppend = {
                "PostCode"           : center_name,
                "PostCode Latitude"  : center_latitude,
                "PostCode Longitude" : center_longitude,
                "Venue"              : "None",
                "Venue Latitude"     : "NaN",
                "Venue Longitude"    : "NaN",
                "Venue Category"     : "NaN",
            }
            nearby_venues = nearby_venues.append(toAppend, ignore_index=True)
    
    # Return DataFrame
    return(nearby_venues)

In [17]:
toronto_venues = explore_venues(
                    centers_names      = data["PostCode"],
                    centers_latitudes  = data["Latitude"],
                    centers_longitudes = data["Longitude"]
                 )

Center Name: M1B
Center Name: M1C
Center Name: M1E
Center Name: M1G
Center Name: M1H
Center Name: M1J
Center Name: M1K
Center Name: M1L
Center Name: M1M
Center Name: M1N
Center Name: M1P
Center Name: M1R
Center Name: M1S
Center Name: M1T
Center Name: M1V
Center Name: M1W
Center Name: M1X
Center Name: M2H
Center Name: M2J
Center Name: M2K
Center Name: M2L
Center Name: M2M
Center Name: M2N
Center Name: M2P
Center Name: M2R
Center Name: M3A
Center Name: M3B
Center Name: M3C
Center Name: M3H
Center Name: M3J
Center Name: M3K
Center Name: M3L
Center Name: M3M
Center Name: M3N
Center Name: M4A
Center Name: M4B
Center Name: M4C
Center Name: M4E
Center Name: M4G
Center Name: M4H
Center Name: M4J
Center Name: M4K
Center Name: M4L
Center Name: M4M
Center Name: M4N
Center Name: M4P
Center Name: M4R
Center Name: M4S
Center Name: M4T
Center Name: M4V
Center Name: M4W
Center Name: M4X
Center Name: M4Y
Center Name: M5A
Center Name: M5B
Center Name: M5C
Center Name: M5E
Center Name: M5G
Center Name: M

In [18]:
# Let's Get a Look at the Results Data
print(toronto_venues.shape)
toronto_venues.head()

(2244, 7)


Unnamed: 0,PostCode,PostCode Latitude,PostCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy's,43.8074,-79.1991,Fast Food Restaurant
1,M1B,43.806686,-79.194353,Interprovincial Group,43.8056,-79.2004,Print Shop
2,M1C,43.784535,-79.160497,Royal Canadian Legion,43.7825,-79.1631,Bar
3,M1E,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.7677,-79.1899,Pizza Place
4,M1E,43.763573,-79.188711,G & G Electronics,43.7653,-79.1915,Electronics Store


#### 2.7 - Formatting Location Data from Foursquare API

Great! Looks like the data came just how we wanted it, without any problems.<br>
If we want to do some segmentation with this data, we need to group the DataFrame by PostCode and then somehow combine all the venues of that PostCode. We can do this with the use of onehot encoding.

In [19]:
# Let's See the Count of Venues Per PostCode
toronto_venues.groupby("PostCode").count()["Venue"].reset_index()

Unnamed: 0,PostCode,Venue
0,M1B,2
1,M1C,1
2,M1E,8
3,M1G,4
4,M1H,7
5,M1J,2
6,M1K,6
7,M1L,10
8,M1M,2
9,M1N,4


In [20]:
# Let's See How Many Unique Categories of Venues There Are
print('There are {} uniques categories.'.format(len(toronto_venues["Venue Category"].unique())))

There are 278 uniques categories.


In [21]:
# Encode the Venue Category Using Onehot Encoding
toronto_venues_onehot = pd.get_dummies(toronto_venues[["Venue Category"]], prefix="", prefix_sep="")
# Add the PostCode Column to the Onehot Encoded DataFrame
toronto_venues_onehot["PostCode"] = toronto_venues["PostCode"]

# Move the PostCode Column to the Front
fixed_columns = [toronto_venues_onehot.columns[-1]] + list(toronto_venues_onehot.columns[:-1])
toronto_venues_onehot = toronto_venues_onehot[fixed_columns]

# Show Head of DataFrame
toronto_venues_onehot.head()

Unnamed: 0,PostCode,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1C,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, this is something we can use. With the venues encoded, we can now find the frequencies of each venue category for each PostCode. Let's do that.

In [22]:
# Drop the Column Indicating the Prescense of Nothing
toronto_venues_onehot = toronto_venues_onehot.drop(columns=["NaN"])
toronto_venues_onehot.shape

(2244, 278)

In [23]:
# Group by the Postcode and Find the Frequency of Each Venue Category for Each Post Code
toronto_venues_grouped = toronto_venues_onehot.groupby("PostCode").mean().reset_index()

print(toronto_venues_grouped.shape)
toronto_venues_grouped.head()

(103, 278)


Unnamed: 0,PostCode,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We're getting really close to be able to start using some Machine Learn techniques, such as KMeansClustering. We now have the frequency of every category for every PostCode. Before we do that, let's find the most common popular venues for each PostCode, perhaps the top ten venues specifically.

In [24]:
# return_most_common_venues(...) Method
#   @param  row             - names of the geographical locations to explore
#   @param  num_top_venues  - latitudes of the geographical locations to explore
#
#   @return toReturn        - 
def return_most_common_venues(row, num_top_venues):
    # Get the Row's Categories (Every Column Except at Index 0)
    row_categories = row.iloc[1:]
    # Sort the Values, Largest First
    row_categories_sorted = row_categories.sort_values(ascending=False)
    # Return the Top num_top_venues Results
    toReturn = row_categories_sorted.index.values[0:num_top_venues]
    return toReturn

In [25]:
# Find Top Ten Most Common Venues
num_top_venues = 10

# For Column Names (1"st", 2"nd", 3"rd")
indicators = ['st', 'nd', 'rd']

# Create Columns
columns = ["PostCode"]
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create New DataFrame with Columns
toronto_top_venues_sorted = pd.DataFrame(columns=columns)
# Add PostCodes to DataFrame
toronto_top_venues_sorted["PostCode"] = toronto_venues_grouped["PostCode"]

# For Each Number in Number of Top Venues
for ind in np.arange(toronto_venues_grouped.shape[0]):
    # Store Top Venues
    toronto_top_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_venues_grouped.iloc[ind, :], num_top_venues)

# Show Head of Data
toronto_top_venues_sorted.head()

Unnamed: 0,PostCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Print Shop,Fast Food Restaurant,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Yoga Studio
1,M1C,Bar,Yoga Studio,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Department Store
2,M1E,Electronics Store,Medical Center,Rental Car Location,Mexican Restaurant,Breakfast Spot,Intersection,Spa,Pizza Place,Drugstore,Dumpling Restaurant
3,M1G,Coffee Shop,Korean Restaurant,Soccer Field,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
4,M1H,Athletics & Sports,Caribbean Restaurant,Thai Restaurant,Bakery,Bank,Fried Chicken Joint,Hakka Restaurant,Eastern European Restaurant,Dumpling Restaurant,Drugstore


### 3.0 - Data Analytics

#### 3.1 - KMeans Clustering on Toronto PostCode Venues

We have enough of the right data to perform some clustering techniques to segment the data about Toronto's PostCodes. We'll start with KMeans Clustering as it is one of the most basic forms, and very applicable to our data.

In [26]:
# Import KMeans Clustering Package
from sklearn.cluster import KMeans

# Set to Cluster with 5 Clusters
kclusters = 5

# Drop the PostCode Column, It Isn't Used to Cluster
toronto_grouped_clustering = toronto_venues_grouped.drop("PostCode", 1)

# Create KMeans Model
kmeans_model = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Show First 10 Labels
kmeans_model.labels_[0:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [27]:
# Add the Cluster Labels to the DataFrame
toronto_top_venues_sorted.insert(0, "Cluster Labels", kmeans_model.labels_)

# Merge DataFrames
toronto_merged = to_hoods_byPostCode_withLatLng
toronto_merged = toronto_merged.join(toronto_top_venues_sorted.set_index("PostCode"), on="PostCode")

# Show Head of DataFrame
toronto_merged.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1,Print Shop,Fast Food Restaurant,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Yoga Studio
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,1,Bar,Yoga Studio,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Department Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1,Electronics Store,Medical Center,Rental Car Location,Mexican Restaurant,Breakfast Spot,Intersection,Spa,Pizza Place,Drugstore,Dumpling Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,Coffee Shop,Korean Restaurant,Soccer Field,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Athletics & Sports,Caribbean Restaurant,Thai Restaurant,Bakery,Bank,Fried Chicken Joint,Hakka Restaurant,Eastern European Restaurant,Dumpling Restaurant,Drugstore


Awesome! We've performed clustering on the PostCodes, and now each PostCode has been labeled with its respective cluster. While this is cool abstractly, let's get a little more practical by showing this result on a map.

In [28]:
# Packages for Plotting
import matplotlib.cm as cm
import matplotlib.colors as colors

# Create Map
map_clusters = folium.Map(location=[to_lat, to_lng], zoom_start=11)

# Set Color Scheme for Clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add Markers to Map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostCode'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Interesting...<br>
By looking at the map, ntohing immediately strikes me why these clusters are the way that they were found to be, perhaps it would be more beneficial to take a look at all the top venues in all the PostCodes of each cluster.

##### Cluster 0

In [29]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Scarborough,0,Gym,Playground,Park,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
23,North York,0,Park,Bank,Yoga Studio,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop
25,North York,0,Fast Food Restaurant,Park,Food & Drink Shop,Yoga Studio,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
30,North York,0,Bus Stop,Park,Airport,Snack Place,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
40,East York,0,Park,Convenience Store,Coffee Shop,Yoga Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
44,Central Toronto,0,Photography Studio,Bus Line,Park,Swim School,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant
50,Downtown Toronto,0,Park,Trail,Playground,Yoga Studio,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run
64,Central Toronto,0,Bus Line,Park,Sushi Restaurant,Jewelry Store,Trail,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
74,York,0,Park,Pharmacy,Fast Food Restaurant,Market,Women's Store,College Rec Center,College Gym,Empanada Restaurant,Electronics Store,Eastern European Restaurant
79,North York,0,Basketball Court,Park,Bakery,Construction & Landscaping,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop


##### Cluster 1

In [30]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,1,Print Shop,Fast Food Restaurant,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Yoga Studio
1,Scarborough,1,Bar,Yoga Studio,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Department Store
2,Scarborough,1,Electronics Store,Medical Center,Rental Car Location,Mexican Restaurant,Breakfast Spot,Intersection,Spa,Pizza Place,Drugstore,Dumpling Restaurant
3,Scarborough,1,Coffee Shop,Korean Restaurant,Soccer Field,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
4,Scarborough,1,Athletics & Sports,Caribbean Restaurant,Thai Restaurant,Bakery,Bank,Fried Chicken Joint,Hakka Restaurant,Eastern European Restaurant,Dumpling Restaurant,Drugstore
5,Scarborough,1,Playground,Cosmetics Shop,Yoga Studio,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop
6,Scarborough,1,Bus Station,Coffee Shop,Discount Store,Hobby Shop,Department Store,Convenience Store,Yoga Studio,Doner Restaurant,Dim Sum Restaurant,Diner
7,Scarborough,1,Bus Line,Bakery,Metro Station,Intersection,Bus Station,Soccer Field,Fast Food Restaurant,Park,Dumpling Restaurant,Drugstore
8,Scarborough,1,Motel,American Restaurant,Yoga Studio,Deli / Bodega,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
9,Scarborough,1,General Entertainment,College Stadium,Café,Skating Rink,Yoga Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run


##### Cluster 2

In [31]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,Etobicoke,2,Bank,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Drugstore,Department Store


##### Cluster 3

In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,North York,3,Cafeteria,Doner Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Yoga Studio,Deli / Bodega


##### Cluster 4

In [33]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,Etobicoke,4,Baseball Field,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio,Department Store
97,North York,4,Furniture / Home Store,Baseball Field,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Deli / Bodega,Donut Shop


### 4.0 - Results

#### 4.1 - Clustering of Toronto Post Codes Analysis

While the map wasn't that useful in deciphering the results of the clustering, visualizing the top venues in the clusters certainly was. Ignoring Clusters 2, 3, and 4 for a minute, Clusters 0 and 1 are certainly different types of PostCodes. Cluster 1 seems to be mode centered around active, busy city life, perhaps even commercial or business areas. Whereas, Cluster 0 seems to be much more residential oriented. Nearly every PostCode's top venue in Cluster 0 includes Parks or Trails, whereas these rarely appear throughout Cluster 1.

This has definitely been an interesting lab assignment, perhaps my favorite one yet, and I can't wait to see how we'll keep building on the work done in this lab!