## Capstone Project

### Introduction / Business Problem  
#### Postal Code Assignment to Neighborhood Groups in the Toronto Area
In previous analysis, various areas in Toronto, Ontario, Canada were examined by using characteristics in association with location data such as latitude and longitude. One of those characteristics concerned the use of locations as grouped by postal code. A Canadian nationwide revision of mail-area coding was introduced roughly in 1971 (1), in order to utilize automated methods to sort and route mail (2). After early integration problems, the new postal code system was slowly adapted after 1974. Postal code (Forward Sortation Areas - FSA and Local Delivery Unit - LDU) assignment was (apparently) governed by an ever-increasing volume in mail items into rural and non-rural areas, although at this point, the process itself dosn't appear plainly identified.  

The association of a postal code appears to be linked to the density of a populace in an area (4). For example, neighborhoods such as "Riverdale" and "The Danforth West" are assigned to the FSA code M4K. There are 140 official neighborhoods in the Toronto area (5). Additionally, there appear to be unofficial neighborhoods as well. As an example, these various neighborhoods are grouped and assigned to respective FSAs as shown in a data table from Canada Post (5). Scarce resources in the form of organizational budget, time, and other logistically-related costs are required to move this volume of mail to their specific destinations. As population areas continue to change (and increase,) it stands to reason that a fiscally-responsible organization such as Canada Post will desire an efficient means to group new neighborhoods and assign postal code(s), so that minimal costs are consumed in the processing of mail.  

It is this grouping of areas that is of particular interest. In our recent course of study, we have familiarized this association to be known as *clustering*. The question posed then is, can machine-learning (ML) algorithms be utilized to help associate geographically-similar areas to another?  

**In this pursuit, we intend to utilize a promising ML algorithm's grouping results in comparison to publicly available *ground truth* data tables as exemplified above.** To assist with location accuracy, we will ascertain if the much-touted Foursquare API can assist us with client-curated location data for our neighborhoods-of-interest; beyond the six decimal places of accuracy provided by the data tables (5).


### Data  
#### Data Requirements
(1) Which data is required?  
For our analysis, we will need location (latitude, longitude) and postal code information for the Toronto-area neighborhoods. Initially, this question can be fullfilled through the following three sources of data:  
- Data source 1: Neighbourhood data file from https://open.toronto.ca/dataset/neighbourhoods/  
- Data source 2: Geospatial data from https://cocl.us/Geospatial_data  
- Data source 3: Postal code data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  

(2) How to source or collect the data?  
To gain access to the data files we will utilize the "Find and Add Data" tool from IBM Watson Studio. Since we have already downloaded the files locally, we will simply upload them with the tool. The uploded files are (in order):  
- neighborhoods.csv  
- Geospatial_Coordinates.csv  
- toronto_boroughs.xlsx  

(3) How to understand or work with the data?  
Lets bring the data files in one at a time to see what kind of data they contain. Note that file access credentials will be removed prior to upload to GitHub.

#### Data Collection
(1) How to source or collect the data?  
To gain access to the data files we will utilize the "Find and Add Data" tool from IBM Watson Studio. Since we have already downloaded the files locally, we will simply upload them with the tool to IBM Cloud Object Storage (COS). The uploded files are (in order):  
- neighborhoods.csv  
- Geospatial_Coordinates.csv  
- toronto_boroughs.xlsx  

(2) How to understand or work with the data?  
Lets bring the data files in one at a time to see what kind of data they contain.

In [43]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0

 Note that the file access credentials below will be removed prior to upload to GitHub

In [44]:
# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_75ce548a9ac5464cb355f5fea1293a88 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='wdFiuRvBv8SpWtdqydraFgagjgZxw9WPwImFrYWG1a4F',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body1 = client_75ce548a9ac5464cb355f5fea1293a88.get_object(Bucket='capstoneprojectnotebook-donotdelete-pr-wbxciyzmta7tdr',Key='neighborhoods.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body1, "__iter__"): body.__iter__ = types.MethodType( __iter__, body1 )
    
body2 = client_75ce548a9ac5464cb355f5fea1293a88.get_object(Bucket='capstoneprojectnotebook-donotdelete-pr-wbxciyzmta7tdr',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body2, "__iter__"): body.__iter__ = types.MethodType( __iter__, body2 )
    
body3 = client_75ce548a9ac5464cb355f5fea1293a88.get_object(Bucket='capstoneprojectnotebook-donotdelete-pr-wbxciyzmta7tdr',Key='toronto_boroughs.xlsx')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body3, "__iter__"): body.__iter__ = types.MethodType( __iter__, body3 )

In [45]:
# this is the specific neighborhood lat/lon information from the University of Toronto
df1 = pd.read_csv(body1)
df1.head()

Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,X,Y,LONGITUDE,LATITUDE,OBJECTID,Shape__Area,Shape__Length,geometry
0,2101,25886861,25926662,49885,94,94,Wychwood (94),Wychwood (94),,,-79.425515,43.676919,16491505,3217960.0,7515.779658,"{u'type': u'Polygon', u'coordinates': (((-79.4..."
1,2102,25886820,25926663,49885,100,100,Yonge-Eglinton (100),Yonge-Eglinton (100),,,-79.40359,43.704689,16491521,3160334.0,7872.021074,"{u'type': u'Polygon', u'coordinates': (((-79.4..."
2,2103,25886834,25926664,49885,97,97,Yonge-St.Clair (97),Yonge-St.Clair (97),,,-79.397871,43.687859,16491537,2222464.0,8130.411276,"{u'type': u'Polygon', u'coordinates': (((-79.3..."
3,2104,25886593,25926665,49885,27,27,York University Heights (27),York University Heights (27),,,-79.488883,43.765736,16491553,25418210.0,25632.335242,"{u'type': u'Polygon', u'coordinates': (((-79.5..."
4,2105,25886688,25926666,49885,31,31,Yorkdale-Glen Park (31),Yorkdale-Glen Park (31),,,-79.457108,43.714672,16491569,11566690.0,13953.408098,"{u'type': u'Polygon', u'coordinates': (((-79.4..."


We see that there are 140 observations of integer, object, and float types contained within 16 features

In [46]:
# lets look at the table information
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 16 columns):
_id                140 non-null int64
AREA_ID            140 non-null int64
AREA_ATTR_ID       140 non-null int64
PARENT_AREA_ID     140 non-null int64
AREA_SHORT_CODE    140 non-null int64
AREA_LONG_CODE     140 non-null int64
AREA_NAME          140 non-null object
AREA_DESC          140 non-null object
X                  0 non-null float64
Y                  0 non-null float64
LONGITUDE          140 non-null float64
LATITUDE           140 non-null float64
OBJECTID           140 non-null int64
Shape__Area        140 non-null float64
Shape__Length      140 non-null float64
geometry           140 non-null object
dtypes: float64(6), int64(7), object(3)
memory usage: 17.6+ KB


In [47]:
# this is the PostalCodes lat/lon information
df2 = pd.read_csv(body2)
df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We see that there are 103 observations of object, and float types contained within 3 features

In [48]:
# lets look at the table information
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postal Code    103 non-null object
Latitude       103 non-null float64
Longitude      103 non-null float64
dtypes: float64(2), object(1)
memory usage: 2.5+ KB


In [49]:
# this is the initial neighborhood grouping file that includes the 'PostalCode', aka, 'the truth file' y_true labels
df3 = pd.read_excel(body3)
df3.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M6C,York,Humewood-Cedarvale
1,M6E,York,Caledonia-Fairbank
2,M6M,York,Del Ray
3,M6M,York,Keelesdale
4,M6M,York,Mount Dennis


We see that there are 288 observations of type object contained within 3 features. We note that each neighborhood has a 'Postcode' assigned; this is the group or y_true cluster indicator

In [50]:
# lets look at the table information
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 3 columns):
Postcode         288 non-null object
Borough          288 non-null object
Neighbourhood    288 non-null object
dtypes: object(3)
memory usage: 6.8+ KB


#### Data Understanding
We noted that between the three data file tables, there were different data types and numbers of observations. Many of these observations were because of Postal codes (FSA) that were not assigned yet; other observations contained one row per neighborhood name. We desire to work with consistent data, therefore we will need to modify feature names, select features for use, drop features that we do not need, add features desired, reduce any observation duplication, and modify some of the observation object values. We can do all of this within the Data Preparation section.

#### Data Preparation  
We need to shape the data sources into a form that is ready for initial visualization and subsequent analysis

In [51]:
import json
import time
import requests
from re import sub as sub

# adjust table display as desired
# pd.set_option('display.width', None)
# pd.set_option('display.max_rows', 20)
# pd.set_option('display.expand_frame_repr', True)
# pd.set_option('display.max_columns', None)
# pd.set_option('max_colwidth', 60)
# pd.set_option('precision', 4)

# drop some columns that we don't need
df1.drop(['_id','AREA_ID', 'AREA_ATTR_ID', 'AREA_SHORT_CODE', 'AREA_LONG_CODE',  'PARENT_AREA_ID', 'AREA_DESC', 'X', 'Y', 'OBJECTID', 'Shape__Area', 'Shape__Length', 'geometry'], axis=1, inplace=True)

# add some columns
df1['Foursquare_lat'] = .0
df1['Foursquare_lon'] = .0

# rename some columns
df2.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df3.rename(columns={'Postcode':'y_true', 'Neighbourhood':'Neighborhood'}, inplace=True)

# drop the rows with "Not assigned" as a neighborhood value
# df1 = df1[df1['Neighborhood'] != 'Not assigned']
df3 = df3.query("Neighborhood != 'Not assigned'")

# remove any '(someInteger)' text from the 'AREA_NAME' feature
for ex in range(len(df1)):
    df1.loc[ex, 'AREA_NAME'] = sub(' [(][0-9]*[)]', '', df1.loc[ex, 'AREA_NAME']).rstrip()

In [52]:
# check the output
# this is the specific neighborhood lat/lon information from the University of Toronto
# df1.style.set_properties(**{'text-align': 'left'})
print(df1, '\t', df1.shape, '\n')

                               AREA_NAME  LONGITUDE   LATITUDE  \
0                               Wychwood -79.425515  43.676919   
1                         Yonge-Eglinton -79.403590  43.704689   
2                         Yonge-St.Clair -79.397871  43.687859   
3                York University Heights -79.488883  43.765736   
4                     Yorkdale-Glen Park -79.457108  43.714672   
5                     Lambton Baby Point -79.496045  43.657420   
6                       Lansing-Westgate -79.424748  43.754271   
7                    Lawrence Park North -79.403978  43.730060   
8                    Lawrence Park South -79.406039  43.717212   
9                     Leaside-Bennington -79.366072  43.703797   
10                       Little Portugal -79.430323  43.647536   
11                           Long Branch -79.533345  43.592362   
12                               Malvern -79.222517  43.803658   
13                            Maple Leaf -79.480758  43.715574   
14        

In [53]:
# check the output
# this is the PostalCodes lat/lon information
print(df2, '\t', df2.shape, '\n')

    PostalCode   Latitude  Longitude
0          M1B  43.806686 -79.194353
1          M1C  43.784535 -79.160497
2          M1E  43.763573 -79.188711
3          M1G  43.770992 -79.216917
4          M1H  43.773136 -79.239476
5          M1J  43.744734 -79.239476
6          M1K  43.727929 -79.262029
7          M1L  43.711112 -79.284577
8          M1M  43.716316 -79.239476
9          M1N  43.692657 -79.264848
10         M1P  43.757410 -79.273304
11         M1R  43.750072 -79.295849
12         M1S  43.794200 -79.262029
13         M1T  43.781638 -79.304302
14         M1V  43.815252 -79.284577
15         M1W  43.799525 -79.318389
16         M1X  43.836125 -79.205636
17         M2H  43.803762 -79.363452
18         M2J  43.778517 -79.346556
19         M2K  43.786947 -79.385975
20         M2L  43.757490 -79.374714
21         M2M  43.789053 -79.408493
22         M2N  43.770120 -79.408493
23         M2P  43.752758 -79.400049
24         M2R  43.782736 -79.442259
25         M3A  43.753259 -79.329656
2

In [54]:
# check the output
# this is the initial neighborhood grouping file that includes the 'PostalCode', aka, 'the truth file' y_true labels
print(df3, '\t', df3.shape, '\n')

    y_true           Borough                     Neighborhood
0      M6C              York               Humewood-Cedarvale
1      M6E              York               Caledonia-Fairbank
2      M6M              York                          Del Ray
3      M6M              York                       Keelesdale
4      M6M              York                     Mount Dennis
5      M6M              York                      Silverthorn
6      M6N              York               The Junction North
7      M6N              York                        Runnymede
8      M9N              York                           Weston
9      M6H      West Toronto               Dovercourt Village
10     M6H      West Toronto                         Dufferin
11     M6J      West Toronto                  Little Portugal
12     M6J      West Toronto                          Trinity
13     M6K      West Toronto                         Brockton
14     M6K      West Toronto                 Exhibition Place
15     M

In [55]:
# install the folium package if not already integrated
# !conda install -c conda-forge folium

In [56]:
import folium
# lets see how the neighborhood and 'PostalCode' data plots using a map
# first, create a map base using the latitude and longitude values of M4P Davisville North (arbitrary desired center of map plot, just to look nice and centered on screen)
davis_lat = 43.712751
davis_lon = -79.390197
map_neighborhood = folium.Map(location=[davis_lat, davis_lon], zoom_start=11)

# add markers to map so we can see where the initial neighborhoods are located
for lat, lon, area in zip(df1['LATITUDE'], df1['LONGITUDE'], df1['AREA_NAME']):
    label = '{}'.format(area)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        tooltip='',      # we might use this parameter later for information that's nice to display on mouse hover
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.1,
        parse_html=False).add_to(map_neighborhood)

# add markers to the map where the 'PostalCode's are located
for lat, lon, pcode in zip(df2['Latitude'], df2['Longitude'], df2['PostalCode']):
    label = '{}'.format(pcode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=2,
        popup=label,
        tooltip=pcode,
        color='red',
        fill=True,
        fill_color='',
        fill_opacity=0.1,
        parse_html=False).add_to(map_neighborhood)

# display map
map_neighborhood

#### In the map above we can see the postal-code centroids (truth-labels) represented by blue rings and the neighborhood centers represented by the red circles. The data is ready for filtering and location refinement by the Foursquare API during subsequent analysis in the second week of the project. In the second week portion, we will discuss and apply one of two ML algorithms that might provide a predictive response to match the truth-labels we have available.

#### References
(1) "Postal code service for Canada to be inaugurated on April first". The Stanstead Journal. 18 March 1971. p. 5.  
(2) "New postal code for all of Canada to speed delivery and avoid errors". L'Avenir. 30 January 1973. p. 19.  
(3) Demarino, Guy (7 January 1975). "Will 'gentle persuasion' aid postal code?". Montreal Gazette. p. 9.  
(4) "Canada Post decides when to urbanize a certain community when its population reaches a certain level, though different factors may also be involved.", *Urbanization*, https://en.wikipedia.org/wiki/Postal_codes_in_Canada  
(5) We note that a six-decimal location value will be precise to 111.32 mm (N/S or E/W) at the equator. https://en.wikipedia.org/wiki/Decimal_degrees