# Applied Data Science Capstone Project

#### Gregory Smith

The body of this notebook consists  the Applied Data Science capstone project as part of the Applied Data Science specialization on Coursera.

In [319]:
import pandas as pd
import numpy as np

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Segmenting and Clustering Neighborhoods in Toronto

### Importing and Cleaning Dataframe

Scraping Toronto postal codes from Wikipedia

In [320]:
import urllib.request

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = urllib.request.urlopen(url)
article = req.read().decode()

with open('List_of_postal_codes_of_Canada:_M.html', 'w') as fo:
    fo.write(article)

df_toronto_neigh = pd.read_html('List_of_postal_codes_of_Canada:_M.html')[0]

In [321]:
df_toronto_neigh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Dropping postcodes that are unassigned to a borough

In [322]:
df_toronto_neigh = df_toronto_neigh[df_toronto_neigh['Borough']!='Not assigned']
df_toronto_neigh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Creating a new dataframe which has all neighbourhoods under the same postal code and borough grouped together. This was done by creating a blank dataframe, creating temporary dataframes of postal code borough combinations, concatentating the neighbourhoods within this dataframe, and concatentating the resulting row with the new dataframe.

In [323]:
# generating empty dataframe with same columns as 'df_toronto_neigh'
df_toronto_neigh_2 = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])

# enumerating through distinct 'Borough' and 'Postcodes' strings in original dataframe
for i, borough in enumerate(df_toronto_neigh['Borough'].unique()):
    for j, post in enumerate(df_toronto_neigh['Postcode'].unique()):
        # generating a temporary df consisting of the entries of the original dataframe where the current enumerated
        # 'Borough' and 'Postcode' is present
        temp_df = df_toronto_neigh.loc[(df_toronto_neigh['Borough']==borough) & (df_toronto_neigh['Postcode']==post)]
        # while the df is larger than one row, we will append the neighborhood element from the second row to the neigborhood
        # element of the first row and then drop that row from the temp_df
        while temp_df.shape[0]>1:
            temp_df.iloc[0,2]=temp_df.iloc[0,2]+', '+temp_df.iloc[1,2]
            temp_df.drop(temp_df.index[1], inplace=True)
        # append the df row generated during the while loop to the new dataframe
        df_toronto_neigh_2=pd.concat([df_toronto_neigh_2,temp_df])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [324]:
df_toronto_neigh_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
5,M6A,North York,"Lawrence Heights, Lawrence Manor"
13,M3B,North York,Don Mills North
18,M6B,North York,Glencairn


Finding which boroughs do not have an assigned neighborhood

In [325]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']=='Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Not assigned


Assigning boroughs without a neighbourhood to have borough and neighbourhood be the same

In [326]:
for i in range(df_toronto_neigh_2.shape[0]):
    if (df_toronto_neigh_2.iloc[i,2]=='Not assigned'):
        df_toronto_neigh_2.iloc[i,2] = df_toronto_neigh_2.iloc[i,1]

In [327]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']=='Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [328]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']==df_toronto_neigh_2['Borough']]

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Queen's Park


Cleaning dataframe by renaming a column and reseting the indices

In [329]:
df_toronto_neigh_2.rename(columns={'Postcode': 'Postal Code'}, inplace=True)

In [330]:
df_toronto_neigh_2.reset_index(drop=True, inplace=True)

Details of cleaned dataframe 'df_toronto_neigh_2'

In [331]:
df_toronto_neigh_2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M6A,North York,"Lawrence Heights, Lawrence Manor"
3,M3B,North York,Don Mills North
4,M6B,North York,Glencairn


In [310]:
df_toronto_neigh_2.describe()

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,103,103,103
unique,103,11,102
top,M4V,North York,Queen's Park
freq,1,24,2


In [332]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']=='Queen\'s Park']

Unnamed: 0,Postal Code,Borough,Neighbourhood
25,M7A,Downtown Toronto,Queen's Park
43,M9A,Queen's Park,Queen's Park


The only neighbourhood with multiplicity is Queen's Park. Upon investing Queen's Park, its postal code is listed as M7A. At first I thought that maybe M9A was reserved for government buildings within Queen's Park, but that does not appear to be the case, atleast not based on my research. I will keep this possible issue in mind when working further on the project.

In [333]:
df_toronto_neigh_2.shape

(103, 3)

### Adding Geographic Coordinates to Data Frame

Creating a dataframe of Toronto postal codes and corresponding latitude and longitude

In [334]:
df_toronto_latlong = pd.read_csv('https://cocl.us/Geospatial_data')
df_toronto_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Joining 'df_toronto_neigh_2' and 'df_toronto_latlong' on the 'Postal Code' column

In [335]:
df_toronto_neigh_2 = df_toronto_neigh_2.merge(df_toronto_latlong, on='Postal Code')

In [336]:
df_toronto_neigh_2.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073
5,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923
6,M2H,North York,Hillcrest Village,43.803762,-79.363452
7,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
9,M3J,North York,"Northwood Park, York University",43.76798,-79.487262


### Cluster Analysis of Toronto Neighbourhoods

In [346]:
{
    "tags": [
        "hide_input",
    ]
}

import datetime

CLIENT_ID = 'ON2TUF0QITX32Z5D2VZ5RJSBDRV1NUZTX4FTCGO0CQTWFQZR' # your Foursquare ID
CLIENT_SECRET = 'PJAJJPIIJYTTAO4YQET0OY2JKPCLG5NQJ2AZHTBOQZULFKJK' # your Foursquare Secret
VERSION = datetime.datetime.now().strftime("%Y%m%d")

In [341]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [344]:
df_toronto_neigh_2.loc[0, 'Neighbourhood']

'Parkwoods'