# Coursera Capstone Week 3 Assignment

### Segmenting and Clustering Neighborhoods in Toronto

# Part 1

In [2]:
!pip install geocoder

#Import necessary packages
import pandas as pd
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 15.7MB/s eta 0:00:01[K     |██████▋                         | 20kB 20.2MB/s eta 0:00:01[K     |██████████                      | 30kB 11.1MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 8.5MB/s eta 0:00:01[K     |████████████████▋               | 51kB 4.3MB/s eta 0:00:01[K     |████████████████████            | 61kB 4.8MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 4.9MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 5.2MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 5.6MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 3.8MB/s 
[?25hCollecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fd

First, we need to scrape the table from Wikipedia. We will do this using pandas.

In [3]:
#use read_html to import tables on the page
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
print(dfs[0].head())

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront


In [4]:
#check to see if there are more tables
print(dfs[1].head())

                                                  0   ...   17
0                                                NaN  ...  NaN
1  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...  ...  NaN
2                                                 NL  ...   YT
3                                                  A  ...    Y

[4 rows x 18 columns]


It looks like the first table is the one we want. In the next code cell, we'll select the first table (index 0)

### Requirements for data preparation:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
- If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [5]:
df = dfs[0]

In [6]:
# Filter out the cells that have 'Not assigned' in the Borough column
df = df[df.Borough != "Not assigned"]

# Check result
print(df.head(10))

   Postal Code           Borough                                Neighbourhood
2          M3A        North York                                    Parkwoods
3          M4A        North York                             Victoria Village
4          M5A  Downtown Toronto                    Regent Park, Harbourfront
5          M6A        North York             Lawrence Manor, Lawrence Heights
6          M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government
8          M9A         Etobicoke      Islington Avenue, Humber Valley Village
9          M1B       Scarborough                               Malvern, Rouge
11         M3B        North York                                    Don Mills
12         M4B         East York              Parkview Hill, Woodbine Gardens
13         M5B  Downtown Toronto                     Garden District, Ryerson


Conveniently, the rows are already combined where Neighbourhoods share a single Postal Code. We don't need to do any additional processing for this. 

In [7]:
# Check for rows where Neighbourhood has a value of 'Not assigned'
print(df[df.Neighbourhood == "Not assigned"])

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


There are no entries in the 'Neighbourhood' column where the value is 'Not assigned'. We don't need to do any processing here, either.

### Conclusion
We now have a cleaned dataframe.
- [x] Three columns: Postal Code, Borough, and Neighborhood
- [x] Only process cells with an assigned borough
- [x] Where neighbourhoods share a postal code, include them in a single entry for the postal code in the 'Neighbourhood' column, separated by commas
- [x] If the 'Neighbourhood' value is 'Not assigned', change it to be the same as the value in the 'Borough' column

In [36]:
df.shape

(103, 3)

# Part 2

Now we need to get the latitude and longitude values for each entry, using the python geocoder package

In [57]:

import requests

longitudes = []
latitudes = []

for index, row in df.iterrows():
  try:
    #get Postal Code for each entry
    neigh = row['Postal Code']
    headers = {
    #This key will be deleted once the assignment is complete
    "apikey": "32fcbcb0-65d7-11eb-b90f-1ff6313abf17"}
    params = (
    ("text", neigh + ", Toronto, Ontario, Canada"),
    );
    response = requests.get('https://app.geocodeapi.io/api/v1/search', headers=headers, params=params);
    result = response.json()
    longitude = result['features'][0]['geometry']['coordinates'][0]
    latitude = result['features'][0]['geometry']['coordinates'][1]
    longitudes.append(longitude)
    latitudes.append(latitude)
  except:
    print('something broke - using default values', neigh)
    longitudes.append(None)
    latitudes.append(None)
  
df['longitude'] = longitudes
df['latitude'] = latitudes

print(df.head())


  Postal Code           Borough  ...  longitude   latitude
0         M3A        North York  ... -79.328265  43.754227
1         M4A        North York  ... -79.313559  43.724686
2         M5A  Downtown Toronto  ... -79.363640  43.656078
3         M6A        North York  ... -79.452785  43.721307
4         M7A  Downtown Toronto  ...  26.171175  44.427782

[5 rows x 5 columns]


In [59]:
df.shape

(102, 5)

Alright! Now we have our data and we are ready to do some exploratory analysis.

In [63]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# First we'll find the latitude and longitude of Toronto
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

/bin/bash: conda: command not found
The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [68]:
import folium
# Now we can map the neighborhoods using folium, with the neighborhood data points superimposed over the map of Toronto
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for pc, lat, lng, borough, neighbourhood in zip(df['Postal Code'], df['latitude'], df['longitude'], df['Borough'], df['Neighbourhood']):
    label = 'Postal Code: {}\nBorough: {}\nNeighbourhood(s): {}'.format(pc, borough, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

We'll do the same thing as in the lab - looking at nearby venues.

In [70]:
# Setup and test Foursquare API request
LIMIT = 100
radius = 500
VERSION = '20180605' # Foursquare API version
# These will be reset
CLIENT_ID = 'C5KFFWHCPEDZHER2PK10Y3K2ZIZWN1G404VRJAVE2IHT0SQS'
CLIENT_SECRET = 'Y1P0UJZHJ21QI5FEL3F1FZJ3COKTC24LFQM2CHDLQCBLBIKT'
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '601a42fe7dbe280ee2c8c15d'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-5227bb01498e17bf485e6202-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/neighborhood_',
          'suffix': '.png'},
         'id': '4f2a25ac4b909258e854f55f',
         'name': 'Neighborhood',
         'pluralName': 'Neighborhoods',
         'primary': True,
         'shortName': 'Neighborhood'}],
       'id': '5227bb01498e17bf485e6202',
       'location': {'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'distance': 113,
        'formattedAddress': ['Toronto ON', 'Canada'],
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.65323167517444,
          'lng': -79.38529600606677}],
        'lat': 43.6532

Now that we've confirmed we can get venue data from the Foursquare API, let's get the data for each of our data points.

In [73]:
# First we'll create a function to be run for each postal code
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# And now we'll run it for every postal code
toronto_venues = getNearbyVenues(names = df['Postal Code'],
                                latitudes = df['latitude'],
                                longitudes = df['longitude']
                                )

M3A
M4A
M5A
M6A
M7A
M9A
M1B
M3B
M4B
M5B
M6B
M9B
M1C
M3C
M4C
M5C
M6C
M9C
M1E
M4E
M5E
M6E
M1G
M4G
M5G
M6G
M1H
M2H
M3H
M4H
M5H
M6H
M1J
M2J
M3J
M4J
M5J
M6J
M1K
M2K
M3K
M4K
M6K
M1L
M2L
M3L
M4L
M5L
M6L
M9L
M1M
M2M
M3M
M4M
M5M
M6M
M9M
M1N
M2N
M3N
M4N
M5N
M6N
M9N
M1P
M2P
M4P
M5P
M6P
M9P
M1R
M2R
M4R
M5R
M6R
M7R
M9R
M1S
M4S
M5S
M6S
M1T
M4T
M5T
M1V
M4V
M5V
M8V
M9V
M1W
M4W
M5W
M8W
M9W
M1X
M4X
M5X
M8X
M4Y
M7Y
M8Y
M8Z


In [77]:
# And now we can check the size of our dataset
print(toronto_venues.shape)
toronto_venues.head()

(6764, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,43.754227,-79.328265,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,43.754227,-79.328265,Brookbanks Pool,43.751389,-79.332184,Pool
2,M3A,43.754227,-79.328265,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,43.724686,-79.313559,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,43.724686,-79.313559,Portugril,43.725819,-79.312785,Portuguese Restaurant


Wow that's quite a few venues. Maybe we could have looked at a specific Borough. Oh well let's forge ahead with some further analysis. I'm curious what types of venues are available in Toronto, and their frequency.

In [82]:
venue_type_count = toronto_venues.groupby(['Venue Category'])['Venue'].count()
# Okay now that we have the venues counted by category, let's check what is the most common
venue_type_count.sort_values(inplace=True, ascending=False)
print(venue_type_count.head())

Venue Category
Coffee Shop            721
Hotel                  498
Restaurant             389
Café                   388
Japanese Restaurant    251
Name: Venue, dtype: int64


Wow! That is a whole lot of coffee shops! Apparently Toronto is fueled by caffeine! Let's look at the top 10 to see if there any other notable venue types. It seems like restaurants are broken down by type of cuisine, so the list may be deceiving...

In [83]:
print(venue_type_count.head(10))

Venue Category
Coffee Shop            721
Hotel                  498
Restaurant             389
Café                   388
Japanese Restaurant    251
Gym                    251
Seafood Restaurant     196
American Restaurant    193
Asian Restaurant       186
Steakhouse             185
Name: Venue, dtype: int64


Alright now we're getting a clearer picture. Even if we do not count 'Coffee Shop' in the restaurant category, seven out of ten of the most common types of venues in Toronto are restaurants. The top two 'restaurant' type venues (Restaurant and Cafe) account for 777 of the venues in Toronto.