# Task - 1

### AIM:
Use a Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.
1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.
4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
import geocoder
import folium

### Getting the Postal Codes data from wikipedia

**We can do this by using BeautifulSoup to scrape the required data from Wikipedia.**  
**But first we must inspect the source code of the webpage to figure out where and how the data is actually stored. The data is stored in a Table format using ```<table>``` tag with 3 ```<td>``` tags nested inside a ```<tr>``` tag for each row.**  
**There is also a class assigned to the table ```class="wikitable"```.**
**Now that we know where to look for the data let's start.**

In [2]:
# Retreiving the Page HTML
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

Find the table tag with class wikitable.

In [3]:
table = soup.find('table', class_ = 'wikitable')

Find all the ```<td>``` tags and store their values in a list.  
Then we convert that list to a numpy array and replace the empty values(or '') with NA for ease of understanding.  
We reshape the array to 180x3 which represents 180 rows and 3 columns which the actual shape of the data.
Once the array is reshape, it is converted to a Dataframe.

In [4]:
tableData = table.find_all('td')
temp = []
for value in tableData:
    string = str(value.string).strip('\n')
    temp.append(string)
temp

['M1A',
 'Not assigned',
 '',
 'M2A',
 'Not assigned',
 '',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A',
 'North York',
 'Victoria Village',
 'M5A',
 'Downtown Toronto',
 'Regent Park / Harbourfront',
 'M6A',
 'North York',
 'Lawrence Manor / Lawrence Heights',
 'M7A',
 'Downtown Toronto',
 "Queen's Park / Ontario Provincial Government",
 'M8A',
 'Not assigned',
 '',
 'M9A',
 'Etobicoke',
 'Islington Avenue',
 'M1B',
 'Scarborough',
 'Malvern / Rouge',
 'M2B',
 'Not assigned',
 '',
 'M3B',
 'North York',
 'Don Mills',
 'M4B',
 'East York',
 'Parkview Hill / Woodbine Gardens',
 'M5B',
 'Downtown Toronto',
 'Garden District, Ryerson',
 'M6B',
 'North York',
 'Glencairn',
 'M7B',
 'Not assigned',
 '',
 'M8B',
 'Not assigned',
 '',
 'M9B',
 'Etobicoke',
 'West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale',
 'M1C',
 'Scarborough',
 'Rouge Hill / Port Union / Highland Creek',
 'M2C',
 'Not assigned',
 '',
 'M3C',
 'North York',
 'Don Mills',
 'M4C',
 'East York',
 'W

In [5]:
temp = np.array(temp)
temp = np.where(temp=='','NA',temp)
data = np.reshape(temp,(180,3))
print("Shape of the Data:",data.shape)
dataFrame = pd.DataFrame(data=data, columns=['PostalCode','Borough','Neighborhood'])
dataFrame.head()

Shape of the Data: (180, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [6]:
dataFrame['Neighborhood'].replace(to_replace='None',value='CN Tower / King and Spadina / Railway Lands \
/ Harbourfront West / Bathurst / Quay / South Niagara / Island airport', inplace=True)

**As per the guidlines, we will drop the rows with ```Borough='Not assigned'```.**

In [7]:
dataFrame.drop(dataFrame.Borough.loc[dataFrame.Borough=='Not assigned'].index,inplace = True,
               axis=0)
dataFrame.reset_index(drop=True, inplace=True)

**Notice how all the rows with Neighborhood='NA' were also remove. The reason is that only the rows with Borough='Not assigned' had Neigborhood='NA'**

We also have a '/' instead of a ',' to seperate the Neigborhoods, let's fix that.

In [8]:
for index, item in enumerate(dataFrame['Neighborhood']):
    dataFrame['Neighborhood'].iloc[index] = re.sub(' /',',',item)
for index, item in enumerate(dataFrame['Neighborhood']):
    dataFrame['Neighborhood'].iloc[index] = re.sub('"','',item)
dataFrame.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
dataFrame.shape

(103, 3)

**Our data is now ready to be exported to a CSV**

In [10]:
dataFrame.to_csv('postalCodes_scraped.csv')

# Task - 2

**Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.**

In [11]:
df = pd.read_csv('postalCodes_scraped.csv')
df.drop('Unnamed: 0', inplace=True, axis=1)

In [12]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
df = df.sort_values(by=['PostalCode','Borough'])
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [14]:
postalCodes = df['PostalCode'].to_list()

**Function to get the latitude and longitude of all the postal codes**

In [15]:
def getLatLong(location):
    lat_long = None
    while lat_long is None:
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(location))
        lat_long = g.latlng
    return lat_long

In [16]:
print(getLatLong('M1B'))

[43.80862623100006, -79.18991284599997]


In [17]:
coordinates = [getLatLong(code) for code in postalCodes]

In [18]:
coordinates = pd.DataFrame(coordinates, columns=['latitude','longitude'])
df['Latitude'] = coordinates['latitude']
df['Longitude'] = coordinates['longitude']

In [19]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.785779,-79.157368
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765806,-79.185284
3,M1G,Scarborough,Woburn,43.771545,-79.218135
4,M1H,Scarborough,Cedarbrae,43.768791,-79.238813


In [21]:
df.to_csv('geocodedPostalCodes.csv', index = False)

# Task - 3
<a href="Clustering

We will work with Boroughs containing Toronto

In [23]:
torontoData = df[df.Borough.str.contains('Toronto')].reset_index(drop=True)
torontoData.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.678148,-79.295349
1,M4K,East Toronto,"The Danforth West, Riverdale",43.683424,-79.354564
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668291,-79.315578
3,M4M,East Toronto,Studio District,43.648,-79.33926
4,M4N,Central Toronto,Lawrence Park,43.729455,-79.386415


In [24]:
torontoData

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.678148,-79.295349
1,M4K,East Toronto,"The Danforth West, Riverdale",43.683424,-79.354564
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668291,-79.315578
3,M4M,East Toronto,Studio District,43.648,-79.33926
4,M4N,Central Toronto,Lawrence Park,43.729455,-79.386415
5,M4P,Central Toronto,Davisville North,43.713171,-79.38887
6,M4R,Central Toronto,North Toronto West,43.714139,-79.406456
7,M4S,Central Toronto,Davisville,43.703327,-79.385649
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.690328,-79.383522
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686378,-79.402372


In [25]:
torontoLatLong = getLatLong('')

In [26]:
torontoLatLong

[43.648690000000045, -79.38543999999996]

Let's create a Map of Toronto

In [27]:
map_tor = folium.Map(location=[torontoLatLong[0],torontoLatLong[1]], zoom_start=11)

# Adding markers for the neigborhoods
for lat, lng, borough, neigh in zip(torontoData['Latitude'], torontoData['Longitude']\
                                  ,torontoData['Borough'], torontoData['Neighborhood']):
    label = '{}, {}'.format(neigh, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=3,
        color='green',
        fill=True,
        fill_color='green',
        fill_opacity=0.7).add_to(map_tor)
    folium.Marker([lat,lng],popup=label).add_to(map_tor)
map_tor

Define Foursquare Credentials

In [28]:
CLIENT_ID = 'OXOLUDPUFMAUMAWCQEEXAG3ORNLGKYR5AZPIQI45U0IYH3H4'
CLIENT_SECRET = 'KIYKYRW2IBSBD4E4QYRTUHOJ03BPPFTIP5H0TCK3RTHUOT5R'
VERSION = '20180605'

In [29]:
torontoData.loc[0,'Neighborhood']

'The Beaches'

In [30]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    torontoData.loc[0,'Latitude'],
    torontoData.loc[0,'Longitude'],
    radius,
    LIMIT)

In [31]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e89fb6d02a172002823e907'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 6,
  'suggestedBounds': {'ne': {'lat': 43.68264828050006,
    'lng': -79.28913883913675},
   'sw': {'lat': 43.673648271500056, 'lng': -79.30155978086314}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd461bc77b29c74a07d9282',
       'name': 'Glen Manor Ravine',
       'location': {'address': 'Glen Manor',
        'crossStreet': 'Queen St.',
        'lat': 43.67682094413784,
        'lng': -79.29394208780985,
        'labeledLatLngs': [{'labe

In [32]:
venues = results['response']['groups'][0]['items']
venues

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '4bd461bc77b29c74a07d9282',
   'name': 'Glen Manor Ravine',
   'location': {'address': 'Glen Manor',
    'crossStreet': 'Queen St.',
    'lat': 43.67682094413784,
    'lng': -79.29394208780985,
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.67682094413784,
      'lng': -79.29394208780985}],
    'distance': 186,
    'cc': 'CA',
    'city': 'Toronto',
    'state': 'ON',
    'country': 'Canada',
    'formattedAddress': ['Glen Manor (Queen St.)', 'Toronto ON', 'Canada']},
   'categories': [{'id': '4bf58dd8d48988d159941735',
     'name': 'Trail',
     'pluralName': 'Trails',
     'shortName': 'Trail',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/hikingtrail_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0, 'groups': []}},
  'referralId': 'e-0-4bd461bc77b

In [33]:
nearby_ = pd.json_normalize(venues)
filtered_columns = ['venue.name','venue.categories','venue.location.lat',
                    'venue.location.lng']
nearby_ = nearby_.loc[:, filtered_columns]
nearby_

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Glen Manor Ravine,"[{'id': '4bf58dd8d48988d159941735', 'name': 'T...",43.676821,-79.293942
1,The Big Carrot Natural Food Market,"[{'id': '50aa9e744b90af0d42d5de0e', 'name': 'H...",43.678879,-79.297734
2,Grover Pub and Grub,"[{'id': '4bf58dd8d48988d11b941735', 'name': 'P...",43.679181,-79.297215
3,Dip 'n Sip,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",43.678897,-79.297745
4,Upper Beaches,"[{'id': '4f2a25ac4b909258e854f55f', 'name': 'N...",43.680563,-79.292869
5,Calvary Baptist Church,"[{'id': '4bf58dd8d48988d132941735', 'name': 'C...",43.681059,-79.299246


In [34]:
nearby_.loc[0, 'venue.categories']

[{'id': '4bf58dd8d48988d159941735',
  'name': 'Trail',
  'pluralName': 'Trails',
  'shortName': 'Trail',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/hikingtrail_',
   'suffix': '.png'},
  'primary': True}]

In [35]:
def getCategories(row):
    try:
        cat_ = row['categories']
    except:
        cat_ = row['venue.categories']
    if len(cat_)==0:
        return None
    else:
        return cat_[0]['name']

In [36]:
# Get the category for each row.
nearby_['venue.categories'] = nearby_.apply(getCategories, axis=1)

In [37]:
# clean the columns
nearby_.columns = [col.split(".")[-1] for col in nearby_.columns]

In [38]:
nearby_

Unnamed: 0,name,categories,lat,lng
0,Glen Manor Ravine,Trail,43.676821,-79.293942
1,The Big Carrot Natural Food Market,Health Food Store,43.678879,-79.297734
2,Grover Pub and Grub,Pub,43.679181,-79.297215
3,Dip 'n Sip,Coffee Shop,43.678897,-79.297745
4,Upper Beaches,Neighborhood,43.680563,-79.292869
5,Calvary Baptist Church,Church,43.681059,-79.299246


In [39]:
def getNearbyVenues(neigh, lat_ ,lng_ ,radius=500):
    venues_list=[]
    LIMIT = 100
    for neigh_,lat,lng in zip(neigh, lat_, lng_):
        print(neigh_)
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT
        )
#       Fetch a json file with the venues for a neighborhood
        results = requests.get(url).json()['response']['groups'][0]['items']
#       Fetch only the relevant information about a venue
        venues_list.append([(
            neigh_,
            lat,
            lng,
            venue['venue']['name'], # Name of the venue from JSON object
            venue['venue']['location']['lat'], # Latitude of the venue
            venue['venue']['location']['lng'], # Longitufe of the venue
            venue['venue']['categories'][0]['name'] # Category of the venue
        ) for venue in results])
    nearbyVenues = pd.DataFrame([item for venue in venues_list for item in venue])
    nearbyVenues.columns = ['Neighborhood',
                           'NeighborhoodLatitude',
                           'NeighborhoodLongitude',
                           'Name',
                           'VenueLatitude',
                           'VenueLongitude',
                           'Category']
    return nearbyVenues

In [40]:
torontoVenues = getNearbyVenues(torontoData['Neighborhood'],
                     torontoData['Latitude'],
                     torontoData['Longitude'])
torontoVenues.head()

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst, Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High Park, The Junction South
Parkdale, Ro

Unnamed: 0,Neighborhood,NeighborhoodLatitude,NeighborhoodLongitude,Name,VenueLatitude,VenueLongitude,Category
0,The Beaches,43.678148,-79.295349,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.678148,-79.295349,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.678148,-79.295349,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.678148,-79.295349,Dip 'n Sip,43.678897,-79.297745,Coffee Shop
4,The Beaches,43.678148,-79.295349,Upper Beaches,43.680563,-79.292869,Neighborhood


In [41]:
torontoVenues.shape

(1672, 7)

Let's check how many venues were returned for each neighborhood

In [42]:
torontoVenues.groupby('Neighborhood').count()

Unnamed: 0_level_0,NeighborhoodLatitude,NeighborhoodLongitude,Name,VenueLatitude,VenueLongitude,Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,64,64,64,64,64,64
"Brockton, Parkdale Village, Exhibition Place",44,44,44,44,44,44
Business reply mail Processing CentrE,100,100,100,100,100,100
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst, Quay, South Niagara, Island airport",67,67,67,67,67,67
Central Bay Street,83,83,83,83,83,83
Christie,13,13,13,13,13,13
Church and Wellesley,86,86,86,86,86,86
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,27,27,27,27,27,27
Davisville North,6,6,6,6,6,6


Let's find out how many unique categories can be curated from all the returned venues

In [43]:
print('{} unique categories found.'.format(len(torontoVenues['Category'].unique())))

230 unique categories found.


# Analyze Each Neighborhood

In [44]:
torontoOneHot = pd.get_dummies(torontoVenues[['Category']], prefix="")
torontoOneHot['Neighborhood'] = torontoVenues['Neighborhood']
fixed_columns = [torontoOneHot.columns[-1]] + list(torontoOneHot.columns[:-1])
torontoOneHot = torontoOneHot[fixed_columns]
torontoOneHot.head()

Unnamed: 0,Neighborhood,_Accessories Store,_Afghan Restaurant,_American Restaurant,_Art Gallery,_Arts & Crafts Store,_Asian Restaurant,_Athletics & Sports,_BBQ Joint,_Baby Store,...,_Tibetan Restaurant,_Toy / Game Store,_Trail,_Train Station,_Vegetarian / Vegan Restaurant,_Video Game Store,_Vietnamese Restaurant,_Wine Bar,_Women's Store,_Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
torontoOneHot.head()

Unnamed: 0,Neighborhood,_Accessories Store,_Afghan Restaurant,_American Restaurant,_Art Gallery,_Arts & Crafts Store,_Asian Restaurant,_Athletics & Sports,_BBQ Joint,_Baby Store,...,_Tibetan Restaurant,_Toy / Game Store,_Trail,_Train Station,_Vegetarian / Vegan Restaurant,_Video Game Store,_Vietnamese Restaurant,_Wine Bar,_Women's Store,_Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
torontoOneHot.shape

(1672, 231)

Next, let's group the rows by neighborhood and the mean of the frequency of occurence of each category.

In [47]:
torontoGrouped = torontoOneHot.groupby('Neighborhood').mean().reset_index()
torontoGrouped

Unnamed: 0,Neighborhood,_Accessories Store,_Afghan Restaurant,_American Restaurant,_Art Gallery,_Arts & Crafts Store,_Asian Restaurant,_Athletics & Sports,_BBQ Joint,_Baby Store,...,_Tibetan Restaurant,_Toy / Game Store,_Trail,_Train Station,_Vegetarian / Vegan Restaurant,_Video Game Store,_Vietnamese Restaurant,_Wine Bar,_Women's Store,_Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.015625,0.0,...,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.022727,0.0,0.0,0.022727,0.022727,0.0,0.0,0.0,0.0,...,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing CentrE,0.0,0.0,0.03,0.01,0.01,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.014925,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.012048,0.0,0.0,0.0,0.012048,0.012048,0.012048,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.076923,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.011628,0.011628,0.0,0.011628,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011628
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.04,0.01,0.0,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
8,Davisville,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [48]:
torontoGrouped.shape

(39, 231)

In [49]:
torontoGrouped.iloc[0,1:].sort_values(ascending=False)

_Coffee Shop                  0.109375
_Cocktail Bar                 0.046875
_Restaurant                   0.046875
_Seafood Restaurant           0.046875
_Bakery                        0.03125
                                ...   
_Miscellaneous Shop                  0
_Middle Eastern Restaurant           0
_Mexican Restaurant                  0
_Metro Station                       0
_Accessories Store                   0
Name: 0, Length: 230, dtype: object

Let's Print each neighborhood along with the top 5 most common venues

In [50]:
NUM_TOP = 5
for hood in torontoGrouped['Neighborhood']:
    print("----"+hood+"----")
    temp = torontoGrouped[torontoGrouped['Neighborhood']==hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq':2})
    print(temp.sort_values('freq',ascending=False).reset_index(drop=True).head(NUM_TOP))
    print('\n')

----Berczy Park----
                 venue  freq
0         _Coffee Shop  0.11
1          _Restaurant  0.05
2  _Seafood Restaurant  0.05
3        _Cocktail Bar  0.05
4              _Bakery  0.03


----Brockton, Parkdale Village, Exhibition Place----
                     venue  freq
0             _Coffee Shop  0.09
1                    _Café  0.07
2  _Thrift / Vintage Store  0.05
3              _Restaurant  0.05
4               _Gift Shop  0.05


----Business reply mail Processing CentrE----
                 venue  freq
0         _Coffee Shop  0.10
1               _Hotel  0.04
2                 _Bar  0.04
3          _Restaurant  0.04
4  _Seafood Restaurant  0.03


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst, Quay, South Niagara, Island airport----
                   venue  freq
0           _Coffee Shop  0.07
1                  _Café  0.06
2            _Restaurant  0.06
3     _French Restaurant  0.04
4  _Gym / Fitness Center  0.04


----Central Bay Street---

Let's put this data into a DataFrame

In [51]:
def getMostFrequent(row, num_top_venues):
    sortedRow = row.sort_values(ascending=False)
    return sortedRow.head(10).index

In [52]:
NUM_TOP = 10
columns=['Neighborhood']
indicators = ['st', 'nd', 'rd']
for ind in np.arange(NUM_TOP):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
torontoSorted = pd.DataFrame(columns = columns)
torontoSorted['Neighborhood'] = torontoGrouped['Neighborhood']
# torontoSorted
for index in np.arange(torontoGrouped.shape[0]):
    torontoSorted.iloc[index,1:] = getMostFrequent(torontoGrouped.iloc[index,1:], NUM_TOP)

torontoSorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,_Coffee Shop,_Cocktail Bar,_Restaurant,_Seafood Restaurant,_Bakery,_Lounge,_Cheese Shop,_Farmers Market,_Breakfast Spot,_Beer Bar
1,"Brockton, Parkdale Village, Exhibition Place",_Coffee Shop,_Café,_Gift Shop,_Restaurant,_Thrift / Vintage Store,_Accessories Store,_North Indian Restaurant,_Caribbean Restaurant,_Sandwich Place,_Chiropractor
2,Business reply mail Processing CentrE,_Coffee Shop,_Bar,_Restaurant,_Hotel,_American Restaurant,_Seafood Restaurant,_Pub,_Café,_Italian Restaurant,_Tea Room
3,"CN Tower, King and Spadina, Railway Lands, Har...",_Coffee Shop,_Café,_Restaurant,_Park,_Gym / Fitness Center,_French Restaurant,_Lounge,_Bar,_Italian Restaurant,_Speakeasy
4,Central Bay Street,_Coffee Shop,_Clothing Store,_Japanese Restaurant,_Thai Restaurant,_Spa,_Bookstore,_Bubble Tea Shop,_Sandwich Place,_Restaurant,_Sushi Restaurant


# Cluster Neighborhoods

**We will now use the KMeans Clustering technique to cluster the neighborhoods into 5 clusters.**

In [53]:
from sklearn.cluster import KMeans

In [54]:
clusters = 5

torontoGroupedClustered = torontoGrouped.drop('Neighborhood',1)

kmeans = KMeans(n_clusters=clusters, random_state=0).fit(torontoGroupedClustered)
kmeans.labels_[0:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [55]:
torontoSorted.insert(0, 'Cluster Labels', kmeans.labels_)

torontoMerged = torontoData.copy()
torontoMerged = torontoMerged.join(torontoSorted.set_index('Neighborhood'), on='Neighborhood')
torontoMerged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.678148,-79.295349,1,_Coffee Shop,_Church,_Health Food Store,_Neighborhood,_Trail,_Pub,_Farmers Market,_Falafel Restaurant,_Farm,_Yoga Studio
1,M4K,East Toronto,"The Danforth West, Riverdale",43.683424,-79.354564,0,_Business Service,_Park,_Discount Store,_Grocery Store,_Bus Line,_Farm,_Fountain,_Food Truck,_Food Court,_Food & Drink Shop
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668291,-79.315578,1,_Sandwich Place,_Park,_Italian Restaurant,_Fast Food Restaurant,_Food & Drink Shop,_Burrito Place,_Steakhouse,_Liquor Store,_Sushi Restaurant,_Movie Theater
3,M4M,East Toronto,Studio District,43.648,-79.33926,1,_Government Building,_Baseball Field,_Night Market,_Business Service,_Diner,_Discount Store,_Fountain,_Food Truck,_Food Court,_Food & Drink Shop
4,M4N,Central Toronto,Lawrence Park,43.729455,-79.386415,4,_Bus Line,_Swim School,_Lawyer,_Yoga Studio,_Fountain,_Food Truck,_Food Court,_Food & Drink Shop,_Food,_Flower Shop


Let's visualize our clusters

In [56]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [57]:
map_clusters = folium.Map(location=[torontoLatLong[0],torontoLatLong[1]], zoom_start=11)

x = np.arange(clusters)
ys = [i + x + (i*x)**2 for i in range(clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(torontoMerged['Latitude'], torontoMerged['Longitude'], torontoMerged['Neighborhood'], torontoMerged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Examine Clusters

## Cluster 1

In [58]:
torontoMerged.loc[torontoMerged['Cluster Labels'] == 0, torontoMerged.columns[[1] + list(range(5, torontoMerged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto,0,_Business Service,_Park,_Discount Store,_Grocery Store,_Bus Line,_Farm,_Fountain,_Food Truck,_Food Court,_Food & Drink Shop
6,Central Toronto,0,_Playground,_Garden,_Park,_Gym Pool,_Elementary School,_Food Court,_Food & Drink Shop,_Food,_Flower Shop,_Fish Market
10,Downtown Toronto,0,_Playground,_Park,_Candy Store,_Grocery Store,_Yoga Studio,_Ethiopian Restaurant,_Food Court,_Food & Drink Shop,_Food,_Flower Shop
23,Central Toronto,0,_Home Service,_Park,_Ethiopian Restaurant,_Food Truck,_Food Court,_Food & Drink Shop,_Food,_Flower Shop,_Fish Market,_Fish & Chips Shop


## Cluster 2

In [59]:
torontoMerged.loc[torontoMerged['Cluster Labels'] == 1, torontoMerged.columns[[1] + list(range(5, torontoMerged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,1,_Coffee Shop,_Church,_Health Food Store,_Neighborhood,_Trail,_Pub,_Farmers Market,_Falafel Restaurant,_Farm,_Yoga Studio
2,East Toronto,1,_Sandwich Place,_Park,_Italian Restaurant,_Fast Food Restaurant,_Food & Drink Shop,_Burrito Place,_Steakhouse,_Liquor Store,_Sushi Restaurant,_Movie Theater
3,East Toronto,1,_Government Building,_Baseball Field,_Night Market,_Business Service,_Diner,_Discount Store,_Fountain,_Food Truck,_Food Court,_Food & Drink Shop
5,Central Toronto,1,_Gym,_Department Store,_Convenience Store,_Park,_Food & Drink Shop,_Breakfast Spot,_Yoga Studio,_Farmers Market,_Fast Food Restaurant,_Fish & Chips Shop
7,Central Toronto,1,_Dessert Shop,_Café,_Sandwich Place,_Pizza Place,_Coffee Shop,_Italian Restaurant,_Thai Restaurant,_Seafood Restaurant,_Fast Food Restaurant,_Farmers Market
8,Central Toronto,1,_Playground,_Convenience Store,_Summer Camp,_Gym,_Yoga Studio,_Ethiopian Restaurant,_Food Court,_Food & Drink Shop,_Food,_Flower Shop
9,Central Toronto,1,_Coffee Shop,_Light Rail Station,_Park,_Skating Rink,_Supermarket,_Liquor Store,_Yoga Studio,_Falafel Restaurant,_Food & Drink Shop,_Food
11,Downtown Toronto,1,_Coffee Shop,_Pub,_Italian Restaurant,_Chinese Restaurant,_Restaurant,_Bakery,_Café,_Pizza Place,_Park,_Playground
12,Downtown Toronto,1,_Coffee Shop,_Japanese Restaurant,_Restaurant,_Gay Bar,_Café,_Pub,_Burger Joint,_Hotel,_Dance Studio,_Sushi Restaurant
13,Downtown Toronto,1,_Pub,_Coffee Shop,_Music Venue,_Café,_Athletics & Sports,_Bank,_Chocolate Shop,_Seafood Restaurant,_French Restaurant,_Tech Startup


## Cluster 3

In [60]:
torontoMerged.loc[torontoMerged['Cluster Labels'] == 2, torontoMerged.columns[[1] + list(range(5, torontoMerged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,2,_Home Service,_Spa,_Ethiopian Restaurant,_Food Truck,_Food Court,_Food & Drink Shop,_Food,_Flower Shop,_Fish Market,_Fish & Chips Shop


## Cluster 4

In [61]:
torontoMerged.loc[torontoMerged['Cluster Labels'] == 3, torontoMerged.columns[[1] + list(range(5, torontoMerged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Downtown Toronto,3,_Harbor / Marina,_Theme Park,_Park,_Farm,_Yoga Studio,_Ethiopian Restaurant,_Food Truck,_Food Court,_Food & Drink Shop,_Food


## Cluster 5

In [62]:
torontoMerged.loc[torontoMerged['Cluster Labels'] == 4, torontoMerged.columns[[1] + list(range(5, torontoMerged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,4,_Bus Line,_Swim School,_Lawyer,_Yoga Studio,_Fountain,_Food Truck,_Food Court,_Food & Drink Shop,_Food,_Flower Shop
