# Task 1 - Data Scraping to get the neighbourhood data in Canada
## This task will consist of numerous steps involved in scraping data from a webpage and building a dataframe using the scraped data

Firstly, we need to install the required libraries. We're using 'BeautifulSoup' package since it is easy-to-use with very little coding.  
We'll also use 'requests' library to get the data from online sources easily.

In [21]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [22]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


Now that we have installed the required packages, we'll import all the necessary libraries required

In [23]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import lxml

Using 'requests' to get the data from our webpage in text format and then using 'BeautifulSoup' to define a variable 'soup' which contains the data.

In [4]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

By looking closely at the data above, we realize that all the data that is relevant to us is in under the "table class = "wikitable sortable"" heading. So we need to only work on this data to extract relevant information. Therefore we define a variable 'table' to contain the table class and we closely look at the resulting data again using the prettify command.

In [5]:
table = soup.find('table',{'class':'wikitable sortable'})
print(table.prettify())


<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

Now to separate data into different columns as per their content, we use a for loop to extract Postcode, Borough and Neighborhood information from the table. We use 'findAll' to find the specific data and using a for loop, we append the content into the columns of data, and then we use pandas to create a dataframe and get all the data into specific columns.

In [6]:
data = []
columns = []

for index, tr in enumerate(table.findAll('tr')): #Using for loop to extract data and append it into 'content'
    content = []
    for td in tr.findAll(['th','td']):
        content.append(td.text.rstrip())
        
    if (index == 0):               #First row consisted of Header row elements, so we use the if-else function here
        columns = content
    else:
        data.append(content)       #From the second row onwards, we append the content in 'data'

df = pd.DataFrame(data = data, columns = columns)  #Creating a pandas Dataframe
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


After creating a pandas dataframe, we need to remove the rows where 'Borough' is 'Not assigned', since we're only interested in processing the rows which have some categorical value of 'Borough'. Therefore we use the 'replace' command to replace 'Not Assigned' with 'NaN' so they can be dropped using 'dropna' command. Then we reset the index since we've made two changes here.

In [7]:
df.replace("Not assigned", np.nan, inplace = True)

df.dropna(subset = ["Borough"], axis = 0, inplace = True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


Now we combine the rows with same postal codes, and add the neighborhoods in the same row separate by a comma. To do this, we group the data by postal code and then transform the Neighborhood column using lambda command. This places the different neighborhoods with the same postal code into the first row, separated by commas.  
But we still need to get rid of the rows with same postal code, which are still there. So we remove duplicate rows to clean our data.

In [8]:
df["Neighbourhood"] = df.groupby("Postcode")["Neighbourhood"].transform(lambda x:','.join(x))
df.head()

df = df.drop_duplicates()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
5,M7A,Downtown Toronto,Queen's Park


Finally, for the Boroughs where the Neighborhood value is 'Not assigned', we need to make the Neighborhood the same as Borough. First we check if there's any rows in the 'Neighbourhood' column with NaN value (since we initially replaced all the 'Not Assigned' values with 'NaN'). But this returns a value of 'False' for the whole column, which means there are no 'NaN' values in the 'Neighbourhood' column anymore.  
For the sake of practice, we still replace all the 'NaN' values with the values of Borough wherever applicable.

In [9]:
missing_data = df.isnull()
print(missing_data["Neighbourhood"].value_counts())
df["Neighbourhood"].replace("NaN", df["Borough"], inplace = True)

False    103
Name: Neighbourhood, dtype: int64


For the last part of our task, we just need to print the rows of our dataframe using '.shape' method. It returns the number of rows = 103 and the number of columns = 3

In [10]:
df.shape

(103, 3)

# Task 2 - Finding Latitude and Longitude Coordinates of Neighborhoods and adding them to the DataFrame

First step is to import the csv file containing the geospatial data (latitude and longitude) of our locations in the dataframe above.

In [17]:
import pandas as pd
import numpy as np

link = 'https://cocl.us/Geospatial_data'
latlong = pd.read_csv(link)
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We have to standardize this table as per the requirement and then match the format with the dataframe containing Borough and Neighborhood information, by linking both tables with the 'PostalCode' column. So we make the necessary changes to the format of both the tables before we can combine them both.

In [22]:
latlong = latlong.set_index("Postal Code")
latlong.rename_axis("PostalCode", axis='index', inplace=True)
latlong.head()

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [23]:
df = df.set_index("Postcode")
df.rename_axis("PostalCode", axis='index', inplace=True)
df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,"Lawrence Heights,Lawrence Manor"
M7A,Downtown Toronto,Queen's Park


Now we can join both the tables to get our required table containing 5 columns.

In [25]:
neighborhoods = df.join(latlong)
neighborhoods.head(10)

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
M3B,North York,Don Mills North,43.745906,-79.352188
M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


# Task 3A - Exploring the neighborhoods in Toronto

Checking the number of unique Boroughs in our new dataframe

In [26]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


We first need to get the lat, long coordinates for Toronto using geocode's Nominatim function

In [27]:
from geopy.geocoders import Nominatim

address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Now we can use the lat, long coordinates to visualize all the neighborhoods in Toronto on a map.

In [31]:
import folium

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighbourhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=6,
        popup=label,
        color='blue',
        fill=True,
        fill_color='light blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Let's choose the borough 'Downtown Toronto' and closely analyse this neighborhood and its venue. So we define another dataframe 'dttoronto'

In [39]:
dttoronto = neighborhoods[neighborhoods['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dttoronto.head()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
3,Downtown Toronto,St. James Town,43.651494,-79.375418
4,Downtown Toronto,Berczy Park,43.644771,-79.373306


We get the lat, long coordinates for Downtown Toronto using geocode

In [38]:
address = 'Downtown Toronto, Toronto'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6541737, -79.38081164513409.


Let's visualize Downtown Toronto on a map to get a better idea of its location

In [41]:
# create map of Downtown Toronto using latitude and longitude values
map_dttoronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(dttoronto['Latitude'], dttoronto['Longitude'], dttoronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=6,
        popup=label,
        color='blue',
        fill=True,
        fill_color='light blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dttoronto)  
    
map_dttoronto

To explore the venues near Downtown Toronto, we will use foursquare API. To use foursquare, we need to define the credentials, version, limit and radius.

In [42]:
import json

CLIENT_ID = 'TJFZA1ML3UOHLHST04BKKOW3CG52IJSKE4L0DFFLEKF00Y0M' # your Foursquare ID
CLIENT_SECRET = 'UEHSR52QJVP4CJQDYROCUVBVASMY5SSJXFEQDWWQZ2CVZVRW' # your Foursquare Secret
VERSION = '20180605'
LIMIT = 100
radius = 500

Then we define the url which will be required to request data from foursquare API in .json format

In [43]:
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=TJFZA1ML3UOHLHST04BKKOW3CG52IJSKE4L0DFFLEKF00Y0M&client_secret=UEHSR52QJVP4CJQDYROCUVBVASMY5SSJXFEQDWWQZ2CVZVRW&ll=43.6541737,-79.38081164513409&v=20180605&radius=500&limit=100'

In [44]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e565de5be61c9001b288aea'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bay Street Corridor',
  'headerFullLocation': 'Bay Street Corridor, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 131,
  'suggestedBounds': {'ne': {'lat': 43.6586737045, 'lng': -79.37460365419369},
   'sw': {'lat': 43.6496736955, 'lng': -79.38701963607448}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '57eda381498ebe0e6ef40972',
       'name': 'UNIQLO ユニクロ',
       'location': {'address': '220 Yonge St',
        'crossStreet': 'at Dundas St W',
        'lat': 43.65591027779457,
        'lng': -79.38064099181345,
        'labeledLatLngs': [

Now we define get_category_type function to extract the category of the nearby venues

In [45]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we normalize the data and filter it to get nearby venues and their respective categories

In [48]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']  
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearbyvenues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,UNIQLO ユニクロ,Clothing Store,43.65591,-79.380641
1,Elgin And Winter Garden Theatres,Theater,43.653394,-79.378507
2,LUSH,Cosmetics Shop,43.653557,-79.3804
3,Ed Mirvish Theatre,Theater,43.655102,-79.379768
4,Indigo,Bookstore,43.653515,-79.380696
5,CF Toronto Eaton Centre,Shopping Mall,43.653534,-79.380551
6,Yonge-Dundas Square,Plaza,43.656054,-79.380495
7,Eggspectation Bell Trinity Square,Breakfast Spot,43.653144,-79.38198
8,SEPHORA,Cosmetics Shop,43.653688,-79.38012
9,JOEY Eaton Centre,Restaurant,43.655404,-79.381929


In [49]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


We can define getNearbyVenues function which can be used to repeat the above process for all the neighborhoods in Downtown Toronto

In [50]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We run the function above on all the neighborhoods and define another dataframe 'dttoronto_venues'

In [59]:
dttoronto_venues = getNearbyVenues(names=dttoronto['Neighbourhood'],
                                   latitudes=dttoronto['Latitude'],
                                   longitudes=dttoronto['Longitude']
                                  )

dttoronto_venues = pd.DataFrame(dttoronto_venues)

dttoronto_venues

Harbourfront
Queen's Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Christie
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown,St. James Town
First Canadian Place,Underground city
Church and Wellesley


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
...,...,...,...,...,...,...,...
1295,Church and Wellesley,43.66586,-79.383160,Noah's Natural Foods,43.668532,-79.385885,Food & Drink Shop
1296,Church and Wellesley,43.66586,-79.383160,A&W,43.666415,-79.378235,Fast Food Restaurant
1297,Church and Wellesley,43.66586,-79.383160,Flash,43.664319,-79.380190,Strip Club
1298,Church and Wellesley,43.66586,-79.383160,Croissant Tree,43.669575,-79.382331,Coffee Shop


In [61]:
dttoronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


In [62]:
print('There are {} unique categories.'.format(len(dttoronto_venues['Venue Category'].unique())))

There are 207 unique categories.


Now let's analyze each neighborhood separately based on the venue categories. To do that, we can use one hot encoding to get venue category's columns, which will then be used to analyze the category preferences of each neighborhood

In [70]:
# one hot encoding
dttoronto_onehot = pd.get_dummies(dttoronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dttoronto_onehot['Neighbourhood'] = dttoronto_venues['Neighborhood'] 

# move neighbourhood column to the first column
fixed_columns = [dttoronto_onehot.columns[-1]] + list(dttoronto_onehot.columns[:-1])
dttoronto_onehot = dttoronto_onehot[fixed_columns]

dttoronto_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Harbourfront,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now let's group neighborhoods and calculate the mean of frequency of occurence of a specific venue category in that neighborhood 

In [71]:
dttoronto_grouped = dttoronto_onehot.groupby('Neighbourhood').mean().reset_index()
dttoronto_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,0.0,...,0.0,0.012658,0.0,0.012658,0.0,0.012658,0.0,0.0,0.0,0.012658
5,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.035294,0.0,0.0,0.047059,0.011765,0.0,0.0,0.0,0.0
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Church and Wellesley,0.011905,0.0,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,...,0.0,0.0,0.0,0.0,0.011905,0.0,0.0,0.011905,0.0,0.011905
8,"Commerce Court,Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
9,"Design Exchange,Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0


The data above does not provide much insight into what is happening, so we need to arrange our data in a manner where we can easily analyze the most common venues and rank them for each neighborhood. So we'll define a function 'return_most_common_venues' and use that function to find the most common venues for each neighborhood and then display the 10 most common venues for that particular neighborhood

In [74]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = dttoronto_grouped['Neighbourhood']

for ind in np.arange(dttoronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dttoronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Bar,Restaurant,Thai Restaurant,Steakhouse,Cosmetics Shop,Sushi Restaurant,Burger Joint,Hotel
1,Berczy Park,Coffee Shop,Café,Cheese Shop,Farmers Market,Seafood Restaurant,Bakery,Restaurant,Beer Bar,Cocktail Bar,Diner
2,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Service,Airport Lounge,Airport Terminal,Plane,Bar,Rental Car Location,Sculpture Garden,Boutique,Boat or Ferry,Harbor / Marina
3,"Cabbagetown,St. James Town",Coffee Shop,Chinese Restaurant,Italian Restaurant,Restaurant,Convenience Store,Bakery,Café,Pizza Place,Pub,Diner
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Juice Bar,Japanese Restaurant,Burger Joint,Ice Cream Shop,Department Store,Chinese Restaurant,Salad Place
5,"Chinatown,Grange Park,Kensington Market",Bar,Café,Chinese Restaurant,Vietnamese Restaurant,Coffee Shop,Bakery,Mexican Restaurant,Dumpling Restaurant,Vegetarian / Vegan Restaurant,Cocktail Bar
6,Christie,Grocery Store,Café,Park,Italian Restaurant,Restaurant,Coffee Shop,Gas Station,Diner,Athletics & Sports,Nightclub
7,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Fast Food Restaurant,Gym,Men's Store,Mediterranean Restaurant,Burger Joint
8,"Commerce Court,Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Italian Restaurant,Gastropub,Deli / Bodega,Seafood Restaurant
9,"Design Exchange,Toronto Dominion Centre",Coffee Shop,Café,Hotel,Restaurant,Italian Restaurant,Seafood Restaurant,Bar,Bakery,Gastropub,American Restaurant


By looking at the data above, we can easily assume that 'Coffee Shop' is the most common venue in all the neighborhoods. 

# Task 3B - K-Means Clustering for neighborhoods in Downtown Toronto

Now we'll perform K-Means clustering to make clusters of the neighborhood venues. We can set the number of clusters to 4

In [77]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 4

dttoronto_grouped_clustering = dttoronto_grouped.drop('Neighbourhood', axis = 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dttoronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 3, 1, 1, 1, 2, 1, 1, 1], dtype=int32)

In [78]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dttoronto_merged = dttoronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
dttoronto_merged = dttoronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

Finally, we'll visualize the clusters on a map to get a better understanding of how the clusters are defined.

In [83]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dttoronto_merged['Latitude'], dttoronto_merged['Longitude'], dttoronto_merged['Neighbourhood'], dttoronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

From the map above, most of the venues fall in the Cluster 1 (Purple colour), whereas all the other clusters have at most 1 value. This essentially means that our KMeans Clustering helped us identify outliers, like the one venue which is in Cluster 3 has significant distance from the other venues, and hence is clustered separately. 

We can do the same analysis on various other neighborhoods in Toronto and get useful insights about the venues in those neighborhoods.