# Assignment - Segmenting and Clustering Neighborhoods in Toronto

## 1 - Get and create a Pandas DataFrame containing Neighbourhood and PostalCode information

In [0]:
import pandas as pd

Let's get the datasets from the Wiki page

In [0]:
df_toronto = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
df_toronto

[    Postcode           Borough          Neighbourhood
 0        M1A      Not assigned           Not assigned
 1        M2A      Not assigned           Not assigned
 2        M3A        North York              Parkwoods
 3        M4A        North York       Victoria Village
 4        M5A  Downtown Toronto           Harbourfront
 ..       ...               ...                    ...
 282      M8Z         Etobicoke              Mimico NW
 283      M8Z         Etobicoke     The Queensway West
 284      M8Z         Etobicoke  Royal York South West
 285      M8Z         Etobicoke         South of Bloor
 286      M9Z      Not assigned           Not assigned
 
 [287 rows x 3 columns],
                                                   0   ...   17
 0                                                NaN  ...  NaN
 1  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...  ...  NaN
 2                                                 NL  ...   YT
 3                                                  A  ..

Get our desired dataset (the first one, that has the PostalCodes)

In [0]:
df_toronto = df_toronto[0]

Remove all rows that do not have a Borough (Not assigned)

In [0]:
df_toronto = df_toronto[df_toronto['Borough'] != 'Not assigned']

When a neighbourhood is not assigned, it must be the same as the Borough. *Despite the warning, the code works.*

In [6]:
df_toronto.loc[df_toronto['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df_toronto[df_toronto['Neighbourhood'] == 'Not assigned']['Borough']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Create a Series with grouped neighbourhoods (unique, this is why I used set and not list), separated by a comma

In [0]:
group_neigh = df_toronto[['Postcode', 'Neighbourhood']].groupby('Postcode')['Neighbourhood'].apply(set).apply(lambda x : ', '.join(x))

In [8]:
group_neigh

Postcode
M1B                                       Rouge, Malvern
M1C               Port Union, Highland Creek, Rouge Hill
M1E                    Guildwood, Morningside, West Hill
M1G                                               Woburn
M1H                                            Cedarbrae
                             ...                        
M9N                                               Weston
M9P                                            Westmount
M9R    St. Phillips, Martin Grove Gardens, Richview G...
M9V    Mount Olive, South Steeles, Albion Gardens, Be...
M9W                                            Northwest
Name: Neighbourhood, Length: 103, dtype: object

Create the similar as above, but for boroughs

In [0]:
group_boroughs = df_toronto[['Postcode', 'Borough']].groupby('Postcode')['Borough'].apply(set).apply(lambda x : ', '.join(x))

In [10]:
group_boroughs

Postcode
M1B    Scarborough
M1C    Scarborough
M1E    Scarborough
M1G    Scarborough
M1H    Scarborough
          ...     
M9N           York
M9P      Etobicoke
M9R      Etobicoke
M9V      Etobicoke
M9W      Etobicoke
Name: Borough, Length: 103, dtype: object

Create a DataFrame with the grouped information, it must be transposed (T) because without it, the columns will be the PostalCodes. Fix the name of PostalCodes too by renaming it. And the index was reseted to be the same structure as the example.

In [0]:
df_toronto_clean = pd.DataFrame([group_boroughs, group_neigh]).T.reset_index().rename(columns={'Postcode': 'PostalCode'})

In [12]:
df_toronto_clean

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Highland Creek, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"St. Phillips, Martin Grove Gardens, Richview G..."
101,M9V,Etobicoke,"Mount Olive, South Steeles, Albion Gardens, Be..."


In [13]:
df_toronto_clean.shape

(103, 3)

## 2 - Merge Neighbourhood information with Geospatial data (longitude, latitude)

As the Google API did not work well, I used the CSV data.

In [0]:
df_geo_coord = pd.read_csv('Geospatial_Coordinates.csv')

In [0]:
df_geo_coord = df_geo_coord.rename(columns={'Postal Code': 'PostalCode'})

In [16]:
df_geo_coord

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Now, merge the Latitude and Longitude data with our previous created dataset.

In [0]:
df_toronto_clean = df_toronto_clean.merge(df_geo_coord, on='PostalCode')

In [18]:
df_toronto_clean['Borough'].value_counts(dropna=False)

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           11
Central Toronto      9
West Toronto         6
York                 5
East Toronto         5
East York            5
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

## 3 - Select Borough, get information about each place on Foursquare API, create clusters and show them in a map

First, let's install our requirements

In [19]:
!pip install folium==0.5.0

Collecting folium==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/37/456fb3699ed23caa0011f8b90d9cad94445eddc656b601e6268090de35f5/folium-0.5.0.tar.gz (79kB)
[K     |████▏                           | 10kB 21.3MB/s eta 0:00:01[K     |████████▎                       | 20kB 1.8MB/s eta 0:00:01[K     |████████████▍                   | 30kB 2.6MB/s eta 0:00:01[K     |████████████████▌               | 40kB 1.7MB/s eta 0:00:01[K     |████████████████████▊           | 51kB 2.1MB/s eta 0:00:01[K     |████████████████████████▉       | 61kB 2.5MB/s eta 0:00:01[K     |█████████████████████████████   | 71kB 2.9MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 2.6MB/s 
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Created wheel for folium: filename=folium-0.5.0-cp36-none-any.whl size=76240 sha256=fd95971d8eee4d678ea819ee241562c96de4a07cd0687ba711c2ff7a65893677
  Stored in directory: /root

And then, import the requested packages

In [0]:
import folium
import time
import numpy as np
import requests
from ipywidgets import IntProgress
from IPython.display import display

Now, let's create a map with the entire dataset we have of Toronto Boroughs

In [21]:
# create map of Toronto using latitude and longitude values
map_newyork = folium.Map(location=[43.729097, -79.369131], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto_clean['Latitude'], df_toronto_clean['Longitude'], df_toronto_clean['Borough'], df_toronto_clean['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Now we have to select a Boroughs to analyze, in this case, we will analyze all Boroughs with the name of Downtown Toronto

In [0]:
df_analysis = df_toronto_clean[df_toronto_clean['Borough'] == 'Downtown Toronto']

Set-up our Foursquare API login data

In [0]:
CLIENT_ID = '<CLIENT_ID>'
CLIENT_SECRET = '<CLIENT_SECRET>'
VERSION = '20180605'

Use that function to get from the API, information about each selected Neighbourhood

In [0]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000, LIMIT=200):

    # Get the number of places to be evaluated
    max_count = len(names)
    # instantiate the progress bar
    f = IntProgress(min=0, max=max_count) 
    # Display the progress bar
    display(f)
    # Start counter
    count = 0  
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
        f.value += 1 # signal to increment the progress bar
        count += 1


    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']


    return(nearby_venues)

In [25]:
downtown_toronto_venues = getNearbyVenues(df_analysis['Neighbourhood'], df_analysis['Latitude'], df_analysis['Longitude'])

IntProgress(value=0, max=19)

Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Grange Park, Kensington Market, Chinatown
Bathurst Quay, CN Tower, Island airport, Railway Lands, King and Spadina, South Niagara, Harbourfront West
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Queen's Park


Now we can see the number of Venue categories we got from all analyzed neighbourhoods

In [26]:
downtown_toronto_venues.shape

(1584, 7)

In [27]:
print(f'There are {len(downtown_toronto_venues["Venue Category"].unique())} unique categories')

There are 208 unique categories


Now we will get a list of unique categories, this will help us to create a one-hot encoding for each neighbourhood

In [0]:
list_unique_categories = list(downtown_toronto_venues['Venue Category'].unique())

The cell below will get each unique Neighbourhood and create a `row_numeric_data` that has a quantity information for each venue (listed on `list_unique_categories`) and calculate in the row_normalized_data the proportion of each unique venue for each neighbourhood. After getting all these venue normalized data, we will create a row for each neighbourhood that is an dictionary where the keys are the columns (neighbourhood name and venue types) and values the values for each column.

In [0]:
row_list = []
for Neighborhood in downtown_toronto_venues['Neighborhood'].unique():
  row_numeric_data = []
  for category in list_unique_categories:
    vanue_category_data = downtown_toronto_venues[downtown_toronto_venues['Neighborhood'] == Neighborhood]['Venue Category']
    row_numeric_data.append(vanue_category_data[vanue_category_data == category].shape[0])
    if sum(row_numeric_data) > 0:
      row_normalized_data = [numberCategory/sum(row_numeric_data) for numberCategory in row_numeric_data]
    else:
      row_normalized_data = row_numeric_data
  row_list.append(dict(zip(['Neighborhood']+list_unique_categories, [Neighborhood]+row_normalized_data)))

Now, with that list of dictionaries, we will create a DataFrame containing all the calculated data.

In [0]:
df_venue_category_data = pd.DataFrame(row_list)

Let's start to make the data science analysis, firstly importing the Scikirt-Learn Kmeans module.

In [0]:
from sklearn.cluster import KMeans

Then we will make an analysis with 8 clusters of all data we got from the neighbourhoods venues

In [0]:
kmeans = KMeans(n_clusters=8, random_state=0).fit(df_venue_category_data.drop(columns=['Neighborhood']))

Below, we can see the label for each neighbourhood

In [33]:
kmeans.labels_

array([4, 7, 5, 5, 5, 0, 0, 5, 0, 0, 0, 0, 1, 1, 2, 0, 0, 6, 3],
      dtype=int32)

Now let's copy the DataFrame that has information of each neighbourhood, it's important to use the `.copy` method because this will avoid us to only refer to that dataframe, using copy we will create a new DataFrame called `df_result_cluster` that have the exact same data than df_analysis.

In [0]:
df_result_cluster = df_analysis.copy()

Now, we will assign the cluster label for each neighbourhood on our analyzed dataset.

In [0]:
df_result_cluster['Cluster'] = kmeans.labels_

Import some modules to use on color selection

In [0]:
import matplotlib.cm as cm
import matplotlib.colors as colors

Finally let's create a map with the clusters we got.

In [37]:
# create map
map_clusters = folium.Map(location=[43.650097, -79.369131], zoom_start=12)

# set color scheme for the clusters
x = np.arange(8)
ys = [i + x + (i*x)**2 for i in range(8)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_result_cluster['Latitude'], 
                                  df_result_cluster['Longitude'], 
                                  df_result_cluster['Neighbourhood'],
                                  df_result_cluster['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

I think this was a very interesting result because show a similar neighbourhood pattern in areas that are close. The city Center of Toronto looks to have similar neighbourhoods, while more distant neighbourhoods can have different characteristics. I think this was expected as closest areas may have similarities.