## Segmenting and Clustering Neighborhoods in the City of Toronto, Canada Parts 2 and 3
We will built code to achieve the following in part 1:
* scrape data from the following site: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
* create an organized dataframe from the data extracted

In Part 2 we will:
* Obtain and add latitude and longitude coordinates to our table

### 1. Scraping the Data and creating a table with postal codes, boroughs and Neighborhoods
#### From Part 1

In [4]:
#import pandas
import pandas as pd

#create a url reference and pull the data
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_df = pd.read_html(wiki_url)
df_setup = wiki_df[0]

#Convert the data to a dataframe
df = pd.DataFrame(df_setup[['Postal Code','Borough','Neighborhood']])
df.rename(columns={"Postal Code":"PostalCode"}, inplace=True)
#let's view the unmodified data
df

#drop all rows without assigned Boroughs
df.drop(df[df['Borough'].str.contains('Not assigned')].index, inplace=True)

In [5]:
#check for any Neighborhoods with 'Not assigned' values
df.loc[df['Neighborhood'] == 'Not assigned']

#group the results together by unique PostalCode and create a new dataframe
df_grouped = df['Neighborhood'].groupby(df['PostalCode']).unique()

#convert to a dataframe and remove brackets
df_grouped_2 = pd.DataFrame(df_grouped)
df_grouped_2['Neighborhood'] = df_grouped_2['Neighborhood'].str.get(0)

df_grouped_2

#merge the dataframes
df_final = pd.merge(df_grouped_2, df[['PostalCode', 'Borough']], on='PostalCode')
df_final = df_final[['PostalCode','Borough','Neighborhood']]

#final dataframe
df_final

#final row count
print("The final dataframe contains this many rows and columns:", df_final.shape)

The final dataframe contains this many rows and columns: (103, 3)


### 2. Obtaining coordinates for our neighborhoods
#### This is part 2
We will be using the following CSV file (https://cocl.us/Geospatial_data) to obtain the latitude and longitude values and add them to the table we created in part 1.

In [6]:
#obtain data from the CSV file
df_latlng = pd.read_csv('https://cocl.us/Geospatial_data')

In [7]:
#view the data
df_latlng.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
#convert the column name Postal Code to PostalCode
df_latlng.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

#merge the new latitude and longitude columns with the previously generated dataframe from part 1
df_final_latlng = pd.merge(df_final, df_latlng[['PostalCode','Latitude','Longitude']], on='PostalCode')
df_final_latlng

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [9]:
#check the shape
print("Shape of df_final_latlng:",df_final_latlng.shape)

Shape of df_final_latlng: (103, 5)


### 3. Examining clustered venue data
#### This is part 3. 
We now have a dataframe containing the postal codes, boroughs, neighborhoods, latitudes and longitudes. We will be using this data to explore, visualize and cluster boroughs in Toronto.

This part has four steps. They are briefly explained in the Markdown segments below.

-----------------

#### Step 1. We import relevant libraries and take a look at our current data.

In [2]:
#import numpy
import numpy as np

#import requests
import requests

#import plotting and clustering modules
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!pip install folium
#! conda install -c conda-forge folium=0.5.0 --yes
import folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 8.4MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [10]:
#retrieve data for boroughs containing the word Toronto
toronto_data = df_final_latlng[df_final_latlng['Borough'].str.contains("Toronto")].reset_index(drop=True)
#view the data
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [11]:
#find the average latitude and longitude value in our dataframe
latitude = toronto_data['Latitude'].mean(axis=0, skipna=True)
longitude = toronto_data['Longitude'].mean(axis=0, skipna=True)

#create a map of our points to view the preliminary data
map_toronto_ave = folium.Map(location=[latitude, longitude], zoom_start=13)
for lat, lng, neighborhood, borough in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood'], toronto_data['Borough']):
    info = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(info, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='red', fill=True, fill_color='red', fill_opacity = 0.5, parse_html= False).add_to(map_toronto_ave)

#view the map
map_toronto_ave

#### Step 2. We will obtain venue information using Foursquare's API for our Toronto neighborhoods.

In [12]:
#define Foursquare credentials
CLIENT_ID = 'TCER0LFUIQJDBLBIHUBKWL3YEAEXZECJPUQEUQ5TLCOVYTEU'
CLIENT_SECRET = 'E3Z4PYK3SGTYNEWJBMFZBDGCEKXKPRMUYZNRLFLHTARHEWHH'
VERSION = '20180605'

In [13]:
#we will be examining popular venues around Toronto, based on location.

#define a function to loop through each neighborhood or neighborhoods at a specific postal code
LIMIT= 100
radius = 500
def NearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list = []
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        #define the Foursquare API
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(name, lat, lng, 
                             v['venue']['name'], 
                             v['venue']['location']['lat'], 
                             v['venue']['location']['lng'],
                             v['venue']['categories'][0]['name']) for v in results])
        
    df_nearbyvenues = pd.DataFrame(item for venue_list in venues_list for item in venue_list)
    df_nearbyvenues.columns = ['Neighborhood','Neighborhood_Latitude','Neighborhood_Longitude','Venue','Venue_Latitude','Venue_Longitude','Venue_Category']
    
    return(df_nearbyvenues)

In [14]:
#find the nearby venues for our Toronto data
toronto_venues = NearbyVenues(names = toronto_data['Neighborhood'], latitudes = toronto_data['Latitude'], longitudes = toronto_data['Longitude'], radius = 500)

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West, Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Dufferin, Dovercourt Village
Little Portugal, Trinity
Brockton, Parkdale Village, Exhibition Place
High 

In [15]:
#check the shape of the dataframe
print("The shape of toronto_venues is:", toronto_venues.shape)
toronto_venues.head(15)

The shape of toronto_venues is: (1625, 7)


Unnamed: 0,Neighborhood,Neighborhood_Latitude,Neighborhood_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
5,"The Danforth West, Riverdale",43.679557,-79.352188,La Diperie,43.677702,-79.352265,Ice Cream Shop
6,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
7,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
8,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
9,"The Danforth West, Riverdale",43.679557,-79.352188,Louis Cifer Brew Works,43.677663,-79.351313,Brewery


In [16]:
#check the number of unique venue types
print("There are {} unique venue types.".format(len(toronto_venues['Venue_Category'].unique())))

There are 232 unique venue types.


#### Step 3. We will group our neighborhoods according to venue categories using one-hot encoding, analyze the data, and prepare the data for clustering.

In [17]:
#one hot encode for venue category
toronto_onehot = pd.get_dummies(toronto_venues['Venue_Category'])

#we have a neighborhood column for neighborhood venues. Let's rename it
toronto_onehot.rename(columns={'Neighborhood':'Neighborhood_Venue'}, inplace=True)

toronto_onehot.head()

Unnamed: 0,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
#check our columns
columns = toronto_onehot.columns.tolist()
columns

#add the neighborhood column to the one hot table
toronto_onehot['Neighborhoods'] = toronto_venues['Neighborhood']

columns_revised = toronto_onehot.columns.tolist()

In [19]:
#move the last column to the front
columns_revised = columns_revised[-1:] + columns_revised[:-1]
toronto_onehot = toronto_onehot[columns_revised]
toronto_onehot

Unnamed: 0,Neighborhoods,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
#group rows by Neighborhood and mean of frequency
toronto_grouped = toronto_onehot.groupby('Neighborhoods').mean().reset_index()

In [21]:
#Let's figure out how many venues were listed for each Neighborhood
toronto_venues['Neighborhood'].value_counts().min

<bound method Series.min of First Canadian Place, Underground city                                                                        100
Garden District, Ryerson                                                                                      100
Commerce Court, Victoria Hotel                                                                                100
Toronto Dominion Centre, Design Exchange                                                                      100
Harbourfront East, Union Station, Toronto Islands                                                             100
Stn A PO Boxes                                                                                                 97
Richmond, Adelaide, King                                                                                       94
St. James Town                                                                                                 84
Church and Wellesley                                        

Given the lowest value is 2 and the second lowest is 4, we will be using 4 venues for our grouping.

In [22]:
#import numpy
import numpy as np

#define a function to obtain the four top venues
number_of_top_venues = 4
def most_common_venues (row, number_of_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:number_of_top_venues]

#create a new dataframe for our ordered top ten data
toronto_top4_venues = pd.DataFrame(columns = ['Neighborhoods', '1st Venue', '2nd Venue', '3rd Venue', '4th Venue'])

#Add the Neighborhoods column
toronto_top4_venues['Neighborhoods'] = toronto_grouped['Neighborhoods']

#add the data
for ind in np.arange(toronto_grouped.shape[0]):
    toronto_top4_venues.iloc[ind, 1:] = most_common_venues(toronto_grouped.iloc[ind, :], number_of_top_venues)

In [23]:
#Let's view the new dataframe
toronto_top4_venues.head()

Unnamed: 0,Neighborhoods,1st Venue,2nd Venue,3rd Venue,4th Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Seafood Restaurant
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Yoga Studio
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Farmers Market,Garden,Park
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport,Boat or Ferry,Plane
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Café


In [24]:
#let's view the shape
print("Top4 Venues shape:", toronto_top4_venues.shape)

Top4 Venues shape: (39, 5)


#### Step 4: We will now begin to cluster our neighborhoods using k-means

In [25]:
#define k clusters
kclusters = 5
toronto_clusters = toronto_grouped.drop(['Neighborhoods'], 1)
kmeans = KMeans(n_clusters = kclusters, init='k-means++',random_state=0).fit(toronto_clusters)

#check the kmean labels
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 2, 0, 1, 0,
       0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [26]:
#insert the cluster data into the modified one hot dataframe
toronto_top4_venues.insert(0, 'Clusters', kmeans.labels_)

In [29]:
toronto_merged = toronto_data

#we will need to rename our Neighborhoods column
toronto_top4_venues.rename(columns={'Neighborhoods':'Neighborhood'}, inplace = True)

#merge the original Toronto dataframe with the top venues dataframe
toronto_merged = toronto_merged.join(toronto_top4_venues.set_index('Neighborhood'), on='Neighborhood')

#Let's view the top rows
toronto_merged.head(39)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Clusters,1st Venue,2nd Venue,3rd Venue,4th Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Neighborhood_Venue,Health Food Store,Pub,Trail
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,0,Intersection,Ice Cream Shop,Brewery,Sandwich Place
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Bakery,Brewery
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Park,Bus Line,Swim School,Yoga Studio
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Hotel,Park,Food & Drink Shop,Sandwich Place
6,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678,0,Clothing Store,Sporting Goods Shop,Coffee Shop,Yoga Studio
7,M4S,Central Toronto,Davisville,43.704324,-79.38879,0,Dessert Shop,Sandwich Place,Sushi Restaurant,Coffee Shop
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,1,Trail,Park,Restaurant,Lawyer
9,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049,0,Coffee Shop,Pub,Liquor Store,Sports Bar


In [28]:
#generate a new map from the latitude and longitude we defined earlier
cluster_map = folium.Map(location=[latitude, longitude], zoom_start=12)

#define a set of colors
x = np.arange(kclusters)
ys= [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0,1, len(ys)))
cluster_color = [colors.rgb2hex(i) for i in colors_array]

#create markers to add to the map
for lat, lng, cluster, neighborhood, borough in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Clusters'], toronto_merged['Neighborhood'], toronto_merged['Borough']):
    info = 'Cluster {}, {}, {}'.format(cluster, neighborhood, borough)
    label = folium.Popup(info, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup = label, color=cluster_color[cluster-1], fill=True, fill_color=cluster_color[cluster-1], fill_opacity=0.5).add_to(cluster_map)

cluster_map

Given the data, I feel a more accurate analysis could have been created by using only entries with 10 or more venues. However, the current clusters which do not belong to cluster 0 appear to favor non-food oriented venues as part of their top four. Neighborhoods at postal codes M4E and M5V are some examples of locations which may be better suited for a different cluster. 

Thank you for viewing my notebook!