# Toronto Neighborhood Cluster Analysis

The Neighborhoods of Toronto are Clusterd by the simalarity of the venues present in the neighborhood.
The venues varies from college/ university to hospitals and art exhibiton to hotels etc.[Click here](https://developer.foursquare.com/docs/resources/categories) for more information on the Venues. This data is fetched from [Forsquare](https://developer.foursquare.com/), is a local search-and-discovery mobile app which provides search results for its users. The app provides personalized recommendations of places to go near a user's current location based on users' previous browsing history and check-in history

### Loading Data 

Importing libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

The neighborhoods list can be found [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). This information can be utilised to group the neighborhoods into boroughs to measure the similarity of the boroughs also. 

In [2]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
print("Fetching Data From:\n",url,"\n========================")
source=requests.get(url).text
print("Completed")

Fetching Data From:
 https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 
Completed


In [3]:
soup=BeautifulSoup(source)
table=soup.find("table",class_="wikitable").tbody
#print(table.prettify())
rows=table.find_all("tr")
l=[]
for row in rows:
    d=[data.text for data in row.find_all("td")]
    if len(d)>0:
        l.append(d)
df=pd.DataFrame(l)
df.columns=["pcode","borough","neighborhood"]
df.head()

Unnamed: 0,pcode,borough,neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


### Cleaning Data

Making sure that the data is clean and can be used for further analysis.

In [4]:
df.neighborhood=df.neighborhood.str.replace("\\\n","")
df=df[df.borough!="Not assigned"]
df=df.groupby(["pcode","borough"])["neighborhood"].apply(lambda x:", ".join(x)).reset_index()
df.loc[df.neighborhood=="Not assigned",["neighborhood"]]=df.loc[df.neighborhood=="Not assigned"].borough
df.head()

Unnamed: 0,pcode,borough,neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [5]:
l_br=[]
for br in df.borough.unique():
    if "Toronto" in br:
        l_br.append(br)
l_br

['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']

In [6]:
df_tor=df[((df["borough"]==l_br[0])|(df["borough"]==l_br[1])|(df["borough"]==l_br[2])|(df["borough"]==l_br[3]))].reset_index(drop=True) 

In [7]:
tor_geo=pd.read_csv("Geospatial_Coordinates.csv")
tor_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
tor=df_tor.merge(tor_geo,left_on="pcode",right_on="Postal Code").drop("Postal Code",1)
tor.head()

Unnamed: 0,pcode,borough,neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [9]:
print("There are {} postal codes.".format(tor.shape[0]))

There are 38 postal codes.


### Displaying Map

In [10]:
import folium
import matplotlib.pyplot as plt

In [11]:
map=folium.Map(location=[tor.Latitude.mean(),tor.Longitude.mean()],zoom_start=12)
# add markers
for lat,lng,pin in zip(tor.Latitude,tor.Longitude,tor.pcode):
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        color="green",
        fill_opacity=.4, fill=True).add_to(map)

# display
map

These are the Neighbourhoods of the Boroughs Containing "Toronto" in the city Toronto:
* East Toronto
* Central Toronto
* West Toronto
* Downtown Toronto

### Collecting Data from 4SQ

By using the API, the program can fetch JSON data from Foursquare. By giving in the latitude and longitude as the input to the API, it returns the number of specified venues nearby the location of interest within the range specified.


In [12]:
from geopy.geocoders import Nominatim
import json
from pandas.io.json import json_normalize

Change the `c_id` and `c_sec` as per the credintials of Foursquare account. Initailze the following:
* `c_id`  : Client ID
* `c_sec` : Client Secret
* `limit` : the maximum number of venues to be returned
* `rad`   : the venues within the radius are returned.

In [13]:
c_id="***************"
c_sec="***************"
rad=700
limit=100
i=len(tor.pcode)   # total number of api requests
count=1            # keeping count of how many finished
l=[]
for code,lat,lng in zip(tor.pcode,tor.Latitude,tor.Longitude):
    res=None
    api="https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v=20181218&ll={},{}&radius={}&limit={}"
    api=api.format(c_id,c_sec,lat,lng,rad,limit)
    response=requests.get(api)
    res=response.json()['response']['groups'][0]['items']
    for ite in res:
        l.append([code,ite["venue"]["categories"][0]["name"]])
    print("{}% completed".format(round(count/i*100,2)))
    count+=1
longdf=pd.DataFrame(l)
longdf.columns=["pcode","category"]
longdf.shape

2.63% completed
5.26% completed
7.89% completed
10.53% completed
13.16% completed
15.79% completed
18.42% completed
21.05% completed
23.68% completed
26.32% completed
28.95% completed
31.58% completed
34.21% completed
36.84% completed
39.47% completed
42.11% completed
44.74% completed
47.37% completed
50.0% completed
52.63% completed
55.26% completed
57.89% completed
60.53% completed
63.16% completed
65.79% completed
68.42% completed
71.05% completed
73.68% completed
76.32% completed
78.95% completed
81.58% completed
84.21% completed
86.84% completed
89.47% completed
92.11% completed
94.74% completed
97.37% completed
100.0% completed


(2436, 2)

In [14]:
longdf.head()

Unnamed: 0,pcode,category
0,M4E,Vegetarian / Vegan Restaurant
1,M4E,Gastropub
2,M4E,Indie Movie Theater
3,M4E,Ice Cream Shop
4,M4E,Bakery


In [15]:
dumy=pd.get_dummies(longdf.category,prefix="",prefix_sep="") # spreading the catedories into columns
dumy.shape     # check if rows are same as longdf 

(2436, 279)

Adding the neighborhood into the spread dataframe and rearranging the columns such that the neighborhood is at the first. 

In [16]:
dumy["pcode"]=longdf.pcode
col=[dumy.columns[-1]]+list(dumy.columns[:-1])
dumy=dumy[col]
dumy.to_csv("spread.csv",index=False)
dumy.loc[0:1]

Unnamed: 0,pcode,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Tunnel,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M4E,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,M4E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Calculating the frequency of a particular category of venue occuring in the neighbourhood.
Creating a `top10` DataFrame to display the top 10 categories present in each neighborhood.

In [17]:
dumy_grp=dumy.groupby("pcode").mean().reset_index()
df_l=[]
for hood in dumy_grp.pcode:
    temp =dumy_grp[dumy_grp['pcode'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    l=temp.sort_values('freq', ascending=False).reset_index(drop=True).head(10).venue
    row=[]
    for t in l:
        row.append(t)
    df_l.append(row)
top10=pd.DataFrame(df_l)
col=[]
ind=["st","nd","rd"]
for i in range(10):
    try:
        col.append('{}{} Most Common Venue'.format(i+1, ind[i]))
    except:
        col.append('{}th Most Common Venue'.format(i+1))
top10.columns=col
top10["pcode"]=dumy_grp.pcode
top10.head()

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,pcode
0,Pub,Bar,Breakfast Spot,Sandwich Place,Gastropub,Health Food Store,Pizza Place,Café,Shoe Store,Salon / Barbershop,M4E
1,Greek Restaurant,Coffee Shop,Pub,Grocery Store,Café,Fast Food Restaurant,Ice Cream Shop,Restaurant,Caribbean Restaurant,Bookstore,M4K
2,Indian Restaurant,Sandwich Place,Coffee Shop,Fast Food Restaurant,Café,Park,Gym,Brewery,Fish & Chips Shop,Snack Place,M4L
3,Café,Bakery,Coffee Shop,Bar,American Restaurant,Italian Restaurant,Sandwich Place,Latin American Restaurant,Pizza Place,Diner,M4M
4,Construction & Landscaping,Business Service,Bus Line,Swim School,Park,ATM,Neighborhood,Noodle House,Nightclub,New American Restaurant,M4N


In [18]:
top10.to_csv("top10Toronto.csv",index=False)

### Kmeans Clustering

Using the K-means alogorithm to find out which all the neighborhoods share similar venues based on the frequency of the types of venues present in the neighborhood

In [19]:
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

Performing the **K-MEANS** algorithm to cluster the neighborhoods based on simalrity of venues present in each neighborhood.

In [20]:
kclusters=2   # the number of cluters
X=dumy_grp.drop("pcode",1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)
cl=kmeans.labels_  # cluster alloted

Creating a DataFrame to contain the neighborhood, cluster type and geographical information.

In [21]:
clust_df=dumy_grp["pcode"].reset_index().assign(cluster=list(cl)).drop("index",1)
clust_df_geo=clust_df.merge(tor_geo,left_on="pcode",right_on="Postal Code").drop("Postal Code",1)
clust_df_geo.head()

Unnamed: 0,pcode,cluster,Latitude,Longitude
0,M4E,1,43.676357,-79.293031
1,M4K,1,43.679557,-79.352188
2,M4L,1,43.668999,-79.315572
3,M4M,1,43.659526,-79.340923
4,M4N,0,43.72802,-79.38879


### Visualizing Cluster on Map

In [22]:
# create map
map_clusters = folium.Map(location=[clust_df_geo.Latitude.mean(), clust_df_geo.Longitude.mean()], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, code, cl in zip(clust_df_geo['Latitude'], clust_df_geo['Longitude'], clust_df_geo['pcode'], clust_df_geo['cluster']):
    label = folium.Popup(str(code) + ' Cluster ' + str(cl), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color="black",
        fill=True,
        fill_color=rainbow[cl-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters

### Explore Groups

Creating a DataFrame to contain Neighborhood , cluster and the top 10 venue category.

In [23]:
clust_top10=clust_df.merge(top10,left_on="pcode",right_on="pcode").groupby("cluster")

In [24]:
for i in range(kclusters):
    temp=clust_top10.get_group(i)
    print("cluster:{}".format(i+1))
    print(temp.head())
    

cluster:1
   pcode  cluster       1st Most Common Venue 2nd Most Common Venue  \
4    M4N        0  Construction & Landscaping      Business Service   
8    M4T        0                        Park         Grocery Store   
10   M4W        0                        Park            Playground   

   3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue  \
4               Bus Line           Swim School                  Park   
8                    Gym                  Bank       Thai Restaurant   
10                 Trail                   ATM           Music Venue   

   6th Most Common Venue 7th Most Common Venue    8th Most Common Venue  \
4                    ATM          Neighborhood             Noodle House   
8           Tennis Court            Playground            Movie Theater   
10          Noodle House             Nightclub  New American Restaurant   

   9th Most Common Venue   10th Most Common Venue  
4              Nightclub  New American Restaurant  
8          

### Result

The Neighborhoods are divided into 2 Clusters as follows:
* Cluster 1: has more amenities like park,tennis court, ATM etc
* Cluster 2: has more Restaurants,Pubs, cafes etc