<h1><center>ANALYSING TORONTO NEIGHBOURHOOD & CLUSTERING</center></h1>

### Hello,  shall be using PANDAS to import the data from WIKIPEDIA page as given

In [2]:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

df=pd.read_html(url, header=0)[0]

df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Taking Care of the following three Requirements 
1. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
2. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
3. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [48]:
#1. Ignore cells with a borough that is Not assigned.
df=df[df.Borough != 'Not assigned']

#2. Combining the neighbourhoods with same Postalcode
df = df.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True)

#3. Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned',df['Borough'], df['Neighborhood'])


It was interesting to note that in the above Prepocessing, after step 1, step 2 and step 3 appear quite redundant.

In [47]:
df.shape


(103, 3)

### Now we shall be importing the Downloaded CSV File from the Link provided 

In [5]:
# importing csv for Geospatial Coordinates
import pandas as pd
gc=pd.read_csv('http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv')

Check the Data that has been downloaded -

In [6]:
gc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Now we shall be merging the two Data Frame on Postal Code 

In [7]:
merged_df=pd.merge(df,gc,on=["Postal Code"])

Checking the Output - 

In [8]:
merged_df.head()


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [9]:
merged_df.shape

(103, 5)

### Checking the Dataset for no. of Boroughs and Neighbourhoods

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(merged_df['Borough'].unique()),
        merged_df.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


### Now we shall be viewing the data on the map of Toronto. For that we first get the Lat Long Data for Toronto installing Geopy package


In [11]:
# installing Geoencoder
!conda install -c conda-forge geopy --yes 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [12]:
import folium
from geopy.geocoders import Nominatim
#Getting the Lat long for Toronto First
address = 'Toronto,CA'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto,CA are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto,CA are 43.6534817, -79.3839347.


### Next we use the FOLIUM package to view the Boroughs and Neibourhood on the Toronto Map

In [13]:
from IPython.display import display
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(merged_df['Latitude'], merged_df['Longitude'], merged_df['Borough'], merged_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
       popup=label,
       color='blue',
       fill=True,
       fill_color='#3186cc',
      fill_opacity=0.7,
     parse_html=False).add_to(map_toronto)  
    
display(map_toronto)

### Now specifically extract all those Borough's which has Name ending with "Toronto"

In [26]:
TorontoBRO= merged_df[merged_df['Borough'].str.contains('Toronto',regex=False)]
TorontoBRO.shape

(39, 5)

In [27]:
TorontoBRO.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


### Plotting it again on the Map to check...

In [29]:
from IPython.display import display
map_toronto1 = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(TorontoBRO['Latitude'], TorontoBRO['Longitude'], TorontoBRO['Borough'], TorontoBRO['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
       popup=label,
       color='blue',
       fill=True,
       fill_color='#3186cc',
      fill_opacity=0.7,
     parse_html=False).add_to(map_toronto1)  
    
display(map_toronto1)

### Using K mean Clustering for Clustering of the neibourhoods

In [35]:
#K-Means Clustering
# import k-means from clustering stage
from sklearn.cluster import KMeans
k=5
TRNTOCLSTR = TorontoBRO.drop(['Postal Code','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(TRNTOCLSTR)
kmeans.labels_
TorontoBRO.insert(0, 'ClusterLabel', kmeans.labels_)

In [36]:
kmeans.labels_

array([0, 0, 0, 0, 1, 0, 0, 3, 0, 2, 0, 3, 1, 0, 3, 1, 0, 1, 4, 4, 4, 4,
       2, 4, 3, 2, 4, 3, 2, 4, 3, 4, 0, 0, 0, 0, 0, 0, 1])

In [50]:
TorontoBRO.head()

Unnamed: 0,ClusterLabel,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,1,M4E,East Toronto,The Beaches,43.676357,-79.293031


### Finally Plotting the K Means Clustering Results on the MAP using FOLIUM

In [41]:
# create map
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(TorontoBRO['Latitude'], TorontoBRO['Longitude'], TorontoBRO['Neighborhood'], TorontoBRO['ClusterLabel']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters