# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto.
the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import os # library to handle the break at the end of the script;
import requests # library to handle requests
from bs4 import BeautifulSoup # library to extract data from a web page
import csv # library to export the scrapped data into a CSV file;

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # installation of the folium library
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will

# Download and Explore Dataset

We retrieve the page with requests.get (which will make an HTTP / HTTPS request). Once this request is made, we can recover the contents of the page with .content. Finally we provide this content to BeautifulSoup to parser.

In [2]:
requete=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M','lxml')
page=requete.content
soup=BeautifulSoup(page)
table=soup.find('table',{'class':'wikitable sortable'})



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


create a new web page in which we will insert the table above to better treat the lines

In [3]:
source='<!doctype html><html><head></head><body>'+str(table)+'</body></html>'
soup2=BeautifulSoup(source)
ligne=soup2.find_all('tr')



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


let's get over the table lines

In [4]:
ligne=soup2.find_all('tr')
corp=ligne[1:]  

Export the data to a csv file

In [5]:
t=[]
csv_file=open("donnees.csv","w")
csv_writer=csv.writer(csv_file)
csv_writer.writerow(['Postcode','Borough','Neighborhood'])
for elem in corp:
    t=elem.text.split('\n')
    csv_writer.writerow([t[1],t[2],t[3]])
csv_file.close()

observe the exported data

In [6]:
df=pd.read_csv('donnees.csv')
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [7]:
df=df[df.Borough!='Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma

In [8]:
df=df.groupby(['Postcode','Borough'])['Neighborhood'].apply(','.join).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [9]:
df.loc[df.Neighborhood=='Not assigned','Neighborhood']=df.loc[df.Neighborhood=='Not assigned','Borough']
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


print the number of rows in your data frame.

In [10]:
df.shape

(103, 3)

# 2-download the project's geographic data

In [11]:
!wget -q -O 'geodata.csv' http://cocl.us/Geospatial_data

observe his geographical data

In [12]:
df2=pd.read_csv('geodata.csv')
df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


merge the two dataframes (df, df2) and store the result) df

In [13]:
df=df.join(df2[['Latitude','Longitude']])
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# 3-Explore and cluster the neighborhoods in Toronto.
#### Boroughs containing only the word 'Toronto'

In [14]:
df_toronto=df[df.Borough.str.contains('Toronto')]
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


### Use geopy library to get the latitude and longitude values of Toronto

In [15]:
address = 'Toronto'

geolocator = Nominatim(user_agent="tr_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto 43.653963, -79.387207.


##### Create a map of Toronto with neighborhoods superimposed on top

In [16]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Cluster Neighborhoods

In [17]:
df_toronto_clustering = df_toronto[['Latitude','Longitude']]
kclusters = 6
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 5, 3, 5, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster

In [18]:
df_toronto.insert(0, 'Cluster Labels', kmeans.labels_)
df_toronto.head()

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighborhood,Latitude,Longitude
37,3,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,5,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,3,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,5,M4M,East Toronto,Studio District,43.659526,-79.340923
44,0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


Finally, let's visualize the resulting clusters

In [19]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood'], df_toronto['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters
Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster

### Cluster 1

In [20]:
df_toronto.loc[df_toronto['Cluster Labels'] == 0, df_toronto.columns[[1] + list(range(2, df_toronto.shape[1]))]]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049
63,M5N,Central Toronto,Roselawn,43.711695,-79.416936
64,M5P,Central Toronto,"Forest Hill North,Forest Hill West",43.696948,-79.411307


### Cluster 2

In [21]:
df_toronto.loc[df_toronto['Cluster Labels'] == 1, df_toronto.columns[[1] + list(range(2, df_toronto.shape[1]))]]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
76,M6H,West Toronto,"Dovercourt Village,Dufferin",43.669005,-79.442259
82,M6P,West Toronto,"High Park,The Junction South",43.661608,-79.464763
83,M6R,West Toronto,"Parkdale,Roncesvalles",43.64896,-79.456325
84,M6S,West Toronto,"Runnymede,Swansea",43.651571,-79.48445


### Cluster 3

In [22]:
df_toronto.loc[df_toronto['Cluster Labels'] == 2, df_toronto.columns[[1] + list(range(2, df_toronto.shape[1]))]]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
52,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
54,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
55,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
56,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
58,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.650571,-79.384568
59,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.640816,-79.381752
60,M5K,Downtown Toronto,"Design Exchange,Toronto Dominion Centre",43.647177,-79.381576
61,M5L,Downtown Toronto,"Commerce Court,Victoria Hotel",43.648198,-79.379817
68,M5V,Downtown Toronto,"CN Tower,Bathurst Quay,Island airport,Harbourf...",43.628947,-79.39442


### Cluster 4

In [23]:
df_toronto.loc[df_toronto['Cluster Labels'] == 3, df_toronto.columns[[1] + list(range(2, df_toronto.shape[1]))]]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
87,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558


### Cluster 5

In [24]:
df_toronto.loc[df_toronto['Cluster Labels'] == 4, df_toronto.columns[[1] + list(range(2, df_toronto.shape[1]))]]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
65,M5R,Central Toronto,"The Annex,North Midtown,Yorkville",43.67271,-79.405678
66,M5S,Downtown Toronto,"Harbord,University of Toronto",43.662696,-79.400049
67,M5T,Downtown Toronto,"Chinatown,Grange Park,Kensington Market",43.653206,-79.400049
75,M6G,Downtown Toronto,Christie,43.669542,-79.422564
77,M6J,West Toronto,"Little Portugal,Trinity",43.647927,-79.41975
78,M6K,West Toronto,"Brockton,Exhibition Place,Parkdale Village",43.636847,-79.428191


### Cluster 6

In [25]:
df_toronto.loc[df_toronto['Cluster Labels'] == 5, df_toronto.columns[[1] + list(range(2, df_toronto.shape[1]))]]

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
43,M4M,East Toronto,Studio District,43.659526,-79.340923
50,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
51,M4X,Downtown Toronto,"Cabbagetown,St. James Town",43.667967,-79.367675
53,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
