## Wikipedia scrape notebook - Toronto Neighbourhood Clusters

In [5]:
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [6]:
wiki_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [7]:
#query the website and return the html to the variable ‘page’
page = urlopen(wiki_page)
soup = BeautifulSoup(page, 'html.parser') #store in variable `soup`

Now that we have wiki URL web page parsed and stored in bfSoup we can now extract and convert into dataframe

In [8]:
#extract table and convert into dataframe
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
df=pd.DataFrame(df)
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df.head(15)

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


Replace not assigned neighborhoods with Borough Names, rows wich has duplicate value of Postcode will be combined into one row.

In [9]:
df = df[df.Borough != 'Not assigned']
mask = df.Neighbourhood == 'Not assigned'
df[mask]['Neighbourhood'] = df[mask]['Borough']
df = df.groupby(['Postcode','Borough']).agg({'Neighbourhood':lambda x: ', '.join(x)}).reset_index()
df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [10]:
df.shape

(103, 3)

Above dataframe shows dataframe of the postal code of each neighborhood along with the borough name and neighborhood name.

Now lets get lat long for the above data

In [11]:
!pip install geocoder
import geocoder 
!pip install folium
import folium
import geopy
from geopy.geocoders import Nominatim

Requirement not upgraded as not directly required: geocoder in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: ratelim in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: future in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: click in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: decorator in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ratelim->geocoder)
Requirement not upgraded as not directly required: chardet<3.1.0,

In [12]:
df['Latitude'] = 0
df['Longitude'] = 0

In [13]:


for i in range(0, len(df)):
    address = df['Borough'].iloc[i]
    #print(address)
    geolocator = Nominatim()
    location = geolocator.geocode(address, timeout=100)
    latitude= location.latitude
    longitude= location.longitude
    df['Latitude'].iloc[i] = latitude
    df['Longitude'].iloc[i] = longitude



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [14]:
df.head(25)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",54.28476,-0.409034
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",54.28476,-0.409034
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",54.28476,-0.409034
3,M1G,Scarborough,Woburn,54.28476,-0.409034
4,M1H,Scarborough,Cedarbrae,54.28476,-0.409034
5,M1J,Scarborough,Scarborough Village,54.28476,-0.409034
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",54.28476,-0.409034
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",54.28476,-0.409034
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",54.28476,-0.409034
9,M1N,Scarborough,"Birch Cliff, Cliffside West",54.28476,-0.409034


Above dataframe shows list of postal code of each neighborhood along with the borough, neighborhood name Lat Long coordinates

In [15]:
#!pip install folium
from sklearn.cluster import KMeans
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

After importing necessary libraries, lets begin to form clusters from the data. To begin, get the mean of lat long for map

In [17]:
c_lat = df.Latitude.mean()
c_lon = df.Longitude.mean()
c_lat,c_lon

(45.936112825242766, -62.575821765048545)

In [20]:
kclusters = 5
cluster_toronto = df.drop(['Neighbourhood','Borough','Postcode'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cluster_toronto)

After defining cluster_toronto and running kmeans on it, we need to visualize them on the . So lets visualize !

In [26]:
df['Cluster Labels'] = kmeans.labels_
map_clusters = folium.Map(location=[c_lat, c_lon], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'], df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' (Cluster: ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters