<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City</font></h1>

## Introduction

In this project, it will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. 

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. It will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.  

We will learn how to convert addresses into their equivalent latitude and longitude values. Also, we will use the Foursquare API to explore neighborhoods in New Toronto City. We will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. We will use the *k*-means clustering algorithm to complete this task. Finally, we will use the Folium library to visualize the neighborhoods in Toronto City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Fetch the Latitude and longitude of the posta codes</a>

3. <a href="#item3">Explore Neighborhoods in New York City</a>

4. <a href="#item4">Analyze Each Neighborhood</a>

5. <a href="#item5">Cluster Neighborhoods</a>

6. <a href="#item6">Examine Clusters</a>  
</font>
</div>

Before fetching the data and start exploring it, let's download all the dependencies that we will need.

In [11]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup # Scraping done easily
from copy import deepcopy # for copying the variables

import geocoder # to fetch latitude and logitude of a posta code
print('Libraries imported.')

Libraries imported.


# 1.  Download and Explore the Dataset

### Function to extract the table from the wikipedia link

In [2]:
def extract_table(url):
    wiki_text = requests.get(url_wiki).text
    soup = BeautifulSoup(wiki_text,'lxml')
    #print(soup.prettify())
    toronto_table = soup.find('table',{'class':'wikitable sortable'})
    #toronto_table
    all_rows = toronto_table.findAll('tr')
    all_rows[:] = [row.getText().split('\n') for row in all_rows]
    all_rows[:] = [[r for r in row if r] for row in all_rows]
    
    return list(all_rows)

In [3]:
# Extract the table from the wikipedia link
url_wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
rows = extract_table(url_wiki)
rows[:10]

[['Postcode', 'Borough', 'Neighbourhood'],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned']]

### Convert the table to a DataFrame

In [4]:
toronto_city = pd.DataFrame(rows[1:],columns = rows[0])
print(toronto_city.shape)
toronto_city.head()

(289, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Drop the rows in which Borough is Not assigned

In [5]:
toronto_city['Borough'].replace('Not assigned',np.nan,inplace=True)
toronto_city.dropna(inplace=True)
toronto_city.reset_index(drop=True,inplace=True)
toronto_city.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### if the Neighborhood is not assigned assign it to the Borough name

In [6]:
neigh_not_ass = toronto_city[toronto_city['Neighbourhood'] == 'Not assigned'].index
for i in neigh_not_ass:
    toronto_city['Neighbourhood'].loc[i]=toronto_city['Borough'].loc[i]

In [7]:
toronto_city.shape

(212, 3)

### Fetch the postalcodes which are assigned to multiple Neighborhood and combine them

In [8]:
toronto_city = toronto_city.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
toronto_city.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
# Alternate way 1
    #toronto_city = toronto_city.groupby('Postcode').agg({'Borough': 'first',
    #                                                     'Neighbourhood': ', '.join
    #                                                    }).reset_index()
    #toronto_city.head(10)

In [10]:
print(toronto_city.shape)

(103, 3)


# 2.   Get the lattitude and longitudes of the postal codes

In [14]:
## initialize your variable to None
#lat_lng_coords = None
#
## loop until you get the coordinates
#while(lat_lng_coords is None):
#    g = geocoder.google('{}, Toronto, Ontario'.format('M5G'))
#    lat_lng_coords = g.latlng
#
#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

In [15]:
lat_long = pd.read_csv('Geospatial_Coordinates.csv')
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
lat_long.rename(columns={'Postal Code':'Postcode'},inplace=True)

In [16]:
lat_long.shape

(103, 3)

In [18]:
toronto_city_ll_codes = toronto_city.merge(lat_long,how='inner',on='Postcode')

In [19]:
toronto_city_ll_codes.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [20]:
toronto_city_ll_codes.shape

(103, 5)