# Segmenting and Clustering Neighborhoods in Toronto

## Introduction  
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. 

Firstly, I will import the libraries that I need.

In [1]:
from bs4 import BeautifulSoup
import urllib.request

import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library

I will get the Toronto neighborhood data from a Wikipedia page and create a dataframe with this data.

In [2]:
postcode = []
borough = []
neighborhood = []

with urllib.request.urlopen('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') as response:
   webpage = response.read()
   soup = BeautifulSoup(webpage,'html.parser')
   for i, anchor in enumerate(soup.table.find_all('td')):
       if (i%3)==0:
            postcode.append(anchor.text)
       elif (i%3)==1:
            borough.append(anchor.text)
       else:
           neighborhood.append(anchor.text.replace('\n',''))
       #print(anchor.text)

df = pd.DataFrame(data = {'Postcode': postcode, 'Borough': borough, 'Neighborhood': neighborhood})
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


I need to clean the data. Firstly, I delete not assigned postcodes

In [3]:
#Drop not assigned postcodes
df = df[df.Borough != 'Not assigned']
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


And I fill missing neighborhood information with borough information

In [4]:
#Replace missing neighborhood information with borough
df.loc[df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df['Borough']
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


I group neighborhoods by postcode.

In [5]:
df = df.groupby(['Postcode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(x))
df = df.to_frame().reset_index()

df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood]], Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [6]:
df.shape

(103, 3)