# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

**Before I get the data and start exploring it, I download all the dependencies that will be needed.

In [1]:
import random 
import numpy as np 
import pandas as pd 
import requests

import matplotlib.pyplot as plt 
%matplotlib inline 

from bs4 import BeautifulSoup
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

import bs4 as bs
import urllib.request

print('Libraries imported.')

Libraries imported.




**I used BeautifulSoup package to scrap and import the Wikipedia article 

In [8]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = bs.BeautifulSoup(source,'html.parser')

table = soup.find('table')
table_rows = table.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        l.append(row)

**I constructed a dataframe which consists out of 3 columns (Postalcode, Borough, Neighbourhood)

In [3]:
df = pd.DataFrame(l, columns=["PostalCode", "Borough", "Neighbourhood"])
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


**I processed the "borough" cells and I ignored the cells with "Not assigned"

In [10]:
df = df[df.Borough != 'Not assigned']
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


**Fuse cells, because more neighbourhoods can exist in one postal code

In [5]:
df = df.groupby(['PostalCode', 'Borough']).agg(', '.join)
df = df.reset_index()
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


**Last cleaning of borough but have a Not assigned neighbourhood

In [6]:
df.loc[df['Neighbourhood']=='Not assigned', ['Neighbourhood']] = 'Queen\'s Park'
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


<b>What I have done so far on the data cleaning above?</b>
1. Import and scrap wikipage using BeautifulSoup package.
2. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.
3. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
4. The rows of same PostalCode and Borough will be combined into one row with the neighborhoods separated with a comma.
5. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

Here I used the .shape method to print the amount of rows in my dataframe

In [9]:
df.shape

(103, 3)

## This is my first part of the Final Assignment 