### Segmenting and Clustering Neighborhoods in Toronto

#### Building the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data and to transform the data into a pandas dataframe like the one shown below:

![alt text](https://i.ibb.co/wgBDk3Y/7-JXaz3-NNEei-Mw-Ape4i-f-Lg-40e690ae0e927abda2d4bde7d94ed133-Screen-Shot-2018-06-18-at-7-17-57-PM.png "Dataframe")

In [1]:
# importing required libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
print('libraries are imported')

libraries are imported


In [2]:
# gathering data from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
print('data is gathered')

data is gathered


In [3]:
soup= BeautifulSoup(source, 'lxml')

#### Transforming the data into a Pandas Dataframe

In [4]:
# defining the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighbourhood'] 

# instantiating the dataframe
neighbourhoods = pd.DataFrame(columns=column_names)

In [5]:
# loop to fill dataframe
table=soup.find('table')
for tr_table in table.find_all('tr'):
    raw_data=[]
    for td_table in tr_table.find_all('td'):
        raw_data.append(td_table.text.strip())
    if len(raw_data)==3:
        neighbourhoods.loc[len(neighbourhoods)] = raw_data


In [6]:
neighbourhoods

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


#### Data Cleansing

In [7]:
#removing borough cells that are ''not assigned'
neighbourhoods = neighbourhoods[neighbourhoods.Borough!='Not assigned']
neighbourhoods = neighbourhoods[neighbourhoods.Borough!= 0]
neighbourhoods.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,neighbourhoods.shape[0]):
    if neighbourhoods.iloc[i][2] == 'Not assigned':
        neighbourhoods.iloc[i][2] = neighbourhoods.iloc[i][1]
        i = i+1
                                 
df = neighbourhoods.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df.head()


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [9]:
df.loc[df.Borough =='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [10]:
df=df[df['Borough']!='Not assigned']

In [11]:
df.loc[df.Neighbourhood =='Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [12]:
df2=df[df.PostalCode != 'Not assigned']

In [13]:
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [14]:
print('Dataframe shape is', df2.shape)

Dataframe shape is (103, 3)
