<h1 align="center" style="font-weight:bold;">Exploring, Segmenting, and Clustering Neighborhoods in Toronto</h1>

<h3 align="Justify" style="font-weight:bold;">Introduction</h3>

<p>In this assignment i am required to explore, segment, and clustering the neighborhoods in the City of Toronto.</p>

Now i use this Notebook to build the code to scrap the following Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

Before get the data and start exploring it, let's download all the dependencies that we will need.

In [2]:
import numpy as np #library to handle data in a vectorized manner
import pandas as pd #library for data analysis
import requests as rqt #library to handle requests 




In [2]:
pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 8.2MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.1 soupsieve-2.0.1
Note: you may need to restart the kernel to use updated packages.


<h3 style="font-weight:bold;">Download and Explore Dataset</h3> 

In [3]:
url_wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_rst = rqt.get(url_wiki).text



In [7]:
#Import BeatutifulSoup to pull data out of HTML Page
from bs4 import BeautifulSoup
soup = BeautifulSoup(wiki_rst,'html.parser')

#Now let's try to extract only table from the page 
table = soup.find('table',attrs={'class':'wikitable sortable'})


In [24]:
print(table.tr.text)


Postal Code

Borough

Neighbourhood



<h4 style='font-weight:bold;'>Extracting Data Text from the Table</h4>


In [23]:
#Now I define the table columns
headers =table.findAll('th')
for k,head in enumerate(headers):
    headers[k]=str(headers[k]).replace("<th>","").replace("</th>","").replace("\n","")

#Getting separated Data from table
rows=table.findAll('tr')
rows=rows[1:len(rows)]

#cleaning the data between rows 
for j, row in enumerate(rows): 
    rows[j] = str(rows[j]).replace("\n</td></tr>","").replace("<tr>\n<td>","")

#Making a the Dataframe 
df=pd.DataFrame(rows)
df[headers] = df[0].str.split("</td>\n<td>", n = 2, expand = True) 
df.drop(columns=[0],inplace=True)
   


In [16]:
#  Ignoring cells with a borough that is Not assigned
df = df.drop(df[(df.Borough == "Not assigned")].index)

# The neighborhood will be the same as the borough.If a cell has a borough but a Not assigned neighborhood
df.Neighbourhood.replace("Not assigned", df.Borough, inplace=True)

# copy Borough value to Neighborhood if NaN:
df.Neighbourhood.fillna(df.Borough, inplace=True)

#Eliminating duplicate rows from Dataframe
df=df.drop_duplicates()

#Printing the number of rows of the dataframe
df.shape

(180, 3)

<h4 style='font-weight:bold'>Extracting Titles from Columns</h4>
    

In [12]:
df.update(
    df.Neighbourhood.loc[
        lambda t: t.str.contains('title')
    ].str.extract('title=\"([^\"]*)',expand=False))

df.update(
    df.Borough.loc[
        lambda t: t.str.contains('title')
    ].str.extract('title=\"([^\"]*)',expand=False))

In [22]:
#Delete Toronto from Neighbourhood
df.update(
    df.Neighbourhood.loc[
        lambda x: x.str.contains('Toronto')
    ].str.replace(", Toronto",""))
df.update(
    df.Neighbourhood.loc[
        lambda x: x.str.contains('Toronto')
    ].str.replace("\(Toronto\)",""))
df


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods
3,M4A\n,North York\n,Victoria Village
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z\n,Not assigned\n,Not assigned\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


In [47]:
#Rename the Postal Code Column to PostalCode
dfRen = df.rename(columns={'Postal Code':'PostalCode','Neighbourhood':'Neighborhood'},inplace=False)

#Combining multiple neighborhoods with the same post code
dfNew = pd.DataFrame({'PostalCode':dfRen.PostalCode.unique()})
dfNew['Borough']=pd.DataFrame(list(set(dfRen['Borough'].loc[dfRen['PostalCode'] == x['PostalCode']])) for i, x in dfNew.iterrows())
dfNew['Neighborhood']=pd.Series(list(set(dfRen['Neighborhood'].loc[dfRen['PostalCode'] == x['PostalCode']])) for i, x in dfNew.iterrows())
dfNew['Neighborhood']=dfNew['Neighborhood'].apply(lambda x: ', '.join(x))

#Removing \n parts from strings in a column 
dfNew['PostalCode'] = dfNew['PostalCode'].map(lambda x: x.rstrip('\n'))
dfNew['Borough'] = dfNew['Borough'].map(lambda x: x.rstrip('\n'))
dfNew = dfNew.drop(dfNew[(dfNew.Borough == "Not assigned")].index)
dfNew.head(10)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [45]:
dfNew.shape

(103, 3)

<h3 style="font-weight:bold;">Now getting the latitude and the longitude coordinates of each neighborhood</h3>

In [53]:
#Reading the Geo-spacial from a csv file
dfG= pd.read_csv("http://cocl.us/Geospatial_data")
dfG.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
dfG.set_index("PostalCode")
dfNew.set_index("PostalCode")
geoData=pd.merge(dfNew, dfG)
geoData.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
