# Collecting the Neighborhoods in Toronto.

From [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), all the Boroughs and Neighborhoods of Toronto are listed. This page can be scraped to collect the information

Importing necessary Libraries

In [1]:
import pandas as pd 
import requests
from bs4 import BeautifulSoup
import re

Reading the URL (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) which contains the information required in a tabular structure.

In [2]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
print("Fetching Data From:\n",url,"\n========================")
source=requests.get(url).text
print("Completed")

Fetching Data From:
 https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 
Completed


Converting the recieved html content into a soup object from "bs4" package for easier extraction of necessary information. Finding the table containing the postal codes and neighborhood of Toronto.

In [3]:
soup=BeautifulSoup(source)
#print(soup.prettify())
table=soup.find("table",class_="wikitable sortable") 
#print(table.prettify())
rows=table.find_all("tr")
#print(rows)

By iterating through all the rows of the table,the content can be accessed and stored asa a list. Creating a DataFrame using the list containing Postal Codes, Borough and Neighborhood of Toronto.

In [4]:
l=[]
for row in rows:
    td=row.find_all("td")
    r=[d.text for d in td]
    l.append(r)
df=pd.DataFrame(l)


As the first row was actually containg the column header with `<th>` tag and not `<td>` an empty element is returned and stored
in  the list. Deleting the first row and naming the cloumns of the dataframe

In [5]:
df.drop(0,inplace=True)  # as the first row contains the header information
df_col=["PostalCode","Borough","Neighborhood"]
df.columns=df_col
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned\n
2,M2A,Not assigned,Not assigned\n
3,M3A,North York,Parkwoods\n
4,M4A,North York,Victoria Village\n
5,M5A,Downtown Toronto,Harbourfront\n


In the Neighborhood column the information is ending with `\n` this needs to be removed.
All the rows in the `Borough` column containing `Not assigned` must be deleted.
If the `Neighborhood` is `Not assigned` then the Borough is assigned as the neighbourhood itself.

In [6]:
df["Neighborhood"]=df["Neighborhood"].str.replace("[\\\n]","")
df=df[df.Borough!="Not assigned"]
df.loc[df.Neighborhood=="Not assigned",["Neighborhood"]]=df.loc[df.Neighborhood=="Not assigned"].Borough
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


Each Postal Code can contain multiple neighborhoods, so combining all the neighborhoods into single line seprated by `,`.
So there is no duplicity in the `PostalCode` i.e there is only one occurance of each postal code in the dataset

In [7]:
cln_df=df.groupby(["PostalCode","Borough"])['Neighborhood'].apply(lambda x:", ".join(x)).reset_index()
cln_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
cln_df.shape
print("There are {} rows in the Dataset.".format(cln_df.shape[0]))

There are 103 rows in the Dataset.
