<h1>Segmenting and Clustering the Neighborhood in Toronto | Part 1</h1>

<h3>Import Libraries</h3>

In [1]:
import pandas as pd      # Libraries for data analysis
import requests          # To handle requests

from bs4 import BeautifulSoup # website scraping libraries and packages in Python from BeautifulSoup

print('Libraries imported!')

Libraries imported!


<h3>Scrap data from Wikipedia page into a DataFrame</h3>

In [2]:
# Create URL
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050"

# Send GET request and obtain data in text format
wiki_data = requests.get(url).text

# Parse the data from the html into a beautifulsoup object
soup = BeautifulSoup(wiki_data, 'html.parser')

In [3]:
# Create three empty lists to store table data
postalcode = []
borough = []
neighborhood = []

# Find the table
soup.find('table').find_all('tr')

# For each row of the table, find all the table data
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')

In [4]:
# Append the data into the respective empty lists
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if(len(cells) > 0):
        postalcode.append(cells[0].text)
        borough.append(cells[1].text)
        neighborhood.append(cells[2].text.rstrip('\n')) # avoid new lines in neighborhood cell

In [5]:
# create a new DataFrame from the above three lists
toronto_df = pd.DataFrame({"PostalCode": postalcode,
                           "Borough": borough,
                           "Neighborhood": neighborhood})

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [6]:
toronto_df.shape

(287, 3)

<h4>Removing all the rows with 'Not assigned' value in "Borough" Column</h4>

In [7]:
#Remove all the rows with 'Not assigned' value in "Borough" column
toronto_df = toronto_df[toronto_df.Borough != "Not assigned"].reset_index(drop=True)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


<h4>Grouping neighborhoods with same "PostalCode" and "Borough"</h4>

In [8]:
#Grouping neighborhoods with same "PostalCode" and "Borough"
toronto_df = toronto_df.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<H4>Assign Borough's value to 'Not assigned' value of "Neighborhood"</H4>
If we have any row with some "Borough" value but with "Neighborhood" value 'Not assigned' then make the "Neighborhood" value same as "Borogh" value.

In [9]:
for index, row in toronto_df.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]
        
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Shape of Cleaned DataFrame

In [10]:
print('Shape of obtained DataFrame: {}\n'.format(toronto_df.shape))
print('There are {} rows in the obtained DataFrame.'.format(toronto_df.shape[0]))

Shape of obtained DataFrame: (103, 3)

There are 103 rows in the obtained DataFrame.


#### Dataframe Convertion to CSV file
Convert the obtained dataframe to CSV file

In [11]:
#Converting obtained DataFrame to CSV file
toronto_df.to_csv('toronto_cleaned.csv', index=False)

<h3>Check whether the obtained dataframe same as required by the question</h3>

In [12]:
column_names = ["PostalCode", "Borough", "Neighborhood"]
req_dataframe = pd.DataFrame(columns= column_names)

req_postalcodes = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in req_postalcodes:
    req_dataframe = req_dataframe.append(toronto_df[toronto_df["PostalCode"] == postcode], ignore_index=True)
    
req_dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."
