# IBM Data Science Professional Certificate | Applied Data Science Capstone

## Segmenting and Clustering Neighborhoods in Toronto

### Part 1 - Explore and Cluster the Neighborhoods in Toronto

For the Toronto neighborhood data, a Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will have to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, we can start the analysis to explore and cluster the neighborhoods in the city of Toronto.

#### Methodology

In this notebook, We build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. There are different website scraping libraries and packages in Python. One of the most common packages is BeautifulSoup and we will use it in this project. Package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

- The table consists of three columns: PostalCode, Borough, and Neighborhood 
- We only process the cells that have an assigned Borough and Ignore cells with a Borough that is Not assigned
- More than one Neighborhood can exist in one Postal Code area. We will combine the rows into one row with the Neighborhoods separated with a comma
- If a cell has a Borough but a Not assigned Neighborhood, then the Neighborhood will be the same as the Borough

##### Import necessary Libraries

In [56]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
from bs4 import BeautifulSoup
print('Libraries Imported.')

Libraries Imported.


##### Scraping the Wikipedia page for the table of Postal Codes of Canada

In [57]:
# Downloading Data
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
soup = BeautifulSoup(source, 'xml')
table=soup.find('table')
print('Data Scrapped.')

Data Scrapped.


##### Converting the Scrapped HTML table to Pandas Dataframe for Preprocessing

In [58]:
# Creating and Loading the Dataframe with the downloaded data
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [59]:
# Checking the shape of the Dataframe
df.shape

(180, 3)

##### Data Preprocessing and Cleaning

In [60]:
# Removing rows where Borough is 'Not Assigned'
df=df[df['Borough']!='Not assigned']
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [61]:
# Checking the shape of the Dataframe
df.shape

(103, 3)

In [62]:
# Checking if there are rows with 'Not Assigned' Neighborhood
df[df['Neighbourhood']=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


In [26]:
# Assigning the Borough name to rows with 'Not Assigned' Neighborhood (no such cases in this dataset)
# df[df['Neighborhood']=='Not assigned']=df['Borough']

In [63]:
# Checking if there are Duplicate Postal Code Rows with Multiple Neighborhoods
duplicaterows = df[df.duplicated()]
print(duplicaterows)

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


In [64]:
# Combining the rows into one row with the neighborhoods separated by a comma where there are Multiple Neighborhoods for a PostalCode
df = df.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [65]:
# Checking the shape of the Dataframe
df.shape

(103, 3)

##### Exporting the Dataframe to a csv file to be used in Part-2

In [66]:
#Export data as'Segmenting and Clustering Neighborhoods in Toronto_Part1.csv'
df.to_csv('Segmenting and Clustering Neighborhoods in Toronto_Part1.csv',index=False)
print('Successfully Exported.')

Successfully Exported.
