# Week3-2: Segmenting and Clustering Neighborhoods in Toronto

## Part 1

Import libraries to be used in this excersice

In [1]:
import pandas as pd              # library used for dataframe convertion
import requests                  # library used to handle request to html address
from bs4 import BeautifulSoup    # Library used for web scraping 
import xml                       # Library used to read XML data file

Read data source from Wikipedia: 
This is a list of postal codes in Canada where the first letter is M. Postal codes beginning with M (except M0R and M7R) are located within the city of Toronto in the province of Ontario. Only the first three characters are listed, corresponding to the Forward Sortation Area.
URL used for it, is the following:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
data_source = requests.get(url).text

Parse XML Files Using Python’s BeautifulSoup

In [3]:
data_soup = BeautifulSoup(data_source, 'xml')

Get the contents of table

In [4]:
soup_table=data_soup.find('table')

Fill out our dataframe with data read from url (using our soup_table) and display the first 5 of them

In [5]:
table_contents=[]

for row in soup_table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

#print (table_contents)

In [6]:
df=pd.DataFrame(table_contents)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


Check number of records in dataframe

In [7]:
df.shape

(103, 3)

Validate we do not have "Not assigned" value in Borough and Neighborhood

In [8]:
df.Borough.value_counts()

North York                                                      24
Scarborough                                                     17
Downtown Toronto                                                17
Etobicoke                                                       11
Central Toronto                                                  9
West Toronto                                                     6
York                                                             5
East York                                                        4
East Toronto                                                     4
East YorkEast Toronto                                            1
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
MississaugaCanada Post Gateway Processing Centre                 1
East TorontoBusiness reply mail Processing Centre969 Eastern     1
EtobicokeNorthwest                                               1
Queen's Park                                                  

In [9]:
df.Neighborhood.value_counts()

Downsview Northwest                                  1
High Park, The Junction South                        1
Bathurst Manor, Wilson Heights, Downsview North      1
Birch Cliff, Cliffside West                          1
North Toronto West                                   1
                                                    ..
Harbourfront East, Union Station, Toronto Islands    1
University of Toronto, Harbord                       1
Guildwood, Morningside, West Hill                    1
Woburn                                               1
Cedarbrae                                            1
Name: Neighborhood, Length: 103, dtype: int64

Display the full contain of dataframe

In [10]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Print the number of rows in dataframe

In [11]:
df.shape

(103, 3)

## Part 2

I was not able to get the geographical coordinates of the neighborhoods using the Geocoder package, so using the csv file suggested in instructions:
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv

Read CSV file and load it over a dataframe

In [12]:
csv_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
df_csv = pd.read_csv(csv_url)
df_csv.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Do a join between our dataframes to create our final dataframe (df join to complement data with Latitude and Longitude according Postal Code).
After join, I will drop a column "Postal Code" to let the final dataframe as needed

In [13]:
final_df = pd.concat([df, df_csv], axis=1)
final_df = final_df.drop(['Postal Code'], axis=1)
final_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.806686,-79.194353
1,M4A,North York,Victoria Village,43.784535,-79.160497
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.770992,-79.216917
4,M7A,Queen's Park,Ontario Provincial Government,43.773136,-79.239476


Validate we still have the same number of rows we should (the same as above we got)

In [14]:
final_df.shape

(103, 5)