### Segmenting and Clustering Neighborhoods in Toronto
##### Capstone project assignment - Samuel Mensah
In this assignment we explore and cluster the neighborhoods in Toronto, where the data is scraped from Wikipedia pages

In [1]:
# import necessary packages
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

#### Scrape Toronto table from wikipedia pages
We use the requests and Beautifulsoup libraries for scraping data for a given url

In [2]:
html_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source   = requests.get(html_url).text
soup = BeautifulSoup(source, 'html.parser')

# find table using soup object
table = soup.find("table", class_="wikitable sortable") # inspect the table in html to get the class name
# get all rows from table
all_rows = table.find_all('tr')

#### Read table into dataframe
We read the text of each row and directly append it into a dataframe, 1st row is the header which is indexed as 0.  
The first column is the Postcode, 2nd column in the Borough and 3rd column is the Neighbourhood

In [3]:
# loop over rows and get values
for count, row in enumerate(all_rows):
    if count == 0: # head
        # define dataframe columns
        column_names  = row.text.split('\n')[1:4]
        # instantiate the dataframe
        neighborhoods = pd.DataFrame(columns=column_names)
    else:
        # get values of rows
        postcode, borough, hood = row.text.split('\n')[1:4] 
        # append to dataframe 
        neighborhoods = neighborhoods.append({'Postcode': postcode,
                                              'Borough': borough,
                                              'Neighbourhood': hood,}, ignore_index=True)

Lets take a quick look at the data

In [4]:
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Preprocess the data

Ignore cells with a borough that is Not assigned

In [5]:
neighborhoods = neighborhoods[neighborhoods['Borough'] != 'Not assigned'].reset_index(drop=True)

Lets confirm that no Borough has a 'Not assigned' value

In [6]:
neighborhoods['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

Aggregate neighborhoods that belong to one postal code

In [7]:
# group by postcode
neighborhoods = neighborhoods.groupby('Postcode').agg(list).reset_index()
# remove all duplicates in borough column using a set function on the column
neighborhoods['Borough'] = neighborhoods['Borough'].apply(lambda x : ",".join(list(set(x))))
# format the neighbourhood column
neighborhoods['Neighbourhood'] = neighborhoods['Neighbourhood'].apply(lambda x : ", ".join(x)) 

Lets confirm the aggregation was successfull

In [8]:
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


 Lets check from the example given in the assignment

In [9]:
print(neighborhoods[neighborhoods['Postcode'] == 'M5A'])

   Postcode           Borough              Neighbourhood
53      M5A  Downtown Toronto  Harbourfront, Regent Park


Find list of boroughs with neighbourhood assigned as 'Not assigned'

In [10]:
 missing_hood = neighborhoods[(neighborhoods['Borough'] != 'Not assigned') &  \
                           (neighborhoods['Neighbourhood'] == 'Not assigned')]
missing_hood

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned


In [11]:
# get index of 'Not assigned' neighbourhood
indices = missing_hood.index.tolist()

Assign borough name to neighborhood name for the neighborhoods with 'Not assigned' values

In [12]:
for idx in indices:
    neighborhoods.iloc[idx,2] = neighborhoods.iloc[idx,1]

Check if neighbourhood name is equal to borough name

In [13]:
neighborhoods.iloc[idx,:]

Postcode                  M7A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 85, dtype: object

Shape of the dataframe

In [14]:
neighborhoods.shape

(103, 3)

Take a look at the head and tail of the dataframe

In [15]:
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [16]:
neighborhoods.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


### 2. Get the geographical coordinates of the neighbourhood and create a dataframe

Lets read the geographical data from the csv file in the url

In [17]:
coordinates_df = pd.read_csv('https://cocl.us/Geospatial_data')
coordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Next, we set the post code as the index for the neighbourhoods data and coordinates data.  
This will later help us join the two dataframes according to the index

In [18]:
coordinates_df = coordinates_df.set_index('Postal Code')
neighborhoods  = neighborhoods.set_index('Postcode')

We can concatenate the two data frames now and reset the index

In [19]:
df = pd.concat([neighborhoods, coordinates_df], axis=1).reset_index()
# We rename the index back to PostalCode
df.rename(columns={"index":"PostalCode"}, inplace=True)

 Lets see the dataframe

In [20]:
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
