<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

This assignment will explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Part1 - Extract Toronto neighborhoods data using Wikipedia and display the top rows with Postalcode, Borough and Neighborhood columns</a>
2.  <a href="#item1">Part2 - Add Geocode </a>
3.  <a href="#item1">Part3 - Explore and cluster the neighborhoods in Toronto </a>
    

</font>
</div>

### Part 1 - Extract data of Toronto neighborhoods from Wikipedia, clean and display the top 10 rows

In [1]:
#Install libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

!pip install geocoder
import geocoder

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 11.7 MB/s eta 0:00:01
[?25hCollecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [2]:
#Read data from Wikipedia
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(source, 'html5lib')

In [3]:
# Clean and save data in a dictionary 

postal_codes_dict = {} 
for table_cell in soup.find_all('td'):
    try:
        postal_code = table_cell.p.b.text 
        postal_code_investigate = table_cell.span.text
        neighborhoods_data = table_cell.span.text 
        borough = neighborhoods_data.split('(')[0] 
        
        if neighborhoods_data == 'Not assigned':
            neighborhoods = []
        else:
            postal_codes_dict[postal_code] = {}
            
            try:
                neighborhoods = neighborhoods_data.split('(')[1]
                neighborhoods = neighborhoods.replace('(', ' ')
                neighborhoods = neighborhoods.replace(')', ' ')

                neighborhoods_names = neighborhoods.split('/')
                neighborhoods_clean = ', '.join([name.strip() for name in neighborhoods_names])
            except:
                borough = borough.strip('\n')
                neighborhoods_clean = borough
 
            postal_codes_dict[postal_code]['borough'] = borough
            postal_codes_dict[postal_code]['neighborhoods'] = neighborhoods_clean
    except:
        pass

In [4]:
# create an populate a dataframe
columns = ['PostalCode', 'Borough', 'Neighborhood']
toronto_data = pd.DataFrame(columns=columns)
toronto_data

for ind, postal_code in enumerate(postal_codes_dict):
    borough = postal_codes_dict[postal_code]['borough']
    neighborhood = postal_codes_dict[postal_code]['neighborhoods']
    toronto_data = toronto_data.append({"Postal Code": postal_code, 
                                        "Borough": borough, 
                                        "Neighborhood": neighborhood},
                                        ignore_index=True)


toronto_data.rename(columns={'Postalcode':'Postal Code'}, inplace=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code
0,,North York,Parkwoods,M3A
1,,North York,Victoria Village,M4A
2,,Downtown Toronto,"Regent Park, Harbourfront",M5A
3,,North York,"Lawrence Manor, Lawrence Heights",M6A
4,,Queen's Park,Ontario Provincial Government,M7A


In [5]:
print('The DataFrame shape is', toronto_data.shape)

The DataFrame shape is (103, 4)


### Part2 - Add Geocode

In [6]:
# Using the CSV 
link = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv"

geocsv_data = pd.read_csv(link)
geocsv_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
# Merging the 2 dataframes 
df_final = pd.merge(geocsv_data,toronto_data , on='Postal Code')
df_final = df_final[['Postal Code', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]

df_final.head()


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
