## Segmenting and Clustering Neighborhoods in Toronto Canada

### Introduction

In this assignment, i will  convert addresses into their equivalent latitude and longitude values.  I will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters.I will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Datase (Current assignment)</a>

2. <a href="#item2">Explore Neighborhoods in Toronto (Next Assignment) </a>
   
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import requests
import numpy as np
import pandas as pd
from urllib.request import urlopen

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(url)
response.text[:100] # Access the HTML with the text property


'\n<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<titl'

### Web scraping using the BeautifulSoup package

In [2]:
pip install requests beautifulsoup4


The following command must be run outside of the IPython shell:

    $ pip install requests beautifulsoup4

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


In [3]:
pip install lxml 


The following command must be run outside of the IPython shell:

    $ pip install lxml 

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


In [4]:
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,3), index = [1]) # I know the size 
row_marker = 0
for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            new_table.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1

new_table  = []
t_headers = []
for th in table.find_all("th"):
        # remove any newlines and extra spaces from left and right
        t_headers.append(th.text.replace('\n', ' ').strip())
    
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
        t_row = {}
        # Each table row is stored in the form of
        t_row = {'Borough': '', 'Neighborhood': ''}

        # find all td's(3) in tr and zip it with t_header
        for td, th in zip(tr.find_all("td"), t_headers): 
            t_row[th] = td.text.replace('\n', '').strip()
        new_table .append(t_row)   
new_table[6]

{'Borough': 'North York',
 'Neighborhood': 'Lawrence Manor, Lawrence Heights',
 'Postal Code': 'M6A'}

#### We convert new_table dictionary  into a dataframe df

In [5]:
df=pd.DataFrame(new_table) # convert new_table to dataframe df
df.head()                  # visualize first 5 rows

Unnamed: 0,Borough,Neighborhood,Postal Code
0,,,
1,Not assigned,,M1A
2,Not assigned,,M2A
3,North York,Parkwoods,M3A
4,North York,Victoria Village,M4A


In [6]:
cols = df.columns.tolist() # listing all columns names
cols

['Borough', 'Neighborhood', 'Postal Code']

We see that the 'postal code'column is the last. We therefore need to bring it as the first column 

In [7]:
cols = cols[-1:] + cols[:-1] # bringing Postal code as first column
cols

['Postal Code', 'Borough', 'Neighborhood']

In [8]:
df = df[cols] # dataframe with reorganized columns
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


We see that there are some columns with blank values or Not assigned.

In [12]:
df = df.replace(r'^\s*$', np.nan, regex=True) # Replacing blank values (white space) with NaN 
df.head()                                            # visualize top five and bottom five rows

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [14]:
df.dropna(subset = ["Neighborhood"], inplace=True) # dropping all rows with NaN values
df.head(10)                                                # visualize top five 

Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
9,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
10,M1B,Scarborough,"Malvern, Rouge"
12,M3B,North York,Don Mills
13,M4B,East York,"Parkview Hill, Woodbine Gardens"
14,M5B,Downtown Toronto,"Garden District, Ryerson"


In [11]:
df.shape

(103, 3)