# __Segmenting and Clustering Neighborhoods in Toronto__ Part 1


<br/>

## *Web Scraping*

<br/>

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

<br/>

1. Start by creating a new Notebook for this assignment.

<br/>

2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

<br/>

![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1581984000000&hmac=aqqnfeTZdyKUZ-RkUdcZZEunf_3-V_IR0cy_wrB4KTw)

<br/>

3. To create the above dataframe:

<br/>

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is __Not assigned.__
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that __M5A__ is listed twice and has two neighborhoods: __Harbourfront__ and __Regent Park__. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in __row 11__ in the above table.
- If a cell has a borough but a __Not assigned neighborhood__, then the neighborhood will be the same as the borough. So for the __9th__ cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be __Queen's Park.__
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the __.shape__ method to print the number of rows of your dataframe.

<br/>

4. Submit a link to your Notebook on your Github repository. __(10 marks)__

<br/>

__Note__: *There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.*

*Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/*

*The package is so popular that there is a plethora of tutorials and examples on how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k*

*Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.*

<br/>
<br/>

___

<br/>

## Install Newest Version of Pandas for new functionalities

<br/>

In [0]:
!pip3 -q install pandas==1.0.1.

<br/>

## Import Necessary Packages for Web Scraping

<br/>

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
print("Pandas Version: " + pd.__version__)

Pandas Version: 1.0.1


<br/>

## Web Scrap for the List of Postal Codes of Canada using __Beautiful Soup__ with *html.parser*

<br/>

In [0]:
url = "http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
postal_codes_canada  = requests.get(url)
soup = BeautifulSoup(postal_codes_canada.text, 'html.parser')
table = soup.find('table', {'class':'wikitable sortable'}).tbody


<br/>

## Find Columns Headers of Table

<br/>

In [4]:
column_headers = table.find_all('th')
column_headers = [c.text.replace('\n', '') for c in column_headers]
print(column_headers)

['Postcode', 'Borough', 'Neighbourhood']


<br/>

## Set Columns Headers of Table to a new DataFrame: __df_postal_codes__. Then, print the column headers.

<br/>

In [5]:
df_postal_codes = pd.DataFrame(columns = column_headers)
df_postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood


<br/>

## Scrape for the column data in the table. Append the scraped data into the DataFrame and organize the DataFrame based on the directions in __STEP 3__.

<br/>

In [0]:
  num_rows = table.find_all('tr')
  for i in range(1,len(num_rows)):
    rows = num_rows[i].find_all('td')
    row = [rows[0].text, rows[1].text, rows[2].text.replace('\n','')] 
    if rows[2].text.replace('\n','') == 'Not assigned' and rows[1].text != 'Not assigned':
      row = [rows[0].text, rows[1].text, rows[1].text] 
      df_postal_codes = df_postal_codes.append(pd.Series(row, index = column_headers), ignore_index = True)
    elif rows[1].text != 'Not Assigned' and rows[2].text.replace('\n','') != 'Not assigned':
      row = [rows[0].text, rows[1].text, rows[2].text.replace('\n','')]
      df_postal_codes = df_postal_codes.append(pd.Series(row, index = column_headers), ignore_index = True)
      if i != 1 and row[0] == row_prev[0]:
        row = [rows[0].text, rows[1].text, rows[2].text.replace('\n','') + ", " + row_prev[2]]
        df_postal_codes = df_postal_codes.append(pd.Series(row, index = column_headers), ignore_index = True)
    row_prev = row

<br/>

## Remove Duplicate Rows with the same __Postcode__ based on directions in __STEP 3__.

<br/>

In [0]:
df_postal_codes.drop_duplicates(subset = ['Postcode'], keep = 'last', inplace = True, ignore_index = True)

<br/>

## Print First 20 Rows in the DataFrame, __df_postal_codes__.

<br/>

In [8]:
df_postal_codes.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


<br/>

## Print Number of Rows using the __.shape__ method based on directions in __STEP 3__.

<br/>

In [9]:
df_postal_codes.shape

(103, 3)

<br/>

## Convert DataFrame to __.csv__ File to use in the next portion of the assignment.

<br/>

In [0]:
df_postal_codes.to_csv("postal_codes_canada_m.csv")

<br/>

# __Segmenting and Clustering Neighborhoods in Toronto__ Part 2

<br/>

## *Adding Latitude and Longitude Coordinates to Postal Codes*

<br/>


Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

<br/>


In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

<br/>




## Geocoder did not work

<br/>

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

<br/>

Use the Geocoder package or the csv file to create the following dataframe:

<br/>

![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1582070400000&hmac=pJ5WPxucteThdms8RjVnATQ9zADCb6v4WwZSBiAQJYY)

<br/>

Important Note: There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

<br/>

Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository. __(2 marks)__

<br/>

___

<br/>

## Install Newest Version of Pandas for new functionalities

<br/>

In [0]:
!pip3 -q install pandas==1.0.1.

<br/>

## Import Necessary Packages

<br/>

In [12]:
import pandas as pd
import numpy as np
from google.colab import files


print("Pandas Version: " + pd.__version__)

Pandas Version: 1.0.1


<br/>

##  **Optional: Use Upload Files to Collect DataFrame Created in last portion of assignment

<br/>

In [13]:
dataframe = files.upload()

<br/>

## Use Pandas to read CSV and set it to DataFrame: __df_postal_codes__

<br/>

In [14]:
df_postal_codes = pd.read_csv("postal_codes_canada_m.csv", index_col=0)
df_postal_codes

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"Old Mill North, Montgomery Road, The Kingsway"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Sunnylea, Royal York South East, The Queensway..."


<br/>

## Read CSV and Data Values for Geospatial Data. And print First 20 Rows in the DataFrame, __df_geo_coords__.

<br/>

In [15]:
df_geo_coords = pd.read_csv("http://cocl.us/Geospatial_data")
df_geo_coords.head(20)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


<br/> 

## Retrieve Latitude and Longitude Values from DataFrame, __df_geo_coords__.

<br/>

In [0]:
latitude = np.zeros([len(df_postal_codes)])
longitude = np.zeros([len(df_postal_codes)])
n = 0

for i in df_postal_codes['Postcode']:
  for c in df_geo_coords['Postal Code']:
    if i==c:
      latitude[n] = df_geo_coords.loc[df_geo_coords['Postal Code'] == c].get('Latitude')
      longitude[n] = df_geo_coords.loc[df_geo_coords['Postal Code'] == c].get('Longitude')
      n += 1

<br/>

## Add Latitude and Longitude Values to DataFrame, __df_postal_codes__, as new Columns

<br/>

In [0]:
df_postal_codes['Latitude'] = latitude
df_postal_codes['Longitude'] = longitude

<br/>

## Display the Updated DataFrame, __df_postal_codes__.

<br/>

In [18]:
df_postal_codes

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"Old Mill North, Montgomery Road, The Kingsway",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558
101,M8Y,Etobicoke,"Sunnylea, Royal York South East, The Queensway...",43.636258,-79.498509


<br/>

## Convert DataFrame to __.csv__ File to use in the next portion of the assignment.

<br/>

In [0]:
df_postal_codes.to_csv("postal_codes_canada_latlng.csv")