# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto #

##### In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

##### For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

##### Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

##### Your submission will be a link to your Jupyter Notebook on your Github repository. 

##### Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [1]:
#Section for Importing Libraries
import pandas as pd

In [2]:
#reading in the data
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)

In [3]:
#cheking to see what df looks like
df

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

<i> It seems that there are three "elements" read into the DataFrame df
Lets check each one one of them </i>

In [4]:
df[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [5]:
df[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,,Canadian postal codes,,,,,,,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
2,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
3,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


In [6]:
df[2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


<i> It seems that df[0] has the data that we need, we will extract and work on this dataset </i>

In [7]:
postal_code_M = df[0]
postal_code_M

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


##### To create the require dataframe:

##### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood ✓

In [8]:
postal_code_M.rename(columns={'Postal Code': 'PostalCode'}, inplace = True)

##### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. ✓ 

<i> Check to see which rows have borough as "not assigned" </i>

In [9]:
postal_code_M[postal_code_M['Borough'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
7,M8A,Not assigned,Not assigned
10,M2B,Not assigned,Not assigned
15,M7B,Not assigned,Not assigned
...,...,...,...
174,M4Z,Not assigned,Not assigned
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned


<i> Therefore the above 77 rows will have to be removed from the dataset<br>
Lets create a new dataframe to contain the new set </i>

In [10]:
#Remove "Not assigned" from the DataFrame, reset index and drop column 'index'
M_with_borough = postal_code_M[postal_code_M['Borough'] != 'Not assigned'].reset_index().drop(['index'], axis=1)
M_with_borough

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


##### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table. ✓ 
<i>Check to see if there are any duplicate in PostalCode </i>

In [11]:
M_with_borough[M_with_borough.duplicated(['PostalCode'])]

Unnamed: 0,PostalCode,Borough,Neighbourhood


<i>Apparently, no two rows have the same Postal Code <br>
Lets check M5A to make sure the requirement is indeed true.</i>

In [12]:
M_with_borough[M_with_borough['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<i> it seems that Neighbourhoods that shares the same Postal Code are already merged into one row separated with a comma, moving on... </i>

##### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.✓ 

In [13]:
postal_code_M[(postal_code_M['Neighbourhood'] == 'Not assigned') & (postal_code_M['Borough'] != 'Not assigned')]

Unnamed: 0,PostalCode,Borough,Neighbourhood


<i> hence, there are no cells has a borough but a Not assigned neighborhood </i>

##### Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.✓ 

In [14]:
M_with_borough

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


##### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. ✓ 

In [15]:
M_with_borough.shape

(103, 3)

### Part 2
<i> The given package that was suggested in the assignment questionnaire did not yield any results... I will use the google API method to retrieve the data needed </i>

In [16]:
import re
himitsu = open("himitsu.txt", "r")
API = re.search('(?<=Google API:)\S+',re.findall(r'Google API:.*', himitsu.read())[0])[0]

In [18]:
Tor_borough_xy = M_with_borough

from geopy.geocoders import GoogleV3
import time

geolocator = GoogleV3(api_key=API)
Tor_borough_xy['Longitude'] =''
Tor_borough_xy['Latitude'] =''

for x in range(len(Tor_borough_xy['PostalCode'])):
    time.sleep(1) #to add delay in case of large DFs
    geocode_result = geolocator.geocode('{}, Toronto, Ontario'.format(Tor_borough_xy['PostalCode'][x]))
    Tor_borough_xy['Latitude'][x] = geocode_result[1][0]    
    Tor_borough_xy['Longitude'][x] = geocode_result[1][1]
    print(str(x+1) + ' Out Of '+ str(len(Tor_borough_xy)) + ' --> ' + geocode_result[0])

Tor_borough_xy

1 Out Of 103 --> North York, ON M3A, Canada
2 Out Of 103 --> North York, ON M4A, Canada
3 Out Of 103 --> Toronto, ON M5A, Canada
4 Out Of 103 --> North York, ON M6A, Canada
5 Out Of 103 --> North York, ON M7A, Canada
6 Out Of 103 --> Etobicoke, ON M9A, Canada
7 Out Of 103 --> Scarborough, ON M1B, Canada
8 Out Of 103 --> North York, ON M3B, Canada
9 Out Of 103 --> Toronto, ON M4B, Canada
10 Out Of 103 --> Toronto, ON M5B, Canada
11 Out Of 103 --> Toronto, ON M6B, Canada
12 Out Of 103 --> Etobicoke, ON M9B, Canada
13 Out Of 103 --> Scarborough, ON M1C, Canada
14 Out Of 103 --> Toronto, ON M3C, Canada
15 Out Of 103 --> Toronto, ON M4C, Canada
16 Out Of 103 --> Toronto, ON M5C, Canada
17 Out Of 103 --> Toronto, ON M6C, Canada
18 Out Of 103 --> Etobicoke, ON M9C, Canada
19 Out Of 103 --> Scarborough, ON M1E, Canada
20 Out Of 103 --> Toronto, ON M4E, Canada
21 Out Of 103 --> Toronto, ON M5E, Canada
22 Out Of 103 --> Toronto, ON M6E, Canada
23 Out Of 103 --> Scarborough, ON M1G, Canada
24 Out

Unnamed: 0,PostalCode,Borough,Neighbourhood,Longitude,Latitude
0,M3A,North York,Parkwoods,-79.3297,43.7533
1,M4A,North York,Victoria Village,-79.3156,43.7259
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",-79.3606,43.6543
3,M6A,North York,"Lawrence Manor, Lawrence Heights",-79.4648,43.7185
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",-79.3895,43.6623
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",-79.5069,43.6537
99,M4Y,Downtown Toronto,Church and Wellesley,-79.3832,43.6659
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",-79.3216,43.6627
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",-79.4985,43.6363
