# Data Science Capstone Project

#### Author: Santiago Martínez Quintero
#### Civil Engineer

## Clustering Part - Toronto Neighbourhoods

### PART 1

In this section, we are going to explore and cluster the neighborhoods in Toronto. You can find all the information about the data set, in the following link: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
#Import tha libraries that we are going to use
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans 

Now, we are going to import the data and then, clean it

In [17]:
#Import the data from Wikipedia, directly to a pandas dataframe
df_Toronto = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df_Toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


- Now, tha we have imported the data, we are goin to clean it using these following rules:
    * The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    * Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    * More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    * If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

1. We are going to drop all the rows that contains a 'Not assigned' Borough

In [18]:
Condition1 = df_Toronto[ df_Toronto['Borough'] == 'Not assigned' ].index
# Delete these row Condition1 from dataFrame
df_Toronto.drop(Condition1 , inplace=True)
df_Toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


2. Now, we are goin to search for cuplicated in the 'Postal Code' column, in orden to group them, as the instruction says

In [27]:
Duplicated = df_Toronto[df_Toronto.duplicated(['Postal Code'], keep=False)]
Duplicated

Unnamed: 0,Postal Code,Borough,Neighbourhood


We noticed, that any Postal Code is duplicated, but lets Look at the DataFrame to check if there is everything in order

In [31]:
df_Toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


3. Now, we are going to replace all the Neighbourhoods with theis Borough, if they are 'Not assigned'

In [33]:
#Search for Neighbourhoods with 'Not assigned' value
df_search = df_Toronto[ df_Toronto['Neighbourhood'] == 'Not assigned' ]
df_search

Unnamed: 0,Postal Code,Borough,Neighbourhood


We noticed, that no Neighbourhood has a Not assigned value, so it is not necessary to do a loop, to change the value

In [35]:
#Look the shape of the DataFrame
df_Toronto.shape

(103, 3)

___

### PART 2

Now, we are going to get all the latitude and longitude coordinates and add them to the DataFrame

In [38]:
#Install Geocoder
! pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 8.7MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [None]:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None
postal_code = df_Toronto['Postal Code']
# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
#The code was runned lots of times and didnt work as spected, so we

Because the geocoder package didnt work, we imported manually all the latitudes and longitudes coordinates, as following:

In [43]:
df_loc = pd.read_csv('http://cocl.us/Geospatial_data')
df_loc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, we are going to join both dataframes, using the Postal Code column as the merging column

In [44]:
df_merge_Toronto = pd.merge(df_Toronto, df_loc, on='Postal Code')

In [46]:
df_merge_Toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
