# IBM Applied Data Science Capstone // Week 3
### IBM Data Science Specialization
#### by Yohann Rousselet

This workbook will be used to complete the IBM Applied Data Science Capstone, which is Course 9 of 9 in the IBM Data Science Specialization

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Geolocation](#geo)

## Introduction <a name="introduction"></a>

The purpose of this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

## Data <a name="data"></a>

The source data for this notebook will be scraped from **Wikipedia** (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

More specifically, the data will be obtained from the table of postal codes and transformed into a **pandas dataframe**.

#### Load required libraries

In [11]:
import pandas as pd
import numpy as np

#### Scrape data

In [12]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

#### Ignore cells with a borough that is 'Not assigned'

In [13]:
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace = True)
df.reset_index(drop=True, inplace = True)

#### Replace ' / ' by ' , ' in Neighborhood column

In [14]:
df['Neighborhood'] = df['Neighborhood'].str.replace(' /', ',')

#### Replace 'Not assigned' Neighborhood with Borough

In [16]:
df.loc[(df.Neighborhood == 'Not asigned'),'Neighborhood']= df['Borough']
df.rename(columns={"Postal code": "Postal Code"}, inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Display number of rows of Dataframe 

In [17]:
print("Numbers of rows =",df.shape[0])

Numbers of rows = 103


## Geolocation <a name="geo"></a>

Now that a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name is was created, we need to get the **latitude** and the **longitude** coordinates of each neighborhood to be able to utilize the **Foursquare location data**. 

#### Loop through Dataframe and get coordinates (not used; see below)

In [None]:
 !pip -q install geocoder

import geocoder # import geocoder

#Create function to find coordinates for given postal code
def geoloc(PostCD):

    # initialize your variable to None
    lat_lng_coords = None
    
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(PostCD))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    return latitude, longitude                      

#loop through all postal codes to find coordinates and populate dataframe
df['Latitude'], df['Longitude'] = zip(*df['Postal code'].apply(geoloc))

#### Using csv file to get Postal Code coordinates

In [32]:
#Define url
url='http://cocl.us/Geospatial_data'

#Load CSV
pcode_df = pd.read_csv(url)
pcode_df.head()

#Merge Dataframes
df_with_pcode = pd.merge(df, pcode_df, how='inner', on='Postal Code', sort=False)

#### Display first 5 rows of Dataframe with longitude and latitude data

In [33]:
df_with_pcode.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


#### Display number of rows of Dataframe with longitude and latitude data 

In [34]:
print("Numbers of rows =",df_with_pcode.shape[0])

Numbers of rows = 103
