## NOTEBOOK on IBM Watson Studio shared via GitHub

This notebook will be used for Applied Data Science Capstone Project week 3 assignment (this is part 2 - adding coordinates to the data framework, as per instructions for 2 points)

In [1]:
import pandas as pd
import numpy as np
import requests
import urllib.request
from bs4 import BeautifulSoup
import lxml

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

Setting up the basics:

In [2]:
# Data will be retrieved from the given wiki page
wikipedia_link_to_Canada_postal_codes='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# Request and sava data in 'page' object
page = requests.get(wikipedia_link_to_Canada_postal_codes)
# Use 'BeautifulSoup' tool to work with retrieved data
soup = BeautifulSoup(page.text, 'lxml')
# Creating an empty data frame to store the data 
df = pd.DataFrame()

Next, the data from the messy page needs to be retrieved. The needed data is in the only table within that page

In [3]:
# Finding the table in the page
match_table = soup.find('table', class_='wikitable sortable')

# Needed infrmation first will be collected in lists
List_PostalCode = []
List_Borough = []
List_Neighborhood = []

# The following cyclicly goes through each row of the table and collects information
i=1
for match_element in match_table.find_all('td'):
    if i==1: List_PostalCode.append(match_element.text)
    if i==2: List_Borough.append(match_element.text)
    if i==3: 
        List_Neighborhood.append(match_element.text)
        i=0
    i=i+1

Collected data lists are assigned to the data frame

In [4]:
df['PostalCode']=List_PostalCode
df['Borough']=List_Borough
df['Neighborhood']=List_Neighborhood

Next, transformations are applied to the data to form the data frame required by this exercise.

In [5]:
# Cleaning up neighbouhood names
for index, element in df.iterrows():
    element['Neighborhood']=element['Neighborhood'].strip("\n")

# Removing rows which do not have Boroughs assigned
# Where neighbourhoods are not assigned, the Borough name is used
for index, element in df.iterrows():
    if element['Borough'] == 'Not assigned': df=df.drop(index)
    if element['Neighborhood'] == 'Not assigned': element['Neighborhood']=element['Borough']

# Dataframe is reindexed
df = df.reset_index(drop=True)

# Neighborboods under the same Boroughs are merged and separated by commas, as per example
df = df.groupby(['PostalCode','Borough'], sort=False, as_index=False).agg(', '.join)

Veryfying that the resulting data frame is as requested per exercise

In [6]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [7]:
df.shape

(103, 3)

Coordinates will be retrieved from given URL with csv file:

In [8]:
df_coordinates = pd.read_csv("http://cocl.us/Geospatial_data")

# Colums for Latitude and Longitude are created
df['Latitude']=''
df['Longitude']=''

The following goes through the file with coordinates and iteratively retrieves the ones matching a certain Postal Code:

In [9]:
Lat=0
Lon=0
i=-1
for index, element in df.iterrows():
    if i<len(df):
        i=i+1
        # Finding the right coordinates
        row_coordinate = df_coordinates.loc[df_coordinates['Postal Code'] == element['PostalCode']]
        # Adding the coordinates to a list element
        Lat = row_coordinate['Latitude'].tolist()
        Lon = row_coordinate['Longitude'].tolist()
        # Assigning the values to the data frame
        df.at[i,'Latitude']=Lat[0]
        df.at[i,'Longitude']=Lon[0]

In [10]:
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6543,-79.3606
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7185,-79.4648
4,M7A,Queen's Park,Not assigned,43.6623,-79.3895
5,M9A,Etobicoke,Islington Avenue,43.6679,-79.5322
6,M1B,Scarborough,"Rouge, Malvern",43.8067,-79.1944
7,M3B,North York,Don Mills North,43.7459,-79.3522
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7064,-79.3099
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3789


In [11]:
df.shape

(103, 5)