# Question 2

## Agenda : Add the lattitude and longitude coordinates to the dataframe.

In [1]:

# Importing necessary libraries
# !pip install bs4
# !pip install requests
# !pip install pandas

import requests
import pandas as pd
from bs4 import BeautifulSoup


Now the data will be extracted from Wikipedia  
Reference : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  
With the help of BeautifulSoup library, I'll extract the human readable data

In [2]:
# Downloading url data from wikipedia
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_data = requests.get(url).text
CN_data = BeautifulSoup(html_data, 'lxml')

In [3]:
# Creating datafram and assigning columns labels
column_names = ['Postalcode','Borough','Neighborhood']
Toronto = pd.DataFrame(columns = column_names)

In [4]:

# Assigning values to the datframe

table = CN_data.find('table').tbody


for tr in table.find_all('tr'):
   
    for td in tr.find_all('td'):
        content = td.getText(separator='|', strip=True).split('|')
        clean_content = [value for value in content if value != '(' and value != ')' and value != '/' and value != ','] 
        
#         print(clean_content)
        
        number_of_entry = len(content)
        
        if number_of_entry == 1:            
            postcode = clean_content[0]
            borough = ['']
            neighborhood = [''] 
        elif number_of_entry == 2:
            postcode = clean_content[0]
            borough = clean_content[1]
            neighborhood = ['']
        else:
            postcode = clean_content[0]
            borough = clean_content[1]
            neighborhood = ','.join([str(item) for item in clean_content[2:]])
        
        Toronto = Toronto.append({'Postalcode': postcode,'Borough': borough,'Neighborhood': neighborhood}, ignore_index=True)
        
Toronto.head(20)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,[]
1,M2A,Not assigned,[]
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park,Harbourfront"
5,M6A,North York,"Lawrence Manor,Lawrence Heights"
6,M7A,Queen's Park,(Ontario Provincial Government)
7,M8A,Not assigned,[]
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern,Rouge"


In [5]:
# Cleaning Dataframe
Toronto = Toronto[Toronto.Borough != 'Not assigned']
Toronto = Toronto[Toronto.Borough != 0]
Toronto.reset_index(drop = True, inplace = True)

for j in range(0,Toronto.shape[0]):
    if Toronto.iloc[j][2] == 'Not assigned':
        Toronto.iloc[j][2] = Toronto.iloc[j][1]
        j += 1

dataframe = Toronto.groupby(['Postalcode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
dataframe.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
dataframe = dataframe.dropna()
null = 'Not assigned'
dataframe = dataframe[(dataframe.Postalcode != null) & (dataframe.Borough != null) & (dataframe.Neighborhood != null)]

In [7]:

dataframe.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
def neighborhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
grp = dataframe.groupby(['Postalcode', 'Borough'])
dataframe_2 = grp.apply(neighborhood_list).reset_index(name='Neighborhood')

In [9]:
dataframe_2.rename(columns={'Postalcode' : 'Postal Code'}, inplace=True)
dataframe_2.head()



Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Retrieve postcode coordinates.¶ Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. We are supposed to use the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, this is a paid service API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

For this task, we just use a prepared csv to retrieve the coordinates.

Extract csv with Toronto geographical coordinates to dataframe.

In [10]:
toronto_geocsv = 'https://cocl.us/Geospatial_data'
!wget -q -O 'toronto_m.geospatial_data.csv' toronto_geocsv
geocsv_data = pd.read_csv(toronto_geocsv).set_index("Postal Code")
geocsv_data.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Let's now combine the two

In [11]:
dataframe = pd.merge(geocsv_data, dataframe_2, on='Postal Code')
dataframe.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern,Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill,Port Union,Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae
