### 1. Instal lxml parser

In [20]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/55/6f/c87dffdd88a54dd26a3a9fef1d14b6384a9933c455c54ce3ca7d64a84c88/lxml-4.5.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 4.9MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.1
Note: you may need to restart the kernel to use updated packages.


### 2. Import all required libraries. Assign all the values from the wikipedia table to a pandas dataframe.

In [340]:
# This page helped me a lot: 
# https://simpleanalytical.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas

import pandas as pd
import numpy as np
import urllib.request
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page)

# print(soup.prettify()) # This will return HTML code for the whole wikipedia page: "List_of_postal_codes_of_Canada:_M" 

table = soup.find('table', class_='wikitable sortable')

A = []
B = []
C = []

for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. All values from lists A,B, and C is assigned to the dataframe
        
df_extracted = pd.DataFrame(A,columns=['PostalCode'])
df_extracted['Borough']=B
df_extracted['Neighborhood']=C

df_extracted.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


### 3. Remove the '\n' in each cell from the dataframe:

In [341]:
# Remove the '\n' in each cell from the dataframe:

df = df_extracted
df['PostalCode'] = df['PostalCode'].str[0:-1]
df['Borough'] = df['Borough'].str[0:-1]
df['Neighborhood'] = df['Neighborhood'].str[0:-1]

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 4. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [342]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

df_assigned = df[df['Borough'] != 'Not assigned']
df_assigned.reset_index(drop=True, inplace=True) # The index values are reset
df_assigned.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### 5. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table:

In [125]:
# The Wikipedia page has been adjusted and this has already been done on the Wikipedia page directly.
# This step is thus not required since the dataframe is already correct

### 6. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough:

In [343]:
# We already removed all the rows where the Borough is not assigned. Now all we have to do is assign the
# values of the Borough to all cells where the Neighborhood has not been assigned i.e. NaN

#Let's first check if there are unassigned neighborhoods:

tot = 0

for i, b in df_assigned['Neighborhood'].items():
    if b == "" or b == "NaN":
        tot = tot + 1
print("Total unassigned neighborhoods: ", tot)

Total unassigned neighborhoods:  0


##### Since there are no unassigned neighborhoods, we do not have to replace any neigherborhood value with its equivalent Borough value.

### 7. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe:

In [344]:
df_final = df_assigned # We now know that the dataframe is final. We give it the name df_final

df_final.shape

(103, 3)

### 8. Get the latitude and longitude values for each postal code:

In [345]:
df_geo = pd.read_csv("Geospatial_Coordinates.csv")

lat = []
long = []

# Iterate through both dataframes and add the longitude and latitude values
# to df_final where the Postal Codes are the same as in df_geo

for i, row in df_final.iterrows():
    for j, row2 in df_geo.iterrows():
        if row["PostalCode"] == row2["Postal Code"]:
            lat.append(row2["Latitude"])
            long.append(row2["Longitude"])

df_final["Latitude"] = lat
df_final["Longitude"] = long

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [361]:
df_final.head(50)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### 9. Create a map of Toronto

In [321]:
pip install geopy

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/ab/97/25def417bf5db4cc6b89b47a56961b893d4ee4fec0c335f5b9476a8ff153/geopy-1.22.0-py2.py3-none-any.whl (113kB)
[K     |████████████████████████████████| 122kB 5.7MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.22.0
Note: you may need to restart the kernel to use updated packages.


In [322]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

In [351]:
neighborhoods = df_final

In [360]:
# create map of Toronto using latitude and longitude values
latitude = 43.70011
longitude = -79.4163

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto