Segmenting Toronto
==================

In [1]:
#First get and install beautiful soup
!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



In [2]:
#Get some more python dependencies
!conda install -c conda-forge lxml --yes


Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



## Get the wikipedia Data and parse the data

In [3]:
# Get the wikipedia webpage
import requests
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

from bs4 import BeautifulSoup

import lxml

# Could not get lxml to work for whatever reason, we are going with default
soup = BeautifulSoup(r.text, "html.parser")

In [4]:
# We are interested in the first table in there:
table_soup = soup.find_all('table')[0]
# Pandas can read an html table:
import pandas as pd

df = pd.read_html(str(table_soup))

In [5]:
#Let's have a look
df[0]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


## Got the data, let's process according to the instructions:
### # 1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood


In [6]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 
# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

#print out neighborhoods to checkout column name (same as New York example)
neighborhoods


Unnamed: 0,PostalCode,Borough,Neighborhood


In [7]:
#Let's assign the columns and make sure the dataframe is of the right format
neighborhoods['PostalCode'] = df[0]['Postcode']
neighborhoods['Borough'] = df[0]['Borough']
neighborhoods['Neighborhood'] = df[0]['Neighbourhood']

# print out Neighborhoods to make sure we got the right column names and data
neighborhoods


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


### 2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.


In [8]:
# Get bad rows index and then drop them
selectedRowsToDrop = neighborhoods[neighborhoods['Borough'] == "Not assigned"].index
print("Dropped {} rows where a borough was unassigned".format(len(selectedRowsToDrop)))
neighborhoods.drop(selectedRowsToDrop, axis=0, inplace=True)
#print out neighborhoods again to make sure we are ok
neighborhoods

Dropped 77 rows where a borough was unassigned


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
...,...,...,...
282,M8Z,Etobicoke,Kingsway Park South West
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West


### 3. More than one neighborhood can exist in one postal code area. 
For example, in the table on the Wikipedia page, 
you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in 
the above table.

In [9]:
# Groupby is very useful here - I found it in several tutorials online - originally I had a loop here
neighborhoods = neighborhoods.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


### 4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [10]:
selectedRowsToRenameNeigh = neighborhoods[neighborhoods['Neighborhood'] == "Not assigned"].index
#We really only find one
print("Found the following not assigned neighborhoods:")
print(neighborhoods.iloc[selectedRowsToRenameNeigh])

print("Renaming {} rows where a borough was unassigned".format(len(selectedRowsToRenameNeigh)))

# There's definitely a better way to implement this below but I cannot get anything else to work so here we are:
for ind in selectedRowsToRenameNeigh.tolist():
     neighborhoods.iloc[ind, neighborhoods.columns.get_loc('Neighborhood')] = neighborhoods.iloc[ind]['Borough']

print("Check that the renaming worked:")
print(neighborhoods.iloc[selectedRowsToRenameNeigh])


Found the following not assigned neighborhoods:
   PostalCode       Borough  Neighborhood
85        M7A  Queen's Park  Not assigned
Renaming 1 rows where a borough was unassigned
Check that the renaming worked:
   PostalCode       Borough  Neighborhood
85        M7A  Queen's Park  Queen's Park


In [11]:
# 5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.


In [12]:
# 6.    In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
print("Number of rows in notebook: {}".format(neighborhoods.shape[0]))

Number of rows in notebook: 103


In [13]:
#print our frame one more time
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
