In [1]:
# The code was removed by Watson Studio for sharing.

# Segmenting and Clustering Neighborhoods in Toronto
## DataFrame Creation

### Introduction

In this notebook, we will be scraping a table from a Wikipedia page containing every postal code in the Toronto area and getting the boroughs and neighborhoods in those postal codes. We will then eventually be creating a DataFrame that lists the postal code with its associated borough and all neighborhoods that exist there.

### Importing Libraries

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
import csv

### Scraping Data

Our first step is to use the BeautifulSoup package to create an object that allows us to parse through the source code of the page for what we need.

In [4]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

Now we need to actually extract the data from the table that exists on that page. Since the table is defined by the tbody tag and all of the table elements by the td tag, we first get a list of all the items in the table, and then grab just the text from those items.

In [5]:
table_list = soup.tbody.find_all('td') #Create a list of every item in the table
for ind,item in enumerate(table_list):
    table_list[ind] = item.text.strip() #Get the text from each item, stripping whitespace

table_list[0:9] #View the first few items in the list

['M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A',
 'North York',
 'Parkwoods']

### Wrangling

Now we begin to create our DataFrame. First up, we reshape our 1D list of items into a 3 column DataFrame on the assumption that we do not have any extraneous data that will mess up the pattern.

In [6]:
# Define the DataFrame columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# Reshape the DataFrame into an n x 3 shape
neighborhoods = pd.DataFrame(np.array(table_list).reshape(int(len(table_list)/3),3), columns = column_names)
neighborhoods.tail() #Look at tail to see if the pattern of Postal Code, Borough, Neighborhood ended up holding all the way through.

Unnamed: 0,PostalCode,Borough,Neighborhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


Now we need to get rid of any PostalCodes that were not assigned a borough.

In [7]:
# Replace any "Not assigned" values with NaN
neighborhoods.replace("Not assigned", np.nan, inplace = True)

# Drop any rows with NaN in the "Borough" column
neighborhoods.dropna(subset=["Borough"], axis=0, inplace=True)

# Reset index
neighborhoods.reset_index(drop=True, inplace=True)

#Check tail to see if (for example) M9Z was dropped
neighborhoods.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
206,M8Z,Etobicoke,Kingsway Park South West
207,M8Z,Etobicoke,Mimico NW
208,M8Z,Etobicoke,The Queensway West
209,M8Z,Etobicoke,Royal York South West
210,M8Z,Etobicoke,South of Bloor


Since a NaN == NaN evaluates to false, we can use this as an easy check to find which Neighborhoods were "Not assigned" and assign them their Boroughs name.

In [8]:
for ind, neighborhood in enumerate(neighborhoods['Neighborhood']):
    if neighborhood != neighborhood:
        neighborhoods['Neighborhood'][ind] = neighborhoods['Borough'][ind]
        
neighborhoods.head(10) #See M7A for example of this working

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Now we separate out the postal codes that have more than one neighborhood by grouping the DataFrame by PostalCode and then applying a simple join function to concatenate all string elements in the Neighborhood column with a comma separator.

In [9]:
neighborhoods_concatenated = pd.DataFrame(neighborhoods.groupby('PostalCode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x)))
neighborhoods_concatenated.reset_index(inplace = True)

neighborhoods_concatenated.head()

Unnamed: 0,PostalCode,Neighborhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


We can then create a separate DataFrame with just the first two columns and drop any duplicate rows we have so that it will line up with our new `neighborhoods_concatenated` DataFrame.

In [10]:
boroughs = neighborhoods[['PostalCode','Borough']].copy()
boroughs.drop_duplicates(inplace = True)

boroughs.head()

Unnamed: 0,PostalCode,Borough
0,M3A,North York
1,M4A,North York
2,M5A,Downtown Toronto
4,M6A,North York
6,M7A,Queen's Park


Lastly, we merge `borough` and `neighborhoods_concatenated` together on the only common column to get our final DataFrame...

In [11]:
final = boroughs.merge(neighborhoods_concatenated)
final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


...and we see that we end up with 103 different postal codes that have boroughs associated with them.

In [12]:
final.shape

(103, 3)

### Export DataFrame

Now, to prepare for the second part of the project, we export the DataFrame to a .csv.

In [13]:
project.save_data(data = final.to_csv(index = False), file_name = 'neighborhoods.csv', overwrite = True)

{'asset_id': '601a3aab-b8bf-42a7-aadc-08152408c1dc',
 'bucket_name': 'capstoneprojectbattleoftheneighbo-donotdelete-pr-zevqh8gcg4tipk',
 'file_name': 'neighborhoods.csv',
 'message': 'File saved to project storage.'}