# Scraping and Analyzing Toronto with Python
# Part 1 - Web Scraping

## Introduction

In this Part 1 of 3 of the series, we will read Toronto neighborhood data from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), extract for relevant data using Beautiful Soup, and prepare the data in a data frame for pre-processing in Part 2.

### Import the necessary libraries

In [16]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Read data from Wikipedia

In [17]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

### Extract data using Beautiful Soup

In [18]:
data = {0: [], 1: [], 2: []}

table = soup.find(['table'], attrs = {'class': 'wikitable sortable'})
trs = table.tbody.find_all('tr')

for tr in trs[1:]:
    tds = tr.find_all('td')

    for i, td in enumerate(tds):
        item = td.text.replace('\n', '').strip()
        data[i].append(item)
        
# HTML extract
# <table class="wikitable sortable">
# <tbody><tr>
# <th>Postal Code
# </th>
# <th>Borough
# </th>
# <th>Neighbourhood
# </th></tr>
# <tr>
# <td>M1A
# </td>
# <td>Not assigned
# </td>
# <td>Not assigned
# </td></tr>
# ...
# </tbody></table>

In [19]:
rows = []

for PostalCode, Borough, Neighborhood in zip(data[0], data[1], data[2]):
    rows.append([PostalCode, Borough, Neighborhood])

### Put data into dataframe

In [20]:
df_toronto = pd.DataFrame(rows, columns = ['PostalCode', 'Borough', 'Neighborhood'], index = data[0])
df_toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
M1A,M1A,Not assigned,Not assigned
M2A,M2A,Not assigned,Not assigned
M3A,M3A,North York,Parkwoods
M4A,M4A,North York,Victoria Village
M5A,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
M5Z,M5Z,Not assigned,Not assigned
M6Z,M6Z,Not assigned,Not assigned
M7Z,M7Z,Not assigned,Not assigned
M8Z,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Process Invalid Data

If a neighborhood name is not assigned, replace with borough name.

In [21]:
df_toronto['Neighborhood'].replace('Not assigned', df_toronto['Borough'], inplace = True)

Remove rows where both neighborhood and borough names are not assigned.

In [22]:
df_toronto.drop(df_toronto[(df_toronto['Borough'] == 'Not assigned') \
    & (df_toronto['Neighborhood'] == 'Not assigned')].index, inplace = True)

M7R and M7Y are not having valid neighborhood names. Replace with their borough names.

In [23]:
# Before replacement
df_toronto.loc['M7R']['Neighborhood'], df_toronto.loc['M7Y']['Neighborhood']

('Canada Post Gateway Processing Centre',
 'Business reply mail Processing Centre, South Central Letter Processing Plant Toronto')

In [24]:
df_toronto.loc['M7R']['Neighborhood'] = df_toronto.loc['M7R']['Borough']
df_toronto.loc['M7Y']['Neighborhood'] = df_toronto.loc['M7Y']['Borough']

In [25]:
# After replacement
df_toronto.loc['M7R']['Neighborhood'], df_toronto.loc['M7Y']['Neighborhood']

('Mississauga', 'East Toronto')

### The resulting dataframe

In [26]:
df_toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
M3A,M3A,North York,Parkwoods
M4A,M4A,North York,Victoria Village
M5A,M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
M8X,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
M4Y,M4Y,Downtown Toronto,Church and Wellesley
M7Y,M7Y,East Toronto,East Toronto
M8Y,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [27]:
df_toronto.shape

(103, 3)

Write dataframe into CSV file to be used in Part 2.

In [28]:
df_toronto.to_csv('toronto.csv')