# Coursera Data Science Capstone - Week 3 Assignment
Developed by: Yongkang Liu  
Created on November 4, 2019.  
Updated on November 4, 2019.

<a name="toc"></a>
# Table of contents

1. [Task 1. Web Scraping](#Q1)

1. [Task 2. Obtain Latitude and Longitude information](#Q2)

1. [Task 3. Explore and Cluster the neighbourhoods in Toronto](#Q3)


<a name="Q1"></a>
## 1. Web Scraping using Beautiful Soup
[Back to ToC](#toc)

Question: Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe


### 1.1. Obtaining HTML document data

*Reference: [Beautiful Soup Document](https://beautiful-soup-4.readthedocs.io/en/latest/)

Beautiful Soup is a Python library for pulling data out of HTML and XML files. The latest version is 4 on the date of November 4, 2019, which works for Python 2.7 and Python 3.2.

In [141]:
# import modules
from bs4 import BeautifulSoup
import requests     # an HTTP client to get the document behind a URL as Beautiful Soup expects a document instead of a URL
#import lxml


In [142]:
# Obtain the HTML document from the URL
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(url)
data = r.text

In [143]:
# Parser the data
soup = BeautifulSoup(data)
# Find the target table
table = soup.find('table', {'class': 'wikitable sortable'})
#table

In [144]:
# Find all rows in the table
table_rows = table.find_all('tr')   # 'tr' is the table row tag in html
print(f"Got table_rows, type: {type(table_rows)}, size: {len(table_rows)}")

Got table_rows, type: <class 'bs4.element.ResultSet'>, size: 289


In [145]:
# Find the headline
print(f'Check the first row: {table_rows[0]}')

Check the first row: <tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>


In [146]:
# extract columns
columns = [th.text.strip('\n') for th in table_rows[0].find_all('th')]  # 'th' is the table header tag
print(f'The headline is {columns}')

The headline is ['Postcode', 'Borough', 'Neighbourhood']


In [147]:
# extract table rows
for rIndex, row in enumerate(table_rows):
    if rIndex > 0:
        row_content = [td.text.strip('\n') for td in row.find_all('td')]   # 'td' is the table cell tag
        print(f'Row {rIndex}: {row_content}')

Row 1: ['M1A', 'Not assigned', 'Not assigned']
Row 2: ['M2A', 'Not assigned', 'Not assigned']
Row 3: ['M3A', 'North York', 'Parkwoods']
Row 4: ['M4A', 'North York', 'Victoria Village']
Row 5: ['M5A', 'Downtown Toronto', 'Harbourfront']
Row 6: ['M5A', 'Downtown Toronto', 'Regent Park']
Row 7: ['M6A', 'North York', 'Lawrence Heights']
Row 8: ['M6A', 'North York', 'Lawrence Manor']
Row 9: ['M7A', "Queen's Park", 'Not assigned']
Row 10: ['M8A', 'Not assigned', 'Not assigned']
Row 11: ['M9A', 'Etobicoke', 'Islington Avenue']
Row 12: ['M1B', 'Scarborough', 'Rouge']
Row 13: ['M1B', 'Scarborough', 'Malvern']
Row 14: ['M2B', 'Not assigned', 'Not assigned']
Row 15: ['M3B', 'North York', 'Don Mills North']
Row 16: ['M4B', 'East York', 'Woodbine Gardens']
Row 17: ['M4B', 'East York', 'Parkview Hill']
Row 18: ['M5B', 'Downtown Toronto', 'Ryerson']
Row 19: ['M5B', 'Downtown Toronto', 'Garden District']
Row 20: ['M6B', 'North York', 'Glencairn']
Row 21: ['M7B', 'Not assigned', 'Not assigned']
Row 22:

### 1.2. Loading data into a DataFrame

Create a Pandas dataframe to store all rows as shown above and use the headline row to name the columns.

In [148]:
lstRow = []
# extract table rows
for rIndex, row in enumerate(table_rows):
    if rIndex > 0:
        row_content = [td.text.strip('\n') for td in row.find_all('td')]
        lstRow.append(row_content)
        #print(f'Row {rIndex}: {row_content}')

In [149]:
import pandas as pd
df_tnt = pd.DataFrame(lstRow, columns=columns)
df_tnt.shape

(288, 3)

In [150]:
df_tnt.Postcode.nunique()  # how many unique Postcodes are seen in the table including the assigned ones

180

In [151]:
df_tnt.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### 1.3. Clean and Shape the DataFrame

#### 1.3.1. Remove rows without borough assigned

"Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**."

In [152]:
df_tnt_clean = df_tnt[df_tnt['Borough']!='Not assigned']
df_tnt_clean.shape

(211, 3)

In [153]:
df_tnt_clean.Postcode.nunique()  # how many Postcodes assigned to boroughs

103

In [154]:
df_tnt_clean.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### 1.3.2. Update Neighbourhood names

"If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park."

In [155]:
# Check the neighbors that need to be renamed
df_empty_neighbor = df_tnt_clean[df_tnt_clean['Neighbourhood']=='Not assigned']
df_empty_neighbor

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


In [156]:
def replace_name(x, y):
    if x == 'Not assigned':
        return y
    else:
        return x

df_tnt_clean['Neighbourhood'] = df_tnt_clean.apply(lambda x : replace_name(x.Neighbourhood, x.Borough), axis=1)
#df_tnt_clean.loc[df_tnt_clean.Neighbourhood=='Not assigned', 'Neighbourhood']=df_tnt_clean['Borough']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [157]:
df_tnt_clean.head(10) # double check M7A Queen's Park

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


#### 1.3.3. Group Neighbourhoods by Postcode

"More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table."

In [158]:
# Option 1
#df_tnt_unique = df_tnt_clean.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
# Option 2
df_tnt_unique = df_tnt_clean.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(','.join).reset_index()

In [159]:
df_tnt_unique.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


**The final dataframe is here.**

In [160]:
df_tnt_unique.shape

(103, 3)

### This is the end of Assignment Task 1

<a name="Q2"></a>
## 2. Obtain Latitude and Longitude information
[Back to ToC](#toc)

### This is the end of Task 2

<a name="Q3"></a>
## 3. Explore and Cluster the neighbourhoods in Toronto
[Back to ToC](#toc)

#### This is the end of Task 3.

<a name="end"></a>
## End of Notebook
[Back](#toc)