### Scrape Wikipedia Page to retrieve neighborhoods in Toronto

This notebook retrieves PostalCode, Borough and Neighborhood from this [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and creates a pandas DataFrame.
Coordinates are retrieved and added to the DataFrame.

In [26]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Sample DataFrame as below:
![Sample DataFrame](https://github.com/swmk/Coursera_Capstone/raw/master/sample_df_wiki.png)

In [3]:
# URL to Wikipedia page
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#### Scrape the content from the Wikipage.

In [4]:
# Sends a request to Wiki page and retrieves html source
site_text = requests.get(wiki_url).text

# Creates BeautifulSoup instance
soup = BeautifulSoup(site_text, 'lxml')

# Retrieves html table containing the postal codes
postal_table = soup.find('table', {'class': 'wikitable sortable'})

# Retrieves all content rows without header row
all_trs = postal_table.find_all('tr')[1:]

#### Transform content into pandas DataFrame
Items with "Not assigned" borough are ignored.

In [5]:
# Variable to holds rows data for DataFrame
df_cols = ['PostalCode', 'Borough', 'Neighborhood']
df = pd.DataFrame(columns=df_cols)

In [6]:
# Extract all html rows.
#  From each row, extract the cell text and convert them to a dictionary.
#   Add to the dataframe if borough cell is not Not assigned. 
for tr in all_trs:
    tds = tr.findAll('td')
    # Extract a row as a dictionary
    df_row = {col: val.text.rstrip() for col, val in zip(df_cols,tds)}
    # Add the row to the DataFrame
    if df_row['Borough'] != 'Not assigned':
        df = df.append(df_row, ignore_index=True)

#### Clean and prepare data.
* Replace "Not assigned" Neighborhood with value from Borough.
* Multiple Neighborhood values of the same PostalCode are merged to one row in the dataframe.

In [7]:
# Clean and prep the data frame

# Replace 'Not assigned' neighborhood with borough value.
df['Neighborhood'].replace(to_replace='Not assigned', value=df['Borough'], inplace=True)

# Merge two cells of the same Postal Code.
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()

In [8]:
df.shape

(103, 3)

In [9]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
# Load Geospatial data
geo_df = pd.read_csv('Geospatial_Coordinates.csv')

In [16]:
# Merge two dfs
merged_df = pd.merge(df, geo_df, how='left', left_on='PostalCode', right_on='Postal Code')

In [24]:
#
merged_df.drop(['Postal Code'], axis=1, inplace=True)