# Applied Data Science - Capstone Project

## What's required in this assignment
### To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood - <span style="color: green"><big>&#10004;</big></span>
- Only process the cells that have complete information and not greyed out or not assigned  - <span style="color: green"><big>&#10004;</big></span>
- For each cell, the postal code will go under the PostalCode column, the first line under the postal code will go under Borough, and the remaining lines will go under the Neighborhood column formatted nicely and separated with commas as shown in the sample dataframe above. For example, for cell (1, 3) on the Wikipedia page, M3A will go under PostalCode, North York will go under Borough, and Parkwoods will go under Neighborhood  - <span style="color: green"><big>&#10004;</big></span>
- If a cell has only one line under the postal code, like cell (1, 7), then that line will go under the Borough and the Neighborhood columns. So for cell (1, 7), the value of the Borough and the Neighborhood column will be Queen's Park  - <span style="color: green"><big>&#10004;</big></span>
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making  - <span style="color: green"><big>&#10004;</big></span>
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe  - <span style="color: green"><big>&#10004;</big></span>
- Submit a link to your Notebook on your Github repository - <span style="color: green"><big>&#10004;</big></span>

## Install required Python packages

In [43]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium --yes 
!conda install -c conda-forge pyquery --yes

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



## Get Wiki page containing Toronto Boroughs/Neighborhoods
### Note: using pandas.io.hmtl to get the wiki table into pandas DataFrame

In [53]:
import requests
import numpy as np
import pandas as pd
from pandas.io.html import read_html

# Define the wiki page url var
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# Issue HTTP request to get the URL content
req = requests.get(WIKI_URL)
# Use pandas read_html to read in the content
wikitables = read_html(WIKI_URL, index_col=None, header=0, attrs={"class":["sortable","wikitable"]})
# Get pandas dataframe
Toronto = wikitables[0]

## Data cleaning

In [54]:
# Empty entries to np.nan to drop them in the next step
Toronto['Borough'].replace('', np.nan, inplace=True)
# Drop np.nan to remove rows not containing meaningful data
Toronto.dropna(subset=['Borough'], inplace=True)
# Leave behind rows containing 'Not assigned' in 'Borough'
Toronto = Toronto[Toronto['Borough'] != 'Not assigned']

## Data processing - 'Not assigned' to value

In [55]:
# Iterate over the dataframe and fix 'Not assigned' for column 'Neighborhood'
for i, _ in Toronto.iterrows():
    if Toronto.loc[i]['Neighbourhood'] == 'Not assigned': Toronto.loc[i]['Neighborhood'] = Toronto.loc[i]['Borough']

## Reindex the dataframe

In [51]:
# Need to fix indexing after performing manipulations with the dataframe
Toronto.reset_index(inplace=True)

## Dataframe shape

In [57]:
# Check datafame shape
Toronto.shape

(212, 3)

## Number of rows in the dataframe

In [58]:
# Print the number of rows in the dataframe
print('Number of rows in Toronto dataframe: {}'.format(Toronto.shape[0]))

Number of rows in Toronto dataframe: 212


# Thank you for reviwing my work!