# Applied Data Science - Capstone Project

## What's required in this assignment
### To create the above dataframe:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood - <span style="color: green"><big>&#10004;</big></span>
- Only process the cells that have complete information and not greyed out or not assigned  - <span style="color: green"><big>&#10004;</big></span>
- For each cell, the postal code will go under the PostalCode column, the first line under the postal code will go under Borough, and the remaining lines will go under the Neighborhood column formatted nicely and separated with commas as shown in the sample dataframe above. For example, for cell (1, 3) on the Wikipedia page, M3A will go under PostalCode, North York will go under Borough, and Parkwoods will go under Neighborhood  - <span style="color: green"><big>&#10004;</big></span>
- If a cell has only one line under the postal code, like cell (1, 7), then that line will go under the Borough and the Neighborhood columns. So for cell (1, 7), the value of the Borough and the Neighborhood column will be Queen's Park  - <span style="color: green"><big>&#10004;</big></span>
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making  - <span style="color: green"><big>&#10004;</big></span>
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe  - <span style="color: green"><big>&#10004;</big></span>
- Submit a link to your Notebook on your Github repository - <span style="color: green"><big>&#10004;</big></span>

## Install required Python packages

In [43]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium --yes 
!conda install -c conda-forge pyquery --yes

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



## Get Wiki page containing Toronto Boroughs/Neighborhoods
### Note: using pandas.io.hmtl to get the wiki table into pandas DataFrame

In [20]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
wikitables = soup.find_all('table') 
Toronto = pd.read_html(str(wikitables[0]), index_col=None, header=0)[0]
Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [21]:
# TO VERIFY THAT BOTH METHODS PRODUCE SIMILAR DIMENSIONS
Toronto.shape

(289, 3)

## Alternative way to read in the content and produce a dataframe using pandas.io.html
### Note: Results are consistent between the two methods

In [22]:
import requests
import numpy as np
import pandas as pd
from pandas.io.html import read_html

# Define the wiki page url var
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# Issue HTTP request to get the URL content
req = requests.get(WIKI_URL)
# Use pandas read_html to read in the content
wikitables = read_html(WIKI_URL, index_col=None, header=0, attrs={"class":["sortable","wikitable"]})
# Get pandas dataframe
Toronto = wikitables[0]
Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [23]:
# TO VERIFY THAT BOTH METHODS PRODUCE SIMILAR DIMENSIONS
Toronto.shape

(289, 3)

## Data cleaning

In [24]:
# Empty entries to np.nan to drop them in the next step
Toronto['Borough'].replace('', np.nan, inplace=True)
# Drop np.nan to remove rows not containing meaningful data
Toronto.dropna(subset=['Borough'], inplace=True)
# Leave behind rows containing 'Not assigned' in 'Borough'
Toronto = Toronto[Toronto['Borough'] != 'Not assigned']

## Data processing - 'Not assigned' to value

In [25]:
# Iterate over the dataframe and fix 'Not assigned' for column 'Neighborhood'
for i, _ in Toronto.iterrows():
    if Toronto.loc[i]['Neighbourhood'] == 'Not assigned': Toronto.loc[i]['Neighborhood'] = Toronto.loc[i]['Borough']

## Dataframe shape

In [26]:
# Check datafame shape
Toronto.shape

(212, 3)

## Number of rows in the dataframe

In [27]:
# Print the number of rows in the dataframe
print('Number of rows in Toronto dataframe: {}'.format(Toronto.shape[0]))

Number of rows in Toronto dataframe: 212


In [82]:
!conda install -c conda-forge geocoder --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/ibm/conda/miniconda3

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    orderedset-2.0             |           py35_0         685 KB  conda-forge
    ratelim-0.1.6              |           py35_0           5 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         742 KB

The following NEW packages will be INSTALLED:

    geocoder:   1.38.1-py_0  conda-forge
    orderedset: 2.0-py35_0   conda-forge
    ratelim:    0.1.6-py35_0 conda-forge


Downloading and Extracting Packages
geocoder-1.38.1      | 52 KB     | ##################################### | 100% 
orderedset-2.0       | 685 KB    | ##########################

## API compensator - at times geocoder.google returns None for the same postal code
### Create a dictionary of all the postal codes to add to the dictionary at the next step

In [28]:
import geocoder
import time
from collections import defaultdict
latitude = defaultdict(list)
longitude = defaultdict(list)
for i, row in Toronto.iterrows():
    g = geocoder.google('{}, Toronto, Ontario'.format(Toronto.loc[i]['Postcode'].strip()))
    lat_lng_coords = g.latlng
    
    if lat_lng_coords != None:
        latitude[Toronto.loc[i]['Postcode']] = lat_lng_coords[0]
        longitude[Toronto.loc[i]['Postcode']] = lat_lng_coords[1]

In [29]:
lat = []
lon = []
for i, _ in Toronto.iterrows():
    lat.append(latitude[Toronto.loc[i]['Postcode']])
    lon.append(longitude[Toronto.loc[i]['Postcode']])
                        

In [30]:
Toronto = Toronto.assign(Latitude = lat, Longitude=lon)
Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.7533,-79.3297
3,M4A,North York,Victoria Village,[],[]
4,M5A,Downtown Toronto,Harbourfront,43.6543,-79.3606
5,M5A,Downtown Toronto,Regent Park,43.6543,-79.3606
6,M6A,North York,Lawrence Heights,[],[]


In [None]:
Toronto.to_csv('Toronto.csv')

In [19]:
Toronto = Toronto.reset_index(drop=True)
Toronto.head()

# Thank you for reviwing my work!