# Applied Data Science Capstone - Week 3
## Segmenting and Clustering Neighborhoods in Toronto
- Data source: Wikipedia website https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
- The dataframe will consist of three columns: Postcode, Borough, and Neighborhood.
- Only process the cells that have an assigned Borough. Ignore cells with a Borough that is not assigned.
- More than one neighborhood can exist in one postal code area. Such rows will be combined into one row separated with a comma.
- If a cell has a borough but a not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
- Submit a link to your Notebook on your Github repository. (10 marks)

### Install beautifulsoup if necessary and import libraries

In [1]:
!pip install beautifulsoup4
import pandas as pd
from bs4 import BeautifulSoup
import requests



## Web Scraping
### 1. Read Wikipedia page
### 2. Parse HTML with standard parser

In [2]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r.text, 'html.parser')

### Find table on Wikipedia page

In [3]:
table = soup.find('table',{'class':'wikitable sortable'})

### Find all rows in the table

In [4]:
trs = table.find_all('tr')

### Append rows

In [5]:
rows = []
for r in trs:
    rows.append([t.text.strip() for t in r.find_all('td')])
     
df = pd.DataFrame(rows, columns=['Postcode', 'Borough', 'Neighborhood'])
df = df[~df['Postcode'].isnull()]

print(df.head())
print('---')
print(df.tail())

  Postcode           Borough      Neighborhood
1      M1A      Not assigned      Not assigned
2      M2A      Not assigned      Not assigned
3      M3A        North York         Parkwoods
4      M4A        North York  Victoria Village
5      M5A  Downtown Toronto      Harbourfront
---
    Postcode       Borough           Neighborhood
283      M8Z     Etobicoke              Mimico NW
284      M8Z     Etobicoke     The Queensway West
285      M8Z     Etobicoke  Royal York South West
286      M8Z     Etobicoke         South of Bloor
287      M9Z  Not assigned           Not assigned


### Remove rows with borough='Not assigned' and reindex

In [6]:
df.drop(df[df['Borough']=='Not assigned'].index,axis=0, inplace=True)
df = df.reset_index(drop=True)

print(df.head())
print('---')
print(df.tail())

  Postcode           Borough      Neighborhood
0      M3A        North York         Parkwoods
1      M4A        North York  Victoria Village
2      M5A  Downtown Toronto      Harbourfront
3      M6A        North York  Lawrence Heights
4      M6A        North York    Lawrence Manor
---
    Postcode    Borough              Neighborhood
205      M8Z  Etobicoke  Kingsway Park South West
206      M8Z  Etobicoke                 Mimico NW
207      M8Z  Etobicoke        The Queensway West
208      M8Z  Etobicoke     Royal York South West
209      M8Z  Etobicoke            South of Bloor


### If there is more than one neighborhood for the same postcode, aggregate to 1 row with neighborhoods separated by commas and re-index

In [7]:
df = df.groupby(['Postcode', 'Borough'])['Neighborhood'].agg(', '.join).reset_index()

print(df.head())
print('---')
print(df.tail())

  Postcode      Borough                            Neighborhood
0      M1B  Scarborough                          Rouge, Malvern
1      M1C  Scarborough  Highland Creek, Rouge Hill, Port Union
2      M1E  Scarborough       Guildwood, Morningside, West Hill
3      M1G  Scarborough                                  Woburn
4      M1H  Scarborough                               Cedarbrae
---
    Postcode    Borough                                       Neighborhood
98       M9N       York                                             Weston
99       M9P  Etobicoke                                          Westmount
100      M9R  Etobicoke  Kingsview Village, Martin Grove Gardens, Richv...
101      M9V  Etobicoke  Albion Gardens, Beaumond Heights, Humbergate, ...
102      M9W  Etobicoke                                          Northwest


### If neighborhood = 'Not assigned' then set neighborhood = borough

In [8]:
print('Example:')
print('Postcode M7A old:')
print(df.loc[df['Postcode'] == 'M7A'])

df.loc[df['Neighborhood']=="Not assigned",'Neighborhood']=df.loc[df['Neighborhood']=="Not assigned",'Borough']

print('---------------------------------------')
print('Postcode M7A new:')
print(df.loc[df['Postcode'] == 'M7A'])

Example:
Postcode M7A old:
   Postcode       Borough  Neighborhood
85      M7A  Queen's Park  Not assigned
---------------------------------------
Postcode M7A new:
   Postcode       Borough  Neighborhood
85      M7A  Queen's Park  Queen's Park


## Show no. of rows and colums in the dataframe

In [9]:
df.shape

(103, 3)

## Show no. of boroughs and neighborhoods in the dataframe

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


## This Jupyter Notebook is available on GitHub
('Applied Data Science Capstone Week 3.ipynb')

https://github.com/steveshep/Coursera_Capstone/blob/master/Applied%20Data%20Science%20Capstone%20Week%203.ipynb

## Get lattitude and longitude for each postcode in the dataframe 
### Use CSV file, geocoder.google() does not not work

In [11]:
# copy dataframe for further processing with geo data
geo_df = df

# add columns latitude and Longitude to new dataframe
geo_df['Latitude'] = ''
geo_df['Longitude'] = ''

In [12]:
# read csv file with geo coordinates for Postcodes into dataframe as the geocoders don't work very well
geo_coordinates = pd.read_csv('https://cocl.us/Geospatial_data')

In [13]:
# define function to get lat and long out of coordinates dataframe
def get_geo_coord(df_pc):
    lat   = geo_coordinates.loc[geo_coordinates['Postal Code'] == df_pc].iloc[0]['Latitude']
    long = geo_coordinates.loc[geo_coordinates['Postal Code'] == df_pc].iloc[0]['Longitude']
    return lat, long

# loop to add lattitude and longitude to dataframe
for i in range(0,len(geo_df)):
    geo_df['Latitude'][i], geo_df['Longitude'][i] = get_geo_coord(geo_df.iloc[i]['Postcode'])

## Use a dataframe containing only boroughs that contain the word 'Toronto'

In [14]:
geo_df_to = geo_df[geo_df['Borough'].str.contains('Toronto')].reset_index(drop=True)

geo_df_to.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.6764,-79.293
1,M4K,East Toronto,"The Danforth West, Riverdale",43.6796,-79.3522
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.669,-79.3156
3,M4M,East Toronto,Studio District,43.6595,-79.3409
4,M4N,Central Toronto,Lawrence Park,43.728,-79.3888


### Define Foursquare credentials and version

In [15]:
# @hidden cell
CLIENT_ID = 'WWNDAHJPMN04XWSXX2APHYS3NNHBCFQS2PI3KYGONDFLFEDX' # your Foursquare ID
CLIENT_SECRET = 'X5Q5HJCGTDL2FYWTREOLGR202XHBPIVYDPZQAHHWKHMRVH1N' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

### Explore the first neigborhood in Toronto

In [16]:
#show name of 
geo_df_to.loc[0, 'Neighborhood']

'The Beaches'