# Applied Data Science Capstone - Week 3
## Segmenting and Clustering Neighborhoods in Toronto
- Data source: Wikipedia website https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
- The dataframe will consist of three columns: Postcode, Borough, and Neighborhood.
- Only process the cells that have an assigned Borough. Ignore cells with a Borough that is not assigned.
- More than one neighborhood can exist in one postal code area. Such rows will be combined into one row separated with a comma.
- If a cell has a borough but a not assigned neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
- Submit a link to your Notebook on your Github repository. (10 marks)

### Install beautifulsoup if necessary and import libraries

In [207]:
!pip install beautifulsoup4
import pandas as pd
from bs4 import BeautifulSoup
import requests



## Start Web Scraping
### 1. Read Wikipedia page
### 2. Parse HTML with standard parser

In [208]:
r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r.text, 'html.parser')

### Find table on Wikipedia page

In [209]:
table = soup.find('table',{'class':'wikitable sortable'})

### Find all rows in the table

In [210]:
trs = table.find_all('tr')

### Append rows

In [211]:
rows = []
for r in trs:
    rows.append([t.text.strip() for t in r.find_all('td')])
     
df = pd.DataFrame(rows, columns=['Postcode', 'Borough', 'Neighborhood'])
df = df[~df['Postcode'].isnull()]

print(df.head())
print('---')
print(df.tail())

  Postcode           Borough      Neighborhood
1      M1A      Not assigned      Not assigned
2      M2A      Not assigned      Not assigned
3      M3A        North York         Parkwoods
4      M4A        North York  Victoria Village
5      M5A  Downtown Toronto      Harbourfront
---
    Postcode       Borough           Neighborhood
283      M8Z     Etobicoke              Mimico NW
284      M8Z     Etobicoke     The Queensway West
285      M8Z     Etobicoke  Royal York South West
286      M8Z     Etobicoke         South of Bloor
287      M9Z  Not assigned           Not assigned


### Remove rows with Borough='Not assigned' and reindex

In [212]:
df.drop(df[df['Borough']=='Not assigned'].index,axis=0, inplace=True)
df = df.reset_index(drop=True)

print(df.head())
print('---')
print(df.tail())

  Postcode           Borough      Neighborhood
0      M3A        North York         Parkwoods
1      M4A        North York  Victoria Village
2      M5A  Downtown Toronto      Harbourfront
3      M6A        North York  Lawrence Heights
4      M6A        North York    Lawrence Manor
---
    Postcode    Borough              Neighborhood
205      M8Z  Etobicoke  Kingsway Park South West
206      M8Z  Etobicoke                 Mimico NW
207      M8Z  Etobicoke        The Queensway West
208      M8Z  Etobicoke     Royal York South West
209      M8Z  Etobicoke            South of Bloor


### If there is more than 1 Neighborhood for the same Postcode, aggregate to 1 row with Neighborhoods separated by commas and re-index

In [213]:
df = df.groupby(['Postcode', 'Borough'])['Neighborhood'].agg(', '.join).reset_index()

print(df.head())
print('---')
print(df.tail())

  Postcode      Borough                            Neighborhood
0      M1B  Scarborough                          Rouge, Malvern
1      M1C  Scarborough  Highland Creek, Rouge Hill, Port Union
2      M1E  Scarborough       Guildwood, Morningside, West Hill
3      M1G  Scarborough                                  Woburn
4      M1H  Scarborough                               Cedarbrae
---
    Postcode    Borough                                       Neighborhood
98       M9N       York                                             Weston
99       M9P  Etobicoke                                          Westmount
100      M9R  Etobicoke  Kingsview Village, Martin Grove Gardens, Richv...
101      M9V  Etobicoke  Albion Gardens, Beaumond Heights, Humbergate, ...
102      M9W  Etobicoke                                          Northwest


### If Neighborhood = 'Not assigned' set Neighborhood = Borough

In [214]:
print('Example:')
print('Postcode M7A old:')
print(df.loc[df['Postcode'] == 'M7A'])

df.loc[df['Neighborhood']=="Not assigned",'Neighborhood']=df.loc[df['Neighborhood']=="Not assigned",'Borough']

print('---------------------------------------')
print('Postcode M7A new:')
print(df.loc[df['Postcode'] == 'M7A'])

Example:
Postcode M7A old:
   Postcode       Borough  Neighborhood
85      M7A  Queen's Park  Not assigned
---------------------------------------
Postcode M7A new:
   Postcode       Borough  Neighborhood
85      M7A  Queen's Park  Queen's Park


### Show no. of rows and colums

In [215]:
df.shape

(103, 3)

## Jupyter Notebook available on GitHub:

https://github.com/steveshep/Coursera_Capstone/blob/master/Applied%20Data%20Science%20Capstone%20Week%203.ipynb