# Segmenting and Clustering Neighborhoods in Toronto

---
## Week 3 Assignment Part 1: Collecting Data and Data Cleansing

*Note: This assignment is a part of IBM's Data Science Capstone Project on Coursera.* 

---
**Assignment outlines**:

+ Web Scrap required data from [List of postal codes of Canada](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)
+ Final dataframe will consists of three columns: PostalCode, Borough, and Neighborhood
+ Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
+ Combine neighbourhoods in same postal code.
+ Not assigned neighborhood should be replaced by borough's name.
---

### Import necessary modules for data collection and cleansing

In [1]:
import requests          
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)                                  # fetch url
df = pd.read_html(response.text)[0]                           # converting response to dataframe
df.to_csv('beautifulsoup_pandas.csv',header=0,index=False)
df.shape

(287, 3)

#### Let's see how our data looks:

In [2]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Remove rows that have `Borough` as `Not assigned`

In [3]:
df1=df[df.Borough.str.strip() != 'Not assigned'].reset_index(drop=True)  #dropping Borough which has no name, i.e, not assigned
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


### Replace `Neighbourhood` that have `Not assigned` to their respective `Borough` name

#### Find `Neighbourhood` that are `Not assigned`

In [4]:
df1[:][df1['Neighbourhood']=='Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
6,M9A,Queen's Park,Not assigned


#### Replace values by `Borough`

In [5]:
df1['Neighbourhood'][df1['Neighbourhood']=='Not assigned']=df1['Borough']  #Replacing not assigned neighbourhood with borough name

### List all Neighbourhood in the same postal code in one row

In [6]:
df2=df1.groupby(["Postcode","Borough"], as_index=False).agg({"Neighbourhood": lambda x: ", ".join(list(x))})  #grouping by postal code
df2.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [7]:
df2.shape

(103, 3)