# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: scrap Toronto neighborhood data on Wikipedia page
<br/>
Use a Notebook to build the code to scrap Toronto neighborhood data on Wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M), wrangle the data, clean it, and then read it into a pandas dataframe in a structured form as follows:
  - This pandas dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
  - Only process the cells that have an assigned borough. 
    * Ignore cells with a borough that is Not assigned
  - More than one neighborhood can exist in one postal code area. 
    * When a postal code area has multiple neighborhoods, separate them with a comma 
    * For example, in the table on the Wiki page, notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park.
These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table
  - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

Add Markdown cells to explain the work and any assumptions being made. Also in the last cell of the notebook, use the .shape method to print the number of rows of the dataframe.

### Use BeautifulSoup for web scrapping

Import lib required to get the data in structured format. 
First pass out postal code Wiki URL into BeautifulSoup (BS) 

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

postalCodesURL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(postalCodesURL).text  # retrieve the content as text
soup = BeautifulSoup(source,'lxml')
#print(soup.prettify())

Analyze the HTML content text and identify the HTML elements from which to extract postal code table

In [2]:
# Extract postal codes table - this table is with 'wikitable sortable' class
postalCodeTable = soup.find('table', {'class':'wikitable sortable'})
#print(postalCodeTable.tr.text)

### Perform data wrangling

Transform and map the data from the cells in postal codes table into a pandas dataframe

In [3]:
# this dataframe consists of three columns: PostalCode, Borough, and Neighborhood
columnNames = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = columnNames)

# Obtain postcode, borough, neighborhood data in the postal codes table and append into the dataframe (one row at a time)
for trCell in postalCodeTable.find_all('tr'):
    tempRowData=[]
    for tdCell in trCell.find_all('td'):
        tempRowData.append(tdCell.text.strip())
    if len(tempRowData)==3:
        df.loc[len(df)] = tempRowData
        
df.head(10) 

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


### Perform Data Cleaning

+ Remove the row data with borough that is "Not assigned".
+ More than one neighborhood can exist in one postal code area, combine these neighborhoods into one row with the neighborhoods separated with a comma
+ If a cell has a borough but a Not assigned neighborhood, then make the neighborhood the same as the borough.

In [4]:
# Remove the row data with borough that is "Not assigned"
df=df[df['Borough']!='Not assigned']
df.head(10)  # this is to display row data with borough that is "Not assigned" are removed

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Queen's Park,Not assigned
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [5]:
# Combine all rows with the same postal code into one row with the neigborhoods separated with a comma
group = df.groupby(['Postalcode','Borough'], sort=False).agg( ', '.join)
df = group.reset_index()
df.head(10)  # show the rows with the same postal code are combined into one row with the neigborhoods separated with a comma

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Not assigned
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [6]:
# If a cell has a borough but a Not assigned neighborhood, then make the neighborhood the same as the borough
df.loc[df['Neighborhood'] =='Not assigned' , 'Neighborhood'] = df['Borough']
df.head(10) # show the Neighborhood cell with "Not assigned" is replaced by the borough

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


### Use the .shape method to print the number of rows of in the dataframe

In [7]:
df.shape

(103, 3)

## Part 2: Add the latitude and the longitude coordinates of each neighborhood
<br/>
Now that a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name is built, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

+ Use the dataframe built in part 1, add two columns, i.e. latitude and the longitude coordinates of each neighborhood to form a new dataframe with 5 columns
+ Geocoder Python package is usually used to get the geographical coordinates. However given that this package can be very unreliable, use a link to a csv file that has the geographical coordinates of each
postal code: http://cocl.us/Geospatial_data

### Build a dataframe from geospatial data CSV file 

In [8]:
# read in data in CSV file to a dataframe
lat_lng_df = pd.read_csv('http://cocl.us/Geospatial_data')
lat_lng_df.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


### Merge two dataframes based upon common column, postal code 

In [10]:
# build dataframe in structured format as required
lat_lng_df.rename(columns = {'Postal Code':'Postalcode'}, inplace=True)
torontoGeoDf = pd.merge(df, lat_lng_df, on='Postalcode')
torontoGeoDf = torontoGeoDf[['Postalcode','Borough','Neighborhood','Latitude','Longitude']]
torontoGeoDf.head(10)

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Queen's Park,Queen's Park,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
