# Segmenting and Clustering Neighborhoods in Toronto

## The procedure will be as follows
1 - Scrapping the data from Wikipedia  
2 - Transforming the data into the appropriate dataframe


### 1 - Scrapping data from Wikipedia

1.1 - Importing libraries

In [5]:
import bs4 as bs
import urllib.request
import numpy as np
import pandas as pd 

1.2 - setting the url

In [6]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

1.3 - getting data

In [7]:
def scrape_url(cname,cols):
    page  = urllib.request.urlopen(url).read()
    soup  = bs.BeautifulSoup(page,'lxml')
    table = soup.find("table",class_=cname)
    header = [head.findAll(text=True)[0].strip() for head in table.find_all("th")]
    data   = [[td.findAll(text=True)[0].strip() for td in tr.find_all("td")]
              for tr in table.find_all("tr")]
    data    = [row for row in data if len(row) == cols]
    # Store data to this temporary dataframe
    raw_df = pd.DataFrame(data,columns=header)
    return raw_df

In [8]:
raw_data = scrape_url("wikitable",3)

In [9]:
raw_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 2 - Transforming the data into the appropriate dataframe

2.1 removal of codes with unassigned Borough

In [10]:
postal_codes_df=raw_data[~raw_data['Borough'].isin(['Not assigned'])]

2.2 assign borough to unassigned Neighbourhoods

In [11]:
postal_codes_df.loc[postal_codes_df['Neighbourhood'] == 'Not assigned','Neighbourhood'] = postal_codes_df[postal_codes_df['Neighbourhood'] == 'Not assigned']['Borough']
# THIS CALL THROWS A WARNING, PLEASE IGNORE IT

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


2.3 group neighbourhoods with the same postal code

In [12]:
postal_codes_df = postal_codes_df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [13]:
postal_codes_df.shape

(103, 3)

# PART II

the geospatial csv file is used

In [51]:
!wget -q -O 'toronto_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


prepare to load the data from the csv file

In [52]:
filename = 'toronto_data.csv'
headers = ['Postal Code', 'Latitude', 'Longitude']

load and preview the data

In [54]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


add the coordinates to dataframe with the postal code

In [55]:
result = pd.merge(postal_codes_df, df, left_on='Postcode', right_on='Postal Code')

remove the extra column and preview the final output

In [48]:
result.drop('Postal Code', axis=1, inplace=True)
result.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
