# BATTLE OF NEIGHBOURHOODS

This notebook is mainly for working on coursera capstone project, Given a city I will segment it into different neighbourhoods using the geographical coordinates of the center of each neighbourhood

Using a combination of location data and Machine Learning we will group the neighbourhoods into clusters.

We will be using Foursquare API as the location data provider

In [25]:
import pandas as pd
import numpy as np

In [26]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


# Exploring and Clustering the neighborhoods in Toronto.

Using the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe 

In [27]:
import requests # for sending http requests
from bs4 import BeautifulSoup # used for parsing HTML and XML documents


In [28]:
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

#using requests.get(url).text will ask a website for the HTML information and teturns the same

In [29]:
soup = BeautifulSoup(wiki_url,'lxml')
#print(soup.prettify()) # prettify will enable us to have a look at the how the tags are nested in the file we have parsed.

We need to find where to start with!! from the above tags. If closely observed we will find a table class under which we have all the information of the Postal codes and Neghborhood details. this is under the class 'wikitable sortable', so we will have to find this class in the script.

In [30]:
find_table = soup.find('table',{'class':'wikitable sortable'})

Again checking closely we can see that all the information lies between 'tr' tags and every 'td' is our content values.

In [31]:
rowValues = []

In [32]:
for tags in range(len(find_table.find_all('td'))):
    rowVal = find_table.find_all('td')[tags].get_text()
    rowValues.append(rowVal)

In [33]:
# creating a list for separate columns which can used as pandas column values
PostalCodeList = []
BoroughList = []
NeighbourhoodList = []

In [34]:
for postcode in range(0, len(rowValues),3):
    PostalCodeList.append(rowValues[postcode])
    
for borough in range(1, len(rowValues),3):
    BoroughList.append(rowValues[borough])
    
for neighbourhood in range(2, len(rowValues),3):
    NeighbourhoodList.append(rowValues[neighbourhood])    

In [35]:
import pandas as pd

tableDF = pd.DataFrame()
tableDF['PostalCode'] = PostalCodeList
tableDF['Borough'] = BoroughList
tableDF['Neighbourhood'] = NeighbourhoodList

In [36]:
#replacing \n in the Neighbourhood Column
tableDF = tableDF.replace('\n','',regex = True)

In [37]:
tableDF.shape

(287, 3)

Removing the rows  with a borough that is Not assigned.
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma 
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [38]:
#Removing the rows with a borough that is Not assigned.
for i in range (len(tableDF['Borough'])):
    try:
        if tableDF['Borough'][i] == 'Not assigned':
            tableDF = tableDF.drop(i)
    except KeyError:
        pass
    
tableDF = tableDF.reset_index(drop = True)  

In [39]:
tableDF.shape

(210, 3)

In [40]:
tableDF.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


In [41]:
#f a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
for i in range (len(tableDF['Neighbourhood'])):
    try:
        if tableDF['Neighbourhood'][i] == 'Not assigned':
            tableDF['Neighbourhood'][i] == tableDF['Borough'][i]
    
    except KeyError:
        pass
        
tableDF = tableDF.reset_index(drop = True)       

In [42]:
tableDF.shape

(210, 3)

In [43]:
tableDF.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


In [45]:
tableDF_final = tableDF.groupby('PostalCode',as_index=False,sort=False).agg({
    'Borough' : lambda x: x.max(),
    'Neighbourhood': lambda x: ', '.join(x)
    
    
})

since we are using group

In [46]:
tableDF_final.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


In [47]:
tableDF_final.shape

(103, 3)

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [55]:
geospatial = pd.read_csv("http://cocl.us/Geospatial_data")

In [57]:
geospatial.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [58]:
geospatial.shape

(103, 3)

In [59]:
#renaming the column to make it easier for us to apply join
geospatial = geospatial.rename(columns = {"Postal Code" : "PostalCode"})

geospatial.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [63]:
full_df = tableDF_final.join(geospatial.set_index('PostalCode'), on = 'PostalCode')

In [64]:
full_df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
