# **IBM Applied Data Science Capstone Course by Coursera**

 * *Build a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name in Toronto.*
 * *Get the geographical coordinates of the neighborhoods in Toronto.*
 * *Explore and cluster the neighborhoods in Toronto (replicate the same analysis we did to New York City data).*
 
 -------
 ## **1. Import Library**


In [20]:
Ajay!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 9.9MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [50]:
import numpy as np #to handle file in vectorize manner
import pandas as pd #for data analysis
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

import json #library to handle Json file

from geopy.geocoders import Nominatim # convert an adress into laditude and longitude

import requests #library to handle request

from bs4 import BeautifulSoup #library to parse HTML and XML documnents

from pandas.io.json import json_normalize #transform Json file into a pandas dataframe

#matplotlib and associated plotting module
import matplotlib.cm as cm
import matplotlib.colors as colors

#import k mean from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Library imported.')

Library imported.


## 2. Scrap data from wikipedia page into DataFrame

In [51]:
# send get request
data= requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [52]:
#parse data from the html into a beautifulsoup object
soup=BeautifulSoup(data,'html.parser')

In [53]:
#create three list to store table data
postalCodeList=[]
boroughList=[]
neighborhoodList=[]

**Using beautifulSoup**

In [54]:
#find the table
soup.find('table').find_all('tr')

#find all rows of the table
soup.find('table').find_all('tr')

#for each row of the table, find all the table data
for row in soup.find('table').find_all('tr'):
    cells=row.find_all('td')

In [55]:
#append the data into the resoective lists
for row in soup.find('table').find_all('tr'):
    cells=row.find_all('td')
    if(len(cells)>0):
        postalCodeList.append(cells[0].text.rstrip('\n'))
        boroughList.append(cells[1].text.rstrip('\n'))
        neighborhoodList.append(cells[2].text.rstrip('\n')) #avoid new line in neighbourhood cell

In [56]:
#create a new dataFrame from the three lists

toronto_df=pd.DataFrame({'PostalCode':postalCodeList,'Borough':boroughList,'Neighborhood':neighborhoodList})
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 3. Drop the cell a borough that is "Not assigned"

In [57]:
#drop cells with borough that is not assigned
toronto_df_dropna=toronto_df[toronto_df.Borough !="Not assigned"].reset_index(drop=True)
toronto_df_dropna.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## 4. Group neighborhood in the same borough

In [58]:
#group neighborhood in the same borough
toronto_df_grouped=toronto_df_dropna.groupby(['PostalCode','Borough'],as_index=False).agg(lambda x: ", ".join(x))
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## 5. For Neighborhood='Not assigned', make the value the same as Borough

In [59]:
#for neighborhood='Not assigned', make the value the same as Borough

for index, row in toronto_df_grouped.iterrows():
    if row['Neighborhood']=='Not assigned':
        row['Neighborhood']=row['Borough']
        
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## 6. Check whether it is the same as required by the question 

In [60]:
# create a new test dataframe
column_names=['PostalCode','Borough','Neighborhood']
test_df=pd.DataFrame(columns=column_names)

test_list=["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df=test_df.append(toronto_df_grouped[toronto_df_grouped['PostalCode']==postcode],ignore_index=True)

test_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


## 7. Print the number of rows of the cleaned dataframe

In [61]:
# print the number of rows of the clean data frame
toronto_df_grouped.shape

(103, 3)

## 8. Load the coordinates from the csv file on Coursera

In [65]:
# Change the data source
# loda coronavirus from the on Coursera
coordinates = pd.read_csv("https://raw.githubusercontent.com/limchiahooi/Coursera_Capstone/master/Geospatial_Coordinates.csv")
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [68]:
# rename the column "PostalCode"
coordinates.rename(columns={'Postal Code':'PostalCode'},inplace=True)
coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## 9. Merge two table to get coordinates

In [70]:
# merge two table on the column "postalcode"
toronto_df_new=toronto_df_grouped.merge(coordinates,on="PostalCode",how='left')
toronto_df_new.head()


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## 10. finally, check to make sure the coordinator added as required by the question

In [74]:
#create a new test data frame
column_names=["PostalCode", "Borough", "Neighborhood", "Latitude", "Longitude"]

test_df=pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df=test_df.append(toronto_df_new[toronto_df_new['PostalCode']==postcode],ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750072,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


## 11. Use geopy library to get the latitude and longitude value of the Toronto

In [76]:
address='Toronto'

geolocator=Nominatim(user_agent='My-application')
location=geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
print('The geographical cordinate of toronto are {},{}.'.format(latitude,longitude))

The geographical cordinate of toronto are 43.6534817,-79.3839347.


## 12. Create a map of Toronto with neighborhoods superimposed on top

In [90]:
# create map of Toronto using latitude and longitude values
map_toronto=folium.Map(location=[latitude,longitude],zoom_start=10)

#add marker to map

for lat, lng, borough,neighborhood in zip(toronto_df_new['Latitude'],toronto_df_new['Longitude'],toronto_df_new['Borough'],toronto_df_new['Neighborhood']):
    label='{},{}'.format(neighborhood,borough)
    label= folium.Popup(label,parse_html=True)
    folium.CircleMarker([lat,lng],radius=5,popup=label,color='blue',fill=True,fill_color='#3186cc',fill_opacity=0.7).add_to(map_toronto)
    
map_toronto