<H1>Capstone Project</H1>
This notebook will mainly use to complete the capstone project of IBM Data Science course.

In [39]:
# import packages
import pandas as pd
import numpy as np 
from bs4 import BeautifulSoup
import requests
import geocoder
from sklearn.cluster import KMeans
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

<h1>Section 01: Web Scrapping</h1>
<p>In this section, neighborhood details of Toronto are extracted from a Wikipedia page and stored in pandas dataframe .<br>
Further, for future references extracted details are stored in a CSV file.<br>
For more details please refer the comments.
</p>

In [2]:
# url to the target Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# get html document
page = requests.get(url)
# parsing html document to BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
# extracting tables
tables = soup.find_all('table')

In [3]:
# save the table with postal codes as a list
nbTable = tables[0].text.splitlines()
# define list to store all rows
neighborhoods = []
# looping row by row to extract necessary values
for row in nbTable:
    if row != '' and 'Not assigned' not in row:
        # define dict to store the values
        cell = {}
        # extracting values
        neighborhood = row.split('(')[1].strip(')')
        neighborhood = neighborhood.replace(' / ', ',').replace(')', ',').strip(' ')

        # storing values inside dict
        cell['PostalCode'] = row[0:3]
        cell['Brough'] = row[3:].split('(')[0]
        cell['Neighborhood'] = neighborhood

        # appending dict to the list
        neighborhoods.append(cell)

In [4]:
# transform the extracted values to a DF
neighborhoods = pd.DataFrame(neighborhoods)
neighborhoods['Brough'] = neighborhoods['Brough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                                'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                                'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                                'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [5]:
# export dataframe as a csv
neighborhoods.to_csv('TorontoNeighborhoods.csv', index=False)

<h1>Section 02: Location Update</h1>
<p>In this section, previously created dataframe is updated with location details (latitude and longitude of each location).<br>
For further details please  refer the comments.
</p>

In [6]:
# initialize variable
locations = []
# loop until get the data
for index, row in neighborhoods.iterrows():
    location = None
    cell = {}
    address = "{}, {}".format(row['PostalCode'], row['Brough'])
    # introduce a loop limiter
    i = 0
    while(location is None):
        # extracting location data
        geoData = geocoder.google(address)
        location = geoData.latlng
        # loop counter
        i = i+1
        if i>5:
            break
    # appending values to the dict
    cell['PostalCode'] = row['PostalCode']
    cell['Brough'] = row['Brough']
    cell['Neighborhood'] = row['Neighborhood']
    if location != None:
        cell['Latitude'] = location[0]
        call['Longitude'] = location[1]
    else:
        cell['Latitude'] = None
        cell['Longitude'] = None
    
    # append dict to the list
    locations.append(cell)

KeyboardInterrupt: 

<p>I tried to extract latitudes and longitudes of all locations with above function. But no luck.<br>
So, I move on and use the provided csv file</p>

In [7]:
# read the csv file to a df
geoSpatialData = pd.read_csv("Geospatial_Coordinates.csv")
geoSpatialData.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
# rename column to merge two dfs
geoSpatialData.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
# merge two dfs
df = pd.merge(neighborhoods, geoSpatialData, on='PostalCode')

<h1>Section 03: Clustering</h1>
<p>In this section neighberhoods are clustered using k-mean method.<br>
For further details please refer the comments.</p>

In [9]:
# extracting neighborhoods in Toronto
toronto = df[df['Brough'].str.contains('Toronto')].reset_index(drop=True)
toronto.head()

Unnamed: 0,PostalCode,Brough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


<p>I tried to get location of Toronto using geocoder. But no sucess.<br>
So, I use the first location from dataframe to create a map.<p>

In [10]:
# extracting latitudes and longitudes of first row in DF
latitudeToronto = toronto.loc[0, 'Latitude']
longitudeToronto = toronto.loc[0, 'Longitude']

In [11]:
# create a map object
mapToronto = folium.Map(location=[latitudeToronto, longitudeToronto], zoom_start=12)

# add markers to map
for latitude, longitude, label in zip(toronto['Latitude'], toronto['Longitude'], toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=True
    ).add_to(mapToronto)

# show the map
mapToronto

In [12]:
# Foursqure credentials
CLIENT_ID = 'AZ00BPKDRUC3GAPGW3O5Z01VNQV4UWDU44WO1ZW4ZZ4YTA2P' # Foursquare ID
CLIENT_SECRET = 'EMQ4TLHUPGCXYWPXB53LCW1GPGLIRA5QFDU4HH5F3TXTACBE' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [13]:
# get nearby venues of each brough
# define a list to store venues
venues = []
# loop through the DF to get venue details
for latitude, longitude, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Neighborhood']):
    # request url
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            500, 
            LIMIT)
    # make a get request
    results = requests.get(url).json()['response']['groups'][0]['items']
    # appending venue details to list
    venues.append([(
      neighborhood,
      latitude,
      longitude,
      venue['venue']['name'],
      venue['venue']['location']['lat'],
      venue['venue']['location']['lng'],
      venue['venue']['categories'][0]['name']) for venue in results])

In [14]:
# converting venue list to DF
nearbyVenues = pd.DataFrame([item for venue in venues for item in venue])
nearbyVenues.columns = ['Neighborhood',
                            'Neighborhood Latitude',
                            'Neighborhood Longitude',
                            'Venue',
                            'Venue Latiitude',
                            'Venue Longitude',
                            'Venue Category']
nearbyVenues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latiitude,Venue Longitude,Venue Category
0,"Regent Park,Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park,Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park,Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park,Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park,Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


In [16]:
# one hot encoding
torontoOneHot = pd.get_dummies(nearbyVenues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column to the DF
torontoOneHot['Neighborhood'] = nearbyVenues['Neighborhood']
# rearranging columns
columnNames = ['Neighborhood'] + [columnName for columnName in list(torontoOneHot.columns) if columnName!='Neighborhood']
torontoOneHot = torontoOneHot[columnNames]

In [17]:
# create grouped DF
torontoGrouped = torontoOneHot.groupby('Neighborhood').mean()
torontoGrouped.reset_index(inplace=True)

In [24]:
# define a function to get most common venues
def getMostCommonVenues(row, topVenueCount):
    categories = row.iloc[1:]
    categoriesSorted = categories.sort_values(ascending=False)
    return categoriesSorted.index[0:topVenueCount]

In [25]:
# select top venue count
numTopVenues = 10
indicators = ["st", "nd", "rd"]

# create a list of column names
columnNames = ['Neighborhood']
for i in range(numTopVenues):
    if i<len(indicators):
        columnNames.append("{}{} Most Common Venue".format(i+1, indicators[i]))
    else:
        columnNames.append("{}th Most Common Venue".format(i+1))

In [60]:
# create a DF with new column names
sortedNeighborhoodVenues = pd.DataFrame(columns=columnNames)
sortedNeighborhoodVenues['Neighborhood'] = torontoGrouped['Neighborhood']
# filling values to DF using getMostCommonVenues function
for i in range(torontoGrouped.shape[0]):
    sortedNeighborhoodVenues.iloc[i, 1:] = getMostCommonVenues(torontoGrouped.iloc[i, :], numTopVenues)
sortedNeighborhoodVenues.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Farmers Market,Beer Bar,Italian Restaurant,Cheese Shop,Seafood Restaurant,Restaurant,Pub
1,"Brockton,Parkdale Village,Exhibition Place",Café,Breakfast Spot,Coffee Shop,Italian Restaurant,Performing Arts Venue,Convenience Store,Nightclub,Music Venue,Restaurant,Climbing Gym
2,"CN Tower,King and Spadina,Railway Lands,Harbou...",Airport Service,Airport Terminal,Airport,Boat or Ferry,Sculpture Garden,Rental Car Location,Plane,Airport Food Court,Coffee Shop,Harbor / Marina
3,Central Bay Street,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Bubble Tea Shop,Restaurant,Burger Joint,Salad Place,Ramen Restaurant,Donut Shop
4,Christie,Grocery Store,Café,Park,Italian Restaurant,Restaurant,Candy Store,Athletics & Sports,Coffee Shop,Nightclub,Baby Store


In [61]:
# clustering neighborhoods
# number of clusters
kClusters = 7
# removing labels from DF
torontoGroupedClustering = torontoGrouped.drop('Neighborhood', 1)
# cluster the dataset using KMeans
kMeans = KMeans(kClusters, random_state=0).fit(torontoGroupedClustering)


In [62]:
# adding cluster labels to DF
sortedNeighborhoodVenues.insert(0, 'Cluster Labels', kMeans.labels_)
# creating new DF to murge other DFs
torontoMerged = toronto
torontoMerged = torontoMerged.join(sortedNeighborhoodVenues.set_index('Neighborhood'), on='Neighborhood')
torontoMerged.head()

Unnamed: 0,PostalCode,Brough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.65426,-79.360636,2,Coffee Shop,Pub,Bakery,Park,Café,Theater,Breakfast Spot,Event Space,Farmers Market,Restaurant
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,Coffee Shop,Clothing Store,Sandwich Place,Café,Italian Restaurant,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Hotel,Pizza Place
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Café,Coffee Shop,Cocktail Bar,Beer Bar,Restaurant,Creperie,Gastropub,Lingerie Store,Seafood Restaurant,Moroccan Restaurant
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,6,Health Food Store,Trail,Pub,Asian Restaurant,Monument / Landmark,Malay Restaurant,Market,Martial Arts School,Massage Studio,Mediterranean Restaurant
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,2,Coffee Shop,Cocktail Bar,Bakery,Farmers Market,Beer Bar,Italian Restaurant,Cheese Shop,Seafood Restaurant,Restaurant,Pub


In [63]:
# create a map object
mapTorontoClusters = folium.Map(location=[latitudeToronto, longitudeToronto], zoom_start=12)
# set colors for different markers
x = np.arange(kClusters)
y = [1+x+(1*x)**2 for i in range(kClusters)]
colorsArray = cm.rainbow(np.linspace(0, 1, len(y)))
rainbow = [colors.rgb2hex(i) for i in colorsArray]
# add markers to map
for latitude, longitude, label, cluster in zip(torontoMerged['Latitude'], 
                                                torontoMerged['Longitude'], 
                                                torontoMerged['Neighborhood'],
                                                torontoMerged['Cluster Labels']):
    label = folium.Popup(label + 'Cluster' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7,
        parse_html=True
    ).add_to(mapTorontoClusters)
# show map
mapTorontoClusters