## Introduction/Business Problem 
This is Capstone Project Assignment. We have 3 set of data used in this project. First is the toronto data. Second is the North york borough data and third one is combined data of both boroughs. Our aim is to analyse the three set of data by k-means clustering and visualis those clusters by Folium maps. Data analysis will finally lead us to solve the problem of finding a similar neighbourhood in toronto for a North york resident and vice-versa. We aim to vary the number of calls to foursquare API and see the consistency of its results and its impact on our results.

## Input Data  
We have made use of geographical coordinates of the boroughs of our interest. Secondly we made use of the details of neighbourhood by Wikipedia page. Third input data to our project is the results returned from FourSquare API.

### Import Libraries and Set Path

In [57]:
from bs4 import BeautifulSoup
import pandas as pd
import os
import requests 
import csv
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
os.chdir("C:/Users/VIVEKANANDPANT/Desktop/GITProject/Coursera_Capstone")

### Connect to Wikipedia page and read the data using beautiful soap library 

In [58]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(URL) 
soup = BeautifulSoup(r.content, 'html5lib') 

### Loop thru Wikipedia table data by finding tr and td tags and append data to store all the values. 

In [59]:
table = soup.find('table', attrs = {'class':'wikitable sortable'}) 
datas=[]
for tr in table.find_all('tr')[1:]:
        tds =tr.find_all('td')
        data={}
        data['Postcode']=tds[0].text
        data['Borough']=tds[1].text
        data['Neighbourhood']=tds[2].text
        data['Neighbourhood']=data['Neighbourhood'][:-1]
        if data['Neighbourhood']=='Not assigned':
            data['Neighbourhood']=data['Borough']
        datas.append(data)


### Name the columns, then apply filter to remove Not Assigned Bororugh and finally group the neighbourhood . Add location details using coordinates csv file.

In [60]:
data_df=pd.DataFrame(datas,columns=['Postcode','Borough','Neighbourhood'])
data_df=data_df[data_df['Borough']!='Not assigned']
data_df=data_df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
coord=pd.read_csv("Geospatial_Coordinates.csv")
merge_data_df=pd.merge(data_df,coord,left_on="Postcode",right_on="Postal Code")
merge_data_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


In [50]:
toronto_data=merge_data_df[data_df['Borough'].str.contains('Toronto')]
north_york_data=merge_data_df[data_df['Borough'].str.contains('North York')]
north_york_toronto_data=pd.concat([toronto_data,north_york_data],axis=0)

### Here we set a LIMIT of 500 and defined a new function that takes location and radius of Neighbourhood as input and fetches the relevant information from FourSquare API

In [9]:
#Setting up Foursquare
CLIENT_ID = 'SLD4V3XPUMCNXYZCXVRQ0RMBUYGNSXN22GO2KYAVPVQK52BZ' # Foursquare ID
CLIENT_SECRET = '24AJZQWXNBZAA35AK3YDOWTRZ350ETG1EWH3AWMRMTD41QOH' #Foursquare Secret
VERSION = '20191001'
LIMIT = 500


def FetchNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
       # print(results)
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
            v['reasons']['items'][0]['summary']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Reason Summary' ]
    
    return(nearby_venues)

In [13]:
def N_common_venues(row, num_top_venues):
    categories = row.iloc[1:]
    categories_sorted = categories.sort_values(ascending=False)  
    return categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

### For Toronto data --get relevant information from foursquare API function above and convert the data in one hot encoding form for ML algo in further steps.

In [51]:
toronto_venues_500 = FetchNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )
toronto_venues_500.to_excel("toronto_venues_500.xls")
# one hot encoding to convert Venue categories in data to binary form suitable for ML input
toronto500_oh = pd.get_dummies(toronto_venues_500[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto500_oh['Neighborhood'] = toronto_venues_500['Neighborhood'] 

# below will give summary of each neighbourhood in a row
toronto500_grouped = toronto500_oh.groupby('Neighborhood').mean().reset_index()
neighborhood_sort = pd.DataFrame(columns=columns)
neighborhood_sort['Neighborhood'] = toronto500_grouped['Neighborhood']

for ind in np.arange(toronto500_grouped.shape[0]):
    neighborhood_sort.iloc[ind, 1:] = N_common_venues(toronto500_grouped.iloc[ind, :], num_top_venues)
neighborhood_sort.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Café,Coffee Shop,Hotel,Theater,Sushi Restaurant,Japanese Restaurant,Restaurant,Asian Restaurant,Steakhouse,Beer Bar
1,Berczy Park,Coffee Shop,Hotel,Café,Restaurant,Japanese Restaurant,Beer Bar,Italian Restaurant,Bakery,Park,Gym
2,"Brockton,Exhibition Place,Parkdale Village",Café,Coffee Shop,Furniture / Home Store,Bakery,Tibetan Restaurant,Restaurant,Bar,Indian Restaurant,Arts & Crafts Store,Thrift / Vintage Store
3,Business Reply Mail Processing Centre 969 Eastern,Park,Coffee Shop,Pizza Place,Brewery,Pet Store,Sushi Restaurant,Burrito Place,Italian Restaurant,Pub,Restaurant
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Harbor / Marina,Café,Coffee Shop,Sculpture Garden,Track,Garden,Airport,Airport Lounge,Scenic Lookout,Dance Studio


In [61]:
toronto_venues_500.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Reason Summary
0,The Beaches,43.676357,-79.293031,Tori's Bakeshop,43.672114,-79.290331,Vegetarian / Vegan Restaurant,This spot is popular
1,The Beaches,43.676357,-79.293031,The Beech Tree,43.680493,-79.288846,Gastropub,This spot is popular
2,The Beaches,43.676357,-79.293031,The Fox Theatre,43.672801,-79.287272,Indie Movie Theater,This spot is popular
3,The Beaches,43.676357,-79.293031,Ed's Real Scoop,43.67263,-79.287993,Ice Cream Shop,This spot is popular
4,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail,This spot is popular


### For North York data --get relevant information from foursquare API function above and convert the data in one hot encoding form for ML algo in further steps.

In [52]:
north_york_venues_500 = FetchNearbyVenues(names=north_york_data['Neighbourhood'],
                                   latitudes=north_york_data['Latitude'],
                                   longitudes=north_york_data['Longitude']
                                  )
north_york_venues_500.to_excel("north_york_venues_500.xls")
# one hot encoding to convert Venue categories in data to binary form suitable for ML input
north_york500_oh = pd.get_dummies(north_york_venues_500[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
north_york500_oh['Neighborhood'] = north_york_venues_500['Neighborhood'] 

# below will give summary of each neighbourhood in a row
north_york500_grouped = north_york500_oh.groupby('Neighborhood').mean().reset_index()

neighborhood_sort1 = pd.DataFrame(columns=columns)
neighborhood_sort1['Neighborhood'] = north_york500_grouped['Neighborhood']

for ind in np.arange(north_york500_grouped.shape[0]):
    neighborhood_sort1.iloc[ind, 1:] = N_common_venues(north_york500_grouped.iloc[ind, :], num_top_venues)
neighborhood_sort1.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor,Downsview North,Wilson Heights",Coffee Shop,Chinese Restaurant,Mediterranean Restaurant,Ski Area,Bridal Shop,Dog Run,Shopping Mall,Diner,Sandwich Place,Restaurant
1,Bayview Village,Japanese Restaurant,Bank,Café,Fast Food Restaurant,Skating Rink,Grocery Store,Skate Park,Shopping Mall,Trail,Chinese Restaurant
2,"Bedford Park,Lawrence Manor East",Italian Restaurant,Coffee Shop,Fast Food Restaurant,Hobby Shop,Butcher,Juice Bar,Sandwich Place,Restaurant,Intersection,Pub
3,"CFB Toronto,Downsview East",Coffee Shop,Turkish Restaurant,Gym,Electronics Store,Latin American Restaurant,Pizza Place,Soccer Field,Business Service,Café,Sandwich Place
4,Don Mills North,Japanese Restaurant,Coffee Shop,Pizza Place,Burger Joint,Spa,Greek Restaurant,Ice Cream Shop,Liquor Store,Diner,Restaurant


### For North York and Toronoto data combined --get relevant information from foursquare API function above and convert the data in one hot encoding form for ML algo in further steps.

In [53]:
north_york_toronto_venues_500 = FetchNearbyVenues(names=north_york_toronto_data['Neighbourhood'],
                                   latitudes=north_york_toronto_data['Latitude'],
                                   longitudes=north_york_toronto_data['Longitude']
                                  )

# one hot encoding to convert Venue categories in data to binary form suitable for ML input
north_york_toronto500_oh = pd.get_dummies(north_york_toronto_venues_500[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
north_york_toronto500_oh['Neighborhood'] = north_york_toronto_venues_500['Neighborhood'] 

# below will give summary of each neighbourhood in a row
north_york_toronto500_grouped = north_york_toronto500_oh.groupby('Neighborhood').mean().reset_index()

neighborhood_sort2 = pd.DataFrame(columns=columns)
neighborhood_sort2['Neighborhood'] = north_york_toronto500_grouped['Neighborhood']

for ind in np.arange(north_york_toronto500_grouped.shape[0]):
    neighborhood_sort2.iloc[ind, 1:] = N_common_venues(north_york_toronto500_grouped.iloc[ind, :], num_top_venues)
neighborhood_sort2.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Café,Coffee Shop,Hotel,Theater,Japanese Restaurant,Sushi Restaurant,Asian Restaurant,Restaurant,Pizza Place,Clothing Store
1,"Bathurst Manor,Downsview North,Wilson Heights",Coffee Shop,Bridal Shop,Sandwich Place,Sushi Restaurant,Supermarket,Fast Food Restaurant,Mediterranean Restaurant,Middle Eastern Restaurant,Ski Chalet,Ski Area
2,Bayview Village,Bank,Japanese Restaurant,Fast Food Restaurant,Skate Park,Chinese Restaurant,Shopping Mall,Intersection,Trail,Park,Skating Rink
3,"Bedford Park,Lawrence Manor East",Coffee Shop,Italian Restaurant,Hobby Shop,Fast Food Restaurant,Thai Restaurant,Bank,Indian Restaurant,Sushi Restaurant,Intersection,Pub
4,Berczy Park,Coffee Shop,Hotel,Café,Restaurant,Beer Bar,Japanese Restaurant,Italian Restaurant,Bakery,Park,Steakhouse


### Now that we have 3 set of data - toronto, north york , toronto + north york . We are going to create k-means clusters for those three set of data

In [21]:
from sklearn.cluster import KMeans
toronto500_grouped_cluster = toronto500_grouped.drop('Neighborhood', 1)
# k-means clustering
clusters = KMeans(n_clusters=5, random_state=0).fit(toronto500_grouped_cluster)
# clustering labels 
clusters.labels_
# merge neighborhood_sort and toronto_data to create input for map by adding location detail
neighborhood_sort.insert(0, 'Cluster Labels', clusters.labels_)
toronto500_merged = toronto_data
toronto500_merged = toronto500_merged.merge(neighborhood_sort.set_index('Neighborhood'),
                                      left_on='Neighbourhood',right_on='Neighborhood')
toronto500_merged.to_excel("toronto500_merged.xls")

north_york500_grouped_cluster = north_york500_grouped.drop('Neighborhood', 1)
# k-means clustering
clusters1 = KMeans(n_clusters=5, random_state=0).fit(north_york500_grouped_cluster)
# clustering labels 
clusters1.labels_
# merge neighborhood_sort and north_york_data to create input for map by adding location detail
neighborhood_sort1.insert(0, 'Cluster Labels', clusters1.labels_)
north_york500_merged = north_york_data
north_york500_merged = north_york500_merged.merge(neighborhood_sort1.set_index('Neighborhood'),
                                      left_on='Neighbourhood',right_on='Neighborhood')
north_york500_merged.to_excel("north_york500_merged.xls")


In [62]:
north_york500_merged

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,M2H,43.803762,-79.363452,3,Park,Pharmacy,Coffee Shop,Residential Building (Apartment / Condo),Sandwich Place,Chinese Restaurant,Pizza Place,Diner,Recreation Center,Bank
1,M2J,North York,"Fairview,Henry Farm,Oriole",M2J,43.778517,-79.346556,0,Clothing Store,Coffee Shop,Fast Food Restaurant,Sandwich Place,Bakery,Japanese Restaurant,Juice Bar,Shopping Mall,Fried Chicken Joint,Salon / Barbershop
2,M2K,North York,Bayview Village,M2K,43.786947,-79.385975,0,Japanese Restaurant,Bank,Café,Fast Food Restaurant,Skating Rink,Grocery Store,Skate Park,Shopping Mall,Trail,Chinese Restaurant
3,M2L,North York,"Silver Hills,York Mills",M2L,43.75749,-79.374714,1,Park,Pool,Yoga Studio,Discount Store,Fast Food Restaurant,Falafel Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dog Run
4,M2M,North York,"Newtonbrook,Willowdale",M2M,43.789053,-79.408493,0,Café,Korean Restaurant,Coffee Shop,Middle Eastern Restaurant,Grocery Store,Japanese Restaurant,Park,Dessert Shop,Gym,Ramen Restaurant
5,M2N,North York,Willowdale South,M2N,43.77012,-79.408493,0,Coffee Shop,Bubble Tea Shop,Ramen Restaurant,Pizza Place,Korean Restaurant,Japanese Restaurant,Sandwich Place,Fast Food Restaurant,Restaurant,Café
6,M2P,North York,York Mills West,M2P,43.752758,-79.400049,3,Park,Coffee Shop,Tennis Court,Restaurant,Gym,French Restaurant,Convenience Store,Grocery Store,Dentist's Office,Bank
7,M2R,North York,Willowdale West,M2R,43.782736,-79.442259,3,Pharmacy,Convenience Store,Bus Line,Pizza Place,Coffee Shop,Eastern European Restaurant,Bakery,Butcher,Park,Dog Run
8,M3A,North York,Parkwoods,M3A,43.753259,-79.329656,3,Park,Convenience Store,Shopping Mall,Bus Stop,Pharmacy,Supermarket,Discount Store,Skating Rink,Laundry Service,Café
9,M3B,North York,Don Mills North,M3B,43.745906,-79.352188,0,Japanese Restaurant,Coffee Shop,Pizza Place,Burger Joint,Spa,Greek Restaurant,Ice Cream Shop,Liquor Store,Diner,Restaurant


In [36]:
north_york_toronto500_grouped_cluster = north_york_toronto500_grouped.drop('Neighborhood', 1)
# k-means clustering
clusters2 = KMeans(n_clusters=5, random_state=0).fit(north_york_toronto500_grouped_cluster)
# clustering labels 
clusters2.labels_
# merge neighborhood_sort and north_york_toronto_data to create input for map by adding location detail
neighborhood_sort2.insert(0, 'Cluster Labels', clusters2.labels_)
north_york_toronto500_merged = north_york_toronto_data
north_york_toronto500_merged = north_york_toronto500_merged.merge(neighborhood_sort2.set_index('Neighborhood'),
                                      left_on='Neighbourhood',right_on='Neighborhood')
north_york_toronto500_merged.to_excel("north_york_toronto500_merged.xls")

### Map to visualize the clusters using Toronto data

In [42]:
# create map
map_clusters = folium.Map(location=[43.65795, -79.3874], zoom_start=11)
# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto500_merged['Latitude'], toronto500_merged['Longitude'], toronto500_merged['Neighbourhood'], toronto500_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color=rainbow[cluster-2],
        fill=True,
        fill_color=rainbow[cluster-2],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Observation - Cluster 2 and 3 have single location where cluster 1 is mostly at nearby locations. Cluster 4,5 have many locations and spread throughout Toronto

-CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara Cluster 2   
-Lawrence Park Cluster 3


### Map to visualize the clusters using North York data

In [47]:
# create map
map_clusters1 = folium.Map(location=[43.7527583,-79.4000493], zoom_start=10)
# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york500_merged['Latitude'], north_york500_merged['Longitude'], north_york500_merged['Neighbourhood'], north_york500_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color=rainbow[cluster-2],
        fill=True,
        fill_color=rainbow[cluster-2],
        fill_opacity=0.7).add_to(map_clusters1)
       
map_clusters1

### Observation - Cluster 1,2 and 4 have single location (listed below) where cluster 0,3 have several locations 

-Emery,Humberlea Cluster 4   
-Downsview Central Cluster 2     
-Silver Hills,York Mills Cluster 1


### Map to visualize the clusters using Toronto and North York combined data

In [48]:
# create map
map_clusters2 = folium.Map(location=[43.7527583,-79.4000493], zoom_start=10)
# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_toronto500_merged['Latitude'], north_york_toronto500_merged['Longitude'], north_york_toronto500_merged['Neighbourhood'], north_york_toronto500_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color=rainbow[cluster-2],
        fill=True,
        fill_color=rainbow[cluster-2],
        fill_opacity=0.7).add_to(map_clusters2)
       
map_clusters2

### below 3 neighbourhoods has no similar location found 
-Downsview Central Cluster 4   
-Willowdale West Cluster 1   
-Silver Hills,York Mills Cluster 3   

Finding - Willowdale west is a neighbourhood in North York borough. When we clustered only North York data then there were 7 neighbourhoods in cluster 3 and Willowdale west was one of those 7.
When we clustered the north york and toronto data combined then Willowdale West was found to be the only neighbourhood in its cluster 1. There is no other neighbourhood in cluster 1.
On digging deep into the data. We noticed that this is because of the results returned from FourSquare API are not consistent. 