### Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe

In [1]:
import pandas as pd
#!pip3 install lxml

In [2]:
df=pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned

In [3]:
df = df[df.Borough != "Not assigned"]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


#### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
### These two rows will be combined into one row with the neighborhoods separated with a comma

In [4]:
df2 = df.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(lambda x: ','.join(x.astype(str))).reset_index()
df2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough

In [5]:
# seaching "not assigned"
df2.loc[df2['Neighbourhood'] == ('Not assigned')].count()

Postal Code      0
Borough          0
Neighbourhood    0
dtype: int64

In [6]:
#no need to change anything: 'not assigned' not found

#### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [7]:
df2.shape

(103, 3)

In [8]:
print('The number of rows in dataframe is: ', (int(df2.shape[0])))

The number of rows in dataframe is:  103


#### In this assignment, I prefer use the link provided instead of loop function

In [9]:
geosp = pd.read_csv('http://cocl.us/Geospatial_data')

In [10]:
geosp.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Merging everything in a single dataframe

In [11]:
df3 = pd.merge(df2, geosp, left_on = 'Postal Code', right_on = 'Postal Code')
df3.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Importing libraries to use k-means and others

In [12]:
import numpy as np 

#!pip install geopy
from geopy.geocoders import Nominatim

#!pip install requests
import requests 

from pandas.io.json import json_normalize 
import json

import matplotlib.cm as cm
import matplotlib.colors as colors

#!pip install sklearn
from sklearn.cluster import KMeans

#!pip install folium
import folium

In [13]:
df3['Borough'].unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       'Mississauga', 'Etobicoke'], dtype=object)

#### I will analyze 'Downtown Toronto'. Looks like a cool part of Toronto (never been there)

In [14]:
df_dt = df3[df3['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
df_dt

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752


In [15]:
#from https://www.google.com/search?ei=D44NYOXsMNWy5OUPhYeZkAs&q=downtown+toronto+latitude+longitude&oq=downtown+toronto+latitude+longitude&gs_lcp=CgZwc3ktYWIQAzIICCEQFhAdEB4yCAghEBYQHRAeOgQIABBHUOIkWO8wYN8xaABwAngAgAGMAogB1Q6SAQMyLTiYAQCgAQGqAQdnd3Mtd2l6yAEIwAEB&sclient=psy-ab&ved=0ahUKEwilsMKn7bTuAhVVGbkGHYVDBrIQ4dUDCA0&uact=5
e_lat = 43.6548
e_long = -79.3883

# map of Toronto Downtown using latitude and longitude values
map_dt = folium.Map(location=[e_lat, e_long], zoom_start=13)

# add markers to map
for lat, lng, label in zip(df_dt['Latitude'], df_dt['Longitude'], df_dt['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt)

In [16]:
map_dt

### Using Fousquare to get information about Toronto and neighboorhood

In [17]:
CLIENT_ID = 'KF1NUMEWXWSWVCALEAO2ZEQACAT5TGWSXQYHZZMURABZKAAM' # your Foursquare ID
CLIENT_SECRET = '2YBQW1Q1UAWRIK3TBVJNCDOGTLOJ4NUU0MUI4DYXNWHQ5RQO' # your Foursquare Secret
ACCESS_TOKEN = 'ZT4HFFZFHB55EQA0J2GP5OA00GKAZRZHFKJOMJHQPCYFPAOX' # your FourSquare Access Token
VERSION = '20180604'

In [18]:
#call function for venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            800, 
            300)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Getting nearby venues in Toronto Downtown. Compare to figure plotted above

In [19]:
dt_venues = getNearbyVenues(names=df_dt['Neighbourhood'],
                                   latitudes=df_dt['Latitude'],
                                   longitudes=df_dt['Longitude']
                                  )

Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Queen's Park, Ontario Provincial Government


#### Finding 5 top places for each neighboorhood

In [20]:
dt_onehot = pd.get_dummies(dt_venues[['Venue Category']], prefix="", prefix_sep="")
dt_onehot['Neighbourhood'] = dt_venues['Neighbourhood'] 
fixed_columns = [dt_onehot.columns[-1]] + list(dt_onehot.columns[:-1])
dt_venuelist = dt_onehot.groupby('Neighbourhood').mean().reset_index()


def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
e_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
e_neighborhoods_venues_sorted['Neighbourhood'] = dt_venuelist['Neighbourhood']

for ind in np.arange(dt_venuelist.shape[0]):
    e_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_venuelist.iloc[ind, :], num_top_venues)

e_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Berczy Park,Coffee Shop,Café,Japanese Restaurant,Hotel,Park
1,"CN Tower, King and Spadina, Railway Lands, Har...",Harbor / Marina,Rental Car Location,Airport Terminal,Airport Service,Sculpture Garden
2,Central Bay Street,Coffee Shop,Café,Clothing Store,Art Gallery,Japanese Restaurant
3,Christie,Grocery Store,Korean Restaurant,Coffee Shop,Café,Pizza Place
4,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Restaurant


#### Applying K-means 

In [21]:
# set number of clusters. Pick 6 after trial and error 
kclusters = 6

dt_venuelist_clustering = dt_venuelist.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dt_venuelist_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 5, 1, 4, 1, 3, 3, 1, 3, 0])

In [22]:
# add clustering labels
e_neighborhoods_venues_sorted.insert(0, 'Clusters', kmeans.labels_)

In [23]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


df_dt_merged = df_dt

df_dt_merged = df_dt_merged.join(e_neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

# merge dt_grouped with dtn_data to add latitude/longitude for each neighborhood
df_dt_merged.dropna(axis=0,inplace = True)
df_dt_merged['Clusters'] = df_dt_merged['Clusters'].astype('int')

df_dt_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Clusters,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,2,Park,Trail,Playground,Grocery Store,Bank
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675,1,Coffee Shop,Pizza Place,Grocery Store,Pharmacy,Café
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Restaurant
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Restaurant,Park,Italian Restaurant,Pub
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Japanese Restaurant,Gastropub,Burger Joint,Italian Restaurant
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,3,Coffee Shop,Café,Gastropub,Seafood Restaurant,Bakery
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,3,Coffee Shop,Café,Japanese Restaurant,Hotel,Park
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1,Coffee Shop,Café,Clothing Store,Art Gallery,Japanese Restaurant
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,3,Café,Coffee Shop,Hotel,Theater,Gym
9,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752,3,Coffee Shop,Hotel,Brewery,Café,Park


### Result visualization

In [24]:
# create map
map_clusters = folium.Map(location=[e_lat, e_long], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_dt_merged['Latitude'], df_dt_merged['Longitude'], df_dt_merged['Neighbourhood'],df_dt_merged['Clusters']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters