Hi there. This notebook will be used primarily for the capstone project on Coursera's Data Science and Machine Learning Course.

In [1]:
import pandas as pd
import numpy as np

In [2]:
print ("Hello Capstone Project Course!")

Hello Capstone Project Course!


Introduction:

You are a real estate agent in the city of Montreal, and resources are stretched. You have many enquiries coming in, and not enough time to handle them all efficiently. There are young adults looking for their first apartment, middle aged couples with children in school, and even the occasional retiree who just wants to relax and pass the rest of their days in a peaceful neighbourhood. But time is money, and it’s very limited right now. 

What you need is a way to efficiently cluster the neighbourhoods in Montreal, so that you can recommend specific locations based on your clients’ profiles, instead of going on a wild goose chase around the city. Further, by clustering an existing dataset of client profiles, every new client can be assigned a cluster based on their profile, simplifying your task further.

What we will aim to achieve with this project is to:
1)	Cluster the neighbourhoods in Montreal based on the facilities and amenities available in each location. 

2)	Cluster an existing client database to fit multiple profiles

3)	Assign client profiles to neighbourhood clusters based on assumptions regarding popular activities




Data:

The relevant data for this project will be as follows:

1.	Foursquare location data for Montreal – using the Foursquare API
This data will help provide a list of points of interest across multiple locations within the greater Montreal metropolitan area.

2.	List of postal codes in Montreal – available at the link:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_H

We will need to retrieve coordinate data for each neighbourhood using the postal codes available at this address, in order to facilitate accurate use of the Foursquare API.

3.	A pre-existing database of customer profiles from the course, available at the link:
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%204/data/Cust_Segmentation.csv

This data contains details such as age, income and education levels, all of which may be relevant to client clustering.


In [3]:
!pip install bs4

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=f36ceb6f92cc158854d9e4d3081b1742dc983b78c602fb451644884d987a0543
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


#### IMPORT REQUIRED PACKAGES

In [4]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from sklearn.cluster import KMeans

#### Webpage with list of postal codes to be scraped

In [5]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_H"

html_data = requests.get(url).text

In [6]:
b_soup1 = BeautifulSoup(html_data, "html5lib")

In [7]:
tag = b_soup1.title
tag

<title>List of postal codes of Canada: H - Wikipedia</title>

In [8]:
table_contents=[]
table=b_soup1.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['Postal Code'] = row.p.text[:3]
        cell['Neighborhood'] = (((((row.span.text).split('(')[0]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

In [9]:
df1=pd.DataFrame(table_contents)

In [10]:
pd.set_option('display.max_rows',200)

In [24]:
df1

Unnamed: 0,Postal Code,Neighborhood
0,H1A,Pointe-aux-Trembles
1,H2A,"Saint-Michel,East"
2,H3A,Downtown Montreal North
3,H4A,Notre-Dame-de-GrâceNortheast
4,H5A,Place Bonaventure
5,H7A,Duvernay-Est
6,H9A,Dollard-des-OrmeauxNorthwest
7,H1B,Montreal East
8,H2B,AhuntsicNorth
9,H3B,Downtown MontrealEast


In [12]:
import os
os.getcwd()

'/home/wsuser/work'

In [15]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0.1,Unnamed: 0,Postal Code,Neighborhood,Latitude,Longitude
0,0,H1A,Pointe-aux-Trembles,45.4543,-73.483774
1,1,H2A,Saint-Michel East,45.561567,-73.601288
2,2,H3A,Downtown Montreal North,45.506992,-73.568941
3,3,H4A,Notre-Dame-de-Grâce Northeast,45.476496,-73.622168
4,4,H5A,Place Bonaventure,45.500795,-73.56523


In [16]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.7-main

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |           1_llvm           5 KB  conda-forge
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    _pytorch_select-0.2        |            gpu_0           2 KB
    absl-py-0.13.0             |     pyhd8ed1ab_0          97 KB  conda-forge
    aiohttp-3.7.4.post0        |   py37h5e8e339_0  

In [17]:
df_data_1.head()

Unnamed: 0.1,Unnamed: 0,Postal Code,Neighborhood,Latitude,Longitude
0,0,H1A,Pointe-aux-Trembles,45.4543,-73.483774
1,1,H2A,Saint-Michel East,45.561567,-73.601288
2,2,H3A,Downtown Montreal North,45.506992,-73.568941
3,3,H4A,Notre-Dame-de-Grâce Northeast,45.476496,-73.622168
4,4,H5A,Place Bonaventure,45.500795,-73.56523


In [21]:
map_mon = folium.Map(location=[45.4543, -73.483774], zoom_start=10)

In [22]:
for lat, lng, neighborhood in zip(df_data_1['Latitude'], df_data_1['Longitude'], df_data_1['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_mon) 

In [25]:
map_mon

In [27]:
CLIENT_ID = 'KPWP4MLQUQWEC0GJXNXNLSP54F2KXA1UOTTKJHSOH5FSN0WI'
CLIENT_SECRET = '3X0HZRFAN3SVXMC0GEFZWD1J5JH1J0CAHH1TT3EQPMUVUMWV'
VERSION = '20180605'
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KPWP4MLQUQWEC0GJXNXNLSP54F2KXA1UOTTKJHSOH5FSN0WI
CLIENT_SECRET:3X0HZRFAN3SVXMC0GEFZWD1J5JH1J0CAHH1TT3EQPMUVUMWV


In [29]:
lat = df_data_1.loc[0, 'Latitude']
lng = df_data_1.loc[0, 'Longitude']

neighborhood_name = df_data_1.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               lat, 
                                                               lng))

Latitude and longitude values of Pointe-aux-Trembles are 45.4543, -73.483774.


In [30]:
def getNearbyVenues(names, latitudes, longitudes, radius=350):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [31]:
mon_venues = getNearbyVenues(names=df_data_1['Neighborhood'],
                                   latitudes=df_data_1['Latitude'],
                                   longitudes=df_data_1['Longitude']
                                  )

Pointe-aux-Trembles
Saint-Michel East
Downtown Montreal North
Notre-Dame-de-Grâce Northeast
Place Bonaventure
Duvernay-Est
Dollard-des-Ormeaux Northwest
Montreal East
Ahuntsic North
Downtown Montreal East
Notre-Dame-de-Grâce Southwest
Place Desjardins
Saint-François
Dollard-des-Ormeaux East
Rivière-des-Prairies Northeast
Ahuntsic Central
Griffintown
Saint-Henri
Saint-Vincent-de-Paul
L'Île-Bizard Northeast
Rivière-des-Prairies Southwest
Villeray Northeast
L'Île-Des-Soeurs
Ville Émard
Duvernay
L'Île-Bizard Southwest
Montréal-Nord North
Petite-Patrie Northeast
Downtown Montreal Southeast
Verdun North
Pont-Viau
Dollard-des-Ormeaux Southwest
Reserved0H0: Santa Claus
Montréal-Nord South
Plateau Mont-Royal North
Downtown Montreal Southwest
Verdun South
Auteuil West
Sainte-Geneviève, Pierrefonds
Anjou West
Plateau Mont-Royal North Central
Petite-Bourgogne
Cartierville Central
Auteuil Northeast
Kirkland
Anjou East
Centre-Sud North
Pointe-Saint-Charles
Cartierville Southwest
Auteuil South
Sennev

In [32]:
mon_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Saint-Michel East,45.561567,-73.601288,STM Station Saint-Michel,45.559425,-73.599749,Metro Station
1,Saint-Michel East,45.561567,-73.601288,Marché Aux Puces Saint-Michel,45.562502,-73.605079,Flea Market
2,Saint-Michel East,45.561567,-73.601288,Petro-Canada,45.560984,-73.602396,Gas Station
3,Saint-Michel East,45.561567,-73.601288,Restaurant Kim Hour,45.561836,-73.605112,Chinese Restaurant
4,Saint-Michel East,45.561567,-73.601288,Poissonnerie Mediterraniénne,45.562328,-73.605625,Fish & Chips Shop
...,...,...,...,...,...,...,...
869,Tour de la Bourse,45.515558,-73.531910,Complexe aquatique de l'Île,45.513078,-73.534298,Pool
870,Tour de la Bourse,45.515558,-73.531910,Biosphère,45.513979,-73.531582,Science Museum
871,Tour de la Bourse,45.515558,-73.531910,Tour de Lévis,45.517164,-73.533460,Historic Site
872,Tour de la Bourse,45.515558,-73.531910,STM Ligne 767 La Ronde,45.512670,-73.530872,Bus Stop


In [33]:
mon_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ahuntsic Central,11,11,11,11,11,11
Ahuntsic East,2,2,2,2,2,2
Ahuntsic North,1,1,1,1,1,1
Ahuntsic Southeast,6,6,6,6,6,6
Ahuntsic Southwest,6,6,6,6,6,6
Anjou East,4,4,4,4,4,4
Anjou West,3,3,3,3,3,3
Auteuil Northeast,1,1,1,1,1,1
Auteuil South,4,4,4,4,4,4
Auteuil West,2,2,2,2,2,2


In [34]:
print('There are {} unique categories.'.format(len(mon_venues['Venue Category'].unique())))

There are 190 unique categories.


In [35]:
mon_onehot = pd.get_dummies(mon_venues[['Venue Category']], prefix="", prefix_sep="")

mon_onehot['Neighborhood'] = mon_venues['Neighborhood'] 

fixed_columns = [mon_onehot.columns[-1]] + list(mon_onehot.columns[:-1])
mon_onehot = mon_onehot[fixed_columns]

mon_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,...,Train Station,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Saint-Michel East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Saint-Michel East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Saint-Michel East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Saint-Michel East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Saint-Michel East,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
mon_onehot.shape

(874, 191)

In [37]:
mon_grouped = mon_onehot.groupby('Neighborhood').mean().reset_index()
mon_grouped

Unnamed: 0,Neighborhood,ATM,Accessories Store,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,...,Train Station,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Ahuntsic Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0
1,Ahuntsic East,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Ahuntsic North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Ahuntsic Southeast,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0
4,Ahuntsic Southwest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Anjou East,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Anjou West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Auteuil Northeast,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Auteuil South,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Auteuil West,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
mon_grouped.shape

(108, 191)

In [39]:
num_top_venues = 3

for hood in mon_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = mon_grouped[mon_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Ahuntsic Central----
            venue  freq
0             Bar  0.09
1  Ice Cream Shop  0.09
2      Restaurant  0.09


----Ahuntsic East----
                venue  freq
0  Athletics & Sports   0.5
1          Playground   0.5
2                 ATM   0.0


----Ahuntsic North----
          venue  freq
0  Soccer Field   1.0
1           ATM   0.0
2   Record Shop   0.0


----Ahuntsic Southeast----
                venue  freq
0  Italian Restaurant  0.33
1       Women's Store  0.17
2      Sandwich Place  0.17


----Ahuntsic Southwest----
          venue  freq
0  Home Service  0.33
1        Bakery  0.17
2          Park  0.17


----Anjou East----
                venue  freq
0   Convenience Store  0.25
1         Pizza Place  0.25
2  Chinese Restaurant  0.25


----Anjou West----
          venue  freq
0  Burger Joint  0.33
1    Sports Bar  0.33
2           Gym  0.33


----Auteuil Northeast----
                 venue  freq
0              Stables   1.0
1          Record Shop   0.0
2  Monument / L

In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

import numpy as np

In [41]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = mon_grouped['Neighborhood']

for ind in np.arange(mon_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(mon_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Ahuntsic Central,Breakfast Spot,Ice Cream Shop,Coffee Shop,Pharmacy,Restaurant,Café,Sushi Restaurant,Bar,Italian Restaurant,Plaza
1,Ahuntsic East,Playground,Athletics & Sports,Yoga Studio,Discount Store,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store
2,Ahuntsic North,Soccer Field,Yoga Studio,College Science Building,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore
3,Ahuntsic Southeast,Italian Restaurant,Women's Store,Clothing Store,Sandwich Place,Bank,Dog Run,Farm,Factory,Event Space,Escape Room
4,Ahuntsic Southwest,Home Service,Ice Cream Shop,Pizza Place,Bakery,Park,Yoga Studio,Dog Run,Farm,Factory,Event Space


In [44]:
kclusters = 5

mon_grouped_clustering = mon_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mon_grouped_clustering)

kmeans.labels_[0:10]


array([2, 0, 2, 2, 1, 2, 2, 2, 2, 2], dtype=int32)

In [47]:
neighborhoods_venues_sorted.drop(['Cluster Labels'], axis=1, inplace=True)


neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

mon_merged = df_data_1

mon_merged = mon_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

mon_merged

Unnamed: 0.1,Unnamed: 0,Postal Code,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,H1A,Pointe-aux-Trembles,45.4543,-73.483774,,,,,,,,,,,
1,1,H2A,Saint-Michel East,45.561567,-73.601288,2.0,Fish & Chips Shop,Gas Station,Metro Station,Flea Market,Chinese Restaurant,Factory,Event Space,Escape Room,English Restaurant,Electronics Store
2,2,H3A,Downtown Montreal North,45.506992,-73.568941,2.0,Coffee Shop,Concert Hall,Hotel,Sushi Restaurant,Theater,Jazz Club,Pizza Place,Pharmacy,French Restaurant,Bank
3,3,H4A,Notre-Dame-de-Grâce Northeast,45.476496,-73.622168,2.0,Coffee Shop,Pizza Place,Bakery,Yoga Studio,Park,Café,Seafood Restaurant,Sandwich Place,Restaurant,Pub
4,4,H5A,Place Bonaventure,45.500795,-73.56523,2.0,Coffee Shop,Sandwich Place,Restaurant,Café,Hotel,Bank,Breakfast Spot,Fast Food Restaurant,Chinese Restaurant,Juice Bar
5,5,H7A,Duvernay-Est,45.661552,-73.593811,3.0,Park,Yoga Studio,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore,Donut Shop
6,6,H9A,Dollard-des-Ormeaux Northwest,45.494021,-73.834634,,,,,,,,,,,
7,7,H1B,Montreal East,45.630963,-73.507206,,,,,,,,,,,
8,8,H2B,Ahuntsic North,45.569708,-73.644639,2.0,Soccer Field,Yoga Studio,College Science Building,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore
9,9,H3B,Downtown Montreal East,45.499745,-73.569435,2.0,Coffee Shop,Hotel,Restaurant,Juice Bar,Cocktail Bar,Mexican Restaurant,Liquor Store,Cosmetics Shop,Market,Skating Rink


In [48]:
mon_merged.isnull().sum()

Unnamed: 0                 0
Postal Code                0
Neighborhood               0
Latitude                   0
Longitude                  0
Cluster Labels            15
1st Most Common Venue     15
2nd Most Common Venue     15
3rd Most Common Venue     15
4th Most Common Venue     15
5th Most Common Venue     15
6th Most Common Venue     15
7th Most Common Venue     15
8th Most Common Venue     15
9th Most Common Venue     15
10th Most Common Venue    15
dtype: int64

In [49]:
mon_merged.dropna(inplace=True)

In [50]:
mon_merged.isnull().sum()

Unnamed: 0                0
Postal Code               0
Neighborhood              0
Latitude                  0
Longitude                 0
Cluster Labels            0
1st Most Common Venue     0
2nd Most Common Venue     0
3rd Most Common Venue     0
4th Most Common Venue     0
5th Most Common Venue     0
6th Most Common Venue     0
7th Most Common Venue     0
8th Most Common Venue     0
9th Most Common Venue     0
10th Most Common Venue    0
dtype: int64

In [51]:
mon_merged['Cluster Labels'] = mon_merged['Cluster Labels'].astype(int)

In [52]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [53]:
map_clusters = folium.Map(location=[45.5017, -73.5673], zoom_start=11)


x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


markers_colors = []
for lat, lon, poi, cluster in zip(mon_merged['Latitude'], mon_merged['Longitude'], mon_merged['Neighborhood'], mon_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## The Most Common Venues in each Neighbourhood and Cluster are in the tables below:

In [54]:
mon_merged.loc[mon_merged['Cluster Labels'] == 0, mon_merged.columns[[1] + list(range(5, mon_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
58,H2M,0,Playground,Athletics & Sports,Yoga Studio,Discount Store,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store
108,H3X,0,Playground,Convenience Store,Grocery Store,Yoga Studio,Dog Run,Farm,Factory,Event Space,Escape Room,English Restaurant


In [55]:
mon_merged.loc[mon_merged['Cluster Labels'] == 1, mon_merged.columns[[1] + list(range(5, mon_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,H5B,1,Home Service,Music Venue,Drugstore,Yoga Studio,Dog Run,Farm,Factory,Event Space,Escape Room,English Restaurant
18,H7C,1,Breakfast Spot,Italian Restaurant,Park,Cosmetics Shop,Donut Shop,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room
21,H2E,1,Bakery,Park,Deli / Bodega,Gas Station,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store
31,H9G,1,Breakfast Spot,Playground,Park,Dog Run,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room,English Restaurant
40,H2J,1,Park,Karaoke Bar,Supermarket,Bike Rental / Bike Share,Dog Run,Brewery,Drugstore,Fast Food Restaurant,Farm,Factory
53,H3L,1,Home Service,Ice Cream Shop,Pizza Place,Bakery,Park,Yoga Studio,Dog Run,Farm,Factory,Event Space
57,H1M,1,Pharmacy,Park,Brewery,Gas Station,Yoga Studio,Dog Run,Farm,Factory,Event Space,Escape Room
64,H3N,1,Bakery,Park,Hockey Arena,Music Store,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store
74,H9P,1,Park,Convenience Store,Creperie,Big Box Store,Dog Run,Farm,Factory,Event Space,Escape Room,English Restaurant
81,H9R,1,Convenience Store,Yoga Studio,Dog Run,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store


In [56]:
mon_merged.loc[mon_merged['Cluster Labels'] == 2, mon_merged.columns[[1] + list(range(5, mon_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,H2A,2,Fish & Chips Shop,Gas Station,Metro Station,Flea Market,Chinese Restaurant,Factory,Event Space,Escape Room,English Restaurant,Electronics Store
2,H3A,2,Coffee Shop,Concert Hall,Hotel,Sushi Restaurant,Theater,Jazz Club,Pizza Place,Pharmacy,French Restaurant,Bank
3,H4A,2,Coffee Shop,Pizza Place,Bakery,Yoga Studio,Park,Café,Seafood Restaurant,Sandwich Place,Restaurant,Pub
4,H5A,2,Coffee Shop,Sandwich Place,Restaurant,Café,Hotel,Bank,Breakfast Spot,Fast Food Restaurant,Chinese Restaurant,Juice Bar
8,H2B,2,Soccer Field,Yoga Studio,College Science Building,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore
9,H3B,2,Coffee Shop,Hotel,Restaurant,Juice Bar,Cocktail Bar,Mexican Restaurant,Liquor Store,Cosmetics Shop,Market,Skating Rink
10,H4B,2,Athletics & Sports,Grocery Store,Japanese Restaurant,Pizza Place,Café,Asian Restaurant,Korean Restaurant,Gym / Fitness Center,Escape Room,Dog Run
14,H1C,2,Coffee Shop,Business Service,Yoga Studio,Donut Shop,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room,English Restaurant
15,H2C,2,Breakfast Spot,Ice Cream Shop,Coffee Shop,Pharmacy,Restaurant,Café,Sushi Restaurant,Bar,Italian Restaurant,Plaza
16,H3C,2,Ice Cream Shop,Grocery Store,Pharmacy,Garden Center,Yoga Studio,Discount Store,Factory,Event Space,Escape Room,English Restaurant


In [57]:
mon_merged.loc[mon_merged['Cluster Labels'] == 3, mon_merged.columns[[1] + list(range(5, mon_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,H7A,3,Park,Yoga Studio,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore,Donut Shop
13,H9B,3,Park,Yoga Studio,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore,Donut Shop
42,H4J,3,Park,Pharmacy,Discount Store,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore
44,H9J,3,Park,Construction & Landscaping,Yoga Studio,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore
54,H4L,3,Park,Yoga Studio,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore,Donut Shop
65,H4N,3,Park,Yoga Studio,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore,Donut Shop
78,H4R,3,Park,Playground,Discount Store,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore
114,H3Y,3,Park,Scenic Lookout,Discount Store,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store,Drugstore


In [58]:
mon_merged.loc[mon_merged['Cluster Labels'] == 4, mon_merged.columns[[1] + list(range(5, mon_merged.shape[1]))]]

Unnamed: 0,Postal Code,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,H7B,4,Construction & Landscaping,Yoga Studio,Dog Run,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store
19,H9C,4,Construction & Landscaping,Boat or Ferry,Yoga Studio,Donut Shop,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room,English Restaurant
48,H4K,4,Construction & Landscaping,Health & Beauty Service,Yoga Studio,Dog Run,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room,English Restaurant
122,H8Z,4,Construction & Landscaping,Yoga Studio,Dog Run,Fast Food Restaurant,Farm,Factory,Event Space,Escape Room,English Restaurant,Electronics Store


### Importing customer dataset

In [59]:
!wget -O Cust_Segmentation.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%204/data/Cust_Segmentation.csv

--2021-06-23 17:42:30--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%204/data/Cust_Segmentation.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33426 (33K) [text/csv]
Saving to: ‘Cust_Segmentation.csv’


2021-06-23 17:42:31 (436 KB/s) - ‘Cust_Segmentation.csv’ saved [33426/33426]



In [60]:
cust_df = pd.read_csv("Cust_Segmentation.csv")
cust_df.head()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,Address,DebtIncomeRatio
0,1,41,2,6,19,0.124,1.073,0.0,NBA001,6.3
1,2,47,1,26,100,4.582,8.218,0.0,NBA021,12.8
2,3,33,2,10,57,6.111,5.802,1.0,NBA013,20.9
3,4,29,2,4,19,0.681,0.516,0.0,NBA009,6.3
4,5,47,1,31,253,9.308,8.908,0.0,NBA008,7.2


In [72]:
df = cust_df.drop('Address', axis=1)
df.head()

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
0,1,41,2,6,19,0.124,1.073,0.0,6.3
1,2,47,1,26,100,4.582,8.218,0.0,12.8
2,3,33,2,10,57,6.111,5.802,1.0,20.9
3,4,29,2,4,19,0.681,0.516,0.0,6.3
4,5,47,1,31,253,9.308,8.908,0.0,7.2


In [73]:
from sklearn.preprocessing import StandardScaler
X = df.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet

array([[ 0.74291541,  0.31212243, -0.37878978, ..., -0.59048916,
        -0.52379654, -0.57652509],
       [ 1.48949049, -0.76634938,  2.5737211 , ...,  1.51296181,
        -0.52379654,  0.39138677],
       [-0.25251804,  0.31212243,  0.2117124 , ...,  0.80170393,
         1.90913822,  1.59755385],
       ...,
       [-1.24795149,  2.46906604, -1.26454304, ...,  0.03863257,
         1.90913822,  3.45892281],
       [-0.37694723, -0.76634938,  0.50696349, ..., -0.70147601,
        -0.52379654, -1.08281745],
       [ 2.1116364 , -0.76634938,  1.09746566, ...,  0.16463355,
        -0.52379654, -0.2340332 ]])

In [63]:
clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
print(labels)

[1 0 1 1 2 0 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 1
 1 1 0 1 0 1 2 1 0 1 1 1 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1
 1 1 1 1 0 1 0 0 2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1
 1 1 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1
 1 1 1 1 0 1 1 0 1 0 1 1 0 2 1 0 1 1 1 1 1 1 2 0 1 1 1 1 0 1 1 0 0 1 0 1 0
 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 2 0 1 1 1 1 1 1 1 0 1 1 1 1
 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 1
 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1
 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 2 1 1 1 0 1 0 0 0 1
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 2
 1 1 1 1 1 1 0 1 1 1 2 1 1 1 1 0 1 2 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 1 1 1 1
 1 0 1 1 1 1 0 1 1 1 0 1 

In [64]:
df["Clus_km"] = labels
df.head(5)

Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio,Clus_km
0,1,41,2,6,19,0.124,1.073,0.0,6.3,1
1,2,47,1,26,100,4.582,8.218,0.0,12.8,0
2,3,33,2,10,57,6.111,5.802,1.0,20.9,1
3,4,29,2,4,19,0.681,0.516,0.0,6.3,1
4,5,47,1,31,253,9.308,8.908,0.0,7.2,2


In [65]:
df.groupby('Clus_km').mean()

Unnamed: 0_level_0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
Clus_km,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,402.295082,41.333333,1.956284,15.256831,83.928962,3.103639,5.765279,0.171233,10.72459
1,432.468413,32.964561,1.614792,6.374422,31.164869,1.032541,2.104133,0.285185,10.094761
2,410.166667,45.388889,2.666667,19.555556,227.166667,5.678444,10.907167,0.285714,7.322222


From the clusters above, it would be possible to classify the customers as follows:

Type A (Cluster 0): Middle Aged, Upper Middle Income

Type B (Cluster 1): Young, Upwardly Mobile

Type C (Cluster 2): Older, High Income

#### Thank you for your time. Please refer to the report for results and observations.