# Customer clustering using K-Means

<img src="https://github.com/retkowsky/images/blob/master/AzureMLservicebanniere.png?raw=true">

> Author: Serge Retkowsky Microsoft<br>
> Date: 03-Sept-2020

## Description
In this notebook, we’ll be using k-means clustering to segment customers into distinct groups based on purchasing habits. k-means clustering is an unsupervised learning technique, which means we don’t need to have a target for clustering. All we need is to format the data in a way the algorithm can process, and we’ll let it determine the customer segments or clusters. This makes k-means clustering great for exploratory analysis as well as a jumping-off point for more detailed analysis. 

## Objectives
The k-means clustering algorithm works by finding like groups based on Euclidean distance, a measure of distance or similarity. The practitioner selects groups to cluster, and the algorithm finds the best centroids for the groups. The practitioner can then use those groups to determine which factors group members relate. For customers, these would be their buying preferences.

## Steps
1. We will load the data
2. We are going to apply a Kmeans ML model in order to cluster our customers database
3. The results are saved into an Azure ML experiments

<img src="https://github.com/retkowsky/images/blob/master/clusteringgraph.jpg?raw=true">

## 0. Settings

In [1]:
import sys
print('You are using Python ', sys.version)

You are using Python  3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
[GCC 7.3.0]


In [2]:
import datetime
now = datetime.datetime.now()
print('Today is', now)

Today is 2020-09-03 10:32:11.268281


In [3]:
import azureml.core
print("You are using Azure ML", azureml.core.VERSION)

You are using Azure ML 1.13.0


In [4]:
import pandas as pd
import logging
import os
import random

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace

### Connection to your Azure ML Workspace

In [5]:
import os
subscription_id = os.environ.get("SUBSCRIPTION_ID", "70b8f39e-8863-49f7-b6ba-34a80799550c")
resource_group = os.environ.get("RESOURCE_GROUP", "azuremlsynapse-rg")
workspace_name = os.environ.get("WORKSPACE_NAME", "azuremlsynapse")

from azureml.core import Workspace
try:
   ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
   ws.write_config()
   print("OK")
except:
   print("Error: Workspace not found")

OK


In [6]:
from azureml.core import Workspace

try:
   ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
   ws.write_config()
   print("Workspace is available")
except:
   print("No workspace")

Workspace is available


In [7]:
ws = Workspace.from_config()

experiment = Experiment(workspace=ws, name='KMeansClustering')

output = {}
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

Unnamed: 0,Unnamed: 1
Workspace,azuremlsynapse
Resource Group,azuremlsynapse-rg
Location,westeurope
Experiment Name,KMeansClustering


In [8]:
run = experiment.start_logging(snapshot_directory=None)

## 1. Data Preparation

In [9]:
from azureml.core import Workspace, Dataset

subscription_id = '70b8f39e-8863-49f7-b6ba-34a80799550c'
resource_group = 'azuremlsynapse-rg'
workspace_name = 'azuremlsynapse'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='Clients')
df=dataset.to_pandas_dataframe()

In [10]:
df=df.drop(['Adresse', 'Commune', 'Prenom', 'Nom'], axis=1)

In [11]:
df.head()

Unnamed: 0,CodeClient,CodePostal,Klout,Points,AppMobile,Newsletter,Latitude,Longitude
0,495501,24100,15.0,73.0,0.0,0.0,44.854218,0.485675
1,918238,90300,1.0,55.0,0.0,0.0,47.670783,6.839947
2,1419459,94270,20.0,37.0,0.0,0.0,48.808894,2.359035
3,1470729,59495,52.0,46.0,0.0,0.0,51.05383,2.440681
4,1470850,62100,36.0,26.0,0.0,0.0,50.946969,1.83164


## 2. K-Means modelling

In [12]:
from sklearn.cluster import KMeans
Classes=4

In [13]:
print("You want to use", Classes, 'clusters.')

You want to use 4 clusters.


In [14]:
run.log('Number of Clusters', Classes)

In [15]:
kmeans = KMeans(n_clusters=Classes)
kmeans.fit(df)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

### Let's assign clusters

In [16]:
Clusters = kmeans.labels_.tolist()   

In [17]:
Clusters = pd.DataFrame(Clusters)

In [18]:
Clusters.columns = ['Cluster'] 

In [19]:
Clusters['Cluster'] = Clusters['Cluster'].astype(str)

### Clusters distribution

In [20]:
Clusters.Cluster.value_counts()

0    326
2    247
3    111
1    105
Name: Cluster, dtype: int64

In [21]:
labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_

### Let's compute the silhouette index

In [24]:
from sklearn.metrics import silhouette_score

silhouette=silhouette_score(df, kmeans.labels_)
print("Silhouette index =", silhouette)

Silhouette index = 0.629169364806592


> **Silhouette analysis** refers to a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters.

### 4. Exportation to Azure ML experiment

In [25]:
Clusters.to_csv('Clusters.csv', sep=',')
file_namecsv = 'Clusters.csv'
run.upload_file(name = file_namecsv, path_or_stream = file_namecsv)

<azureml._restclient.models.batch_artifact_content_information_dto.BatchArtifactContentInformationDto at 0x7faf801f29e8>

In [26]:
run.log("Silhouette", silhouette)

In [27]:
experiment

Name,Workspace,Report Page,Docs Page
KMeansClustering,azuremlsynapse,Link to Azure Machine Learning studio,Link to Documentation


### Overview of the experiment

<img src="https://github.com/retkowsky/images/blob/master/kmeans1.jpg?raw=true">

<img src="https://github.com/retkowsky/images/blob/master/kmeans2.jpg?raw=true">

In [28]:
run.complete()

> End of notebook