## Unsupervised Learning using Kmeans Clustering

Example from Data Smart book by John Foreman. From a set of transactions of Wine Lots recognise customer clusters and their buying patterns. 

In [7]:
import os
import pandas as pd
from sklearn import metrics
from sklearn.cluster import KMeans

PATH = "./"

def load_wine_data(path):
    offers = os.path.join(path, "WineKMC-OfferInfo.csv")
    trans = os.path.join(path, "WineKMC-Trans.csv")
    wine_offers = pd.read_csv(offers)
    wine_trans = pd.read_csv(trans)
    return wine_offers, wine_trans


Pivot the customer transaction table to measure distances between the customers based on the transactions...

In [8]:
def create_distance_map(wine_trans):
    customers = pd.DataFrame(wine_trans['Customer Last Name'].drop_duplicates()).reset_index(drop=True).sort_values(['Customer Last Name'])
    distance_map = pd.DataFrame(columns=customers['Customer Last Name'])

    for idx, row in wine_trans.sort_values(['Offer #']).iterrows():
        distance_map.set_value(row['Offer #'], row['Customer Last Name'], 1);
    
    distance_map.fillna(value=0, inplace=True)
    return distance_map

In [9]:
wine_offers, wine_trans = load_wine_data(PATH)
wine_offers

Unnamed: 0,Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak
0,1,January,Malbec,72,56,France,False
1,2,January,Pinot Noir,72,17,France,False
2,3,February,Espumante,144,32,Oregon,True
3,4,February,Champagne,72,48,France,True
4,5,February,Cabernet Sauvignon,144,44,New Zealand,True
5,6,March,Prosecco,144,86,Chile,False
6,7,March,Prosecco,6,40,Australia,True
7,8,March,Espumante,6,45,South Africa,False
8,9,April,Chardonnay,144,57,Chile,False
9,10,April,Prosecco,72,52,California,False


In [10]:
wine_offers, wine_trans = load_wine_data(PATH)
distance_map = create_distance_map(wine_trans)
distance_map

Customer Last Name,Adams,Allen,Anderson,Bailey,Baker,Barnes,Bell,Bennett,Brooks,Brown,...,Turner,Walker,Ward,Watson,White,Williams,Wilson,Wood,Wright,Young
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,1
7,0,0,0,1,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
8,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,1,0,0,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
10,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [11]:
clusters=4
estimator = KMeans(init='k-means++', n_clusters=clusters, n_init=1000, random_state=42, max_iter=10000)
estimator.fit(distance_map)
wine_offers['Cluster'] = estimator.labels_
wine_offers.sort_values(['Cluster'])

Unnamed: 0,Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak,Cluster
25,26,October,Pinot Noir,144,83,Australia,False,0
1,2,January,Pinot Noir,72,17,France,False,0
23,24,September,Pinot Noir,6,34,Italy,False,0
16,17,July,Pinot Noir,12,47,Germany,False,0
12,13,May,Merlot,6,43,Chile,False,1
7,8,March,Espumante,6,45,South Africa,False,1
17,18,July,Espumante,6,50,Oregon,False,1
28,29,November,Pinot Grigio,6,87,France,False,1
29,30,December,Malbec,6,54,France,False,1
6,7,March,Prosecco,6,40,Australia,True,1


#### Clusters by Customer Buying Behavior. 
- Cluster 0: Customers who are interested in Pinot Noir
- Cluster 1: Customers interested in when the offers include small lot sizes (6Kg)
- Cluster 3: Customers interested in Champagne from France.  