## K-Means Assumptions
- symmetric distribution
- standardize all variabls: equal scale (mean and variance)

## Sequence of Pre-Processing Pipeline
1. log transformation - unskew the data
2. Standardize to the same average and std
3. store as a separate array to be used for clustering so we can come back to re-calculate the clustering in the original values

### Inspect the Data

In [None]:
# Print the average values of the variables in the dataset
datamart.mean()
# Print the standard deviation of the variables in the dataset
datamart.std()
# Get the key statistics of the dataset
datamart.describe()

In [None]:
import seaborn as sns
from matplotlib import pyplot asp lt

In [None]:
# Plot distribution of var
plt.subplot(2, 1, 1); # allows you to plot several subplots in one chart, you do not have to change it.
sns.distplot(datamart['Recency'])
plt.subplot(2, 1, 2); sns.distplot(datamart['Frequency'])
plt.show()

### Deal with Skewed Data
Deal with skewness and make data symmetric. Logarithmic transformation applies to positive values only. Un-skew distributions by logarithm transformation

In [None]:
import numpy as np
frequency_log = np.log(datamart['Frequency']) # Apply log transformation
sns.distplot(frequency_log) # Create a plot of the distribution
plt.show()

The result can be not perfectly symmetrical but very little skewness

#### for negative values
- add a constant before log transformation: let the smallest number be 1
- use cube root transformation

### Center and Scaling

In [None]:
datamart_rfm.describe()

In [None]:
# Normalize the data by applying both centering and scaling
data_normalized = (data - data.mean()) / data.std()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # Initialize a scaler
scaler.fit(data) # Fit the scaler
data_normalized = scaler.transform(data # Scale and center the data
data_normalized = pd.DataFrame(data_normalized, index=data.index, columns=data.columns) # Create a pandas DataFrame

In [None]:
print(data_normalized.describe().round(2)) # Print summary statistics

In [None]:
datamart_normalized.mean(axis=0)
datamart_normalized.std(axis=0)

# Fit Kmeans

In [None]:
from sklearn.cluster import KMeans
# initialize kmeans
kmeans = KMeans(n_clusers=2, random_state=1)

In [None]:
# compute k-means clustering
kmeans.fit(datamart_normalized)

In [None]:
# extract cluster labels
cluster_labels = kmeans.labels_

How clusters vary from one another

In [None]:
# create a cluter label column in the original dataframe
datamart_rfm_k2 = datamart_rfm.assign(Cluster=cluster_labels)

In [None]:
# calculate average rfm values and size for each cluster
datamart_rfm_k2.grupby(['Cluster']).agg({
    'Recency':'mean',
    'Frequency':'mean',
    'MonetaryValue':['mean','count']
}).round(0)

## Evaluating KMeans
#### visual methods - elbow criterion
Plot the number of clusters against witin-cluster sum-of-squares-errors

#### Elbow represents an optimal number of clusters

In [2]:
sse = {}
# compute SSE for each k
for k in range(1,11):
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)
    # Fit KMeans on the normalized dataset
    kmeans.fit(data_normalized)
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_ 

NameError: name 'KMeans' is not defined

In [None]:
# Elbow Criterion Chart
plt.title('The Elbow Method')
plt.xlabel('k'); plt.ylabel('SSE')
# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

- best to choose the point on elbow, or the next point
- use as a guide but test multiple solutions
- compare against each solution and choose one which makes most business sense

## Interpretation of Clusters

1. Summary Statistics for each cluster
2. Snake plots
3. Relative importance of cluster attributes compared to population

Snake plots: visualizing cluster averages on lineplot

In [None]:
# Transform datamart_normalized as DtataFrame and add a Cluster column
datamart_normalized = pd.DataFrame(datamart_normalized,
                                  index=datamart_rfm.index,
                                  columns=datamart_rfm.columns)
datamart_normalized['Cluster'] = datamart_rfm_k3['Cluster']

In [None]:
# Melt the data into long format so RFM values and metric names are stored in 1 column each
datamart_melt = pd.melt(datamart_normalized.reset_index(),
                       id_vars=['CustomerID','Cluster'],
                       value_vars=['Recency','Frequency','MonetaryValue'],
                       var_name='Attribute', # one column 
                       value_name='Value')

In [None]:
plt.title('Snake Plot of Standardized Variables')
# Plot a line for each value of the cluster variable
sns.lineplot(x='Attribute', y='Value',
            hue='Cluster', data=datamart_melt)

Relative Importance of Segment Attributes

1. calculate average values of each cluster
2. calculate average values of population
3. calculate importance score by dividing them and subtracting 1

In [None]:
# Calculate average RFM values for each cluster
cluster_avg = datamart_rfm_k3.groupby(['Cluster']).mean()

In [None]:
# Calculate average RFM values for the total customer population
population_avg = datamart_rfm.mean()

In [None]:
# Calculate relative importance of cluster's attribute value compared to population
relative_imp = cluster_avg/population_avg - 1

The result is a relative importance score for each RFM value of the segments. The further that ratio is from zero, the more important that attribute is for a segment relative to the total population.

In [None]:
relative_imp.round(2)

In [None]:
plt.figure(figsize=(8,2))
plt.title('Relative Importance of Attributes')
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()

#### Mathematical methods - silhouette coefficient