# K-means clustering: Customer segmentation

The following example helps you understand the use of k-means clustering in market segmentation. 

Assume that you are working are working for a bank and you have gathered relevant customer data in a file called customer_segmentation.csv (taken from Kaggle and modified). The dataset has the following columns:

- Customer Id: ID of the customer
- Age: Age of the customer
- Edu: Education level of the customer
- Income: Annual income in thousands of dollars

Given the data, you are being asked what kind of products (could be policies, loans, investment strategies etc.) the bank should develop. Some questions that they ask you specifically could be: should we develop one investment product for all our customers? Or should we develop different ones for different sets of customers? In order to answer such questions, you take on the task of k-means clustering to find different groups of customers that then the bank can deal with independently.

Here's a template that assigns several tasks for you and also provides a roadmap to guide you. Please note, you are free to follow your own coding style and name the variables the way you like.

## 1. Import the relevant libraries

## 2. Load the data

## 3. Explore the descriptive statistics

## 4. Data preprocessing

If necessary, we carry out preprocessing here. However, in this example, we don't need to do this. So, we can proceed ahead.

## 5. Perform K-means clustering 

### 5.1 Declare the inputs 

### 5.2 Feature scaling

Feature scaling is an important aspect of k-means clustering. Since k-means algorithm is based on euclidean distances between different features, the disparity in magnitudes amongst the features can lead to less meaningful clusters. So, feature scaling is generally performed prior to running a k-means algorithm.

### 5.3 Build the k-means model and find out the optimal number of clusters based on the elbow method

In [None]:
from sklearn.cluster import KMeans

nmin = none # minimum no of clusters you would like to investigate
nmax = none # maximum no of clusters you would like to investigate

wcss = [] # List to store the Within-Cluster-Sum-of-Squares metric for each iteration

for i in range(nmin,nmax):
    kmeans = none
    #use the fit method
    wcss_iter = none
    wcss.append(wcss_iter)

In [None]:
wcss

In [None]:
# Elbow method

number_clusters = range(nmin,nmax)
plt.plot(number_clusters,wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster Sum of Squares')
plt.show()

What value of "number of clusters" would you choose based on the elbow method?

### 5.4 Clustering results based on the "optimal number of clusters" chosen from the elbow method

In [None]:
# Perform k-means clustering by choosing the optimal number of clusters

In [None]:
%matplotlib notebook
from mpl_toolkits.mplot3d import Axes3D
 
fig = plt.figure(figsize =(9, 9))
ax = Axes3D(fig, 
            rect = [0, 0, .95, 1], 
            elev = 48, 
            azim = 134)

ax.scatter(data_with_clusters['Edu'], data_with_clusters['Age'], data_with_clusters['Income'], 
           c = data_with_clusters['Cluster'], 
           s = 200, 
           cmap = 'hot', 
           alpha = 0.5, 
           edgecolor = 'darkgrey')
ax.set_xlabel('Education', 
              fontsize = 16)
ax.set_ylabel('Age', 
              fontsize = 16)
ax.set_zlabel('Income', 
              fontsize = 16)

plt.show()

Based on the result above:
1. How do you think the groups can be named? 
2. How would you choose to develop products for these different groups?

- Group 1: Younger, lower education and lower income
- Group 2: Older, lower education and lower income
- Group 3: Higher education and lower income
- Group 4: Older and higher income

## 6. Regression following cluster analysis

Now that you have found out different customer groups (or segments), you can look at each of these groups separately and understand the data better within each group and make additional predictions to set up suitable products and pricing of the products.

In [None]:
# As an example, create a dataframe that contains data of the customers pertaining to cluster 0 alone.


In [None]:
# Make a scatter plot of Age vs. Income of this customer segment

%matplotlib inline
plt.scatter...

plt.show()

Homework: You can build a multiple regression model on this particular customer segment and find out how income depends on Age and the level of education.

## K-means clustering: Image Compression and Segmentation

The following example helps you in understanding the use of k-means for image compression and segmentation.

In this case, we shall use one of the sample images available from sklearn, and see how image compression/segmentation can be achieved.

In [None]:
# pick a sample image from sklearn
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")

# plot the image
ax = plt.axes(xticks=[], yticks=[])
ax.imshow(china);

The image itself is stored in a three-dimensional array of size (height, width, RGB), containing red/blue/green contributions as integers from 0 to 255:

In [None]:
#check the shape of "china"


This shows that the image has 427 $\times$ 640 pixels. Each pixel has a R/G/B value associated with it, hence the third dimension has a shape 3.

In [None]:
china[0,0,:]

This shows that the one of the pixels (0,0) has a color with a R/G/B combination of 174, 201, 231.

One way we can view this set of pixels is as a cloud of points in a three-dimensional color space. We will reshape the data to [n_samples x n_features], and rescale the colors so that they lie between 0 and 1:

In [None]:
# Rescale the pixel data so that its values are in the range [0, 1]. Here such a simple scaling approach is enough
# since all the R/G/B values range between 0-255.

data = none

# Reshape the data so it's now 2D, with the first dimension of the reshaped data is equal to 427 x 640. 
# hint: you can use the np.reshape method

data = none
data.shape

Let's look at the 3D plot in the R/G/B space by randomly picking 2000 pixels from 273280 pixels. We pick a smaller number to just give us an idea and make the plot without any memory issues.

In [None]:
# Randomly select N row indices from data
N = 2000
i = np.random.randint(0, data.shape[0], N)

# Extract the colors associated with the randomly selected indices
plot_data = data[i, :]

%matplotlib notebook
from mpl_toolkits.mplot3d import Axes3D
 
fig = plt.figure(figsize =(9, 9))
ax = Axes3D(fig, 
            rect = [0, 0, .95, 1], 
            elev = 48, 
            azim = 134)

ax.scatter(plot_data[:,0], plot_data[:,1], plot_data[:,2], 
           s = 200, 
           cmap = 'hot', 
           alpha = 0.5, 
           edgecolor = 'darkgrey')
ax.set_xlabel('Red', 
              fontsize = 16)
ax.set_ylabel('Green', 
              fontsize = 16)
ax.set_zlabel('Blue', 
              fontsize = 16)

plt.show()

In [None]:
# Instantiate a KMeans model with a certain number of clusters
kmeans = KMeans(n_clusters=none)

# Train the model on the data
kmeans.fit(data)

In [None]:
# Predict which cluster each point in the data belongs to 
# and find the color associated with the center of that cluster
identified_clusters = kmeans.predict(data)

# assign new colors to each pixel based on the centroid values
new_colors = kmeans.cluster_centers_[identified_clusters]
new_colors.shape

In [None]:
# Replot the RGB data, but now with clustering results.

# Randomly select N row indices from data
N = 2000
i = np.random.randint(0, data.shape[0], N)

# Extract the colors associated with the randomly selected indices
plot_data = data[i, :]
plot_clusters = identified_clusters[i]

%matplotlib notebook
from mpl_toolkits.mplot3d import Axes3D
 
fig = plt.figure(figsize =(9, 9))
ax = Axes3D(fig, 
            rect = [0, 0, .95, 1], 
            elev = 48, 
            azim = 134)

ax.scatter(plot_data[:,0], plot_data[:,1], plot_data[:,2], 
           c = plot_clusters, 
           s = 200, 
           cmap = 'hot', 
           alpha = 0.5, 
           edgecolor = 'darkgrey')
ax.set_xlabel('Red', 
              fontsize = 16)
ax.set_ylabel('Green', 
              fontsize = 16)
ax.set_zlabel('Blue', 
              fontsize = 16)

plt.show()

In [None]:
# Replot the image and compare with the original one
china_recolored = new_colors.reshape(china.shape)

%matplotlib inline
fig, ax = plt.subplots(1, 2, figsize=(16, 6),
                       subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('reduced-color Image', size=16);

What do you observe? 