Title: K-means clustering

Authors: Bauyrzhan Taimanov - Madina Amanatayeva

Date: Jan 18, 2024

In [2]:
# imports 
from matplotlib import pyplot as plt
import pandas as pd

# 0) Context


## Customer Segmentation with Parallel K-Means

In business, we often say "Customer is king" and "Customer is always right." Basically, it means we need to focus on what customers want. So, the idea here is to use this concept to group products that match what different groups of customers like. This can help a lot in online stores because it means showing customers the products they're most likely to be interested in.

![alt text](https://editor.analyticsvidhya.com/uploads/73654cluster.jpg)

This concept can be handy in places where we use digital marketing tools to promote products. For example, in some types of ads, we show products to people based on things like their interests. Then, by looking at whether they click on the ads or not, we can figure out who else might like those products.
Similarly, in other types of ads, we bid for space to show our products to people in real-time. We do this based on what we know about the people seeing the ads and what's worked well in the past. This way, we can make sure we're showing the right products to the right people at the right time.

## Concept

Customers buy products based on different needs and constraints, like budget and preference for certain items. To understand these behaviors, we can analyze customer invoice data and other non-personal data.
Here's a simplified approach:
* Analyze customer invoice data to find buying patterns.
* Use clustering algorithms to group customers based on similar purchase behaviors.
* Identify products that match each customer group's preferences.
* Tailor digital campaigns to each customer group for better targeting.

Though, with our current data limitations, we may not have all the necessary details to build a complete model.

# 1) Analysis of the Serial Algorithm

Certainly! In customer segmentation, the goal is to group customers into distinct segments based on similarities in their behavior, preferences, or characteristics. K-Means clustering is a popular algorithm for this task. Here's how you can use it along with other algorithms to solve the customer segmentation problem:

* K-Means Algorithm:
Description: K-Means is an iterative clustering algorithm that aims to partition a dataset into K clusters.
Usage in Customer Segmentation: You can use K-Means to cluster customers based on features such as purchase history, demographics, or behavioral data. By specifying the number of clusters (K), K-Means will assign each customer to the nearest centroid, creating distinct customer segments.
Process:
Choose the number of clusters (K).
Initialize K centroids randomly.
Assign each data point to the nearest centroid.
Update the centroids by computing the mean of all data points assigned to each centroid.
Repeat steps 3 and 4 until convergence or a specified number of iterations.
Considerations: K-Means is sensitive to the initial placement of centroids and may converge to local optima. Therefore, it's essential to run the algorithm multiple times with different initializations and choose the best clustering solution based on evaluation metrics or domain knowledge.
* Hierarchical Clustering:
Description: Hierarchical clustering builds a tree-like hierarchy of clusters by recursively merging or splitting clusters based on similarity.
Usage in Customer Segmentation: Hierarchical clustering can reveal hierarchical relationships among customer segments. It doesn't require specifying the number of clusters beforehand and can provide insights into the natural grouping structure of the data.
Process:
Start with each data point as a separate cluster.
Merge the two closest clusters into a single cluster.
Repeat step 2 until all data points are in one cluster or until a stopping criterion is met.
Considerations: Hierarchical clustering can be computationally intensive for large datasets and may not scale well. Additionally, determining the optimal number of clusters can be subjective when using hierarchical clustering.
* DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Description: DBSCAN is a density-based clustering algorithm that groups together closely packed points based on density.
Usage in Customer Segmentation: DBSCAN can identify dense regions of customers in the feature space, automatically detecting outliers as noise points.
Process:
Define two parameters: epsilon (a radius within which to search for neighboring points) and min_samples (the minimum number of points required to form a dense region).
Identify core points (points with at least min_samples within distance epsilon) and border points (points within epsilon distance of a core point but with fewer than min_samples neighbors).
Expand clusters by connecting core points to their neighbors and assigning border points to a cluster if they are reachable from a core point.
Assign noise points to a separate cluster or mark them as outliers.
Considerations: DBSCAN is robust to outliers and doesn't require specifying the number of clusters beforehand. However, it may struggle with clusters of varying densities and can be sensitive to the choice of epsilon and min_samples.
* Gaussian Mixture Models (GMM):
Description: GMM is a probabilistic model that assumes data points are generated from a mixture of several Gaussian distributions.
Usage in Customer Segmentation: GMM can capture complex cluster shapes and model uncertainty in cluster assignments.
Process:
Initialize parameters for each Gaussian distribution (mean, covariance, and mixing coefficient).
Expectation-Maximization (EM) algorithm: iteratively update parameters to maximize the likelihood of the observed data, assigning data points probabilities of belonging to each cluster.
Cluster data points based on the highest probability of belonging to a particular cluster.
Considerations: GMM can be computationally expensive, especially for high-dimensional data. It may also converge to local optima and requires careful initialization.
Each of these algorithms has its strengths and weaknesses, and the choice depends on factors such as the nature of the data, the desired level of interpretability, and computational considerations. It's often helpful to experiment with multiple algorithms and evaluate their performance using metrics such as silhouette score, Davies–Bouldin index, or domain-specific criteria. Additionally, combining multiple algorithms or using ensemble methods can sometimes yield improved segmentation results.